docs/guides/recovery.md

Select File:
docs/guides/recovery.md

# Recovery and Checkpoint Restart (Opt-in)

The recovery layer is **disabled by default**. It lets you detect corrupted training runs and restart from checkpoints automatically using explicit, user-controlled processes.

## Components

- `Tinkex.Recovery.Policy` – configuration (defaults off, conservative backoff, optimizer restore on)
- `Tinkex.Recovery.Monitor` – polls `Rest.get_training_run/2` for `corrupted: true`
- `Tinkex.Recovery.Executor` – bounded worker that restarts runs from checkpoints

Telemetry events: `:detected`, `:started`, `:checkpoint_selected`, `:client_created`, `:completed`, `:failed`, `:exhausted`, `:poll_error` (all under `[:tinkex, :recovery, ...]`).

## Quickstart

```elixir
config = Tinkex.Config.new(api_key: api_key, recovery: %{enabled: true})

policy =
  Tinkex.Recovery.Policy.new(
    enabled: true,
    checkpoint_strategy: :latest,        # or {:specific, "tinker://..."}
    restore_optimizer: true,             # false for weights-only
    poll_interval_ms: 15_000,
    max_attempts: 3,
    backoff_ms: 5_000,
    max_backoff_ms: 60_000,
    on_recovery: fn old_pid, new_pid, cp ->
      Logger.info("Recovered #{cp.tinker_path} -> #{inspect(new_pid)} (old=#{inspect(old_pid)})")
      :ok
    end,
    on_failure: fn run_id, reason ->
      Logger.warning("Recovery failed for #{run_id}: #{inspect(reason)}")
      :ok
    end
  )

{:ok, service} = Tinkex.ServiceClient.start_link(config: config)

{:ok, executor} =
  Tinkex.Recovery.Executor.start_link(
    max_concurrent: 2
  )

{:ok, monitor} =
  Tinkex.Recovery.Monitor.start_link(
    executor: executor,
    policy: policy
  )

:ok = Tinkex.Recovery.Monitor.monitor_run(monitor, "run-123", service, %{training_pid: training_pid})
```

## Checkpoint Strategy

- `:latest` (default): uses `TrainingRun.last_checkpoint`
- `{:specific, path}`: explicit checkpoint path
- `:best`: reserved for future support

If `last_checkpoint` is missing, the executor will fetch the run via REST (requires passing `config`).

## Optimizer Restore

`restore_optimizer: true` uses `ServiceClient.create_training_client_from_state_with_optimizer/3`. Set to `false` for weights-only restarts (fresh optimizer).

## Concurrency and Backoff

- `max_concurrent` (executor option) caps simultaneous recoveries (default: 1).
- `max_attempts`, `backoff_ms`, `max_backoff_ms` control retry behavior; `:exhausted` is emitted when attempts are spent.

## Telemetry and Observability

Attach handlers to `[:tinkex, :recovery, *]` for tracing. `:poll_error` indicates REST polling issues (monitor keeps state).

## Safety Notes

- Nothing runs automatically; you must start monitor/executor and set `policy.enabled: true`.
- Callbacks should be cheap and resilient; errors are swallowed to avoid cascading failures.
- Ensure `config` is available for REST lookups when monitor/executor are used outside the original process.

## Offline Example

Run `mix run examples/recovery_simulated.exs` to see a fully offline flow that seeds a checkpoint, flips a run to `corrupted: true`, and exercises the monitor + executor with stubbed REST/service modules. No network or API key is required.

## Live Example with Injected Corruption

Run `mix run examples/recovery_live_injected.exs` to create a training run, save a real checkpoint, inject a single `corrupted: true` poll response (no server change), and let the monitor + executor restore from that checkpoint. The example then saves a second checkpoint from the recovered client to prove the restart path. Requires `TINKER_API_KEY` (and optional `TINKER_BASE_URL`/`TINKER_BASE_MODEL`).