Skip to main content

guides/snapshots-and-resume.md

# Snapshots And Resume

A Jidoka snapshot is the serializable pause point of a turn. It captures the
spec, request, pending work, journal, and any pending review interrupt. It never
captures pids, sockets, or provider clients.

## When To Use This

- Use snapshots whenever a turn might be paused and continued later: human
  approval, batch deferral, long-running tool work, or a redeploy in the
  middle of a long conversation.
- Use snapshots as the unit you persist in a job queue. They are designed to
  round-trip through any binary-safe store.
- Do not use snapshots as a substitute for sessions. A session contains a
  list of snapshots plus durable metadata. See
  [Sessions And Stores](sessions-and-stores.md).

## Prerequisites

- An agent and a runtime capability that can produce a turn (LLM, and
  operations if the agent declares tools).
- Familiarity with `Jidoka.turn/3` and `Jidoka.resume/2`; the resume path is
  the only API that consumes a snapshot.
- A persistent target (database, queue, file) for serialized snapshots
  when crossing process or node boundaries.

```bash
mix deps.get
mix test
```

## Quick Example

The smallest hibernate/resume cycle uses the `:after_prompt` checkpoint.

```elixir
{:hibernate, snapshot} =
  Jidoka.turn(MyApp.SupportAgent, "Hello",
    checkpoint: :after_prompt
  )

{:ok, serialized} = Jidoka.Runtime.AgentSnapshot.serialize(snapshot)
String.starts_with?(serialized, "jidoka:snapshot:v1:")
#=> true

{:ok, result} = Jidoka.resume(serialized)
result.content
```

The snapshot carried everything Jidoka needed to keep going. Runtime
capabilities still come from the host process at resume time.

## Concepts

A snapshot is data. The runtime treats it as an inert payload until
`Jidoka.resume/2` lifts it back into a live `Turn.State`.

```diagram
╭──────────────────╮     ╭──────────────────────╮     ╭──────────────╮
│ Jidoka.turn/3    │────▶│   Turn.State + Cursor │────▶│ AgentSnapshot│
│ checkpoint: ...  │     ╰──────────────────────╯     ╰──────┬───────╯
╰──────────────────╯                                          │
                                                ╭──────────────────────╮
                                                │ serialize / store /  │
                                                │ deserialize          │
                                                ╰──────┬───────────────╯
                                                ╭──────────────────────╮
                                                │ Jidoka.resume/2      │
                                                │ same capabilities    │
                                                ╰──────────────────────╯
```

Key facts:

- [`Jidoka.Runtime.AgentSnapshot`](`Jidoka.Runtime.AgentSnapshot`) has a
  `schema_version/0` of `1`. Unknown versions fail at normalization.
- `serialize/1` returns `"jidoka:snapshot:v1:" <> base64`. The body is
  `:erlang.term_to_binary/1` over the validated struct.
- Snapshots are validated for portability before serialization: pids,
  ports, references, and functions are rejected so a snapshot can never
  capture local-only runtime state.
- `from_input/1` accepts a struct, a map of attributes, a keyword list, or
  the opaque string returned by `serialize/1`. This is what makes
  `Jidoka.resume/2` flexible without leaking format details.
- The `cursor` field describes where the turn paused: `:after_prompt`,
  `:before_effect`, or `:review`. Resume reads it to decide whether to
  apply an approval response or continue with the pending effect.

## How To

### Step 1: Choose A Checkpoint Policy

The turn runner accepts one of four policies on `:checkpoint`:

- `:none` is the default. The turn runs to completion or to an error and
  only hibernates if an operation control returns an interrupt.
- `:after_prompt` hibernates immediately after the first prompt is
  assembled and before the first effect runs. Use this when you want to
  inspect or persist work before paying for any model call.
- `:after_each_phase` hibernates after prompt assembly and again before any
  pending effect. Use this for batch pipelines that resume one phase per
  job.
- `:before_each_effect` hibernates right before each pending effect.
  Use this for the tightest external durability boundary.

```elixir
{:hibernate, snapshot} =
  Jidoka.turn(MyApp.SupportAgent, "Look up A1001",
    llm: llm,
    operations: operations,
    checkpoint: :before_each_effect
  )

snapshot.cursor.phase
#=> :before_effect
```

### Step 2: Serialize And Persist

Snapshots survive any byte-safe transport. The serialized payload is opaque;
the contract is the `"jidoka:snapshot:v1:"` prefix and the `schema_version`
field.

```elixir
{:ok, payload} = Jidoka.Runtime.AgentSnapshot.serialize(snapshot)

:ok = MyApp.Queue.enqueue(job_id, payload)
```

`serialize/1` raises through `serialize!/1`, but production code should
prefer the tuple form so that a non-portable value (a stray pid in
metadata, for example) is surfaced as `{:error,
{:non_serializable_snapshot_value, _, _}}` rather than an exception.

### Step 3: Resume From Any Snapshot Input

`Jidoka.resume/2` accepts every shape `AgentSnapshot.from_input/1` accepts:

```elixir
# A struct.
{:ok, result} = Jidoka.resume(snapshot, llm: llm)

# Map-shaped attributes that match the schema.
{:ok, result} = Jidoka.resume(Map.from_struct(snapshot), llm: llm)

# The opaque serialized string.
{:ok, result} = Jidoka.resume(payload, llm: llm)
```

Resume runs through the same harness boundary as `Jidoka.turn/3`. Supply
the same runtime capabilities (`llm:`, `operations:`, and optionally
`memory_store:`) and, when resuming a review pause, an `:approval` option.

### Step 4: Continue, Hibernate, Or Error

Resume returns the same three outcomes as `turn/3`:

```elixir
case Jidoka.resume(snapshot, llm: llm, operations: operations) do
  {:ok, %Jidoka.Turn.Result{} = result} ->
    handle_result(result)

  {:hibernate, %Jidoka.Runtime.AgentSnapshot{} = snapshot} ->
    persist_again(snapshot)

  {:error, reason} ->
    log_failure(reason)
end
```

`{:hibernate, snapshot}` is normal: a single resume may hit another
checkpoint or another review interrupt. Always loop until you see `{:ok,
_}` or `{:error, _}`.

### Step 5: Honor Schema Versioning

The struct carries `schema_version: 1`. Anything else fails up front:

```elixir
Jidoka.Runtime.AgentSnapshot.new(%{
  schema_version: 99,
  snapshot_id: "snap_x",
  agent_id: "support",
  cursor: cursor,
  turn_state: turn_state
})
#=> {:error, {:unsupported_snapshot_schema_version, 99, 1}}
```

Likewise, `deserialize/1` only accepts the `"jidoka:snapshot:v1:"` prefix:

```elixir
Jidoka.Runtime.AgentSnapshot.deserialize("v0:garbage")
#=> {:error, :invalid_snapshot_serialization}
```

When the snapshot version eventually changes, older payloads will not be
silently coerced; Jidoka returns a versioned error and the application owns the
migration.

### Step 6: Reuse Snapshots In A Session

The session keeps snapshots in order. The latest snapshot is what
`Jidoka.Session.resume/2` continues from.

```elixir
{:hibernate, session, snapshot} =
  Jidoka.Session.chat(session_id, "Refund A1001",
    store: store,
    llm: llm,
    checkpoint: :after_prompt
  )

{:ok, ^snapshot} = Jidoka.Session.get(store, session_id) |> then(&{:ok, Jidoka.Harness.Session.latest_snapshot(elem(&1, 1))})
```

Use sessions when you want lifecycle, pending reviews, and metadata for
free. Use raw snapshots when the durable unit is a job, not a conversation.

## Common Patterns

- **Pair the snapshot with its request id.** The snapshot is portable, but
  observability ties to `request_id`. Persist both.
- **Round-trip in tests.** Always exercise `serialize/1` followed by
  `deserialize/1` in unit tests for any code that stashes snapshots. This
  catches non-portable metadata immediately.
- **Validate at boundaries, not in handlers.** Let `from_input/1` reject
  bad inputs at the entry point instead of pattern-matching deep inside
  application code.
- **Treat checkpoint policy as request-scoped.** Different callers can pick
  different policies against the same agent without redefining the spec.
- **Do not mutate snapshot fields.** Build a new struct with `new!/1` if
  you really need to project a derived shape; never reach into
  `turn_state` directly.

## Testing

A round-trip test gives you most of the value with very little setup:

```elixir
test "snapshot round-trips through opaque serialization" do
  llm = fn _intent, _journal ->
    {:ok, %{type: :final, content: "ok"}}
  end

  {:hibernate, snapshot} =
    Jidoka.turn(MyApp.SupportAgent, "Hello",
      llm: llm,
      checkpoint: :after_prompt
    )

  assert {:ok, serialized} = Jidoka.Runtime.AgentSnapshot.serialize(snapshot)
  assert String.starts_with?(serialized, "jidoka:snapshot:v1:")

  assert {:ok, ^snapshot} = Jidoka.Runtime.AgentSnapshot.deserialize(serialized)
  assert {:ok, %Jidoka.Turn.Result{content: "ok"}} = Jidoka.resume(serialized, llm: llm)
end
```

For approval flows, see the resume-with-`:approval` examples in
[Human In The Loop](human-in-the-loop.md).

## Troubleshooting

| Symptom | Likely Cause | Fix |
| --- | --- | --- |
| `{:error, :invalid_snapshot_serialization}` | Payload does not start with `"jidoka:snapshot:v1:"`. | Re-serialize from the source `AgentSnapshot` or migrate the persisted row. |
| `{:error, {:non_serializable_snapshot_value, path, :pid}}` | A pid was placed in snapshot metadata or context. | Remove runtime references before persisting; keep only data. |
| `{:error, {:unsupported_snapshot_schema_version, n, 1}}` | Persisted snapshot was written under a different schema. | Migrate the persisted payload or discard the older snapshot. |
| `Jidoka.resume/2` returns `{:hibernate, _}` again | Checkpoint policy or review interrupt still in effect. | Loop until `{:ok, _}` or `{:error, _}`; supply `:approval` if waiting on review. |
| `{:error, {:missing_pending_effect, _}}` on resume | The snapshot was finalized or already consumed. | Start a new turn; do not resume a snapshot whose work has already completed. |

## Reference

Key modules touched in this guide:

- [`Jidoka.Runtime.AgentSnapshot`](`Jidoka.Runtime.AgentSnapshot`) -
  `new/1`, `from_input/1`, `serialize/1`, `deserialize/1`,
  `schema_version/0`, `from_turn_state/3`.
- [`Jidoka.Turn.Cursor`](`Jidoka.Turn.Cursor`) - the `cursor.phase` field
  on a snapshot (`:after_prompt`, `:before_effect`, `:review`).
- [`Jidoka.Turn.State`](`Jidoka.Turn.State`) - the inner runtime state a
  snapshot wraps.
- [`Jidoka.Harness`](`Jidoka.Harness`) - `resume/2` boundary that
  `Jidoka.resume/2` delegates to.
- [`Jidoka.Effect.Journal`](`Jidoka.Effect.Journal`) - replay-safe record
  of effect intents and results inside the snapshot.

## Related Guides

- [Sessions And Stores](sessions-and-stores.md) - the durable wrapper that
  owns snapshots in order.
- [Human In The Loop](human-in-the-loop.md) - resuming with an
  `:approval` response.
- [Idempotency And Safety](idempotency-and-safety.md) - how the journal
  decides which effects re-run on resume.
- [Runtime And Harness](runtime-and-harness.md) - architectural overview.