guides/patterns/retry.md

# Retry

Failure-and-retry agent. `handle_error/3` returns
`{:prompt, retry_text, state}` to self-chain a new attempt. The
retry decision lives on agent state: attempt count, accumulated
errors, and a configurable cap.

## When to reach for this

You expect transient failures (rate limits, network blips,
flaky backends) and want the agent to absorb them without the
manager having to notice and retry. The retry logic benefits
from being stateful -- you want to count attempts, track the
sequence of errors, apply backoff, and give up after a cap.

The pattern hinges on one gen_agent primitive: `handle_error/3`
has the same return shape as `handle_response/3`, so returning
`{:prompt, text, state}` from `handle_error` self-chains a retry
turn just as cleanly as self-chaining after a success.

## What it exercises in gen_agent

- **`handle_error/3` returning `{:prompt, ..., state}`** as a
  retry primitive -- the whole pattern is this one return shape.
- **Attempt counting and error accumulation** on agent state.
- **Giving-up semantics** via `{:halt, %{state | phase: :failed}}`
  when `max_attempts` is reached.
- **Backoff** via `Process.sleep/1` inside `handle_error/3`
  before returning the retry prompt (the agent's own process is
  the one doing the sleep, so it's naturally rate-limited).

## The pattern

One callback module. No facade needed -- the manager just starts
the agent and polls status.

```elixir
defmodule Retry.Agent do
  use GenAgent

  defmodule State do
    defstruct [
      :task,
      :max_attempts,
      :result,
      phase: :running,
      attempts: 0,
      errors: []
    ]
  end

  @impl true
  def init_agent(opts) do
    state = %State{
      task: Keyword.fetch!(opts, :task),
      max_attempts: Keyword.get(opts, :max_attempts, 3)
    }

    system = "You are a persistent assistant. Answer concisely in 1-2 sentences."

    backend_opts = [
      system: system,
      max_tokens: Keyword.get(opts, :max_tokens, 100)
    ]

    {:ok, backend_opts, state}
  end

  @impl true
  def handle_response(_ref, response, %State{} = state) do
    new_attempts = state.attempts + 1
    text = String.trim(response.text)

    {:halt,
     %{
       state
       | result: text,
         attempts: new_attempts,
         phase: :succeeded
     }}
  end

  @impl true
  def handle_error(_ref, reason, %State{} = state) do
    new_attempts = state.attempts + 1
    new_errors = state.errors ++ [reason]
    new_state = %{state | attempts: new_attempts, errors: new_errors}

    if new_attempts < state.max_attempts do
      # Optional: exponential backoff before retrying.
      backoff_ms = :math.pow(2, new_attempts - 1) |> round() |> Kernel.*(1000)
      Process.sleep(backoff_ms)

      retry_prompt = "The previous attempt failed. Retry the task: #{state.task}"
      {:prompt, retry_prompt, new_state}
    else
      {:halt, %{new_state | phase: :failed}}
    end
  end
end
```

## Using it

```elixir
{:ok, _pid} = GenAgent.start_agent(Retry.Agent,
  name: "retry-haiku",
  backend: GenAgent.Backends.Anthropic,
  task: "write a haiku about persistence",
  max_attempts: 5
)

# Kick off the first attempt.
{:ok, _ref} = GenAgent.tell("retry-haiku",
  "write a haiku about persistence")

# Wait for phase in [:succeeded, :failed] and read the result.
%{agent_state: state} = GenAgent.status("retry-haiku")
IO.inspect(%{
  phase: state.phase,
  attempts: state.attempts,
  errors: state.errors,
  result: state.result
})

GenAgent.stop("retry-haiku")
```

## Testing without burning tokens

In the playground we test this pattern by injecting a stateful
`http_fn` into the Anthropic backend. The function counts calls
and fails the first N with a synthetic `{:http_error, 429, ...}`,
then succeeds. No real backend calls, no tokens spent.

```elixir
defp failing_http_fn(fail_count) do
  {:ok, counter} = Elixir.Agent.start_link(fn -> 0 end)

  fn _request ->
    n = Elixir.Agent.get_and_update(counter, fn n -> {n, n + 1} end)

    if n < fail_count do
      {:error,
       {:http_error, 429,
        %{"error" => %{"type" => "rate_limit_error",
                       "message" => "simulated (call #{n + 1})"}}}}
    else
      {:ok, canned_success_response(n + 1)}
    end
  end
end
```

Pass it as `http_fn: failing_http_fn(2)` in your start opts when
testing. Worth cribbing for any unit/integration test of
retry logic.

## Variations

- **Exponential backoff with jitter.** Instead of a fixed
  `2^(n-1) * 1000`, add random jitter: `sleep(backoff + :rand.uniform(500))`.
  Avoids thundering-herd when many agents retry simultaneously.
- **Error-class-aware retry.** Not every error should be
  retried. Pattern-match on the reason in `handle_error/3`:
  `:interrupted` and `:timeout` should probably not retry,
  `{:http_error, 429, _}` and `{:http_error, 503, _}` should.
  Treat non-retryable errors as immediate halt.
- **Retry budget.** Instead of a count cap, a time budget: halt
  if `System.monotonic_time() - state.started_at` exceeds N
  seconds. More natural for "best effort within a deadline."
- **Different prompt on retry.** The retry prompt could
  incorporate the error, e.g. "The previous attempt failed with:
  #{inspect(reason)}. Try a different approach." -- useful when
  the failure is prompt-shaped rather than transport-shaped.
- **Retry with a different backend.** If the first backend
  errors three times, swap to a fallback backend. Requires the
  callback module to track which backend it started with and
  to restart the session mid-run, which is more work than the
  minimal pattern shown here.