Skip to main content

guides/ralph-loop.md

# The Ralph Loop

A "Ralph loop" is the dumbest agent that converges: send the same prompt
to the same model, over and over, until it says it's done. Named after
Ralph Wiggum — no planning, no memory, just persistence. Geoffrey Huntley
popularized the pattern; it's a few lines of code wrapped around one
piece of disk.

This guide is mostly about that piece of disk.

## The pattern in one sentence

A `TODO.md` lives on disk; a loop hands the LLM the same prompt each
iteration; the LLM reads `TODO.md`, does the top item, marks it done,
commits, and the loop runs again until `TODO.md` is empty.

The interesting part isn't the loop. It's why `TODO.md` makes a
stateless loop converge on a non-trivial outcome.

## Why a file, not a conversation

The LLM has no memory between iterations — and in the strictest version
of Ralph, it has no memory *within* the run either (each iteration is a
fresh agent). State has to live somewhere durable. A conversation buffer
is the wrong place:

| Conversation | Filesystem |
|---|---|
| Bounded by context window | Unbounded |
| Lossy under compaction | Lossless |
| Dies with the process | Survives crashes |
| Opaque to humans | `cat TODO.md` |
| Not diffable | `git diff` |

If iteration 7 crashes mid-edit, iteration 8 reads `TODO.md` and resumes.
There is no "resume" code to write. The filesystem *is* the resume.

## The five-step contract

Each iteration does exactly this, in order:

1. **Read** `TODO.md`.
2. **Pick** the top unchecked item.
3. **Do** it — edit code, run tests, whatever the item requires.
4. **Mark** it `[x]`. Append any new subtasks discovered along the way.
5. **Commit** with the item text as the commit message.

Step 4's second clause is the one most readers gloss over. The list
typically *grows* for the first several iterations as the agent uncovers
complexity, then shrinks. A reader expecting monotonic burndown will
think it's broken on iteration 3. It isn't — Ralph is discovering the
shape of the problem.

This is also why the prompt is the same every iteration: there is
nothing iteration-specific to say. The contract is the prompt.

## What a good `TODO.md` entry looks like

Items have to be **verifiable**, **single-iteration-sized**, and
**ordered by dependency**.

```markdown
## MVP

- [ ] Add `Foo.parse/1` that turns a binary into `{:ok, %Foo{}}` or `{:error, term}`
- [ ] Add unit tests covering empty input, malformed input, and the happy path
- [ ] Wire `Foo.parse/1` into the existing `Bar.ingest/1` pipeline
- [ ] Update the `Bar` doctest to reflect the new return shape

## FUTURE

- streaming parser
- benchmarks
```

What goes wrong without this discipline:

- **"Fix the API"** — too vague. The agent thrashes, marks it done
  without doing much, or expands it into ten items it then half-finishes.
- **"Rename `foo` to `bar` in `lib/x.ex` line 42"** — too small. That's a
  code review note, not an iteration.
- **Items in arbitrary order** — Ralph picks the top item, so dependency
  order is enforced by list order. If item 3 depends on item 5, you'll
  watch Ralph break item 3, give up, and mark it done anyway.

Keep nice-to-haves out of `## MVP`. Put them in `## FUTURE` (or a
separate `FUTURE.md`). Otherwise Ralph will keep finding work forever —
see *Livelock* below.

## Git is the backstop

Commit-per-iteration is non-negotiable. Three reasons:

1. **Bisect.** When the build breaks on iteration 23, you want
   `git bisect` to land on the exact iteration that broke it.
2. **Revert.** A bad iteration is one `git revert` away from gone. If
   five iterations stacked on top of each other in a single commit, you
   have to untangle them by hand.
3. **Audit.** The commit log *is* the record of what Ralph did. Every
   step has a message (the TODO item), a diff, and a timestamp.

The prompt should require it. If Ralph forgets to commit, the next
iteration sees a dirty tree and `mix test` fails — which surfaces the
problem instead of hiding it.

## Failure modes

### Livelock by infinite subtasks

The agent keeps appending "while I'm in here, I should also..." items.
The list never shrinks.

**Fix:** A hard-coded `## MVP` section with an explicit definition of
done. The prompt says "DONE means every line under `## MVP` starts with
`[x]`." Items the agent thinks of beyond that go to `## FUTURE` and
don't count.

### Premature DONE

The agent declares done with items still unchecked, because the prompt
said "say DONE when you're finished" and the LLM decided it was tired.

**Fix:** Make the sentinel mechanically checkable. Not "when you're
done" but "when `grep -c '^- \[ \]' TODO.md` returns 0 *and* `mix test`
exits 0."

### Phantom completion

The agent marks an item `[x]` without doing the work. The diff for that
iteration is just the checkbox flip.

**Fix:** Two layers. First, the prompt requires the commit to include
the work, not just the checkbox. Second, the loop runs `mix test`
between iterations and refuses to proceed on red. (A verifier subagent
that reads the commit diff against the item text is the next step up.)

### The wrong thing, correctly

Tests pass. Feature is wrong. Ralph cannot detect this — there's no
ground truth in the loop.

**Fix:** Human checkpoints, or a grader subagent that compares the diff
to the original spec. Ralph is for narrow, well-specified work; it is
not for "build me a product."

## Where `TODO.md` comes from

This is where most Ralph attempts fall over. Bad input, bad output.

Two reasonable starting points:

- **Human-written.** You sit down for fifteen minutes and write
  twenty checkboxes. This is the most reliable mode and the one Geoffrey
  Huntley uses for production work.
- **Planning pass.** A separate agent (or Ralph's iteration 0 with a
  different prompt) decomposes a goal into checkboxes. Cheap, but the
  list quality is only as good as the planner; budget time to edit it
  by hand before kicking off the loop.

Either way, **read the list before you run Ralph**. It will get done.
You want it to be the thing you actually wanted.

## Running it

A built-in mix task ships the loop:

```bash
# Loop on an existing TODO.md in the current directory
mix skill_kit.ralph TODO.md

# Generate TODO.md from a prompt, then loop
mix skill_kit.ralph TODO.md --prompt "Add JSON parsing to lib/foo.ex with tests"

# Use a different agent (default: ralph)
mix skill_kit.ralph TODO.md --agent some-other-ralph
```

The contract lives in skills, not in the task. The task is a thin
driver that starts the agent, sends per-turn triggers, and watches
for the sentinel.

```
examples/agents/ralph/
├── AGENT.md                  # identity + skill routing
└── skills/
    ├── plan/SKILL.md         # write a TODO from a goal
    └── iterate/SKILL.md      # do one item: pick, edit, test, mark, commit
```

The `iterate` skill uses SkillKit's `` !`cmd` `` syntax to inline the
current TODO contents into the prompt at render time:

```markdown
TODO file path: $ARGUMENTS

Current contents:

​```
!`cat $ARGUMENTS 2>/dev/null || echo "(file not found)"`
​```
```

That keeps the iteration prompt fresh every turn without an extra
shell tool call.

The agent's job is to route — its `AGENT.md` says "if the user asks
to plan, activate `plan`; if to iterate, activate `iterate`; then
echo the skill's final word verbatim." That last clause is what lets
the driver detect `DONE` reliably without a fuzzy match.

## The loop itself

For completeness — it's footnote-sized. Using `SkillKit.send_message/2`
on a single long-running agent (cheap; conversation accumulates but
`TODO.md` is the source of truth):

```elixir
defmodule Ralph do
  alias SkillKit.Event.Error, as: EventError
  alias SkillKit.Types.AssistantMessage

  @prompt """
  Read TODO.md. Pick the top item under `## MVP` whose box is unchecked.
  Do it. Mark it [x]. Append any subtasks you discovered to `## MVP`.
  Stage and commit your work; the commit message is the item text.

  Reply with exactly the word DONE if and only if every line under
  `## MVP` starts with `[x]` AND `mix test` exits 0.
  """

  def run(source) do
    {:ok, agent} = SkillKit.start_agent(source, tools: [{SkillKit.Tools.Shell, cwd: "."}])
    result = loop(agent, 1)
    SkillKit.stop_agent(agent)
    result
  end

  defp loop(agent, iter) do
    IO.puts("--- iter #{iter} ---")
    :ok = SkillKit.send_message(agent, @prompt)

    receive do
      %AssistantMessage{content: "DONE" <> _} -> :done
      %AssistantMessage{} -> loop(agent, iter + 1)
      %EventError{reason: reason} -> {:error, reason}
    end
  end
end
```

The classic-Ralph variant — fresh agent every iteration, zero
conversational memory — swaps the body of `loop/2` to call
`SkillKit.start_agent`, `send_message_sync`, and `stop_agent` per turn.
More expensive (a full supervision tree per iteration) but each
iteration is provably independent.

There is no `Stream.take(50)`. There is no `:timer.minutes(10)`. The
exit conditions are the `DONE` sentinel, an `%Error{}` event, or you
hitting Ctrl-C because you ran out of API budget.

## Pacing

Don't put rate limiting in the loop. The loop is sequential — one
request in flight at a time — and `Anthropic.Client` already retries
429s with `Retry-After` honored (`lib/anthropic/client.ex:45`). That's
enough for a single Ralph.

The shape that needs more is *many concurrent Ralphs sharing an API
key*. There is no centralized LLM gateway in SkillKit today; each agent
hits the provider directly. If you fan out, expect collisions on the
shared budget and plan accordingly (separate keys, or build the
gateway).

## When not to use Ralph

- **Tasks without a verifier.** If `mix test` can't tell you it's
  working, Ralph can't either. You'll get green checkboxes and broken
  code.
- **Tasks that need taste.** Ralph optimizes for "ship the item." It
  will not push back, redesign, or notice the spec is wrong.
- **Tasks you haven't spec'd.** The entire premise is that `TODO.md`
  encodes intent. If you can't write the list, Ralph can't run it.

Ralph is a hammer for the narrow case where the work decomposes into
checkboxes a test suite can grade. Inside that case it is remarkably
effective. Outside it, it is an expensive way to produce a clean
commit history of wrong code.