# Testing And Evals
Use deterministic tests for agent behaviour. Inject a fake LLM and local
operations, assert the compiled spec shape, and run small eval cases without
calling a provider.
## When To Use This
- Use this guide when adding tests for a new agent, tool, control, or
memory contract.
- Use this guide when setting up regression coverage for the
DSL-to-`Jidoka.Agent.Spec` contract.
- Use this guide when building a small eval suite for CI.
- Do not use this guide for live model evaluations or benchmarking; those
belong in opt-in suites that explicitly require provider credentials.
## Prerequisites
- A working Jidoka project (see [Getting Started](getting-started.md)).
- Familiarity with the operation contract from
[Tools And Operations](tools-and-operations.md).
- No provider keys are required for any example below.
```bash
mix deps.get
mix test
```
## Quick Example
A minimal eval pins both capabilities, declares one assertion, and runs
the case through the same harness as production.
```elixir
defmodule MyApp.TimeAgent do
use Jidoka.Agent
agent :time_agent do
model "openai:gpt-4o-mini"
instructions "Call local_time when asked for the time."
end
tools do
action MyApp.Tools.LocalTime
end
end
operations =
Jidoka.Runtime.LocalOperations.operations(%{
"local_time" => fn _args -> {:ok, %{city: "Chicago", time: "09:30"}} end
})
llm = fn _intent, journal ->
case map_size(journal.results) do
0 -> {:ok, %{type: :operation, name: "local_time", arguments: %{}}}
_ -> {:ok, %{type: :final, content: "Chicago time is 09:30."}}
end
end
{:ok, run} =
Jidoka.Eval.run_case(
%{
id: "time_basic",
agent: MyApp.TimeAgent.spec(),
input: "What time is it?",
assertions: %{
contains: "09:30",
operation_called: "local_time"
}
},
llm: llm,
operations: operations
)
run.status
#=> :passed
```
The run is reproducible. The same inputs always produce the same
`Jidoka.Eval.Run`, so this example doubles as a regression test.
## Concepts
Deterministic testing in Jidoka uses four building blocks.
1. **Fake LLM function.** Every LLM capability is a 2-arity function
`fn intent, journal -> {:ok, decision} | {:error, reason} end`. The
decision shape is `%{type: :operation, name: ..., arguments: ...}` or
`%{type: :final, content: ...}`. The journal is the replay trace;
counting `map_size(journal.results)` is the standard way to drive
multi-step decisions.
2. **Local operation capability.** `Jidoka.Runtime.LocalOperations.operations/1`
wraps a map of `%{name => handler}` into a capability. Handlers may be
`(args -> term)` or `(intent, journal -> term)`. The same helper is what
`Jidoka.Operation.Source.Local` uses under the hood.
3. **Golden DSL-to-spec tests.** `Jidoka.project/1` produces compact,
deterministic maps from any Jidoka data. Snapshotting those projections
locks the DSL/import contract; changes show up as diffs in the golden
file.
4. **`Jidoka.Eval`.** `Jidoka.Eval.Case` packages an agent + request +
assertion set into one value. `Jidoka.Eval.run_case/2` runs the case
through the normal turn runtime and returns a `Jidoka.Eval.Run` with
status, evaluated assertions, and observations.
```diagram
╭──────────────────╮ ╭───────────────────╮ ╭──────────────────╮
│ Eval.Case data │────▶│ Jidoka.Eval │────▶│ turn runtime │
│ - agent (spec) │ │ .run_case/2 │ │ .run_turn/3 │
│ - request/input │ ╰─────────┬─────────╯ ╰────────┬─────────╯
│ - assertions │ │ │
╰──────────────────╯ │ ▼
│ {:ok, Turn.Result}
│ | {:hibernate, Snap}
│ | {:error, reason}
▼ │
╭───────────────────╮ │
│ evaluate/2 │◀────────────╯
│ - contains │
│ - equals │
│ - operation_called│
╰─────────┬─────────╯
▼
╭───────────────────╮
│ Jidoka.Eval.Run │
│ status: │
│ :passed │
│ :failed │
│ :error │
╰───────────────────╯
```
### Three Kinds Of Outcome
`Jidoka.Eval.Run.status` is one of:
- `:passed` - the harness returned `{:ok, %Turn.Result{}}` and every
evaluated assertion passed.
- `:failed` - the harness returned `{:ok, _result}` but at least one
assertion failed. The `:assertions` list contains `:passed`/`:failed`
entries with `:expected` and `:actual`.
- `:error` - the harness did not produce a result. Two subcases live here:
- **Input validation errors** (`{:error, %Jidoka.Error.Invalid{}}` from
request normalization, context schema mismatch, or spec compilation).
`run.error` is the projected error map.
- **Execution errors** (`{:error, reason}` from the operation or LLM
capability). `run.error` carries the same shape.
- **Hibernation outcomes** (`{:hibernate, snapshot}` from an operation
control returning `{:interrupt, ...}`). `run.error` is
`%{reason: :hibernated, snapshot: ...}`. The eval does not resume
automatically; treat hibernation as a non-pass outcome and feed the
snapshot into a `Jidoka.resume/2` test if you need to drive the rest.
## How To
### Step 1: Author A Fake LLM
The simplest fake returns one decision regardless of journal:
```elixir
llm = fn _intent, _journal ->
{:ok, %{type: :final, content: "pong"}}
end
```
For multi-step tests, branch on `map_size(journal.results)`:
```elixir
llm = fn _intent, journal ->
case map_size(journal.results) do
0 -> {:ok, %{type: :operation, name: "local_time", arguments: %{}}}
1 -> {:ok, %{type: :final, content: "09:30"}}
end
end
```
You can also branch on intent metadata or the journal contents when you
need to assert specific tool arguments came back. The fake is just a
function; complexity lives in your test, not in a mock framework.
### Step 2: Provide Local Operations
`Jidoka.Runtime.LocalOperations.operations/1` is the helper for local
operation tests:
```elixir
operations =
Jidoka.Runtime.LocalOperations.operations(%{
"local_time" => fn _args -> {:ok, %{time: "09:30"}} end,
"echo" => fn %{"phrase" => phrase} -> {:ok, %{echoed: phrase}} end
})
```
Handlers can be `(args -> term)` or `(intent, journal -> term)`. A return
value that is not `{:ok, _}` or `{:error, _}` is wrapped in `{:ok, value}`.
Pass it to `turn/3` (or to `Jidoka.Eval.run_case/2`) as `operations:`. The
runtime routes any intent with `kind: :operation` through this capability.
### Step 3: Write A Golden DSL-To-Spec Test
The DSL is data-first; the most effective regression test compares the
projected spec to a snapshot.
```elixir
defmodule MyApp.Golden.TimeAgentTest do
use ExUnit.Case, async: true
test "compiled spec matches the golden projection" do
projection =
MyApp.TimeAgent.spec()
|> Jidoka.project()
|> drop_volatile_fields()
expected = %{
id: "time_agent",
operations: [
%{name: "local_time", idempotency: :idempotent}
]
}
assert match?(^expected, projection)
end
defp drop_volatile_fields(%{} = projection) do
Map.update!(projection, :operations, fn operations ->
Enum.map(operations, &Map.take(&1, [:name, :idempotency]))
end)
|> Map.take([:id, :operations])
end
end
```
In the Jidoka repository, `test/jidoka/golden/dsl_to_spec_test.exs` asserts
the full projection against a recorded snapshot.
### Step 4: Use Jidoka.Eval.run_case For Behavior Tests
`Jidoka.Eval.run_case/2` accepts a `Jidoka.Eval.Case` struct, a map, or a
keyword list. Three assertion kinds are supported today:
- `contains: "substring"` (or a list of substrings) - asserts
`result.content` contains each.
- `equals: "exact content"` - asserts `result.content` equals the value.
- `operation_called: "name"` (or a list) - asserts each name appears in
`result.agent_state.operation_results`.
```elixir
{:ok, run} =
Jidoka.Eval.run_case(
%{
id: "time_basic",
agent: MyApp.TimeAgent.spec(),
input: "What time is it?",
assertions: %{
contains: ["09:30", "Chicago"],
operation_called: ["local_time"]
}
},
llm: llm,
operations: operations
)
run.status
#=> :passed
run.observations
#=> %{content: "Chicago time is 09:30.", operation_calls: ["local_time"], ...}
```
The `Run` struct also carries `result` (the full `Turn.Result`),
`assertions` (with `:expected` and `:actual`), and `metadata` so test
output can stay close to the source data.
### Step 5: Distinguish Outcome Kinds
When a test fails, look at `run.status` and `run.error` first:
```elixir
case Jidoka.Eval.run_case(case_input, llm: llm, operations: operations) do
{:ok, %Jidoka.Eval.Run{status: :passed} = run} -> {:ok, run}
{:ok, %Jidoka.Eval.Run{status: :failed, assertions: as}} -> {:failed, as}
{:ok, %Jidoka.Eval.Run{status: :error, error: %{reason: :hibernated} = e}} ->
{:hibernated, e.snapshot}
{:ok, %Jidoka.Eval.Run{status: :error, error: e}} -> {:execution_error, e}
{:error, reason} -> {:case_validation_error, reason}
end
```
`{:error, reason}` from `run_case/2` itself is the **case validation**
path - the input could not be normalized into a `Jidoka.Eval.Case`. The
three statuses inside the run cover the runtime outcomes.
### Step 6: Build A Small Eval Suite
Eval cases are plain data, so they compose well into a regular ExUnit
suite. Iterate the case list, attach the agent spec, and assert on
`run.status`. `Jidoka.Eval` is not a replacement for ExUnit, just a
packaging convenience for the agent/request/assertions trio.
## Common Patterns
- **One fake per scenario.** Resist building a single mega-fake. Each test
is clearest when the LLM function shows exactly the decisions that
matter for that case.
- **Use the journal as the state machine.** `map_size(journal.results)`
and `Map.values(journal.results)` are usually enough to branch decisions
without inventing a separate test state.
- **Inspect before asserting.** When an assertion fails, run
`Jidoka.inspect(run.result)` to see the timeline, then refine the
assertion or the fake.
- **Project, then snapshot.** Golden tests should compare
`Jidoka.project/1` output, not raw structs.
- **Treat hibernation as data.** When a test deliberately exercises a
control interrupt, assert on `run.error.reason == :hibernated` and use
`Jidoka.resume/2` in a follow-up test to drive the resume path.
## Testing
The dedicated tests under `test/jidoka/eval` exercise this guide's surface
end to end. The recipe is short: build a spec, pin an LLM and operations
capability, then assert on `Jidoka.Eval.Run.status`.
```elixir
test "passes when content and operations match" do
operations =
Jidoka.Runtime.LocalOperations.operations(%{
"echo" => fn %{"phrase" => phrase} -> {:ok, %{echoed: phrase}} end
})
llm = fn _intent, journal ->
case map_size(journal.results) do
0 -> {:ok, %{type: :operation, name: "echo", arguments: %{"phrase" => "hi"}}}
_ -> {:ok, %{type: :final, content: "hi"}}
end
end
spec =
Jidoka.agent!(
id: "echo_agent",
instructions: "Echo the user's input.",
operations: [Jidoka.Agent.Spec.Operation.new!(name: "echo")]
)
assert {:ok, %Jidoka.Eval.Run{status: :passed}} =
Jidoka.Eval.run_case(
%{id: "echo_basic", agent: spec, input: "hi",
assertions: %{contains: "hi", operation_called: "echo"}},
llm: llm,
operations: operations
)
end
```
For tests that need to inspect the full run shape, project it with
`Jidoka.project(run)`.
## Troubleshooting
| Symptom | Likely Cause | Fix |
| --- | --- | --- |
| `{:error, %Jidoka.Error.Invalid{}}` from `run_case/2` | The case input was malformed (missing `:agent`, invalid `:input`). | Verify the case keys; `agent:` is required and must be a spec or compatible map. |
| `run.status == :error` with `error.reason == :hibernated` | An operation control returned `{:interrupt, _}`. | Either remove the control for the test or assert on hibernation and resume in a follow-up. |
| `run.status == :error` with a Splode error map | The LLM or operation capability returned `{:error, _}`. | Inspect `run.error.details`; the capability is the fastest place to fix. |
| Assertions report `:passed` but content is wrong | The fake LLM returned the expected string by accident even when the operation was never called. | Add `operation_called:` to lock down the path. |
| Golden test fails after an unrelated change | Volatile fields (ids, timestamps) leaked into the snapshot. | Project the spec, drop the volatile keys, then assert. |
## Reference
- [`Jidoka.Eval`](`Jidoka.Eval`) - `run_case/2` and `evaluate/2`.
- [`Jidoka.Eval.Case`](`Jidoka.Eval.Case`) - case schema, `new/2`,
`new!/2`, `from_input/2`.
- [`Jidoka.Eval.Run`](`Jidoka.Eval.Run`) - run schema, `:passed | :failed |
:error` status, assertions, observations.
- [`Jidoka.Runtime.LocalOperations`](`Jidoka.Runtime.LocalOperations`) -
`operations/1` helper that wraps a handler map.
- [`Jidoka.Operation.Source.Local`](`Jidoka.Operation.Source.Local`) -
source-shaped wrapper around the same handlers.
- [`Jidoka.Projection`](`Jidoka.Projection`) - data projector used by
golden tests.
- [`Jidoka`](`Jidoka`) - public facade: `turn/3`, `chat/3`, `resume/2`,
`inspect/2`, `project/1`.
## Related Guides
- [Tools And Operations](tools-and-operations.md) - shape of the operation
contract under test.
- [Memory](memory.md) - test patterns for memory-backed turns.
- [Handoffs](handoffs.md) - testing ownership transitions.
- [Inspection And Preflight](inspection-and-preflight.md) - debugging
failures before adding assertions.
- [Runtime And Harness](runtime-and-harness.md) - hibernation and resume
flows referenced by error-status cases.