# Evals
Evals are the test counterpart to skills. Where a `SKILL.md` injects
instructions into an agent, an `EVAL.md` describes behaviors the skill should
produce and the criteria for success. The eval harness loads the skill under
test into a fresh agent, sends each case's prompt, and asks an LLM judge
whether the resulting transcript meets the criteria.
`SkillKit.Eval.Case` plugs evals into ExUnit, so `mix test` runs your skill
evals alongside your unit tests.
SkillKit dogfoods its own harness: the skills under `examples/skills/` carry
colocated `EVAL.md` suites, wired up in `test/examples/skills_eval_test.exs`.
Run them against a real provider with `mix test --only eval`.
## Writing an eval
An `EVAL.md` is a *suite* of cases. Each `##` heading is one case (its text is
the case name); under it, a `### Prompt` section is the message sent to the
agent and a `### Expect` section is the rubric the LLM judge scores against.
When the `EVAL.md` lives next to the `SKILL.md` it tests, that's all you need —
no frontmatter:
```markdown
skills/greeter/
SKILL.md
EVAL.md
```
```markdown
## greets the user by name
### Prompt
Hi, I'm Sam
### Expect
The assistant greets the user by their name in a warm, friendly tone.
## handles a missing name
### Prompt
Hello there
### Expect
The assistant greets politely without inventing a name.
```
Headings named `Prompt` / `Expect` (case-insensitive, any level) are section
markers; every other `##` heading starts a new case. Other heading levels
inside a section stay part of its content, so a `### Step 1` inside a prompt is
just prompt text.
### Optional frontmatter
To test a skill that isn't colocated, or to add tools or pin a model, use
frontmatter — every field is optional:
```markdown
---
skills:
- "skills/greeter"
tools:
- "SkillKit.Tools.Shell"
model: "anthropic:claude-sonnet-4-6"
system: "You are being evaluated."
---
## greets the user by name
...
```
| Field | Notes |
|-------|-------|
| `skills` | Skill providers under test — paths (`"skills/greeter"`) or module names (`"SkillKit.Tools.Shell"`). Overrides the colocated `SKILL.md`. |
| `tools` | Tool providers, same forms as `skills`. |
| `model` | Model URI for the eval agent; falls back to the default provider. |
| `system` | System prompt for the eval agent. |
The skill under test resolves in this order: explicit `skills:` frontmatter, else
a `SKILL.md` sitting next to the `EVAL.md`, else nothing.
## Colocating evals with code
When an eval exercises **application code** — a tool module, or a skill whose
behavior runs through your modules — keep the eval next to that code. The eval
then anchors to the module, and the eval cache keys on the module's compiled
hash (`Module.module_info(:md5)`): change the code and the eval re-runs; leave
it untouched and a prior pass is reused. No dependency lists to maintain.
Two forms, both setting the eval's subject `module`:
**`@eval` attribute** — the eval lives in the module, doctest-style:
```elixir
defmodule MyApp.Greeter do
use SkillKit.Eval
@eval """
## greets the user by name
### Prompt
Hi, I'm Sam
### Expect
Greets the user by name.
"""
def greet(name), do: ...
end
```
**Sidecar file** — keep the markdown in a file named after the source file. The
subject module is read from the sibling `.ex`; no frontmatter:
```
lib/my_app/
greeter.ex # defmodule MyApp.Greeter
greeter.EVAL.md # ## greets the user by name …
```
If the subject module is itself a `SkillKit.Tool` it's offered to the agent as a
tool; if it's a kit/skill provider it's loaded as a skill. Either way the
module's MD5 anchors the cache. Discover `@eval` modules with
`use SkillKit.Eval.Case, modules: [MyApp.Greeter]`; sidecars are found by
pointing `dir:` at your source tree (e.g. `dir: "lib"`). For the rare case where
the sidecar can't sit beside its `.ex`, an explicit `module:` frontmatter key
still works.
## Evaluating whole agents
To eval an agent rather than a single skill, drop an `EVAL.md` next to its
`AGENT.md`:
```
agents/researcher/
AGENT.md
EVAL.md
```
The runner boots the **whole agent** — its `AGENT.md` identity, skills, and
sub-agents — via `SkillKit.start_agent/2`, sends the prompt, and judges the
transcript. The eval anchors to the agent directory, so the cache keys on its
contents (`AGENT.md` + every skill under it); change anything the agent is made
of and the eval re-runs.
The colocated `AGENT.md` is inferred automatically; point elsewhere with an
`agent:` frontmatter key. The model is taken from `:run`/frontmatter (so the
eval hits a known provider) and otherwise falls back to the agent's own model.
```markdown
---
agent: "agents/researcher"
---
## cites sources
### Prompt
What logging library does this project use?
### Expect
Names the library and cites the file where it's configured.
```
## Running evals as tests
Point `SkillKit.Eval.Case` at a directory of evals:
```elixir
defmodule MyApp.SkillEvalTest do
use SkillKit.Eval.Case, dir: "skills"
end
```
This discovers every case under `dir` at compile time and defines one test per
case. Test names are qualified by the eval file's directory (e.g.
`"greeter: greets the user by name"`) so cases from different files don't
collide. Each test runs the case through `SkillKit.Eval.Runner` and asserts
that all of its checks pass.
Generated tests are tagged `:eval`. Because they drive a real agent and an LLM
judge, exclude them from the default suite and opt in explicitly:
```elixir
# test_helper.exs
ExUnit.start(exclude: [:eval])
```
```bash
# run only the skill evals against a real provider
ANTHROPIC_API_KEY=... mix test --only eval
```
Because the default test provider is the mock, pin the agent (and judge) to an
explicit provider URI so the cases hit the real API:
```elixir
use SkillKit.Eval.Case,
dir: "skills",
run: [
model: "anthropic:claude-sonnet-4-6",
judge_model: "anthropic:claude-sonnet-4-6"
]
```
Eval skills are loaded from real files on disk, but the test environment
defaults to in-memory storage. `SkillKit.Eval.Case` handles this for you: a
per-test `setup` swaps in `SkillKit.Storage.File` while each `:eval` test runs
(and restores the prior provider after), so colocated `SKILL.md` files resolve.
Pass `storage: false` to the macro to leave your configured provider in place,
or `storage: MyApp.Storage` to swap in a different one.
Forward options to the runner with `:run`:
```elixir
use SkillKit.Eval.Case, dir: "skills", run: [timeout: 60_000]
```
## How scoring works
For each case the runner produces a `SkillKit.Eval.Result` made of
`SkillKit.Eval.Check`s. The case passes only when **every** check passes:
1. **Completion** — the agent produced a response (not an error or timeout).
A run that doesn't complete fails here and is not sent to the judge.
2. **LLM judge** — `SkillKit.Eval.Judge` gives a model the user prompt, the
tools the agent called, and its final response, and asks whether the
transcript satisfies the `## Expect` rubric. The verdict is
**severity-weighted** and always resolves to pass or fail:
- `FAIL` is reserved for *critical* shortfalls — a security or safety
problem, a vulnerability, incorrect/harmful output, or a critical failure
to do what the rubric asks.
- Everything else `PASS`es. When the substance is right but the transcript
deviates in a non-critical way (different wording, optional suggestions,
extra caveats, hypothetical edge cases), the judge passes it and attaches
a one-line `WARNING:`. The rubric sets the bar for *substance*, not exact
wording the agent must reproduce.
This keeps a capable agent from failing over non-critical nitpicks while
still hard-failing genuinely bad behavior.
When a check fails, ExUnit prints the failing checks and the captured
transcript (prompt, tools called, response) via
`SkillKit.Eval.Result.failure_message/1`. Warnings on a *passing* eval are
printed too — ExUnit shows nothing for a pass otherwise — and are available via
`SkillKit.Eval.Result.warnings/1`. Pass `run: [judge: false]` to skip the judge
— a cheap smoke test that the agent responds at all without spending judge
tokens.
## Caching
Evals are expensive — each is an agent run plus a judge call — so the runner
can skip a case that already passed when nothing in its *scope* has changed.
Enable it with `run: [cache: true]`:
```elixir
use SkillKit.Eval.Case, dir: "skills", run: [cache: true]
```
The scope fingerprint (`SkillKit.Eval.Cache`) covers the case text (name,
prompt, rubric, system), the agent and judge models, the source of every skill
and tool under test (file contents for path providers, the **compiled MD5** for
module providers), the subject `module`'s MD5 when the eval is colocated with
code, and a harness-version token bumped when scoring changes. A case whose
fingerprint matches a recorded **pass** is skipped — its result is marked
`cached: true` and no LLM is called. Failures and unknown fingerprints always
run; failures are never cached. Because module providers and module-anchored
evals hash compiled code, changing the application code an eval exercises
re-runs it rather than serving a stale pass.
The cache is a term file. `cache: true` stores it under `_build/<env>/`
(ephemeral, already gitignored — a fresh CI checkout runs every eval); pass a
path string to put it elsewhere and commit it to share skips with CI:
```elixir
use SkillKit.Eval.Case, dir: "skills", run: [cache: ".skill_kit/eval_cache.bin"]
```
Because LLMs are non-deterministic, a cache hit means "this exact scope already
passed, trust it" rather than a guaranteed-identical re-run — the right
contract for an expensive suite, like a build cache. Delete the cache file to
force a full re-run.
## Running evals in CI
SkillKit's own CI (`.github/workflows/ci.yml`) runs the dogfood evals as a
separate, blocking `evals` job, and persists the result cache across runs so
only changed skills cost an API call:
```yaml
- name: Cache eval results
uses: actions/cache@v4
with:
path: .skill_kit/eval_cache.bin
# run_id never pre-exists, so the cache is re-saved every run; restore-keys
# loads the most recent prior copy.
key: ${{ runner.os }}-evalcache-${{ github.run_id }}
restore-keys: ${{ runner.os }}-evalcache-
- name: Run skill evals
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: mix test test/examples/skills_eval_test.exs --only eval
```
The run is scoped to the dogfood suite file: a bare `--only eval` sweeps in
every `:eval`-tagged test in the repo, including mock-based fixtures that have
no real provider and can't pass on their own.
A plain `actions/cache` keyed on `mix.lock` (like the deps cache) will **not**
work for results: that key only changes with dependencies, and a cache is
immutable per key, so an existing entry is never re-saved. The rolling
`run_id` key above re-saves on every run.
The eval job is gated by a `RUN_EVALS` flag so any repo can opt out: set the
`RUN_EVALS` repository variable to `"false"` to skip it, or trigger a one-off
run with the workflow's `run_evals` input. It is on by default and blocking — a
failing eval fails the check.
## Running an eval directly
The harness is plain functions, so you can run a case outside ExUnit:
```elixir
{:ok, [eval | _]} = SkillKit.Eval.load_file("skills/greeter/EVAL.md")
result = SkillKit.Eval.Runner.run(eval, model: "anthropic:claude-sonnet-4-6")
SkillKit.Eval.Result.passed?(result)
#=> true
```
## Evals as meta-skills
Because an eval captures the *intended behavior* of a skill independently of
its prose, it doubles as a specification you can author a skill against: write
the eval first, draft the `SKILL.md` next to it, and iterate until the eval is
green — test-driven development for skills. A generator that drafts and refines
the application skill from its eval builds directly on this harness.