# Observability
Cantrip emits structured `:telemetry` events at process, gate, and medium
boundaries. This doc is the canonical reference for what gets emitted, how to
subscribe, and what to alert on.
**Audience:** operators deploying Cantrip, instrumentation engineers,
production support.
**Standard:** every documented event is asserted by a regression test. Events
not on this list are not load-bearing.
---
## Event registry
All events are emitted under the `[:cantrip, ...]` prefix.
| Event | Measurements | Metadata | Emitted from |
|---|---|---|---|
| `[:cantrip, :entity, :start]` | — | `entity_id, intent, trace_id` | `EntityServer.handle_call(:run, ...)` when an episode begins |
| `[:cantrip, :entity, :stop]` | `duration` | `entity_id, reason, trace_id` | `EntityServer.emit_entity_stop/2` when an episode terminates or is truncated |
| `[:cantrip, :turn, :start]` | — | `entity_id, turn_number, trace_id` | `EntityServer.run_loop/1` per turn |
| `[:cantrip, :turn, :stop]` | `duration` | `entity_id, turn_number, trace_id` | `EntityServer.emit_turn_stop/3` per turn |
| `[:cantrip, :gate, :start]` | — | `entity_id, gate_name, trace_id` | `Gate.Executor.emit_gate_start/2` per gate invocation |
| `[:cantrip, :gate, :stop]` | `duration` | `entity_id, gate_name, is_error, trace_id` | `Gate.Executor.emit_gate_stop/4` per gate invocation |
| `[:cantrip, :code, :eval]` | `duration` | `entity_id, trace_id` | `Medium.Code` per LLM-emitted Elixir evaluation |
| `[:cantrip, :bash, :eval]` | `duration` | `entity_id, trace_id` | `Medium.Bash` per shell command |
| `[:cantrip, :usage]` | `prompt_tokens, completion_tokens, total_tokens` | `entity_id, turn_number, trace_id` | `EntityServer.run_loop/1` after provider response |
| `[:cantrip, :redact, :hit]` | `count` | `entity_id, trace_id` | `Redact.scan/1` when boundary redaction removes a credential |
| `[:cantrip, :fold, :trigger]` | — | `entity_id, turn_number, trace_id` | `EntityServer.run_loop/1` when folding fires |
| `[:cantrip, :ward, :truncate]` | — | `entity_id, ward, trace_id` | `EntityServer.run_loop/1` when a ward stops execution |
| `[:cantrip, :ward, :child_rejected]` | `count` | `entity_id, child_id, child_medium, reason, trace_id` | child-cast coordinator when declaration-time child wards reject a spawn |
| `[:cantrip, :child, :start]` | — | `entity_id, child_depth, trace_id` | child-cast coordinator before child cast |
| `[:cantrip, :child, :stop]` | — | `entity_id, child_depth, outcome, trace_id` | child-cast coordinator after child cast |
| `[:cantrip, :loom, :persist_error]` | `count` | `storage_module, event_type, reason, trace_id` | `Loom.append_event/2` when the storage backend rejects a write |
| `[:cantrip, :compile_and_load]` | `duration` | `entity_id, module, outcome, trace_id` | `EntityServer.execute_compile_and_load/2` per hot-load attempt |
`duration` measurements are `System.monotonic_time/0` deltas (native units —
convert with `System.convert_time_unit/3` at the subscriber).
### Metadata invariants
- **`entity_id`** is always a binary, present on every event.
- **`trace_id`** is always a binary, present on every event. Propagates from
parent cantrip context through child cantrips so a full trace forms a tree
rooted at the originating episode.
- User-supplied strings that are intentionally useful for operations, such as
root intents, pass through the internal redaction boundary before emission so
credential-shaped substrings are scrubbed. LLM responses, provider response
bodies, bearer tokens, and raw credentials must not appear in event metadata.
---
## Subscribing
### Quick local logging
```elixir
:telemetry.attach_many(
"cantrip-logger",
[
[:cantrip, :entity, :start],
[:cantrip, :entity, :stop],
[:cantrip, :turn, :stop],
[:cantrip, :gate, :stop]
],
fn event, measurements, metadata, _config ->
Logger.info(
"#{Enum.join(event, ".")} | #{inspect(measurements)} | #{inspect(metadata)}"
)
end,
nil
)
```
### Production observability stack
The event prefix `[:cantrip, ...]` maps cleanly to most metric backends.
Recommended subscriptions for production deployments:
- **`[:cantrip, :turn, :stop]`** → histogram of `duration` per
`entity_id` for turn-latency tracking.
- **`[:cantrip, :gate, :stop]`** → histogram of `duration` per `gate_name`;
counter of `is_error: true` per `gate_name` for gate-error rates.
- **`[:cantrip, :entity, :stop]`** → counter per `reason` to track terminated
vs truncated vs error termination.
- **`[:cantrip, :usage]`** → counters for prompt/completion/total token
volume per `entity_id`.
- **`[:cantrip, :ward, :truncate]`** → counter per `ward` to see which guard
is stopping work.
- **`[:cantrip, :ward, :child_rejected]`** → counter per `reason` to catch
child-spawn policy pressure or prompt drift.
- **`[:cantrip, :redact, :hit]`** → counter of credential-shaped content
removed from entity/model-visible boundaries.
- **`[:cantrip, :child, :start]` / `[:cantrip, :child, :stop]`** → counters
and outcome tags for delegation fanout.
- **`[:cantrip, :code, :eval]`** and **`[:cantrip, :bash, :eval]`** →
histogram of `duration` for medium-evaluation latency.
Example StatsD attachment (using `telemetry_metrics_statsd`):
```elixir
metrics = [
Telemetry.Metrics.distribution("cantrip.turn.stop.duration",
event_name: [:cantrip, :turn, :stop],
measurement: :duration,
unit: {:native, :millisecond}
),
Telemetry.Metrics.distribution("cantrip.gate.stop.duration",
event_name: [:cantrip, :gate, :stop],
measurement: :duration,
unit: {:native, :millisecond},
tags: [:gate_name]
),
Telemetry.Metrics.counter("cantrip.gate.error.count",
event_name: [:cantrip, :gate, :stop],
keep: &(&1.is_error)
)
]
TelemetryMetricsStatsd.start_link(metrics: metrics)
```
Prometheus, Datadog, and other backends have equivalent
`Telemetry.Metrics`-based adapters.
---
## Recommended alerts
| Signal | Threshold | Why |
|---|---|---|
| `cantrip.gate.error.rate` | > 5% over 5 min, per `gate_name` | High gate error rate = LLM misuse or provider drift |
| `cantrip.turn.stop.duration` p95 | > 60s | Long turns suggest provider slowness, runaway code-medium evaluation, or hung gate |
| `cantrip.entity.stop.reason` = `:truncated` | > 10% over 1 hour | High truncation rate = `max_turns` ward set too low for the workload |
| `cantrip.ward.truncate.count` | sudden increase by `ward` | A runtime guard is stopping work more often than expected |
| `cantrip.redact.hit.count` | any unexpected sustained rate | User data or files contain credential-shaped content reaching observation boundaries |
| `cantrip.code.eval.duration` p95 | > 30s | Long code-medium evaluations suggest sandbox starvation or hung port |
---
## Trace correlation
`trace_id` propagates through child cantrips via the parent context. A full
trace for a parent episode that spawns N child cantrips is:
```
trace_id = "<root-uuid>"
├─ [:cantrip, :entity, :start] entity_id=parent_id
│ ├─ [:cantrip, :turn, :start] turn_number=1
│ ├─ [:cantrip, :gate, :start] gate_name=call_entity → spawns child
│ │ ├─ [:cantrip, :entity, :start] entity_id=child_id (same trace_id)
│ │ ├─ [:cantrip, :turn, :start] turn_number=1
│ │ └─ [:cantrip, :entity, :stop] entity_id=child_id
│ ├─ [:cantrip, :gate, :stop] gate_name=call_entity
│ └─ [:cantrip, :turn, :stop] turn_number=1
└─ [:cantrip, :entity, :stop] entity_id=parent_id
```
All events in this tree carry the same `trace_id`. To correlate to external
systems (HTTP request IDs, job queue IDs, etc.), pass the external ID as
`trace_id` when running the top-level cantrip:
```elixir
Cantrip.cast(cantrip, intent, trace_id: external_request_id)
```
ACP requests can use the protocol metadata channel. Put a non-empty string in
`_meta.trace_id` (or `_meta.cantrip_trace_id`) on `session/new` or
`session/prompt`; the Familiar ACP runtime stores it on the session and passes
it into `Cantrip.summon/3` or `Cantrip.send/3` so entity, turn, gate, usage,
child, and code events carry the caller's external trace ID. Other `_meta`
fields are ignored by Cantrip's ACP boundary; editor metadata cannot override
the configured LLM, loom path, or turn budget.
```json
{
"jsonrpc": "2.0",
"id": 7,
"method": "session/prompt",
"params": {
"sessionId": "sess_123",
"_meta": {"trace_id": "http-request-abc"},
"prompt": [{"type": "text", "text": "Inspect the failing test"}]
}
}
```
When no external trace ID is supplied, Cantrip mints a fresh per-session entity
trace ID.
---
## What is not emitted (and why)
- **LLM provider request/response bodies.** Too large and contain prompts.
Use `:telemetry.attach_many` with your own redaction if you need partial
visibility into provider traffic; do not log raw bodies.
- **Loom record contents.** The loom is the durable trace; subscribe to the
loom directly via `Cantrip.Loom` API if you need turn-level data. Telemetry
is for operational metrics, not data plane.
- **Stack traces.** Errors arrive as already-redacted observation strings.
Unredacted stack traces stay internal.
---
## Event Registry In Code
The runtime event registry is used by tests and documentation review. New
telemetry surfaces should be added there first, then pinned by a regression
test and documented in the table above.