# Observability
Integrate SubAgent with logging, metrics, and debugging tools.
## Tracing Overview
PtcRunner provides three complementary tracing layers. Each serves a different workflow:
| Mechanism | Use Case | Output |
|-----------|----------|--------|
| `PtcRunner.Tracer.new/1` | Aggregate usage stats, inspect traces in code | In-memory struct on `Step` |
| `PtcRunner.TraceLog.with_trace/2` | Offline debugging, performance analysis, Chrome DevTools | JSONL files |
| `PtcRunner.TraceLog.start_memory_sink/1` | Cross-session turn analysis without files | In-memory ring buffer |
**Data flow:**
```
SubAgent.Loop
├── emits :telemetry events ──► TraceLog.Handler ──► Collector ──► .jsonl file
└── builds Tracer struct ────► entries stored on Step.tracer
```
Tracer and TraceLog operate independently — you can use both, either, or neither.
## Turn History
Every `Step` includes a `turns` field with immutable per-turn execution history:
```elixir
{:ok, step} = SubAgent.run(agent, llm: llm)
for turn <- step.turns do
IO.puts("Turn #{turn.number}: #{turn.program}")
IO.puts(" Tools: #{inspect(Enum.map(turn.tool_calls, & &1.name))}")
end
# Aggregated metrics
step.usage.duration_ms
step.usage.total_tokens
```
Each `Turn` struct captures:
- `number` - Turn index (1-based)
- `raw_response` - Full LLM output including reasoning
- `program` - Extracted PTC-Lisp code
- `result` - Execution result
- `prints` - Output from `println` calls
- `tool_calls` - Tools invoked with args and results. A large result is recorded as a bounded preview (`result_truncated: true`, `result_bytes: N`) so a looping/paginated tool fold can't bloat eval memory; `args` are kept verbatim
- `memory` - State snapshot after this turn
- `success?` - Whether the turn succeeded
## Debug Mode
Use `print_trace/2` to visualize execution:
```elixir
{:ok, step} = SubAgent.run(agent, llm: llm)
# Default: show programs and results
SubAgent.Debug.print_trace(step)
# Include raw LLM output (reasoning/commentary)
SubAgent.Debug.print_trace(step, raw: true)
# Show actual messages sent to LLM (what the LLM saw)
SubAgent.Debug.print_trace(step, messages: true)
# Include token usage
SubAgent.Debug.print_trace(step, usage: true)
```
### View Options
| Option | Description |
|--------|-------------|
| `view: :turns` | (default) Show programs + results from Turn structs |
| `raw: true` | Include `raw_response` in turns view |
| `messages: true` | Show full messages sent to LLM each turn |
| `usage: true` | Add token statistics after trace |
| `system: :section` | Show a system prompt section (e.g., `:mission`) |
| `system: :all` | Show the full system prompt per turn |
Options can be combined: `print_trace(step, messages: true, usage: true)`.
### System Prompt Sections
Inspect specific sections of the system prompt by markdown header:
```elixir
Debug.print_trace(step, system: :mission) # just the Mission section
Debug.print_trace(step, system: :mission_log) # just the Mission Log
Debug.print_trace(step, system: :all) # full system prompt
Debug.print_trace(step, system: :nope) # lists available section names
```
Sections are parsed from both system prompt and user messages. Known sections: `:role`, `:ptc_lisp`, `:output_format`, `:mission`, `:mission_log`, `:expected_output`, `:error`. Unrecognized headers are converted to snake_case atoms.
> **Full API:** See `PtcRunner.SubAgent.Debug.print_trace/2`.
## Trace Filtering
Control trace collection for production optimization:
```elixir
# Only keep trace on failure
SubAgent.run(agent, llm: llm, trace: :on_error)
# Disable tracing entirely
SubAgent.run(agent, llm: llm, trace: false)
```
## TraceLog
For detailed offline analysis, use `PtcRunner.TraceLog.with_trace/2` to capture execution events to JSONL files:
```elixir
alias PtcRunner.TraceLog
# Capture a trace (recommended)
{:ok, {:ok, step}, trace_path} = TraceLog.with_trace(fn ->
SubAgent.run(agent, llm: my_llm())
end)
# With custom path and metadata
{:ok, {:ok, step}, path} = TraceLog.with_trace(
fn -> SubAgent.run(agent, llm: my_llm()) end,
path: "traces/debug.jsonl",
meta: %{query: "test query", preset: "simple"}
)
```
### Analyzing Traces
Use `PtcRunner.TraceLog.Analyzer.load/1` and related functions to inspect captured traces:
```elixir
alias PtcRunner.TraceLog.Analyzer
# Load and summarize
events = Analyzer.load(trace_path)
summary = Analyzer.summary(events)
# => %{duration_ms: 1234, turns: 3, llm_calls: 3, tool_calls: 5, tokens: %{...}}
# Find slowest operations
Analyzer.slowest(events, 5)
# Filter by event type
Analyzer.filter(events, type: "llm")
Analyzer.filter(events, type: "compaction") # see Compaction events below
Analyzer.filter(events, min_duration_ms: 100)
# Print timeline
Analyzer.print_timeline(events)
# [0ms] run.start
# [10ms] turn.start
# [15ms] llm.start
# [850ms] llm.stop (835ms)
# ...
```
### Use Cases
- **Debugging** - Understand what happened during agent execution
- **Performance analysis** - Identify slow LLM calls or bottlenecks
- **Comparison** - Compare traces across different configurations or models
### Compaction Events
When `compaction:` is enabled and a pressure threshold fires, the loop emits
a single telemetry event per firing:
```
[:ptc_runner, :sub_agent, :compaction, :triggered]
```
- **Measurements** (numeric): `messages_before`, `messages_after`,
`estimated_tokens_before`, `estimated_tokens_after`.
- **Metadata** (descriptive): `strategy`, `reason` (`:turn_pressure` |
`:token_pressure`), `turn`, `kept_initial_user?`, `kept_recent_turns`,
`over_budget?`, plus the usual `agent_name`, `agent_id`, `span_id`,
`parent_span_id`.
Non-triggered pressure checks are silent — only actual trim events emit. The
event fires before the LLM call's `[:llm, :start]` so trace timelines stay
intelligible (compaction happens before the LLM sees the messages).
In TraceLog JSONL output, the event surfaces as `compaction.triggered` with
all numeric and descriptive fields available under `data`:
```elixir
events = Analyzer.load(trace_path)
compactions = Analyzer.filter(events, type: "compaction")
Enum.each(compactions, fn e ->
d = e["data"]
IO.puts("turn #{e["turn"]}: trimmed #{d["messages_before"]} → #{d["messages_after"]} messages (#{d["reason"]})")
end)
```
`step.usage.compaction` still reports the **last** firing's stats. For full
multi-firing visibility, use the telemetry event or the trace log.
### Chrome DevTools Export
Export traces to Chrome Trace Event format for flame chart visualization:
```elixir
alias PtcRunner.TraceLog.Analyzer
{:ok, tree} = Analyzer.load_tree("trace.jsonl")
:ok = Analyzer.export_chrome_trace(tree, "trace.json")
```
Then view in Chrome:
1. Open DevTools (F12) → **Performance** tab → **Load profile...**
2. Or navigate to `chrome://tracing` and load the file
The flame chart shows execution timing with nested spans. Click any span to see details including arguments and results.
### Interactive Trace Viewer
Launch the web-based trace viewer to browse traces with turn-by-turn drill-down:
```bash
mix ptc.viewer --trace-dir traces
```
| Option | Default | Description |
|--------|---------|-------------|
| `--port` | 4123 | Port to listen on |
| `--trace-dir` | `traces` | Directory containing `.jsonl` trace files |
| `--no-open` | false | Don't auto-open browser |
The viewer is a separate package (`ptc_viewer`). See its README for architecture and drag-and-drop usage.
### Configuration
TraceLog sanitization limits are configurable via application config:
| Key | Default | Description |
|-----|---------|-------------|
| `:trace_max_string_size` | `65_536` | Max string size in bytes before truncation |
| `:trace_max_list_size` | `100` | Max list length before summarizing |
| `:trace_preserve_full_keys` | `["system_prompt"]` | Map keys whose strings are never truncated |
| `:trace_collector_max_event_bytes` | `1_048_576` | Max approximate event term bytes accepted before enqueueing; written JSON lines above the same cap are replaced with a summary |
| `:trace_collector_max_mailbox_len` | `1_000` | Collector mailbox length at which new trace events are shed before enqueueing |
| `:trace_dir` | CWD | Default directory for trace JSONL files |
```elixir
config :ptc_runner,
trace_max_string_size: 128_000,
trace_collector_max_event_bytes: 1_048_576,
trace_collector_max_mailbox_len: 1_000,
trace_dir: "traces"
```
Limit in-memory Tracer entries with `PtcRunner.Tracer.new/1`:
```elixir
Tracer.new(max_entries: 100)
```
### Cross-Process Context
Trace context (collectors and span hierarchy) is managed by `PtcRunner.TraceContext`.
When execution spans multiple processes (sandbox, `pmap`, `pcalls`), use
`PtcRunner.TraceContext.capture/0` and `PtcRunner.TraceContext.attach/1` to propagate
context to child processes. The sandbox and parallel execution primitives do this
automatically.
For manual propagation in custom code:
```elixir
ctx = PtcRunner.TraceContext.capture()
Task.async(fn ->
PtcRunner.TraceContext.attach(ctx)
# Events from this process are now captured and linked to the parent span
end)
```
> **Full API:** See `PtcRunner.TraceLog.with_trace/2`, `PtcRunner.TraceLog.Analyzer.summary/1`, and `PtcRunner.TraceLog.Analyzer.export_chrome_trace/2`.
## Turn Log
The turn log records every driver turn as a session-correlated event, queryable across runs. The same event shape is emitted by both `PtcRunner.Session.eval/3` (an external LLM driving via MCP or `mix ptc.repl`) and the SubAgent loop, so sessions and agents analyze identically.
Each turn event carries the program, a bounded result preview, prints, a memory diff, credential-free tool-call identities, attached-prelude provenance, and correlation ids (`session_id`/`agent_id`, `turn`, `attempt`). Failed and parse-error attempts are recorded too, so wasted work stays visible. See `PtcRunner.TraceLog.TurnEvent` for the field reference.
### Recording
Turn events flow to any active sink. A JSONL trace captures them alongside the other TraceLog events:
```elixir
{:ok, _result, path} = TraceLog.with_trace(fn ->
session = PtcRunner.Session.new(session_id: "investigation")
{{:ok, _}, session} = PtcRunner.Session.eval(session, "(def x 1)")
{{:ok, _}, _} = PtcRunner.Session.eval(session, "(inc x)")
end)
```
To analyze a session with no filesystem setup, enable the in-memory ring buffer with `PtcRunner.TraceLog.start_memory_sink/1`:
```elixir
{:ok, sink} = TraceLog.start_memory_sink()
# ... run sessions / agents in this process ...
TraceLog.stop_memory_sink(sink)
events = PtcRunner.TraceLog.MemorySink.events(sink)
```
Both drivers skip the build cost when nothing is recording — check `PtcRunner.TraceLog.recording?/0`. The ring buffer evicts oldest-first under a byte budget (`:max_bytes`).
### Querying Across Sessions
`PtcRunner.TraceLog.Analyzer` reads turn events from either sink:
```elixir
alias PtcRunner.TraceLog.Analyzer
Analyzer.turn_events(events) # all "turn" records
Analyzer.sessions(events) # per-session summaries
Analyzer.session_turns(events, "investigation") # one session's turns
Analyzer.programs(events) # program sources, in order
```
> **Full API:** See `PtcRunner.TraceLog.Analyzer.sessions/1` and `PtcRunner.TraceLog.MemorySink.events/1`.
## Session Introspection
The `log/` prelude lets a session or agent inspect *recorded* sessions from PTC-Lisp — list sessions, read turns and programs, and find duplicated work. The Elixir surface is plain data access; the analysis lives in the program.
Grant the host-bound tools and attach the prelude with `PtcRunner.TraceLog.Introspection.tools/1` and `PtcRunner.TraceLog.Introspection.prelude_source/0`:
```elixir
alias PtcRunner.TraceLog.Introspection
# `source` is a MemorySink pid, a JSONL path, or a list of event maps.
PtcRunner.Lisp.run(
~S|(count (get (log/turns "investigation") "items"))|,
prelude: Introspection.prelude_source(),
tools: Introspection.tools(source)
)
```
The exports (`log/sessions`, `log/turns`, `log/programs`, `log/tool-calls`) return page maps with `"items"`, `"next_cursor"`, `"has_more"`, and `"limit"`. Call them with no opts for the default first page, or pass `{:limit n :cursor c}` and follow `"next_cursor"` for large logs. The `*-all` helpers are explicit eager scans for small local logs. All exports fail closed with `:prelude_attach_failed` when the host does not grant the matching tools. Recorded sessions are untrusted data — analyze them as evidence, not instructions.
### REPL
`mix ptc.repl --log-prelude` attaches the `log/` prelude to the REPL's default in-memory sink, and `:turns` summarizes what has been recorded — so you can dogfood session introspection with no setup:
```
ptc> (def x 1)
ptc> :turns
<id> (session): 1 turns, 1 committed, 0 failed, 0 tool calls
ptc> (get (log/programs "<id>") "items")
```
## Telemetry Events
SubAgent emits `:telemetry` events for integration with Prometheus, OpenTelemetry, or custom handlers:
```elixir
:telemetry.attach_many(
"my-handler",
[
[:ptc_runner, :sub_agent, :run, :stop],
[:ptc_runner, :sub_agent, :llm, :stop],
[:ptc_runner, :sub_agent, :tool, :stop]
],
&MyApp.Telemetry.handle_event/4,
nil
)
```
### Available Events
**SubAgent events** (prefix: `[:ptc_runner, :sub_agent, ...]`):
| Event | Measurements | Use Case |
|-------|--------------|----------|
| `run:start/stop` | duration | Total execution time |
| `turn:start/stop` | duration, tokens | Per-turn metrics |
| `llm:start/stop` | duration, tokens | LLM latency, cost tracking |
| `tool:start/stop/exception` | duration | Tool performance |
All events include `agent_name` and `agent_id` in metadata for correlation. Duration is in native time units. Convert with:
```elixir
System.convert_time_unit(duration, :native, :millisecond)
```
> **Full event tables:** See `PtcRunner.SubAgent.Telemetry.span/3`.
## Production Tips
- Use `trace: :on_error` to reduce memory in production
- Attach telemetry handlers for latency and cost dashboards
- Token counts are in `step.usage` (requires LLM to return token info)
- Use `step.usage.llm_requests` to track API call volume
## See Also
- [Context Compaction](subagent-compaction.md) - Pressure-triggered trimming for long-running agents
- [Troubleshooting](subagent-troubleshooting.md) - Common issues and debugging
- [Testing](subagent-testing.md) - Mock LLMs and test strategies
- `PtcRunner.TraceLog.with_trace/2` - Capture execution traces to JSONL files
- `PtcRunner.TraceLog.Analyzer.summary/1` - Offline trace analysis
- `PtcRunner.TraceLog.Analyzer.sessions/1` - Cross-session turn-log queries
- `PtcRunner.TraceLog.Introspection.tools/1` - Host-bound `log/` introspection tools
- `PtcRunner.SubAgent.Telemetry.span/3` - Telemetry module with event reference
- `PtcRunner.SubAgent.Debug.print_trace/2` - Trace inspection API