# AITrace
> The unified observability layer for the AI Control Plane.
`AITrace` provides the unified observability layer for the AI Control Plane, transforming opaque, non-deterministic AI processes into fully interpretable and debuggable execution traces.
Its mission is to create an Elixir-native instrumentation library and a corresponding data model that captures the complete causal chain of an AI agent's reasoning process—from initial prompt to final output, including all thoughts, tool calls, and state changes—enabling a true "Execution Cinema" experience for developers and operators.
## The Problem: Why Traditional Observability Fails
Debugging a simple web request is a solved problem. We have structured logs, metrics, and distributed tracing (like OpenTelemetry) that show the path of a request through a series of stateless services.
Debugging an AI agent is fundamentally different. It is like performing forensic analysis on a dream. The challenges are unique:
* **Non-Determinism:** The same input can produce different outputs and, more importantly, different *reasoning paths*.
* **Deeply Nested Causality:** A final answer may be the result of a multi-step chain of thought, where an LLM hallucinates, calls the wrong tool with the wrong arguments, misinterprets the result, and then tries to correct itself.
* **Stateful Complexity:** Agents are not stateless. Their behavior is conditioned by memory, scratchpads, and the history of the conversation. A simple log line is insufficient to capture the state that led to a decision.
* **Polyglot Execution:** An agent's "thought" may happen in Elixir, but its "action" (e.g., running a code interpreter) happens in a sandboxed Python environment. Tracing this flow across language boundaries is notoriously difficult.
`Logger.info/1` is inadequate. Traditional APM tools provide a high-level view but lack the granular, AI-specific context needed to answer the most important question: **"Why did the agent do *that*?"**
## Core Concepts & Data Model
`AITrace` is built on a few simple but powerful concepts, heavily inspired by OpenTelemetry but adapted for AI workflows.
* **Trace:** The complete, end-to-end record of a single transaction (e.g., one user message to an agent). It is identified by a unique `trace_id`. A trace is composed of a root `Span` and many nested `Spans` and `Events`.
* **Span:** A record of a timed operation with a distinct start and end. A span represents a unit of work. Examples: `llm_call`, `tool_execution`, `prompt_rendering`. Spans can be nested to represent a call graph. Each span has a `name`, `start_time`, `end_time`, and a key-value map of `attributes`.
* **Event:** A point-in-time annotation within a `Span`. It represents a notable occurrence that isn't a timed operation. Examples: `agent_state_updated`, `validation_failed`, `tool_not_found`.
* **Context:** An immutable Elixir map (`%AITrace.Context{}`) that carries the `trace_id` and the current `span_id`. This context is explicitly passed through the entire call stack of a traced operation, ensuring all telemetry is correctly correlated.
## Architectural Design & API
`AITrace` is designed as a set of ergonomic Elixir macros and functions that make instrumentation feel natural.
### The Core API
```elixir
# lib/my_app/agent.ex
def handle_user_message(message, state) do
# 1. Start a new trace for the entire transaction
AITrace.trace "agent.handle_message" do
# The `trace` macro injects a `ctx` variable into the scope.
# 2. Add point-in-time events with rich metadata.
AITrace.add_event(ctx, "Initial state loaded", %{agent_id: state.id})
# 3. Wrap discrete, timed operations in spans.
{:ok, response, new_ctx} = AITrace.span ctx, "reasoning_loop" do
# The `span` macro also yields a new, updated context.
DSPex.execute(reasoning_logic, %{message: message}, context: new_ctx)
end
# new_ctx now contains the updated span information.
AITrace.add_event(new_ctx, "Reasoning complete", %{token_usage: response.token_usage})
{:reply, response.answer, update_state(state)}
end
end
```
### Key Design Points
* **Context Propagation:** The biggest challenge is passing the `ctx`. The API will provide helpers and patterns (like using a `with` block) to make this propagation clear and explicit, avoiding "magic" context passing.
* **OTP Integration:** The library will include helpers for stashing and retrieving the `AITrace` context in `GenServer` calls and `Oban` jobs, making it easy to continue a trace across process boundaries.
* **Pluggable Backends (Exporters):** `AITrace` itself is only responsible for generating telemetry data. It will use a configurable "Exporter" to send this data somewhere. This makes the system incredibly flexible.
* `AITrace.Exporter.Console`: Prints human-readable traces to the terminal for local development.
* `AITrace.Exporter.File`: Dumps structured JSON traces to a file.
* `AITrace.Exporter.Phoenix`: (Future) A backend that sends traces over Phoenix Channels to a live UI.
* `AITrace.Exporter.OpenTelemetry`: (Future) A backend that converts `AITrace` data into OTel format for integration with existing systems like Jaeger or Honeycomb.
## Integration with the Ecosystem
`AITrace` is the glue that binds the entire portfolio into a single, observable system.
* **`DSPex` Instrumentation:** `DSPex` will be modified to accept an optional `AITrace.Context`. When provided, it will automatically create spans for `llm_call`, `prompt_render`, and `output_parsing`. It will add events for which modules are being used (`Predict`, `ChainOfThought`) and log the full, raw prompt/completion data as span attributes.
* **`Altar` Instrumentation:** The host application's tool-execution wrapper (as designed in Architecture A) will use `AITrace.span` to wrap every call to `Altar.LATER.Executor`. The span's attributes will include the tool's name, arguments, and the validated result or error. This provides perfect observability into tool usage.
* **`Snakepit` Instrumentation:** `Snakepit` will be enhanced to accept an `AITrace.Context`. It will extract the `trace_id` and `span_id` and pass them as gRPC metadata to the Python worker. A corresponding Python library will then be able to reconstruct the trace context on the other side, enabling true cross-language distributed tracing. The duration of the gRPC call itself will be captured in a span.
## The End Goal: The "Execution Cinema"
The data generated by `AITrace` is structured to power a rich, interactive debugging UI. This UI is the ultimate user-facing product of the library.
### Features
* **Waterfall View:** A visual timeline showing the nested spans of a trace, allowing a developer to immediately spot long-running operations.
* **Context Explorer:** Clicking on any span reveals its full attributes—the exact prompt sent to an LLM, the JSON returned from a tool, the error message from a validation failure.
* **State Diff:** For spans that include agent state changes, the UI can show a "diff" of the state before and after the operation.
* **Causal Flow:** The UI will clearly visualize the flow of data and control, making it easy to follow the agent's "train of thought."
This is not just a log viewer; it is a purpose-built, interactive debugger for AI reasoning.