README.md

# AITrace

<p align="center">
  <img src="assets/ai_trace.svg" alt="AITrace logo" width="220" />
</p>

> The unified observability layer for the AI Control Plane.

`AITrace` provides the unified observability layer for the AI Control Plane, transforming opaque, non-deterministic AI processes into fully interpretable and debuggable execution traces.

Its mission is to create an Elixir-native instrumentation library and a corresponding data model that captures the complete causal chain of an AI agent's reasoning process—from initial prompt to final output, including all thoughts, tool calls, and state changes—enabling a true "Execution Cinema" experience for developers and operators.

## The Problem: Why Traditional Observability Fails

Debugging a simple web request is a solved problem. We have structured logs, metrics, and distributed tracing (like OpenTelemetry) that show the path of a request through a series of stateless services.

Debugging an AI agent is fundamentally different. It is like performing forensic analysis on a dream. The challenges are unique:

*   **Non-Determinism:** The same input can produce different outputs and, more importantly, different *reasoning paths*.
*   **Deeply Nested Causality:** A final answer may be the result of a multi-step chain of thought, where an LLM hallucinates, calls the wrong tool with the wrong arguments, misinterprets the result, and then tries to correct itself.
*   **Stateful Complexity:** Agents are not stateless. Their behavior is conditioned by memory, scratchpads, and the history of the conversation. A simple log line is insufficient to capture the state that led to a decision.
*   **Polyglot Execution:** An agent's "thought" may happen in Elixir, but its "action" (e.g., running a code interpreter) happens in a sandboxed Python environment. Tracing this flow across language boundaries is notoriously difficult.

`Logger.info/1` is inadequate. Traditional APM tools provide a high-level view but lack the granular, AI-specific context needed to answer the most important question: **"Why did the agent do *that*?"**

## Core Concepts & Data Model

`AITrace` is built on a few simple but powerful concepts, heavily inspired by OpenTelemetry but adapted for AI workflows.

*   **Trace:** The complete, end-to-end record of a single transaction (e.g., one user message to an agent). It is identified by a unique `trace_id`. A trace is composed of a root `Span` and many nested `Spans` and `Events`.

*   **Span:** A record of a timed operation with a distinct start and end. A span represents a unit of work. Examples: `llm_call`, `tool_execution`, `prompt_rendering`. Spans can be nested to represent a call graph. Each span has a `name`, `start_time`, `end_time`, and a key-value map of `attributes`.

*   **Event:** A point-in-time annotation within a `Span`. It represents a notable occurrence that isn't a timed operation. Examples: `agent_state_updated`, `validation_failed`, `tool_not_found`.

*   **Context:** An immutable Elixir map (`%AITrace.Context{}`) that carries the `trace_id` and the current `span_id`. This context is explicitly passed through the entire call stack of a traced operation, ensuring all telemetry is correctly correlated.

## Installation

Add `aitrace` to your `mix.exs` dependencies:

```elixir
def deps do
  [
    {:aitrace, "~> 0.1.0"}
  ]
end
```

## Quick Start

```elixir
defmodule MyApp.Agent do
  require AITrace  # Required to use the macros

  def handle_user_message(message, state) do
    # 1. Start a new trace for the entire transaction
    AITrace.trace "agent.handle_message" do
      # 2. Add point-in-time events with rich metadata
      AITrace.add_event("request_received", %{message_length: String.length(message)})

      # 3. Wrap discrete, timed operations in spans
      response = AITrace.span "reasoning_loop" do
        # Add attributes to the current span
        AITrace.with_attributes(%{model: "gpt-4", temperature: 0.7})

        # Perform reasoning
        think_about(message)
      end

      AITrace.add_event("reasoning_complete", %{token_usage: response.tokens})

      {:reply, response.answer, update_state(state)}
    end
  end
end
```

## Core API

### Starting a Trace

```elixir
AITrace.trace "operation_name" do
  # Your code here - context is stored in process dictionary
end
```

### Creating Spans

```elixir
AITrace.span "span_name" do
  # Timed operation - duration is automatically measured
end
```

### Adding Events

```elixir
AITrace.add_event("event_name", %{key: "value"})
AITrace.add_event("simple_event")  # No attributes
```

### Adding Attributes

```elixir
AITrace.with_attributes(%{user_id: 42, region: "us-west"})
```

### Accessing Context

```elixir
ctx = AITrace.get_current_context()
IO.inspect(ctx.trace_id)
IO.inspect(ctx.span_id)
```

## Configuration

Configure exporters in your application config:

```elixir
# config/config.exs
config :aitrace,
  exporters: [
    {AITrace.Exporter.Console, verbose: true, color: true},
    {AITrace.Exporter.File, directory: "./traces"}
  ]
```

### Available Exporters

*   **`AITrace.Exporter.Console`** - Prints human-readable traces to stdout
  - Options: `verbose` (show attributes/events), `color` (ANSI colors)

*   **`AITrace.Exporter.File`** - Writes JSON traces to files
  - Options: `directory` (output directory, default: "./traces")

### Creating Custom Exporters

Implement the `AITrace.Exporter` behavior:

```elixir
defmodule MyApp.CustomExporter do
  @behaviour AITrace.Exporter

  @impl true
  def init(opts), do: {:ok, opts}

  @impl true
  def export(trace, state) do
    # Send trace to your backend
    IO.inspect(trace)
    {:ok, state}
  end

  @impl true
  def shutdown(_state), do: :ok
end
```

## Examples

See `examples/basic_usage.exs` for a complete working example:

```bash
mix run examples/basic_usage.exs
```

Output:
```
Trace: b37b73325dbd626481e0ff3e89de02c8
▸ reasoning (10.84ms) ✓
  Attributes: %{model: "gpt-4", temperature: 0.7}
    • reasoning_complete
      %{thought_count: 3}
▸ tool_execution (5.95ms) ✓
  Attributes: %{tool: "web_search"}
▸ response_generation (8.98ms) ✓
  Attributes: %{tokens: 150}
```

## Architecture

### Data Model

- **AITrace.Context** - Carries trace_id and span_id through the call stack
- **AITrace.Span** - Timed operations with start/end times, attributes, and events
- **AITrace.Event** - Point-in-time annotations within spans
- **AITrace.Trace** - Complete trace containing all spans

### Runtime

- **AITrace.Collector** - In-memory Agent storing active traces
- **AITrace.Application** - Supervision tree managing the collector
- Context stored in process dictionary for implicit propagation

### Future Integrations

`AITrace` is designed to integrate with other AI infrastructure:

*   **DSPex** - Automatic instrumentation for LLM calls and prompt rendering
*   **Altar** - Tool execution tracing with arguments and results
*   **Snakepit** - Cross-language tracing via gRPC metadata
*   **Phoenix Channels** - Real-time trace streaming to web UIs
*   **OpenTelemetry** - Export to standard observability platforms

## Development Status

**✅ Implemented (v0.1.0)**
- Core data structures (Context, Span, Event, Trace)
- Trace and span macros with automatic timing
- Event and attribute APIs
- Console exporter (human-readable output)
- File exporter (JSON format)
- Comprehensive test suite (80 tests)
- Working examples

**🚧 Planned**
- Phoenix Channel exporter for real-time streaming
- OpenTelemetry exporter
- OTP integration helpers (GenServer, Oban)
- Cross-process context propagation
- "Execution Cinema" web UI with waterfall views
- DSPex, Altar, and Snakepit integrations

## Testing

```bash
# Run all tests
mix test

# Run with coverage
mix test --cover

# Run example
mix run examples/basic_usage.exs
```

## License

MIT - See [LICENSE](LICENSE) for details.

## Contributing

AITrace is part of the AI Control Plane ecosystem. Contributions welcome!