<p align="center">
<img src="assets/crucible_telemetry.svg" alt="Telemetry" width="150"/>
</p>
# CrucibleTelemetry
**Research-grade instrumentation and metrics collection for AI/ML experiments in Elixir.**
## Overview
**TelemetryResearch** provides specialized observability for rigorous scientific experimentation, going beyond standard production telemetry with features designed for AI/ML research:
- **Experiment Isolation**: Run multiple experiments concurrently without cross-contamination
- **Rich Metadata**: Automatic enrichment with experiment context, timestamps, and custom tags
- **Multiple Export Formats**: CSV, JSON Lines for analysis in Python, R, Julia, Excel
- **Complete Event Capture**: No sampling by default - full reproducibility
- **Statistical Analysis**: Built-in descriptive statistics and metrics calculations
- **Zero-Cost Abstraction**: Minimal overhead when not actively collecting data
## Why TelemetryResearch?
Standard production telemetry libraries focus on monitoring and alerting, but research experiments have different requirements:
| Production Telemetry | Research Telemetry (this library) |
|---------------------|-----------------------------------|
| Real-time dashboards | Statistical analysis and exports |
| Sampling for efficiency | Complete capture for reproducibility |
| Fixed metrics | Rich, experiment-specific metadata |
| Single workload tracking | Multiple concurrent experiments |
| JSON/logs output | CSV, JSON Lines, Parquet for analysis tools |
## Installation
Add `telemetry_research` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:crucible_telemetry, "~> 0.1.0"}
]
end
```
Or install from GitHub:
```elixir
def deps do
[
{:crucible_telemetry, github: "nshkrdotcom/elixir_ai_research", sparse: "apps/telemetry_research"}
]
end
```
## Quick Start
```elixir
# 1. Start an experiment
{:ok, experiment} = CrucibleTelemetry.start_experiment(
name: "ensemble_vs_single",
hypothesis: "5-model ensemble achieves >99% reliability",
condition: "treatment",
tags: ["accuracy", "reliability"]
)
# 2. Run your AI workload - events are automatically collected
# Your existing code with :telemetry.execute() calls works unchanged
# 3. Stop and analyze
{:ok, experiment} = CrucibleTelemetry.stop_experiment(experiment.id)
metrics = CrucibleTelemetry.calculate_metrics(experiment.id)
# => %{
# latency: %{mean: 150.5, p95: 250.0, ...},
# cost: %{total: 0.025, mean_per_request: 0.0025, ...},
# reliability: %{success_rate: 0.99, ...}
# }
# 4. Export for analysis
{:ok, path} = CrucibleTelemetry.export(experiment.id, :csv)
# Now analyze in Python: pd.read_csv(path)
```
## Core Concepts
### Experiments
An **experiment** is an isolated collection session with its own:
- Unique ID and metadata
- Dedicated storage (ETS table)
- Telemetry event handlers
- Tags and conditions for comparison
```elixir
{:ok, experiment} = CrucibleTelemetry.start_experiment(
name: "gpt4_baseline",
hypothesis: "Single GPT-4 achieves 90% accuracy on benchmark",
condition: "control",
tags: ["h1", "baseline", "gpt4"],
sample_size: 1000,
metadata: %{
researcher: "alice",
benchmark: "mmlu",
version: "v1"
}
)
```
### Event Collection
TelemetryResearch automatically attaches to standard telemetry events:
- `[:req_llm, :request, :start|stop|exception]` - LLM API calls
- `[:ensemble, :prediction, :start|stop]` - Ensemble predictions
- `[:ensemble, :vote, :completed]` - Voting results
- `[:hedging, :request, :*]` - Request hedging events
- `[:causal_trace, :event, :created]` - Reasoning traces
- `[:altar, :tool, :*]` - Tool invocations
Events are enriched with:
- Experiment context (ID, name, condition, tags)
- Computed metrics (latency, cost, success)
- Timestamps (microsecond precision)
- Custom metadata
### Storage
Events are stored in **ETS tables** for fast in-memory access:
```elixir
# Query events by filters
events = CrucibleTelemetry.Store.query(experiment.id, %{
event_name: [:req_llm, :request, :stop],
success: true,
time_range: {start_time, end_time}
})
```
ETS storage is ideal for experiments with <1M events. For longer experiments or persistent storage, PostgreSQL backend support is planned.
### Metrics & Analysis
Calculate comprehensive metrics automatically:
```elixir
metrics = CrucibleTelemetry.calculate_metrics(experiment.id)
# Latency metrics
metrics.latency.mean # Average latency
metrics.latency.median # Median latency
metrics.latency.p50 # 50th percentile
metrics.latency.p95 # 95th percentile
metrics.latency.p99 # 99th percentile
metrics.latency.std_dev # Standard deviation
# Cost metrics
metrics.cost.total # Total cost in USD
metrics.cost.mean_per_request # Average cost per request
metrics.cost.cost_per_1k_requests # Projected cost for 1K requests
metrics.cost.cost_per_1m_requests # Projected cost for 1M requests
# Reliability metrics
metrics.reliability.success_rate # Success rate (0.0-1.0)
metrics.reliability.successful # Count of successful requests
metrics.reliability.failed # Count of failed requests
metrics.reliability.sla_99 # Meets 99% SLA?
metrics.reliability.sla_999 # Meets 99.9% SLA?
# Token metrics (if available)
metrics.tokens.total_prompt # Total prompt tokens
metrics.tokens.total_completion # Total completion tokens
metrics.tokens.mean_total # Average tokens per request
```
### Export Formats
Export data for analysis in your preferred tool:
#### CSV (Excel, pandas, R)
```elixir
{:ok, path} = CrucibleTelemetry.export(experiment.id, :csv,
path: "results/experiment.csv"
)
# Then in Python:
# import pandas as pd
# df = pd.read_csv("results/experiment.csv")
# df.groupby('condition')['latency_ms'].describe()
```
#### JSON Lines (streaming, jq)
```elixir
{:ok, path} = CrucibleTelemetry.export(experiment.id, :jsonl,
path: "results/experiment.jsonl"
)
# Then with jq:
# cat results/experiment.jsonl | jq '.latency_ms' | jq -s 'add/length'
```
## Use Cases
### 1. A/B Testing
Compare two approaches side-by-side:
```elixir
# Control: Single model
{:ok, control} = CrucibleTelemetry.start_experiment(
name: "control_single_model",
condition: "control",
tags: ["ab_test"]
)
# Treatment: Ensemble
{:ok, treatment} = CrucibleTelemetry.start_experiment(
name: "treatment_ensemble",
condition: "treatment",
tags: ["ab_test"]
)
# ... run workloads ...
# Compare results
comparison = CrucibleTelemetry.Analysis.compare_experiments([
control.id,
treatment.id
])
```
### 2. Performance Benchmarking
Track performance over time:
```elixir
{:ok, exp} = CrucibleTelemetry.start_experiment(
name: "gemini_2_flash_benchmark",
tags: ["benchmark", "latency", "2024-12"]
)
# Run benchmark suite
Enum.each(benchmark_queries, fn query ->
# Make LLM calls - automatically tracked
end)
{:ok, _} = CrucibleTelemetry.stop_experiment(exp.id)
# Export for historical tracking
CrucibleTelemetry.export(exp.id, :csv,
path: "benchmarks/gemini_2_flash_#{Date.utc_today()}.csv"
)
```
### 3. Hypothesis Testing
Test specific hypotheses about your system:
```elixir
{:ok, exp} = CrucibleTelemetry.start_experiment(
name: "ensemble_reliability",
hypothesis: "5-model ensemble achieves >99% reliability",
condition: "ensemble_5x",
tags: ["h1", "reliability"],
sample_size: 1000
)
# ... collect 1000 samples ...
metrics = CrucibleTelemetry.calculate_metrics(exp.id)
# Test hypothesis
hypothesis_confirmed = metrics.reliability.success_rate > 0.99
IO.puts("Hypothesis #{if hypothesis_confirmed, do: "CONFIRMED", else: "REJECTED"}")
IO.puts("Success rate: #{metrics.reliability.success_rate * 100}%")
```
### 4. Cost Analysis
Track and optimize costs:
```elixir
{:ok, exp} = CrucibleTelemetry.start_experiment(
name: "cost_optimization",
tags: ["cost", "optimization"]
)
# ... run workload ...
metrics = CrucibleTelemetry.calculate_metrics(exp.id)
IO.puts("Total cost: $#{metrics.cost.total}")
IO.puts("Cost per 1M requests: $#{metrics.cost.cost_per_1m_requests}")
# Identify expensive requests
expensive_events = CrucibleTelemetry.Store.query(exp.id, %{})
|> Enum.filter(&(&1.cost_usd > 0.01))
|> Enum.sort_by(&(&1.cost_usd), :desc)
```
## API Reference
### TelemetryResearch
Main module with convenience functions.
- `start_experiment(opts)` - Start a new experiment
- `stop_experiment(experiment_id)` - Stop an experiment
- `get_experiment(experiment_id)` - Get experiment details
- `list_experiments()` - List all experiments
- `export(experiment_id, format, opts)` - Export data
- `calculate_metrics(experiment_id)` - Calculate metrics
### CrucibleTelemetry.Experiment
Experiment lifecycle management.
- `start(opts)` - Start experiment with options
- `stop(experiment_id)` - Stop experiment
- `get(experiment_id)` - Get experiment
- `list()` - List experiments
- `archive(experiment_id, opts)` - Archive to file/S3
- `cleanup(experiment_id, opts)` - Clean up resources
### CrucibleTelemetry.Store
Data storage and querying.
- `get_all(experiment_id)` - Get all events
- `query(experiment_id, filters)` - Query with filters
- `insert(experiment_id, event)` - Insert event (internal)
- `delete_experiment(experiment_id)` - Delete all data
### CrucibleTelemetry.Export
Export to various formats.
- `export(experiment_id, format, opts)` - Export data
- `export_multiple(experiment_ids, format, opts)` - Export multiple
### CrucibleTelemetry.Analysis
Statistical analysis and metrics.
- `calculate_metrics(experiment_id)` - Calculate all metrics
- `compare_experiments(experiment_ids)` - Compare experiments
## Examples
See the `examples/` directory for complete examples:
- `basic_usage.exs` - Basic workflow walkthrough
- `ab_testing.exs` - A/B testing with two experiments
- `custom_metrics.exs` - Custom event tracking
Run examples with:
```bash
cd apps/telemetry_research
mix run examples/basic_usage.exs
```
## Testing
Run the test suite:
```bash
cd apps/telemetry_research
mix test
```
Run with coverage:
```bash
mix test --cover
```
## Architecture
```
┌─────────────────────────────────────────────────┐
│ TelemetryResearch │
│ │
│ ┌────────────┐ ┌─────────┐ ┌──────────────┐ │
│ │ Experiment │ │ Handler │ │ Store │ │
│ │ Manager │ │ Pipeline│ │ ETS │ │
│ └────────────┘ └─────────┘ └──────────────┘ │
│ │ │ │ │
│ └─────────────┴──────────────┘ │
│ │ │
│ ┌─────────────▼─────────────┐ │
│ │ Export │ Analysis │ │
│ │ CSV/JSON │ Metrics │ │
│ └──────────────┴────────────┘ │
└─────────────────────────────────────────────────┘
```
## Telemetry Events
TelemetryResearch listens for these standard events:
### req_llm Events
- `[:req_llm, :request, :start]` - LLM request started
- `[:req_llm, :request, :stop]` - LLM request completed
- `[:req_llm, :request, :exception]` - LLM request failed
### Ensemble Events
- `[:ensemble, :prediction, :start]` - Ensemble prediction started
- `[:ensemble, :prediction, :stop]` - Ensemble prediction completed
- `[:ensemble, :vote, :completed]` - Voting completed
### Hedging Events
- `[:hedging, :request, :start]` - Hedging request started
- `[:hedging, :request, :duplicated]` - Request duplicated
- `[:hedging, :request, :stop]` - Hedging request completed
### Causal Trace Events
- `[:causal_trace, :event, :created]` - Reasoning event created
### Altar Tool Events
- `[:altar, :tool, :start]` - Tool invocation started
- `[:altar, :tool, :stop]` - Tool invocation completed
## Performance
TelemetryResearch is designed for minimal overhead:
- **Event handling**: <1μs per event (in-memory ETS insert)
- **Storage**: Up to 1M events in memory (~100-500MB depending on metadata)
- **Query**: Fast filtering with ETS ordered_set
- **Export**: Streaming to avoid memory spikes
## Roadmap
- [ ] PostgreSQL backend for persistent storage
- [ ] TimescaleDB support for time-series optimization
- [ ] Parquet export format
- [ ] LiveView dashboard for real-time monitoring
- [ ] Statistical hypothesis testing (t-test, chi-square)
- [ ] Continuous aggregates
- [ ] S3 archival support
- [ ] Multi-node distributed experiments
## License
MIT License - see [LICENSE](https://github.com/North-Shore-AI/crucible_telemetry/blob/main/LICENSE) file for details