<p align="center">
<img src="assets/metrics_ex.svg" alt="MetricsEx" width="200">
</p>
<h1 align="center">MetricsEx</h1>
<p align="center">
<a href="https://github.com/North-Shore-AI/metrics_ex/actions"><img src="https://github.com/North-Shore-AI/metrics_ex/workflows/CI/badge.svg" alt="CI Status"></a>
<a href="https://hex.pm/packages/metrics_ex"><img src="https://img.shields.io/hexpm/v/metrics_ex.svg" alt="Hex.pm"></a>
<a href="https://hexdocs.pm/metrics_ex"><img src="https://img.shields.io/badge/docs-hexdocs-blue.svg" alt="Documentation"></a>
<img src="https://img.shields.io/badge/elixir-%3E%3D%201.14-purple.svg" alt="Elixir">
<a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green.svg" alt="License"></a>
</p>
<p align="center">
Metrics aggregation with multiple exporters and intelligent alerting
</p>
---
## Purpose
MetricsEx provides comprehensive metrics collection, aggregation, and querying for:
- **Experiment results** (Crucible)
- **Model performance** (CNS agents)
- **System health** (Work jobs, services)
- **Training progress** (Tinkex)
## Features
- **Multiple metric types**: counters, gauges, histograms
- **Fast in-memory storage**: ETS-based with configurable retention
- **Flexible aggregations**: mean, sum, count, min, max, percentiles
- **Time series support**: Fixed interval buckets (minute, hour, day)
- **Telemetry integration**: Auto-attach to telemetry events
- **Export formats**: JSON API, Prometheus text format
- **Real-time streaming**: Phoenix PubSub integration
- **Dashboard-ready**: Pre-computed rollups for UI consumption
## Installation
Add `metrics_ex` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:metrics_ex, "~> 0.1.0"}
]
end
```
## Configuration
```elixir
# config/config.exs
config :metrics_ex,
retention_hours: 24,
pubsub: MyApp.PubSub # Optional: for real-time streaming
```
## Quick Start
### Recording Metrics
```elixir
# Record experiment results
MetricsEx.record(:experiment_result, %{
experiment_id: "exp_123",
metric: :entailment_score,
value: 0.75,
tags: %{model: "llama-3.1", dataset: "scifact"}
})
# Increment counters
MetricsEx.increment(:jobs_completed, tags: %{tenant: "cns"})
MetricsEx.increment(:requests_total, 5, tags: %{endpoint: "/api"})
# Record gauge values (point-in-time)
MetricsEx.gauge(:queue_depth, 42, tags: %{queue: "sno_validation"})
# Record histogram values (distributions)
MetricsEx.histogram(:response_time, 123.45, tags: %{endpoint: "/api"})
# Measure execution time
result = MetricsEx.measure(:database_query, fn ->
# expensive operation
MyApp.run_query()
end, tags: %{query: "SELECT"})
```
### Querying Metrics
```elixir
# Get raw metrics
MetricsEx.get_metrics(name: :jobs_completed, limit: 100)
# => %{metrics: [...], count: 100, timestamp: "2025-12-06T..."}
# Aggregate with grouping
MetricsEx.query(:experiment_result,
group_by: [:model],
aggregation: :mean,
window: :last_24h
)
# => [
# %{model: "llama-3.1", mean: 0.72},
# %{model: "qwen", mean: 0.68}
# ]
# Time series data
MetricsEx.time_series(:jobs_completed,
interval: :hour,
aggregation: :count,
window: :last_24h
)
# => [
# %{timestamp: ~U[2025-12-06 00:00:00Z], count: 45},
# %{timestamp: ~U[2025-12-06 01:00:00Z], count: 52},
# ...
# ]
# Pre-computed rollups for dashboards
MetricsEx.rollup(:experiment_result,
group_by: [:model, :dataset],
aggregations: [:mean, :count, :p95],
window: :last_24h
)
# => %{
# "llama-3.1/scifact" => %{mean: 0.72, count: 150, p95: 0.89},
# "qwen/fever" => %{mean: 0.68, count: 200, p95: 0.85}
# }
```
### Telemetry Integration
```elixir
# Attach to telemetry events
MetricsEx.attach_telemetry([
{[:work, :job, :completed], :counter},
{[:work, :job, :duration], :histogram},
{[:crucible, :experiment, :completed], :histogram},
{[:queue, :depth], :gauge}
])
# Now telemetry events are automatically recorded
:telemetry.execute([:work, :job, :completed], %{count: 1}, %{tenant: "cns"})
```
### Prometheus Export
```elixir
# Export all metrics in Prometheus format
prometheus_text = MetricsEx.Storage.Prometheus.export()
# Export specific metric
prometheus_text = MetricsEx.Storage.Prometheus.export(:jobs_completed)
# Use in Phoenix controller
defmodule MyAppWeb.MetricsController do
use MyAppWeb, :controller
def prometheus(conn, _params) do
metrics = MetricsEx.Storage.Prometheus.export()
conn
|> put_resp_content_type(MetricsEx.Storage.Prometheus.content_type())
|> send_resp(200, metrics)
end
end
```
## Architecture
### Supervision Tree
```
MetricsEx.Supervisor
├── Phoenix.PubSub (optional)
├── MetricsEx.Storage.ETS (storage backend)
└── MetricsEx.Recorder (recording coordinator)
```
### Components
#### 1. Metric Types (`MetricsEx.Metric`)
Defines three core metric types:
- **Counter**: Monotonically increasing (e.g., request count, jobs completed)
- **Gauge**: Point-in-time value (e.g., queue depth, memory usage)
- **Histogram**: Distribution of values (e.g., response times, scores)
#### 2. Storage Backend (`MetricsEx.Storage.ETS`)
- Fast in-memory ETS storage
- Configurable retention (default: 24 hours)
- Automatic cleanup of old metrics
- Concurrent reads/writes
#### 3. Recorder (`MetricsEx.Recorder`)
- GenServer for recording metrics
- Real-time PubSub broadcasting
- Type inference based on metric name/value
#### 4. Aggregator (`MetricsEx.Aggregator`)
- Flexible aggregation functions: mean, sum, count, min, max, percentiles
- Group by tags
- Time series generation
- Pre-computed rollups
#### 5. Telemetry Handler (`MetricsEx.TelemetryHandler`)
- Auto-attach to telemetry events
- Converts telemetry measurements to metrics
- Extracts tags from metadata
#### 6. API (`MetricsEx.API`)
- JSON-compatible data structures
- Dashboard-ready endpoints
- Time window helpers
## Aggregation Functions
| Function | Description | Example |
|----------|-------------|---------|
| `:count` | Number of metrics | `count: 150` |
| `:sum` | Total sum of values | `sum: 1234` |
| `:mean` | Average value | `mean: 0.72` |
| `:min` | Minimum value | `min: 0.45` |
| `:max` | Maximum value | `max: 0.95` |
| `:p50` | 50th percentile (median) | `p50: 0.70` |
| `:p95` | 95th percentile | `p95: 0.89` |
| `:p99` | 99th percentile | `p99: 0.93` |
## Time Windows
| Window | Description |
|--------|-------------|
| `:last_hour` | Last 60 minutes |
| `:last_24h` | Last 24 hours |
| `:last_7d` | Last 7 days |
| `:last_30d` | Last 30 days |
## Time Intervals
| Interval | Description |
|----------|-------------|
| `:minute` | 1-minute buckets |
| `:hour` | 1-hour buckets |
| `:day` | 1-day buckets |
## System Statistics
```elixir
MetricsEx.get_stats()
# => %{
# storage: %{
# total_metrics: 12345,
# metrics_stored: 12345,
# metrics_pruned: 567,
# retention_hours: 24,
# memory_bytes: 1048576
# },
# recorder: %{
# metrics_recorded: 12345
# },
# timestamp: "2025-12-06T12:00:00Z"
# }
```
## Integration Examples
### With Crucible Experiments
```elixir
# Record experiment metrics
MetricsEx.record(:crucible_experiment, %{
value: entailment_score,
tags: %{
experiment_id: experiment.id,
model: "llama-3.1",
dataset: "scifact",
stage: "validation"
}
})
# Query experiment results
MetricsEx.query(:crucible_experiment,
group_by: [:model, :dataset],
aggregation: :mean,
window: :last_7d
)
```
### With CNS Agents
```elixir
# Record agent performance
MetricsEx.record(:cns_agent_metric, %{
value: beta1_score,
tags: %{
agent: "antagonist",
sno_id: sno.id,
iteration: 3
}
})
# Track agent iterations
MetricsEx.time_series(:cns_agent_metric,
interval: :hour,
aggregation: :mean,
tags: %{agent: "synthesizer"},
window: :last_24h
)
```
### With Work Job System
```elixir
# Auto-record via telemetry
:telemetry.execute(
[:work, :job, :completed],
%{duration: job_duration_ms},
%{tenant: tenant, queue: queue_name}
)
# Monitor job throughput
MetricsEx.rollup(:work_job_completed,
group_by: [:tenant, :queue],
aggregations: [:count, :mean, :p95],
window: :last_hour
)
```
## Testing
```bash
# Run all tests
mix test
# Run with coverage
mix test --cover
# Run specific test file
mix test test/metrics_ex/aggregator_test.exs
```
## Performance
- **Write throughput**: 100K+ metrics/sec (async casts to GenServer)
- **Query latency**: <10ms for typical aggregations (in-memory ETS)
- **Memory footprint**: ~100 bytes per metric + overhead
- **Retention cleanup**: Every 5 minutes (configurable)
## Roadmap
- [ ] Persistent storage backend (PostgreSQL, ClickHouse)
- [ ] Advanced percentile algorithms (t-digest)
- [ ] Alert rules and notifications
- [ ] Metric cardinality limits
- [ ] Query result caching
- [ ] Grafana integration
- [ ] OpenTelemetry compatibility
## Contributing
This project is part of the [North-Shore-AI](https://github.com/North-Shore-AI) monorepo. See the main repository for contribution guidelines.
## License
MIT License - see [LICENSE](LICENSE) for details.
## Related Projects
- [crucible_framework](https://github.com/North-Shore-AI/crucible_framework) - ML experimentation orchestration
- [cns](https://github.com/North-Shore-AI/cns) - Critic-Network Synthesis
- [crucible_telemetry](https://github.com/North-Shore-AI/crucible_telemetry) - Research-grade instrumentation
- [ex_work](https://github.com/North-Shore-AI/ex_work) - Background job processing