<p align="center">
<img src="assets/eval_ex.svg" alt="EvalEx" width="200">
</p>
<h1 align="center">EvalEx</h1>
<p align="center">
<a href="https://github.com/North-Shore-AI/eval_ex/actions"><img src="https://github.com/North-Shore-AI/eval_ex/workflows/CI/badge.svg" alt="CI Status"></a>
<a href="https://hex.pm/packages/eval_ex"><img src="https://img.shields.io/hexpm/v/eval_ex.svg" alt="Hex.pm"></a>
<a href="https://hexdocs.pm/eval_ex"><img src="https://img.shields.io/badge/docs-hexdocs-blue.svg" alt="Documentation"></a>
<img src="https://img.shields.io/badge/elixir-%3E%3D%201.14-purple.svg" alt="Elixir">
<a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green.svg" alt="License"></a>
</p>
<p align="center">
Model evaluation harness with comprehensive metrics and statistical analysis
</p>
---
EvalEx provides a framework for defining, running, and comparing model evaluations with built-in metrics, benchmark suites, and Crucible integration. Designed for the CNS 3.0 dialectical reasoning system and compatible with any ML evaluation workflow.
## Features
- **Evaluation Behaviour**: Define custom evaluations with standardized structure
- **Built-in Metrics**: Exact match, F1, BLEU, ROUGE, entailment, citation accuracy, schema compliance
- **CNS Benchmark Suites**: Pre-configured evaluations for Proposer, Antagonist, and full pipeline
- **Result Comparison**: Statistical comparison of multiple evaluation runs
- **Crucible Integration**: Submit results to Crucible Framework for tracking and visualization
- **Parallel Execution**: Run evaluations in parallel for faster results
## Installation
Add `eval_ex` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:eval_ex, path: "../eval_ex"}
]
end
```
## Quick Start
### Define a Custom Evaluation
```elixir
defmodule MyEval do
use EvalEx.Evaluation
@impl true
def name, do: "proposer_scifact"
@impl true
def dataset, do: :scifact
@impl true
def metrics, do: [:entailment, :citation_accuracy, :schema_compliance]
@impl true
def evaluate(prediction, ground_truth) do
%{
entailment: EvalEx.Metrics.entailment(prediction, ground_truth),
citation_accuracy: EvalEx.Metrics.citation_accuracy(prediction, ground_truth),
schema_compliance: EvalEx.Metrics.schema_compliance(prediction, ground_truth)
}
end
end
```
### Run Evaluation
```elixir
# Prepare your data
predictions = [
%{hypothesis: "Vitamin D reduces COVID severity", claims: [...], evidence: [...]},
# ... more predictions
]
ground_truth = [
%{hypothesis: "Ground truth hypothesis", evidence: [...]},
# ... more ground truth
]
# Run evaluation
{:ok, result} = EvalEx.run(MyEval, predictions, ground_truth: ground_truth)
# View results
IO.puts(EvalEx.Result.format(result))
# => Evaluation: proposer_scifact
# Dataset: scifact
# Samples: 100
# Duration: 1234ms
#
# Metrics:
# entailment: 0.7500 (±0.1200)
# citation_accuracy: 0.9600 (±0.0500)
# schema_compliance: 1.0000 (±0.0000)
```
### Use CNS Benchmark Suites
```elixir
# Use pre-configured CNS Proposer evaluation
{:ok, result} = EvalEx.run(
EvalEx.Suites.cns_proposer(),
model_outputs,
ground_truth: scifact_data
)
# Use CNS Antagonist evaluation
{:ok, result} = EvalEx.run(
EvalEx.Suites.cns_antagonist(),
antagonist_outputs,
ground_truth: synthetic_contradictions
)
# Use full pipeline evaluation
{:ok, result} = EvalEx.run(
EvalEx.Suites.cns_full(),
pipeline_outputs,
ground_truth: ground_truth
)
```
### Compare Results
```elixir
# Run multiple evaluations
{:ok, result1} = EvalEx.run(MyEval, predictions_v1, ground_truth: ground_truth)
{:ok, result2} = EvalEx.run(MyEval, predictions_v2, ground_truth: ground_truth)
{:ok, result3} = EvalEx.run(MyEval, predictions_v3, ground_truth: ground_truth)
# Compare
comparison = EvalEx.compare([result1, result2, result3])
# View comparison
IO.puts(EvalEx.Comparison.format(comparison))
# => Comparison of 3 evaluations
# Best: proposer_v2
# Rankings:
# 1. proposer_v2: 0.8750
# 2. proposer_v3: 0.8250
# 3. proposer_v1: 0.7800
```
### Crucible Integration
```elixir
# Run with Crucible tracking
{:ok, result} = EvalEx.run_with_crucible(
MyEval,
predictions,
experiment_name: "proposer_eval_v3",
ground_truth: ground_truth,
track_metrics: true,
tags: ["proposer", "scifact", "v3"],
description: "Evaluating improved claim extraction"
)
# Export for Crucible
{:ok, json} = EvalEx.Crucible.export(result, :json)
```
## Built-in Metrics
### Text Metrics
- `exact_match/2` - Exact string match (case-insensitive, trimmed)
- `fuzzy_match/2` - Fuzzy string matching using Levenshtein distance
- `f1/2` - Token-level F1 score
- `bleu/3` - BLEU score with n-gram overlap
- `rouge/2` - ROUGE-L score (longest common subsequence)
- `meteor/2` - METEOR score approximation with alignment and word order
### Semantic & Quality Metrics
- `entailment/2` - Entailment score (placeholder for NLI model integration)
- `bert_score/2` - BERTScore placeholder (returns precision, recall, f1)
- `factual_consistency/2` - Validates facts in prediction align with ground truth
### Code Generation Metrics
- `pass_at_k/3` - Pass@k metric for code generation (percentage of samples passing tests)
- `perplexity/1` - Perplexity metric for language model outputs
### Diversity & Quality Metrics
- `diversity/1` - Text diversity using distinct n-grams (distinct-1, distinct-2, distinct-3)
### CNS-Specific Metrics
- `citation_accuracy/2` - Validates citations exist and support claims
- `schema_compliance/2` - Validates prediction conforms to expected schema
### Usage
```elixir
# Simple text comparison
EvalEx.Metrics.exact_match("hello world", "Hello World")
# => 1.0
# Token overlap
EvalEx.Metrics.f1("the cat sat on the mat", "the dog sat on a mat")
# => 0.8
# Citation validation
prediction = %{
hypothesis: "Claim text",
citations: ["e1", "e2"]
}
ground_truth = %{
evidence: [
%{id: "e1", text: "Evidence 1"},
%{id: "e2", text: "Evidence 2"}
]
}
EvalEx.Metrics.citation_accuracy(prediction, ground_truth)
# => 1.0
# Schema validation
prediction = %{name: "test", value: 42, status: "ok"}
schema = %{required: [:name, :value, :status]}
EvalEx.Metrics.schema_compliance(prediction, schema)
# => 1.0
```
## Statistical Analysis
EvalEx provides comprehensive statistical analysis tools for comparing evaluation results:
### Confidence Intervals
```elixir
# Calculate confidence intervals for all metrics
intervals = EvalEx.Comparison.confidence_intervals(result, 0.95)
# => %{
# accuracy: %{mean: 0.85, lower: 0.82, upper: 0.88, confidence: 0.95},
# f1: %{mean: 0.80, lower: 0.77, upper: 0.83, confidence: 0.95}
# }
```
### Effect Size (Cohen's d)
```elixir
# Calculate effect size between two results
effect = EvalEx.Comparison.effect_size(result1, result2, :accuracy)
# => -0.45 (negative means result2 has higher accuracy)
# Interpretation:
# - Small: d = 0.2
# - Medium: d = 0.5
# - Large: d = 0.8
```
### Bootstrap Confidence Intervals
```elixir
# More robust than parametric methods for non-normal distributions
values = [0.7, 0.75, 0.8, 0.85, 0.9]
ci = EvalEx.Comparison.bootstrap_ci(values, 1000, 0.95)
# => %{mean: 0.80, lower: 0.71, upper: 0.89}
```
### ANOVA (Analysis of Variance)
```elixir
# Test for significant differences across multiple results
result = EvalEx.Comparison.anova([result1, result2, result3], :accuracy)
# => %{
# f_statistic: 5.2,
# df_between: 2,
# df_within: 6,
# significant: true,
# interpretation: "Strong evidence of difference"
# }
```
## CNS Benchmark Suites
### CNS Proposer (`EvalEx.Suites.CNSProposer`)
Evaluates claim extraction, evidence grounding, and schema compliance.
**Metrics:**
- Schema compliance: 100% target (hard requirement)
- Citation accuracy: 96%+ target (hard gate)
- Entailment score: 0.75+ target
- Semantic similarity: 0.70+ target
**Dataset:** SciFact
### CNS Antagonist (`EvalEx.Suites.CNSAntagonist`)
Evaluates contradiction detection, precision, recall, and beta-1 quantification.
**Metrics:**
- Precision: 0.8+ target (minimize false alarms)
- Recall: 0.7+ target (don't miss real contradictions)
- Beta-1 accuracy: Within ±10% of ground truth
- Flag actionability: 80%+ of HIGH flags lead to action
**Dataset:** Synthetic contradictions
### CNS Full Pipeline (`EvalEx.Suites.CNSFull`)
Evaluates end-to-end Proposer → Antagonist → Synthesizer pipeline.
**Metrics:**
- Schema compliance: Proposer output validation
- Citation accuracy: Evidence grounding
- Beta-1 reduction: Synthesis quality (target: 30%+ reduction)
- Critic pass rate: All critics passing thresholds
- Convergence: Iterations to completion
**Dataset:** SciFact
## Architecture
```
eval_ex/
├── lib/
│ └── eval_ex/
│ ├── evaluation.ex # Evaluation behaviour
│ ├── runner.ex # Evaluation runner
│ ├── result.ex # Result struct
│ ├── metrics.ex # Built-in metrics
│ ├── comparison.ex # Result comparison
│ ├── crucible.ex # Crucible integration
│ └── suites/
│ ├── cns_proposer.ex
│ ├── cns_antagonist.ex
│ └── cns_full.ex
└── test/
```
## CNS 3.0 Integration
EvalEx implements the evaluation framework specified in the CNS 3.0 Agent Playbook:
- **Semantic Grounding**: 4-stage validation pipeline (citation → entailment → similarity → paraphrase)
- **Agent Metrics**: Standardized success thresholds for each CNS agent
- **Statistical Testing**: T-tests for comparing evaluation runs
- **Actionable Feedback**: Detailed breakdowns for debugging and improvement
See the CNS 3.0 Agent Playbook in the tinkerer project for complete specifications.
## Development
```bash
# Get dependencies
mix deps.get
# Run tests
mix test
# Generate documentation
mix docs
# Code quality
mix format
mix credo --strict
```
## License
MIT
## Links
- [GitHub Repository](https://github.com/North-Shore-AI/eval_ex)
- [North Shore AI Monorepo](https://github.com/North-Shore-AI)
- Crucible Framework (see North Shore AI monorepo)
- CNS Project (see North Shore AI monorepo)