README.md

<p align="center">
  <img src="assets/crucible_datasets.svg" alt="Datasets" width="150"/>
</p>

# CrucibleDatasets

[![Elixir](https://img.shields.io/badge/elixir-1.14+-purple.svg)](https://elixir-lang.org)
[![Hex.pm](https://img.shields.io/hexpm/v/crucible_datasets.svg)](https://hex.pm/packages/crucible_datasets)
[![Documentation](https://img.shields.io/badge/docs-hexdocs-purple.svg)](https://hexdocs.pm/crucible_datasets)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/North-Shore-AI/crucible_datasets/blob/main/LICENSE)

**Lightweight dataset management library for AI evaluation research in Elixir.**

CrucibleDatasets provides a unified interface for loading, caching, evaluating, and sampling benchmark datasets (MMLU, HumanEval, GSM8K) with support for versioning, reproducible evaluation, and custom datasets.

> **Note:** v0.5.0 removes the HuggingFace Hub integration from v0.4.x. Versions 0.4.0 and 0.4.1 are deprecated. See [CHANGELOG.md](CHANGELOG.md) for details.

## Features

- **Automatic Caching**: Fast access with local caching and version tracking
- **Comprehensive Metrics**: Exact match, F1 score, BLEU, ROUGE evaluation metrics
- **Dataset Sampling**: Random, stratified, and k-fold cross-validation
- **Reproducibility**: Deterministic sampling with seeds, version tracking
- **Result Persistence**: Save and query evaluation results
- **Export Tools**: CSV, JSONL, Markdown, HTML export
- **CrucibleIR Integration**: Unified dataset references via `DatasetRef`
- **Extensible**: Easy integration of custom datasets and metrics

## Supported Datasets

- **MMLU** (Massive Multitask Language Understanding) - 57 subjects across STEM, humanities, social sciences
- **HumanEval** - Code generation benchmark with 164 programming problems
- **GSM8K** - Grade school math word problems (8,500 problems)
- **Custom Datasets** - Load from local JSONL files

## Installation

Add `crucible_datasets` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:crucible_datasets, "~> 0.5.0"}
  ]
end
```

## Quick Start

```elixir
# Load a dataset
{:ok, dataset} = CrucibleDatasets.load(:mmlu_stem, sample_size: 100)

# Create predictions (example with perfect predictions)
predictions = Enum.map(dataset.items, fn item ->
  %{
    id: item.id,
    predicted: item.expected,
    metadata: %{latency_ms: 100}
  }
end)

# Evaluate
{:ok, results} = CrucibleDatasets.evaluate(predictions,
  dataset: dataset,
  metrics: [:exact_match, :f1],
  model_name: "my_model"
)

IO.puts("Accuracy: #{results.accuracy * 100}%")
# => Accuracy: 100.0%
```

## DatasetRef Integration

CrucibleDatasets supports `CrucibleIR.DatasetRef` for unified dataset references across the Crucible framework:

```elixir
alias CrucibleIR.DatasetRef

# Create a DatasetRef
ref = %DatasetRef{
  name: :mmlu_stem,
  split: :train,
  options: [sample_size: 100]
}

# Load dataset using DatasetRef
{:ok, dataset} = CrucibleDatasets.load(ref)

# DatasetRef works seamlessly with all dataset operations
predictions = generate_predictions(dataset)
{:ok, results} = CrucibleDatasets.evaluate(predictions, dataset: dataset)
```

This enables seamless integration with other Crucible components like `crucible_harness`, `crucible_ensemble`, and `crucible_bench`.

## Usage Examples

### Loading Datasets

```elixir
# Load by name
{:ok, mmlu} = CrucibleDatasets.load(:mmlu_stem, sample_size: 200)
{:ok, gsm8k} = CrucibleDatasets.load(:gsm8k)
{:ok, humaneval} = CrucibleDatasets.load(:humaneval)

# Load custom dataset from file
{:ok, custom} = CrucibleDatasets.load("my_dataset", source: "path/to/data.jsonl")
```

### Evaluation

```elixir
# Single model evaluation
{:ok, results} = CrucibleDatasets.evaluate(predictions,
  dataset: :mmlu_stem,
  metrics: [:exact_match, :f1],
  model_name: "gpt4"
)

# Batch evaluation (compare multiple models)
model_predictions = [
  {"model_a", predictions_a},
  {"model_b", predictions_b},
  {"model_c", predictions_c}
]

{:ok, all_results} = CrucibleDatasets.evaluate_batch(model_predictions,
  dataset: :mmlu_stem,
  metrics: [:exact_match, :f1]
)
```

### Sampling and Splitting

```elixir
# Random sampling
{:ok, sample} = CrucibleDatasets.random_sample(dataset,
  size: 50,
  seed: 42
)

# Stratified sampling (maintain subject distribution)
{:ok, stratified} = CrucibleDatasets.stratified_sample(dataset,
  size: 100,
  strata_field: [:metadata, :subject]
)

# Train/test split
{:ok, {train, test}} = CrucibleDatasets.train_test_split(dataset,
  test_size: 0.2,
  shuffle: true
)

# K-fold cross-validation
{:ok, folds} = CrucibleDatasets.k_fold(dataset, k: 5)

Enum.each(folds, fn {train, test} ->
  # Train and evaluate on each fold
end)
```

### Result Persistence

```elixir
# Save evaluation results
CrucibleDatasets.save_result(results, "my_experiment")

# Load saved results
{:ok, saved} = CrucibleDatasets.load_result("my_experiment")

# Query results with filters
{:ok, matching} = CrucibleDatasets.query_results(
  model: "gpt4",
  dataset: "mmlu_stem"
)
```

### Export

```elixir
# Export to various formats
CrucibleDatasets.export_csv(results, "results.csv")
CrucibleDatasets.export_jsonl(results, "results.jsonl")
CrucibleDatasets.export_markdown(results, "results.md")
CrucibleDatasets.export_html(results, "results.html")
```

### Cache Management

```elixir
# List cached datasets
cached = CrucibleDatasets.list_cached()

# Invalidate specific cache
CrucibleDatasets.invalidate_cache(:mmlu_stem)

# Clear all cache
CrucibleDatasets.clear_cache()
```

## Dataset Schema

All datasets follow a unified schema:

```elixir
%CrucibleDatasets.Dataset{
  name: "mmlu_stem",
  version: "1.0",
  items: [
    %{
      id: "mmlu_stem_physics_0",
      input: %{
        question: "What is the speed of light?",
        choices: ["3x10^8 m/s", "3x10^6 m/s", "3x10^5 m/s", "3x10^7 m/s"]
      },
      expected: 0,  # Index of correct answer
      metadata: %{
        subject: "physics",
        difficulty: "medium"
      }
    },
    # ... more items
  ],
  metadata: %{
    source: "huggingface:cais/mmlu",
    license: "MIT",
    domain: "STEM",
    total_items: 200,
    loaded_at: ~U[2024-01-15 10:30:00Z],
    checksum: "abc123..."
  }
}
```

## Evaluation Metrics

### Exact Match

Binary metric (1.0 or 0.0) with normalization:
- Case-insensitive string comparison
- Whitespace normalization
- Numerical comparison with tolerance
- Type coercion (string <-> number)

```elixir
CrucibleDatasets.Evaluator.ExactMatch.compute("Paris", "paris")
# => 1.0

CrucibleDatasets.Evaluator.ExactMatch.compute(42, "42")
# => 1.0
```

### F1 Score

Token-level F1 (precision and recall):

```elixir
CrucibleDatasets.Evaluator.F1.compute(
  "The quick brown fox",
  "The fast brown fox"
)
# => 0.8 (3/4 tokens match)
```

### BLEU and ROUGE

Machine translation and summarization metrics:

```elixir
CrucibleDatasets.Evaluator.BLEU.compute(predicted, reference)
CrucibleDatasets.Evaluator.ROUGE.compute(predicted, reference)
```

### Custom Metrics

Define custom metrics as functions:

```elixir
semantic_similarity = fn predicted, expected ->
  # Your custom metric logic
  0.95
end

{:ok, results} = CrucibleDatasets.evaluate(predictions,
  dataset: dataset,
  metrics: [:exact_match, semantic_similarity]
)
```

## Architecture

```
CrucibleDatasets/
├── CrucibleDatasets             # Main API
├── Dataset                      # Dataset schema
├── EvaluationResult             # Evaluation result schema
├── Loader/                      # Dataset loaders
│   ├── MMLU                     # MMLU loader
│   ├── HumanEval                # HumanEval loader
│   └── GSM8K                    # GSM8K loader
├── Registry                     # Dataset registry
├── Cache                        # Local caching
├── Evaluator/                   # Evaluation engine
│   ├── ExactMatch               # Exact match metric
│   ├── F1                       # F1 score metric
│   ├── BLEU                     # BLEU score metric
│   └── ROUGE                    # ROUGE score metric
├── Sampler                      # Sampling utilities
├── ResultStore                  # Result persistence
└── Exporter                     # Export utilities
```

## Cache Directory

Datasets are cached in: `~/.elixir_ai_research/datasets/`

```
datasets/
├── manifest.json              # Index of all cached datasets
├── mmlu_stem/
│   └── 1.0/
│       ├── data.etf          # Serialized dataset
│       └── metadata.json     # Version info
├── humaneval/
└── gsm8k/
```

## Result Storage Directory

Evaluation results are stored by default in `~/.elixir_ai_research/results/`. To change the location:

```bash
export CRUCIBLE_DATASETS_RESULTS_DIR=/tmp/crucible_results
```

## Testing

```bash
# Run tests
mix test

# Run with coverage
mix test --cover
```

## Static Analysis

```bash
mix dialyzer
```

## Examples

```bash
mix run examples/basic_usage.exs
mix run examples/evaluation_workflow.exs
mix run examples/sampling_strategies.exs
mix run examples/batch_evaluation.exs
mix run examples/cross_validation.exs
mix run examples/custom_metrics.exs
```

## Integration with Crucible Framework

CrucibleDatasets integrates with other Crucible components:

- **crucible_harness**: Experiment orchestration
- **crucible_ensemble**: Multi-model voting
- **crucible_bench**: Statistical comparison
- **crucible_ir**: Unified dataset references

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for version history.