# CrucibleTrain
<p align="center">
<img src="assets/crucible_train.svg" alt="CrucibleTrain Logo" width="200"/>
</p>
<p align="center">
<strong>Unified ML training infrastructure for Elixir/BEAM</strong>
</p>
<p align="center">
<a href="https://hex.pm/packages/crucible_train"><img src="https://img.shields.io/hexpm/v/crucible_train.svg" alt="Hex Version"/></a>
<a href="https://hexdocs.pm/crucible_train"><img src="https://img.shields.io/badge/hex-docs-blue.svg" alt="Hex Docs"/></a>
<a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green.svg" alt="License"/></a>
</p>
---
CrucibleTrain provides a complete, platform-agnostic training infrastructure for ML workloads on the BEAM. It includes:
- **Renderers**: Message-to-token transformation for all major model families (Llama3, Qwen3, DeepSeek, etc.)
- **Training Loops**: Supervised learning, RL, DPO, and distillation
- **Type System**: Unified Datum, ModelInput, and related types
- **Ports & Adapters**: Pluggable backends for any training platform
- **Logging**: Multiplexed ML logging (JSON, console, custom backends)
- **Crucible Integration**: Stage implementations for pipeline composition
## Installation
Add to your `mix.exs`:
```elixir
def deps do
[
{:crucible_train, "~> 0.2.0"}
]
end
```
## Quick Start
```elixir
alias CrucibleTrain.Supervised.{Train, Config}
alias CrucibleTrain.Renderers
renderer = Renderers.get_renderer("meta-llama/Llama-3.1-8B")
config = %Config{
training_client: my_client,
train_dataset: my_dataset,
learning_rate: 1.0e-4,
num_epochs: 3
}
{:ok, result} = Train.main(config)
```
## Training Stages
This package provides Crucible stages for ML training workflows:
| Stage | Name | Description |
|-------|------|-------------|
| `SupervisedTrain` | `:supervised_train` | Standard supervised learning with configurable optimizer/loss |
| `DPOTrain` | `:dpo_train` | Direct Preference Optimization with beta parameter |
| `RLTrain` | `:rl_train` | Reinforcement Learning (PPO, DQN, A2C, REINFORCE) |
| `Distillation` | `:distillation` | Knowledge Distillation with temperature/alpha |
All stages implement the `Crucible.Stage` behaviour with full `describe/1` schemas for introspection.
```elixir
# View stage schema
schema = CrucibleTrain.Stages.SupervisedTrain.describe(%{})
# => %{
# name: :supervised_train,
# description: "Runs supervised learning training...",
# required: [],
# optional: [:epochs, :batch_size, :learning_rate, :optimizer, :loss_fn, :metrics],
# types: %{epochs: :integer, batch_size: :integer, ...}
# }
```
Use in Crucible pipelines:
```elixir
alias CrucibleIR.StageDef
stages = [
%StageDef{name: :supervised_train, options: %{epochs: 3, batch_size: 32}}
]
```
## Logging Backends
CrucibleTrain supports multiple logging backends for experiment tracking:
```elixir
alias CrucibleTrain.Logging
# Local JSONL logging
{:ok, logger} = Logging.create_logger(:json, log_dir: "./logs")
# Console table output
{:ok, logger} = Logging.create_logger(:pretty)
# Log metrics and hyperparameters
Logging.log_hparams(logger, %{learning_rate: 1.0e-4})
Logging.log_metrics(logger, step, %{loss: 0.5, accuracy: 0.9})
Logging.close(logger)
```
### Weights & Biases Integration
Full integration with [Weights & Biases](https://wandb.ai) for experiment tracking:
```elixir
# Setup: export WANDB_API_KEY="your-api-key"
{:ok, logger} = Logging.create_logger(:wandb,
api_key: System.get_env("WANDB_API_KEY"),
project: "my-project",
entity: "my-team", # optional
run_name: "experiment-1" # optional
)
# Get run URL
url = Logging.get_url(logger)
# => "https://wandb.ai/my-team/my-project/runs/experiment-1"
# Log hyperparameters (nested maps supported)
Logging.log_hparams(logger, %{
model: "llama-3.1-8b",
optimizer: %{name: "adamw", lr: 1.0e-4}
})
# Log training metrics
Logging.log_metrics(logger, step, %{loss: 0.5, accuracy: 0.9})
# Log long-form text (summaries, notes)
Logging.log_long_text(logger, "eval_notes", "Model performed well on...")
Logging.close(logger)
```
**Rate Limiting**: W&B free tier has strict rate limits (~60 requests/minute per run). Rate limiting is **enabled by default** with conservative settings:
- 500ms minimum interval between requests
- Automatic retry with exponential backoff on 429 errors
```elixir
# Custom rate limit settings
{:ok, logger} = Logging.create_logger(:wandb,
project: "my-project",
rate_limit: [min_interval_ms: 1000, max_retries: 5]
)
# Disable rate limiting (not recommended for free tier)
{:ok, logger} = Logging.create_logger(:wandb,
project: "my-project",
rate_limit: false
)
```
### Neptune.ai Integration
Full integration with [Neptune.ai](https://neptune.ai) for experiment tracking:
```elixir
# Setup:
# export NEPTUNE_API_TOKEN="your-api-token"
# export NEPTUNE_PROJECT="workspace/project-name"
{:ok, logger} = Logging.create_logger(:neptune,
api_token: System.get_env("NEPTUNE_API_TOKEN"),
project: System.get_env("NEPTUNE_PROJECT")
)
# Get run URL
url = Logging.get_url(logger)
# => "https://app.neptune.ai/workspace/project-name/e/RUN-1"
# Same logging API as other backends
Logging.log_hparams(logger, %{model: "deepseek-v3", batch_size: 64})
Logging.log_metrics(logger, step, %{loss: 0.3, grad_norm: 1.2})
Logging.close(logger)
```
**Rate Limiting**: Enabled by default with 200ms minimum interval. Configure the same way as W&B.
### Rate Limit Configuration
Both W&B and Neptune loggers support the following rate limit options:
| Option | Default (W&B) | Default (Neptune) | Description |
|--------|---------------|-------------------|-------------|
| `min_interval_ms` | 500 | 200 | Minimum ms between requests |
| `max_retries` | 3 | 3 | Retry attempts on 429 |
| `base_backoff_ms` | 1000 | 1000 | Initial backoff duration |
| `max_backoff_ms` | 30000 | 30000 | Maximum backoff cap |
## Evaluation & Scoring
Pluggable scoring system for model evaluation:
```elixir
alias CrucibleTrain.Eval.{Scoring, BatchRunner}
# Score individual outputs
Scoring.score(:exact_match, "Paris", "Paris") # => 1.0
Scoring.score(:contains, "The answer is 42", "42") # => 1.0
# Streaming batch evaluation
results =
samples
|> BatchRunner.stream_evaluate(config, chunk_size: 25)
|> Enum.to_list()
metrics = BatchRunner.aggregate_metrics(results)
# => %{mean_score: 0.85, total: 100, correct: 85}
```
## Learning Rate Scheduling
Flexible LR schedules with warmup support:
```elixir
alias CrucibleTrain.Supervised.Config
# Cosine annealing with warmup
config = %Config{
learning_rate: 1.0e-4,
lr_schedule: {:warmup, 100, :cosine}
}
# Available schedules: :constant, :linear, :cosine
# Warmup: {:warmup, warmup_steps, base_schedule}
```
## Examples
See the `examples/` directory for runnable demos:
```bash
# Run all local examples
./examples/run_all.sh
# Run individual examples
mix run --no-start examples/json_logger_example.exs
mix run --no-start examples/wandb_logger_example.exs
mix run --no-start examples/scoring_example.exs
```
| Example | Description |
|---------|-------------|
| `json_logger_example.exs` | Local JSONL logging |
| `pretty_print_logger_example.exs` | Console table output |
| `multiplex_logger_example.exs` | Multiple backends |
| `wandb_logger_example.exs` | Weights & Biases |
| `neptune_logger_example.exs` | Neptune.ai |
| `scoring_example.exs` | Evaluation scoring |
| `batch_runner_example.exs` | Batch evaluation |
| `lr_scheduling_example.exs` | LR schedules |
See [`examples/README.md`](examples/README.md) for setup instructions for cloud services.
## Documentation
Full documentation available at [HexDocs](https://hexdocs.pm/crucible_train).
## License
MIT License - see [LICENSE](LICENSE) for details.