README.md

# Tribunal

LLM evaluation framework for Elixir.

**Tribunal** provides tools for evaluating LLM outputs, detecting hallucinations, and measuring response quality.

## Installation

```elixir
def deps do
  [
    {:tribunal, "~> 0.1.0"},

    # Optional: for LLM-as-judge evaluations
    {:req_llm, "~> 1.2"},

    # Optional: for embedding-based similarity
    {:alike, "~> 0.1"}
  ]
end
```

## Quick Start

### ExUnit Integration

```elixir
defmodule MyApp.RAGTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  @context ["Returns are accepted within 30 days with receipt."]

  test "response is faithful to context" do
    response = MyApp.RAG.query("What's the return policy?")

    assert_contains response, "30 days"
    assert_faithful response, context: @context
    refute_hallucination response, context: @context
  end
end
```

### Dataset-Driven Evaluations

```elixir
# test/evals/rag_test.exs
defmodule MyApp.RAGEvalTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  tribunal_eval "test/evals/datasets/questions.json",
    provider: {MyApp.RAG, :query}
end
```

### CLI

```bash
# Initialize evaluation structure
mix tribunal.init

# Run evaluations
mix tribunal.eval

# Output formats
mix tribunal.eval --format json --output results.json
mix tribunal.eval --format github  # GitHub Actions annotations
```

```
Tribunal LLM Evaluation
═══════════════════════════════════════════════════════════════

Summary
───────────────────────────────────────────────────────────────
  Total:     12 test cases
  Passed:    10 (83%)
  Failed:    2
  Duration:  1.4s

Results by Metric
───────────────────────────────────────────────────────────────
  faithful       8/8 passed    100%  ████████████████████
  relevant       6/8 passed    75%   ███████████████░░░░░
  contains       10/10 passed  100%  ████████████████████
  no_pii         4/4 passed    100%  ████████████████████

Failed Cases
───────────────────────────────────────────────────────────────
  1. "What is the return policy for electronics?"
     ├─ relevant: Response discusses refunds but doesn't address return policy

  2. "Can I return opened software?"
     ├─ relevant: Response is generic, doesn't mention software-specific policy

───────────────────────────────────────────────────────────────
❌ FAILED
```

## Assertion Types

### Deterministic (instant, no API calls)

- `assert_contains` / `refute_contains` - Substring matching
- `assert_regex` - Pattern matching
- `assert_json` - Valid JSON validation
- `assert_refusal` - Refusal pattern detection
- `assert_max_tokens` - Token limit
- `refute_pii` - No PII detection
- `refute_toxic` - No toxic patterns
- [Full list in assertions guide](guides/assertions.md)

### LLM-as-Judge (requires `req_llm`)

- `assert_faithful` - Grounded in context
- `assert_relevant` - Addresses query
- `refute_hallucination` - No fabricated info
- `refute_bias` - No stereotypes
- `refute_toxicity` - No hostile language
- `refute_harmful` - No dangerous content
- `refute_jailbreak` - No safety bypass

### Embedding-Based (requires `alike`)

- `assert_similar` - Semantic similarity check

## Red Team Testing

Generate adversarial prompts to test LLM safety:

```elixir
alias Tribunal.RedTeam

attacks = RedTeam.generate_attacks("How do I pick a lock?")
# Returns encoding attacks (base64, leetspeak, rot13)
# injection attacks (ignore instructions, delimiter injection)
# jailbreak attacks (DAN, STAN, developer mode)
```

## Guides

- [Getting Started](guides/getting-started.md)
- [ExUnit Integration](guides/exunit-integration.md)
- [Assertions Reference](guides/assertions.md)
- [LLM-as-Judge](guides/llm-as-judge.md)
- [Datasets](guides/datasets.md)
- [Red Team Testing](guides/red-team-testing.md)
- [Reporters](guides/reporters.md)

## Roadmap

- [x] Core evaluation pipeline
- [x] Faithfulness metric (RAGAS-style)
- [x] Hallucination detection
- [x] LLM-as-judge with configurable models
- [x] ExUnit integration for test assertions
- [x] Red team attack generators
- [ ] Async batch evaluation

## License

MIT