guides/fixture-testing.md

# Fixture Testing Guide

ReqLLM uses a comprehensive fixture-based testing system to ensure reliability across all supported models and providers. This guide explains how we validate "Supported Models" and the testing infrastructure.

## Overview

The testing system validates models through the `mix req_llm.model_compat` task, which runs capability-focused tests against models selected from the registry.

## The Model Compatibility Task

### Basic Usage

```bash
# Validate all models with passing fixtures (fastest)
mix req_llm.model_compat

# Alias
mix mc
```

This runs tests using cached fixtures - no API calls are made. It validates models that have previously passing test results stored in `priv/supported_models.json`.

### Validating Specific Models

```bash
# Validate all Anthropic models
mix mc anthropic

# Validate specific model
mix mc "openai:gpt-4o"

# Validate all models for a provider
mix mc "xai:*"

# List all available models from registry
mix mc --available
```

### Recording New Fixtures

To test against live APIs and (re)generate fixtures:

```bash
# Re-record fixtures for xAI models
mix mc "xai:*" --record

# Re-record all models (not recommended, expensive)
mix mc "*:*" --record
```

### Testing Model Subsets

```bash
# Test sample models per provider (uses config/config.exs sample list)
mix mc --sample

# Test specific provider samples
mix mc --sample anthropic
```

## Architecture

### Model Registry

Model metadata lives in `priv/models_dev/*.json` files, automatically synced from [models.dev](https://models.dev) via `mix req_llm.model_sync`.

Each model entry includes:
- Capabilities (`tool_call`, `reasoning`, `attachment`, `temperature`)
- Modalities (`input: [:text, :image]`, `output: [:text]`)
- Limits (`context`, `output` token limits)
- Costs (`input`, `output` per 1M tokens)
- API-specific metadata

### Fixture State

The `priv/supported_models.json` file tracks which models have passing fixtures. This file is auto-generated and should not be manually edited.

### Comprehensive Test Macro

Tests use the `ReqLLM.ProviderTest.Comprehensive` macro (in `test/support/provider_test/comprehensive.ex`), which generates up to 9 focused tests per model based on capabilities:

1. **Basic generate_text** (non-streaming) - All models
2. **Streaming** with system context + creative params - Models with streaming support
3. **Token limit constraints** - All models
4. **Usage metrics and cost calculations** - All models
5. **Tool calling - multi-tool selection** - Models with `:tool_call` capability
6. **Tool calling - no tool when inappropriate** - Models with `:tool_call` capability
7. **Object generation (non-streaming)** - Models with object generation support
8. **Object generation (streaming)** - Models with object generation support
9. **Reasoning/thinking tokens** - Models with `:reasoning` capability

### Test Organization

```
test/coverage/
├── anthropic/
│   └── comprehensive_test.exs
├── openai/
│   └── comprehensive_test.exs
├── google/
│   └── comprehensive_test.exs
└── ...
```

Each provider has a single comprehensive test file:

```elixir
defmodule ReqLLM.Coverage.Anthropic.ComprehensiveTest do
  use ReqLLM.ProviderTest.Comprehensive, provider: :anthropic
end
```

The macro automatically:
- Selects models from `ModelMatrix` based on provider and operation type
- Generates tests for each model based on capabilities
- Handles fixture recording and replay
- Tags tests with provider, model, and scenario

## How "Supported Models" is Defined

A model is considered "supported" when it:

1. **Has metadata** in `priv/models_dev/<provider>.json`
2. **Passes comprehensive tests** for its advertised capabilities
3. **Has fixture** evidence stored for validation

The count you see in documentation ("135+ models currently pass our comprehensive fixture-based test suite") comes from models in `priv/supported_models.json`.

## Semantic Tags

Tests use structured tags for precise filtering:

```elixir
@moduletag :coverage                     # All coverage tests
@moduletag provider: "anthropic"         # Provider filter
@describetag model: "claude-3-5-sonnet"  # Model filter (without provider prefix)
@tag scenario: :basic                    # Scenario filter
```

Run specific subsets:

```bash
# All coverage tests
mix test --only coverage

# Specific provider
mix test --only "provider:anthropic"

# Specific scenario
mix test --only "scenario:basic"
mix test --only "scenario:streaming"
mix test --only "scenario:tool_multi"

# Specific model
mix test --only "model:claude-3-5-haiku-20241022"

# Combine filters
mix test --only "provider:openai" --only "scenario:basic"
```

## Environment Variables

### Fixture Mode Control

```bash
# Use cached fixtures (default, no API calls)
mix mc

# Record new fixtures (makes live API calls)
REQ_LLM_FIXTURES_MODE=record mix mc
# OR
mix mc --record
```

### Model Selection

```bash
# Test all available models
REQ_LLM_MODELS="all" mix mc

# Test all models from a provider
REQ_LLM_MODELS="anthropic:*" mix mc

# Test specific models (comma-separated)
REQ_LLM_MODELS="openai:gpt-4o,anthropic:claude-3-5-sonnet" mix mc

# Sample N models per provider
REQ_LLM_SAMPLE=2 mix mc

# Exclude specific models
REQ_LLM_EXCLUDE="gpt-4o-mini,gpt-3.5-turbo" mix mc
```

### Debug Output

```bash
# Verbose fixture debugging
REQ_LLM_DEBUG=1 mix mc
```

## Fixture System Details

### Fixture Storage

Fixtures are stored next to test files:

```
test/coverage/<provider>/fixtures/
├── basic.json
├── streaming.json
├── token_limit.json
├── usage.json
├── tool_multi.json
├── no_tool.json
├── object_basic.json
├── object_streaming.json
└── reasoning_basic.json
```

### Fixture Format

Fixtures capture the complete API response:

```json
{
  "captured_at": "2025-01-15T10:30:00Z",
  "model_spec": "anthropic:claude-3-5-sonnet-20241022",
  "scenario": "basic",
  "result": {
    "ok": true,
    "response": {
      "id": "msg_123",
      "model": "claude-3-5-sonnet-20241022",
      "message": {...},
      "usage": {...}
    }
  }
}
```

### Parallel Execution

The fixture system supports parallel test execution:

- Tests run concurrently for speed
- State tracking skips models with passing fixtures
- Use `--record` or `--record-all` to regenerate

## Development Workflow

### Adding a New Provider

1. Implement provider module and metadata
2. Create test file using `Comprehensive` macro
3. Record initial fixtures:
   ```bash
   mix mc "<provider>:*" --record
   ```
4. Verify all tests pass:
   ```bash
   mix mc "<provider>"
   ```

### Updating Model Coverage

1. Sync latest model metadata:
   ```bash
   mix req_llm.model_sync
   ```
2. Record fixtures for new models:
   ```bash
   mix mc "<provider>:new-model" --record
   ```
3. Validate updated coverage:
   ```bash
   mix mc "<provider>"
   ```

### Refreshing Fixtures

Periodically refresh fixtures to catch API changes:

```bash
# Refresh specific provider
mix mc "anthropic:*" --record

# Refresh specific capability
REQ_LLM_FIXTURES_MODE=record mix test --only "scenario:streaming"

# Refresh all (expensive, requires all API keys)
mix mc "*:*" --record
```

## Quality Commitments

We guarantee that all "supported models" (those counted in our documentation):

1. **Have passing fixtures** for basic functionality
2. **Are tested against live APIs** before fixture capture
3. **Pass capability-focused tests** for advertised features
4. **Are regularly refreshed** to catch provider-side changes

### What's Tested

For each supported model:

- ✅ Text generation (streaming and non-streaming)
- ✅ Token limits and truncation behavior
- ✅ Usage metrics and cost calculation
- ✅ Tool calling (if advertised)
- ✅ Object generation (if advertised)
- ✅ Reasoning tokens (if advertised)

### What's NOT Guaranteed

- Complex edge cases beyond basic capabilities
- Provider-specific features not in model metadata
- Real-time behavior (fixtures may be cached)
- Exact API response formats (providers may change)

## Troubleshooting

### Fixture Mismatch

If tests fail with fixture mismatches:

```bash
# Re-record the specific scenario
mix mc "provider:model" --record
```

### Missing API Key

Tests skip if API key is unavailable:

```bash
# Set in .env file
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
```

### Debugging Fixture Issues

Enable verbose output:

```bash
REQ_LLM_DEBUG=1 mix test --only "provider:anthropic" --only "scenario:basic"
```

## Best Practices

1. **Run locally before CI**: `mix mc` before committing
2. **Record incrementally**: Don't re-record all fixtures at once
3. **Use samples for development**: `mix mc --sample` for quick validation
4. **Keep fixtures fresh**: Refresh fixtures when providers update APIs
5. **Tag tests appropriately**: Use semantic tags for precise test selection

## Commands Reference

```bash
# Validation (using fixtures)
mix mc                          # All models with passing fixtures
mix mc anthropic                # All Anthropic models
mix mc "openai:gpt-4o"          # Specific model
mix mc --sample                 # Sample models per provider
mix mc --available              # List all registry models

# Recording (live API calls)
mix mc --record                 # Re-record passing models
mix mc "xai:*" --record         # Re-record xAI models
mix mc "<provider>:*" --record  # Re-record specific provider

# Environment variables
REQ_LLM_FIXTURES_MODE=record    # Force recording
REQ_LLM_MODELS="pattern"        # Model selection pattern
REQ_LLM_SAMPLE=N                # Sample N per provider
REQ_LLM_EXCLUDE="model1,model2" # Exclude models
REQ_LLM_DEBUG=1                 # Verbose output
```

## Summary

The fixture-based testing system provides:

- **Fast local validation** with cached fixtures
- **Comprehensive coverage** across capabilities
- **Parallel execution** for speed
- **Clear model support guarantees** backed by test evidence
- **Easy provider addition** with minimal boilerplate

This system is how ReqLLM backs up the claim of "135+ supported models" - each one has fixture evidence of passing comprehensive capability tests.