docs/04-testing-strategy.md

# Testing Strategy

## Overview

The Elixir Codex SDK follows a comprehensive test-driven development (TDD) approach using Supertester for deterministic OTP testing. This document outlines our testing philosophy, strategies, tools, and best practices.

## Testing Philosophy

### Core Principles

1. **Test First**: Write tests before implementation
2. **Deterministic**: Zero flaky tests, zero `Process.sleep`
3. **Fast**: Full suite < 5 minutes, average test < 50ms
4. **Comprehensive**: 95%+ coverage, all edge cases
5. **Maintainable**: Clear, readable, well-organized tests
6. **Async**: All tests run with `async: true` where possible

### Red-Green-Refactor Cycle

1. **Red**: Write a failing test that defines desired behavior
2. **Green**: Write minimal code to make test pass
3. **Refactor**: Improve code quality while keeping tests green
4. **Repeat**: Continue with next feature

## Test Categories

### 1. Unit Tests

**Purpose**: Test individual functions and modules in isolation.

**Characteristics**:
- Run with `async: true`
- Mock all external dependencies
- Focus on single responsibility
- Fast (< 1ms per test)
- High coverage of edge cases

**Example**:
```elixir
defmodule Codex.EventsTest do
  use ExUnit.Case, async: true

  describe "ThreadStarted" do
    test "creates struct with required fields" do
      event = %Codex.Events.ThreadStarted{
        type: :thread_started,
        thread_id: "thread_abc123"
      }

      assert event.type == :thread_started
      assert event.thread_id == "thread_abc123"
    end

    test "enforces required fields" do
      assert_raise ArgumentError, fn ->
        %Codex.Events.ThreadStarted{}
      end
    end
  end
end
```

### 2. Integration Tests

**Purpose**: Test interactions between components.

**Characteristics**:
- Tagged `:integration`
- Use mock codex-rs script
- Test full workflows
- Medium speed (< 100ms per test)
- May run synchronously

**Example**:
```elixir
defmodule Codex.Thread.IntegrationTest do
  use ExUnit.Case
  use Supertester

  @moduletag :integration

  test "full turn execution with mock codex" do
    mock_script = create_mock_codex_script([
      ~s({"type":"thread.started","thread_id":"thread_123"}),
      ~s({"type":"turn.started"}),
      ~s({"type":"item.completed","item":{"id":"1","type":"agent_message","text":"Hello"}}),
      ~s({"type":"turn.completed","usage":{"input_tokens":10,"cached_input_tokens":0,"output_tokens":5}})
    ])

    codex_opts = %Codex.Options{codex_path_override: mock_script}
    {:ok, thread} = Codex.start_thread(codex_opts)

    {:ok, result} = Codex.Thread.run(thread, "test input")

    assert result.final_response == "Hello"
    assert result.usage.input_tokens == 10
    assert thread.thread_id == "thread_123"

    File.rm!(mock_script)
  end

  defp create_mock_codex_script(events) do
    script = """
    #!/bin/bash
    #{Enum.map_join(events, "\n", &"echo '#{&1}'")}
    """

    path = Path.join(System.tmp_dir!(), "mock_codex_#{:rand.uniform(10000)}")
    File.write!(path, script)
    File.chmod!(path, 0o755)
    path
  end
end
```

### 3. Live Tests

**Purpose**: Test against real codex-rs binary and OpenAI API.

**Characteristics**:
- Tagged `:live`
- Require API key via environment variable
- Optional (skip in CI by default)
- Slow (seconds per test)
- Useful for validation and debugging

**Example**:
```elixir
defmodule Codex.LiveTest do
  use ExUnit.Case

  @moduletag :live
  @moduletag timeout: 60_000

  setup do
    unless System.get_env("CODEX_API_KEY") do
      ExUnit.configure(exclude: [:live])
    end

    :ok
  end

  test "real turn execution" do
    {:ok, thread} = Codex.start_thread()

    {:ok, result} = Codex.Thread.run(thread, "Say 'test successful' and nothing else")

    assert result.final_response =~ "test successful"
    assert result.usage.input_tokens > 0
  end
end
```

### 4. Property Tests

**Purpose**: Test properties that should hold for all inputs.

**Characteristics**:
- Use StreamData for generation
- Test invariants and laws
- Discover edge cases automatically
- Run many iterations

**Example**:
```elixir
defmodule Codex.Events.PropertyTest do
  use ExUnit.Case, async: true
  use ExUnitProperties

  property "all events encode and decode correctly" do
    check all event <- event_generator() do
      json = Jason.encode!(event)
      {:ok, decoded} = Jason.decode(json)

      assert decoded["type"] in [
        "thread.started", "turn.started", "turn.completed",
        "turn.failed", "item.started", "item.updated",
        "item.completed", "error"
      ]
    end
  end

  defp event_generator do
    gen all type <- member_of([:thread_started, :turn_started, :turn_completed]),
            thread_id <- string(:alphanumeric, min_length: 1, max_length: 50) do
      case type do
        :thread_started ->
          %Codex.Events.ThreadStarted{
            type: :thread_started,
            thread_id: thread_id
          }

        :turn_started ->
          %Codex.Events.TurnStarted{type: :turn_started}

        :turn_completed ->
          %Codex.Events.TurnCompleted{
            type: :turn_completed,
            usage: %Codex.Events.Usage{
              input_tokens: 10,
              cached_input_tokens: 0,
              output_tokens: 5
            }
          }
      end
    end
  end
end
```

### 5. Chaos Tests

**Purpose**: Test system resilience under adverse conditions.

**Characteristics**:
- Simulate process crashes
- Test resource cleanup
- Verify supervision behavior
- Test under high load

**Example**:
```elixir
defmodule Codex.ChaosTest do
  use ExUnit.Case
  use Supertester

  describe "resilience" do
    test "handles Exec GenServer crash during turn" do
      {:ok, thread} = Codex.start_thread()

      # Start turn in separate process
      task = Task.async(fn ->
        Codex.Thread.run(thread, "test")
      end)

      # Give it time to start
      Process.sleep(50)

      # Find and kill the Exec GenServer
      [{exec_pid, _}] = Registry.lookup(CodexSdk.ExecRegistry, thread.thread_id)
      Process.exit(exec_pid, :kill)

      # Should return error, not crash
      assert {:error, _} = Task.await(task)
    end

    test "cleans up resources on early stream halt" do
      {:ok, thread} = Codex.start_thread()
      {:ok, stream} = Codex.Thread.run_streamed(thread, "test")

      # Track temp files before
      temp_files_before = count_temp_files()

      # Take only first event, halting stream early
      [_first_event | _] = Enum.take(stream, 1)

      # Give cleanup time
      Process.sleep(100)

      # Verify no temp files leaked
      temp_files_after = count_temp_files()
      assert temp_files_after <= temp_files_before
    end

    defp count_temp_files do
      Path.wildcard(Path.join(System.tmp_dir!(), "codex-output-schema-*"))
      |> length()
    end
  end
end
```

## Supertester Integration

### Why Supertester?

[Supertester](https://hex.pm/packages/supertester) provides deterministic OTP testing without `Process.sleep`. It enables:

1. **Proper Synchronization**: Wait for actual conditions, not arbitrary timeouts
2. **Async Safety**: All tests can run `async: true`
3. **Clear Assertions**: Readable test code with helpful error messages
4. **Zero Flakes**: Deterministic behavior eliminates timing issues

### Basic Usage

```elixir
defmodule Codex.Exec.SupertesterTest do
  use ExUnit.Case, async: true
  use Supertester

  test "GenServer receives message" do
    {:ok, pid} = Codex.Exec.start_link(...)

    # Send message
    send(pid, {:test, self()})

    # Wait for response (not Process.sleep!)
    assert_receive {:response, value}
    assert value == :expected
  end

  test "GenServer state changes" do
    {:ok, pid} = Codex.Exec.start_link(...)

    # Trigger state change
    GenServer.call(pid, :change_state)

    # Assert state changed
    assert :sys.get_state(pid).changed == true
  end
end
```

### Advanced Patterns

**Testing Async Workflows**:
```elixir
test "async event processing" do
  {:ok, pid} = Codex.Exec.start_link(...)

  ref = make_ref()
  GenServer.cast(pid, {:process, ref, self()})

  # Wait for specific message pattern
  assert_receive {:processed, ^ref, result}, 1000
  assert result.success
end
```

**Testing Supervision**:
```elixir
test "supervised restart" do
  {:ok, sup} = Codex.Supervisor.start_link()

  # Get child pid
  [{:undefined, pid, :worker, _}] = Supervisor.which_children(sup)

  # Kill child
  Process.exit(pid, :kill)

  # Wait for restart
  eventually(fn ->
    [{:undefined, new_pid, :worker, _}] = Supervisor.which_children(sup)
    assert new_pid != pid
    assert Process.alive?(new_pid)
  end)
end
```

## Mock Strategies

### 1. Mox for Protocols

**When**: Testing modules that depend on behaviors.

**Example**:
```elixir
# Define behavior
defmodule Codex.ExecBehaviour do
  @callback run_turn(pid(), String.t(), map()) :: reference()
  @callback get_events(pid(), reference()) :: [Codex.Events.t()]
end

# Define mock in test_helper.exs
Mox.defmock(Codex.ExecMock, for: Codex.ExecBehaviour)

# Use in tests
test "thread uses exec" do
  Mox.expect(Codex.ExecMock, :run_turn, fn _pid, input, _opts ->
    assert input == "test"
    make_ref()
  end)

  Mox.expect(Codex.ExecMock, :get_events, fn _pid, _ref ->
    [
      %Codex.Events.ThreadStarted{...},
      %Codex.Events.TurnCompleted{...}
    ]
  end)

  # Test with mock
  thread = %Codex.Thread{exec: Codex.ExecMock, ...}
  {:ok, result} = Codex.Thread.run(thread, "test")
end
```

### 2. Mock Scripts for Exec

**When**: Testing Exec GenServer with controlled output.

**Example**:
```elixir
defmodule MockCodexScript do
  def create(events) when is_list(events) do
    script_content = """
    #!/bin/bash
    # Read stdin (ignore for mock)
    cat > /dev/null

    # Output events
    #{Enum.map_join(events, "\n", &"echo '#{&1}'")}

    exit 0
    """

    path = Path.join(System.tmp_dir!(), "mock_codex_#{System.unique_integer([:positive])}.sh")
    File.write!(path, script_content)
    File.chmod!(path, 0o755)

    path
  end

  def cleanup(path) do
    File.rm(path)
  end
end

# Usage in tests
test "exec processes events" do
  events = [
    Jason.encode!(%{type: "thread.started", thread_id: "t1"}),
    Jason.encode!(%{type: "turn.completed", usage: %{input_tokens: 5}})
  ]

  script = MockCodexScript.create(events)

  try do
    {:ok, pid} = Codex.Exec.start_link(codex_path: script, input: "test")
    # ... test assertions
  after
    MockCodexScript.cleanup(script)
  end
end
```

### 3. Test Doubles for Data

**When**: Testing with known data structures.

**Example**:
```elixir
defmodule Codex.Fixtures do
  def thread_started_event(thread_id \\ "thread_test123") do
    %Codex.Events.ThreadStarted{
      type: :thread_started,
      thread_id: thread_id
    }
  end

  def agent_message_item(text \\ "Hello") do
    %Codex.Items.AgentMessage{
      id: "msg_#{System.unique_integer([:positive])}",
      type: :agent_message,
      text: text
    }
  end

  def complete_turn_result do
    %Codex.Turn.Result{
      items: [agent_message_item()],
      final_response: "Hello",
      usage: %Codex.Events.Usage{
        input_tokens: 10,
        cached_input_tokens: 0,
        output_tokens: 5
      }
    }
  end
end
```

## Coverage Goals

### Overall Coverage: 95%+

**Per Module**:
- Core modules (Codex, Thread, Exec): **100%**
- Type modules (Events, Items, Options): **100%**
- Utility modules (OutputSchemaFile): **95%**
- Test support modules: **80%**

### Coverage Tool: ExCoveralls

**Configuration** in `mix.exs`:
```elixir
def project do
  [
    test_coverage: [tool: ExCoveralls],
    preferred_cli_env: [
      coveralls: :test,
      "coveralls.detail": :test,
      "coveralls.post": :test,
      "coveralls.html": :test
    ]
  ]
end
```

**Commands**:
```bash
# Run tests with coverage
mix coveralls

# Detailed coverage report
mix coveralls.detail

# HTML coverage report
mix coveralls.html

# CI coverage (upload to Coveralls.io)
mix coveralls.github
```

### Coverage Exceptions

Some code is deliberately excluded:
```elixir
# coveralls-ignore-start
def debug_helper do
  # Only used in development
end
# coveralls-ignore-stop
```

## Test Organization

### Directory Structure

```
test/
├── codex_test.exs                    # Codex module tests
├── codex/
│   ├── thread_test.exs              # Thread module tests
│   ├── exec_test.exs                # Exec GenServer tests
│   ├── exec/
│   │   ├── parser_test.exs          # Event parser tests
│   │   └── integration_test.exs     # Exec integration tests
│   ├── events_test.exs              # Event type tests
│   ├── items_test.exs               # Item type tests
│   ├── options_test.exs             # Option struct tests
│   └── output_schema_file_test.exs  # Schema file helper tests
├── integration/
│   ├── basic_workflow_test.exs      # End-to-end workflows
│   ├── streaming_test.exs           # Streaming workflows
│   └── error_scenarios_test.exs     # Error handling
├── live/
│   └── real_codex_test.exs          # Tests with real API
├── property/
│   ├── events_property_test.exs     # Event properties
│   └── parsing_property_test.exs    # Parser properties
├── chaos/
│   └── resilience_test.exs          # Chaos engineering
├── support/
│   ├── fixtures.ex                  # Test data fixtures
│   ├── mock_codex_script.ex         # Mock script helper
│   └── supertester_helpers.ex       # Supertester utilities
└── test_helper.exs                  # Test configuration
```

### File Naming

- `*_test.exs`: Standard tests
- `*_integration_test.exs`: Integration tests
- `*_property_test.exs`: Property-based tests

### Test Naming

**Descriptive Names**:
```elixir
# Good
test "returns error when codex binary not found"
test "accumulates events until turn completes"
test "cleans up temp files on early stream halt"

# Bad
test "it works"
test "error case"
test "cleanup"
```

**Describe Blocks**:
```elixir
describe "run/3" do
  test "executes turn successfully" do
    # ...
  end

  test "handles API errors gracefully" do
    # ...
  end
end

describe "run/3 with output schema" do
  test "creates temporary schema file" do
    # ...
  end

  test "cleans up schema file after turn" do
    # ...
  end
end
```

## Assertions and Matchers

### Standard Assertions

```elixir
# Equality
assert result == expected
refute result == unexpected

# Pattern matching
assert %Codex.Events.ThreadStarted{thread_id: id} = event
assert id =~ ~r/thread_\w+/

# Boolean
assert Process.alive?(pid)
assert File.exists?(path)

# Membership
assert value in list
assert Map.has_key?(map, :key)

# Exceptions
assert_raise ArgumentError, fn ->
  %Codex.Events.ThreadStarted{}
end

# Messages
assert_receive {:event, ^ref, event}, 1000
refute_received {:error, _}
```

### Custom Assertions

```elixir
defmodule CodexSdk.Assertions do
  import ExUnit.Assertions

  def assert_valid_thread_id(thread_id) do
    assert is_binary(thread_id), "thread_id must be a string"
    assert String.starts_with?(thread_id, "thread_"), "thread_id must start with 'thread_'"
    assert String.length(thread_id) > 7, "thread_id must have content after prefix"
  end

  def assert_complete_turn_result(result) do
    assert %Codex.Turn.Result{} = result
    assert is_list(result.items)
    assert is_binary(result.final_response)
    assert %Codex.Events.Usage{} = result.usage
    assert result.usage.input_tokens > 0
  end

  def assert_events_in_order(events, expected_types) do
    actual_types = Enum.map(events, & &1.type)
    assert actual_types == expected_types,
      "Events out of order.\nExpected: #{inspect(expected_types)}\nActual: #{inspect(actual_types)}"
  end
end
```

## Error Testing

### Expected Errors

```elixir
test "returns error for invalid schema" do
  thread = %Codex.Thread{...}

  result = Codex.Thread.run(
    thread,
    "test",
    %Codex.Turn.Options{output_schema: "invalid"}
  )

  assert {:error, {:invalid_schema, _}} = result
end
```

### Error Propagation

```elixir
test "propagates turn failure from codex" do
  mock_script = create_failing_mock([
    ~s({"type":"thread.started","thread_id":"t1"}),
    ~s({"type":"turn.failed","error":{"message":"API error"}})
  ])

  codex_opts = %Codex.Options{codex_path_override: mock_script}
  {:ok, thread} = Codex.start_thread(codex_opts)

  result = Codex.Thread.run(thread, "test")

  assert {:error, {:turn_failed, error}} = result
  assert error.message == "API error"
end
```

### Error Recovery

```elixir
test "recovers from transient errors" do
  # Test retry logic, fallbacks, etc.
end
```

## Performance Testing

### Timing Assertions

```elixir
test "parses event in under 1ms" do
  event_json = ~s({"type":"thread.started","thread_id":"t1"})

  {time_us, result} = :timer.tc(fn ->
    Codex.Exec.Parser.parse_event(event_json)
  end)

  assert {:ok, _event} = result
  assert time_us < 1000, "Parsing took #{time_us}µs, expected < 1000µs"
end
```

### Load Testing

```elixir
test "handles 100 concurrent turns" do
  threads = for _ <- 1..100 do
    {:ok, thread} = Codex.start_thread()
    thread
  end

  tasks = for thread <- threads do
    Task.async(fn ->
      Codex.Thread.run(thread, "test")
    end)
  end

  results = Task.await_many(tasks, 30_000)

  assert Enum.all?(results, fn
    {:ok, _} -> true
    _ -> false
  end)
end
```

### Memory Testing

```elixir
test "streaming does not accumulate memory" do
  {:ok, thread} = Codex.start_thread()
  {:ok, stream} = Codex.Thread.run_streamed(thread, "generate 1000 items")

  memory_before = :erlang.memory(:total)

  # Consume stream
  Enum.each(stream, fn _ -> :ok end)

  memory_after = :erlang.memory(:total)
  memory_delta = memory_after - memory_before

  # Should be roughly constant (< 1MB growth)
  assert memory_delta < 1_000_000,
    "Memory grew by #{memory_delta} bytes, expected < 1MB"
end
```

## CI/CD Integration

### GitHub Actions Workflow

```yaml
name: CI

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        elixir: ['1.14', '1.15', '1.16']
        otp: ['25', '26', '27']

    steps:
      - uses: actions/checkout@v3

      - name: Setup Elixir
        uses: erlef/setup-beam@v1
        with:
          elixir-version: ${{ matrix.elixir }}
          otp-version: ${{ matrix.otp }}

      - name: Restore dependencies cache
        uses: actions/cache@v3
        with:
          path: deps
          key: ${{ runner.os }}-mix-${{ hashFiles('**/mix.lock') }}
          restore-keys: ${{ runner.os }}-mix-

      - name: Install dependencies
        run: mix deps.get

      - name: Run tests
        run: mix test --exclude live

      - name: Run integration tests
        run: mix test --only integration

      - name: Check coverage
        run: mix coveralls.github
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Run Dialyzer
        run: mix dialyzer

      - name: Run Credo
        run: mix credo --strict
```

### Local Pre-commit Checks

```bash
#!/bin/bash
# .git/hooks/pre-commit

echo "Running tests..."
mix test || exit 1

echo "Running Dialyzer..."
mix dialyzer || exit 1

echo "Running Credo..."
mix credo --strict || exit 1

echo "Checking coverage..."
mix coveralls || exit 1

echo "All checks passed!"
```

## Best Practices

### DO

1. **Write tests first** - TDD approach
2. **Use descriptive names** - Clearly state what's being tested
3. **Test one thing** - Single responsibility per test
4. **Use Supertester** - No `Process.sleep`
5. **Mock external deps** - Fast, deterministic tests
6. **Test edge cases** - Null, empty, invalid inputs
7. **Test errors** - Both expected and unexpected
8. **Keep tests simple** - Easy to understand and maintain
9. **Use fixtures** - DRY test data
10. **Run tests often** - Continuous feedback

### DON'T

1. **Don't use `Process.sleep`** - Use proper synchronization
2. **Don't test implementation** - Test behavior, not internals
3. **Don't share state** - Each test should be independent
4. **Don't skip failing tests** - Fix or remove them
5. **Don't write flaky tests** - Always reproducible
6. **Don't mock everything** - Test real integrations when possible
7. **Don't ignore warnings** - Keep Dialyzer clean
8. **Don't hardcode values** - Use variables and constants
9. **Don't write long tests** - Break into smaller tests
10. **Don't test external APIs** - Mock or tag as :live

## Troubleshooting

### Flaky Tests

**Symptoms**: Test sometimes passes, sometimes fails.

**Common Causes**:
- Using `Process.sleep` for synchronization
- Shared state between tests
- Race conditions
- Timing assumptions

**Solutions**:
- Use Supertester for proper sync
- Ensure `async: true` is safe
- Use `assert_receive` with timeout
- Check for shared resources

### Slow Tests

**Symptoms**: Tests take too long to run.

**Common Causes**:
- Real API calls
- Large data generation
- Inefficient algorithms
- Too much setup

**Solutions**:
- Mock external calls
- Use smaller test data
- Optimize code under test
- Cache expensive setup

### Low Coverage

**Symptoms**: Coverage below target.

**Common Causes**:
- Missing edge case tests
- Untested error paths
- Dead code

**Solutions**:
- Review coverage report
- Add missing tests
- Remove dead code
- Test all branches

## Conclusion

A comprehensive testing strategy is essential for building reliable, maintainable software. By following TDD principles, using Supertester for deterministic OTP testing, maintaining high coverage, and organizing tests clearly, we ensure the Elixir Codex SDK is production-ready and trustworthy.

Key takeaways:
- **Test first** - Write tests before implementation
- **No flakes** - Use proper synchronization, not sleeps
- **High coverage** - 95%+ with focus on critical paths
- **Fast feedback** - Quick test runs enable rapid iteration
- **Clear organization** - Well-structured tests are maintainable tests