guides/usage-and-billing.md

Select File:
guides/usage-and-billing.md

# Usage & Billing

## Overview

ReqLLM provides normalized usage tracking and best-effort cost calculation for API requests. Every response includes usage data that works consistently across providers, with detailed breakdowns for tokens, tools, and images when the provider exposes enough information.

## Pricing Policy

ReqLLM currently targets **"some assistance, no guarantees"** for pricing.

In practice, that means:

- `response.usage` is intended to be useful for product analytics, tenant attribution, dashboards, and rough billing estimates
- token, tool, image, and caching costs are calculated from provider usage data plus model pricing metadata when those inputs exist
- the resulting USD totals are not guaranteed to match provider invoices exactly

When exact billing matters, treat ReqLLM usage as a helpful estimate and reconcile against provider-side reporting. For the full contract, known gaps, and production guidance, see the [Pricing Policy](pricing-policy.md) guide.

## The Usage Structure

Every `ReqLLM.Response` includes a `usage` map with normalized metrics:

```elixir
{:ok, response} = ReqLLM.generate_text("anthropic:claude-haiku-4-5", "Hello")

response.usage
#=> %{
#     # Token counts
#     input_tokens: 8,
#     output_tokens: 12,
#     total_tokens: 20,
#
#     # Cost summary (USD)
#     input_cost: 0.00024,
#     output_cost: 0.00036,
#     total_cost: 0.0006,
#
#     # Detailed cost breakdown
#     cost: %{
#       tokens: 0.0006,
#       tools: 0.0,
#       images: 0.0,
#       total: 0.0006
#     }
#   }
```

## Token Usage

### Standard Tokens

All providers report basic token counts:

| Field | Description |
|-------|-------------|
| `input_tokens` | Tokens in the request (prompt, context, tools) |
| `output_tokens` | Tokens generated by the model |
| `total_tokens` | Sum of input and output tokens |

### Reasoning Tokens

For reasoning models (OpenAI o1/o3/gpt-5, Anthropic extended thinking, Google thinking):

```elixir
{:ok, response} = ReqLLM.generate_text("openai:o3-mini", prompt)

response.usage.reasoning_tokens
#=> 1250  # Tokens used for internal reasoning
```

The `reasoning_tokens` field tracks tokens used for chain-of-thought reasoning. These may be billed differently than standard tokens depending on the provider.

### Cached Tokens

For providers that support prompt caching (Anthropic, OpenAI):

```elixir
response.usage.cached_tokens
#=> 500  # Input tokens served from cache

response.usage.cache_creation_tokens
#=> 0    # Tokens used to create new cache entries
```

Cached tokens are typically billed at a reduced rate. See [Anthropic Prompt Caching](anthropic.md#anthropic_prompt_cache) for details.

## Tool Usage

When using tools like web search, usage is tracked in `tool_usage`:

```elixir
response.usage.tool_usage
#=> %{
#     web_search: %{count: 2, unit: "call"}
#   }
```

### Web Search

Each provider has slightly different web search tracking:

| Provider | Unit | Notes |
|----------|------|-------|
| Anthropic | `"call"` | $10 per 1,000 searches |
| OpenAI | `"call"` | Responses API models only |
| xAI | `"call"` or `"source"` | Varies by response format |
| Google | `"query"` | Grounding queries |

**Anthropic Example:**
```elixir
{:ok, response} = ReqLLM.generate_text(
  "anthropic:claude-sonnet-4-5",
  "What's happening in AI today?",
  provider_options: [web_search: %{max_uses: 5}]
)

response.usage.tool_usage.web_search
#=> %{count: 3, unit: "call"}
```

**xAI Example:**
```elixir
{:ok, response} = ReqLLM.generate_text(
  "xai:grok-4-1-fast-reasoning",
  "Latest tech news",
  xai_tools: [%{type: "web_search"}]
)

response.usage.tool_usage.web_search
#=> %{count: 5, unit: "call"}
```

**Google Grounding Example:**
```elixir
{:ok, response} = ReqLLM.generate_text(
  "google:gemini-3-flash-preview",
  "Current stock market trends",
  provider_options: [google_grounding: %{enable: true}]
)

response.usage.tool_usage.web_search
#=> %{count: 2, unit: "query"}
```

## Image Usage

For image generation, usage is tracked in `image_usage`:

```elixir
{:ok, response} = ReqLLM.generate_image("openai:gpt-image-1", prompt)

response.usage.image_usage
#=> %{
#     generated: %{count: 1, size_class: "1024x1024"}
#   }
```

### Size Classes

Image costs vary by resolution:

| Provider | Size Classes |
|----------|-------------|
| OpenAI GPT Image | `"1024x1024"`, `"1536x1024"`, `"1024x1536"`, `"auto"` |
| OpenAI DALL-E 3 | `"1024x1024"`, `"1792x1024"`, `"1024x1792"` |
| Google | Based on aspect ratio |

### Multiple Images

```elixir
{:ok, response} = ReqLLM.generate_image("openai:dall-e-2", prompt, n: 3)

response.usage.image_usage.generated
#=> %{count: 3, size_class: "1024x1024"}
```

## Cost Breakdown

The `cost` map provides a detailed breakdown by category:

```elixir
response.usage.cost
#=> %{
#     tokens: 0.001,    # Token-based costs (input + output)
#     tools: 0.02,      # Web search and tool costs
#     images: 0.04,     # Image generation costs
#     total: 0.061,     # Sum of all costs
#     line_items: [...]  # Per-component details
#   }
```

### Line Items

For detailed billing analysis, `line_items` provides per-component costs:

```elixir
response.usage.cost.line_items
#=> [
#     %{component: "token.input", cost: 0.0003, quantity: 100},
#     %{component: "token.output", cost: 0.0007, quantity: 50},
#     %{component: "tool.web_search", cost: 0.02, quantity: 2}
#   ]
```

## Provider-Specific Notes

### Anthropic

- **Web search**: $10 per 1,000 searches
- **Prompt caching**: Reduced rates for cached tokens
- **Extended thinking**: Reasoning tokens tracked separately

### OpenAI

- **Responses API**: Web search available for o1, o3, gpt-5 models
- **Chat Completions API**: No built-in web search
- **Image generation**: Costs vary by model and size

### xAI

- **Web search**: Via `xai_tools` option
- **Deprecated**: `live_search` is no longer supported
- **Units**: May report as `"call"` or `"source"`

### Google

- **Grounding**: Search via `google_grounding` option
- **Units**: Reports as `"query"`
- **Image generation**: Gemini image models supported

## Known Limits

ReqLLM does not currently guarantee support for every provider billing surface. In particular:

- realtime audio/text billing is not modeled yet
- video generation billing is not modeled yet
- account-specific discounts, credits, taxes, and regional pricing are outside the public contract

## Telemetry

ReqLLM now emits three telemetry families:

- `[:req_llm, :request, :start | :stop | :exception]` for lifecycle timing, request and response summaries, usage, and standardized reasoning metadata
- `[:req_llm, :reasoning, :start | :update | :stop]` for provider-neutral thinking and reasoning milestones
- `[:req_llm, :token_usage]` for backwards-compatible token and cost tracking

For billing and tenant attribution, use `[:req_llm, :request, :stop]` as the source of truth. It includes duration in measurements plus `request_id`, `usage`, `finish_reason`, and normalized `reasoning` metadata in the event metadata. The token usage event remains useful if you only want token and cost totals.

When you audit reasoning-heavy workloads, prefer the normalized `reasoning` snapshot on the request lifecycle events over raw provider payloads. It captures both the originally requested reasoning settings and the effective translated request, so you can see when a provider rewrites or disables a reasoning configuration before you attribute cost or behavior to a tenant.

```elixir
:telemetry.attach_many(
  "my-req-llm-billing",
  [
    [:req_llm, :request, :stop],
    [:req_llm, :request, :exception],
    [:req_llm, :token_usage]
  ],
  fn event, measurements, metadata, _config ->
    case event do
      [:req_llm, :request, :stop] ->
        duration_ms = System.convert_time_unit(measurements.duration, :native, :millisecond)

        IO.inspect(
          %{
            request_id: metadata.request_id,
            duration_ms: duration_ms,
            finish_reason: metadata.finish_reason,
            usage: metadata.usage,
            reasoning: metadata.reasoning
          },
          label: "Request"
        )

      [:req_llm, :request, :exception] ->
        IO.inspect(metadata, label: "Failed request")

      [:req_llm, :token_usage] ->
        IO.inspect(%{measurements: measurements, metadata: metadata}, label: "Usage")
    end
  end,
  nil
)
```

`[:req_llm, :token_usage]` remains available on every request, including streaming:

```elixir
:telemetry.attach(
  "my-usage-handler",
  [:req_llm, :token_usage],
  fn _event, measurements, metadata, _config ->
    IO.inspect(measurements, label: "Usage")
    IO.inspect(metadata, label: "Metadata")
  end,
  nil
)
```

Event measurements include:
- `input_tokens`, `output_tokens`, `total_tokens`
- `input_cost`, `output_cost`, `total_cost`
- `reasoning_tokens` (when applicable)

See the [Telemetry Guide](telemetry.md) for the full event contract, reasoning lifecycle, milestone semantics, and payload capture options.

## Example: Complete Usage Tracking

```elixir
defmodule UsageTracker do
  def track_request(model, prompt, opts \\ []) do
    {duration_us, result} = :timer.tc(fn ->
      ReqLLM.generate_text(model, prompt, opts)
    end)

    case result do
      {:ok, response} ->
        usage = response.usage

        IO.puts("""
        Request completed in #{duration_us / 1000}ms

        Tokens:
          Input: #{usage.input_tokens}
          Output: #{usage.output_tokens}
          Total: #{usage.total_tokens}
          #{if usage.reasoning_tokens, do: "Reasoning: #{usage.reasoning_tokens}", else: ""}

        Cost:
          Input: $#{format_cost(usage.input_cost)}
          Output: $#{format_cost(usage.output_cost)}
          Total: $#{format_cost(usage.total_cost)}

        #{format_tool_usage(usage.tool_usage)}
        #{format_image_usage(usage.image_usage)}
        """)

        {:ok, response}

      error ->
        error
    end
  end

  defp format_cost(nil), do: "n/a"
  defp format_cost(cost), do: :erlang.float_to_binary(cost, decimals: 6)

  defp format_tool_usage(nil), do: ""
  defp format_tool_usage(tool_usage) do
    Enum.map_join(tool_usage, "\n", fn {tool, %{count: count, unit: unit}} ->
      "Tool Usage: #{tool} = #{count} #{unit}(s)"
    end)
  end

  defp format_image_usage(nil), do: ""
  defp format_image_usage(%{generated: %{count: count, size_class: size}}) do
    "Image Usage: #{count} image(s) at #{size}"
  end
  defp format_image_usage(_), do: ""
end
```

## See Also

- [Data Structures](data-structures.md) - Response structure details
- [Anthropic Guide](anthropic.md) - Web search and prompt caching
- [OpenAI Guide](openai.md) - Responses API and image generation
- [xAI Guide](xai.md) - Grok web search
- [Google Guide](google.md) - Grounding and search
- [Image Generation Guide](image-generation.md) - Image costs