guides/batch.md

# Batch Processing Guide

This guide covers the batch API for encoding and decoding multiple items in a single NIF call.

## When to Use Batch Operations

Batch operations are useful when you need to process many separate strings or binaries:

- Decoding/encoding rows from a database
- Processing lists of filenames or paths
- Converting multiple user inputs
- Data migration tasks

For streaming a single large file, use `EncodingRs.Decoder` instead (see the [Streaming Guide](streaming.md)).

## The Problem

Each NIF call has overhead: scheduler context switching, argument marshalling, and result conversion. When processing many small items, this overhead can dominate:

```elixir
# Inefficient: 1000 NIF calls
items
|> Enum.map(fn {data, encoding} ->
  EncodingRs.decode(data, encoding)
end)
```

## The Solution

Batch operations process all items in a single NIF call, amortizing the dispatch overhead:

```elixir
# Efficient: 1 NIF call
EncodingRs.decode_batch(items)
```

## Usage

### Decoding Multiple Binaries

```elixir
items = [
  {<<72, 101, 108, 108, 111>>, "windows-1252"},
  {<<0x82, 0xA0>>, "shift_jis"},
  {<<0xC4, 0xE3, 0xBA, 0xC3>>, "gbk"}
]

results = EncodingRs.decode_batch(items)
# => [{:ok, "Hello"}, {:ok, "あ"}, {:ok, "你好"}]
```

### Encoding Multiple Strings

```elixir
items = [
  {"Hello", "windows-1252"},
  {"あ", "shift_jis"},
  {"你好", "gbk"}
]

results = EncodingRs.encode_batch(items)
# => [{:ok, <<72, 101, 108, 108, 111>>}, {:ok, <<130, 160>>}, {:ok, <<196, 227, 186, 195>>}]
```

### Handling Errors

Results are returned in the same order as input. Check each result individually:

```elixir
items = [
  {"Hello", "windows-1252"},
  {"Test", "invalid-encoding"},
  {"World", "utf-8"}
]

results = EncodingRs.encode_batch(items)
# => [{:ok, "Hello"}, {:error, :unknown_encoding}, {:ok, "World"}]

# Process results
Enum.zip(items, results)
|> Enum.each(fn {{input, encoding}, result} ->
  case result do
    {:ok, encoded} ->
      IO.puts("Encoded #{inspect(input)} to #{encoding}")
    {:error, reason} ->
      IO.puts("Failed to encode #{inspect(input)}: #{reason}")
  end
end)
```

### Mixed Encodings

Batch operations support different encodings per item:

```elixir
# Database rows with encoding metadata
rows = [
  %{content: <<...>>, encoding: "shift_jis", id: 1},
  %{content: <<...>>, encoding: "gbk", id: 2},
  %{content: <<...>>, encoding: "windows-1252", id: 3}
]

items = Enum.map(rows, &{&1.content, &1.encoding})
results = EncodingRs.decode_batch(items)

# Combine results back with original data
Enum.zip(rows, results)
|> Enum.map(fn {row, {:ok, decoded}} ->
  Map.put(row, :content_utf8, decoded)
end)
```

## Dirty Scheduler Behavior

Batch operations **always** use dirty CPU schedulers, regardless of input size or item count.

### Rationale

Batch operations are typically used for throughput-focused workloads where:

1. **Total work is significant** - Even if individual items are small, processing many items adds up
2. **Predictability matters** - Consistent dirty scheduler usage avoids variable latency
3. **Simplicity** - No threshold logic to tune or understand

### Trade-offs

| Aspect | Batch (always dirty) | Single-item (threshold-based) |
|--------|---------------------|------------------------------|
| Small workloads | Slight overhead from dirty scheduler | Uses normal scheduler |
| Large workloads | Optimal | Optimal |
| Latency | Consistent | Variable based on size |
| Complexity | Simple | Requires threshold tuning |

### When This Matters

For most use cases, always using dirty schedulers is the right choice. The overhead is minimal and the behavior is predictable.

If you have a latency-sensitive application processing very small batches (< 10 items, each < 1KB), you may see slightly better latency using individual `decode/2` or `encode/2` calls, which respect the configured dirty threshold.

## Known Limitations

### No Batch Streaming

The batch API is for one-shot processing of complete binaries only. It does not support stateful streaming decoding where characters may be split across chunk boundaries.

For streaming use cases, use `EncodingRs.Decoder` which maintains state between chunks. However, each decoder handles a single stream - there is currently no way to batch process chunks from multiple streams in a single NIF call.

If you need to process multiple streams concurrently, create separate `EncodingRs.Decoder` instances for each stream.

## Future Options

The following options may be added in future versions based on user feedback:

- **Batch streaming** - Process chunks from multiple decoders in a single NIF call
- **Threshold-based routing** - Check total bytes and route to normal/dirty scheduler
- **Item count threshold** - Use dirty scheduler only above N items
- **Explicit scheduler choice** - `decode_batch/2` with options like `[scheduler: :normal]`

If you have a use case that would benefit from these options, please [open an issue](https://github.com/jeffhuen/encoding_rs/issues).

## Performance Tips

1. **Batch similar-sized items** - Helps with memory allocation efficiency

2. **Reasonable batch sizes** - Batches of 100-10,000 items work well. Extremely large batches (100K+) may cause memory pressure.

3. **Consider chunking very large lists**:
   ```elixir
   large_list
   |> Enum.chunk_every(1000)
   |> Enum.flat_map(&EncodingRs.decode_batch/1)
   ```

4. **Parallel batches** - For very large workloads, split across processes:
   ```elixir
   items
   |> Enum.chunk_every(1000)
   |> Task.async_stream(&EncodingRs.decode_batch/1, max_concurrency: 4)
   |> Enum.flat_map(fn {:ok, results} -> results end)
   ```

## Comparison: Batch vs Streaming vs One-Shot

| Scenario | Best Approach |
|----------|---------------|
| Single small binary | `EncodingRs.decode/2` |
| Single large file | `EncodingRs.Decoder.stream/2` |
| Many separate items | `EncodingRs.decode_batch/1` |
| Network stream | `EncodingRs.Decoder` |
| Database rows | `EncodingRs.decode_batch/1` |