guides/streaming.md

# Streaming Guide

This guide covers how to use `EncodingRs.Decoder` for processing large files and streaming data in multibyte encodings like Shift_JIS, GBK, Big5, and EUC-JP.

## The Problem

When processing files in chunks, multibyte characters can be split across chunk boundaries:

```
File content: "Hello世界" (in Shift_JIS)
Bytes: <<72, 101, 108, 108, 111, 144, 162, 138, 69>>
                                   ↑
                            Chunk boundary here

Chunk 1: <<72, 101, 108, 108, 111, 144>>  → "Hello" + incomplete byte
Chunk 2: <<162, 138, 69>>                  → incomplete byte + "界"
```

Using one-shot `EncodingRs.decode/2` on each chunk independently produces corrupted output with replacement characters (`�`).

## The Solution

`EncodingRs.Decoder` maintains state between chunks, buffering incomplete byte sequences until they can be completed.

## Real-World Examples

### Processing a Large CSV File

```elixir
defmodule CsvProcessor do
  def process_shift_jis_csv(path) do
    path
    |> File.stream!([], 64 * 1024)  # 64KB chunks
    |> EncodingRs.Decoder.stream("shift_jis")
    |> Enum.join()
    |> String.split("\n")
    |> Enum.map(&String.split(&1, ","))
  end
end
```

### Streaming HTTP Response

```elixir
defmodule HttpClient do
  def fetch_and_decode(url, encoding) do
    # Using Req or similar HTTP client with streaming
    Req.get!(url, into: fn {:data, chunk}, acc ->
      {:cont, [chunk | acc]}
    end).body
    |> Enum.reverse()
    |> EncodingRs.Decoder.stream(encoding)
    |> Enum.join()
  end
end
```

### Processing with Error Tracking

```elixir
defmodule DataImporter do
  require Logger

  def import_with_validation(path, encoding) do
    {content, had_errors} =
      path
      |> File.stream!([], 8192)
      |> EncodingRs.Decoder.stream_with_errors(encoding)
      |> Enum.reduce({"", false}, fn {chunk, errors}, {acc, had_any} ->
        if errors do
          Logger.warning("Invalid bytes detected in chunk")
        end
        {acc <> chunk, had_any or errors}
      end)

    if had_errors do
      Logger.warning("File contained invalid byte sequences")
    end

    content
  end
end
```

### GenServer for Continuous Stream Processing

```elixir
defmodule StreamProcessor do
  use GenServer

  def start_link(encoding) do
    GenServer.start_link(__MODULE__, encoding)
  end

  def process_chunk(pid, chunk, is_last \\ false) do
    GenServer.call(pid, {:process, chunk, is_last})
  end

  @impl true
  def init(encoding) do
    {:ok, decoder} = EncodingRs.Decoder.new(encoding)
    {:ok, %{decoder: decoder, buffer: ""}}
  end

  @impl true
  def handle_call({:process, chunk, is_last}, _from, state) do
    {:ok, output, had_errors} =
      EncodingRs.Decoder.decode_chunk(state.decoder, chunk, is_last)

    {:reply, {:ok, output, had_errors}, state}
  end
end

# Usage
{:ok, pid} = StreamProcessor.start_link("gbk")
{:ok, out1, _} = StreamProcessor.process_chunk(pid, chunk1)
{:ok, out2, _} = StreamProcessor.process_chunk(pid, chunk2)
{:ok, out3, _} = StreamProcessor.process_chunk(pid, chunk3, true)
```

### Converting File Encoding

```elixir
defmodule EncodingConverter do
  @doc """
  Convert a file from one encoding to another.
  """
  def convert(input_path, output_path, from_encoding, to_encoding) do
    input_path
    |> File.stream!([], 64 * 1024)
    |> EncodingRs.Decoder.stream(from_encoding)
    |> Stream.map(fn chunk ->
      {:ok, encoded} = EncodingRs.encode(chunk, to_encoding)
      encoded
    end)
    |> Stream.into(File.stream!(output_path))
    |> Stream.run()
  end
end

# Convert Shift_JIS to UTF-8
EncodingConverter.convert("input.txt", "output.txt", "shift_jis", "utf-8")
```

## Choosing Chunk Size

The chunk size affects memory usage and performance:

| Chunk Size | Memory | Performance | Use Case |
|------------|--------|-------------|----------|
| 4KB | Low | More overhead | Memory-constrained |
| 64KB | Medium | Good balance | General purpose |
| 256KB+ | Higher | Less overhead | Large files, fast storage |

```elixir
# Memory-constrained environment
File.stream!(path, [], 4096)

# General purpose (recommended)
File.stream!(path, [], 64 * 1024)

# High-throughput processing
File.stream!(path, [], 256 * 1024)
```

## When to Use Streaming vs One-Shot

| Scenario | Approach |
|----------|----------|
| Small files (<1MB) | `EncodingRs.decode/2` |
| Large files | `EncodingRs.Decoder.stream/2` |
| Network streams | `EncodingRs.Decoder` |
| Unknown size | `EncodingRs.Decoder.stream/2` |
| Memory-constrained | `EncodingRs.Decoder.stream/2` |

## Common Encodings

| Region | Common Encodings |
|--------|------------------|
| Japanese | `shift_jis`, `euc-jp`, `iso-2022-jp` |
| Chinese (Simplified) | `gbk`, `gb18030` |
| Chinese (Traditional) | `big5` |
| Korean | `euc-kr` |
| Western European | `windows-1252`, `iso-8859-1` |
| Cyrillic | `windows-1251`, `koi8-r` |

## Tips

1. **Always flush**: Pass `is_last: true` for the final chunk to flush any buffered bytes.

2. **Don't share decoders**: Each decoder maintains mutable state. Don't share across processes.

3. **Check for errors**: Use `stream_with_errors/2` if you need to know about invalid byte sequences.

4. **BOM detection**: For files with BOMs, detect and strip first:
   ```elixir
   content = File.read!(path)
   {encoding, data} = case EncodingRs.detect_and_strip_bom(content) do
     {:ok, enc, rest} -> {enc, rest}
     {:error, :no_bom} -> {"utf-8", content}  # default
   end
   {:ok, decoded} = EncodingRs.decode(data, encoding)
   ```