Skip to main content

README.md

# IREE.Tokenizers

`IREE.Tokenizers` is an inference-only Elixir tokenizer package backed by the
[IREE tokenizer runtime](https://github.com/iree-org/iree-tokenizer-py). It lets
Elixir applications load common LLM tokenizer assets and run fast local
encode/decode without a Python service. I first discovered IREE's tokenizer work
through the [ZML.ai blog](https://zml.ai/posts/iree-tokenizer/), and deeply
admire the company and the engineering behind it.

In one sentence: this package turns Hugging Face `tokenizer.json`, OpenAI
`.tiktoken`, and SentencePiece `.model` files into BEAM-friendly tokenizer
handles with one-shot, batch, streaming, offset, mask, and vocab helper APIs.

## What this package does

- Loads tokenizer assets from local files, in-memory buffers, or the Hugging Face
  Hub.
- Supports Hugging Face `tokenizer.json`, OpenAI `.tiktoken`, and SentencePiece
  `.model` formats.
- Supports BPE, WordPiece, and Unigram model families.
- Encodes and decodes single inputs, lists of inputs, and streams of chunks.
- Returns token IDs, token strings, type IDs, attention masks, special-token
  masks, and optional byte offsets.
- Applies tokenizer-level `tokenizer.json` padding/truncation defaults where the
  reference `tokenizers` package applies them.
- Uses a native Rust/C runtime through Rustler, with precompiled NIFs for common
  release targets and local source builds in development/test.

## Why use it

Use this package when an Elixir system needs tokenizer performance and LLM-style
runtime ergonomics without leaving the BEAM:

- serving or batching LLM prompts in Phoenix, Livebook, Broadway, Oban, Nx, or
  custom inference services
- counting or packing tokens before model calls
- streaming tokenization for large prompts or ingestion pipelines
- using OpenAI/tiktoken-compatible encodings from Elixir
- loading SentencePiece `.model` files directly when a model repository does not
  expose the exact `tokenizer.json` path you want

## Current results

The checked-in benchmark and parity files are generated by scripts in `bench/`.
The README only summarizes results that have corresponding artifacts in
`bench/results/`.

### Correctness/parity

`bench/validate_parity.exs` compares `IREE.Tokenizers` with
[`elixir-nx/tokenizers`](https://hex.pm/packages/tokenizers), the Rust-backed
Hugging Face `tokenizers` reference package. The current selected matrix is
green for 7 public tokenizer families, 19 representative inputs per family, and
both `add_special_tokens: true` and `false` modes. It also checks batch encode
and stream encode parity.

See the full report: [`bench/results/parity_report.md`](bench/results/parity_report.md).

Currently green selected matrix:

| Model / load path | Coverage in the report |
| --- | --- |
| `Qwen/Qwen2.5-7B-Instruct` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `google-bert/bert-base-uncased` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `openai-community/gpt2` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `microsoft/Phi-3-mini-4k-instruct` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `google-t5/t5-small` from `tokenizer.json` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `google-t5/t5-small` from SentencePiece `.model` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `sentence-transformers/all-MiniLM-L6-v2` | 19/19 cases, both special-token modes; batch OK; stream OK |

The benchmark-matrix rows currently published in
[`bench/results/model_matrix.md`](bench/results/model_matrix.md) were also
re-checked on this branch for representative one-shot, batch, and stream parity:

- `LiquidAI/LFM2.5-1.2B-Instruct`
- `Qwen/Qwen3.5-9B`
- `zai-org/GLM-5.1`
- `mistralai/Ministral-3-3B-Reasoning-2512`
- `google/gemma-4-31B-it`

Historical upstream/runtime gaps and local fixes are documented in
[`docs/UPSTREAM_BUGS.md`](docs/UPSTREAM_BUGS.md). Do not treat that file as the
live status by itself; the latest parity report is the authoritative current
result.

### Performance

Benchmark numbers depend on machine, OTP/Elixir versions, CPU, and cache state.
The checked-in numbers show the current shape:

| Benchmark artifact | Summary |
| --- | --- |
| [`bench/results/model_matrix.md`](bench/results/model_matrix.md) | Curated real-model prompt workload: IREE one-shot is 1.6x-5.6x faster than `tokenizers`; IREE stream is 5.4x-14.0x faster on the published rows. |
| [`bench/results/tokenizers_compare.md`](bench/results/tokenizers_compare.md) | Local BPE fixture: medium/long encode is about 1.3x faster; medium/long decode is about 10x faster. |
| [`bench/results/sentencepiece_compare.md`](bench/results/sentencepiece_compare.md) | Direct `.model` loading: T5-small encode is 1.97x faster; LLaMA tokenizer encode is 1.18x faster; LLaMA decode is 1.81x faster. |

The model-matrix run reports latency only for rows where the benchmark corpus
produces equivalent outputs across both libraries, and reports stream numbers
only when streamed output matches IREE one-shot output on that corpus.

Latency chart:

![Model matrix latency](https://github.com/goodhamgupta/iree_tokenizers/blob/main/bench/results/model_matrix_latency.svg?raw=1)

Speedup chart:

![Model matrix speedup](https://github.com/goodhamgupta/iree_tokenizers/blob/main/bench/results/model_matrix_speedup.svg?raw=1)

## Installation

Add the package to your Mix dependencies:

```elixir
def deps do
  [
    {:iree_tokenizers, "~> 0.7.0"}
  ]
end
```

Then run:

```bash
mix deps.get
```

The package uses `rustler_precompiled` for release builds. The current prebuilt
NIF target list is:

- `aarch64-apple-darwin`
- `x86_64-apple-darwin`
- `x86_64-unknown-linux-gnu`

In `:dev` and `:test`, the project forces a local Rust source build. You can
also force a local build with:

```bash
IREE_TOKENIZERS_BUILD=1 mix compile
```

## Quick start

### Load from the Hugging Face Hub

```elixir
alias IREE.Tokenizers.Tokenizer

{:ok, tokenizer} = Tokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

{:ok, encoding} =
  Tokenizer.encode(tokenizer, "Hello from Elixir", add_special_tokens: false)

encoding.ids
#=> token ids

{:ok, text} = Tokenizer.decode(tokenizer, encoding.ids, skip_special_tokens: false)
#=> "Hello from Elixir"
```

For gated or private Hugging Face repositories, pass a token:

```elixir
{:ok, tokenizer} =
  Tokenizer.from_pretrained("some/private-model",
    token: System.fetch_env!("HF_TOKEN")
  )
```

`from_pretrained/2` caches downloaded tokenizer assets by ETag in a per-user
cache directory by default. You can pass `cache_dir:`, `revision:`, `subfolder:`,
`filename:`, `use_cache: false`, or a custom `http_client:`.

### Load a local `tokenizer.json`

```elixir
{:ok, tokenizer} = Tokenizer.from_file("tokenizer.json")

{:ok, encoding} =
  Tokenizer.encode(tokenizer, "Hello world",
    add_special_tokens: true,
    track_offsets: true
  )

encoding.ids
encoding.tokens
encoding.offsets
encoding.attention_mask
encoding.special_tokens_mask
```

### Load OpenAI `.tiktoken` encodings

```elixir
{:ok, tokenizer} =
  Tokenizer.from_pretrained("gpt-4o", format: :tiktoken)

{:ok, cl100k} =
  Tokenizer.from_pretrained("openai/cl100k_base", format: :tiktoken)

Tokenizer.supported_tiktoken_encodings()
#=> ["cl100k_base", "o200k_base", "o200k_harmony", "r50k_base", "gpt2", "p50k_base", "p50k_edit"]
```

For local `.tiktoken` files, pass `format: :tiktoken` when inference from the
filename is not enough:

```elixir
{:ok, tokenizer} =
  Tokenizer.from_file("gpt2.tiktoken", format: :tiktoken)

{:ok, tokenizer} =
  Tokenizer.from_buffer(buffer,
    format: :tiktoken,
    tiktoken_encoding: "cl100k_base"
  )
```

### Load SentencePiece `.model` files

Local files ending in `.model` are inferred automatically:

```elixir
{:ok, tokenizer} = Tokenizer.from_file("spiece.model")
```

From Hugging Face, request the SentencePiece path explicitly:

```elixir
{:ok, tokenizer} =
  Tokenizer.from_pretrained("google-t5/t5-small",
    format: :sentencepiece_model
  )
```

### Batch encode/decode

```elixir
{:ok, encodings} =
  Tokenizer.encode_batch(tokenizer, ["short prompt", "another prompt"],
    add_special_tokens: false
  )

ids_batch = Enum.map(encodings, & &1.ids)
{:ok, texts} = Tokenizer.decode_batch(tokenizer, ids_batch, skip_special_tokens: false)
```

`encode_batch/3` is intentionally parity-first: it routes through the same
single-input `encode/3` path for each item so tokenizer defaults, local fixes,
and transformations are identical to one-shot encoding.

### Streaming encode/decode

```elixir
alias IREE.Tokenizers.{DecodeStream, EncodeStream}

{:ok, stream} = EncodeStream.new(tokenizer, add_special_tokens: false)
{:ok, ids1} = EncodeStream.feed(stream, "Hello ")
{:ok, ids2} = EncodeStream.feed(stream, "world")
{:ok, ids3} = EncodeStream.finalize(stream)
ids = ids1 ++ ids2 ++ ids3

{:ok, decode_stream} = DecodeStream.new(tokenizer, skip_special_tokens: false)
{:ok, text1} = DecodeStream.feed(decode_stream, Enum.take(ids, 2))
{:ok, text2} = DecodeStream.feed(decode_stream, Enum.drop(ids, 2))
{:ok, text3} = DecodeStream.finalize(decode_stream)
text = text1 <> text2 <> text3
```

For tokenizer families where the native streaming runtime can diverge at chunk
boundaries, the wrapper uses buffered-finalize strategies so the final stream
output still matches one-shot encode.

### Encode transformations

```elixir
alias IREE.Tokenizers.Encoding.Transformation

{:ok, encoding} =
  Tokenizer.encode(tokenizer, "hello",
    add_special_tokens: false,
    encoding_transformations: [
      Transformation.truncate(128),
      Transformation.pad(128, pad_id: 0, pad_token: "[PAD]")
    ]
  )
```

When a Hugging Face `tokenizer.json` carries fixed padding or truncation config,
that default config is applied automatically. Explicit transformations are then
applied after those defaults.

## API map

| Module | Purpose |
| --- | --- |
| `IREE.Tokenizers.Tokenizer` | Main load/encode/decode/vocab API. |
| `IREE.Tokenizers.Encoding` | Struct and helpers for token IDs, masks, offsets, tokens, padding, and truncation. |
| `IREE.Tokenizers.Encoding.Transformation` | Builders for post-encode transformations. |
| `IREE.Tokenizers.EncodeStream` | Incremental encode state. |
| `IREE.Tokenizers.DecodeStream` | Incremental decode state. |
| `IREE.Tokenizers.Model` and model modules | Build simple BPE, WordPiece, or Unigram specs from Elixir data. |

## Supported scope

Supported now:

- inference-time encode/decode
- Hugging Face `tokenizer.json`
- OpenAI `.tiktoken`
- SentencePiece `.model`
- BPE, WordPiece, and Unigram tokenizers
- single input encode/decode
- list input batch encode/decode
- streaming encode/decode
- token offsets, type IDs, attention masks, special-token masks, token strings
- special token ID lookup helpers
- tokenizer vocabulary lookup helpers

Deferred or intentionally out of scope for v1:

- pair-sequence encode input such as `{left, right}`
- tokenizer training APIs
- full tokenizer mutation APIs
- full surface-area parity with every `elixir-nx/tokenizers` option
- word ID tracking and overflowing-window output

Unsupported pair input returns:

```elixir
{:error, {:invalid_argument, "pair sequence inputs are not supported in v1"}}
```

## How it is implemented

The implementation has four layers:

1. Elixir public API
   - `lib/iree/tokenizers/tokenizer.ex` owns loading, options, Hugging Face
     downloads/caching, batch behavior, tokenizer JSON defaults, and public
     result shaping.
   - `lib/iree/tokenizers/encoding.ex` mirrors the practical `Encoding` helper
     surface: IDs, masks, offsets, tokens, pad/truncate/transform.
   - `lib/iree/tokenizers/encode_stream.ex` and `decode_stream.ex` provide BEAM
     stream state wrappers.

2. Rust NIF bridge
   - `lib/iree/tokenizers/native.ex` uses `RustlerPrecompiled` in releases and
     source builds in development/test.
   - `native/iree_tokenizers_native/src/tokenizer.rs` maps Rust resources and
     NIF structs to the Elixir API.
   - Dirty CPU NIFs are used for encode/decode paths that can do significant
     native work.

3. Vendored IREE tokenizer runtime
   - The native crate builds a curated C source bundle under
     `native/iree_tokenizers_native/vendor/iree_tokenizer_src`.
   - The pinned upstream commit is recorded in
     `native/iree_tokenizers_native/vendor/IREE_COMMIT`.
   - `scripts/update_iree_bundle.sh` refreshes the vendored source bundle from a
     matching upstream IREE checkout.

4. Parity-preserving compatibility layer
   - SentencePiece `.model` buffers are converted to tokenizer JSON in Rust
     before construction.
   - Some tokenizer families use special decode or buffered stream strategies to
     match the Hugging Face reference output.
   - Encode buffers grow with bounded retry logic so native output-capacity
     issues return clear errors instead of silently truncating or exhausting the
     BEAM.
   - `encode_batch/3` delegates through one-shot `encode/3` for each input to
     preserve correctness across known native batch-runtime edge cases.
   - Hugging Face `tokenizer.json` padding/truncation defaults are parsed and
     applied in the Elixir layer.

## Repository usage

Install dependencies and run the normal local checks from the repository root:

```bash
mix deps.get
mix test
cargo test --manifest-path native/iree_tokenizers_native/Cargo.toml
```

Format Elixir and Rust code:

```bash
mix format
cargo fmt --manifest-path native/iree_tokenizers_native/Cargo.toml
```

Run optional pretrained integration suites:

```bash
RUN_PRETRAINED_BATCH_INTEGRATION=1 mix test test/iree_tokenizers/batch_integration_test.exs
RUN_PRETRAINED_STREAM_INTEGRATION=1 mix test test/iree_tokenizers/stream_integration_test.exs
RUN_SENTENCEPIECE_INTEGRATION=1 mix test test/iree_tokenizers/sentencepiece_integration_test.exs
```

Run the full selected parity matrix:

```bash
cd bench
mix deps.get
mix run validate_parity.exs
```

Limit the parity matrix while iterating:

```bash
cd bench
MODEL_FILTER="Qwen/Qwen2.5-7B-Instruct" mix run validate_parity.exs
```

The parity report is written to `bench/results/parity_report.md`.

## Benchmark harness

Set up once:

```bash
cd bench
mix deps.get
```

Run the generic fixture comparison:

```bash
mix run compare.exs
```

Generate the SentencePiece `.model` comparison charts:

```bash
mix run sentencepiece_compare.exs
```

Generate the curated model latency/speedup matrix:

```bash
mix run model_matrix_graphs.exs
```

Limit a model-matrix run while iterating:

```bash
MODEL_FILTER="Qwen/Qwen3.5-9B" mix run model_matrix_graphs.exs
```

All benchmark outputs are written to `bench/results/`. If a benchmark target
requires authentication, set `HF_TOKEN` before running the script.

## Vendored IREE bundle

The native crate builds against the vendored source bundle under
`native/iree_tokenizers_native/vendor/iree_tokenizer_src`.

The pinned IREE commit is recorded in:

```text
native/iree_tokenizers_native/vendor/IREE_COMMIT
```

To refresh the bundle from a matching upstream checkout:

```bash
scripts/update_iree_bundle.sh /path/to/iree
```

After any vendor refresh, run Rust tests, Elixir tests, and the pretrained
parity suites. Vendor updates can overwrite local C patches that are required
for parity.

## License

This package is distributed under the Apache-2.0 license. The vendored IREE
runtime carries its own license file under
`native/iree_tokenizers_native/vendor/iree_tokenizer_src/IREE-LICENSE`.