# IREE.Tokenizers
`IREE.Tokenizers` is an inference-only Elixir tokenizer package backed by the
[IREE tokenizer runtime](https://github.com/iree-org/iree-tokenizer-py). It lets
Elixir applications load common LLM tokenizer assets and run fast local
encode/decode without a Python service. I first discovered IREE's tokenizer work
through the [ZML.ai blog](https://zml.ai/posts/iree-tokenizer/), and deeply
admire the company and the engineering behind it.
In one sentence: this package turns Hugging Face `tokenizer.json`, OpenAI
`.tiktoken`, and SentencePiece `.model` files into BEAM-friendly tokenizer
handles with one-shot, batch, streaming, offset, mask, and vocab helper APIs.
## What this package does
- Loads tokenizer assets from local files, in-memory buffers, or the Hugging Face
Hub.
- Supports Hugging Face `tokenizer.json`, OpenAI `.tiktoken`, and SentencePiece
`.model` formats.
- Supports BPE, WordPiece, and Unigram model families.
- Encodes and decodes single inputs, lists of inputs, and streams of chunks.
- Returns token IDs, token strings, type IDs, attention masks, special-token
masks, and optional byte offsets.
- Applies tokenizer-level `tokenizer.json` padding/truncation defaults where the
reference `tokenizers` package applies them.
- Uses a native Rust/C runtime through Rustler, with precompiled NIFs for common
release targets and local source builds in development/test.
## Why use it
Use this package when an Elixir system needs tokenizer performance and LLM-style
runtime ergonomics without leaving the BEAM:
- serving or batching LLM prompts in Phoenix, Livebook, Broadway, Oban, Nx, or
custom inference services
- counting or packing tokens before model calls
- streaming tokenization for large prompts or ingestion pipelines
- using OpenAI/tiktoken-compatible encodings from Elixir
- loading SentencePiece `.model` files directly when a model repository does not
expose the exact `tokenizer.json` path you want
## Current results
The checked-in benchmark and parity files are generated by scripts in `bench/`.
The README only summarizes results that have corresponding artifacts in
`bench/results/`.
### Correctness/parity
`bench/validate_parity.exs` compares `IREE.Tokenizers` with
[`elixir-nx/tokenizers`](https://hex.pm/packages/tokenizers), the Rust-backed
Hugging Face `tokenizers` reference package. The current selected matrix is
green for 7 public tokenizer families, 19 representative inputs per family, and
both `add_special_tokens: true` and `false` modes. It also checks batch encode
and stream encode parity.
See the full report: [`bench/results/parity_report.md`](bench/results/parity_report.md).
Currently green selected matrix:
| Model / load path | Coverage in the report |
| --- | --- |
| `Qwen/Qwen2.5-7B-Instruct` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `google-bert/bert-base-uncased` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `openai-community/gpt2` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `microsoft/Phi-3-mini-4k-instruct` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `google-t5/t5-small` from `tokenizer.json` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `google-t5/t5-small` from SentencePiece `.model` | 19/19 cases, both special-token modes; batch OK; stream OK |
| `sentence-transformers/all-MiniLM-L6-v2` | 19/19 cases, both special-token modes; batch OK; stream OK |
The benchmark-matrix rows currently published in
[`bench/results/model_matrix.md`](bench/results/model_matrix.md) were also
re-checked on this branch for representative one-shot, batch, and stream parity:
- `LiquidAI/LFM2.5-1.2B-Instruct`
- `Qwen/Qwen3.5-9B`
- `zai-org/GLM-5.1`
- `mistralai/Ministral-3-3B-Reasoning-2512`
- `google/gemma-4-31B-it`
Historical upstream/runtime gaps and local fixes are documented in
[`docs/UPSTREAM_BUGS.md`](docs/UPSTREAM_BUGS.md). Do not treat that file as the
live status by itself; the latest parity report is the authoritative current
result.
### Performance
Benchmark numbers depend on machine, OTP/Elixir versions, CPU, and cache state.
The checked-in numbers show the current shape:
| Benchmark artifact | Summary |
| --- | --- |
| [`bench/results/model_matrix.md`](bench/results/model_matrix.md) | Curated real-model prompt workload: IREE one-shot is 1.6x-5.6x faster than `tokenizers`; IREE stream is 5.4x-14.0x faster on the published rows. |
| [`bench/results/tokenizers_compare.md`](bench/results/tokenizers_compare.md) | Local BPE fixture: medium/long encode is about 1.3x faster; medium/long decode is about 10x faster. |
| [`bench/results/sentencepiece_compare.md`](bench/results/sentencepiece_compare.md) | Direct `.model` loading: T5-small encode is 1.97x faster; LLaMA tokenizer encode is 1.18x faster; LLaMA decode is 1.81x faster. |
The model-matrix run reports latency only for rows where the benchmark corpus
produces equivalent outputs across both libraries, and reports stream numbers
only when streamed output matches IREE one-shot output on that corpus.
Latency chart:

Speedup chart:

## Installation
Add the package to your Mix dependencies:
```elixir
def deps do
[
{:iree_tokenizers, "~> 0.7.0"}
]
end
```
Then run:
```bash
mix deps.get
```
The package uses `rustler_precompiled` for release builds. The current prebuilt
NIF target list is:
- `aarch64-apple-darwin`
- `x86_64-apple-darwin`
- `x86_64-unknown-linux-gnu`
In `:dev` and `:test`, the project forces a local Rust source build. You can
also force a local build with:
```bash
IREE_TOKENIZERS_BUILD=1 mix compile
```
## Quick start
### Load from the Hugging Face Hub
```elixir
alias IREE.Tokenizers.Tokenizer
{:ok, tokenizer} = Tokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
{:ok, encoding} =
Tokenizer.encode(tokenizer, "Hello from Elixir", add_special_tokens: false)
encoding.ids
#=> token ids
{:ok, text} = Tokenizer.decode(tokenizer, encoding.ids, skip_special_tokens: false)
#=> "Hello from Elixir"
```
For gated or private Hugging Face repositories, pass a token:
```elixir
{:ok, tokenizer} =
Tokenizer.from_pretrained("some/private-model",
token: System.fetch_env!("HF_TOKEN")
)
```
`from_pretrained/2` caches downloaded tokenizer assets by ETag in a per-user
cache directory by default. You can pass `cache_dir:`, `revision:`, `subfolder:`,
`filename:`, `use_cache: false`, or a custom `http_client:`.
### Load a local `tokenizer.json`
```elixir
{:ok, tokenizer} = Tokenizer.from_file("tokenizer.json")
{:ok, encoding} =
Tokenizer.encode(tokenizer, "Hello world",
add_special_tokens: true,
track_offsets: true
)
encoding.ids
encoding.tokens
encoding.offsets
encoding.attention_mask
encoding.special_tokens_mask
```
### Load OpenAI `.tiktoken` encodings
```elixir
{:ok, tokenizer} =
Tokenizer.from_pretrained("gpt-4o", format: :tiktoken)
{:ok, cl100k} =
Tokenizer.from_pretrained("openai/cl100k_base", format: :tiktoken)
Tokenizer.supported_tiktoken_encodings()
#=> ["cl100k_base", "o200k_base", "o200k_harmony", "r50k_base", "gpt2", "p50k_base", "p50k_edit"]
```
For local `.tiktoken` files, pass `format: :tiktoken` when inference from the
filename is not enough:
```elixir
{:ok, tokenizer} =
Tokenizer.from_file("gpt2.tiktoken", format: :tiktoken)
{:ok, tokenizer} =
Tokenizer.from_buffer(buffer,
format: :tiktoken,
tiktoken_encoding: "cl100k_base"
)
```
### Load SentencePiece `.model` files
Local files ending in `.model` are inferred automatically:
```elixir
{:ok, tokenizer} = Tokenizer.from_file("spiece.model")
```
From Hugging Face, request the SentencePiece path explicitly:
```elixir
{:ok, tokenizer} =
Tokenizer.from_pretrained("google-t5/t5-small",
format: :sentencepiece_model
)
```
### Batch encode/decode
```elixir
{:ok, encodings} =
Tokenizer.encode_batch(tokenizer, ["short prompt", "another prompt"],
add_special_tokens: false
)
ids_batch = Enum.map(encodings, & &1.ids)
{:ok, texts} = Tokenizer.decode_batch(tokenizer, ids_batch, skip_special_tokens: false)
```
`encode_batch/3` is intentionally parity-first: it routes through the same
single-input `encode/3` path for each item so tokenizer defaults, local fixes,
and transformations are identical to one-shot encoding.
### Streaming encode/decode
```elixir
alias IREE.Tokenizers.{DecodeStream, EncodeStream}
{:ok, stream} = EncodeStream.new(tokenizer, add_special_tokens: false)
{:ok, ids1} = EncodeStream.feed(stream, "Hello ")
{:ok, ids2} = EncodeStream.feed(stream, "world")
{:ok, ids3} = EncodeStream.finalize(stream)
ids = ids1 ++ ids2 ++ ids3
{:ok, decode_stream} = DecodeStream.new(tokenizer, skip_special_tokens: false)
{:ok, text1} = DecodeStream.feed(decode_stream, Enum.take(ids, 2))
{:ok, text2} = DecodeStream.feed(decode_stream, Enum.drop(ids, 2))
{:ok, text3} = DecodeStream.finalize(decode_stream)
text = text1 <> text2 <> text3
```
For tokenizer families where the native streaming runtime can diverge at chunk
boundaries, the wrapper uses buffered-finalize strategies so the final stream
output still matches one-shot encode.
### Encode transformations
```elixir
alias IREE.Tokenizers.Encoding.Transformation
{:ok, encoding} =
Tokenizer.encode(tokenizer, "hello",
add_special_tokens: false,
encoding_transformations: [
Transformation.truncate(128),
Transformation.pad(128, pad_id: 0, pad_token: "[PAD]")
]
)
```
When a Hugging Face `tokenizer.json` carries fixed padding or truncation config,
that default config is applied automatically. Explicit transformations are then
applied after those defaults.
## API map
| Module | Purpose |
| --- | --- |
| `IREE.Tokenizers.Tokenizer` | Main load/encode/decode/vocab API. |
| `IREE.Tokenizers.Encoding` | Struct and helpers for token IDs, masks, offsets, tokens, padding, and truncation. |
| `IREE.Tokenizers.Encoding.Transformation` | Builders for post-encode transformations. |
| `IREE.Tokenizers.EncodeStream` | Incremental encode state. |
| `IREE.Tokenizers.DecodeStream` | Incremental decode state. |
| `IREE.Tokenizers.Model` and model modules | Build simple BPE, WordPiece, or Unigram specs from Elixir data. |
## Supported scope
Supported now:
- inference-time encode/decode
- Hugging Face `tokenizer.json`
- OpenAI `.tiktoken`
- SentencePiece `.model`
- BPE, WordPiece, and Unigram tokenizers
- single input encode/decode
- list input batch encode/decode
- streaming encode/decode
- token offsets, type IDs, attention masks, special-token masks, token strings
- special token ID lookup helpers
- tokenizer vocabulary lookup helpers
Deferred or intentionally out of scope for v1:
- pair-sequence encode input such as `{left, right}`
- tokenizer training APIs
- full tokenizer mutation APIs
- full surface-area parity with every `elixir-nx/tokenizers` option
- word ID tracking and overflowing-window output
Unsupported pair input returns:
```elixir
{:error, {:invalid_argument, "pair sequence inputs are not supported in v1"}}
```
## How it is implemented
The implementation has four layers:
1. Elixir public API
- `lib/iree/tokenizers/tokenizer.ex` owns loading, options, Hugging Face
downloads/caching, batch behavior, tokenizer JSON defaults, and public
result shaping.
- `lib/iree/tokenizers/encoding.ex` mirrors the practical `Encoding` helper
surface: IDs, masks, offsets, tokens, pad/truncate/transform.
- `lib/iree/tokenizers/encode_stream.ex` and `decode_stream.ex` provide BEAM
stream state wrappers.
2. Rust NIF bridge
- `lib/iree/tokenizers/native.ex` uses `RustlerPrecompiled` in releases and
source builds in development/test.
- `native/iree_tokenizers_native/src/tokenizer.rs` maps Rust resources and
NIF structs to the Elixir API.
- Dirty CPU NIFs are used for encode/decode paths that can do significant
native work.
3. Vendored IREE tokenizer runtime
- The native crate builds a curated C source bundle under
`native/iree_tokenizers_native/vendor/iree_tokenizer_src`.
- The pinned upstream commit is recorded in
`native/iree_tokenizers_native/vendor/IREE_COMMIT`.
- `scripts/update_iree_bundle.sh` refreshes the vendored source bundle from a
matching upstream IREE checkout.
4. Parity-preserving compatibility layer
- SentencePiece `.model` buffers are converted to tokenizer JSON in Rust
before construction.
- Some tokenizer families use special decode or buffered stream strategies to
match the Hugging Face reference output.
- Encode buffers grow with bounded retry logic so native output-capacity
issues return clear errors instead of silently truncating or exhausting the
BEAM.
- `encode_batch/3` delegates through one-shot `encode/3` for each input to
preserve correctness across known native batch-runtime edge cases.
- Hugging Face `tokenizer.json` padding/truncation defaults are parsed and
applied in the Elixir layer.
## Repository usage
Install dependencies and run the normal local checks from the repository root:
```bash
mix deps.get
mix test
cargo test --manifest-path native/iree_tokenizers_native/Cargo.toml
```
Format Elixir and Rust code:
```bash
mix format
cargo fmt --manifest-path native/iree_tokenizers_native/Cargo.toml
```
Run optional pretrained integration suites:
```bash
RUN_PRETRAINED_BATCH_INTEGRATION=1 mix test test/iree_tokenizers/batch_integration_test.exs
RUN_PRETRAINED_STREAM_INTEGRATION=1 mix test test/iree_tokenizers/stream_integration_test.exs
RUN_SENTENCEPIECE_INTEGRATION=1 mix test test/iree_tokenizers/sentencepiece_integration_test.exs
```
Run the full selected parity matrix:
```bash
cd bench
mix deps.get
mix run validate_parity.exs
```
Limit the parity matrix while iterating:
```bash
cd bench
MODEL_FILTER="Qwen/Qwen2.5-7B-Instruct" mix run validate_parity.exs
```
The parity report is written to `bench/results/parity_report.md`.
## Benchmark harness
Set up once:
```bash
cd bench
mix deps.get
```
Run the generic fixture comparison:
```bash
mix run compare.exs
```
Generate the SentencePiece `.model` comparison charts:
```bash
mix run sentencepiece_compare.exs
```
Generate the curated model latency/speedup matrix:
```bash
mix run model_matrix_graphs.exs
```
Limit a model-matrix run while iterating:
```bash
MODEL_FILTER="Qwen/Qwen3.5-9B" mix run model_matrix_graphs.exs
```
All benchmark outputs are written to `bench/results/`. If a benchmark target
requires authentication, set `HF_TOKEN` before running the script.
## Vendored IREE bundle
The native crate builds against the vendored source bundle under
`native/iree_tokenizers_native/vendor/iree_tokenizer_src`.
The pinned IREE commit is recorded in:
```text
native/iree_tokenizers_native/vendor/IREE_COMMIT
```
To refresh the bundle from a matching upstream checkout:
```bash
scripts/update_iree_bundle.sh /path/to/iree
```
After any vendor refresh, run Rust tests, Elixir tests, and the pretrained
parity suites. Vendor updates can overwrite local C patches that are required
for parity.
## License
This package is distributed under the Apache-2.0 license. The vendored IREE
runtime carries its own license file under
`native/iree_tokenizers_native/vendor/iree_tokenizer_src/IREE-LICENSE`.