# FastestTiktoken
Fast OpenAI-compatible tokenization for Elixir.
`FastestTiktoken` is a Rustler-backed Elixir library built on the
high-performance pure-Rust [`tiktoken`](https://crates.io/crates/tiktoken) crate.
It is designed for projects that need exact OpenAI tokenizer behavior without
depending on older wrappers around `tiktoken-rs`.
Full public-behavior parity is tested against official OpenAI
[`tiktoken`](https://github.com/openai/tiktoken) `0.13.0` for the OpenAI
encodings and API surfaces exposed here: model mapping, GPT-2/r50k fixtures,
regex edge cases, roundtrips, special-token handling, `o200k_harmony`, large
inputs, and batch helpers.
## Installation
Add `fastest_tiktoken` to your dependencies:
```elixir
def deps do
[
{:fastest_tiktoken, "~> 0.1.1"}
]
end
```
Then fetch and compile:
```bash
mix deps.get
mix compile
```
Published releases use precompiled NIFs from GitHub Releases. Local source
builds require Rust 1.94 or newer.
## Quick Start
Count tokens by OpenAI model:
```elixir
iex> FastestTiktoken.count_tokens("hello world", model: "gpt-4o")
{:ok, 2}
```
Encode and decode text:
```elixir
{:ok, tokens} = FastestTiktoken.encode("hello world", model: "gpt-4o")
#=> {:ok, [24912, 2375]}
FastestTiktoken.decode(tokens, model: "gpt-4o")
#=> {:ok, "hello world"}
```
Use an explicit encoding:
```elixir
FastestTiktoken.encode("hello world", encoding: :cl100k_base)
#=> {:ok, [15339, 1917]}
```
Resolve GPT OSS models through the official `o200k_harmony` mapping:
```elixir
FastestTiktoken.encoding_for_model("gpt-oss-120b")
#=> {:ok, "o200k_harmony"}
FastestTiktoken.encode("<|start|>hello<|end|>",
model: "gpt-oss-120b",
allowed_special: :all
)
#=> {:ok, [200006, 24912, 200007]}
```
Batch encode and decode:
```elixir
{:ok, batch} =
FastestTiktoken.encode_batch(["hello world", "goodbye world"],
encoding: :cl100k_base
)
FastestTiktoken.decode_batch(batch, encoding: :cl100k_base)
#=> {:ok, ["hello world", "goodbye world"]}
```
Handle special tokens explicitly:
```elixir
FastestTiktoken.encode("hello <|endoftext|>",
encoding: :cl100k_base,
allowed_special: :all
)
#=> {:ok, [15339, 220, 100257]}
FastestTiktoken.encode("hello <|endoftext|>",
encoding: :cl100k_base,
allowed_special: ["<|endoftext|>"]
)
#=> {:ok, [15339, 220, 100257]}
```
By default, special token strings are treated as ordinary text. That matches
`encode_ordinary` semantics and keeps `count_tokens/2` on the Rust crate's
zero-allocation count path.
## Why FastestTiktoken
- Compared with other Elixir tokenizer wrappers that depend on older
`tiktoken-rs` bindings, this project uses the faster pure-Rust
[`tiktoken`](https://crates.io/crates/tiktoken) crate.
- Keeps a small Elixir API with explicit `{:ok, value}` / `{:error, reason}`
return values.
- Supports RustlerPrecompiled artifacts so production installs do not need a
Rust toolchain.
- Parity-tested against official OpenAI `tiktoken` `0.13.0`, including the
`o200k_harmony` special-token table used by GPT OSS models.
## More Documentation
- [Usage guide](docs/usage.md)
- [Parity and performance](docs/parity-and-performance.md)
## Source Builds
To force a local Rust build instead of using a precompiled NIF:
```bash
FASTEST_TIKTOKEN_BUILD=1 mix test
```
Source builds require Rust 1.94 or newer, as declared by the native crate.