Skip to main content

README.md

# FastestTiktoken

Fast OpenAI-compatible tokenization for Elixir.

`FastestTiktoken` is a Rustler-backed Elixir library built on the
high-performance pure-Rust [`tiktoken`](https://crates.io/crates/tiktoken) crate.
It is designed for projects that need exact OpenAI tokenizer behavior without
depending on older wrappers around `tiktoken-rs`.

Full public-behavior parity is tested against official OpenAI
[`tiktoken`](https://github.com/openai/tiktoken) `0.13.0` for the OpenAI
encodings and API surfaces exposed here: model mapping, GPT-2/r50k fixtures,
regex edge cases, roundtrips, special-token handling, `o200k_harmony`, large
inputs, and batch helpers.

## Installation

Add `fastest_tiktoken` to your dependencies:

```elixir
def deps do
  [
    {:fastest_tiktoken, "~> 0.1.1"}
  ]
end
```

Then fetch and compile:

```bash
mix deps.get
mix compile
```

Published releases use precompiled NIFs from GitHub Releases. Local source
builds require Rust 1.94 or newer.

## Quick Start

Count tokens by OpenAI model:

```elixir
iex> FastestTiktoken.count_tokens("hello world", model: "gpt-4o")
{:ok, 2}
```

Encode and decode text:

```elixir
{:ok, tokens} = FastestTiktoken.encode("hello world", model: "gpt-4o")
#=> {:ok, [24912, 2375]}

FastestTiktoken.decode(tokens, model: "gpt-4o")
#=> {:ok, "hello world"}
```

Use an explicit encoding:

```elixir
FastestTiktoken.encode("hello world", encoding: :cl100k_base)
#=> {:ok, [15339, 1917]}
```

Resolve GPT OSS models through the official `o200k_harmony` mapping:

```elixir
FastestTiktoken.encoding_for_model("gpt-oss-120b")
#=> {:ok, "o200k_harmony"}

FastestTiktoken.encode("<|start|>hello<|end|>",
  model: "gpt-oss-120b",
  allowed_special: :all
)
#=> {:ok, [200006, 24912, 200007]}
```

Batch encode and decode:

```elixir
{:ok, batch} =
  FastestTiktoken.encode_batch(["hello world", "goodbye world"],
    encoding: :cl100k_base
  )

FastestTiktoken.decode_batch(batch, encoding: :cl100k_base)
#=> {:ok, ["hello world", "goodbye world"]}
```

Handle special tokens explicitly:

```elixir
FastestTiktoken.encode("hello <|endoftext|>",
  encoding: :cl100k_base,
  allowed_special: :all
)
#=> {:ok, [15339, 220, 100257]}

FastestTiktoken.encode("hello <|endoftext|>",
  encoding: :cl100k_base,
  allowed_special: ["<|endoftext|>"]
)
#=> {:ok, [15339, 220, 100257]}
```

By default, special token strings are treated as ordinary text. That matches
`encode_ordinary` semantics and keeps `count_tokens/2` on the Rust crate's
zero-allocation count path.

## Why FastestTiktoken

- Compared with other Elixir tokenizer wrappers that depend on older
  `tiktoken-rs` bindings, this project uses the faster pure-Rust
  [`tiktoken`](https://crates.io/crates/tiktoken) crate.
- Keeps a small Elixir API with explicit `{:ok, value}` / `{:error, reason}`
  return values.
- Supports RustlerPrecompiled artifacts so production installs do not need a
  Rust toolchain.
- Parity-tested against official OpenAI `tiktoken` `0.13.0`, including the
  `o200k_harmony` special-token table used by GPT OSS models.

## More Documentation

- [Usage guide](docs/usage.md)
- [Parity and performance](docs/parity-and-performance.md)

## Source Builds

To force a local Rust build instead of using a precompiled NIF:

```bash
FASTEST_TIKTOKEN_BUILD=1 mix test
```

Source builds require Rust 1.94 or newer, as declared by the native crate.