<div align="center">
<img src="assets/tiktoken_ex.svg" width="400" alt="TiktokenEx Logo" />
</div>
# TiktokenEx
**Pure Elixir TikToken-style byte-level BPE tokenizer (Kimi K2 compatible).**
[](https://github.com/North-Shore-AI/tiktoken_ex/actions/workflows/ci.yml)
[](https://hex.pm/packages/tiktoken_ex)
[](https://hexdocs.pm/tiktoken_ex)
[](LICENSE)
TiktokenEx is a small, dependency-light implementation of the core TikToken
idea:
- Split text with a Unicode-aware regex (`pat_str`)
- Encode pieces with byte-pair encoding (BPE) using `mergeable_ranks`
- Optionally recognize special tokens (e.g. `<|im_end|>`)
It’s focused on matching the behavior of MoonshotAI’s **Kimi K2** tokenizers
that ship a `tiktoken.model` file and a TikToken-compatible `pat_str`.
## Installation
Add `tiktoken_ex` to your dependencies:
```elixir
def deps do
[
{:tiktoken_ex, "~> 0.1.0"}
]
end
```
## Usage
### Build an encoding directly
```elixir
alias TiktokenEx.Encoding
mergeable_ranks = %{
"He" => 0,
"ll" => 1,
"llo" => 2,
"H" => 10,
"e" => 11,
"l" => 12,
"o" => 13
}
{:ok, enc} = Encoding.new(pat_str: ".+", mergeable_ranks: mergeable_ranks)
{:ok, ids} = Encoding.encode(enc, "Hello")
{:ok, text} = Encoding.decode(enc, ids)
```
### Load a Kimi K2 encoding from local HuggingFace artifacts
Kimi provides:
- `tiktoken.model` (mergeable ranks)
- `tokenizer_config.json` (special tokens, etc)
```elixir
alias TiktokenEx.{Encoding, Kimi}
{:ok, enc} =
Kimi.from_hf_files(
tiktoken_model_path: "/path/to/tiktoken.model",
tokenizer_config_path: "/path/to/tokenizer_config.json"
)
{:ok, ids} = Encoding.encode(enc, "Say hi")
{:ok, decoded} = Encoding.decode(enc, ids)
```
### Special tokens
Special tokens are recognized by default. To treat them as plain text:
```elixir
{:ok, ids} = TiktokenEx.Encoding.encode(enc, "<|im_end|>", allow_special_tokens: false)
```
#### Special token matching
When special tokens overlap (one is a prefix of another), the matching behavior depends on the
regex alternative order.
- Default: `special_token_matching: :parity` (unspecified order; closer to upstream `tiktoken`).
- Optional: `special_token_matching: :longest` (deterministic "longest match wins").
### Regex compatibility note
Kimi’s upstream `pat_str` uses character-class intersections (`&&`), which are
not supported by Erlang’s PCRE engine. `TiktokenEx.Kimi.pat_str/0` provides a
PCRE-compatible translation.
## Development
- Run tests: `mix test`
- Run oracle parity tests (downloads HF artifacts): `mix test --include oracle`
- Run tests across backends: `scripts/test_backends.sh` (add `--oracle` to include parity)
- Run dialyzer: `mix dialyzer`
## License
MIT © 2025 North-Shore-AI