README.md

# Tiktokenex

Pure Elixir BPE tokenizer compatible with OpenAI's [tiktoken](https://github.com/openai/tiktoken). No NIFs, no Python, no external dependencies.

Supports `cl100k_base` (GPT-4, GPT-3.5) and `o200k_base` (GPT-4o) encodings.

## Usage

```elixir
# Encode text to token IDs
Tiktokenex.encode("Hello, world!")
#=> [9906, 11, 1917, 0]

# Decode back to text
Tiktokenex.decode([9906, 11, 1917, 0])
#=> "Hello, world!"

# Count tokens
Tiktokenex.count("Hello, world!")
#=> 4

# See the BPE chunks
Tiktokenex.encode_to_chunks("Hello, world!")
#=> ["Hello", ",", " world", "!"]

# Use o200k_base encoding
Tiktokenex.encode("Hello", :o200k_base)
```

## Installation

Add to your `mix.exs` as a git or path dependency:

```elixir
def deps do
  [
    # git
    {:tiktokenex, git: "https://github.com/phiat/tiktokenex.git"},
    # …or a sibling working copy for development
    {:tiktokenex, path: "../tiktokenex"}
  ]
end
```

BPE rank files are not tracked in git — fetch them once with the bundled justfile recipe:

```bash
git clone https://github.com/phiat/tiktokenex.git
cd tiktokenex
just setup        # mix deps.get + downloads cl100k_base + o200k_base into priv/ranks/
```

Or download manually:

```bash
mkdir -p priv/ranks
curl -o priv/ranks/cl100k_base.tiktoken \
  https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
curl -o priv/ranks/o200k_base.tiktoken \
  https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
```

## How It Works

1. **Pre-tokenization** (`Pretokenizer`) — splits text using tiktoken's regex patterns into coarse chunks
2. **BPE encoding** (`BPE`) — applies byte-pair encoding merges using rank tables
3. **Rank loading** (`Ranks`) — parses `.tiktoken` rank files, caches in `persistent_term`

The algorithm matches tiktoken's output exactly. See `test/` for reference vectors.

## API

| Function | Description |
|----------|-------------|
| `encode(text, encoding)` | Text to token ID list |
| `decode(ids, encoding)` | Token IDs back to text |
| `encode_to_chunks(text, encoding)` | Text to BPE chunk strings |
| `count(text, encoding)` | Token count |

Default encoding is `:cl100k_base`. Pass `:o200k_base` as the second argument for GPT-4o tokenization.

## Tests

```bash
just check    # mix test + credo + compile-with-warnings-as-errors
just test     # tests only
```

## License

MIT — see [LICENSE](LICENSE).