README.md

# tok

Pure Erlang tokenizer for [HuggingFace](https://huggingface.co) `tokenizer.json` files.
No NIFs, no Python, no native dependencies — drop a `tokenizer.json` next to your application and encode text directly from Erlang.

## Supported formats

| Type | Models |
|------|--------|
| WordPiece | BERT, DistilBERT, RoBERTa-base, multilingual BERT, ... |
| BPE — ByteLevel | GPT-2, Falcon, Llama 3, Mistral-Nemo, ... |
| BPE — Metaspace | Llama 2, Mistral 7B, Phi-3, ... |

## Installation

```erlang
%% rebar.config
{deps, [{tok, "0.2.0"}]}.
```

## Quick start

```erlang
{ok, Tok} = tok:load("path/to/tokenizer.json"),

%% Encode — returns {InputIds, AttentionMask, TokenTypeIds} as flat binaries
%% of int32 little-endian values, padded to max_length from the tokenizer config.
{IdsBin, MaskBin, _TypeBin} = tok:encode(Tok, <<"Hello world">>),

%% Decode ids from the binary
Ids = [Id || <<Id:32/signed-little>> <= IdsBin],

%% Decode back to text (strips special tokens)
Text = tok:decode(Tok, Ids),

%% Count tokens without building the output binary
N = tok:count_tokens(Tok, <<"Hello world">>).
```

## Getting a tokenizer.json

Download directly from HuggingFace:

```bash
# Any model page → Files → tokenizer.json
curl -L https://huggingface.co/<org>/<model>/resolve/main/tokenizer.json \
     -o tokenizer.json
```

Or save from the Python `transformers` library:

```python
from transformers import AutoTokenizer
AutoTokenizer.from_pretrained("bert-base-uncased").save_pretrained(".")
# tokenizer.json is now in the current directory
```

## API

```erlang
%% Load a tokenizer from a tokenizer.json file.
-spec load(file:filename()) -> {ok, tokenizer()} | {error, term()}.

%% Encode text. Returns three binaries of int32 little-endian values,
%% each padded to max_length as configured in the tokenizer file.
-spec encode(tokenizer(), binary()) ->
    {InputIds, AttentionMask, TokenTypeIds}.

%% Encode with options.
%%   add_special_tokens => false  skips CLS/SEP (WordPiece) or BOS/EOS (BPE).
-spec encode(tokenizer(), binary(), #{add_special_tokens => boolean()}) ->
    {InputIds, AttentionMask, TokenTypeIds}.

%% Encode a list of texts.
-spec encode_batch(tokenizer(), [binary()]) ->
    [{InputIds, AttentionMask, TokenTypeIds}].
-spec encode_batch(tokenizer(), [binary()], #{add_special_tokens => boolean()}) ->
    [{InputIds, AttentionMask, TokenTypeIds}].

%% Decode a list of token IDs back to text. Special tokens are stripped.
-spec decode(tokenizer(), [integer()]) -> binary().

%% Count real tokens (after truncation, including special tokens).
%% Cheaper than encode/2 — does not allocate output binaries.
-spec count_tokens(tokenizer(), binary()) -> non_neg_integer().

%% Return vocabulary size.
-spec vocab_size(tokenizer()) -> integer().
```

### Reading the output binary

```erlang
{IdsBin, MaskBin, TypeBin} = tok:encode(Tok, Text),

InputIds     = [Id || <<Id:32/signed-little>> <= IdsBin],
AttentionMask = [M  || <<M:32/signed-little>>  <= MaskBin],
TokenTypeIds  = [T  || <<T:32/signed-little>>  <= TypeBin].
```

The binary format matches what most ONNX runtimes and NIF-based inference libraries expect directly, so you can often pass `IdsBin` through without decoding.

## Notes

- **max_length** is read from the `truncation` section of `tokenizer.json`. If absent, defaults to 512.
- **pad_id** is read from the `padding` section. If absent, defaults to the `[PAD]` token id or 0.
- BOS/EOS tokens are injected automatically when a `TemplateProcessing` post-processor is present in the tokenizer file.
- `byte_fallback` (Llama 2 / Mistral style) is supported: characters not in the vocabulary are split into `<0xNN>` byte tokens.

## License

Apache 2.0 — see [LICENSE](LICENSE).