# Razdel
[](https://hex.pm/packages/razdel)
Rule-based Russian sentence and word tokenization — Elixir port of [Natasha Razdel](https://github.com/natasha/razdel).
Part of the [natasha-ex](https://github.com/natasha-ex) ecosystem: Russian NLP for Elixir.
## Usage
```elixir
iex> Razdel.sentenize("Привет. Как дела?")
[%Razdel.Substring{start: 0, stop: 7, text: "Привет."},
%Razdel.Substring{start: 8, stop: 17, text: "Как дела?"}]
iex> Razdel.tokenize("Кружка-термос на 0.5л")
[%Razdel.Substring{start: 0, stop: 13, text: "Кружка-термос"},
%Razdel.Substring{start: 14, stop: 16, text: "на"},
%Razdel.Substring{start: 17, stop: 20, text: "0.5"},
%Razdel.Substring{start: 20, stop: 21, text: "л"}]
```
Handles Russian abbreviations (`т.е.`, `г.`, `ст.`, `к.п.н.`), initials (`Л. В. Щербы`), quotes, brackets, dialogue dashes, list bullets, and smileys.
## Installation
```elixir
def deps do
[{:razdel, "~> 0.1.0"}]
end
```
## Algorithm
Split-then-rejoin: text is split at every potential delimiter (`.?!;""»…)`), then a chain of heuristic rules decides which splits are false positives and should be rejoined.
Rules (in priority order):
1. **empty_side** — join if either side is empty
2. **no_space_prefix** — join if no space after delimiter
3. **lower_right** — join if next token is lowercase
4. **delimiter_right** — join if next token is punctuation
5. **abbr_left** — join for known abbreviations (400+ entries)
6. **inside_pair_abbr** — join for paired abbreviations (т.е., и т.д.)
7. **initials_left** — join for single uppercase letters (initials)
8. **list_item** — join for numbered/lettered list bullets
9. **close_quote** — handle closing quotes correctly
10. **close_bracket** — handle closing brackets
11. **dash_right** — join dialogue dashes before lowercase words
## Performance
Benchmarked on the original test data (48,735 sentence texts, 208,995 token texts), Apple M5:
| Operation | Python (CPython 3.13) | Elixir (OTP 27) | Ratio |
| ---------- | --------------------: | --------------: | ----------------- |
| sentenize | 78,000 /s | 23,000 /s | Python ~3.4× |
| tokenize | 323,000 /s | 363,000 /s | **Elixir ~1.1×** |
The sentenize gap is due to UTF-8 char-counting overhead in `Substring.locate` — the BEAM has no O(1) byte→char offset conversion. The split+segment core alone runs faster than Python.
```bash
mix run bench/bench.exs
```
## License
MIT — Danila Poyarkov