README.md

Select File:
# Kiri (切り)

Japanese morphological analyzer for Elixir, powered by Sudachi dictionaries.

Kiri reads Sudachi-format dictionaries (converted to `.kiri` format) and produces
segmented morphemes with part-of-speech tags, readings, normalized forms, and
synonym group IDs. Pure Elixir implementation — no Rust toolchain required.

## Installation

Add `kiri` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:kiri, "~> 0.2"}
  ]
end
```

## Dictionary Setup

Download a Sudachi dictionary and convert it to `.kiri` format:

```bash
# Download
mkdir -p ~/.kiri-ji/dict
curl -L -o ~/.kiri-ji/dict/sudachi-dictionary-core.zip \
  https://github.com/WorksApplications/SudachiDict/releases/download/v20260116/sudachi-dictionary-20260116-core.zip
unzip -o ~/.kiri-ji/dict/sudachi-dictionary-core.zip -d ~/.kiri-ji/dict
mv ~/.kiri-ji/dict/sudachi-dictionary-*/system_core.dic ~/.kiri-ji/dict/

# Convert to .kiri format (one-time step)
mix kiri.convert ~/.kiri-ji/dict/system_core.dic ~/.kiri-ji/system_core.kiri
```

## Usage

```elixir
# Load once at application startup
{:ok, dict} = Kiri.load_dictionary("~/.kiri-ji/system_core.kiri")

# Tokenize from anywhere — concurrent safe, no GenServer
morphemes = Kiri.tokenize(dict, "東京都に行った")

for m <- morphemes do
  IO.puts "#{m.surface}\t#{Enum.join(m.part_of_speech, ",")}\t#{m.normalized_form}"
end
# 東京都  名詞,固有名詞,地名,一般,*,*  東京都
# に      助詞,格助詞,*,*,*,*            に
# 行っ    動詞,非自立可能,*,*,五段-カ行,連用形-促音便  行く
# た      助動詞,*,*,*,助動詞-タ,終止形-一般  た
```

## Concurrency

The `%Dictionary{}` struct is a ~2 KB handle. The actual ~150 MB binary data
lives in `:persistent_term`, shared across all processes with zero copy.

```elixir
texts
|> Task.async_stream(&Kiri.tokenize(dict, &1), max_concurrency: 100)
|> Enum.to_list()
```

## Split Modes

Override the default split mode per call:

```elixir
morphemes = Kiri.tokenize(dict, "関西国際空港", mode: :a)
```

- **`:c`** (default) — longest units / named entities
- **`:b`** — middle-length units
- **`:a`** — shortest units (UniDic short)

## Options

Options can be passed to `Kiri.tokenize/3`:

| Option                      | Type                | Default    | Description                                |
| --------------------------- | ------------------- | ---------- | ------------------------------------------ |
| `mode`                      | `:a \| :b \| :c`   | `:c`       | Split mode (A/B/C)                         |
| `prolonged_sound_marks`     | `boolean`           | `false`    | Collapse repeated prolonged sound marks    |
| `ignore_yomigana`           | `boolean`           | `false`    | Strip bracketed readings after kanji       |
| `disable_normalization`     | `boolean`           | `false`    | Skip NFKC input text normalization         |
| `disable_numeric_normalize` | `boolean`           | `false`    | Skip numeric normalization in path rewrite |
| `backend`                   | `:elixir \| :nif`   | `:elixir`  | Tokenization backend                       |

## Architecture

Pure Elixir implementation — the full plugin stack (input text normalization,
path rewriting, split modes, prolonged sound marks, yomigana stripping) and
core algorithms (Viterbi lattice solver, DARTSCLONE trie, MeCab OOV) are
implemented in Elixir using binary pattern matching against `:persistent_term`-stored
dictionary sections. An optional NIF backend is available for users who want
Rust-accelerated lattice construction.

## License

Apache-2.0