README.md

# Kiri (切り)

Japanese morphological analyzer for Elixir, powered by Sudachi dictionaries.

Kiri reads Sudachi-format binary `.dic` dictionaries and produces segmented
morphemes with part-of-speech tags, readings, normalized forms, and synonym
group IDs.

## Installation

Add `kiri` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:kiri, "~> 0.1.0"}
  ]
end
```

You also need a Sudachi dictionary file (`system_core.dic`). Download one from
the [SudachiDict releases](https://github.com/WorksApplications/SudachiDict/releases).

## Usage

```elixir
# Create a tokenizer from a system dictionary
{:ok, tokenizer} = Kiri.create_tokenizer("path/to/system_core.dic")

# Tokenize text (default mode :c — longest units)
morphemes = Kiri.tokenize(tokenizer, "東京都に行った")

# Inspect results
for m <- morphemes do
  IO.puts("#{m.surface}\t#{Enum.join(m.part_of_speech, ",")}\t#{m.normalized_form}")
end
# 東京都   名詞,固有名詞,地名,一般,*,*   東京都
# に       助詞,格助詞,*,*,*,*             に
# 行っ     動詞,非自立可能,*,*,五段-カ行,連用形-促音便  行く
# た       助動詞,*,*,*,助動詞-タ,終止形-一般            た
```

## Split modes

Kiri supports three split modes matching Sudachi's behavior:

- **`:c`** (default) — longest units / named entities (e.g. `"東京都"`)
- **`:b`** — middle-length units
- **`:a`** — shortest units (e.g. `"東京"`, `"都"`)

```elixir
Kiri.tokenize(tokenizer, "東京都", mode: :a)
```

## Options

Pass options to `Kiri.create_tokenizer/2`:

| Option                       | Default | Description                                  |
| ---------------------------- | ------- | -------------------------------------------- |
| `:mode`                      | `:c`    | Default split mode                           |
| `:user_dictionaries`         | `[]`    | List of user dictionary file paths           |
| `:prolonged_sound_marks`     | `false` | Collapse repeated prolonged sound marks      |
| `:ignore_yomigana`           | `false` | Strip bracketed readings before tokenization |
| `:disable_normalization`     | `false` | Skip NFKC normalization                      |
| `:disable_numeric_normalize` | `false` | Skip numeric sequence normalization          |

## Architecture

Kiri uses a Rust NIF (via [Rustler](https://github.com/rustler-magic/rustler))
for dictionary loading and Viterbi lattice search. The input text processing
pipeline (NFKC normalization, character categories, prolonged sound marks) and
path rewrite plugins (katakana OOV joining, numeric normalization) are
implemented in pure Elixir.

## License

Apache-2.0