# Text.Stemmer
Pre-compiled [Snowball](https://snowballstem.org) stemmers for Elixir.
This package ships 36 stemming algorithms covering a wide range of natural languages, accessible through a single `Text.Stemmer.stem/2` entry point. The stemmers themselves were generated from the canonical Snowball algorithm sources using the [`:snowball`](https://hex.pm/packages/snowball) compiler, which is included as a runtime dependency.
## Installation
Add `:text_stemmer` to your `mix.exs` deps:
```elixir
def deps do
[
{:text_stemmer, "~> 0.1"}
]
end
```
## Usage
```elixir
iex> Text.Stemmer.stem("generalizations", :en)
"general"
iex> Text.Stemmer.stem("gouvernements", :fr)
"gouvern"
iex> Text.Stemmer.stem_list(["running", "ran", "runs"], :en)
["run", "ran", "run"]
```
Languages are identified by their [ISO 639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) two-letter code. Algorithm-specific variants use a `<code>_<algorithm>` form: `:en_porter`, `:en_lovins`, `:nl_porter`. See the `Text.Stemmer` moduledoc for the full table.
```elixir
iex> length(Text.Stemmer.supported_languages())
36
```
## Regenerating stemmers
The pre-generated stemmer modules under `lib/text/stemmer/stemmers/` are produced from the `.sbl` algorithm sources vendored in `src/algorithms/` (taken from [snowballstem/snowball](https://github.com/snowballstem/snowball)). To regenerate after editing or updating a source file, run:
```bash
mix snowball.gen --module-prefix Text.Stemmer.Stemmers \
--output-dir lib/text/stemmer/stemmers
```
The `mix snowball.gen` task is supplied by the [`:snowball`](https://hex.pm/packages/snowball) compiler dependency.
## Compliance testing
Each generated stemmer is verified against the canonical Snowball corpus from [snowballstem/snowball-data](https://github.com/snowballstem/snowball-data), vendored under `test/data/<lang>/` as gzipped `voc.txt`/`output.txt` pairs. Compliance tests are tagged `:compliance` and excluded by default; run them explicitly with:
```bash
mix test --only compliance
```
The corpus is **not shipped with the Hex package** — it lives in the source tree only. Per-language licensing notes from upstream are preserved in `test/data/<lang>/COPYING` files. The Arabic corpus is GPLv3; the rest are mostly BSD-3-Clause or CC BY-SA. See `test/data/COPYING` for the umbrella terms.
## Documentation
Full API documentation is published at [https://hexdocs.pm/text_stemmer](https://hexdocs.pm/text_stemmer).
## License
Apache-2.0. See [LICENSE.md](https://github.com/kipcole9/text_stemmer/blob/v0.1.0/LICENSE.md).