README.md

Select File:
# Text.Stemmer

Pre-compiled [Snowball](https://snowballstem.org) stemmers for Elixir.

This package ships 36 stemming algorithms covering a wide range of natural languages, accessible through a single `Text.Stemmer.stem/2` entry point. The stemmers themselves were generated from the canonical Snowball algorithm sources using the [`:snowball`](https://hex.pm/packages/snowball) compiler, which is included as a runtime dependency.

## Installation

Add `:text_stemmer` to your `mix.exs` deps:

```elixir
def deps do
  [
    {:text_stemmer, "~> 0.1"}
  ]
end
```

## Usage

```elixir
iex> Text.Stemmer.stem("generalizations", :en)
"general"

iex> Text.Stemmer.stem("gouvernements", :fr)
"gouvern"

iex> Text.Stemmer.stem_list(["running", "ran", "runs"], :en)
["run", "ran", "run"]
```

Languages are identified by their [ISO 639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) two-letter code. Algorithm-specific variants use a `<code>_<algorithm>` form: `:en_porter`, `:en_lovins`, `:nl_porter`. See the `Text.Stemmer` moduledoc for the full table.

```elixir
iex> length(Text.Stemmer.supported_languages())
36
```

## Regenerating stemmers

The pre-generated stemmer modules under `lib/text/stemmer/stemmers/` are produced from the `.sbl` algorithm sources vendored in `src/algorithms/` (taken from [snowballstem/snowball](https://github.com/snowballstem/snowball)). To regenerate after editing or updating a source file, run:

```bash
mix snowball.gen --module-prefix Text.Stemmer.Stemmers \
                 --output-dir lib/text/stemmer/stemmers
```

The `mix snowball.gen` task is supplied by the [`:snowball`](https://hex.pm/packages/snowball) compiler dependency.

## Compliance testing

Each generated stemmer is verified against the canonical Snowball corpus from [snowballstem/snowball-data](https://github.com/snowballstem/snowball-data), vendored under `test/data/<lang>/` as gzipped `voc.txt`/`output.txt` pairs. Compliance tests are tagged `:compliance` and excluded by default; run them explicitly with:

```bash
mix test --only compliance
```

The corpus is **not shipped with the Hex package** — it lives in the source tree only. Per-language licensing notes from upstream are preserved in `test/data/<lang>/COPYING` files. The Arabic corpus is GPLv3; the rest are mostly BSD-3-Clause or CC BY-SA. See `test/data/COPYING` for the umbrella terms.

## Documentation

Full API documentation is published at [https://hexdocs.pm/text_stemmer](https://hexdocs.pm/text_stemmer).

## License

Apache-2.0. See [LICENSE.md](https://github.com/kipcole9/text_stemmer/blob/v0.1.0/LICENSE.md).