README.md

# motif

Pure Erlang keyword and topic extraction using the
[RAKE](https://www.researchgate.net/publication/227988510) algorithm.
Supports French, English, and German with built-in stop-word lists.
No external dependencies.

## Installation

```erlang
%% rebar.config
{deps, [{motif, "0.1.0"}]}.
```

## Quick start

```erlang
%% Extract from English text (language auto-detected)
Results = motif:extract(<<"Red roses are a symbol of love and beauty.">>),
%% [{<<"red roses">>, 4.0}, {<<"symbol">>, 1.0}, {<<"love">>, 1.0}, {<<"beauty">>, 1.0}]

%% Explicit language + max results
Top3 = motif:extract(Text, #{lang => fr, max => 3}),

%% Auto-detect language (samples first 200 words)
Auto = motif:extract(Text, #{lang => auto}),

%% Get the stop-word list for a language
Stops = motif:stop_words(fr).
```

## API

```erlang
%% Extract keyword candidates. Returns [{Keyword, Score}] sorted by score desc.
-spec extract(binary()) -> [{binary(), float()}].
-spec extract(binary(), #{max  => pos_integer(),
                           lang => fr | en | de | auto}) -> [{binary(), float()}].

%% Return the built-in stop-word list for a language.
-spec stop_words(fr | en | de) -> [binary()].
```

## Algorithm

RAKE (Rapid Automatic Keyword Extraction):

1. Split text into sentences on `. ! ?`
2. Within each sentence, split into candidate phrases on stop words
3. Score each word: `degree(word) / frequency(word)`
   where `degree(w)` = sum of phrase lengths containing `w`
4. Score each candidate: sum of its word scores
5. Return sorted by score descending, deduplicated

Multi-word phrases with co-occurring rare words score highest.

## Language detection

`lang => auto` samples the first 200 words, counts stop-word hits per
language, and picks the language with the most hits. Falls back to `en`
on a tie or empty input.

## License

Apache 2.0 — see [LICENSE](LICENSE).