README.md

# ExNLP

A comprehensive Natural Language Processing library for Elixir, providing tokenization, stemming, ranking algorithms, similarity metrics, and text analysis tools. Inspired by Python's NLTK and designed with idiomatic Elixir patterns.

## Features

- **🔤 Tokenization**: Multiple tokenizers (standard, whitespace, regex, n-gram, keyword) with NLTK-inspired API
- **✂️ Stemming**: Snowball stemmers for 7 languages (English, Spanish, Portuguese, French, German, Italian, Polish)
- **📊 Ranking Algorithms**: TF-IDF and BM25 implementations for document ranking and search
- **🔍 Similarity Metrics**: Levenshtein, Jaccard, Dice, Jaro-Winkler, Hamming distance, and more
- **🚫 Stopwords**: Built-in stopword lists for 30+ languages
- **🔧 Text Filtering**: Case conversion, length filtering, pattern replacement, and stopword removal
- **📈 Statistics**: Term frequency, document frequency, corpus-level statistics
- **🔗 Co-occurrence Analysis**: Term co-occurrence matrices and analysis
- **📝 N-grams**: Character and word n-gram generation
- **🎯 Idiomatic Elixir**: Clean, functional code following Elixir best practices

## Installation

Add `ex_nlp` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:ex_nlp, "~> 0.1.0"}
  ]
end
```

Then run `mix deps.get`.

## Quick Start

### Tokenization

```elixir
# Simple word tokenization
iex> ExNlp.Tokenizer.word_tokenize("Hello, world!")
["Hello", "world"]

# Get tokens with position and offset information
iex> ExNlp.Tokenizer.tokenize("Hello, world!")
[
  %ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
  %ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]

# Custom regex tokenizer
iex> ExNlp.Tokenizer.regexp_tokenize("abc123 def456", "\\d+")
["123", "456"]
```

### Stemming

```elixir
# Stem words in multiple languages
iex> ExNlp.Snowball.stem("running", :english)
"run"

iex> ExNlp.Snowball.stem("caminando", :spanish)
"camin"

iex> ExNlp.Snowball.stem_words(["running", "jumping", "beautiful"], :english)
["run", "jump", "beauti"]

# Check supported languages
iex> ExNlp.Snowball.supported_languages()
[:english, :spanish, :portuguese, :french, :german, :italian, :polish]
```

### Ranking Algorithms

#### TF-IDF

```elixir
# Calculate TF-IDF score for a term in a document
iex> documents = ["The quick brown fox", "A brown dog"]
iex> ExNlp.Ranking.TfIdf.calculate("fox", "The quick brown fox", documents)
0.5108256237659907

# With preprocessing options
iex> ExNlp.Ranking.TfIdf.calculate("running", "The runner is running fast", documents,
...>   stem: true, language: :english, remove_stopwords: true)
0.6931471805599453
```

#### BM25

```elixir
# Score documents against a query
iex> documents = ["BM25 is a ranking function", "used by search engines"]
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"])
[1.8455076734299591, 1.0126973514850315]

# Rank documents with custom parameters
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"], 
...>   k1: 1.5, b: 0.75, stem: true, language: :english)
[1.923456, 1.123456]
```

### Similarity Metrics

```elixir
# Levenshtein distance
iex> ExNlp.Similarity.levenshtein("kitten", "sitting")
3

# Levenshtein similarity (normalized)
iex> ExNlp.Similarity.levenshtein_similarity("kitten", "sitting")
0.5714285714285714

# Jaccard similarity
iex> ExNlp.Similarity.jaccard(["cat", "dog"], ["cat", "bird"])
0.3333333333333333

# Jaro-Winkler similarity
iex> ExNlp.Similarity.jaro_winkler_similarity("martha", "marhta")
0.9611111111111111

# Dice coefficient
iex> ExNlp.Similarity.dice_coefficient(["cat", "dog"], ["cat", "bird"])
0.5
```

### Stopwords

```elixir
# Check if a word is a stopword
iex> ExNlp.Stopwords.is_stopword?("the", :english)
true

# Remove stopwords from a list
iex> words = ["the", "quick", "brown", "fox"]
iex> ExNlp.Stopwords.remove(words, :english)
["quick", "brown", "fox"]

# Get list of stopwords
iex> ExNlp.Stopwords.list(:english)
["a", "all", "and", "as", "at", ...]
```

### Text Filtering

```elixir
# Build a filtering pipeline
iex> tokens = ExNlp.Tokenizer.tokenize("The Quick Brown Fox")
iex> tokens
...> |> ExNlp.Filter.lowercase()
...> |> ExNlp.Filter.stop_words(:english)
...> |> ExNlp.Filter.min_length(3)
[
  %ExNlp.Token{text: "quick", ...},
  %ExNlp.Token{text: "brown", ...},
  %ExNlp.Token{text: "fox", ...}
]
```

### N-grams

```elixir
# Character n-grams
iex> ExNlp.Ngram.char_ngrams("hello", 2)
["he", "el", "ll", "lo"]

# Word n-grams
iex> ExNlp.Ngram.word_ngrams(["the", "quick", "brown", "fox"], 2)
[["the", "quick"], ["quick", "brown"], ["brown", "fox"]]
```

### Statistics

```elixir
# Term frequency in a document
iex> ExNlp.Statistics.term_frequency("cat", ["the", "cat", "sat", "on", "the", "mat"])
1

# Document frequency in a corpus
iex> corpus = [["cat", "dog"], ["cat", "bird"], ["dog", "fish"]]
iex> ExNlp.Statistics.document_frequency("cat", corpus)
2

# Most frequent terms
iex> corpus = [["cat", "dog"], ["cat", "cat"], ["dog"]]
iex> ExNlp.Statistics.most_frequent(corpus, 2)
[{"cat", 3}, {"dog", 2}]
```

### Co-occurrence Analysis

```elixir
# Build co-occurrence matrix
iex> corpus = [["cat", "dog"], ["cat", "bird", "dog"], ["bird"]]
iex> matrix = ExNlp.Cooccurrence.cooccurrence_matrix(corpus)
iex> matrix["cat"]["dog"]
2

# Find co-occurring terms
iex> ExNlp.Cooccurrence.cooccurring_terms("cat", corpus, 2)
[{"dog", 2}, {"bird", 1}]
```

## Supported Languages

### Stemming

- **English** - Porter2 algorithm (Porter stemmer v2)
- **Spanish** - Spanish stemmer
- **Portuguese** - Portuguese stemmer
- **French** - French stemmer
- **German** - German stemmer
- **Italian** - Italian stemmer
- **Polish** - Polish stemmer

### Stopwords

Stopword lists are available for 30+ languages including: English, Spanish, Portuguese, French, German, Italian, Polish, Russian, Dutch, Swedish, Norwegian, Danish, Finnish, Turkish, Arabic, Chinese, and more. See `priv/stopwords/` for the complete list.

## Architecture

The library is organized into logical modules:

- **`ExNlp.Tokenizer`** - Text tokenization with multiple strategies
- **`ExNlp.Snowball`** - Word stemming algorithms
- **`ExNlp.Ranking`** - Document ranking (TF-IDF, BM25)
- **`ExNlp.Similarity`** - String and set similarity metrics
- **`ExNlp.Stopwords`** - Stopword detection and filtering
- **`ExNlp.Filter`** - Token filtering and transformation
- **`ExNlp.Statistics`** - Text and corpus statistics
- **`ExNlp.Cooccurrence`** - Term co-occurrence analysis
- **`ExNlp.Ngram`** - N-gram generation

## Performance

The library includes benchmark suites for critical operations. Run benchmarks with:

```bash
mix run benchmarks/tokenizer_bench.exs
mix run benchmarks/similarity_bench.exs
mix run benchmarks/ranking_bench.exs
```

## Testing

Run the test suite with:

```bash
mix test
```

## Documentation

Generate documentation with:

```bash
mix docs
```

## Contributing

Contributions are welcome! This library aims to be a comprehensive NLP toolkit for Elixir. Areas for contribution:

- Additional language support for stemming
- More stopword lists
- Additional similarity metrics
- Performance optimizations
- Documentation improvements

## Credits

- Stemming algorithms based on the [Snowball Stemming Algorithms](http://snowballstem.org/)
- Inspired by Python's [NLTK](https://www.nltk.org/) and [spaCy](https://spacy.io/)
- Stopword lists compiled from various open sources

## License

MIT License - see [LICENSE](LICENSE) file for details.