README.md

# Rake

[![Hex.pm](https://img.shields.io/hexpm/v/rake.svg)](https://hex.pm/packages/rake)
[![Docs](https://img.shields.io/badge/hex-docs-blue.svg)](https://hexdocs.pm/rake)

**RAKE (Rapid Automatic Keyword Extraction)** for Elixir.

Extract keywords from documents using word co-occurrence patterns. RAKE is an unsupervised, domain-independent, and language-independent algorithm that requires no training data.

Based on the paper: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). *Automatic Keyword Extraction from Individual Documents*. In Text Mining: Applications and Theory. John Wiley & Sons.

## Features

- **Zero dependencies** - Pure Elixir implementation
- **No training required** - Works on individual documents without a corpus
- **Language independent** - Configure stop words for any language
- **Multiple scoring metrics** - Choose between frequency, degree, or degree/frequency ratio
- **Configurable** - Adjust minimum word length, maximum phrase length, and more
- **Graph inspection** - Access the word co-occurrence graph for analysis

## Installation

Add `rake` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:rake, "~> 0.1.0"}
  ]
end
```

## Quick Start

```elixir
text = """
Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict
inequations, and nonstrict inequations are considered.
"""

Rake.extract(text)
# => [
#   %{keyword: "linear Diophantine equations", score: 9.0, words: ["linear", "diophantine", "equations"]},
#   %{keyword: "nonstrict inequations", score: 4.0, words: ["nonstrict", "inequations"]},
#   %{keyword: "strict inequations", score: 4.0, words: ["strict", "inequations"]},
#   %{keyword: "linear constraints", score: 4.5, words: ["linear", "constraints"]},
#   ...
# ]
```

## How RAKE Works

RAKE identifies keywords through a simple but effective process:

1. **Split text into candidate keywords** - The text is divided at stop words (like "the", "and", "of") and phrase delimiters (punctuation). The sequences of words between these boundaries become candidate keywords.

2. **Build a word co-occurrence graph** - For each candidate keyword, RAKE tracks which words appear together. Words that frequently co-occur in candidates are likely part of important phrases.

3. **Score words and candidates** - Each word gets a score based on its frequency and degree (number of co-occurrences). Candidate keywords are scored as the sum of their word scores.

4. **Return top keywords** - By default, returns the top T keywords where T = (number of unique words) / 3.

## Usage

### Basic Extraction

```elixir
# Uses default English stop words
keywords = Rake.extract("Your document text here...")
```

### Custom Stop Words

```elixir
# For domain-specific text or other languages
stop_words = ~w(el la los las un una de en y que)
keywords = Rake.extract(spanish_text, stop_words: stop_words)
```

### Scoring Metrics

RAKE supports three scoring metrics:

| Metric | Description | Best For |
|--------|-------------|----------|
| `:deg_freq` | degree(word) / frequency(word) | **Default.** Favors words that appear predominantly in longer phrases |
| `:deg` | degree(word) | Favors words that appear in many/long phrases |
| `:freq` | frequency(word) | Favors frequently occurring words |

```elixir
# Use degree scoring (favors words in longer phrases)
Rake.extract(text, score_metric: :deg)

# Use frequency scoring
Rake.extract(text, score_metric: :freq)
```

### Controlling Output

```elixir
# Return exactly 10 keywords
Rake.extract(text, top: 10)

# Return ALL scored candidates (no limit)
Rake.extract(text, top: :all)
```

### Filtering

```elixir
# Only include words with 3+ characters
Rake.extract(text, min_word_length: 3)

# Limit phrases to 4 words maximum
Rake.extract(text, max_words: 4)
```

### Adjoining Keywords

RAKE can detect keywords that frequently appear adjacent to each other (with stop words between) and combine them:

```elixir
# "axis" + "of" + "evil" -> "axis of evil"
Rake.extract(text, adjoining: true)
```

For adjoining detection to create a combined keyword, the pair must appear adjacent at least twice in the document.

### Inspecting the Word Graph

For debugging or analysis, you can access the word co-occurrence graph:

```elixir
{keywords, graph} = Rake.extract_with_graph(text)

# Check statistics for a specific word
graph.words["algorithm"]
# => %{freq: 3, deg: 7}

# freq = how many times the word appears in candidates
# deg = sum of co-occurrences (how connected the word is)
```

## Options Reference

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `:stop_words` | list | `Rake.StopWords.english()` | Words that split candidate keywords |
| `:score_metric` | atom | `:deg_freq` | One of `:freq`, `:deg`, or `:deg_freq` |
| `:top` | integer or `:all` | T = words/3 | Number of keywords to return |
| `:min_word_length` | integer | 1 | Minimum characters per word |
| `:max_words` | integer | unlimited | Maximum words per keyword phrase |
| `:adjoining` | boolean | false | Detect adjoining keyword pairs |

## Default Stop Words

`Rake.StopWords.english()` provides ~100 common English stop words derived from the original RAKE paper's keyword adjacency stoplist:

```elixir
Rake.StopWords.english()
# => ["the", "and", "of", "a", "in", "is", "for", "to", ...]
```

For other languages or domains, provide your own stop word list.

## Performance

RAKE is designed for efficiency:

- **Single pass scoring** - Unlike iterative algorithms (e.g., TextRank), RAKE scores keywords in one pass
- **No external dependencies** - Pure Elixir, no NIFs or external services
- **Scales linearly** - Processing time grows linearly with document size

The original paper reports RAKE processing 500 abstracts in 160ms (vs 1002ms for TextRank).

## Examples

### Scientific Abstract

```elixir
abstract = """
Machine learning models require large amounts of labeled training data.
Active learning reduces labeling costs by selecting the most informative
samples for annotation. This paper presents a novel active learning
strategy for deep neural networks.
"""

Rake.extract(abstract, top: 5)
# => [
#   %{keyword: "active learning strategy", score: 9.0, ...},
#   %{keyword: "deep neural networks", score: 9.0, ...},
#   %{keyword: "labeled training data", score: 9.0, ...},
#   %{keyword: "Machine learning models", score: 9.0, ...},
#   %{keyword: "labeling costs", score: 4.0, ...}
# ]
```

### News Article

```elixir
article = """
The Federal Reserve announced today that interest rates will remain
unchanged. Fed Chair Jerome Powell cited ongoing inflation concerns
and labor market strength as key factors in the decision.
"""

Rake.extract(article, top: 5)
# => [
#   %{keyword: "ongoing inflation concerns", score: 9.0, ...},
#   %{keyword: "labor market strength", score: 9.0, ...},
#   %{keyword: "Federal Reserve", score: 4.0, ...},
#   %{keyword: "Fed Chair Jerome Powell", score: 8.0, ...},
#   %{keyword: "interest rates", score: 4.0, ...}
# ]
```

### With Custom Domain Stop Words

```elixir
# For legal documents, add domain-specific stop words
legal_stop_words = Rake.StopWords.english() ++ ~w(
  hereby whereas therefore pursuant herein thereof
  plaintiff defendant court jurisdiction
)

Rake.extract(legal_document, stop_words: legal_stop_words)
```

## Comparison with Other Approaches

| Method | Training Required | Corpus Required | Speed |
|--------|-------------------|-----------------|-------|
| **RAKE** | No | No | Fast (single pass) |
| TF-IDF | No | Yes | Fast |
| TextRank | No | No | Slower (iterative) |
| BERT KeyPhrase | Yes | Yes | Slow (neural) |

RAKE is ideal when you need fast, unsupervised keyword extraction from individual documents without maintaining a corpus.

## License

MIT License. See [LICENSE](LICENSE) for details.

## References

- Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), *Text Mining: Applications and Theory*. John Wiley & Sons.