docs/WORDNET.md

Select File:
# WordNet Integration

Complete guide to using WordNet with Nasty for word sense disambiguation and semantic similarity.

## Overview

Nasty integrates **Open English WordNet** (OEWN) and **Open Multilingual WordNet** (OMW) to provide comprehensive lexical database support. WordNet enhances natural language processing by:

- **Word Sense Disambiguation** - Determine which meaning of a word is used in context
- **Semantic Similarity** - Measure how similar two words or concepts are
- **Synonym/Antonym Discovery** - Find related words
- **Hierarchical Relationships** - Navigate hypernym/hyponym taxonomies
- **Cross-lingual Support** - Link concepts across English, Spanish, and Catalan

## Quick Start

```elixir
alias Nasty.Lexical.WordNet

# Get all meanings of "bank"
synsets = WordNet.synsets("bank", :noun)
# => [
#   %Synset{definition: "financial institution", ...},
#   %Synset{definition: "land alongside water", ...}
# ]

# Get definition
WordNet.definition(synset_id)
# => "a financial institution that accepts deposits"

# Find synonyms
WordNet.synonyms("big", :adj)
# => ["large", "big", "great"]

# Get hypernyms (more general concepts)
WordNet.hypernyms(synset_id)
# => ["oewn-02083346-n"]  # canine

# Calculate semantic similarity
alias Nasty.Lexical.WordNet.Similarity
Similarity.wup_similarity(dog_id, cat_id)
# => 0.857  # High similarity
```

## Installation

### 1. Download WordNet Data

```bash
# Download English WordNet (required for most features)
mix nasty.wordnet.download --language en

# Optional: Download Spanish
mix nasty.wordnet.download --language es

# Optional: Download Catalan
mix nasty.wordnet.download --language ca
```

Data files are downloaded to `priv/wordnet/` by default.

### 2. Verify Installation

```bash
mix nasty.wordnet.list
```

Expected output:
```
WordNet Data Status
============================================================

English (en)
  Status: Installed
  Path: priv/wordnet/oewn-2025.json
  Size: 45.2 MB
  Loaded: No (will load on first use)

Spanish (es)
  Status: Not installed
  Download: mix nasty.wordnet.download --language es
...
```

## Core Concepts

### Synsets

A **synset** (synonym set) groups words with the same meaning:

```elixir
# Get synsets for "dog"
synsets = WordNet.synsets("dog", :noun)

# First synset
synset = hd(synsets)
synset.id          # => "oewn-02084071-n"
synset.definition  # => "a member of the genus Canis"
synset.examples    # => ["the dog barked all night"]
synset.lemmas      # => ["dog", "domestic dog", "Canis familiaris"]
synset.pos         # => :noun
```

### Lemmas

A **lemma** is a specific word sense:

```elixir
lemmas = WordNet.lemmas("run", :verb)
# Multiple senses of "run" as a verb

lemma = hd(lemmas)
lemma.word        # => "run"
lemma.synset_id   # => "oewn-01926311-v"
lemma.sense_key   # => "run%2:38:00::"
```

### Relations

WordNet defines **semantic relations** between synsets:

```elixir
# Hypernyms (more general)
WordNet.hypernyms(dog_id)  # => [canine_id]

# Hyponyms (more specific)
WordNet.hyponyms(canine_id)  # => [dog_id, wolf_id, fox_id, ...]

# Meronyms (part-of)
WordNet.meronyms(car_id)  # => [wheel_id, door_id, engine_id, ...]

# Holonyms (whole-of)
WordNet.holonyms(wheel_id)  # => [car_id, bicycle_id, ...]

# Antonyms (opposites)
WordNet.antonyms(hot_id)  # => [cold_id]

# Similar concepts
WordNet.similar(hot_id)  # => [warm_id, ...]
```

## API Reference

### Synset Operations

#### `synsets/3`

Get all synsets for a word.

```elixir
WordNet.synsets(word, pos \\ nil, language \\ :en)
```

**Parameters:**
- `word` - Word to look up (string)
- `pos` - Part of speech filter: `:noun`, `:verb`, `:adj`, `:adv`, or `nil` for all
- `language` - Language code: `:en`, `:es`, `:ca`

**Returns:** List of `Synset` structs

**Examples:**
```elixir
# All senses of "run"
WordNet.synsets("run")

# Only verb senses
WordNet.synsets("run", :verb)

# Spanish word
WordNet.synsets("perro", :noun, :es)
```

#### `synset/2`

Get a specific synset by ID.

```elixir
WordNet.synset(synset_id, language \\ :en)
```

#### `definition/2`

Get the definition of a synset.

```elixir
WordNet.definition(synset_id, language \\ :en)
# => "a member of the genus Canis"
```

#### `examples/2`

Get usage examples for a synset.

```elixir
WordNet.examples(synset_id, language \\ :en)
# => ["the dog barked all night"]
```

### Relation Operations

#### Taxonomic Relations

```elixir
# More general concepts
WordNet.hypernyms(synset_id, language \\ :en)

# More specific concepts
WordNet.hyponyms(synset_id, language \\ :en)
```

#### Part-Whole Relations

```elixir
# Parts of this concept
WordNet.meronyms(synset_id, language \\ :en)

# Wholes that contain this concept
WordNet.holonyms(synset_id, language \\ :en)
```

#### Similarity/Opposition

```elixir
# Opposite concepts
WordNet.antonyms(synset_id, language \\ :en)

# Similar concepts
WordNet.similar(synset_id, language \\ :en)
```

#### All Relations

```elixir
# Get all relations from a synset
WordNet.all_relations(synset_id, language \\ :en)
# => [{:hypernym, "target-id"}, {:meronym, "another-id"}, ...]
```

### Synonym/Antonym Discovery

#### `synonyms/3`

Find synonyms by getting all words in same synsets.

```elixir
WordNet.synonyms(word, pos \\ nil, language \\ :en)

# Examples
WordNet.synonyms("big")
# => ["big", "large", "great", "huge"]

WordNet.synonyms("run", :verb)
# => ["run", "jog", "sprint", ...]
```

### Semantic Path Operations

#### `common_hypernyms/3`

Find shared ancestors of two synsets.

```elixir
WordNet.common_hypernyms(synset1_id, synset2_id, language \\ :en)
# => [common_ancestor_id, ...]
```

#### `shortest_path/3`

Find shortest path length between synsets.

```elixir
WordNet.shortest_path(synset1_id, synset2_id, language \\ :en)
# => 3  # number of edges
```

### Cross-lingual Operations

#### `from_ili/2`

Find synsets in target language via Interlingual Index.

```elixir
# Find English equivalent of Spanish word
spanish_synsets = WordNet.synsets("perro", :noun, :es)
spanish_synset = hd(spanish_synsets)

# Get ILI
ili_id = spanish_synset.ili  # => "i2084071"

# Find in English
english_synsets = WordNet.from_ili(ili_id, :en)
# => [%Synset{lemmas: ["dog", ...]}]
```

## Semantic Similarity

The `Nasty.Lexical.WordNet.Similarity` module provides various similarity metrics.

### Path Similarity

Based on shortest path in hypernym hierarchy:

```elixir
alias Nasty.Lexical.WordNet.Similarity

# Path similarity (0.0 to 1.0)
Similarity.path_similarity(dog_id, mammal_id)
# => 0.5  # 1 edge apart

Similarity.path_similarity(dog_id, organism_id)
# => 0.25  # 3 edges apart
```

### Wu-Palmer Similarity

Based on depth of Least Common Subsumer (LCS):

```elixir
# Wu-Palmer similarity (0.0 to 1.0)
Similarity.wup_similarity(dog_id, cat_id)
# => 0.857  # High similarity (both mammals)

Similarity.wup_similarity(dog_id, tree_id)
# => 0.133  # Low similarity (different domains)
```

**Formula:** `2 * depth(LCS) / (depth(synset1) + depth(synset2))`

### Lesk Similarity

Based on definition overlap:

```elixir
# Lesk similarity (0.0 to 1.0)
Similarity.lesk_similarity(dog_id, cat_id)
# => 0.15  # Some overlapping words in definitions
```

### Combined Similarity

Weighted combination of multiple metrics:

```elixir
Similarity.combined_similarity(
  dog_id,
  cat_id,
  :en,
  metrics: [:path, :wup, :lesk],
  weights: [0.3, 0.5, 0.2]
)
# => 0.654
```

### Word Similarity

Compare words directly (not synsets):

```elixir
Similarity.word_similarity("dog", "cat", :noun)
# => 0.857  # Max similarity across all synset pairs

Similarity.word_similarity("happy", "sad", :adj, :en, metric: :wup)
# => 0.5  # Moderate similarity (both emotions)
```

## Word Sense Disambiguation

WordNet dramatically enhances WSD accuracy from ~60% to ~75%+.

### Basic WSD

```elixir
alias Nasty.Language.English.WordSenseDisambiguator, as: WSD

# Disambiguate "bank" in context
context_tokens = [
  %Token{text: "river", pos_tag: :noun},
  %Token{text: "flowing", pos_tag: :verb}
]

{:ok, sense} = WSD.disambiguate("bank", context_tokens, pos_tag: :noun)

sense.definition  # => "land alongside a body of water"
sense.synset_id   # => "oewn-..."
```

### How It Works

1. **Get all senses** from WordNet (not just 5 hardcoded ones!)
2. **Score each sense** using Lesk algorithm:
   - Context-definition overlap
   - Related words (hypernyms, synonyms)
   - Frequency ranking
3. **Return best match**

### Full Pipeline

```elixir
alias Nasty.Language.English

# Parse sentence
{:ok, tokens} = English.tokenize("The river bank was muddy.")
{:ok, tagged} = English.tag_pos(tokens)

# Disambiguate all content words
disambiguated = WSD.disambiguate_all(tagged)

Enum.each(disambiguated, fn {token, sense} ->
  IO.puts("#{token.text}: #{sense.definition}")
end)

# Output:
# river: a large natural stream of water
# bank: land alongside a body of water
# muddy: covered with mud
```

## Advanced Usage

### Depth Calculation

```elixir
alias Nasty.Lexical.WordNet.Similarity

# Calculate depth in taxonomy
Similarity.depth(entity_id)  # => 0  (root)
Similarity.depth(dog_id)     # => 13 (deep in hierarchy)
```

### Least Common Subsumer

```elixir
# Find most specific common ancestor
lcs_id = Similarity.lcs(dog_id, cat_id)
# => mammal_id
```

### Statistics

```elixir
# Get statistics for loaded data
WordNet.stats(:en)
# => %{synsets: 120532, lemmas: 155287, relations: 207016}
```

### Manual Loading

```elixir
# Pre-load data (otherwise loads on first use)
WordNet.ensure_loaded(:en)
WordNet.ensure_loaded(:es)

# Check if loaded
WordNet.loaded?(:en)  # => true
```

## Performance

### Memory Usage

- **English (OEWN):** ~200MB RAM (120K synsets)
- **Spanish (OMW):** ~50MB RAM (30K synsets)
- **Catalan (OMW):** ~40MB RAM (25K synsets)

### Load Time

- **JSON parsing:** ~1-2 seconds per language
- **ETS table building:** ~1 second
- **Total:** 2-3 seconds per language

### Query Performance

- **Synset lookup by ID:** O(1), <1ms
- **Lemma lookup by word:** O(1), <1ms
- **Hypernym traversal:** O(d) where d=depth, <5ms typical
- **Similarity calculation:** O(d1 + d2), <10ms typical
- **Shortest path:** BFS, depends on distance

### Optimization

WordNet uses **lazy loading** - data loads only when first accessed:

```elixir
# Fast - no loading
WordNet.loaded?(:en)  # => false

# First query triggers loading (2-3 seconds)
WordNet.synsets("dog")

# Subsequent queries are instant
WordNet.synsets("cat")  # <1ms
```

## Troubleshooting

### WordNet Not Found

```
WordNet data file not found for en: priv/wordnet/oewn-2025.json
Run 'mix nasty.wordnet.download --language en' to download.
```

**Solution:** Download the data file:
```bash
mix nasty.wordnet.download --language en
```

### No Synsets Found

```elixir
WordNet.synsets("misspelled")
# => []
```

**Solutions:**
1. Check spelling
2. Try lemmatized form: "running" → "run"
3. Try different POS tag
4. Word may not be in WordNet

### Memory Issues

If loading multiple languages causes memory issues:

1. Only load languages you need
2. Use lazy loading (don't pre-load)
3. Consider clearing unused languages:
   ```elixir
   Storage.clear(:es)  # Free Spanish data
   ```

### Slow First Query

First query loads WordNet data (2-3 seconds). To avoid:

```elixir
# Pre-load during application startup
defmodule MyApp.Application do
  def start(_type, _args) do
    # Load WordNet in background
    Task.start(fn -> Nasty.Lexical.WordNet.ensure_loaded(:en) end)
    
    # ...
  end
end
```

## Examples

### Example 1: Find Related Words

```elixir
defmodule RelatedWords do
  alias Nasty.Lexical.WordNet

  def find_related(word, pos \\ :noun) do
    synsets = WordNet.synsets(word, pos)
    synset = hd(synsets)  # Use first (most common) sense
    
    # Get hypernyms
    hypernym_ids = WordNet.hypernyms(synset.id)
    hypernyms = Enum.map(hypernym_ids, &WordNet.synset(&1))
    
    # Get hyponyms
    hyponym_ids = WordNet.hyponyms(synset.id)
    hyponyms = Enum.map(hyponym_ids, &WordNet.synset(&1))
    
    %{
      word: word,
      definition: synset.definition,
      synonyms: synset.lemmas,
      more_general: Enum.flat_map(hypernyms, & &1.lemmas),
      more_specific: Enum.flat_map(hyponyms, & &1.lemmas)
    }
  end
end

RelatedWords.find_related("dog")
# => %{
#   word: "dog",
#   definition: "a member of the genus Canis",
#   synonyms: ["dog", "domestic dog", "Canis familiaris"],
#   more_general: ["canine", "canid"],
#   more_specific: ["puppy", "hound", "working dog", ...]
# }
```

### Example 2: Semantic Search

```elixir
defmodule SemanticSearch do
  alias Nasty.Lexical.WordNet
  alias Nasty.Lexical.WordNet.Similarity

  def find_similar(query_word, candidate_words, threshold \\ 0.5) do
    query_synsets = WordNet.synsets(query_word, :noun)
    query_synset = hd(query_synsets)
    
    candidate_words
    |> Enum.map(fn word ->
      synsets = WordNet.synsets(word, :noun)
      if synsets == [], do: {word, 0.0}, else: {word, max_similarity(query_synset, synsets)}
    end)
    |> Enum.filter(fn {_word, sim} -> sim >= threshold end)
    |> Enum.sort_by(fn {_word, sim} -> sim end, :desc)
  end
  
  defp max_similarity(query_synset, candidate_synsets) do
    Enum.map(candidate_synsets, fn synset ->
      Similarity.wup_similarity(query_synset.id, synset.id)
    end)
    |> Enum.max()
  end
end

SemanticSearch.find_similar("dog", ["cat", "wolf", "tree", "house"])
# => [
#   {"cat", 0.857},
#   {"wolf", 0.923},
#   {"tree", 0.133},
#   {"house", 0.125}
# ]
```

### Example 3: Cross-lingual Translation

```elixir
defmodule CrossLingual do
  alias Nasty.Lexical.WordNet

  def translate(word, from_lang, to_lang) do
    # Get synsets in source language
    synsets = WordNet.synsets(word, nil, from_lang)
    
    # For each synset, find equivalent in target language
    Enum.flat_map(synsets, fn synset ->
      if synset.ili do
        target_synsets = WordNet.from_ili(synset.ili, to_lang)
        Enum.flat_map(target_synsets, & &1.lemmas)
      else
        []
      end
    end)
    |> Enum.uniq()
  end
end

CrossLingual.translate("perro", :es, :en)
# => ["dog", "domestic dog", "Canis familiaris"]

CrossLingual.translate("dog", :en, :es)
# => ["perro", "can"]
```

## References

- [Open English WordNet](https://github.com/globalwordnet/english-wordnet)
- [Open Multilingual WordNet](https://omwn.org/)
- [WN-LMF Specification](https://globalwordnet.github.io/schemas/)
- [Princeton WordNet](https://wordnet.princeton.edu/)
- [Wu & Palmer (1994)](https://dl.acm.org/doi/10.3115/981732.981751) - Wu-Palmer Similarity
- [Lesk (1986)](https://dl.acm.org/doi/10.1145/318723.318728) - Lesk Algorithm

## See Also

- [PARSING_GUIDE.md](PARSING_GUIDE.md) - NLP pipeline overview
- [ENGLISH_GRAMMAR.md](languages/ENGLISH_GRAMMAR.md) - Grammar specification  
- [USER_GUIDE.md](USER_GUIDE.md) - General usage guide