# Parsing Guide
This document provides a comprehensive technical guide to all parsing algorithms implemented in Nasty, including tokenization, POS tagging, morphological analysis, phrase parsing, sentence parsing, and dependency extraction.
## Table of Contents
1. [Pipeline Overview](#pipeline-overview)
2. [Tokenization](#tokenization)
3. [POS Tagging](#pos-tagging)
4. [Morphological Analysis](#morphological-analysis)
5. [Phrase Parsing](#phrase-parsing)
6. [Sentence Parsing](#sentence-parsing)
7. [Dependency Extraction](#dependency-extraction)
8. [Integration Example](#integration-example)
## Pipeline Overview
The Nasty NLP pipeline processes text through the following stages:
```mermaid
flowchart TD
A[Input Text]
B["[1] Tokenization (NimbleParsec)"]
C["[2] POS Tagging (Rule-based / HMM / Neural)"]
D["[3] Morphological Analysis (Lemmatization + Features)"]
E["[4] Phrase Parsing (Bottom-up CFG)"]
F["[5] Sentence Parsing (Clause Detection)"]
G["[6] Dependency Extraction (UD Relations)"]
H[Complete AST]
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
```
Each stage:
- Takes structured input from the previous stage
- Adds linguistic annotations
- Preserves position tracking (span information)
- Maintains language metadata
## Tokenization
### Algorithm: NimbleParsec Combinator Parsing
**Module**: `Nasty.Language.English.Tokenizer`
**Approach**: Bottom-up combinator-based parsing using NimbleParsec, processing text left-to-right with greedy longest-match.
### Token Types
1. **Hyphenated words**: `well-known`, `twenty-one`
2. **Contractions**: `don't`, `I'm`, `we've`, `it's`
3. **Numbers**: integers (`123`), decimals (`3.14`)
4. **Words**: alphabetic sequences
5. **Punctuation**: sentence-ending (`.`, `!`, `?`), commas, quotes, brackets, etc.
### Parser Combinators
```elixir
# Order matters - more specific patterns first
token = choice([
hyphenated, # "well-known"
contraction, # "don't"
number, # "123", "3.14"
word, # "cat"
punctuation # ".", ",", etc.
])
```
### Position Tracking
Every token includes precise position information:
```elixir
%Token{
text: "cat",
span: %{
start_pos: {1, 5}, # {line, column}
start_offset: 4, # byte offset
end_pos: {1, 8},
end_offset: 7
}
}
```
Position tracking handles:
- Multi-line text with newline counting
- Whitespace between tokens (ignored but tracked)
- UTF-8 byte offsets vs. character positions
### Edge Cases
- **Empty text**: Returns `{:ok, []}`
- **Whitespace-only**: Returns `{:ok, []}`
- **Unparseable text**: Returns `{:error, {:parse_incomplete, ...}}`
- **Contractions**: Parsed as single tokens, not split
### Example
```elixir
{:ok, tokens} = Tokenizer.tokenize("I don't know.")
# => [
# %Token{text: "I", pos_tag: :x, span: ...},
# %Token{text: "don't", pos_tag: :x, span: ...},
# %Token{text: "know", pos_tag: :x, span: ...},
# %Token{text: ".", pos_tag: :punct, span: ...}
# ]
```
## POS Tagging
### Three Tagging Models
**Module**: `Nasty.Language.English.POSTagger`
Nasty supports three POS tagging approaches with different accuracy/speed tradeoffs:
| Model | Accuracy | Speed | Method |
|-------|----------|-------|--------|
| Rule-based | ~85% | Very Fast | Lexical lookup + morphology + context |
| HMM (Trigram) | ~95% | Fast | Viterbi decoding with add-k smoothing |
| Neural (BiLSTM-CRF) | 97-98% | Moderate | Deep learning with contextual embeddings |
### 1. Rule-Based Tagging
**Algorithm**: Sequential pattern matching with three-tier lookup
#### Tagging Strategy
1. **Lexical Lookup**: Closed-class words (determiners, pronouns, prepositions, etc.)
- 450+ words in lookup tables
- Example: `"the"` → `:det`, `"in"` → `:adp`, `"and"` → `:cconj`
2. **Morphological Analysis**: Suffix-based tagging for open-class words
```
Nouns: -tion, -sion, -ment, -ness, -ity, -ism
Verbs: -ing, -ed, -s/-es (3rd person singular)
Adjectives: -ful, -less, -ous, -ive, -able, -ible
Adverbs: -ly
```
3. **Contextual Disambiguation**: Local context rules
- Word after determiner → likely noun
- Word after preposition → likely noun
- Word before noun → likely adjective
- Capitalized words → proper nouns
#### Third-Person Singular Verb Detection
Conservative approach to avoid mistagging plural nouns as verbs:
```elixir
# "walks" → :verb (stem "walk" in common verb list)
# "books" → :noun (not a verb stem)
# "stations" → :noun (ends with -tions, noun suffix)
```
Checks:
- Exclude capitalized words (proper nouns)
- Exclude words with clear noun suffixes (-tions, -ments, etc.)
- Verify stem is in common verb list (140+ verbs)
### 2. HMM-Based Tagging
**Algorithm**: Viterbi decoding with trigram Hidden Markov Model
#### Model Components
1. **Emission Probabilities**: P(word|tag)
- Learned from tagged training data
- Smoothing for unknown words: add-k smoothing (k=0.001)
2. **Transition Probabilities**: P(tag₃|tag₁, tag₂)
- Trigram model for better context
- Special START markers for sentence boundaries
- Add-k smoothing for unseen trigrams
3. **Initial Probabilities**: P(tag) at sentence start
- Distribution of first tags in training sentences
#### Training Process
```elixir
training_data = [
{["The", "cat", "sat"], [:det, :noun, :verb]},
...
]
model = HMMTagger.new()
{:ok, trained} = HMMTagger.train(model, training_data, [])
```
Counts:
- Emission counts: `{word, tag}` pairs
- Transition counts: `{tag1, tag2} → tag3` trigrams
- Initial counts: first tag in each sequence
Normalization:
```
P(word|tag) = (count(word, tag) + k) / (sum(word, tag) + k * vocab_size)
P(tag3|tag1,tag2) = (count(tag1,tag2,tag3) + k) / (sum(tag1,tag2,*) + k * num_tags)
```
#### Viterbi Decoding
Dynamic programming algorithm to find most likely tag sequence:
```
score[t][tag] = max over prev_tags of:
score[t-1][prev_tag] +
log P(tag|prev_prev_tag, prev_tag) +
log P(word_t|tag)
```
Steps:
1. **Initialization**: Score each tag for first word
2. **Forward Pass**: Compute best score for each (position, tag) pair
3. **Backpointers**: Track best previous tag for reconstruction
4. **Backtracking**: Reconstruct best path from end to start
### 3. Neural Tagging (BiLSTM-CRF)
**Algorithm**: Bidirectional LSTM with Conditional Random Field layer
**Module**: `Nasty.Statistics.POSTagging.NeuralTagger`
#### Architecture
```mermaid
flowchart TD
A["Input: Word IDs [batch_size, seq_len]"]
B["Word Embeddings [batch_size, seq_len, embedding_dim]"]
C["BiLSTM Layers (×2) [batch_size, seq_len, hidden_size * 2]"]
D["Linear Projection [batch_size, seq_len, num_tags]"]
E["CRF Layer (optional) [batch_size, seq_len, num_tags]"]
F["Output: Tag IDs [batch_size, seq_len]"]
A --> B
B --> C
C --> D
D --> E
E --> F
```
#### Key Components
1. **Word Embeddings**: 300-dimensional learned representations
- Vocabulary built from training data (min frequency = 2)
- Unknown words mapped to special UNK token
2. **Bidirectional LSTM**: 2 layers, 256 hidden units each
- Forward LSTM: left-to-right context
- Backward LSTM: right-to-left context
- Concatenated outputs: 512 dimensions
3. **CRF Layer**: Learns tag transition constraints
- Enforces valid tag sequences (e.g., DET → NOUN more likely than DET → VERB)
- Joint decoding over entire sequence
4. **Dropout**: 0.3 rate for regularization
#### Training
```elixir
tagger = NeuralTagger.new(vocab_size: 10000, num_tags: 17)
training_data = [{["The", "cat"], [:det, :noun]}, ...]
{:ok, trained} = NeuralTagger.train(tagger, training_data,
epochs: 10,
batch_size: 32,
learning_rate: 0.001,
validation_split: 0.1
)
```
Training features:
- Adam optimizer (adaptive learning rate)
- Cross-entropy loss (or CRF loss if using CRF layer)
- Early stopping with patience=3
- Validation set monitoring (10% split)
#### Inference
```elixir
{:ok, tags} = NeuralTagger.predict(trained, ["The", "cat", "sat"], [])
# => {:ok, [:det, :noun, :verb]}
```
Steps:
1. Convert words to IDs using vocabulary
2. Pad sequences to batch size
3. Run through BiLSTM-CRF model
4. Argmax over tag dimension (or Viterbi if using CRF)
5. Convert tag IDs back to atoms
### Model Selection
Use `:model` option in `POSTagger.tag_pos/2`:
```elixir
# Rule-based (fast, ~85% accuracy)
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :rule_based)
# HMM (fast, ~95% accuracy)
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :hmm)
# Neural (moderate, 97-98% accuracy)
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :neural)
# Ensemble: HMM + rule-based fallback for punctuation/numbers
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :ensemble)
# Neural ensemble: Neural + rule-based fallback
{:ok, tokens} = POSTagger.tag_pos(tokens, model: :neural_ensemble)
```
## Morphological Analysis
### Algorithm: Dictionary + Rule-Based Lemmatization
**Module**: `Nasty.Language.English.Morphology`
**Approach**: Two-tier lemmatization with irregular form lookup followed by rule-based suffix removal.
### Lemmatization Process
#### 1. Irregular Form Lookup
Check dictionaries for common irregular forms:
**Verbs** (80+ irregular verbs):
```
"went" → "go", "was" → "be", "ate" → "eat", "ran" → "run"
```
**Nouns** (12 irregular nouns):
```
"children" → "child", "men" → "man", "mice" → "mouse"
```
**Adjectives** (12 irregular comparatives/superlatives):
```
"better" → "good", "best" → "good", "worse" → "bad"
```
#### 2. Rule-Based Suffix Removal
If no irregular form found, apply POS-specific rules:
**Verbs**:
```
-ing → stem (handling doubled consonants)
"running" → "run" (remove doubled 'n')
"making" → "make"
-ed → stem (handling doubled consonants, silent e)
"stopped" → "stop" (remove doubled 'p')
"liked" → "like" (restore silent 'e')
-s → base form (3rd person singular)
"walks" → "walk"
```
**Nouns**:
```
-ies → -y (flies → fly)
-es → base (if stem ends in s/x/z/ch/sh)
"boxes" → "box", "dishes" → "dish"
-s → base (cats → cat)
```
**Adjectives**:
```
-est → base (superlative)
"fastest" → "fast" (handle doubled consonants)
-er → base (comparative)
"faster" → "fast"
```
### Morphological Feature Extraction
#### Verb Features
```elixir
%{
tense: :present | :past,
aspect: :progressive, # for -ing forms
person: 3, # for 3rd person singular
number: :singular
}
```
Examples:
- `"running"` → `%{tense: :present, aspect: :progressive}`
- `"walked"` → `%{tense: :past}`
- `"walks"` → `%{tense: :present, person: 3, number: :singular}`
#### Noun Features
```elixir
%{number: :singular | :plural}
```
Examples:
- `"cat"` → `%{number: :singular}`
- `"cats"` → `%{number: :plural}`
#### Adjective Features
```elixir
%{degree: :positive | :comparative | :superlative}
```
Examples:
- `"fast"` → `%{degree: :positive}`
- `"faster"` → `%{degree: :comparative}`
- `"fastest"` → `%{degree: :superlative}`
### Example
```elixir
{:ok, tokens} = Tokenizer.tokenize("running cats")
{:ok, tagged} = POSTagger.tag_pos(tokens)
{:ok, analyzed} = Morphology.analyze(tagged)
# => [
# %Token{text: "running", pos_tag: :verb, lemma: "run",
# morphology: %{tense: :present, aspect: :progressive}},
# %Token{text: "cats", pos_tag: :noun, lemma: "cat",
# morphology: %{number: :plural}}
# ]
```
## Phrase Parsing
### Algorithm: Bottom-Up Pattern Matching with Context-Free Grammar
**Module**: `Nasty.Language.English.PhraseParser`
**Approach**: Greedy longest-match, left-to-right phrase construction using simplified CFG rules.
### Grammar Rules
```
NP → Det? Adj* (Noun | PropN | Pron) (PP | RelClause)*
VP → Aux* Verb (NP)? (PP | AdvP)*
PP → Prep NP
AdjP → Adv? Adj
AdvP → Adv
RC → RelPron/RelAdv Clause
```
### Phrase Types
#### 1. Noun Phrase (NP)
**Components**:
- **Determiner** (optional): `the`, `a`, `my`, `some`
- **Modifiers** (0+): adjectives, adjectival phrases
- **Head** (required): noun, proper noun, or pronoun
- **Post-modifiers** (0+): prepositional phrases, relative clauses
**Examples**:
```
"the cat" → [det: "the", head: "cat"]
"the big cat" → [det: "the", modifiers: ["big"], head: "cat"]
"the cat on the mat" → [det: "the", head: "cat",
post_modifiers: [PP("on", NP("the mat"))]]
```
**Special Cases**:
- **Pronouns as NPs**: `"I"`, `"he"`, `"they"` can stand alone
- **Multi-word proper nouns**: `"New York"` → consecutive PROPNs merged as modifiers
#### 2. Verb Phrase (VP)
**Components**:
- **Auxiliaries** (0+): `is`, `have`, `will`, `can`
- **Head** (required): main verb
- **Complements** (0+): object NP, PPs, adverbs
**Examples**:
```
"sat" → [head: "sat"]
"is running" → [auxiliaries: ["is"], head: "running"]
"saw the cat" → [head: "saw", complements: [NP("the cat")]]
"sat on the mat" → [head: "sat", complements: [PP("on", NP("the mat"))]]
```
**Special Case - Copula Construction**:
If only auxiliaries found (no main verb), treat last auxiliary as main verb:
```
"is happy" → [head: "is", complements: [AdjP("happy")]]
"are engineers" → [head: "are", complements: [NP("engineers")]]
```
#### 3. Prepositional Phrase (PP)
**Structure**: `Prep + NP`
**Examples**:
```
"on the mat" → [head: "on", object: NP("the mat")]
"in the house" → [head: "in", object: NP("the house")]
```
#### 4. Adjectival Phrase (AdjP)
**Structure**: `Adv? + Adj`
**Examples**:
```
"very big" → [intensifier: "very", head: "big"]
"quite small" → [intensifier: "quite", head: "small"]
```
#### 5. Adverbial Phrase (AdvP)
**Structure**: `Adv` (currently simple single-word adverbs)
**Examples**:
```
"quickly" → [head: "quickly"]
"often" → [head: "often"]
```
#### 6. Relative Clause (RC)
**Structure**: `RelPron/RelAdv + Clause`
**Relativizers**:
- Pronouns: `who`, `whom`, `whose`, `which`, `that`
- Adverbs: `where`, `when`, `why`
**Examples**:
```
"that sits" → [relativizer: "that", clause: VP("sits")]
"who I know" → [relativizer: "who", clause: [subject: NP("I"), predicate: VP("know")]]
```
**Two Patterns**:
1. **Relativizer as subject**: `"that sits"` → clause has only VP
2. **Relativizer as object**: `"that I see"` → clause has NP subject + VP
### Parsing Process
Each `parse_*_phrase` function:
1. Checks current position in token list
2. Attempts to consume tokens matching the pattern
3. Recursively parses sub-phrases (e.g., NP within PP)
4. Calculates span from first to last consumed token
5. Returns `{:ok, phrase, next_position}` or `:error`
**Greedy Matching**: Consumes as many tokens as possible for each phrase (e.g., all consecutive adjectives as modifiers).
**Position Tracking**: Every phrase includes span covering all constituent tokens.
### Example
```elixir
tokens = [
%Token{text: "the", pos_tag: :det},
%Token{text: "big", pos_tag: :adj},
%Token{text: "cat", pos_tag: :noun},
%Token{text: "on", pos_tag: :adp},
%Token{text: "the", pos_tag: :det},
%Token{text: "mat", pos_tag: :noun}
]
{:ok, np, _pos} = PhraseParser.parse_noun_phrase(tokens, 0)
# => %NounPhrase{
# determiner: "the",
# modifiers: ["big"],
# head: "cat",
# post_modifiers: [
# %PrepositionalPhrase{
# head: "on",
# object: %NounPhrase{determiner: "the", head: "mat"}
# }
# ]
# }
```
## Sentence Parsing
### Algorithm: Clause Detection with Coordination and Subordination
**Module**: `Nasty.Language.English.SentenceParser`
**Approach**: Split on sentence boundaries, then parse each sentence into clauses with support for simple, compound, and complex structures.
### Sentence Structures
1. **Simple**: Single independent clause
- `"The cat sat."`
2. **Compound**: Multiple coordinated independent clauses
- `"The cat sat and the dog ran."`
3. **Complex**: Independent clause with subordinate clause(s)
- `"The cat sat because it was tired."`
4. **Fragment**: Incomplete sentence (e.g., subordinate clause alone)
### Sentence Functions
Inferred from punctuation:
- `.` → `:declarative` (statement)
- `?` → `:interrogative` (question)
- `!` → `:exclamative` (exclamation)
### Parsing Process
#### 1. Sentence Boundary Detection
Split on sentence-ending punctuation (`.`, `!`, `?`):
```elixir
split_sentences(tokens)
# Groups tokens into sentence units
```
#### 2. Clause Parsing
For each sentence group, parse into clause structure:
**Grammar**:
```
Sentence → Clause+
Clause → SubordConj? NP? VP
```
**Three Clause Types**:
- **Independent**: Can stand alone as complete sentence
- **Subordinate**: Begins with subordinating conjunction (`because`, `if`, `when`, etc.)
- **Relative**: Part of relative clause structure (handled in phrase parsing)
#### 3. Coordination Detection
Look for coordinating conjunctions (`:cconj`):
- `and`, `or`, `but`, `nor`, `yet`, `so`, `for`
If found, split and parse both sides:
```elixir
"The cat sat and the dog ran"
# Split at "and"
# Parse: Clause1 ("The cat sat") + Clause2 ("the dog ran")
# Result: [Clause1, Clause2]
```
#### 4. Subordination Detection
Check for subordinating conjunction (`:sconj`) at start:
- `after`, `although`, `because`, `before`, `if`, `since`, `when`, `while`, etc.
If found, mark clause as subordinate:
```elixir
"because it was tired"
# Parse: Clause with subordinator: "because"
# Type: :subordinate
```
### Simple Clause Parsing
**Algorithm**: Find verb, split at verb to identify subject and predicate.
**Steps**:
1. Find first verb/auxiliary in token sequence
2. **If verb at position 0**: Imperative sentence (no subject)
- Parse VP starting at position 0
- Subject = nil
3. **If verb at position > 0**: Declarative sentence
- Try to parse NP before verb (subject)
- Parse VP starting at end of subject (predicate)
4. **If no subject found**: Try VP alone (imperative or fragment)
**Fallback**: If parsing fails, create minimal clause with first verb found.
### Clause Structure
```elixir
%Clause{
type: :independent | :subordinate | :relative,
subordinator: Token.t() | nil, # "because", "if", etc.
subject: NounPhrase.t() | nil,
predicate: VerbPhrase.t(),
language: :en,
span: span
}
```
### Sentence Structure
```elixir
%Sentence{
function: :declarative | :interrogative | :exclamative,
structure: :simple | :compound | :complex | :fragment,
main_clause: Clause.t(),
additional_clauses: [Clause.t()], # for compound sentences
language: :en,
span: span
}
```
### Example
```elixir
tokens = tokenize_and_tag("The cat sat and the dog ran.")
{:ok, [sentence]} = SentenceParser.parse_sentences(tokens)
# => %Sentence{
# function: :declarative,
# structure: :compound,
# main_clause: %Clause{
# type: :independent,
# subject: NP("The cat"),
# predicate: VP("sat")
# },
# additional_clauses: [
# %Clause{
# type: :independent,
# subject: NP("the dog"),
# predicate: VP("ran")
# }
# ]
# }
```
## Dependency Extraction
### Algorithm: Phrase Structure to Universal Dependencies Conversion
**Module**: `Nasty.Language.English.DependencyExtractor`
**Approach**: Traverse phrase structure AST and extract grammatical relations as Universal Dependencies (UD) relations.
### Universal Dependencies Relations
Nasty uses the UD relation taxonomy:
**Core Arguments**:
- `nsubj` - nominal subject
- `obj` - direct object
- `iobj` - indirect object
**Non-Core Dependents**:
- `obl` - oblique nominal (prepositional complement to verb)
- `advmod` - adverbial modifier
- `aux` - auxiliary verb
**Nominal Dependents**:
- `det` - determiner
- `amod` - adjectival modifier
- `nmod` - nominal modifier (prepositional complement to noun)
- `case` - case marking (preposition)
**Clausal Dependents**:
- `acl` - adnominal clause (relative clause)
- `mark` - subordinating marker
**Coordination**:
- `conj` - conjunct
- `cc` - coordinating conjunction
### Extraction Process
#### 1. Sentence-Level Extraction
```elixir
extract(sentence)
# Extracts from main_clause + additional_clauses
```
#### 2. Clause-Level Extraction
For each clause:
1. **Subject Dependency**: `nsubj(predicate_head, subject_head)`
- Extract head token from subject NP
- Extract head token from predicate VP
- Create dependency relation
2. **Predicate Dependencies**: Extract from VP (see below)
3. **Subordinator Dependency** (if present): `mark(predicate_head, subordinator)`
#### 3. Noun Phrase Dependencies
From NP structure:
1. **Determiner**: `det(head, determiner)`
- `"the cat"` → `det(cat, the)`
2. **Adjectival Modifiers**: `amod(head, modifier)`
- `"big cat"` → `amod(cat, big)`
3. **Post-modifiers**:
- **PP**: `case(pp_object_head, preposition)` + `nmod(np_head, pp_object_head)`
- `"cat on mat"` → `case(mat, on)` + `nmod(cat, mat)`
- **Relative Clause**: `mark(clause_head, relativizer)` + `acl(np_head, clause_head)`
- `"cat that sits"` → `mark(sits, that)` + `acl(cat, sits)`
#### 4. Verb Phrase Dependencies
From VP structure:
1. **Auxiliaries**: `aux(main_verb, auxiliary)`
- `"is running"` → `aux(running, is)`
2. **Complements**:
- **Direct Object NP**: `obj(verb, np_head)`
- `"saw cat"` → `obj(saw, cat)`
- **PP Complement**: `case(pp_object, preposition)` + `obl(verb, pp_object)`
- `"sat on mat"` → `case(mat, on)` + `obl(sat, mat)`
- **Adverb**: `advmod(verb, adverb)`
- `"ran quickly"` → `advmod(ran, quickly)`
#### 5. Prepositional Phrase Dependencies
From PP structure:
1. **Case Marking**: `case(pp_object_head, preposition)`
2. **Oblique/Nominal Modifier**:
- If governor is verb: `obl(governor, pp_object_head)`
- If governor is noun: `nmod(governor, pp_object_head)`
### Dependency Structure
```elixir
%Dependency{
relation: :nsubj | :obj | :det | ...,
head: Token.t(), # Governor token
dependent: Token.t(), # Dependent token
span: span
}
```
### Example
```elixir
# Input: "The cat sat on the mat."
sentence = parse("The cat sat on the mat.")
dependencies = DependencyExtractor.extract(sentence)
# => [
# %Dependency{relation: :det, head: "cat", dependent: "the"},
# %Dependency{relation: :nsubj, head: "sat", dependent: "cat"},
# %Dependency{relation: :case, head: "mat", dependent: "on"},
# %Dependency{relation: :det, head: "mat", dependent: "the"},
# %Dependency{relation: :obl, head: "sat", dependent: "mat"}
# ]
```
### Visualization
Dependencies can be visualized as a directed graph:
```mermaid
graph TD
Root["sat (ROOT)"]
Cat[cat]
Mat[mat]
The1[the]
On[on]
The2[the]
Root -->|nsubj| Cat
Root -->|obl| Mat
Cat -->|det| The1
Mat -->|case| On
Mat -->|det| The2
```
## Integration Example
Complete pipeline from text to dependencies:
```elixir
alias Nasty.Language.English.{
Tokenizer, POSTagger, Morphology,
PhraseParser, SentenceParser, DependencyExtractor
}
# Input text
text = "The big cat sat on the mat."
# Step 1: Tokenization
{:ok, tokens} = Tokenizer.tokenize(text)
# => [Token("The"), Token("big"), Token("cat"), ...]
# Step 2: POS Tagging (choose model)
{:ok, tagged} = POSTagger.tag_pos(tokens, model: :neural)
# => [Token("The", :det), Token("big", :adj), Token("cat", :noun), ...]
# Step 3: Morphological Analysis
{:ok, analyzed} = Morphology.analyze(tagged)
# => [Token("The", :det, lemma: "the"), ...]
# Step 4: Sentence Parsing (includes phrase parsing internally)
{:ok, sentences} = SentenceParser.parse_sentences(analyzed)
# => [Sentence(...)]
# Step 5: Dependency Extraction
sentence = hd(sentences)
dependencies = DependencyExtractor.extract(sentence)
# => [Dependency(:det, "cat", "The"), ...]
# Result: Complete AST with dependencies
sentence
# => %Sentence{
# main_clause: %Clause{
# subject: %NounPhrase{
# determiner: Token("The"),
# modifiers: [Token("big")],
# head: Token("cat")
# },
# predicate: %VerbPhrase{
# head: Token("sat"),
# complements: [
# %PrepositionalPhrase{
# head: Token("on"),
# object: %NounPhrase{...}
# }
# ]
# }
# }
# }
```
## Performance Considerations
### Model Selection
**For Production**:
- Use neural models for highest accuracy
- Cache loaded models in memory
- Batch sentences for GPU acceleration (if available)
**For Development/Testing**:
- Use rule-based for fastest iteration
- HMM for good balance of speed and accuracy
### Optimization Tips
1. **Batch Processing**: Process multiple sentences together
2. **Model Caching**: Load models once, reuse across requests
3. **Lazy Loading**: Only load neural models when needed
4. **Parallel Processing**: Use `Task.async_stream` for multiple sentences
### Accuracy Benchmarks
Tested on Universal Dependencies English-EWT test set:
| Component | Accuracy |
|-----------|----------|
| Tokenization | 99.9% |
| Rule-based POS | 85% |
| HMM POS | 95% |
| Neural POS | 97-98% |
| Phrase Parsing | 87% (F1) |
| Dependency Extraction | 82% (UAS) |
## Further Reading
- [Universal Dependencies](https://universaldependencies.org/) - UD relations and guidelines
- [Penn Treebank POS Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
- [NimbleParsec Documentation](https://hexdocs.pm/nimble_parsec/)
- [Axon Neural Networks](https://hexdocs.pm/axon/)
- See `docs/ARCHITECTURE.md` for overall system design
- See `docs/NEURAL_MODELS.md` for neural network details