# Nasty Architecture
This document describes the architecture of Nasty, a language-agnostic NLP library for Elixir that treats natural language with the same rigor as programming languages.
## Design Philosophy
Nasty is built on three core principles:
1. **Grammar-First**: Treat natural language as a formal grammar with an Abstract Syntax Tree (AST), similar to how compilers handle programming languages
2. **Language-Agnostic**: Use behaviours to define a common interface, allowing multiple natural languages to coexist
3. **Pure Elixir**: No external NLP dependencies; built entirely in Elixir using NimbleParsec and functional programming patterns
## System Architecture
### High-Level Overview
```mermaid
flowchart TD
API["Public API (Nasty)<br/>parse/2, render/2, summarize/2, to_code/2, explain_code/2"]
Registry["Language Registry<br/>Manages language implementations & auto-detection"]
English["Nasty.Language.English<br/>(Full implementation)"]
Other["Nasty.Language.Spanish/Catalan<br/>(Future)"]
Pipeline["NLP Pipeline<br/>Tokenization → POS Tagging → Parsing → Semantic Analysis"]
AST["AST Structures<br/>Document → Paragraph → Sentence → Clause → Phrases → Token"]
Translation["Translation System"]
Operations["AST Operations<br/>Query, Validation, Transform, Traversal"]
API --> Registry
Registry --> English
Registry --> Other
English --> Pipeline
Pipeline --> AST
AST --> Translation
AST --> Operations
```
## Core Components
### 1. Language Behaviour System
The `Nasty.Language.Behaviour` defines the interface that all language implementations must follow:
#### Required Callbacks
```elixir
@callback language_code() :: atom()
@callback tokenize(String.t(), options()) :: {:ok, [Token.t()]} | {:error, term()}
@callback tag_pos([Token.t()], options()) :: {:ok, [Token.t()]} | {:error, term()}
@callback parse([Token.t()], options()) :: {:ok, Document.t()} | {:error, term()}
@callback render(struct(), options()) :: {:ok, String.t()} | {:error, term()}
```
#### Optional Callbacks
```elixir
@callback metadata() :: map()
```
#### Benefits
- **Pluggability**: New languages can be added without changing core code
- **Type Safety**: Dialyzer ensures implementations follow the contract
- **Consistency**: All languages provide the same interface
- **Testing**: Easy to mock and test language-specific behavior
### 2. Language Registry
The `Nasty.Language.Registry` is an Agent-based registry that:
- **Registers** language implementations at runtime
- **Validates** implementations comply with the Behaviour
- **Provides lookup** by language code (`:en`, `:es`, `:ca`)
- **Detects language** from text using heuristics
```elixir
# Registration (happens at application startup)
Registry.register(Nasty.Language.English)
# Lookup
{:ok, module} = Registry.get(:en)
# Detection
{:ok, :en} = Registry.detect_language("Hello world")
```
### 3. NLP Pipeline
Each language implementation follows a multi-stage pipeline:
#### Stage 1: Tokenization
**Purpose**: Split raw text into atomic units (tokens)
**Responsibilities**:
- Sentence boundary detection
- Word segmentation
- Contraction handling ("don't" → "do" + "n't")
- Position tracking (line, column, byte offsets)
**Implementation**: NimbleParsec combinators for efficient parsing
**Output**: `[Token.t()]` with text and position information
#### Stage 2: POS Tagging
**Purpose**: Assign part-of-speech tags and morphological features
**Responsibilities**:
- Tag assignment using Universal Dependencies tagset
- Morphological analysis (tense, number, person, case, etc.)
- Lemmatization (reduce to dictionary form)
**Methods**:
- Rule-based tagging
- Statistical models (HMM)
- Hybrid approaches
**Output**: `[Token.t()]` with `pos_tag`, `lemma`, and `morphology` filled
#### Stage 3: Parsing
**Purpose**: Build hierarchical syntactic structure
**Responsibilities**:
- Phrase structure parsing (NP, VP, PP, AP, AdvP)
- Clause identification (independent, subordinate, relative)
- Sentence structure determination (simple, compound, complex)
- Document and paragraph organization
**Approaches**:
- Recursive descent parsing
- Chart parsing (future)
- Statistical parsing (future)
**Output**: `Document.t()` with complete AST hierarchy
#### Stage 4: Semantic Analysis (Optional)
**Purpose**: Extract meaning and relationships
**Components**:
- **Named Entity Recognition (NER)**: Identify persons, organizations, locations, dates
- **Dependency Extraction**: Extract grammatical relationships between words
- **Semantic Role Labeling (SRL)**: Identify who did what to whom
- **Coreference Resolution**: Link pronouns to referents
- **Relation Extraction**: Extract entity relationships
- **Event Extraction**: Identify events and participants
**Output**: Enriched `Document.t()` with semantic annotations
#### Stage 5: Rendering
**Purpose**: Convert AST back to natural language text
**Responsibilities**:
- Surface realization (choose correct word forms)
- Agreement enforcement (subject-verb, etc.)
- Word order application (language-specific)
- Punctuation insertion
- Capitalization and formatting
**Output**: Rendered text string
### 4. AST Structure
The AST is a hierarchical, linguistically-precise representation:
```mermaid
graph TD
Doc["Document (root)"]
P1[Paragraph]
P2[Paragraph]
S1[Sentence]
S2[Sentence]
C1["Clause (main)"]
C2["Clause (subordinate)"]
Subj["Subject (NounPhrase)"]
Pred["Predicate (VerbPhrase)"]
V["Verb (Token)"]
Comp["Complement (NounPhrase)"]
Adv["Adverbial (PrepositionalPhrase)"]
Doc --> P1
Doc --> P2
P1 --> S1
P1 --> S2
S1 --> C1
S1 --> C2
C1 --> Subj
C1 --> Pred
Pred --> V
Pred --> Comp
Pred --> Adv
```
#### Node Types
**Document Nodes**:
- `Document` - Root container
- `Paragraph` - Topic-related sentences
**Sentence Nodes**:
- `Sentence` - Complete grammatical unit
- `Clause` - Subject + predicate
**Phrase Nodes**:
- `NounPhrase` - Noun-headed (the cat, big house)
- `VerbPhrase` - Verb-headed (is running, gave a book)
- `PrepositionalPhrase` - Preposition-headed (on the mat)
- `AdjectivalPhrase` - Adjective-headed (very happy)
- `AdverbialPhrase` - Adverb-headed (quite quickly)
**Atomic Nodes**:
- `Token` - Single word/punctuation with POS tag
**Semantic Nodes**:
- `Entity` - Named entity
- `Relation` - Entity relationship
- `Event` - Event with participants
- `CorefChain` - Coreference links
- `Frame` - Semantic role frame
#### Universal Properties
All nodes include:
```elixir
%{
language: atom(), # :en, :es, :ca
span: %{ # Position tracking
start_pos: {line, column},
start_byte: integer(),
end_pos: {line, column},
end_byte: integer()
}
}
```
### 5. AST Utilities
#### Query Module
Search and extract information from AST:
```elixir
Nasty.AST.Query.find_subject(sentence)
Nasty.AST.Query.extract_tokens(document)
Nasty.AST.Query.find_entities(document)
```
#### Validation Module
Ensure AST structural integrity:
```elixir
case Nasty.AST.Validation.validate(document) do
:ok -> :ok
{:error, errors} -> handle_errors(errors)
end
```
#### Transform Module
Modify AST nodes:
```elixir
transformed = Nasty.AST.Transform.map(document, fn node ->
# Transform logic
node
end)
```
#### Traversal Module
Navigate AST with different strategies:
```elixir
Nasty.AST.Traversal.pre_order(document, visitor_fn)
Nasty.AST.Traversal.post_order(document, visitor_fn)
Nasty.AST.Traversal.breadth_first(document, visitor_fn)
```
### 6. Statistical & Neural Models
#### Model Infrastructure
**Registry**: Agent-based model storage
- `ModelRegistry.register/2` - Store model
- `ModelRegistry.get/1` - Retrieve model
- `ModelRegistry.list_models/0` - List all
**Loader**: Serialize/deserialize models
- `ModelLoader.load/1` - Load from file
- `ModelLoader.save/2` - Save to file
- `ModelLoader.load_from_priv/1` - Load from app resources
#### Model Types
**HMM (Hidden Markov Model)**:
- POS tagging with ~95% accuracy
- Viterbi algorithm for decoding
- Fast inference, low memory
**BiLSTM-CRF (Neural)**:
- POS tagging with 97-98% accuracy
- Bidirectional LSTM with CRF layer
- Built with Axon/EXLA for GPU acceleration
- Character-level CNN for OOV handling
- Pre-trained embedding support
**Naive Bayes**:
- Text classification
- Multinomial variant for document classification
**Future Models**:
- PCFG (Probabilistic Context-Free Grammar) for parsing
- CRF (Conditional Random Fields) for NER
- Pre-trained transformers (BERT, RoBERTa via Bumblebee)
### 7. Code Interoperability
Bidirectional conversion between natural language and code:
#### NL → Code Pipeline
```
Natural Language
↓
Intent Recognition (parse to Intent AST)
↓
Code Generation (Intent → Elixir AST)
↓
Validation
↓
Elixir Code String
```
**Example**:
```elixir
Nasty.to_code("Filter users where age is greater than 18",
source_language: :en,
target_language: :elixir)
# => "Enum.filter(users, fn item -> item > 18 end)"
```
#### Code → NL Pipeline
```
Elixir Code String
↓
Parse to Elixir AST
↓
Traverse & Explain (AST → Natural Language)
↓
Natural Language Description
```
**Example**:
```elixir
Nasty.explain_code("Enum.sort(list)",
source_language: :elixir,
target_language: :en)
# => "Sort list"
```
### 8. Translation System
AST-based translation between natural languages:
#### Translation Pipeline
```
Source AST (Language A)
↓
AST Transformation (structural changes)
↓
Token Translation (lemma-to-lemma mapping)
↓
Morphological Agreement (gender/number/person)
↓
Word Order Application (language-specific rules)
↓
Target AST (Language B)
↓
Rendering
↓
Target Text
```
**Components:**
**ASTTransformer** - Transforms AST nodes between languages:
```elixir
alias Nasty.Translation.ASTTransformer
{:ok, spanish_doc} = ASTTransformer.transform_document(english_doc, :es)
```
**TokenTranslator** - Lemma-to-lemma translation with POS awareness:
```elixir
alias Nasty.Translation.TokenTranslator
# cat (noun) → gato (noun)
translated = TokenTranslator.translate_token(token, :en, :es)
```
**Agreement** - Enforces morphological agreement:
```elixir
alias Nasty.Translation.Agreement
# Ensure "el gato" (masc) not "la gato"
adjusted = Agreement.apply_agreement(tokens, :es)
```
**WordOrder** - Applies language-specific word order:
```elixir
alias Nasty.Translation.WordOrder
# "the big house" → "la casa grande" (adjective after noun in Spanish)
ordered = WordOrder.apply_order(phrase, :es)
```
**LexiconLoader** - Manages bidirectional lexicons with ETS caching:
```elixir
alias Nasty.Translation.LexiconLoader
# Load English-Spanish lexicon
{:ok, lexicon} = LexiconLoader.load(:en, :es)
# Bidirectional lookup
"gato" = LexiconLoader.lookup(lexicon, "cat", :noun)
"cat" = LexiconLoader.lookup(lexicon, "gato", :noun)
```
**Features:**
- AST-aware translation preserving grammatical structure
- Morphological feature agreement
- Language-specific word order rules (SVO, pro-drop, adjective position)
- Idiomatic expression support
- Fallback to original text for untranslatable content
- Bidirectional translation (English ↔ Spanish, English ↔ Catalan)
### 9. Rendering & Visualization
#### Text Rendering
Convert AST to formatted text:
```elixir
Nasty.Rendering.Text.render(document)
```
#### Pretty Printing
Human-readable AST inspection:
```elixir
Nasty.Rendering.PrettyPrint.inspect(ast)
```
#### DOT Visualization
Generate Graphviz diagrams:
```elixir
{:ok, dot} = Nasty.Rendering.Visualization.to_dot(ast)
File.write("ast.dot", dot)
```
#### JSON Export
Export to JSON for external tools:
```elixir
{:ok, json} = Nasty.Rendering.Visualization.to_json(ast)
```
### 9. Data Layer
#### CoNLL-U Support
Parse and generate Universal Dependencies format:
```elixir
{:ok, sentences} = Nasty.Data.CoNLLU.parse_file("corpus.conllu")
conllu_string = Nasty.Data.CoNLLU.format(sentence)
```
#### Corpus Management
Manage training corpora:
```elixir
{:ok, corpus} = Nasty.Data.Corpus.load("path/to/corpus")
stats = Nasty.Data.Corpus.statistics(corpus)
```
## Application Supervision
```elixir
defmodule Nasty.Application do
use Application
def start(_type, _args) do
children = [
# Language Registry Agent
Nasty.Language.Registry,
# Model Registry Agent
Nasty.Statistics.ModelRegistry
]
opts = [strategy: :one_for_one, name: Nasty.Supervisor]
result = Supervisor.start_link(children, opts)
# Register languages at startup
Nasty.Language.Registry.register(Nasty.Language.English)
result
end
end
```
## Extension Points
### Adding a New Language
1. Implement `Nasty.Language.Behaviour`
2. Create language module in `lib/language/your_language/`
3. Implement required callbacks
4. Register in `application.ex`
5. Add tests
See [Language Guide](LANGUAGE_GUIDE.md) for details.
### Adding New NLP Features
1. Create module in appropriate layer (`lib/language/`, `lib/semantic/`, etc.)
2. Define behaviour if language-agnostic
3. Implement for each language
4. Add to pipeline if needed
5. Update AST if new node types needed
### Adding Statistical Models
1. Implement model training in `lib/statistics/`
2. Create Mix task for training
3. Add model to registry
4. Integrate into pipeline
## Performance Considerations
### Efficiency
- **NimbleParsec**: Compiled parser combinators for fast tokenization
- **Agent-based registries**: Fast in-memory lookup
- **Streaming**: Process documents incrementally where possible
- **Lazy evaluation**: Use streams for large corpora
### Scalability
- **Stateless processing**: All functions are pure
- **Concurrent processing**: Parse multiple documents in parallel
- **Distributed**: Can run across multiple nodes (future)
## Testing Strategy
### Unit Tests
- Test each module in isolation
- Use `async: true` for parallel execution
- Mock language implementations when testing core
### Integration Tests
- Test full pipeline from text to AST
- Test rendering round-trips
- Test code interoperability
### Property-Based Testing
- Generate random ASTs and validate
- Test parsing/rendering round-trips
- Verify AST invariants
## Future Directions
### Architecture Evolution
1. **Generic Layers**: Extract `lib/parsing/`, `lib/semantic/`, `lib/operations/`
2. **Plugin System**: Dynamic language loading
3. **Streaming Pipeline**: Process infinite text streams
4. **Distributed Processing**: Multi-node coordination
### Advanced Features
1. **Neural Models**: Transformer-based parsing and tagging
2. **Multi-lingual**: True cross-language support
3. **Incremental Parsing**: Update AST on edits
4. **Error Recovery**: Graceful handling of malformed input
## See Also
- [API Documentation](API.md)
- [AST Reference](AST_REFERENCE.md)
- [Language Guide](LANGUAGE_GUIDE.md)
- [User Guide](USER_GUIDE.md)