docs/ARCHITECTURE.md

Select File:
# Nasty Architecture

This document describes the architecture of Nasty, a language-agnostic NLP library for Elixir that treats natural language with the same rigor as programming languages.

## Design Philosophy

Nasty is built on three core principles:

1. **Grammar-First**: Treat natural language as a formal grammar with an Abstract Syntax Tree (AST), similar to how compilers handle programming languages
2. **Language-Agnostic**: Use behaviours to define a common interface, allowing multiple natural languages to coexist
3. **Pure Elixir**: No external NLP dependencies; built entirely in Elixir using NimbleParsec and functional programming patterns

## System Architecture

### High-Level Overview

```mermaid
flowchart TD
    API["Public API (Nasty)<br/>parse/2, render/2, summarize/2, to_code/2, explain_code/2"]
    Registry["Language Registry<br/>Manages language implementations & auto-detection"]
    English["Nasty.Language.English<br/>(Full implementation)"]
    Other["Nasty.Language.Spanish/Catalan<br/>(Future)"]
    Pipeline["NLP Pipeline<br/>Tokenization → POS Tagging → Parsing → Semantic Analysis"]
    AST["AST Structures<br/>Document → Paragraph → Sentence → Clause → Phrases → Token"]
    Translation["Translation System"]
    Operations["AST Operations<br/>Query, Validation, Transform, Traversal"]
    
    API --> Registry
    Registry --> English
    Registry --> Other
    English --> Pipeline
    Pipeline --> AST
    AST --> Translation
    AST --> Operations
```

## Core Components

### 1. Language Behaviour System

The `Nasty.Language.Behaviour` defines the interface that all language implementations must follow:

#### Required Callbacks

```elixir
@callback language_code() :: atom()
@callback tokenize(String.t(), options()) :: {:ok, [Token.t()]} | {:error, term()}
@callback tag_pos([Token.t()], options()) :: {:ok, [Token.t()]} | {:error, term()}
@callback parse([Token.t()], options()) :: {:ok, Document.t()} | {:error, term()}
@callback render(struct(), options()) :: {:ok, String.t()} | {:error, term()}
```

#### Optional Callbacks

```elixir
@callback metadata() :: map()
```

#### Benefits

- **Pluggability**: New languages can be added without changing core code
- **Type Safety**: Dialyzer ensures implementations follow the contract
- **Consistency**: All languages provide the same interface
- **Testing**: Easy to mock and test language-specific behavior

### 2. Language Registry

The `Nasty.Language.Registry` is an Agent-based registry that:

- **Registers** language implementations at runtime
- **Validates** implementations comply with the Behaviour
- **Provides lookup** by language code (`:en`, `:es`, `:ca`)
- **Detects language** from text using heuristics

```elixir
# Registration (happens at application startup)
Registry.register(Nasty.Language.English)

# Lookup
{:ok, module} = Registry.get(:en)

# Detection
{:ok, :en} = Registry.detect_language("Hello world")
```

### 3. NLP Pipeline

Each language implementation follows a multi-stage pipeline:

#### Stage 1: Tokenization

**Purpose**: Split raw text into atomic units (tokens)

**Responsibilities**:
- Sentence boundary detection
- Word segmentation
- Contraction handling ("don't" → "do" + "n't")
- Position tracking (line, column, byte offsets)

**Implementation**: NimbleParsec combinators for efficient parsing

**Output**: `[Token.t()]` with text and position information

#### Stage 2: POS Tagging

**Purpose**: Assign part-of-speech tags and morphological features

**Responsibilities**:
- Tag assignment using Universal Dependencies tagset
- Morphological analysis (tense, number, person, case, etc.)
- Lemmatization (reduce to dictionary form)

**Methods**:
- Rule-based tagging
- Statistical models (HMM)
- Hybrid approaches

**Output**: `[Token.t()]` with `pos_tag`, `lemma`, and `morphology` filled

#### Stage 3: Parsing

**Purpose**: Build hierarchical syntactic structure

**Responsibilities**:
- Phrase structure parsing (NP, VP, PP, AP, AdvP)
- Clause identification (independent, subordinate, relative)
- Sentence structure determination (simple, compound, complex)
- Document and paragraph organization

**Approaches**:
- Recursive descent parsing
- Chart parsing (future)
- Statistical parsing (future)

**Output**: `Document.t()` with complete AST hierarchy

#### Stage 4: Semantic Analysis (Optional)

**Purpose**: Extract meaning and relationships

**Components**:
- **Named Entity Recognition (NER)**: Identify persons, organizations, locations, dates
- **Dependency Extraction**: Extract grammatical relationships between words
- **Semantic Role Labeling (SRL)**: Identify who did what to whom
- **Coreference Resolution**: Link pronouns to referents
- **Relation Extraction**: Extract entity relationships
- **Event Extraction**: Identify events and participants

**Output**: Enriched `Document.t()` with semantic annotations

#### Stage 5: Rendering

**Purpose**: Convert AST back to natural language text

**Responsibilities**:
- Surface realization (choose correct word forms)
- Agreement enforcement (subject-verb, etc.)
- Word order application (language-specific)
- Punctuation insertion
- Capitalization and formatting

**Output**: Rendered text string

### 4. AST Structure

The AST is a hierarchical, linguistically-precise representation:

```mermaid
graph TD
    Doc["Document (root)"]
    P1[Paragraph]
    P2[Paragraph]
    S1[Sentence]
    S2[Sentence]
    C1["Clause (main)"]
    C2["Clause (subordinate)"]
    Subj["Subject (NounPhrase)"]
    Pred["Predicate (VerbPhrase)"]
    V["Verb (Token)"]
    Comp["Complement (NounPhrase)"]
    Adv["Adverbial (PrepositionalPhrase)"]
    
    Doc --> P1
    Doc --> P2
    P1 --> S1
    P1 --> S2
    S1 --> C1
    S1 --> C2
    C1 --> Subj
    C1 --> Pred
    Pred --> V
    Pred --> Comp
    Pred --> Adv
```

#### Node Types

**Document Nodes**:
- `Document` - Root container
- `Paragraph` - Topic-related sentences

**Sentence Nodes**:
- `Sentence` - Complete grammatical unit
- `Clause` - Subject + predicate

**Phrase Nodes**:
- `NounPhrase` - Noun-headed (the cat, big house)
- `VerbPhrase` - Verb-headed (is running, gave a book)
- `PrepositionalPhrase` - Preposition-headed (on the mat)
- `AdjectivalPhrase` - Adjective-headed (very happy)
- `AdverbialPhrase` - Adverb-headed (quite quickly)

**Atomic Nodes**:
- `Token` - Single word/punctuation with POS tag

**Semantic Nodes**:
- `Entity` - Named entity
- `Relation` - Entity relationship
- `Event` - Event with participants
- `CorefChain` - Coreference links
- `Frame` - Semantic role frame

#### Universal Properties

All nodes include:

```elixir
%{
  language: atom(),  # :en, :es, :ca
  span: %{          # Position tracking
    start_pos: {line, column},
    start_byte: integer(),
    end_pos: {line, column},
    end_byte: integer()
  }
}
```

### 5. AST Utilities

#### Query Module

Search and extract information from AST:

```elixir
Nasty.AST.Query.find_subject(sentence)
Nasty.AST.Query.extract_tokens(document)
Nasty.AST.Query.find_entities(document)
```

#### Validation Module

Ensure AST structural integrity:

```elixir
case Nasty.AST.Validation.validate(document) do
  :ok -> :ok
  {:error, errors} -> handle_errors(errors)
end
```

#### Transform Module

Modify AST nodes:

```elixir
transformed = Nasty.AST.Transform.map(document, fn node ->
  # Transform logic
  node
end)
```

#### Traversal Module

Navigate AST with different strategies:

```elixir
Nasty.AST.Traversal.pre_order(document, visitor_fn)
Nasty.AST.Traversal.post_order(document, visitor_fn)
Nasty.AST.Traversal.breadth_first(document, visitor_fn)
```

### 6. Statistical & Neural Models

#### Model Infrastructure

**Registry**: Agent-based model storage
- `ModelRegistry.register/2` - Store model
- `ModelRegistry.get/1` - Retrieve model
- `ModelRegistry.list_models/0` - List all

**Loader**: Serialize/deserialize models
- `ModelLoader.load/1` - Load from file
- `ModelLoader.save/2` - Save to file
- `ModelLoader.load_from_priv/1` - Load from app resources

#### Model Types

**HMM (Hidden Markov Model)**:
- POS tagging with ~95% accuracy
- Viterbi algorithm for decoding
- Fast inference, low memory

**BiLSTM-CRF (Neural)**:
- POS tagging with 97-98% accuracy
- Bidirectional LSTM with CRF layer
- Built with Axon/EXLA for GPU acceleration
- Character-level CNN for OOV handling
- Pre-trained embedding support

**Naive Bayes**:
- Text classification
- Multinomial variant for document classification

**Future Models**:
- PCFG (Probabilistic Context-Free Grammar) for parsing
- CRF (Conditional Random Fields) for NER
- Pre-trained transformers (BERT, RoBERTa via Bumblebee)

### 7. Code Interoperability

Bidirectional conversion between natural language and code:

#### NL → Code Pipeline

```
Natural Language
    ↓
Intent Recognition (parse to Intent AST)
    ↓
Code Generation (Intent → Elixir AST)
    ↓
Validation
    ↓
Elixir Code String
```

**Example**:
```elixir
Nasty.to_code("Filter users where age is greater than 18", 
  source_language: :en, 
  target_language: :elixir)
# => "Enum.filter(users, fn item -> item > 18 end)"
```

#### Code → NL Pipeline

```
Elixir Code String
    ↓
Parse to Elixir AST
    ↓
Traverse & Explain (AST → Natural Language)
    ↓
Natural Language Description
```

**Example**:
```elixir
Nasty.explain_code("Enum.sort(list)", 
  source_language: :elixir, 
  target_language: :en)
# => "Sort list"
```

### 8. Translation System

AST-based translation between natural languages:

#### Translation Pipeline

```
Source AST (Language A)
    ↓
AST Transformation (structural changes)
    ↓
Token Translation (lemma-to-lemma mapping)
    ↓
Morphological Agreement (gender/number/person)
    ↓
Word Order Application (language-specific rules)
    ↓
Target AST (Language B)
    ↓
Rendering
    ↓
Target Text
```

**Components:**

**ASTTransformer** - Transforms AST nodes between languages:
```elixir
alias Nasty.Translation.ASTTransformer

{:ok, spanish_doc} = ASTTransformer.transform_document(english_doc, :es)
```

**TokenTranslator** - Lemma-to-lemma translation with POS awareness:
```elixir
alias Nasty.Translation.TokenTranslator

# cat (noun) → gato (noun)
translated = TokenTranslator.translate_token(token, :en, :es)
```

**Agreement** - Enforces morphological agreement:
```elixir
alias Nasty.Translation.Agreement

# Ensure "el gato" (masc) not "la gato"
adjusted = Agreement.apply_agreement(tokens, :es)
```

**WordOrder** - Applies language-specific word order:
```elixir
alias Nasty.Translation.WordOrder

# "the big house" → "la casa grande" (adjective after noun in Spanish)
ordered = WordOrder.apply_order(phrase, :es)
```

**LexiconLoader** - Manages bidirectional lexicons with ETS caching:
```elixir
alias Nasty.Translation.LexiconLoader

# Load English-Spanish lexicon
{:ok, lexicon} = LexiconLoader.load(:en, :es)

# Bidirectional lookup
"gato" = LexiconLoader.lookup(lexicon, "cat", :noun)
"cat" = LexiconLoader.lookup(lexicon, "gato", :noun)
```

**Features:**
- AST-aware translation preserving grammatical structure
- Morphological feature agreement
- Language-specific word order rules (SVO, pro-drop, adjective position)
- Idiomatic expression support
- Fallback to original text for untranslatable content
- Bidirectional translation (English ↔ Spanish, English ↔ Catalan)

### 9. Rendering & Visualization

#### Text Rendering

Convert AST to formatted text:
```elixir
Nasty.Rendering.Text.render(document)
```

#### Pretty Printing

Human-readable AST inspection:
```elixir
Nasty.Rendering.PrettyPrint.inspect(ast)
```

#### DOT Visualization

Generate Graphviz diagrams:
```elixir
{:ok, dot} = Nasty.Rendering.Visualization.to_dot(ast)
File.write("ast.dot", dot)
```

#### JSON Export

Export to JSON for external tools:
```elixir
{:ok, json} = Nasty.Rendering.Visualization.to_json(ast)
```

### 9. Data Layer

#### CoNLL-U Support

Parse and generate Universal Dependencies format:
```elixir
{:ok, sentences} = Nasty.Data.CoNLLU.parse_file("corpus.conllu")
conllu_string = Nasty.Data.CoNLLU.format(sentence)
```

#### Corpus Management

Manage training corpora:
```elixir
{:ok, corpus} = Nasty.Data.Corpus.load("path/to/corpus")
stats = Nasty.Data.Corpus.statistics(corpus)
```

## Application Supervision

```elixir
defmodule Nasty.Application do
  use Application

  def start(_type, _args) do
    children = [
      # Language Registry Agent
      Nasty.Language.Registry,
      
      # Model Registry Agent
      Nasty.Statistics.ModelRegistry
    ]

    opts = [strategy: :one_for_one, name: Nasty.Supervisor]
    result = Supervisor.start_link(children, opts)
    
    # Register languages at startup
    Nasty.Language.Registry.register(Nasty.Language.English)
    
    result
  end
end
```

## Extension Points

### Adding a New Language

1. Implement `Nasty.Language.Behaviour`
2. Create language module in `lib/language/your_language/`
3. Implement required callbacks
4. Register in `application.ex`
5. Add tests

See [Language Guide](LANGUAGE_GUIDE.md) for details.

### Adding New NLP Features

1. Create module in appropriate layer (`lib/language/`, `lib/semantic/`, etc.)
2. Define behaviour if language-agnostic
3. Implement for each language
4. Add to pipeline if needed
5. Update AST if new node types needed

### Adding Statistical Models

1. Implement model training in `lib/statistics/`
2. Create Mix task for training
3. Add model to registry
4. Integrate into pipeline

## Performance Considerations

### Efficiency

- **NimbleParsec**: Compiled parser combinators for fast tokenization
- **Agent-based registries**: Fast in-memory lookup
- **Streaming**: Process documents incrementally where possible
- **Lazy evaluation**: Use streams for large corpora

### Scalability

- **Stateless processing**: All functions are pure
- **Concurrent processing**: Parse multiple documents in parallel
- **Distributed**: Can run across multiple nodes (future)

## Testing Strategy

### Unit Tests

- Test each module in isolation
- Use `async: true` for parallel execution
- Mock language implementations when testing core

### Integration Tests

- Test full pipeline from text to AST
- Test rendering round-trips
- Test code interoperability

### Property-Based Testing

- Generate random ASTs and validate
- Test parsing/rendering round-trips
- Verify AST invariants

## Future Directions

### Architecture Evolution

1. **Generic Layers**: Extract `lib/parsing/`, `lib/semantic/`, `lib/operations/`
2. **Plugin System**: Dynamic language loading
3. **Streaming Pipeline**: Process infinite text streams
4. **Distributed Processing**: Multi-node coordination

### Advanced Features

1. **Neural Models**: Transformer-based parsing and tagging
2. **Multi-lingual**: True cross-language support
3. **Incremental Parsing**: Update AST on edits
4. **Error Recovery**: Graceful handling of malformed input

## See Also

- [API Documentation](API.md)
- [AST Reference](AST_REFERENCE.md)
- [Language Guide](LANGUAGE_GUIDE.md)
- [User Guide](USER_GUIDE.md)