docs/TRANSLATION.md

Select File:
# Translation System Guide

Comprehensive guide to Nasty's AST-based translation system for natural language translation between English, Spanish, and Catalan.

## Table of Contents

1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Quick Start](#quick-start)
4. [Core Components](#core-components)
5. [Translation Pipeline](#translation-pipeline)
6. [Morphological Agreement](#morphological-agreement)
7. [Word Order Rules](#word-order-rules)
8. [Lexicon Management](#lexicon-management)
9. [Supported Language Pairs](#supported-language-pairs)
10. [Customization](#customization)
11. [Best Practices](#best-practices)
12. [Limitations](#limitations)

## Overview

Nasty's translation system operates at the Abstract Syntax Tree (AST) level, providing grammatically-aware translation that preserves linguistic structure. Unlike token-by-token machine translation, this approach:

- Preserves grammatical relationships
- Applies morphological agreement rules
- Handles language-specific word order
- Supports bidirectional translation
- Enables roundtrip translation with minimal loss

## Architecture

### System Diagram

```mermaid
flowchart TD
    A["Source Text<br/>(Language A)"]
    B["Parse to AST<br/>(Source Lang)"]
    C["AST Transform<br/>(Structural)"] -.-> C1[ASTTransformer]
    D["Token Translate<br/>(Lemma mapping)"] -.-> D1[TokenTranslator]
    E["Agreement<br/>(Morphology)"] -.-> E1[Agreement]
    F["Word Order<br/>(Reordering)"] -.-> F1[WordOrder]
    G["Render to Text<br/>(Target Lang)"] -.-> G1[AST.Renderer]
    H["Target Text<br/>(Language B)"]
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
```

### Module Structure

```mermaid
graph TD
    Root[lib/]
    Trans[translation/]
    AST[ast/]
    Priv[priv/]
    TransSub[translation/]
    Lex[lexicons/]
    
    Root --> Trans
    Root --> AST
    Root --> Priv
    
    Trans --> T1[translator.ex<br/>Main API]
    Trans --> T2[ast_transformer.ex<br/>AST node transformation]
    Trans --> T3[token_translator.ex<br/>Token-level translation]
    Trans --> T4[agreement.ex<br/>Morphological agreement]
    Trans --> T5[word_order.ex<br/>Word order rules]
    Trans --> T6[lexicon_loader.ex<br/>Lexicon management]
    
    AST --> A1[renderer.ex<br/>AST to text rendering]
    
    Priv --> TransSub
    TransSub --> Lex
    Lex --> L1[en_es.exs<br/>English → Spanish]
    Lex --> L2[es_en.exs<br/>Spanish → English]
    Lex --> L3[en_ca.exs<br/>English → Catalan]
    Lex --> L4[ca_en.exs<br/>Catalan → English]
```

## Quick Start

### Basic Translation

```elixir
alias Nasty.Language.{English, Spanish}
alias Nasty.Translation.Translator

# English to Spanish
{:ok, doc_en} = Nasty.parse("The cat runs.", language: :en)
{:ok, doc_es} = Translator.translate(doc_en, :es)
{:ok, text_es} = Nasty.render(doc_es)
IO.puts(text_es)
# => "El gato corre."

# Spanish to English
{:ok, doc_es} = Nasty.parse("El perro grande.", language: :es)
{:ok, doc_en} = Translator.translate(doc_es, :en)
{:ok, text_en} = Nasty.render(doc_en)
IO.puts(text_en)
# => "The big dog."
```

### Using the High-Level API

```elixir
# Translate text directly
{:ok, text_es} = Nasty.translate_text("The quick cat.", :en, :es)
# => "El gato rápido."

# Or with explicit parsing
{:ok, ast} = Nasty.parse("The house is big.", language: :en)
{:ok, translated_ast} = Nasty.translate(ast, :es)
{:ok, text} = Nasty.render(translated_ast)
```

## Core Components

### 1. ASTTransformer

Transforms AST nodes between language structures.

**Module:** `Nasty.Translation.ASTTransformer`

**Functions:**
- `transform_document/2` - Transform entire document
- `transform_sentence/2` - Transform sentence
- `transform_phrase/2` - Transform phrase structures
- `transform_clause/2` - Transform clause

**Example:**
```elixir
alias Nasty.Translation.ASTTransformer

{:ok, spanish_doc} = ASTTransformer.transform_document(english_doc, :es)
```

### 2. TokenTranslator

Performs lemma-to-lemma translation with POS awareness.

**Module:** `Nasty.Translation.TokenTranslator`

**Functions:**
- `translate_token/3` - Translate single token
- `translate_with_morphology/3` - Translate preserving morphology
- `lookup_translation/3` - Lookup in lexicon

**Example:**
```elixir
alias Nasty.Translation.TokenTranslator

# cat (noun) → gato (noun)
translated = TokenTranslator.translate_token(token, :en, :es)

# Preserves morphology
# cats (noun, plural) → gatos (noun, plural)
translated = TokenTranslator.translate_with_morphology(token, :en, :es)
```

### 3. Agreement

Enforces morphological agreement rules (gender, number, person).

**Module:** `Nasty.Translation.Agreement`

**Functions:**
- `apply_agreement/2` - Apply all agreement rules
- `apply_determiner_noun/2` - Determiner-noun agreement
- `apply_noun_adjective/2` - Noun-adjective agreement
- `apply_subject_verb/2` - Subject-verb agreement

**Example:**
```elixir
alias Nasty.Translation.Agreement

# Ensure "el gato" (masculine) not "la gato"
adjusted = Agreement.apply_agreement(tokens, :es)

# Ensure "los gatos grandes" (plural agreement throughout)
adjusted = Agreement.apply_agreement(tokens, :es)
```

### 4. WordOrder

Applies language-specific word order transformations.

**Module:** `Nasty.Translation.WordOrder`

**Functions:**
- `apply_order/2` - Apply all word order rules
- `apply_adjective_order/2` - Position adjectives correctly
- `apply_svo_order/2` - Subject-Verb-Object ordering
- `handle_clitics/2` - Clitic placement

**Example:**
```elixir
alias Nasty.Translation.WordOrder

# "the big house" → "la casa grande" (adjective after noun)
ordered = WordOrder.apply_order(phrase, :es)

# "I eat it" → "Lo como" (clitic before verb in Spanish)
ordered = WordOrder.handle_clitics(phrase, :es)
```

### 5. LexiconLoader

Manages bidirectional lexicons with ETS caching for fast lookup.

**Module:** `Nasty.Translation.LexiconLoader`

**Functions:**
- `load/2` - Load lexicon for language pair
- `lookup/3` - Look up translation
- `reload/2` - Reload lexicon from file

**Example:**
```elixir
alias Nasty.Translation.LexiconLoader

# Load lexicon (cached in ETS)
{:ok, lexicon} = LexiconLoader.load(:en, :es)

# Bidirectional lookup
"gato" = LexiconLoader.lookup(lexicon, "cat", :noun)
"cat" = LexiconLoader.lookup(lexicon, "gato", :noun)

# Reload after editing lexicon file
LexiconLoader.reload(:en, :es)
```

### 6. AST.Renderer

Renders AST back to natural language text.

**Module:** `Nasty.AST.Renderer`

**Functions:**
- `render_document/1` - Render complete document
- `render_sentence/1` - Render single sentence
- `render_phrase/1` - Render phrase
- `render_tokens/1` - Render token sequence

**Example:**
```elixir
alias Nasty.AST.Renderer

# Render with proper spacing and punctuation
{:ok, text} = Renderer.render_document(document)

# Render phrase
{:ok, text} = Renderer.render_phrase(noun_phrase)
# => "el gato grande"
```

## Translation Pipeline

### Step-by-Step Process

#### 1. Parse Source Text

```elixir
alias Nasty.Language.English

text = "The quick brown fox jumps."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)
```

**AST Structure:**
```mermaid
graph TD
    Doc["Document (language: :en)"]
    Para[Paragraph]
    Sent[Sentence]
    Clause[Clause]
    Subj["Subject: NounPhrase"]
    Det["Determiner: 'The'"]
    Mod["Modifiers: ['quick', 'brown']"]
    Head1["Head: 'fox'"]
    Pred["Predicate: VerbPhrase"]
    Head2["Head: 'jumps'"]
    
    Doc --> Para
    Para --> Sent
    Sent --> Clause
    Clause --> Subj
    Clause --> Pred
    Subj --> Det
    Subj --> Mod
    Subj --> Head1
    Pred --> Head2
```

#### 2. Transform AST Structure

```elixir
alias Nasty.Translation.ASTTransformer

{:ok, doc_es} = ASTTransformer.transform_document(doc, :es)
```

Changes `language: :en` to `language: :es` throughout.

#### 3. Translate Tokens

```elixir
alias Nasty.Translation.TokenTranslator

# For each token in AST:
# "fox" (noun) → "zorro" (noun)
# "jumps" (verb) → "salta" (verb)
```

#### 4. Apply Agreement

```elixir
alias Nasty.Translation.Agreement

# Ensure gender/number agreement:
# "el" (masculine singular) + "zorro" (masculine singular) ✓
# "los" (masculine plural) + "zorros" (masculine plural) ✓
```

#### 5. Apply Word Order

```elixir
alias Nasty.Translation.WordOrder

# "the quick brown fox" → "el zorro rápido pardo"
# (adjectives after noun in Spanish for most adjectives)
```

#### 6. Render to Text

```elixir
alias Nasty.AST.Renderer

{:ok, text} = Renderer.render_document(doc_es)
# => "El zorro rápido pardo salta."
```

## Morphological Agreement

### Gender Agreement

Spanish and Catalan have grammatical gender (masculine/feminine).

**Determiner-Noun:**
```elixir
# English: "the cat"
# Spanish: "el gato" (masculine)

# English: "the house"
# Spanish: "la casa" (feminine)
```

**Noun-Adjective:**
```elixir
# English: "the red car"
# Spanish: "el carro rojo" (masculine)

# English: "the red house"
# Spanish: "la casa roja" (feminine)
```

### Number Agreement

Determiners, nouns, and adjectives must agree in number.

```elixir
# English: "the cats"
# Spanish: "los gatos" (plural)

# English: "the big cats"
# Spanish: "los gatos grandes" (plural throughout)
```

### Person Agreement

Subject-verb agreement by grammatical person.

```elixir
# English: "I run"
# Spanish: "Yo corro" (first person singular)

# English: "They run"
# Spanish: "Ellos corren" (third person plural)
```

## Word Order Rules

### SVO vs. SOV

English, Spanish, and Catalan all use Subject-Verb-Object (SVO) order:

```elixir
# English: "The cat eats fish."
# Spanish: "El gato come pescado."
# Catalan: "El gat menja peix."
```

### Adjective Position

**English:** Adjectives before nouns
```
"the red car"
"the big house"
```

**Spanish/Catalan:** Most adjectives after nouns
```
"el carro rojo" (the car red)
"la casa grande" (the house big)
```

**Exceptions:** Some adjectives stay before nouns
```
"el buen libro" (the good book) - NOT "el libro bueno"
"la primera vez" (the first time) - NOT "la vez primera"
```

### Clitic Placement

**Spanish clitics** (lo, la, me, te, se) attach to verbs:

```elixir
# English: "I see it"
# Spanish: "Lo veo" (clitic before conjugated verb)

# English: "I want to see it"
# Spanish: "Quiero verlo" (clitic after infinitive)
```

## Lexicon Management

### Lexicon Format

Lexicons are Elixir maps organized by POS tag:

```elixir
# priv/translation/lexicons/en_es.exs
%{
  noun: %{
    "cat" => "gato",
    "house" => "casa",
    "book" => "libro"
  },
  verb: %{
    "run" => "correr",
    "eat" => "comer",
    "sleep" => "dormir"
  },
  adj: %{
    "big" => "grande",
    "red" => "rojo",
    "quick" => "rápido"
  },
  det: %{
    "the" => "el",
    "a" => "un",
    "some" => "algunos"
  }
}
```

### Morphological Information

Include gender/number for target language:

```elixir
%{
  noun: %{
    "cat" => %{lemma: "gato", gender: :masculine},
    "house" => %{lemma: "casa", gender: :feminine},
    "dog" => %{lemma: "perro", gender: :masculine}
  }
}
```

### Idiomatic Expressions

Handle multi-word expressions:

```elixir
%{
  idioms: %{
    "kick the bucket" => "estirar la pata",
    "break the ice" => "romper el hielo",
    "piece of cake" => "pan comido"
  }
}
```

### Custom Lexicons

Add domain-specific vocabulary:

```elixir
# priv/translation/lexicons/custom_tech_en_es.exs
%{
  noun: %{
    "widget" => "componente",
    "server" => "servidor",
    "database" => "base de datos"
  },
  verb: %{
    "deploy" => "desplegar",
    "compile" => "compilar",
    "debug" => "depurar"
  }
}
```

Load custom lexicons:
```elixir
LexiconLoader.load(:en, :es, path: "priv/translation/lexicons/custom_tech_en_es.exs")
```

## Supported Language Pairs

### Direct Pairs

- **English ↔ Spanish** - Full bidirectional support
- **English ↔ Catalan** - Full bidirectional support

### Transitive Pairs

- **Spanish ↔ Catalan** - Via English (two-step translation)

```elixir
# Spanish → Catalan (via English)
{:ok, doc_es} = Nasty.parse("El gato corre.", language: :es)
{:ok, doc_en} = Translator.translate(doc_es, :en)
{:ok, doc_ca} = Translator.translate(doc_en, :ca)
{:ok, text_ca} = Nasty.render(doc_ca)
# => "El gat corre."
```

## Customization

### Extending Lexicons

1. Edit lexicon files in `priv/translation/lexicons/`
2. Add new entries maintaining the POS structure
3. Reload lexicons: `LexiconLoader.reload(:en, :es)`

### Custom Agreement Rules

Extend `Nasty.Translation.Agreement`:

```elixir
defmodule MyApp.CustomAgreement do
  def apply_custom_rule(tokens, language) do
    # Custom agreement logic
    tokens
  end
end
```

### Custom Word Order Rules

Extend `Nasty.Translation.WordOrder`:

```elixir
defmodule MyApp.CustomWordOrder do
  def apply_custom_order(phrase, language) do
    # Custom word order logic
    phrase
  end
end
```

## Best Practices

### 1. Sentence-Level Translation

Translate sentence by sentence for best results:

```elixir
sentences = String.split(text, ~r/[.!?]+/)

translated = Enum.map(sentences, fn sent ->
  {:ok, doc} = Nasty.parse(sent, language: :en)
  {:ok, translated} = Translator.translate(doc, :es)
  {:ok, text} = Nasty.render(translated)
  text
end)
|> Enum.join(". ")
```

### 2. Review Idiomatic Expressions

Idiomatic expressions may not translate literally:

```elixir
# "It's raining cats and dogs"
# Literal: "Está lloviendo gatos y perros" ❌
# Idiomatic: "Está lloviendo a cántaros" ✓
```

### 3. Extend Lexicons for Domain Text

For technical/specialized text, add domain vocabulary:

```elixir
# Add medical, legal, technical terms
# to custom lexicon files
```

### 4. Use for Formal/Technical Text

Best for:
- Technical documentation
- Formal correspondence
- News articles
- Academic text

Less suitable for:
- Poetry
- Idiomatic speech
- Creative writing

### 5. Verify Grammatical Gender

Some nouns have unexpected gender:

```elixir
# "problem" → "problema" (masculine in Spanish!)
# "hand" → "mano" (feminine)
```

Check lexicons and adjust if needed.

## Limitations

### Current Limitations

1. **Idiomatic Expressions**
   - May translate literally rather than idiomatically
   - Solution: Add idiom mappings to lexicons

2. **Complex Verb Tenses**
   - Some tense combinations may not map perfectly
   - Solution: Manual review for complex tenses

3. **Cultural Context**
   - Cultural references not adapted
   - Solution: Add context-aware transformations

4. **Ambiguous Words**
   - First lexicon entry used for ambiguous words
   - Solution: Add context-aware lexicon lookup

5. **Limited Language Pairs**
   - Currently English, Spanish, Catalan only
   - Solution: Add more language implementations

### Workarounds

**For idiomatic text:**
```elixir
# Pre-process idioms before translation
text = String.replace(text, "kick the bucket", "die")
```

**For ambiguous words:**
```elixir
# Use context or manual disambiguation
# "bank" (financial) vs "bank" (river)
```

**For complex grammar:**
```elixir
# Simplify sentence structure before translation
# "Having been running..." → "He ran..."
```

## Future Enhancements

- Neural translation integration
- Context-aware lexicon selection
- Multi-sentence context for pronouns
- Statistical phrase translation
- User feedback learning
- More language pairs (French, German, etc.)

## See Also

- [API.md](API.md) - Translation API reference
- [ARCHITECTURE.md](ARCHITECTURE.md) - System architecture
- [USER_GUIDE.md](USER_GUIDE.md) - User guide with examples
- [CROSS_LINGUAL.md](CROSS_LINGUAL.md) - Cross-lingual transfer learning