# Language Implementation Guide
This guide explains how to add support for a new natural language to Nasty.
## Overview
Adding a new language requires:
1. Implementing the `Nasty.Language.Behaviour`
2. Creating language-specific parsing/tagging modules
3. Registering the language
4. Adding tests and resources
## Step-by-Step Guide
### Step 1: Create Language Module
Create `lib/language/your_language.ex`:
```elixir
defmodule Nasty.Language.YourLanguage do
@moduledoc """
Your Language implementation for Nasty.
Provides tokenization, POS tagging, parsing, and rendering for YourLanguage.
"""
@behaviour Nasty.Language.Behaviour
alias Nasty.AST.{Document, Token}
@impl true
def language_code, do: :yl # Your ISO 639-1 code
@impl true
def tokenize(text, _opts \\ []) do
# Implement tokenization
# See lib/language/english/tokenizer.ex for reference
{:ok, tokens}
end
@impl true
def tag_pos(tokens, _opts \\ []) do
# Implement POS tagging
# See lib/language/english/pos_tagger.ex for reference
{:ok, tagged_tokens}
end
@impl true
def parse(tokens, _opts \\ []) do
# Implement parsing
# See lib/language/english.ex for reference
{:ok, document}
end
@impl true
def render(ast, _opts \\ []) do
# Implement rendering
# Use Nasty.Rendering.Text as a base
{:ok, text}
end
@impl true
def metadata do
%{
version: "1.0.0",
features: [:tokenization, :pos_tagging, :parsing, :rendering]
}
end
end
```
### Step 2: Implement Tokenization
Create `lib/language/your_language/tokenizer.ex`:
```elixir
defmodule Nasty.Language.YourLanguage.Tokenizer do
@moduledoc"""
Tokenizer for YourLanguage using NimbleParsec.
"""
import NimbleParsec
alias Nasty.AST.{Node, Token}
# Define language-specific patterns
word = ascii_string([?a..?z, ?A..?Z], min: 1)
punctuation = ascii_char([?., ?!, ?,, ?;, ?:])
whitespace = ascii_string([?\s, ?\n, ?\t], min: 1)
defparsec :token, choice([word, punctuation])
def tokenize(text) do
# Implement tokenization logic
# Return {:ok, [Token.t()]} | {:error, reason}
end
end
```
### Step 3: Implement POS Tagging
Create `lib/language/your_language/pos_tagger.ex`:
```elixir
defmodule Nasty.Language.YourLanguage.POSTagger do
@moduledoc """
Part-of-speech tagger for YourLanguage.
Uses Universal Dependencies tagset for consistency.
"""
alias Nasty.AST.Token
def tag(tokens, _opts \\ []) do
# Implement tagging logic
tagged = Enum.map(tokens, &tag_token/1)
{:ok, tagged}
end
defp tag_token(token) do
# Assign POS tag based on rules or statistical model
%{token | pos_tag: determine_tag(token.text)}
end
defp determine_tag(word) do
# Your tagging logic
:noun # placeholder
end
end
```
### Step 4: Implement Morphology
Create `lib/language/your_language/morphology.ex`:
```elixir
defmodule Nasty.Language.YourLanguage.Morphology do
@moduledoc """
Morphological analysis for YourLanguage.
"""
def lemmatize(word) do
# Return base form of word
end
def analyze(token) do
# Return morphological features
%{
number: :singular,
tense: :present,
# ... other features
}
end
end
```
### Step 5: Implement Parsing
Create parsing modules for phrase and sentence structure:
`lib/language/your_language/phrase_parser.ex`:
```elixir
defmodule Nasty.Language.YourLanguage.PhraseParser do
@moduledoc """
Builds phrase structures (NP, VP, PP) for YourLanguage.
"""
alias Nasty.AST.{NounPhrase, VerbPhrase, PrepositionalPhrase}
def parse_noun_phrase(tokens) do
# Build NounPhrase from tokens
end
def parse_verb_phrase(tokens) do
# Build VerbPhrase from tokens
end
end
```
`lib/language/your_language/sentence_parser.ex`:
```elixir
defmodule Nasty.Language.YourLanguage.SentenceParser do
@moduledoc """
Builds sentence and clause structures for YourLanguage.
"""
alias Nasty.AST.{Sentence, Clause}
def parse_sentence(tokens) do
# Build Sentence with clauses
end
end
```
### Step 6: Register Language
Add to `lib/nasty/application.ex`:
```elixir
defmodule Nasty.Application do
use Application
def start(_type, _args) do
# ... existing code ...
# Register languages
:ok = Nasty.Language.Registry.register(Nasty.Language.English)
:ok = Nasty.Language.Registry.register(Nasty.Language.YourLanguage) # Add this
result
end
end
```
### Step 7: Add Language Detection
Update `lib/language/registry.ex` to support your language:
```elixir
# Add character set scoring
defp character_set_score(text, :yl) do
# Score based on your language's character set
end
# Add common word scoring
defp common_word_score(words, :yl) do
common_words = MapSet.new(["word1", "word2", ...])
score_against_common_words(words, common_words)
end
```
### Step 8: Add Resources
Create resource files in `priv/languages/your_language/`:
```
priv/languages/your_language/
├── lexicons/
│ ├── irregular_verbs.txt
│ ├── irregular_nouns.txt
│ └── stop_words.txt
└── grammars/
└── phrase_rules.ex
```
### Step 9: Add Tests
Create `test/language/your_language_test.exs`:
```elixir
defmodule Nasty.Language.YourLanguageTest do
use ExUnit.Case, async: true
alias Nasty.Language.YourLanguage
describe "tokenize/2" do
test "tokenizes simple sentence" do
{:ok, tokens} = YourLanguage.tokenize("Simple sentence.", [])
assert length(tokens) == 3
end
end
describe "tag_pos/2" do
test "tags parts of speech" do
{:ok, tokens} = YourLanguage.tokenize("Word.", [])
{:ok, tagged} = YourLanguage.tag_pos(tokens, [])
assert hd(tagged).pos_tag != nil
end
end
describe "parse/2" do
test "parses to document AST" do
text = "Simple sentence."
{:ok, tokens} = YourLanguage.tokenize(text, [])
{:ok, tagged} = YourLanguage.tag_pos(tokens, [])
{:ok, doc} = YourLanguage.parse(tagged, [])
assert %Nasty.AST.Document{} = doc
assert doc.language == :yl
end
end
describe "render/2" do
test "renders AST to text" do
# Create simple AST
# Test rendering
end
end
end
```
## Language-Specific Considerations
### Word Order
Different languages have different word orders:
- **SVO** (Subject-Verb-Object): English, Spanish
- **SOV**: Japanese, Korean
- **VSO**: Welsh, Arabic (Classical)
Implement word order in your `render/2` function.
### Morphology
Languages vary in morphological complexity:
- **Isolating**: Chinese (minimal morphology)
- **Agglutinative**: Turkish, Finnish (many affixes)
- **Fusional**: Spanish, Russian (inflection)
Implement appropriate morphological analysis.
### Syntax
Consider language-specific syntax:
- **Gender agreement**: Spanish, French
- **Case marking**: German, Russian
- **Postpositions vs. Prepositions**: Japanese vs. English
- **Relative clause placement**: English vs. Japanese
### Punctuation
Handle language-specific punctuation:
- **Quotation marks**: «» in French, 「」 in Japanese
- **Question marks**: ¿? in Spanish
- **Spacing**: No spaces in Chinese
## Universal Dependencies
Always use Universal Dependencies standards:
### POS Tags
Use UD POS tags: `:noun`, `:verb`, `:adj`, etc.
### Dependency Relations
Use UD dependency relations: `:nsubj`, `:obj`, `:obl`, etc.
### Morphological Features
Use UD features: `number: :singular`, `tense: :past`, etc.
## Testing Checklist
- [ ] Tokenization handles edge cases (contractions, URLs, etc.)
- [ ] POS tagging achieves reasonable accuracy (>90%)
- [ ] Parser handles all sentence types
- [ ] Rendering produces grammatical output
- [ ] Language detection works correctly
- [ ] All tests pass
- [ ] Documentation is complete
## Example Implementations
### Spanish Implementation ✓
Spanish is fully implemented and serves as a reference for adding new languages.
See the complete implementation in `lib/language/spanish/`.
**Key Features Implemented**:
- ✓ Gender agreement (el gato, la gata)
- ✓ Inverted punctuation (¿Cómo estás?, ¡Hola!)
- ✓ Verb conjugations (all tenses)
- ✓ Clitic pronouns (dámelo, dáselo)
- ✓ Complete adapter pattern (3 adapters, 843 lines)
- ✓ Spanish discourse markers, stop words, entity lexicons
- ✓ 45% code reduction through generic algorithm reuse
**Quick Reference**:
```elixir
defmodule Nasty.Language.Spanish do
@behaviour Nasty.Language.Behaviour
@impl true
def language_code, do: :es
# Complete implementation in lib/language/spanish/
# See docs/languages/SPANISH_IMPLEMENTATION.md for details
end
```
**Adapters**:
- `Spanish.Adapters.SummarizerAdapter` (241 lines)
- `Spanish.Adapters.EntityRecognizerAdapter` (346 lines)
- `Spanish.Adapters.CoreferenceResolverAdapter` (256 lines)
For a complete guide, see:
- [SPANISH_IMPLEMENTATION.md](languages/SPANISH_IMPLEMENTATION.md) - Full documentation
- `examples/spanish_example.exs` - Working code examples
- `test/language/spanish/` - Test suite
### Catalan Implementation ✓
Catalan is fully implemented (Phases 1-7) and demonstrates language-specific features.
See the implementation in `lib/language/catalan/` (7 modules) and documentation in `docs/languages/CATALAN.md`.
**Key Features Implemented**:
- ✓ Interpunct handling (col·laborar, intel·ligent)
- ✓ Apostrophe contractions (l', d', s', n', m', t')
- ✓ Article contractions (del, al, pel)
- ✓ 10 Catalan diacritics (à, è, é, í, ï, ò, ó, ú, ü, ç)
- ✓ 3 verb conjugation classes (-ar, -re, -ir)
- ✓ Post-nominal adjectives and flexible word order
- ✓ Full parsing pipeline (phrase/sentence parsing, dependencies, NER)
- ✓ Externalized grammar rules (phrase_rules.exs, dependency_rules.exs)
- ✓ 74 comprehensive tests, 100% passing
**Quick Reference**:
```elixir
defmodule Nasty.Language.Catalan do
@behaviour Nasty.Language.Behaviour
@impl true
def language_code, do: :ca
# Complete implementation in lib/language/catalan/
# See docs/languages/CATALAN.md for details
end
```
**Modules**:
- `Catalan.Tokenizer` (145 lines)
- `Catalan.POSTagger` (509 lines)
- `Catalan.Morphology` (519 lines)
- `Catalan.PhraseParser` (334 lines)
- `Catalan.SentenceParser` (281 lines)
- `Catalan.DependencyExtractor` (226 lines)
- `Catalan.EntityRecognizer` (285 lines)
For complete details, see:
- [CATALAN.md](languages/CATALAN.md) - Full documentation
- `test/language/catalan/` - Test suite (74 tests)
## Resources
- [Universal Dependencies](https://universaldependencies.org/)
- [ISO 639-1 Language Codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
- [NimbleParsec Documentation](https://hexdocs.pm/nimble_parsec)
## See Also
- [Architecture](ARCHITECTURE.md)
- [API Documentation](API.md)
- [AST Reference](AST_REFERENCE.md)