docs/languages/CATALAN.md

Select File:
docs/languages/CATALAN.md

# Catalan Language Support

Comprehensive Catalan language support for the Nasty NLP library.

## Status

**Implemented (Phases 1-7):**
- Tokenization with Catalan-specific features
- POS tagging with Universal Dependencies tagset
- Morphological analysis and lemmatization
- Grammar resource files (phrase and dependency rules)
- Phrase and sentence parsing (NP, VP, PP, clause detection)
- Dependency extraction (Universal Dependencies relations)
- Named entity recognition (PERSON, LOCATION, ORGANIZATION, DATE, MONEY, PERCENT)

**Pending (Phase 8):**
- Text summarization (stub implementation)
- Coreference resolution
- Semantic role labeling

## Features

### Tokenization

The Catalan tokenizer handles all language-specific features:

- **Interpunct (l·l)**: Kept as single token
  - Example: `"Col·laborar"` → `["Col·laborar"]`
  - Common in compound words: col·laborar, intel·ligent, il·lusió

- **Apostrophe Contractions**: Separated as distinct tokens
  - Determiners: `l'` (el/la)
  - Prepositions: `d'` (de), `s'` (es/se)
  - Pronouns: `n'` (en), `m'` (me), `t'` (te)
  - Example: `"L'home d'or"` → `["L'", "home", "d'", "or"]`

- **Article Contractions**: Recognized as single tokens
  - `del` = de + el
  - `al` = a + el  
  - `pel` = per + el
  - Example: `"Vaig al mercat"` → `["Vaig", "al", "mercat"]`

- **Diacritics**: Complete support for all 10 Catalan diacritics
  - Vowels: à, è, é, í, ï, ò, ó, ú, ü
  - Consonant: ç (ce trencada)
  - Unicode NFC normalization

### POS Tagging

Rule-based POS tagger using Universal Dependencies tagset:

- **Comprehensive Lexicon**: 300+ word forms
  - Articles, pronouns, prepositions
  - Common verbs, nouns, adjectives, adverbs
  - Function words and particles

- **Verb Conjugations**: All tenses supported
  - Present, preterite, imperfect, future, conditional
  - Subjunctive mood patterns
  - Gerunds and past participles

- **Context-Based Disambiguation**
  - Post-nominal adjective detection
  - Determiner-noun sequences
  - Preposition-noun patterns

### Morphology

Morphological analyzer with lemmatization:

- **Verb Classes**: 3 conjugation classes
  - `-ar` verbs: parlar → parlar, parlant → parlar
  - `-re` verbs: viure → viure, vivint → viure  
  - `-ir` verbs: dormir → dormir, dormint → dormir

- **Irregular Verbs**: Dictionary of 100+ forms
  - ser, estar, haver (auxiliaries)
  - anar, fer, dir, poder, voler (common verbs)
  - tenir, venir, veure (irregulars)

- **Morphological Features**
  - Gender: masculine/feminine
  - Number: singular/plural
  - Tense: present, past, future, conditional, imperfect
  - Mood: indicative, conditional, subjunctive
  - Aspect: progressive, perfective

### Grammar Rules

Externalized grammar files in `priv/languages/ca/grammars/`:

**Phrase Rules** (`phrase_rules.exs`):
- Noun phrases with post-nominal adjectives
- Verb phrases with flexible word order
- Prepositional, adjectival, adverbial phrases
- Relative clause patterns
- Special rules for Catalan-specific features

**Dependency Rules** (`dependency_rules.exs`):
- Universal Dependencies v2 relations
- Core arguments (subject, object, indirect object)
- Non-core dependents (oblique, adverbials)
- Function word relations
- Catalan-specific patterns (clitics, pro-drop)

## Usage

```elixir
alias Nasty.Language.Catalan

# Complete pipeline
text = "El gat dorm al sofà."
{:ok, tokens} = Catalan.tokenize(text)
{:ok, tagged} = Catalan.tag_pos(tokens)
{:ok, document} = Catalan.parse(tagged)

# Extract entities
alias Nasty.Language.Catalan.EntityRecognizer
{:ok, entities} = EntityRecognizer.recognize(tagged)
# => [%Entity{type: :person, text: "Joan Garcia", ...}]

# Extract dependencies
alias Nasty.Language.Catalan.DependencyExtractor
sentences = document.paragraphs |> Enum.flat_map(& &1.sentences)
deps = Enum.flat_map(sentences, &DependencyExtractor.extract/1)
# => [%Dependency{relation: :nsubj, head: "dorm", dependent: "gat", ...}]

# Individual components
{:ok, tokens} = Catalan.Tokenizer.tokenize("El gat dorm al sofà.")
{:ok, tagged} = Catalan.POSTagger.tag_pos(tokens)
{:ok, analyzed} = Catalan.Morphology.analyze(tagged)

# Access lemmas and features
Enum.each(analyzed, fn token ->
  IO.puts("#{token.text} [#{token.pos_tag}] → #{token.lemma}")
end)
```

## Linguistic Features

### Word Order

Catalan allows flexible word order while maintaining SVO as default:

- **SVO** (Subject-Verb-Object): `"El gat menja peix"` (The cat eats fish)
- **VSO** (Verb-Subject-Object): `"Menja el gat peix"` (Eats the cat fish) - emphatic
- **VOS** (Verb-Object-Subject): `"Menja peix el gat"` (Eats fish the cat) - rare

### Pro-Drop

Subject pronouns often omitted when context is clear:

- `"Parla català"` (I/he/she/it speaks Catalan) - subject implicit
- `"Hem anat al mercat"` (We have gone to the market) - subject implicit

### Post-Nominal Adjectives

Descriptive adjectives typically follow nouns:

- `"casa gran"` (big house)
- `"llibre interessant"` (interesting book)
- Exception: `"bon dia"` (good day) - some adjectives precede for emphasis

### Clitic Pronouns

Pronouns can attach to verbs as clitics:

- `"Dona'm el llibre"` (Give me the book) - m' = me
- `"Digue-li la veritat"` (Tell him/her the truth) - li = him/her

## Test Coverage

**74 tests, 0 failures**

- Tokenization: 54 tests
  - Interpunct words
  - Apostrophe and article contractions
  - Diacritics
  - Position tracking
  - Edge cases

- POS Tagging: 20 tests
  - Basic word classes
  - Verb conjugations
  - Catalan-specific features
  - Context-based tagging

## Implementation Details

### Phrase Parser (`lib/language/catalan/phrase_parser.ex` - 334 lines)

- `parse_noun_phrase/2`: Handles quantifiers, determiners, adjectives, and post-modifiers
- `parse_verb_phrase/2`: Processes auxiliaries, main verbs, objects, and complements
- `parse_prep_phrase/2`: Parses preposition + noun phrase structures
- Catalan-specific: Post-nominal adjectives, quantifying adjectives (molt, poc, algun, tot)

### Sentence Parser (`lib/language/catalan/sentence_parser.ex` - 281 lines)

- `parse_sentences/2`: Sentence boundary detection and splitting
- `parse_clause/2`: Subject and predicate extraction
- Catalan subordinators: que, perquè, quan, on, si, encara, mentre, així, doncs, ja
- Coordination: i, o, però, sinó, ni

### Dependency Extractor (`lib/language/catalan/dependency_extractor.ex` - 226 lines)

- Extracts Universal Dependencies relations from parsed structures
- Core relations: nsubj (nominal subject), obj (object), iobj (indirect object)
- Modifiers: det (determiner), amod (adjectival modifier), advmod (adverbial modifier)
- Function words: aux (auxiliary), case (case marking), mark (subordinating conjunction)
- Coordination: cc (coordinating conjunction), conj (conjunct)

### Entity Recognizer (`lib/language/catalan/entity_recognizer.ex` - 285 lines)

- Rule-based NER with 6 entity types
- **PERSON**: Catalan titles (Sr., Sra., Dr., Dra., Don, Donya), capitalized name sequences
- **LOCATION**: Catalan places (Barcelona, Catalunya, València, Girona, Tarragona, Lleida, Andorra)
- **ORGANIZATION**: Indicators (banc, universitat, hospital, ajuntament, govern)
- **DATE**: Catalan months and days (gener, febrer, març, dilluns, dimarts)
- **MONEY**: Euro symbols (€, euros, dòlar, dòlars)
- **PERCENT**: Percentage symbols (%, per cent)
- Confidence scoring: 0.5-0.95 based on pattern strength

## Future Work (Phase 8 and Beyond)

1. **Summarizer**: Extractive and abstractive text summarization
2. **Coreference Resolution**: Link mentions across sentences
3. **Semantic Role Labeling**: Predicate-argument structure
4. **End-to-end Tests**: Integration tests for complete pipeline
5. **Advanced Entity Recognition**: ML-based NER with larger lexicons
6. **Question Answering**: Extractive QA for Catalan texts
7. **Text Classification**: Sentiment analysis, topic classification

## References

- Universal Dependencies Catalan Treebank: [UD_Catalan-AnCora](https://github.com/UniversalDependencies/UD_Catalan-AnCora)
- Catalan Grammar: Institut d'Estudis Catalans
- Linguistic Patterns: Based on Central Catalan (Barcelona dialect)

## Language Code

ISO 639-1: `ca`  
ISO 639-3: `cat`

## Contributing

When enhancing Catalan support:
1. Maintain consistency with Spanish implementation patterns
2. Follow Universal Dependencies standards
3. Document Catalan-specific features
4. Add comprehensive tests for new functionality
5. Update this documentation