docs/GRAMMAR_CUSTOMIZATION.md

Select File:
docs/GRAMMAR_CUSTOMIZATION.md

# Grammar Customization Guide

This document explains how to customize and extend Nasty's grammar rules by creating external grammar resource files.

## Overview

Starting with version 0.2.0, Nasty externalizes grammar rules from hardcoded Elixir modules into configurable `.exs` resource files. This allows you to:

- Customize existing grammar rules without modifying source code
- Create domain-specific grammar variants (e.g., legal, medical, technical)
- Add support for new languages
- A/B test different parsing strategies
- Share grammar rule sets across projects

## Architecture

Grammar rules are stored as Elixir term files (`.exs`) in:

```
priv/languages/{language_code}/grammars/{rule_type}.exs
```

For variants (e.g., formal, informal, technical):

```
priv/languages/{language_code}/variants/{variant_name}/{rule_type}.exs
```

### Language Codes

- English: `en` or `english`
- Spanish: `es` or `spanish`
- Catalan: `ca` or `catalan` (future)

### Rule Types

Each language can have the following grammar rule files:

1. `phrase_rules.exs` - Phrase structure patterns (NP, VP, PP, AdjP, AdvP)
2. `dependency_rules.exs` - Universal Dependencies relations and extraction rules
3. `coordination_rules.exs` - Coordinating conjunctions and coordination patterns
4. `subordination_rules.exs` - Subordinating conjunctions and subordinate clause patterns

## Grammar Loader API

### Loading Grammar Rules

```elixir
alias Nasty.Language.GrammarLoader

# Load default grammar rules
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules)

# Load with variant
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules, variant: "formal")

# Force reload (bypass cache)
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules, force_reload: true)
```

### Cache Management

```elixir
# Clear all cached grammar
GrammarLoader.clear_cache()

# Clear specific cached rules
GrammarLoader.clear_cache(:en, :phrase_rules, :default)
```

### Direct File Loading

```elixir
# Load from custom path
{:ok, rules} = GrammarLoader.load_file("/path/to/custom_rules.exs")
```

## Creating Grammar Files

### File Structure

Grammar files are Elixir term files that evaluate to a map:

```elixir
%{
  # Top-level keys define rule categories
  rule_category_1: [...],
  rule_category_2: %{...},
  
  # Metadata
  notes: %{
    key: "description"
  }
}
```

### Example: Simple Phrase Rules

Create `priv/languages/en/grammars/custom_phrase_rules.exs`:

```elixir
%{
  # Noun phrase patterns
  noun_phrases: [
    # Simple NP: Det + Noun
    {:np, [:det, :noun]},
    
    # NP with adjective: Det + Adj + Noun
    {:np, [:det, :adj, :noun]},
    
    # NP with PP: Det + Noun + PP
    {:np, [:det, :noun, :pp]}
  ],
  
  # Verb phrase patterns
  verb_phrases: [
    # Simple VP: just Verb
    {:vp, [:verb]},
    
    # VP with object: Verb + NP
    {:vp, [:verb, :np]},
    
    # VP with auxiliary: Aux + Verb
    {:vp, [:aux, :verb]}
  ],
  
  notes: %{
    version: "1.0.0",
    author: "Your Name",
    description: "Custom phrase rules for domain-specific parsing"
  }
}
```

## English Grammar Reference

### Phrase Rules (`phrase_rules.exs`)

See `priv/languages/en/grammars/phrase_rules.exs` for the complete reference.

Key sections:

```elixir
%{
  noun_phrases: [
    # List of NP patterns
    {:np, [:det, :noun]},
    {:np, [:det, :adj, :noun]},
    # ...
  ],
  
  verb_phrases: [
    # List of VP patterns
    {:vp, [:verb]},
    {:vp, [:aux, :verb, :np]},
    # ...
  ],
  
  prepositional_phrases: [
    # PP patterns
    {:pp, [:prep, :np]},
    # ...
  ],
  
  adjectival_phrases: [
    # AdjP patterns
    {:adjp, [:adv, :adj]},
    # ...
  ],
  
  adverbial_phrases: [
    # AdvP patterns
    {:advp, [:adv]},
    # ...
  ],
  
  relative_clauses: [
    # Relative clause patterns
    {:relative_clause, [:relative_marker, :clause]},
    # ...
  ],
  
  special_rules: [
    # Special handling rules
    {:comparative_than, :pseudo_prep},
    # ...
  ]
}
```

### Dependency Rules (`dependency_rules.exs`)

See `priv/languages/en/grammars/dependency_rules.exs` for the complete reference.

Key sections:

```elixir
%{
  core_arguments: [
    # Subject, object, complements
    %{
      relation: :nsubj,
      description: "Nominal subject",
      head_pos: [:verb],
      dependent_pos: [:noun, :propn, :pron],
      example: "The cat sleeps → nsubj(sleeps, cat)"
    },
    # ...
  ],
  
  nominal_dependents: [
    # Determiners, modifiers
    %{relation: :det, ...},
    %{relation: :amod, ...},
    # ...
  ],
  
  function_words: [
    # Auxiliaries, copulas, markers
    %{relation: :aux, ...},
    # ...
  ],
  
  extraction_priorities: [
    # Order of dependency extraction
    :nsubj, :obj, :det, :amod, # ...
  ]
}
```

### Coordination Rules (`coordination_rules.exs`)

Key sections:

```elixir
%{
  coordinating_conjunctions: [
    %{
      conjunction: "and",
      type: :copulative,
      example: "cats and dogs"
    },
    # ...
  ],
  
  coordination_patterns: [
    %{
      pattern: :np_coordination,
      structure: "NP CCONJ NP",
      example: "cats and dogs"
    },
    # ...
  ],
  
  special_cases: [
    # Correlative conjunctions, etc.
    %{
      type: :correlative,
      patterns: [
        %{pair: ["both", "and"], example: "both cats and dogs"},
        # ...
      ]
    }
  ]
}
```

### Subordination Rules (`subordination_rules.exs`)

Key sections:

```elixir
%{
  subordinating_conjunctions: [
    %{
      conjunction: "because",
      type: :causal,
      example: "I stayed because it rained"
    },
    # ...
  ],
  
  relative_markers: [
    %{
      marker: "who",
      type: :relative_pronoun,
      example: "the person who came"
    },
    # ...
  ],
  
  subordinate_clause_types: [
    %{
      type: :adverbial,
      dependency_relation: :advcl,
      subtypes: [:temporal, :causal, :conditional, ...]
    },
    # ...
  ]
}
```

## Spanish Grammar Reference

Spanish grammar files follow the same structure but include Spanish-specific features:

- Post-nominal adjectives: `la casa roja` (the red house)
- Pro-drop: null subjects allowed
- Flexible word order: SVO, VSO, VOS
- Clitic pronouns: `dámelo` (give-me-it)
- Personal 'a': `Veo a Juan` (I see Juan)
- Two copulas: `ser` vs. `estar`
- Phonetic variants: `y`→`e`, `o`→`u` before vowels

See files in `priv/languages/es/grammars/` for complete Spanish grammar.

## Creating Domain-Specific Variants

### Example: Technical English

Create `priv/languages/en/variants/technical/phrase_rules.exs`:

```elixir
%{
  # Inherit base rules and add technical-specific patterns
  noun_phrases: [
    # Standard patterns
    {:np, [:det, :noun]},
    
    # Technical compound nouns (e.g., "TCP/IP protocol")
    {:np, [:propn, :noun]},
    {:np, [:propn, :sym, :propn, :noun]},
    
    # Noun phrases with technical modifiers
    {:np, [:num, {:unit, [:noun]}, :noun]},  # "5 GB memory"
    
    # Multi-word technical terms
    {:np, [{:many, :noun}]}  # "machine learning model"
  ],
  
  verb_phrases: [
    # Standard patterns
    {:vp, [:verb, :np]},
    
    # Technical action verbs (instantiate, serialize, etc.)
    {:vp, [:tech_verb, :np, :pp]},
    
    # Passive constructions common in technical writing
    {:vp, [:aux, :verb, :pp]}
  ],
  
  notes: %{
    domain: "technical",
    use_case: "Software documentation, API specs, technical papers"
  }
}
```

### Example: Legal English

```elixir
%{
  noun_phrases: [
    # Legal entities
    {:np, [:det, :legal_entity]},  # "the plaintiff", "the defendant"
    
    # Complex legal terms
    {:np, [:det, :adj, :legal_term, :pp]},  # "the aforementioned contractual obligation"
    
    # References (Section X, Article Y)
    {:np, [:legal_ref_type, :num]}  # "Section 5"
  ],
  
  subordination_patterns: [
    # Legal conditionals (provided that, in the event that)
    {:conditional, :multiword_legal_conj}
  ],
  
  notes: %{
    domain: "legal",
    use_case: "Contracts, legislation, court documents"
  }
}
```

## Using Custom Grammar in Code

### Option 1: Load and Use Directly

```elixir
# Load custom grammar
{:ok, custom_phrase_rules} = GrammarLoader.load(:en, :custom_phrase_rules)

# Use in your parser
custom_np_patterns = custom_phrase_rules.noun_phrases
# Process with custom patterns...
```

### Option 2: Extend Parser Module

```elixir
defmodule MyApp.CustomParser do
  alias Nasty.Language.GrammarLoader
  
  def parse_technical_text(text) do
    # Load technical variant
    {:ok, rules} = GrammarLoader.load(:en, :phrase_rules, variant: "technical")
    
    # Parse using custom rules
    # ... your parsing logic using rules ...
  end
end
```

### Option 3: Runtime Configuration

```elixir
# In config/config.exs
config :nasty,
  default_grammar_variant: "technical"

# In your code
variant = Application.get_env(:nasty, :default_grammar_variant, :default)
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules, variant: variant)
```

## Grammar Validation

The grammar loader validates that all files return a map:

```elixir
# Valid
%{
  rules: [...],
  notes: %{}
}

# Invalid - will raise error
[1, 2, 3]  # Not a map
```

For more complex validation, extend `GrammarLoader.validate_rules/1`.

## Best Practices

### 1. Start with Base Grammar

Copy existing grammar files and modify rather than starting from scratch:

```bash
cp priv/languages/en/grammars/phrase_rules.exs \
   priv/languages/en/variants/custom/phrase_rules.exs
```

### 2. Document Your Rules

Include comprehensive notes in your grammar files:

```elixir
%{
  rules: [...],
  
  notes: %{
    version: "1.0.0",
    author: "Team Name",
    created: "2026-01-08",
    description: "Custom grammar for medical text parsing",
    changes: [
      "Added medical entity patterns",
      "Extended VP patterns for medical procedures"
    ],
    examples: [
      "The patient underwent cardiac catheterization",
      "Diagnose: Type 2 diabetes mellitus"
    ]
  }
}
```

### 3. Test Your Grammar

Create tests for custom grammar:

```elixir
defmodule MyApp.CustomGrammarTest do
  use ExUnit.Case
  alias Nasty.Language.GrammarLoader
  
  test "custom grammar loads successfully" do
    assert {:ok, rules} = GrammarLoader.load(:en, :custom_rules)
    assert is_map(rules)
    assert Map.has_key?(rules, :noun_phrases)
  end
  
  test "custom grammar includes domain patterns" do
    {:ok, rules} = GrammarLoader.load(:en, :custom_rules, variant: "medical")
    assert Enum.any?(rules.noun_phrases, fn pattern ->
      # Check for medical-specific patterns
    end)
  end
end
```

### 4. Version Your Grammar

Track grammar versions for reproducibility:

```elixir
%{
  metadata: %{
    version: "2.1.0",
    compatible_with: "nasty >= 0.2.0"
  },
  # ... rules ...
}
```

### 5. Keep Grammar Files Focused

Separate concerns across different rule types:

- Phrase structure → `phrase_rules.exs`
- Dependencies → `dependency_rules.exs`
- Coordination → `coordination_rules.exs`
- Subordination → `subordination_rules.exs`

Don't mix all rules into one file.

## Performance Considerations

### Caching

Grammar files are cached in ETS after first load:

```elixir
# First load: reads from disk
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules)  # ~5ms

# Subsequent loads: from cache
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules)  # ~0.1ms
```

Clear cache when updating grammar during development:

```elixir
GrammarLoader.clear_cache()
```

### File Size

Keep grammar files under 1MB for fast loading. If needed, split into multiple files:

```
phrase_rules_np.exs  # Noun phrase patterns
phrase_rules_vp.exs  # Verb phrase patterns
phrase_rules_pp.exs  # Prepositional phrase patterns
```

## Troubleshooting

### Grammar File Not Found

```
Grammar file not found: .../en/grammars/missing_rules.exs, using empty rules
```

**Solution**: Check file exists and path is correct. Grammar files must be in `priv/languages/{lang}/grammars/`.

### Invalid Grammar Format

```
** (ArgumentError) Grammar rules must be a map, got: [...]
```

**Solution**: Ensure file evaluates to a map:

```elixir
# Correct
%{rules: [...]}

# Wrong
[...]
```

### Compilation Errors

```
** (SyntaxError) invalid syntax
```

**Solution**: Grammar files must be valid Elixir. Test with:

```bash
elixir priv/languages/en/grammars/your_rules.exs
```

### Cache Issues

If changes to grammar files aren't reflected:

```elixir
# Clear cache
Nasty.Language.GrammarLoader.clear_cache()

# Or force reload
{:ok, rules} = GrammarLoader.load(:en, :phrase_rules, force_reload: true)
```

## Examples Repository

See working examples in the main repository:

- English grammar: `priv/languages/en/grammars/`
- Spanish grammar: `priv/languages/es/grammars/`
- Test fixtures: `test/fixtures/grammars/`

## Contributing Custom Grammars

To contribute grammar variants to the Nasty project:

1. Create grammar files following the structure above
2. Add tests demonstrating the grammar works
3. Document the use case and domain
4. Submit a pull request to the main repository

## Further Reading

- [PARSING_GUIDE.md](PARSING_GUIDE.md) - Understanding the parsing pipeline
- [ENGLISH_GRAMMAR.md](languages/ENGLISH_GRAMMAR.md) - English grammar specification
- [ARCHITECTURE.md](ARCHITECTURE.md) - System architecture overview
- Universal Dependencies: https://universaldependencies.org/