# Getting Started with Nasty
A beginner-friendly guide to Natural Abstract Syntax Tree processing in Elixir.
## Table of Contents
1. [Installation](#installation)
2. [Your First Steps](#your-first-steps)
3. [Core Concepts](#core-concepts)
4. [Common Patterns](#common-patterns)
5. [Language Support](#language-support)
6. [Troubleshooting](#troubleshooting)
7. [Next Steps](#next-steps)
## Installation
### Prerequisites
- **Elixir**: Version 1.14 or later
- **Erlang/OTP**: Version 25 or later
Check your versions:
```bash
elixir --version
# Erlang/OTP 25 [erts-13.0] [source] [64-bit]
# Elixir 1.14.0 (compiled with Erlang/OTP 25)
```
### Adding Nasty to Your Project
Add `nasty` to your `mix.exs` dependencies:
```elixir
def deps do
[
{:nasty, "~> 0.1.0"}
]
end
```
Then run:
```bash
mix deps.get
mix compile
```
### Verifying Installation
Test that everything works:
```elixir
# In IEx
iex> alias Nasty.Language.English
iex> {:ok, tokens} = English.tokenize("Hello world!")
iex> IO.inspect(tokens)
```
## Your First Steps
### Example 1: Parse a Simple Sentence
```elixir
alias Nasty.Language.English
# Step 1: Tokenize
text = "The cat runs."
{:ok, tokens} = English.tokenize(text)
# Step 2: POS Tag
{:ok, tagged} = English.tag_pos(tokens)
# Step 3: Parse
{:ok, document} = English.parse(tagged)
# Examine the result
IO.inspect(document)
```
**What just happened?**
1. **Tokenization**: Split text into words and punctuation
2. **POS Tagging**: Assigned grammatical categories (noun, verb, etc.)
3. **Parsing**: Built an Abstract Syntax Tree (AST)
### Example 2: Extract Information
```elixir
alias Nasty.Language.English
text = "John Smith works at Google in New York."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
# Extract named entities
alias Nasty.Language.English.EntityRecognizer
entities = EntityRecognizer.recognize(tagged)
Enum.each(entities, fn entity ->
IO.puts("#{entity.text} is a #{entity.type}")
end)
# Output:
# John Smith is a person
# Google is a org
# New York is a gpe
```
### Example 3: Translate Between Languages
```elixir
alias Nasty.Language.{English, Spanish}
alias Nasty.Translation.Translator
# Parse English
{:ok, tokens} = English.tokenize("The cat runs.")
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)
# Translate to Spanish
{:ok, doc_es} = Translator.translate(doc, :es)
# Render Spanish text
{:ok, text_es} = Nasty.Rendering.Text.render(doc_es)
IO.puts(text_es)
# Output: El gato corre.
```
## Core Concepts
### The AST Structure
Nasty represents text as a tree:
```
Document
└── Paragraph
└── Sentence
└── Clause
├── Subject (NounPhrase)
│ ├── Determiner: "The"
│ └── Head: "cat"
└── Predicate (VerbPhrase)
└── Head: "runs"
```
### Tokens
Every word is a **Token** with:
- `text`: The actual word ("runs")
- `lemma`: Dictionary form ("run")
- `pos_tag`: Part of speech (`:verb`)
- `morphology`: Features (`%{tense: :present}`)
- `language`: Language code (`:en`)
- `span`: Position in text
### Phrases
Phrases group related tokens:
- **NounPhrase**: "the big cat"
- **VerbPhrase**: "is running quickly"
- **PrepositionalPhrase**: "in the house"
### The Processing Pipeline
```
Text → Tokenization → POS Tagging → Morphology → Parsing → AST
```
Each step enriches the data:
1. **Tokenization**: Split into atomic units
2. **POS Tagging**: Add grammatical categories
3. **Morphology**: Add features (tense, number, etc.)
4. **Parsing**: Build hierarchical structure
## Common Patterns
### Pattern 1: Batch Processing
Process multiple texts efficiently:
```elixir
alias Nasty.Language.English
texts = [
"The first sentence.",
"The second sentence.",
"The third sentence."
]
results =
texts
|> Task.async_stream(fn text ->
with {:ok, tokens} <- English.tokenize(text),
{:ok, tagged} <- English.tag_pos(tokens),
{:ok, doc} <- English.parse(tagged) do
{:ok, doc}
end
end, max_concurrency: System.schedulers_online())
|> Enum.to_list()
```
### Pattern 2: Extract Specific Information
Find all nouns in a document:
```elixir
alias Nasty.Utils.Query
{:ok, doc} = Nasty.parse("The cat and dog play.", language: :en)
# Find all nouns
nouns = Query.find_by_pos(doc, :noun)
Enum.each(nouns, fn token ->
IO.puts(token.text)
end)
# Output:
# cat
# dog
```
### Pattern 3: Transform Text
Normalize and clean text:
```elixir
alias Nasty.Utils.Transform
{:ok, doc} = Nasty.parse("The CAT runs QUICKLY!", language: :en)
# Lowercase everything
normalized = Transform.normalize_case(doc, :lower)
# Remove punctuation
no_punct = Transform.remove_punctuation(normalized)
# Render back to text
{:ok, clean_text} = Nasty.render(no_punct)
IO.puts(clean_text)
# Output: the cat runs quickly
```
### Pattern 4: Error Handling
Always handle errors gracefully:
```elixir
alias Nasty.Language.English
text = "Some text..."
case English.tokenize(text) do
{:ok, tokens} ->
case English.tag_pos(tokens) do
{:ok, tagged} ->
case English.parse(tagged) do
{:ok, doc} ->
# Success! Process doc
process_document(doc)
{:error, reason} ->
IO.puts("Parse error: #{inspect(reason)}")
end
{:error, reason} ->
IO.puts("Tagging error: #{inspect(reason)}")
end
{:error, reason} ->
IO.puts("Tokenization error: #{inspect(reason)}")
end
```
Or use `with`:
```elixir
with {:ok, tokens} <- English.tokenize(text),
{:ok, tagged} <- English.tag_pos(tokens),
{:ok, doc} <- English.parse(tagged) do
process_document(doc)
else
{:error, reason} ->
IO.puts("Error: #{inspect(reason)}")
end
```
## Language Support
### Supported Languages
Nasty currently supports:
- **English** (`:en`) - Fully implemented
- **Spanish** (`:es`) - Fully implemented
- **Catalan** (`:ca`) - Fully implemented
### Using Different Languages
Each language has its own module:
```elixir
# English
alias Nasty.Language.English
{:ok, doc_en} = Nasty.parse("The cat runs.", language: :en)
# Spanish
alias Nasty.Language.Spanish
{:ok, doc_es} = Nasty.parse("El gato corre.", language: :es)
# Catalan
alias Nasty.Language.Catalan
{:ok, doc_ca} = Nasty.parse("El gat corre.", language: :ca)
```
### Language Detection
Auto-detect the language:
```elixir
{:ok, lang} = Nasty.Language.Registry.detect_language("Hola mundo")
# => {:ok, :es}
{:ok, lang} = Nasty.Language.Registry.detect_language("Hello world")
# => {:ok, :en}
```
## Troubleshooting
### Common Issues
#### Issue 1: Module Not Found
**Error:**
```
** (UndefinedFunctionError) function Nasty.Language.English.tokenize/1 is undefined
```
**Solution:**
Make sure you've compiled the project:
```bash
mix deps.get
mix compile
```
#### Issue 2: Empty Token List
**Problem:**
```elixir
{:ok, []} = English.tokenize("")
```
**Solution:**
Empty strings return empty token lists. Check your input:
```elixir
text = String.trim(user_input)
if text != "" do
English.tokenize(text)
else
{:error, :empty_input}
end
```
#### Issue 3: Parse Errors with Long Sentences
**Problem:**
Very long or complex sentences may fail to parse.
**Solution:**
Split long sentences:
```elixir
sentences = String.split(text, ~r/[.!?]+/)
|> Enum.map(&String.trim/1)
|> Enum.filter(&(&1 != ""))
Enum.each(sentences, fn sent ->
{:ok, doc} = Nasty.parse(sent, language: :en)
# Process doc
end)
```
#### Issue 4: Low Entity Recognition
**Problem:**
Named entities not detected.
**Solution:**
Entities depend on lexicons. For specialized domains, you may need to add custom entity patterns or use statistical models:
```elixir
# Use rule-based (default)
{:ok, tagged} = English.tag_pos(tokens)
entities = EntityRecognizer.recognize(tagged)
# Or use CRF model (better accuracy)
entities = EntityRecognizer.recognize(tagged, model: :crf)
```
### Performance Issues
#### Slow Processing
If processing is slow:
1. **Use parallel processing** for multiple documents
2. **Cache parsed documents** to avoid re-parsing
3. **Use simpler models** for POS tagging (`:rule` instead of `:neural`)
```elixir
# Fast rule-based tagging
{:ok, tagged} = English.tag_pos(tokens, model: :rule)
# Better accuracy but slower
{:ok, tagged} = English.tag_pos(tokens, model: :hmm)
```
### Getting Help
- **Documentation**: Check [docs/](docs) for detailed guides
- **Examples**: See [examples/](examples) for working code
- **Issues**: Report bugs on [GitHub](https://github.com/am-kantox/nasty/issues)
## Next Steps
### Learn More
1. **Read the User Guide**: [USER_GUIDE.md](USER_GUIDE.md) for comprehensive examples
2. **Explore Examples**: [EXAMPLES.md](EXAMPLES.md) for runnable scripts
3. **Understand Architecture**: [ARCHITECTURE.md](ARCHITECTURE.md) for system design
4. **Try Translation**: [TRANSLATION.md](TRANSLATION.md) for multilingual features
### Try the Examples
Run the example scripts:
```bash
# Basic tokenization
elixir examples/tokenizer_example.exs
# Question answering
elixir examples/question_answering.exs
# Translation
elixir examples/translation_example.exs
# Multilingual comparison
elixir examples/multilingual_pipeline.exs
```
### Build Something
Now that you understand the basics, try building:
1. **Text Analyzer**: Extract keywords, entities, and sentiment
2. **Translation Tool**: Translate documents between languages
3. **Chatbot**: Parse user input and generate responses
4. **Content Categorizer**: Classify documents by topic
5. **Grammar Checker**: Analyze and correct grammatical errors
### Advanced Topics
Once comfortable with basics, explore:
- **Statistical Models**: Train custom POS taggers
- **Neural Networks**: Use BiLSTM-CRF for better accuracy
- **Information Extraction**: Extract relations and events
- **Question Answering**: Build Q&A systems
- **Custom Grammars**: Define domain-specific grammar rules
## Quick Reference
### Essential Functions
```elixir
# Parsing
Nasty.parse(text, language: :en)
# Rendering
Nasty.render(ast)
# Translation
Nasty.Translation.Translator.translate(ast, target_language)
# Querying
Nasty.Utils.Query.find_by_pos(doc, :noun)
Nasty.Utils.Query.extract_entities(doc)
# Transformation
Nasty.Utils.Transform.normalize_case(doc, :lower)
Nasty.Utils.Transform.remove_punctuation(doc)
```
### Language Modules
```elixir
Nasty.Language.English
Nasty.Language.Spanish
Nasty.Language.Catalan
```
### Common Modules
```elixir
alias Nasty.Language.English
alias Nasty.Translation.Translator
alias Nasty.Utils.{Query, Transform, Traversal}
alias Nasty.Rendering.Text
```
## Summary
You now know how to:
- ✓ Install and set up Nasty
- ✓ Parse text into an AST
- ✓ Extract information from documents
- ✓ Translate between languages
- ✓ Handle common issues
- ✓ Use best practices
**Happy parsing!** 🚀