docs/GETTING_STARTED.md

Select File:
docs/GETTING_STARTED.md

# Getting Started with Nasty

A beginner-friendly guide to Natural Abstract Syntax Tree processing in Elixir.

## Table of Contents

1. [Installation](#installation)
2. [Your First Steps](#your-first-steps)
3. [Core Concepts](#core-concepts)
4. [Common Patterns](#common-patterns)
5. [Language Support](#language-support)
6. [Troubleshooting](#troubleshooting)
7. [Next Steps](#next-steps)

## Installation

### Prerequisites

- **Elixir**: Version 1.14 or later
- **Erlang/OTP**: Version 25 or later

Check your versions:
```bash
elixir --version
# Erlang/OTP 25 [erts-13.0] [source] [64-bit]
# Elixir 1.14.0 (compiled with Erlang/OTP 25)
```

### Adding Nasty to Your Project

Add `nasty` to your `mix.exs` dependencies:

```elixir
def deps do
  [
    {:nasty, "~> 0.1.0"}
  ]
end
```

Then run:
```bash
mix deps.get
mix compile
```

### Verifying Installation

Test that everything works:

```elixir
# In IEx
iex> alias Nasty.Language.English
iex> {:ok, tokens} = English.tokenize("Hello world!")
iex> IO.inspect(tokens)
```

## Your First Steps

### Example 1: Parse a Simple Sentence

```elixir
alias Nasty.Language.English

# Step 1: Tokenize
text = "The cat runs."
{:ok, tokens} = English.tokenize(text)

# Step 2: POS Tag
{:ok, tagged} = English.tag_pos(tokens)

# Step 3: Parse
{:ok, document} = English.parse(tagged)

# Examine the result
IO.inspect(document)
```

**What just happened?**
1. **Tokenization**: Split text into words and punctuation
2. **POS Tagging**: Assigned grammatical categories (noun, verb, etc.)
3. **Parsing**: Built an Abstract Syntax Tree (AST)

### Example 2: Extract Information

```elixir
alias Nasty.Language.English

text = "John Smith works at Google in New York."
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)

# Extract named entities
alias Nasty.Language.English.EntityRecognizer
entities = EntityRecognizer.recognize(tagged)

Enum.each(entities, fn entity ->
  IO.puts("#{entity.text} is a #{entity.type}")
end)
# Output:
# John Smith is a person
# Google is a org
# New York is a gpe
```

### Example 3: Translate Between Languages

```elixir
alias Nasty.Language.{English, Spanish}
alias Nasty.Translation.Translator

# Parse English
{:ok, tokens} = English.tokenize("The cat runs.")
{:ok, tagged} = English.tag_pos(tokens)
{:ok, doc} = English.parse(tagged)

# Translate to Spanish
{:ok, doc_es} = Translator.translate(doc, :es)

# Render Spanish text
{:ok, text_es} = Nasty.Rendering.Text.render(doc_es)
IO.puts(text_es)
# Output: El gato corre.
```

## Core Concepts

### The AST Structure

Nasty represents text as a tree:

```
Document
└── Paragraph
    └── Sentence
        └── Clause
            ├── Subject (NounPhrase)
            │   ├── Determiner: "The"
            │   └── Head: "cat"
            └── Predicate (VerbPhrase)
                └── Head: "runs"
```

### Tokens

Every word is a **Token** with:
- `text`: The actual word ("runs")
- `lemma`: Dictionary form ("run")
- `pos_tag`: Part of speech (`:verb`)
- `morphology`: Features (`%{tense: :present}`)
- `language`: Language code (`:en`)
- `span`: Position in text

### Phrases

Phrases group related tokens:
- **NounPhrase**: "the big cat"
- **VerbPhrase**: "is running quickly"
- **PrepositionalPhrase**: "in the house"

### The Processing Pipeline

```
Text → Tokenization → POS Tagging → Morphology → Parsing → AST
```

Each step enriches the data:
1. **Tokenization**: Split into atomic units
2. **POS Tagging**: Add grammatical categories
3. **Morphology**: Add features (tense, number, etc.)
4. **Parsing**: Build hierarchical structure

## Common Patterns

### Pattern 1: Batch Processing

Process multiple texts efficiently:

```elixir
alias Nasty.Language.English

texts = [
  "The first sentence.",
  "The second sentence.",
  "The third sentence."
]

results = 
  texts
  |> Task.async_stream(fn text ->
    with {:ok, tokens} <- English.tokenize(text),
         {:ok, tagged} <- English.tag_pos(tokens),
         {:ok, doc} <- English.parse(tagged) do
      {:ok, doc}
    end
  end, max_concurrency: System.schedulers_online())
  |> Enum.to_list()
```

### Pattern 2: Extract Specific Information

Find all nouns in a document:

```elixir
alias Nasty.Utils.Query

{:ok, doc} = Nasty.parse("The cat and dog play.", language: :en)

# Find all nouns
nouns = Query.find_by_pos(doc, :noun)

Enum.each(nouns, fn token ->
  IO.puts(token.text)
end)
# Output:
# cat
# dog
```

### Pattern 3: Transform Text

Normalize and clean text:

```elixir
alias Nasty.Utils.Transform

{:ok, doc} = Nasty.parse("The CAT runs QUICKLY!", language: :en)

# Lowercase everything
normalized = Transform.normalize_case(doc, :lower)

# Remove punctuation
no_punct = Transform.remove_punctuation(normalized)

# Render back to text
{:ok, clean_text} = Nasty.render(no_punct)
IO.puts(clean_text)
# Output: the cat runs quickly
```

### Pattern 4: Error Handling

Always handle errors gracefully:

```elixir
alias Nasty.Language.English

text = "Some text..."

case English.tokenize(text) do
  {:ok, tokens} ->
    case English.tag_pos(tokens) do
      {:ok, tagged} ->
        case English.parse(tagged) do
          {:ok, doc} -> 
            # Success! Process doc
            process_document(doc)
          {:error, reason} ->
            IO.puts("Parse error: #{inspect(reason)}")
        end
      {:error, reason} ->
        IO.puts("Tagging error: #{inspect(reason)}")
    end
  {:error, reason} ->
    IO.puts("Tokenization error: #{inspect(reason)}")
end
```

Or use `with`:

```elixir
with {:ok, tokens} <- English.tokenize(text),
     {:ok, tagged} <- English.tag_pos(tokens),
     {:ok, doc} <- English.parse(tagged) do
  process_document(doc)
else
  {:error, reason} -> 
    IO.puts("Error: #{inspect(reason)}")
end
```

## Language Support

### Supported Languages

Nasty currently supports:
- **English** (`:en`) - Fully implemented
- **Spanish** (`:es`) - Fully implemented
- **Catalan** (`:ca`) - Fully implemented

### Using Different Languages

Each language has its own module:

```elixir
# English
alias Nasty.Language.English
{:ok, doc_en} = Nasty.parse("The cat runs.", language: :en)

# Spanish
alias Nasty.Language.Spanish
{:ok, doc_es} = Nasty.parse("El gato corre.", language: :es)

# Catalan
alias Nasty.Language.Catalan
{:ok, doc_ca} = Nasty.parse("El gat corre.", language: :ca)
```

### Language Detection

Auto-detect the language:

```elixir
{:ok, lang} = Nasty.Language.Registry.detect_language("Hola mundo")
# => {:ok, :es}

{:ok, lang} = Nasty.Language.Registry.detect_language("Hello world")
# => {:ok, :en}
```

## Troubleshooting

### Common Issues

#### Issue 1: Module Not Found

**Error:**
```
** (UndefinedFunctionError) function Nasty.Language.English.tokenize/1 is undefined
```

**Solution:**
Make sure you've compiled the project:
```bash
mix deps.get
mix compile
```

#### Issue 2: Empty Token List

**Problem:**
```elixir
{:ok, []} = English.tokenize("")
```

**Solution:**
Empty strings return empty token lists. Check your input:
```elixir
text = String.trim(user_input)
if text != "" do
  English.tokenize(text)
else
  {:error, :empty_input}
end
```

#### Issue 3: Parse Errors with Long Sentences

**Problem:**
Very long or complex sentences may fail to parse.

**Solution:**
Split long sentences:
```elixir
sentences = String.split(text, ~r/[.!?]+/)
|> Enum.map(&String.trim/1)
|> Enum.filter(&(&1 != ""))

Enum.each(sentences, fn sent ->
  {:ok, doc} = Nasty.parse(sent, language: :en)
  # Process doc
end)
```

#### Issue 4: Low Entity Recognition

**Problem:**
Named entities not detected.

**Solution:**
Entities depend on lexicons. For specialized domains, you may need to add custom entity patterns or use statistical models:

```elixir
# Use rule-based (default)
{:ok, tagged} = English.tag_pos(tokens)
entities = EntityRecognizer.recognize(tagged)

# Or use CRF model (better accuracy)
entities = EntityRecognizer.recognize(tagged, model: :crf)
```

### Performance Issues

#### Slow Processing

If processing is slow:

1. **Use parallel processing** for multiple documents
2. **Cache parsed documents** to avoid re-parsing
3. **Use simpler models** for POS tagging (`:rule` instead of `:neural`)

```elixir
# Fast rule-based tagging
{:ok, tagged} = English.tag_pos(tokens, model: :rule)

# Better accuracy but slower
{:ok, tagged} = English.tag_pos(tokens, model: :hmm)
```

### Getting Help

- **Documentation**: Check [docs/](docs) for detailed guides
- **Examples**: See [examples/](examples) for working code
- **Issues**: Report bugs on [GitHub](https://github.com/am-kantox/nasty/issues)

## Next Steps

### Learn More

1. **Read the User Guide**: [USER_GUIDE.md](USER_GUIDE.md) for comprehensive examples
2. **Explore Examples**: [EXAMPLES.md](EXAMPLES.md) for runnable scripts
3. **Understand Architecture**: [ARCHITECTURE.md](ARCHITECTURE.md) for system design
4. **Try Translation**: [TRANSLATION.md](TRANSLATION.md) for multilingual features

### Try the Examples

Run the example scripts:

```bash
# Basic tokenization
elixir examples/tokenizer_example.exs

# Question answering
elixir examples/question_answering.exs

# Translation
elixir examples/translation_example.exs

# Multilingual comparison
elixir examples/multilingual_pipeline.exs
```

### Build Something

Now that you understand the basics, try building:

1. **Text Analyzer**: Extract keywords, entities, and sentiment
2. **Translation Tool**: Translate documents between languages
3. **Chatbot**: Parse user input and generate responses
4. **Content Categorizer**: Classify documents by topic
5. **Grammar Checker**: Analyze and correct grammatical errors

### Advanced Topics

Once comfortable with basics, explore:

- **Statistical Models**: Train custom POS taggers
- **Neural Networks**: Use BiLSTM-CRF for better accuracy
- **Information Extraction**: Extract relations and events
- **Question Answering**: Build Q&A systems
- **Custom Grammars**: Define domain-specific grammar rules

## Quick Reference

### Essential Functions

```elixir
# Parsing
Nasty.parse(text, language: :en)

# Rendering
Nasty.render(ast)

# Translation
Nasty.Translation.Translator.translate(ast, target_language)

# Querying
Nasty.Utils.Query.find_by_pos(doc, :noun)
Nasty.Utils.Query.extract_entities(doc)

# Transformation
Nasty.Utils.Transform.normalize_case(doc, :lower)
Nasty.Utils.Transform.remove_punctuation(doc)
```

### Language Modules

```elixir
Nasty.Language.English
Nasty.Language.Spanish
Nasty.Language.Catalan
```

### Common Modules

```elixir
alias Nasty.Language.English
alias Nasty.Translation.Translator
alias Nasty.Utils.{Query, Transform, Traversal}
alias Nasty.Rendering.Text
```

## Summary

You now know how to:
- ✓ Install and set up Nasty
- ✓ Parse text into an AST
- ✓ Extract information from documents
- ✓ Translate between languages
- ✓ Handle common issues
- ✓ Use best practices

**Happy parsing!** 🚀