# Training Neural Models Guide
This guide provides detailed instructions for training neural models in Nasty, from data preparation to deployment.
## Table of Contents
1. [Prerequisites](#prerequisites)
2. [Data Preparation](#data-preparation)
3. [Training POS Tagging Models](#training-pos-tagging-models)
4. [Advanced Training Options](#advanced-training-options)
5. [Model Evaluation](#model-evaluation)
6. [Troubleshooting](#troubleshooting)
## Prerequisites
### System Requirements
- **Memory**: Minimum 4GB RAM for training, 8GB+ recommended
- **CPU**: Multi-core CPU (4+ cores recommended)
- **GPU**: Optional but highly recommended (10-100x speedup with EXLA)
- **Storage**: 500MB-2GB for models and training data
### Dependencies
All neural dependencies are included in `mix.exs`:
```elixir
{:axon, "~> 0.7"},
{:nx, "~> 0.9"},
{:exla, "~> 0.9"},
{:bumblebee, "~> 0.6"}
```
Install with:
```bash
mix deps.get
```
### Enable GPU Acceleration (Optional)
Set environment variable for EXLA to use GPU:
```bash
export XLA_TARGET=cuda120 # or cuda118, rocm, etc.
mix deps.compile
```
## Data Preparation
### CoNLL-U Format
Neural models train on CoNLL-U formatted data. Each sentence is separated by blank lines, with one token per line:
```
1 The the DET DT _ 2 det _ _
2 cat cat NOUN NN _ 3 subj _ _
3 sat sit VERB VBD _ 0 root _ _
1 Dogs dog NOUN NNS _ 2 subj _ _
2 run run VERB VBP _ 0 root _ _
```
Columns (tab-separated):
1. Index
2. Word form
3. Lemma
4. **UPOS tag** (used for training)
5. XPOS tag
6. Features
7. Head
8. Dependency relation
9-10. Additional annotations
### Where to Get Training Data
**Universal Dependencies** corpora:
- English: [UD_English-EWT](https://github.com/UniversalDependencies/UD_English-EWT)
- Spanish: [UD_Spanish-GSD](https://github.com/UniversalDependencies/UD_Spanish-GSD)
- Catalan: [UD_Catalan-AnCora](https://github.com/UniversalDependencies/UD_Catalan-AnCora)
Download and extract:
```bash
cd data
git clone https://github.com/UniversalDependencies/UD_English-EWT
```
### Data Split Recommendations
- **Training**: 80% (or use provided train split)
- **Validation**: 10% (or use provided dev split)
- **Test**: 10% (or use provided test split)
The training pipeline handles splitting automatically if you provide a single file.
## Training POS Tagging Models
### Quick Start - CLI Training
The easiest way to train is using the Mix task:
```bash
mix nasty.train.neural_pos \
--corpus data/UD_English-EWT/en_ewt-ud-train.conllu \
--output models/pos_neural_v1.axon \
--epochs 10 \
--batch-size 32
```
### CLI Options Reference
```bash
mix nasty.train.neural_pos [options]
Required:
--corpus PATH Path to CoNLL-U training corpus
Optional:
--output PATH Model save path (default: pos_neural.axon)
--validation PATH Path to validation corpus (auto-split if not provided)
--epochs N Number of training epochs (default: 10)
--batch-size N Batch size (default: 32)
--learning-rate F Learning rate (default: 0.001)
--hidden-size N LSTM hidden size (default: 256)
--embedding-dim N Word embedding dimension (default: 300)
--num-layers N Number of LSTM layers (default: 2)
--dropout F Dropout rate (default: 0.3)
--use-char-cnn Enable character CNN (default: enabled)
--char-embedding-dim N Character embedding dim (default: 50)
--optimizer NAME Optimizer: adam, sgd, adamw (default: adam)
--early-stopping N Early stopping patience (default: 3)
--checkpoint-dir PATH Save checkpoints during training
--min-freq N Min word frequency for vocab (default: 1)
--validation-split F Validation split fraction (default: 0.1)
```
### Programmatic Training
For more control, train programmatically:
```elixir
alias Nasty.Statistics.POSTagging.NeuralTagger
alias Nasty.Statistics.Neural.DataLoader
# Load training data
{:ok, sentences} = DataLoader.load_conllu_file("data/train.conllu")
# Split into train/validation
{train_data, valid_data} = DataLoader.split_data(sentences, validation_split: 0.1)
# Create and configure tagger
tagger = NeuralTagger.new(training_data: train_data)
# Train with custom options
{:ok, trained_tagger} = NeuralTagger.train(tagger, train_data,
epochs: 20,
batch_size: 32,
learning_rate: 0.001,
hidden_size: 512,
embedding_dim: 300,
num_lstm_layers: 3,
dropout: 0.5,
use_char_cnn: true,
validation_data: valid_data,
early_stopping_patience: 5
)
# Save trained model
:ok = NeuralTagger.save(trained_tagger, "models/pos_advanced.axon")
```
## Advanced Training Options
### Hyperparameter Tuning
**Hidden Size** (`--hidden-size`):
- Small (128-256): Faster training, less memory, slightly lower accuracy
- Medium (256-512): Balanced performance (default: 256)
- Large (512-1024): Best accuracy, requires more memory/time
**Embedding Dimension** (`--embedding-dim`):
- Small (50-100): Fast, low memory
- Medium (300): Good balance (default, matches GloVe)
- Large (300-1024): For very large corpora
**Number of LSTM Layers** (`--num-layers`):
- 1 layer: Fast, simple patterns
- 2 layers: Balanced (default, recommended)
- 3+ layers: Complex patterns, risk overfitting
**Dropout** (`--dropout`):
- 0.0: No regularization (risk overfitting)
- 0.3: Good default
- 0.5: Strong regularization for small datasets
**Batch Size** (`--batch-size`):
- Small (8-16): Better generalization, slower
- Medium (32): Good balance (default)
- Large (64-128): Faster training, needs more memory
### Character CNN Configuration
Character-level CNN helps with out-of-vocabulary words:
```bash
mix nasty.train.neural_pos \
--corpus data/train.conllu \
--use-char-cnn \
--char-embedding-dim 50 \
--char-vocab-size 150
```
Disable if training is too slow:
```bash
mix nasty.train.neural_pos \
--corpus data/train.conllu \
--no-char-cnn
```
### Using Pre-trained Embeddings
Load GloVe embeddings for better initialization:
```elixir
alias Nasty.Statistics.Neural.Embeddings
# Load GloVe vectors
glove_embeddings = Embeddings.load_glove("data/glove.6B.300d.txt", word_vocab)
# Train with pre-trained embeddings
{:ok, tagger} = NeuralTagger.train(base_tagger, train_data,
pretrained_embeddings: glove_embeddings,
freeze_embeddings: false # Allow fine-tuning
)
```
Note: GloVe loading is currently a placeholder. Full implementation coming soon.
### Optimizer Selection
**Adam** (default):
- Adaptive learning rates
- Works well out-of-the-box
- Good for most use cases
**SGD**:
- Simple, stable
- May need learning rate scheduling
- Good baseline
**AdamW**:
- Adam with weight decay
- Better generalization
- Recommended for large models
```bash
mix nasty.train.neural_pos \
--corpus data/train.conllu \
--optimizer adamw \
--learning-rate 0.0001
```
### Early Stopping
Automatically stop training when validation performance plateaus:
```bash
mix nasty.train.neural_pos \
--corpus data/train.conllu \
--validation data/dev.conllu \
--early-stopping 5 # Stop after 5 epochs without improvement
```
### Checkpointing
Save model checkpoints during training:
```bash
mix nasty.train.neural_pos \
--corpus data/train.conllu \
--checkpoint-dir checkpoints/ \
--checkpoint-frequency 2 # Save every 2 epochs
```
Checkpoints are named: `checkpoint_epoch_001.axon`, `checkpoint_epoch_002.axon`, etc.
## Model Evaluation
### During Training
The training task prints per-tag metrics:
```
Epoch 1/10
Loss: 0.456
Accuracy: 0.923
Per-tag accuracy:
NOUN: 0.957
VERB: 0.942
DET: 0.989
...
```
### Post-Training Evaluation
Evaluate on test set:
```bash
mix nasty.eval.neural_pos \
--model models/pos_neural_v1.axon \
--test data/en_ewt-ud-test.conllu
```
Or programmatically:
```elixir
{:ok, model} = NeuralTagger.load("models/pos_neural_v1.axon")
{:ok, test_sentences} = DataLoader.load_conllu_file("data/test.conllu")
# Evaluate
correct = 0
total = 0
for {words, gold_tags} <- test_sentences do
{:ok, pred_tags} = NeuralTagger.predict(model, words, [])
correct = correct + Enum.count(Enum.zip(pred_tags, gold_tags), fn {p, g} -> p == g end)
total = total + length(gold_tags)
end
accuracy = correct / total
IO.puts("Accuracy: #{Float.round(accuracy * 100, 2)}%")
```
### Metrics to Track
- **Overall Accuracy**: Percentage of correctly tagged tokens
- **Per-Tag Accuracy**: Accuracy for each POS tag
- **Per-Tag Precision/Recall**: For detailed error analysis
- **OOV Accuracy**: Performance on out-of-vocabulary words
- **Training Time**: Total time and time per epoch
- **Convergence**: Number of epochs to best validation score
## Troubleshooting
### Out of Memory
**Symptoms**: Process crashes with memory error
**Solutions**:
1. Reduce batch size: `--batch-size 16` or `--batch-size 8`
2. Reduce hidden size: `--hidden-size 128`
3. Reduce embedding dimension: `--embedding-dim 100`
4. Disable character CNN: `--no-char-cnn`
5. Use smaller training corpus subset
### Training Too Slow
**Symptoms**: Hours per epoch
**Solutions**:
1. Enable EXLA GPU support (see Prerequisites)
2. Increase batch size: `--batch-size 64`
3. Disable character CNN if not needed
4. Use fewer LSTM layers: `--num-layers 1`
5. Reduce hidden size: `--hidden-size 128`
### Overfitting
**Symptoms**: High training accuracy, low validation accuracy
**Solutions**:
1. Increase dropout: `--dropout 0.5`
2. Use more training data
3. Enable early stopping: `--early-stopping 3`
4. Reduce model complexity (fewer layers, smaller hidden size)
5. Add L2 regularization
### Underfitting
**Symptoms**: Low training and validation accuracy
**Solutions**:
1. Increase model capacity: `--hidden-size 512 --num-layers 3`
2. Train longer: `--epochs 20`
3. Lower dropout: `--dropout 0.2`
4. Increase learning rate: `--learning-rate 0.01`
5. Check data quality (wrong labels, formatting issues)
### Validation Loss Not Decreasing
**Symptoms**: Validation loss stays flat or increases
**Solutions**:
1. Lower learning rate: `--learning-rate 0.0001`
2. Add early stopping
3. Check for data issues (train/validation overlap, different distributions)
4. Try different optimizer: `--optimizer adamw`
### CoNLL-U Loading Errors
**Symptoms**: Parser errors, wrong tag counts
**Solutions**:
1. Verify file format (tab-separated, 10 columns)
2. Check for empty lines between sentences
3. Ensure UTF-8 encoding
4. Remove or fix malformed lines
5. Validate with UD validator: https://universaldependencies.org/tools.html
### Model Not Learning
**Symptoms**: Loss stays constant, accuracy at baseline
**Solutions**:
1. Check data quality (are labels correct?)
2. Verify vocabulary is being built correctly
3. Increase learning rate: `--learning-rate 0.01`
4. Remove or reduce dropout initially
5. Check for bugs in data preprocessing
## Best Practices
### For Small Datasets (<5K sentences)
```bash
mix nasty.train.neural_pos \
--corpus data/small_corpus.conllu \
--epochs 20 \
--batch-size 16 \
--hidden-size 128 \
--embedding-dim 100 \
--dropout 0.5 \
--early-stopping 5 \
--no-char-cnn
```
### For Medium Datasets (5K-50K sentences)
```bash
mix nasty.train.neural_pos \
--corpus data/medium_corpus.conllu \
--epochs 15 \
--batch-size 32 \
--hidden-size 256 \
--embedding-dim 300 \
--dropout 0.3 \
--use-char-cnn \
--early-stopping 3
```
### For Large Datasets (50K+ sentences)
```bash
mix nasty.train.neural_pos \
--corpus data/large_corpus.conllu \
--epochs 10 \
--batch-size 64 \
--hidden-size 512 \
--embedding-dim 300 \
--num-layers 3 \
--dropout 0.3 \
--use-char-cnn \
--optimizer adamw \
--learning-rate 0.0001
```
## Production Deployment
After training, deploy your model:
1. **Save the trained model**:
```bash
# Model is already saved by training task
ls -lh models/pos_neural_v1.axon
```
2. **Load in production**:
```elixir
{:ok, model} = NeuralTagger.load("models/pos_neural_v1.axon")
```
3. **Integrate with POSTagger**:
```elixir
# Use neural mode
{:ok, ast} = Nasty.parse(text, language: :en, model: :neural, neural_model: model)
# Or use ensemble mode
{:ok, ast} = Nasty.parse(text, language: :en, model: :neural_ensemble, neural_model: model)
```
4. **Monitor performance**:
- Track accuracy on representative sample
- Monitor latency (should be <100ms per sentence on CPU)
- Watch memory usage
## Next Steps
- Read [NEURAL_MODELS.md](NEURAL_MODELS.md) for architecture details
- See [PRETRAINED_MODELS.md](PRETRAINED_MODELS.md) for using Bumblebee transformers
- Check [examples/](../examples/) for complete training scripts
- Explore UD treebanks for more training data