docs/FINE_TUNING.md

# Fine-tuning Transformers Guide

Complete guide to fine-tuning pre-trained transformer models on custom datasets in Nasty.

## Overview

Fine-tuning adapts a pre-trained transformer (BERT, RoBERTa, etc.) to your specific NLP task. Instead of training from scratch, you:

1. Start with a model trained on billions of tokens
2. Train for a few epochs on your task-specific data (1000+ examples)
3. Achieve state-of-the-art accuracy in minutes/hours instead of days/weeks

**Benefits:**
- 98-99% POS tagging accuracy (vs 97-98% BiLSTM-CRF)
- 93-95% NER F1 score (vs 75-80% rule-based)
- 10-100x less training data required
- Transfer learning from massive pre-training

## Quick Start

```bash
# Fine-tune RoBERTa for POS tagging
mix nasty.fine_tune.pos \
  --model roberta_base \
  --train data/en_ewt-ud-train.conllu \
  --validation data/en_ewt-ud-dev.conllu \
  --output models/pos_finetuned \
  --epochs 3 \
  --batch-size 16

# Fine-tune time: 10-30 minutes (CPU), 2-5 minutes (GPU)
# Result: 98-99% accuracy on UD English
```

## Prerequisites

### System Requirements

- **Memory**: 8GB+ RAM (16GB recommended)
- **Storage**: 2GB for models and data
- **GPU**: Optional but highly recommended (10-30x speedup with EXLA)
- **Time**: 10-30 minutes per run (CPU), 2-5 minutes (GPU)

### Required Data

Training data must be in **CoNLL-U format**:

```
1	The	the	DET	DT	_	2	det	_	_
2	cat	cat	NOUN	NN	_	3	nsubj	_	_
3	sat	sit	VERB	VBD	_	0	root	_	_

1	Dogs	dog	NOUN	NNS	_	2	nsubj	_	_
2	run	run	VERB	VBP	_	0	root	_	_
```

Download Universal Dependencies corpora:
- English: [UD_English-EWT](https://github.com/UniversalDependencies/UD_English-EWT)
- Spanish: [UD_Spanish-GSD](https://github.com/UniversalDependencies/UD_Spanish-GSD)
- More: [Universal Dependencies](https://universaldependencies.org/)

## POS Tagging Fine-tuning

### Basic Usage

```bash
mix nasty.fine_tune.pos \
  --model roberta_base \
  --train data/train.conllu \
  --epochs 3
```

### Full Configuration

```bash
mix nasty.fine_tune.pos \
  --model bert_base_cased \
  --train data/en_ewt-ud-train.conllu \
  --validation data/en_ewt-ud-dev.conllu \
  --output models/pos_bert_finetuned \
  --epochs 5 \
  --batch-size 32 \
  --learning-rate 0.00002 \
  --max-length 512 \
  --eval-steps 500
```

### Options Reference

| Option | Description | Default |
|--------|-------------|---------|
| `--model` | Base transformer (required) | - |
| `--train` | Training CoNLL-U file (required) | - |
| `--validation` | Validation file | None |
| `--output` | Output directory | priv/models/finetuned |
| `--epochs` | Training epochs | 3 |
| `--batch-size` | Batch size | 16 |
| `--learning-rate` | Learning rate | 3e-5 |
| `--max-length` | Max sequence length | 512 |
| `--eval-steps` | Evaluate every N steps | 500 |

## Supported Models

### English Models

**bert-base-cased** (110M params):
- Best for: Case-sensitive tasks, proper nouns
- Memory: ~500MB
- Speed: Medium

**roberta-base** (125M params):
- Best for: General purpose, highest accuracy
- Memory: ~550MB
- Speed: Medium
- **Recommended for most tasks**

**distilbert-base** (66M params):
- Best for: Fast inference, lower memory
- Memory: ~300MB
- Speed: Fast
- Accuracy: ~97% (vs 98% full BERT)

### Multilingual Models

**xlm-roberta-base** (270M params):
- Languages: 100 languages
- Best for: Spanish, Catalan, multilingual
- Memory: ~1.1GB
- Cross-lingual transfer: 90-95% of monolingual

**bert-base-multilingual-cased** (110M params):
- Languages: 104 languages
- Good baseline for many languages
- Memory: ~500MB

## Data Preparation

### Minimum Dataset Size

| Task | Minimum | Recommended | Optimal |
|------|---------|-------------|---------|
| POS Tagging | 1,000 sentences | 5,000 sentences | 10,000+ sentences |
| NER | 500 sentences | 2,000 sentences | 5,000+ sentences |
| Classification | 100 examples/class | 500 examples/class | 1,000+ examples/class |

### Data Splitting

Standard split ratios:

```
Total data: 12,000 sentences

Training:   9,600 (80%)
Validation: 1,200 (10%)
Test:       1,200 (10%)
```

### Data Quality Checklist

- [ ] Consistent annotation scheme (use Universal Dependencies)
- [ ] Balanced representation across domains (news, social media, technical)
- [ ] Clean text (no encoding errors, proper Unicode)
- [ ] No data leakage (train/val/test are disjoint)
- [ ] Representative of production data

## Hyperparameter Tuning

### Learning Rate

Most important hyperparameter!

```bash
# Too high: Model doesn't converge
--learning-rate 0.001  # DON'T USE

# Too low: Learning is very slow
--learning-rate 0.000001  # DON'T USE

# Good defaults:
--learning-rate 0.00003  # RoBERTa, BERT (3e-5)
--learning-rate 0.00002  # DistilBERT (2e-5)
--learning-rate 0.00005  # XLM-RoBERTa (5e-5)
```

### Batch Size

Balance between speed and memory:

```bash
# Small dataset or low memory
--batch-size 8

# Balanced (recommended)
--batch-size 16

# Large dataset, lots of memory
--batch-size 32

# Very large dataset, GPU
--batch-size 64
```

Memory usage by batch size:
- Batch 8: ~2GB GPU memory
- Batch 16: ~4GB GPU memory
- Batch 32: ~8GB GPU memory
- Batch 64: ~16GB GPU memory

### Number of Epochs

```bash
# Small dataset (1K-5K examples)
--epochs 5

# Medium dataset (5K-20K examples)
--epochs 3

# Large dataset (20K+ examples)
--epochs 2
```

Rule of thumb: Stop when validation loss plateaus (use validation set!)

### Max Sequence Length

```bash
# Short texts (tweets, titles)
--max-length 128  # Faster, uses less memory

# Normal texts (sentences, paragraphs)
--max-length 512  # Default, good balance

# Long texts (documents)
--max-length 1024  # Slower, uses more memory
```

## Programmatic Fine-tuning

For more control, use the API directly:

```elixir
alias Nasty.Statistics.Neural.Transformers.{Loader, FineTuner, DataPreprocessor}
alias Nasty.Statistics.Neural.DataLoader

# Load base model
{:ok, base_model} = Loader.load_model(:roberta_base)

# Load training data
{:ok, train_sentences} = DataLoader.load_conllu_file("data/train.conllu")

# Prepare examples
training_data = 
  Enum.map(train_sentences, fn sentence ->
    tokens = sentence.tokens
    labels = Enum.map(tokens, & &1.pos)
    {tokens, labels}
  end)

# Create label map (UPOS tags)
label_map = %{
  0 => "ADJ", 1 => "ADP", 2 => "ADV", 3 => "AUX",
  4 => "CCONJ", 5 => "DET", 6 => "INTJ", 7 => "NOUN",
  8 => "NUM", 9 => "PART", 10 => "PRON", 11 => "PROPN",
  12 => "PUNCT", 13 => "SCONJ", 14 => "SYM", 15 => "VERB", 16 => "X"
}

# Fine-tune
{:ok, finetuned} = FineTuner.fine_tune(
  base_model,
  training_data,
  :pos_tagging,
  num_labels: 17,
  label_map: label_map,
  epochs: 3,
  batch_size: 16,
  learning_rate: 3.0e-5
)

# Save
File.write!("models/pos_finetuned.axon", :erlang.term_to_binary(finetuned))
```

## Evaluation

### During Training

The CLI automatically evaluates on validation set:

```
Fine-tuning POS tagger
  Model: roberta_base
  Training data: data/train.conllu
  Output: models/pos_finetuned

Loading base model...
Model loaded: roberta_base

Loading training data...
Training examples: 8,724
Validation examples: 1,091
Number of POS tags: 17

Starting fine-tuning...

Epoch 1/3, Iteration 100: loss=0.3421, accuracy=0.891
Epoch 1/3, Iteration 200: loss=0.2156, accuracy=0.934
Epoch 1 completed. validation_loss: 0.1842, validation_accuracy: 0.951

Epoch 2/3, Iteration 100: loss=0.1523, accuracy=0.963
Epoch 2/3, Iteration 200: loss=0.1298, accuracy=0.971
Epoch 2 completed. validation_loss: 0.0921, validation_accuracy: 0.979

Epoch 3/3, Iteration 100: loss=0.0876, accuracy=0.981
Epoch 3/3, Iteration 200: loss=0.0745, accuracy=0.985
Epoch 3 completed. validation_loss: 0.0654, validation_accuracy: 0.987

Fine-tuning completed successfully!
Model saved to: models/pos_finetuned

Evaluating on validation set...

Validation Results:
  Accuracy: 98.72%
  Total predictions: 16,427
  Correct predictions: 16,217
```

### Post-training Evaluation

Test on held-out test set:

```bash
mix nasty.eval \
  --model models/pos_finetuned.axon \
  --test data/en_ewt-ud-test.conllu \
  --type pos_tagging
```

## Troubleshooting

### Out of Memory

**Symptoms**: Process crashes, CUDA out of memory

**Solutions**:
1. Reduce batch size: `--batch-size 8`
2. Reduce max length: `--max-length 256`
3. Use smaller model: `distilbert-base` instead of `roberta-base`
4. Use gradient accumulation (API only)

### Training Too Slow

**Symptoms**: Hours per epoch

**Solutions**:
1. Enable GPU: Set `XLA_TARGET=cuda` env var
2. Increase batch size: `--batch-size 32`
3. Reduce max length: `--max-length 256`
4. Use DistilBERT instead of BERT

### Poor Accuracy

**Symptoms**: Validation accuracy <95%

**Solutions**:
1. Train longer: `--epochs 5`
2. Increase dataset size (need 5K+ sentences)
3. Lower learning rate: `--learning-rate 0.00001`
4. Check data quality (annotation errors?)
5. Try different model: RoBERTa instead of BERT

### Overfitting

**Symptoms**: High training accuracy, low validation accuracy

**Solutions**:
1. More training data
2. Fewer epochs: `--epochs 2`
3. Higher learning rate: `--learning-rate 0.00005`
4. Use validation set for early stopping

### Model Not Learning

**Symptoms**: Loss stays constant

**Solutions**:
1. Higher learning rate: `--learning-rate 0.0001`
2. Check data format (is it loading correctly?)
3. Verify labels are correct
4. Try different optimizer (edit FineTuner code)

## Best Practices

### 1. Always Use Validation Set

```bash
# GOOD: Monitor validation performance
mix nasty.fine_tune.pos \
  --train data/train.conllu \
  --validation data/dev.conllu

# BAD: No way to detect overfitting
mix nasty.fine_tune.pos \
  --train data/train.conllu
```

### 2. Start with Defaults

Don't tune hyperparameters until you see the baseline:

```bash
# First run: Use defaults
mix nasty.fine_tune.pos --model roberta_base --train data/train.conllu

# Then: Tune if needed
```

### 3. Use RoBERTa for Best Accuracy

```bash
# Highest accuracy
--model roberta_base

# Not: BERT or DistilBERT (unless you need speed/size)
```

### 4. Save Intermediate Checkpoints

Models are saved automatically to output directory. Keep multiple versions:

```
models/
  pos_epoch1.axon
  pos_epoch2.axon
  pos_epoch3.axon
  pos_final.axon  # Best model
```

### 5. Document Your Configuration

Keep a log of what worked:

```
# models/pos_finetuned/README.md

Model: RoBERTa-base
Training data: UD_English-EWT (8,724 sentences)
Epochs: 3
Batch size: 16
Learning rate: 3e-5
Final accuracy: 98.7%
Training time: 15 minutes (GPU)
```

## Production Deployment

After fine-tuning, deploy to production:

### 1. Quantize for Efficiency

```bash
mix nasty.quantize \
  --model models/pos_finetuned.axon \
  --calibration data/calibration.conllu \
  --output models/pos_finetuned_int8.axon
```

Result: 4x smaller, 2-3x faster, <1% accuracy loss

### 2. Load in Production

```elixir
# Load quantized model
{:ok, model} = INT8.load("models/pos_finetuned_int8.axon")

# Use for inference
def tag_sentence(text) do
  {:ok, tokens} = Nasty.parse(text, language: :en)
  {:ok, tagged} = apply_model(model, tokens)
  tagged
end
```

### 3. Monitor Performance

Track key metrics:
- Accuracy on representative samples (weekly)
- Inference latency (should be <100ms per sentence)
- Memory usage (should be stable)
- Error rate by domain/source

## Advanced Topics

### Few-shot Learning

Fine-tune with minimal data (100-500 examples):

```elixir
FineTuner.few_shot_fine_tune(
  base_model,
  small_dataset,
  :pos_tagging,
  epochs: 10,
  learning_rate: 1.0e-5,
  data_augmentation: true
)
```

### Domain Adaptation

Fine-tune on domain-specific data:

```bash
# Medical text
mix nasty.fine_tune.pos \
  --model roberta_base \
  --train data/medical_train.conllu

# Legal text
mix nasty.fine_tune.pos \
  --model roberta_base \
  --train data/legal_train.conllu
```

### Multilingual Fine-tuning

Use XLM-RoBERTa for multiple languages:

```bash
mix nasty.fine_tune.pos \
  --model xlm_roberta_base \
  --train data/multilingual_train.conllu  # Mix of en, es, ca
```

## See Also

- [QUANTIZATION.md](QUANTIZATION.md) - Optimize fine-tuned models
- [ZERO_SHOT.md](ZERO_SHOT.md) - Classification without training
- [CROSS_LINGUAL.md](CROSS_LINGUAL.md) - Transfer across languages
- [NEURAL_MODELS.md](NEURAL_MODELS.md) - Neural architecture details