# Training Optimization Guide
## Table of Contents
1. [Understanding Gradient Computation](#understanding-gradient-computation)
2. [Choosing an Optimizer](#choosing-an-optimizer)
3. [Learning Rate Strategies](#learning-rate-strategies)
4. [Gradient Clipping](#gradient-clipping)
5. [Weight Decay](#weight-decay)
6. [Gradient Accumulation](#gradient-accumulation)
7. [Batch Size Selection](#batch-size-selection)
8. [Memory Optimization](#memory-optimization)
9. [Common Problems and Solutions](#common-problems-and-solutions)
10. [Performance Benchmarks](#performance-benchmarks)
## Understanding Gradient Computation
### Current Limitation: Numerical Gradients
ExBurn v0.1.0 uses **numerical differentiation** (finite differences) to compute gradients. This is the main performance bottleneck.
```
Central differences: ∂L/∂w ≈ (L(w + ε) - L(w - ε)) / 2ε
One-sided: ∂L/∂w ≈ (L(w + ε) - L(w)) / ε
```
**Impact**: For a model with N scalar parameters, central differences requires 2N forward passes per mini-batch. A 100K-parameter model needs 200K forward passes per batch.
### Choosing a Gradient Method
```elixir
# Default: central differences (more accurate, slower)
grads = ExBurn.Training.compute_gradients(model, {x, y}, grad_method: :numerical)
# Faster: one-sided differences (less accurate, ~2x faster)
grads = ExBurn.Training.compute_gradients(model, {x, y}, grad_method: :numerical_batch)
```
| Method | Forward Passes | Error Order | When to Use |
|---|---|---|---|
| `:numerical` | 2N | O(ε²) | Small models, high accuracy needed |
| `:numerical_batch` | N+1 | O(ε) | Larger models, speed matters more |
### When Autodiff Arrives (v0.3.0)
Burn's Autodiff backend will compute exact gradients in a **single backward pass**, regardless of parameter count. This is a game-changer:
```
Numerical (v0.1.0): 200K forward passes for 100K params
Autodiff (v0.3.0): 1 backward pass for any model size
```
**Recommendation**: For now, keep models small (< 50K params) for training. Use larger models only for inference.
## Choosing an Optimizer
### Adam (Default)
Best general-purpose optimizer. Adapts learning rates per-parameter.
```elixir
ExBurn.Model.compile(model, optimizer: :adam, learning_rate: 0.001)
# beta1=0.9, beta2=0.999, epsilon=1e-8
```
**When to use**: Default choice for most tasks. Works well with default hyperparameters.
**Tips**:
- `learning_rate: 0.001` is a good starting point
- Reduce to `0.0001` if training is unstable
- Increase to `0.01` if convergence is very slow
### SGD with Momentum
Can achieve better generalization than Adam with proper tuning.
```elixir
ExBurn.Model.compile(model, optimizer: :sgd, learning_rate: 0.01)
# momentum=0.9
```
**When to use**: When you need maximum generalization and have time to tune.
**Tips**:
- Requires higher learning rate than Adam (typically 0.01–0.1)
- Use Nesterov momentum for faster convergence:
```elixir
ExBurn.Training.fit(model, data, nesterov: true)
```
- Combine with cosine annealing LR schedule for best results
### RMSprop
Good for recurrent networks and non-stationary objectives.
```elixir
ExBurn.Model.compile(model, optimizer: :rmsprop, learning_rate: 0.001)
# decay=0.9, epsilon=1e-8
```
**When to use**: RNNs, LSTMs, or when Adam diverges.
### Optimizer Comparison
| Optimizer | Convergence Speed | Generalization | Tuning Effort | Memory |
|---|---|---|---|---|
| Adam | Fast | Good | Low | 2x params (m + v) |
| SGD + Momentum | Medium | Best | High | 1x params (velocity) |
| RMSprop | Medium | Good | Medium | 1x params (cache) |
## Learning Rate Strategies
### Fixed Learning Rate
```elixir
# No schedule — use constant learning rate
ExBurn.Model.compile(model, learning_rate: 0.001)
```
### Step Decay
Reduce LR by a factor every N epochs. Good for long training runs.
```elixir
# Halve the learning rate every 10 epochs
lr_schedule: {:step, 0.001, 10, 0.5}
```
### Exponential Decay
Smooth decay. Good for medium-length training.
```elixir
# Multiply LR by 0.95 each epoch
lr_schedule: {:exponential, 0.001, 0.95}
```
### Cosine Annealing
Smoothly decay from base_lr to min_lr following a cosine curve. Often gives the best results.
```elixir
# Decay from 0.001 to 0.00001 over the training run
lr_schedule: {:cosine, 0.001, 1.0e-5}
```
### Learning Rate Schedule Comparison
```
LR
│
0.001 ─┤ ████
│ ████ ╲ Step (sudden drops)
│ ████ ╲ ╲
│ ████ ╲ ╲
│ ████ ╲ ╲
0.0001 ┤ ╲ ╲
│ ╲ ╲ ╲
│ ╲ ╲ ╲
│ ╲ ╲ ╲
0.00001 ┤──────────────╲──── Cosine (smooth)
└──────────────────────── Epochs
```
### Tips
- Start with Adam + cosine annealing for best results
- If loss oscillates, reduce the base learning rate
- If convergence is too slow, increase the base learning rate
- Use warmup (planned) for large batch sizes
## Gradient Clipping
Prevents exploding gradients, which cause NaN loss.
### Clip by Norm
Scales all gradients so their total norm doesn't exceed a threshold:
```elixir
# If ||gradients||_2 > 1.0, scale them down
clip_norm: 1.0
```
**When to use**: Always enable for recurrent networks. Recommended for deep networks.
### Clip by Value
Clips each gradient element to a range:
```elixir
# Clip each gradient to [-5.0, 5.0]
clip_value: 5.0
```
**When to use**: As a safety net alongside norm clipping.
### Tips
- `clip_norm: 1.0` is a good default
- If you see NaN loss, enable clipping immediately
- Clipping doesn't prevent vanishing gradients — use residual connections for that
## Weight Decay
L2 regularization that penalizes large weights, improving generalization:
```elixir
ExBurn.Model.compile(model, weight_decay: 1.0e-4)
```
This adds `weight_decay * param` to each gradient before the optimizer step.
### Tips
- `1.0e-4` is a good default for most tasks
- `1.0e-5` for small datasets (less regularization)
- `1.0e-3` for large models that overfit
- Don't use with AdamW (not yet implemented) — with standard Adam, weight decay interacts with the adaptive learning rate
## Gradient Accumulation
Simulates a larger batch size by accumulating gradients across multiple mini-batches:
```elixir
# Effective batch size = 32 * 4 = 128
ExBurn.Training.fit(model, data,
batch_size: 32,
accumulate_gradients: 4
)
```
### When to Use
- GPU memory limits your batch size
- You want the stability of large batches but can't fit them in memory
- Training on mobile devices with limited RAM
### Tips
- Increase learning rate proportionally to the accumulation factor (e.g., 4x accumulation → 2x LR)
- Batch normalization (when available) will still see the small mini-batch statistics
## Batch Size Selection
| Batch Size | Pros | Cons |
|---|---|---|
| 8–16 | Better generalization, less memory | Noisy gradients, slower training |
| 32–64 | Good default | Balanced |
| 128–256 | Faster training, stable gradients | May generalize worse, more memory |
| 512+ | Very stable gradients | Often worse generalization, high memory |
### Tips
- Start with 32 and increase if you have memory headroom
- If you increase batch size, increase learning rate proportionally
- Use gradient accumulation to simulate large batches on memory-constrained devices
## Memory Optimization
### On Desktop (CUDA/Metal)
```elixir
# Use f16 for 2x memory reduction
# (convert parameters to f16 before training)
# Use gradient accumulation to reduce per-batch memory
accumulate_gradients: 4
```
### On Mobile (iOS/Android)
```elixir
# Keep models small (< 10M params)
# Use CPU for training (GPU autodiff is memory-intensive)
ExBurn.Model.compile(model, device: :cpu)
# Free intermediate tensors explicitly
ExBurn.Tensor.free(intermediate_tensor)
```
### Memory-Saving Tips
1. **Reduce batch size** — the single biggest lever
2. **Use gradient accumulation** — same effective batch, less memory
3. **Free tensors explicitly** — don't wait for GC
4. **Use f16 precision** — halves memory for tensors
5. **Avoid storing all intermediate activations** — use gradient checkpointing (planned)
## Common Problems and Solutions
### Loss is NaN
**Causes**: Exploding gradients, too high learning rate, numerical instability
**Solutions**:
```elixir
# 1. Enable gradient clipping
clip_norm: 1.0
# 2. Reduce learning rate
learning_rate: 0.0001
# 3. Use :numerical_batch gradient method (more stable)
grad_method: :numerical_batch
```
### Loss Doesn't Decrease
**Causes**: Too low learning rate, bad initialization, wrong loss function
**Solutions**:
```elixir
# 1. Increase learning rate
learning_rate: 0.01
# 2. Check loss function matches task
# Classification → :cross_entropy
# Regression → :mse
# Binary → :binary_cross_entropy
# 3. Verify data preprocessing (normalization, etc.)
```
### Loss Oscillates
**Causes**: Learning rate too high, batch size too small
**Solutions**:
```elixir
# 1. Reduce learning rate
learning_rate: 0.0005
# 2. Increase batch size or use gradient accumulation
accumulate_gradients: 4
# 3. Use learning rate schedule
lr_schedule: {:cosine, 0.001, 1.0e-6}
```
### Overfitting
**Causes**: Model too complex, not enough data, no regularization
**Solutions**:
```elixir
# 1. Add weight decay
weight_decay: 1.0e-3
# 2. Add dropout in the Axon model
|> Axon.dropout(rate: 0.5)
# 3. Freeze early layers
model = ExBurn.Model.freeze(model, ["hidden1"])
# 4. Use early stopping
callbacks: [ExBurn.Training.EarlyStoppingCallback.wait(5)]
```
### Training is Very Slow
**Causes**: Numerical gradients on large model, too many epochs
**Solutions**:
```elixir
# 1. Use faster gradient method
grad_method: :numerical_batch
# 2. Reduce model size
# 3. Use fewer epochs with early stopping
callbacks: [ExBurn.Training.EarlyStoppingCallback.wait(3)]
# 4. Increase batch size (fewer optimizer steps)
batch_size: 128
```
## Performance Benchmarks
Approximate training times per epoch on synthetic data (will vary by hardware):
| Model Size | Params | Batch | Method | Time/Epoch |
|---|---|---|---|---|
| Tiny MLP | 1K | 32 | :numerical | ~2s |
| Small MLP | 10K | 32 | :numerical | ~15s |
| Small MLP | 10K | 32 | :numerical_batch | ~8s |
| Medium MLP | 100K | 32 | :numerical | ~3min |
| Medium MLP | 100K | 32 | :numerical_batch | ~1.5min |
**Key takeaway**: With numerical gradients, training time scales linearly with parameter count. Keep models under 50K parameters for interactive training, or switch to inference-only for larger models until autodiff arrives in v0.3.0.
## Quick Reference: Recommended Settings
### For Quick Experiments
```elixir
compiled = ExBurn.Model.compile(model,
loss: :cross_entropy,
optimizer: :adam,
learning_rate: 0.001
)
ExBurn.Training.fit(compiled, data,
epochs: 10,
batch_size: 32,
verbose: true
)
```
### For Best Results
```elixir
compiled = ExBurn.Model.compile(model,
loss: :cross_entropy,
optimizer: :adam,
learning_rate: 0.001,
weight_decay: 1.0e-4
)
ExBurn.Training.fit(compiled, data,
epochs: 50,
batch_size: 64,
shuffle: true,
validation_data: val_data,
lr_schedule: {:cosine, 0.001, 1.0e-6},
clip_norm: 1.0,
accuracy: true,
callbacks: [
&ExBurn.Training.LoggingCallback.log/1,
ExBurn.Training.EarlyStoppingCallback.wait(10, 1.0e-5),
ExBurn.Training.CheckpointCallback.every(10, "/checkpoints")
]
)
```
### For Memory-Constrained Devices
```elixir
compiled = ExBurn.Model.compile(model,
loss: :cross_entropy,
optimizer: :adam,
learning_rate: 0.0005,
device: :cpu
)
ExBurn.Training.fit(compiled, data,
epochs: 20,
batch_size: 16,
accumulate_gradients: 4,
clip_norm: 1.0
)
```