guides/03_mobile_deployment.md

Select File
guides/03_mobile_deployment.md

# Mobile Deployment with ExBurn

## Overview

ExBurn compiles models for mobile deployment via Burn's CubeCL backend:

- **iOS**: Metal via CubeCL
- **Android**: Vulkan via CubeCL

The typical workflow is: train on a desktop GPU → save the model → load and run inference on mobile.

## Training and Saving on Desktop

```elixir
# Train on desktop (CUDA or Metal)
model =
  Axon.input("input", shape: {nil, 784})
  |> Axon.dense(128, activation: :relu)
  |> Axon.dropout(rate: 0.2)
  |> Axon.dense(10)

compiled = ExBurn.Model.compile(model,
  loss: :cross_entropy,
  optimizer: :adam,
  learning_rate: 0.001
)

trained = ExBurn.Training.fit(compiled, {train_x, train_y},
  epochs: 20,
  batch_size: 64
)

# Save for deployment
ExBurn.Model.save(trained, "model.bin")
```

## Loading and Inference on Mobile

```elixir
# Load the model on the mobile device
{:ok, model} = ExBurn.Model.load(compiled, "model.bin")

# Run inference
{:ok, output} = ExBurn.Model.predict(model, input_tensor)
```

## Using ExBurn.Serving for Batched Inference

For production inference with concurrent batching:

```elixir
serving = ExBurn.Serving.build(model,
  batch_size: 32,
  batch_timeout: 50,
  partitions: System.schedulers_online()
)

output = Nx.Serving.run(serving, input_tensor)
```

## Cross-Compilation

### iOS (Metal)

```bash
# Add the iOS target
rustup target add aarch64-apple-ios

# Build the NIF for iOS
cd native/ex_burn_nif
cargo build --target aarch64-apple-ios --features metal --no-default-features --release
```

### Android (Vulkan)

```bash
# Add the Android target
rustup target add aarch64-linux-android

# Build the NIF for Android
cd native/ex_burn_nif
cargo build --target aarch64-linux-android --features vulkan --no-default-features --release
```

### CPU-only Fallback

```bash
cd native/ex_burn_nif
cargo build --no-default-features --release
```

## Model Optimization for Mobile

### 1. Use f16 Precision

Halves memory usage with minimal accuracy loss on inference:

```elixir
# Convert parameters to f15
# (planned — currently use Nx's built-in type conversion)
```

### 2. Reduce Model Size

| Model Size | Feasibility on Mobile |
|---|---|
| < 1M params | ✅ Comfortable on all modern devices |
| 1M – 10M params | ✅ Fine for inference, training may OOM |
| 10M – 50M params | ⚠️ Inference only, may need quantization |
| > 50M params | ❌ Not recommended for mobile |

### 3. Use ExCubecl Pipelines

Chain multiple GPU kernels without CPU round-trips:

```elixir
{:ok, pipeline} = ExBurn.CubeclBridge.pipeline()
ExBurn.CubeclBridge.pipeline_add(pipeline, "dense", [input_buf, weight_buf, bias_buf], output_buf)
ExBurn.CubeclBridge.pipeline_add(pipeline, "relu", [output_buf], output_buf)
{:ok, _} = ExBurn.CubeclBridge.pipeline_run(pipeline)
```

### 4. Batch Inference

Process multiple inputs together for better GPU utilization:

```elixir
serving = ExBurn.Serving.build(model, batch_size: 16, batch_timeout: 100)
```

## Supported Operations

| Operation | iOS (Metal) | Android (Vulkan) | Notes |
|---|---|---|---|
| Dense / Linear | ✅ | ✅ | |
| Conv2D | ✅ | ✅ | |
| ReLU | ✅ | ✅ | |
| Sigmoid | ✅ | ✅ | |
| Softmax | ✅ | ✅ | |
| Dropout | ✅ | ✅ | No-op during inference |
| LayerNorm | ✅ | ✅ | |
| MatMul | ✅ | ✅ | |
| Transpose | ✅ | ✅ | |
| Reshape | ✅ | ✅ | |
| Concatenate | ✅ | ✅ | |
| Slice | ✅ | ✅ | |

## Memory Considerations

- Burn's Autodiff backend is memory-intensive. **Training on mobile is only feasible for small models** (< 10M parameters).
- **Inference is the primary use case** for mobile deployment.
- Minimum recommended: 4GB RAM, A12+ chip (iOS) / Snapdragon 700+ (Android).
- Use gradient checkpointing (planned for v0.3.0) to reduce training memory.

## Precompiled NIFs (v0.2.0)

Starting with v0.2.0, precompiled NIF binaries are distributed via `rustler_precompiled`, eliminating the Rust toolchain requirement for end users. The NIF automatically downloads the correct binary for the target platform.