# Picking a delegate
A TFLite "delegate" is the runtime that actually executes the model
graph. TFLite ships several; picking the right one for your platform
+ model combination is the single biggest perf lever.
This guide lays out the decision matrix and explains the quirks we
discovered measuring real hardware.
## Quick decision tree
```
Are you on iOS (any iPhone/iPad)?
├── Model is FP16 or FP32? → delegate: "coreml" (24 ms iPhone SE A15)
└── Model is INT8? → delegate: "xnnpack" (27 ms iPhone SE A15)
(NOT coreml — 0/256 nodes delegate)
Are you on Android?
├── MediaTek Dimensity? → delegate: "nnapi", accelerator: "mtk-gpu_shim"
├── Qualcomm Snapdragon? → delegate: "nnapi", accelerator: "qti-gpu"
├── Pixel? → delegate: "nnapi", accelerator: "google-edgetpu"
├── Unknown OEM? → delegate: "xnnpack" (always works)
└── Want to discover? → see "Discovering NNAPI accelerators" below
Mac / Linux dev host? → delegate: "xnnpack"
```
## Per-delegate detail
### `xnnpack` — CPU+SIMD
Bundled into TFLite. Cross-platform. Default when no other delegate
is set explicitly.
Highly-optimised: tuned ARM NEON / Intel AVX kernels, INT8 + FP32 +
quantized-int8 paths.
| Pro | Con |
|---|---|
| Works everywhere TFLite runs | No GPU / NPU |
| Reproducible numbers (no thermal throttle) | Slower than accelerator paths when those work |
| No vendor-driver dependencies | |
Options:
* `num_threads:` — CPU thread count (default 6). Up to physical core
count helps; oversubscription hurts.
On the Moto G Power 5G (Dimensity 7020), XNNPACK matched the GPU
delegate at **77 ms** for YOLOv8n. On modern phones with strong CPU
cores, XNNPACK is a competitive default.
### `nnapi` — Android Neural Networks API
Android's neural-net dispatch layer. Each device's vendor ships a
HAL driver (`libmtk-gpu-shim.so`, `libqti-gpu.so`, etc.); NNAPI picks
one based on the `accelerator` name (or by default, badly — see
below).
Options:
* `accelerator:` — vendor HAL name string. **Always pass this
explicitly.** NNAPI's auto-selection on at least one MediaTek device
picks the NPU which is 5× SLOWER than the GPU for YOLO-class models.
* `allow_fp16:` — let the HAL promote FP32 ops to FP16 (default
`true`). Lossy but typically fine for inference.
#### Discovering NNAPI accelerators on a connected device
The standalone `bench` CLI (in `scripts/bench_android/`) has a
`list-nnapi` mode:
```bash
adb push scripts/bench_android/bench /data/local/tmp/
adb push ~/.mob/cache/tflite-2.16.1-android_arm64/jni/arm64-v8a/libtensorflowlite_jni.so /data/local/tmp/
adb shell 'cd /data/local/tmp && LD_LIBRARY_PATH=. ./bench list-nnapi'
```
Output is a list of accelerator names available on this device:
```
mtk-gpu_shim
mtk-neuron_shim
nnapi-reference (CPU emulation — slow)
```
#### Known accelerator perf rankings (YOLOv8n on Moto G Power 5G)
| Accelerator | Median |
|---|---|
| `mtk-gpu_shim` | 75-117 ms (best — MediaTek's PowerVR HAL) |
| `xnnpack` CPU | 77-91 ms (tied with GPU; deterministic) |
| `mtk-neuron_shim` | 355 ms (NPU — slower because YOLO post-processing falls back to CPU) |
| `nnapi-reference` | 358 ms (CPU emulation — never use) |
| `nnapi` (no accelerator) | 358 ms (defaults to mtk-neuron_shim — never do this) |
The NPU loses despite being a "real" neural-net accelerator because
YOLO has `concat` + `reshape` ops in its post-processing that aren't
in the NPU's supported set. TFLite falls back to CPU for those ops
mid-graph, with cross-device buffer transfers between each fallback.
The transfer overhead swamps any per-op NPU speedup.
A model designed end-to-end for the APU (no reshape/concat in the
inference graph) would land much faster on `mtk-neuron_shim`. YOLOv8n
as exported doesn't fit.
### `coreml` — Apple Core ML
Routes the delegated portion of the graph through Apple's Core ML
framework, which internally schedules to the **Apple Neural Engine**
when ops are supported on devices that have one (A11+ iPhones).
Options:
* `coreml_ane_only:` — when `true`, `load_module/2` returns
`{:error, _}` instead of falling back to CPU on devices without an
ANE. Useful for "ANE-only or skip" logic. Default `false`.
#### Op-coverage caveat — the INT8 trap
**Don't use Core ML with INT8 models.** The Ultralytics
`yolov8n_full_integer_quant.tflite` export uses INT8 quantization ops
that Core ML's tooling doesn't translate to ANE primitives. The
result: **0 out of 256 nodes delegated**, and the whole model falls
back to CPU which is slower than just running XNNPACK directly.
For Core ML you want the **FP16 or FP32 model variant**:
| Model | Core ML delegation rate | Latency (iPhone SE A15) |
|---|---|---|
| INT8 | 0/256 (0%) — full CPU fallback | 45 ms (don't use) |
| FP16 | 214/385 (56%) | 23-25 ms |
| FP32 | 214/254 (84%) | 24-25 ms |
FP16 and FP32 hit the same wall-clock because the delegated portion
is the same (214 conv-shaped ops). FP16 wins on bundle size (~6 MB
vs ~12 MB).
The 30% of nodes that fall to CPU on FP16 are the post-processing
ops (concat / reshape / NMS-prep) — same shape as the Android NPU
problem. Core ML handles the boundary more gracefully than NNAPI
NPU does (cheap shared-memory transitions on Apple silicon), which
is why this works at all.
### `metal` — Apple Metal GPU (planned)
TFLite ships `TensorFlowLiteCMetal.xcframework` with a Metal GPU
delegate, but the current NIF doesn't expose it as a `delegate:`
option. PR welcome.
Core ML is usually faster than Metal on Apple Silicon since it can
pick ANE for supported ops + Metal as a fallback. Metal-only is
mainly useful for older devices without an ANE.
## Comparing the paths on the same device
Same iPhone SE 3rd gen A15, same `.tflite` model files, varying the
delegate:
| Variant | Delegate | Delegation | Min / Median / Max |
|---|---|---|---|
| INT8 | xnnpack | n/a (CPU+NEON) | 27 / 36 / 37 ms |
| INT8 | coreml | 0/256 (full fallback) | 36 / 39 / 42 ms |
| FP16 | xnnpack | n/a (CPU+NEON) | 86 / 98 / 265 ms |
| **FP16** | **coreml** | 214/385 (56%) | **23 / 25 / 26 ms** |
| FP32 | coreml | 214/254 (84%) | 24 / 24 / 25 ms |
The standout: **FP16 + Core ML wins** at 25 ms median. Half the
bundle of FP32 with identical wall-clock. The CPU+NEON XNNPACK path
is impressive at 36 ms — for context, our standalone bench
measurements show it consistently within 30 ms of the GPU/ANE paths
on modern phones.
## Composing with `Nx` backends
TFLite delegates handle the model graph. The pre/post-processing in
your Elixir code is separate compute that you can route to a
different backend:
```elixir
# Input prep on EMLX (Metal GPU on iOS) — useful for batch
# transformations, scaling, normalization.
input_bytes =
camera_bytes
|> Nx.from_binary(:f32, backend: {EMLX.Backend, device: :gpu})
|> Nx.reshape({1, 640, 640, 3})
|> Nx.divide(255.0)
|> Nx.to_binary()
# Model inference on TFLite + Core ML → ANE
{:ok, [out]} = NxTfliteMob.call(handle, [input_bytes])
# Output decode on EMLX again
out
|> Nx.from_binary(:f32, backend: {EMLX.Backend, device: :gpu})
|> Nx.reshape({1, 84, 8400})
|> ...
```
Two distinct compute paths, one screen. The TFLite delegate doesn't
care what backend your Nx code uses — it sees only the bytes you
hand to `call/2`.
## When `xnnpack` is the right answer even when GPU/NPU is available
* Deterministic numbers (CPU paths don't thermal-throttle as
aggressively)
* Cold-start (delegate init for Core ML / NNAPI is 100-500 ms)
* Tiny models (the delegate dispatch overhead dominates inference
for sub-ms models)
* Cross-platform parity for tests