README.md

Select File
# Nx.Vulkan

A GPU tensor backend for [Nx](https://github.com/elixir-nx/nx) that runs on **anything with a Vulkan driver** — including FreeBSD, where CUDA and Metal don't exist.

```
✓ Linux + NVIDIA RTX 3060 Ti      (proprietary driver)
✓ FreeBSD + NVIDIA GT 750M        (NVIDIA legacy driver)
✓ FreeBSD + NVIDIA GT 650M        (NVIDIA legacy driver)
```

Two backends live in this repo:

- **`Nx.Vulkan.VulkanoBackend`** — pure-Rust, [vulkano](https://github.com/vulkano-rs/vulkano)-backed, current primary. 24 ops native + host-fallback for the long tail. Validated on Axon training (forward + autograd + SGD), the eXMC regime-model log-posterior at f64 precision, and Scholar linear regression (coefficients match to 2e-6).
- **`Nx.Vulkan.Backend`** — C++ spirit-backed, predecessor. Chain-shader synthesis pipeline (`Synthesis`, `ShaderTemplate`, `ChainShaderSpecs`) still ships here and produces SPV blobs that either backend can dispatch.

## What works today

| Capability | VulkanoBackend | spirit (legacy) |
|---|---|---|
| Buffer alloc / upload / download | Arc-managed | raw pointer |
| Elementwise binary (add/sub/mul/div/pow/max/min) | ✓ f32 + f64 | ✓ f32 |
| Elementwise unary (exp/log/sqrt/sigmoid/tanh/abs/neg/floor/ceil/sign) | ✓ f32 + f64 | ✓ f32 |
| Reductions (sum, reduce_max, reduce_min, axis + leading + trailing) | ✓ f32 + f64 | ✓ f32 |
| Shape / movement (reshape, squeeze, transpose-2D) | ✓ | ✓ |
| Matmul (rank-2 × rank-2) | ✓ f32 | ✓ f32 |
| Slice / as_type / general dot axes | host-fallback | host-fallback |
| Chain shader synthesis (Mission II) | dispatches generated SPV | dispatches generated SPV |
| `Nx.Defn.grad` (autograd) | works automatically | works automatically |
| Axon training step | **✓ validated** (1e-8 grad match vs BinaryBackend) | partial |
| eXMC NUTS sampler integration | **✓ regime log_p byte-identical** | ✓ chain-shader path |
| Scholar linear regression | **✓ coefficients match to 2e-6** (SVD via host-fallback) | partial |
| Long-running workloads (5000+ dispatches) | **✓ pipeline cache** | ✗ stale-buffer crash class |

The chain-shader dispatch path (`leapfrog_chain_synth`) is shared between both backends — same SPV cache, same content-addressing, same runtime synthesis pipeline. The difference is the backend that handles general Nx tensor ops outside that dispatch.

## Why two backends

The spirit backend reached production first — chain-shader synthesis, runtime SPV compilation, content-addressed disk cache, and a long-lived `Nx.Vulkan.Node` GenServer. Then a use-after-free in the C++ FFI layer crashed the live trader three minutes after every restart. The failure surfaced as `Nx.Vulkan.Native.byte_size` raising `:badarg` on a stale `VkBuf*` pointer — a classic FFI ownership leak the C++ type system cannot detect. The vulkano backend grew from a spike that proved the migration was mechanical: same SPV bytes in, byte-identical chain tensors out, perf within ten percent on the bench target.

The two coexist while we backfill the long tail of ops. Long-term, the spirit path retires.

See [`docs/VULKANO_BACKEND_ROADMAP.md`](docs/VULKANO_BACKEND_ROADMAP.md) for the full stage breakdown. The full story is in [*The Backend That Didn't Need to Know*](http://www.dataalienist.com/blog-backend-didnt-need-to-know.html).

## Benchmarks (May 2026)

Square matmul, milliseconds per dispatch, median of 50–200 iterations:

| size | bin (super-io) | bin (mac-247) | vulkano (super-io) | vulkano (mac-247) | spirit (mac-247) |
|---|---|---|---|---|---|
| 16×16 | 2.76 | 2.51 | 1.18 | **1.06** | 1.16 |
| 64×64 | 130.76 | 158.45 | 7.07 | 7.92 | **7.56** |
| 256×256 | 20,097 | 13,891 | 149.19 | **136.10** | 141.73 |
| 1024×1024 | n/a (hours) | n/a (hours) | 2,323 | 2,843 | 2,845 |

Two observations:

1. **vulkano and spirit agree within 5% on every matmul size where both run** on the same hardware. The C++ path doesn't buy back its maintenance cost.
2. **The Vulkan path beats BinaryBackend by 92–135× at 256×256** on the GT 650M. The GPU is from 2013; what changes is moving the loop off the BEAM scheduler.

The C++ spirit path crashed on Linux super-io mid-bench with a memory-supervisor high-watermark warning — same fragility class that motivated the migration in the first place. The vulkano path completed cleanly on both hosts. Full bench script: [`examples/full_bench.exs`](examples/full_bench.exs).

## Position vs EXLA and EMLX

Three GPU backends exist for Nx today. Each won a different platform first.

| | EXLA | EMLX | Nx.Vulkan.VulkanoBackend |
|---|---|---|---|
| **Backing API** | Google XLA | Apple MLX (Metal) | Khronos Vulkan via vulkano (Rust) |
| **Maturity** | Years; production | Released 2024 | Released 2026 |
| **Linux + NVIDIA CUDA** | ✓ canonical | ✗ | ✓ via Vulkan |
| **macOS + Apple Silicon** | ✗ | ✓ canonical | ✓ via MoltenVK |
| **FreeBSD + NVIDIA** | ✗ | ✗ | **✓ only path** |
| **Windows / WSL2** | partial via TF | ✗ | ✓ (Vulkan ships on Windows) |
| **Op coverage** | full Nx surface (~200) | full Nx surface | 24 native, rest via host fallback |
| **`Nx.Defn.grad` (autograd)** | full | full | **✓ free** (graph transformation) |
| **fp64 compute** | full | none (Metal limit) | ✓ binary/unary/reduce |
| **Production use** | Google scale | Apple devices | eXMC trader on mac-247 |

### The autograd insight

`Nx.Defn.grad` is a graph transformation that runs at compile time on the `Nx.Defn.Expr` AST. For every forward op in the graph, it inserts the corresponding backward op expressed in terms of *more forward ops*. The backend never sees a "backward op" — it just keeps executing forward primitives. Forward op coverage IS gradient coverage when running through `Nx.Defn.Evaluator`.

That means **VulkanoBackend supports gradients for any function expressible in its 24 native ops + host-fallback long tail**. No backward callbacks were written. Validated by running a complete Axon training step (Dense → sigmoid → Dense → MSE → `Nx.Defn.value_and_grad`) on `Nx.Vulkan.VulkanoBackend`, with gradient sum agreeing to 1e-8 against the `BinaryBackend` reference.

### What's missing

**Op coverage — the long tail.** Convolutions, FFTs, sort, scatter, `Nx.LinAlg.solve`/`qr`/`svd`, complex types, sparse ops. Most of these have host-fallback paths that work today but are slow. Native shaders for each are 50–100 LOC of vulkano apiece. Estimated effort to reach feature parity with EXLA: 6–12 months of focused work, parallelisable.

**`Nx.Defn` custom compiler.** Today we run through `Nx.Defn.Evaluator`, which dispatches ops one at a time. EXLA compiles whole graphs to optimised HLO. A custom Defn compiler that batches dispatches, fuses elementwise chains, and caches compiled graphs would close most of the remaining perf gap. Estimated effort: 3–6 months.

**Persistent buffer pool.** Currently per-call buffer allocation through vulkano's `StandardMemoryAllocator`. Works but costs a millisecond per dispatch that an explicit pool could reclaim. **Mid-2026 work.**

**f64 matmul.** `matmul.spv` is f32-only. f64 dot products fall back to host, which is slow for large tensors. **2 weeks** to add the f64 variant.

**Scholar — linalg fast paths.** Linear regression (normal equation + SVD) now smoke-tests cleanly via a host-fallback `block/4` callback that routes `Nx.Block.LinAlg.SVD`/`QR`/`solve`/`cholesky` through `BinaryBackend`. Coefficients match to 2e-6. The fallback works for any Scholar algorithm whose linalg uses `Nx.Block`; native SVD/QR shaders would speed things up but aren't blocking correctness. **2-4 weeks** to add the most-used linalg shaders natively.

## Quickstart

### As a backend in your project

```elixir
# mix.exs
def deps do
  [
    {:nx, "~> 0.10"},
    {:nx_vulkan, git: "https://github.com/borodark/nx_vulkan"}
  ]
end
```

```elixir
# Build a tensor, transfer to GPU, do work
x_bin = Nx.tensor([1.0, 2.0, 3.0, 4.0], type: :f32)
x_vk  = Nx.backend_transfer(x_bin, Nx.Vulkan.VulkanoBackend)

y_vk  = Nx.sigmoid(x_vk)
y_bin = Nx.backend_transfer(y_vk, Nx.BinaryBackend)
IO.inspect(Nx.to_list(y_bin))
# [0.7310585975646973, 0.8807970881462097, 0.9525741338729858, 0.9820137619972229]
```

### Try the Axon training example

```sh
git clone https://github.com/borodark/nx_vulkan
cd nx_vulkan
mix deps.get && mix compile
elixir examples/axon_training_loop.exs
```

Runs a 100-step Dense(4→32, tanh)→Dense(1) regression with manual SGD. Compares loss trajectories on `BinaryBackend` vs `VulkanoBackend`. PASS verdict on both Linux + FreeBSD.

### Try the full bench

```sh
mix run examples/full_bench.exs
```

Per-op + end-to-end + robustness across every backend Nx can find. Auto-detects EXLA availability. Runs in ~10 minutes on RTX 3060 Ti, ~15 on GT 650M.

## Why FreeBSD matters

Nx today has three GPU backends. Two of them — EXLA and EMLX — explicitly do not run on FreeBSD. If you have NVIDIA hardware on FreeBSD, Vulkan is the only path. **mac-248** (FreeBSD 15.0 / GT 750M) and **mac-247** (FreeBSD 15.0 / GT 650M Mac Edition) are the canonical bring-up boxes; every commit gets verified there alongside the Linux dev host.

The companion blog series:

- [*The Backend That Didn't Need to Know*](http://www.dataalienist.com/blog-backend-didnt-need-to-know.html) — the C++→vulkano migration; descriptor pool debugging; autograd was free
- [*The GPU That Doesn't Need CUDA*](http://www.dataalienist.com/blog-vulkan-on-freebsd.html) — the FreeBSD Vulkan story (spirit-era)
- [*A Walkable Path Under the Mountain*](http://www.dataalienist.com/blog-walkable-path.html) — eXMC + zed integration

## Architecture

```
   ┌─────────────────────────────────────────────────────────┐
   │  Nx layer                                                │
   │  • Nx.Vulkan.VulkanoBackend  (current)                   │
   │  • Nx.Vulkan.Backend         (legacy, C++ path)          │
   └──────────────┬─────────────────────────┬─────────────────┘
                  │                         │
   ┌──────────────▼──────────┐  ┌──────────▼──────────────────┐
   │  Nx.Vulkan.NativeV       │  │  Nx.Vulkan.Native            │
   │  (Rustler crate          │  │  (Rustler crate              │
   │   nx_vulkan_vulkano)     │  │   nx_vulkan_native)          │
   │  • Arc<Buffer> resources │  │  • C++ shim NIFs             │
   │  • pipeline cache        │  │  • opaque VkBuf* pointers    │
   │  • specialisation        │  │                              │
   └──────────┬───────────────┘  └─────────┬────────────────────┘
              │                            │
              │                       ┌────▼─────────┐
              │                       │  C++ shim    │
              │                       │  (legacy)    │
              │                       └────┬─────────┘
              │                            │
              │                       ┌────▼─────────┐
              │                       │   spirit     │
              │                       │   (vendored) │
              │                       └────┬─────────┘
              │                            │
              └──────────┬─────────────────┘
                         ▼
              ┌─────────────────────────┐
              │  Vulkan driver (loader) │
              └─────────────────────────┘
                         │
              ┌──────────▼──────────────┐
              │  priv/shaders/*.spv      │
              │  • elementwise_binary    │
              │  • elementwise_unary     │
              │  • reduce_axis           │
              │  • matmul                │
              │  • transpose             │
              │  • synthesised chain     │
              │    shaders (Mission II)  │
              │  • 9 hand-written leap-  │
              │    frog families         │
              └──────────────────────────┘
```

The SPV catalog under `priv/shaders/` is shared by both backends. The synthesis pipeline that produces new chain shaders at runtime
(`Nx.Vulkan.Synthesis`, `Nx.Vulkan.ShaderTemplate`,
`Nx.Vulkan.ChainShaderSpecs`) lives in the Elixir layer and is
backend-agnostic.

Old spirit-era infrastructure that survives unchanged:

- **`Nx.Vulkan.Node`** — long-lived named GenServer that owns the `vkPipelineCache` blob and serialises dispatch via `with_node/2`. Used by the legacy backend; new backend doesn't require it but cooperates with it.
- **`Nx.Vulkan.PipelineCache`** — disk-persistent `vkPipelineCache` with UUID validation. Survives BEAM restarts.
- **Runtime chain shader synthesis** — render a `FamilySpec`, hand to `Synthesis.compile/1`, get a content-addressed SPV path back. ~150 ms cold, 5 ms cache hit. Both backends consume the output.

## Building

### Prerequisites

- Erlang/OTP 26+, Elixir 1.17+
- Rust 1.78+
- C++ compiler (only needed for the legacy spirit backend; vulkano is pure Rust)
- Vulkan SDK + `glslangValidator`:
  - Debian/Ubuntu: `apt install libvulkan-dev vulkan-tools glslang-tools`
  - FreeBSD: `pkg install vulkan-loader vulkan-headers vulkan-tools glslang shaderc`

### Build

```sh
mix deps.get
mix compile
```

Vulkano compiles in ~30s on Linux, ~3:18 on FreeBSD 15.0 (mostly dependency compilation). The spirit/C++ path compiles in parallel.

### Rust toolchain pin

`rust-toolchain.toml` pins rustc to 1.85. The reason is in the file's comment; bump when upstream rustler emits a corrected `rustler-sys` signature.

## Status

**Phase 3 in progress** (May 2026): vulkano backend covers stages 1–8 of [the roadmap](docs/VULKANO_BACKEND_ROADMAP.md).

| Feature | Status |
|---|---|
| Vulkano buffer lifecycle (alloc/upload/download/free) | ✓ |
| 24 native compute ops via specialised SPVs | ✓ |
| f64 shader paths (binary/unary/reduce) | ✓ |
| Pipeline cache (correctness + perf) | ✓ |
| Cross-host validation (Linux + 2× FreeBSD) | ✓ |
| Axon training step end-to-end | ✓ |
| eXMC regime log_p (f64) byte-identical | ✓ |
| Autograd via `Nx.Defn.grad` | ✓ |
| Persistent buffer pool | mid-2026 |
| f64 matmul | mid-2026 |
| Scholar linear regression (coefs match to 2e-6) | ✓ |
| Scholar native linalg shaders (SVD/QR/cholesky/solve) | mid-2026 |
| Custom `Nx.Defn` compiler | 2026 H2 |
| Conv / FFT / sort / scatter | 2026 H2–Q4 |

Plan history is in [`PLAN_GPU_NODE.md`](PLAN_GPU_NODE.md) (Phase 1–2 era) and [`docs/VULKANO_BACKEND_ROADMAP.md`](docs/VULKANO_BACKEND_ROADMAP.md) (Phase 3+). Per-workstream notes in [`research/gpu_node/`](research/gpu_node/).

## Sibling: zed

[`zed`](../zed/) is the declarative ZFS + Elixir deploy tool that orchestrates BEAM nodes. `nx_vulkan` is consumed *inside* deployed BEAM nodes — not as a zed dependency. See `specs/nx-vulkan-execution.md` in the zed repo for the integration story.

## License

Apache 2.0. Same as Spirit and Nx.