Skip to main content

livebooks/intro_10min.livemd

# Nx.Vulkan — 10-Minute Intro

```elixir
Mix.install([
  {:nx_vulkan,  path: Path.expand("../", __DIR__)},

  {:axon, "~> 0.7"}
])
```

## What this notebook is

Ten minutes from cold start to a working Axon training step on
GPU through `Nx.Vulkan.VulkanoBackend`. Linux NVIDIA, FreeBSD
NVIDIA, macOS via MoltenVK — anywhere with a Vulkan loader, this
runs.

Five sections, each timed:

1. **Boot the backend** (1 min) — verify the device is found, no
   driver setup needed beyond the package manager.
2. **Move a tensor to GPU** (1 min) — `backend_transfer` round-trip.
3. **Run some ops** (3 min) — binary, unary, reduce, matmul.
4. **Run a forward pass on a model** (2 min) — Axon
   `Dense → sigmoid → Dense` with parameters on GPU.
5. **One training step with gradient** (3 min) — `Nx.Defn.grad`
   handles backprop, no backward callbacks required.

## 1. Boot the backend

The backend lazy-initialises on first NIF call. Trigger it by
asking for the device name.

```elixir
# Allocate a one-byte buffer just to force vulkano context init.
{:ok, _ref} = Nx.Vulkan.NativeV.buf_alloc(4)
```

You should see a single eprintln line above this cell:

```
[nx_vulkan_vulkano] device: NVIDIA GeForce ... (DiscreteGpu)
```

If you see `IntegratedGpu`, that's fine too — the backend picks
the most-discrete compute-capable device on the system.

## 2. Move a tensor to GPU

Tensors normally live on `Nx.BinaryBackend` (Elixir bytes in the
BEAM heap). `backend_transfer/2` uploads to whatever target backend
you point at.

```elixir
x = Nx.tensor([1.0, 2.0, 3.0, 4.0, 5.0], type: :f32)
IO.inspect(x.data.__struct__, label: "before transfer")

x_vk = Nx.backend_transfer(x, Nx.Vulkan.VulkanoBackend)
IO.inspect(x_vk.data.__struct__, label: "after transfer")
```

The `data` field changes from `Nx.BinaryBackend` to
`Nx.Vulkan.VulkanoBackend`. The tensor is now backed by an
`Arc<Buffer<u8>>` inside vulkano. When the Elixir reference is
garbage-collected, vulkano runs `vkDestroyBuffer` automatically —
no `Drop` to call, no leak to track.

Round-trip back to confirm bytes are preserved:

```elixir
x_back = Nx.backend_transfer(x_vk, Nx.BinaryBackend)
IO.inspect(Nx.to_list(x_back), label: "round-trip")
```

## 3. Run some ops

Every Nx op you call dispatches through the backend's callback.
For tensors on `VulkanoBackend`, that means a Vulkan compute
shader.

### Elementwise binary

```elixir
a = Nx.tensor([1.0, 2.0, 3.0, 4.0], backend: Nx.Vulkan.VulkanoBackend)
b = Nx.tensor([10.0, 20.0, 30.0, 40.0], backend: Nx.Vulkan.VulkanoBackend)

Nx.add(a, b) |> Nx.backend_transfer(Nx.BinaryBackend) |> Nx.to_list()
```

Expected: `[11.0, 22.0, 33.0, 44.0]`.

The shader is `elementwise_binary.spv`, specialised at constant
ID 0 with op-code 0 (add). Subsequent calls with the same shader
and op-code hit the pipeline cache and skip the
shader-module-build step.

### Elementwise unary

```elixir
Nx.sigmoid(a) |> Nx.backend_transfer(Nx.BinaryBackend) |> Nx.to_list()
```

`elementwise_unary.spv` op-code 5 (sigmoid). Result:
`[0.7310585975646973, 0.8807970881462097, ...]`.

### Reduction

```elixir
m = Nx.iota({4, 4}, type: :f32, backend: Nx.Vulkan.VulkanoBackend)
{
  Nx.sum(m) |> Nx.backend_transfer(Nx.BinaryBackend) |> Nx.to_number(),
  Nx.sum(m, axes: [0]) |> Nx.backend_transfer(Nx.BinaryBackend) |> Nx.to_list()
}
```

Full sum: 120.0. Per-column sum: `[24.0, 28.0, 32.0, 36.0]`.

### Matmul

```elixir
a = Nx.iota({2, 3}, type: :f32, backend: Nx.Vulkan.VulkanoBackend) |> Nx.divide(Nx.tensor(1.0))
b = Nx.iota({3, 2}, type: :f32, backend: Nx.Vulkan.VulkanoBackend) |> Nx.divide(Nx.tensor(1.0))
Nx.dot(a, b) |> Nx.backend_transfer(Nx.BinaryBackend) |> Nx.to_list()
```

Expected: `[[10.0, 13.0], [28.0, 40.0]]`. The shader is
`matmul.spv` at 16×16 tile size; for these tiny matrices most of
the cost is dispatch overhead.

## 4. Forward pass on an Axon model

Two-layer MLP, parameters initialised on `BinaryBackend` then
transferred to GPU.

```elixir
model =
  Axon.input("x", shape: {nil, 8})
  |> Axon.dense(16, activation: :sigmoid)
  |> Axon.dense(2)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(%{"x" => Nx.template({1, 8}, :f32)}, Axon.ModelState.empty())

# Walk the model state, transferring each parameter tensor.
transfer = fn state ->
  %{state | data:
    Map.new(state.data, fn {layer, ps} ->
      {layer, Map.new(ps, fn {k, v} -> {k, Nx.backend_transfer(v, Nx.Vulkan.VulkanoBackend)} end)}
    end)}
end

params_vk = transfer.(params)
x_vk = Nx.tensor([[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]],
                  type: :f32, backend: Nx.Vulkan.VulkanoBackend)

out = predict_fn.(params_vk, %{"x" => x_vk})
out |> Nx.backend_transfer(Nx.BinaryBackend) |> Nx.to_list()
```

The forward pass runs: `Dense → sigmoid → Dense`, where every
op dispatches through `VulkanoBackend`. Matmul → add (with
broadcast) → sigmoid → matmul. Four shader dispatches total.

## 5. One training step with gradient

This is the interesting one. We never wrote a backward callback
for any of our ops. `Nx.Defn.grad` transforms the graph at
compile time, inserting backward ops *expressed in terms of
forward ops*. Our 24 forward ops therefore cover all the
gradients they could conceivably need.

```elixir
target = Nx.tensor([[1.0, -1.0]], type: :f32, backend: Nx.Vulkan.VulkanoBackend)

loss_fn = fn params, x_in, y_in ->
  out = predict_fn.(params, %{"x" => x_in})
  diff = Nx.subtract(out, y_in)
  Nx.divide(Nx.sum(Nx.multiply(diff, diff)), Nx.tensor(elem(Nx.shape(y_in), 0) * 1.0))
end

grad_fn = fn p, x_in, y_in ->
  Nx.Defn.value_and_grad(p, fn pp -> loss_fn.(pp, x_in, y_in) end)
end

{loss, grads} =
  Nx.Defn.jit_apply(grad_fn, [params_vk, x_vk, target], compiler: Nx.Defn.Evaluator)

{
  Nx.to_number(loss),
  grads.data["dense_0"]["kernel"]
  |> Nx.backend_transfer(Nx.BinaryBackend)
  |> Nx.sum()
  |> Nx.to_number()
}
```

Returns `{loss_value, sum_of_first_layer_kernel_gradient}`. Both
numbers match what `Nx.BinaryBackend` would produce to f32
precision.

Apply a single SGD update to verify parameters actually move:

```elixir
lr = 0.01

apply_sgd = fn p, g ->
  %{p | data: Map.new(p.data, fn {layer, layer_p} ->
    layer_g = g.data[layer]
    {layer, Map.new(layer_p, fn {pname, w} ->
      {pname, Nx.subtract(w, Nx.multiply(layer_g[pname], Nx.tensor(lr)))}
    end)}
  end)}
end

params_updated = apply_sgd.(params_vk, grads)

# Check the dense_0 kernel changed
before = params_vk.data["dense_0"]["kernel"] |> Nx.sum() |> Nx.to_number()
after_step = params_updated.data["dense_0"]["kernel"] |> Nx.sum() |> Nx.to_number()
IO.puts("dense_0 kernel sum: #{before} → #{after_step}")
```

You should see the sum change by a small amount — exactly
`-lr * sum(grad)`.

## Where to go from here

The `examples/` directory in this repo has more:

* `examples/axon_training_loop.exs` — 100-step training run with
  loss-trajectory comparison vs `BinaryBackend`. PASS verdict.
* `examples/full_bench.exs` — per-op latency curves, end-to-end
  workloads, robustness run. Cross-host comparison.

The roadmap (`docs/VULKANO_BACKEND_ROADMAP.md`) lists the open
work: persistent buffer pool, f64 matmul, custom `Nx.Defn`
compiler, conv/FFT/sort/scatter for the long tail.

The blog post that tells the whole story:
[*The Backend That Didn't Need to Know*](http://www.dataalienist.com/blog-backend-didnt-need-to-know.html).

Most production users will hit the same boundary we did: forward
op coverage is enough for the workloads that matter, and the
backend that supports them most completely is the one that knows
the least about gradients.