CHANGELOG.md

Select File
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/),
and this project adheres to [Semantic Versioning](https://semver.org/).

<!-- %% CHANGELOG_ENTRIES %% -->

## 0.7.2 - 2026-06-13

### Fixed

- The README performance section now compares Emily against both
  benchmark baselines — EXLA (host CPU) and EMLX (the older MLX-backed
  Nx backend on the Metal GPU) — instead of EXLA alone, and its
  rule-of-thumb figures (ViT-base, DistilBERT) are reconciled with the
  current benchmark report.
- The benchmark report's environment block now records the Emily
  version the numbers were produced on (0.7.0) and drops a misleading
  run timestamp.
- The `MAINTAINING.md` release runbook is corrected: `mix publisho` is
  no longer described as pushing (it only commits and tags), and the
  obsolete manual draft-promotion step is dropped — `release-nif.yml`
  now publishes the release automatically once the NIFs are built.

## 0.7.1 - 2026-06-13

### Fixed

- Documentation no longer fails to build over autolink references to the
  hidden `Emily.Native.async_eval/2` and `Emily.Native.fast_rope_int/8`
  NIF stubs in the changelog; both are excluded from ex_doc autolinking.

## 0.7.0 - 2026-06-13

### Added

- **Native Expr compiler — on by default under
  `compiler: Emily.Compiler`.** Lowers a traced `Nx.Defn.Expr` to a
  flat IR once and replays the whole forward graph in a **single NIF
  call per invocation**, collapsing the per-op BEAM↔worker round-trips
  a step-evaluated decode loop would otherwise pay. Weights cross the
  NIF boundary once (captured by the compiled program) and are never
  re-serialised per call. It is the default, so a bare
  `compiler: Emily.Compiler` compiles native:

      Nx.Defn.jit(&forward/1, compiler: Emily.Compiler).(input)

  Coverage is the full Nx primitive set (with `Emily.Backend`'s
  dtype-coercion and op-composition semantics ported into the
  lowering), the fused `Emily.Fast.*` kernels (RMSNorm, LayerNorm,
  RoPE, scaled dot-product attention and its mask / sink / mask+sink
  variants), `Nx.Block.*` including the full `LinAlg` family
  (`cholesky` / `solve` / `qr` / `eigh` / `lu` / `svd` /
  `determinant`), `Nx.Random`, and the control flow `cond` /
  `defn while` (with the host loop driven entirely from the worker
  thread). Anything the IR can't lower yet routes through
  `Nx.Defn.Evaluator` under the default `native_fallback: :eval` (with
  a one-shot `[:emily, :compiler, :fallback]` telemetry event), so the
  native lane is safe as the default on any model. The default is read
  from `config :emily, :native` (defaulting to `true`), so
  `config :emily, native: false` opts every defn out of the native lane
  application-wide — e.g. on a memory-constrained host where the
  one-shot compile peak is too large; a per-call `native:` option
  always wins over the app-env default.

  `native_fallback: :raise` fails instead — the conformance suites use
  this to prove a model lowers fully native.

  End-to-end: DistilBERT (question answering with `Nx.Serving`), ViT,
  Whisper (`speech_to_text` end-to-end including the featurizer STFT,
  encoder/decoder, and autoregressive decode loop), and Bumblebee
  `Text.generation` (greedy *and* multinomial sampling) all compile
  fully native under `native_fallback: :raise`. Bumblebee generation
  on Qwen3-0.6B measures **~5× the evaluator's decode throughput**
  (~61 vs ~12 tok/s on an M-series Mac), with byte-identical
  completions. Native training drives Axon end-to-end — a LeNet CNN
  and a dense MLP train on real MNIST entirely through the single-NIF
  path (forward, categorical-cross-entropy, backward, Adam) to the
  same >97% / >96% accuracy as the evaluator.

- **`Emily.Compiler` — `:fuse` opt-in.** Adds `mx::compile` fusion on
  top of the replay, fusing elementwise runs (RMSNorm, softmax, SiLU
  gating, residual adds) the plain replay leaves as separate kernels.
  For a `defn while`, the loop body is fused under `mx::compile` and
  cached per stream so it cache-hits across iterations rather than
  recompiling per step. Enable on top of the native generation path:

      Nx.Defn.jit(&forward/1,
        compiler: Emily.Compiler, native: true, fuse: true)

  On Qwen3-0.6B this lifts greedy decode to **~5.4× the evaluator
  (~1.1× over the plain native lane)**, ~68 vs ~62 tok/s; in
  isolation on a decode-shaped transformer block, fusion measures
  ~1.5–1.6× over the plain replay. Trade-off: `mx::compile`
  reassociates f32 to within a few ULP, so output is **not**
  bit-identical to the evaluator. Greedy argmax is robust to that
  empirically (Qwen3-0.6B token ids matched the evaluator exactly in
  our run), but the match is empirical, not guaranteed — a near-tie
  top-2 logit can flip a token. **Sampling strategies will diverge
  from the evaluator under fusion** even with a fixed seed.

- **`Emily.Generation` — a model-agnostic decode-loop driver.**
  JIT-compiles a caller-supplied shape-stable per-token forward
  (`fn token, offset, cache, params -> {logits, cache} end`) with the
  native single-NIF compiler and drives the autoregressive loop from
  Elixir — offset bookkeeping, KV-cache threading, stop conditions,
  next-token selection (greedy by default), and per-token streaming
  via `:on_token`. The forward runs fully native; the loop stays in
  Elixir, so token streaming and host-side control are preserved.
  Emily supplies only the mechanism — the model (forward + cache) is
  the caller's.

- `Emily.async_eval/1` (and `Emily.Native.async_eval/2`) schedule
  evaluation of one or more lazy graphs **without blocking on the
  GPU**, wrapping `mlx::core::async_eval`. The work is handed to the
  device's command queue and the call returns as soon as it is
  enqueued — not when it finishes. Lets a caller keep dispatching the
  next step's ops while the device computes the current one (e.g. an
  autoregressive decode loop), blocking only when a value is actually
  read back on the host via `to_binary/1` / `eval/1`. Pass every
  output of a step (logits plus all KV-cache buffers) in one call.

- `Emily.Native.fast_rope_int/8` — RoPE with an **integer**
  absolute-position `offset` (routing to MLX's int-offset `rope`
  overload), for incremental decode where the caller tracks position
  host-side. Complements the existing tensor-offset `fast_rope/8`.
  Note: feed the kernel the 4-D `{batch, heads, seq, head_dim}`
  layout — in 3-D, MLX 0.31 mis-rotates single-token (`seq == 1`)
  inputs.

### Fixed

- **Dilated window reductions (`window_dilations > 1`) returned wrong
  values.** `window_sum`/`window_max`/`window_min`/`window_product`
  with a dilated kernel silently produced garbage for windows past the
  first stride positions, on both the eager backend and the native
  compiler (they share the window-reduce core). A dilated kernel axis
  gets an `as_strided` stride > 1, so the sliding-window view aliases
  fewer physical elements than its logical size; MLX's strided-reduce
  fast path then read past the aliased buffer. The view is now
  materialised contiguously before the reduce when any dilation > 1
  (the common non-dilated pooling path is unchanged and stays
  copy-free).

## 0.6.1 - 2026-05-31

### Changed

- Documentation updated for the 0.6.x release: the README installation
  instructions and the example notebooks now reference
  `{:emily, "~> 0.6"}`.

## 0.6.0 - 2026-05-31

This release is a security-hardening pass over the native (NIF) boundary
and the build/release pipeline: direct `Emily.Native` calls now validate
their arguments instead of trusting Elixir-side normalization,
precompiled-NIF downloads verify against a checksum pinned in the hex
package (a trust root independent of the GitHub release), and the
per-stream worker is bounded and tears down without blocking a BEAM
scheduler. It is backward compatible, but two behaviour changes matter
for high-concurrency callers: the per-worker async queue is now bounded
(`worker_queue_limit`, default 8192) and rejects when full, and a stopped
or dropped worker replies `{:error, :stopped}` to queued callers instead
of running their work.

### Added

- `Emily.Stream.close/1` stops a stream's worker thread deterministically
  instead of waiting for garbage collection: queued operations are
  cancelled (their callers get a `RuntimeError`), the in-flight op
  finishes, and the OS thread is joined off the BEAM schedulers.
- `config :emily, worker_queue_limit: N` (default `8192`) bounds the
  per-worker async queue, and `config :emily, await_timeout: ms` (default
  `:infinity`) sets an optional timeout for awaiting native results.

### Security

- Worker-thread teardown no longer blocks a BEAM scheduler. The resource
  destructor previously drained the worker's entire queue and joined the
  OS thread inline, so collecting a busy stream during GC could stall a
  scheduler. Workers are now joined off-scheduler by a dedicated reaper
  (itself joined at NIF unload), and on stop the worker cancels its
  queued tasks — replying `{:error, :stopped}` — instead of running them.

- The async NIF worker queue is now bounded (`worker_queue_limit`, reject
  when full) so a flood of operations can't grow it without limit and pin
  host/GPU memory, and a stopped or dropped worker now replies
  `{:error, :stopped}` to every queued caller instead of leaving it
  blocked forever. `Emily.Native.worker_queue_depth/1` exposes the depth
  for observability.

- The dev/CI source-build path now refuses to trust an MLX install
  directory it doesn't own and keeps the build cache `0700`, so a shared
  or attacker-controlled `EMILY_CACHE` can't plant a `libmlx.a` that is
  then statically linked into the NIF. Fixed system tools (`getconf`,
  `id`, `sw_vers`, plus `xcrun`/`sysctl`/`ps` in `build-mlx.sh`) resolve
  from absolute/system paths rather than `$PATH`, and the MLX-build lock
  records the holder's process start time so a recycled PID can't be
  mistaken for the original holder. Build-time only; no runtime change.

- Precompiled NIF downloads are now verified against checksums pinned
  inside the hex package (`native_checksums.txt`) rather than a `.sha256`
  sidecar fetched from the same GitHub release as the tarball. Because
  the package contents are covered by Hex's package hash in the
  consumer's `mix.lock`, the trust root no longer lives in the mutable
  release. The tarball is also extracted with `:erl_tar` against a strict
  entry allowlist (`libemily.{so,dylib}` + `mlx.metallib`), rejecting
  symlinks, hardlinks, `..` traversal, absolute paths, and unexpected
  entries — closing a path-traversal/arbitrary-write vector in the old
  `tar -xzf` extraction. New `mix emily.checksums` task regenerates the
  pinned file per release.

- Integer arguments crossing the NIF boundary are now range-checked
  before being narrowed from Elixir's `int64` to C++ `int`. Previously an
  out-of-range axis, count, or shape entry wrapped silently (e.g. an axis
  of `2^32 + 3` became `3`), dispatching the wrong MLX operation; and
  unbounded sample counts in `random_split`/`random_categorical` could
  drive huge allocations. Out-of-range values, and negative counts, now
  raise `ArgumentError`. Centralized as `checked_int` / `require_count`
  helpers applied across the reduce, shape, sort, random, index, linalg,
  conv, and fast NIFs.

- Native indexing and window NIFs now validate their vector arguments
  against the tensor rank before indexing, and reject non-positive
  strides, dilations, and window dimensions. Previously a direct
  `Emily.Native` call with a malformed `slice_update` start, a short
  pad/window vector, or a zero window stride could read a C++ vector out
  of bounds or trigger an integer divide-by-zero (SIGFPE) — both of which
  crash the whole BEAM VM rather than raising in the caller. They now
  raise `ArgumentError`.

- `Emily.Native.from_binary/3` now validates tensor shapes at the NIF
  boundary. Dimensions above `INT32_MAX` are rejected (previously they
  silently truncated through MLX's `int32` `ShapeElem`), and the element
  and byte counts are computed with overflow checking. Without this an
  attacker-chosen shape whose element product wrapped (e.g.
  `[2^21, 2^21, 2^22]` → `0`) could pass the binary-size check against an
  undersized — even empty — binary and build an array whose shape outran
  its allocation, an out-of-bounds read on the next `eval`/`to_binary`.

- `Emily.Native.conv_general/8` now rejects a non-positive `groups`
  argument with `ArgumentError` instead of crashing the BEAM VM. MLX's
  convolution checks compute `in_channels % groups`, so `groups <= 0`
  (or a large value that narrows to zero through the `int64 → int`
  conversion) was an integer modulo-by-zero — a SIGFPE that bypassed the
  NIF's exception path and terminated the entire node. The guard
  validates the un-narrowed value at the NIF boundary.

## 0.5.1 - 2026-05-23

### Fixed

- `CHANGELOG.md` — corrected the 0.5.0 entry. The published release
  carried two `### Changed` headings and listed three new-functionality
  items (`mix emily.doctor`, `config :emily, fallback:`, and the
  `Emily.Memory` public allocator API) under Changed rather than
  Added. Merged the duplicate Changed sections, moved the
  new-functionality items to Added, and put items into reverse
  chronological order. No code change.

## 0.5.0 - 2026-05-23

### Added

- `Emily.Quantization.dequantize_defn/1` now supports the `nvfp4`
  microscaled mode in addition to `affine`, `mxfp4`, and `mxfp8` —
  the full MLX `QuantizationMode` enum now runs through the
  defn-native dequant path. `nvfp4` reuses the FP4-E2M1 lane LUT
  from `mxfp4` and the FP8-E4M3 LUT from `mxfp8` (consumed against
  the per-group scale bytes rather than lane codes — the NVIDIA
  microscaled convention uses finer-grained group_size=16 with
  FP8-E4M3 scales instead of mxfp4/mxfp8's group_size=32 with
  FP8-E8M0 scales). Output dtype is bf16 to match
  `QuantizedWeight.to_dense/1`, round-trip is bit-identical (max
  abs diff = 0.0). `Emily.Quantization.Transform` accepts
  `mode: "nvfp4"`.

- `Emily.Quantization.dequantize_defn/1` now supports the `mxfp8`
  microscaled mode in addition to `affine` and `mxfp4`. Each 8-bit
  lane code decodes through a 256-entry FP8-E4M3 lookup table
  precomputed via MLX's `FromFP8` bit-trick (strip sign, shift the
  low 7 bits left by 7 to align the E4M3 exponent into f16's
  exponent field, multiply by 256 for the bias difference, restore
  sign). Per-group scales reuse the FP8-E8M0 decode from the mxfp4
  path. Output dtype is bf16 to match `QuantizedWeight.to_dense/1`,
  and the round-trip is bit-identical (max abs diff = 0.0) on
  realistic data. `Emily.Quantization.Transform` accepts
  `mode: "mxfp8"`; only `nvfp4` (which uses an FP8-E4M3 per-group
  scale instead of FP8-E8M0) remains defn-unsupported.

- `Emily.Quantization.dequantize_defn/1` now supports the `mxfp4`
  microscaled mode in addition to `affine`. Each 4-bit lane code
  decodes through MLX's FP4-E2M1 lookup table (`+0.0, +0.5, +1.0,
  +1.5, +2.0, +3.0, +4.0, +6.0` and their negatives); each u8 scale
  byte decodes through `2^(s - 127)` (FP8-E8M0). Output dtype is
  bf16 to match `QuantizedWeight.to_dense/1`, and the round-trip is
  bit-identical (max abs diff = 0.0) on realistic scale bytes
  because every FP4 LUT entry and every E8M0 power-of-two is exact
  in bf16. `Emily.Quantization.Transform` gains a `:mode` option
  (default `"affine"`, accepts `"mxfp4"`); `mxfp8` and `nvfp4` are
  still defn-unsupported and route through the Native NIF.

- `Emily.Quantization.dequantize_defn/1` now supports int3 and int6
  weights in addition to int2/int4/int8. The new path reads each
  lane's two adjacent u32 words as a u64, shifts by the in-word bit
  offset, and masks — handling the cross-u32 packing MLX uses for
  bit widths that don't divide 32 cleanly. `defn_supported_bits/0`
  now returns `[2, 3, 4, 6, 8]`; quantized Axon graphs rewritten
  via `Emily.Quantization.Transform` (and `Emily.Quantization.Layers.quantized_dense/4`)
  pick the expanded set up automatically. Previously the defn path
  rejected `bits ∈ {3, 6}` and callers had to fall back to
  `QuantizedWeight.to_dense/1` (the Native NIF).

- `ARCHITECTURE.md` — current shape of the library extracted from
  `PLAN.md`. Covers the four-layer dispatch model, the worker-thread
  + per-process-stream concurrency model, the public `Emily.Memory`
  allocator API, the telemetry event catalogue, the
  `:debug_bounds_check` / `:debug_detect_nan_inf` compile-time flags,
  build/packaging notes, the per-layer testing oracle table, and the
  active risk register. Linked from the README under a new
  Documentation section and grouped under "Project" in the HexDocs
  sidebar.
- `ROADMAP.md` — active and future work, separated from the
  historical milestone log. Lists deferred-to-post-1.0 items
  (typed exceptions, GPU interop pointers, source-build doctor
  probes) and the open in-roadmap MLX capability gaps (sparse / MoE
  matmuls, FP8 dtype, `ThreadLocalStream`).
- `mix emily.doctor` — diagnostic Mix task that verifies the local
  Emily runtime installation. Checks the host platform (OS, arch,
  macOS version against the active variant's minimum), the active
  MLX variant, `priv/libemily.so` and `priv/mlx.metallib`, NIF
  loadability, and a tiny `Emily.Backend` smoke test that asserts
  the result didn't silently fall back to `Nx.BinaryBackend`. Checks
  short-circuit: when a prerequisite fails, dependent checks report
  `[skip]` rather than producing cascading noise. Supports
  `--variant aot|jit` for "would this host satisfy :jit?" probes and
  `--help` for usage.
- `config :emily, fallback: :silent | :warn | :raise` — strict
  fallback modes for development and CI. `:silent` (the default)
  preserves today's behaviour; `:warn` emits the one-shot
  `Logger.warning` per `{op, input_shapes}` pair previously gated by
  `:warn_on_fallback`; `:raise` raises `RuntimeError` with op,
  shapes, and dtypes on entry, letting CI fail the build when a hot
  path unexpectedly routes through `Nx.BinaryBackend`. An invalid
  `:fallback` value raises `ArgumentError` on the first fallback so
  typos surface immediately.
- `Emily.Memory` — public allocator API for long-running serving and
  training workloads that need to observe and manage MLX memory
  without reaching into `Emily.Native`. Exposes `stats/0` (active,
  peak, and cached bytes, also emitting `[:emily, :memory, :stats]`),
  `reset_peak/0`, and `clear_cache/0`. Documented under the README's
  Observability section and grouped with `Emily.Telemetry` in the
  ExDoc sidebar.

### Changed

- `PLAN.md` slimmed to its milestone-history role. The current-shape
  sections (architecture diagram, core design decisions, testing
  philosophy, risks-and-mitigations) moved to `ARCHITECTURE.md`;
  goals, non-goals, and deferred-milestone summaries moved to
  `ROADMAP.md`. The M0–M27 milestone narratives, the ratified
  project decisions, and the 2026-04-22 MLX capability audit stay in
  `PLAN.md` as the historical record. The stale "narrow
  `with_stream/2` + `new/1` + `synchronize/1` surface" reference (no
  `synchronize/1` ever shipped) and the planned `set_default_stream/1`
  primary deliverable (removed during the post-M14 fixes) drop out
  with the prologue rewrite.
- `Emily.Native` now annotates NIF errors with operation, input
  shape/dtype, options, and worker context. `ArgumentError` and
  `RuntimeError` raised from async ops get an `Emily.Native context:
  op=… inputs=[…] options=[…] stream=…` suffix, so common failures
  (shape mismatches in `matmul`, divisibility errors in `quantize`,
  mask shape bugs in `fast_scaled_dot_product_attention`, etc.) are
  diagnosable from the message alone. The error-formatting path is
  total — bad context maps degrade to `?` markers rather than masking
  the underlying NIF error.
- The legacy `config :emily, :warn_on_fallback, true` boolean is
  soft-deprecated in favour of `:fallback`. It is still honoured
  when `:fallback` is unset (`true` → `:warn`); when both are set,
  `:fallback` wins.
- `Emily.Telemetry.memory_stats/0` now delegates to
  `Emily.Memory.stats/0`. Behaviour is unchanged — same event,
  measurements, and return shape — but new code should prefer the
  `Emily.Memory` entry point.

## 0.4.0 - 2026-05-17

### Changed

- Upgraded to Nx 0.12 / Bumblebee 0.7 / Axon 0.8. Nx 0.12 replaces
  the optional-callback list (`lu`, `svd`, `qr`, `cholesky`, `eigh`,
  `solve`, `take`, `take_along_axis`, `fft2`, `ifft2`,
  `cumulative_*`, `logical_not`, `all_close`) with a single
  generic `Nx.Backend.block/4` dispatch keyed on `Nx.Block.*`
  structs. `Emily.Backend` now routes every previously-native op
  through `block/4`, preserving the MLX fast paths without losing
  the BinaryBackend fallback when an unknown block arrives. Existing
  `Emily.Backend` consumers see no behavioural change.
- Migrated `Emily.Fast.*` from the now-removed
  `Nx.Defn.Expr.optional/3` extension point to `Nx.block/4`. Each
  fused kernel (`rms_norm`, `layer_norm`, `rope`, `rope_with_freqs`,
  `scaled_dot_product_attention` with and without mask/sinks) now
  emits an `Emily.Fast.Block.*` struct that `Emily.Backend.block/4`
  pattern-matches to the matching `mx::fast::*` NIF. The
  composed-defn fallbacks under non-Emily backends are unchanged.
- Bumblebee 0.7 ships Qwen3 first-class, so
  `notebooks/qwen3_quantized.livemd` no longer needs the `main`-ref
  Bumblebee pin from the 0.6.3 era.

### Added

- `Nx.rfft/2` and `Nx.irfft/2` support. The underlying
  `Native.rfftn` / `Native.irfftn` NIFs were already in place from
  earlier MLX work; Nx 0.12 surfaces these as backend-block ops so
  Emily wires them up at no MLX-side cost.
- Smoke tests for three new Bumblebee 0.7 model families on
  `Emily.Backend`: NomicBERT (`:nomic_embeddings`), SmolLM3
  (`:smollm3`), and ModernBERT (`:modernbert`). All three drive a
  tiny synthetic spec end-to-end through `Axon.predict` so they
  remain offline-friendly; tagged `:conformance`.
- Runnable Livebooks for each of the three new Bumblebee 0.7
  families: `notebooks/nomic_embeddings.livemd` (NomicBERT
  embeddings with cosine similarity), `notebooks/smollm3_chat.livemd`
  (SmolLM3-3B chat completion with a `<think>` toggle for hybrid
  reasoning), and `notebooks/modernbert_classification.livemd`
  (ModernBERT NLI fine-tune). All three are published under the
  HexDocs Notebooks group.
- A `[:emily, :block, :fallback]` telemetry event fires whenever
  `Emily.Backend.block/4` falls through to the supplied default
  `fun`. Surfaces ops we used to handle natively but now land on
  the composed-defn path — useful in soak runs to spot silent
  regressions after a Bumblebee bump.

### Fixed

- `mix docs` no longer emits autolinker warnings for the
  `Emily.Backend.block/4` and `Nx.Defn.Expr.optional/3` references
  in the `Emily.Fast` and `Emily.Fast.Block` moduledocs. The
  references resolved to `@doc false` callees (the backend callback
  is hidden by `Nx.Backend`, and `optional/3` was removed in Nx 0.12);
  the prose stays, the `Mod.fun/arity` shape is broken up so the
  autolinker no longer follows it. Same pattern as the earlier
  fix in `ee32c7c`.

### Removed

- `{:f8_e4m3fn, 8}` (introduced in Nx 0.11) is rejected at the
  backend boundary with the same "no MLX primitive" `ArgumentError`
  pattern as `{:f, 64}`. MLX has no float-8 dtype; cast to `:f16` or
  `:bf16`.

## 0.3.5 - 2026-05-03

## 0.3.4 - 2026-05-03

### Fixed

- `Nx.LinAlg.svd(tensor, full_matrices?: false)` on rank-2 inputs no
  longer routes through MLX's full-matrices SVD and post-slices —
  MLX's SVD has no thin switch, so the old path materialised the full
  m × m U on device and instantly OOM'd Metal for tall matrices like
  the Qwen3-0.6B embedder kernel (151936 × 1024 → ~92 GB U). The thin
  case now computes `G = MᵀM → eigh → S, V; U = MV / S` (or the
  symmetric `MMᵀ` route for wide matrices), keeping the decomposition
  at min(m, n)². See the `Emily.Backend` moduledoc Divergences section
  for the numerical caveat (the Gram step squares M's condition
  number). Refs #84.
- `mix docs` runs cleanly. The MNIST notebook referenced
  `Axon.Loop`'s `trainer/2` (no such arity); three other inline
  references resolved to `@doc false` callees in upstream libraries
  (`Nx.Defn.Expr`'s `optional/3`, Bumblebee's `rms_norm/2`)
  and triggered autolinker warnings on every doc build. The notebook
  now uses the correct `trainer/3` arity, and the prose references
  have been reshaped so the autolinker no longer follows them,
  keeping the build warning-free for future `--warnings-as-errors`
  enforcement. Refs #83.

## 0.3.3 - 2026-05-03

### Fixed

- `Emily.Compiler` now silently drops options it doesn't recognise
  instead of raising `ArgumentError`. This matches the behaviour of
  `Nx.Defn.Evaluator` and EXLA, and restores compatibility with
  higher-level libraries that forward caller-supplied options through
  the JIT compiler — notably `Axon.build/2`, whose contract states
  that "all other options are forwarded to the underlying JIT
  compiler". Hit when running a Bumblebee-built Axon model with
  `Axon.predict(..., global_layer_options: [output_hidden_states:
  true])` under Emily as the global defn compiler. Refs #81.

## 0.3.2 - 2026-04-25

## 0.3.1 - 2026-04-25

### Fixed

- Precompiled NIF download no longer times out on the `:peer.call/4`
  default 5s `gen_server.call` deadline. Consumers installing
  `{:emily, "~> 0.3"}` on a cold cache could see `:gen_server.call`
  timeouts while fetching the multi-MB tarball; the `.sha256` sidecar
  fit in the window but the main asset did not. The peer RPC now runs
  with `:infinity` so httpc's own request timing drives cancellation.

## 0.3.0 - 2026-04-25

### Changed

- Hex consumers now receive a precompiled NIF
  (`libemily.{so,dylib}` + `mlx.metallib`) instead of source. First
  `mix compile` downloads the matching `emily-nif-<v>-<variant>-
  <target>.tar.gz` (and its `.sha256` sidecar) from the emily GitHub
  release for the pinned version, verifies the tarball against the
  published SHA256, and extracts into `priv/`. No cmake / Xcode /
  C++ toolchain is needed on the consumer side.
- In-repo / CI builds now clone MLX's source via a Mix git dep
  (`:mlx_src`) and build libmlx from source; `release-mlx.yml` is
  retired.
- Variant selection is unified under the `:variant` app-config key
  (`:aot` | `:jit`). Contributors flip variants via
  `EMILY_MLX_VARIANT=jit` (read by `config/config.exs`); consumers
  set `config :emily, variant: :jit` in their own
  `config/config.exs`. The old `:mlx_variant` key and
  `config/local.exs` override are gone.
- macOS default cache location moves from `~/Library/Caches/emily/`
  to `DARWIN_USER_CACHE_DIR` (`/private/var/folders/<hash>/C/emily`)
  — the per-user sandboxed cache root Apple's own sandboxed apps
  use. Persistent across reboots, lives outside `~/Library/`.
  Linux / Windows still use the XDG convention. Override via
  `EMILY_CACHE`. Existing macOS users can `rm -rf
  ~/Library/Caches/emily/` to reclaim the orphaned data after
  upgrade.
- NIF object files move from the user-level cache to
  `$(MIX_APP_PATH)/obj/` (i.e. `_build/<env>/lib/emily/obj/`). As a
  consequence, plain `mix clean` now correctly removes them via the
  existing Makefile rule — they were previously left behind because
  `make clean` didn't see the cache-dir env vars.

### Added

- `.github/workflows/release-nif.yml` — on bare-semver tag push,
  builds the precompiled NIF for each `(variant × target)` cell and
  uploads tarball + `.sha256` sidecar to a draft GitHub release.
  `workflow_dispatch` is also wired for out-of-band rebuilds
  (artefacts go to workflow storage; the release is untouched).
- `mix clean.mlx` — wipes the MLX install dir(s) under the cache.
  Plain `mix clean` deliberately preserves them since rebuilding
  MLX from source is ~5-7 minutes.

### Fixed

- MLX source builds are now atomic. The build script installs into
  `${PREFIX}.staging` and only `mv`s onto the final path after the
  artefact sanity checks pass; an EXIT trap wipes the scratch dirs
  on failure. Previously, an interrupted build (Ctrl-C, killed
  process, concurrent run) left an empty install dir that
  subsequent `mix compile` runs misread as "MLX is already
  installed", silently skipping the build and bombing out in
  `elixir_make` with `make: *** No rule to make target
  '.../mlx.metallib'`. The compile-time check now requires both
  `lib/libmlx.a` and `lib/mlx.metallib` to be present before
  trusting the dir.
- Concurrent invocations of `build-mlx.sh` against the same install
  prefix are now serialised via a `mkdir`-based lock with
  stale-PID reclaim. ElixirLS uses its own build path
  (`.elixir_ls/build/...`) so an LSP-driven `mix compile` and a CLI
  `mix compile.emily_mlx --force` lock on *different*
  `Mix.Project.with_build_lock` keys and freely raced into the same
  MLX cache dir, clobbering each other's `${PREFIX}.build/`
  mid-build and surfacing as `clang ... Rename failed: ... No such
  file or directory` during Metal-shader compilation.
- CMake's FetchContent sub-build of metal_cpp / json / fmt during
  configure runs with `CMAKE_BUILD_PARALLEL_LEVEL=1`, dodging a
  race in its download → extract → rename → stamp-touch pipeline
  that surfaced as `getcwd: cannot access parent directories`
  followed by `cd: <dir>/_deps: No such file or directory`. The
  main MLX build still runs at full NCPU jobs.
- The MLX scratch build dir (`${PREFIX}.build`) is preserved on
  configure failure so `CMakeError.log` survives for diagnostics.

### Removed

- `config/local.exs` override (obsoleted by the env-var plumbing).
- `.github/workflows/release-mlx.yml` (MLX build is folded into the
  NIF workflow).
- `scripts/build-mlx-prebuilt.sh` (superseded by in-tree
  `scripts/build-mlx.sh`).
- `scripts/smoke-test-package.sh` and the tagged `smoke-test` job in
  `ci.yml` (simulated a source-compile consumer, no longer
  applicable).

See `MAINTAINING.md` for the updated release flow.

## 0.2.2 - 2026-04-23

### Fixed

- MLX prebuilt download now runs on a peer VM (`:peer.start_link/1` with
  stdio connection) so it is unaffected by Mix's code-path pruning
  during dep compilation. Previous releases crashed in the tagged
  `smoke-test` CI lane with `{:error, :nofile}` / "module :public_key
  is not available" on clean caches, because Mix removed the
  `:ssl`/`:public_key`/`:asn1`/`:inets` ebin directories from the
  parent VM's code path even though the apps were started. The peer
  node has a fresh code path, so standard `httpc` + `public_key` work
  without further shimming.

## 0.2.1 - 2026-04-22

### Fixed

- **`mix compile` crash on a cold MLX download in a clean consumer
  project.** `http_download!/2` in `mix.exs` called
  `:public_key.cacerts_get/0` right after
  `Application.ensure_all_started(:ssl)`. The app-start path pulled
  `:public_key` in transitively, but the module itself was not
  guaranteed to be loaded at call time — the tag-triggered Hex
  smoke test on CI blew up with
  `UndefinedFunctionError ... module :public_key is not available`
  on 0.2.0. `http_download!` now force-loads the module via
  `:code.ensure_loaded/1` before touching it. Any checkout with a
  populated `~/Library/Caches/emily/mlx-<v>-*` directory skipped
  this path, which is why the break only surfaced in the first
  clean CI run.

## 0.2.0 - 2026-04-22

### Added

- **MLX prebuilt-release workflow
  (`.github/workflows/release-mlx.yml`).** Manual workflow that
  builds `libmlx.a` + `mlx.metallib` + headers from a chosen
  `ml-explore/mlx` tag and uploads the tarball to a draft GitHub
  release tagged `mlx-<version>` on this repo. Used to produce the
  prebuilts that Emily's compile step downloads instead of the
  previous source-build path. To cut a new MLX prebuilt release:
  1. Run the workflow with `build_type=no-jit` on macos-14
     (produces `mlx-<v>-macos-arm64-aot.tar.gz`).
  2. Run it again with `build_type=jit` on macos-26 (produces
     `mlx-<v>-macos-arm64-jit.tar.gz`).
  3. Copy the two SHA256s from the draft release's `.sha256`
     sidecars into `@mlx_checksums` in `mix.exs`.
  4. Un-draft the release so consumers can fetch.
  The heavy lifting sits in `scripts/build-mlx-prebuilt.sh`, which
  runs standalone for local debugging:
  `scripts/build-mlx-prebuilt.sh path/to/mlx-src 0.31.2 0`.
- **`Emily.Fast.einsum/2`** — eager-only wrapper around MLX's
  path-optimised `mx::einsum`. Accepts a standard Einstein-summation
  string and a list of `Emily.Backend`-backed tensors; MLX picks the
  contraction order internally. Operands on any other backend raise
  `ArgumentError` with a transfer-first message. The helper is a
  direct-call eager helper (same pattern as
  `Emily.Quantization.quantized_matmul/2`) and is intentionally **not**
  `defn`-callable — a fallback via `Nx.Defn.Expr`'s `optional/3` would
  require a full einsum-string parser and is deferred until a user
  needs cross-backend composability.

### Fixed

- **`Nx.top_k/2` on Emily tensors.** The backend's `top_k/3`
  override pattern-matched `out` as a single `%Nx.Tensor{}` and
  returned a single tensor, but the real Nx callback contract takes
  `{out_values, out_indices}` and returns a `{values, indices}`
  tuple. Any call to `Nx.top_k` raised `FunctionClauseError`.
  Dropped the override so Nx falls back to `argsort(:desc) +
  take_along_axis + slice_along_axis`, each of which routes
  through Emily's backend.

### Changed

- **MLX prebuilt download replaces the vendored source build.** The
  `vendor/mlx` submodule and the cmake-from-source path are gone.
  `mix compile` now downloads a SHA256-verified `libmlx.a` +
  `mlx.metallib` + headers tarball for the pinned `@mlx_version` from
  this repo's releases into `$EMILY_CACHE` and links the NIF against
  it directly. Consumer prerequisites drop from "Xcode + Metal
  toolchain + cmake + submodule checkout" to just macOS Apple Silicon.
  The JIT / no-JIT switch moves from the `EMILY_MLX_JIT` env var to
  `config :emily, mlx_variant: :jit | :no_jit` in `config/config.exs`
  (default `:no_jit`); variant is read via `Config.Reader.read!` at
  project load, so a gitignored `config/local.exs` is the supported
  per-checkout override. Version bumps are a single-commit change of
  `@mlx_version` + `@mlx_checksums` in `mix.exs`, paired with a new
  `mlx-<version>` GitHub release produced by `release-mlx.yml`. First
  MLX pin under the new scheme: **0.31.2**.
- **Microscaled quantization modes on `Emily.QuantizedWeight`.** The
  container now carries a `:mode` field (default `"affine"`) and
  accepts `"mxfp4"`, `"mxfp8"`, `"nvfp4"` — MLX's full
  `QuantizationMode` enum (`vendor/mlx/mlx/primitives.h:155`).
  `from_dense/2`, `to_dense/1`, and `Emily.Quantization.quantized_matmul/2`
  all thread the mode through to MLX; mode-specific
  `{group_size, bits}` constraints are validated up front with a
  clear Emily error before the NIF call. Microscaled modes carry
  a placeholder biases tensor — MLX's `fp_quantize` returns only
  `(wq, scales)`, and the Native layer substitutes `nil` before
  the MLX call. `Emily.Quantization.dequantize_defn/1` is
  affine-only (it's a hand-rolled nibble unpacker) and now raises
  `ArgumentError` on non-affine modes, pointing users at
  `to_dense/1`. Smoke-tested end-to-end on Metal for all four modes
  (Apple Silicon, macOS 26).
- **SDPA attention sinks (`mx::fast::scaled_dot_product_attention`
  `sinks` param).** `Emily.Fast.scaled_dot_product_attention/4` and
  `scaled_dot_product_attention_with_mask/5` now accept an optional
  `:sinks` keyword opt — a per-head tensor broadcastable to
  `{1, heads, 1, 1}` whose entries participate in the softmax
  denominator as extra "null destinations" (StreamingLLM). When
  absent the helpers emit the pre-existing optional-node, so
  `Emily.Bumblebee.FastKernels` and direct callers stay source- and
  bit-compatible. The defn fallback implements the same semantics
  in numerically-stable form; equivalence vs. the fused kernel was
  measured at ~2e-7 max-abs-diff on f32.
- **MLX JIT build no longer patches vendored MLX.** The
  `patches/mlx-jit-nax-gate.patch` workaround (and the
  `maybe_apply_mlx_patches` plumbing in `mix.exs`) has been removed.
  The JIT build now requires the macOS 26.2+ SDK directly, which
  ships `<MetalPerformancePrimitives/MetalPerformancePrimitives.h>`;
  the AOT (default) build is unchanged and still works on older
  macOS. Upstream discussion:
  [ml-explore/mlx#3426](https://github.com/ml-explore/mlx/pull/3426).
- **CI matrix split across macOS versions.** The `jit=0` row stays
  on `macos-14` to keep AOT coverage on older macOS; the `jit=1`
  row now runs on `macos-26` so the Metal Performance Primitives
  SDK is available natively.
- **Native axis reversal via `mx::slice` with stride -1.** The
  descending branches of `Nx.sort` and `Nx.argsort` (and
  `Nx.reverse`) previously built an `arange` index tensor and
  gathered with `take`. They now call a new `Native.flip/3` NIF
  that lowers to a single strided slice, saving the index
  allocation and gather kernel per call.
- **Parallel NIF C++ build.** `elixir_make` doesn't pass `-j` by
  default and `mix.exs` didn't set `:make_args`, so every `.cpp`
  in `c_src/` compiled serially. `mix.exs` now passes
  `-j#{System.schedulers_online()}` through, and the vestigial
  `JOBS` / `MAKE_JOBS` pair in the `Makefile` (computed but never
  referenced) has been removed. On an 8-core M-series, a clean NIF
  build drops from ~19 s to ~7 s.

## 0.1.2 - 2026-04-19

### Fixed

- **HexDocs source links.** `mix.exs`'s `source_url_pattern`
  prepended a `v` prefix to the version tag, but the project's
  release convention (via `mix publisho`) uses bare semver tags.
  The generated `[source]` links in HexDocs pointed at nonexistent
  `v<version>` tags. Dropped the prefix so links resolve to the
  actual tag.

## 0.1.1 - 2026-04-19

Initial release. See the git history for per-milestone detail.

### Added

- **Nx backend.** `Emily.Backend` implements every required
  `Nx.Backend` callback against MLX, with transparent fallback to
  `Nx.BinaryBackend` for ops without a native primitive.
- **Defn compiler.** `Emily.Compiler` runs `defn` / `Nx.Serving` /
  Bumblebee on Emily; pins the result backend and caps partition
  concurrency so `Nx.Serving` stays compatible.
- **Fused transformer kernels.** `Emily.Fast` exposes
  `mx::fast::rms_norm`, `layer_norm`, `rope`, and scaled-dot-product
  attention as defn-callable helpers with composed-defn fallbacks
  for non-Emily backends. `Emily.Bumblebee.FastKernels` rewrites a
  Bumblebee Axon graph to call the fused kernels in place; declared
  as an optional dep on `:axon` + `:bumblebee`, elides cleanly if
  either is absent.
- **Affine group-wise quantization.** `Emily.QuantizedWeight` and
  `Emily.Quantization` wrap MLX `quantize` / `dequantize` /
  `quantized_matmul` for int2 / int4 / int8 inference.
  `Emily.Quantization.dequantize_defn/1` provides a defn-native
  dequantize for use inside Axon forward passes.
- **Mixed-precision training.** `Emily.MixedPrecision` ships the
  bf16 recipe: `cast_params` for the forward pass, f32 master
  weights, dynamic loss scaling with overflow detection.
- **Per-process Metal streams.** `Emily.Stream` lets each BEAM
  process own its own Metal command queue, enabling concurrent
  inference on a shared model.
- **Zero-copy `to_binary`.** `Nx.to_binary/1` on an Emily tensor
  returns a BEAM resource binary aliasing the MLX buffer — no memcpy.
- **Native gradient + training primitives.** `gather`, `scatter`,
  `scatter_add`, `conv`, and the window-reduction family lower
  directly to MLX so `Nx.Defn.grad` and CNN training stay native.
- **Native linalg.** `lu`, `svd`, `qr`, `cholesky`, `eigh`, `solve`,
  and `triangular_solve` dispatch to `mx::linalg::*` instead of
  rounding through `Nx.BinaryBackend`.
- **Telemetry.** `[:emily, :eval, *]`, `[:emily, :to_binary, *]`,
  `[:emily, :fallback, *]`, and `[:emily, :memory, :stats]` span
  events; opt-in one-shot fallback warnings via
  `config :emily, :warn_on_fallback, true`.
- **Compile-time debug flags.** `:debug_bounds_check` and
  `:debug_detect_nan_inf` re-enable runtime assertions on hot paths;
  default off with zero runtime cost.
- **Bumblebee conformance.** End-to-end suites for DistilBERT,
  Qwen3-0.6B (dense and quantized), ViT-base, and Whisper-tiny,
  pinned against HuggingFace reference values.
- **Worker-thread dispatch.** Each MLX stream is owned by a
  dedicated OS thread. NIFs enqueue work on the worker and return
  immediately; the worker posts the result back to the caller via
  `enif_send`, and the public wrapper awaits it with `receive`. No
  BEAM scheduler (regular or dirty) blocks on MLX work, and the
  per-thread Metal `CommandEncoder` state stays consistent regardless
  of how the BEAM migrates Elixir processes between schedulers.
- **Vendored MLX build.** MLX is built from source via cmake from
  `vendor/mlx` (git submodule); no prebuilt download. Build cache
  keyed on the submodule SHA under `~/Library/Caches/emily/`.
- **Documentation.** Per-module HexDocs, five runnable Livebooks
  (`notebooks/distilbert_qa.livemd`,
  `notebooks/qwen3_quantized.livemd`,
  `notebooks/mnist_training.livemd`,
  `notebooks/whisper_transcription.livemd`,
  `notebooks/fast_kernels.livemd`), and worked Bumblebee examples in
  the conformance suite.