# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/),
and this project adheres to [Semantic Versioning](https://semver.org/).
<!-- %% CHANGELOG_ENTRIES %% -->
## 0.7.2 - 2026-06-13
### Fixed
- The README performance section now compares Emily against both
benchmark baselines — EXLA (host CPU) and EMLX (the older MLX-backed
Nx backend on the Metal GPU) — instead of EXLA alone, and its
rule-of-thumb figures (ViT-base, DistilBERT) are reconciled with the
current benchmark report.
- The benchmark report's environment block now records the Emily
version the numbers were produced on (0.7.0) and drops a misleading
run timestamp.
- The `MAINTAINING.md` release runbook is corrected: `mix publisho` is
no longer described as pushing (it only commits and tags), and the
obsolete manual draft-promotion step is dropped — `release-nif.yml`
now publishes the release automatically once the NIFs are built.
## 0.7.1 - 2026-06-13
### Fixed
- Documentation no longer fails to build over autolink references to the
hidden `Emily.Native.async_eval/2` and `Emily.Native.fast_rope_int/8`
NIF stubs in the changelog; both are excluded from ex_doc autolinking.
## 0.7.0 - 2026-06-13
### Added
- **Native Expr compiler — on by default under
`compiler: Emily.Compiler`.** Lowers a traced `Nx.Defn.Expr` to a
flat IR once and replays the whole forward graph in a **single NIF
call per invocation**, collapsing the per-op BEAM↔worker round-trips
a step-evaluated decode loop would otherwise pay. Weights cross the
NIF boundary once (captured by the compiled program) and are never
re-serialised per call. It is the default, so a bare
`compiler: Emily.Compiler` compiles native:
Nx.Defn.jit(&forward/1, compiler: Emily.Compiler).(input)
Coverage is the full Nx primitive set (with `Emily.Backend`'s
dtype-coercion and op-composition semantics ported into the
lowering), the fused `Emily.Fast.*` kernels (RMSNorm, LayerNorm,
RoPE, scaled dot-product attention and its mask / sink / mask+sink
variants), `Nx.Block.*` including the full `LinAlg` family
(`cholesky` / `solve` / `qr` / `eigh` / `lu` / `svd` /
`determinant`), `Nx.Random`, and the control flow `cond` /
`defn while` (with the host loop driven entirely from the worker
thread). Anything the IR can't lower yet routes through
`Nx.Defn.Evaluator` under the default `native_fallback: :eval` (with
a one-shot `[:emily, :compiler, :fallback]` telemetry event), so the
native lane is safe as the default on any model. The default is read
from `config :emily, :native` (defaulting to `true`), so
`config :emily, native: false` opts every defn out of the native lane
application-wide — e.g. on a memory-constrained host where the
one-shot compile peak is too large; a per-call `native:` option
always wins over the app-env default.
`native_fallback: :raise` fails instead — the conformance suites use
this to prove a model lowers fully native.
End-to-end: DistilBERT (question answering with `Nx.Serving`), ViT,
Whisper (`speech_to_text` end-to-end including the featurizer STFT,
encoder/decoder, and autoregressive decode loop), and Bumblebee
`Text.generation` (greedy *and* multinomial sampling) all compile
fully native under `native_fallback: :raise`. Bumblebee generation
on Qwen3-0.6B measures **~5× the evaluator's decode throughput**
(~61 vs ~12 tok/s on an M-series Mac), with byte-identical
completions. Native training drives Axon end-to-end — a LeNet CNN
and a dense MLP train on real MNIST entirely through the single-NIF
path (forward, categorical-cross-entropy, backward, Adam) to the
same >97% / >96% accuracy as the evaluator.
- **`Emily.Compiler` — `:fuse` opt-in.** Adds `mx::compile` fusion on
top of the replay, fusing elementwise runs (RMSNorm, softmax, SiLU
gating, residual adds) the plain replay leaves as separate kernels.
For a `defn while`, the loop body is fused under `mx::compile` and
cached per stream so it cache-hits across iterations rather than
recompiling per step. Enable on top of the native generation path:
Nx.Defn.jit(&forward/1,
compiler: Emily.Compiler, native: true, fuse: true)
On Qwen3-0.6B this lifts greedy decode to **~5.4× the evaluator
(~1.1× over the plain native lane)**, ~68 vs ~62 tok/s; in
isolation on a decode-shaped transformer block, fusion measures
~1.5–1.6× over the plain replay. Trade-off: `mx::compile`
reassociates f32 to within a few ULP, so output is **not**
bit-identical to the evaluator. Greedy argmax is robust to that
empirically (Qwen3-0.6B token ids matched the evaluator exactly in
our run), but the match is empirical, not guaranteed — a near-tie
top-2 logit can flip a token. **Sampling strategies will diverge
from the evaluator under fusion** even with a fixed seed.
- **`Emily.Generation` — a model-agnostic decode-loop driver.**
JIT-compiles a caller-supplied shape-stable per-token forward
(`fn token, offset, cache, params -> {logits, cache} end`) with the
native single-NIF compiler and drives the autoregressive loop from
Elixir — offset bookkeeping, KV-cache threading, stop conditions,
next-token selection (greedy by default), and per-token streaming
via `:on_token`. The forward runs fully native; the loop stays in
Elixir, so token streaming and host-side control are preserved.
Emily supplies only the mechanism — the model (forward + cache) is
the caller's.
- `Emily.async_eval/1` (and `Emily.Native.async_eval/2`) schedule
evaluation of one or more lazy graphs **without blocking on the
GPU**, wrapping `mlx::core::async_eval`. The work is handed to the
device's command queue and the call returns as soon as it is
enqueued — not when it finishes. Lets a caller keep dispatching the
next step's ops while the device computes the current one (e.g. an
autoregressive decode loop), blocking only when a value is actually
read back on the host via `to_binary/1` / `eval/1`. Pass every
output of a step (logits plus all KV-cache buffers) in one call.
- `Emily.Native.fast_rope_int/8` — RoPE with an **integer**
absolute-position `offset` (routing to MLX's int-offset `rope`
overload), for incremental decode where the caller tracks position
host-side. Complements the existing tensor-offset `fast_rope/8`.
Note: feed the kernel the 4-D `{batch, heads, seq, head_dim}`
layout — in 3-D, MLX 0.31 mis-rotates single-token (`seq == 1`)
inputs.
### Fixed
- **Dilated window reductions (`window_dilations > 1`) returned wrong
values.** `window_sum`/`window_max`/`window_min`/`window_product`
with a dilated kernel silently produced garbage for windows past the
first stride positions, on both the eager backend and the native
compiler (they share the window-reduce core). A dilated kernel axis
gets an `as_strided` stride > 1, so the sliding-window view aliases
fewer physical elements than its logical size; MLX's strided-reduce
fast path then read past the aliased buffer. The view is now
materialised contiguously before the reduce when any dilation > 1
(the common non-dilated pooling path is unchanged and stays
copy-free).
## 0.6.1 - 2026-05-31
### Changed
- Documentation updated for the 0.6.x release: the README installation
instructions and the example notebooks now reference
`{:emily, "~> 0.6"}`.
## 0.6.0 - 2026-05-31
This release is a security-hardening pass over the native (NIF) boundary
and the build/release pipeline: direct `Emily.Native` calls now validate
their arguments instead of trusting Elixir-side normalization,
precompiled-NIF downloads verify against a checksum pinned in the hex
package (a trust root independent of the GitHub release), and the
per-stream worker is bounded and tears down without blocking a BEAM
scheduler. It is backward compatible, but two behaviour changes matter
for high-concurrency callers: the per-worker async queue is now bounded
(`worker_queue_limit`, default 8192) and rejects when full, and a stopped
or dropped worker replies `{:error, :stopped}` to queued callers instead
of running their work.
### Added
- `Emily.Stream.close/1` stops a stream's worker thread deterministically
instead of waiting for garbage collection: queued operations are
cancelled (their callers get a `RuntimeError`), the in-flight op
finishes, and the OS thread is joined off the BEAM schedulers.
- `config :emily, worker_queue_limit: N` (default `8192`) bounds the
per-worker async queue, and `config :emily, await_timeout: ms` (default
`:infinity`) sets an optional timeout for awaiting native results.
### Security
- Worker-thread teardown no longer blocks a BEAM scheduler. The resource
destructor previously drained the worker's entire queue and joined the
OS thread inline, so collecting a busy stream during GC could stall a
scheduler. Workers are now joined off-scheduler by a dedicated reaper
(itself joined at NIF unload), and on stop the worker cancels its
queued tasks — replying `{:error, :stopped}` — instead of running them.
- The async NIF worker queue is now bounded (`worker_queue_limit`, reject
when full) so a flood of operations can't grow it without limit and pin
host/GPU memory, and a stopped or dropped worker now replies
`{:error, :stopped}` to every queued caller instead of leaving it
blocked forever. `Emily.Native.worker_queue_depth/1` exposes the depth
for observability.
- The dev/CI source-build path now refuses to trust an MLX install
directory it doesn't own and keeps the build cache `0700`, so a shared
or attacker-controlled `EMILY_CACHE` can't plant a `libmlx.a` that is
then statically linked into the NIF. Fixed system tools (`getconf`,
`id`, `sw_vers`, plus `xcrun`/`sysctl`/`ps` in `build-mlx.sh`) resolve
from absolute/system paths rather than `$PATH`, and the MLX-build lock
records the holder's process start time so a recycled PID can't be
mistaken for the original holder. Build-time only; no runtime change.
- Precompiled NIF downloads are now verified against checksums pinned
inside the hex package (`native_checksums.txt`) rather than a `.sha256`
sidecar fetched from the same GitHub release as the tarball. Because
the package contents are covered by Hex's package hash in the
consumer's `mix.lock`, the trust root no longer lives in the mutable
release. The tarball is also extracted with `:erl_tar` against a strict
entry allowlist (`libemily.{so,dylib}` + `mlx.metallib`), rejecting
symlinks, hardlinks, `..` traversal, absolute paths, and unexpected
entries — closing a path-traversal/arbitrary-write vector in the old
`tar -xzf` extraction. New `mix emily.checksums` task regenerates the
pinned file per release.
- Integer arguments crossing the NIF boundary are now range-checked
before being narrowed from Elixir's `int64` to C++ `int`. Previously an
out-of-range axis, count, or shape entry wrapped silently (e.g. an axis
of `2^32 + 3` became `3`), dispatching the wrong MLX operation; and
unbounded sample counts in `random_split`/`random_categorical` could
drive huge allocations. Out-of-range values, and negative counts, now
raise `ArgumentError`. Centralized as `checked_int` / `require_count`
helpers applied across the reduce, shape, sort, random, index, linalg,
conv, and fast NIFs.
- Native indexing and window NIFs now validate their vector arguments
against the tensor rank before indexing, and reject non-positive
strides, dilations, and window dimensions. Previously a direct
`Emily.Native` call with a malformed `slice_update` start, a short
pad/window vector, or a zero window stride could read a C++ vector out
of bounds or trigger an integer divide-by-zero (SIGFPE) — both of which
crash the whole BEAM VM rather than raising in the caller. They now
raise `ArgumentError`.
- `Emily.Native.from_binary/3` now validates tensor shapes at the NIF
boundary. Dimensions above `INT32_MAX` are rejected (previously they
silently truncated through MLX's `int32` `ShapeElem`), and the element
and byte counts are computed with overflow checking. Without this an
attacker-chosen shape whose element product wrapped (e.g.
`[2^21, 2^21, 2^22]` → `0`) could pass the binary-size check against an
undersized — even empty — binary and build an array whose shape outran
its allocation, an out-of-bounds read on the next `eval`/`to_binary`.
- `Emily.Native.conv_general/8` now rejects a non-positive `groups`
argument with `ArgumentError` instead of crashing the BEAM VM. MLX's
convolution checks compute `in_channels % groups`, so `groups <= 0`
(or a large value that narrows to zero through the `int64 → int`
conversion) was an integer modulo-by-zero — a SIGFPE that bypassed the
NIF's exception path and terminated the entire node. The guard
validates the un-narrowed value at the NIF boundary.
## 0.5.1 - 2026-05-23
### Fixed
- `CHANGELOG.md` — corrected the 0.5.0 entry. The published release
carried two `### Changed` headings and listed three new-functionality
items (`mix emily.doctor`, `config :emily, fallback:`, and the
`Emily.Memory` public allocator API) under Changed rather than
Added. Merged the duplicate Changed sections, moved the
new-functionality items to Added, and put items into reverse
chronological order. No code change.
## 0.5.0 - 2026-05-23
### Added
- `Emily.Quantization.dequantize_defn/1` now supports the `nvfp4`
microscaled mode in addition to `affine`, `mxfp4`, and `mxfp8` —
the full MLX `QuantizationMode` enum now runs through the
defn-native dequant path. `nvfp4` reuses the FP4-E2M1 lane LUT
from `mxfp4` and the FP8-E4M3 LUT from `mxfp8` (consumed against
the per-group scale bytes rather than lane codes — the NVIDIA
microscaled convention uses finer-grained group_size=16 with
FP8-E4M3 scales instead of mxfp4/mxfp8's group_size=32 with
FP8-E8M0 scales). Output dtype is bf16 to match
`QuantizedWeight.to_dense/1`, round-trip is bit-identical (max
abs diff = 0.0). `Emily.Quantization.Transform` accepts
`mode: "nvfp4"`.
- `Emily.Quantization.dequantize_defn/1` now supports the `mxfp8`
microscaled mode in addition to `affine` and `mxfp4`. Each 8-bit
lane code decodes through a 256-entry FP8-E4M3 lookup table
precomputed via MLX's `FromFP8` bit-trick (strip sign, shift the
low 7 bits left by 7 to align the E4M3 exponent into f16's
exponent field, multiply by 256 for the bias difference, restore
sign). Per-group scales reuse the FP8-E8M0 decode from the mxfp4
path. Output dtype is bf16 to match `QuantizedWeight.to_dense/1`,
and the round-trip is bit-identical (max abs diff = 0.0) on
realistic data. `Emily.Quantization.Transform` accepts
`mode: "mxfp8"`; only `nvfp4` (which uses an FP8-E4M3 per-group
scale instead of FP8-E8M0) remains defn-unsupported.
- `Emily.Quantization.dequantize_defn/1` now supports the `mxfp4`
microscaled mode in addition to `affine`. Each 4-bit lane code
decodes through MLX's FP4-E2M1 lookup table (`+0.0, +0.5, +1.0,
+1.5, +2.0, +3.0, +4.0, +6.0` and their negatives); each u8 scale
byte decodes through `2^(s - 127)` (FP8-E8M0). Output dtype is
bf16 to match `QuantizedWeight.to_dense/1`, and the round-trip is
bit-identical (max abs diff = 0.0) on realistic scale bytes
because every FP4 LUT entry and every E8M0 power-of-two is exact
in bf16. `Emily.Quantization.Transform` gains a `:mode` option
(default `"affine"`, accepts `"mxfp4"`); `mxfp8` and `nvfp4` are
still defn-unsupported and route through the Native NIF.
- `Emily.Quantization.dequantize_defn/1` now supports int3 and int6
weights in addition to int2/int4/int8. The new path reads each
lane's two adjacent u32 words as a u64, shifts by the in-word bit
offset, and masks — handling the cross-u32 packing MLX uses for
bit widths that don't divide 32 cleanly. `defn_supported_bits/0`
now returns `[2, 3, 4, 6, 8]`; quantized Axon graphs rewritten
via `Emily.Quantization.Transform` (and `Emily.Quantization.Layers.quantized_dense/4`)
pick the expanded set up automatically. Previously the defn path
rejected `bits ∈ {3, 6}` and callers had to fall back to
`QuantizedWeight.to_dense/1` (the Native NIF).
- `ARCHITECTURE.md` — current shape of the library extracted from
`PLAN.md`. Covers the four-layer dispatch model, the worker-thread
+ per-process-stream concurrency model, the public `Emily.Memory`
allocator API, the telemetry event catalogue, the
`:debug_bounds_check` / `:debug_detect_nan_inf` compile-time flags,
build/packaging notes, the per-layer testing oracle table, and the
active risk register. Linked from the README under a new
Documentation section and grouped under "Project" in the HexDocs
sidebar.
- `ROADMAP.md` — active and future work, separated from the
historical milestone log. Lists deferred-to-post-1.0 items
(typed exceptions, GPU interop pointers, source-build doctor
probes) and the open in-roadmap MLX capability gaps (sparse / MoE
matmuls, FP8 dtype, `ThreadLocalStream`).
- `mix emily.doctor` — diagnostic Mix task that verifies the local
Emily runtime installation. Checks the host platform (OS, arch,
macOS version against the active variant's minimum), the active
MLX variant, `priv/libemily.so` and `priv/mlx.metallib`, NIF
loadability, and a tiny `Emily.Backend` smoke test that asserts
the result didn't silently fall back to `Nx.BinaryBackend`. Checks
short-circuit: when a prerequisite fails, dependent checks report
`[skip]` rather than producing cascading noise. Supports
`--variant aot|jit` for "would this host satisfy :jit?" probes and
`--help` for usage.
- `config :emily, fallback: :silent | :warn | :raise` — strict
fallback modes for development and CI. `:silent` (the default)
preserves today's behaviour; `:warn` emits the one-shot
`Logger.warning` per `{op, input_shapes}` pair previously gated by
`:warn_on_fallback`; `:raise` raises `RuntimeError` with op,
shapes, and dtypes on entry, letting CI fail the build when a hot
path unexpectedly routes through `Nx.BinaryBackend`. An invalid
`:fallback` value raises `ArgumentError` on the first fallback so
typos surface immediately.
- `Emily.Memory` — public allocator API for long-running serving and
training workloads that need to observe and manage MLX memory
without reaching into `Emily.Native`. Exposes `stats/0` (active,
peak, and cached bytes, also emitting `[:emily, :memory, :stats]`),
`reset_peak/0`, and `clear_cache/0`. Documented under the README's
Observability section and grouped with `Emily.Telemetry` in the
ExDoc sidebar.
### Changed
- `PLAN.md` slimmed to its milestone-history role. The current-shape
sections (architecture diagram, core design decisions, testing
philosophy, risks-and-mitigations) moved to `ARCHITECTURE.md`;
goals, non-goals, and deferred-milestone summaries moved to
`ROADMAP.md`. The M0–M27 milestone narratives, the ratified
project decisions, and the 2026-04-22 MLX capability audit stay in
`PLAN.md` as the historical record. The stale "narrow
`with_stream/2` + `new/1` + `synchronize/1` surface" reference (no
`synchronize/1` ever shipped) and the planned `set_default_stream/1`
primary deliverable (removed during the post-M14 fixes) drop out
with the prologue rewrite.
- `Emily.Native` now annotates NIF errors with operation, input
shape/dtype, options, and worker context. `ArgumentError` and
`RuntimeError` raised from async ops get an `Emily.Native context:
op=… inputs=[…] options=[…] stream=…` suffix, so common failures
(shape mismatches in `matmul`, divisibility errors in `quantize`,
mask shape bugs in `fast_scaled_dot_product_attention`, etc.) are
diagnosable from the message alone. The error-formatting path is
total — bad context maps degrade to `?` markers rather than masking
the underlying NIF error.
- The legacy `config :emily, :warn_on_fallback, true` boolean is
soft-deprecated in favour of `:fallback`. It is still honoured
when `:fallback` is unset (`true` → `:warn`); when both are set,
`:fallback` wins.
- `Emily.Telemetry.memory_stats/0` now delegates to
`Emily.Memory.stats/0`. Behaviour is unchanged — same event,
measurements, and return shape — but new code should prefer the
`Emily.Memory` entry point.
## 0.4.0 - 2026-05-17
### Changed
- Upgraded to Nx 0.12 / Bumblebee 0.7 / Axon 0.8. Nx 0.12 replaces
the optional-callback list (`lu`, `svd`, `qr`, `cholesky`, `eigh`,
`solve`, `take`, `take_along_axis`, `fft2`, `ifft2`,
`cumulative_*`, `logical_not`, `all_close`) with a single
generic `Nx.Backend.block/4` dispatch keyed on `Nx.Block.*`
structs. `Emily.Backend` now routes every previously-native op
through `block/4`, preserving the MLX fast paths without losing
the BinaryBackend fallback when an unknown block arrives. Existing
`Emily.Backend` consumers see no behavioural change.
- Migrated `Emily.Fast.*` from the now-removed
`Nx.Defn.Expr.optional/3` extension point to `Nx.block/4`. Each
fused kernel (`rms_norm`, `layer_norm`, `rope`, `rope_with_freqs`,
`scaled_dot_product_attention` with and without mask/sinks) now
emits an `Emily.Fast.Block.*` struct that `Emily.Backend.block/4`
pattern-matches to the matching `mx::fast::*` NIF. The
composed-defn fallbacks under non-Emily backends are unchanged.
- Bumblebee 0.7 ships Qwen3 first-class, so
`notebooks/qwen3_quantized.livemd` no longer needs the `main`-ref
Bumblebee pin from the 0.6.3 era.
### Added
- `Nx.rfft/2` and `Nx.irfft/2` support. The underlying
`Native.rfftn` / `Native.irfftn` NIFs were already in place from
earlier MLX work; Nx 0.12 surfaces these as backend-block ops so
Emily wires them up at no MLX-side cost.
- Smoke tests for three new Bumblebee 0.7 model families on
`Emily.Backend`: NomicBERT (`:nomic_embeddings`), SmolLM3
(`:smollm3`), and ModernBERT (`:modernbert`). All three drive a
tiny synthetic spec end-to-end through `Axon.predict` so they
remain offline-friendly; tagged `:conformance`.
- Runnable Livebooks for each of the three new Bumblebee 0.7
families: `notebooks/nomic_embeddings.livemd` (NomicBERT
embeddings with cosine similarity), `notebooks/smollm3_chat.livemd`
(SmolLM3-3B chat completion with a `<think>` toggle for hybrid
reasoning), and `notebooks/modernbert_classification.livemd`
(ModernBERT NLI fine-tune). All three are published under the
HexDocs Notebooks group.
- A `[:emily, :block, :fallback]` telemetry event fires whenever
`Emily.Backend.block/4` falls through to the supplied default
`fun`. Surfaces ops we used to handle natively but now land on
the composed-defn path — useful in soak runs to spot silent
regressions after a Bumblebee bump.
### Fixed
- `mix docs` no longer emits autolinker warnings for the
`Emily.Backend.block/4` and `Nx.Defn.Expr.optional/3` references
in the `Emily.Fast` and `Emily.Fast.Block` moduledocs. The
references resolved to `@doc false` callees (the backend callback
is hidden by `Nx.Backend`, and `optional/3` was removed in Nx 0.12);
the prose stays, the `Mod.fun/arity` shape is broken up so the
autolinker no longer follows it. Same pattern as the earlier
fix in `ee32c7c`.
### Removed
- `{:f8_e4m3fn, 8}` (introduced in Nx 0.11) is rejected at the
backend boundary with the same "no MLX primitive" `ArgumentError`
pattern as `{:f, 64}`. MLX has no float-8 dtype; cast to `:f16` or
`:bf16`.
## 0.3.5 - 2026-05-03
## 0.3.4 - 2026-05-03
### Fixed
- `Nx.LinAlg.svd(tensor, full_matrices?: false)` on rank-2 inputs no
longer routes through MLX's full-matrices SVD and post-slices —
MLX's SVD has no thin switch, so the old path materialised the full
m × m U on device and instantly OOM'd Metal for tall matrices like
the Qwen3-0.6B embedder kernel (151936 × 1024 → ~92 GB U). The thin
case now computes `G = MᵀM → eigh → S, V; U = MV / S` (or the
symmetric `MMᵀ` route for wide matrices), keeping the decomposition
at min(m, n)². See the `Emily.Backend` moduledoc Divergences section
for the numerical caveat (the Gram step squares M's condition
number). Refs #84.
- `mix docs` runs cleanly. The MNIST notebook referenced
`Axon.Loop`'s `trainer/2` (no such arity); three other inline
references resolved to `@doc false` callees in upstream libraries
(`Nx.Defn.Expr`'s `optional/3`, Bumblebee's `rms_norm/2`)
and triggered autolinker warnings on every doc build. The notebook
now uses the correct `trainer/3` arity, and the prose references
have been reshaped so the autolinker no longer follows them,
keeping the build warning-free for future `--warnings-as-errors`
enforcement. Refs #83.
## 0.3.3 - 2026-05-03
### Fixed
- `Emily.Compiler` now silently drops options it doesn't recognise
instead of raising `ArgumentError`. This matches the behaviour of
`Nx.Defn.Evaluator` and EXLA, and restores compatibility with
higher-level libraries that forward caller-supplied options through
the JIT compiler — notably `Axon.build/2`, whose contract states
that "all other options are forwarded to the underlying JIT
compiler". Hit when running a Bumblebee-built Axon model with
`Axon.predict(..., global_layer_options: [output_hidden_states:
true])` under Emily as the global defn compiler. Refs #81.
## 0.3.2 - 2026-04-25
## 0.3.1 - 2026-04-25
### Fixed
- Precompiled NIF download no longer times out on the `:peer.call/4`
default 5s `gen_server.call` deadline. Consumers installing
`{:emily, "~> 0.3"}` on a cold cache could see `:gen_server.call`
timeouts while fetching the multi-MB tarball; the `.sha256` sidecar
fit in the window but the main asset did not. The peer RPC now runs
with `:infinity` so httpc's own request timing drives cancellation.
## 0.3.0 - 2026-04-25
### Changed
- Hex consumers now receive a precompiled NIF
(`libemily.{so,dylib}` + `mlx.metallib`) instead of source. First
`mix compile` downloads the matching `emily-nif-<v>-<variant>-
<target>.tar.gz` (and its `.sha256` sidecar) from the emily GitHub
release for the pinned version, verifies the tarball against the
published SHA256, and extracts into `priv/`. No cmake / Xcode /
C++ toolchain is needed on the consumer side.
- In-repo / CI builds now clone MLX's source via a Mix git dep
(`:mlx_src`) and build libmlx from source; `release-mlx.yml` is
retired.
- Variant selection is unified under the `:variant` app-config key
(`:aot` | `:jit`). Contributors flip variants via
`EMILY_MLX_VARIANT=jit` (read by `config/config.exs`); consumers
set `config :emily, variant: :jit` in their own
`config/config.exs`. The old `:mlx_variant` key and
`config/local.exs` override are gone.
- macOS default cache location moves from `~/Library/Caches/emily/`
to `DARWIN_USER_CACHE_DIR` (`/private/var/folders/<hash>/C/emily`)
— the per-user sandboxed cache root Apple's own sandboxed apps
use. Persistent across reboots, lives outside `~/Library/`.
Linux / Windows still use the XDG convention. Override via
`EMILY_CACHE`. Existing macOS users can `rm -rf
~/Library/Caches/emily/` to reclaim the orphaned data after
upgrade.
- NIF object files move from the user-level cache to
`$(MIX_APP_PATH)/obj/` (i.e. `_build/<env>/lib/emily/obj/`). As a
consequence, plain `mix clean` now correctly removes them via the
existing Makefile rule — they were previously left behind because
`make clean` didn't see the cache-dir env vars.
### Added
- `.github/workflows/release-nif.yml` — on bare-semver tag push,
builds the precompiled NIF for each `(variant × target)` cell and
uploads tarball + `.sha256` sidecar to a draft GitHub release.
`workflow_dispatch` is also wired for out-of-band rebuilds
(artefacts go to workflow storage; the release is untouched).
- `mix clean.mlx` — wipes the MLX install dir(s) under the cache.
Plain `mix clean` deliberately preserves them since rebuilding
MLX from source is ~5-7 minutes.
### Fixed
- MLX source builds are now atomic. The build script installs into
`${PREFIX}.staging` and only `mv`s onto the final path after the
artefact sanity checks pass; an EXIT trap wipes the scratch dirs
on failure. Previously, an interrupted build (Ctrl-C, killed
process, concurrent run) left an empty install dir that
subsequent `mix compile` runs misread as "MLX is already
installed", silently skipping the build and bombing out in
`elixir_make` with `make: *** No rule to make target
'.../mlx.metallib'`. The compile-time check now requires both
`lib/libmlx.a` and `lib/mlx.metallib` to be present before
trusting the dir.
- Concurrent invocations of `build-mlx.sh` against the same install
prefix are now serialised via a `mkdir`-based lock with
stale-PID reclaim. ElixirLS uses its own build path
(`.elixir_ls/build/...`) so an LSP-driven `mix compile` and a CLI
`mix compile.emily_mlx --force` lock on *different*
`Mix.Project.with_build_lock` keys and freely raced into the same
MLX cache dir, clobbering each other's `${PREFIX}.build/`
mid-build and surfacing as `clang ... Rename failed: ... No such
file or directory` during Metal-shader compilation.
- CMake's FetchContent sub-build of metal_cpp / json / fmt during
configure runs with `CMAKE_BUILD_PARALLEL_LEVEL=1`, dodging a
race in its download → extract → rename → stamp-touch pipeline
that surfaced as `getcwd: cannot access parent directories`
followed by `cd: <dir>/_deps: No such file or directory`. The
main MLX build still runs at full NCPU jobs.
- The MLX scratch build dir (`${PREFIX}.build`) is preserved on
configure failure so `CMakeError.log` survives for diagnostics.
### Removed
- `config/local.exs` override (obsoleted by the env-var plumbing).
- `.github/workflows/release-mlx.yml` (MLX build is folded into the
NIF workflow).
- `scripts/build-mlx-prebuilt.sh` (superseded by in-tree
`scripts/build-mlx.sh`).
- `scripts/smoke-test-package.sh` and the tagged `smoke-test` job in
`ci.yml` (simulated a source-compile consumer, no longer
applicable).
See `MAINTAINING.md` for the updated release flow.
## 0.2.2 - 2026-04-23
### Fixed
- MLX prebuilt download now runs on a peer VM (`:peer.start_link/1` with
stdio connection) so it is unaffected by Mix's code-path pruning
during dep compilation. Previous releases crashed in the tagged
`smoke-test` CI lane with `{:error, :nofile}` / "module :public_key
is not available" on clean caches, because Mix removed the
`:ssl`/`:public_key`/`:asn1`/`:inets` ebin directories from the
parent VM's code path even though the apps were started. The peer
node has a fresh code path, so standard `httpc` + `public_key` work
without further shimming.
## 0.2.1 - 2026-04-22
### Fixed
- **`mix compile` crash on a cold MLX download in a clean consumer
project.** `http_download!/2` in `mix.exs` called
`:public_key.cacerts_get/0` right after
`Application.ensure_all_started(:ssl)`. The app-start path pulled
`:public_key` in transitively, but the module itself was not
guaranteed to be loaded at call time — the tag-triggered Hex
smoke test on CI blew up with
`UndefinedFunctionError ... module :public_key is not available`
on 0.2.0. `http_download!` now force-loads the module via
`:code.ensure_loaded/1` before touching it. Any checkout with a
populated `~/Library/Caches/emily/mlx-<v>-*` directory skipped
this path, which is why the break only surfaced in the first
clean CI run.
## 0.2.0 - 2026-04-22
### Added
- **MLX prebuilt-release workflow
(`.github/workflows/release-mlx.yml`).** Manual workflow that
builds `libmlx.a` + `mlx.metallib` + headers from a chosen
`ml-explore/mlx` tag and uploads the tarball to a draft GitHub
release tagged `mlx-<version>` on this repo. Used to produce the
prebuilts that Emily's compile step downloads instead of the
previous source-build path. To cut a new MLX prebuilt release:
1. Run the workflow with `build_type=no-jit` on macos-14
(produces `mlx-<v>-macos-arm64-aot.tar.gz`).
2. Run it again with `build_type=jit` on macos-26 (produces
`mlx-<v>-macos-arm64-jit.tar.gz`).
3. Copy the two SHA256s from the draft release's `.sha256`
sidecars into `@mlx_checksums` in `mix.exs`.
4. Un-draft the release so consumers can fetch.
The heavy lifting sits in `scripts/build-mlx-prebuilt.sh`, which
runs standalone for local debugging:
`scripts/build-mlx-prebuilt.sh path/to/mlx-src 0.31.2 0`.
- **`Emily.Fast.einsum/2`** — eager-only wrapper around MLX's
path-optimised `mx::einsum`. Accepts a standard Einstein-summation
string and a list of `Emily.Backend`-backed tensors; MLX picks the
contraction order internally. Operands on any other backend raise
`ArgumentError` with a transfer-first message. The helper is a
direct-call eager helper (same pattern as
`Emily.Quantization.quantized_matmul/2`) and is intentionally **not**
`defn`-callable — a fallback via `Nx.Defn.Expr`'s `optional/3` would
require a full einsum-string parser and is deferred until a user
needs cross-backend composability.
### Fixed
- **`Nx.top_k/2` on Emily tensors.** The backend's `top_k/3`
override pattern-matched `out` as a single `%Nx.Tensor{}` and
returned a single tensor, but the real Nx callback contract takes
`{out_values, out_indices}` and returns a `{values, indices}`
tuple. Any call to `Nx.top_k` raised `FunctionClauseError`.
Dropped the override so Nx falls back to `argsort(:desc) +
take_along_axis + slice_along_axis`, each of which routes
through Emily's backend.
### Changed
- **MLX prebuilt download replaces the vendored source build.** The
`vendor/mlx` submodule and the cmake-from-source path are gone.
`mix compile` now downloads a SHA256-verified `libmlx.a` +
`mlx.metallib` + headers tarball for the pinned `@mlx_version` from
this repo's releases into `$EMILY_CACHE` and links the NIF against
it directly. Consumer prerequisites drop from "Xcode + Metal
toolchain + cmake + submodule checkout" to just macOS Apple Silicon.
The JIT / no-JIT switch moves from the `EMILY_MLX_JIT` env var to
`config :emily, mlx_variant: :jit | :no_jit` in `config/config.exs`
(default `:no_jit`); variant is read via `Config.Reader.read!` at
project load, so a gitignored `config/local.exs` is the supported
per-checkout override. Version bumps are a single-commit change of
`@mlx_version` + `@mlx_checksums` in `mix.exs`, paired with a new
`mlx-<version>` GitHub release produced by `release-mlx.yml`. First
MLX pin under the new scheme: **0.31.2**.
- **Microscaled quantization modes on `Emily.QuantizedWeight`.** The
container now carries a `:mode` field (default `"affine"`) and
accepts `"mxfp4"`, `"mxfp8"`, `"nvfp4"` — MLX's full
`QuantizationMode` enum (`vendor/mlx/mlx/primitives.h:155`).
`from_dense/2`, `to_dense/1`, and `Emily.Quantization.quantized_matmul/2`
all thread the mode through to MLX; mode-specific
`{group_size, bits}` constraints are validated up front with a
clear Emily error before the NIF call. Microscaled modes carry
a placeholder biases tensor — MLX's `fp_quantize` returns only
`(wq, scales)`, and the Native layer substitutes `nil` before
the MLX call. `Emily.Quantization.dequantize_defn/1` is
affine-only (it's a hand-rolled nibble unpacker) and now raises
`ArgumentError` on non-affine modes, pointing users at
`to_dense/1`. Smoke-tested end-to-end on Metal for all four modes
(Apple Silicon, macOS 26).
- **SDPA attention sinks (`mx::fast::scaled_dot_product_attention`
`sinks` param).** `Emily.Fast.scaled_dot_product_attention/4` and
`scaled_dot_product_attention_with_mask/5` now accept an optional
`:sinks` keyword opt — a per-head tensor broadcastable to
`{1, heads, 1, 1}` whose entries participate in the softmax
denominator as extra "null destinations" (StreamingLLM). When
absent the helpers emit the pre-existing optional-node, so
`Emily.Bumblebee.FastKernels` and direct callers stay source- and
bit-compatible. The defn fallback implements the same semantics
in numerically-stable form; equivalence vs. the fused kernel was
measured at ~2e-7 max-abs-diff on f32.
- **MLX JIT build no longer patches vendored MLX.** The
`patches/mlx-jit-nax-gate.patch` workaround (and the
`maybe_apply_mlx_patches` plumbing in `mix.exs`) has been removed.
The JIT build now requires the macOS 26.2+ SDK directly, which
ships `<MetalPerformancePrimitives/MetalPerformancePrimitives.h>`;
the AOT (default) build is unchanged and still works on older
macOS. Upstream discussion:
[ml-explore/mlx#3426](https://github.com/ml-explore/mlx/pull/3426).
- **CI matrix split across macOS versions.** The `jit=0` row stays
on `macos-14` to keep AOT coverage on older macOS; the `jit=1`
row now runs on `macos-26` so the Metal Performance Primitives
SDK is available natively.
- **Native axis reversal via `mx::slice` with stride -1.** The
descending branches of `Nx.sort` and `Nx.argsort` (and
`Nx.reverse`) previously built an `arange` index tensor and
gathered with `take`. They now call a new `Native.flip/3` NIF
that lowers to a single strided slice, saving the index
allocation and gather kernel per call.
- **Parallel NIF C++ build.** `elixir_make` doesn't pass `-j` by
default and `mix.exs` didn't set `:make_args`, so every `.cpp`
in `c_src/` compiled serially. `mix.exs` now passes
`-j#{System.schedulers_online()}` through, and the vestigial
`JOBS` / `MAKE_JOBS` pair in the `Makefile` (computed but never
referenced) has been removed. On an 8-core M-series, a clean NIF
build drops from ~19 s to ~7 s.
## 0.1.2 - 2026-04-19
### Fixed
- **HexDocs source links.** `mix.exs`'s `source_url_pattern`
prepended a `v` prefix to the version tag, but the project's
release convention (via `mix publisho`) uses bare semver tags.
The generated `[source]` links in HexDocs pointed at nonexistent
`v<version>` tags. Dropped the prefix so links resolve to the
actual tag.
## 0.1.1 - 2026-04-19
Initial release. See the git history for per-milestone detail.
### Added
- **Nx backend.** `Emily.Backend` implements every required
`Nx.Backend` callback against MLX, with transparent fallback to
`Nx.BinaryBackend` for ops without a native primitive.
- **Defn compiler.** `Emily.Compiler` runs `defn` / `Nx.Serving` /
Bumblebee on Emily; pins the result backend and caps partition
concurrency so `Nx.Serving` stays compatible.
- **Fused transformer kernels.** `Emily.Fast` exposes
`mx::fast::rms_norm`, `layer_norm`, `rope`, and scaled-dot-product
attention as defn-callable helpers with composed-defn fallbacks
for non-Emily backends. `Emily.Bumblebee.FastKernels` rewrites a
Bumblebee Axon graph to call the fused kernels in place; declared
as an optional dep on `:axon` + `:bumblebee`, elides cleanly if
either is absent.
- **Affine group-wise quantization.** `Emily.QuantizedWeight` and
`Emily.Quantization` wrap MLX `quantize` / `dequantize` /
`quantized_matmul` for int2 / int4 / int8 inference.
`Emily.Quantization.dequantize_defn/1` provides a defn-native
dequantize for use inside Axon forward passes.
- **Mixed-precision training.** `Emily.MixedPrecision` ships the
bf16 recipe: `cast_params` for the forward pass, f32 master
weights, dynamic loss scaling with overflow detection.
- **Per-process Metal streams.** `Emily.Stream` lets each BEAM
process own its own Metal command queue, enabling concurrent
inference on a shared model.
- **Zero-copy `to_binary`.** `Nx.to_binary/1` on an Emily tensor
returns a BEAM resource binary aliasing the MLX buffer — no memcpy.
- **Native gradient + training primitives.** `gather`, `scatter`,
`scatter_add`, `conv`, and the window-reduction family lower
directly to MLX so `Nx.Defn.grad` and CNN training stay native.
- **Native linalg.** `lu`, `svd`, `qr`, `cholesky`, `eigh`, `solve`,
and `triangular_solve` dispatch to `mx::linalg::*` instead of
rounding through `Nx.BinaryBackend`.
- **Telemetry.** `[:emily, :eval, *]`, `[:emily, :to_binary, *]`,
`[:emily, :fallback, *]`, and `[:emily, :memory, :stats]` span
events; opt-in one-shot fallback warnings via
`config :emily, :warn_on_fallback, true`.
- **Compile-time debug flags.** `:debug_bounds_check` and
`:debug_detect_nan_inf` re-enable runtime assertions on hot paths;
default off with zero runtime cost.
- **Bumblebee conformance.** End-to-end suites for DistilBERT,
Qwen3-0.6B (dense and quantized), ViT-base, and Whisper-tiny,
pinned against HuggingFace reference values.
- **Worker-thread dispatch.** Each MLX stream is owned by a
dedicated OS thread. NIFs enqueue work on the worker and return
immediately; the worker posts the result back to the caller via
`enif_send`, and the public wrapper awaits it with `receive`. No
BEAM scheduler (regular or dirty) blocks on MLX work, and the
per-thread Metal `CommandEncoder` state stays consistent regardless
of how the BEAM migrates Elixir processes between schedulers.
- **Vendored MLX build.** MLX is built from source via cmake from
`vendor/mlx` (git submodule); no prebuilt download. Build cache
keyed on the submodule SHA under `~/Library/Caches/emily/`.
- **Documentation.** Per-module HexDocs, five runnable Livebooks
(`notebooks/distilbert_qa.livemd`,
`notebooks/qwen3_quantized.livemd`,
`notebooks/mnist_training.livemd`,
`notebooks/whisper_transcription.livemd`,
`notebooks/fast_kernels.livemd`), and worked Bumblebee examples in
the conformance suite.