# C / NIF safety audit
Audit branch: `audit/c-safety-review`. Scope: `c_src/erllama_nif.c`,
`c_src/erllama_safe.cpp`, `c_src/crc32c.{c,h}`. The upstream
`c_src/llama.cpp/` submodule was not examined.
The audit looked for segfaults, deadlocks, infinite loops, leaks, and
non-thread-safe operations. Findings are grouped by severity. Line
numbers are against the tree at the time of audit.
## High severity
### H1. Adapter outlives its underlying `llama_model` — use-after-free on adapter free
- Sites: `nif_free_model` (`c_src/erllama_nif.c:602-633`),
`nif_adapter_load` (`c_src/erllama_nif.c:2051-2095`),
`nif_adapter_free` (`c_src/erllama_nif.c:2097-2113`),
`adapter_dtor` (`c_src/erllama_nif.c:351-366`).
- Failure mode: `llama.h` documents that "loaded adapters that are not
manually freed will be freed when the associated model is deleted".
`nif_free_model` ignores adapters: when `active_contexts == 0` it
frees the model immediately, which implicitly frees every adapter
derived from it. The wrapping `erllama_adapter_t` keeps the *resource*
alive (via `enif_keep_resource`) but `a->adapter` is now a dangling
pointer. A subsequent `nif_adapter_free` or `adapter_dtor` calls
`llama_adapter_lora_free` on freed memory — UAF / double-free.
- Fix sketch: track `active_adapters` on `erllama_model_t` next to
`active_contexts`. `nif_free_model` should set `release_pending`
whenever either counter is non-zero. Decrement `active_adapters`
from the adapter destructor (mirror of `context_drops_model`) and
perform the deferred free there.
### H2. Adapter pointer race in `nif_set_adapters`
- Site: `c_src/erllama_nif.c:2144-2191`.
- Failure mode: for each adapter the code locks `a->mu`, copies
`a->adapter` into the array, then unlocks immediately. The pointer
array is then passed to `erllama_safe_set_adapters_lora` outside any
adapter lock. A concurrent `nif_adapter_free` on any of the listed
adapters will null the underlying llama object between the unlock
and the llama call — the call sees a dangling pointer.
- Fix sketch: either (a) hold every `a->mu` for the duration of
`set_adapters_lora` (lock them in deterministic order — e.g. by
resource pointer — to avoid AB-BA), or (b) add an `in_use` counter
to `erllama_adapter_t` that `nif_adapter_free` waits on, or (c)
serialise adapter mutation through a coarser lock (one per model
is enough since adapters are always model-scoped).
## Medium severity
### M1. Partial chat-message build leaks role/content pairs
- Site: `build_chat_msgs_from_list` (`c_src/erllama_nif.c:1377-1405`),
caller `nif_apply_chat_template` (`c_src/erllama_nif.c:1533-1541`).
- Failure mode: when the loop fails on message k > 0 (bad map,
missing role/content, or OOM allocating the content string), the
helper returns `-1` / `-2` without freeing the role+content pairs
it already wrote into `out[idx0..idx-1]`. The caller invokes
`free_chat_msgs(msgs, n_msgs)` with the pre-call `n_msgs`
(0 or 1 for the optional synthetic-system entry), so every
successfully-built message before the failure leaks. Per-message
leak is `enif_alloc(role_bin.size + 1) + enif_alloc(content_bin.size + 1)`.
- Fix sketch: have the helper free `out[idx0..idx-1]` before
returning, or accept an `int *out_idx` so the caller can free up
to `idx` on error.
## Low severity / contention notes
### L1. `nif_detokenize` holds `m->mu` across the whole token loop
- Site: `c_src/erllama_nif.c:1165-1263`.
- Not a deadlock, but the model lock is held across N
`llama_token_to_piece` calls plus per-call `enif_realloc`. For a
multi-MB detokenize the entire model surface (tokenize,
apply_chat_template, free_model contention) stalls for the
duration. Consider releasing the lock per chunk or splitting
detokenize into a per-token primitive driven from Erlang.
### L2. `nif_kv_pack` allocates a fresh empty binary on `need == 0`
- Site: `c_src/erllama_nif.c:873-880`. Minor. Returning a shared
empty term would be cheaper, but `enif_alloc_binary(0, ...)` is
legal per ERTS docs and the path is rare.
## Considered and ruled out
### Per-resource mutex pattern
`erllama_model_t`, `erllama_context_t`, `erllama_adapter_t`, and
`erllama_sampler_t` each carry a `pthread_mutex_t` that every NIF
entry takes before dereferencing the wrapped llama pointer. This
serialises concurrent dirty-NIF calls with explicit `free_*` calls
on the same resource: a free cannot interleave with a live llama
call. Different resources stay independent. Intentional and
correct.
### `context_drops_model` deferred-free
The combination of `m->active_contexts` and `m->release_pending`
under `m->mu` is the only state involved. The single decrement
that observes `release_pending && active_contexts == 0 && m->model`
performs the free atomically, all under the lock. Race-free for
contexts. (Adapters are *not* covered — see H1.)
### `enif_keep_resource` / `enif_release_resource` pairing
- Context wrapper keeps the model resource at `nif_new_context:700`
and releases it at `ctx_dtor:318` and `nif_free_context:735`. One
keep, one release per lifecycle path.
- Adapter wrapper keeps the model at `nif_adapter_load:2090`,
releases at `adapter_dtor:359`. Single pair.
- Sampler wrapper keeps the context at `nif_sampler_new:2249`,
releases at `sampler_dtor:338`. Single pair.
### `pthread_mutex_init` failure paths
All four resource constructors (`nif_load_model`, `nif_new_context`,
`nif_adapter_load`, `nif_sampler_new`) zero-init the resource before
attempting `pthread_mutex_init`. On init failure they
`enif_release_resource` (which will run the destructor with
`mu_inited == 0`) and free the underlying llama object explicitly.
The destructors all guard `pthread_mutex_destroy` on `mu_inited`.
Clean.
### `enif_alloc(sizeof(T) * n)` overflow
`n` is always a validated `int32_t` capped at `ERLLAMA_MAX_TOKENS`
(`1 << 20`) or `(1 << 24)` in detokenize, or it comes from
`enif_get_list_length` (unsigned int) and ends up multiplied by
small `sizeof`s. On 64-bit platforms `size_t` is 8 bytes; no
overflow is reachable. 32-bit BEAM would be exposed in the
`nif_set_adapters` path but that target is not supported.
### `llama_context` thread-safety
`llama.cpp` contexts are not thread-safe. Every entry that touches a
context (`erllama_safe_decode`, `erllama_safe_state_seq_*`,
`erllama_safe_sampler_sample`, `erllama_safe_get_embeddings*`,
`erllama_safe_set_adapters_lora`, `erllama_safe_set_embeddings`) is
gated by the matching `c->mu`. Same for `m->mu` on model calls.
Single-threaded per resource is the design.
### `pthread_once` use in `crc32c_init` and `erllama_safe_backend_init_once`
Both correctly serialise one-time initialisation; the static `rc`
read by `erllama_safe_backend_init_once` after `pthread_once`
returns is covered by the happens-before guarantees of pthread_once.
### Infinite loops
None found. Every loop is bounded by validated input size
(`ERLLAMA_MAX_TOKENS`, `(1 << 24)`, `ERLLAMA_MAX_TOKEN_TEXT`,
`grammar_bin.size`, `enif_get_list_length` capped), by external
list iteration with `enif_get_list_cell` returning false on
exhaustion, or by `pthread_once`.
## Suggested fix priority
1. **H1**: track adapters on the model, defer free while adapters
exist. Prevents the most likely VM crash in real use (anyone
calling `unload/1` while still holding a LoRA reference).
2. **H2**: hold adapter locks (or use an `in_use` flag) across
`set_adapters_lora`.
3. **M1**: free partial work in `build_chat_msgs_from_list` on
error to plug the per-bad-request leak.
4. **L1** is a contention/SLA concern, not a correctness bug;
address only if `detokenize` shows up in scheduler stall traces.