internals/c-safety-audit.md

Select File
internals/c-safety-audit.md

# C / NIF safety audit

Audit branch: `audit/c-safety-review`. Scope: `c_src/erllama_nif.c`,
`c_src/erllama_safe.cpp`, `c_src/crc32c.{c,h}`. The upstream
`c_src/llama.cpp/` submodule was not examined.

The audit looked for segfaults, deadlocks, infinite loops, leaks, and
non-thread-safe operations. Findings are grouped by severity. Line
numbers are against the tree at the time of audit.

## High severity

### H1. Adapter outlives its underlying `llama_model` — use-after-free on adapter free

- Sites: `nif_free_model` (`c_src/erllama_nif.c:602-633`),
  `nif_adapter_load` (`c_src/erllama_nif.c:2051-2095`),
  `nif_adapter_free` (`c_src/erllama_nif.c:2097-2113`),
  `adapter_dtor` (`c_src/erllama_nif.c:351-366`).
- Failure mode: `llama.h` documents that "loaded adapters that are not
  manually freed will be freed when the associated model is deleted".
  `nif_free_model` ignores adapters: when `active_contexts == 0` it
  frees the model immediately, which implicitly frees every adapter
  derived from it. The wrapping `erllama_adapter_t` keeps the *resource*
  alive (via `enif_keep_resource`) but `a->adapter` is now a dangling
  pointer. A subsequent `nif_adapter_free` or `adapter_dtor` calls
  `llama_adapter_lora_free` on freed memory — UAF / double-free.
- Fix sketch: track `active_adapters` on `erllama_model_t` next to
  `active_contexts`. `nif_free_model` should set `release_pending`
  whenever either counter is non-zero. Decrement `active_adapters`
  from the adapter destructor (mirror of `context_drops_model`) and
  perform the deferred free there.

### H2. Adapter pointer race in `nif_set_adapters`

- Site: `c_src/erllama_nif.c:2144-2191`.
- Failure mode: for each adapter the code locks `a->mu`, copies
  `a->adapter` into the array, then unlocks immediately. The pointer
  array is then passed to `erllama_safe_set_adapters_lora` outside any
  adapter lock. A concurrent `nif_adapter_free` on any of the listed
  adapters will null the underlying llama object between the unlock
  and the llama call — the call sees a dangling pointer.
- Fix sketch: either (a) hold every `a->mu` for the duration of
  `set_adapters_lora` (lock them in deterministic order — e.g. by
  resource pointer — to avoid AB-BA), or (b) add an `in_use` counter
  to `erllama_adapter_t` that `nif_adapter_free` waits on, or (c)
  serialise adapter mutation through a coarser lock (one per model
  is enough since adapters are always model-scoped).

## Medium severity

### M1. Partial chat-message build leaks role/content pairs

- Site: `build_chat_msgs_from_list` (`c_src/erllama_nif.c:1377-1405`),
  caller `nif_apply_chat_template` (`c_src/erllama_nif.c:1533-1541`).
- Failure mode: when the loop fails on message k > 0 (bad map,
  missing role/content, or OOM allocating the content string), the
  helper returns `-1` / `-2` without freeing the role+content pairs
  it already wrote into `out[idx0..idx-1]`. The caller invokes
  `free_chat_msgs(msgs, n_msgs)` with the pre-call `n_msgs`
  (0 or 1 for the optional synthetic-system entry), so every
  successfully-built message before the failure leaks. Per-message
  leak is `enif_alloc(role_bin.size + 1) + enif_alloc(content_bin.size + 1)`.
- Fix sketch: have the helper free `out[idx0..idx-1]` before
  returning, or accept an `int *out_idx` so the caller can free up
  to `idx` on error.

## Low severity / contention notes

### L1. `nif_detokenize` holds `m->mu` across the whole token loop

- Site: `c_src/erllama_nif.c:1165-1263`.
- Not a deadlock, but the model lock is held across N
  `llama_token_to_piece` calls plus per-call `enif_realloc`. For a
  multi-MB detokenize the entire model surface (tokenize,
  apply_chat_template, free_model contention) stalls for the
  duration. Consider releasing the lock per chunk or splitting
  detokenize into a per-token primitive driven from Erlang.

### L2. `nif_kv_pack` allocates a fresh empty binary on `need == 0`

- Site: `c_src/erllama_nif.c:873-880`. Minor. Returning a shared
  empty term would be cheaper, but `enif_alloc_binary(0, ...)` is
  legal per ERTS docs and the path is rare.

## Considered and ruled out

### Per-resource mutex pattern

`erllama_model_t`, `erllama_context_t`, `erllama_adapter_t`, and
`erllama_sampler_t` each carry a `pthread_mutex_t` that every NIF
entry takes before dereferencing the wrapped llama pointer. This
serialises concurrent dirty-NIF calls with explicit `free_*` calls
on the same resource: a free cannot interleave with a live llama
call. Different resources stay independent. Intentional and
correct.

### `context_drops_model` deferred-free

The combination of `m->active_contexts` and `m->release_pending`
under `m->mu` is the only state involved. The single decrement
that observes `release_pending && active_contexts == 0 && m->model`
performs the free atomically, all under the lock. Race-free for
contexts. (Adapters are *not* covered — see H1.)

### `enif_keep_resource` / `enif_release_resource` pairing

- Context wrapper keeps the model resource at `nif_new_context:700`
  and releases it at `ctx_dtor:318` and `nif_free_context:735`. One
  keep, one release per lifecycle path.
- Adapter wrapper keeps the model at `nif_adapter_load:2090`,
  releases at `adapter_dtor:359`. Single pair.
- Sampler wrapper keeps the context at `nif_sampler_new:2249`,
  releases at `sampler_dtor:338`. Single pair.

### `pthread_mutex_init` failure paths

All four resource constructors (`nif_load_model`, `nif_new_context`,
`nif_adapter_load`, `nif_sampler_new`) zero-init the resource before
attempting `pthread_mutex_init`. On init failure they
`enif_release_resource` (which will run the destructor with
`mu_inited == 0`) and free the underlying llama object explicitly.
The destructors all guard `pthread_mutex_destroy` on `mu_inited`.
Clean.

### `enif_alloc(sizeof(T) * n)` overflow

`n` is always a validated `int32_t` capped at `ERLLAMA_MAX_TOKENS`
(`1 << 20`) or `(1 << 24)` in detokenize, or it comes from
`enif_get_list_length` (unsigned int) and ends up multiplied by
small `sizeof`s. On 64-bit platforms `size_t` is 8 bytes; no
overflow is reachable. 32-bit BEAM would be exposed in the
`nif_set_adapters` path but that target is not supported.

### `llama_context` thread-safety

`llama.cpp` contexts are not thread-safe. Every entry that touches a
context (`erllama_safe_decode`, `erllama_safe_state_seq_*`,
`erllama_safe_sampler_sample`, `erllama_safe_get_embeddings*`,
`erllama_safe_set_adapters_lora`, `erllama_safe_set_embeddings`) is
gated by the matching `c->mu`. Same for `m->mu` on model calls.
Single-threaded per resource is the design.

### `pthread_once` use in `crc32c_init` and `erllama_safe_backend_init_once`

Both correctly serialise one-time initialisation; the static `rc`
read by `erllama_safe_backend_init_once` after `pthread_once`
returns is covered by the happens-before guarantees of pthread_once.

### Infinite loops

None found. Every loop is bounded by validated input size
(`ERLLAMA_MAX_TOKENS`, `(1 << 24)`, `ERLLAMA_MAX_TOKEN_TEXT`,
`grammar_bin.size`, `enif_get_list_length` capped), by external
list iteration with `enif_get_list_cell` returning false on
exhaustion, or by `pthread_once`.

## Suggested fix priority

1. **H1**: track adapters on the model, defer free while adapters
   exist. Prevents the most likely VM crash in real use (anyone
   calling `unload/1` while still holding a LoRA reference).
2. **H2**: hold adapter locks (or use an `in_use` flag) across
   `set_adapters_lora`.
3. **M1**: free partial work in `build_chat_msgs_from_list` on
   error to plug the per-bad-request leak.
4. **L1** is a contention/SLA concern, not a correctness bug;
   address only if `detokenize` shows up in scheduler stall traces.