README.md

Select File:
# erllama

[![CI](https://github.com/erllama/erllama/actions/workflows/ci.yml/badge.svg)](https://github.com/erllama/erllama/actions/workflows/ci.yml)
[![Hex.pm](https://img.shields.io/hexpm/v/erllama.svg)](https://hex.pm/packages/erllama)

erllama is a native Erlang/OTP wrapper around `llama.cpp` with a
**token-exact, multi-tier, supervised KV cache**. It turns a
multi-second prefill into a millisecond restore, and lets you keep
**more warm state than fits in RAM** by promoting cold-but-popular
prefixes down to the disk tier.

If you have ever waited five seconds for a chat assistant to
acknowledge "hello" — that's prompt prefill. erllama caches the
work so the second turn, the third turn, and every subsequent agent
sharing the same system prompt skip it.

## A 30-second taste

```erlang
1> {ok, _} = application:ensure_all_started(erllama).
2> Path = "/srv/models/tinyllama-1.1b-chat.Q4_K_M.gguf".
3> {ok, Bin} = file:read_file(Path).
4> {ok, M} = erllama:load_model(#{
       backend     => erllama_model_llama,
       model_path  => Path,
       fingerprint => crypto:hash(sha256, Bin)
   }).
{ok, <<"erllama_model_2375">>}

5> {ok, Reply, _} = erllama:complete(M, <<"Once upon a time">>).
%% ~3 s on a CPU box. Prompt prefill, async cold save fired.

6> {ok, Reply2, _} = erllama:complete(M, <<"Once upon a time">>).
%% ~10 ms. Cache hit; KV state restored, one decode for fresh logits.

7> {ok, _, _} = erllama:complete(M, <<"Once upon a time, in a quiet village">>).
%% ~50 ms. Longest-prefix walk found the cached row even though
%% the new prompt is longer.
```

`load_model/1` returns a binary `model_id` that is also the registered
name for the underlying gen_statem. Pass it to `complete/2,3`,
`unload/1`, etc.

That is the whole pitch. The cache is on by default, runs under
its own supervisor, and never returns approximate matches.

## What you get

- **Many models in one BEAM.** Load TinyLlama and Llama-3-8B side by
  side, hot-swap a model without bouncing the cache, give each model
  its own `policy` and `tier`. One shared cache; rows are
  fingerprint-segregated so models never collide on identical
  prompts.
- **Token-exact hits.** Cache key is
  `sha256(model_fp || quant || ctx_params || tokens_le32)`. Same
  tokens, same key, guaranteed-correct restore.
- **Three storage tiers.** ETS slabs for the hottest data, files
  on `/dev/shm` for warm working set, on-disk files (plain read
  I/O) for everything else. Each tier supervised independently
  with its own quota and LRU.
- **Bigger than RAM.** Disk is a first-class tier, not a fallback.
  A 70B model in Q4 already takes ~40 GB of weights; the disk tier
  holds the warm KV state your working set needs without crowding
  weights out of RAM.
- **Shared-prefix hits across agents.** Spawn N workers that all
  start with the same system prompt: the first cold-prefills, every
  subsequent worker gets a longest-prefix hit on the shared part.
- **Multi-turn warmth.** Pass the previous turn's `parent_key` and
  the cache waits up to `session_resume_wait_ms` for the in-flight
  finish save to publish.
- **Stateless-friendly.** OpenAI/Anthropic-shaped servers that
  resend the full conversation each turn get hits automatically
  through a longest-prefix walk. No `parent_key` needed.
- **Crash-safe saves.** Reserve, write temp, validate, atomic
  `link(2)`, announce. Two-stage TTL cleanup adopts orphans on
  writer crash.
- **Memory-pressure-driven eviction.** Pluggable pressure source
  (`memsup`, `nvidia-smi`, or your own callback). Off by default.
- **Always-on metrics.** Hits, misses, saves, evictions, and
  per-path latency totals exposed via `erllama_cache:get_counters/0`.
  Per-counter cost is ~10-20 ns; you cannot meaningfully turn them
  off.

## Installation

erllama targets Erlang/OTP **28** with rebar3 **3.25+**.

Add to `rebar.config`:

```erlang
{deps, [
    {erllama, "~> 0.1"}
]}.
```

Then in your supervision tree, wait for the application to start
before loading models:

```erlang
ok = application:ensure_started(erllama).
```

The first compile builds vendored llama.cpp (~3 minutes on a fast
machine). Subsequent builds are cached. See [requirements](#requirements)
for the toolchain.

## Documentation

| Guide | What it covers |
|---|---|
| [Loading a model](guides/loading.md) | Every option to `erllama:load_model/1,2`, with examples and pitfalls. |
| [Caching](guides/caching.md) | Tiers, save reasons, lookup paths, watermarks. The operator's manual. |
| [Configuration](guides/configuration.md) | Full `sys.config` and per-model option reference. |
| [Building](guides/building.md) | Platform-specific build notes (Linux, macOS, FreeBSD), CUDA/Metal toggles, common build issues. |
| [Examples](guides/examples.md) | Drop-in patterns for one-shot completion, stateless HTTP servers, multi-turn sessions, concurrent agents, cache inspection. |

For the API reference (`erllama`, `erllama_cache`, `erllama_scheduler`,
`erllama_nif`), see the **[generated module docs on
HexDocs](https://hexdocs.pm/erllama)** or run `rebar3 ex_doc`
locally.

For the design rationale behind the cache:

- [Cache design](internals/cache-design.md) — why multi-tier, why
  token-exact, what was deliberately left out.
- [Publish protocol](internals/publish-protocol.md) — the
  five-stage crash-safe save protocol.
- [NIF safety](internals/nif-safety.md) — two-resource lifetime,
  exception shim, why disk reads use plain `file:read_file/1`.

## Many models in one BEAM

Each loaded model is its own supervised `gen_statem` under
`erllama_model_sup`. The cache is process-wide and segregates rows
by fingerprint, so the only thing two models share is the byte
budget.

```erlang
{ok, _} = erllama:load_model(<<"tiny">>, TinyConfig).
{ok, _} = erllama:load_model(<<"big">>,  BigConfig).

{ok, R1, _} = erllama:complete(<<"tiny">>, <<"summarise: ...">>).
{ok, R2, _} = erllama:complete(<<"big">>,  <<"deep analysis of: ...">>).

ok = erllama:unload(<<"tiny">>).
```

| Capability | How |
|---|---|
| N models in one BEAM | `load_model/2` per binary id; each is one `gen_statem` |
| No cross-model collisions | Cache key includes the model fingerprint |
| Hot-swap a model | `unload/1` then `load_model/2`; the cache survives |
| Per-model `policy` | `policy => #{...}` on the load; merges over app-env defaults |
| Per-model `tier` | `tier_srv => MyDisk, tier => disk` per model |
| Shared-prefix hits across agents | Longest-prefix walk on every cold prompt |
| Concurrent saves bounded | Single writer pool with a leak-proof semaphore |

Tested end-to-end in
`test/erllama_SUITE.erl:concurrent_complete_under_writer_cap` —
four models with distinct fingerprints running parallel completions
under one writer cap.

## A slightly longer example

A real load with all the cache parameters. The disk tier requires a
running `erllama_cache_disk_srv` started by the operator; the RAM tier
(`erllama_cache_ram`) starts automatically with the application.

```erlang
{ok, _} = erllama_cache_disk_srv:start_link(my_disk, "/var/lib/erllama/kvc"),
{ok, Bin} = file:read_file("/srv/models/llama-3.1-8b.Q4_K_M.gguf"),
Fp = crypto:hash(sha256, Bin),
CtxHash = crypto:hash(sha256, term_to_binary({8192, 4096})),

{ok, M} = erllama:load_model(#{
    backend          => erllama_model_llama,
    model_path       => "/srv/models/llama-3.1-8b.Q4_K_M.gguf",
    model_opts       => #{n_gpu_layers => 99},
    context_opts     => #{n_ctx => 8192, n_batch => 4096},
    fingerprint      => Fp,
    fingerprint_mode => safe,
    quant_type       => q4_k_m,
    quant_bits       => 4,
    ctx_params_hash  => CtxHash,
    context_size     => 8192,
    tier_srv         => my_disk,
    tier             => disk,
    policy           => #{
        boundary_trim_tokens   => 32,
        boundary_align_tokens  => 256,
        session_resume_wait_ms => 500
    }
}).
```

Stateless OpenAI/Anthropic-shaped server:

```erlang
handle_completion(ModelId, Prompt) ->
    {ok, Reply, _Tokens} = erllama:complete(ModelId, Prompt),
    Reply.
```

No `parent_key`. The cache walks the new prompt backward by the
configured stride and finds the longest cached prefix. If the new
prompt is yesterday's conversation plus one fresh turn, the walk
hits.

Stateful Erlang-native multi-turn: the session layer threads
`parent_key` between turns. The previous turn's finish-save key is
the parent of the next call. It is held by the calling session
process, not retrieved from the cache.

```erlang
%% First turn: cold prefill. The model fires an async finish save
%% whose key is sha256(fingerprint || quant || ctx_params || tokens).
{ok, R1, Tokens1} = erllama:complete(M, Prompt1),
ParentKey = erllama_cache_key:make(#{
    fingerprint => Fp,
    quant_type  => q4_k_m,
    ctx_params_hash => CtxHash,
    tokens      => Tokens1
}),

%% Second turn: pass ParentKey to skip the longest-prefix walk.
{ok, R2, _} = erllama:complete(M, Prompt2, #{parent_key => ParentKey}).
```

Inspect cache state from a shell:

```erlang
1> erllama_cache:get_counters().
#{hits_exact => 142, hits_resume => 17, hits_longest_prefix => 89,
  misses => 12, saves_cold => 12, saves_continued => 67,
  saves_finish => 31, evictions => 3, ...}

2> erllama_cache_meta_srv:dump().
%% List of raw ETS rows:
%%   {Key, Tier, Size, LastUsedNs, Refcount, Status, HeaderBin,
%%    Location, TokensRef, Hits}
[{<<_:256>>, disk, 8388608, 1737..., 0, available, _, _, _, 4}, ...]
```

## Requirements

- Erlang/OTP **28**
- rebar3 **3.25+**
- C++17 toolchain (Apple clang or recent gcc; `cmake` >= 3.20)
- Apple Silicon: Metal + Accelerate auto-detected.
- Linux: BLAS auto-detected; CUDA off by default (toggle via
  `ERLLAMA_OPTS=-DGGML_CUDA=ON`).
- FreeBSD: `erlang-runtime28` from ports, plus `cmake bash gmake`.

## Architecture at a glance

```
erllama_sup
├── erllama_cache_sup
│   ├── erllama_cache_meta_srv      sole writer; meta + LRU + reservations
│   ├── erllama_cache_ram           RAM tier (ETS slabs)
│   ├── erllama_cache_ramfile_srv   ram_file tier
│   ├── erllama_cache_disk_srv      disk tier (plain read/write I/O)
│   └── erllama_cache_writer        writer pool, leak-proof semaphore
├── erllama_model_sup               simple_one_for_one for dynamic models
└── erllama_scheduler               memory-pressure poller (off by default)
```

Inside a request:

1. `erllama:complete/2` enters the per-model `gen_statem`.
2. **prefilling** — tokenize, then either hit the cache and
   `kv_unpack` (warm) or run `llama_decode` over the prompt (cold).
   Cold path fires an async `cold` save at the trimmed-prefix
   boundary.
3. **generating** — token-by-token greedy `llama_decode`. Every
   `continued_interval` tokens, fire an async `continued` save.
4. **idle** — fire an async `finish` save for the full prompt +
   reply. The KV state becomes evictable.

For the publish protocol, the reservation state machine, and the
exception-safe NIF wrappers, see
[internals/publish-protocol.md](internals/publish-protocol.md) and
[internals/nif-safety.md](internals/nif-safety.md).

## Status

**Pre-release.** Core cache, scheduler, and NIF: 166 EUnit + 11
PropEr + 7 stub Common Test cases. End-to-end CT suite gated on
`LLAMA_TEST_MODEL` (6 cases, passing locally with TinyLlama 1.1B
Q4_K_M).

See [CHANGELOG.md](CHANGELOG.md) for the release notes.

## Contributing

The contributor guide is [AGENTS.md](AGENTS.md). The short version:

```bash
rebar3 fmt          # auto-format (always run first)
rebar3 compile      # warnings_as_errors
rebar3 eunit        # unit tests
rebar3 proper       # property tests
rebar3 ct           # Common Test (without a real model)
rebar3 lint         # Elvis
rebar3 dialyzer     # static analysis
rebar3 xref         # cross-reference
```

End-to-end against a real GGUF:

```bash
LLAMA_TEST_MODEL=/path/to/tinyllama-1.1b-chat.gguf \
    rebar3 ct --suite=test/erllama_real_model_SUITE
```

Bumping the vendored llama.cpp: see [UPDATE_LLAMA.md](UPDATE_LLAMA.md).

## Coming next: erllama_cluster

A separate OTP application is in development to coordinate a fleet of
erllama nodes into a single inference cluster. Each node continues to
run erllama as a standalone library — local model loading, local KV
cache, local inference. The cluster layer sits on top and decides
which node serves which request.

Three distribution strategies, all in v1:

- **Request distribution** with pluggable load-balancing and
  cache-affinity routing — follow-up requests prefer the node that
  warmed the KV cache for the prefix.
- **Speculative decoding across nodes** — small draft model on one
  node, large verifier on another, coordinated per request.
- **Pipeline parallelism** — models too large for one node split by
  layer ranges across multiple nodes, hidden states passed between
  shards as Erlang binaries.

Transport is QUIC, via Erlang distribution carried over
[erlang_quic](https://github.com/benoitc/erlang_quic) — a pure Erlang
QUIC implementation, no C NIF in the protocol path. Circuit breakers
per `{Node, ModelId}` driven by `nodeup`/`nodedown` rather than
application-level pings. A globally registered scheduler handles
cluster-wide GPU budgeting and on-demand model placement, with local
fallback schedulers elected by `pg` quorum on network partition.

Repository: <https://github.com/erllama/erllama_cluster> (under
construction).

## Acknowledgements

Same idea as [antirez/ds4](https://github.com/antirez/ds4).

## License

MIT. Copyright (c) 2026 Benoit Chesneau. See [LICENSE](LICENSE).

The vendored `c_src/llama.cpp/` retains its upstream MIT license; see
`c_src/llama.cpp/LICENSE`.