# Loading a model
erllama serves one or more loaded models concurrently. Each loaded
model is a supervised `gen_statem` that owns a single
`llama_context*`, sits behind a registered name, and shares the
process-wide KV cache with every other model.
This guide walks through the load call and every option that
matters in practice.
## The minimal call
```erlang
1> {ok, _} = application:ensure_all_started(erllama).
2> {ok, Bin} = file:read_file("/srv/models/tinyllama-1.1b-chat.Q4_K_M.gguf").
3> {ok, M} = erllama:load_model(#{
backend => erllama_model_llama,
model_path => "/srv/models/tinyllama-1.1b-chat.Q4_K_M.gguf",
fingerprint => crypto:hash(sha256, Bin)
}).
{ok, <<"erllama_model_2375">>}
```
That is enough to run a completion. erllama fills in the cache
parameters from the application defaults; with no `tier`/`tier_srv`
override the model writes to the RAM tier (the only one started by
default).
`M` is a binary registered name. Use it for every subsequent call:
`erllama:complete(M, ...)`, `erllama:unload(M)`, etc. You can also
pass an explicit id via `load_model/2` (also binary).
## The full option map
```erlang
#{
backend => erllama_model_llama,
model_path => "/srv/models/llama-3.1-8b-instruct.Q4_K_M.gguf",
model_opts => #{n_gpu_layers => 99, use_mmap => true},
context_opts => #{n_ctx => 8192, n_batch => 4096, n_threads => 8},
fingerprint => Fp,
fingerprint_mode => safe,
quant_type => q4_k_m,
quant_bits => 4,
ctx_params_hash => HashOfCtxParams,
context_size => 8192,
tier_srv => my_disk,
tier => disk,
policy => #{ ... }
}
```
### `backend`
Module implementing the `erllama_model_backend` behaviour. Two
shipped today:
- `erllama_model_llama` — the real llama.cpp backend.
- `erllama_model_stub` — a no-op backend used by the unit tests.
### `model_path`
Absolute path to a GGUF file. The model layer hands it to llama.cpp
verbatim; relative paths work too but are resolved against the BEAM's
current working directory, which is rarely what you want under a
release.
### `model_opts`
Pass-through to `llama_model_default_params()`. The fields that
matter day-to-day:
| Key | Default | Notes |
|---|---|---|
| `n_gpu_layers` | 0 | Number of transformer layers offloaded to GPU. Set high enough to cover the model on Metal/CUDA boxes. 99 effectively means "all". |
| `use_mmap` | true | mmap the GGUF instead of copying into anon RAM. Leave on. |
| `use_mlock` | false | `mlock(2)` the model pages. Useful on workloads where `vm.swappiness` is non-zero and you can't afford to page out weights. |
| `vocab_only` | false | Open the file but skip weight loading. Tokenizer-only mode. |
### `context_opts`
Pass-through to `llama_context_default_params()`.
| Key | Default | Notes |
|---|---|---|
| `n_ctx` | 2048 | Maximum context length the model will accept. Caching is keyed on this. Setting it higher than the model trained on will silently degrade quality past the training horizon. |
| `n_batch` | 512 | Maximum tokens fed to a single `llama_decode` call. Bigger values prefill faster but use more VRAM/RAM. 4096 is a sane upper bound for 8B-class models on a 24 GB GPU. |
| `n_ubatch` | n_batch | Micro-batch size. Usually leave equal to `n_batch`. |
| `n_threads` | hw_concurrency | CPU threads for prompt eval. |
| `n_threads_batch` | n_threads | CPU threads for batch eval. |
### `fingerprint`
A 32-byte SHA-256 over the model file. The cache key includes this
fingerprint so a hit is bound to the exact GGUF that produced it; if
you replace the model on disk, old cache rows are no longer
addressable and will be evicted by LRU.
```erlang
{ok, Bin} = file:read_file(Path),
Fp = crypto:hash(sha256, Bin).
```
### `fingerprint_mode`
How aggressively the cache trusts the fingerprint:
- `safe` — recompute the fingerprint at load time. Slow on multi-GB
files but ironclad.
- `gguf_chunked` — fingerprint the GGUF metadata chunk and the first
weights tensor only. Order of magnitude faster; defeats accidental
but not malicious tampering.
- `fast_unsafe` — trust whatever you pass in. Use only if you
fingerprint upstream and pass the result through.
### `quant_type` and `quant_bits`
Identifies the quantisation byte-for-byte. Two models with the same
weights but different quant schemes have different cache rows.
### `ctx_params_hash`
A SHA-256 over the parts of `context_opts` that change KV layout —
typically `(n_ctx, n_batch)`. erllama treats two contexts with
different params as different cache namespaces.
```erlang
CtxHash = crypto:hash(sha256, term_to_binary({Nctx, Nbatch})).
```
### `context_size`
Plain integer copy of `n_ctx`. The cache uses it for bounds checks.
### `tier_srv` and `tier`
Where saves go.
- `tier_srv` is the registered name of the tier server. Only the RAM
tier (`erllama_cache_ram`) is started automatically by the
application. To use `ram_file` or `disk`, start a tier server
yourself and pass its name:
```erlang
{ok, _} = erllama_cache_disk_srv:start_link(my_disk, "/var/lib/erllama/kvc"),
...
tier_srv => my_disk,
tier => disk,
```
- `tier` is the symbolic tier (`ram | ram_file | disk`). It must
match the backend the `tier_srv` was started with.
For production deployments use the disk tier — it survives restarts
and is the cheapest place to keep warm state.
### `policy`
Optional per-model overrides of the cache save-policy gates. Any
keys you omit fall back to the application defaults declared in
`erllama.app.src` (`min_tokens`, `cold_min_tokens`,
`cold_max_tokens`, `continued_interval`, `boundary_trim_tokens`,
`boundary_align_tokens`, `session_resume_wait_ms`). See the
[caching guide](caching.md) for what each gate means. Pass an empty
map (or omit the key entirely) to use the defaults.
## Loading multiple models
`load_model/2` takes an explicit binary id and is idempotent against
`{already_started, _}`: calling it twice with the same id returns
`{error, already_loaded}` the second time. To run two distinct models
concurrently:
```erlang
{ok, _} = erllama:load_model(<<"tiny">>, TinyConfig).
{ok, _} = erllama:load_model(<<"big">>, BigConfig).
{ok, R, _} = erllama:complete(<<"tiny">>, <<"hello">>).
{ok, R2, _} = erllama:complete(<<"big">>, <<"hello">>).
```
Both share one `erllama_cache` instance — cache rows are scoped by
fingerprint, so they never collide.
## Unloading
```erlang
ok = erllama:unload(M).
```
Triggers a synchronous `shutdown` save (best-effort: capped by
`evict_save_timeout_ms`) and terminates the gen_statem. Any
in-flight cache writes are awaited up to that timeout.
## Common pitfalls
- **Forgetting the fingerprint.** Without it the cache key falls back
to the path string, which means renaming the file invalidates the
cache. Always pass an actual hash.
- **Wrong `n_ctx`.** The cache key includes `ctx_params_hash`. If you
bump `n_ctx` for a tenant, expect a one-shot cold prefill across
every cached prefix until the new rows accumulate.
- **Mismatched `tier` / `tier_srv`.** `tier => disk` against an
`erllama_cache_ram` server name fails at first save; verify the
pair before deploy. The RAM tier is the only one auto-started; for
`ram_file` / `disk`, start the relevant `erllama_cache_disk_srv`
yourself and pass its registered name.