# whisper_cpp usage rules
For agents and humans writing code against `whisper_cpp`. These rules are
shipped with the Hex package so downstream consumers can opt in to a
consistent set of conventions.
## Loading models
- Pass a path to a `.bin` or `.gguf` whisper.cpp checkpoint to
`WhisperCpp.load_model/2`. Download checkpoints from
<https://huggingface.co/ggerganov/whisper.cpp>.
- Cache the `%WhisperCpp.Model{}` for the process lifetime; loading is
expensive and the underlying NIF resource is safe to share across
BEAM processes - concurrent `transcribe/3` calls do not serialise.
- Prefer `device: :auto` (the default). Explicit device selection that
does not match the installed NIF artefact returns `:invalid_request`.
## Audio input
- `transcribe/3` accepts exactly one shape: `{:pcm_f32, binary()}`,
where the binary is little-endian IEEE-754 `f32` samples, mono,
16 kHz, normalised to `[-1.0, 1.0]`.
- This library does **not** decode audio file formats. Decode WAV,
MP3, FLAC, M4A, Opus, etc. upstream and hand the PCM in. Standard
recipe with ffmpeg:
```bash
ffmpeg -i input.mp3 -f f32le -ac 1 -ar 16000 input.pcm
```
In Elixir: `pcm = File.read!("input.pcm")`, then
`WhisperCpp.transcribe(model, {:pcm_f32, pcm}, ...)`.
- Bare binaries (without the `{:pcm_f32, _}` wrapper) and file paths
are rejected with `:invalid_request`. A typo'd path used to turn
into garbage PCM; the wrapper surfaces the bug instead.
## Slicing PCM
- Use `WhisperCpp.transcribe_slice/4` to transcribe a `[start_s, end_s)`
window of an already-decoded master PCM buffer. It handles the byte
math, runs whisper.cpp on the slice, and shifts segment/word times
back into the absolute timeline.
- Slices shorter than 0.3 s return an empty transcription. whisper.cpp
pads short inputs and hallucinates into the padding; do not pass
unfiltered VAD output.
## Voice activity detection
- Pass `:vad_model_path` (a silero GGML model, ~0.9 MB, from
`huggingface.co/ggml-org/whisper-vad`) to let whisper.cpp strip
silence before the encoder; timestamps are remapped to the original
timeline. Tune with `:vad_threshold`, `:vad_min_speech_ms`,
`:vad_min_silence_ms`, and `:vad_speech_pad_ms`.
- Audio with no detected speech returns
`{:ok, %Transcription{text: "", segments: []}}` - treat empty
segments as "no speech", not as an error.
## Cancellation and progress
- For cancellable transcribes, mint a `%WhisperCpp.AbortHandle{}` via
`WhisperCpp.AbortHandle.new/0` and pass it via `:abort_handle`.
Signal cancellation from another process with
`WhisperCpp.AbortHandle.abort/1`. The call returns
`{:ok, partial_transcription}` with whatever segments completed
before whisper.cpp's next abort poll.
- For progress, pass `:progress_pid` (commonly `self()` inside a
`Task`). The pid receives `{:whisper_progress, percent}` messages
(0..100) as work advances; duplicate percentages are coalesced.
- Both hooks are zero-cost when omitted.
## Options and errors
- Pass options as keyword lists. Unknown keys and out-of-range values
fail with `{:error, %WhisperCpp.Error{reason: :invalid_request}}`
before reaching the NIF - rely on this for input validation.
- `:language` takes an ISO 639-1 code (`"de"`), a full language name
(`"german"`), or `"auto"`. Unknown codes - including BCP 47 tags like
`"de-CH"` - return `:invalid_request`. `nil` (default) auto-detects on
multilingual models; English-only models resolve `nil`/`"auto"` to
`"en"` and reject other languages.
- Match `%WhisperCpp.Error{}` (or its `:reason` field) rather than
inspecting message strings.
## Logging
- Native whisper.cpp/GGML logs are filtered to warnings and errors.
Set `WHISPER_CPP_NATIVE_LOG` to `none`, `error`, `warn` (default),
`info`, or `debug` before the NIF loads to change that - `info`
restores the classic full model-load output for diagnosis.
## Performance
- `:n_threads` defaults to 4. On dedicated nodes, set it to the number
of physical cores.
- Word timestamps add one DTW pass; enable `:word_timestamps` only when
you need them.
- For latency-sensitive workloads, prefer `:single_segment` on short
clips to skip the segment-split pass.
- Beam search (`:beam_size > 1`) is roughly 2-3x slower than greedy and
worth it for the lowest WER on long-form audio; for short slices,
greedy is usually fine.
- A single loaded model handle is safe to share: parallel transcribe
calls do not serialise on the context lock, so saturating a GPU or
multi-core CPU from many BEAM processes is the expected pattern.