Skip to main content

usage-rules.md

# whisper_cpp usage rules

For agents and humans writing code against `whisper_cpp`. These rules are
shipped with the Hex package so downstream consumers can opt in to a
consistent set of conventions.

## Loading models

- Pass a path to a `.bin` or `.gguf` whisper.cpp checkpoint to
  `WhisperCpp.load_model/2`. Download checkpoints from
  <https://huggingface.co/ggerganov/whisper.cpp>.
- Cache the `%WhisperCpp.Model{}` for the process lifetime; loading is
  expensive and the underlying NIF resource is safe to share across
  BEAM processes - concurrent `transcribe/3` calls do not serialise.
- Prefer `device: :auto` (the default). Explicit device selection that
  does not match the installed NIF artefact returns `:invalid_request`.

## Audio input

- `transcribe/3` accepts exactly one shape: `{:pcm_f32, binary()}`,
  where the binary is little-endian IEEE-754 `f32` samples, mono,
  16 kHz, normalised to `[-1.0, 1.0]`.
- This library does **not** decode audio file formats. Decode WAV,
  MP3, FLAC, M4A, Opus, etc. upstream and hand the PCM in. Standard
  recipe with ffmpeg:

  ```bash
  ffmpeg -i input.mp3 -f f32le -ac 1 -ar 16000 input.pcm
  ```

  In Elixir: `pcm = File.read!("input.pcm")`, then
  `WhisperCpp.transcribe(model, {:pcm_f32, pcm}, ...)`.

- Bare binaries (without the `{:pcm_f32, _}` wrapper) and file paths
  are rejected with `:invalid_request`. A typo'd path used to turn
  into garbage PCM; the wrapper surfaces the bug instead.

## Slicing PCM

- Use `WhisperCpp.transcribe_slice/4` to transcribe a `[start_s, end_s)`
  window of an already-decoded master PCM buffer. It handles the byte
  math, runs whisper.cpp on the slice, and shifts segment/word times
  back into the absolute timeline.
- Slices shorter than 0.3 s return an empty transcription. whisper.cpp
  pads short inputs and hallucinates into the padding; do not pass
  unfiltered VAD output.

## Voice activity detection

- Pass `:vad_model_path` (a silero GGML model, ~0.9 MB, from
  `huggingface.co/ggml-org/whisper-vad`) to let whisper.cpp strip
  silence before the encoder; timestamps are remapped to the original
  timeline. Tune with `:vad_threshold`, `:vad_min_speech_ms`,
  `:vad_min_silence_ms`, and `:vad_speech_pad_ms`.
- Audio with no detected speech returns
  `{:ok, %Transcription{text: "", segments: []}}` - treat empty
  segments as "no speech", not as an error.

## Cancellation and progress

- For cancellable transcribes, mint a `%WhisperCpp.AbortHandle{}` via
  `WhisperCpp.AbortHandle.new/0` and pass it via `:abort_handle`.
  Signal cancellation from another process with
  `WhisperCpp.AbortHandle.abort/1`. The call returns
  `{:ok, partial_transcription}` with whatever segments completed
  before whisper.cpp's next abort poll.
- For progress, pass `:progress_pid` (commonly `self()` inside a
  `Task`). The pid receives `{:whisper_progress, percent}` messages
  (0..100) as work advances; duplicate percentages are coalesced.
- Both hooks are zero-cost when omitted.

## Options and errors

- Pass options as keyword lists. Unknown keys and out-of-range values
  fail with `{:error, %WhisperCpp.Error{reason: :invalid_request}}`
  before reaching the NIF - rely on this for input validation.
- `:language` takes an ISO 639-1 code (`"de"`), a full language name
  (`"german"`), or `"auto"`. Unknown codes - including BCP 47 tags like
  `"de-CH"` - return `:invalid_request`. `nil` (default) auto-detects on
  multilingual models; English-only models resolve `nil`/`"auto"` to
  `"en"` and reject other languages.
- Match `%WhisperCpp.Error{}` (or its `:reason` field) rather than
  inspecting message strings.

## Logging

- Native whisper.cpp/GGML logs are filtered to warnings and errors.
  Set `WHISPER_CPP_NATIVE_LOG` to `none`, `error`, `warn` (default),
  `info`, or `debug` before the NIF loads to change that - `info`
  restores the classic full model-load output for diagnosis.

## Performance

- `:n_threads` defaults to 4. On dedicated nodes, set it to the number
  of physical cores.
- Word timestamps add one DTW pass; enable `:word_timestamps` only when
  you need them.
- For latency-sensitive workloads, prefer `:single_segment` on short
  clips to skip the segment-split pass.
- Beam search (`:beam_size > 1`) is roughly 2-3x slower than greedy and
  worth it for the lowest WER on long-form audio; for short slices,
  greedy is usually fine.
- A single loaded model handle is safe to share: parallel transcribe
  calls do not serialise on the context lock, so saturating a GPU or
  multi-core CPU from many BEAM processes is the expected pattern.