# whisper_ct2 usage rules
Rules for LLM coding agents using `whisper_ct2` in a consumer project.
Published per the [`usage_rules`](https://hex.pm/packages/usage_rules)
convention; sync into your project with `mix usage_rules.sync`.
## Load the model once, reuse the struct
`WhisperCt2.load_model/2` returns `{:ok, %WhisperCt2.Model{ref: ref}}` where
`ref` is a NIF resource pointing at the live CTranslate2 model. The model
stays in memory as long as some process holds the struct. Do **not** call
`load_model/2` per request - hold it in a long-lived process.
```elixir
defmodule MyApp.Whisper do
use GenServer
def start_link(path), do: GenServer.start_link(__MODULE__, path, name: __MODULE__)
def transcribe(audio, opts \\ []),
do: GenServer.call(__MODULE__, {:transcribe, audio, opts}, :infinity)
@impl true
def init(path) do
{:ok, model} = WhisperCt2.load_model(path)
{:ok, model}
end
@impl true
def handle_call({:transcribe, audio, opts}, _from, model) do
{:reply, WhisperCt2.transcribe(model, audio, opts), model}
end
end
```
Put it under your supervision tree. When the process dies the NIF resource
is freed, so let the supervisor reload it.
## Parallelism: one Model serialises inside ct2rs
A single `%Model{}` processes calls serially through the NIF. For real
concurrency across multiple callers, load N replicas (one per process) and
pool them - e.g. with `:poolboy`, `nimble_pool`, or a `Registry`-keyed set
of GenServers. Increasing `:max_queued_batches` only deepens the queue, not
the worker count.
Do not share the same model across OS threads expecting parallel inference;
share it across BEAM processes for fan-in, not fan-out.
## Batched transcribe collapses per-call overhead
`WhisperCt2.transcribe_batch(model, [audio1, audio2, ...], opts)` stacks
every chunk of every input into one mel batch and runs the encoder once
across the whole thing. For diarization-driven workflows (one call per
turn, dozens to hundreds of turns) this is materially faster than
looping `transcribe/3` because CTranslate2 amortises the encoder
forward pass across the batch.
`:language` applies to every audio in the batch; pass `nil` to
auto-detect per-audio (only meaningful on multilingual checkpoints).
For carving sub-windows out of an already-decoded buffer, use
`WhisperCt2.Pcm.slice(samples, sample_rate, start_s, duration_s)` -
it does the f32 byte math (4 bytes/sample) and bounds-checks against the
buffer size, so a slice past the end fails loudly instead of decoding
garbage.
```elixir
pcm = File.read!("call.pcm") # f32 LE, 16 kHz mono, prepared upstream
{:ok, turn} = WhisperCt2.Pcm.slice(pcm, model.sampling_rate, 12.3, 4.7)
WhisperCt2.transcribe(model, {:pcm_f32, turn}, language: "en")
```
## Word-level timestamps are opt-in
Pass `word_timestamps: true` to attach `%WhisperCt2.Word{text, start, end,
probability}` entries to each segment. Implementation reuses the encoder
output from `generate` and runs one extra batched `align` call (DTW over
decoder attention) across every chunk in the batch. Cost is on the order
of the alignment pass itself, not a second encoder forward. Use it for
caption alignment or diarization-aware splicing; skip it when you only
need segment timing.
## Segment timestamps: `:with_timestamps`
`:with_timestamps` defaults to `true`: the prompt asks Whisper to emit
`<|t_..|>` tokens that split each 30 s chunk into sub-segments. Leave it
on for stock OpenAI / `Systran/faster-whisper-*` checkpoints.
Set `with_timestamps: false` for fine-tunes that ignore the timestamp
instruction or were trained to emit plain text (e.g. some domain
fine-tunes). The chunk's full text then becomes one segment spanning
`[0, chunk_duration_s)` instead of being silently dropped.
`:word_timestamps` implicitly forces `:with_timestamps` back to `true` -
the DTW alignment needs the timestamp scaffolding. Don't combine
`with_timestamps: false` with `word_timestamps: true` and expect the
former to win.
## Initial prompt and prefix
- `:initial_prompt` - free-text conditioning prepended via
`<|startofprev|>`. Bias the decoder toward domain vocabulary, names,
or speaker style ("Discussion of CTranslate2 internals", "Dialogue
between Alice and Bob"). Same role as in faster-whisper.
- `:prefix` - forced text the generation must start with. Useful when
the first words are already known (caption corrections, fixed
intro lines).
Both are tokenised inside the NIF without special-token expansion, so
control tokens in the strings are not interpreted.
## Pass `:language` when you know it
`:language` defaults to `nil`, which makes Whisper auto-detect from the first
chunk. Auto-detection adds latency and can misfire on short or noisy clips
(English-only fine-tunes still sometimes guess `:cy` or `:fr`). Always pass
`language: "en"` (or the relevant ISO code) when the source language is known.
`model.multilingual` tells you whether the loaded checkpoint can do anything
other than English - `faster-whisper-*.en` variants are monolingual and ignore
`:language`. Branch on `model.multilingual` if your code supports both.
## Result shape
`{:ok, %WhisperCt2.Transcription{text, segments, language, duration_s}}`:
- `text` - all segment texts joined by `" "` and `String.trim/1`'d. Use this
for display or downstream NLP.
- `segments` - list of `%WhisperCt2.Segment{}`, each carrying absolute
`:start` / `:end` seconds, `:no_speech_prob`, `:avg_logprob`, the
underlying text-token IDs (`:tokens`), and `:words` (`nil` unless
`:word_timestamps` was set).
- `language` - resolved ISO code (auto-detected when not pinned).
- `duration_s` - input audio length in seconds.
Segment timestamps are real fields, not embedded tokens - do **not** regex
the text for `<|t_..|>`. Boundaries are produced by Whisper's own timestamp
tokens, parsed inside the NIF.
`:no_speech_prob` and `:avg_logprob` are always populated; filter
hallucination with e.g. `seg.avg_logprob < -1.0` or
`seg.no_speech_prob > 0.6`.
## Model struct fields are part of the API
Illustrative shape (`:ref` and `:path` omitted for brevity; both are
also part of the struct):
```elixir
%WhisperCt2.Model{
sampling_rate: 16_000, # always 16 kHz for published Whisper
n_samples: 480_000, # samples in one Whisper window (30 s)
multilingual: true, # false for *.en variants
device: :cpu, # resolved (never :auto)
compute_type: :int8, # resolved (never :default / :auto)
...
}
```
Read these at runtime instead of hardcoding. `device` and `compute_type` are
the **resolved** values - `:auto` and `:default` are normalised at load time.
## Audio contract is strict
CTranslate2 wants **mono `f32` PCM at the model's sample rate** (always
16 kHz for published Whisper checkpoints), normalised to `-1.0..1.0`.
`transcribe/3` and `transcribe_batch/3` accept exactly one shape:
- `{:pcm_f32, binary}` - little-endian f32 samples at the model's sample
rate.
There is **no built-in decoder**. Paths, raw bare binaries, WAV bytes,
MP3, etc. are all rejected at the boundary with a clear
`:invalid_request` error. Decoding, downmixing, and resampling are the
caller's job; use `ffmpeg`, Membrane, or your platform audio stack
upstream.
```bash
ffmpeg -i input.mp3 -ar 16000 -ac 1 -f f32le output.pcm
```
```elixir
pcm = File.read!("output.pcm")
WhisperCt2.transcribe(model, {:pcm_f32, pcm}, language: "en")
```
For microphone or streaming sources, build the f32 buffer yourself and pass
`{:pcm_f32, binary}`. The authoritative sample rate is `model.sampling_rate`
on the loaded struct, not a hard-coded `16_000`.
Audio longer than 30 s is split into Whisper-window chunks automatically;
per-chunk text is in `transcription.segments`.
## Return shape: never raises on the happy path
Every public function returns `{:ok, _} | {:error, %WhisperCt2.Error{}}`.
The error struct implements `Exception`, so `raise/1` works if you want
let-it-crash behaviour, but do not write `case` clauses that assume an
`{:ok, _}` pattern only. `Error.reason` is one of:
- `:invalid_request` - bad options or audio shape; rejected before the NIF
- `:load_error` - model directory missing or unreadable
- `:inference_error` - CTranslate2 raised during transcription
- `:runtime_error` - other ct2rs-side failure
- `:nif_panic` - Rust panic caught by the panic boundary
- `:native_error` - fallback for unrecognised native errors
## Device and compute_type selection
Probe before deciding:
```elixir
WhisperCt2.available_devices()
#=> {:ok, %{cpu: 1, cuda: 1, cuda_supported: true}}
```
- `device: :auto` (default) picks CUDA when the artefact was built with it
and at least one device is visible; otherwise CPU. Use this unless you
have a reason not to.
- `device: :cuda` returns
`{:error, %WhisperCt2.Error{reason: :invalid_request}}` if CUDA is
unavailable - do not assume it succeeds.
- `compute_type: :default` keeps the stored quantisation of the model
(recommended for `Systran/faster-whisper-*` int8 builds).
`compute_type: :auto` lets ct2rs pick the fastest supported on-device.
Do not hardcode `:float16` / `:int8_float16` unless you know the target
hardware supports it - mismatches raise `:load_error`.
## Model files
`load_model/2` needs a directory containing:
```
model.bin
config.json
tokenizer.json
vocabulary.txt
preprocessor_config.json
```
`Systran/faster-whisper-*` ships the first four. `preprocessor_config.json`
must be copied from any `openai/whisper-*` repo (all sizes share the file).
A missing `preprocessor_config.json` is the most common `:load_error`
cause; check this first when load fails.
## Backend selection at install time
The published Hex package picks the right precompiled NIF from your target
triple automatically. Two consumer-facing knobs:
- `WHISPER_CT2_VARIANT=mkl` on `x86_64-unknown-linux-gnu` selects the Intel
MKL artefact instead of oneDNN. Only set this on Intel-only fleets.
- `WHISPER_CT2_BUILD=1` (or
`config :rustler_precompiled, :force_build, whisper_ct2: true`) forces a
source build. First build of CTranslate2 takes ~10 minutes and needs
Rust, CMake, and a C++17 toolchain. Do not enable this in CI unless you
understand the cost.
x86_64 macOS and Windows are not shipped - source build only.
## Do not
- Do not call `load_model/2` per transcription.
- Do not pass `.wav` (or any other file) paths, raw bare binaries,
encoded WAV bytes, mp3, opus, or non-16 kHz audio to `transcribe/3` -
the audio contract is `{:pcm_f32, binary}` only. Decode and resample
upstream (`ffmpeg -ar 16000 -ac 1 -f f32le`).
- Do not assume `device: :cuda` succeeds; check `available_devices/0` or
use `:auto`.
- Do not share a single `%Model{}` to get parallel inference; pool replicas.
- Do not catch `:nif_panic` and retry blindly - it indicates a bug worth
reporting.
- Do not hardcode `16_000` as the sample rate - read `model.sampling_rate`.
- Do not pass `:language` to a `*.en` checkpoint and expect anything but
English; check `model.multilingual` if the language is dynamic.
- Do not regex segment text for `<|t_..|>` tokens - segment timestamps
are real fields (`:start`, `:end`) populated from the model output.
- Do not loop `transcribe/3` over a list of short clips when
`transcribe_batch/3` would batch them through one encoder pass.
- Do not pass control tokens like `<|en|>` inside `:initial_prompt` or
`:prefix`; they are tokenised as plain text and will not behave as
special tokens.