# whisper_ct2
`whisper_ct2` is an Elixir library for running OpenAI Whisper speech-to-text
models inside the BEAM. It loads CTranslate2-converted Whisper models through a
Rustler NIF, so Elixir code can transcribe f32 PCM buffers without starting
Python or a separate inference service.
CTranslate2 is the speed-optimised C++ inference engine that powers
[`faster-whisper`](https://github.com/SYSTRAN/faster-whisper) — 4-8x faster
than vanilla `openai-whisper` on the same hardware, with int8 / int8-float16
quantisation and CUDA / oneDNN / MKL / Accelerate backends.
## Installation
```elixir
def deps do
[{:whisper_ct2, "~> 0.5"}]
end
```
Installation downloads a precompiled NIF artefact matching your target triple
from the project's GitHub releases. No Rust toolchain or CMake is needed on
the consumer side.
### Source builds
Set `WHISPER_CT2_BUILD=1` in your environment (or
`config :rustler_precompiled, :force_build, whisper_ct2: true` in your parent
project) to compile from source instead. The first source build of CTranslate2
takes ~10 minutes and requires:
- Rust toolchain (`rustup`, stable)
- `cmake`, a C++17 compiler, `make`
- Linux: `libstdc++`, `libgomp` available at link time
- CUDA toolkit 12+ if building with `cuda` or `cuda-dynamic` features
## Models
Point `WhisperCt2.load_model/2` at a directory containing a CTranslate2-converted
Whisper model. Required files:
```text
model.bin
config.json
tokenizer.json
vocabulary.txt
preprocessor_config.json
```
The [`Systran/faster-whisper-*`](https://huggingface.co/Systran) repositories
ship the first four directly. They do **not** include
`preprocessor_config.json`; copy the canonical one from `openai/whisper-tiny.en`
(or any other `openai/whisper-*` repo - all Whisper sizes share the same file):
```bash
uvx hf download Systran/faster-whisper-tiny.en \
--local-dir models/faster-whisper-tiny.en
uvx hf download openai/whisper-tiny.en preprocessor_config.json \
--local-dir models/faster-whisper-tiny.en
```
## Backends
The published Hex package ships four precompiled artefacts; install picks the
right one automatically based on your target triple:
| Target triple | CPU backend | GPU | Notes |
| ----------------------------------- | ----------- | -------------- | ---------------------------------------------------- |
| `aarch64-apple-darwin` | Accelerate | none | Apple Silicon (M1+). Uses Accelerate / AMX paths. |
| `x86_64-unknown-linux-gnu` | oneDNN | `cuda-dynamic` | Default x86_64 binary; runs well on Intel and AMD. |
| `x86_64-unknown-linux-gnu` (`mkl`) | Intel MKL | `cuda-dynamic` | Intel-tuned variant. Opt in via env var (below). |
| `aarch64-unknown-linux-gnu` | oneDNN | `cuda-dynamic` | Graviton/Grace, optional CUDA on GH200-class hosts. |
`cuda-dynamic` defers loading `libcudart` until first GPU use, so each artefact
still runs on hosts without CUDA installed. `:device` selection picks CUDA when
available, otherwise CPU.
x86_64 macOS and Windows are not shipped.
### Selecting the MKL variant
For Intel-only fleets where you want maximum SGEMM throughput:
```bash
WHISPER_CT2_VARIANT=mkl mix deps.compile whisper_ct2
```
`rustler_precompiled` reads this env var at install time and selects the `--mkl`
artefact instead of the default.
### Build from source with a custom backend
For source builds you can pick any combination of `ct2rs` features:
```bash
WHISPER_CT2_BUILD=1 WHISPER_CT2_FEATURES="dnnl cuda-dynamic" mix compile
# other options: mkl, openblas, accelerate, cuda, cuda-dynamic
```
### Runtime device selection
```elixir
WhisperCt2.available_devices()
#=> {:ok, %{cpu: 1, cuda: 1, cuda_supported: true}}
{:ok, model} =
WhisperCt2.load_model("models/faster-whisper-tiny.en",
device: :auto, # :cpu | :cuda | :auto (default)
compute_type: :auto, # :default | :auto | :float16 | :int8_float16 | ...
device_indices: [0]
)
```
`:auto` picks CUDA when the artefact supports it and at least one CUDA device
is visible; otherwise CPU. Explicit `:cuda` returns
`{:error, %WhisperCt2.Error{reason: :invalid_request}}` if either condition
fails.
## Usage
```elixir
{:ok, model} = WhisperCt2.load_model("models/faster-whisper-tiny.en")
# Decode/resample to 16 kHz mono f32 PCM upstream (ffmpeg, Membrane,
# anything that produces little-endian f32 bytes).
pcm = File.read!("jfk.pcm")
{:ok, %WhisperCt2.Transcription{text: text, segments: segs}} =
WhisperCt2.transcribe(model, {:pcm_f32, pcm}, language: "en")
IO.puts(text)
# => "And so, my fellow Americans ask not what your country can do for you ..."
for s <- segs do
IO.puts("[#{s.start}-#{s.end}] (no_speech=#{Float.round(s.no_speech_prob, 3)}) #{s.text}")
end
```
`%WhisperCt2.Segment{}` carries absolute `:start` / `:end` seconds,
`:no_speech_prob`, `:avg_logprob`, the underlying text token IDs, and
(when `:word_timestamps` is on) a list of `%WhisperCt2.Word{}` with
per-word timing.
### Audio contract
CTranslate2 expects **mono `f32` PCM samples** at the model's sample rate
(16 kHz for every published Whisper checkpoint), normalized to the
`-1.0..1.0` range. `transcribe/3` and `transcribe_batch/3` accept exactly
one shape:
- `{:pcm_f32, binary}` - little-endian f32 samples at the model's
sample rate.
Anything else (paths, raw bare binaries, WAV bytes, MP3, 44.1 kHz, ...)
is rejected at the boundary with an `:invalid_request` error. There is
no bundled audio decoder; decode, downmix, and resample upstream using
your tool of choice. For a one-shot file conversion:
```bash
ffmpeg -i input.mp3 -ar 16000 -ac 1 -f f32le output.pcm
```
Audio longer than 30 s is chunked into Whisper windows automatically; the
encoder runs once across every chunk in the batch.
### Batched transcribe and word timestamps
```elixir
# Diarization-driven workflow: one master decode upstream, many short
# splices fed in as PCM byte ranges.
samples = File.read!("call.pcm")
turns =
[
WhisperCt2.Pcm.slice(samples, 16_000, 0.0, 3.2),
WhisperCt2.Pcm.slice(samples, 16_000, 3.2, 4.5)
# ...
]
|> Enum.map(fn {:ok, bin} -> {:pcm_f32, bin} end)
{:ok, transcriptions} =
WhisperCt2.transcribe_batch(model, turns, language: "en", word_timestamps: true)
```
`transcribe_batch/3` stacks every chunk of every input into one encoder
forward pass. `:word_timestamps` adds one batched DTW alignment pass and
attaches `%Word{}` entries to each segment.
### Decoding biases
```elixir
WhisperCt2.transcribe(model, {:pcm_f32, talk_pcm},
language: "en",
initial_prompt: "Discussion of CTranslate2, BEAM, and Whisper internals.",
prefix: "Welcome back to the show."
)
```
`:initial_prompt` prepends free-text context (via `<|startofprev|>`) so the
decoder is biased toward your domain vocabulary or speaker style;
`:prefix` forces the start of the generated transcript.
## Options
`transcribe/3` and `transcribe_batch/3` accept any subset of:
| Option | Type | Notes |
| ------------------------------ | ------------------- | ------------------------------------------------------ |
| `:language` | `String.t \| nil` | ISO code (`"en"`). `nil` auto-detects on multilingual. |
| `:initial_prompt` | `String.t \| nil` | Free-text context prepended via `<\|startofprev\|>`. |
| `:prefix` | `String.t \| nil` | Forced text the generation must start with. |
| `:word_timestamps` | `boolean` | Attach per-word timing via a batched DTW alignment. |
| `:with_timestamps` | `boolean` | Emit `<\|t_..\|>` segment timestamps (default `true`). `false` for fine-tunes that emit plain text. |
| `:beam_size` | `pos_integer` | Beam-search width. |
| `:patience` | `float` | Beam-search patience. |
| `:length_penalty` | `float` | Decoding length penalty. |
| `:repetition_penalty` | `float` | Decoding repetition penalty. |
| `:no_repeat_ngram_size` | `non_neg_integer` | Disallow repeated n-grams of this size. |
| `:sampling_temperature` | `float` | Sampling temperature. |
| `:sampling_topk` | `pos_integer` | Top-k sampling. |
| `:suppress_blank` | `boolean` | Suppress the initial blank token. |
| `:suppress_tokens` | `[integer]` | Suppress these token IDs. |
| `:max_length` | `pos_integer` | Max tokens per chunk. |
| `:num_hypotheses` | `pos_integer` | Number of decoded hypotheses. |
| `:max_initial_timestamp_index` | `non_neg_integer` | Cap the first timestamp token. |
Unset values use the CTranslate2 defaults. `no_speech_prob` and
`avg_logprob` are always populated on each segment - there is no opt-in
return-knob.
Unknown option keys and out-of-range values return
`{:error, %WhisperCt2.Error{reason: :invalid_request}}` before reaching the
NIF.
## Errors
All failures return `{:error, %WhisperCt2.Error{}}`. `reason` is one of
`:invalid_request`, `:load_error`, `:inference_error`, `:runtime_error`,
`:nif_panic`, or `:native_error`. The struct also implements `Exception`, so
`raise/1` works.
## Testing
Unit tests run with no external dependencies:
```bash
mix test
```
The end-to-end transcription test downloads the `faster-whisper-tiny.en` model
(~75 MB) and the `jfk.wav` clip from the whisper.cpp samples:
```bash
mix test --include integration
```
Cached under `test/fixtures/`. Set `WHISPER_CT2_REFRESH=1` to redownload.
## License
MIT. CTranslate2 itself is MIT-licensed. The bundled `ct2rs` crate links
CTranslate2 statically by default.