CHANGELOG.md

Select File
# Changelog

All notable changes to `whisper_cpp` will be documented in this file. The
format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/);
this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.4.0] - 2026-06-11

### Added
- Built-in voice activity detection: pass `:vad_model_path` (a silero GGML
  model from `huggingface.co/ggml-org/whisper-vad`) to strip silence before
  the encoder, with `:vad_threshold`, `:vad_min_speech_ms`,
  `:vad_min_silence_ms`, and `:vad_speech_pad_ms` tuning options. Audio
  with no detected speech returns an empty transcription. The NIF runs the
  VAD itself and remaps all timestamps back to the original timeline -
  whisper.cpp's own VAD hook is dead code on the state-based API whisper-rs
  uses.

### Changed
- Native whisper.cpp/GGML logging is filtered to warnings and errors;
  the dozens of info lines per model load no longer reach stderr.
  `WHISPER_CPP_NATIVE_LOG` accepts `none`, `error`, `warn` (default),
  `info`, and `debug`. VAD contexts stay per-call: loading the silero
  model costs about a millisecond, and a shared context would serialise
  detection across concurrent transcribes.
- Integer options are bounded to `u32` and the VAD millisecond knobs to
  two minutes, returning `:invalid_request` instead of raising or
  overflowing inside the detector. Option validation now also runs
  before the sub-0.3 s `transcribe_slice` short-circuit, and an abort
  raised during the VAD pass is honoured before the encoder starts.
- `:duration_ms` must be at least 1. `0` previously meant "whole audio"
  without VAD but "empty window" with it; the ambiguity is rejected as
  `:invalid_request`.
- Passing `:vad_threshold`, `:vad_min_speech_ms`, `:vad_min_silence_ms`,
  or `:vad_speech_pad_ms` without `:vad_model_path` returns
  `:invalid_request` instead of being silently ignored.
- Buffers above `i32::MAX` samples (about 37 hours) are rejected instead
  of silently truncating at the FFI boundary.

### Fixed
- Multi-segment transcriptions no longer contain doubled spaces in
  `Transcription.text` (whisper segments carry their own leading space;
  the join added another). Space-free scripts no longer gain spurious
  spaces.
- `:temperature` is validated to `0.0..1.0` (above 1.0 whisper.cpp's
  retry ladder is empty and the decoder state undefined), `:n_threads`
  to GGML's 512-thread abort threshold, and `:beam_size`/`:best_of` to
  whisper.cpp's 8-decoder limit - all returning `:invalid_request`
  instead of native crashes or opaque inference errors.
- `:best_of` defaults to 5, matching whisper.cpp, and now also applies
  to temperature-fallback passes in beam-search mode.
- Sub-0.3 s `transcribe_slice` windows validate options, buffer bounds,
  and alignment before returning the documented empty transcription, and
  a window of exactly 0.3 s transcribes instead of being dropped by
  float subtraction error. The empty result keeps the pinned language.
- `translate: true` on English-only models returns `:invalid_request`
  instead of being silently ignored; `use_gpu: false` wins over a
  conflicting `:device`; invalid UTF-8 string options and non-keyword
  option lists return `:invalid_request` instead of raising.
- Native error messages no longer leak the internal "kind=..." routing
  tag; results with no decoded segments echo the requested language
  instead of fabricating "en"; progress percentages are clamped to the
  documented 0..100.
- `Pcm.slice/4` rounds sample positions instead of truncating, so
  millisecond-precise windows keep their last sample.
- Builds with two GPU features fail at compile time instead of silently
  picking one; unknown `WHISPER_CPP_VARIANT` values fail the build
  instead of falling back to the CPU artefact.
- `:abort_handle` and `:progress_pid` callbacks no longer leak memory per
  call: the vendored whisper-rs (branch `vendor/whisper-rs-0.16.0-patched`)
  fixes the abort-trampoline type confusion and the callback closure leak
  at the source (upstream issues 277/271, fix PR 278), replacing the
  downstream pre-boxing and sentinel workarounds. The progress sender
  thread now exits via natural channel close. The same vendor patch stops
  `set_language`, `set_initial_prompt`, and the VAD path from leaking one
  `CString` per call.

## [0.3.1] - 2026-06-11

### Changed
- Vendored whisper.cpp 1.8.3 -> 1.8.6. whisper-rs has no release vendoring
  anything newer, so `whisper-rs-sys` is patched via `[patch.crates-io]` to
  this repo's `vendor/whisper-rs-sys-1.8.6` branch - the published
  whisper-rs-sys 0.15.0 with only its whisper.cpp submodule bumped. The
  patch applies to source builds and the precompiled NIF artefacts alike,
  and is dropped as soon as upstream re-vendors (see issue #18).
- CI: `sccache-action` v0.0.9 -> v0.0.10 (Node 24; GitHub retires the
  Node 20 runtime on 2026-06-16).

## [0.3.0] - 2026-06-11

### Changed
- rustler 0.37 → 0.38 (Rust crate and optional Hex package). Additive
  upstream release; no NIF API changes needed. The vendored whisper.cpp
  stays at 1.8.3 until whisper-rs ships a release vendoring something
  newer - upstream's latest (0.16.0, 2026-03-12) predates whisper.cpp
  1.8.4.
- `language: nil` now actually auto-detects on multilingual models, as the
  docs always claimed. Previously `nil` silently fell through to
  whisper.cpp's forced-`"en"` default, decoding non-English audio as
  English. English-only models resolve `nil`/`"auto"` to `"en"`.
- `:language` is validated against whisper.cpp's language table. Unknown
  codes - including BCP 47 tags such as `"de-CH"` - return
  `:invalid_request` instead of silently corrupting the decoder prompt
  with an invalid language token. Passing a non-English language to an
  English-only model is rejected the same way instead of being silently
  ignored.
- The `:beam_size` and `:best_of` docs state the real defaults: greedy
  decoding with `best_of: 1`. The docs previously claimed a beam-search
  default of 5 that no code path produced.

### Fixed
- `:abort_handle` cancellation works now. The abort callback is passed to
  whisper-rs as a boxed trait object so the trampoline polls the real flag;
  the bare closure was reinterpreted memory (out-of-bounds reads) and the
  flag was never consulted, so cancellation silently did nothing.
- `:progress_pid` no longer leaks one OS thread per call. The progress
  sender thread is shut down explicitly after inference; the previous
  design waited for a channel close that whisper-rs's leaked callback
  closure could never trigger.
- `:word_timestamps` no longer corrupts multibyte UTF-8. Token bytes are
  accumulated per word and converted once, so characters split across BPE
  tokens (umlauts and most non-Latin scripts) survive instead of turning
  into replacement characters.
- Dropping the last reference to a loaded model frees the whisper context
  on a detached thread instead of the garbage-collecting BEAM scheduler,
  which a multi-gigabyte free would stall.
- `{:pcm_f32, _}` buffers containing NaN or infinity samples are rejected
  with `:invalid_request` instead of being fed to inference.

## [0.2.0] - 2026-05-20

### Added
- `WhisperCpp.load_model/2`: GGML/GGUF model loading with `:cpu`, `:cuda`,
  `:hipblas`, `:vulkan`, `:metal`, `:coreml`, `:intel_sycl`, and `:auto`
  device selection.
- `WhisperCpp.transcribe/3`: full whisper.cpp transcription on
  `{:pcm_f32, binary}` buffers (little-endian f32 mono at 16 kHz) with
  segment, token, and optional per-word output.
- `WhisperCpp.transcribe_slice/4`: time-shifted per-slice transcription that
  reuses one decoded PCM buffer.
- `WhisperCpp.AbortHandle`: cooperative cancellation. Pass an `%AbortHandle{}`
  via `:abort_handle` and call `AbortHandle.abort/1` from another process to
  stop in-flight inference; the partial transcription produced before the
  abort is returned.
- `:progress_pid` transcribe option: receive `{:whisper_progress, pct}`
  messages as work advances; duplicate percentages are coalesced.
- `:word_timestamps` option for per-word timing.
- `WhisperCpp.available_devices/0`: backend introspection for the loaded NIF
  artefact.
- `WhisperCpp.Pcm`: PCM slicing helpers. Audio file decoding is intentionally
  out of scope; callers decode upstream (ffmpeg, Bumblebee, ...) and share one
  decoded PCM buffer across stages.
- Rustler NIF built on `whisper-rs`, with cargo features for `cuda`,
  `hipblas`, `vulkan`, `metal`, `coreml`, `intel-sycl`, `openblas`, and
  `openmp`. Inference does not serialise across processes sharing one loaded
  model.
- Precompiled NIF artefacts via `rustler_precompiled` for x86_64 / aarch64
  Linux (CPU, CUDA, hipBLAS variants) and aarch64 macOS (Metal).