lib/bumblebee/audio.ex

defmodule Bumblebee.Audio do
  @moduledoc """
  High-level tasks related to audio processing.
  """

  @typedoc """
  A term representing audio.

  Can be either of:

    * a 1-dimensional `Nx.Tensor` with audio samples

    * `{:file, path}` with path to an audio file (note that this
      requires `ffmpeg` installed)

  """
  @type speech_to_text_input :: Nx.t() | {:file, String.t()}
  @type speech_to_text_output :: %{results: list(speech_to_text_result())}
  @type speech_to_text_result :: %{text: String.t()}

  @doc """
  Builds serving for speech-to-text generation.

  The serving accepts `t:speech_to_text_input/0` and returns
  `t:speech_to_text_output/0`. A list of inputs is also supported.

  Note that either `:max_new_tokens` or `:max_length` must be specified.
  The generation should generally finish based on the audio input,
  however you still need to specify the upper limit.

  ## Options

    * `:max_new_tokens` - the maximum number of tokens to be generated,
      ignoring the number of tokens in the prompt

    * `:compile` - compiles all computations for predefined input shapes
      during serving initialization. Should be a keyword list with the
      following keys:

        * `:batch_size` - the maximum batch size of the input. Inputs
          are optionally padded to always match this batch size

      It is advised to set this option in production and also configure
      a defn compiler using `:defn_options` to maximally reduce inference
      time.

    * `:defn_options` - the options for JIT compilation. Defaults to `[]`

  Also accepts all the other options of `Bumblebee.Text.Generation.build_generate/3`.

  ## Examples

      {:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
      {:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
      {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})

      serving =
        Bumblebee.Audio.speech_to_text(whisper, featurizer, tokenizer,
          max_new_tokens: 100,
          defn_options: [compiler: EXLA]
        )

      Nx.Serving.run(serving, {:file, "/path/to/audio.wav"})
      #=> %{results: [%{text: "There is a cat outside the window."}]}

  """
  @spec speech_to_text(
          Bumblebee.model_info(),
          Bumblebee.Featurizer.t(),
          Bumblebee.Tokenizer.t(),
          keyword()
        ) :: Nx.Serving.t()
  defdelegate speech_to_text(model_info, featurizer, tokenizer, opts \\ []),
    to: Bumblebee.Audio.SpeechToText
end