# Google Cloud Speech gRPC API client
[![Hex.pm](https://img.shields.io/hexpm/v/ex_google_stt.svg)](https://hex.pm/packages/ex_google_stt)
Elixir client for Google Speech-to-Text V2 streaming API using gRPC
## Installation
The package can be installed by adding `:ex_google_stt` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:ex_google_stt, "~> 0.4.5"}
]
end
```
## Configuration
This library uses [`Goth`](https://github.com/peburrows/goth) to obtain authentication tokens. It requires Google Cloud credendials to be configured. See [Goth's README](https://github.com/peburrows/goth#installation) for details.
Using Google's V2 API requires that you set a recognizer to use for your requests (see [here](https://cloud.google.com/speech-to-text/v2/docs/reference/rest/v2/projects.locations.recognizers#Recognizer])). It is a string like the following:
`projects/{project}/locations/{location}/recognizers/{recognizer}`
You can either set this in the config or send it as a configuration when starting the `TranscriptionServer`.
In the config:
```elixir
config :ex_google_stt, recognizer: "projects/{project}/locations/{location}/recognizers/_"
```
## Usage
### Introduction
The library is designed to abstract most of the GRPC logic, so I'll provide the most basic use of it here.
- In summary, we use a `TranscriptionServer` than handles the GRPC streams to Google.
- That `TranscriptionServer` is responsible for monitoring/opening the streams and to parse the responses.
- Send audio data (as binary) using `TranscriptionServer.process_audio(server_pid, audio_data)`
- The `TranscriptionServer` will then send the responses to the target pid, set when creating the server.
- The caller should define a `handle_info` that will receive the transcripts and handle eventual errors.
### Initial Configurations
When starting the `TranscriptionServer`, you can define a few configs:
```
- target - a pid to send the results to, defaults to self()
- language_codes - a list of language codes to use for recognition, defaults to ["en-US"]
- enable_automatic_punctuation - a boolean to enable automatic punctuation, defaults to true
- interim_results - a boolean to enable interim results, defaults to false
- recognizer - a string representing the recognizer to use, defaults to use the recognizer from the config
- model - a string representing the model to use, defaults to "latest_long". Be careful, changing to 'short' may have unintended consequences
- explicit_decoding_config - a struct with audio decoding parameters
```
Note that apart from the `interim_results` these configurations are better off set-up in the reconizer directly, so that you can control it without deploying any code.
See here for details: https://cloud.google.com/speech-to-text/v2/docs/recognizers
Basically, create a recognizer in GCP then add a system_env with the recognizer string on it.
### Example
```elixir
defmodule MyModule.Transcribing do
use GenServer
alias ExGoogleSTT.{Error, SpeechEvent, Transcript, TranscriptionServer}
...
def init(_opts) do
{:ok, transcription_server} = TranscriptionServer.start_link(target: self(), interim_results: true)
end
def handle_info({:got_new_speech, speech_binary}, state) do
TranscriptionServer.process_audio(state.server_pid, speech_binary)
end
def handle_info({:stt_event, %{Transcript{} = transcript}}, state) do
# Do whatever you need with the transcription
end
def handle_info({:stt_event, %SpeechEvent{event: :SPEECH_ACTIVITY_BEGIN}}, state) do
# You probably want to ignore these
end
def handle_info({:stt_event, :stream_timeout}, state) do
# You probably want to ignore these as well. This is only a simple GRPC timeout, when nothing is coming.
end
def handle_info({:response, %Error{status: some_status, message: message}}, state) do
# You might want to to log these, as they are real errors.
end
end
```
### Other usages
The library allows you define other response handling functions and even ditch the `GenServer` part of `TranscriptionServer` altogether.
## Notes
### Decoding
If you are not relying on auto decoding, you can specify the custom encoding
parameters of your audio stream.
```elixir
defmodule MyModule.Transcribing do
use GenServer
alias ExGoogleSTT.TranscriptionServer
alias Google.Cloud.Speech.V2.ExplicitDecodingConfig
def init(_opts) do
{:ok, transcription_server} =
TranscriptionServer.start_link(
target: self(),
interim_results: true,
explicit_decoding_config: %ExplicitDecodingConfig{
encoding: :LINEAR16,
sample_rate_hertz: 16000,
audio_channel_count: 1
}
)
end
```
### Infinite stream
Google's STT V2 knows when a sentence finishes, as long as there's some silence after it. When that happens, it'll return the transcription without ending the stream.
Therefore, as long as we keep the stream open, we can keep transcribing realtime speech.
A few points to notice though.
- The `model` must be `long` or `latest_long`. `short` will result in ending the stream after the first utterance.
- One must end the stream to ensure the transcription stops.
### Auto-generated modules
This library uses [`protobuf-elixir`](https://github.com/tony612/protobuf-elixir) and its `protoc-gen-elixir` plugin to generate Elixir modules from `*.proto` files for Google's Speech gRPC API. The documentation for the types defined in `*.proto` files can be found [here](https://cloud.google.com/speech-to-text/docs/reference/rpc/google.cloud.speech.v1)
### Tests
ALL the tests require communication with google, so you must have a google credentials configured to run them in this repo.
Tests with tag `:load_test` are excluded by default, since they can be a bit expensive to run, use `mix test --include load_test` to run them.
#### Fixture
A recording fragment in `test/fixtures` comes from an audiobook
"The adventures of Sherlock Holmes (version 2)" available on [LibriVox](https://librivox.org/the-adventures-of-sherlock-holmes-by-sir-arthur-conan-doyle/)
## Status
Current version of library supports only Streaming API and not tested in production. Treat this as experimental.
## License
Portions of this project are modifications based on work created by [Sofware Mansion](https://swmansion.com/) and used according to terms described in the Apache License 2.0. See [here](https://github.com/software-mansion-labs/elixir-gcloud-speech-grpc) for the original repository.
The work it is not endorsed by or affiliated with the original authors or their organizations.
The modifications are also licensed under Apache License 2.0.