Speech recognition

You can recognize live speech during an active call or conference.

Speech recognition usage
Transcribing a call
Google's beta STT features usage
Passing parameters directly to the provider
Emotions and gender recognition

Voximplant offers two modes of speech recognition: Phrase hint and Freeform. Phrase hint mode recognizes user input from a predefined list of phrases, making it suitable for IVRs and voice-interactive dialogs. On the other hand, Freeform mode recognizes all user input and transcribes it into text, which is useful for transcribing human conversations into text files.

You can use various speech recognition engines such as Google, Amazon, Microsoft, Yandex, or T-bank. You can find the profile names here.

The Phrase hint mode is supported by Google profile only. Google does not limit the results to the specified list, but the words on the list you specify have a higher chance of being selected.

Speech recognition usage

Import the ASR module into your scenario via the require method:

require(Modules.ASR)

Use the VoxEngine.createASR method to create an ASR object.
Subscribe to the ASR object events like ASREvents.Result.
Send media from a call object to the ASR object via the sendMediaTo method
Receive the results via events

To use the Phrase hint mode for speech recognition, specify the phraseHints parameter after the speech recognition profile. Refer to the provided code example to understand how it works.

ASR with hints

The Phrase hint mode is supported by Google profile only.

If you intend to utilize the Freeform mode, the Result event is triggered every time the voice is recognized. However, there is always a delay between capturing the audio and recognizing it, so it is essential to plan user interaction accordingly.

To understand how this mode works, refer to the provided code example.

ASR without hints

Transcribing a call

Use the record method to transcribe a call or a conference into a text file. Set the transcribe parameter to true and the language parameter to one of the supported languages.

See the following code example to understand how it works:

Transcribe a call

Unlike audio and video recording, transcription results are not available immediately after a call ends. Therefore, you must retrieve them using the GetCallHistory method of the HTTP API. To use this method, specify the with_records=true parameter.

There are records in the response JSON with the transcription_url field. This field value returns the transcription as plain text:

GetCallHistory

Each line in the transcription file is prefixed with “Left” to indicate an audio stream originating from a call endpoint and directed to the Voximplant cloud. Conversely, “Right” is used to denote an audio stream originating from the Voximplant cloud and directed to a call endpoint. This naming convention follows the same logic as used for the left and right audio channels in stereo recording.

"Left" and "Right" names can be changed via the labels parameter. The dict parameter allows you to specify an array of words that the transcriber tries to match in case of recognition problems. Specifying domain-specific words can improve transcription results a lot.

Google's beta STT features usage

Since Google provides access to their Speech API v1p1beta1 features, we support it as well.

Currently, Voximplant speech recognition supports the following features:

enableSeparateRecognitionPerChannel
alternativeLanguageCodes
enableWordTimeOffsets
enableWordConfidence
enableAutomaticPunctuation
diarizationConfig
metadata

To utilize the features, you must set the beta parameter to true when creating an ASR instance.

ASR Instance with v1p1beta1

Refer to the API reference to learn about the parameters.

Here is what the request's result looks like:

ASR result

With Google beta features, the ASREvents.result event updates with the resultEndTime, channelTag, and languageCode properties. You can see them in the session logs.

Passing parameters directly to the provider

There are two methods for passing speech recognition parameters to your provider. You can fill in the ASRParameters parameters on the Voximplant side, as explained in this article, so the platform converts them to the provider’s format and sends them to your provider. Alternatively, you can provide the parameters directly to the provider in the request parameter.

Please ensure that you specify the parameters in the format that your provider accepts. Different providers use different formats, so refer to your provider’s API reference for more information.

To pass the parameters directly, choose the provider in the profile parameter of the createASR, method and pass the request parameter using the provider's format. Here is the full scenario example of how to use the request parameter with Google:

Request parameter for Google ASR

Here are examples of the request parameter for the most common providers:

Request parameter examples

Emotions and gender recognition

Some voice recognition providers offer additional features like emotion and gender recognition, but these features vary depending on the provider and their specific implementation. For more information on whether a particular feature is supported by your provider, refer to their API documentation.

For example, T-bank ASR supports gender recognition, and SaluteSpeech ASR supports emotion recognition.

To receive this information, send the request to ASR in the request parameter, as it is described in the Passing parameters directly to the provider section of this article.

Let us request emotion recognition from SaluteSpeech. Pass the emotions_result in the request parameter as shown below.

Request emotion recognition from SaluteSpeech

In return, SaluteSpeech provides emotions result in addition to speech recognition result:

{
  // other parameters
  "response": {
    "emotionsResult": {
      "negative": 0.003373496,
      "neutral": 0.996082962,
      "positive": 0.000543531496
    }
  },
  "text": "check one two three"
}

Let us request gender recognition from T-bank in the same way. Pass the enable_gender_identification parameter in the request parameter to request it:

Request gender recognition from T-bank

In return, T-bank offers gender identification probabilities in addition to speech recognition results.

{
  // other parameters
  "response": {
    "results": [
      {
        "recognitionResult": {
          "genderIdentificationResult": {
            "femaleProba": 0.831116796,
            "maleProba": 0.168883175
          }
        }
      }
    ]
  },
  "text": "check one two three"
}

Contents