Voximplant has released an option for enabling Google Cloud Speech-to-Text (STT) Beta features. With this capability, Voximplant customers can leverage enhanced models and speech adaption boost for improved transcription, detect the spoken language for multilingual applications, separate multiple speakers on the same channel in transcripts with diarization, and use word-level confidence information for improved response handling. This feature is available today for all customers as part of Voximplant’s ASR (Automatic Speech Recognition) module. These features are part of our standard ASR integration and can be accessed with only a couple lines of code in minutes.

Beta features and Voximplant Google Cloud STT support

Google provides advanced features under its v1p1beta1 Speech API version that are not included in the standard API. Although Google does not provide specific SLA or commited technical support on beta features, many Voximplant customers do use these capabilities in production. Often these features are moved into Google’s full GA product after months, serving as a way to leverage the latest capabilities or preview what’s coming.

Voximplant supports the v1 Speech API and v1p1beta1 Speech API options as described in the table below:

Feature Beta Required What it does Uses
alternativeLanguageCodes   Multilingual input and identification of up to 3 languages in addition to the one specified in model Allow IVR users to speak in their language of choice without specifying "press 1 for English.."; Detect the language in speech analytics applications
beta   Use to enable Beta features  
diarizationConfig Yes Separate speakers by an ID in transcription results, even if there are multiple speakers on the same call leg Improved transcription understanding, such as when a call is transferred, handed to another customer, or calls from conference rooms
enableAutomaticPunctuation   Adds punctuation to transcription results Improved transcription readability
enableWordConfidence Yes Show transcription confidence on a per-word basis Act if a specific keyword it exceeds a certain threshold level in IVR or speech analytics alerts
enableWordTimeOffsets   Show when each transcribed word begins and ends Map transcriptions to specific points in a recording; analyze pauses and fast talking in speech analytics applications
interimResults   Get partial results before a speech input has ended Automatically respond quickly or interrupt based on words or phrases in IVR applications
metadata   Share extra speech information with Google for use in their transcription engine  
model For enhanced models Improve transcription results for specific use cases. Enhanced models use se Google's most accurate transcription technology Use command_and_search_enhanced for IVR applications, phone_call_enhanced for  use cases
phraseHints   Old method of specifying a custom dictionary Improve STT results for industry jargon, names, and other frequently used language that is often transcribed incorrectly
profanityFilter   Shows a censored version of profane words Obscure profanity when displaying transcription output to users
profile   Specify the language code to use REQUIRED
singleUtterance   Enables/disables single utterance mode for shorter phrases IVR applications that take shorter inputs should enable singleUtterance; transcription and speech analytics applications generally disable this.
speechContexts For Boost features New method of specifying a custom dictionary Improve STT results for industry jargon, names, and other frequently used language that is often transcribed incorrectly.  Speech Boost allows weighting of sets of words for more precise transcription improvement.

Note speaker separation by channel is not supported today, but a similar result can be achieved by invoking multiple ASRs and attaching an individual call leg to each.

Examples

Leveraging Google’s Speech Beta features from VoxEngine is easy. Let’s look at a few examples below. 

All of these examples follow this basic VoxEngine template:

require(Modules.ASR);

VoxEngine.addEventListener(AppEvents.CallAlerting, (e) => {
   let call = e.call;
   call.answer();
   call.say('The initial prompt to say after answering goes here');
   call.addEventListener(CallEvents.PlaybackFinished, () => {
       call.sendMediaTo(betaAsr);
   });

   /* *** UPDATE THE CONFIGURATION BELOW *** */
   const betaAsr = VoxEngine.createASR({
       alternativeLanguageCodes: [],
       beta: true,
       diarizationConfig: {},
       enableAutomaticPunctuation: false,
       enableWordConfidence: false,
       enableWordTimeOffsets: true,
       interimResults: false,
       metadata: {
           audioTopic: ""
       },
       model: ASRModelList.Google.phone_call_enhanced,
       // phraseHints: [], //Old method; use speechContexts instead
       profanityFilter: false,
       profile: ASRProfileList.Google.en_US,
       singleUtterance: true,
       speechContexts: [],
   });

   betaAsr.addEventListener(ASREvents.SpeechCaptured, () => {
       call.stopMediaTo(betaAsr);
   });

   betaAsr.addEventListener(ASREvents.Result, e => {
       Logger.write(`DEBUG:: RAW ASR Result: ${JSON.stringify(e)}`);
       /* *** UPDATE THE RESPONSE BELOW *** */
       let response = `Response to demonstrate handling the data goes here`;
       response = e.confidence > 0 ? response : "Sorry, I didn't understand that";
       call.say(`<speak>${response}</speak>`, VoiceList.Default);
       call.addEventListener(CallEvents.PlaybackFinished, () => VoxEngine.terminate())
   });
})

You can see the full set of createASR options for Voximplant’s Google Speech-to-Text integration.

The examples below will focus on:

  • Updating the call.say prompts to guide the user,
  • the betaAsr object, and
  • handling the returned ASR results in the betaAsr.addEventListener(ASREvents.Result block.

See Voximplant’s Getting Started Guide and How to use ASR for more information on setting up VoxEngine.

Enabling Beta and Enhanced Models

Note the beta: true, parameter in betaAsr = VoxEngine.createASR. This is required to enable Beta features. Selecting the primary language to recognize is mandatory in the `profile` field. In addition, in the models section you can choose default_enhanced, phone_call_enhanced, or command_and_search_enhanced for improved transcription results on your calls.

Multilingual input

In many places, callers may typically speak in more than one language. Rather than forcing customers to use one language via an IVR tree - i.e. “press 1 for English”, the multilingual input option allows recognition of up to 4 languages.

To demonstrate this, let’s start by replacing the betaAsr configuration with the following:

   const betaAsr = VoxEngine.createASR({
       alternativeLanguageCodes: ['es-es', 'ru-RU', 'de-DE'],
       beta: true,
       // enhanced mode not avail for multilingual
       model: ASRModelList.Google.command_and_search,
       profile: ASRProfileList.Google.en_US,
       singleUtterance: true
   });

Note Google’s alternativeLanguageCodes option does not work with the enhanced speech recognition models.

Replace the result section with:

betaAsr.addEventListener(ASREvents.Result, e => {
       Logger.write(`DEBUG:: RAW ASR Result: ${JSON.stringify(e)}`);
       const langMap = {'en-us': "english", 'es-es': "spanish", 'de-de': "german", 'ru-ru': "russian" };
       let response = `<s>You said <lang xml:lang="${e.languageCode}">${e.text}</lang> in ${langMap[e.languageCode] || 'unknown language'}, confidence is ${e.confidence}</s>`;
       response = e.confidence > 0 ? response : "Sorry, I didn't understand that";
       call.say(`<speak>${response}</speak>`, VoiceList.Default);
       call.addEventListener(CallEvents.PlaybackFinished, () => VoxEngine.terminate())
   });

Then if you inspect the raw result in Voximplant’s call history logs, you will see something like this with the transcription in e.text and detected language in e.languageCode:

{
  "id": "d4jZPxAJSkugYH6PIrxNDWffrmM-dkE8mYngGW9mns4",
  "name": "ASR.Result",
  "requestId": "",
  "source": "6597EEA43D952034.1599695085.409633",
  "asr": {},
  "eventSourceField": "asr",
  "text": "was ist das",
  "confidence": 90,
  "isFinal": true,
  "languageCode": "de-de",
  "resultEndTime": "2.720s"
}

Speech Adaptation and Boost

Google Cloud Speech-to-Text Voximplant supported a `phraseHints` array of strings that could be used to improve the recognition of words. This is useful for improving the transcription of words not found in a languages’ general vocabulary, like company names (i.e. “Voximplant”), product names, industry jargon, and other commonly said phrases that might otherwise be mistranscribed. Google Speech-to-Text now uses a `speechContexts` object for this with the ability to specify the language. In addition, when Beta is enabled, Google Cloud offers a speech adaption boost feature for prioritizing certain words and phrases. Simply give the phrase array a relative weight between 0 and 20 to improve that word set’s adaptation priority.

Below is an example of a VoxEngine.createASR configuration:

const betaAsr = VoxEngine.createASR({
   beta: true,
   model: ASRModelList.Google.phone_call_enhanced,
   profile: ASRProfileList.Google.en_US,
   // OLD METHOD
   // phraseHints: ["yellow", "green", "red", "blue", "white", "black", "gelb", "grun", "rot", "blau", "weiss", "orange", "schwarz", "lila"],
   speechContexts: [{
       "languageCode": "en-US",
       "phrases": ["yellow", "green", "red", "blue", "white", "orange", "black", "purple"],
       "boost": 20
   },{
   "languageCode": "de-DE",
       "phrases": ["gelb", "grun", "rot", "blau", "weiss", "orange", "schwarz", "lila"],
       "boost": 10}]
});

Word-level Timings and Confidence

The Google Cloud Speech-to-Text engine provides transcription confidence on an utterance level where the Google engine determines the utterance. The Google STT API also returns an resultEndtime object. For developers that need more precision, Voximplant offers enableWordConfidence for word-level transcription confidence and enableWordTimeOffsets to provide timing information for each word.

The VoxEngine.createASR configuration to enable both of these features looks like:

const betaAsr = VoxEngine.createASR({
   beta: true,
   enableWordConfidence: true,
   enableWordTimeOffsets: true,
   model: ASRModelList.Google.phone_call_enhanced,
   profile: ASRProfileList.Google.en_US
});

This returns an arrow of word objects that looks like:

[
  {
    "confidence": 0.912838578,
    "endTime": "2.100s",
    "startTime": "1.800s",
    "word": "voximplant"
  },
  {
    "confidence": 0.912838578,
    "endTime": "3.300s",
    "startTime": "2.100s",
    "word": "Google"
  },
  {
    "confidence": 0.912838578,
    "endTime": "3.700s",
    "startTime": "3.300s",
    "word": "speech"
  },
  {
    "confidence": 0.668478847,
    "endTime": "4.100s",
    "startTime": "3.700s",
    "word": "beta"
  }
]

You can access these in the ASREvents.Result event object:

  betaAsr.addEventListener(ASREvents.Result, e => {
       Logger.write(`DEBUG:: RAW ASR Result: ${JSON.stringify(e)}`);
       let response = `<s>You said</s>`;
       e.words.forEach(word => {
           response += `<s>${word.word}} at ${word.startTime} with confidence ${word.confidence}</s>`
       })
       response = e.confidence > 0 ? response : "Sorry, I didn't understand that";
       call.say(`<speak>${response}</speak>`, VoiceList.Default);
       call.addEventListener(CallEvents.PlaybackFinished, () => VoxEngine.terminate())
   });

Pricing

Voximplant provides speech recognition from Google for $0.0085 per 15 seconds interval for the standard (default) models. The enhanced models are $0.0127 per 15 seconds interval. For example, a 36-second enhanced transcription would cost $0.0381 - 3 15-second intervals (2 full intervals and one partial) * 0.0127.

See https://voximplant.com/pricing for more pricing details.

Learn More or Try It

Voximplant is constantly adding to its speech integrations and other products, so stay tuned! Check out our documentation here, sign-up for an account here, or contact-us now to discuss your needs.