SIGN UP

Speech-to-text: ASR

Speech-to-text: ASR

Unlike transcription that is performed after call ends, ASR (automatic speech recognition) works during a call and allows either to recognize a word among given variants or do a "freeform" recognition of an arbitrary speech.

ASR is provided via an ASR module that should be added into scenario via a require syntax. Module is used following way:

  1. Create an ASR object by calling a VoxEngine.createASR method.
  2. Subscribe to ASR object events like ASREvents.Result
  3. Send media from a call object to the ASR object via a sendMediaTo method
  4. Receive recognized text via events

During a normal ASR workflow, it fires two type of events. SpeechCaptured event is fired after a voice followed by a pause is recorded and sent to recognition. That event is handy if you want to recognize single answer like "yes" or "no" and stop recognition after that via call object stopMediaTo method. To recognize an answer among known alternatives, like "yes" or "not", it's handy to provide ASR with a words list, that works same way as for transcribing:

require(Modules.ASR);
// Record two-way audio (stereo is optional) and transcribe it after
// recording is stopped
call.record({
  language: ASRLanguage.ENGLISH_US,
  transcribe: true,
  stereo: true
});

Use the following code if you want to build IVR with real-time recognition of some words/phrases from the specified array:

require(Modules.ASR);
//..
call.say("Choose your color", Language.US_ENGLISH_FEMALE);
call.addEventListener(CallEvents.PlaybackFinished, () => {
      call.sendMediaTo(asr);
});
//...
const asr = VoxEngine.createASR(
  ASRLanguage.ENGLISH_US,
  ["Yellow", "Green", "Red", "Blue", "White", "Black"]);
asr.addEventListener(ASREvents.Result, e => {
  if (e.confidence > 0) {
    call.say(`You have chosen ${e.text} color, confidence is ${e.confidence}`,
      Language.US_ENGLISH_FEMALE);
  }
  else {
    call.say("Couldn't recognize your answer", Language.US_ENGLISH_FEMALE);
  }
});
asr.addEventListener(ASREvents.SpeechCaptured, () => {
    call.stopMediaTo(asr);
});
//...

Result event is fired after voice is recognized. Where is always a delay between capture and recognition, so plan user interaction accordingly. Following example shows how to use a Result event for streaming recognition of arbitrary text:

require(Modules.ASR);
let full_result = "";
let ts = null;
//..
call.say("Please start saying someting", Language.US_ENGLISH_FEMALE);
call.addEventListener(CallEvents.PlaybackFinished, () => {
  call.sendMediaTo(asr);
});
//...
// Removing the dictionary to use freeform recognition
const asr = VoxEngine.createASR(ASRLanguage.ENGLISH_US);
asr.addEventListener(ASREvents.Result, e => {
  // Recognition results arrive here
  full_result += (e.text + " ");
  // If CaptureStarted won't be fired in 5 seconds then stop recognition
  ts = setTimeout(() => asr.stop(), 5000);
});
asr.addEventListener(ASREvents.SpeechCaptured, () => {
  // After speech has been captured - don't stop sending media to ASR
  // call.stopMediaTo(asr);
});
asr.addEventListener(ASREvents.CaptureStarted, () => {
  // Clear timeout if CaptureStarted has been fired
  clearTimeout(ts);
});
//...

CaptureStarted event can happen due to background noise. Voximplant own VAD (voice activity detection) can be used to mitigate that:

call.handleMicStatus(true);
call.addEventListener(CallEvents.MicStatusChange, e => {
  if (e.active) {
    // speech started
  } else {
    // speech ended
  }
});
Tags:ASR
B6A24216-9891-45D1-9D1D-E7359CEB8282 Created with sketchtool.

Comments(0)

Add your comment

Please complete this field.

Recommended

Get your free developer account or talk with our sales team to learn more about Voximplant solutions
SIGN UP