Speech-to-text: ASR

Speech-to-text: ASR

Unlike transcription that is performed after a call ends, ASR (automatic speech recognition) operates during a call and provides either recognition of a word among given variants or a "freeform" recognition of an arbitrary speech.

ASR is represented by the ASR module that should be mounted into a scenario via the require syntax. This is how the module is used:

  1. Create an ASR object by calling a VoxEngine.createASR method
  2. Subscribe to the ASR object events like ASREvents.Result
  3. Send media from a call object to the ASR object via a sendMediaTo method
  4. Receive recognized text via events

During a normal ASR workflow, it triggers two type of events. SpeechCaptured event is triggered after a voice followed by a pause is recorded and sent to recognition. That event is handy if you want to recognize a single answer like "yes" or "no" and stop recognition after that via the stopMediaTo method. To recognize an answer among known alternatives, like "yes" or "not", it's handy to provide ASR with a words list, that works the same way as for transcribing:

require(Modules.ASR);
// Record two-way audio (stereo is optional) and transcribe it after
// recording is stopped
call.record({
  language: ASRProfileList.Google.en_US,
  transcribe: true,
  stereo: true
});

Use the following code if you want to build IVR with real-time recognition of some words/phrases from the specified array:

VoxEngine.addEventListener(AppEvents.CallAlerting, (e) => {
    let call = e.call;
    call.answer();
    call.say("Choose your color", { "language": VoiceList.Amazon.en_US_Joanna });
    call.addEventListener(CallEvents.PlaybackFinished, () => {
        call.sendMediaTo(asr);
    });

    const asr = VoxEngine.createASR({
        profile: ASRProfileList.Google.en_US,
        phraseHints: ["Yellow", "Green", "Red", "Blue", "White", "Black"]
        });

    asr.addEventListener(ASREvents.SpeechCaptured, () => {
        call.stopMediaTo(asr);
    });

    asr.addEventListener(ASREvents.Result, e => {
        if (e.confidence > 0) {
            call.say(`You have chosen ${e.text} color, confidence is ${e.confidence}`, { "language": VoiceList.Amazon.en_US_Joanna });
        }
        else {
            call.say("Couldn't recognize your answer", { "language": VoiceList.Amazon.en_US_Joanna });
        }

        call.addEventListener(CallEvents.PlaybackFinished, () => VoxEngine.terminate())
    });
})

The Result event is triggered after the voice is recognized. There is always a delay between capture and recognition, so plan user interaction accordingly. The following code structure shows how to use the Result event for streaming recognition of an arbitrary text:

require(Modules.ASR);
let full_result = "";
let ts = null;
//..
call.say("Please start saying something", { "language": VoiceList.Amazon.en_US_Joanna });
call.addEventListener(CallEvents.PlaybackFinished, () => {
  call.sendMediaTo(asr);
});
//...
// Removing the dictionary to use freeform recognition
const asr = VoxEngine.createASR({ profile: ASRProfileList.Google.en_US });
asr.addEventListener(ASREvents.Result, e => {
  // Recognition results arrive here
  full_result += (e.text + " ");
  // If CaptureStarted won't be triggered in 5 seconds then stop recognition
  ts = setTimeout(() => asr.stop(), 5000);
});
asr.addEventListener(ASREvents.SpeechCaptured, () => {
  // After speech has been captured - don't stop sending media to ASR
  // call.stopMediaTo(asr);
});
asr.addEventListener(ASREvents.CaptureStarted, () => {
  // Clear timeout if CaptureStarted has been triggered
  clearTimeout(ts);
});
//...

The CaptureStarted event can happen due to background noise. Voximplant VAD (voice activity detection) can be used to mitigate that:

call.handleMicStatus(true);
call.addEventListener(CallEvents.MicStatusChange, e => {
  if (e.active) {
    // speech started
  } else {
    // speech ended
  }
});
Tags:ASR
B6A24216-9891-45D1-9D1D-E7359CEB8282 Created with sketchtool.

Comments(0)

Add your comment

Please complete this field.

Recommended

Sign up for a free Voximplant developer account or talk to our experts
SIGN UP