Voximplant. Blog

Speech-to-text: ASR

Unlike transcription that is performed after call ends, ASR (automatic speech recognition) works during a call and allows either to recognize a word among given variants or do a “freeform” recognition of an arbitrary speech.

ASR is provided via an ASR module that should be added into scenario via a require syntax. Module is used following way:

  1. Create an ASR object by calling a VoxEngine.createASR method.
  2. Subscribe to ASR object events like ASREvents.Result
  3. Send media from a call object to the ASR object via a sendMediaTo method
  4. Receive recognized text via events

During a normal ASR workflow, it fires two type of events. SpeechCaptured event is fired after a voice followed by a pause is recorded and sent to recognition. That event is handy if you want to recognize single answer like “yes” or “no” and stop recognition after that via call object stopMediaTo method. To recognize an answer among known alternatives, like “yes” or “not”, it’s handy to provide ASR with a words list, that works same way as for transcribing:

  1. require(Modules.ASR);
  2. // Record two-way audio (stereo is optional) and transcribe it after recording is stopped
  3. call.record({language: ASRLanguage.ENGLISH_US, transcribe: true, stereo: true});

Use the following code if you want to build IVR with real-time recognition of some words/phrases from the specified array:

  1. require(Modules.ASR);
  2. //..
  3. mycall.say("Choose your color", Language.US_ENGLISH_FEMALE);
  4. mycall.addEventListener(CallEvents.PlaybackFinished, function (e) {
  5. mycall.sendMediaTo(myasr);
  6. });
  7. //...
  8. var myasr = VoxEngine.createASR(
  9. ASRLanguage.ENGLISH_US,
  10. ["Yellow", "Green", "Red", "Blue", "White", "Black"]);
  11. myasr.addEventListener(ASREvents.Result, function (e) {
  12. if (e.confidence > 0) mycall.say("You have chosen " + e.text + " color, confidence is " + e.confidence, Language.US_ENGLISH_FEMALE);
  13. else mycall.say("Couldn't recognize your answer", Language.US_ENGLISH_FEMALE);
  14. });
  15. myasr.addEventListener(ASREvents.SpeechCaptured, function (e) {
  16. mycall.stopMediaTo(myasr);
  17. });
  18. //...

Result event is fired after voice is recognized. Where is always a delay between capture and recognition, so plan user interaction accordingly. Following example shows how to use a Result event for streaming recognition of arbitrary text:

  1. require(Modules.ASR);
  2. var full_result = "", ts;
  3. //..
  4. mycall.say("Please start saying someting", Language.US_ENGLISH_FEMALE);
  5. mycall.addEventListener(CallEvents.PlaybackFinished, function (e) {
  6. mycall.sendMediaTo(myasr);
  7. });
  8. //...
  9. // Removing the dictionary to use freeform recognition
  10. var myasr = VoxEngine.createASR(ASRLanguage.ENGLISH_US);
  11. myasr.addEventListener(ASREvents.Result, function (e) {
  12. // Recognition results arrive here
  13. full_result += e.text + " ";
  14. // If CaptureStarted won't be fired in 5 seconds then stop recognition
  15. ts = setTimeout(recognitionEnded, 5000);
  16. });
  17. myasr.addEventListener(ASREvents.SpeechCaptured, function (e) {
  18. // After speech has been captured - don't stop sending media to ASR
  19. /*mycall.stopMediaTo(myasr);*/
  20. });
  21. myasr.addEventListener(ASREvents.CaptureStarted, function() {
  22. // Clear timeout if CaptureStarted has been fired
  23. clearTimeout(ts);
  24. });
  25. function recognitionEnded() {
  26. // Stop recognition
  27. myasr.stop();
  28. }
  29. //...

CaptureStarted event can happen due to background noise. Voximplant own VAD (voice activity detection) can be used to mitigate that:

  1. mycall.handleMicStatus(true);
  2. mycall.addEventListener(CallEvents.MicStatusChange, function(e) {
  3. if (e.active) {
  4. // speech started
  5. } else {
  6. // speech ended
  7. }
  8. });

Comments