What is Automatic Speech Recognition?

What is Automatic Speech Recognition?

With automatic speech recognition, you can convert voice inputs into text. Here, we review how automatic speech recognition systems work.

Automatic speech recognition, or ASR, is a technology that converts speech to text in real time. ASR may also be called Speech-to-Text or simply transcription systems. You’re familiar with ASR systems if you’ve ever used virtual assistants such as Siri or Alexa. The technology is also implemented in automated subtitling, smart homes, and in-car systems.

How Does it Work?

ASR systems usually leverage three major components – lexicon, the acoustic model, and the language model – to decode the audio signal and provide the most appropriate transcription.


Some words can be pronounced in multiple ways. For instance, the word ‘read’ is pronounced differently depending on what tense is used, present or past. The lexicon contains all possible pronunciation options. 

ASR systems use phonetic sets customized for each language. One of the most widespread sets is ARPABET that represents phonemes and allophones of General American English. 

Acoustic Model

The next stage involves separating an audio signal into frames 25ms in length. Acoustic models analyze each frame and provide the probability of using different phonemes there. In other words, acoustic models aim to predict which sound is spoken in each frame.

Different people pronounce the same phrase in multiple ways. Factors like gender, background noise, and accent make it sound differently. Acoustic models use deep neural networks trained on hours of various audio recordings and relevant transcripts to determine the relationship between audio frames and phonemes.

Language Model

Language models recognize the context of spoken phrases to compose word sequences. Traditionally, language models of N-gram (groups of words) type predict the next word by known previous words. 

Language models operate in a way similar to acoustic ones. They use deep neural networks trained on text data to estimate the probability of the next word.

All three components enable ASR systems to make predictions of what words and sentences were in the audio input. Based on these predictions, ASR systems choose the most likely prediction. 

How Does Voximplant Leverage ASR?

Voximplant provides developers with ASR that captures the voice, transcribes it, and returns text during or after a call. You can also connect external recognition services via WebSocket

Here are some areas where ASR is used in apps:

  • Smart IVR. You can build apps to greet and route callers to desired agents and departments with voice inputs instead of DTMF
  • Voice bots. Conduct automated surveys research. Let voice bots ask pre-recorded questions and analyze responses in text form
  • Real-time transcriptions. Convert conversations between agents and customers to measure contact center performance and identify trouble areas

Tags:ASRspeech recognitionSTT
B6A24216-9891-45D1-9D1D-E7359CEB8282 Created with sketchtool.
Add your comment

Please complete this field.

Sign Up for a free Voximplant developer account or talk to our experts