Voximplant now has speech synthesis support for Microsoft Azure Text-to-Speech (TTS). With these new additions, Voximplant developers can now choose from more than 425 voice options covering 40 languages and 60 dialects from 6 distinct speech synthesis engine providers.  Like Voximplant’s other speech options, you can add these voices to telephony apps in seconds for automated prompts, IVR, virtual agent applications, and more.  The new Azure TTS engine adds 116 total new voice options covering 35 languages and 49 unique dialects. These options include lifelike 36 neural options based on the latest deep learning technology, with several that offer even more advanced functionality with Microsoft’s proprietary speaking styles.

The variety of choices, with SSML support, gives Voximplant developers many options for choosing a voice that matches their brand and feature needs. Existing Azure Text-to-Speech users can quickly move their standard and neural TTS implementations to VoxEngine, Voximplant’s serverless platform for enabling a wide variety of telephony features.

Use the new Microsoft voices in seconds

Convenient access in VoxEngine

You can find the Microsoft Azure TTS voices under the VoiceList.Microsoft reference map with Voximplant’s other supported voices inside VoxEngine. Neural voices can be found under VoiceList.Microsoft.Neural. See here for the full list with their corresponding language codes. Use the voices with the language object as part of sayOptions parameter object with call.say or ttsPlayerOptions parameter object within the VoxEngine.сreateTTSPlayer method

Using the new voices in your VoxEngine scripts is as simple as:

VoxEngine.createTTSPlayer("Hello, you have created a TTSPlayer, have fun!", {"language": VoiceList.Microsoft.Neural.en_GB_LibbyNeural});

See our TTS documentation for more details.

Voximplant Kit

Microsoft Speech Synthesis will come soon to Voximplant Kit. Keep a look out for Microsoft options in the Synth Language dropdown in the Text to Speech, Interactive Menu, and Dialogflow Connector blocks.


Microsoft TTS is priced at $5 USD for one million characters for standard voices and $20 per one million characters for neural voices. Phrases are rounded up to the next highest 10 character increment.

As an example, let’s use the phrase “let's do some tests with Microsoft Text-to-Speech” spoken in a neural voice. This phrase contains 49 characters. This is rounded up to the next 10 character increment, so 50 characters are charged. Neural voices are priced at $20/1M chars, or $0.00002 per character, resulting in a total charge of $0.001 (50 * $0.00002).

All characters sent to the speech engine are included in the cost calculation, including unspoken SSML tags. Chinese, Japanese, and Korean characters count as two characters.

See our pricing page for more pricing information.

Advanced Microsoft Speech Synthesis Options

SSML Support with Multi-voice capabilities

Speech Synthesis Markup Language (SSML) is supported on all Microsoft voices. Microsoft TTS also supports switching voices, even with different languages, in the same SSML statement. This is helpful for multilingual responses and for simulating multiple virtual agents using the same command.

Adjust speaking styles

In addition, the following neural voices support distinct speaking styles, allowing you a quick way to adjust the way the voice sounds for specific circumstances, such as customer service or sounding empathetic:

Voice Style Description
en-US-AriaNeural style="newscast-formal" A formal, confident and authoritative tone for news delivery
  style="newscast-casual" A versatile and casual tone for general news delivery
  style="customerservice" Expresses a friendly and helpful tone for customer support
  style="chat" Expresses a casual and relaxed tone
  style="cheerful" Expresses a positive and happy tone
  style="empathetic" Expresses a sense of caring and understanding
zh-CN-XiaoxiaoNeural style="newscast" Expresses a formal and professional tone for narrating news
  style="customerservice" Expresses a friendly and helpful tone for customer support
  style="assistant" Expresses a warm and relaxed tone for digital assistants
  style="lyrical" Expresses emotions in a melodic and sentimental way
zh-CN-YunyangNeural style="customerservice" Expresses a friendly and helpful tone for customer support

See Voximplant’s SSML HowTo and Microsoft’s Azure Speech services SSML support page for full usage details.


SSML Creation Tool

Microsoft Speech Studio is a visual tool for adjusting voice parameters and outputting SSML including tools for setting the speaking style and adjusting intonation. Just cut and paste the <speak>..</speak> section of the SSML output as your TTS phrase in VoxEngine.

Learn more about this tool on Microsoft’s Speech Studio page.

VoxEngine Example

The following is an example VoxEngine script showing different Microsoft voices. phrases is taken from the Speech Studio example shown above. Note the use of different voices and speaking styles in the same SSML statement:

// Microsoft TTS Example
function onCallConnected(e) {
   let phrases = [];
   phrases[0] = "let's do some tests with Microsoft Text-to-Speech";
   phrases[1] = "Let's start without SSML";
   phrases[2] = "This is the Microsoft Azure Text-to-Speech service that has been added to Voximplant. Just select one of the Microsoft voices from VoiceList to use these. Remember SSML is supported.";
   // Exported from Microsoft Speech Studio
   phrases[3] =        
      `<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="Microsoft Server Speech Text to Speech Voice (en-US, GuyNeural)">Now let's test Microsoft's SSML support.</voice><voice name="Microsoft Server Speech Text to Speech Voice (en-US, AriaNeural)"><mstts:express-as type="cheerful">This is the Microsoft Azure <prosody contour="(53%, -59%)">Text-to-Speech</prosody> service that has been added to Voximplant. </mstts:express-as><mstts:express-as type="newscast-formal">Just select one of the Microsoft voices from VoiceList to use these. </mstts:express-as><mstts:express-as type="customerservice">Remember <say-as interpret-as="spell">SSML</say-as> is supported.</mstts:express-as></voice></speak>`;
  call.say(phrases[0], {"language": VoiceList.Microsoft.Neural.en_US_GuyNeural});
  let i = 1;
  call.addEventListener(CallEvents.PlaybackFinished, ()=>{
   if(i === phrases.length){
   call.say(phrases[i], {"language": VoiceList.Microsoft.Neural.en_US_AriaNeural});
VoxEngine.addEventListener(AppEvents.CallAlerting, e => {
 call = e.call
 call.addEventListener(CallEvents.Connected, onCallConnected)
 call.addEventListener(CallEvents.Disconnected, VoxEngine.terminate)

Learn More

Ready to add speech synthesis to your telephony application? Check out our documentation here, sign-up for an account here, or contact-us now to discuss your needs.