The industry is exploding with cloud communications services you can use to build innovative applications. Speech-to-text, text-to-speech, real time sentiment analysis and other forms of speech analytics are just a few of the many specialized cloud services available to enrich your communications app. The challenge developers face is how to integrate their preferred real time service without introducing so much latency into their application that the user experience suffers.
Some CPaaS make it easy to knit these services together with direct integrations (i.e. Voximplant has 6 speech synthesis providers, 4 transcription providers, direct Dialogflow integration, etc.). But with a rapidly developing market and multiple regional providers available, it’s impossible for any CPaaS to cover all the options. Developers need a flexible, open web protocol they can use to integrate their app with any cloud communications service.
Fortunately, the IETF created the WebSocket protocol to provide a low latency, high performance web communications channel. The WebSocket protocol (commonly referred to as “WebSockets”) enables real time interactive data exchanges between clients and servers, and application-to-application. It is ideal for connecting communications applications to services offered by specialized cloud providers.
This blog will describe how the WebSocket protocol evolved, popular use cases, and the support available on the Voximplant platform.
HTTP Doesn’t Support Real Time Communications
HTTP is the request/response protocol that powers most of today's web services. It is fantastic for serving rich text, graphics and images to web browsers, but it is a poor fit for real time communications applications. This is because HTTP doesn’t support truly bidirectional communications and its overhead produces latency that degrades user experiences.
HTTP is an application layer protocol designed to enable web clients to request resources that are hosted on web servers. It relies on the Transmission Control Protocol (TCP) to reliably connect clients with servers. It is considered a half-duplex protocol because the server transfers data only at the request of the client; the server cannot initiate a data transfer nor request data from the client.
This is a fundamental problem for real time communications applications. Almost any valuable communications service requires simultaneous, bidirectional data transmissions. In the case of multiparty conferencing and messaging services, multidirectional data transmissions are required.
In addition, the HTTP request and response structure introduces substantial overhead. Initially, the protocol was designed to open and close an independent TCP connection for each request/response pair. This means each request incurs the overhead of the TCP 3-way handshake, which produces significant latency and degrades the experience for even non-real time applications.
Improvements were made in HTTP v1.1, including a keep alive enhancement that allows a single connection to remain open for multiple requests. This provides a noticeable improvement for non-real time applications. However, the latency in the request/response protocol continues to impair most real time applications.
Polling Hacks Reduce Latency for Real Time Apps
Innovative developers, anxious to bring real-time features to their apps, have built work-arounds for HTTP’s limitations. The first is short polling where the client pings the server at regular intervals to check for new events or data. This may be a good hack for retrieving stock market quotes or updating a chat session. But it’s not useful for most voice and video communications because it delays media arrival by hundreds of milliseconds. In addition, the constant stream of request/response messages is a waste of server resources.
Long polling reduces the request/response overhead inherent in short polling by designing your application to delay its response until the server has an event or new data to send to the client. This technique is part of the Comet application programming model introduced over a decade ago. Comet enables servers to push data to clients. When a response is received by the client, it immediately issues a new request. Alternatively, the server may respond with a timeout message if no new event or data is available.
Long polling is not ideal for real time applications, however. There are multiple issues:
- Latency can be as high as three round trip times, or more, for events or data that occur immediately after the server sends a response (the time for the first response to be received, plus transmission of the next request, plus the next response enclosing the new data).
- Long polling is not scalable to large applications because each outstanding HTTP request consumes server resources until a response is issued.
- Requests timeouts need to be carefully managed to ensure they do not exceed the TCP connection timeout interval for the network infrastructure connecting the server and clients.
Ultimately, IETF engineers recognized the limitations in the HTTP protocol and set about developing a flexible, efficient solution that enables real time web applications.
WebSockets: Purpose-built for Bidirectional, Full Duplex Communications
The WebSocket protocol allows clients to establish a persistent server connection that supports simultaneous data transfers in both directions. It is referred to as a full duplex transport because both client and server elements can asynchronously transmit data - without a prior request. In addition, it supports transmission of a wide range of data types, including binary.
The WebSocket specification was completed by the IETF in 2011 and it is supported by all popular browsers. Most web development libraries also include WebSocket support. The WebSocket protocol is stable, mature and widely used in a range of real time applications, including gaming, instant messaging, chat, voice/video media exchange and more.
Websockets Under the Hood
WebSockets provides applications with direct access to a TCP socket connecting the client and server. It is compatible with the HTTP protocol and uses the same ports - 80 and 443. Clients request a WebSocket connection by including an HTTP connection upgrade header in the resource request, introducing a modest amount of initial overhead and latency. Once the server accepts the upgrade request, data can be exchanged at-will in both directions.
WebSockets provides a thin layer of abstraction on top of TCP to ensure compatibility with web infrastructure. Instead of using IP addresses to specify a destination host, WebSockets uses web URIs, replacing http with ws or https with wss.
WebSockets includes an optional subprotocol header that specifies the data type to be used throughout the life of the connection. The client may offer multiple data types, however the server must accept only one mutually supported type in its response. Supported message types include JSON, SOAP, XMPP and binary.
The protocol includes features to protect communications privacy and prevent man-in-the-middle attacks. All WebSockets implementations support TLS and clients can request a TLS connection by specifying the wss URI. Clients can authenticate the server by including a client key header in the connection upgrade request. The server responds with a SHA1 hash of the key appended with its GUID.
Advantages for Real Time Communications Applications
The WebSocket protocol provides all the attributes needed to build scalable real time communications applications using web infrastructure. It replaces the HTTP request/response protocol, which is difficult to adapt to the demands of two-way communication, with a streamlined bidirectional communication channel between client and server. Key advantages include:
- Full duplex - Data can be simultaneously transmitted by server and client, providing a truly bidirectional communication channel suitable for real time communications.
- Low latency - While a modest amount of latency is incurred when opening a socket, thereafter data can be exchanged without any application layer overhead.
- Efficient and scalable - Supports large scale communications applications.
- Flexible - Supports exchange of a wide range of data types, including binary for voice and video media transmission, and XML for messaging applications.
- Secure - optionally runs over TLS, which provides privacy.
WebSockets Use Cases
What can you build with WebSockets? Interactive, real time data exchange capabilities open up a boat load of possibilities for adding media processing and synthesis to any communications application. You can knit together exciting services available from leading specialized cloud providers to create innovative applications:
- Leverage speech-to-text services to add captions to your conferencing application, add visual voicemail to your VoIP service, or provide transcription for your contact center agents and supervisors.
- Integrate natural language processing into your contact center app to build a chatbot, voicebot, or IVR bot. You can use NLP to add a voice assistant to your VoIP service.
- Integrate voice sentiment analysis into your contact center service to provide real time coaching to agents
- Use speech analytics to evaluate the quality of your contact center interactions
The WebSocket protocol’s low latency characteristics aren’t just for handling communications media; it has many applications where text is in the payload. For example, communications engineers often use it to exchange signaling information across a communications infrastructure, including WebRTC-based voice and video communications sessions. While this channel isn’t accessible to app developers, it’s a good example of bi-directional, low latency, text exchange between client and server. You can similarly use a WebSocket to exchange log files and other types of text information.
Flexible Voximplant Connectivity with Media Providers
Voximplant is leading the CPaaS industry with multiple mechanisms available for developers to leverage 3rd party media processing services. We offer support for the WebSocket protocol that enables developers to connect with virtually any compatible cloud service. In addition, we have developed tight integrations with six speech processing providers and four transcription services.
Voximplant Supports WebSockets
The Voximplant serverless architecture offers unique support for the WebSocket protocol that is more scalable than alternative CPaaS platforms. Our WebSocket API is in the VoxEngine JavaScript cloud application environment. When your app needs to call a 3rd party service, it can open a WebSocket inside the Voximplant cloud and send media directly from the VoxEngine cloud.
Similarly, data received from a 3rd party service can be forwarded directly to VoxEngine where it can be relayed to a user or processed in our serverless environment. This provides the highest performance and lowest latency because your application is co-resident with our CPaaS infrastructure - no additional media conversion is required.
Using WebSockets, you can pipe media received from a call directly to a natural language processing service, for example. The service may return text, or other media via the same socket for use in your application.
Our serverless architecture eliminates the need to provision servers or monitor capacity utilization. You can dynamically scale your application with confidence that performance and service quality are always optimized.
require(Modules.WebSocket);
VoxEngine.addEventListener(AppEvents.CallAlerting, e => {
call = e.call;
call.answer();
const webSocket = VoxEngine.createWebSocket( /*url*/ "wss://your_link/");
webSocket.addEventListener(WebSocketEvents.MESSAGE, e => {
Logger.write("LOG OUTGOING: WebSocketEvents.MESSAGE: " + e.text);
call.sendMessage("LOG OUTGOING: WebSocketEvents.MESSAGE: " + e.text);
if (e.text == "Hi there, I am a WebSocket server") {
call.sendMediaTo(webSocket, {
encoding: WebSocketAudioEncoding.ULAW,
"tag": "MyAudioStream",
"customParameters": {
"param1": "12345"
}
});
}
});
});
Example outbound websocket connection instructions with Voximplant.
Exchange Real Time Data
With the WebSocket protocol, the web has the ability to support a range of innovative real time applications. Full duplex, low latency data transmission capabilities enable developers to easily knit together specialized cloud services and create sophisticated communications applications, while maintaining compatibility with web semantics and infrastructure.
Voximplant has been at the forefront with efficient WebSockets support that provides the foundation for sharing media with third party cloud services.
Footnotes
The WebSocket Protocol, IETF
WebSocket: Simultaneous Bi-Directional Client-Server Communication, Gabbie Piraino, Medium
Known Issues and Best Practices for the Use of Long Polling and Streaming in Bi-directional HTTP, IETF RFC 6202