0. Advice from Twilio
For the last few weeks, we have been trying to how we can integrate VoIP API and TTS module. However, we’ve realized that the TTS feature that Twilio provides is not the best solution for a real time TTS conversation by getting some advice from a Twilio’s support engineer.
1. Case Studies
One day, we watched the above video and we were wondering what makes if possible for Google Duplex to make a real time STT(Speech to text) and TTS(Text to Speech) call. Fortunately, we have found few trials making their own AI agent just like Google Duplex. Through two case studies, we have learned some knowledge about telephony system and found out that there should be more types of software packages for handling the calls.
Case Study 1: Building your own Duplex AI agent using Rasa and Twilio
Knowledge of Telephony system
Sending Voice over IP (VoIP) requires two protocols: SIP and RTP
SIP(Session Initiation Protocol): A signaling protocol used for establishing a session (call) for real-time media communications. IP address and port information are exchanged.
RTP(Real-time Transport Protocol): After SIP establishes a session, this protocol is used for exchanging voice packets
PBX(Private Branch eXchange): A private telephone network used within a company or organization. Users can communicate using different communication channels like VoIP.
Software packages to consider
Twilio offers Elastic SIP Trunking. It maintains hardware connected to the physical telephone network, and when a call comes in/out, they initiate a VoIP call over the internet to our system.
Kamailio provides an Open Source SIP Server. It is used from large Internet Service Provider (ISP) to provide public telephony service. In this study case, it serves as VoIP load balancer and router. It sits outside of our system’s firewall and is where Twilio connects and then routes the call inside the firewall to Asterisk
Asterisk is an Open Source telephone private branch exchange (PBX). It allows telephones interfaced with a variety of hardware technologies to make calls to one another, and to connect to telephony services, such as the public switched telephone network and voice over Internet Protocol services. In this study case, it handles controlling the call (answer, hangup, etc) and bridging audio to and from the speech subsystems.
Procedure possibly taken to our application
Twilio receives a call from the PSTN(phone network) and starts a VoIP call to the Kamailio server ->Asterisk auto-answers the call via Kamailio and routes the incoming audio stream -> TTS engine synthesizes the audio from the text typed by a caller and send it to Asterisk -> Asterisk injects the audio stream from TTS engine into the VoIP call -> Twilio bridges the audio from Asterisk/TTS engine to the physical phone network.
Case Study 2: Making IVRs not Suck with Dialogflow
Knowledge of Telephony system
RPC(Remote procedure call): a that one program can use to request a service from a program located in another computer on a network without having to understand the network’s details.
Software packages to consider
VoxImplant is a Communications Platform as a Service (CPaaS). In this case study, they integrated Google’s Cloud Speech API into Voximplant for automatic speech recognition with their libraries.
gRPC is being used for communication in internal production, on Google Cloud Platform, and in public-facing APIs. In this case study, it is used for real-time audio streaming.
2. Analysis of TTS Engines
Google Cloud Text-to-Speech API has more than 30 voices available in different languages. The voices are very clear, and the sound is fluent. Integrating it is quite easy but, it doesn’t come for free. The price is monthly based depending on the amount of characters to synthesize into audio sent to the service. This shows the steps of how to integrate Google Cloud Text-to-Speech AP into IOS application.
Open Source TTS engines:
1- Mimic: ( https://mycroft.ai/documentation/mimic/ ) is a fast, lightweight Text-to-speech engine developed by Mycroft.AI and VocaliD. It plays the text as a speech by using the chosen voice.
2- eSpeak ( http://espeak.sourceforge.net) includes several languages and voices that help all the native speakers of theses languages to use it easily. The speech can be used at different level of speed where the voice is clear but is not as natural or smooth as larger synthesizers that based on human speech recordings. It available as: a command line program (Linux and Windows) to speak text from a file.
3- Merlin (https://github.com/CSTR-Edinburgh/merlin) is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor and a vocoder. It is free software, distributed under an Apache License Version 2.0, allowing unrestricted commercial and non-commercial use alike.
Text To Speech (realistic voice and emotion)
Most synthesized voices do not include tones and expression in the voice. In order to make a life in the interaction, we are planning to have emotion and natural expression in our application which the tone can change if we have a question mark or punctuation, so the receiver will feel like he is dealing with a person not with a machine.
There are some websites which provide realistic voices, such as:
1- Text to speech, which Support multi languages and has male and female voices.
2- IBM, which has just female voice. it changes tonality and expression of the voice which is not possible on any other tts website.
3- Oddcast, which supports many different languages and english accents. It has both male and female voices.