The main feature that we will have in our application is converting text to speech synthesis, known as TTS. It allows the users to write and convert their text into a human voice during the phone call. To apply TTS function to a phone call function in a real time manner, we have chosen to use Twilio API.
Twilio is a cloud communications platform to programmatically make and receive voice calls in a web/mobile application. For last two weeks, we have tried some functions of Twilio API to get some sense of placing and receiving voice calls as well as using text to speech module during a call for our application. We have been more focused on trying those functions under a web based environment rather than directly diving into a mobile application to get to know the API well beforehand.
TwiML
TwiML(Twilio Markup Language) is a set of instructions of elements that allows users to tell the program what to do during the phone call. It’s an XML document with special tags defined by Twilio to build voice applications. So, the commands will be performed based on extensible markup language, which helps us to build our voice application.
The elements of TwiML are divided into three groups: the root <Response> element, nouns, and verbs. In any TwiML response to a Twilio request, all verb elements have to be nested within <Response>, the root element of Twilio’s XML Markup. A TwiML noun describes the phone numbers and API resources to take action on. TwiML verbs tell Twilio what actions to take on a given call. <say>verb, the text will be converted into speech in real time call. The following figure shows an example of language:
Inbound / Outbound call
After activating an account, we got a virtual number offered from Twilio and tried inbound and outbound call.
The process of an Inbound call is as follows:
When someone makes a call to a virtual number offered by Twilio, Twilio looks up the URL associated with that phone number and sends it a request. Twilio then reads the TwiML instructions hosted at that URL to determine what to do. Therefore, HTTP request in the above image contains phone numbers of sender and receiver, HTTP response is the contents of the returned TwiML.
The process of an Outbound call is as follows:
When we initiate an outbound call with the Twilio API, Twilio then requests your TwiML to learn how to handle the call.
Twilio executes just one TwiML document to the caller at a time, but many TwiML documents can be linked together to build complex interactive voice applications.
<Say> verb makes it easy to synthesize speech. When we provide the text, and Twilio will synthesize speech in real time and playback the audio. When using <Say> we have choice between using Man, Woman, Alice or Amazon Polly Voices. For the next 2 weeks, we’re planning to apply <Say>verb to TTS call between web application and personal phone and find out a specific method for mobile application.
<References>
https://www.twilio.com/docs/voice/twiml
https://www.twilio.com/docs/voice/tutorials/how-to-make-outbound-phone-calls-python
https://www.twilio.com/docs/voice/twiml/say/text-speech
https://www.nexmo.com/blog/2017/10/20/text-to-speech-voice-calls-with-php-dr/