Week 15 – Final Project Presentation and Demo

Finally we have completed the real-time TTS call application for speech impairment people. Though we went a long way to get this point, we eventually have the PC application for real time TTS call with emotional features and Android application for TTS with emotional features. Here are our final presentation and result demo videos.

This slideshow requires JavaScript.



This is what we’ve implemented for android. We will be able to make a complete real time TTS call system by integrating telephony endpoint system which consists of SIP, RTP, and PBX which can be represented by Kmailio and Asterisk between Twilio API and the TTS system.


With the knowledge from the trial of the android application, we could finally implement a complete real time TTS call system for PC. We enhanced usability by having send button for informing the opposite side of person of TTS call, pressing enter button for sound generations, and providing user guidebook.


This is a possible scenario that a user can face while using our real time TTS call system. It is proper for unavoidable phone call situations and meets two objectives of our project(Free TTS call / No other tool needed for receiver).


Week 13 – Workaround for complex real time telephony system

0. So far, we’ve been having a hard time finding a proper way to real time TTS call for mute people. It seemed too complicated to implement the system because telephony system needs many types of software package to be inter-connected to send information back and forth.  One day, we watched a YouTube video and it was about sending the call output through a Google Voice Phone call using Linux and Pulse audio. While watching the video, we found that the system introduced on the video could catch and send the internal sound of the window PC. We then came up with an idea and searched a software that can route the sound for MacOS whose name is Soundflower.

1. Soundflower is an open source kernel extension for MacOS, and it is designed to create a virtual audio output device that can also act as an input. In other words, it is possible for us to route external sound into an internal sound and vice versa using Soundflower. After installing Soundflower, we set the sound input from Internal Microphone to Soundflower both in Mac book and Google hangout in order to see if a sample mp3 file is playing through internal sound so that the other side of calling person can listen to the sound from the mp3 file. As a result, it worked well.

Screen Shot 2018-11-23 at 12.23.15 PM

Screen Shot 2018-11-23 at 1.20.26 PM.png

2. A prototype for PC call using gTTS

We then implemented a prototype that generate TTS sound and create mp3 file at the same time using tkinter(for GUI) and gTTS(google TTS engine).

Screen Shot 2018-11-23 at 12.43.14 PM          Screen Shot 2018-11-23 at 1.23.56 PM.png


3. TTS for android

We developed TTS application for Android that has speed and pitch features to modify voice tone. Also we’re trying to add gender voice option (Female and Male voice); however, we had difficult time to switch from the default female voice on the android TTS to male, so we realized that we need to use another engine such as iSpeech or Nuance. These engines are really good in supporting male and female voices. Also, we are working now to add another feature to the voice such as emotions. Android studio was really nice program that gave an accurate and beautiful designs without additional fees.

Currently, we are in the process of getting familiar with iSpeech and Nuance to find out which one is the most applicable to our project.


Screen Shot 2018-11-23 at 3.05.42 PM



Week 11 – Real time TTS call case studies and TTS engines

0. Advice from Twilio

For the last few weeks, we have been trying to how we can integrate VoIP API and TTS module. However, we’ve realized that the TTS feature that Twilio provides is not the best solution for a real time TTS conversation by getting some advice from a Twilio’s support engineer.

Screen Shot 2018-11-09 at 1.37.37 PM

Screen Shot 2018-11-09 at 1.38.12 PM

1. Case Studies


One day, we watched the above video and we were wondering what makes if possible for Google Duplex to make a real time STT(Speech to text) and TTS(Text to Speech) call. Fortunately, we have found few trials making their own AI agent just like Google Duplex. Through two case studies, we have learned some knowledge about telephony system and found out that there should be more types of software packages for handling the calls.

Case Study 1: Building your own Duplex AI agent using Rasa and Twilio

Screen Shot 2018-11-09 at 2.04.48 PM.png

Knowledge of Telephony system

Sending Voice over IP (VoIP) requires two protocols: SIP and RTP

SIP(Session Initiation Protocol): A signaling protocol used for establishing a session (call) for real-time media communications. IP address and port information are exchanged.

RTP(Real-time Transport Protocol):  After SIP establishes a session, this protocol is used for exchanging voice packets

PBX(Private Branch eXchange): A private telephone network used within a company or organization. Users can communicate using different communication channels like VoIP.

Software packages to consider

Twilio offers Elastic SIP Trunking. It maintains hardware connected to the physical telephone network, and when a call comes in/out, they initiate a VoIP call over the internet to our system.

Kamailio provides an Open Source SIP Server. It is used from large Internet Service Provider (ISP)  to provide public telephony service. In this study case, it serves as VoIP load balancer and router. It sits outside of our system’s firewall and is where Twilio connects and then routes the call inside the firewall to Asterisk

Asterisk is an Open Source telephone private branch exchange (PBX). It allows telephones interfaced with a variety of hardware technologies to make calls to one another, and to connect to telephony services, such as the public switched telephone network and voice over Internet Protocol services. In this study case, it handles controlling the call (answer, hangup, etc) and bridging audio to and from the speech subsystems.

Procedure possibly taken to our application

Twilio receives a call from the PSTN(phone network) and starts a VoIP call to the Kamailio server ->Asterisk auto-answers the call via Kamailio and routes the incoming audio stream -> TTS engine synthesizes the audio from the text typed by a caller and send it to Asterisk -> Asterisk injects the audio stream from TTS engine into the VoIP call -> Twilio bridges the audio from Asterisk/TTS engine to the physical phone network.

Case Study 2: Making IVRs not Suck with Dialogflow

Screen Shot 2018-11-09 at 2.05.00 PM.png

Knowledge of Telephony system

RPC(Remote procedure call): a that one program can use to request a service from a program located in another computer on a network without having to understand the network’s details.

Software packages to consider

VoxImplant is a Communications Platform as a Service (CPaaS). In this case study, they integrated Google’s Cloud Speech API into Voximplant for automatic speech recognition with their libraries.

gRPC is being used for communication in internal production, on Google Cloud Platform, and in public-facing APIs. In this case study, it is used for real-time audio streaming.



2. Analysis of TTS Engines 

Google Cloud Text-to-Speech API has more than 30 voices available in different languages. The voices are very clear, and the sound is fluent. Integrating it is quite easy but, it doesn’t come for free. The price is monthly based depending on the amount of characters to synthesize into audio sent to the service. This shows the steps of how to integrate Google Cloud Text-to-Speech AP into IOS application.

Open Source TTS engines:

1- Mimic: ( https://mycroft.ai/documentation/mimic/ ) is a fast, lightweight Text-to-speech engine developed by Mycroft.AI and VocaliD. It plays the text as a speech by using the chosen voice.

2- eSpeak ( http://espeak.sourceforge.net) includes several languages and voices that help all the native speakers of theses languages to use it easily. The speech can be used at different level of speed where the voice is clear but is not as natural or smooth as larger synthesizers that based on human speech recordings. It available as: a command line program (Linux and Windows) to speak text from a file.

3- Merlin (https://github.com/CSTR-Edinburgh/merlin) is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It must be used in combination with a front-end text processor and a vocoder. It is free software, distributed under an Apache License Version 2.0, allowing unrestricted commercial and non-commercial use alike.


Text To Speech (realistic voice and emotion)

Most synthesized voices do not include tones and expression in the voice. In order to make a life in the interaction, we are planning to have emotion and natural expression in our application which the tone can change if we have a question mark or punctuation, so the receiver will feel like he is dealing with a person not with a machine. 

There are some websites which provide realistic voices, such as: 

1- Text to speech, which Support multi languages and has male and female voices.

2- IBM, which has just female voice. it changes tonality and expression of the voice which is not possible on any other tts website. 

3- Oddcast, which supports many different languages and english accents. It has both male and female voices.




View at Medium.com


View at Medium.com


View at Medium.com

Week 9 – Getting familiar with Voice API

The main feature that we will have in our application is converting text to speech synthesis, known as TTS. It allows the users to write and convert their text into a human voice during the phone call. To apply TTS function to a phone call function in a real time manner, we have chosen to use Twilio API.

Twilio is a cloud communications platform to programmatically make and receive voice calls in a web/mobile application. For last two weeks, we have tried some functions of Twilio API to get some sense of placing and receiving voice calls as well as using text to speech module during a call for our application. We have been more focused on trying those functions under a web based environment rather than directly diving into a mobile application to get to know the API well beforehand.


TwiML(Twilio Markup Language) is a set of instructions of elements that allows users to tell the program what to do during the phone call. It’s an XML document with special tags defined by Twilio to build voice applications. So, the commands will be performed based on extensible markup language, which helps us to build our voice application.

The elements of TwiML are divided into three groups: the root <Response> element, nouns, and verbs. In any TwiML response to a Twilio request, all verb elements have to be nested within <Response>, the root element of Twilio’s XML Markup. A TwiML noun describes the phone numbers and API resources to take action on. TwiML verbs tell Twilio what actions to take on a given call. <say>verb, the text will be converted into speech in real time call. The following figure shows an example of language:


Inbound / Outbound call

Screen Shot 2018-10-26 at 3.24.05 PM.png

After activating an account, we got a virtual number offered from Twilio and tried  inbound and outbound call.

The process of an Inbound call is as follows:Picture3

When someone makes a call to a virtual number offered by Twilio, Twilio looks up the URL associated with that phone number and sends it a request. Twilio then reads the TwiML instructions hosted at that URL to determine what to do. Therefore, HTTP request in the above image contains phone numbers of sender and receiver, HTTP response is the contents of the returned TwiML.

The process of an Outbound call is as follows:Picture2

When we initiate an outbound call with the Twilio API, Twilio then requests your TwiML to learn how to handle the call.

Twilio executes just one TwiML document to the caller at a time, but many TwiML documents can be linked together to build complex interactive voice applications.

<Say> verb makes it easy to synthesize speech. When we provide the text, and Twilio will synthesize speech in real time and playback the audio. When using <Say> we have choice between using Man, Woman, Alice or Amazon Polly Voices. For the next 2 weeks, we’re planning to apply <Say>verb to TTS call between web application and personal phone and find out a specific method for mobile application.









For our project, we are planning to develop an application specifically designed for speech-impaired people due to the lack of such applications in the current market. The application will allow the user to make a phone calls through the application where the user can interact directly by typing this messages which will be converted into speech to the receiver. The voice will express the user’s emotions by varying different vocal attributes during speech generation including the volume, tone, and speed.