Sushil Bastola Developing a Speech Unit Framework

Metropolia University of Applied Sciences

Bachelor of Engineering

Software Engineering

Bachelor’s Thesis

30 August 2018

Author Sushil Bastola

Title Developing a Speech Unit Framework

DegreeNumber of Pages Bachelorxx pages of+ xEngineering appendices

Date 30 August 2018 Degree Programme Information Technology

Professional Major Software Engineering

Instructors Janne Salonen, Principal Lecturer

This project aims to build a speech unit framework for Kone to automate process of gener- ating announcements, translating them to multiple languages and extracting to desired au- dio formats for different devices’ types such as elevators, doors, gates using speech syn- thesis.

Traditionally Kone has been generating the speech using human resources to create an- nouncements for the different device types. Specialized personnel speak on the recording devices and records the voices that later is installed as an announcement in different de- vices. The process is repeated for different languages using native speakers from corre- sponding countries. The audios then are saved and installed on different device types. This process can be expensive, inconsistent and hefty in long run since the speech must be translated to many different languages.

This project built a system that uses speech synthesizers from AWS to automate the pro- cess of creating announcements. The system can generate audios in selected languages with custom settings and filters. The audio can be exported as a zip file to a specific format and then can be installed in corresponding device type.

To conclude, the project was successful to automate the process of generating announce- ments using speech synthesizers. The project built a system that minimizes the traditional problems of inconsistence and offered a faster, reliable and cheaper solution to the problem.

Keywords AWS, TTS, Polly, Synthesizers, Micro Service, Architecture, Speech

Contents

List of Abbreviations

1 Introduction 1

2 Overview of TTS 1

2.1 Brief History 1 2.2 Implementation Techniques 2 2.3 Tools and Technologies 4 2.3.1 software and APIs 4 2.4 Application of Speech Synthesis 6 2.4.1 Applications for the Blind 6 2.4.2 Applications for the Deafened and Vocally Handicapped 7 2.4.3 Educational Applications 7 2.4.4 Applications for Telecommunications and Multimedia 7

3 Kone 7

3.1 Brief History 8 3.2 Main Expertise 8 3.2.1 Elevators, escalators and automatic building doors solutions 8 3.2.2 Maintenance and modernization 9 3.2.3 Advanced people flow solutions 9

4 Addressing the problem 10

5 Implementation 10

5.1 Technologies 10 5.1.1 Docker 11 5.1.2 React 12 5.1.3 Redux 13 5.1.4 Koa 14 5.1.5 Sequelize 15 5.1.6 JWT 15 5.1.7 Postgres 16 5.1.8 Swagger 16 5.2 Application Architecture 19 5.2.1 Microservices 20 5.3 Amazon Polly 22

5.4 Execution 23 5.4.1 Requirement Analysis 23 5.4.2 System Design 23 5.4.3 Execution 25 5.4.4 Integration and Deployment 26 5.5 Outcome and observations 26

6 Conclusion 27

7 References 28

Appendices

List of Abbreviations

TTS Text to Speech

NTTS Neutral Text to Speech

AWS Amazon Web Service

API Application Programming Interface

DOM Documents Object Model

MVC Model View Controller

ORM Object Relational Mapping

SQL Structured Query Language

JS JavaScript

OS Operating System

MVCC Multi-Version Concurrency Control

JWT Json Web Token

JSON JavaScript Object Notation

HMAC Hash Message Authentication Code

RSA Rivest, Shamir, and Adelman

RFC Request For Comment

ECDSA Elliptic Curve Digital Signature Algorithm

SSML Speech Synthesis Markup Language

MP3 Moving Picture Experts Group

MVP Minimal Viable Product

CRUD Create Read Update Delete

OAS OpenAPI Specification

YAML YAML Ain’t Markup Language

UI User Interface

PAT Parametric Artificial Talker

VOCODER Voice Operating Demonstrator

IOT Internet of Things

UWP Universal Windows Platform

PHP Hypertext Preprocessor

ASP Active Server Page

HTML Hyper Text Markup Language

ES6 EcmaScript

1

1 Introduction

Speech Synthesis is the process of artificially producing human speech, usually done using computers. The software that produces the artificial speech is called speech syn- thesizer. A TTS system converts a language text into speech while other render symbolic linguistic representation into speech.[1]

The project aimed to replace the traditional way of generating the announcements in many different languages using manual human resources with speech synthesis tech- nologies. With the use of speech synthesizers, the process can be automated using dif- ferent online cloud services that provide TTS conversion real-time. Since the technology of speech synthesis have evolved drastically in fast few years, the services are trustwor- thy and resilient. Therefore, the approach of using speech synthesis is cheaper, faster, reliable and consistent compared to traditional method. The main goal of the project was to use speech synthesis for the process of generating announcements rather than tradi- tional human resources.

2 Overview of TTS

2.1 Brief History

The earliest memory of creating an artificial speech date back to over two hundred years ago. It started with using mechanical devices to produce the speech since the electrical signals were not invented yet. A Danish Scientist Christian Kratzenstein, working on St. Petersburg used a mechanical device to model the human vocal tract that could produce artificially synthesized vowels. He first made acoustic resonators like human vocal tract and then activated the resonators with vibrating reeds. The outline of the device is shown in figure 2.[1]

2

This invention was followed by better version created by Wolfgang von Kempelen that added the model for lips and tongue and could produce constants and vowels. This ver- sion was upgraded by other scientists for next few decades. [1]

Later in 1930s, VOCODER (Voice Operating Demonstrator) was developed by Bell labs which is the first electronic speech synthesizer. The device was a keyboard-operated electronic speech analyzer and synthesizer. In 1953, PAT (Parametric Artificial Talker) was invented which consisted of three formant resonators connected in parallelly to each other. Following the invention of relatively cheaper TMS-5100 chips, Texas Instruments brought a product called Speak-n-Spell in 1953. The device was designed to help chil- dren with reading. [1]

The technology of speech synthesis has gotten more complex and sophisticated in mod- ern era. Now the algorithms like HMM and neural network are used which are more accurate and resilient. Since the technology has evolved to the phase where it is trust- worthy, it has been used in numerous fields of development including health sectors, educations, entertainment, telecommunication and education. [1]

2.2 Implementation Techniques

Speech synthesis can be done following several different ways. The techniques can be classified into three types. The first technique is . In this technique, synthesis tries to the model of the human vocal organ and vocal cord. Human articulator and vocal cords are modeled with different sets of areas function between glottis and mouth. When speaking, the vocal tract muscles contract causing the articulators to move

3

and change the shape of vocal tract. This consequently produces different sounds. This technique usually produces high quality synthetic speech but however is also difficult to model since large amount of data must be processed. [1]

The second technique of speech synthesis is Formant synthesis which is based on source-filter-model of speech. There are basically two structures, cascade and parallel, but to get the better-quality output some kind of combination of these structures is used. Formant techniques allows infinite amount of sound combination which makes it more flexible than some other techniques. In cascade formant synthesizer has band-pass res- onators connected in series and each output of the resonator is supplied as input to the following resonator. This structure is simpler to implement since it only uses formant frequencies as control information and has been found useful for non-nasal voices. [1]

Figure 3: Simple layout of Cascade formant synthesizer.

Figure 4: Simple layout of Parallel formant synthesizer

4

In parallel structure formant synthesizer, resonators are connected in parallel to each other. The excitation signal is supplied to available formants simultaneously and the out- put of these formants are summed. This structure has more control information since it offers controlling of bandwidth and the gain of each formant individually. This type of synthesizer has been found better for nasal voices, fricatives and stop-constants. [1]

The other technique of speech synthesis is . It uses collection of large recorded speech data that is enough to cover the areas of language. The units of speech data are modified and used as necessary. The method heavily depends on runtime selection and editing of the speech units available in the database. This selection and storage of speech units are usually heavy and requires a lot of memory. Also, usually concatenative synthesizers are usually limited to one speaker and one voice. [1]

2.3 Tools and Technologies

2.3.1 Speech synthesis software and APIs

There are several tech giants that offer TTS APIs to their customers in order to make the development of TTS applications faster, easier and convenient. Companies like Amazon, Google and Microsoft has been playing big role in recent years for rapid development on TTS fields. Companies like IVONA, Neospeech and Readspeaker have also been in this business for a while.

Following is the list of some of companies that have been working with the speech syn- thesis.

Acapela

Acapela company provides TTS software and services. They provide SDK solutions for Windows, Mac OS X, Windows server, Linux server, UWP, iOS, Android, Linux embed- ded and windows mobile. Most of the solutions they provide are cloud based solutions. Merged previously from three companies, Acapela support TTS services in more than 30 different languages and narrowing down their support even to accents, dialects and local voices. [1]

5

Amazon Polly

Amazon Polly is one of the widely used services in cloud-based solutions developed by Amazon. It provides a simple to use API and easy integrations gateway. They offer many variables control over voices based on frequency, sampling rate, format and others. They also have wide range of programming language support and is available over 25 different languages. [1]

ESpeak

Espeak uses synthesizers based on formant synthesis and therefore their sounds are not natural sounding. They are an open source project based on Windows and Linux. The speech produced is clear and can be used in high speeds but lacks the natural accent. They support export of a file as WAV file, SSML and HTML. They support more than 45 different languages. [2]

Festival

They provide a general framework for building speech synthesis systems. They are free software and distributed under X11-type licence that provides unrestricted use for com- mercial purposes. [1]

Google

Undoubtedly google is one of the biggest players in TTS services. Since they handle the distribution of Android operating systems, their solution was primarily developed for mo- bile devices. Now, the services have been extended widely to support various platforms and systems. One of their most common TTS services is the Google Translate. They also implemented the Google TTS on chrome web browser to read any text in the browser. They have across 100 different voices and support more than 20 different lan- guages. [1]

Ispeech

Ispeech is a California based company that develops TTS services and solutions for various platforms. Their services are free to implement in mobile platforms and cost

6

some amount to implement the service in the web. They provide API access to develop- ers to implement their service. They support more than 20 different languages. [1]

Microsoft

Like Google, Mircrosoft is also one of the biggest players in TTS business. Speech Ap- plication Programming Interface or SAPI, the API developed by them provides speech recognition throughout Windows applications. The SDK itself is integrated into their OS. [1]

Nuance

Nuance is mainly known for powering the Apple Siri personal assistant. Their solutions are available for wide range of platforms including cloud, embedded and mobile. They support over 119 different voices and can be used in 53 different languages. [1]

ReadSpeaker

ReadSpeaker supports web-based applications to use TTS technology to use speech on websites and mobile platforms. They support more than 35 different languages and have their API available for developers in many different languages including Java, Objective C, PHP, ASP and Flash. [1]

2.4 Application of Speech Synthesis

2.4.1 Applications for the Blind

This is probably the most useful and important application of speech synthesis. Synthe- sized speech can be a great aid in reading and communication for blind people. Before the technology of speech synthesis, audio books were used where the content of the book was read through audio tape. This process can be hefty since they take a lot of time to assemble and are expensive. It is easier to get the needed information from com- puters by using special keyboards that supports Braille characters. Speech technologies now are integrated to www pages and other forms of media and can be read through normal computers. [1]

7

2.4.2 Applications for the Deafened and Vocally Handicapped

Born-deaf people usually struggle to learn to speak properly since they cannot get any input. People who have hearing difficulties also have difficulties in speaking. Synthesized speech helps these people to communicate with people who have difficulties in under- standing sign language. Tools such as HALMET (Helpful Automatic Machine for Lan- guage and Emotional Talk) have been developed to help people express their feelings. This device is constructed to operate on PC with high quality speech synthesizers. [1]

2.4.3 Educational Applications

Synthesized speech can be used in educational sectors many ways. They can teach throughout the year and can be designed in a way to teach languages teaching students about spelling and pronunciations. Also, children with dyslexics can also use this tech- nology to teach themselves in learning since some children feel embarrassed when they must be helped by teacher or other fellow students. [1]

2.4.4 Applications for Telecommunications and Multimedia

This is one of the newest applications so the technology. Speech synthesis now are integrated to most of the telephone inquiries including customer care. The integration process is easy and can be done in short time. This has led even the normal customers to use it in their daily life use. There are services to listen to email rather than going through each of them manually. This service is useful when you must go through numer- ous amounts of emails at one time. [1]

3 Kone

Kone Corporation is a company based in Helsinki, Finland founded in 1910. It is world leader in designers and manufactures of escalators, elevators, automatic doors, gates and other mobility devices. They have over 55,000 employees across 60 different coun- tries worldwide. Their business provides service for local manufactures, designers, build- ing owners, and architectures. Millions of people around the world use the services and products every day. [2]

8

3.1 Brief History

Kone started in 1910 when they registered themselves as a machine repair shop named Tarmo in Helsinki and incorporated themselves as Kone, Finnish world for Machine. They started selling and refurbishing used Strömberg motors under Kone. Soon they started importing and installing elevators from Sweden. In 1930s Kone started dominat- ing Finnish elevator market but the market was too tiny inside Finland. Soon in 1933, they started making cranes to counterpart the low market in elevator industry. Kone started its breakthrough on 1968 when it bought ASEA’s elevator business which was a combination of Swedish business unit with its Norwegian and Danish subsidiaries. Kone became the market leader in Northern Europe in one leap. They started selling their products in 9 different countries. In 1974 they took another big step buying the Westing- house’s European elevator business which was a leader in both France and Belgium. [2]

By 1980, Kone was one of the top three companies in elevators and escalators business. In 1994, they bought fourth largest elevator business in United States. By 2015, over one billion people use Kone elevators every day. Their business operates over 60 different countries with over 400,000 customers. [2]

3.2 Main Expertise

Kone has been expanding their expertise fields since they started their business. They started with selling elevators and soon expanded it selling cranes. Now they have ex- panded their business to numerous other sectors. The company demerged into two dif- ferent companies in 2005 splitting the parent company into Kone Corporation and Car- gotec Corporation. Kone itself now mainly focuses on elevators alike business. Some of the expertise of Kone are listed below [2]

3.2.1 Elevators, escalators and automatic building doors solutions

Kone delivers innovative and eco-efficient elevator solutions for all type of buildings from normal residential house to the tallest skyscrapers. They have cost and energy efficient solutions from low demand building to very high demand buildings. Recently, they have tested the new technology of using carbon Nano particles on constructing a rope instead

9

of using traditional steel rope for escalators. This innovation is highly efficient since its cheap, eco-friendly, strong and durable. In recent years, they have started to integrate IOT in their elevators that allows users to control different variables inside the elevators. [2]

Likewise, Kone escalators and auto walks have set new industry standards for safety, eco-efficient and visual design. They offer solution to different environment types includ- ing indoor, outdoor, business areas, transport stations. Automated doors are installed in buildings and other designated environments which makes the flow of people and good efficient. [2]

3.2.2 Maintenance and modernization

Kone as it started as a repairing shop, they have been involved in maintenance since their early days. Now Kone offers maintenance service throughout day. They have large of number experts working to offer maintenance and better solutions in every inconven- ience. [2]

Modernization has been a big part of Kone culture. With time, they have evolved their technologies according to the need and demand of the market. They have pioneered some of the inventions in elevator industry. [2]

3.2.3 Advanced people flow solutions

Kone is replacing traditional building solutions related to doors and keys with their new automated solutions. They have recently started a service that lets user to open their home or work building door automatically and get to the floor without touching a single button. Likewise, with Kone Visit, user can welcome visitors, delivery and service people by unlocking door from smartphones from any distance. They use smartphone as a cen- tral hub for information flow regarding access grants, maintenance notices and building updates. These services are now available in selected countries including Finland, Ger- many, Netherlands and United Kingdom. [2]

10

4 Addressing the problem

Devices like elevators and escalators have announcements installed on them for differ- ent scenarios. Usually they are general floor information, building updates or emergency instructions. It is very important to have these audios properly installed on the devices since it plays a vital role in people experiences and emergency situations. The sound quality of the audio should be high and should not have any disturbance in it.

For the quality experience of the daily customers, the audios should have consistency on them regarding the speech quality, voice type, noise filter and volume. These param- eters can be hard to control when they are developed from humans. As human voice is inconsistent, it might vary depending on the condition of the body. Also, the availability of the person who is responsible for recording the voice can be a problem. This leads to change in designated personnel and hence changes various parameters in new recorded speech compared to old one. Likewise, the process itself is time consuming. Audios must be manually recorded and since it’s not recorded from machine, parameters might vary in human voice on each record hence leading to several retake. Also, these audios must be translated in many different languages. This requires native speaker from the required country to manually repeat the process. This process can be expensive, inconsistent, unreliable and error prone for the company in long run.

Likewise, it is time consuming to amend the recorded speech if companies want to change parameters in speech. Usually this requires company to get the person who was responsible for latest recording and repeat the steps again. The availability of the person responsible for the latest recording is not sure. This makes the process more expensive and time consuming.

5 Implementation

5.1 Technologies

The application was designed to run on web platform. The technologies used in the ap- plication are common web technologies. The web client, server and database run on docker containers separately. The web client is written using react as component library

11

and redux for application state management. The server side is written in Koa as a mod- ular framework for NodeJS. Postgres is used for database with sequelize as a database driver. Swagger is used for developing, debugging and managing API specifications.

5.1.1 Docker

Docker is a container-based development and deployment platform. Docker aims to in- crease the easiness to create, run and deploy applications which high amount of flexibil- ity. All applications developed on docker runs on container. A container is a standard unit of a software that bundles all the code, application dependencies for a software to run easily, smoothly and reliably from one environment to another. A container image is an executable standalone lightweight version of a software that contains everything needed for an application to run. [5]

Before the concept of containerization, the popular way of isolating and organization an application was to place each application on its own virtual resources. These multiple machines run multiple applications on same hardware. This concept was Virtualization. Virtualization had multiple drawbacks considering the efficiency and cost. The virtual machines were bulky in size and resource consuming. Since the user installs an extra OS on a same physical hardware, this required a lot of extra resources than needed which included all the needed libraries and packages for running OS itself. There were continuous problems related to performance stability, portability, software updates, con- tinuous integration and delivery. Containerization aimed to solve those problems. Since containerization is a type of virtualization that brings virtualization to operating system level, there was no need of installing a guest OS. Containerization brings abstraction to operating system while virtualization brings in hardware. This cuts off all extra bulky ker- nel libraries that comes along with installing a guest OS saving a lot of resources and boosting performance. [5]

12

Figure 5: Virtual Machine and Containers

Docker containers run over docker engine that runs on the host operating system. And every docker container consisting different applications run on docker engine. As seen from the diagram above, each application run on its own container with its own set of libraries and binaries dependencies. This makes sure that each container run independ- ent to each other’s making developers assurance that their application does not interfere with each other under development. These containers then run on Docker engine which sits on top of host OS. This is less resource consuming compared to virtual machines since there is no extra layer of guest OS with a lot of resources. In docker, the applica- tions share relevant resources and libraries when needed. The performance boosts up largely since the applications related binaries and libraries run on host OS. [5]

5.1.2 React

React is an open source JavaScript library which is usually build for building scalable single page applications. React handles the view layer for mobile and web applications by offering fast way of creating and updating components. It can be used as a view layer with any other MVC frameworks and libraries like Angular to handle view. It is also useful for creating reusable UI components. The components can maintain and trace their state throughout its Lifecyle. [8]

13

Fig 6: Difference in render: DOM and Virtual DOM

React uses Virtual DOM to trace the state of application. This makes it faster. In browser, DOM is represented in a tree data. In case of update to any child element of DOM, the DOM tree will be updated with the selected child. But also, all the following child will be re-rendered or re-painted. This process of re-rendering makes the process slow and ex- pensive. React found a new way to tackle this problem. React uses virtual DOM which is basically an object representation of tree state of DOM. Anytime the state of the appli- cation changes, the virtual DOM is updated with the required changes. React does not directly work with DOM, instead updates the virtual DOM. Once the update is completed in virtual DOM, it looks for best method to update the real DOM element so that least operation is done in real DOM. This makes the process cheaper and reducers the per- formance cost. [8]

5.1.3 Redux

Redux is a state management JavaScript library for applications. Redux helps to write application that behave consistently across different environments. Redux has a central- ized data center for management of your application state which brings on multiple fea- tures like undo or redo and state persistence. Redux operates on three principles across application eco system. The first principle is the state of your entire application is stored in an object tree within a single store. The second is the state is read-only. The only way to modify is to dispatch action that has instructions what to modify on state. And the third

14

is changes are only made with pure functions. When an action is emitted, it gets mapped to its reducer which is a pure function and has instruction how to modify the state. [10]

5.1.4 Koa

Koa is a NodeJS framework that was developed by team of express which is one of the popular frameworks for NodeJS among developer community. Koa application is an ob- ject that consist of arrays of middleware functions for different operations and are run in a stack like manner. Koa helps to write applications in traditional cascading way. NodeJS operates with callbacks on every request. This can be less readable and hard to debug if the calls are nested. This can quickly result in term called ‘Callback Hell’ which is nes- tation of multiple callbacks from different operations. Koa uses the new feature of ES6 async and await to remove callbacks and offers a cascading way of writing applications. This makes the code more readable and easier to debug. [12]

Koa offers bare minimum services for writing applications. It is quite modular compared to other NodeJS framework like express and connect. Unlike other framework of NodeJS which exposes the request and response object of node, Koa exposes its own ctx.req and ctx.res. Koa does not offer routing by default, but it can be imported from third party libraries that have been approved by Koa community. Koa has a promised based flow and hence has better error handling than other NodeJS framework through try and catch. The difference in code flow of NodeJS against Koa is shown in image below [12]

15

Fig 7: Traditional NodeJS vs Koa

5.1.5 Sequelize

Sequelize was developed for Nodejs based on promises. Sequelize is an ORM devel- oped for databases like Postgres, MySQL, SQLite and others. It is one of the most read- able ORM in the developer community. Use of ORM like sequelize in application saves time in writing raw queries which can be complex and less readable. It also prevents from attacks like SQL injections. Like other ORM, sequelize offers features like robust transaction support, relation mapping, read replication and others. It has also support for eager and lazy loading, raw query, database synchronization, database migration, model validation, data seedings and scopes. [13]

5.1.6 JWT

JWT is an open standard (RFC 7519) that offers a robust and self-contained way of exchanging information between two components as a JSON object. The JSON object can be trusted since the object is digitally signed and can be verified. JWT is signed by using a secret that can be both with a HMAC algorithm or a public/private key using RSA or ECDSA. JWT can be used for authorization and exchanging information between par- ties. For authorization, once the user is logged in, a JWT is assigned to the user and every subsequent request that involves protected services requires the JWT to be sent along with the request header to get access to the services. Also, different permission can be assigned for different JWT allowing developer to customize the permission level on different resources. [14]

16

5.1.7 Postgres

Postgres is arguably the most advanced free open source database system which has been dominating the developer community for a very long time. PostgresSQL is a gen- eral purpose and object-relational database management system developed in Berkeley Computer Science Department at University of California. It has more than 30 years of active development in its core platform. Postgres has earned its reputation for robust- ness, data integrity, well-structured architecture, reliability, strong feature set and its commitment to open source projects and community. It was initially designed to run on Linux platform but later support for portability was added so that it can run on different OS system including Mac OS X, Solaris and Windows. It is designed to handle different kind of loads ranging from single machine to huge data warehouses, multiple web ser- vices with vast number of concurrent users. Postgres has a huge active developers’ community that has active tracing of existing bugs/issues and offering help to the devel- opers. [9]

Postgres offers many advanced features compared to other database systems. Some of the features are User-defined types, asynchronous replication, views, rules, subquery, table inheritance, nested transactions, sophisticated locking mechanism, Multi-version concurrency control (MVCC), tablespaces and point-in-time recovery. PostgreSQL is the first database system to support multi-version concurrency support even before oracle. Later oracle implemented the similar feature which is known as snapshot isolation. It also allows to add custom functions into the system using different languages like C/C++ and Java. Postgres is also very extensible. It has a support for custom data types, index types and functional languages. If a developer does not like or is not satisfied with some fea- ture/part of the system, it allows to write custom plugin to enhance and meet needed requirements. Postgres also has a very low maintainability cost and efforts because of its high stability. So, choosing postgres over other database management system basi- cally means lower total cost of ownership. [9]

5.1.8 Swagger

Swagger is an open-source software that provides large Eco-system of tools to aids de- velopers design, document, build, test, deploy and consume RESTful web services. In other words, swagger is a set of rules for a format that describes REST APIs. This format

17

can be used set a standard for technical documentations with common behaviors, pat- terns and consistency. [11]

API design can be very error prone process and sometimes extremely troublesome to keep trace of work and debug the faulty API. Swagger is a first editor developed with OpenAPI Specification (OAS). It has since then contributed largely in meeting needs for developers building API with OAS. This YAML editors checks and validates your API design on real time, checks OAS compliancy and suggests feedback real time as the user goes on developing. It provides tools that helps to design the API for your applica- tion from scratch. It contains various tools on process of software development that helps in well and organized documentation of the product. It offers tools to test the documen- tations on how they will look and behave on end consumers so that developers can verify their work is valid and the documentations corresponds correctly to the developed ser- vice. This documentation is easily readable and available to any new developers that joins the project. The documentation can be shared among the developers, testers and other people engaged in the product development. It also can be used to automate some of the process in API developments. [11]

18

Figure 8: Swagger online editor with meta data, API specification in YAML format

19

Figure 9: Swagger API Testing Tool for each created API

5.2 Application Architecture

Application architecture is the process of defining the application structure against the business requirements. It aims to defines the application landscape and optimize the architecture to fit the business requirements. It describes how the application compo- nents interact with each other on higher level, also with users.

20

Figure 10: Application diagram

The application designed for this project was a web-based application consisting of a browser client and a server. The system was developed on a docker based development and deployment system. So, every components of application runs on a docker con- tainer. There are three docker containers running different components of system: client, server and database. These containers run independently from one another and is also deployed independently. As seen from the picture above, the web service communicates with Rest API. The user needs to be logged in to the system to use the services. The authentication technique used is JWT (JSON Web Token) which uses token validation method to verify the user in subsequent requests once the user is logged in. Additionally, there is a microservice for amazon Polly service. This service is a standalone application that is responsible for processing the parameters fed through REST API and returning a synthesized audio with required filters.

5.2.1 Microservices

Microservices is an architectural style that bundles application as a collection of services. In a system with microservices, a large application is separated by splitting it into sensible components based on requirements and each component is served as an independent microservice. Each microservice host its own database and every other requirement re- quired for it to run smoothly and communicates with other services with an API like REST API. These services are individual standalone application that can run independently

21

from other services. Any downfall of one service does not affect the running of any other services. [6]

Data consistency is a very important consideration when it comes to microservices. As each service has its own database, it is important to maintain data consistency across different services. Some business transactions span is not limited to one database and needs to propagate in multiple services. The most common is to obtain this is by using saga pattern. There are two ways of using saga patterns: Choreography and Orchestra- tion. In Choreography pattern, each local transaction publishes a domain events that trig- gers local transactions in other services. In Orchestration pattern, the orchestration object tells the participants what local transactions needs to be executed. [6]

Figure 11: Choreography pattern

Figure 12: Orchestration pattern

22

Microservice architecture has numerous advantages over traditional monolithic system. Each service is loosely coupled and has logics only needed for the component itself to operate. This makes the system highly maintainable and testable. It allows rapid devel- opment of each component of a system since a small group of community can focus on a service and is only responsible for maintaining that service. Also, new developers have easy time understanding the system because each service is independent to each other and logic of a service is only based on the service itself. Likewise, it reduces long-term commitment of a project to a technology as each service can be written in any technology stack. Hence microservice architecture offers improved maintainability, better testability and deployability. [6]

5.3 Amazon Polly

Amazon Polly is a TTS cloud service by Amazon Web Services that uses advance deep learning technologies to synthesize text to speech that sounds like a human voice. The service was launched in November 2016 and now has support for 47 voices and 24 different languages. Besides standard TTS service, Amazon Polly offers NTTS using advanced machine learning techniques to provide huge improvements in speech quality and one of most human-like speech on the market right now. [4]

Amazon Polly offers most natural sounding voice in the market. It has support for many different languages with male and female voices and various filters you can apply to those voices. It offers unlimited replays of generated speech without any additional cost. It supports generation in standard formats like MP3 and OGG and provide them from cloud or locally with different apps and devices for offline paybacks. It also provides real time playing service of the text user want to synthesize. It provides the speech as a stream so that user can listen to the audio immediately while customizing it. It supports lexicons and SSML tags. This support provides the customization and filtration to gener- ated speech like pronunciation, pitch, volume, speed and rate. It provides pay as you go service with unlimited playback service that is cost effective for a voice application. [4]

23

5.4 Execution

Standard software development procedures were followed while developing this project. The project was split into timeline that described the features and modules to be deliv- ered by certain date and time frame. The project was structured following agile develop- ment method of software development. A minimal viable product was developed initially to prove the proof of concept and later addition features were added to the MVP to com- plete the product.

5.4.1 Requirement Analysis

The initial phase of the product began by gathering the requirements of the project from the client and putting them together to structure high level technical understanding of the project. The requirements from the client were discussed in meetings to come up with the solution and agree to a procedure. The actions and actors of the project were col- lected, and diagrams were made based on user actions. There was frequent communi- cation with the client on regular basis to update and discuss new features and specifica- tions, remove resolution of conflict and ambiguity in requirements. These requirements were refined over time depending on the new requirements of the client and the diagrams were updated accordingly.

5.4.2 System Design

The next phase of the project was system design. This requires numerous iterations and discussions with developers to determine the optimal way of implementing the feature or a module. Frequent meetings were held among developers and project manager to dis- cuss on the design of the application.

The application was mainly based on generating speech from text. The individual gen- erated speech that is saved to system is termed as Audio. Generated speech was struc- tured and combined with other speeches to compile it to a single speech which was termed as announcement. Each part of a announcements that contains specific audio was termed as Segment. Segment contains information like name, translations available for specific audio, tags, example. Translation contains the information like segment, lan- guage, audios available for that translations and example. The announcements were

24

mapped to a sheet. Sheet is a container for announcements that can be related to pro- ject. It contains reference to all the announcements that are part of a sheet. Sheet also can be global and announcements from it can be exported directly independent from any project. Sheet has reference to a device type and announcements from the sheets are designed for the specific device types.

The first phase of system design was to design a database. Considering concepts like Segments, Announcements, Sheet, Project, Device, Audio, Voice, Translations, Lan- guages and many others, an initial proposal of database structure was made. This struc- ture was iterated through in following weeks to improve the design to fit the requirements from the client. This process was delicate and time-consuming process. It requires more attention as the whole structure of the application depends on how relationships were defined. An in-depth database diagram was made with all the required tables. The enti- ties for a table were defined and the relationship between the tables were outlined. After numerous iterations, the layout for the database structure was defined.

The second step of the development was to create application design. The goal of this process is to separate the applications into sensible modules and services to enhance the important considerations in application development like scalability, security and ro- bustness. The application was separated into two different parts. A traditional server was allocated and designed for handling all the local transactions and API calls. This part was mainly responsible for receiving non-voice and speech related requests from users, per- forming operations on the database and serving required information. So, requests in- cluding transactions like CRUD operations on segments, translations, announcements and other non-voice generation calls were handled by this server.

Likewise, for generating an audio based on user’s provided filters, a micro-service was designed that takes requests from users through REST API and performs requested operations. This server communicates with another server to perform transactions that span more than one service. This architecture allows the separation between the core concept of the application which is a TTS service and other operations that are needed to add additional features, customization and control to the generated audio. The micro- service is solely responsible for operating CRUD actions on the audios and providing the data back to the users through an API.

25

5.4.3 Execution

The next step in application development was to write the code. For maintaining better communication and keeping track of work, issues were created in Jira. These issues have the details like description, duration and severity of the tasks. The tasks then were assigned to sprints depending on the priority requested from the customers. These tasks were then divided among the developers and were given a deadline of sprint expiry date. Slack channel was setup initially for better communication between the developers and maintaining better track of progress.

The development began by writing API in back-end. The endpoints, as requested by developers working on front-end were written supplying the front-end with needed data. These endpoints were initially returning hard-coded data that were acting as database seeds. Meanwhile, in the front-end the execution began by separating the UI interface into react components and writing those components. These components were respon- sible for displaying data sent by back-end in JSON format. Gradually the hard-coded database seeds were replaced by real database with dummy data. Regular meetings were held at the end of the sprint to track the progress and discuss about efficiency and learnings of last sprint. If developers run across some bug, an issue was filed in Jira assigning it to the developer who was responsible for task. These bugs were added to the backlog so that they can be solved in final bug solving sprints of the project.

As micro service can implement separately by a small team, few developers were allo- cated to complete the service. The service was independent and loosely coupled from main system. This enabled the developers to do their tasks with total control and inde- pendent from main timeline. This service was responsible for generating and editing ex- isting audios using the Amazon Polly synthesizers service.

Within duration of few months, an MVP was created as requested by the customer. This MVP contains the minimum features needed to validate the proof of concept. This MVP product was sent to customer for testing. After receiving comments and feedbacks from the costumers and permission to move forward with adding features and completing the MVP, another timeline of the project was made. Tasks were added to Jira regarding new features to be added. Same procedures as MVP creation was followed to deliver the final product in following months.

26

5.4.4 Integration and Deployment

The project is using Bamboo as an integration server which is a continuous integration and continuous deployment server developed by Atlassian. Like any other integration tools, Bamboo has features like plan, stage, job, task that allows developers to continu- ously monitor and debug their application. Bamboo has a built-in Git Branching workflows and deployment projects.

For the deployment of the project, build plans are created with specific branches. This build can either fail or succeed depending on the various factors like code quality, con- flicts, versions and dependencies used. Upon successful build, a release is made, and the code is deployed.

5.5 Outcome and observations

The project was completed and delivered as required by the client. All of the requested features were added according to the timeline and planning. Some of the observations made from the project were:

1.Speech synthesis technologies are getting better than ever.

Working with the Amazon Polly and its speech synthesis process, it was quite apparent that they were one the strongest competitors in the market. Speech provided by their synthesizers supported high customization with many filters to make it sound more real and human like voice. Synthesizers these days have supports for numerous languages with different voices. Quality of the speech has been getting better as the technology advances.

2. When it comes to designing, developing and testing API, swagger is one of the best products for developers.

Swagger was very helpful while designing the API. The communication between front- end and back-end developers were more effective since there was a well-organized and standardized API documentation. The debug and testing of API were easy and straight forward since swagger provided real time validation, feedbacks and testing tool for the API.

27

3. Speech Synthesis technology can replace human resources in voice generation field in near future

The speech synthesis technology has been developing rapidly in past decades. With the big tech giants’ companies like Amazon, Microsoft, Apple investing in technology, it can be predicted that the technology will be refined even more in coming years. This implies that the technology can be widely applicable to different sectors of infrastructures as they are trustworthy and affordable. This could be a huge benefit to industries that uses hu- man resource to generate voices for their business purpose. Industries like elevators industry can completely remove the need of human resources while generating and translating announcements for different devices.

6 Conclusion

In the present technology market, there are many speech technology companies offering products in the market. Aside from these companies, there are many researches carried out by educational institutions and companies in field of TTS. This has improved the technology vastly providing very human like and intelligent voice technology which is very different from what it used to be in its initial days. On the other hand, there are still some issues existing in the voice making it different from human voice. The voice gen- erated lacks the emotional intellect in the voice.

The main aim of the project was to build a system that generates speech from a given text and provides features like translating the generated audios into multiple languages, change format, bit rate and export audio files. The project was delivered successfully with requested features helping client to automate the service of speech generation through modern speech synthesizers. The service eliminated human resources expense for client in human resources and developed a consistent and reliable voice generation and translation system that is independent from human resources.

With services like Amazon Polly that serves best quality TTS technology, it can be pre- dicted that the technology will be evolving more in future making the technology applica- ble in different infrastructures of development and helping business to minimize their expenses and provide better service to their customers.

28

7 References

1. Sami Lemmety. Review of Speech Synthesis Technology [online]. Available at: re- search.spa.aalto.fi/publications/theses/lemmetty_mst/ [Accessed 14 June 2019]

2. Kone.com, Kone Corporation (2019) [Online] Available at: www.kone.com [Accessed 17 June 2019]

3. K.R Aida-Zade, C.Adril and A.M Sharifova. World Academy of Science. The main prin- ciple of speech to text synthesis system 2013 [online] Available at: publications.wa- set.org/8303/pdf [Accessed 19 June 2019]

4. docs.aws.amazon.com. Amazon Web Services, Inc. (2019) Amazon Polly [online] Available at: https://aws.amazon.com/polly/ [Accessed 4 June 2019]

5. www.docker.com (2019). What are Docker Containers. Docker Inc. [Online] Available at: www.docker.com/resources/what-container [Accessed 25 June 2019]

6. Chris Richardson, Microservices.io (2019). Pattern: Saga [Online] Available at: https://microservices.io/patterns/data/saga.html [Accessed 15 July 2019]

7. Chris Richarson. Mircoservices.io (2019). Pattern: Microservice Architecture [Online] Available at: https://microservices.io/patterns/microservices.html [Accessed 25 July 2019]

8. Reactjs.org, Facebook Inc (2019). [Online] Available at https://reactjs.org/docs/get- ting-started.html [Accessed 2 August 2019]

9. Postgres.sql, The PostgreSQL Global Development Group, PostgreSQL: The world’s most advanced open source relational database. [Online] Available at: https://www.postgresql.org/ [Accessed August 15 2019]

10. Redux.js.org, Dan Abramov (2019). A predictable state container for JavaScript apps. [online]. Available at https://redux.js.org/ [Accessed 25 August]

29

11. Swagger.io, SmartBear Software (2019). API development for Everyone [online] Available at: https://swagger.io/docs/ [Accessed 30 August 2019]

12. Koajs.com, Nodejs Foundation (2019). Next generation framework for NodeJS De- velopers. [Online] Available at: https://koajs.com/#introduction [Accessed 5 September 2019]

13. Sascha Depold, Sequelize.org. Sequelize ORM (2019) [Online] Available at https://sequelize.org/master/ [Accessed 15 September 2019]

14. Jwt.io, AuthO Inc (2019). JSON Web Token [Online] Available at: https://jwt.io/intro- duction/ [Accessed 27 September 2019]