Developing a Speech Unit Framework

Sushil Bastola Developing a Speech Unit Framework Metropolia University of Applied Sciences Bachelor of Engineering Software Engineering Bachelor’s Thesis 30 August 2018 Author Sushil Bastola Title Developing a Speech Unit Framework DegreeNumber of Pages Bachelorxx pages of+ xEngineering appendices Date 30 August 2018 Degree Programme Information Technology Professional Major Software Engineering Instructors Janne Salonen, Principal Lecturer This project aims to build a speech unit framework for Kone to automate process of generating announcements, translating them to multiple languages and extracting to desired audio formats for different devices’ types such as elevators, doors, gates using speech synthesis. Traditionally Kone has been generating the speech using human resources to create announcements for the different device types. Specialized personnel speak on the recording devices and records the voices that later is installed as an announcement in different devices. The process is repeated for different languages using native speakers from corresponding countries. The audios then are saved and installed on different device types. This process can be expensive, inconsistent and hefty in long run since the speech must be translated to many different languages. This project built a system that uses speech synthesizers from AWS to automate the process of creating announcements. The system can generate audios in selected languages with custom settings and filters. The audio can be exported as a zip file to a specific format and then can be installed in corresponding device type. To conclude, the project was successful to automate the process of generating announcements using speech synthesizers. The project built a system that minimizes the traditional problems of inconsistence and offered a faster, reliable and cheaper solution to the problem. Keywords AWS, TTS, Polly, Synthesizers, Micro Service, Architecture, Speech Contents List of Abbreviations 1 Introduction 1 2 Overview of TTS 1 2.1 Brief History 1 2.2 Implementation Techniques 2 2.3 Tools and Technologies 4 2.3.1 Speech synthesis software and APIs 4 2.4 Application of Speech Synthesis 6 2.4.1 Applications for the Blind 6 2.4.2 Applications for the Deafened and Vocally Handicapped 7 2.4.3 Educational Applications 7 2.4.4 Applications for Telecommunications and Multimedia 7 3 Kone 7 3.1 Brief History 8 3.2 Main Expertise 8 3.2.1 Elevators, escalators and automatic building doors solutions 8 3.2.2 Maintenance and modernization 9 3.2.3 Advanced people flow solutions 9 4 Addressing the problem 10 5 Implementation 10 5.1 Technologies 10 5.1.1 Docker 11 5.1.2 React 12 5.1.3 Redux 13 5.1.4 Koa 14 5.1.5 Sequelize 15 5.1.6 JWT 15 5.1.7 Postgres 16 5.1.8 Swagger 16 5.2 Application Architecture 19 5.2.1 Microservices 20 5.3 Amazon Polly 22 5.4 Execution 23 5.4.1 Requirement Analysis 23 5.4.2 System Design 23 5.4.3 Execution 25 5.4.4 Integration and Deployment 26 5.5 Outcome and observations 26 6 Conclusion 27 7 References 28 Appendices List of Abbreviations TTS Text to Speech NTTS Neutral Text to Speech AWS Amazon Web Service API Application Programming Interface DOM Documents Object Model MVC Model View Controller ORM Object Relational Mapping SQL Structured Query Language JS JavaScript OS Operating System MVCC Multi-Version Concurrency Control JWT Json Web Token JSON JavaScript Object Notation HMAC Hash Message Authentication Code RSA Rivest, Shamir, and Adelman RFC Request For Comment ECDSA Elliptic Curve Digital Signature Algorithm SSML Speech Synthesis Markup Language MP3 Moving Picture Experts Group MVP Minimal Viable Product CRUD Create Read Update Delete OAS OpenAPI Specification YAML YAML Ain’t Markup Language UI User Interface PAT Parametric Artificial Talker VOCODER Voice Operating Demonstrator IOT Internet of Things UWP Universal Windows Platform PHP Hypertext Preprocessor ASP Active Server Page HTML Hyper Text Markup Language ES6 EcmaScript 1 1 Introduction Speech Synthesis is the process of artificially producing human speech, usually done using computers. The software that produces the artificial speech is called speech synthesizer. A TTS system converts a language text into speech while other render symbolic linguistic representation into speech.[1] The project aimed to replace the traditional way of generating the announcements in many different languages using manual human resources with speech synthesis technologies. With the use of speech synthesizers, the process can be automated using different online cloud services that provide TTS conversion real-time. Since the technology of speech synthesis have evolved drastically in fast few years, the services are trustwor- thy and resilient. Therefore, the approach of using speech synthesis is cheaper, faster, reliable and consistent compared to traditional method. The main goal of the project was to use speech synthesis for the process of generating announcements rather than traditional human resources. 2 Overview of TTS 2.1 Brief History The earliest memory of creating an artificial speech date back to over two hundred years ago. It started with using mechanical devices to produce the speech since the electrical signals were not invented yet. A Danish Scientist Christian Kratzenstein, working on St. Petersburg used a mechanical device to model the human vocal tract that could produce artificially synthesized vowels. He first made acoustic resonators like human vocal tract and then activated the resonators with vibrating reeds. The outline of the device is shown in figure 2.[1] 2 This invention was followed by better version created by Wolfgang von Kempelen that added the model for lips and tongue and could produce constants and vowels. This version was upgraded by other scientists for next few decades. [1] Later in 1930s, VOCODER (Voice Operating Demonstrator) was developed by Bell labs which is the first electronic speech synthesizer. The device was a keyboard-operated electronic speech analyzer and synthesizer. In 1953, PAT (Parametric Artificial Talker) was invented which consisted of three formant resonators connected in parallelly to each other. Following the invention of relatively cheaper TMS-5100 chips, Texas Instruments brought a product called Speak-n-Spell in 1953. The device was designed to help chil- dren with reading. [1] The technology of speech synthesis has gotten more complex and sophisticated in mod- ern era. Now the algorithms like HMM and neural network are used which are more accurate and resilient. Since the technology has evolved to the phase where it is trust- worthy, it has been used in numerous fields of development including health sectors, educations, entertainment, telecommunication and education. [1] 2.2 Implementation Techniques Speech synthesis can be done following several different ways. The techniques can be classified into three types. The first technique is Articulatory Synthesis. In this technique, synthesis tries to the model of the human vocal organ and vocal cord. Human articulator and vocal cords are modeled with different sets of areas function between glottis and mouth. When speaking, the vocal tract muscles contract causing the articulators to move 3 and change the shape of vocal tract. This consequently produces different sounds. This technique usually produces high quality synthetic speech but however is also difficult to model since large amount of data must be processed. [1] The second technique of speech synthesis is Formant synthesis which is based on source-filter-model of speech. There are basically two structures, cascade and parallel, but to get the better-quality output some kind of combination of these structures is used. Formant techniques allows infinite amount of sound combination which makes it more flexible than some other techniques. In cascade formant synthesizer has band-pass resonators connected in series and each output of the resonator is supplied as input to the following resonator. This structure is simpler to implement since it only uses formant frequencies as control information and has been found useful for non-nasal voices. [1] Figure 3: Simple layout of Cascade formant synthesizer. Figure 4: Simple layout of Parallel formant synthesizer 4 In parallel structure formant synthesizer, resonators are connected in parallel to each other. The excitation signal is supplied to available formants simultaneously and the output of these formants are summed. This structure has more control information since it offers controlling of bandwidth and the gain of each formant individually. This type of synthesizer has been found better for nasal voices, fricatives and stop-constants. [1] The other technique of speech synthesis is Concatenative Synthesis. It uses collection of large recorded speech data that is enough to cover the areas of language. The units of speech data are modified and used as necessary. The method heavily depends on runtime selection and editing of the speech units available in the database. This selection and storage of speech units are usually heavy and requires a lot of memory. Also, usually concatenative synthesizers are usually limited to one speaker and one voice. [1] 2.3 Tools and Technologies 2.3.1 Speech synthesis software and APIs There are several tech giants that offer TTS APIs to their customers in order to make the development of TTS applications faster, easier and convenient. Companies like Amazon, Google and Microsoft has been playing big role in recent years for rapid development on TTS fields. Companies like IVONA, Neospeech and Readspeaker have also been in this business for a while. Following is the list of some of companies that have been working with the speech synthesis. Acapela Acapela company provides TTS software and services. They provide SDK solutions for Windows, Mac OS X, Windows server, Linux server, UWP, iOS, Android, Linux embed- ded and windows mobile. Most of the solutions they provide are cloud based solutions. Merged previously from three companies, Acapela support TTS services in more than 30 different languages and narrowing down their support even to accents, dialects and local voices.

Load more