A Text-To-Speech System Based on Deep Neural Networks

Karlsruhe Institute of Technology A Text-to-Speech system based on Deep Neural Networks Bachelor’s Thesis of Arsenii Dunaev KIT Department of Informatics Institute for Anthropomatics and Robotics (IAR) Interactive Systems Labs (ISL) Reviewers: Prof. Dr. Alexander Waibel Prof. Dr. Tamim Asfour Advisors: M.Sc. Stefan Constantin Duration: July 12th, 2019 – November 11th, 2019 KIT — The Research University in the Helmholtz Association www.kit.edu Erklärung: Ich versichere hiermit, dass ich die Arbeit selbstständig verfasst habe, keine anderen als die angegebe- nen Quellen und Hilfsmittel benutzt habe, die wörtlich oder inhaltlich übernommenen Stellen als solche kenntlich gemacht habe und die Satzung des Karlsruher Instituts für Technologie zur Sicherung guter wissenschaftlicher Praxis beachtet habe. Karlsruhe, den 11. November 2019 Arsenii Dunaev Abstract: Based on the previous success in the Text-to-Speech field using Deep Neural Networks (Shen et al., 2018), this work aims at training and evaluation of an attention-based sequence-to-sequence neural voice generation system. Chosen fully-convolutional architecture matches state-of-the-art techniques in speech synthesis while training an order of magnitude faster than analogous recurrent architectures (Ping et al., 2018). This thesis also considers a great number of open-source speech datasets for both English and German languages, determining the best one and its key characteristics for training of the Text-to-Speech system. To achieve a high quality of synthesized speech various training techniques are used and described in detail during this work. Such methods as concatenation of separate audio-files in a larger ones with uniform distribution of lengths (between 0 and 20 seconds) and trimming of silences with modern tools allow the model to produce more intelligible speech utterances of any duration as well as more natural speech utterances without clipping in the end, which is considered to be an issue for many Text-to- Speech systems. This work also addresses the training of Text-to-Speech models for other languages by specifying new frontend (e.g. for German model). Several models are evaluated through user studies by measuring three criteria: intelligibility, comprehensibility, and naturalness. The most successful English and German models achieve the Word Error Rate of 23.48% and 14.21% respectively, and a higher than average degree of comprehensibility and naturalness. In conclusion, suggestions and recommendations on training of high-performance Text-to- Speech language models for future work are given. Kurzzusammenfassung: Basierend auf früheren Erfolgen im Bereich Text-to-Speech mit Deep Neural Networks (Shen et al., 2018), stellt diese Arbeit als Ziel das Training und die Evaluierung einer Attention-basierten Sequenz-zu- Sequenz neuronalen System zur Erzeugung der menschlichen Stimme. Die ausgewählte fully-convolutional Architektur entspricht dem Stand der Technik in Sprachsynthese, während das System um eine Größenord- nung schneller trainiert wird als bei analogen recurrent Architekturen (Ping et al., 2018). Diese Thesis betrachtet auch eine große Anzahl von Open-Source-Sprachdatensätzen für englische und deutsche Sprachen, um den besten Datensatz und seine wichtigsten Merkmale für das Training des Text-to-Speech Systems zu ermitteln. Um eine hohe Qualität der synthetisierten Sprache zu erreichen, werden verschiedene Trainingstech- niken in dieser Arbeit verwendet und detailliert beschrieben. Solche Methoden wie die Konkatenation separater Audiodateien in größere Dateien mit gleichmäßiger Verteilung der Tondauern (zwischen 0 und 20 Sekunden) und das Trimmen von Pausen mit modernen Werkzeugen ermöglichen es dem Modell, ver- ständlichere Sprachausgaben von beliebiger Dauer beziehungsweise natürlichere Sprachausgaben ohne Abschneiden am Ende der Aussage zu erzeugen, was für viele Text-to-Speech Systeme ein Problem darstellt. Diese Arbeit befasst sich auch mit dem Training von Text-to-Speech Modellen für andere Sprachen, indem ein neues Frontend (z. B. für das deutsche Modell) spezifiziert wird. Mehrere Modelle werden durch Benutzerstudien anhand von drei Kriterien bewertet: Verständlichkeit, Nachvollziehbarkeit und Natürlichkeit. Die erfolgreichsten englischen und deutschen Modelle erreichen eine Wortfehlerrate von 23,48% bzw. 14,21% und einen überdurchschnittlichen Grad an Nachvol- lziehbarkeit und Natürlichkeit. Im Ausblick werden abschließende Vorschläge und Empfehlungen zum Training hochperformanter Text-to-Speech Modelle gegeben. Contents Page 1 Contents 1. Introduction 3 1.1. Motivation . .3 1.2. Goal of this work . .3 2. Fundamentals 5 2.1. Perceptron . .5 2.2. Multi-Layer Perceptron . .6 2.2.1. General structure . .7 2.2.2. Training of an Artificial Neural Network . .7 2.3. Convolutional Neural Network . .8 2.4. Encoder-Decoder Network . 10 2.5. Vocoder . 12 3. Methods and algorithms of speech synthesis 13 3.1. Traditional TTS technologies . 13 3.1.1. Concatenative systems . 13 3.1.2. Source-filter systems . 14 3.2. HMM-based synthesis . 15 3.3. Speech synthesis using Deep Neural Networks . 15 4. Text-to-Speech system 16 4.1. General structure . 16 4.2. Datasets comparison and training . 17 4.2.1. English model . 17 4.2.2. German model . 22 4.3. Synthesis . 25 5. Evaluation and results 27 5.1. User evaluation . 27 5.1.1. User studies criteria for TTS systems . 27 5.1.2. Conducting of user testing . 28 5.2. Results . 30 5.2.1. English model . 30 5.2.2. German model . 34 6. Conclusion and future work 36 6.1. Conclusion . 36 6.2. Future work . 36 Appendices 39 A. Deep Voice 3 architecture 40 B. Speech datasets 41 A Text-to-Speech system based on Deep Neural Networks Page 2 Contents C. Evaluation sentences 46 C.1. English model . 46 C.1.1. Stage 1 . 46 C.1.2. Stage 2 . 46 C.1.3. Stage 3 . 47 C.2. German model . 48 C.2.1. Stage 1 . 48 C.2.2. Stage 2 . 48 Acronyms 49 Glossary 50 A Text-to-Speech system based on Deep Neural Networks Page 3 1. Introduction 1.1. Motivation Natural Language Processing (NLP) is one of the widest parts of the development of ’human-like’ machines (such as robots) among computer vision and artificial intelligence itself. One of the largest and most sophisticated tasks of NLP is speech synthesis. Inspired by speaking machines in science fiction, scientists and engineers have been fascinating and studying the ways of establishing speech by machines for many years. Started with "acoustic-mechanical speech machines" (Wolfgang von Kempelen’s speaking machine in 1769-1791 (von Kempelen, 1791)) and continued with the first computer-based speech synthesis systems in the late 1950s (Bell Labs: synthesizing speech on IBM 704 mentioned in Mullennix and Stern (2010)), the technology advanced fast with Deep Neural Networks (DNNs) in recent years. Speaking about the speech synthesis mostly refers to the Text-to-Speech (TTS) systems: ability of the computer to read text aloud. The conventional methods (such as concatenative, source-filter or Hidden Markov Model (HMM)-based synthesis) achieved good quality in synthesizing comprehensible and natural human speech, but they still possess drawbacks and limitations (e.g. long development time, high complexity of algorithms or large storage capacity). Neural networks, already extremely popular in other artificial intelligence branches, offer a chance to escape some limitations and produce a natural ’human’ voice when a huge amount of sample data is available. With the help of DNNs many nowadays compa- nies have made a breakthrough in the development of voice assistants products like Apple’s Siri, Google Assistant or Amazon Alexa, which are approaching the quality of the human voice. Speech synthesis is still one of the significant vital assistive technology tools, which allows environ- mental barriers to be removed for people with a wide range of disabilities such as blindness or loss of speech. The best modern TTS systems are usually commercial (i.e. its architecture or the dataset, on which the system is trained, are closed-source). This work will make the first steps towards training and evaluation of the TTS system, which is based on DNNs and is completely open-source. 1.2. Goal of this work This work aims at training and evaluating of the single end-to-end multi-modal neural network that: • produces comprehensible and natural male voice (since most of the previous open-source TTS systems are trained on the commonly known female voice dataset prepared by Ito (2017), which are usually better than analogous trained on male speech datasets); • has a low latency of synthesizing (<2 seconds); • has a low operating power of synthesizing the speech (use of two Central Processing Unit (CPU)- threads completely without the Graphics Processing Unit (GPU)). The training will be conducted based on an implementation (Yamamoto et al., 2018) of the Deep Voice 3 TTS system by Baidu Research (Ping et al., 2018). To successfully train the TTS system a large natural human speech dataset have to be found. This work will cover a vast amount of freely available speech datasets and compare them to each other in order to determine: 1. the most suitable one for the given problem; 2. the optimal size of the dataset needed to suffice the high quality of synthesized speech. A Text-to-Speech system based on Deep Neural Networks Page 4 Chapter 1. Introduction This way, the quality of the speech must correspond to the high quality of modern TTS systems such as Tacotron 2 by Google (Shen et al., 2018). In addition, the trained model should produce speech promptly without considerable loss in naturalness or intelligibility. To conclude, this thesis aims at exploring the use of neural networks in modern TTS systems and possible future work in this field. A Text-to-Speech system based on Deep Neural Networks Page 5 2. Fundamentals The following chapter describes useful basics in the field of Artificial Neural Networks (ANNs), espe- cially the relevant techniques for this work. 2.1.

Load more