Unsupervised Learning for Expressive Speech Synthesis

PhD Dissertation Unsupervised Learning for Expressive Speech Synthesis Author Igor Jauk Advisor: Antonio Bonafonte TALP Research Centre Department of Signal Theory and Communications Universitat Politècnicade Catalunya Barcelona, June 2017 Alea iacta est. IVLIVS CAESAR Abstract Nowadays, especially with the upswing of neural networks, speech synthesis is almost totally data driven. The goal of this thesis is to provide methods for automatic and unsupervised learning from data for expressive speech synthesis. In comparison to \ordinary" synthesis systems, it is more difficult to find reliable expressive training data, despite huge availability on sources like Internet. The main difficulty consists in the highly speaker- and situation-dependent nature of expressiveness, causing many and acoustically substantial variations. The consequences are, first, it is very difficult to define labels which reliably identify expressive speech with all nuances. The typical definition of 6 basic emotions, or alike, is a simplification which will have inexcusable consequences dealing with data outside the lab. Second, even if a label set is defined, apart of the enormous manual effort, it is difficult to gain sufficient training data for the models respecting all the nuances and variations. The goal of this thesis is to study automatic training methods for expressive speech synthesis avoiding labeling and to develop applications from these pro- posals. The focus lies on the acoustic and the semantic domains. For the part of the acoustic domain, the goal is to find suitable acoustic features to rep- resent expressive speech, especially for the multi-speaker domain, as getting closer to real-life uncontrolled data. For this, the perspective will slide away from traditional, mainly prosody-based, features towards features gained with factor analysis, trying to identify the principal components of the expressiveness, namely using i-vectors. Results show that a combination of traditional and i-vector based features performs better in unsupervised clustering of expressive speech than traditional features and even better than large state-of-the-art sets in the multi-speaker domain. Once the feature set is defined, it is used for unsupervised clustering of an audiobook, where from each cluster a voice is trained. Then, the method is evaluated in an audiobook-editing application, where users can use the synthetic voices to create their own dialogues. The obtained results validate the proposal. In this editing application users choose synthetic voices and assign them to the sentences considering the speaking characters and the expressiveness. Involving the semantic domain, this assignment can be achieved automatically, at least partly. Words and sentences are represented numerically in trainable semantic vector spaces, called embeddings, and these can be used to predict the expressiveness to some extent. This method not only permits fully automatic reading of larger text passages, considering the local context, but can also be used as a semantic search engine for training data. Both applications are evaluated in a i ii ABSTRACT perceptual test showing the potential of the proposed method. Finally, accounting for the new tendencies in the speech synthesis world, deep neural network based expressive speech synthesis is designed and tested. Emo- tionally motivated semantic representations of text, sentiment embeddings, trained on the positiveness and the negativeness of movie reviews, are used as an additional input to the system. The neural network now learns not only from segmental and contextual information, but also from the sentiment embeddings, affecting especially prosody. The system is evaluated in two perceptual experiments which show preferences for the inclusion of sentiment embeddings as an additional input. Acknowledgements First of all I would like to thank Antonio Bonafonte for his help, lead and patience, and for the opportunity to work and to develop this work in his group. Next, I would like to mention the FPU grant (Formaciónde Profesorado Univer- sitario) from the Spanish Ministry of Science and Innovation (MCINN) which made possible the research documented in this thesis as well as the short-term stay at the University of El Paso, where part of this work, related to Chapter Semantics-to-Acoustics Mapping, was designed and implemented. At the same time I would like to thank Prof. Nigel Ward from the University of Texas at El Paso for his friendly receive and advise in mentioned topic. Further, I would also like to mention the NII International Internship Program which made possible the short-term stay at the National Institute of Informatics (NII) in Tokyo and to thank Prof. Junichi Yamagishi for hosting me and supervising the research related to Chapter NN-based expressive speech synthesis with sentiment embeddings. I would like to thank for all the additional help received from many persons on the way to the finish line, among them Dani, Carlos, Santi, Sergi, Jaime, Lauri, Xin, Shinji, Paula, Gustav, all the participants who suffered in my listening tests and everybody else who helped and encouraged me, as well as anybody who I should mention and forgot. iii iv ACKNOWLEDGEMENTS Contents Abstracti Acknowledgements iii 1 Introduction1 1.1 Thesis Goals.............................3 1.2 Thesis Overview...........................4 2 Speech synthesis review7 2.1 General notions of TTS systems...................9 2.1.1 Text Analysis......................... 11 2.1.2 Prosody Prediction...................... 13 2.1.3 Corpus preparation...................... 19 2.1.4 Waveform Generation.................... 20 2.2 Concatenative Speech Synthesis................... 22 2.3 Statistical Speech Synthesis..................... 26 2.3.1 Speaker Adaptation..................... 28 2.4 Deep Learning............................ 31 2.4.1 Introduction to deep learning................ 31 2.4.2 Neural Network Based Speech Synthesis.......... 35 2.5 Expressive Speech Synthesis..................... 37 2.6 Discussion............................... 41 3 Feature Selection 43 3.1 Acoustic features: Overview..................... 44 3.1.1 Spectral Features....................... 45 3.1.2 Prosodic Features...................... 47 3.1.3 I-vectors............................ 49 3.1.4 OpenSMILE......................... 50 v vi CONTENTS 3.2 Experiments.............................. 51 3.2.1 Experimental framework................... 52 3.2.2 Experiment 1: MFCC i-vectors and a small corpus.... 56 3.2.3 Experiment 2: Prosodic i-vectors and single- vs multi- speaker............................ 60 3.2.4 Experiment 3: Comparison to OpenSMILE........ 64 3.3 Discussion............................... 67 4 Semantics-to-Acoustics Mapping 69 4.1 Semantic representation....................... 71 4.1.1 Distance Measures...................... 72 4.1.2 Bag-of-words Representations................ 72 4.1.3 Latent Semantic Indexing.................. 74 4.1.4 Continuous Semantic Embeddings with Neural Networks. 79 4.2 Predicting Acoustics from Semantics................ 82 4.3 Experiments.............................. 83 4.3.1 Experimental framework................... 84 4.3.2 Predicting Acoustic Feature Vectors from Semantic Vec- tors: an Analysis....................... 85 4.3.3 Automatic Expressive Reading of Text........... 88 4.3.4 Creating adhoc Expressive Voices.............. 90 4.4 Discussion............................... 91 5 NN-based expressive TTS with sentiment 93 5.1 System architecture......................... 94 5.2 Objective test............................. 96 5.3 Preliminary experiment....................... 104 5.3.1 Perceptual results...................... 105 5.4 Main listening test for the DNN-sentiment evaluation...... 106 5.4.1 Perceptual results...................... 106 5.5 Discussion............................... 110 6 Discussion 113 6.1 Summary............................... 113 6.2 Conclusions and future work.................... 115 6.3 Published contributions....................... 117 List of Tables 3.1 Low-level audio features by OpenSMILE.............. 51 3.2 Perplexities (PP) for silence rate, syllable rate, mean F0, F1- F3 for /e/ , F3 for /o/ and i-vectors for Expressions (Ex) and Characters (Ch) in comparison to the database........... 58 3.3 Paragraph sentences of the first subjective experiment....... 59 3.4 Relative preferences for the voices v0-v9 over the whole paragraph for the two characters (Ch2 and Ch3) and the narrator (Narr).. 60 3.5 Perplexities for different features combinations and for the three databases. The female part of the emotional studio corpus (C1), the male part of the same corpus, (C2), and the audiobook database (Al) for expressions (E) and for characters (Ch) are shown.... 61 3.6 Paragraph sentences of the second subjective experiment..... 63 3.7 Relative preferences for the voices v0-v9 over the whole paragraph for the narrator (Narr) and the two present characters (Ch2 and Ch3).................................. 64 3.8 Perplexities for different features combinations, including openS- MILE, and for the three databases. The female part of the emotional studio corpus (C1), the male part of the same corpus, (C2), and the audiobook database (Al) for expressions (E) and for characters (Ch) are shown......................... 66 4.1 Co-occurrence matrix. Columns are the documents, rows are the terms.................................. 73 4.2 Co-occurrence matrix........................ 75 4.3 Distance results. Means and variances of

Load more