Unsupervised Learning for Text-To-Speech Synthesis
Total Page:16
File Type:pdf, Size:1020Kb
Unsupervised Learning for Text-to-Speech Synthesis Oliver Watts Thesis submitted for the degree of Doctor of Philosophy The University of Edinburgh 2012 Declaration I have composed this thesis. The work in it is my own unless stated. Oliver Watts ii Acknowledgements I would like to thank the following people: • Simon King and Junichi Yamagishi, for all of their supervision, guidance and advice • All those who provided useful feedback after earlier presentations of this work, in particular Steve Renals and Rob Clark • Everybody at the Centre for Speech Technology Research, for creating such a friendly and stimulating place to work • Karl Isaac for helping set up an experiment using Mechanical Turk, and C´assiaValentini-Botinh~aofor helping to recruit local listeners for experi- ments • Adriana Stan at the University of Cluj-Napoca for providing the raw version of the newstext data used in Chapter 9 (and indeed for having made the whole of her Romanian speech corpus generally available), and for recruiting participants for the evaluation of Romanian systems • Martti Vainio and Antti Suni at the University of Helsinki for making the Finnish database available for the experiments described in Chapter 8 • Antti Suni (again) and Heini Kallio at the University of Helsinki and Tuomo Raitio at Aalto University for recruiting participants for the evaluation of Finnish systems. This work was made possible by a Doctoral Training Award from the Engineering and Physical Sciences Research Council (EPSRC: http://www.epsrc.ac.uk), and has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF: http://www.ecdf.ed.ac.uk). The ECDF is partially supported by the eDIKT initiative (http://www.edikt.org.uk). iii Abstract This thesis introduces a general method for incorporating the distributional analy- sis of textual and linguistic objects into text-to-speech (TTS) conversion systems. Conventional TTS conversion uses intermediate layers of representation to bridge the gap between text and speech. Collecting the annotated data needed to pro- duce these intermediate layers is a far from trivial task, possibly prohibitively so for languages in which no such resources are in existence. Distributional analysis, in contrast, proceeds in an unsupervised manner, and so enables the creation of systems using textual data that are not annotated. The method therefore aids the building of systems for languages in which conventional linguistic resources are scarce, but is not restricted to these languages. The distributional analysis proposed here places the textual objects analysed in a continuous-valued space, rather than specifying a hard categorisation of those objects. This space is then partitioned during the training of acoustic models for synthesis, so that the models generalise over objects' surface forms in a way that is acoustically relevant. The method is applied to three levels of textual analysis: to the characteri- sation of sub-syllabic units, word units and utterances. Entire systems for three languages (English, Finnish and Romanian) are built with no reliance on manu- ally labelled data or language-specific expertise. Results of a subjective evaluation are presented. iv Contents 1 Introduction 1 1.1 Goals and Approach . .1 1.2 Organisation of this thesis . .2 2 Background 5 2.1 Conventional Systems . .5 2.1.1 Synthesiser Front-end: Linguistic modelling . .6 2.1.2 Synthesiser Back-end: Acoustic Modelling . 13 2.1.3 Training Recipes Used . 25 2.2 Types of Alternative Approach . 27 2.2.1 Conventional Approach . 29 2.2.2 Naive Approach . 30 2.2.3 Speech-Driven Approach . 30 2.2.4 Text-Driven Approach . 31 2.2.5 Approach Taken in this Thesis . 31 2.3 Naivety, language independence, and language typology . 32 3 Benchmark Systems 35 3.1 Introduction . 35 3.2 Experiment 1: Contribution of High-Level Features . 37 3.2.1 Overview . 37 3.2.2 Data Used . 38 3.2.3 Annotation Used . 39 3.2.4 Subjective Evaluation . 41 3.2.5 Systems' Use of Linguistic Features . 42 3.3 Experiment 2: Contribution of Phonemic Transcription . 44 3.3.1 Letter-Based Speech Synthesis . 44 3.3.2 Phonemes: Modelling vs. Generalisation . 47 v vi CONTENTS 3.3.3 Induced Compound Features . 47 3.3.4 Evaluation . 50 3.3.5 Results . 52 3.4 Experiment 3: Contribution of Phonetic Categories . 53 3.4.1 Overview . 53 3.4.2 Systems Trained . 53 3.4.3 Evaluation . 56 3.5 Conclusions . 58 4 Vector Space Models 61 4.1 Vector Space Modelling: an Example . 61 4.2 Vector Space Modelling Applied to TTS . 67 4.2.1 Serial Tree Building: Shortcomings . 67 4.2.2 Distributional{Acoustic Modelling: Vector Space Models and Decision Trees . 69 4.2.3 Strengths of the proposed approach . 70 4.2.4 The Vector Space Models Used in this Thesis . 72 4.2.5 Design Choices . 74 4.3 Previous Work on Vector Space Models of Language . 78 4.3.1 Salton's Vector Space Model . 78 4.3.2 Dimension-Reduced Vector Space Model: Latent Semantic Indexing . 81 4.3.3 Reduced Vector Space Model: Other Projections . 86 4.3.4 Probablistic topic models . 87 4.4 Other Techniques for the Induction of Linguistic Representations . 87 4.4.1 Hierarchical Clustering Approaches . 88 4.4.2 HMM-based Approaches . 90 4.4.3 Graph Partitioning Approaches . 91 4.4.4 Connectionist Approaches . 91 5 Letter Space 95 5.1 Introduction . 95 5.2 A Vector Space Model of Letter Types . 96 5.3 A Vector Space Model of Phoneme Types . 97 5.3.1 Experiment . 99 5.4 Conclusions . 102 CONTENTS vii 6 Word Space 103 6.1 Introduction . 103 6.2 A Vector Space Model for Word Types . 104 6.3 Experiment 1: Word VSM for Phrase-Break Prediction . 108 6.3.1 Background: Phrase-break Prediction . 108 6.3.2 Experiment . 109 6.3.3 Results . 114 6.4 Experiment 2: Word VSM for State-Tying . 117 6.4.1 Experiment . 117 6.5 Conclusions and Open Problems . 120 7 Refinements and Extensions 123 7.1 Word Space: Corpus Size and Parameter Settings . 123 7.1.1 Data and Vector Space Models . 124 7.1.2 Results . 126 7.2 Feature Selection for HMM TTS . 129 7.2.1 Ensembles of Trees . 129 7.2.2 Variable Importance Measures . 130 7.2.3 Feature Selection for the Distributional{Acoustic Method . 131 7.2.4 Experiment: Word VSM for State-Tying, Revisited . 135 7.2.5 Conclusion . 135 7.3 An Utterance Space Model . 136 7.3.1 A Vector Space Model of Utterances . 137 7.3.2 Experiment: Separability of Questions and Non-Questions 138 8 Evaluation of Entire Systems 143 8.1 Introduction . 143 8.2 Language Characteristics . 144 8.3 Initial Hypotheses and Systems Built . 146 8.4 System Construction and Features Used . 149 8.4.1 Feature-set A . 149 8.4.2 Feature-sets B, C & D . 154 8.4.3 Feature-set T: Topline Features . 157 8.4.4 Feature Selection: Systems *-*-2 . 157 8.5 Synthesis . 158 8.6 Objective Evaluation . 158 viii CONTENTS 8.6.1 Method . 158 8.6.2 Results . 159 8.7 Subjective Evaluation . 163 8.7.1 Selected Hypotheses . 163 8.7.2 Procedure . 164 8.7.3 Results . 165 8.8 Conclusions . 167 9 Conclusions and Future Work 169 9.1 Contributions . 169 9.2 Future work . 171 List of Figures 2.1 Splitting a toy dataset for letter-to-sound conversion . 11 3.1 Results of AB test for naturalness in Experiment 1 . 42 3.2 Questions asked at test time in Experiment 1 . 43 3.3 Serial tree building: a toy example . 48 3.4 Results of Experiment 2 . 51 3.5 Serial tree building: another toy example . 54 3.6 Results of Experiment 3 . 56 3.7 Leaf node sizes in three systems . 57 4.1 Induction of word representations: toy example . 65 4.2 Two dimensions of a space of possible vector space models . 76 5.1 Two dimensions of a letter space . 96 5.2 Two dimensions of a phoneme space . 98 5.3 Results of phoneme-space experiment . 101 6.1 Results of phrase-break experiment . 114 6.2 Sizes of trees built for phrase-break experiment . 115 6.3 Portions of two trees built for phrase-break experiment . 116 6.4 Results of state-tying experiment . 119 7.1 The effect of varying method for handling unseen words in a word space . 126 7.2 The effect of varying context vocabulary size in a word space . 127 7.3 Results on phrase-break task, compared with earlier results . 128 7.4 Results of state-tying experiment . 134 7.5 Two dimensions of an utterance space (1) . 139 7.6 Two dimensions of an utterance space (2) . 140 8.1 Morphological richness of three target languages . 145 ix x LIST OF FIGURES 8.2 Construction of naive systems . 150 8.3 Objective evaluation results for 15 English, Finnish and Romanian voices . 161 (a) Bark cepstral distortion (English) ............... 161 (b) Root Mean Square Error of F0 (English) ........... 161 (c) Bark cepstral distortion (Finnish) ............... 161 (d) Root Mean Square Error of F0 (Finnish) ........... 161 (e) Bark cepstral distortion (Romanian) ............. 161 (f) Root Mean Square Error of F0 (Romanian) ......... 161 8.4 Objective evaluation results for 30 English, Finnish and Romanian voices . 162 (a) Bark cepstral distortion (English) ............... 162 (b) Root Mean Square Error of F0 (English) ........... 162 (c) Bark cepstral distortion (Finnish) ............... 162 (d) Root Mean Square Error of F0 (Finnish) ........... 162 (e) Bark cepstral distortion (Romanian) ..............