Diphthong Synthesis Using the Three-Dimensional Dynamic Digital Waveguide Mesh
Total Page:16
File Type:pdf, Size:1020Kb
Diphthong Synthesis using the Three-Dimensional Dynamic Digital Waveguide Mesh Amelia Jane Gully PhD University of York Electronic Engineering September 2017 Abstract The human voice is a complex and nuanced instrument, and despite many years of research, no system is yet capable of producing natural-sounding synthetic speech. This affects intelligibility for some groups of listeners, in applications such as automated announcements and screen readers. Further- more, those who require a computer to speak|due to surgery or a degener- ative disease|are limited to unnatural-sounding voices that lack expressive control and may not match the user's gender, age or accent. It is evident that natural, personalised and controllable synthetic speech systems are required. A three-dimensional digital waveguide model of the vocal tract, based on magnetic resonance imaging data, is proposed here in order to address these issues. The model uses a heterogeneous digital waveguide mesh method to represent the vocal tract airway and surrounding tissues, facilitating dy- namic movement and hence speech output. The accuracy of the method is validated by comparison with audio recordings of natural speech, and per- ceptual tests are performed which confirm that the proposed model sounds significantly more natural than simpler digital waveguide mesh vocal tract models. Control of such a model is also considered, and a proof-of-concept study is presented using a deep neural network to control the parameters of a two-dimensional vocal tract model, resulting in intelligible speech output and paving the way for extension of the control system to the proposed three- dimensional vocal tract model. Future improvements to the system are also discussed in detail. This project considers both the naturalness and control issues associated with synthetic speech and therefore represents a significant step towards improved synthetic speech for use across society. 2 Contents Abstract 2 List of Figures 10 List of Tables 17 Acknowledgements 20 Declaration 21 I Introduction 22 1 Introduction 23 1.1 Hypothesis . 25 1.1.1 Hypothesis Statement . 25 1.1.2 Description of Hypothesis . 25 1.2 Novel Contributions . 26 1.3 Statement of Ethics . 27 1.4 Thesis Layout . 27 3 CONTENTS 4 II Literature Review 30 2 Acoustics of Speech 31 2.1 Acoustic Quantities . 31 2.2 The Wave Equation . 32 2.3 The Acoustic Duct . 34 2.4 The Vocal Tract Transfer Function . 37 2.5 Vocal Tract Anatomy . 38 2.5.1 Articulation . 38 2.5.2 Determining Vocal Tract Shape . 40 2.6 Source-Filter Model of Speech . 42 2.6.1 The Voice Source . 42 2.6.2 The Voice Filter . 45 2.6.3 Shortcomings of the Source-Filter Model . 46 2.7 Vowels . 47 2.7.1 The Vowel Quadrilateral . 48 2.7.2 Static Vowels . 49 2.7.3 Dynamic Vowels . 51 2.8 Consonants . 52 2.8.1 Voice-Manner-Place . 52 2.8.2 Static Consonants . 53 2.8.3 Dynamic Consonants . 56 2.9 Running Speech . 58 2.9.1 Coarticulation and Assimilation . 58 2.9.2 Reduction and Casual Speech . 59 2.9.3 Prosody . 60 2.10 Conclusion . 61 CONTENTS 5 3 Text-to-Speech (TTS) Synthesis 62 3.1 Components of TTS Systems . 63 3.2 Waveform Generation Techniques . 65 3.2.1 Formant Synthesis . 66 3.2.2 Concatenative Synthesis . 67 3.2.3 Statistical Parametric Synthesis . 70 3.2.4 Articulatory Synthesis . 78 3.3 TTS for Assistive Technology . 82 3.3.1 Augmentative and Alternative Communication (AAC) 82 3.3.2 Other Applications . 84 3.4 TTS for Commercial Applications . 85 3.5 Evaluating Synthetic Speech . 86 3.5.1 Naturalness . 86 3.5.2 Other Evaluation Criteria . 91 3.6 Alternatives to TTS . 91 3.7 Conclusion . 93 4 Physical Vocal Tract Models 94 4.1 Physical Modelling Techniques . 95 4.1.1 Transmission Lines . 95 4.1.2 Reflection Lines . 96 4.1.3 Finite-Difference Time-Domain (FDTD) Models . 97 4.1.4 Digital Waveguide Mesh (DWM) Models . 99 4.1.5 Transmission Line Matrix (TLM) Models . 100 4.1.6 Finite Element Method (FEM) Models . 101 4.1.7 Boundary Element Method (BEM) Models . 102 4.1.8 Choosing a Modelling Approach . 102 CONTENTS 6 4.2 One-Dimensional Vocal Tract Models . 103 4.2.1 Transmission-Line Vocal Tract Models . 103 4.2.2 Reflection-Line Vocal Tract Models . 105 4.2.3 Shortcomings of 1D Vocal Tract Models . 107 4.3 Two-Dimensional Vocal Tract Models . 108 4.4 Three-Dimensional Vocal Tract Models . 110 4.5 Evaluating Physical Vocal Tract Models . 113 4.6 Challenges for 3D Vocal Tract Modelling . 115 4.7 Conclusion . 116 III Original Research 117 5 Dynamic 3D DWM Vocal Tract 118 5.1 Homogeneous DWM Vocal Tract Models . 119 5.1.1 Homogeneous 2D DWM Vocal Tract Model . 119 5.1.2 Homogeneous 3D DWM Vocal Tract Model . 121 5.1.3 A note about MRI data . 121 5.2 Heterogeneous DWM Vocal Tract Models . 122 5.2.1 Heterogeneous 2D DWM Vocal Tract Model . 123 5.3 Boundaries in the DWM . 125 5.4 Proposed Model . 127 5.4.1 Data Acquisition and Pre-Processing . 127 5.4.2 Admittance Map Construction . 132 5.5 Model Refinement . 135 5.5.1 Mesh Alignment and Extent . 135 5.5.2 Source and Receiver Positions . 139 5.6 Monophthong Synthesis . 142 CONTENTS 7 5.6.1 Procedure . 142 5.6.2 Results and Discussion . 143 5.7 Diphthong Synthesis . 158 5.7.1 Procedure . 158 5.7.2 Results and Discussion . 158 5.8 Implementation . 167 5.9 Conclusion . 168 6 Perceptual Testing 170 6.1 Subjective Testing Methods . 171 6.1.1 Test Format . 171 6.1.2 Test Delivery . 174 6.1.3 Test Materials . 175 6.1.4 Summary . 176 6.2 Pilot Test: Dimensionality Increase . 177 6.2.1 Method . 177 6.2.2 Results and Discussion . 180 6.2.3 Summary . 184 6.3 Pilot Test: Sampling Frequency . 184 6.3.1 Method . 185 6.3.2 Results . 186 6.3.3 Discussion . 188 6.3.4 Summary . 190 6.4 Perceived Naturalness of 3DD-DWM Vocal Tract Model . 190 6.4.1 Method . 191 6.4.2 Comparison with Recordings . 192 6.4.3 Comparison with 3D FEM Model . 194 CONTENTS 8 6.4.4 Listener Comments . 197 6.4.5 Demographic Effects . 197 6.4.6 Summary . 199 6.5 Conclusion . 200 7 Combining DWM and Statistical Approaches 202 7.1 The DNN-driven TTS Synthesiser . 203 7.2 Reformulating the 2D DWM . 207 7.2.1 The K-DWM . 207 7.2.2 Simple DWM Boundaries . 210 7.3 The DNN-driven 2D DWM Synthesiser . 210 7.3.1 Mesh Construction . 212 7.3.2 Optimisation Method . 216 7.4 Pilot Study . 218 7.4.1 Method . 218 7.4.2 Results . 219 7.4.3 Summary . 230 7.5 Discussion and Extensions . 231 7.6 Conclusion . 233 8 Conclusions and Further Work 235 8.1 Thesis Summary . 235 8.2 Novel Contributions . 238 8.3 Hypothesis . 239 8.4 Future Work . 241 8.5 Closing Remarks . 245 Appendix A Index of Accompanying Multimedia Files 246 CONTENTS 9 Appendix B Listener Comments from Perceptual Tests 248 B.1 First Pilot . 248 B.2 Second Pilot . 251 B.3 Final Test . 252 Appendix C List of Acronyms 257 Appendix D List of Symbols 259 References 262 List of Figures 2.1 The travelling-wave solution to the wave equation. An initial input at t = 0 propagates through a 1D domain via travelling wave components p− and p+ which propagate left and right, respectively, throughout the domain as time goes on. Variable p is the sum of the travelling wave components. 33 2.2 Scattering at an impedance discontinuity, after [16, p. 561]. Z1 and Z2 represent the characteristic acoustic impedances in tube 1 and 2 respectively. Right-going pressure wave compo- + nent p1 is incident on the boundary, and some is transmitted + into tube 2 as right-going pressure component p2 , whereas some is.