Diphthong Synthesis Using the Three-Dimensional Dynamic Digital Waveguide Mesh

Diphthong Synthesis using the Three-Dimensional Dynamic Digital Waveguide Mesh Amelia Jane Gully PhD University of York Electronic Engineering September 2017 Abstract The human voice is a complex and nuanced instrument, and despite many years of research, no system is yet capable of producing natural-sounding synthetic speech. This affects intelligibility for some groups of listeners, in applications such as automated announcements and screen readers. Further- more, those who require a computer to speak|due to surgery or a degener- ative disease|are limited to unnatural-sounding voices that lack expressive control and may not match the user's gender, age or accent. It is evident that natural, personalised and controllable synthetic speech systems are required. A three-dimensional digital waveguide model of the vocal tract, based on magnetic resonance imaging data, is proposed here in order to address these issues. The model uses a heterogeneous digital waveguide mesh method to represent the vocal tract airway and surrounding tissues, facilitating dynamic movement and hence speech output. The accuracy of the method is validated by comparison with audio recordings of natural speech, and perceptual tests are performed which confirm that the proposed model sounds significantly more natural than simpler digital waveguide mesh vocal tract models. Control of such a model is also considered, and a proof-of-concept study is presented using a deep neural network to control the parameters of a two-dimensional vocal tract model, resulting in intelligible speech output and paving the way for extension of the control system to the proposed three- dimensional vocal tract model. Future improvements to the system are also discussed in detail. This project considers both the naturalness and control issues associated with synthetic speech and therefore represents a significant step towards improved synthetic speech for use across society. 2 Contents Abstract 2 List of Figures 10 List of Tables 17 Acknowledgements 20 Declaration 21 I Introduction 22 1 Introduction 23 1.1 Hypothesis . 25 1.1.1 Hypothesis Statement . 25 1.1.2 Description of Hypothesis . 25 1.2 Novel Contributions . 26 1.3 Statement of Ethics . 27 1.4 Thesis Layout . 27 3 CONTENTS 4 II Literature Review 30 2 Acoustics of Speech 31 2.1 Acoustic Quantities . 31 2.2 The Wave Equation . 32 2.3 The Acoustic Duct . 34 2.4 The Vocal Tract Transfer Function . 37 2.5 Vocal Tract Anatomy . 38 2.5.1 Articulation . 38 2.5.2 Determining Vocal Tract Shape . 40 2.6 Source-Filter Model of Speech . 42 2.6.1 The Voice Source . 42 2.6.2 The Voice Filter . 45 2.6.3 Shortcomings of the Source-Filter Model . 46 2.7 Vowels . 47 2.7.1 The Vowel Quadrilateral . 48 2.7.2 Static Vowels . 49 2.7.3 Dynamic Vowels . 51 2.8 Consonants . 52 2.8.1 Voice-Manner-Place . 52 2.8.2 Static Consonants . 53 2.8.3 Dynamic Consonants . 56 2.9 Running Speech . 58 2.9.1 Coarticulation and Assimilation . 58 2.9.2 Reduction and Casual Speech . 59 2.9.3 Prosody . 60 2.10 Conclusion . 61 CONTENTS 5 3 Text-to-Speech (TTS) Synthesis 62 3.1 Components of TTS Systems . 63 3.2 Waveform Generation Techniques . 65 3.2.1 Formant Synthesis . 66 3.2.2 Concatenative Synthesis . 67 3.2.3 Statistical Parametric Synthesis . 70 3.2.4 Articulatory Synthesis . 78 3.3 TTS for Assistive Technology . 82 3.3.1 Augmentative and Alternative Communication (AAC) 82 3.3.2 Other Applications . 84 3.4 TTS for Commercial Applications . 85 3.5 Evaluating Synthetic Speech . 86 3.5.1 Naturalness . 86 3.5.2 Other Evaluation Criteria . 91 3.6 Alternatives to TTS . 91 3.7 Conclusion . 93 4 Physical Vocal Tract Models 94 4.1 Physical Modelling Techniques . 95 4.1.1 Transmission Lines . 95 4.1.2 Reflection Lines . 96 4.1.3 Finite-Difference Time-Domain (FDTD) Models . 97 4.1.4 Digital Waveguide Mesh (DWM) Models . 99 4.1.5 Transmission Line Matrix (TLM) Models . 100 4.1.6 Finite Element Method (FEM) Models . 101 4.1.7 Boundary Element Method (BEM) Models . 102 4.1.8 Choosing a Modelling Approach . 102 CONTENTS 6 4.2 One-Dimensional Vocal Tract Models . 103 4.2.1 Transmission-Line Vocal Tract Models . 103 4.2.2 Reflection-Line Vocal Tract Models . 105 4.2.3 Shortcomings of 1D Vocal Tract Models . 107 4.3 Two-Dimensional Vocal Tract Models . 108 4.4 Three-Dimensional Vocal Tract Models . 110 4.5 Evaluating Physical Vocal Tract Models . 113 4.6 Challenges for 3D Vocal Tract Modelling . 115 4.7 Conclusion . 116 III Original Research 117 5 Dynamic 3D DWM Vocal Tract 118 5.1 Homogeneous DWM Vocal Tract Models . 119 5.1.1 Homogeneous 2D DWM Vocal Tract Model . 119 5.1.2 Homogeneous 3D DWM Vocal Tract Model . 121 5.1.3 A note about MRI data . 121 5.2 Heterogeneous DWM Vocal Tract Models . 122 5.2.1 Heterogeneous 2D DWM Vocal Tract Model . 123 5.3 Boundaries in the DWM . 125 5.4 Proposed Model . 127 5.4.1 Data Acquisition and Pre-Processing . 127 5.4.2 Admittance Map Construction . 132 5.5 Model Refinement . 135 5.5.1 Mesh Alignment and Extent . 135 5.5.2 Source and Receiver Positions . 139 5.6 Monophthong Synthesis . 142 CONTENTS 7 5.6.1 Procedure . 142 5.6.2 Results and Discussion . 143 5.7 Diphthong Synthesis . 158 5.7.1 Procedure . 158 5.7.2 Results and Discussion . 158 5.8 Implementation . 167 5.9 Conclusion . 168 6 Perceptual Testing 170 6.1 Subjective Testing Methods . 171 6.1.1 Test Format . 171 6.1.2 Test Delivery . 174 6.1.3 Test Materials . 175 6.1.4 Summary . 176 6.2 Pilot Test: Dimensionality Increase . 177 6.2.1 Method . 177 6.2.2 Results and Discussion . 180 6.2.3 Summary . 184 6.3 Pilot Test: Sampling Frequency . 184 6.3.1 Method . 185 6.3.2 Results . 186 6.3.3 Discussion . 188 6.3.4 Summary . 190 6.4 Perceived Naturalness of 3DD-DWM Vocal Tract Model . 190 6.4.1 Method . 191 6.4.2 Comparison with Recordings . 192 6.4.3 Comparison with 3D FEM Model . 194 CONTENTS 8 6.4.4 Listener Comments . 197 6.4.5 Demographic Effects . 197 6.4.6 Summary . 199 6.5 Conclusion . 200 7 Combining DWM and Statistical Approaches 202 7.1 The DNN-driven TTS Synthesiser . 203 7.2 Reformulating the 2D DWM . 207 7.2.1 The K-DWM . 207 7.2.2 Simple DWM Boundaries . 210 7.3 The DNN-driven 2D DWM Synthesiser . 210 7.3.1 Mesh Construction . 212 7.3.2 Optimisation Method . 216 7.4 Pilot Study . 218 7.4.1 Method . 218 7.4.2 Results . 219 7.4.3 Summary . 230 7.5 Discussion and Extensions . 231 7.6 Conclusion . 233 8 Conclusions and Further Work 235 8.1 Thesis Summary . 235 8.2 Novel Contributions . 238 8.3 Hypothesis . 239 8.4 Future Work . 241 8.5 Closing Remarks . 245 Appendix A Index of Accompanying Multimedia Files 246 CONTENTS 9 Appendix B Listener Comments from Perceptual Tests 248 B.1 First Pilot . 248 B.2 Second Pilot . 251 B.3 Final Test . 252 Appendix C List of Acronyms 257 Appendix D List of Symbols 259 References 262 List of Figures 2.1 The travelling-wave solution to the wave equation. An initial input at t = 0 propagates through a 1D domain via travelling wave components p− and p+ which propagate left and right, respectively, throughout the domain as time goes on. Variable p is the sum of the travelling wave components. 33 2.2 Scattering at an impedance discontinuity, after [16, p. 561]. Z1 and Z2 represent the characteristic acoustic impedances in tube 1 and 2 respectively. Right-going pressure wave compo- + nent p1 is incident on the boundary, and some is transmitted + into tube 2 as right-going pressure component p2 , whereas some is.

Diphthong Synthesis Using the Three-Dimensional Dynamic Digital Waveguide Mesh

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support