Personalising Synthetic Voices for Individuals with Severe Speech Impairment
Total Page:16
File Type:pdf, Size:1020Kb
PERSONALISING SYNTHETIC VOICES FOR INDIVIDUALS WITH SEVERE SPEECH IMPAIRMENT Sarah M. Creer DOCTOR OF PHILOSOPHY AT DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF SHEFFIELD SHEFFIELD, UK AUGUST 2009 IMAGING SERVICES NORTH Boston Spa, Wetherby West Yorkshire, lS23 7BQ www.bl.uk Contains CD Table of Contents Table of Contents ii List of Figures vi Abbreviations x 1 Introduction 1 1.1 Population affected by speech loss and impairment 2 1.1.1 Dysarthria .......... 2 1.2 Voice Output Communication Aids . 3 1.3 Scope of the thesis . 5 1.4 Thesis overview . 6 1.4.1 Chapter 2: VOCAs and social interaction 6 1.4.2 Chapter 3: Speech synthesis methods and evaluation. 7 1.4.3 Chapter 4: HTS - HMM-based synthesis. 7 1.4.4 Chapter 5: Voice banking using HMM-based synthesis for data pre- deterioration .. 7 1.4.5 Chapter 6: Building voices using dysarthric data 8 1.4.6 Chapter 7: Conclusions and further work 8 2 VOCAs and social interaction 9 2.1 Introduction ...................... 9 2.2 Acceptability of AAC: social interaction perspective 9 2.3 Acceptability of VOCAs . 10 2.3.1 Speed of access . 11 2.3.2 Aspects related to the output voice . 15 2.4 Dysarthric speech. 21 2.4.1 Types of dysarthria ........ 22 2.4.2 Acoustic description of dysarthria . 23 2.4.3 Theory of dysarthric speech production 26 2.5 Requirements of aVOCA . .. ... 27 ii 2.6 Conclusions................. 27 3 Speech synthesis methods and evaluation 29 3.1 Introduction ..... 29 3.2 Evaluation...... 29 3.2.1 Intelligibility 30 3.2.2 Naturalness. 31 3.2.3 Similarity to target speaker 32 3.3 Methods of synthesis . 33 3.3.1 Articulatory synthesis . 33 3.3.2 Parametric synthesis . 36 3.3.3 Concatenative synthesis 43 3.3.4 Model-based synthesis 54 3.3.5 Voice conversion 58 3.4 Conclusions.......... 62 4 HTS - HMM-based synthesis 65 4.1 Introduction ......... 65 4.2 HTS overview . 65 4.2.1 Speaker dependent system . 66 4.2.2 Speaker adaptation system 67 4.3 Feature vectors .......... 68 4.4 Model details: contexts and parameter sharing 69 4.5 Model construction . 70 4.5.1 Speaker dependent models. 70 4.5.2 Speaker adaptation models 70 4.6 Synthesis ............. 70 4.6.1 Global variance ...... 71 4.7 HMM-based synthesis and dysarthric speech. 71 4.7.1 Misalignment between data and labels 71 4.7.2 Proposed approach. 72 4.8 Conclusion .................. 79 5 Voice banking using HMM-based synthesis for speech data pre-deterioration 81 5.1 Introduction. 81 5.2 Background.. 81 5.3 Evaluation... 82 5.3.1 Stimuli 82 5.3.2 Participants. 85 5.3.3 Procedure. 85 5.3.4 Results ... 86 iii 5.3.5 Discussion ................ 89 5.4 Acoustic features affecting listener judgements 91 5.4.1 Introduction ..... 91 5.4.2 Multi-layer perceptrons 93 5.4.3 Feature extraction ... 93 5.4.4 'fraining......... 95 5.4.5 Results: speaker-dependent MLPs 96 5.4.6 Results: speaker-independent MLPs 97 5.4.7 Discussion. 99 5.5 Conclusions............... 103 6 Building voices using dysarthric data 104 6.1 Introduction .. 104 6.2 Evaluation... 104 6.2.1 Method 105 6.3 Results..... 113 6.3.1 Speaker 3 113 6.3.2 Speaker 4 116 6.3.3 Speaker 5 119 6.4 Discussion.... 121 6.4.1 Limitations of the evaluation 121 6.4.2 Capturing speaker characteristics 122 6.4.3 Voice output quality . 123 6.4.4 Manipulation of prosodic features . 125 6.4.5 Speech reconstruction 126 6.4.6 Speaker acceptability. 126 6.5 Conclusion ......... 127 7 Conclusions and further work 129 7.1 Introduction. 129 7.2 Conclusions 129 7.3 Contributions 131 7.4 Further work 131 7.4.1 Value assessment of the application. 132 7.4.2 Automation of the procedure .. 133 7.4.3 Specification of target population. 136 7.5 Summary ....... 138 A Speech production theory 139 A.l Classical phonetics and phonology 139 A.I.1 International phonetic alphabet . 139 iv A.1.2 Limitations .. 140 A.2 Coarticulation theory 140 A.2.1 Target theory. 141 A.3 Action theory . 143 A.4 Articulatory phonology 143 A.5 Task dynamics .. 144 A.6 Cognitive phonetics. 144 B HMM-based speech synthesis: HTS further details 146 B.l Hidden Markov Models .... 146 B.2 HMM-based synthesis features 148 B.2.1 Spectral features 148 B.2.2 Log FO ... .. 149 B.2.3 Aperiodicity... 149 B.3 Context-dependent modelling for HMM-based synthesis 150 B.4 Label file format . 153 B.5 Average voice building and speaker adaptation 154 B.5.1 Adaptation from the average voice 155 C Test set sentences 157 D Protocol for data selection process 159 E Test set for evaluation of voices built with dysarthric data 160 v List of Figures 3.1 Source-filter model ........ 37 3.2 Basic cascade formant synthesiser . 38 3.3 Basic parallel formant synthesiser configuration 39 3.4 Decreasing pitch using PSOLA ........ 50 4.1 Overview of the basic HTS HMM-based speech synthesis speaker-dependent system. .. 66 4.2 Overview of HTS HMM-based synthesis speaker-independent adaptation sys- tern 67 4.3 The phrase "a cargo back as well" as spoken by a dysarthric speaker, labelled for use as adaptation data . .. 75 4.4 Possible component feature selection from the average voice model and the participant speaker model to produce an output speaker model . .. 77 5.1 Boxplot of results for the listener experiment for speaker 1 . 87 5.2 Boxplot of results for the listener experiment for speaker 2 . 88 5.3 Overall results of the listener experiment for speaker 1 88 5.4 Overall results of the listener experiment for speaker 2 89 5.5 Multi-layer percept ron . 94 5.6 Results of the speaker-dependent MLP experiment for speaker 1 97 5.7 Results of the speaker-dependent MLP experiment for speaker 2 98 5.8 Results of the listener experiments (equal contribution from each speaker) 100 5.9 Results of the speaker-independent MLP experiments ........... 101 vi 6.1 Amount of data accepted by the system in the case of unedited original recorded data and data edited for intelligibility ............... 108 A.1 Alternative models of speech production .. 142 B.1 Hidden Markov Model . 147 B.2 Hidden Semi-Markov Model (HSMM) 147 B.3 Aperiodicity component extraction . 150 B.4 Conversion of utterance level orthographic transcription to phonetic and prosodic labelling . 152 B.5 Component parts and structure of CSMAPLR 155 vii Abstract Speech technology can help individuals with speech disorders to interact more easily. Many individuals with severe speech impairment, due to conditions such as Parkinson's disease or motor neurone disease, use voice output communication aids (VOCAs), which have synthesised or pre recorded voice output. This voice output effectively becomes the voice of the individual and should therefore represent the user accurately. Currently available personalisation of speech synthesis techniques require a large amount of data input, which is difficult to produce for individuals with severe speech impairment. These techniques also do not provide a solution for those individuals whose voices have begun to show the effects of dysarthria. The thesis shows that Hidden Markov Model (HMM)-based speech synthesis is a promising approach for 'voice banking' for individuals before their condition causes deterioration of the speech and once deterioration has begun. Data input requirements for building personalised voices with this technique using human listener judgement evaluation is investigated. It shows that 100 sentences is the minimum required to build a significantly different voice from an average voice model and show some resemblance to the target speaker. This amount depends on the speaker and the average model used. A neural network analysis trained on extracted acoustic features revealed that spectral features had the most influence for predicting human listener judgements of similarity of synthesised speech to a target speaker. Accuracy of prediction significantly improves if other acoustic features are introduced and combined non-linearly. These results were used to inform the reconstruction of personalised synthetic voices for speakers whose voices had begun to show the effects of their conditions. Using HMM-based synthesis, per sonalised synthetic voices were built using dysarthric speech showing similarity to target speakers without recreating the impairment in the synthesised speech output. viii Acknowledgements Many thanks must go to my supervisors, Phil Green and Stuart Cunningham, whose experience, knowledge and direction was always appreciated and gratefully received. This research was sponsored by the Engineering and Physical Sciences Research Council. This work would not have been possible without the contribution from Junichi Yamagishi whose willingness to share his expertise and ideas has been greatly appreciated. Thanks must also go to others at CSTR who made my visit there so useful and enjoyable. Thanks to Margaret Freeman in the Department of Human Communication Sciences for the advice and assistance in finding participants. I am extremely grateful to all the participants in this study whose time and contributions have been essential and much appreciated. Thanks to all those in the VIVOCA project and the CAST and RAT groups for their interest and involvement in this work and general support. A thank you goes to SpandH, past and present for all the assistance, discussions, coffee times and the willingness to tolerate the quizzes and sweepstakes. Thanks also to those in ICOSS floor 3 for the tea and the solidarity throughout all weather conditions. It is good to have the opportunity to thank my friends and family for their great and continuing support.