"SOME INVESTIGATIONS FOR SEGMENTATION IN BY CONCATENATION FOR MORE NATURALNESS WITH APPLICATION TO TEXT TO SPEECH (TTS) FOR MARATHI "

A THESIS SUBMITTED TO

BHARATI VIDYAPEETH UNIVERSITY, PUNE

FOR AN AWARD OF THE DEGREE OF

DOCTOR OF PHILOSOPHY

IN ELECTRONICS ENGINEERING

UNDER THE FACULTY OF

ENGINEERING AND

SUBMITTED BY

MRS. SMITA P. KAWACHALE

UNDER THE GUIDANCE OF

DR. J. S. CHITODE

RESEARCH CENTRE BHARATI VIDYAPEETH DEEMED UNIVERSITY COLLEGE OF ENGINEERING, PUNE - 411043

JUNE, 2015

CERTIFICATE

This is to certify that the work incorporated in the thesis entitled “Some investigations for segmentation in speech synthesis by concatenation for more naturalness with application to text to speech (TTS) for Marathi language” for the degree of ‗‘ in the subject of Electronics Engineering under the faculty of Engineering and Technology has been carried out by Mrs. Smita P. Kawachale in the Department of Electronics Engineering at Bharati Vidyapeeth Deemed University, College of Engineering, Pune during the period from August 2010 to October 2014 under the guidance of Dr. J. S. Chitode.

Principal College of Engineering, Bharati Vidypaeeth University, Pune

Place: Date:

I

DECLARATION BY THE CANDIDATE

I declare that the thesis entitled “Some investigations for segmentation in speech synthesis by concatenation for more naturalness with application to text to speech (TTS) for Marathi language” submitted by me for the degree of ‗Doctor of Philosophy‘ is the record of work carried out by me during the period from August 2010 to October 2014 under the guidance of Dr. J. S. Chitode and has not formed the basis for the award of any degree, diploma, associate ship, fellowship, titles in this or any other university or other institution of higher .

I further declare that the material obtained from other sources has been duly acknowledged in the thesis.

Signature of the candidate (Mrs. Smita P. Kawachale)

Place: Date:

II

CERTIFICATE OF GUIDE

This is to certify that the work incorporated in the thesis entitled ―Some investigations for segmentation in speech synthesis by concatenation for more naturalness with application to text to speech (TTS) for Marathi language‖ submitted by Mrs. Smita P. Kawachale for the degree of ‗Doctor of Philosophy‘ in the subject of Electronics Engineering under the faculty of Engineering and Technology has been carried out in the Bharati Vidyapeeth University‘s College of Engineering, Pune during the period from August 2010 to October 2014 under my direct guidance.

Dr. J. S. Chitode ( guide)

Place: Date:

III

ACKNOWLEDGEMENT

There are number of people; without whom this thesis might not have been written and to whom I am greatly indebted.

In the first place I would like to record my gratitude to honorable Prof Dr. A. R. Bhalerao for his supervision, advice and guidance from the very early stage of this research as well as giving me extraordinary experiences throughout the work. Above all and the most needed, he provided me unflinching encouragement and support in various ways. His truly scientist and engineering intuition has made him as a constant oasis of ideas and passions in , which exceptionally inspire and enrich my growth as a student, a researcher and a scientist want to be. I am indebted to him more than he knows.

I would like to thank, my guide, Professor Dr. J. S. Chitode for providing me the opportunity to work with him. I am so deeply grateful for his help, professionalism and valuable guidance throughout this work and through my entire program of study that I do not have enough words to express my deep and sincere appreciation.

To my parents, who have been sources of encouragement and inspiration to me throughout my life. To my brother and brother in law, who have been very helpful and supportive for completion of this thesis. To my father in law and mother in law, both of whom have supported to my work and this thesis and encouraged me to redefine and recreate my ability.

To my dear husband, a very special thank you for your practical and emotional support as I added the roles of wife and then mother, to the competing demands of work, study and personal development.

To my daughter, special thanks for being patient and helpful for my thesis in my daily work routine.

IV

I would also like to thank the experts who were involved in the validation and evaluation of this research work: Dr. S. R. Gengage, Dr. G. S. Mani and Dr. Mrs K. S. Jog. Without their participation and input, the validation and evaluation of this work could not have been successfully conducted.

Many thanks go in particular to M.I.T. college team, my HOD, Dr. G. N. Mulay, to my all collogues and students. I am much indebted to Professor Allwyn Anuse for his valuable advice, support, guidance and help in carrying out NN based segmentation part of my work. Special thanks to all my students for their great help and support, Pritam, Vrushali, Nilesh, Anand, Rahul, Ankit, Rushikesh, Nishit, Khushboo, Nikhil, Jaydeep, Pratap, Kuldeep, Nitish, Gaurav, Rohit, Saurabh, Chaitnya, Nihar and Subhash.

I would also acknowledge, all my team at Bharati Vidyapeeth, Prof. Vaidya Sir, Mrs Raut madam, Prof. Dawande, Prof. Chimate, Prof. Kurkute, Mrs. Sampada Dhole madam, Prof. Prachi Mule, Prof. Mrs Paygude, Prof. Mrs. Vandana Gaikwad for their advice and their help to share their bright thoughts with me, which were very fruitful for shaping up my ideas and research.

Special thanks to Mrs. Mangal Patil madam, for her consistent help, support and time she has given to my work for all these last five years. Special thanks to Mr. Salunke of RD cell.

Thanks to my entire linguist‘s team, Prof. Smita Bondre, Prof. H. G. Mate, Prof. Birajdaar and Mrs. Swati Kulkarni for language guidance and time they have given for contextual analysis which helped in database optimization and syllabification.

I am very grateful to Mr. Suryawanshi, Bharati Bhavan for his timely help and support. Thanks for all the help related to university submissions, formats and procedures.

V

Finally, I would like to thank everybody who was important to the successful realization of thesis, as well as expressing my apology that I could not mention personally one by one.

Mrs. Smita Kawachale

VI

CONTENTS

1 List of tables XVIII

2 List of figures XIX-XXVIII 3 Abbreviations XXIX 4 Abstract XXX-XXXV

No. Title of the chapter & contents Page no. 1. System block diagram 1-3 Introduction to system block diagram 2

2. Objectives of the research 4-9 2.1 Objectives 6 2.2 Sub-objectives of the research work 7 2.3 Organization of the thesis 7

3. Theory of TTS 10-34 3.1 Introduction 11 3.2 Sound elements for speech synthesis 15 3.2.1 Classification of speech 16 3.2.2 Elements of a language 18 3.3 Methods and approaches to speech synthesis 18 3.4 Language study 25 3.4.1 The consonants 26 3.4.2 Vowels 27 3.4.3 Consonant conjuncts 28 3.5 Present scenario of TTS systems 29 3.5.1 DECTalk 29 3.5.2 text-to-speech 30 3.5.3 Laureate 30 3.5.4 SoftVoice 31 3.5.5 CNET PSOLA 31

VII

No. Title of the chapter & contents Page no. 3.5.6 ORATOR 32 3.5.7 Eurovocs 32 3.5.8 Lernout & Hauspie‘s 33 3.5.9 Apple plain talk 33 3.5.10 Silpa 34

4. Literature review 35-63 4.1 Introduction 36 4.2 Review of the literature 37 Review of context based speech synthesis 4.2.1 37 papers Review of evaluation of speech synthesis 4.2.2 42 with spectral smoothing methods Review of concatenative speech synthesis 4.2.3 48 and segmentation Review of recent papers on performance improvement 4.3 51 of TTS systems. 4.4 Summary of literature review 61

Contextual analysis and classification of syllables 5. 64-88 for Marathi language 5.1 Review of the related literature 65 5.2 Language study 65 5.3 CV structure 67 5.4 Block diagram of contextual analysis 67 5.5 Implementation of contextual analysis system 76 5.5.1 Input text 76 5.5.2 Text encoding 76 5.5.3 CV structure formation 77 Performance evaluation and result discussion 5.5.4 77 of contextual analysis 5.6 Conclusion of contextual analysis 88

Position based syllabification using neural and 6. 89-185 non-neural techniques

VIII

No. Title of the chapter & contents Page no. 6.1 Factors of voice quality variation for database creation 94 6.2 Neural network for segmentation 95 6.3 Neural network and its types 96 Basic block diagram of neural network for 6.3.1 98 segmentation 6.4 Why automatic segmentation? 103 6.4.1 Syllable as unit 105 6.4.2 Why not dictionary? 108 6.4.3 Energy, extracted feature for NN 109 Why energy is used as parameter for soft- 6.4.4 110 cutting? Basic algorithm of neural network segmentation 6.5 111 system 6.5.1 Purpose of algorithm 111 6.5.2 Methodology used 111 6.5.3 Result/outcome 112 6.5.4 Basic algorithm of segmentation system 112 6.5.5 Flowchart 112 6.6 Energy calculation 113 6.6.1 Actual formula used 114 6.6.2 Algorithm for energy calculation 115 6.6.2.1 Purpose of energy algorithm 115 6.6.2.2 Methodology used 115 6.6.2.3 Results/outcome of the algorithm 115 6.6.3 Flowchart 115 6.7 Post-processing of energy plot 117 6.7.1 Normalization 117 6.7.1.1 Purpose of normalization 118 6.7.1.2 Methodology used 118 6.7.1.3 Results/outcome of the algorithm 118 6.7.2 Normalization algorithm 118 6.7.3 Smoothing energy plot 119 6.8 Block diagram of basic TTS system and segmentation 120

IX

No. Title of the chapter & contents Page no. approaches 6.8.1 Syllabification flowchart 122 6.9 Neural network for automatic segmentation 123 6.9.1 MAXNET 124 6.9.2 Maxnet architecture 125 6.10 Implementation of Maxnet 127 6.10.1 Maxnet algorithm 127 6.10.1.1 Purpose of algorithm 127 Methodology/ mathematical 6.10.1.2 128 function Step by step Maxnet 6.10.1.3 128 implementation 6.10.1.4 Results/outcome of the algorithm 129 6.10.2 Maxnet flowchart 129 Testing of Maxnet algorithm and 6.10.3 130 authentication of results 6.10.4 Conclusion of Maxnet 139 6.11 Back-propagation 139 6.11.1 Architecture of back-propagation 141 6.12 Implementation of back-propagation 143 6.12.1 Input nodes 143 6.12.2 Hidden nodes 143 6.12.3 Output nodes 144 6.12.4 Initialization of weights 144 6.12.5 Choice of learning rate 휂 144 6.13 Back-propagation algorithm implementation 144 6.13.1 Purpose of algorithm 146 6.13.2 Methodology used 146 6.13.3 Results/outcome of the algorithm 146 6.13.4 Basic back-propagation algorithm 146 6.13.5 Back-propagation flowchart 148 6.14 Supervised and un-supervised NN for segmentation 149 6.14.1 Supervised neural network learning 150

X

No. Title of the chapter & contents Page no. 6.14.2 The training data set 150 6.14.3 The basic training procedure 150 6.14.4 Unsupervised learning in neural networks 151 6.14.5 The delta rule 152 Results of cascaded combination of 6.14.6 154 supervised and unsupervised NN Neural-classification approach (K-means with 6.15 158 Maxnet) for syllabification 6.15.1 Analysis of K-means algorithm 158 Step by step implementation of K-means for 6.15.2 159 segmentation 6.15.2.1 Purpose of algorithm 160 6.15.2.2 Methodology used 160 6.15.2.3 Results/ outcome of algorithm 162 Authentication and testing of 6.15.2.4 162 results 6.15.3 Results of K-means algorithm 162 6.16 Non-neural methods for syllable segmentation 167 6.16.1 Simulated annealing 167 6.16.1.1 Purpose of algorithm 167 6.16.1.2 Methodology used 167 6.16.1.3 Results/outcome of the algorithm 167 Testing and authentication of 6.16.1.4 168 results 6.16.2 Results of simulated annealing 168 6.16.3 Slope detection algorithm 172 6.16.3.1 Purpose of algorithm 172 6.16.3.2 Methodology used 173 6.16.3.3 Results/outcome of the algorithm 173 6.16.4 Results of slope detection 174 Summary of segmentation results for simulated 6.17 annealing, slope detection and K-means Maxnet 178 combination 6.17.1 Results for 2-syllable words 179 6.17.2 Results for 3-syllable words 180

XI

No. Title of the chapter & contents Page no. 6.17.3 Results for 4-syllable words 181 Statistical comparison of different 6.17.4 182 segmentation techniques 6.18 GUI 182 6.18.1 GUI results for K-means 183 6.18.2 GUI results for slope detection 183 6.18.3 GUI results for simulated annealing 184 6.19 Conclusion of syllabification 185

7. Removal of spectral mismatch using PSOLA for 186-209 speech improvement 7.1 Introduction 187 7.2 Removal of spectral mismatch using PSOLA 187 7.2.1 Why choose PSOLA? 188 Block diagram of PSOLA for spectral 7.2.2 188 reduction 7.2.3 System flowchart 189 7.2.4 Pre-processing of PSOLA 189 7.2.5 Pitch estimation steps 190 7.2.6 Pitch estimation details 191 7.2.7 Pitch marking 193 7.2.8 Block diagram of pitch marking 193 7.3 TD-PSOLA 195 7.3.1 Purpose of TD-PSOLA algorithm 196 7.3.2 Methodology used 196 7.3.3 Results/outcome of the algorithm 196 Step by step explanation of PSOLA block 7.3.4 196 diagram 7.4 TD-PSOLA implementation 198 7.4.1 Detail algorithm of TD-PSOLA 199 7.4.2 Pseudo code of PSOLA 200 7.5 Results and discussion of TD-PSOLA 204 Comparison of original and PSOLA- 7.5.1 205 processed output

XII

No. Title of the chapter & contents Page no. 7.5.2 GUI used for testing 206 7.6 Performance evaluation of PSOLA 207 7.6.1 Subjective evaluation 207 7.6.2 Objective evaluation 208 7.7 Conclusion 209

8. Evaluation of spectral mismatch and development 210-319 of necessary objective measures 8.1 Introduction 213 8.2 Speech synthesis 214 8.3 Generation of speech database 216 8.4 Scope 216 8.5 Fourier transform 217 8.6 Short time Fourier transform 218 8.7 Wavelet transform 220 8.7.1 Multi-resolution analysis (MRA) 220 8.8 Continuous wavelet transform 221 8.9 Discrete wavelet transform 223 8.9.1 Multi-resolution filter bank 223 8.9.2 Selection of wavelet 225 8.10 Dynamic time warping 225 8.11 Description of speech signal 227 8.12 System block diagram for spectral smoothing 228 8.13 Finding discontinuities and smoothing 230 8.14 Actual implementation requirements 231 8.15 System design for spectral smoothing 231 8.15.1 Flowchart for spectral smoothing 231 8.15.2 Spectral smoothing algorithm 232 8.15.2.1 Purpose of algorithm 232 Methodology used and step by 8.15.2.2 232 step implementation 8.15.2.3 Results/outcome of the algorithm 233

XIII

No. Title of the chapter & contents Page no. 8.16 Concatenation of syllables 233 Results of correlation of energies of concatenated 8.17 234 words Detailed block diagram of distance and energy 8.18 237 calculation and its explanation 8.19 Prerequisites of distance calculation algorithm 239 8.19.1 Normalization of sound file 239 8.19.2 Framing 239 Correlation of frames and application of 8.19.3 239 DTW 8.20 Plotting maximum peak values 242 8.20.1 Research findings of peak calculation 242 8.21 Distance calculation algorithm 245 8.21.1 Purpose of algorithm 245 Methodology used and step by step 8.21.2 245 implementation 8.21.3 Results/outcome of the algorithm 246 Finding Euclidean distance and energy between auto- 8.22 246 correlated and cross-correlated words 8.22.1 Euclidean distance 246 8.22.2 Definition 246 Explanation of algorithm for energy 8.22.3 248 calculation Energy calculation of correlated words for spectral 8.23 250 distortion reduction Analysis of results of distance and energy calculated 8.24 252 in original, PC and IC words 8.24.1 Energy results 254 8.25 Other-algorithms of spectral smoothing 256 8.26 Identification of mismatch of formants using PSD 256 8.26.1 PSD Algorithm 256 8.26.2 Purpose of algorithm 256 Methodology used and step by step 8.26.3 257 implementation 8.26.4 Results/outcome of the algorithm 257 8.26.4.1 Research findings 258

XIV

No. Title of the chapter & contents Page no. 8.26.5 Peak determination 259 8.26.6 Formant extraction 259 8.27 Slope calculation for formant noise detection 260 Calculation and plotting of normalized 8.27.1 261 values 8.27.2 Implementation of PSD and formant plots 262 8.27.2.1 Purpose of algorithm 262 8.27.2.2 Methodology used 262 8.27.2.3 Results/outcome of the algorithm 263 8.27.3 Word selection 263 8.27.4 PSD plots 264 8.28 Analysis of formants to evaluate spectral mismatch 269 8.29 Comprehensive analysis of results obtained by PSD 279 8.29.1 PSD results in frame by frame basis 280 8.29.2 Results for 2-syllable words 281 8.29.3 Numerical results of PSD for 2 syllable word 284 8.29.4 Results for 3 syllable words 286 8.29.5 Numerical results of PSD for 3 syllable word 289 8.29.6 Results for 4 syllable words 291 8.29.7 Numerical results of PSD for 4 syllable word 295 Detection of spectral mismatch and its reduction using 8.30 296 multi-resolution wavelet analysis 8.30.1 Purpose of algorithm 296 Methodology used and step by step 8.30.2 297 implementation 8.30.3 Results/ outcome of the algorithm 298 8.30.4 Wavelet results 298 Detection of spectral mismatch and its reduction using 8.31 308 DTW 8.31.1 Purpose of algorithm 308 Methodology used and step by step 8.31.2 309 implementation 8.31.3 Results/ outcome of the algorithm 309 8.31.4 Correlation results of DTW 309

XV

No. Title of the chapter & contents Page no. 8.32 GUI to perform DTW, wavelet and PSD 315 8.32.1 List box 316 8.32.2 Radio buttons 316 8.32.3 Push buttons 316 8.33 Subjective analysis 318 8.34 Conclusion of spectral mismatch reduction 319

9. Linguistic analysis 320-355 9.1 Introduction 321 9.1.1 Vowels 322 9.1.2 Consonants 322 9.1.3 Consonant vowel (CV) structure 323 9.1.4 Suffix processing in text analysis 323 9.1.5 Structure formation 324 9.2 Implementation of suggestions from linguistic experts 324 9.2.1 Sandhi and related rules 326 9.3 Linguistic expert‘s analysis and new rules 330 9.4 Database preparation with words and syllables 332 9.5 Testing and results 332 9.5.1 Testing of paragraphs 332 Comparison of naturalness obtained by 9.5.2 350 various methods 9.6 Conclusion of linguistic analysis 355

10. Outcome of research 356-360 Outcome of research 357

Future scope, improvements and concluding 11. 361-368 remarks 11.1 Concluding remarks 362 11.2 Improvements and future scope 364 11.3 Applications 366

XVI

Appendix-1 i-xvii CV structure breaking rules and WAVE file details Appendix-2 xviii Authors publications Bibliography i-vii Bibliography

XVII

LIST OF TABLES No. Title of the Table Page No. 5.1 Marathi words and their code strings 69 5.2 CV structure of input words 70 5.3 Examples of CV structure and its breaking 71 5.4 Textual database snapshot 73 Comparative results of segmentation for 2 syllable 6.1 179 words Comparative results of segmentation for 3 syllable 6.2 180 words Comparative results of segmentation for 4 syllable 6.3 181 words Statistical comparison of different segmentation 6.4 182 techniques 7.1 MOS results of PSOLA 208 Speech output of PSOLA with percentage error 7.2 209 calculation Average energy results for cross-correlation of 8.1 237 original, PC and IC 8.2 Result table of Euclidian distance calculation 253 8.3 Result table of energy calculation 254 8.4 Numerical results for word {eH ma (shikar) 284

8.5 Numerical results for word H m_YmaUm (kamdharana) 289 8.6 Numerical results for word X¡Z§{XZs (dainandini) 295 Numerical results of wavelet and Back propagation for 8.7 307 approximate coefficients Numerical results of wavelet and Back-propagation 8.8 308 for detail coefficients 8.9 Numerical results of correlation 314 8.10 Subjective analysis for DTW, PSD and wavelet 318 9.1 Concatenation details of nw. b. Xoenm§So paragraph 333 9.2 Concatenation details of India pledge 337 9.3 Concatenation details of d. nw. H mio paragraph 341 9.4 Concatenation details of amOy JmoîQ paragraph 346 MOS test for linguistic analysis implemented for 9.5 351 Marathi TTS 9.6 Comparison of different TTS systems 352

XVIII

LIST OF FIGURES No. Title of the Figure Page No. 1.1 General framework of research work 2 3.1 Speech synthesis 11 3.2 Consonants in Marathi language 27 3.3 Vowels in Marathi language 27 Block diagram of contextual analysis and 5.1 68 classification of syllables 5.2 Devnagari keyboard 68 5.3 Snapshot of textual database 72 5.4 GUI snapshot of the system 76 5.5 Entering text using keyboard 78 5.6 Formation of code 78 5.7 Concatenating the word 79 5.8 Snapshot of input text encoding 79 5.9 Displaying CV structure of word 80 5.10 CV Broken code list and syllables 80 5.11 Adding syllables to database 81 5.12 Entering syllable to search 81 5.13 Syllable to search 82 5.14 Display of words 82 5.15 Total words of syllable _mZ (maan) 83 5.16 Energy plot of _¨Va, A¨Va (mantar, antar) 85 a,b 5.17 Energy plot of A¨ew_mZ, A¨YH ma (anshumaan, andhakar) 85 a,b 5.18 Energy plot of g_m¨Va, dV©_mZ (samantar, vartamaan) 86 a,b 5.19 PSD plot of _¨Va (mantar) 87 5.20 PSD plot of A¨Va (antar) 87 6.1 Simple neural network 97 6.2 Neural network with hidden layer 97 Basic block diagram of TTS with neural network for 6.3 99 segmentation

XIX

No. Title of the Figure Page No. 6.4 Block diagram for feature extraction 100 6.5 Block diagram of ‗NN for segmentation‘ 101 6.6 Formation of new word from pre-recorded words 103 Formation of new words from old words with proper 6.7 103 syllable positions 6.8 Sound file of AmZ§X (Anand) 104 6.9 Audio file of Marathi word NmoQ ses (chotishi) 105 6.10 Energy plot of NmoQ ses (chotishi) 106 6.11 Sound file of nam^d (parabhav) 107 6.12 Sound file of AWH (athak) 107 6.13 Sound file of nWH (pathak) 107 6.14 Sound contents of word Am^mi (abhal) 109 6.15 Energy plot of Marathi word Amgmdas (Aasavari) 111 Flowchart for basic neural network based 6.16 113 segmentation 6.17 Flowchart for energy calculation 116 6.18 Simple energy plot of word AmpXVs (Aditi) 116 6.19 Simple energy plot of word Jm¡ad (Gaurav) 117 6.20 Normalized energy plot of word AmpXVs (Aditi) 118 6.21 Normalized energy plot of word Jm¡ad (Gaurav) 119 6.22 Smooth energy plot of word AmpXVs (Aditi) 120 6.23 Smooth energy plot of word Jm¡ad (Gaurav) 120 Basic block diagram of TTS and segmentation 6.24 121 approaches 6.25 Syllabification flowchart 122 6.26 Maxnet network architecture 126 6.27 Flowchart for Maxnet 130 6.28 GUI snapshot for Maxnet algorithm testing 131 6.29 Maxnet output for Marathi word AO© (arja) 132 6.30 Energy plot for word AO© (arja) 133 6.31 Sound plot of three syllable word ~o~§Y (bebandh) 134

XX

No. Title of the Figure Page No. 6.32 Energy plot of three syllable word ~o~§Y (bebandh) 134 6.33 Importing word from its location 135 6.34 Working for browsed word Am^mi (abhal) 135 System output for four syllable word NoXZp~§Xy 6.35 136 (chedanbindu) System output for five syllable word M_H XmanZo 6.36 136 (chamakdarpane) 6.37 System output with monosyllabic word ~mU (baan) 137 6.38 Sound and energy plot for word AmJmgs (Agasi) 138 6.39 Generalized neural network architecture 140 6.40 The structure of a neuron 141 6.41 Back-propagation architecture 142 6.42 Flowchart for Back-propagation 149 6.43 Adaline network 153 6.44 Segmentation for two syllable word AmXoe (Adesh) 155 6.45 Segmentation for two syllable word Z¨Va (nantar) 155 6.46 Segmentation for three syllable word ~amo~a (barobar) 156 Segmentation for three syllable word AÜ``Z 6.47 156 (adhyayan) Segmentation of four syllable word AnmaXe©H 6.48 157 (apardarshak) Segmentation of four syllable word M_H Umam 6.49 157 (chamaknara) 6.50 Clustering using K-means 161 6.51 K-means output for A§pH V (Ankit) 162 6.52 K-means output for AãXþc (Abdul) 163 6.53 K-means output for AmJ_Z (agaman) 164 6.54 K-means output for pZðm (nishtha) 164 6.55 K-means output for A_mÝç (amanya) 165 6.56 K-means output for gX¡d (sadaiv) 165 6.57 K-means output for ApídZr (Ashwini) 166 6.58 K-means output for ApXVr (Aditi) 166

XXI

No. Title of the Figure Page No. 6.59 Simulated annealing output for A§pH V (Ankit) 168 6.60 Simulated annealing output for pZð m (nishtha) 169 6.61 Simulated annealing output for AmJ_Z (agaman) 169 6.62 Simulated annealing output for AãXþc (Abdul) 170 6.63 Simulated annealing output for ApXVr (Aditi) 170 6.64 Simulated annealing output for ApídZr (Ashwini) 171 6.65 Simulated annealing output for A_mÝç (amanya) 171 6.66 Simulated annealing output for gX¡d (sadaiv) 172 6.67 Slope detection output for A§pH V (Ankit) 174 6.68 Slope detection output for AmJ_Z (agaman) 175 6.69 Slope detection output for AãXþc (Abdul) 175 6.70 Slope detection output for pZð m (nishtha) 176 6.71 Slope detection output for A_mÝç (amanya) 176 6.72 Slope detection output for ApídZr (Ashwini) 177 6.73 Slope detection output for ApXVr (Aditi) 177 6.74 Slope detection output for gX¡d (sadaiv) 178 6.75 GUI snapshot for K-means 183 6.76 GUI snapshot for slope algorithm 184 6.77 GUI snapshot for simulated annealing 184 Block diagram of PSOLA method for spectral 7.1 188 reduction System block diagram for reduction of concatenation 7.2 189 cost using PSOLA TD- PSOLA system with pitch estimation and 7.3 190 marking 7.4 Block diagram of pitch estimation 190 7.5 Pitch estimation and marking illustration 191 Speech signal and its corresponding ACF of Matlab 7.6 193 program 7.7 Block diagram of pitch marking 194 7.8 Pitch marking of speech signal 195

XXII

No. Title of the Figure Page No. 7.9 TDSOLA block diagram 198 7.10 PSOLA speech waveform for AmYw{ZH (adhunik) 204 7.11 PSOLA speech waveform for ^maVs` (Bharatiya) 204 PSOLA and original waveform for word AmYw{ZH 7.12 205 (adhunik) PSOLA and original waveform for word Xoe~m§Yd 7.13 205 (deshbandhav) PSOLA and original waveform for word H mQH ga 7.14 206 (katkasar) PSOLA and original waveform for word AmYw{ZH 7.15 206 (adhunik) using GUI Input and output waveform with pitch scale = 0.8, 7.16 207 time scale = 0.13 for _hm_mJ© (mahamarg) Input and output waveform with pitch scale = 0.13, 7.17 207 time scale = 0.7 for _hm_mJ© (mahamarg) 8.1 Constant resolution time-frequency plane 220 8.2 Multi-resolution time-frequency plane 221 8.3 Analysis filter bank 224 8.4 Synthesis filter bank 224 8.5 Daubechies wavelet 225 8.6 Description of speech signal 227 8.7 Block diagram of spectral smoothing and TTS 228 8.8 Proper concatenated word ‗dmaH as‘ (varkari) 229 8.9 Improper concatenated word ‗dmaH as‘ (varkari) 230 8.10 Flowchart for spectral smoothing 232 8.11 Original word H m_Jma (kamgar) in time domain 233 Proper-concatenated word H m_Jma (kamgar) in time 8.12 234 domain Improper-concatenated word H m_Jma (kamgar) in time 8.13 234 domain 8.14 Auto-correlation of original word H m_Jma (kamgar) 235 Cross-correlation of original H m_Jma (kamgar) and PC H 8.15 236 m_Jma (kamgar) Cross-correlation of original H m_Jma (kamgar) and IC H 8.16 236 m_Jma (kamgar)

XXIII

No. Title of the Figure Page No. 8.17 Block diagram of quantification of mismatch 238 8.18 100 Hz sine wave correlated with 100 Hz sine wave 240 8.19 100 Hz sine wave correlated with 200 Hz sine wave 241 8.20 100 Hz sine wave correlated with 300Hz sine wave 241 Peak values obtained after correlating the frames of 8.21 243 original and PC and applying DTW Peak values obtained after correlating the frames of 8.22 244 original and IC and applying DTW The peak values obtained after correlating the frames 8.23 245 of original and original Distance between original-original auto-correlated and 8.24 247 IC-original correlated output Distance between original-original auto-correlated and 8.25 248 PC-original correlated output 8.26 Correlation graph 249 8.27 Auto-Correlation of original vs. original 250 Original vs. original correlation of word H m_Jma 8.28 251 (kamgar) Original vs. PC word correlation of word H m_Jma 8.29 251 (kamgar) 8.30 Original vs. IC correlation of word H m_Jma (kamgar) 252 8.31 PSD plot of original XodJS (Devgad) 258 8.32 Concept of peak determination 259 8.33 Formant plot of original XodJS (Devgad) 260

8.34 Normalized plot of ‗f0‘ 261 8.35 Normalized plot of ‗f0‘ 262 8.36 PSD plot of original word XodJS (Devgad) 264 8.37 PSD plot of properly concatenated XodJS (Devgad) 264 8.38 PSD plot of improperly concatenated XodJS (Devgad) 265 8.39 PSD plot of original word pdXoe (videsh) 266 8.40 PSD plot of properly concatenated word pdXoe (videsh) 266 PSD plot of improperly concatenated word pdXoe 8.41 267 (videsh) 8.42 Sound files of original, PC and IC XodJS (Devgad) 268

XXIV

No. Title of the Figure Page No. 8.43 Formant plot of original XodJS (Devgad) 269 8.44 Formant plot of properly concatenated XodJS (Devgad) 269 Formant plot of improperly concatenated XodJS 8.45 270 (Devgad) 8.46 Formants ‗f0‘ for original, PC and IC XodJS (Devgad) 271

8.47 Formant ‗f1‘ of original, PC and IC amOJS (Rajgad) 272

8.48 Formant f2 of original, PC and IC word pdXoe (videsh) 273

Slope plot ‗f2‘ of original, PC and IC for word amOJS 8.49 274 (Rajgad) 8.50 Slope ‗f0‘ plot of XodJS (Devgad) 275 8.51 PSD plot of original word ~mJdmZ (bagwan) 276 8.52 PSD plot of proper concatenated word ~mJdmZ (bagwan) 276 PSD plot of improper concatenated word ~mJdmZ 8.53 277 (bagwan) 8.54 Formant plot of original word ~mJdmZ (bagwan) 277 Formant plot of proper concatenated word ~mJdmZ 8.55 278 (bagwan) Formant plot of improper concatenated word ~mJdmZ 8.56 278 (bagwan) 8.57 Snapshot of PSD GUI 279 8.58 PSD plot of original word pdXoe (videsh) in GUI 280 PSD plot of original {eH ma (shikar) and properly 8.59 281 concatenated {eH ma (shikar) PSD plot of original {eH ma (shikar) and improperly 8.60 281 concatenated {eH ma (shikar) PSD plot of improper {eH ma (shikar) and modified 8.61 282 improper {eH ma (shikar) PSD plot of original _maoH as (marekari) and properly 8.62 282 concatenated _maoH as (marekari) PSD plot of original _maoH as (marekari) and improperly 8.63 283 concatenated _maoH as (marekari) PSD plot of original _maoH as (marekari) and improperly 8.64 283 concatenated _maoH as (marekari) after modification 8.65 PSD plot of original A\ dm (afava) and properly 283

XXV

No. Title of the Figure Page No. concatenated A\ dm (afava) PSD plot of original A\ dm (afava) and improperly 8.66 284 concatenated A\ dm (afava) PSD plot of original A\ dm (afava) and improperly 8.67 284 concatenated A\ dm (afava) after modification PSD plot of original (chamkidar) and properly 8.68 M_H sXma 286 concatenated M_H sXma (chamkidar) PSD plot of original (chamkidar) and 8.69 M_H sXma 286 improperly concatenated M_H sXma (chamkidar) PSD plot of improper M_H sXma (chamkidar) and 8.70 improperly concatenated M_H sXma (chamkidar) after 287 modification PSD plot of original (kamdharana) and 8.71 H m_YmaUm 287 properly concatenated H m_YmaUm (kamdharana) PSD plot of original (kamdharana) and 8.72 H m_YmaUm 288 improperly concatenated H m_YmaUm (kamdharana) PSD plot of original H m_YmaUm (kamdharana) and 8.73 improperly concatenated H m_YmaUm (kamdharana) after 288 modification PSD plot of original XemdVma (dashavtar) and properly 8.74 288 concatenated XemdVma (dashavtar) PSD plot of original XemdVma (dashavtar) and 8.75 289 improperly concatenated XemdVma (dashavtar) PSD plot of original XemdVma (dashavtar) and 8.76 improperly concatenated XemdVma (dashavtar) after 289 modification PSD plot of original A{d^mÁ` (avibhajya) and properly 8.77 291 concatenated A{d^mÁ` (avibhajya) PSD plot of original A{d^mÁ` (avibhajya) and 8.78 292 improperly concatenated A{d^mÁ` (avibhajya) PSD plot of original A{d^mÁ` (avibhajya) and 8.79 improperly concatenated A{d^mÁ` (avibhajya) after 292 modification PSD plot of original X¡Z§{XZs (dainandini) and properly 8.80 293 concatenated X¡Z§{XZs (dainandini) PSD plot of original X¡Z§{XZs (dainandini) and 8.81 293 improperly concatenated X¡Z§{XZs (dainandini) 8.82 PSD plot of original X¡Z§{XZs (dainandini) and 293

XXVI

No. Title of the Figure Page No. improperly concatenated X¡Z§{XZs (dainandini) after modification PSD plot of original XoUoKoUoo (deneghene) and properly 8.83 294 concatenated XoUoKoUo (deneghene) PSD plot of original XoUoKoUo (deneghene) and 8.84 294 improperly concatenated XoUoKoUo (deneghene) PSD plot of original XoUoKoUo (deneghene) and 8.85 improperly concatenated XoUoKoUo (deneghene) after 294 modification Approximate wavelet coefficients of AMb (achal) 8.86 298 from level 1 to 5 Detail wavelet coefficients of AMb (achal) from level 8.87 299 1 to 5 Approximate wavelet coefficients of ‗a{dZm‘ (Ravina) 8.88 299 from level 1 to 5 Detail wavelet coefficients of ‗a{dZm‘ (Ravina) from 8.89 300 level 1 to 5 Approximate wavelet coefficients of ‗{eH ma‘ (shikar) 8.90 300 from level 1 to 5 Detail wavelet coefficients of ‗{eH ma‘ (shikar) from 8.91 301 level 1 to 5 Approximate wavelet coefficients of ‗dmaH as‘ (varkari) 8.92 301 from level 1 to 5 Detail wavelet coefficients of ‗dmaH as‘ (varkari) from 8.93 302 level 1 to 5 Approximate wavelet coefficients of original, 8.94 302 improper and modified improper AMb (achal) Detail wavelet coefficients of original, improper and 8.95 303 modified improper AMb (achal) Approximate wavelet coefficients of original, 8.96 304 improper and modified improper ‗a{dZm' (Ravina) Detail wavelet coefficients of original, improper and 8.97 304 modified improper ‗a{dZm' (Ravina) Approximate wavelet coefficients of original, 8.98 305 improper and modified improper ‗{eH ma' (shikar) Detail wavelet coefficients of original, improper and 8.99 305 modified improper ‗{eH ma' (shikar) 8.100 Approximate wavelet coefficients of original, 306

XXVII

No. Title of the Figure Page No. improper and modified improper ‗dmaH as' (varkari) Detail wavelet coefficients of original, improper and 8.101 306 modified improper ‗dmaH as ' (varkari) Correlation results for ‗AMb‘ (achal) original, ‗AMb‘ 8.102 310 (achal) proper before and after applying DTW Correlation results for ‗AMb‘ (achal) original, 8.103 310 ‗AMb‘(achal) improper before and after applying DTW Correlation results for ‗am_Xod‘ (Ramdev) original, ‗am_Xod‘ 8.104 311 (Ramdev) proper before and after applying DTW Correlation results for ‗am_Xod‘ (Ramdev) original, ‗am_Xod‘ 8.105 311 (Ramdev) improper before and after applying DTW Correlation results for ‗nmdSa‘ (pavdar) original, ‗nmdSa‘ 8.106 312 (pavdar) proper before and after applying DTW Correlation results for ‗nmdSa‘ (pavdar) original, ‗nmdSa‘ 8.107 312 (pavdar) improper before and after applying DTW Correlation results for ‗{eH ma‘ (shikar) original, ‗{eH ma‘ 8.108 313 (shikar) proper before and after applying DTW Correlation results for ‗{eH ma‘ (shikar) original ‗{eH ma‘ 8.109 313 (shikar) improper before and after applying DTW 8.110 GUI snapshot for spectral smoothing algorithms 316 8.111 PSD plot of original and improper concatenated word 317 8.112 DTW results in GUI 317 9.1 Classification of sound units in Indian language 322 9.2 Paragraph with different types of contexts 330

XXVIII

ABBREVIATIONS TTS : Text to speech PC : Properly concatenated word IC : Improperly concatenated word NN : Neural network ANN : Artificial neural network IMF : Initial, middle and final position of syllable DTW : Dynamic time warping PSD : Power spectral density CV : Consonant vowel structure or rules ED : Euclidian distance LTS : Letter to sound rules FT : Fourier transform IFT : Inverse Fourier transform MRA : Multi resolution analysis WT : Wavelet transform STFT : Short time Fourier transform ICWT : Inverse continuous wavelet transform DWT : Discrete wavelet transform BP : Back propagation

XXIX

ABSTRACT

Speech synthesis is a process where human sound is produced artificially. Text-to-speech (TTS) is a type of speech synthesis application that creates spoken sound version of a written text. There is great demand for TTS for Indian . Development of TTS that sounds similar to human natural sound, is being attempted over the years, but still not achieved by even the best of the presently available TTS systems. Most of these still sound robotic, unless human speech itself is present in them. However, such human speech necessitates creation of a large database of each and every word of that language which is quite an onerous task.

During this research work, four innovative naturalness improvement techniques for TTS output have been designed, developed, integrated and tested for position based syllabification and spectral mismatch calculation and reduction. The stability, repeatability and accuracy of these methods/techniques are good. This research work reduces database size by using hybrid syllabic approach.

This research work illustrates a new approach and methodology that helps to reduce database size by using ―syllabic based concatenative speech synthesis‖. The work also describes new approaches of improving the naturalness of TTS output. Here new words are ‗created‘ by joining existing words and syllables from the small database of most frequently used words. This is being referred as hybrid syllabic approach. The naturalness of these ‗created‘ words in speech is further improved by ‗position based syllabification‘ where syllable positions are considered while forming syllables, ‗spectral noise calculation and reduction‘ where spectral distortion present at the joint is reduced with different time – frequency domain methods for objective results (numerical figures) and linguistic rules implementation for improving context and resulting natural output. A combination of neural-classification network and non-neural methods are used for syllabification. This results in proper cutting of

XXX syllables from words. These syllables are used for ‗creation‘ of new words. After new words are ‗created‘, the spectral distortion present at joints is reduced with spectral estimation and reduction methods in time and frequency domains. These methods give numerical figures for spectral distortion from which one can estimate and reduce the distortion. If distortion present at joints is reduced, then resulting speech sounds more natural. This work is carried out for text to speech conversion of Marathi language.

Dennis Klatt first developed formant synthesizer using rules for English language. Later on many researchers explored speech synthesis for many other languages with different techniques like formant synthesis, etc. However these techniques are not yielding satisfactory results for Indian languages. This is due the fact that most of the Indian languages are phonetic and have unique sound for each alphabet. These languages are more suitable for . Concatenation takes place using allophones, diphones, triphones, demi- syllables, syllables and words. These speech units are stored in the database and used for concatenation during synthesis. The audio database size, selection of units, position of units and naturalness of synthetic output is closely related. It is observed that syllables as basic unit provides fairly better naturalness.

A lot of TTS systems have been developed but a major drawback of them is their large database. Reducing the database and producing more natural sounding speech is the aim of this work. Syllabic based speech synthesizer using energy calculation technique is implemented and is giving good results. A text to speech (TTS) synthesizer must be capable of automatically producing speech by storing small segments of speech. It generates more number of words based on very small database. Different syllables can form new words hence original database is not large. While storing small segments syllable cutting becomes essential part in concatenative type speech synthesis. Soft cutting of syllables gives the ‗from‘ and ‗to‘ location of sample numbers of syllables and then these

XXXI locations can be used in the database. For proper cutting of syllables, location of vowels plays an important role. Vowel detection can be done by calculating energy of sound file. Vowels have more energy as compared to consonants and hence syllables can be cut very easily by making use of this property. In this way the database becomes more efficient and hence can generate more number of words through concatenation of newly formed syllables. Among different methods of speech synthesis, concatenative speech synthesis (CSS) is used due to its naturalness and less signal processing requirement. But concatenative TTS has problems like requirement of large database and resulting spectral mismatch in output speech. These both problems are solved in this research work with the help of proper syllabification and joint spectral distortion reduction.

This work focuses on improving naturalness of TTS using: 1) Contextual analysis and classification of syllables. 2) Position based syllabification. 3) Spectral noise calculation and reduction. 4) Implementation of linguistic rules for improving speech output.

Every alphabet in Marathi language has been assigned a pre-defined code. Based on this code, words are converted into their consonant-vowel (CV) structure. These CV structure rules are obtained from contextual analysis and classification of syllables. CV structure rules helps to store the syllables in database. Different syllables are used to create new words hence database size does not increase. The frequency of occurrence of a syllable in the database is calculated. This generates a list of most commonly used syllables. The program further generates list of words in which these syllables are present at all positions (initial, middle and final). These words are evaluated with respect to some parameters like short time energy, power spectral density (PSD) etc. Syllables having large energy peak, less distortion, having good PSD are stored in textual database. These parameters help to validate the results of contextual analysis and classification of syllables based on that.

XXXII

Context based segmentation is based on syllable positions (I-initial, M- medium, F-final). Various speech segmentation methods are tested before but position based syllabification resulted in good segmentation accuracy. Concatenation of position dependent syllable results in less spectral mismatch (concatenation cost) and gives more natural sounding audio output.

A neural network and classification model is presented for segmentation of Devnagari script. Syllable formation is the first step for the development of speech synthesizers. The syllable serves as an important interface between the phonetic and phonological and the morphological and lexical representational tiers of language. It has been demonstrated that reliable segmentation of spontaneous speech into syllabic entities is useful for speech synthesizers. Soft cutting of syllables gives the exact location of frame number of syllables and then these locations can be used in the textual database. As manual segmentation is very time consuming, supervised and unsupervised methods of neural network are used for the development of proper syllables. Neural network model for automatic speech segmentation into syllables for ‗Devnagari script‘ is developed here. There is a great interest in the development of automatic speech segmentation algorithms. Maxnet algorithm from neural network is used for automatic segmentation. Maxnet is one layer network that conducts a competition to determine which node has the highest initial input value. Maxnet is providing 70% accuracy in syllable segmentation. Further improvement in speech quality can be achieved using IMF and context based combinations of syllables. Energy of every syllable differs with its positions in the word. So syllable can be roughly categorized according to its position in the word as initial (I), middle (M), and final (F).

Two approaches classification-neural and non-neural are presented for segmentation of syllables in speech synthesis of Marathi TTS. This work demonstrates the comparative analysis of neural and non-neural approaches for segmentation. A non-neural approach, slope detection has

XXXIII given good syllabification accuracy (about 80%). K-means a neural- classification approach resulted in better segmentation accuracy with less syllabification errors. The research work explains how neural network and classification approaches like K-means, Maxnet outweighs in performance than traditional non neural approaches like slope detection and simulated annealing. 90% accuracy is achieved with neural network-classification (K-means-Maxnet) model for syllable segmentation which resulted in naturalness improvement of Marathi TTS.

In concatenative TTS, position of syllable plays very important role while carrying out segmentation. If position of syllable is considered while forming new words from existing syllables (PC- properly concatenated), resulting spectral mismatch is less. If position of syllable is not considered during concatenation of speech units (IC- improperly concatenated), resulting synthesis end up in more spectral mismatch. Different techniques like PSD, wavelet and DTW are used to find spectral mismatch in concatenated segments. We presented a method based on power spectral density (PSD) to estimate position dependent spectral mismatch. This can be done by plotting power spectral density of 10 millisecond samples of original, properly concatenated (PC) and improperly concatenated (IC) words. These samples are then made noise free to neglect their low amplitude peaks. Analysis of PSD leads to locate formants in the given samples. Formants for original, properly and improperly concatenated words are plotted. It is observed that formant plots for original and properly concatenated words are very similar for all formants. But formants plots for improper concatenation resulted in extra peaks. These extra peaks can be considered as estimation for spectral mismatch. In all these three techniques (PSD, wavelet and DTW), PSD results are more effective who shows clear spectral mismatch in graphical form. With formant modification spectral mismatch can be reduced and it helps to smooth some of the frames which reduce glitch type of sound at concatenation point. DTW (dynamic time warping) algorithm shows spectral noise at concatenation. Results show noticeable difference in correlation before and after applying DTW. The value of correlation that is

XXXIV similarity of improper and proper word with original word is increased after applying DTW. Wavelet based audio results shows more naturalness compared to other two methods (MOS score of 3.875). In this work the discontinuities at the cutting point are smoothed by changing the spectral characteristics before and after the cutting point so that the spectral mismatch is equally distributed over the number of adjacent frames. This work throws light on how spectral mismatch calculation and reduction increases naturalness of concatenative Marathi TTS. Among these three methods of spectral distortion reduction, wavelet has given best audio results.

For further improvement in optimization of database and resulting naturalness of Marathi output, linguist guidance is considered and implemented for certain suggestions. This approach of naturalness improvement is based on implementation of some language rules as suggested by linguists based on contextual study of Marathi language. After implementing few rules, it has been observed that the speech quality has improved and on MOS scale this approach has given the score as 3.625

This system is further elaborated for larger samples including paragraphs for testing. By implementing four approaches, contextual analysis and classification of syllables, position based syllabification, spectral noise calculation and reduction after concatenation and linguistic rules; it is observed that resulting TTS output sounds fairly natural and intelligible. They are explained in detail in respective chapters. Results are validated using this Marathi TTS system for different text paragraphs. Paragraphs with different contexts are used for evaluation of the performance of the TTS system.

XXXV

Chapter- 1

Introduction to system block diagram

Introduction to system block diagram

CHAPTER- 1

INTRODUCTION TO SYSTEM BLOCK DIAGRAM

No. Title of the Contents Page No. Introduction to system block diagram 2

1

Introduction to system block diagram

CHAPTER- 1

INTRODUCTION TO SYSTEM BLOCK DIAGRAM

This research work focuses on naturalness improvement of speech output of the Marathi TTS system. This system is divided into small modules and then integrated after development and testing of these modules.

Figure 1.1 shown below, outlines a general framework for this research work and blocks with different color indicating major thrust areas:

CV rules

Syllable Syllable formation / Marathi Text analysis search and segmentation text input and search region list with IMF

Time and frequency domain features

Calculation /

Audio Concatenation Evaluation of reduction of output and play naturalness spectral mismatch

Fig. 1.1: General framework of research work

2

Introduction to system block diagram

The scope of this work is outlined as shown below:

 This work is simulated on computer using C/VB/MATLAB platform.

 The input to the system is through GUI keyboard or text file.

 Database for the vocabulary is about 3000 including words and syllables.

 This TTS system is used for Marathi language.

 As shown in the diagram four blocks with different colors indicate main working areas.

 The first block shows syllable segmentation with IMF. Here IMF stands for initial, medial and final positions of syllable. For natural sounding segmentation, syllable position (context) plays a very important role.

 Second block shows syllable search and region list. After syllable segmentation, region list is formed based on soft cutting of syllables from existing most frequently used words. These syllables are stored in textual database. Only the most frequently used words and resulting syllables are stored in the audio database.

 Contextual syllabification has always resulted in improved naturalness. After concatenation, resulting spectral distortion is estimated and reduced with the help of different techniques. The third block shows spectral distortion estimation and reduction. Different time and frequency domain methods are used for spectral smoothing.

 Finally both segmentation accuracy and spectral mismatch reduction are evaluated for resulting naturalness of speech synthesis. The fourth block shows evaluation and validation methods for naturalness improvement of TTS system.

3

Chapter-2

Objectives of the research

Objectives of the research

CHAPTER- 2

OBJECTIVES OF THE RESEARCH

No. Title of the contents Page no. 2.1 Objectives 6 2.2 Sub-objectives of the research work 7 2.3 Organization of the thesis 7

4

Objectives of the research

CHAPTER-2

OBJECTIVES OF THE RESEARCH

Speech synthesis is an artificial production of human speech usually produced by means of computers often called text to speech (TTS) systems. Speech recognition and synthesis find applications in automatic messaging systems, announcing systems, security systems, remote reading, email reading, speaker identification, newspaper reading, password recognition, speaker analysis, reading out manuscripts for collation, entertainment, computer aided learning, talking calculator, weather report, talking toys, accessing textual information over telephone, weather report, telephone enquiry services etc.

 The main objective of this research is to devise ways and techniques to improve naturalness of TTS system, so that its application and commercial viability is vastly improved in all the above mentioned applications.

 Recent advances in technology have enabled the analysis, storage and transmission of digital speech in far more economical and convenient way. As speech is produced by human beings, its analysis and synthesis is required for these digital applications. Analysis of speech has been possible efficiently because of high speed digital signal processors [1].

 The linguistic structure of each language is different hence synthesis rules are different for all languages. There are more than 3000 languages spoken across all over the world. In India alone there are more than 400 languages used in different states. The grammar of each of these languages is different. Therefore, a global text to speech synthesis system that would speak any language is not yet developed. This research work aims to develop synthesis rules that are applicable for multiple Indian languages using Devnagari script.

 This work aims to improve database size, naturalness of synthesized words. It aims to develop both subjective and objective evaluation techniques.

5

Objectives of the research

2.1 OBJECTIVES:

 The present TTS systems are based on any one of the single speech units like diphones, syllables, words, etc for database generation. This research work aims to generate a combination of two speech units viz words and syllables for data generation to provide better synthetic sound in terms of naturalness and intelligibility.

 This research system aims to implement hybrid approach (using both word and syllable as speech units) for reduction of database size. It aims to prepare database of most frequent words and syllables so that they can be reused for synthesizing new words.

 Concatenation of position dependent syllable results in less spectral mismatch. It gives more natural sounding audio output. This research work focuses on improving naturalness of speech synthesis using context based segmentation. Context based segmentation is based on syllable position (I-initial, M-medial, F-final). This work aims to devise a technique of position dependent (I/M/F) speech segmentation for reduction in spectral distortion at concatenation point.

 When a required word is not found in the database, it is formed by segments of other words and syllables. When new words are formed by concatenating existing words or syllables, spectral distortion resulted at the concatenation point. This research work aims to develop new spectral distortion calculation and reduction methods in time and frequency domain to increases naturalness of synthetic speech output.

 For improvement in naturalness, input text is designed with contextual analysis. Linguist‘s feedback for contextual analysis helps to improve optimization of database and results in more accurate syllable formation. This work aims to investigate a new approach to consider different contexts and position of syllables during synthesizing new word. It aims to find new

6

Objectives of the research

language rules from linguist‘s guidance for improvement in context and stored speech units.

 This research method aims to evaluate performance of Marathi TTS using subjective and objective validation methods. Subjective evaluation test, MOS is carried out to check the performance of this TTS system. For objective evaluation, this research work aims to find spectral distance, correlation, percentage error, graphical display of distortion etc for naturalness improvement.

2.2 SUB-OBJECTIVES OF THE RESEARCH WORK:

This research work is divided into following sub-objectives:

a) To design framework for speech synthesis-by-concatenation. b) To prepare database of words and syllables (hybrid approach). c) To improve segmentation using position based syllabification. d) To improve spectral smoothing using spectral distortion calculation and reduction methods. e) To use context based speech synthesis based on linguist‘s guidance using syllables. f) To implement linguistic rules of Marathi language to select most frequent words and syllables for database. g) To evaluate naturalness of speech output of TTS system using subjective and objective evaluation methods.

2.3 ORGANIZATION OF THE THESIS:

The thesis is composed of eleven chapters.

7

Objectives of the research

Chapter 1 – This chapter discusses the system block diagram and its explanation. It shows major thrust areas or modules of research work. System block diagram shows research modules with different colors.

Chapter 2 – This chapter discusses briefly the aim and objectives of the research work. It describes the significance of position based syllabification and spectral distortion reduction methods for improving naturalness of the TTS. It shows how these methods improve the overall performance of TTS system under consideration.

Chapter 3 – This chapter explains theory of TTS systems. It covers sound elements of TTS system, language study and present scenario of TTS systems.

Chapter 4 – This chapter explains a review of literature for syllabification, spectral smoothing and recent work in other TTS systems. It explains summary of literature survey for present research work.

Chapter 5 – This chapter explains contextual analysis and classification of syllables. This is a very important part of research work for optimization of database and implementation of hybrid TTS for naturalness improvement.

Chapter 6 – This chapter explains position based syllabification in detail. For naturalness improvement of Marathi TTS, accurate syllable formation or segmentation is very important. This chapter explains all non-neural, neural and classification network approaches for syllable segmentation. If accuracy of segmentation is high, it gives more natural output. This chapter explains how NN has improved syllabification for TTS system under consideration.

Chapter 7 – This chapter explains removal of spectral mismatch using PSOLA for speech improvement. Being very raw and basic method of analysis and improvement of speech signal, this method is not extended further and finalised for speech naturalness improvement.

Chapter 8 – This chapter explains spectral mismatch estimation and reduction after concatenation of syllables. Different spectral distortion calculation and reduction 8

Objectives of the research methods like distance calculations, energy calculation, PSD, wavelet analysis and DTW are explained for spectral smoothing. Resulting speech output is more natural due to spectral mismatch reduction.

Chapter 9 – This chapter explains linguistic analysis. Linguist‘s suggestions for implementation of different contexts and rules have been discussed. CV structure formation, language rules for different contexts and subjective analysis is explained in detail in this chapter.

Chapter 10 – This chapter explains the outcome of the research. How different methods, approaches, algorithms are implemented for naturalness improvement of TTS system are summarized. This chapter explains actual contributions of research work and different modules implemented for them.

Chapter 11 – This chapter explains concluding remarks of the research work carried out. It explains how different objectives and sub objectives are achieved by implementing different algorithms. It shows how naturalness improvement of TTS output is successfully implemented. This chapter suggests some improvement along with concluding remarks. This chapter shows some future scope points for this work.

9

Chapter-3

Theory of TTS

Theory of TTS

CHAPTER- 3

THEORY OF TTS

No. Title of the contents Page no. 3.1 Introduction 11 3.2 Sound elements for speech synthesis 15 3.2.1 Classification of speech 16 3.2.2 Elements of a language 18 3.3 Methods and approaches to speech synthesis 18 3.4 Language study 25 3.4.1 The consonants 26 3.4.2 Vowels 27 3.4.3 Consonant conjuncts 28 3.5 Present scenario of TTS systems 29 3.5.1 DECTalk 29 3.5.2 Bell labs text-to-speech 30 3.5.3 Laureate 30 3.5.4 SoftVoice 31 3.5.5 CNET PSOLA 31 3.5.6 ORATOR 32 3.5.7 Eurovocs 32 3.5.8 Lernout & Hauspie‘s 33 3.5.9 Apple plain talk 33 3.5.10 Silpa 34

10

Theory of TTS

CHAPTER-3

THEORY OF TTS

3.1 INTRODUCTION:

 Speech synthesis is the artificial production of human speech usually produced by means of computers. Speech synthesis systems are often called text to speech (TTS) systems in reference to their ability to convert text into speech. The most important qualities of a speech synthesis system are naturalness and intelligibility. Naturalness describes how closely the output sounds like human speech. Intelligibility is the ease with which the output is understood.

 The ideal speech synthesis is both natural and intelligible. Speech synthesis systems usually try to maximize both characteristics. The two primary technologies for generating synthetic speech waveforms are concatenative synthesis and formant synthesis. Each technology has its own strengths and weaknesses. The intended use of a synthesis system typically determines which approach has been used. The following figure 3.1 shows speech synthesis technologies and their sub-types.

SPEECH SYNTHESIS

Concatenative Formant Articulatory HMM-based synthesis synthesis synthesis synthesis

Word Syllabic Hybrid Unit selection synthesis synthesis synthesis synthesis

Fig. 3.1: Speech synthesis technologies

11

Theory of TTS

Different methods used to produce synthesized speech can be classified into three main groups: i) Articulatory synthesis: This method attempts to model the human systems by controlling the speech articulators (e.g. jaw, tongue, lips). Articulatory synthesis is based on physical models of the human speech production system. Due to lack of knowledge of complex human articulation organs, articulatory synthesis has not lead to quality speech synthesis.

ii) Formant synthesis: This method models the pole frequencies of speech signal or transfer function of vocal tracts based on the source-filter-model. Format speech synthesis is based on rules which describe the resonant frequencies of the vocal tract. The formant method uses the source filter model of speech production, where speech is modeled by parameters of the quality speech. It sounds unnatural, since it is difficult to estimate the vocal tract model and source parameters accurately.

iii) Concatenative synthesis: This method uses different length prerecorded samples derived from natural speech. This technique uses stored basic speech units (segments), which are concatenated to the word sequences according to a pronunciation dictionary. Special signal processing techniques to smooth the unit transitions and to model the intonation are used. This method needs a large database and results in speech distortion at concatenation point. But concatenative synthesis produces more natural speech output as compared to other methods and speech distortion at concatenation joint can be improved with spectral distortion reduction methods [2].

 Speech synthesis is an artificial production of human speech usually produced by means of computers. Speech Synthesis is used in: spoken dialog systems, application for blind and visually-impaired persons, applications in telecommunication, eyes and hands free applications, speech enhancement, train/flight announcement systems, speaker recognition, speaker validation, spoken dialog systems, text to speech systems, blind source separation, voice activity detection and voiced/unvoiced decoder.

12

Theory of TTS

 The simple speech interface, which plays an important role in human- computer interaction, has considerable social needs. The main directions of speech interaction are recognition and synthesis technologies. A speech recognition technology is regarded as the input from human to machine and in contrast with it; speech synthesis is considered as machine‘s output technologies.

 Most of the information in digital world is accessible to a few who can read or understand a particular language. Language technologies can provide solutions in the form of natural interfaces so that digital content can reach to the masses and facilitate the exchange of information across different people speaking different languages.

 These technologies play a crucial role in multi-lingual societies such as India which has about 1652 dialects/native languages. While Hindi written in Devanagari script, is the official language, the other 17 languages recognized by the constitution of India are: 1) Assamese 2) Tamil 3) Malayalam 4) Gujarati 5) Telugu 6) Oriya 7) Urdu 8) Bengali 9) Sanskrit 10) Kashmiri 11) Sindhi 12) Punjabi 13) Konkani 14) Marathi 15) Manipuri 16) Kannada and 17) Nepali.

 Seamless integration of speech recognition, machine translation and speech synthesis systems could facilitate the exchange of information between two people speaking two different languages. The basic units of the writing system in Indian languages are characters which are an orthographic representation of the speech sounds.

 A character in an Indian language script is close to a syllable and can be typically of the following form: C, V, CV, VC, CCV and CVC, where C is a consonant and V is a vowel. All Indian language scripts have a common phonetic base. A universal phone-set consists of about 35 consonants and about 18 vowels.

13

Theory of TTS

 The scripts of Indian languages are phonetic in nature. There is more or less one to one correspondence between what is written and what is spoken. However, in Hindi and Marathi the inherent vowel (short /a/) associated with the consonant is not pronounced depending on the context. This is referred to as inherent vowel suppression (IVS) or schwa deletion [3].

 The main goal in processing the speech signal is to obtain a more convenient or more useful representation of the information contained in the speech signal. Time domain processing method directly deals with the waveform of speech signal. Speech signal can be represented in terms of time domain measurements as average zero-crossing rate, energy and auto-correlation function.

 These representations make digital processing simpler. In particular the amplitude of unvoiced segment is generally much lower than the amplitude of voiced segment. Thus simple time-domain processing techniques are capable of providing useful representation of such signal features as intensity, excitation mode, pitch and possibly even vocal tract parameters such as formant frequencies.

 State-of the-art speech synthesis systems demands for a high overall quality. However, synthesized speech still lacks naturalness. To help to achieve more natural sounding speech synthesis, not only construction of rich database is important but precise alignment is also vital. This research work increases the naturalness in concatenative TTS systems. To improve the overall performance of TTS system, this work focuses on:

1) More naturalness implementation in current Marathi TTS through context based speech synthesis using IMF (initial, middle, final) syllable positions. 2) This work implements a hybrid model of synthesis (words and syllables are used in database preparation) to improve the overall performance of TTS system.

14

Theory of TTS

3) Linguist guidance for contextual analysis and database preparation is considered. Optimization of database and consideration of proper contexts resulted in improvement in natural speech synthesis output. 4) This research work focuses on position based syllabification with the help of neural and non-neural syllable formation approaches. 5) Spectral mismatch calculation and reduction for improving joint speech quality of hybrid TTS system. 6) Paragraphs based on contextual analysis and language rules are good testing models for this Marathi TTS system.

3.2 SOUND ELEMENTS FOR SPEECH SYNTHESIS:

Existing speech synthesis systems use different sound elements. The most common are:  phones  diphones  phone clusters  syllables

1) The phone is the smallest sound element, which cannot be segmented. It represents the typical kind of sound or sound nuance. Sounds (phones), which are phonetically similar, belong to the same phoneme.

2) A diaphone begins at the second half of a phone (stationary area) and ends at the first half of the next phone (stationary area). Thus, a diaphone always contains a sound transition. Diphones are more suitable sound elements for speech synthesis. Compared with phones, segmentation is simpler. The time duration of diphones is longer and the segment boundaries are easier to detect.

3) Phone clusters are sequences of vowels or consonants. According to the position of the sound sequences phone clusters are split into initial, medial and

15

Theory of TTS

final cluster. Medial cluster very often can be divided into initial and final clusters.

4) A syllable is a phonetic-phonological basic unit of a word. It consists of syllable start, syllable core and syllable end. Syllables are not influenced by neighboring sound elements. The segmentation of syllables is relatively easy. A disadvantage for synthesis purposes is that if only syllables are used, a huge number of syllables are required. Hence in present Marathi TTS system hybrid (most common words and syllables) approach is used [4].

5) A Synthesis-by-concatenation text-to-speech (TTS) synthesizer must be capable of automatically producing speech. It uses the most commonly used words in the audio database. Audio database consists of already recorded words and textual database consists of details of these recorded words.

6) Syllable database is in textual form. Using these syllables a large number of new words can be synthesized using concatenation. These syllables or new words are not stored in audio form but are synthesized as per requirement during runtime. Audio database needs more memory than textual database. Hybrid concatenative synthesis, using most commonly used words and syllables reduces memory usage and increases naturalness [5].

3.2.1 Classification of speech: As per the mode of excitation speech sounds can be divided into three broad classes: voiced, unvoiced and plosive sounds.

a) Voiced sounds 1) Voiced sounds have a source due to periodic glottal excitation, which can be approximated by an impulse train in time domain and by harmonics in frequency domain. 2) Voiced sounds are generated in throat. 3) They are produced because air from lungs is forced through the vocal cords. 4) Vocal cords vibrate periodically and generate pulses called glottal pulses. 5) Characterized by

16

Theory of TTS

i) High energy levels ii) Very distinct resonant and formant frequencies 6) The rate at which the vocal cords vibrate determines the pitch. 7) Glottal pulses pass through the vocal tract where some frequencies resonate. 8) Rate at which the vocal cords vibrate depends on the air pressure in the lungs and the tension in the vocal cords, both of which can be controlled by the speaker to vary the pitch of the sound being produced. 9) The range of pitch for an adult male is from about 50 Hz to about 250 Hz, with an average value of about 120 Hz. For an adult female the upper limit of the range is much higher than male, perhaps as high as 500 Hz. b) Unvoiced sounds 1) Unvoiced sounds are non-periodic in nature. Examples of sub-types of unvoiced sounds are given below: i) Fricatives: Fricatives are consonants produced by forcing air through a narrow channel made by placing two articulators close together. Example of fricative is ―thin‖. ii) Plosives: In phonetics, a plosive consonant also known as an oral stop is a consonant that is made by blocking a part of the mouth so that no air can pass through, and the pressure increases behind the place where it is blocked, and when the air is allowed to pass through again, this sound is created. Example of plosive is ―top‖. iii) Whispered: This sound is to speak with soft, hushed sounds, using the breath, lips, etc., but with no vibration of the vocal cords. Example of whispered sound is ―he‖. 2) Produced by turbulent air flow through the vocal tract. 3) Unvoiced sounds are present in mouth. 4) Vocal cords are open. 5) Pitch information is unimportant. 6) Characterized by i) Higher frequencies than voiced sounds ii) Lower energy than voiced sounds 7) Can be modeled as a random sequence. 8) Nasals, ―s‖ sounds are unvoiced sounds. 17

Theory of TTS

9) In the production of unvoiced sounds the vocal cords do not vibrate.

c) Plosive sounds Plosive sounds, for example the ‗nwh²‘(puh) sound at the beginning of the word ‗pnZ‘(pin) or the ‗Swh²‘(duh) sound at the beginning of "S\'(daf), are produced by creating yet another type of excitation. For this class of sound, the vocal tract is closed at some point; the air pressure is allowed to build up and then suddenly released. The rapid release of this pressure provides a transient excitation of the vocal tract. The transient excitation may occur with or without vocal cord vibration to produce voiced [such as S\ (daf)] or unvoiced [such as pnZ (pin)] plosive sounds.

3.2.2 Elements of a language: 1) Phoneme: Phonemes differentiate words of a language. 2) Syllables: One or more phonemes are used to form syllables. 3) Words: One or more syllables are combined to form words. 4) Phrases, sentences: One or more words are combined to form phrases or sentences. 5) : Study of the arrangement of speech sounds according to the rules of a language

3.3 METHODS AND APPROACHES TO SPEECH SYNTHESIS:

Synthesized speech can be produced by several different methods. All of these have some benefits and deficiencies. A detail description of different methods of speech synthesis is given in the following paragraphs.

1) Articulatory synthesis:  Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. The first articulatory synthesis regularly used for

18

Theory of TTS

laboratory experiments was developed at in the mid- 1970s by , Tom Baer and Paul Mermelstein.

 This synthesis, known as ASY, was based on vocal tract models developed at Bell laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker and colleagues.

 Articulatory synthesis tries to model the human vocal organs as accurately as possible, so it is potentially the most satisfying method to produce high-quality synthetic speech. On the other hand, it is also one of the most difficult methods to implement and the computational load is also considerably high.

 The first articulatory model was based on a table of vocal tract area functions from larynx to lips for each phonetic segment. For rule-based synthesis the articulatory control parameters may be for example lip aperture, lip protrusion, tongue tip height, tongue tip position, tongue height, tongue position and velic aperture.

 Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems. A notable exception is the NeXT-based system originally developed and marketed by Trillium sound research, a spin- off company of the University of Calgary, where much of the original research was conducted.

 Following the demise of the various incarnations of NeXT (started by Steve Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the GNU (general public license), with work continuing as GNU speech. The system, first tested in 1994, provides full articulator-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts.

 When speaking, the vocal tract muscles cause articulators to move and change shape of the vocal tract, which causes different sounds. The data for

19

Theory of TTS

articulatory model is usually derived from x-ray analysis of natural speech. However, this data is usually 2-D, where as the real vocal tract is naturally 3- D. So the rule-based articulatory synthesis is very difficult to optimize due to unavailability of sufficient data of the motions of the articulators during speech.

 Advantage of articulatory synthesis is that the vocal tract models allow accurate modeling of transients due to abrupt area changes, whereas formant synthesis models only spectral .

2) Formant synthesis:  Formant synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using an acoustic model. Parameters such as fundamental frequency, voicing and noise levels are varied over time to create a waveform of artificial speech. This method is sometimes called rule- based synthesis.

 Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system. Formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems [6].

 High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a screen reader. Formant synthesis needs less memory than concatenative systems because they do not have a database of speech samples. They can therefore be used in embedded systems, where memory and microprocessor power are especially limited.

 Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the Texas instruments

20

Theory of TTS

toy ‗Speak & Spell‘ and in the early 1980s Sega arcade machines. Creating proper intonation for these projects was painstaking and the results have yet to be matched by real-time text-to-speech interfaces. Probably the most widely used synthesis method during the last few decades has been formant synthesis, which is based on the source–filter model of speech.

 There are two basic structures in general, parallel and cascade, but for better performance some kind of combination of these is usually used. At least three formants are generally required to produce intelligible speech and up to five formants to produce high quality speech.

 Each formant is usually modeled with a two-pole resonator, which enables both the formant frequency (pole-pair frequency) and its bandwidth to be specified (Donovan 1996). Rule–based formant synthesis is based on a set of rules used to determine the parameters necessary to synthesize a desired utterance using a formant synthesizer [7].

 A cascade formant synthesizer consists of band-pass resonators connected in series and the output of each formant resonator is applied to the input of the following one. The cascade structure needs only format frequencies as control information. The main advantage of the cascade structure is that the relative formant amplitudes for vowels do not need individual controls. The cascade structure has been found better for non-nasal voiced sounds because it needs less control information than parallel structure.

 A parallel formant synthesizer consists of resonators connected in parallel. Sometimes extra resonators for nasals are used. The excitation signal is applied to all formants simultaneously and their outputs are summed. Adjacent outputs of formant resonators must be summed in opposite phase to avoid unwanted zeros or anti-resonances in the frequency response (O'Saughnessy 1987). The parallel structure enables controlling of bandwidth and gains for each formant individually and thus needs more control information. The parallel structure has been found to be better for nasals,

21

Theory of TTS

fricatives, and stop-consonants, but some vowels cannot be modeled with parallel formant synthesizer as well as with the cascade one.

3) Concatenative synthesis:  Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes results in audible glitches in the output. But due to voice conversion techniques, concatenative synthesis can be of different voice. Also due to different database reduction mechanisms, the storage problem is controlled.

 Connecting prerecorded natural utterances is probably the easiest way to produce intelligible and natural sounding synthetic speech. However concatenative synthesizers are usually limited to one speaker and one voice. As preparation of database is time consuming; only one speaker can record the database and hence speech output is of one voice. As this approach is making use of concatenation of natural speech segments, the storage of these segments need more memory than other approaches of speech synthesis.

 One of the most important aspects in concatenative synthesis is to find the correct unit length. The selection is usually a trade-off between longer and shorter units. With longer unit high naturalness, less concatenation points and good control of coarticulation are achieved, but the amount of required units and memory are increased. With shorter units, less memory is needed, but the sample collecting and labeling procedure becomes more difficult and complex. In present systems units used are usually words, syllables, demi-syllables, phonemes, diphones and sometimes even triphones [8].

 Concatenative synthesis can be divided into three types depending on concatenation unit used,

22

Theory of TTS

1. Word synthesis: Here prerecorded words are concatenated to produce output speech. 2. Syllabic synthesis: Syllables are concatenated to form words and output speech is produced. 3. Hybrid syllabic synthesis: Both words as well as syllables are used to produce output speech.

Hybrid approach is followed in this research work.

4) HMM based synthesis:  HMM-based synthesis is a synthesis method based on Hidden Markov models, also called statistical parametric synthesis. In this system, the frequency spectrum (vocal tract), fundamental frequency (vocal source) and duration (prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion.

 HMM-based TTS system consists of two stages, the training stage and the synthesis stage. In the training stage, phoneme HMMs is trained using speech

database. Spectrum and f0 are modeled by multi-stream HMMs in which

output distributions for spectral and f0 parts are modeled using continuous probability distribution and multi-space probability distribution (MSD) respectively.

 To model variations of spectrum and f0, phonetic, prosodic and linguistic contextual factors such as phoneme identity factors, stress related factors and location factors are taken into account. Then, a decision tree based context

clustering technique is separately applied to the spectral and f0 parts of the context dependent phoneme HMMs. Finally state durations are modeled by multi-dimensional Gaussian distributions and the state clustering technique is applied to the duration models.

23

Theory of TTS

 In the synthesis stage, first an arbitrarily given text is transformed into a context dependent phoneme label sequence. According to the label sequence, a sentence HMM is constructed by concatenating context dependent phoneme HMMs. Phoneme durations are determined using state duration distributions.

Then spectral and f0 parameter sequences are obtained based on ML criterion from the sentence HMM. Finally, by using the MLSA filter, speech is

synthesized from the generated Mel-cepstral and f0 parameter sequences.

Advantages of HMM synthesis 1) HMM synthesis provides a means to automatically train the specification- to-parameter module, thus bypassing the problems associated with hand- written rules. 2) The trained models can produce high quality synthesis and have the advantages of being compact and amenable to modification for voice transformation and other purposes.

Disadvantages of HMM synthesis 1) The speech has to be generated by a parametric model so no matter how naturally the models generate parameters the final quality is very much dependent on the parameter-to-speech technique used. 2) Even with the dynamic constrains the model generate somewhat ‗safe‘ observations and fail to generate some of the more interesting and delicate phenomena in speech.

5) Sine wave synthesis:  Sine wave synthesis is based on a well known assumption that the speech signal can be represented as a sum of sine waves with time varying amplitude and frequencies. Sine wave synthesis is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles.

 The first sine wave synthesis program (SWS) for the automatic creation of stimuli for perceptual experiments was developed by Philip Robin at Haskins laboratory in the 1970s. This program was subsequently used by Robert

24

Theory of TTS

Remez, Philip Robin, David Pisoni and other colleagues to show that the listeners can perceive continuous speech without traditional speech cues. This work paved the way for the view of speech as a dynamic pattern of trajectories through articulatory-acoustic space.

3.4 LANGUAGE STUDY:

 There are 15 officially recognized Indian scripts. These scripts are broadly divided into two categories, namely Brahmi scripts and Perso-Arabic Scripts. The Brahmi scripts consist of Devanagari, Punjabi, Gujarati, Oriya, Bengali, Assamese, Telugu, Kannada, Malayalam and Tamil.

Hindi, Marathi and Sanskrit languages use Devanagari script.

 The spelling of languages written in Devanagari is partly phonetic in the sense that a word written in it can only be pronounced in one way, but not all possible pronunciations can be written perfectly. Devanagari has 34 consonants (vyanjan) and 12 vowels (svar). A syllable (akshar) is formed by the combination of zero or one consonant and one vowel. All these scripts mentioned above are written in a nonlinear fashion. The division between consonant and vowel is applied for all the Indian scripts.

 Marathi is an Indo-Aryan, Indo-European and Indo-Iranian language spoken by about 96.75 million people mainly in Maharashtra and neighboring states of Maharashtra. Marathi is also spoken in Israel and Mauritius. Marathi is thought to be a descendent of Maharashtri, one of the Prakrit languages, which is developed from Sanskrit. The vowels getting attached to the consonant are not in one direction. They can be placed either on the top or the bottom of consonant. Vowels can also be placed on the left or right of consonant. Following example shows all these vowel symbols (suffices) used with consonants.

25

Theory of TTS

e.g. $z , x , o , ¡ etc.

 Since the origin of all Indian scripts is the same, they share the common phonetic structure. In all of the scripts basic consonant, vowels and their phonetic representation is also same. Typically the alphabets get divided into following categories:

3.4.1 The consonants:  A consonant is a sound in spoken language (letter of the alphabet denoting such a sound) that has no sounding voice (vocal sound) of its own, but must rely on a nearby vowel with which it can sound (sonant). Each consonant has the specialty of having an inherent vowel, generally the short vowel ‗A²‘ (Aa). So, the consonant d (Va) represents not just d² (Va) but also A² (Aa).

 There are 34 consonants in Marathi as shown below. However, in the presence of a dependent vowel, the inherent vowel associated with a consonant is overridden by the dependent vowel. Consonants may also be rendered as half forms, which are presentation forms used to depict the initial consonant in the cluster. These half forms do not have an inherent vowel.

 Some Marathi characters have alternate presentation forms whose choice depends on neighboring consonants. This variability is especially notable for a

(Ra) which has numerous forms such as ©, — , è e.g. a (Ra) is represented using

different signs in the words like à&ßV (prapta), H¥ Vr (kruti), XOm© (darja), Iè`m (kharya). Consonants in Marathi language are shown in the following figure 3.2

Consonants

26

Theory of TTS

Fig. 3.2: Consonants in Marathi language

3.4.2 Vowels:  A vowel is a sound in spoken language that has a sounding voice (vocal sound) of its own; it is produced by comparatively open configuration of the vocal tract. Unlike a consonant (a non-vowel), a vowel can be sounded on its own. A single vowel sound forms the basis of a syllable. Vowels in Marathi language are shown in the following figure 3.3:

Vowels and vowel diacritics

Fig. 3.3: Vowels in Marathi language

There are mainly two representations for vowels.

1. Independent vowel The writing system tends to treat independent vowels as orthographic CV syllables in which the consonant is null. These vowels are placed on the consonants either in the beginning or after the consonant. Each of these vowels is pronounced separately. The independent vowels are used to write words that start with a vowel. Example A§Va (antar)

27

Theory of TTS

Typical vowels are independent vowels: A(Aa), Am(Aaa), B(Ei), B©(Eee), C(U), D (Uoo), E(Ea), Eo(Aai), Amo(O), Am¡(Au), A¨(Aam), A:(Aha)

2. Dependent vowel The dependent vowels serve as the common manner for writing non-inherent vowels. They do not stand-alone; rather they are depicted in combination with base letterform. Explicit appearance of dependent vowel in a syllable overrides the inherent vowel of a single consonant. Marathi has a collection of non-spacing dependent vowel signs that may occur above or below a consonant as well as a spacing dependent vowel sign that may occur to the left or right of a consonant. In Marathi there is only one spacing dependent vowel that occurs to the left of the consonant i.e. {

Usage of dependent vowels: V Vm {V Vr Vw Vy Vo V¡ Vmo Vm¡ V¨ V: (ta, taa, ti, tee, tu, too, te, tai, to, tau, tam, taha)

3. Halant A halant sign ‗² ‗ known as virama or vowel omission sign serves to cancel the inherent vowel of the consonant to which it is applied. Such a consonant is known as dead consonant. The halant is bond to a dead consonant as a combining mark. e.g. V (consonant) + ² (halant) = V²

3.4.3 Consonant conjuncts: Consonant conjuncts serve as an orthographic abbreviation of two or more adjacent letterforms. This abbreviation takes place only in the context of a consonant cluster. An orthographic consonant cluster is defined as a sequence of characters that represents one or more dead consonants followed by a normal live consonant or an independent vowel. e.g. V² (ta) + ` (ya) = Ë` (tya)

28

Theory of TTS

3.5 PRESENT SCENARIO OF TTS SYSTEMS:

 First commercial speech synthesis systems were mostly hardware based and the developing process was very time-consuming and expensive. Since computers have become more and more powerful most synthesis today is software based systems. Software based systems are easy to configure and update and they are also much less expensive than the hardware systems. However, a standalone hardware device may still be the best solution when a portable system is needed.

 The speech synthesis process can be divided into high-level and low-level synthesis. A low-level synthesis is the actual device which generates the output sound from information provided by high-level device in some format, for example in phonetic representation. A high-level synthesis is responsible for generating the input data to the low-level device including correct text pre- processing, pronunciation and prosodic information. Most synthesis contains both high and low level system, but due to specific problems with methods they are sometimes developed separately.

3.5.1 DECTalk:  Digital equipment corporation (DEC) has long traditions with speech synthesis. The DECtalk system is originally descended from MITalk and Klattalk. The present system is available for American English, German and Spanish. It offers nine different voice personalities, four male, four female and

one child.

 The system is capable to say most proper names, e-mail and URL addresses, as well as supports a customized pronunciation dictionary. It also has punctuation control for pauses, pitch, stress and the voice control commands may be inserted in a text file for use by DECtalk software applications. But sound of this product is still robotic. The software version has three special modes, speech-to-wave mode, the log-file mode and the text-to-memory

mode.

29

Theory of TTS

3.5.2 Bell labs text-to-speech:  AT&T Bell Laboratories (Lucent technologies) has very long traditions with speech synthesis since the demonstration of VODER in 1939. The first full TTS system was demonstrated in Boston 1972 and released in 1973. It was based on articulatory model developed by Cecil Coker (Klatt 1987). The development process of the present concatenative synthesis system was started by Joseph Olive in mid 1970's (Bell Labs 1997). Present system is based on concatenation of diaphones, context-sensitive allophonic units or even of triphones. Due to this type of segmentation unit, the concatenation joints are more and resulting speech naturalness gets affected.

 The current system is available for English, French, Spanish, Italian, German, Russian, Romanian, Chinese and Japanese. The architecture of the current system is entirely modular (Möbius et al. 1996). It is designed as pipeline where each of 13 modules handles one particular step for the process. Any change in one of the 13 blocks will not affect other blocks but to implement this single change, the interface between all blocks needs to be modified every time and this integration should be always smooth. This is the biggest disadvantage of this system.

3.5.3 Laureate:  Laureate is a speech synthesis system developed during last two decades at BT laboratories (British Telecom). To achieve good platform independence Laureate is written in standard ANSI C and it has a modular architecture. (Gaved 1993, Morton 1987). The Laureate system is optimized for telephony applications so that lots of attentions have been paid for text normalization and pronunciation fields. The system supports multi-channel capabilities and other features needed in telecommunication applications.

 The current version of Laureate is available only for British and American English with several different accents. Prototype versions for French and Spanish also exist and several other European languages are under development. A talking head for the system has been recently introduced

30

Theory of TTS

(Breen et al. 1996). More information including several pre-generated sound examples and interactive demo is available at the Laureate home page (BT Laboratory 1998). This system is more prone for one kind of application, telephony. This system cannot be extended for other applications as the more emphasis is on front end processing.

3.5.4 SoftVoice:  SoftVoice Inc. has over 25 years of experience in speech synthesis. The latest version of SVTTS is the fifth generation multilingual TTS system for Windows is available for English and Spanish with 20 present voices including males, females, children, robots and aliens. Languages and parameters may be changed dynamically during speech. More languages are under development and the user may also create an unlimited number of own voices. The input text may contain over 30 different control commands for speech features. Speech rate is adjustable between 20 and 800 words per minute and the fundamental frequency or pitch between 10 and 2000 Hz. Pitch modulation effects such as vibrato, perturbation and excursion are also included. This system is concentrating more on emotional part of synthesis than naturalness or quality improvement of speech output.

 Vocal quality may be set as normal, breathy or whispering and the singing is also supported. The output speech may be listened in either word-by-word or letter-by letter modes. The system can return mouth shape data for animation and has capable to send synchronization data for the other user's applications. The basic architecture of the present system is based on formant synthesis.

3.5.5 CNET PSOLA:  The latest commercial product is available from ElanInformatique as ProVerbe TTS system. The concatenation unit used is diphone sampled at 8 KHz rate. The ProVerbe speech unit is a serial (RS232 or RS458) connected external device (150x187x37 mm) optimized for telecommunication applications like e-mail reading via telephone.

31

Theory of TTS

 The system is available for American and British English, French, German, and Spanish. The pitch and speaking rate are adjustable and the system contains a complete telephone interface allowing connection directly to the public network. ProVerbe has an ISA connected internal device which is capable for multichannel operation. Internal device is available for Russian language and has the same features as serial unit. This system is having limited applications and also available for limited languages. It is not yet extended for other languages.

3.5.6 ORATOR:  ORATOR is a TTS system developed by Bell research (Bellcore). The synthesis is based on demi-syllable concatenation (Santen 1997, Macchi et al. 1993, Spiegel 1993). The latest ORATOR version provides probably one of the most natural sounding speeches available today. Special attention on text processing and pronunciation of proper names for American English is given and the system is thus suitable for telephone applications. The current version of ORATOR is available only for American English and supports several platforms, such as Windows NT, Sun and DEC stations. This system is limited to some languages and for limited platforms. This system is developed more from the point of view of front end processing. Demi-syllables results in more number of concatenation points and hence the performance of this system is not as good as present or latest systems.

3.5.7 Eurovocs:  Eurovocs is a text-to-speech synthesis developed by Technologies & Revalidate (T&R) in Belgium. It is a small (200 x 110 x 50 mm, 600g) external device with built-in speaker and it can be connected to any system or computer which is capable to send ASCII via standard serial interface RS232. No additional software on computer is needed.

 Eurovocs system uses the text-to-speech technology of Lernout and Hauspie speech products described in the following section and it is available for Dutch, French, German, Italian and American English. One Eurovocs device

32

Theory of TTS

can be programmed with two languages. The system also supports personal dictionaries. Recently introduced improved version contains Spanish and some improvements in speech quality. Only two languages at a time can be used with this type of product. Available for few languages and this product is an external device which needs to be connected to computer.

3.5.8 Lernout & Hauspie‟s:  Lernout & Hauspie‘s (L&H) has several TTS products with different features depending on the markets they are used. Different products are available and optimized for application fields, such as computers and multimedia (TTS2000/M), telecommunications (TTS2000/T), automotive electronics (TTS3000/A), consumer electronics (TTS3000/C).

 All versions are available for American English and first two for German, Dutch, Spanish, Italian and Korean (Lernout & Hauspie 1997). Several other languages such as Japanese, Arabic and Chinese are under development. Products have a customizable vocabulary tool that permits the user to add special pronunciations of words which do not succeed with normal pronunciation rules. With a special transplanted prosody tool it is possible to copy duration and intonation values from recorded speech for commonly used sentences which may be used for example in information and announcement systems.

 Different versions are available for different applications. Single product cannot be used for different applications. All versions are available only for American English. Other languages are still under development.

3.5.9 Apple plain talk:  Apple has developed three different speech synthesis systems for their Macintosh personal computers. Systems have different level of quality for different requirements. The Plain-talk products are available for Macintosh computers only and they are downloadable, free from the Apple homepage.

33

Theory of TTS

 MacinTalk2 is the wavetable synthesis with ten built-in voices. It uses only 150 kilobytes of memory, but has also the lowest quality of Plain Talk family, but runs on almost every Macintosh system.

 MacinTalk3 is a formant synthesis with 19 different voices and with considerably better speech quality compared to MacinTalk2. It supports singing voices and some special effects. The system requires at least Macintosh with a 68030 processor and about 300 KB of memory. MacinTalk3 has the largest set of different sounds.

 MacinTalkPro is the highest quality product of the family based on concatenative synthesis. The system requirements are considerably higher than in other versions, but it has three adjustable quality levels for slower machines. Pro version requires 68040 PowerPC processor with operating system version 7.0 and uses about 1.5 MB of memory. The pronunciations are derived from a dictionary of about 65,000 words and 5,000 common names.

 The database size of this system is very large. Processor requirement is very high. Although this system has provision of different voices, the processor capacity and memory requirement is considerably high.

3.5.10 Silpa:  Silpa stands for Swathanthra Indian language computing project. It is a web platform to host the freedom for software language processing applications easily. It is a web framework and a set of applications for processing Indian languages in many ways. In other words, it is a platform for porting existing and upcoming language processing applications to the web. Silpa can be used as a python library or as a web service from other applications.

 This is a web application for language processing application. It is a web platform for language processing but not an independent synthesis system.

The product range of text-to-speech synthesis is very wide and it is quite unreasonable to present all possible products or systems available out there.

34

Chapter-4

Literature review

Literature review

CHAPTER- 4

LITERATURE REVIEW

No. Title of the Contents Page No. 4.1 Introduction 36 4.2 Review of the literature 37 4.2.1 Review of context based speech synthesis papers 37 Review of evaluation of speech synthesis with 4.2.2 42 spectral smoothing methods Review of concatenative speech synthesis and 4.2.3 48 segmentation Review of recent papers on performance improvement of 4.3 51 TTS systems. 4.4 Summary of literature review 61

35

Literature review

CHAPTER- 4

LITERATURE REVIEW

4.1 INTRODUCTION:

Speech is the most natural and efficient form of and provides a vehicle with which to transmit our thoughts and ability to impart large amounts of information in a compact form. ―Communication‖ is the creation of understanding and speech offers an efficient way and a concise form of doing so. Naturalness in human speech is dependent on a number of factors and the extent to which a text to speech synthesis system can account for these factors in its model, is the measure of its success in the marketplace.

Many researchers have published their work on speech synthesis and applications of text to speech. Papers on Arebic, Telugu synthesis, corpus based synthesis and unit selection synthesis are closely related to this work. A syllable is used as basic unit in such research works. Some researchers have used diaphone, demi-syllable or word as basic unit. Linguistic structure of the language to be considered is important for the selection of unit. Indian languages are phonetic in nature and can be formed with different Aja§ (aksharas) which are close to a syllable.

Various speech synthesis methods are available to improve the speech quality and naturalness for TTS systems. Various techniques are under development for overall improvement of TTS performance. A few of them are reviewed here. This chapter presents a brief summary of work done by other researchers on speech synthesis and their performance improvement. The published work till the moment of writing of this thesis is considered. A detailed literature review is conducted for new approaches of text to speech systems.

36

Literature review

4.2 REVIEW OF THE LITERATURE:

4.2.1 Review of context based speech synthesis papers:

Context based segmentation and concatenation takes into consideration the syllable position, context (adjacent letters, kriyapada, pratyaya etc). There are various different contexts present in one particular language. Speech output can be improved with empirical study of the language and development of rules with linguists‘ guidance. Context based segmentation can be used for automatic syllable formation. The following section shows reviews of related papers.

Speech in Indian language is based on basic sound units which are inherently syllable units made from C, V, CCV, VC and CVC combinations, where C is a consonant and V is a vowel. The syllable like unit is defined in terms of the short-term-energy (STE) function by M. Nageshwara Rao et. al. [9] In terms of the STE function the syllable-like unit can be viewed as an energy peak in the nucleus (vowel) region. It tapers off at both the ends of the nucleus where a consonant is present. The valleys at both the ends of the nucleus can be considered as the boundaries of a syllable-like unit. To create syllable like unit repository, a group delay based segmentation algorithm can be used. This algorithm helps to carry out segmentation automatically. For detail analysis spectrogram is used in this paper. FestVox cluster-unit-selection based speech synthesizer is the reference synthesizer for this research work. The biggest disadvantage of this method is that the evaluation is based merely on ‘subjective analyses’. MOS score is recorded for Tamil sentences synthesized using different speech units (monosyllable, diaphone, bi-syllable, tri-syllable etc). A second perceptual test was conducted to find which order of syllable-like units was best acceptable for synthesis. (Combination of monosyllable-bisyllable, bisyllable- trisyllable, monosyllable-trisyllable etc at the beginning and end of word is implemented). Marathi TTS is using combination of neural-classification network for syllable segmentation which gives 90% accuracy in syllabification and the evaluation of speech quality is carried out using both subjective and objective validation methods.

37

Literature review

Segmentation of acoustic signal into syllabic units is an important stage in speech synthesis. Although STE (short term energy) function contains useful information about syllable segmentation boundaries, it has to be processed before segment boundaries can be extracted. T. Nagarajan et. al. [10] presents a sub-band based group delay approach to segment spontaneous speech into syllable-like units. This technique exploits the additive property of Fourier transform phase and de- convolution property of cepstrum to smooth STE function of the speech signal and make it suitable for syllable boundary detection. Although automatic segmentation technique like group delay is used, it is a time taking procedure for fixing syllable like units in continuous and spontaneous speech. Spectral changes are uniform across syllable boundaries but the evaluation of these changes is based on spectrogram convention which is an old method and most of the time clarity is lacking. In present Marathi TTS neural-classification network based automatic segmentation is carried out which gives 90% accuracy. Spectral behavior is judged by PSD, DTW and wavelet coefficients. These methods are giving an objective evaluation (numerical figures) of spectral distortion. These numerical figures are used to reduce spectral noise. The graphical plots of all these methods are more clear and easy for locating spectral noise. Numerical results of distance and correlation gives a proper evaluation of spectral distortion.

Lijuan Wang et. al. [11] presents a post-refining method with fine contextual- dependent GMMs for auto-segmentation task. A GMM trained with a super feature vector extracted from multiple evenly spaced frames near the boundary is suggested to describe the waveform across a boundary. An accuracy of 90% is thus achieved when only 250 manually labeled sentences are provided to train the refining models. It has been observed that if boundaries are classified into groups by their phonemic context, such as vowel, nasals, liquids etc. and a refining model is trained for each group, more precise auto-segmented boundaries can be obtained. This paper presents a CART based method that clusters segmental boundaries automatically according to their similarity in acoustic features. The best performance (91.9% accuracy) is achieved when 30000 samples (or about 1500 sentences) are provided. In addition, this method is quite generic and makes no inherent assumptions on the language or speaker type. This method is presently applicable to only English. The whole system need to be revised for extension to some other language. The Marathi 38

Literature review

TTS system needs a database of about 3000 words and syllables. It can synthesize most of the sentences provided to it. It uses automatic segmentation based on NN- classification approach and reduces spectral distortion using different time and frequency domain smoothing methods. This TTS system can be easily extendable to couple of languages as Hindi and Sanskrit, apart from Marathi as it is using Devnagari script.

S. P. Kishore et. al. [12] describes a data-driven synthesis for Indian languages using syllables as basic units for concatenation. Unit selection algorithm of this method exploits the advantages of a prosodic matching function, which is capable of implicitly selecting a larger sequence such as words, phrases and even sentences. This method generates high quality speech for Indian languages. This approach minimizes the coarticulation effects and prosodic mismatch between two adjacent units. Selection of units is based on phonetic context and prosodic parameters of its adjacent units. The hypothesis here is that a procedure of selecting a unit satisfying local phonetic constraints and prosodic constraints through prosodic matching function implicitly selects a larger available sequence which leads to high quality synthesis. Performance of this approach is better than other data-driven techniques. Letter to sound rules and syllabification rules are used for preparation of the database. Being it is unit selection method, it takes longer time for searching required unit in a huge database. The database preparation of this system is time consuming. The Marathi TTS system is using moderate database and is using concatenative hybrid (words + syllables) synthesis approach. Database search time is very less as compared to any unit selection method. CV structure rules are used for optimization of database and breaking of syllables. CV structure rules are uniform and prepared with the help of linguist’s guidance. There are more than 470 such rules available for Marathi language and they can be extended very easily for other languages. Whereas data-driven approach uses only couple of standard letter to sound rules (LTS) based on context study of one particular language.

Concatenated waveform text to speech synthesis systems require an inventory of stored waveforms from which units of speech can be extracted for subsequent rearrangement and concatenation as needed. In most of the speech synthesis systems for natural sounding speech syllable is the preferred unit. The mark-up of stored 39

Literature review waveforms for segmentation into syllables must be precise; it can be done manually or automatically. Eric Lewis et. al. [13] describes how most of the segmentation can be done automatically, leaving only those waveforms which would be prone to error to be segmented manually. Rules for automatic segmentation of words into syllables have been derived based on morphemic decomposition. An algorithm has been produced for the implementation of these rules, which utilizes the orthography to phoneme dictionary together with an accurate markup of pitch periods. Although not capable of doing a complete automatic segmentation of all words author believe that about 90% of waveform markup can now be automated allowing much faster derivation of synthetic voices. Being using limited domain database (MeteoSPRUCE) this system can modify syllables from one context to another but it performs all necessary mark-ups by manual editing of the waveform. The provision of both a phonetically defined syllable boundary together with precise knowledge of voiced and unvoiced sections of the waveform, make the task of automatically marking the syllable boundaries in the waveform considerably easier. But phonetically defined syllable boundary and knowledge of voiced-unvoiced sections is dependent on the user. If the user doesn’t have this perfect knowledge then this system output may end up in poor results. The research work of Marathi TTS is not making use of any phonetic study. It uses strong methods of automatic segmentations like slope detection, simulated annealing, Maxnet and K-means. These algorithms are giving good segmentation results. K-means-Maxnet (classification-neural) algorithm is resulting in 90% of syllabification accuracy in automatic segmentation.

Goh Kawai et. al. [14] suggests a method for detecting syllabic nuclei from English utterances on a frame-by-frame basis using band pass filtered acoustic energy measurements. In the training phase, phones in English utterances read by a female speaker were assigned rank-ordered sonority values. These sonority values were predicted using multiple linear regressions where the predictor variables were band pass-filtered acoustic energy values at the phone‘s central region. Results show that (1) syllabic nuclei are identified at over 60 percent accuracy, and (2) speech rate, defined as syllabic nuclei per unit time, is estimated at over 80 percent accuracy. This method is based on sonority value prediction using multiple linear regressions. It is a statistical method for examining the relationship between dependent variables and independent variables. The dependent variable is usually affected by independent 40

Literature review variable and this is biggest disadvantage of multiple regression models. Due to this limitation, sonority value prediction is not accurate. If the accuracy is compared, this method gives a maximum of 60% accuracy for syllabic nuclei identification although speech rate accuracy is over 80%. Marathi TTS gives syllable segmentation accuracy of 90% by implementing classification-neural network combination.

Stas Tiomkin et. al. [15] proposes a combination of concatenative TTS (CTTS) and statistical TTS (STTS). In order to gain advantages of both approaches this hybrid system (HTTS) is implemented. Each utterance representation in HTTS is constructed from natural segments and model generated segments in interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of this approach. The designed HTTS combines the advantageous characteristics of STTS, (smooth transitions between adjacent segments) with those of CTTS (naturalness of natural segments). The HTTS interweaves natural segments with statistical models, where the positions of statistical models are defined by the hybrid dynamic path algorithm. According to MOS listening tests, hybrid generated speech sounds more natural than the corresponding pure statistically generated speech. It is important to note that the improvement of the CTTS system by STTS system was successful in this work because both systems use the same speech models and operate in the same speech features space. The HTTS is advantageous at an intermediate working point of about 7 MB (for the baseline CTTS footprint). However, determining the working point is currently not based on some optimality criterion, but rather on subjective evaluations which is the biggest limitation of this system. There is no objective evaluation method for this research work. Marathi TTS evaluation is carried out both subjectively and objectively. It gives more than 80% result in both validation approaches.

Shinnousuke Takamichi et. al. [16] shows use of rich context models, which are statistical models that represent individual acoustic parameter segments. In training, the rich context models are reformulated as Gaussian mixture models (GMMs). In synthesis, initial speech parameters are generated from probability distributions over fitted to individual segments. The speech parameter sequence is iteratively generated from GMMs using a parameter generation method based on the maximum likelihood criterion. The experimental results demonstrated i) the use of approximation with a 41

Literature review single Gaussian component sequence yields better synthetic speech quality than the use of EM algorithm in the proposed parameter generation method. ii) The state based model selection yields quality improvements at the same level as the frame- based model selection. iii) the use of the initial parameters generated from the over- fitted speech probability distributions is very effective to further improve speech quality and iv) the proposed methods for spectral and f0 components yields significant improvements in synthetic speech quality compared with the traditional HMM-based speech synthesis. This method is a good combination of statistical parametric method of synthesis for modeling acoustic features flexibly and unit selection synthesis produces spectral improvement in resulting speech quality. Due to a hybrid approach this algorithm is liable for all advantages of parametric and unit selection synthesis. But the parameters generated using over-fitted speech probability distributions are responsible for speech quality. If over-fitting is not properly carried out, the resulting speech output may be affected. Due to unit selection approach this research work needs large database for all types of rich contexts. The ‘Rich Context’ needs to be interleaved between parametric units and this procedure may not always work correctly for given input text. Marathi TTS is using concatenative synthesis approach and hybrid (words and syllables) database which resulted in accurate segmentation, spectral mismatch reduction and more natural speech output.

4.2.2 Review of evaluation of speech synthesis with spectral smoothing methods:

Spectral distortion or glitches present at the concatenation point of synthesized words need to be reduced. It helps to remove the spectral mismatch present before and after joint frames and hence align the synthesized word more correctly in terms of spectral behavior. The following sections show reviews of some papers based on spectral smoothing or spectral distortion reduction.

There are many scenarios in both speech synthesis and coding in which adjacent time- frames of speech are spectrally discontinuous. David T. Chappell et. al. [17] addresses the topic of improving concatenative speech synthesis with a limited database by proposing methods to smooth adjust or interpolate the spectral transitions between speech segments. The objective is to produce natural sounding speech via 42

Literature review segment concatenation when formants and other spectral features do not align properly. Several methods for adjusting the spectra at the boundaries between waveform segments are considered. Techniques examined include optimal coupling, waveform interpolation, linear predictive parameter interpolation and psychoacoustic closure. Several of these algorithms have been previously developed for either coding or synthesis, while others are enhanced. Authors also consider the connection between speech science and articulation in determining the type of smoothing appropriate for given phoneme–phoneme transitions. Moreover, this work incorporates the use of a recently-proposed auditory-neural based distance measure (ANBM), which employs a computational model of the auditory system to access perceived spectral discontinuities. Authors demonstrate how actual ANBM scores can be used to help determine the need for smoothing. In addition, formal evaluation of four smoothing methods using the ANBM and extensive listener tests reveals that smoothing can distinctly improve the quality of speech but when applied inappropriately can also degrade the quality. It is shown that after proper spectral smoothing or spectral interpolation, the final synthesized speech sounds more natural and has a more continuous spectral structure. This research work has good evaluation technique called ANBM. This distance calculation gives numeric figure of spectral smoothing apart from extensive listener test. The techniques used for adjusting the spectra, optimal coupling, waveform interpolation, linear predictive parameter interpolation and psychoacoustic closure are good methods but more promising results can be obtained with other spectral smoothing methods where few frames before and after concatenation points are averaged and smoothing is carried out. Direct formant processing degrades the performance of speech output. Marathi TTS uses averaging of spectra before and after concatenation point in PSD and modifying the spectral characteristics with averaging. Other spectral mismatch reduction techniques like DTW and wavelet coefficient modification gives more promising results in terms of naturalness of resulting speech output (percentage error improvement of 20%).

The quality of concatenative speech synthesis depends on the cost function employed for unit selection. Effective cost functions for spectral continuity have proven difficult to define and standard measures do not accurately reflect human perception of spectral discontinuity in concatenated speech. Previous studies on joint cost have focused predominantly on static spectral measures extracted from the unit boundary. 43

Literature review

Spectral dynamic behavior can be investigated as a source of discontinuity in concatenated speech. This method is suggested by Barry Kirkpatrick et. al. [18] Number of measures representing spectral dynamics is tested for the task of detecting discontinuities. A strategy to effectively combine dynamics and static measures is proposed using principal component analysis (PCA). Selection of best unit sequence depends on target cost plus joint cost. The spectral dynamic measures tested contain information correlating with human perception of discontinuities, suggesting that spectral dynamics are a source of discontinuity in concatenated speech. An ideal join cost should accurately reflect human perception of discontinuity. The objective of this paper is to identify the role of spectral dynamics in the perception of discontinuities in concatenative speech synthesis and to investigate the proposed measures for use in defining perceptually salient join cost functions for spectral continuity. LPC, MFCC, LSF and log harmonic amplitude are tested as different measures here. As with previous studies the gain in performance from adding the spectral dynamic measure is low and can even cause a decrease in performance in some cases. PCA is found to have a marginal improvement in some cases (LSF and AH) and no impact on the result in other cases (formants and MFCCs). The delta features computed with the same window length as the static spectral features were found to contain almost zero discriminating information. Such a measure would effectively introduce a noise component into the join cost for unit selection. The Marathi TTS is implementing different features for improving joint distortion like, PSD, DTW or wavelet domain analysis. But these features have not affected joint distortion adversely. They have reduced joint spectral noise and tried to improve audio quality. Here the best audio quality is achieved through wavelet algorithm. But there is no adverse effect of these algorithms on the joint distortion.

In TTS, spectral smoothing is often employed to reduce artifacts at unit-joining points. A context adaptive smoothing method is proposed by Ki-seung Lee et. al. [19] where amount of smoothing is determined according to context information. Discontinuities at unit boundaries are predicted by a regression tree and smoothing factors are computed by using predicted discontinuities and real discontinuities at unit boundaries. A discontinuity measure at unit-joining points is defined. There are several measures for representing discontinuities. A simple time-domain measure is employed here. WL and Wr represent the last frame of the left unit and the first frame 44

Literature review of the right unit respectively. The frames are pitch-synchronous where one frame is two pitch periods long. The amount of discontinuity is measured by calculating Euclidean distance between these two frames. Since it is very often observed that the pitch period of the left frame is different from that of the right frame, linear interpolation is applied to the shorter length frame. After smoothing, discontinuity is changed by some scale amount. Before smoothing the amount of discontinuity of the current synthesized speech signal is computed and the desired amount of discontinuity is predicted from context information. The filter coefficient of smoothing filter is defined with standard equation and resulting synthesized waveform is passed through this filter. The discontinuities of the resulting speech signal are close to the predicted ones. This scheme differs from the previous methods in that the filter coefficient of the smoothing filter is determined according to the amount of discontinuity from the synthesized speech signal and context information. The naturalness of this method depends on prediction of context information and the amount of discontinuity of the synthesized speech. Predicted filter coefficients may not be always up to the mark. Adjusting scaling factor of filter according to context is not always successful. It might work for some context but for remaining other context the prediction might not be very accurate. Marathi TTS is not dependent on any prediction or estimation as far as spectral smoothing is required. In case of PSD, actual formant difference between original and synthesized word is taken and with the help of this difference, formant of improperly concatenated words are stored back to original word so that the resulting distortion can be reduced. Other methods DTW, wavelet has resulted in more natural speech after reducing spectral distortion.

In unit selection-based concatenative speech synthesis joint cost (also known as concatenation cost) which measures how well two units can be joined together is one of the main criteria for selecting appropriate units from the inventory. Usually, some form of local parameter smoothing is also needed to disguise the remaining discontinuities. Jithendra Vepa et. al. [20] presents a subjective evaluation of three join cost functions and three smoothing methods. A report of listener‘s preferences for each join cost in combination with each smoothing method is presented. The three join cost functions were taken from previous study, where authors proposed join cost functions derived from spectral distances, which have good correlations with perceptual scores obtained for a range of concatenation discontinuities. This 45

Literature review evaluation allows us to further validate their ability to predict concatenation discontinuities. The units for synthesis stimuli are obtained from a state-of-the-art unit selection text-to-speech system: rVoice from Rhetorical systems ltd. Three joint costs are calculated LSF joint cost, MCA joint cost and Kalman joint cost. For comparative analysis three smoothing approaches are presented a) no smoothing b) linear smoothing with LSF interpolation and c) Kalman filter based smoothing. A listening test was designed to evaluate the three join costs and the smoothing methods and to compare the smoothed LSFs obtained from Kalman filter and linear smoothing on LSFs. Paired T test statistics is calculated for analysis of these results. The results from the listening test indicated that LSF join cost has more preferences than MCA join cost and Kalman join cost. Although very detailed subjective listening tests are carried out and detailed statistical analysis is taken for validation of results of smoothing and joint cost, this method is relying on perceptual ability of human beings. There is no objective evaluation or technique implemented. This is the biggest limitation of this method. For statistical analysis many listening tests for different combinations need to be carried out with some time difference of days, months etc which is very expensive and time consuming task. Marathi TTS evaluation is based on both subjective listening test as well on objective validation methods of spectral noise reduction. Due to both subjective and objective testing, the Marathi TTS results are more authenticate.

The quality of unit selection based concatenative speech synthesis mainly depends on how well two successive units can be joined together to minimize the audible discontinuities. The objective measure of discontinuity used when selecting units is known as the join cost. The ideal join cost will measure perceived discontinuity based on easily measurable spectral properties of the units being joined in order to ensure smooth and natural sounding synthetic speech. Jithendra Vepa et. al. [21] described a perceptual experiment conducted to measure the correlation between subjective human perception and various objective spectral-based measures proposed in the literature. For objective distance measures, MFCC, LSF, MCA coefficients are used. A distance measure between two vectors of such parameters can use various metrics: Euclidean, Absolute, Kullback-Leibler or Mahalanobis. These are very standard available distance calculation techniques. Both coefficients and distances used here are already available in the literature. But a detail comparative analysis is 46

Literature review carried out very concisely. Correlation is calculated between subjective perceptual listening test and these objective measures. Increase in correlation shows successful implementation of perceptual test. Present Marathi TTS is based on new innovative objective measures like PSD, DTW and wavelet analysis. All these methods give graphical and numerical results in detail. Quantification of joint mismatch is accurately carried out.

Phuay Hui Low et. al. [22] explores how estimated formant tracks for formant smoothing can be used in TTS synthesis. A pole analysis procedure is used to estimate the formant frequency, bandwidths, spectrum shape and dynamics. The obtained formant tracks are then used for formant smoothing purposes in TTS synthesis. Three methods of spectral and formant smoothing are explained. The first method achieves spectral smoothing at segment boundaries by interpolating the LP autocorrelation vectors. The second and third formant smoothing methods involves direct modification of formant frequencies. In the second method, smoothing is achieved by substituting the formants of the synthesized speech with that of natural speech. Finally, the last method achieves formant smoothing by moving the formants of successive segments closer to an average value. Here, direct formant modification is used. But most of the literature survey shows that the performance of spectral smoothing is reduced with direct formant modification. Instead Marathi TTS system uses different strong signal processing methods like PSD, DTW and wavelet for spectral smoothing. All these methods uses correlation, distance calculation etc to calculate and reduce spectral distortion.

Youcef Tabet et. al. [23] demonstrates to resolve mismatches speech synthesis system combines the unit selection method with harmonic plus noise model (HNM). This model represents speech signal as a sum of a harmonic and noise part. The decomposition of speech signal into these two parts enables more natural sounding modifications of the signal. Finally, hidden Markov model (HMM) synthesis combined with an HNM model is introduced in order to obtain a text-to-speech system that requires smaller development time and cost. This is a crude method available in the literature. The output of this TTS sounds more natural as compared to mere HMM or with HNM. The advantages of both HMM and HNM are combined. Resulting time of development is very less as both methods are available in fully 47

Literature review developed form. But these methods cannot give very natural speech output as both HMM synthesis system has drawback of muffled sound. Marathi TTS combines proper syllabification and spectral smoothing techniques to achieve more naturalness. As care has been taken for breaking and concatenating speech segments, resulting output is natural.

4.2.3 Review of concatenative speech synthesis and segmentation:

There are various methods or approaches for speech synthesis like, formant, articulatory, HMM based and concatenative. Among all these approaches concatenative method of synthesis results in more naturalness as it is based on actual speech segment storage and concatenation at the time of synthesis. The following section reviews some papers based on concatenative speech synthesis.

Duration is one of the prosodic feature of speech, the other two being stress and intonation. S. R. Rajesh Kumar et. al. [24] demonstrates the significance of durational knowledge in speech synthesis for the Indian language, Hindi. Here, speech synthesis is carried out based on parameter concatenation (linear prediction) model. The various durational effects in Hindi have been identified and categorized. Some of the durational effects have been studied and a set of durational rules formulated for use in a speech synthesis system for Hindi. Linear prediction model is used for synthesis. More than 12 rules are developed for different duration patterns. This method is based on few rules but in continuous speech there are various duration patterns and to include all types more rules need to be developed. Type of synthesis method is very old. Marathi TTS uses CV structure rules for studying different contexts of the language and to break the syllables of words and sentences according to these rules. There are more than 470 rules already available for Marathi language.

Speech synthesis systems based on concatenation of segments derived from natural speech are very intelligible and achieve a high overall quality. Even though listeners often complain about wrong or missing temporal structure and timing in synthetic speech, concatenative synthesis gives more natural results with duration control. Matthias Eichner et. al. [25] suggests a new approach for duration control in speech synthesis that uses the probability of a word in its context to control the local speaking 48

Literature review rate within the utterance. This idea is based on the observation that words that are very likely to occur in a given context are pronounced less accurate and faster than improbable ones. An algorithm that implements duration control using multi-gram model is explained with experimental results. As language model is used for duration control it improved the naturalness of resulting speech. Time and intonation in speech are linked closely and should be generated using unified model. This approach is more from the point of view of expressive speech. Marathi TTS emphasizes more on naturalness improvement using core signal processing techniques. Different contexts of language are implemented in terms of linguistic rules.

Edmilson Morais et. al. [26] presents some preliminary methods to apply time frequency interpolation (TFI) techniques to concatenative speech synthesis. The TFI technique described here is a pitch-synchronous time-frequency approach of the well- known prototype-waveform interpolation technique – PWI. The basic concepts of representing the speech signal in the time-frequency domain as well as techniques to perform time-scale and pitch- scale modifications are described. Using the flexibility of TFI technique to perform spectral smoothing, a method was developed to minimize the spectral mismatch at the boundaries of the synthesis-units - SUs. The proposed system was evaluated using SUs (diphones) and prosodic modifications generated by the Festival system. An informal subjective test was performed between the proposed TFI system and the standard TD-PSOLA system. This test highlights the superior quality of the proposed system in comparison with TD-PSOLA. The system evaluation is performed using a subjective test between standard TD-PSOLA and this system. 72% people participated in the 14 sentence listening test preferred this system over TD_PSOLA. But this system does not have any objective evaluation technique. Spectral smoothing using TFI is giving good results, but as speech unit used is diaphone there are many concatenation points which generally degrades the performance of synthesis system. Marathi TTS is using syllable as speech unit and hence less number of concatenation points are present. It uses signal processing based spectral smoothing algorithms and resulted in more natural speech output.

K. Szklanny et. al. [27] reports progress in the process of improving automatic segmentation. To help achieve natural sounding speech synthesis, not only the construction of rich database is important, but also precise alignment must be 49

Literature review conducted. The process of automatic segmentation often causes errors to be corrected. Script for finding outliers was constructed and only necessary manual correction was proposed. Then a praat script was realized which allowed to detect and remove most important errors in automatic segmentation. Zero crossing process was applied to each phoneme. Additionally a comparison in prototype speech synthesizer was performed. The results show that quality of generated speech has significantly increased. The automatic segmentation turned out to be necessary while working with so many recordings. It was proved that additional atomization was possible and, what is more, it enabled us to eliminate the biggest irregularities and errors in the relatively short time period. It is necessary to notice that the process of the segmentation of acoustic database is very complex. It involves many problems and in order to achieve satisfactory effect it requires a lot of work. However this labor is smaller than in case of the manual segmentation of the large quantity of recordings, which is simply impractical. Despite of bigger accuracy and the smaller amount of "overlooked" errors, manual segmentation of such a large acoustical database is unachievable in a reasonable time. The main advantage of the automatically generated alignment is a significant consequence in setting boundaries, which is practically impossible to obtain during manual segmentation. This paper is basically for automatic segmentation for unit selection. To avoid errors resulting from automatic segmentation, praat script is used. Some words which are not suitable for automatic segmentation are separated and manual segmentation is carried out. This method is advantageous for unit selection synthesis due to its large footprint. Unit selection system needs more time for finding correct match and needs manual segmentation for many words due to large size of database. This results in the whole process becoming more time consuming. Marathi TTS is using NN based automatic segmentation. But the database size is moderate due to hybrid synthesis (words plus syllables) and due to accurate syllabification. Existing syllables are used for the formation of new words.

With the growing popularity of corpus-based methods for concatenative speech synthesis, a large amount of interest has been placed on borrowing techniques from ASR community. Ivan Bulyko et. al. [28] explores the applications of Buried Markov models (BMM) to speech synthesis. The paper shows that BMMs are more efficient than HMMs as a synthesis model and focuses on using BMM dependencies for computing splicing costs. It can be shown how the computational complexity of the 50

Literature review dynamic search can be significantly reduced by constraining the splicing points with a negligible loss in synthesis quality. This approach uses prediction accuracy of BMM models as an indicator of how suitable a given boundary is for making a splice. The prediction accuracy is measured at the boundary frames, but one could also access the prediction over a range of frames within the unit, which may give a more accurate estimate of the effects of coarticulation. Starting with a unit database consisting of half phones, one can obtain a heterogeneous database of units ranging from half phones to phones, diphones or longer units. The unit boundaries are automatically chosen according to their suitability for making a splice. Statistical approach BMM is used here. For this type of approach parameter generation and reconstruction becomes tedious job. Marathi TTS uses spectral mismatch reduction techniques like PSD, DTW and wavelet which gives more promising results and naturalness. As unit boundary is chosen as per suitability of splice this system have to store all types of units. Multi-unit database preparation is time consuming and complicated. Marathi TTS system stores most frequently used words and syllables, which needs less time as compared to BMM based synthesis.

4.3 REVIEW OF RECENT PAPERS ON PERFORMANCE IMPROVEMENT OF TTS SYSTEMS:

The following section shows review of some recent papers related to synthesis work. These are new developments in the TTS systems or speech synthesis. All these new approaches are compared with present work.

Rohit Kumar et. al. [29] describes the use of Genetic algorithm (GA) for the unit selection problem, which is essentially a search/optimization problem. The various operators for GA have been defined and comparison with optimization reached by hill climbing approaches is presented. An evolutionary unit selection algorithm is proposed and various details of its implementation are mentioned. Several parameters for the GA are experimented with and termination criteria are justified. In comparison with local optimization, it is observed that GA scores less but is more consistent. Further there is scope of improvement in the GA based unit selection during the creation of initial population to cover wider search space. GA based unit selection can 51

Literature review offer faster results as size of unit selection databases grows. Although GA offers many advantages for overall improvement of unit selection synthesis and database search, there are still some drawbacks or downsides persists in this approach like hill climbing one. GA score is very poor for local optimization. Marathi TTS uses very moderate size database, hence search method is not very complicated. In unit selection almost all instances of every unit are present which makes any powerful search approach time consuming. In Marathi TTS existing words and syllables are used hence database is very limited and produces more natural output.

Tuomo Raitio et. al. [30] describes hidden Markov model (HMM) based speech synthesizer that utilizes glottal inverse filtering for generating natural sounding synthetic speech. In this method speech is first decomposed into glottal source signal and the model of the vocal tract filter through glottal inverse filtering. Thus it is parameterized into excitation and spectral features. The source and filter features are modeled individually in the framework of HMM and generated in synthesis stage according to the input text. A new HMM-based text-to-speech system utilizing glottal inverse filtering is described. The study presented a method to extract and model individual parameters for the voice source and the vocal tract and a method to reconstruct a realistic voice source from the parameters using real glottal flow pulses. These novel procedures enable the generation of high-quality synthetic speech. Subjective listening tests showed that this method is able to generate highly natural synthetic speech and the quality of the system is considerably better compared to other commonly used HMM-based speech synthesizers. Although this system is using new approach, the evaluation is based on subjective listening test. Extraction of parameters for voice source and vocal tract is easier but modeling individual parameters for efficient and successful reconstruction of realistic voice source using glottal flow pulses is complicated. It needs lot of time to implement this method. Marathi TTS implementation time is less for segmentation and spectral smoothing. Basic database preparation of most frequently used words takes less time. Due to existing use of words and syllables size of database is moderate.

Zhen-Hua Ling et. al. [31] presents investigation into ways of integrating articulatory features into hidden Markov model (HMM) based parametric speech synthesis. In broad terms this may be achieved by estimating the joint distribution of acoustic and 52

Literature review articulatory features during training. Performance is evaluated using the RMS error of generated acoustic parameters as well as formal listening tests. Results show that the accuracy of acoustic parameter prediction and the naturalness of synthesized speech can be improved when shared clustering and asynchronous state model structures are adopted for combined acoustic and articulatory features. Most significantly however the experiments demonstrate that modeling the dependency between these two feature streams can make speech synthesis systems more flexible. The characteristics of synthetic speech can be easily controlled by modifying generated articulatory features as part of the process of producing acoustic synthesis parameters. Three factors that influence the model structure have been explored in this paper: model clustering, synchronous-state modeling and dependent-feature modeling. The evaluation results have shown that the accuracy of acoustic parameter prediction and the naturalness of synthesized speech which is correlated with this can be improved significantly by modeling acoustic and articulatory features together in a shared-clustering and asynchronous-state system. The parameter generation process becomes more flexible through the introduction of articulatory control. This offers the potential to manipulate both the global characteristics of the synthetic speech as well as the quality of specific phones, such as vowels. It is conceivable that a system using shared clustering, an asynchronous state sequence and a dependent-feature model structure, may combine all advantages. This system is tested with results but actual implementation and evaluation is still pending. Integration of articulatory features for more improvement of speech synthesis is complicated and time consuming procedure. It takes more time to model acoustic parameters. Also evaluation is based only on subjective listening tests. Marathi TTS offers accurate syllable segmentation and spectral mismatch reduction methods for performance improvement of synthesized speech. It is evaluated with subjective as well as objective validation methods.

Zeynep Orhan et. al. [32] shows framework of Turkish text to speech system and its evaluation techniques namely MOS, DRT and CT. Naturalness and intelligibility of Turkish system is tested by MOS and CT-DRT. Although the system uses simple techniques, it provides promising results for Turkish TTS, since the selected concatenative method is very well suited for Turkish language structure. The system can be improved by increasing the quality of the recorded speech files. The sound files of news, films etc can be explored for extracting the recurrent sound units in 53

Literature review

Turkish instead of recording the diphones one by one. For improvement of TTS speech quality, recorded sound files need to be improved. Also system uses database of single or double letter diphones. Systems using diphone face serious degradation due to more number of concatenation joints. Marathi TTS uses syllable as basic unit of synthesis which resulted in less concatenation joints. This system provides more naturalness and validated using both subjective and objective tests.

Anupam Basu et. al. [33] developed audio Qwerty editor and KGP-Talk, a talking web browser for visually impaired community of India which helps them to develop their educational infrastructure and consequently join the mainstream life through employment opportunities. In view of this need an audio Qwerty editor and KGP- Talk a talking web browser, have been developed. These are supported by a text to speech synthesis system for Indian languages. In this paper it is shown how a text to speech system for Indian languages can be integrated with other systems like editors and web browsers to make a speech enabled user interface that can be used by the visually impaired people for communication. Presently the TTS supports the Indian languages Bengali and Hindi and it can be enhanced for other Indian languages. At present the greatest emphasis should be given to the further refinement of the TTS engine, as it forms the backbone of the audio user interface. Marathi TTS can be used for all those languages which use Devnagari script. Development of full audio interface needs refinement of all modules, like audio editor, talking web browser and TTS engine. The present system is ready with audio editor and talking web browser but TTS engine need to be refined more. Integration of all these modules can produce good audio interface but after integration the performance of all independent modules should not affect. Marathi TTS system has TTS engine with full accuracy of segmentation and concatenation joint artifacts reduction. It can be used for book reading.

Xian-Jun Xia et. al. [34] shows new speech synthesis system constructed using extended speech corpus. A synthetic error detector based on SVM classifier is built using the natural and unnatural synthetic speech. At synthesis time, the input text is synthesized using the baseline system and the extended system simultaneously. The two unit selection results are evaluated by the trained synthetic error detector to determine the optimal one. Experimental results prove the effectiveness of author‘s 54

Literature review method in improving the naturalness of synthetic speech on a task of synthesizing place names. In this paper, authors introduced an approach utilizing the manually evaluated speech data to improve the performance of a unit selection speech synthesis system. The natural synthetic speech has been added to the original corpus and a synthetic error detector has been built for the purpose of integrating human perception knowledge into the system construction. The experimental results have proved that this method is effective in improving the quality of synthetic speech. Re-estimate acoustic models during the extended system construction and to make use of the N- best unit selection results of dynamic programming search is still need to be implemented. Preparation of extended speech corpus is time consuming and complicated procedure. Required memory space is very high. In Marathi TTS, database preparation is easier due to automatic segmentation and contextual analysis of syllables. Database size is limited due to use of existing words and syllables for formation of new words.

Synthesis of natural sounding speech is the greatest challenge in a text-to-speech (TTS) synthesis system. In natural speech, duration, intensity and pitch are dynamically varied which is manifested as rhythm or prosody of speech. If these variations are not recreated, the synthesized speech will sound robotic. Synthesis of good quality speech depends on how well the duration and intonation patterns are imposed on speech segments. The best way to improve naturalness in speech is to mimic the way human brain imposes rhythm. Humans speak in a particular style by varying the duration of the speech segments in words and phrases as per certain specific duration patterns. The brain might be retrieving the corresponding patterns at the time of speaking for generating a discourse in a particular style (news reading, bible reading, storytelling etc.). Sreelekshmi K.S. et. al. [35] demonstrates existence of duration patterns in natural speech using cluster analysis. Speech uttered in Malayalam, an Indian language is taken for analysis. Cluster analysis is carried out on isolated words, as well as on words and phrases in continuous speech. Results of cluster analysis can be observed using Silhouette plot which showed the existence of duration patterns in speech. When cluster analysis was performed on the individual sets of words and phrases, it was observed that each set of similar words/phrases fell into same clusters. Similar sounding words were grouped automatically. From this one can understand that, it is not the textual meaning that matters to form clusters but, 55

Literature review rather the style of pronunciation, the order of succession of phonemes/syllables and the length of words/phrases. The result is evidence to the existence of finite number of clusters of duration patterns in natural speech. Naturalness can be obtained in synthesized speech by storing duration patterns of words, phrases and sentences of the required style, which can be retrieved at the time of speech synthesis for varying time duration of speech units. Finding and analyzing different duration patterns for different style of speaking and storing them is difficult and very lengthy procedure. Marathi TTS is using position based syllabification and spectral noise reduction using different time- frequency domain methods for naturalness improvement.

N. P. Narendra et. al. [36] shows design and development of unrestricted text to speech synthesis (TTS) system in Bengali language. Unrestricted TTS system is capable to synthesize good quality of speech in different domains. In this work, syllables are used as basic units for synthesis. Festival framework has been used for building the TTS system. Speech collected from female artist is used as speech corpus. Initially five speakers‘ speech is collected and prototype TTS is built from each of the five speakers. The final speaker (among the five) is selected through subjective and objective evaluation of natural and synthesized waveforms. Then development of unrestricted TTS is carried out by addressing the issues involved at each stage to produce good quality synthesizer. Evaluation is carried out in four stages by conducting objective and subjective listening tests on synthesized speech. At the first stage, TTS system is built with basic festival framework. The subjective and objective measures indicate that the proposed features and methods have improved the quality of synthesized speech. In this work, a prototype Bengali TTS using syllable as the basic unit was developed using festival framework. The best speaker who has uniform characteristics of speaking rate, pitch and energy dynamics is selected. Speech corpus collected from the best speaker is used to build unrestricted TTS. Text corpus is collected from various domains. LTS (letter to sound) rules are derived for grapheme to phoneme conversion in Bengali. Clustering of units is done based on syllable specific positional, contextual and phonological features. Unrestricted TTS in Bengali language is developed. In addition to existing LTS (language to sound) rules, statistical model can be developed to predict the pronunciation for the unknown words. Still more insightful research can be done on concatenation cost and target cost calculation based on syllable specific 56

Literature review characteristics. The overall system takes into consideration many aspects for development of different modules. First this system needs festival framework for building TTS system. Selection of best speaker and preparation of corpus for unrestricted TTS is time consuming procedure. LTS rules development for grapheme to phoneme conversion in Bengali needs detail study of language and lot of time. Clustering and statistical models are used for additional features incorporation like syllable classification, for prediction of pronunciation. Both subjective and objective evaluation methods development for this overall complicated system needs more time. The system needs more work to be done for improvement of concatenation and target cost. Marathi TTS is simpler and development time is very less. Database is moderate in size, about 3000 units (most frequent words and syllables) and due to accurate segmentation it needs less time for development.

T. Nagarajan et. al [37] presents a sub-band based group delay approach to segment spontaneous speech into syllable-like units. This technique exploits the additive property of the Fourier transform phase and the de-convolution property of the cepstrum to smooth the STE function of the speech signal and make it suitable for syllable boundary detection. By treating the STE function as a magnitude spectrum of an arbitrary signal, a minimum phase group delay function is derived. This group delay function is found to be a better representative of the STE function for syllable boundary detection. Although the group delay function derived from the STE function of the speech signal contains segment boundaries, the boundaries are difficult to determine in the context of long silences, semi-vowels and fricatives. In this paper, these issues are specifically addressed and algorithms are developed to improve the segmentation performance. The speech signal is first passed through a bank of three filters, corresponding to three different spectral bands. The STE functions of these signals are computed. Using these three STE functions, three minimum phase group delay functions are derived. By combining the evidence derived from these group delay functions, the syllable boundaries are detected. Further, a multi-resolution based technique is presented to overcome the problem of shift in segment boundaries during smoothing. Experiments carried out on the switchboard and OGI-MLTS corpora show that the error in segmentation is at most 25 ms for 67% and 20ms for 76.6% of the syllable segments, respectively. A novel approach for segmenting the speech signal into syllable-like units is presented. Several refinements are suggested for improving 57

Literature review the segmentation performance. When compared with the performance of the baseline system, there is a considerable reduction in segmentation errors and the number of insertions and deletions. The advantage of segmentation prior to labeling in speech is that it can be independent of the task. Simple isolated syllable models can be built from the segmented data. Once syllable sequences are available, appropriate post processing can be done to build systems for specific tasks. This method concentrates on syllable segmentation improvement for performance improvement of speech synthesis. Syllable boundary detection accuracy is appreciable. But in most of the TTS to improve overall performance both segmentation as well as concatenation artifacts reduction is important. In Marathi TTS both these techniques are implemented successfully and evaluation is carried out using objective and subjective evaluation tests.

Marian Macchi et. al. [38] discussed different issues of text to speech in his research paper. The ultimate goal of text-to-speech synthesis is to convert ordinary orthographic text into an acoustic signal that is indistinguishable from human speech. Originally, synthesis systems were architected around a system of rules and models that were based on research on human language and speech production and perception processes. The quality of speech produced by such systems is inherently limited by the quality of the rules and the models. Given that our knowledge of human speech processes is still incomplete, the quality of text-to speech is far from natural-sounding. Hence, today‘s interest in high quality speech for applications, in combination with advances in computer resource, has caused the focus to shift from rules and model- based methods to corpus-based methods that presumably bypass rules and models. For example, many systems now rely on large word pronunciation dictionaries instead of letter-to-phoneme rules and large prerecorded sound inventories instead of rules predicting the acoustic correlates of phonemes. Because of the need to analyze large amounts of data, this approach relies on automated techniques such as those used in automatic speech recognition. Today, text-to-speech for unrestricted text is still far from entirely natural. Although there are a few sub problems such as name pronunciation where text-to speech may perform better than humans, in general, text- to- speech systems still only inadequately approach human speech and language competence. Improvements to speech quality have come most recently from the incorporation of rules based on analysis of large amounts of data or from storing large 58

Literature review tables, dictionaries and sound inventories. This work discusses step by step procedure of development of TTS. Both issues related to front end and back end of TTS systems are discussed. There is need to give more time for front end processing and development. Naturalness improvement implementation is still pending issue in this work. This paper uses statistical clustering to separate different types of speech units. Phoneme size speech units are selected and stored in the database. Marathi TTS system is using concatenative synthesis approach and implements different efficient syllabification techniques and spectral smoothing methods for naturalness improvement of speech output. Marathi TTS has achieved recordable naturalness in speech output.

This paper describes the use of wavelet analysis in the development of a Japanese text-to-speech (TTS) system for personal computers. The quality of synthesized speech is one of the most important features of any TTS system. Synthesis methods which are based on manipulation of the speech signal spectrum (e.g., linear predictive coding synthesis and formant synthesis) produce comprehensible but unnatural sounding output. The lack of naturalness commonly associated with these methods results from the use of oversimplified speech models, small synthesis unit inventories and poor handling of text parsing for prosody control. Mei Kobayashi et. al. [39] developed four new technologies to overcome these difficulties and improve the quality of output from TTS systems. These methods are accurate pitch mark determination by wavelet analysis, speech waveform generation using a modified time domain pitch synchronous overlap-add method, speech synthesis unit selection using a context dependent clustering method and efficient prosody control using a 3- phrase parser. Four new technologies used to synthesize natural sounding speech in ProTALKER, a Japanese text-to-speech system: 1) Accurate pitch mark determination by wavelet analysis, 2) Speech waveform generation using a modified time domain pitch synchronous overlap-add method and pitch marks, 3) Speech synthesis unit selection using a context dependent clustering method and 4) Efficient prosody control using a three-phrase parser are explained in this research work. Each technology improved the quality of synthesized speech. The modified time domain pitch synchronous overlap-add method makes use of pitch marks determined by our new wavelet analysis technique. Glottal closure instants are used to determine pitch marks and the maximum point of each segment is used to align the center of a 59

Literature review segmentation window. The wavelet technique stably and accurately determines pitch marks to minimize rumbling during speech synthesis. The quality of synthesized speech also depends on the dictionary of synthesis units. The context dependent clustering algorithm served as a useful tool in selecting the set of units to install in ProTALKER. The fourth new technology, a three-phrase parser, controls the prosody; it determines placement of accents, pauses and f0 resetting of synthesized speech by examining three consecutive phrases. Marathi TTS system uses concatenative synthesis method for more naturalness. Wavelet domain analysis is used for spectral distortion calculation and reduction. Due to reduction in spectral noise at the joint of synthesized word, the resulting speech output sounds more natural. Simple signal processing methods and neural-classification combination is used to improve the synthesis quality of speech. Marathi TTS system is simpler than Japanese TTS system as it is concentrating more on naturalness improvement than at a time improvement of speech, database, prosody features like pauses, accents and f0 settings.

B. Sudhakar et. al [40] addresses the problem of improving the intelligibility of the synthesized speech in Tamil TTS synthesis system. The human speech is artificially generated by speech synthesis. The normal language text is automatically converted into speech using text-to-speech (TTS) system. This work deals with a corpus-driven Tamil TTS system based on the concatenative synthesis approach. Concatenative speech synthesis involves the concatenation of the basic units to synthesize an intelligent, natural sounding speech. In this paper, syllables are the basic unit of speech synthesis database and the modification of syllable pitch by time scale modification. The speech units are annotated with associated prosodic information about each unit, manually or automatically, based on an algorithm. An annotated speech corpus utilizes the clustering technique that provides way to select the suitable unit for concatenation, depends on the minimum total join cost of the speech unit. The entered text file is analyzed first, this syllabication is performed based on the linguistics rules and the syllables are stored separately. Then the syllable corresponding speech file is concatenated and the silence present in the concatenated speech is removed. After that discontinuities are minimized at syllable boundaries without degrading the quality. Smoothing at the concatenated syllable boundary is performed by changing the syllable pitches by time scale modification. A speech synthesis system has been designed and implemented for Tamil language. A database 60

Literature review has been created from various domain words and syllables. Syllable pitch modification is performed based on time scale modification. The speech files present in the corpus are recorded and stored in PCM format in order to retain the naturalness of the synthesized speech. The given text is analyzed and syllabication is performed based on the rules specified. The desired speech is produced by concatenative speech synthesis approach such that spectral discontinuities are minimized at unit boundaries. It is inferred that the produced synthesized speech is preserving naturalness and good quality based the subjective quality test results. Although this system is working on syllabification and spectral smoothing, the evaluation of this system is based on subjective listening test while Marathi TTS evaluation is based on both subjective listening test and objective numerical figures and parameters. Graphemes to syllable conversion rules for Tamil language are used. These rules are language dependent and need to be changed for different languages. Marathi TTS uses CV structure breaking rules to form syllables from words and these rules can be used for any language which uses Devnagari script.

4.4 SUMMARY OF LITERATURE REVIEW:

In this chapter more than thirty five papers related to this research work are reviewed. This research work concentrates on naturalness improvement of speech output of Marathi TTS system. In concatenative TTS, there are different speech units used by different researchers. Diphone was a very popular speech unit in last couple of decades, but since last two decades syllable is commonly used speech unit. But if only syllables are used as speech units, a huge database is needed for successful implementation of concatenative speech synthesis.

In this research work, a hybrid syllabic speech synthesis is implemented, where both most common words and syllables are used as speech units. Due to this approach, the TTS system first tries to find input word, if it is not available; it uses syllables to synthesize the input word. This resulted in modular database size and time required for search is reduced. As compared to different and complicated database reduction methods in the literature, this approach is simple and easier for implementation. Linguist‘s guidance and contextual study of the language has improved the 61

Literature review performance of this Marathi TTS system in terms of naturalness of speech output. In most of the reviewed papers, some signal processing method for improving synthesis of input word improvement is implemented or either the researcher has concentrated on totally linguistic approach of text processing, normalization or front end development. The present research work has improved naturalness with the help of both linguistic approach and implementation of different signal processing methods.

Apart from contextual study and implementation of some linguistic rules for database improvement, this work has implemented two core signal processing approaches. The first is accurate and precise segmentation for developing syllables from most frequent words. Here different techniques like neural, non-neural and classification networks are used for syllable segmentation. Non-neural approaches like slope detection and simulated annealing have provided segmentation accuracy below 80%. Neural approaches like Maxnet, Back-propagation-Maxnet combination (supervised- unsupervised) have given segmentation accuracy of about 85%. The best results were obtained with neural-classification network of K-means-Maxnet. Syllable segmentation accuracy of 90% is obtained. More than 500 words are tested for this combination. In the literature survey different segmentation approaches are implemented by different researchers. Some have implemented automatic segmentation, some have used manual approaches and some have implemented combination of automatic-manual segmentation. In this research work automatic segmentation is implemented. As compared to other research work, this method is giving more promising results and proper segmentation of syllables from most common words.

The second approach implemented for naturalness improvement is spectral distortion reduction at concatenation joint. This approach reduces spectral distortion or noise at the joint after concatenation of syllables. Spectral mismatch is responsible for glitch type of sound present for most of the concatenative speech synthesizers. This research work concentrates on spectral mismatch reduction at concatenation joint using different time and frequency domain methods. This work has used different spectral noise reduction methods like PSOLA, PSD, DTW and wavelet. Among these methods, PSD has successfully shown spectral distortion in terms of graphical figures. This distortion can be reduced by plotting formant plots and reducing the distance 62

Literature review between original and improperly concatenated word. This approach shows that syllable position plays an important role in spectral alignment of syllables. While synthesizing new word from existing words or syllables, if position of syllable is considered, the resulting new word has less spectral mismatch. PSOLA results have very less accuracy for spectral distortion reduction hence it is not finalized as spectral smoothing methods. DTW has given good graphical results for spectral mismatch but audio quality is not very promising. The best results were obtained from wavelet analysis. Multi-resolution based wavelet is used for calculation of approximate and detail coefficients of original, properly concatenated and improperly concatenated words. Wavelet with Back-propagation is used to reduce spectral distortion present for improper concatenated word. Original word coefficients are given for output of Back-propagation algorithms as known good output and improper word coefficients are provided at the input. Back-propagation algorithm has reduced this mismatch which can be validated with % error calculation. Here mean square error of 0.1 is used. This research method has given natural output for synthesized words. Results are validated with subjective analysis along with objective numerical figures. In the literature, most of the synthesis outputs are evaluated on the basis of subjective analysis but in this research work both subjective and objective evaluation methods are implemented.

Some linguistic rules based on contextual analysis are implemented. There are various research papers for front end or text processing in the literature. This work is based on the study of some important contexts and their implementation in terms of language rules for naturalness improvement. This research Marathi TTS is unique due to signal processing and linguistic guidance based naturalness improvement. In the literature most of the methods are testing their performance based on subjective or objective evaluation techniques. This research work evaluation is carried out with both subjective perceptual test (MOS) and signal processing based parameters for naturalness improvement in terms of objective numerical figures.

63

Chapter-5

Contextual analysis and classification of syllables for Marathi language

Contextual analysis and classification of syllables for Marathi language

CHAPTER- 5

CONTEXTUAL ANALYSIS AND CLASSIFICATION OF SYLLABLES FOR MARATHI LANGUAGE

No. Title of the Contents Page No. 5.1 Review of the related literature 65 5.2 Language study 65 5.3 CV structure 67 5.4 Block diagram of contextual analysis 67 5.5 Implementation of contextual analysis system 76 5.5.1 Input text 76 5.5.2 Text encoding 76 5.5.3 CV structure formation 77 Performance evaluation and result discussion of 5.5.4 77 contextual analysis 5.6 Conclusion of contextual analysis 88

64

Contextual analysis and classification of syllables for Marathi language

CHAPTER - V

CONTEXTUAL ANALYSIS AND CLASSIFICATION OF SYLLABLES FOR MARATHI LANGUAGE

5.1 REVIEW OF THE RELATED LITERATURE:

Indian languages are phonetic, that is, they have a unique sound for each alphabet. These languages are most suitable for concatenative synthesis. Concatenation takes place using phones, diphones, syllables and words. These are called concatenation units. Syllable being a basic unit provides fairly good naturalness. Since speech synthesis rules are dependent on grammar of the language, the algorithm developed for one language does not suit for other language.

Marathi is an Indo-Aryan, Indo-European and Indo-Iranian language spoken by about 96.75 million people mainly in the Maharashtra and its neighboring states. Marathi is also spoken in Israel and Mauritius. Marathi is thought to be a descendent of Maharashtri, one of the Prakrit languages which are developed from Sanskrit. The vowels getting attached to the consonant are not in one direction. They can be placed either on the top or the bottom of the consonant. Vowels can also be placed to the left or right of the consonant.

5.2 LANGUAGE STUDY:

This research work is using Devnagari script as input. Devnagari script is used by Hindi, Marathi, Nepali and Sanskrit languages. Devnagari has 34 consonants (vyanjan) and 12 vowels (svar). A syllable (akshar) is formed by the combination of zero or one consonants and one vowel. The vowels getting attached to the consonant are not in one direction so they can be placed either on the top or the bottom of consonant [41]. Following symbols are used for different consonants. These are vowel suffices or symbols as shown below:

65

Contextual analysis and classification of syllables for Marathi language

E.g.: $z , x , o , etc. Typically the alphabets get divided into following categories:

1) Consonants: A consonant is a sound in spoken language that has no sounding voice (vocal sound) of its own and must rely on a nearby vowel with which it can sound (sonant). Each consonant has the peculiarity of having an inherent vowel, generally the short vowel ‗A‘. Therefore the consonant d represents not just d but also A. Some Marathi characters have alternate presentation forms whose choice depends on neighboring consonants. This variability is especially notable for a which has numerous forms such as ©, — , ¥ , Œ Example of a are represented using different signs in the following words: à&ßV (prapta), H¥ Vr (kruti), XOm© (darja)

2) Vowels: A vowel is a sound in spoken language that has a sounding voice (vocal sound) of its own; it is produced by comparatively open configuration of the vocal tract. There are mainly two representations for vowels.

i) Independent vowel These vowels are placed on the consonants either in the beginning or after the consonant. Each of these vowels is pronounced separately. The independent vowels are used to write words that start with a vowel.

Example: A§Va

Independent vowels: A Am B B© C D E Eo Amo Am¡ A¨ A:

ii) Dependent vowel The dependent vowels serve as the common manner for writing non inherent vowels. They do not stand alone; rather they are depicted in combination with base letter form.

Usage of dependent vowels : V Vm {V Vr Vw Vy Vo V¡ Vmo Vm¡ V¨ V:

66

Contextual analysis and classification of syllables for Marathi language

Syllable: In this research system syllable is used as a basic concatenation unit. Syllable as a basic unit provides good results for Indian languages due to their phonetic structure. The naturalness is also satisfactory for syllabic synthesis. A syllable is formed by the combination of zero or one consonants and one vowel [42].

5.3 CV STRUCTURE:

A character in Indian language scripts is close to a syllable and can be typically of the following form: C, V, CV, VC, CCV and CVC. Here C is a consonant and V is a vowel. CV structure is formed according to its basic rules and further breaking of this CV structure is done through CV breaking rules. The word is split into its consequent syllables using these rules. CV structure rules and their breaking are shown in appendix. In Marathi language, there are more than 471 CV structure rules. These rules are based on empirical context study of the language. This information is based on the work done as per PhD thesis reference [i].

5.4 BLOCK DIAGRAM OF CONTEXTUAL ANALYSIS:

The following figure shows the block diagram of contextual analysis and classification of syllables.

67

Contextual analysis and classification of syllables for Marathi language

Text input

Text encoding

Syllable segmentation

CV STRUCTURE rule STRUCTURE CV Acquiring syllables from the database

Most frequently occurring words

Calculation of performance parameters

Fig. 5.1: Block diagram of contextual analysis and classification of syllables

Block diagram description:

1) Input text A Devnagari keyboard is provided in the GUI as shown below for the input text. It is designed using VB (front end) and it operates on mouse click. Input text can also be entered through an input text file. The text is displayed in the text box.

Fig. 5.2: Devnagari keyboard

68

Contextual analysis and classification of syllables for Marathi language

2) Text encoding The text encoding stage assigns each Devnagari character a predefined code. Different codes are assigned to each character keeping in mind the purpose of the software. Separate ranges of codes are assigned for consonants and vowels. The words in the sentence are isolated on the basis of spaces. This word is then converted into its encoded string separated by a ―|‖. The entire text is converted into a string of code values, which is implemented using C. The code assigned for each vowel, consonant, half consonants and for symbols is unique. Vowels come implicitly with consonants. But if a consonant comes as a last letter in a word then that consonant is not pronounced fully and it is considered as a half consonant. The following table shows examples of Marathi words and their encoded strings.

Table 5.1: Marathi words and their code strings

Word Code string

JaO 74|97|152|

H$ï 72|174|81|

~¨Oa 93|163|79|170|

Separate ranges of codes are assigned for consonants and vowels as shown below:  Full consonants- 72 to 105.  Half consonants- 145 to 178.  Vowels – 65 to 77 and 117 to 129

Example: code of H (full consonant) is 72 while code of H² (half consonant) is 145. Two ranges of vowels divide the actual vowels and suffixes used for vowels. The range 65 to 77 is given to vowels A Am B B© C D E Eo Amo Am¡ A¨ A: and the other range 117 to 129 is given for all the suffices used for the vowels m , [ , s , w , y , o, ¡ , ¨

69

Contextual analysis and classification of syllables for Marathi language

3) Syllable segmentation The code string is converted into appropriate CV (consonant-vowel) structure. The input to this block is the code string from the previous block. The task of this block is to find the CV structure of word depending on its code. CV structure formation is done using some of the basic rules. Consonant-vowel (CV) structure represents the position of vowels in the word. These vowels normally represent syllables. Therefore, the syllables in the word can be easily located. The CV structure representation is based on the grammatical construction of the language. The position of the various dependent vowels decides the CV structure. The codes are assigned in such a way that implementation of the algorithm becomes easier. The processing of the syllables as well as words takes place with the help of these codes.

 Full consonant – ‗CV‘  Half consonant- ‗C‘  Vowel- ‗V‘

The last code of the code string is assigned ―C‖ only if the previous code is a full consonant. The syllable segmentation block is implemented in C. The code string is converted into an appropriate CV (consonant-vowel) structure. Then this block searches for words in the database. If words are not found in the database, algorithm goes to next part of breaking words into syllables. Table 2 shows CV structure of input words along with their encoded string.

Table 5.2: CV structure of input words

Word Code string CV structure

JaO 74|97|152| CVCVC

H$ï 72|174|81| CVCCV

~¨Oa 93|163|79|170| CVCCVC

70

Contextual analysis and classification of syllables for Marathi language

4) CV breaking rules: CV structure breaking rules is the heart of the syllable based speech synthesis. These rules are developed previously and based on these rules the separation of syllables is carried out. These rules are developed after detailed study of consonant-vowel structure and context analysis of Marathi language. For developing these rules, it was assumed that every syllable contains a minimum of one vowel. The syllables that are not present in the database are cut using CV breaking rules. There are around 471 rules in order to break the words into syllables and its CV format. Please refer thesis [i] in the references for more details. Care has been taken that almost all the words are covered using these CV breaking rules. It is also observed that more than one word have the same CV structure.

E.g.: [dbsZ (vilin) and gmjmV (sakshat) have the same CV structure i.e. CVCVC.

These rules cover CV structures of all words in Marathi. All such CV formats and their broken structures are stored in the database in tabular form. The following table 5.3 shows CV structure and their breaking. CV structures and their breaking rules are present in Appendix 1 at the end of this thesis.

Table 5.3: Examples of CV structure and its breaking CV CV structure after Word structure breaking A&M&`© VCVCCV V+CVC+CV ~wX²Yr CVCCV CVC+CV

5) Searching syllables in database The syllables to be searched are looked into the database. The entered syllable is converted into its encoded string. This string is searched in the field wherein the encoded string of the database is stored. If the syllable is present in the database then its corresponding words are displayed on the screen. Here is a snapshot of the textual database, with words and their code strings.

71

Contextual analysis and classification of syllables for Marathi language

Fig. 5.3: Snapshot of textual database

6) Acquiring list of syllables from the database: The words from the database are entered in the text editor of the system. These words are broken into its CV structure and the syllables are acquired for whole database. A list is maintained containing these syllables, which is used for future reference.

7) Most frequently occurring words: Now from this list of syllables, the syllable to be searched is entered by the user. Then the list of words containing this particular syllable is displayed. A separate text file is maintained for these words only if the count of words is greater than the predefined value (Here predefined value used is 35). This word list is considered as most frequently occurring words. Please refer PhD thesis submitted as per reference [i] for details of this work.

8) Database: Two databases are maintained viz. audio database that stores the audio files and textual database that stores the text file corresponding to audio files in the audio database [43]. The textual database is required to search the required word in the audio database.

72

Contextual analysis and classification of syllables for Marathi language

Textual database: It consists of two entries:

 Marathi word  Corresponding encoded string

A snapshot of textual database is shown below:

Table 5.4: Textual database snapshot M

FIELD1 FIELD2

_hmOZ 95|103|116|79|163| (Mahajan)

_hmZJa 95|103|116|90|74|170| (mahanagar)

Audio database: Audio database contains the audio files of the words present in the textual database. These audio files are further used for the calculation of the performance parameters.

Database search:

 Word is present in the database: The output from the text encoding stage is a string of code values separated by a ―|‖. Now this string is searched in the textual database. If the match is found then its corresponding word present in the field1 of the database is returned.

 Word is not present in the database: If the word is not present in the database then the word is split into syllables (CV break) and search is carried out for its syllables along with their frequency of occurrence in the database. CV break module is executed when the entered word is not present in the textual database. In the Marathi TTS system, syllable is used as a basic concatenation unit.

73

Contextual analysis and classification of syllables for Marathi language

When storing most frequent words and syllables in audio database, they need to be evaluated for their accuracy. The audio database is limited but it should contain all the most frequent words and syllables segmented properly from sentences or words. For testing the performance or accuracy of these words or syllables, some time or frequency domain parameters are used. In this research work, energy (time domain) and PSD (frequency domain) parameters are used for calculating the accuracy and adequacy of selected most frequent words and syllables. The following section shows how these parameters are used for deciding most frequent words and syllables which need to be stored in the audio database. Based on the output of these evaluation parameters, the decision is made weather to store the particular word or syllable in audio database or not.

9) Calculation of performance parameters: After the syllables are searched in the database, the words are displayed on the screen in which that particular syllable occurs. All these words which occur most frequently in the database are then analyzed using different performance parameters like energy plot, PSD etc. Through these parameters the naturalness of the syllable is observed. The word in which the syllable has good peak with less distortion is stored. This is how the optimization of the database is carried out and the naturalness of the speech is increased. Contextual analysis and classification of syllables for Marathi language is a basic part of TTS synthesizer. A lot of work has been done on TTS but the major flaw of all of those TTS is that they have a huge database [44]. Therefore the main aim of this work is to concentrate on the reduction of database and increasing the naturalness of the speech.

Before most frequently occurring words and syllables are stored in the audio database, they are analyzed with some performance evaluation parameters. Two parameters energy, time domain feature and PSD, a frequency domain parameter are tested here.

Energy graph: Energy is plotted for audio files of corresponding words whose syllable is being searched. With the help of this energy plot, the amplitude of the different syllables occurring in the words is observed. The peaks are then compared with peaks of the same syllable but occurring in the different words. The word in which the particular 74

Contextual analysis and classification of syllables for Marathi language syllable has good energy peak with less distortion is considered as good syllable and that word is stored in the database. This is how the optimization of the database is carried out and the naturalness of the speech is increased.

Power spectral density: Power spectral density (PSD) describes how the power of a signal or time series is distributed with frequency. Here, power can be the actual physical power that can be defined as the squared value of the signal that is, as the actual power if the signal is a voltage applied to a 1-ohm load. This instantaneous power (the mean or expected value of which is the average power) is then given by:

푃 = 푠(푡)2……………………………………. (5.1)

Since a signal with nonzero average power is not square integral, the Fourier transforms do not exist in this case. The PSD is the Fourier transform of the autocorrelation function R(τ), of the signal if the signal can be treated as a wide- sense stationary random process.

This results in the formula,

∞ S (f) = 푅(휏)푒−2휋푖푓휏 d휏………………………….. (5.2) −∞

The power of the signal in a given frequency band can be calculated by integrating over positive and negative frequencies.

f1 −f1 P = S f df + S(f)df ……………………… (5.3) f2 −f2

PSD is used to observe the syllable distortion and spectral alignment. Syllables having less distortion with good spectral alignment are stored in the database.

75

Contextual analysis and classification of syllables for Marathi language

5.5 IMPLEMENTATION OF CONTEXTUAL ANALYSIS SYSTEM:

In this section different algorithms text encoding, CV structure formation and CV structure breaking etc are used. These algorithms are based on the work ‗Text to speech synthesis for Hindi language using hybrid syllabic approach‘, a PhD thesis, please refer more details in reference[i].

5.5.1 Input text: To enable the user to input Marathi text, a keyboard is provided in the GUI using Subak-1 font. It is designed using VB and it operates on mouse click. The GUI snapshot of the system is shown below in the figure 5.4

Fig. 5.4: GUI snapshot of the system

5.5.2 Text encoding: The text encoder is implemented in C. The text encoder module assigns a unique code for each character of the Marathi text entered through the keyboard. The anuswar is replaced by appropriate half consonant. If the previous character is a full consonant then it is made half and the appropriate code is assigned. The code string thus obtained by proper assignment of code for various rafars, kanas, matras and anuswars is written in the output file of the text encoder after entering different words like MmcH (chalak), AW© (artha), ~mOy (baju), YSo (dhade)

76

Contextual analysis and classification of syllables for Marathi language

Output is as shown below: 77|116|98|145 65|170|87| 93|116|79|139| 89|83|122|

5.5.3 CV structure formation: This stage is implemented in C. The input to this module is the final code string of the words in input text. The code string is of the form 72|116|168|. The character array c[i] extracts each code individually as c [0] = 72, c [1] = 116, c [2] = 168. Consonants and vowels are then checked. Full consonants are assigned CV and vowels (65 to 69) are assigned V with C assigned to half consonants (145 to 178). Here 72  CV, 116 CV, 168C.

5.5.4 Performance evaluation and result discussion of contextual analysis: This system is tested for contextual analysis with 500 most common words. Out of 500 words, total 400 words gave accurate syllable positions. The output of this system is compared with region list of sound forge for same words. The system gave 80 % accuracy for contextual analysis and classification of syllables.

Two parameters; energy and PSD are used as testing methods for performance evaluation of this system. 500 most common words are tested with energy and PSD algorithm. The results are recorded and compared with sound forge output for same words. For energy algorithm out of 500 words, 390 words gave correct segmentation values (78%) and for PSD algorithm out of 500 words, 410 words gave exact segmentation points as sound forge region list (82%). This system has resulted in total 1000 correct syllables which are stored in the textual database.

The GUI snapshot of the system is shown in the above figure 5.4

77

Contextual analysis and classification of syllables for Marathi language

Text can be inputted through the keyboard generated in the GUI. The font used is SUBAK-1. The following figure 5.5 shows inputted text.

Fig. 5.5: Entering text using keyboard

When the input is provided in the text box, the play button is pressed. The code formation takes places. When the code formation is done a message box appears saying code formation complete. This is shown in the figure 5.6 below:

Fig. 5.6: Formation of code

78

Contextual analysis and classification of syllables for Marathi language

When ―ok‖ is clicked the progress bar showing concatenation process appears on the screen as seen in the figure 5.7 below.

Fig. 5.7: Concatenating the word

When the concatenation process is finalized, text encoding is complete. The encoded string of the words in text box is shown below in the figure 5.8

Fig. 5.8: Snapshot of input text encoding

79

Contextual analysis and classification of syllables for Marathi language

When the text encoding is done, CV structure formation takes place. In order to see CV structure in the list box, one has to select CV structure from the menu bar. Along with the CV Structure, broken CV structure also appears in the CV Broken list box as shown below.

Fig. 5.9: Displaying CV structure of word

Now the encoded string of this word needs to be broken according to its CV broken structure. As soon as the user clicks View Syllables, the syllables corresponding to its broken part are displayed. This is how the word is broken and syllables are obtained. This is shown in the figure 5.10 below.

Fig. 5.10: CV Broken code list and syllables

80

Contextual analysis and classification of syllables for Marathi language

The Add Syllable command can be clicked and the syllables can be added in the database and the message box is displayed syllable successfully added in the database. If the syllable is already present in the database then the message box is displayed syllable already present in the database. This is how the list of syllables is obtained and saved in textual database.

Fig. 5.11: Adding syllables to database

Now the syllable to be searched is entered by the user by pressing the command „Enter the syllable‟. A message box is displayed as shown below in the figure 5.12

Fig. 5.12: Entering syllable to search

81

Contextual analysis and classification of syllables for Marathi language

Now the user has to enter the syllable and the same process of its encoded string formation takes place as explained earlier.

Fig. 5.13: Syllable to search

The reverse process of detecting all the words from the database containing this particular syllable takes place. For this purpose, the user needs to click on the command display. After clicking this command all the words containing this syllable are obtained as shown below.

Fig. 5.14: Display of words

82

Contextual analysis and classification of syllables for Marathi language

If the syllable searched has got more than 35 words then these words are saved in the different file. A word list with its syllables is shown below in the figure 5.15

Fig. 5.15: Total words of syllable _mZ (maan)

In this research work, the search is made using the count 35. To separate the most common words this number is used as a threshold. Files of the syllables _mZ², _Z², AZ², da², H a², Hma², H Z², nZ², Va² (maan, man, aan, var, kar, kaar, kan, pan, tar) are obtained. Now among these files, search is made for words which occur more frequently. Then these words are plotted and analysis is carried out through performance parameters.

Suppose the words to be analyzed are: _¨Va, A¨Va, dV©_mZ, g_m¨Va, A¨ew_mZ, A¨YH ma (mantar, antar, vartaman, samantar, anshumaan, andhakar)

 It can be clearly seen from the energy plot of two words _¨Va (mantar) and A¨Va (antar) that the word A¨Va (antar) (fig 5.16a) contains better peak with less distortion for the syllable Va² (tar) than the word _¨Va (mantar) (fig 5.16b). This means that when the syllable Va² (tar) occurs it is better to store it from the word A¨Va (antar). In all these figures, x axis are time scale and y axis are amplitude/energy of samples. The x axis is normalized to keep the time scale limited for respective word.

83

Contextual analysis and classification of syllables for Marathi language

 Same is the case for the syllable AZ² (aan) in the words A¨ew_mZ (anshuman) and A¨YH ma (andhakar). The syllable AZ² (aan) has got better peaks with less distortion for the word A¨ew_mZ (anshuman) (fig 5.17a) than A¨YH ma (andhakar) (fig 5.17b).

 Similarly it can be observed for the syllable _mZ² (maan) occurring in the words dV©_mZ (vartaman) and g_m¨Va (samantar) the peak of the syllable _mZ² (maan) is better with less distortion in the word g_m¨Va (samantar) (fig 5.18a) compared to dV©_mZ (vartaman) (fig 5.18b).

Thus whenever the syllables AZ² , _mZ² , Va² (aan, maan, tar) occurs then they should be stored from the words A¨Va, A¨ew_mZ, g_m¨Va (antar, anshumaan, samantar) respectively as better peaks with less distortion are observed in their energy plots.

84

Contextual analysis and classification of syllables for Marathi language

Fig. 5.16 a, b: Energy plot of _¨Va, A¨Va (mantar, antar)

Fig. 5.17 a, b: Energy plot of A¨ew_mZ, A¨YH ma (anshumaan, andhakar)

85

Contextual analysis and classification of syllables for Marathi language

Fig. 5.18 a, b: Energy plot of g_m¨Va, dV©_mZ (samantar, vartamaan)

Power spectral density (PSD), frequency domain feature has been selected as the other performance evaluation parameter. PSD has been plotted for the words _¨Va (mantar) and A¨Va (antar). The length of the wave file of the word _¨Va (mantar) is 745 msec. This wave file is broken into small segments of 10 msec using the ‗sound forge‘ software. The frequency of human voice changes every 10 msec so the wave files are broken into 10 msec interval in order to gain uniformity in the frequency during the PSD plot. The PSD plot of word _¨Va (mantar) is acquired as shown below in the figure 5.19 by plotting total 75 samples of its wave file.

86

Contextual analysis and classification of syllables for Marathi language

Fig. 5.19: PSD plot of _¨Va (mantar)

Similarly PSD plot has been acquired for the word A¨Va (antar). The length of the wave file for the word A¨Va (antar) is 626 msec. Dividing this length by 10 msec as explained in the above section a total of 63 samples are acquired. Figure 5.20 below shows PSD plot of A¨Va (antar).

Fig. 5.20: PSD plot of A¨Va (antar)

By analyzing these two PSD plots it can be concluded that the peak for the syllable Va² (tar) is more accurate and natural with less distortion in the word A¨Va (antar) as compared to the word _¨Va (mantar). These results are same as the results obtained for

87

Contextual analysis and classification of syllables for Marathi language the energy plot for the same input word. Similar PSD results are obtained for other syllable and word combinations. Total 500 words are tested with these two parameters for performance evaluation of this system.

5.6 CONCLUSION OF CONTEXTUAL ANALYSIS:

Contextual analysis and classification of syllable plays an important role in optimization of database. It improves the database quality and resulting speech output. The database size is reduced due to accurate most frequent words selection and common syllables storage. The present contextual analysis system has given 400 correct syllable locations out of 500 words tested. The system resulted in 80% accuracy for identification of syllables. Total 1000 syllables are obtained correctly from this system. Hence with this system, new words can be formed with most frequently used words and syllables present in the audio and textual database respectively.

88

Chapter-6

Position based syllabification using neural and non-neural techniques

Position based syllabification using neural and non-neural techniques

CHAPTER- 6

POSITION BASED SYLLABIFICATION USING NEURAL AND NON-NEURAL TECHNIQUES

No. Title of the contents Page no. 6.1 Factors of voice quality variation for database creation 94 6.2 Neural network for segmentation 95 6.3 Neural network and its types 96 Basic block diagram of neural network for 6.3.1 98 segmentation 6.4 Why automatic segmentation? 103 6.4.1 Syllable as unit 105 6.4.2 Why not dictionary? 108 6.4.3 Energy, extracted feature for NN 109 6.4.4 Why energy is used as parameter for soft-cutting? 110 6.5 Basic algorithm of neural network segmentation system 111 6.5.1 Purpose of algorithm 111 6.5.2 Methodology used 111 6.5.3 Result/outcome 112 6.5.4 Basic algorithm of segmentation system 112 6.5.5 Flowchart 112 6.6 Energy calculation 113 6.6.1 Actual formula used 114 6.6.2 Algorithm for energy calculation 115 6.6.2.1 Purpose of energy algorithm 115 6.6.2.2 Methodology used 115 6.6.2.3 Results/outcome of the algorithm 115 6.6.3 Flowchart 115 6.7 Post-processing of energy plot 117 6.7.1 Normalization 117 6.7.1.1 Purpose of normalization 118 6.7.1.2 Methodology used 118

89

Position based syllabification using neural and non-neural techniques

No. Title of the contents Page no. 6.7.1.3 Results/outcome of the algorithm 118 6.7.2 Normalization algorithm 118 6.7.3 Smoothing energy plot 119 Block diagram of basic TTS system and segmentation 6.8 120 approaches 6.8.1 Syllabification flowchart 122 6.9 Neural network for automatic segmentation 123 6.9.1 MAXNET 124 6.9.2 Maxnet architecture 125 6.10 Implementation of Maxnet 127 6.10.1 Maxnet algorithm 127 6.10.1.1 Purpose of algorithm 127 6.10.1.2 Methodology/ mathematical function 128 6.10.1.3 Step by step Maxnet implementation 128 6.10.1.4 Results/outcome of the algorithm 129 6.10.2 Maxnet flowchart 129 Testing of Maxnet algorithm and authentication of 6.10.3 130 results 6.10.4 Conclusion of Maxnet 139 6.11 Back-propagation 139 6.11.1 Architecture of back-propagation 141 6.12 Implementation of back-propagation 143 6.12.1 Input nodes 143 6.12.2 Hidden nodes 143 6.12.3 Output nodes 144 6.12.4 Initialization of weights 144 6.12.5 Choice of learning rate 휂 144 6.13 Back-propagation algorithm implementation 144 6.13.1 Purpose of algorithm 146 6.13.2 Methodology used 146 6.13.3 Results/outcome of the algorithm 146 6.13.4 Basic back-propagation algorithm 146

90

Position based syllabification using neural and non-neural techniques

No. Title of the contents Page no. 6.13.5 Back-propagation flowchart 148 6.14 Supervised and un-supervised NN for segmentation 149 6.14.1 Supervised neural network learning 150 6.14.2 The training data set 150 6.14.3 The basic training procedure 150 6.14.4 Unsupervised learning in neural networks 151 6.14.5 The delta rule 152 Results of cascaded combination of supervised and 6.14.6 154 unsupervised NN Neural-classification approach (K-means with Maxnet) for 6.15 158 syllabification 6.15.1 Analysis of K-means algorithm 158 Step by step implementation of K-means for 6.15.2 159 segmentation 6.15.2.1 Purpose of algorithm 160 6.15.2.2 Methodology used 160 6.15.2.3 Results/ outcome of algorithm 162 6.15.2.4 Authentication and testing of results 162 6.15.3 Results of K-means algorithm 162 6.16 Non-neural methods for syllable segmentation 167 6.16.1 Simulated annealing 167 6.16.1.1 Purpose of algorithm 167 6.16.1.2 Methodology used 167 6.16.1.3 Results/outcome of the algorithm 167 6.16.1.4 Testing and authentication of results 168 6.16.2 Results of simulated annealing 168 6.16.3 Slope detection algorithm 172 6.16.3.1 Purpose of algorithm 172 6.16.3.2 Methodology used 173 6.16.3.3 Results/outcome of the algorithm 173 6.16.4 Results of slope detection 174 Summary of segmentation results for simulated annealing, 6.17 178 slope detection and K-means Maxnet combination

91

Position based syllabification using neural and non-neural techniques

No. Title of the contents Page no. 6.17.1 Results for 2-syllable words 179 6.17.2 Results for 3-syllable words 180 6.17.3 Results for 4-syllable words 181 Statistical comparison of different segmentation 6.17.4 182 techniques 6.18 GUI 182 6.18.1 GUI results for K-means 183 6.18.2 GUI results for slope detection 183 6.18.3 GUI results for simulated annealing 184 6.19 Conclusion of syllabification 185

92

Position based syllabification using neural and non-neural techniques

CHAPTER- 6

POSITION BASED SYLLABIFICATION USING NEURAL AND NON-NEURAL TECHNIQUES

Syllabification is the formation of syllables from word. Different words can be made up of two, three, four or multiple syllables. Audio database is prepared for most frequent words. These words are cut into syllables and these syllables are stored in the textual database. When new word need to be synthesized (concatenated), these syllables are used for the formation of this new word. While forming new word from existing syllables in the database, if position of syllable is considered, then resulting new word sounds more natural. Different syllable positions like initial, middle or final (IMF), are considered for generation of new word. It is observed that due to consideration of syllable position, the resulting speech has less spectral distortion at the concatenation point. Position based syllabification resulted in more natural speech output. This chapter explains how different syllable segmentation techniques are implemented for more naturalness of resulting speech output.

In this research work neural and classification network techniques are used for improving context based segmentation. Neural network based algorithms can be divided into two types. 1) Supervised and 2) Unsupervised. Both types of algorithms are used for segmentation. Resulting accuracy of segments is compared and the type of algorithm (neural, classification or non-neural) is finalized for accurate syllabification. If proper segmentation of given input word is carried out then resulting speech units are more natural. These speech units (here syllables) can be stored in the database and used for synthesizing new words.

93

Position based syllabification using neural and non-neural techniques

6.1 FACTORS OF VOICE QUALITY VARIATION FOR DATABASE CREATION:

For accurate segmentation of syllables from words, proper recording of original words in the database is very important. If recording itself contains lot of noise, then the resulting speech segments (here syllables) formed from recorded words will be distorted. If these speech segments are used for synthesizing new word then the resulting output will have more spectral distortion and noise. Therefore preparation of database is a vital step in speech synthesis. There are various parameters which affect the speech voice quality as explained in the following section.

Factors that can cause variation in the voice quality include: 1) Settings of audio equipments 2) The layout of the recording studio 3) Microphone settings 4) Speech speed 5) Physical conditions of a speaker that may affect vocal organs 6) Mental conditions of a speaker that may affect vocal effort

For recording of all sessions identical audio equipment was used. As the volume of the microphone amplifier was adjusted, it is unlikely that factor one in the above list impacts the voice quality variation. The layout of the recording studio was kept unchanged for all recording period. It is also unlikely that factor two of the list impacts the voice quality variation.

Since the distance between the microphone and speaker‘s mouth was not always strictly controlled, the frequency response of the microphone may have varied over the recording period. But it is seen that the amplitude variation for the distance range of +/-10cm is about 3 dB, which may be hardly noticeable for listeners. Therefore factor three in the above list can be ignored.

The last three items of the list can be regarded as the main cause of the voice quality variation. However, it is difficult to quantitatively determine these factors.

94

Position based syllabification using neural and non-neural techniques

6.2 NEURAL NETWORK FOR SEGMENTATION:

 The neural network of an animal is part of its nervous system, containing a large number of interconnected neurons (nerve cells). Artificial neural networks (ANN) are computing systems whose central theme is borrowed from analogy of biological neural networks. ANN is a relatively new computational tool that has found extensive utilization in solving complex real world problems.

 Neural networks have been applied in speech synthesis for about twenty years, and the latest results have been quite hopeful. Automatic segmentation is immensely required because manual segmentation is extremely time consuming [45]. Manual segmentation also has some restrictions because of human limitations. The field of text to speech synthesis has been rapidly developing and this is very clear from the fact that IEEE has given due recognition to the research conducted in this regard.

 There is a great demand for text to speech synthesis for Indian languages. The objective of this work is to develop small segments (syllables) from different words and use them in the formation of new words. This enables TTS to speak the words which do not exist in the database. This helps in reducing the database and thereby maintaining naturalness. It uses concatenative synthesis, which is based on the concatenation of segments of recorded speech. Audio database is maintained to store audio files. The textual database is required to search the index of the required word in the audio database. Syllables are cut automatically from words using neural network. This cutting is done on the basis of energy calculation.

 A neural network (NN) is a computer software (and possibly hardware) that simulates a simple model of neural cells in humans. The purpose of this simulation is to acquire the intelligent features of these cells. The importance of neural network in syllabification is:

95

Position based syllabification using neural and non-neural techniques

1) Its ability to generalize and capture the functional relationship between input and output pattern pairs. 2) Its ability to predict after an appropriate learning phase, even patterns not presented before. 3) Its ability to tolerate certain amount of fault in input.

 In recent years artificial neural networks have played an important role in various aspects of speech processing. One of the most important applications of neural networks is solving speech synthesis problems. Text to speech (TTS) synthesis includes the whole process permitting a system to convert a written text (grapheme) into a vocal message (phoneme).

 This conversion is necessary for the applications of speech synthesis such as the applications for learning machine language for blind persons and mail reading. The two traditional methods of grapheme-to-phoneme conversion, synthesis by rule and synthesis by concatenation have some disadvantages. They need a huge database and the quality of the synthesized speech is not satisfactory [46]. So several new techniques that improve text-to-speech synthesis by implementing speech segmentation using neural network have been developed.

6.3 NEURAL NETWORK AND ITS TYPES:

 An artificial neural network is an information-processing system that has certain performance characteristics in common with biological neural networks.  Information processing occurs at many simple elements called neurons. Signals are passed between neurons over connection links.  Each connection link has an associated weight which in a typical neural net multiplies the signal transmitted. The weights represent information being used by the net to solve a problem.

96

Position based syllabification using neural and non-neural techniques

 Each neuron applies an activation function (usually nonlinear) to its net input (sum of weighted input signals) to determine its output signal.

The following figures 6.1 and 6.2 show types of neural networks.

X 1 w1

X2 w2 Y

w3 X 3 Fig. 6.1: Simple neural network

 The net input, y_in, to neuron Y is the sum of the weighted signals from

neurons X1, X2, and X3,

y_in = w₁ X₁+ w₂ X₂ + w₃ X₃ ……………………………. (6.1)

 Equation (6.1) shows the activation function for output y_in with three input

signals from input neurons X1, X2 and X3. Input signals gets multiplied with

the respective weights on the link, here w1, w2 and w3 respectively. Here the activation depends only on the inputs and respective weights.

X1

w1 v1 Z1

X2 w2 Y

w v Z 3 2 2 X3

Input Hidden Output units units units

Fig. 6.2: Neural network with hidden layer

97

Position based syllabification using neural and non-neural techniques

 The activation y of neuron Y is given by some function of its net input y = f (y-in) f(x) = 1/(1+exp(-x)) ………………………………… (6.2)

If the output function would be the identity (output=activation), then the neuron would be linear. But these have severe limitations. The most common output function is the Sigmoidal function. Equation (6.2) shows this function. The Sigmoidal function is very close to one for large positive numbers, 0.5 at zero, and very close to zero for large negative numbers. This allows a smooth transition between the low and high output of neuron. From this equation, it is clear that the output depends only on the activation, which in turn depends on the values of the inputs and their respective weights. In figure 6.2, Z1 and Z2 are output nodes where the neural network produces final output. Networks with hidden layer are used in data classification.

6.3.1 Basic block diagram of neural network for segmentation The following figure 6.3 shows basic block diagram of TTS for neural network (NN) based segmentation. All the blocks of TTS system are shown in the diagram. Two important blocks of working area are shown with different color. After carrying out isolation and encoding of input Devnagari text, it is searched in the existing database. If the respective audio files are present they are provided to concatenation unit for playing that input text file. If the input text or words are not present in the database then as shown in the diagram they are segmented using neural network. Neural network needs some extracted feature of speech signal having speech contents. There are certain time or frequency domain features commonly used for speech signal processing in NN. Here in this research work, time domain parameter energy is used. After segmentation of some existing words, segmented syllables are stored in textual database. These syllables are used for generation of new input word which is not present in the database. Their audio files are concatenated during synthesis. Concatenation block plays all the files and generates the speech output.

98

Position based syllabification using neural and non-neural techniques

 Database 

 Isolation and Search in Feature extraction: Devnagari text encoding database energy   Neural network for segmentation  Concatenation Speech and play Fig. 6.3: Basic block diagram of TTS with neural network for segmentation

Fig. 6.3: Basic block diagram of TTS with neural network for segmentation

Highlighted blocks in the figure 6.3 are main blocks of segmentation.

Block diagram description: 1) Isolation: In an isolation stage all spaces, tabs, commas and full stops etc. are removed and words are isolated from each other.

2) Encoding: The encoding stage encodes inputted text into code string. Predefined code is assigned to each Devnagari character. There is a unique code for each vowel, consonant, half consonant, symbol and number. This block also breaks the word into syllables using CV structure breaking rules.

3) Searching: Encoded word is searched in the database. Two databases need to be maintained viz. audio database and textual database. - Audio database: Audio database stores recorded sound files. Sound files are stored in WAV format. Limited numbers of most common words are recorded and new words are formed with existing words and syllables. This reduces the database size.

99

Position based syllabification using neural and non-neural techniques

- Textual database: Textual database stores code string, WAV filename, from and to positions of the word within that audio file. ‗From‘ and ‗to‘ positions are sample number ranges in audio database for the given input. The textual database is required to search the index of the required word in the audio database. If the word is not found in the database then it is searched in syllable database. Also suffixes and prefixes are searched for synthesizing the missing word.

4) Feature extraction:  This is a very important block of this research work. The process of reducing dimensionality of the input is called feature extraction. Input speech file contains large number of samples. If we provide this file as input to neural network, the number of inputs will increase by a large number. Therefore feature extraction is a very vital part of this work. Different features can be considered in time and frequency domain.

 Time domain features like energy, zero crossings and frequency domain features like DFT (Discrete Fourier transform) and spectrogram etc, can be used as an extracted feature which is provided to NN or other syllabification approaches. The feature used in this research work is ‗energy‘ which is time domain parameter. It is simple and very accurate feature that can be used for neural network. Figure 6.4 below shows a detailed block diagram of ‗feature extraction‘ block used in the previous figure 6.3. Please refer the block diagram shown above in section 6.2.1 for basic block diagram of TTS and neural network used for segmentation.

10 msec Energy Energy Normalizing Sound file frames calculation plot energy

Feature extracted: Smoothing energy energy plot

Fig. 6.4: Block diagram for feature extraction

100

Position based syllabification using neural and non-neural techniques

The above block diagram shows how sound file is converted into energy which is used as input feature for neural network block. First sound file is divided into 10 msec frames which contain 110 samples each. Equation (6.5) in the following section shows in detail calculation of energy for each frame of 110 samples. Energy plot is obtained by plotting frame vs. amplitude of each frame. After normalization and smoothing this energy is used as extracted feature. This extracted feature is given as input to neural network as explained in detail in section 6.6 of this chapter.

5) Neural network for segmentation: Manual segmentation is a very tedious and time consuming job. There is great need for automatic speech segmentation algorithm. Neural network algorithms viz. Maxnet and Back-propagation are used here for automatic segmentation [47]. These algorithms are explained in detail below. Figure 6.5 shows block diagram of neural network for segmentation.

Smooth Neural Segment energy plot network positions

Training samples

Fig. 6.5: Block diagram of ‘NN for segmentation’

The above block diagram shows that neural network takes energy and training samples as input. Based on these two inputs it gives segment locations or minima positions for given input word.

6) Concatenation:  The aim of this research work is to reduce the audio database. This is achieved by using concatenation algorithm for formation of different words from existing most frequent words and syllables [48]. Concatenation system concatenates all the WAV files for the different words in the text and makes a new WAV file. The Marathi word which is not present in the audio database is broken into syllables.

101

Position based syllabification using neural and non-neural techniques

 These different syllables are then put together using concatenation algorithm. ‗From‘ and ‗to‘ positions are obtained from the text database which are stored in target file. Once the complete sentence is decoded the speech signal is obtained as output. Some examples of formation of new words with syllables of existing words are shown below.

Example 1: Ap^_mZ (Abhiman) = A (Aa) + p^ (bhi) + _mZ (maan) JwéOZm§Mm (gurujanancha) = Jw (gu) + é (ru) + OZm§ (janan) + Mm (cha) Jwé_mZ (guruman) = Jw (gu) + é (ru) + _mZ (maan)

Example 2: XodYa (Devdhar) = Xod (Dev) + Ya (Dhar) M¨XsJS (Chandigarh) = M¨Xs (Chandi) + JS (Gad) XodJS (Devgad) = Xod (Dev) + JS (Gad) Example 3: H m_Jma (kamgar) = H m_ (kam) + Jma (gar) dmaH as (varkari) = dma (var) +H as (vari) H m_H as (kamkari) = H m_ (kam) + H as (kari)

Example 4: Jm¡ad (Gaurav) = Jm¡ (Gau) + ad (Rav) gm¡a^ (Saurabh) = gm¡ (Sau) + a^ (Rabh) gm¡ad (Saurav) = gm¡ (Sau) + ad (Rav)

Different syllables from different words are concatenated to form a new word. Figures 6.6 and 6.7 shows how a new word can be formed from pre-recorded words and their syllables.

102

Position based syllabification using neural and non-neural techniques

pdH mg H ‘b

pd H mg H ‘b

pd‘b

Fig. 6.6: Formation of new word from pre-recorded words

dmagm H m‘H as

dma gm H m‘ H as

dmaH as

Fig. 6.7: Formation of new words from old words with proper syllable positions

This shows that if the word is not present in the database it can be created with the help of syllable segmentation.

6.4 WHY AUTOMATIC SEGMENTATION?

 The main aim of this work is to segment different words into syllables automatically with the help of user-friendly software prepared based on classification-neural network. Through this research work the classification- neural network combination (K-means and Maxnet) is implemented for speech segmentation. In this work, Maxnet, cascaded combination of supervised –

103

Position based syllabification using neural and non-neural techniques

unsupervised networks (Back-propagation-Maxnet) and other non-neural network methods like simulated annealing, slope detection are used first time for speech unit segmentation (syllables formation). The resulting database size is moderate and synthetic speech output is more natural due to high segmentation accuracy. With conventional manual segmentation techniques, separating syllables from word was a time consuming and tedious job.

 The user had to know the code of the syllable and had to find out the ‗from‘ and ‗to‘ positions of each syllable from the sound file. The user had to listen carefully to the word every time for finding those ‗from‘ and ‗to‘ positions. In textual database everything need to be entered manually. This was a tiresome and time consuming job. Therefore there is great interest in developing automatic speech segmentation algorithms.

Figure 6.8 shows sound file contents for the word AmZ§X (Anand)

Fig. 6.8: Sound file of AmZ§X (Anand)

From the figure 6.8; it is very clear that segmenting syllable at proper position is not easy. If the user is segmenting syllables manually he/she has to do so by trial and error method. The user has to listen to that file again and again for receiving correct position of breaking. Therefore automatic segmentation algorithm is strongly needed. 104

Position based syllabification using neural and non-neural techniques

6.4.1 Syllable as unit: Various studies have proved that syllable is better than other units of speech [49]. The following points clear why syllable is better as compared to other units of segmentation. (e.g. diphone, phone, demi-syllable, trisyllable etc)

1) Better durational stability: Syllable has better durational stability as compared to others. Audio file of Marathi word NmoQ ses (chotishi) is shown in the following figure:

Fig. 6.9: Audio file of Marathi word NmoQ ses (chotishi)

In the above figure; approximately initial 3000 samples are for the first syllable Nmo(cho). This duration is around 0.275 seconds. Instead of syllable, if phone or diaphone is selected as speech unit for synthesis, it is shorter in duration than syllable. Being very small in duration the stability of the phone or diaphone is not good.

2) Better representational stability: Syllable also has better representational stability than other units. The energy plot of syllable can be represented in a better way as compared to other features. Figure 6.10 explains this representation.

105

Position based syllabification using neural and non-neural techniques

Fig. 6.10: Energy plot of NmoQ ses (chotishi)

In figure 6.10; initial 27 frames form the first syllable Nmo (cho). Therefore, in the energy plot, it can be represented quite effortlessly. A single consonant or vowel is very difficult to present in this form.

3) Effective minimal unit in time domain: As explained earlier, syllable is large enough (compared to smaller units) to process and represent. Also it is small enough to segment and concatenate to produce natural synthetic speech. Syllable is small in size hence database remains limited and resulting memory requirement is less.

4) Reduction in database size: New words can be formed from syllables of pre-recorded words. Therefore it is not always necessary that the particular word should be present in audio database. Though the word is not present in the database it can be produced at the output. Here, the word nWH (pathak) is formed from two syllables n (pa) and WH (thak) as shown in the figures below. n (pa) is taken from word nam^d (parabhav) and syllable WH (thak) is taken from word AWH (athak). See the following sound files in the figures 6.11, 6.12 and 6.13 for this example.

106

Position based syllabification using neural and non-neural techniques

Fig. 6.11: Sound file of nam^d (parabhav)

Fig. 6.12: Sound file of AWH (athak)

Fig. 6.13: Sound file of nWH (pathak)

107

Position based syllabification using neural and non-neural techniques

From the above figures, it can be seen that a concatenated word is very easily formed without many fluctuations. If any other unit is considered instead of syllable, the new word would not have been soft. Above all reasons are adequate to choose syllable as best unit for automatic segmentation process.

6.4.2 Why not dictionary? Two basic methods of speech synthesis are: 1) Rule based synthesis 2) Dictionary based synthesis

 Rule based speech synthesis uses rules of a particular language to generate the synthetic speech. Dictionary based speech synthesis uses most commonly used words in the audio database. But both the methods have some downsides. Rule based synthesis has the drawback of reduced naturalness and dictionary based synthesis has the drawback of large database size. Hence words and syllable based synthesis (hybrid) is used here.

 Syllable database is in textual form. Using these syllables large number of new words can be synthesized using concatenation. These syllables or new words are not stored in audio form but are synthesized as per requirement during runtime. Audio database needs more memory than textual database. Here only most common words and suffixes are stored in audio database. A separate textual database of syllables is formed.

 Syllable based synthesis reduces memory usage and increases naturalness. Additional upgrading in speech quality is incorporated using IMF (I=initial position, M=middle position and F=final position, i.,e. position based syllabification) and context based combinations of syllables. In this way the database becomes more efficient and hence can generate more number of words through concatenation of newly formed syllables. As position of syllable is considered, the spectral alignment of synthesized word is more proper and results in improved naturalness of output speech.

108

Position based syllabification using neural and non-neural techniques

6.4.3 Energy, extracted feature for NN: The properties of the speech signal change with time. For example, the excitation changes between voiced and unvoiced speech, there is significant variation in the peak amplitude of the signal. There is considerable variation of fundamental frequency within voiced regions. The amplitude of the speech signal varies appreciably with time. In particular, the amplitude of an unvoiced segment is generally much lower than the amplitude of voiced segment. The amplitude of speech segments is used to calculate energy as shown in section 6.6 below.

1) Importance of feature extraction Process of reducing dimensionality in the input is called as feature extraction. Basic input in this work is sound file. Sound file content of simple Marathi word Am^mi (abhal) is shown in the figure 6.14 below:

Fig. 6.14: Sound contents of word Am^mi (abhal)

From the figure 6.14, it is observed that the sound file has more than 5500 samples. Applying these many samples to neural network is time consuming job. Therefore, there is a great need of feature extraction.

2) Energy as feature With reduction in dimensionality, one more important thing in processing the speech signal is to obtain a more suitable or more useful representation of the information

109

Position based syllabification using neural and non-neural techniques carried by the speech signal. It can be used to make a 2-way classification of sound as to whether a section of the signal is voiced speech or unvoiced speech. In such cases, a representation which discards irrelevant information and places the desired features clearly in evidence is preferred over a more detailed representation that retains all the information.

Time domain processing method directly deals with the waveform of the speech signal. The examples of representation of speech signal in terms of time domain measurements are average zero crossing rate, energy and auto-correlation function. These representations make digital processing simple and the resulting representation provide a useful basis for estimating important features of the speech signal.

Generally, energy for voiced data is much higher than the energy of silence. The energy of unvoiced data is usually lower than voiced sounds but higher than silence. In this research, energy simple time domain parameter is used as an extracted feature from speech signal for neural network segmentation.

6.4.4 Why energy is used as parameter for soft-cutting? For proper breaking of word into syllable there should be one vowel in each syllable. Therefore to separate syllables from word one requires separating vowels. Energy is one such parameter which can clearly differentiate between consonant and vowel. Energy of a vowel is greater than that of a consonant.

The WAV file used is recorded with the sampling rate of 11025 Hz. Energy can be calculated for the frame of 10msec that gives 110 samples per frame. Energy of a frame is obtained by adding 110 data samples of a frame and using formula given in equation (6.5). The speech data representing vowels like A (Aa), B (Ei), Am (Aaa) have got 95% of the energy [50]. On the other hand the speech data representing consonants like H (Ka), _ (Ma), n (Pa) has got 5% of the energy. This means, in the word H a (Kar), the pronunciation of A (Aa) after letter H (Ka) shows maximum energy and a peak is observed in energy waveform for the occurrence of a vowel [51]. Therefore, if one is able to separate two peaks from energy plot, one can separate two vowels in the word. Therefore when the position of the minima is located between

110

Position based syllabification using neural and non-neural techniques two peaks accurately then one can separate two vowels present in the word. Once two vowels are separated, the syllables get segmented automatically as every syllable contains at least one vowel. Figure 6.18 shows energy plot of Marathi word Amgmdas (Aasavari).

Fig. 6.15: Energy plot of Marathi word Amgmdas (Aasavari)

From figure 6.15, it can be seen that vowels have higher energy as compared to consonants. As shown in the figure, four syllables can be segmented from this word.

6.5 BASIC ALGORITHM OF NEURAL NETWORK SEGMENTATION SYSTEM:

Basic algorithm of speech segmentation using NN system is explained below:

6.5.1 Purpose of algorithm To get a segmented syllable from given speech input word. Input word is marked with minima/cutting point location.

6.5.2 Methodology used Energy is used as extracted feature as input to neural network algorithm. Different neural network methods like unilayer, multilayer and classification networks are used.

111

Position based syllabification using neural and non-neural techniques

6.5.3 Results/outcome of the algorithm Output of this algorithm should be segmented syllable or soft locations of syllables for given input word.

6.5.4 Basic algorithm of segmentation system The basic algorithm of neural network segmentation system is as follows: 1) Enter the word whose segmentation is to be done or browse it from its location. 2) Read the WAV file. 3) Calculate total number of frames for total number of samples. 4) Calculate energy for every frame. 5) Plot the calculated energy with frame number on x-axis and frame amplitude on y-axis. 6) Normalize energy plot such that limit of frame amplitude is between -1 to 1. 7) Smooth the normalized energy plot. 8) This energy plot is given as an input to neural network. 9) Get the syllable positions from neural network and convert them into ‗from‘ and ‗to‘ positions of sound file (To get ‗from‘ and ‗to‘ positions; frame number at which minima is present is multiplied by number of samples/frame). 10) Give the output of system as ‗from‘ and ‗to‘ positions of syllables in sound file. 11) Play segmented syllables and check the output.

The following figure 6.16 shows the flowchart of basic NN based segmentation system. It explains the flow of the system through different stages from input to output.

6.5.5 Flowchart: The following figure shows all the steps for segmentation of syllables from given input word. First the entered sound file (WAV format) should be read. After taking absolute value of all samples, the frames of 110 samples are formed. Energy is calculated as given in equation (6.5) in the following section for 110 samples. Next energy waveform is plotted against frame number and its amplitude. Energy plot is 112

Position based syllabification using neural and non-neural techniques improved further by normalization and smoothing. This energy plot is given as input to neural network algorithm. It gives minima positions which are converted into sound file format. These are syllable locations (segments) which can be stored in the textual database. Syllable segmentation accuracy can be checked by playing the syllables. The flow of segmentation procedure is shown below in detail.

Start

Enter the word whose segmentation is to be done

Read sound file (WAV format)

Take absolute value of each and every sample

Calculate total number of frames having 110 samples per frame

Sum of the absolute values of 110 samples is taken to find energy of each frame

Plot frame number vs. frame amplitude

Normalize energy plot such that limit of frame amplitude is between -1 to 1. Smooth this plot

Give this energy plot to neural network as input

Take the minima positions from neural network and convert them into „from‟ and „to‟ positions of sound file

Play syllables

Stop

Fig. 6.16: Flowchart for basic neural network based segmentation

6.6 ENERGY CALCULATION:

Normally the energy of the signal is defined as

113

Position based syllabification using neural and non-neural techniques

푚=∞ 퐸 = 푚=0 (푥 푚 ∗ 푥(푚))………………………………………... (6.3)

Where 푥(푚) = input signal Here m varies from zero to plus infinity.

6.6.1 Actual formula used: The above formula has little meaning for speech since it gives little information about time dependent properties of the speech signal. So short time energy at sample is defined as

푛 퐸 = 푚=푛−푁+1(푥 푚 ∗ 푥(푚))……………………………………… (6.4)

Where 푥(푚) = input signal N = total number of samples in one frame n = sample number

Where 푚 varies from 푛 − 푁 + 1 to 푛. Here, 푁 is the total number of samples in one frame and 푛 is a sample number. But the difficulty with the above formula is that it is very sensitive to large signal levels (since they enter in the computation as a square), thereby emphasizing a large sample-to-sample variation in 푥(푚). A simple way to lessen effect of this problem is to use average magnitude function for calculating energy, also called as pseudo energy. It does not emphasize large signal levels [133]. The formula for this is given as

푛 퐸 = 푚=푛−푁+1 |푥(푚)|………………………………………………. (6.5)

Where 푚 varies from 푛 − 푁 + 1 to 푛. This function is called average magnitude (pseudo energy) but here it is called as energy only. In this work, 푚 varies from 1 to 푛. Every frame is formed of 10msec of sound file. This is because of the limitation of human vocal system [52]. Higher frequency than this cannot be produced by humans. The recording frequency used was 11025 Hz. In sound file 10msec frame has 110 samples.

114

Position based syllabification using neural and non-neural techniques

6.6.2 Algorithm for energy calculation:  Input : Audio file in WAV format.  Output : Simple energy plot.

6.6.2.1 Purpose of energy algorithm This algorithm reads samples of input speech file, calculates its energy which can be an extracted feature for neural network algorithm

6.6.2.2 Methodology used Simple WAV file reading and averaging of all samples using equation (6.5) in frame format is carried out.

6.6.2.3 Results/outcome of the algorithm Smooth energy output which can be used as proper extracted feature for NN algorithm.

ALGORITHM: 1) Read sound file (WAV format). 2) Take absolute value of each and every sample. 3) Calculate total number of frames having 110 samples per frame. 4) Summation of absolute values of 110 samples is taken for particular frame [as shown in the equation (6.5) above]. 5) Such calculation for every frame gives energy plot. 6) Energy plot is plotted as frame number vs. frame amplitude.

6.6.3 Flowchart: The flowchart for energy calculation algorithm is shown in the following figure 6.17. It shows how energy is calculated for given input sound file. Here sound file in WAV format is used. First it is read for extraction of audio data. All the audio samples are converted into their absolute values. Total number of frames having 110 samples each are calculated and they are summed together to calculate the energy of given sound file. Energy graph can be obtained by plotting frame number and its amplitude.

115

Position based syllabification using neural and non-neural techniques

Start

Read sound file (WAV format)

Take absolute value of each and every sample

Calculate total number of frames having 110 samples per frame

Sum of the absolute values of 110 samples is taken to find energy of each frame

Plot frame number vs. frame amplitude

Stop

Fig. 6.17: Flowchart for energy calculation

Figure 6.17 shows the flowchart for energy calculation. It explains various stages in the energy calculation starting from sound file to energy plot. Energy plot calculated with this algorithm is shown in the figures 6.18 and 6.19 below for two words.

Fig. 6.18: Simple energy plot of word AmpXVs (Aditi)

116

Position based syllabification using neural and non-neural techniques

Fig. 6.19: Simple energy plot of word Jm¡ad (Gaurav)

In the figure 6.18 amplitude of energy is beyond 7 and in figure 6.19 it is beyond 4. But as neural network accepts input into the range -1 to 1, this energy plot should be normalized. There are lots of variations in the plot which may disturb segmentation. Therefore post-processing (averaging, smoothing and normalization etc) is required to modify the energy plot. This energy plot is further smoothed with moving average algorithm. This algorithm uses average of fixed number of frames for smoothing of energy plot. The number of frames used for averaging varies depending on the smoothness required. This removes the small rise and falls of energy plot and makes it smoother which is suitable for segmentation.

6.7 POST-PROCESSING OF ENERGY PLOT:

6.7.1 Normalization: Normalization is needed to convert available energy into the range -1 to 1. The following algorithm is used for this task.

117

Position based syllabification using neural and non-neural techniques

6.7.1.1 Purpose of normalization This algorithm is used to transform given energy amplitude with various values into specific range. Normalization helps to get the specific range and use it for next step that is for smoothing.

6.7.1.2 Methodology used Standard procedure of normalization is carried out to change input energy amplitude into the range -1 to 1

6.7.1.3 Results/outcome of the algorithm Normalized output of speech energy in the range -1 to 1 is obtained.

6.7.2 Normalization algorithm: 1) Take calculated energy as input. 2) Take maximum of the frame amplitudes. 3) Divide amplitude of every frame by the maximum amplitude. 4) Subtract 0.5 from amplitude of every frame. 5) Multiply amplitude of every frame by 2.

The output of this algorithm gives energy plot into the range -1 to 1. Figure 6.20 shows the normalized energy plot for same word Am[XVs (Aditi).

Fig. 6.20: Normalized energy plot of word AmpXVs (Aditi)

118

Position based syllabification using neural and non-neural techniques

Figure 6.21 shows normalized energy plot for word Jm¡ad (Gaurav)

Fig. 6.21: Normalized energy plot of word Jm¡ad (Gaurav)

6.7.3 Smoothing energy plot: From the above figures 6.20 and 6.21, it is seen that there are lots of variations. The energy plot needs to be smoothed. Moving average filter (having length 12) is used to make it smooth. This algorithm uses average of fixed number of frames for smoothing of energy plot. The number of frames used for averaging varies depending on the smoothness required. Here average of 12 frames is used for smoothing of energy waveform. This averaging provides a smooth energy plot as shown in figures 6.22 and 6.23. Following equation shows moving average used for smoothing.

i11 Smooth i   energy( i ) /12 …………………………………….. (6.6) i

Where i = frame number of input energy signal.

119

Position based syllabification using neural and non-neural techniques

Fig. 6.22: Smooth energy plot of word AmpXVs (Aditi)

Fig. 6.23: Smooth energy plot of word Jm¡ad (Gaurav)

The difference between the figure 6.20 and the figure 6.22 noticeably informs the need of post-processing.

6.8 BLOCK DIAGRAM OF BASIC TTS SYSTEM AND SEGMENTATION APPROACHES:

The following figure 6.24 shows block diagram of basic TTS system and segmentation approaches. Neural, non-neural and classification networks are used for syllabification [53]. These syllabification approaches are used to check segmentation

120

Position based syllabification using neural and non-neural techniques accuracy so that one approach can be finalized for syllable database preparation. Important blocks are shown with different color in this block diagram. Region list is used for textual database preparation of syllables. Syllables from this database are used for generation of new word.

Text processing Syllable cutting using and encoding Database search neural-classification and non-neural approaches

Region list and Text and audio database syllable concatenation

Speech

Fig. 6.24: Basic block diagram of TTS and segmentation approaches

Explanation of block diagram: 1) The input to the text to speech synthesizer is Devnagari (Marathi) text. 2) In text processing; encoding stage encodes input text into code string. This block breaks the word into syllables using CV rules. Two databases need to be maintained viz. audio database and textual database. 3) The CV structure and CV structure breaking rules help in determining the number of syllables. 4) The number of syllables thus obtained is then used for segmentation approaches using neural, classification and non-neural methods. These syllables are then stored in textual database and concatenated as required during synthesis. Both neural and non-neural segmentation techniques are implemented and segmentation accuracy is checked. 5) The concatenated or synthesized speech with segmentation details is the output of the segmentation system.

121

Position based syllabification using neural and non-neural techniques

The following figure 6.25 shows syllabification flowchart with different segmentation approaches.

6.8.1 Syllabification flowchart:

Speech input

Framing

Energy calculation

Smoothing

Silent period removal

Approach

NEURAL / CLASSIFICATION NON-NEURAL

Comparison

Fig. 6.25: Syllabification flowchart

122

Position based syllabification using neural and non-neural techniques

Explanation of syllabification flowchart:  The speech signal is dynamic in nature. Therefore, it is divided into frames of size 110 samples. Energy of each of these frames is calculated with the formula shown in the equation (6.5).  The energy varies in each frame depending upon the speech signal. It is therefore averaged to give a smooth curve. Smoothing is done using moving averaging filter. An average of twelve consecutive frames is taken. This number can be different depending on the smoothness required.  The recorded speech signal may consist of noise and some silence. The silence period is removed after smoothing. Syllabification is carried out using neural and non-neural approaches.

Speech segmentation by our new neural network approaches like, Back-propagation, combination of supervised and un-supervised algorithms (Back-propagation–Maxnet); classification method (K-means with Maxnet) is shown in the following sections.

6.9 NEURAL NETWORK FOR AUTOMATIC SEGMENTATION:

 It is important to remember that with a NN solution, one need not have to understand the solution at all! With a NN one has to simply show it: "this is the correct output given this is the input".

 With an adequate amount of training the network mimics the function that is presented for demonstration. Furthermore with a NN it is acceptable to apply some inputs that turn out to be irrelevant to the solution. During the training process; the network learns to ignore some inputs that don't contribute to the output. Conversely if user provides some critical inputs to neural network then it is easily detectable because the network fails to converge on a solution and remains in saturation state.

 Neural networks are known for their ability to generalize and capture the functional relationship between input output pattern pairs. Neural networks

123

Position based syllabification using neural and non-neural techniques

have the ability to predict after an appropriate learning phase even the patterns not presented before. Therefore neural network is selected for automatic segmentation [54].

6.9.1 MAXNET:  Maxnet invented by Lippman, in 1987.  In simple neural networks, the competitive networks are composed of two networks: the Hemming net and Maxnet.  The Maxnet finds the perceptron with the maximum value. Maxnet is a recurrent one-layer network used to determine which node has the highest initial activation.  The Maxnet is fully connected network with each node connecting to every other node, including itself. The basic idea is that the nodes compete against each other by sending out inhibiting signals to each other.  This is done by setting the weights of the connections between different nodes to be negative and applying the following algorithm a) The diagonal elements of weight matrix is always 1 and all other elements are – ε b) The parameter ε is constant in (0, 1/m) c) The inputs (initial states) are real numbers in the range [0, 1]  To find output of Maxnet:- All neurons are updated in synchronous mode. The effective input of the ith neuron is calculated by  old 푚 old Neti = xi – 푗 =1 ε xj ………………………….(6.7) And the final output is given by

Oi = f(neti) = 0, neti < 0 ……………….……….(6.8)

= neti, neti ≥ 0

 For any input vector, the Maxnet gradually suppresses all but the neuron with the largest initial input. There is no training algorithm for Maxnet as its weights are fixed.

124

Position based syllabification using neural and non-neural techniques

 It is unsupervised learning network. Here training samples contain only input pattern. No desired output is given (teacher-less)  It learns to form classes/clusters of sample patterns according to similarities among them.  Ways to realize competition in NN – Lateral inhibition where output of each node feeds to others through inhibitory connections (with negative weights)  In Maxnet lateral inhibition between competitor weights:

Wij = θ if i=j = -ε otherwise  Competition: iterative process until the net stabilizes (at the most with positive activation) 0< ε<1/m; where m is the number of competitors. If ε too small; NN takes too long to converge If ε too big; it may suppress the entire network (no winner)  Maxnet is an implementation of a maximum-finding function. With each iteration, the neurons activations will decrease until only one neuron remains active. The ‗winner‘ is neuron that had the greatest output. Maxnet is a neural net based on competition. It picks the node whose input is largest, and thus it can act as a subnet.

6.9.2 Maxnet architecture:

A ―Maxnet‖ is one layer network that conducts a competition to determine which node has highest initial input value. The architecture of Maxnet is shown in the following figure. The architecture shows the subnet having m nodes. All the m nodes in the subnet are completely interconnected. These interconnections have symmetric weights ‗ε‘. All the weights present are inhibited. The weights in this case are fixed; hence there is no need of updating of weight in this network and training the same. Due to this characteristic, this may be called as a fixed weight network [131].

Figure 6.26 shows Maxnet network architecture.

125

Position based syllabification using neural and non-neural techniques

X1 Xm

Xi Xj

wgt = 휀

wgt = 휃

Fig. 6.26: Maxnet network architecture

Topology: Nodes with self-arcs where all self-arcs have a small positive (excitatory) weight and all other arcs had a small negative (inhibitory) weight. The primary mechanism is an iterative process, in which each node receives inhibitory inputs from all other nodes, via ‗lateral‘ (intra-layer) connections. Weights can be chosen so that a single node whose value is initially maximum can eventually prevail as the only active or ‗winner‘ node, while the activation of all other nodes subside to zero. The simplest neuron model that accomplishes this is lower-pounded variation of identity function, with

푓 푛푒푡 = 푚푎푥 0, 푛푒푡 … … … … … … … … … (6.9)

For the network shown in the figure 6.26, the network tasks can be accomplished by choices of self-excitation weight 휃 = 1 and mutual inhibition magnitude 휀 ≤ 1/(푛푢푚푏푒푟 표푓 푛표푑푒푠).

It is assumed that all nodes update their output simultaneously. The weighted input from current node is treated just like the weighted inputs from other nodes. The function at each node takes into account both self weighted signal and other signals from other nodes. In this way the node function examines all the weighted inputs approaching to it.

126

Position based syllabification using neural and non-neural techniques

푛푒푡 = 푤푖푥푖 … … … … … … … … … … … … … … (6.10 ) 푖=1

Where 푥푖 = input signal

푤푖 = associated weight of respective input signal

One might think that for a straightforward task of choosing the maximum of n numbers why to use Maxnet neural network? This is because Maxnet allows for greater parallelism in execution than an algebraic solution, since every computation is local to each node rather than being controlled by a central coordinator.

6.10 IMPLEMENTATION OF MAXNET:

Maxnet is used as subnet to pick the node whose input is largest. The 30 input nodes are used for this Maxnet implementation. After testing 500 most common words in Marathi language, it is found out that the first minima (syllable cutting point or segmentation point) is present most of the time in first 30 frames. (This is a trial and error number). 30 frames of input speech signal are provided to 30 inputs of Maxnet network. This algorithm is used first time for speech segmentation of Marathi TTS.

Number of samples/input nodes: 30 Learning rate used: 0.5 Number of iterations/epoch: 30 Success rate of this algorithm: It is counted in terms of segmentation accuracy, which is 75% for this algorithm.

6.10.1 Maxnet Algorithm: Maxnet is implemented as shown in the following algorithm.

6.10.1.1 Purpose of algorithm This algorithm gives frame number at which minima is present.

127

Position based syllabification using neural and non-neural techniques

6.10.1.2 Methodology/ mathematical function: The following steps show how Maxnet is implemented.

Activation function for Maxnet is 푓 푥 = 푥 푖푓 푥 > 0; = 0 표푡푕푒푟푤푖푠푒. Where 푥 = input signal

6.10.1.3 Step by step Maxnet implementation: Maxnet is initialized with some weights as shown below. If first and second nodes are same that means the signal is travelling to same node, then weight should be initialized to 1. In all other cases the signal is travelling to some other node and hence it must be initialized with some inhibitory signal value, here it is a very small value indicated as 휖. As Maxnet is a fixed weight network, this initialization is constant for most of the applications.

1) Initialize activations and weights (set 0 < 휖 ≤ (1/30))

푤푖푗 = 1 푖푓 푖 = 푗 = −휖 푖푓 푖 ≠ 푗

Where 푤푖푗 = weight of single layer i= node i (first node) j= node j (second node) 휖 = inhibitory weight The updating of weights is carried out for all nodes for all the iterations unless there is only one winner node. For Maxnet algorithm, if there are more than one non-zero activation nodes then it will continue with its next iteration or else it will stop.

2) While stopping condition (if only one node has non-zero activation) is false, do steps 3 to 5 as shown below.

After updating all nodes with their current weights and updating their activation with the new calculated node value, the activation is saved with new value at each node. The following equation shows this step.

128

Position based syllabification using neural and non-neural techniques

3) Update activation of each node: for 푗 = 1, 2, … , 푚

푎푗 푛푒푤 = 푓[푎푗 표푙푑 − 휖 푘≠푗 푎푘 (표푙푑)]………………….. (6.11)

Where 푎푗 , 푎푘 = nodes of different layers All the nodes are updated with their new activation functions and saved for their next iteration.

4) Save activation for use in next iteration:

푎푗 표푙푑 = 푎푗 푛푒푤 , 푗 = 1,2, … , 푚

In present work, there are total 30 input nodes in Maxnet and hence there are total 30 iterations. After first iteration, Maxnet algorithm gives first node having maximum activation value. This node is discarded as we have to find minimum value node for locating segmentation point. After discarding 29 node activations, the algorithm end up with one node activation having lowest value. This node activation is the first segmentation point for given input word. After finding out first minima location, Maxnet is provided with next 30 frame values after the segmentation location.

5) Test stopping condition: If more than one node has non-zero activation, continue; otherwise, stop.

6.10.1.4 Results/outcome of the algorithm Maxnet gives the position of the peak in energy plot as it finds maximum of all provided inputs. But for segmentation of speech signal, the minima position (segmentation location) is required. A maximum received in the first iteration of Maxnet algorithm is discarded and input signal is applied again. After thirty iterations, minima position (segmentation location) should be obtained.

6.10.2 Maxnet flowchart: Flowchart of Maxnet is shown below in the following figure 6.27

129

Position based syllabification using neural and non-neural techniques

Start

Initialize weights of all nodes

Calculate output of each node

No If only one Update weights node =0

Yes

Node is winner

Stop

Fig. 6.27: Flowchart for Maxnet

The above figure 6.27 shows flowchart of Maxnet algorithm. It explains flow of the Maxnet algorithm. This algorithm starts with initialization of weights. The net output is calculated after updating of weights. If resulting node is the only node with output as non-zero then it is the winner. After displaying winner the algorithm stops. If the net output is not the only node with nonzero value or highest weight then the weights need to be updated. As shown in the above flowchart, after updating of weights, the algorithm continues at calculating the output of net.

6.10.3 Testing of Maxnet algorithm and authentication of results: A ―Maxnet‖ is one layer network that conducts a competition to determine which node has highest initial input value. It is assumed that all nodes update their output simultaneously. The weighted input from current node is treated just like the weighted inputs from other nodes [55]. For Maxnet algorithm results validation, more than 500 words are tested for segmentation accuracy. The minima locations of Maxnet algorithm and manual cutting points for all 500 words are compared. The Maxnet results which are similar to manual segmentation results are considered as high segmentation accuracy results. There are total 375 words out of 500 which have similar segmentation values as manual segmentation results. This gave total 75% segmentation accuracy to Maxnet algorithm.

130

Position based syllabification using neural and non-neural techniques

The following figure 6.28 shows the GUI used in Matlab. This is used for Maxnet results testing.

Fig. 6.28: GUI snapshot for Maxnet algorithm testing

Audio file in WAV format is used as input to the system. After entering a word, sound file or energy plot can be obtained. When the ‗Show syllable positions‘ button is pressed, positions at which word is segmented into syllables is shown. ‗Play 1st syllable‘, ‗Play 2nd syllable‘ and such buttons are used for playing segmented syllables. By listening to these syllables, testing is carried out.

In all these segmentation results of different algorithms, syllable segments or minima locations are shown as frame numbers.

Figure 6.29 shows the output of the system for Marathi word AO© (arja).

131

Position based syllabification using neural and non-neural techniques

Fig. 6.29: Maxnet output for Marathi word AO© (arja)

When word is entered in the ‗edit box‘, sound contents of the file are shown. Total numbers of samples in the file are shown above the plot. ‗Energy Plot‘ from ‗Select Plot‘ panel can be selected to view energy output of input sound file. Figure 6.30 shows energy plot of the same Marathi word AO© (arja). Energy plot is displayed in place of sound file.

132

Position based syllabification using neural and non-neural techniques

Fig. 6.30: Energy plot for word AO© (arja)

The total number of frames in the energy plot is shown on the top of the figure. When the ‗Show syllable positions‘ button is pressed, system shows syllables are at positions 25 and 51 in energy plot. It also shows ‗from‘ and ‗to‘ positions of syllables in sound file. ‗Play 1st syllable‘ button plays the first syllable. Similarly second syllable can be played by pressing ‗Play 2nd syllable button‘.

To get ‗from‘ and ‗to‘ positions of syllable in sound file, syllable positions are multiplied by number of samples per frame (110 in this case).

Figure 6.31 shows system output for three syllable word ~o~§Y (bebandh).

133

Position based syllabification using neural and non-neural techniques

Fig. 6.31: Sound plot of three syllable word ~o~§Y (bebandh)

Figure 6.32 demonstrates output of same word with energy plot.

Fig. 6.32: Energy plot of three syllable word ~o~§Y (bebandh)

134

Position based syllabification using neural and non-neural techniques

Word can be directly browsed from its location. The following figure 6.33 illustrates this operation. When the ‗Browse‘ button is clicked, a new window appears from which one can directly import the required WAV file.

Fig. 6.33: Importing word from its location

Fig. 6.34: Working for browsed word Am^mi (abhal) 135

Position based syllabification using neural and non-neural techniques

Figure 6.35 explains an example of a four syllable word. Marathi word NoXZp~§Xy (chedanbindu) has four syllables as No (che), XZ (dan), p~Z² (bin) and Xy (du)

Fig. 6.35: System output for four syllable word NoXZp~§Xy (chedanbindu)

Figure 6.36 demonstrates example of word having five syllables. Marathi word M_HXmanZo (chamakdarpane) has five syllables as M (cha), _H (mak), Xma (dar), n (pa) and Zo (ne)

Fig. 6.36: System output for five syllable word M_H XmanZo (chamakdarpane)

136

Position based syllabification using neural and non-neural techniques

Figure 6.37 explains an example of a word having a single syllable. Actually there is no need to apply a monosyllabic word to this system, but if a monosyllabic word is presented, the system should work properly.

Fig. 6.37: System output with monosyllabic word ~mU (baan)

System also displays error message if the user tries to play syllable which is not present. It is shown in the figure 6.37 that when user tries to play ‗2nd syllable‘, error message is displayed. Here error is displayed at the bottom-left corner of GUI as ‗syllables are less than that‘.

The following figure 6.38 explains sound and energy plot for word AmJmgs (Agasi)

137

Position based syllabification using neural and non-neural techniques

Fig. 6.38: Sound and energy plot for word AmJmgs (Agasi)

138

Position based syllabification using neural and non-neural techniques

6.10.4 Conclusion of Maxnet Input word can be very easily and accurately segmented into syllables using Maxnet system. It works properly for different 2, 3 and 4 syllable words in the database. Results are checked for most common Marathi words. Segmentation accuracy of Maxnet algorithm is limited to 75%.

6.11 BACK-PROPAGATION:

 Feed-forward networks have the following characteristics: 1. Perceptrons are arranged in layers, with the first layer taking inputs and the last layer producing outputs. The middle layers have no connection with external world, and hence are called hidden layers. 2. Each perceptron in one layer is connected to every perceptron on the next layer. Hence information is constantly ‗feed forward‘ from one layer to the next, and this explains why these networks are called feed-forward networks. 3. There is no connection among perceptrons in the same layer.

 This is a type of network in which every layer is connected or linked to every previous layer.

 Training: In NN training, training data is used for adjusting the weight values. These weights decide what NN will produce at the output. The Back- propagation algorithm uses gradient descent technique to adjust the connection weights between neurons.

푑퐸 ΔWt = - ε + α ΔWt-1 …………………………(6.12) 푑푡 Where ε = learning rate ΔW = weight vector And α = momentum

The momentum α decides the amount of change of weights taking place in each iteration/epoch.

139

Position based syllabification using neural and non-neural techniques

 Back-propagation is a common method of teaching artificial neural networks how to perform a given task. It is an implementation of Delta rule. It modifies network weights to minimize mean square error between desired and actual outputs of network.

 Back-propagation propagates inputs forward in the usual way, i.e. all the outputs are computed using sigmoid thresholding of the inner product of the corresponding weight and input vectors. All the output at stage n is connected to all the inputs at stage n+1.

 Here the output is known first and then accordingly error is calculated for particular input. In training phase this error is back propagated and accordingly the weights are changed. Each neuron receives a signal from the neurons in the previous layer and each of those signals is multiplied by a separate weight value.

 The weighted inputs are summed and passed through a limiting function which scales the output to a fixed range of values. The output of the limiter is then broadcast to all of the neurons in the next layer. To use the network to solve a problem one has to apply the inputs to the inputs of the first layer, allow the signals to propagate through the network and read the output values. Figure 6.39 shows generalized architecture of neural network with one hidden layer.

Hidden Layer

Inputs Outputs

Fig. 6.39: Generalized neural network architecture

140

Position based syllabification using neural and non-neural techniques

Stimulation is applied to the inputs of the first layer and signals propagate through the middle (hidden) layer(s) to the output layer. Each link between neurons has a unique weighting value. The following figure shows structure of a neuron.

SUM

Output to other neurons Weighted Limiter Inputs (Sigmoid function)

Fig. 6.40: The structure of a neuron

 Inputs from one or more previous neurons are individually weighted then summed. The result is non-linearly scaled between -1 and +1 and the output value is passed on to the neurons in the next layer.

 Since the real uniqueness or 'intelligence' of the network exists in the values of the weights between neurons, one needs a method of adjusting the weights to solve a particular problem. For this type of network, the most common learning algorithm is called Back-propagation (BP). A BP network learns by example, that is one must provide a learning set that consists of some input examples and the known-correct output for each case. One uses these input- output examples to show the network what type of behavior is expected and the BP algorithm allows the network to adapt.

6.11.1 Architecture of Back-propagation: Back-propagation architecture is shown in the following figure.

141

Position based syllabification using neural and non-neural techniques

Input … … in0 in1 in4 inm Layer

W01 W04

Hidden W0m mp0 mf1 mp1 … mp4 … mpm Layer W00 mf4 mfm mf0

Output mp0 mp1 … mp4 … mpp Layer

mf0 mf1 mf4 mfp

Figure 6.41: Back-propagation architecture

 This is a type of network in which every layer is connected or linked to every previous layer. The BP learning process works in small iterative steps: one of the example cases is applied to the network and the network produces some output based on the current state of its synaptic weights (initially the output will be random). This output is compared to the known-good output and a mean-squared error signal is calculated [132].

 This error value is then propagated backwards through the network and small changes are made to the weights in each layer. The weight changes are calculated to reduce the error signal for the case in question. The whole process is repeated for each of the example cases, then back to the first case again and so on. The cycle is repeated until the overall error value drops below some pre-determined threshold. [In our research muse (mean square error) = 0.1 is used.] At this point one can say that the network has learned the problem "well enough". The network will never exactly learn the ideal function, but rather it will asymptotically approach the ideal function.

142

Position based syllabification using neural and non-neural techniques

6.12 IMPLEMENTATION OF BACK-PROPAGATION:

A multilayer neural network with one layer of hidden units is used for Back- propagation algorithm. The present implementation of Back-propagation algorithm is with following specifications:

No of input nodes: 30 input frames are provided to 30 input nodes of BP algorithm. No. of hidden nodes: 20 are used, this number is fixed, and only one hidden layer is used. No. of output nodes: 30 output nodes are present. The output node where minima or segment location is present is outputted as algorithm output. In this research, Back-propagation algorithm is used first time for speech segmentation of words into syllables.

6.12.1 Input nodes: 30 input nodes are taken. 30 frames from energy plot are applied to neural network at a time. It is observed from different energy plots that a syllable length is generally not more than 30 frames. About 300 words are tested and syllables are checked for different frames. First syllable is generally found in first 30 frames. This number is calculated after testing most common words. Then next 30 frames from segmented syllable position are applied to Back-propagation neural network for checking next syllable segmentation point.

6.12.2 Hidden nodes: 20 nodes are taken in a single hidden layer. Only one hidden layer is used. 20 nodes in hidden layer are providing good results. Hidden layers with more than or less than 20 nodes are tried but they have resulted in long training time and segmentation accuracy was not even 50%. Different hidden layer combinations like 12, 15, 18, 22, 30 nodes are tried. For all these hidden layer combinations, output of Back-propagation algorithm is compared with manual segmentation results. This experiment is carried out for 200 words. Hidden layer with 20 nodes has given more than 80% segmentation accuracy. All other node combinations have given less than 50% segmentation accuracy.

143

Position based syllabification using neural and non-neural techniques

6.12.3 Output nodes: 30 output nodes are used. Every output node corresponds to the respective frame number (i.e. 15th output node tells whether syllable segmentation is present on 15th frame position or not).

6.12.4 Initialization of weights: Initial weights are randomly chosen in the range 0 to 1. Typically weights chosen are small because large weights may drive output of input layer to saturation requiring a large amount of training time to emerge from saturated state.

6.12.5 Choice of learning rate (휼): A large value of 휂 will lead to rapid learning but weight may then oscillate, while low values imply slow learning. Values between 0.1 and 0.9 have been used in many applications. Here 휂 of 0.5 is used. This value is decided after trying with different values between 0.1 and 0.9 for more than 300 most common words.

6.13 BACK-PROPAGATION ALGORITHM IMPLEMENTATION:

Back-propagation, an abbreviation for ‗backward propagation of errors‘ is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent. The method calculates the gradient of a loss function with respect to all the weights in the network. The gradient is fed to the optimization method which in turn uses it to update the weights, in an attempt to minimize the loss function. Back-propagation requires, a known desired output for each input value in order to calculate the loss function gradient. It is therefore usually considered to be a supervised learning method. The goal of any supervised learning algorithm is to find a function that best maps a set of inputs to its correct output. The back-propagation algorithm works in two phases.

Phase 1: Propagation Every propagation involves following steps

144

Position based syllabification using neural and non-neural techniques

1) Forward propagation of training pattern‘s input through the neural network in order to generate the propagation‘s output activations. 2) Backward propagation of the propagation‘s output activations through the neural network using training pattern target in order to generate the deltas (the difference between the input and output values) of all output and hidden neurons.

Phase 2: Weight update For each weight-synapse follow the following steps:

1) Multiply its output delta and input activation to get the gradient of the weight. 2) Subtract a ratio (percentage) of the gradient from the weight.

The ratio (percentage) influences the speed and quality of learning. It is called the learning rate. The greater the ratio, the faster the neuron trains: the lower the ratio; the more accurate the training is.

Repeat phase 1 and 2 until the performance of the network is satisfactory.

The Back-propagation NN must have at least one input layer and one output layer. It could have zero and more hidden layers. The number of hidden layers and how many neurons should be present in each hidden layer cannot be well defined in advance, and could change as per network configuration and type of the data. One could start a network configuration using a single hidden layer, and add more hidden layers if one notices that the network is not working as per the requirement.

In present Back-propagation algorithm implementation, we have used 30 input neurons, 20 hidden neurons and 30 output neurons. The basic use of Back- propagation is to provide 30 frames of speech where the minima (segmentation point) may be present. This output of Back-propagation is given as input to Maxnet which selects the exact frame of speech where segmentation can be carried out.

145

Position based syllabification using neural and non-neural techniques

6.13.1 Purpose of algorithm Back-propagation is used to reduce error in the input and output layers in multilayer network [56]. It is used to find the exact frame number as minima location.

6.13.2 Methodology used Error is calculated by finding out the difference between input/output of hidden/output layer and known good output. If error is 0.1 then algorithm stops or else it keep updating weights for improving the error.

6.13.3 Results/outcome of the algorithm Output of this algorithm is appropriate weights for an improved output at output-layer for given input at input layer. Number of hidden layers depends on the type of error at input layer and required output layer accuracy or function implemented. In this research one hidden layer with 20 nodes is used.

6.13.4 Basic Back-propagation algorithm: Number of input nodes: 30 input frames applied to 30 input nodes Learning rate: 0.5 Number of epoch: 10000 mse (mean square error): 0.1

Basic algorithm of Back-propagation is as follows:

1) Initialize weights and learning rate.

2) While stopping condition is false, do steps 3 to 10. (Here stopping condition is mean square error should be 0.1)

3) For each training pair, do steps 4 to 9.

4) For each input unit 푥푖 , (푖 = 1,2, … , 푛) receive input signal and broadcast their signals to all units in layer above.

Where 푥푖 = input signal

5) Each hidden unit 푧푗 , (푗 = 1,2, … , 푝) accepts weighted input signal.

146

Position based syllabification using neural and non-neural techniques

푛 푧푖푛푗 = 푣0푗 + 푖=1 푥푖 푣푖푗 …………………………… (6.13)

Where 푧푖푛푗 = input of hidden unit

푥푖 = input signal

푣푖푗 = associated weight of input signal

Apply this activation function to 푧푖푛푗 .

푍푖푗 = 푓(푧푖푛푗 )…………………………………….. (6.14)

Send this signal to all units in above layer.

Where 푍푖푗 = hidden layer output after application of activation function.

6) Each output unit 푦푘 , (푘 = 1,2, … , 푚) gets signal from previous layer.

푝 푦푖푛푘 = 푤0푘 + 푗 =1 푧푗 푤푗푘 ……………………..……. (6.15)

Where 푦푖푛푘 = output unit of previous layer 푤 = weight of respective layer (input and hidden layer weights)

Find its output

푌푘 = 푓(푦푖푛푘 )………………………..…………. (6.16)

푌푘 = output of output-unit

Back-propagation of error: The following steps shows calculation of error.

7) Each output unit computes

훿푘 = 푡푘 − 푎푘 푓(푦푖푛푘 )……………………………… (6.17)

Δ푤푗푘 = 휂훿푘 푧푗 , Δ푤0푘 = 휂훿푘

훿k = delta input for output layer

tk = target output

yink = actual output of previous layer

147

Position based syllabification using neural and non-neural techniques

휂 = learning rate

wjk = weight of hidden to output layer

wok = initial weight of first layer

This step calculates the error after calculating the difference between actual output and target (known good) output. This delta output (error) is back-propagated to hidden layer for updating of weights.

8) Each hidden unit 푍푗 , (푗 = 1,2, … , 푝) gets its delta inputs. 푚

훿푖푛푗 = 훿푘 푤푗푘 … … … . … … … … … … … . ( 6.18) 푘=1

And find 훿푖 = 훿푖푛푗 − 푓(푧푖푛푗 ). Calculate Δ푣푖푗 = 휂훿푗 푥푖, Δ푣0푗 = 휂훿푗 .

Where 훿inj = delta input to hidden layer

wjk = weight of hidden and output layer

vij = associated weight of input signal

voj = associated weight of output signal

9) Update weights

푤푗푘 = 푤푗푘 + Δ푤푗푘 …………………………….. (6.19)

푣푖푗 = 푣푖푗 + Δ푣푖푗 ……………………..………… (6.20)

10) Test stopping condition.(Here stopping condition is mean square error should be 0.1)

The flowchart of Back-propagation algorithm is shown in the figure 6.42 below. It explains the flow of the Back-propagation algorithm.

6.13.5 Back-propagation flowchart: Flowchart for this algorithm is shown in the following figure 6.42 The Back- propagation algorithm starts by initialization of weights. After providing input to hidden layer its output is calculated. Similarly, after giving input to output layer its output is calculated. The output of hidden layer acts as input to the output layer. Error

148

Position based syllabification using neural and non-neural techniques is calculated between output of output layer and known good output. Here in this research the known good output is nothing but manual segmentation locations for accurate segmentation. If the error is 0.1 then the algorithms stops as shown in the following flowchart. If error is not satisfactorily small, then weights at hidden and output layer need to be updated. The algorithm keeps updating weights till the resulting error is acceptable. The acceptable error (mean square error) is 0.1 in this algorithm.

Start

Initialize weights

Calculate input to hidden layer

Calculate output of hidden layer

Calculate input to output layer

Calculate output of output layer

Calculate error

Yes If error=0

No

Update the weights

Stop

Fig. 6.42: Flowchart for Back-propagation

6.14 SUPERVISED AND UN-SUPERVISED NN FOR SEGMENTATION:

In this section, combination of supervised and un-supervised neural network is explained for speech segmentation. Neural networks need to be trained to implement some algorithm or function. There are two forms of training  Supervised  Unsupervised 149

Position based syllabification using neural and non-neural techniques

6.14.1 Supervised neural network learning: This method is similar to rote writing or with teacher learning. Here the output of the network is compared with known good output. The difference (error) is corrected and again the network output is checked. This procedure continues till the output of network is equal to known good output.

6.14.2 The training data set: For supervised learning data set, two components are required. First component is a set of input vectors. These are usually floating point numbers but they could also be bits. The other component is a set of correct output corresponding to each input vector.

6.14.3 The basic training procedure:  The basic idea is simple. Initially the weights are set randomly often between 0.0 and 1.0, a threshold is set for the artificial neurons. The first vector from the training data is applied to the inputs and the corresponding output vector is calculated.

 Output is compared to expected correct output vector and error is calculated. If the error is satisfactorily small, no action is taken and the next training vector is applied to input of NN. If the error is not small, there is a need to change the weights appropriately and apply the first input vector again. Hopefully, the error measured is smaller this time. If it is satisfactorily small the second vector is applied to the inputs.

 The second vector produces an output vector which is compared with the correct output for that input case. If the error is too large the weights are adjusted. Suppose this adjustment produces an error measure which is not satisfactory then the second input vector is re-applied with the new weights.

 Now there could be another problem. By changing the weights to get a satisfactory result when the second input vector is applied, the resulting weights may mess up the results for the first input vector. This must be checked out. If the first vector is correct then one can continue and apply the

150

Position based syllabification using neural and non-neural techniques

third input vector. If the first vector is not correct it must be applied again and the weights need to be changed one more time. This process continues for some time. Training a net can be slow [57].

 Eventually, a set of weights is found which produces a satisfactorily small error measure for all the input vectors. In the present research, this small error (mean square error) is defined as 0.1 and then weights for this error are finalized. This small error value could be any value between -0.5 to 0.5 and it is evaluated depending on the network performance. The net is then said to be trained. It has learned how to recognize the patterns in the input data set.

 Another way of training involves applying all the input vectors using the original set of weights (batch training), storing the resulting error vectors and then applying all the corrections to the weights. This completes one cycle. Then the input vectors are applied again with the new weights, new errors are calculated and applied to the weights. The process continues until overall error measure becomes sufficiently small, here it is (mean square value) 0.1

6.14.4 Unsupervised learning in neural networks:  Being able to train an artificial neural network to recognize patterns or features in data sets using supervised learning algorithms is impressive. Creating networks which find patterns in data sets without supervision is amazing.

 The feed forward networks using training algorithms such as Back propagation learn by rote. The teacher uses a data set which includes the correct output for each input. In other words, the teacher gives the net 'problems' and sees if the net produces the 'right answers'. If the answer is not correct, the net is corrected by changing its weights and it is then retested. This is much like the old fashioned rote learning once extensively used in schools.

 Unsupervised networks learn by their own as a kind of self-study. A data set is presented to such networks and they learn to recognize patterns in the data set. That is, they categorize the input data into groups. The net does not understand

151

Position based syllabification using neural and non-neural techniques

the meaning of the groups. It is up to human users to interpret or label the groups in some meaningful way.

 For example, an unsupervised network might be presented with a data set of mortgage requests. The set could contain such things as the requester's credit rating, the size of the mortgage, the interest rate, etc. The net might discover that the data set divided into two categories. But only the human could give meaning to the categories.

 If the data set consisted of actual records of an experienced bank manager over a long period, one could look at what the banker did. Were the mortgages granted or rejected? This knowledge would enable the human to label the two categories discovered by the net. One label would be ‗accept‘ and the other would be ‗reject‘. Once trained the net could be presented with new mortgage requests. It would make decisions based on the skill of the original human banker [58].

6.14.5 The delta rule: Around 1970, Minsky and Papert showed that no two layer neural network could be trained to solve non- linearly separable problems. They also speculated that there was no training algorithm capable of finding the correct weights attached to so-called 'hidden' layer in 3 or higher layer networks in a reasonable length of time. Furthermore, if a 2 layer network could not solve non-linearly separable problems like XOR, the computational power of such network was limited. However, with the discovery of the back propagation training algorithm, neural networks returned in the 1980s.

One of the simplest and earliest algorithms for neural network training is called the delta rule. In machine learning, the delta rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network. It is a special case of the more general back-propagation algorithm. Consider the following adaline network shown in the figure 6.43. This diagram explains how all inputs along with their weights are applied to this network. Weights are updated

152

Position based syllabification using neural and non-neural techniques based on the difference between desired answer and actual answer. This is an old type of neural network.

X1 X2 Desired answer

w1 w2

Compare Weight update info

Actual answer

AN ADALINE Fig. 6.43: Adaline network The delta rule is,

훽퐸푥 푤 = 푤표푙푑 + ……………………………………… (6.21) 푥 2

Where w = weight of network x = input signal E = error (difference between correct output and actual output) 훽 = learning rate

Here ‗w‘ and ‗x‘ are vectors. The error ‗E‘ is the difference between the correct output and the actual output of the training vector. Beta (훽) is the learning rate. It is an adjustable parameter which determines how much change to make in each cycle. Actually during each epoch there is some change in the network in terms of input- output node values which are dependent on signal they are carrying from one node to another and on the associated weights. The learning rate, 훽 is this change in every cycle or epoch. Choosing the correct beta is an art form!

The threshold works like this,

153

Position based syllabification using neural and non-neural techniques

The net input is, 푛 퐼 = 푡=1 푥푖 푤푖……………………….……….. (6.22)

Where I = input of net

푥푖 = input signal

푤푖 = weight of input signal

The output, O is then, O = f (I) Where O = output of net

Where f (I) is the threshold function, often the sigmoid function or sometimes the step function, in which case the output would be 0 or 1.

6.14.6 Results of cascaded combination of supervised and unsupervised NN:  There are advantages and disadvantages of supervised and un-supervised networks. Hence, cascaded combination of supervised (Back-propagation) and unsupervised (Maxnet) is tested for syllable segmentation. It is observed that this cascaded combination has improved syllable cutting accuracy. This combination resulted in 82% segmentation accuracy.

 Back-propagation gives output frames where a minima is present. This output node value of group of frames is given as input to Maxnet (unsupervised) algorithm. It gives an exact frame number for minima location. This unique combination of supervised (Back-propagation) and unsupervised (Maxnet) is used first time for speech segmentation of syllables from words in TTS. As compared to standalone unsupervised NN approach like Maxnet, this combination resulted in more segmentation accuracy (82%) and hence more natural sounding speech synthesis output. The percentage of segmentation accuracy is calculated with reference to manual segmentation. Total 500 words are tested for segmentation location with manual and cascaded combination (Back-propagation-Maxnet) and their results compared for accurate minima locations of speech syllables. After comparison of their segmentation points, it is observed that out of 500 words the cascaded

154

Position based syllabification using neural and non-neural techniques combination has given correct locations for 410 words. The following figures, 6.44 to 6.49 show some examples of segmentation with this combination.

Fig. 6.44: Segmentation for two syllable word AmXoe (Adesh)

Fig. 6.45: Segmentation for two syllable word Z¨Va (nantar)

155

Position based syllabification using neural and non-neural techniques

Fig. 6.46: Segmentation for three syllable word ~amo~a (barobar)

Fig. 6.47: Segmentation for three syllable word AÜ``Z (adhyayan)

156

Position based syllabification using neural and non-neural techniques

Fig. 6.48: Segmentation of four syllable word AnmaXe©H (apardarshak)

Fig. 6.49: Segmentation of four syllable word M_H Umam (chamaknara)

157

Position based syllabification using neural and non-neural techniques

Apart from the pure neural network, classification network along with neural network approach (K-means and Maxnet combination) is used for syllable segmentation as shown in the following section.

6.15 NEURAL-CLASSIFICATION APPROACH (K-MEANS WITH MAXNET) FOR SYLLABIFICATION:

6.15.1 Analysis of K-means algorithm: K-means is the clustering network. Clustering is the process of partitioning a given set of patterns into disjoint clusters. This is done so that patterns in the same clusters are alike while patterns in different clusters are different. In this algorithm, k-average or mean values are determined about which data can be clustered. K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early grouping is done. At this point we need to re-calculate k new centroids as barycenter of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new center. A loop has been generated. As a result of this loop we may notice that the k centers change their location step by step until no more changes are done or in other words centers do not move any more. Finally, this algorithm aims at minimizing an objective function known as squared error function given by:

푐 푐 2 J(V) = 푖=1. 푗 =1(||푥푖 - 푣푗 ||) ….……………………………(6.23)

Where,

||xi - vj|| is the Euclidean distance between xi and vj.

158

Position based syllabification using neural and non-neural techniques

(Distance measure between a data point and cluster center vj is an indicator of the distance of n data points from their respective cluster centers.)

th ci is the number of data points in i cluster c is the number of cluster centers

The general algorithm steps are given below: 1) Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2) Assign each object to the group that has the closest centroid. 3) When all objects have been assigned, recalculate the positions of the K centroids. 4) Repeat steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

This algorithm explains the general steps implemented in k-mean clustering. The actual step by step implementation for segmentation is given in the following section.

6.15.2 Step by step implementation of K-means for segmentation:  In syllable cutting using K-means, the energy plot of an input word (the input word which is to be segmented into syllables) is used for clustering.  The main aim is to find minima (or low energy point) of the energy plot for provided input word.  While carrying out clustering, intra-cluster distances are minimized and inter- cluster distances are maximized. There are two types of clustering a) partitioned clustering and b) hierarchical clustering.  We have used partitioned clustering where the division of data objects into non-overlapping (cluster) is carried out such that each data object is in exactly one subset.  In partitioned clustering, each cluster is associated with a centroid (center point). Each point is assigned to the cluster with the closet centroid.

159

Position based syllabification using neural and non-neural techniques

 The number of clusters, k, must be specified. In syllable segmentation using K-means algorithm, k is dependent on the number of syllables for the given input word. If 2 syllable word is present then k=2, for 3 syllable word, k=3 and so on.  The centroid is selected so that the minima lie in the 2nd cluster of given 2 or 3 syllable word. Then using Maxnet, the minima is located in the respective cluster.  Each centroid has two parameters: frame number and amplitude of energy plot. The distance of each sample is calculated from the centroid.  After calculating the distance of the sample (point to be clustered) with respect to each centroid, the minimum distance is considered as a classifier value. If the distance of the sample with first centroid is less, the sample belongs to the first cluster or else it belongs to the second cluster. In this way classification of all samples (in energy plot of given input word) is carried out.  Once the cluster in which the minima is present, is known with the help of K- means algorithm, it is given as input to Maxnet algorithm which gives the exact frame number of the minima position.  This unique combination of classification network (K-means) and uni-layer NN (Maxnet) is used first time for speech segmentation of syllables from given input words or text.

6.15.2.1 Purpose of algorithm This algorithm is used to find or to classify the minima locations. Then Maxnet algorithm is used to fix the exact location of minima. This combination of classification-neural network gives better results as compared to other segmentation methods. The segmentation accuracy is of 90%, the highest in all segmentation approaches.

6.15.2.2 Methodology used  Centroids are selected according to the number of syllables in the word.  Each centroid contains two parameters: 1) frame number and 2) amplitude of energy plot.

160

Position based syllabification using neural and non-neural techniques

 The cluster which is selected from K-means algorithm is passed to Maxnet for selecting the minima. Figure 6.50 shows clustering carried out using K-means.  In K-means, distance is calculated as

퐷 = (푥1 − 푥2)2 + (푦1 − 푦2)2 ………………….……………….(6.24)

Where x1 and x2 is frame numbers and y1 and y2 are amplitudes. Point (x1, y1) is cluster point and point (x2, y2) is point to be clustered. D = distance between cluster point and point to be clustered.

Fig. 6.50: Clustering using K-means

For example: centroid1 = (5, 2) and centroid2 = (10, 7) where the first parameter is the frame number and the second parameter is the amplitude. Suppose a point (6, 3) is to be clustered, the two distances are calculated with respect to the two centroids. D1 = 6 − 5 2 + (3 − 2)2 = 2 D2 = 6 − 10 2 + (3 − 7)2 = 32 Since D1 < D2, the point (6, 3) belongs to cluster1.

161

Position based syllabification using neural and non-neural techniques

6.15.2.3 Results/ outcome of the algorithm The accurate segmented syllable locations of 2, 3, 4 and 5 syllable words should be the output. The results obtained should be better and accurate.

6.15.2.4 Authentication and testing of results Most common 500 Marathi words are tested for K-means algorithm. Segmentation locations are calculated for K-means for all these words and then they are compared with manual segmentation values. Once comparison is carried out for all 500 words, the segmentation accuracy is calculated in terms of percentage. About 455 out of 500 words have given same segmentation locations as manual segmentation values. This resulted in 90% of segmentation accuracy.

6.15.3 Results of K-means algorithm: Some results for 2, 3 syllable words are shown below:

K-means algorithm: (2-syllable words) Minima locations of respective words are shown at the bottom of every figure. These values indicate respective frame numbers of energy plot for given input word.

A§pH V (Ankit)

Fig. 6.51: K-means output for A§pH V (Ankit)

162

Position based syllabification using neural and non-neural techniques

Minima point: 18 as shown in the figure 6.51 above

AãXþc (Abdul)

Fig. 6.52: K-means output for AãXþc (Abdul)

Minima point: 10 as shown in the figure 6.52 above

For K-means the centroid was selected so that the minima lie in the 2nd cluster for most of the 2 and 3 syllable words. Then using Maxnet, the minima can be located as shown in the first example in the figure 6.51 at point 18. The left diagram is the energy plot while the right diagram is the energy plot after smoothing. Some more results are shown below:

163

Position based syllabification using neural and non-neural techniques

AmJ_Z (agaman)

Fig. 6.53: K-means output for AmJ_Z (agaman)

Minima point: 17 as shown in the figure 6.53 above pZðm (nishtha)

Fig. 6.54: K-means output for pZðm (nishtha) Minima point: 17 as shown in the figure 6.54 above. K-means algorithm: 3-syllable words

164

Position based syllabification using neural and non-neural techniques

A_mÝç (amanya)

Fig. 6.55: K-means output for A_mÝç (amanya)

Minima points at frame number 8 and 33 as shown in the figure 6.55 above gX¡d (sadaiv)

Fig. 6.56: K-means output for gX¡d (sadaiv)

Minima points at frame number 24 and 39 as shown in the figure 6.56 above

165

Position based syllabification using neural and non-neural techniques

ApídZr (Ashwini)

Fig. 6.57: K-means output for ApídZr (Ashwini)

Minima points at frame number 8 and 33 as shown in the figure 6.57 above

ApXVr (Aditi)

Fig. 6.58: K-means output for ApXVr (Aditi)

Minima points at frame no 12 and 33 as shown in the figure 6.58 above

166

Position based syllabification using neural and non-neural techniques

More than 500 words have been tested and resulting accuracy is 90% for minima location. Manual segmentation is used for comparison of segmentation accuracy of this algorithm. Out of 500 words tested, 450 words show correct segmentation locations.

6.16 NON-NEURAL METHODS FOR SYLLABLE SEGMENTATION:

6.16.1 Simulated annealing: The following section explains this algorithm.

6.16.1.1 Purpose of algorithm This non-neural method is used to find the concatenation point i.e. joint of syllable for given input word.

6.16.1.2 Methodology used  Simulated annealing is a generic probabilistic approach for the global optimization problem for locating a good approximation to the global optimum for a given function in a large search space.  SA algorithm: Here random frame numbers are selected.  If the energy of present frame number is less as compared with previous then this frame is selected.  Temperature variable (T) is used to avoid locking of algorithm in local minima.

6.16.1.3 Results/outcome of the algorithm Minima location is found out with this algorithm. Although accuracy is limited, this algorithm is simple to implement and provides moderate accuracy [59]. When 500 most common Marathi words are tested for segmentation accuracy, this algorithm has given 75% output.

167

Position based syllabification using neural and non-neural techniques

6.16.1.4 Testing and authentication of results To calculate the segmentation accuracy of this method, 500 most common Marathi words are tested. The segmentation accuracy of all the words is compared with manual segmentation. Out of 500 words, 375 words have given same segmentation locations as manual segmentation values. This accuracy is limited as compared to other neural network based methods of speech segmentation.

6.16.2 Results of simulated annealing:

2 Syllable words In all the following figures, the segmentation point or minima value is given at the bottom. All these minima values are nothing but respective frame numbers of energy plot for given input word.

A§pH V (Ankit)

Fig. 6.59: Simulated annealing output for A§pH V (Ankit)

Minima point at frame number 8 as shown in the figure 6.59 above

168

Position based syllabification using neural and non-neural techniques

In simulated annealing, random frame numbers are selected and compared with the previous frame amplitudes to locate the minima. The red spots indicate the random chosen frame numbers and finally the minima is located at point no 8. pZð m (nishtha)

Fig. 6.60: Simulated annealing output for pZð m (nishtha)

Minima point at frame number 7 as shown in the figure 6.60 above

AmJ_Z (agaman)

Fig. 6.61: Simulated annealing output for AmJ_Z (agaman)

169

Position based syllabification using neural and non-neural techniques

Minima point at frame number 7 as shown in the figure 6.61 above

AãXþc (Abdul)

Fig. 6.62: Simulated annealing output for AãXþc (Abdul)

Minima point at frame number 1 as shown in the figure 6.60 above. Here a minimum is not located properly.

Simulated annealing: 3 Syllable words ApXVr (Aditi)

Fig. 6.63: Simulated annealing output for ApXVr (Aditi)

170

Position based syllabification using neural and non-neural techniques

Minima points at frame number 4 and 19 as shown in the figure 6.63 above.

ApídZr (Ashwini)

Fig. 6.64: Simulated annealing output for ApídZr (Ashwini)

Minima points at frame number 15 and 30 as shown in the figure 6.64 above.

A_mÝç (amanya)

Fig. 6.65: Simulated annealing output for A_mÝç (amanya)

171

Position based syllabification using neural and non-neural techniques

Minima points at frame number 23 and 30 as shown in the figure 6.65 above. gX¡d (sadaiv)

Fig. 6.66: Simulated annealing output for gX¡d (sadaiv)

Minima points at frame number 4 and 24 as shown in the figure 6.66 above.

Thus simulated annealing gives moderate segmentation accuracy. About 500 words are tested to find minima/segmentation location, but accuracy remains limited to 75%. After comparing with manual segmentation for segmentation accuracy, it has found that out of 500 words, only 375 words have shown correct segmentation locations.

One more non-neural algorithm is tested, slope-detection which is explained in the following section.

6.16.3 Slope detection algorithm: The following section explains this algorithm.

6.16.3.1 Purpose of algorithm This is one simple arithmetic method to find minima location. This non-neural approach is used to locate the segmentation location of provided input word.

172

Position based syllabification using neural and non-neural techniques

6.16.3.2 Methodology used The procedure to find minima location needs normalized energy graph of given speech input. The following steps show how minima location can be obtained from energy waveform.  The energy plot obtained after normalization and smoothing procedure is used for syllable cutting in non-neural approach. The successive difference between frames is calculated by subtraction of next frame from previous frame as shown below.  Slope = energy_amplitude (x+1) - energy_amplitude (x)  The objective is to locate the point of inflection on the energy plot where the slope changes from negative to positive. The frame where slope is changing from negative to positive is nothing but minima location or segmentation point.  By locating all such points syllable cutting is carried out.  Consonant has lower energy as compared to vowels.  The aim is to find minima points on the energy graph for syllabification (cutting of syllables from words). Consonant energy is less and they appear as valley on energy graph. Vowels have more energy and they appear as peak in energy graph.  Syllabification is to be carried out at a point where the slope turns from negative to positive.

6.16.3.3 Results/outcome of the algorithm Slope detection should give segmentation output i,e minima location. Resulting accuracy of segmented syllables is acceptable and the resulting speech output is natural. 77% segmentation accuracy is obtained after implementing this algorithm. 500 most common Marathi words are tested for this algorithm. Segmentation locations are compared with manual segmentation. Those words which have given same locations as manual segmentation locations are noted down. Total 385 words out of 500 have given correct segmentation results. Hence this algorithm is resulted in 77% segmentation accuracy which is acceptable for naturalness. But neural network based segmentation has given more segmentation accuracy (90%) as compared to non-neural methods.

173

Position based syllabification using neural and non-neural techniques

6.16.4 Results of slope detection: In following all result diagrams, the segmentation location (minima) is shown at the bottom of figure. This minima location (segmentation point) is nothing but the respective frame number of energy plot for given input word.

2-syllable words A§pH V (Ankit)

Fig. 6.67: Slope detection output for A§pH V (Ankit)

Minima at frame number 18 as shown in the figure 6.67 above.

In this case, as observed at point 18, the slope of the energy plot changes from negative to positive. The left diagram is the energy plot while the right diagram is a smoothened energy plot. Examples of some more words are given below:

AmJ_Z (agaman)

174

Position based syllabification using neural and non-neural techniques

Fig. 6.68: Slope detection output for AmJ_Z (agaman)

Minima point at frame number 17 for the figure 6.68 above

AãXþc (Abdul)

Fig. 6.69: Slope detection output for AãXþc (Abdul)

Minima point at frame number: 10 as shown in the figure 6.69 above.

175

Position based syllabification using neural and non-neural techniques pZð m (nishtha)

Fig. 6.70: Slope detection output for pZð m (nishtha)

Minima point at frame number: 16 as shown in the figure 6.70 above.

Slope detection algorithm: 3 syllable words

A_mÝç (amanya)

Fig. 6.71: Slope detection output for A_mÝç (amanya) 176

Position based syllabification using neural and non-neural techniques

Minima points at frame number 8 and 32 as shown in the figure 6.71 above.

ApídZr (Ashwini)

Fig. 6.72: Slope detection output for ApídZr (Ashwini)

Minima points at frame number 24 and 39 as shown in the figure 6.72 above

ApXVr (Aditi)

Fig. 6.73: Slope detection output for ApXVr (Aditi)

177

Position based syllabification using neural and non-neural techniques

Minima point at frame number 12 and 29 as shown in the figure 6.73 above. gX¡d (sadaiv)

Figure 6.74: Slope detection output for gX¡d (sadaiv)

Minima point at frame number 12 and 34 as shown in the figure 6.74 above.

Resulting accuracy of segmented syllables is acceptable and the resulting speech output is natural. 77% segmentation accuracy is obtained from this algorithm. The results are compared with manual segmentation locations and out of 500 words, 385 words has given accurate minima locations. From this authentication of results, it is found that segmentation accuracy of neural network based methods is more than non- neural network methods.

6.17 SUMMARY OF SEGMENTATION RESULTS FOR SIMULATED ANNEALING, SLOPE DETECTION AND K-MEANS MAXNET COMBINATION:

All the segmentation results for different words (2 syllables, 3 syllables and 4 syllables etc) are collectively presented in tabular form below. Red color results in the

178

Position based syllabification using neural and non-neural techniques table indicate wrong values with respect to reference values. Here the reference values are nothing but manual segmentation results. Comparison of the three algorithms (simulated annealing, slope detection and K-means with Maxnet) is shown below. All these results show segmentation locations as frame numbers. As can be observed from the table, K-means‘s accuracy is better and consistent than other two algorithms. Maxnet results are not included as segmentation accuracy of this algorithm is limited to 70% [58].

6.17.1 Results for 2-syllable words:

Table 6.1: Comparative results of segmentation for 2 syllable words Word Simulated annealing Slope algorithm K-means

Amam_ (aram) 19 19 19 A¨Hw a (ankur) 27 30 31 AmË_m (atma) 22 18 19 emoYH (shodak) 30 32 29 OdmZ (jawan) 19 15 16 A[à` (apriya) 18 16 15 X³pdS (Dravid) 15 11 12 AãXwb (Abdul) 5 8 9 CKS (ughad) 9 8 8

A¨pH V (Ankit) 14 15 16

M§Ö (chandra) 28 23, 28 28

Midi (chalwal) 16 21 20

qM~ (chimb) 19 18 19 M¨w~H (chumbak) 20 19 21 Mî_m (chashma) 23 8, 23 23 nmRH (Pathak) 8 13 13 MooVZ (Chetan) 20 19 20 AmOmar (ajari) 13 12 13

179

Position based syllabification using neural and non-neural techniques

Word Simulated annealing Slope algorithm K-means

XOm© (darja) 21 15 21

Xm¡am (daura) 17 17 18

pXem (Disha) 9 14 14

XwJm© (Durga) 15 16 17

6.17.2 Results for 3-syllable words:

Table 6.2: Comparative results of segmentation for 3 syllable words Simulated Word Slope algorithm K-means annealing A[^foH (Abhishek) 13, 29 12, 28 13, 28 Ap^_mZ (Abhiman) 5, 29 10, 33 11, 35 A[YH ma (adhikar) 14, 32 15, 30 15, 30 AmMm`© (acharya) 19, 48 14, 40 20, 48 Ampef (Ashish) 12, 42 12, 42 13, 41 A[XVr (Aditi) 10, 26 11, 26 10, 27 A[ídZr (Ashwini) 22, 38 23, 38 23, 39 AnUm© (Aparna) 5, 26 5, 26 25, 26 C¨~aRm (umbartha) 13, 30 12, 31 13, 31 A_¥V (Amrut) 14, 25 15, 25, 35 15, 25 A_mÝ` (amanya) 8, 31 9, 31 8, 30 pdXyfr (Vidushi) 10, 26 10, 27, 41 11, 27 M_H Xma (chamkadar) 10, 27 11, 27 10, 28 M[dï (chavishta) 13, 36 13, 35 12, 35 CXmgrZ (udasin) 8, 29 9, 30 9, 29 M§XZm (Chandana) 13, 19 19, 34 19, 34 X[jU_wIr (dakshinmukhi) 11, 45 11, 45 11, 45 MwUMwUrV (chunchunit) 15, 30 15, 31 15, 30

180

Position based syllabification using neural and non-neural techniques

Simulated Word Slope algorithm K-means annealing dñVwpñWVr (vastusthiti) 13, 13 13, 29 13, 29 CX²KmQ Z (udghatan) 9, 24 5, 24 24, 26 MmbT H b (chaldhakal) 29, 38 29, 38 25, 38 XrK©H mi (dirghakaal) 13, 29 14, 30 14, 29 XodXËV (Devdatta) 24, 37 25, 36 24, 36 XmVoar (dateri) 13, 31 13, 32 14, 32 ~ZmdQ (banavat) 12, 31 13, 32 12, 32 AgmdoV (asavet) 9, 29 10, 29 9, 28 MmH moar (chakori) 14, 35 14, 35 14, 35

6.17.3 Results for 4-syllable words:

Table 6.3: Comparative results of segmentation for 4 syllable words Simulated Word Slope algorithm K-means annealing MmbypJar (chalugiri) 23, 43, 61 24, 44, 62 24, 44, 61 MhyH SyZ (chahukadun) 12, 29, 44 11, 30, 44 11, 29, 44 ~¨JmbgmRr (Bangalsathi) 22, 44, 61 21, 43, 61 21, 44, 61 Apd^mÁ` (avibhajya) 19, 29, 47 18, 30, 47 18, 29, 47 Apdídmg (avishvas) 18, 29, 43 18, 29, 43 18, 29, 43 X¡Z¨pXZr (dainandini) 16, 40, 51 17, 41, 51 17, 40, 51 XamoSoImoa (darodekhor) 11, 30, 46 12, 31, 46 12, 30, 46 Xm¡è`mgmRr (dauryasathi) 19, 35, 50 21, 37, 50 20, 36, 50 XoUoKoUo (deneghene) 38, 50, 56 18, 39, 56 18, 38, 56

After testing different words (2, 3, 4 syllables), it is discovered that different segmentation approaches has given good results. The most consistent results with great segmentation accuracy are obtained with K-means algorithm [60]. More than

181

Position based syllabification using neural and non-neural techniques

500 different most frequent words in Marathi language are tested for these different algorithms and results are collected in tabular form.

6.17.4 Statistical comparison of different segmentation techniques: The following table 6.4 shows statistical comparison of different segmentation techniques discussed in this thesis. The last column shows the resulting segmentation accuracy of all these methods. After this comparison it is clear that the segmentation accuracy of K-means and Maxnet combination is more in all the methods which shows neural network has given more accurate results.

Table 6.4: Statistical comparison of different segmentation techniques Number of words Sr. Name of segmentation Number of Segmentation with correct No. technique words tested accuracy (%) segmentation 1 Maxnet 500 350 70 2 Simulated annealing 500 375 75 3 Slope detection 500 385 77 Back-propagation- 4 500 410 82 Maxnet 5 K-mean with Maxnet 500 450 90

6.18 GUI:

 Graphical user interface (GUI) is a type of user interface that allows people to interact with programs in a friendlier way with images rather than text commands. E.g. computer, MP3 players, portable media players or gaming devices; household appliances and office equipments.

 A GUI offers graphical icons and visual indicators, to fully represent the information and actions available to a user. The actions are usually performed through direct manipulation of the graphical elements. Graphic user interface (GUI) has been implemented for the word XoemVë`m (deshatlya). Results for the

182

Position based syllabification using neural and non-neural techniques

three approaches of segmentation have been shown [61]. The GUI shows the point of syllabification (minima location) and the number of syllables for given input word.

6.18.1 GUI results for K-means: The following figures show GUI results for different algorithms of segmentation. These results show how GUI is displaying different results. There are three radio buttons to select syllabification approach. Once syllabification approach is selected then next ‗plot graph‘ push button need to be pressed which displays the graph for given input word for selected syllabification approach. At the bottom of graph, it displays the number of syllables and minima locations.

Fig. 6.75: GUI snapshot for K-means

6.18.2 GUI results for slope detection:

183

Position based syllabification using neural and non-neural techniques

Fig. 6.76: GUI snapshot for slope algorithm

6.18.3 GUI Results for simulated annealing:

Fig. 6.77: GUI snapshot for simulated annealing

184

Position based syllabification using neural and non-neural techniques

6.19 CONCLUSION OF SYLLABIFICATION:

These segmentation results show that to improve synthetic speech quality accurate syllabification is very important. While forming syllables, if syllable position is considered then it improves the contextual flow and spectral alignment and hence produces more natural sounding speech. In different syllabification approaches, it has been observed that the neural-classification (K-means with Maxnet) combination has given more accurate (90% segmentation accuracy) and proper segmentation as compared to other neural and non-neural methods.

185

Chapter-7

Removal of spectral mismatch using PSOLA for speech improvement

Removal of spectral mismatch using PSOLA for speech improvement

CHAPTER- 7

REMOVAL OF SPECTRAL MISTMATCH USING PSOLA FOR SPEECH IMPROVEMENT

No. Title of the contents Page no. 7.1 Introduction 187 7.2 Removal of spectral mismatch using PSOLA 187 7.2.1 Why choose PSOLA? 188 7.2.2 Block diagram of PSOLA for spectral reduction 188 7.2.3 System flowchart 189 7.2.4 Pre-processing of PSOLA 189 7.2.5 Pitch estimation steps 190 7.2.6 Pitch estimation details 191 7.2.7 Pitch marking 193 7.2.8 Block diagram of pitch marking 193 7.3 TD-PSOLA 195 7.3.1 Purpose of TD-PSOLA algorithm 196 7.3.2 Methodology used 196 7.3.3 Results/outcome of the algorithm 196 7.3.4 Step by step explanation of PSOLA block diagram 196 7.4 TD-PSOLA implementation 198 7.4.1 Detail algorithm of TD-PSOLA 199 7.4.2 Pseudo code of PSOLA 200 7.5 Results and discussion of TD-PSOLA 204 Comparison of original and PSOLA-processed 7.5.1 205 output 7.5.2 GUI used for testing 206 7.6 Performance evaluation of PSOLA 207 7.6.1 Subjective evaluation 207 7.6.2 Objective evaluation 208 7.7 Conclusion 209

186

Removal of spectral mismatch using PSOLA for speech improvement

CHAPTER- 7

REMOVAL OF SPECTRAL MISTMATCH USING PSOLA FOR SPEECH IMPROVEMENT

7.1 INTRODUCTION:

It is extremely tough to make a machine which sounds identical to human. Hence the best text to speech (TTS) algorithm ever made sounds robotic, unless and until human speech itself is involved in it. But it is not possible to create a database of each and every word possible in any language. Syllable based concatenative speech synthesis (CSS) leads to the formation of new words from existing words and syllables in the database. The most important qualities of a speech synthesis system are naturalness and intelligibility. The present Marathi TTS system is capable of automatically producing speech by storing small segments of speech and concatenating them when required.

Among different methods of speech improvement, pitch modification is one of the simplest methods. If the pitch is modified at the concatenation joint, the glitch or spectral mismatch can be reduced [62]. It is modified for the frame before the joint and the frame after the joint. Simple time domain technique, pitch synchronized overlap and add (TD-PSOLA) is used to improve speech synthesis output. In this chapter, PSOLA analysis, its implementation and speech synthesis results are discussed for spectral distortion reduction at concatenation joint.

7.2 REMOVAL OF SPECTRAL MISMATCH USING PSOLA:

PSOLA refers to a family of signal processing techniques that are used to perform time-scale and pitch-scale modification of speech. The basis of all PSOLA techniques is to isolate pitch periods in the original signal, perform the required modification and re-synthesize the final waveform through an overlap-add operation. The aim of this work is to increase the naturalness of text to speech synthesis by comparing pre- recorded speech (original speech) with PSOLA processed output and improving the 187

Removal of spectral mismatch using PSOLA for speech improvement speech quality. The system produces synthetic speech with desired pitch scale and time scale modification for arbitrary input speech signal so that there is no spectral mismatch at the point of concatenation. The similarity between original word and the output obtained after PSOLA processing is evaluated subjectively using MOS test and objectively using our new algorithm of percentage error calculation.

7.2.1 Why choose PSOLA? PSOLA is an effective method for speech signal analysis and synthesis. It is a very simple and more computational saving method for implementation. It can be used without compromising high synthetic quality [63]. PSOLA in time domain form (TD- PSOLA) is implemented here for spectral mismatch reduction at concatenation joint.

7.2.2 Block diagram of PSOLA for spectral reduction:

Search in audio Text input Cut into syllables database

Concatenation of Pitch synchronized Speech output syllables overlap and adds (PSOLA)

Fig. 7.1: Block diagram of PSOLA method for spectral reduction

As shown in the figure 7.1, two important blocks are ‗Concatenation of syllables‘ and ‗PSOLA‘. This diagram shows how PSOLA method can be used to improve the speech distortion at the concatenation point. If the input word is not in the audio database then the word is formed with syllables. After synthesizing required input word the spectral distortion present at concatenation joint is reduced by applying PSOLA algorithm. The following sections explain how resulting speech output of PSOLA, has improved the glitch type of distortion present at the concatenation joint of synthesized word.

188

Removal of spectral mismatch using PSOLA for speech improvement

7.2.3 System flowchart: The following figure 7.2 show system block diagram for reducing spectral mismatch present at the joint. It shows how PSOLA technique is used for reducing spectral distortion. Before using actual PSOLA algorithm two important steps need to be carried out, pitch estimation and pitch marking. These two pre-processing steps are explained in detail in the following sections.

Segmentation of Cutting of syllables syllables

Concatenation block Concatenation

Apply proper words, Selection of method pitch estimation and pitch marking

PSOLA

Fig. 7.2: System block diagram for reduction of concatenation cost using PSOLA

7.2.4 Pre-processing of PSOLA: The following figure 7.3 shows block diagram of PSOLA system with two pre- processing steps of pitch estimation and pitch marking. Synthesized speech after concatenation is given as input to pitch estimation algorithm. After pitch estimation, pitch marking is carried out. Both pitch and time scale is given as input to TD-PSOLA block. This algorithm improves the synthesized speech and reduces spectral distortion present at the concatenation joint.

189

Removal of spectral mismatch using PSOLA for speech improvement

Synthesized speech

Pitch estimation

Pitch marking

- Pitch scale TD-PSOLA - Time scale

Improved speech

Fig. 7.3: TD- PSOLA system with pitch estimation and marking

Pitch estimation is the first step for getting synthetic speech from TD-PSOLA. The following section shows pitch estimation in detail.

7.2.5 Pitch estimation steps:  Input: Speech signal  Output: Pitch contour  Method: Autocorrelation The following figure 7.4 shows block diagram of pitch estimation.

Input speech

Pitch estimation

Adaptable Filter

Autocorrelation

Pitch periods

Fig. 7.4: Block diagram of pitch estimation

190

Removal of spectral mismatch using PSOLA for speech improvement

The following flowchart 7.5 shows pitch estimation and marking steps in detail.

SPEECH

Low pass Filter

Speech frames

Clipping

Center Clipping

Voiced / Unvoiced Decision Pitch Calculation

Frame pitch

Clipping

Median Filter

Pitch contour Clipping

Figure 7.5: Pitch estimation and marking illustration

For each voiced speech segment, the global maximum location tm is the first pitch mark. The first search region is defined using pitch period value at tm:

SR = [tm + f. T0 ; tm + (2-f). T0]……………………..(7.1)

The next search region is formed by applying formula (7.1) to tm chosen at maximum sample of current search region. The f value is usually set to 0.7.

7.2.6 Pitch estimation details:  Autocorrelation: correlate a window of speech with a previous window. Find the best match.

191

Removal of spectral mismatch using PSOLA for speech improvement

 Peak and center clipping: This step reduces the false peaks. Clip at the top/bottom of a signal. Center the remainder around ‗0‘ There are many pitch estimation methods for speech signal. Most of them belong to one of the two major approaches: 1) Time-domain and 2) Frequency-domain. Many popular and reliable methods used autocorrelation function (ACF) measurement in time-domain. The ACF function is defined as

푊−푇 r(τ) = 푖=1 푥푖 − 푥푖+푇 …………………..……….….(7.2)

Where W is the size of rectangular window.

ACF is quite interesting. It is defined in time-domain but could be computed efficiently in frequency-domain via fast Fourier transform (FFT) and its inverse transform.

r(n)  iFFT{| FFT[x(k)] |2}………………….…….………(7.3)

ACF emphasizes spectrum peaks so it is robust for noise but sensitive to formants. However, this weakness could be overcome by ―spectral whitening‖ method. The center-clipping is one of them. Center-clipping is defined as

x(n) –CL, x(n) > CL

C[x(n)] = 0, |x(n)| ≤ CL

xn+CL, x(n)< -CL

Here CL is clipping threshold and generally is set to 30% of the maximum magnitude of the signal. ACF is a smooth transition between time-domain and frequency-domain methods. It is very important to understand both single pitch and multi-pitch estimation methods because people often borrow ACF to explain other periodic measurement functions and ―auditory models based on autocorrelation are currently one of the more popular ways to explain pitch perception‖

Speech signal and its corresponding ACF are shown in the following figure 7.6 which is the output of our Matlab program. If the input signal is periodic, the ACF is 192

Removal of spectral mismatch using PSOLA for speech improvement periodic too and show peaks at multiple of pitch period. A small windowed speech signal after normalization is given to ACF algorithm as shown in the first window of the following figure. Output of auto-correlation function shows spectrum peaks in the second window. The difference between two peaks gives pitch period/frequency. From these pitch frequency one can estimate the pitch for given input signal.

Fig. 7.6: Speech signal and its corresponding ACF of Matlab program

Speech signal has pitch range from 60 Hz to 500Hz; low pass filter is used to cut off the frequency above 1000Hz. This pre-processing step reduces noise and hence improves the accuracy. Center clipping is applied to each frame before calculation of ACF. Frame pitch is estimated based on ACF of each frame. Voiced / unvoiced decision is also based on ACF. After that, median filter, a post-processing method, is applied to reduce errors of this step.

7.2.7 Pitch marking: The purpose of pitch marking is to determine the beginning and ending of a pitch cycle when the signal repeats itself using pitch contour estimated in the previous section. Pitch contour is a function that tracks the perceived pitch of the sound over time.

7.2.8 Block diagram of pitch marking: The following figure 7.7 shows block diagram of pitch marking. Pitch estimation algorithm gives pitch contour or region where pitch marking has to be carried out. In pitch marking this pitch contour is divided into pitch valleys or peaks as pitch mark candidates. This is based on similarity of estimated pitch contour. In second phase of

193

Removal of spectral mismatch using PSOLA for speech improvement pitch marking generally state transition probabilities are defined. Dynamic algorithm is used to find most likely pitch marks. The following diagram shows flow diagram of pitch marking.

Search region Peaks candidates Pitch contour division selection

Dynamic Pitch marks Fine-tune programming

Fig. 7.7: Block diagram of pitch marking

After all search regions in voiced segment are known, three peak candidates for pitch marks are selected from each search regions. For each search region, only one peak candidate is chosen as pitch mark. A minimum distance between two consecutive candidates is defined to assure that there is no too close peak candidate pair. The best pitch mark sequence is found by dynamic programming which maximizes probabilistic optimal criterion. There are two probabilities. The first one is relative to peak candidate height, which prefers higher amplitude peaks. The second one relates to distance between two pitch mark candidates in two consecutive search regions. Dynamic programming is applied to get the optimal pitch mark sequence with k as current search region that maximize likelihood.

P(k, j)  max[P(k −1, i)  log tk (i, j)]  log sk ( j)………………(7.4)

Where sk(j) are speech segments

P(k) and tk are optimal pitch marks and transformations

After dynamic programming step, fine tuning is carried out with filter to reduce the distortion/noise in the resulting pitch mark. This step increases accuracy of pitch mark. In the following figure 7.6, the vertical lines are pitch marks. The horizon line is the pitch contour. The * shapes are pitch mark candidates of the search region.

194

Removal of spectral mismatch using PSOLA for speech improvement

Fig. 7.8: Pitch marking of speech signal

7.3 TD-PSOLA:

PSOLA is a digital signal processing technique used for speech processing and more specifically for speech synthesis. It can be used to modify the pitch and duration of speech signal. PSOLA works by dividing the speech waveform into small overlapping segments. The segments are then combined using overlap-add technique. PSOLA can be implemented in both time and frequency domain. TD-PSOLA (time domain PSOLA) is easier to implement and gives good results. TD-PSOLA works pitch- synchronously, which means there is one analysis window per pitch period. A prerequisite for this is that we need to able to identify the epochs in the speech signal. For PSOLA, it is vital that epochs are determined with great accuracy. Epochs may be the instants of glottal closure or any other instant as long as it lies in the same relative position for every frame. The signal is then separated with a Hanning window, generally extending two pitch periods. These windowed frames can then be recombined by placing their centers at the original epoch positions and adding the overlapping regions. Though the result is not exactly the same, the resulting speech waveform is perceptually indistinguishable from the original one. For unvoiced segments, a default window length of 10ms is commonly used. The PSOLA is implemented with the following steps of text to speech conversion for reduction of spectral mismatch at the concatenation joint of speech.

1) Accept the text input in Marathi and encode this input. 2) Search for the word/syllable in the database. Concatenation of the syllables is carried out if the input word is not present in the database.

195

Removal of spectral mismatch using PSOLA for speech improvement

3) Removal/reduction of spectral mismatch/discontinuities from the concatenated word using PSOLA. Speech output with less spectral mismatch.

7.3.1 Purpose of TD-PSOLA algorithm

To remove spectral mismatch at the point of concatenation, by implementing desired pitch and time scale modification at the concatenation joint of speech signal.

7.3.2 Methodology used  Input: Speech signal, pitch marks, pitch scale, time scale etc.  Results/ output: Improved synthetic speech  Method: - Target pitch mark calculation - Unit waveform extraction - Synthetic unit waveform allocation (based on new pitch mark)

7.3.3 Results/outcome of the algorithm Improved synthetic speech with less spectral noise after overlap and add step of algorithm. The final synthesis WAV file should have the same spectral characteristics as input WAV file but with a different pitch and/or duration. The general implementation steps for PSOLA method are shown in the figure 7.9 in the following section.

7.3.4 Step by step explanation of PSOLA block diagram:  There are three steps in PSOLA implementation. First: Calculate synthesis epoch from analysis epoch. Second: Use the Hanning window for mapping analysis signal to synthesis signal. And lastly: Apply waveform mapping using new time and pitch scale, to have better smooth speech.

 The following diagram 7.9 shows analysis speech is used to extract unit waveforms and used for reconstruction of spectrum. Time scale and pitch scale preprocessing is used for pitch estimation and pitch marking. They improved the quality of unvoiced sounds, voiced fricatives and over-stretched

196

Removal of spectral mismatch using PSOLA for speech improvement

sounds. After pitch marking, the generated synthesis unit waveforms are given as input to overlap and add algorithm.

 TD-PSOLA works pith-synchronously, which means there is one analysis window per pitch period. The signal is then separated with a Hanning window, generally extending two pitch periods. Pitch-scale modification is performed by recombining frames on epochs which are set at different distances apart from the original ones. General steps of PSOLA is as shown below: 1) Find the pitch points for given input speech signal. 2) Apply Hanning window centered on the pitch points and extending it to the next and previous pitch point. (Generally 20 to 30 ms duration is used for this extension) 3) Add waves back

 We are familiar with speeding up or slowing down signals, which changes not only the duration but also the pitch. Lengthening is achieved by duplicating the frames. For a given set of frames, certain frames are duplicated, inserted back into the sequence, and then overlap-added. The result is a longer or slower speech waveform. Shortening is achieved by removing the frames. For a given set of frames, certain frames are removed, and the remaining ones are overlap-added. The result is a shorter or speeded speech waveform.

 Overlap and add step reduces the spectral misalignment and produces improved synthetic speech. Thus this simple time domain technique is used for spectral distortion reduction. The following block diagram (fig. 7.9) shows how spectrum reconstruction is carried out with the help of unit waveform extraction. Synthesis unit waveforms are prepared from pitch marks calculation. Time scale and pitch scale inputs are provided for this calculation. Synthesis unit waveforms are provided as input to ‗overlap and add‘ block.

 Pitch and time scales modification of speech signal have many applications like prosody modification in speech synthesis, voice morphing, singing voice correction and speech watermarking. In this work, a complete pitch and time scale modification system using time-domain pitch synchronous overlap and

197

Removal of spectral mismatch using PSOLA for speech improvement

adds method (PSOLA) is implemented for spectral smoothening. TD-PSOLA is used to compensate for the phase difference of syllables present in the words.

 The output of this block is nothing but synthetic speech with new pitch and time scale adjustment for reduction of spectral distortion present at the concatenation joint of synthesized word.

Analysis speech Unit waveform Spectrum

reconstruction Synthesis extraction

epoch

Pitch marks Synthesis pitch Synthesis unit  Pitch scale mark calculation waveforms  Time scale

Overlap and add

Synthesis speech

Fig. 7.9: TDSOLA block diagram

7.4 TD-PSOLA IMPLEMENTATION:

Generally TD-PSOLA is used for prosody modification. But we have used it first time for spectral mismatch reduction. Output of PSOLA is evaluated for the first time using new algorithm of percentage error calculation. It shows PSOLA results are closer to the original one as shown in the following sections. Table 7.1 shows subjective evaluation with standard MOS test. Based on the pitch scale and time scale

198

Removal of spectral mismatch using PSOLA for speech improvement information, new synthetic pitch marks are calculated. Unit waveforms are placed at new synthetic pitch marks to produce output signal with desired pitch and duration. In this work, a time-domain technique called time domain pitch synchronous overlap and add (TD-PSOLA) is chosen because it is an effective and computational saving technique but still provides high synthetic voice quality.

7.4.1 Detail algorithm of TD-PSOLA 1) Isolate unit waveforms in the original speech signal and read it for processing. 2) Pitch estimation is carried out with the following steps a) Low pass filter the given input speech signal. b) Center clipping is applied for filtered signal to reduce the noise present is speech frames. c) Pitch calculation is carried out with single pitch using autocorrelation function (ACF). The ACF emphasizes spectrum peaks and correlates adjacent windows to find best pitch peak. This process reduces the false peak candidates. d) For every analysis window there must be one peak pitch candidate, hence pitch is calculated on a frame by frame basis. e) Median filter is used for post-processing after center clipping and autocorrelation. It helps to get continuity in pitch contour of best pitch peak candidates. f) Locate all best pitch peak candidates and form accurate pitch contour with ACF. Best pitch peaks are selected after searching all voiced speech segments to the left and right of best pitch mark. g) Interpolate the pitch contour after selecting all best pitch peak candidates. 3) Pitch marking is carried out with the following steps a) Input pitch contour to divided search regions. b) For each search region only one peak candidate is chosen as pitch mark. A minimum distance between two consecutive candidates is defined to assure that there is no too close peak candidate pair. For each voiced

speech segment, the global maximum location tm is the first pitch mark. c) The best pitch mark sequence is found by dynamic programming which maximizes probabilistic optimal criterion.

199

Removal of spectral mismatch using PSOLA for speech improvement

d) After dynamic programming step, fine tuning is carried out with filter to reduce the distortion/noise in the resulting pitch mark. This step increases accuracy of pitch mark. 4) TD-PSOLA is carried out with following steps a) Unit waveforms are extracted based on synthesis epoch details of given speech spectrum with input pitch and time values. b) The above procedure results in analysis speech which is used for spectrum reconstruction of synthesis unit waveforms. c) For PSOLA, new time and pitch scale values (modified) are provided for synthesis pitch mark calculation. d) With these new time and pitch scale values, the pitch marks are found out as explained above in two pre-processing steps (pitch estimation and pitch marking). e) These pitch marks are given as input to synthesis unit waveforms. f) With new pitch marks as per new time and pitch scale, the synthesis unit waveforms are overlapped and added. g) After adding all synthesis unit waveforms with new values of time and pitch scaling, it is observed that the resulting speech waveform is smoother in terms of spectral alignment.

7.4.2 Pseudo code of PSOLA #include

% Accept speech file

#include ―input.wav‖

% Pass band and Stop band edge frequencies wp = 900Hz, ws = 1000Hz

% Filter the signal using Matlab function ―makefilter‖

[num, den] = makefilter (‗lowpass‘, ‗butter‘, -2, -20, wp, ws)

% Here -2 is gain of pass band (Gp) and -20 gain of stop band (Gs) which are

% in decibels. Perform filtering using MATLAB

200

Removal of spectral mismatch using PSOLA for speech improvement

Runfilter (num, den, ‗input.wav‘) x(n) = input speech signal (% input wave file input.wav after low pass filter)

%C[x(n)] = input speech signal after applying center clipping for reducing noise

CL= 0.3*max[x(n)] if x(n) >CL

C[x(n)] = x(n)-CL else if |x(n)| ≤ CL ; C[x(n)] = 0 else if x(n) < -CL; C[x(n)] = x(n) + CL

Else C[x(n)] = 0

% Define best overlap offset seeking window for 15ms %SEEK-WINDOW 15ms W = 15ms T = time period of current frame r(T) = Auto-correlated signal Xi = New window

Xi+1 = Previous window = C[x(n)] for i= 1 to i= W-T r(T) = Xi – Xi+1 pt = r(T)= pitch period %Here r(T) gives best matching window for best pitch mark r(T) = x1 % Apply median filter for reducing the noise for k = 1:len y(k, 1) = median (x1 (k: k+L-1)); end

%Locate maximum pitch and find out global maximum of waveform

%First pitch mark m = Max pitch (y(k,1)) % first pitch mark at ―t‖

Locate other pitch marks in [tm + f*T0 <=t<= tm + (2-f)* T0]

Locate other pitch marks in [tm - f*T0<=t<= tm - (2-f)* T0]

% The above two steps locates pitch marks on both left and right side of

201

Removal of spectral mismatch using PSOLA for speech improvement

% maximum pitch mark m

Function pitch_marks = p(x) = find_pmarks (m)

%INTERPOLATE pitch contour

Function v = interp [x(n), p(x), m]

%INPUT pitch contour; SR= search region of speech segments of input waveform

SR = [tm + f. T0; tm + (2-f). T0]

SR = v [x(n), p(x), m]

% Get optimal pitch mark sequence

P(k, j)  max[P(k −1,i)  log tk (i, j)]  log vk ( j)] %Read output speech file ―output_WAV‖ and define time scale factor

TIME_SCALE 1

%Define processing sequence for 11000Hz sample rate

SEQUENCE 25ms

%Define overlapping size

OVERLAP 20ms

%Define best overlap offset seeking window

SEEK_WINDOW 15ms

%Use optimal peak mark sequence

Inpt_prev = p(k,j)

% Use cross correlation function to find best overlapping offset where input_prev and %input_new best match with each other

Int_seek_best_overlap (const SAMPLE*input_prev, const SAMPLE*input_new)

%Pre-calculate overlapping slopes with input_prev for i=0 to i

202

Removal of spectral mismatch using PSOLA for speech improvement

%Find best overlap offset within (0:SEEK_WINDWO)

For i=0 to i

Crosscorr += (float) input_new [i+j] *temp[i]

%Overlap input_prev with input_new by sliding the amplitude during OVERLAP

%samples. Store result to output.

Output[i] = (inpu_prev[i] * (OVERLAP – i) + input_new[i]*i/OVERLAP

%Perform time scaling for sample data given in input, write result to output. Return the %number of output samples and copy flat mid-sequence from current processing

%sequence to output

Int num_out_samples = 0

Memcpy (output, seq_offset, Flat_Duration * sizeof (SAMPLE))

%Calculate a pointer to theoretical next processing sequence begin input += SEQUENCE_SKIP – OVERLAP

%Seek actual best matching offset using cross-correlation

Seq_offset = input + seek_best – overlap (prev_offset, input)

%Do overlapping between previous and new sequence, copy result to output

Overlap (output + FLAT_DURATION, prev_offset, seq_offset)

%Update input and sequence pointers by overlapping amount

Seq-offset += OVERLAP; input += OVERLAP

%Update output pointer and sample counters

Output += SEQUENCE – OVERLAP; Num_out_samples += SEQUENCE_OVERLAP

Num_in_samples -= SEQUENCE_SKIP return num_out – samples

203

Removal of spectral mismatch using PSOLA for speech improvement

7.5 RESULTS AND DISCUSSION OF TD-PSOLA:

Results of TD-PSOLA implementation for few Marathi words are shown below. Input word: AmYw{ZH (adhunik)

Fig. 7.10: PSOLA speech waveform for AmYw{ZH (adhunik) Input word: ^maVs` (Bharatiya)

Fig. 7.11: PSOLA speech waveform for ^maVs` (Bharatiya)

The above figures 7.10 and 7.11 show graphical waveform comparison of PSOLA processed speech output with original speech waveform. In both these figures first window shows original time domain speech waveform for given input word. The

204

Removal of spectral mismatch using PSOLA for speech improvement second window shows the speech waveform after time-pitch scale modification using TD-PSOLA. These results are validated with subjective analysis (Table 7.1) and with new objective method of percentage error calculation (Table 7.2). These evaluations show percentage error improvement of 65% for 300 PSOLA tested words.

7.5.1 Comparison of original and PSOLA-processed output: After using PSOLA and changing pitch, processed speech is displayed with original word. Marathi word AmYw{ZH (adhunik) is shown below for comparison.

PSOLA Original Fig. 7.12: PSOLA and original waveform for word AmYw{ZH (adhunik)

From the figure 7.12 it is clear that PSOLA processed waveform is close to original waveform. The following figures show results for words 1) Xoe~m§Yd (deshbandhav) and 2) H mQ H ga (katkasar). In these figures first window shows original waveform for given input word and second window shows PSOLA processed waveform.

Fig. 7.13: PSOLA and original waveform for word Xoe~m§Yd (deshbandhav) 205

Removal of spectral mismatch using PSOLA for speech improvement

Fig. 7.14: PSOLA and original waveform for word H mQH ga (katkasar)

7.5.2 GUI used for testing: The GUI contains list box, two or more radio buttons and some push buttons. List box provides list of words. User can select the word from the list for observation of PSOLA results. The push button is provided for testing PSOLA algorithm. The snapshot of GUI used for testing of PSOLA is shown in the following figure. The input word used for testing is AmYw{ZH (adhunik). Some other words can be tested with the provided GUI.

Fig. 7.15: PSOLA and original waveform for word AmYw{ZH (adhunik) using GUI

206

Removal of spectral mismatch using PSOLA for speech improvement

The following figures shows time and pitch scale modifications with different values after applying PSOLA algorithm for word _hm_mJ© (mahamarg)

Fig. 7.16: Input and output waveform with pitch scale = 0.8, time scale = 0.13 for _hm_mJ© (mahamarg)

Fig. 7.17: Input and output waveform with pitch scale = 0.13, time scale = 0.7 for _hm_mJ© (mahamarg)

7.6 PERFORMANCE EVALUATION OF PSOLA:

7.6.1 Subjective evaluation: The subjective evaluation of PSOLA is conducted with MOS where four subjects have participated for audio quality evaluation of these results. Table 7.1 below shows MOS analysis for PSOLA results. For MOS the assigned testing values are as below: 1 = poor audio quality: unacceptable

207

Removal of spectral mismatch using PSOLA for speech improvement

2 = Acceptable but very noisy and not clear 3 = Good audio quality with good intelligibility and naturalness 4 = Better speech quality with excellent intelligibility and good naturalness 5 = Best speech quality with excellent intelligibility and human like naturalness

Table 7.1: MOS results of PSOLA Subjects MOS = 1 MOS = 2 MOS = 3 MOS = 4 MOS = 5 Subject 1 1 Subject 2 1 Subject 3 1 Subject 4 1 Average 0 1 3 0 0

From the table 7.1 it is clear that the MOS test preference is of average value 3 which is acceptable good audio quality.

7.6.2 Objective evaluation: Percentage error before and after applying PSOLA is calculated and shown in the following table. The results for few Marathi words are shown below in tabular form. Percentage error calculation is used while comparing approximate value to exact value. The following table shows percentage error in terms of phase angle of synthesized speech (leading, lagging or difference with respect to original waveform) and percentage error in terms of power of PSOLA synthesized speech. The following table shows results for four Marathi words. It can be observed that the percentage error has reduced after application of PSOLA. After PSOLA processing there is considerable reduction is percentage error. Percentage error improvement of 65% is achieved after testing 300 different words. The resulting synthetic speech is improved in terms of naturalness.

208

Removal of spectral mismatch using PSOLA for speech improvement

Table 7.2: Speech output of PSOLA with percentage error calculation Percentage error Percentage Words (Before error PSOLA) (PSOLA) AmYw{ZH (adhunik) 18.3883 1.6117 ^maVs` (Bhartiya) 14.7055 5.294 Xoe~m§Yd (deshbandhav) 16.8315 3.1685 H mQ H ga (katkasar) 11.5259 8.4741

7.7 CONCLUSION:

The purpose of this research work is to improve naturalness and intelligibility in the synthesized speech. This research work is concentrating more on naturalness improvement of speech output than implementation of prosody or in the system. Here a pitch scale and time scale modification of speech signal is implemented using PSOLA. The glitches or spectral mismatch occurring during the concatenation of the words or syllables were reduced. The results obtained after implementing the PSOLA algorithm successfully conclude that the purpose of the work was completed. The resulting synthetic speech is improved in terms of naturalness. Percentage error improvement of 65% is achieved after testing 300 different words. PSOLA is evaluated using our new method of percentage error calculation and it shows improvement in speech output. Hence PSOLA is successfully implemented for spectral distortion reduction. For evaluation, subjective analysis (MOS) is also carried out. After successfully validating with subjective and objective analysis, it is proved that PSOLA (time domain) can be used for spectral mismatch reduction at concatenation joint of synthesized speech. As frequency and wavelet domain audio results are more promising these methods (PSD, DTW, multi-resolution based wavelet) are used for spectral smoothing which are discussed in the following chapters of this thesis.

209

Chapter-8

Evaluation of spectral mismatch and development of necessary objective measures

Evaluation of spectral mismatch and development of necessary objective measures

CHAPTER- 8

EVALUATION OF SPECTRAL MISMATCH AND DEVELOPMENT OF NECESSARY OBJECTIVE MEASURES

No. Title of the Contents Page No. 8.1 Introduction 213 8.2 Speech synthesis 214 8.3 Generation of speech database 216 8.4 Scope 216 8.5 Fourier transform 217 8.6 Short time Fourier transform 218 8.7 Wavelet transform 220 8.7.1 Multi-resolution analysis (MRA) 220 8.8 Continuous wavelet transform 221 8.9 Discrete wavelet transform 223 8.9.1 Multi-resolution filter bank 223 8.9.2 Selection of wavelet 225 8.10 Dynamic time warping 225 8.11 Description of speech signal 227 8.12 System block diagram for spectral smoothing 228 8.13 Finding discontinuities and smoothing 230 8.14 Actual implementation requirements 231 8.15 System design for spectral smoothing 231 8.15.1 Flowchart for spectral smoothing 231 8.15.2 Spectral smoothing algorithm 232 8.15.2.1 Purpose of algorithm 232 Methodology used and step by step 8.15.2.2 232 implementation 8.15.2.3 Results/outcome of the algorithm 233 8.16 Concatenation of syllables 233 8.17 Results of correlation of energies of concatenated words 234 Detailed block diagram of distance and energy calculation 8.18 237 and its explanation 210

Evaluation of spectral mismatch and development of necessary objective measures

No. Title of the Contents Page No. 8.19 Prerequisites of distance calculation algorithm 239 8.19.1 Normalization of sound file 239 8.19.2 Framing 239 8.19.3 Correlation of frames and application of DTW 239 8.20 Plotting maximum peak values 242 8.20.1 Research findings of peak calculation 242 8.21 Distance calculation algorithm 245 8.21.1 Purpose of algorithm 245 Methodology used and step by step 8.21.2 245 implementation 8.21.3 Results/outcome of the algorithm 246 Finding Euclidean distance and energy between auto- 8.22 246 correlated and cross-correlated words 8.22.1 Euclidean distance 246 8.22.2 Definition 246 8.22.3 Explanation of algorithm for energy calculation 248 Energy calculation of correlated words for spectral 8.23 250 distortion reduction Analysis of results of distance and energy calculated in 8.24 252 original, PC and IC words 8.24.1 Energy results 254 8.25 Other-algorithms of spectral smoothing 256 8.26 Identification of mismatch of formants using PSD 256 8.26.1 PSD Algorithm 256 8.26.2 Purpose of algorithm 256 Methodology used and step by step 8.26.3 257 implementation 8.26.4 Results/outcome of the algorithm 257 8.26.4.1 Research findings 258 8.26.5 Peak determination 259 8.26.6 Formant extraction 259 8.27 Slope calculation for formant noise detection 260 8.27.1 Calculation and plotting of normalized values 261 8.27.2 Implementation of PSD and formant plots 262 8.27.2.1 Purpose of algorithm 262

211

Evaluation of spectral mismatch and development of necessary objective measures

No. Title of the Contents Page No. 8.27.2.2 Methodology used 262 8.27.2.3 Results/outcome of the algorithm 263 8.27.3 Word selection 263 8.27.4 PSD plots 264 8.28 Analysis of formants to evaluate spectral mismatch 269 8.29 Comprehensive analysis of results obtained by PSD 279 8.29.1 PSD results in frame by frame basis 280 8.29.2 Results for 2-syllable words 281 8.29.3 Numerical results of PSD for 2 syllable word 284 8.29.4 Results for 3 syllable words 286 8.29.5 Numerical results of PSD for 3 syllable word 289 8.29.6 Results for 4 syllable words 291 8.29.7 Numerical results of PSD for 4 syllable word 295 Detection of spectral mismatch and its reduction using 8.30 296 multi-resolution wavelet analysis 8.30.1 Purpose of algorithm 296 Methodology used and step by step 8.30.2 297 implementation 8.30.3 Results/ outcome of the algorithm 298 8.30.4 Wavelet results 298 Detection of spectral mismatch and its reduction using 8.31 308 DTW 8.31.1 Purpose of algorithm 308 Methodology used and step by step 8.31.2 309 implementation 8.31.3 Results/ outcome of the algorithm 309 8.31.4 Correlation results of DTW 309 8.32 GUI to perform DTW, wavelet and PSD 315 8.32.1 List box 316 8.32.2 Radio buttons 316 8.32.3 Push buttons 316 8.33 Subjective analysis 318 8.34 Conclusion of spectral mismatch reduction 319

212

Evaluation of spectral mismatch and development of necessary objective measures

CHAPTER- 8

EVALUATION OF SPECTRAL MISMATCH AND DEVELOPMENT OF NECESSARY OBJECTIVE MEASURES

8.1 INTRODUCTION:

 Evaluation of spectral mismatch and development of necessary objective measures aims to increase the naturalness in text to speech synthesis. It implements segment concatenation when formants and other spectral features do not align properly. Spectral smoothing methods are implemented to remove the discontinuities at the syllable boundaries.

 It is also shown that a properly concatenated word (PC) is more natural and hence more similar to the original word than an improperly concatenated word (IC). If position of syllable is considered while concatenating new word, it is called properly concatenated word and if concatenation is carried out irrespective of position, it is called improperly concatenated word.

 For removing the discontinuities at the syllable boundaries different time and frequency domain techniques are used, such as PSD (power spectral density), WT (wavelet transform) and DTW (dynamic time warping). The Back- propagation algorithm is used to minimize the error between the spectral characteristics of the original and improperly concatenated word.

 This research work is implemented using MATLAB software for programming. The aim is to remove the discontinuities occurred during concatenation of syllables. In this work the discontinuities at the cutting point are smoothed by changing the spectral characteristics before and after the cutting point.

213

Evaluation of spectral mismatch and development of necessary objective measures

 The spectral features are extracted from PSD and wavelet algorithm. They are compared to find the spectral mismatch. These spectral features are modified to smooth the discontinuities in the concatenated output. After spectral smoothing the final synthesized speech better approximates the desired speech characteristics and is continuous in both time and frequency domain.

 Other spectral mismatch reduction methods like DTW, multi-resolution based wavelet give similar results for improving speech output.

8.2 SPEECH SYNTHESIS:

 Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer and can be implemented in software or hardware. A text-to-speech (TTS) system converts normal language text into speech.

 Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.

 The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer.

 In its simplest form computer speech can be produced by playing out a series of digitally stored segments of natural speech. The segments are coded for storage efficiency. The most important quality of a speech synthesis system is its naturalness. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is

214

Evaluation of spectral mismatch and development of necessary objective measures

understood. The ideal speech synthesizer is natural as well as intelligible. Speech synthesis systems usually try to maximize both characteristics.

 Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes results in audible glitches in the output.

 The database can store a complete word such as dmaH ar (varkari) or just the syllables such as dma (var) and H ar (kari). If a complete word is stored in the database then it makes the database lengthier and hence the database requires more memory but on the other hand, it makes the synthesizer more natural. Whereas if only the syllables are stored then these syllables can be combined to form a word which leads to lesser naturalness but at the same time decreases the database size. In this research work both, most frequent words as well as syllables (Hybrid) are stored. e.g. dma (var) can be used in dmaUo (varane) and H ar (kari) is used in _maoH as (marekari).

 In case of concatenative synthesis, the position of syllables is an important factor in the naturalness of synthesizer. When a new word is formed, the syllables required for making this word are taken from other words stored in the database. These words are segmented and the required syllables are extracted.

 This work proves that the extracted syllable should be present in the same position (initial, medial or final) as required by the new word as this makes the synthesizer sound more natural. This work gives numerical values through which it is proved that if one interchanges the syllable position then the naturalness of the synthesizer is reduced. If syllables are used in their correct position the naturalness of resulting speech increases due to proper spectral alignment [64].

215

Evaluation of spectral mismatch and development of necessary objective measures

8.3 GENERATION OF SPEECH DATABASE:

The quality of concatenative speech synthesis depends on the audio database from which the units are selected. The different words in Marathi language are recorded in a quiet room with the same setting of the instruments. The following care has been taken while recording the database.

 Constant volume.  Non-emotional way of speaking.  Medium speech speed.  Clear pronunciation.  Use of identical audio equipment throughout the recording period.  Recording of the speech material within one session.

8.4 SCOPE:

 Today the applications of TTS are in automated telecom services (e.g., name and address rendering), as a part of a network voice server for e-mail (e-mail by phone), in directory assistance, as an aid in providing up-to-the-minute information to a telephone user (e.g., business locator services, banking services, help lines), in computer games and in aids to the handicapped.

 This work entitled ‗Evaluation of spectral mismatch and development of necessary objective measures‘ is capable of automatically producing speech by storing small segments of speech, concatenating them when required and removing spectral joint distortion. Both subjective and objective evaluation is used for spectral smoothing results. It helps in reducing the database size and thereby maintaining naturalness. Reduction in the database size reduces the time complexities for searching the words in the database and thus the time needed for the whole TTS system gets reduced.

216

Evaluation of spectral mismatch and development of necessary objective measures

 It uses concatenative synthesis, which is based on the concatenation of segments of recorded speech [65]. Audio database is maintained to store audio files. Textual database is required to store the text file corresponding to audio file in audio database. In the textual database ‗from‘ and ‗to‘ positions of the syllables are stored along with the WAV file name.

 The TTS systems which are already available for other languages have not used concatenation with hybrid approach where words and syllables are both used. This system performs in a better way in terms of the naturalness as it uses existing words and syllables for concatenation or synthesizing new words. Thus, the aim of the present system is to increase the naturalness of the concatenated speech by spectral smoothing techniques.

 Different time domain and frequency domain parameters are used for smoothing. With the help of subjective and objective evaluation, this work proves that after applying spectral smoothing techniques the naturalness of the system is increased.

8.5 FOURIER TRANSFORM:

 The Fourier transform (FT) is probably the most widely used signal analysis method. A periodic function can be represented by an infinite sum of complex exponentials. Many years later, this idea was extended to non-periodic functions and then to discrete time signals.

 The Fourier transform retrieves the global information of the frequency content of a signal. For stationary and pseudo-stationary signals this transform gives a good description. However, for highly non-stationary signals some limitations occur.

 These limitations are overcome by the short time Fourier transform (STFT). The STFT is a time-frequency analysis method which is able to retrieve the

217

Evaluation of spectral mismatch and development of necessary objective measures

local frequency information of a signal. The Fourier transform decomposes a signal into orthogonal trigonometric basis functions.

 Fourier transform of a continuous signal x(t) is defined in equation (8.1) below. The Fourier transformed signal X (f) gives the global frequency distribution of the time signal X(t). The original signal can be reconstructed using the inverse Fourier transform (IFT) as defined in equation (8.2) below.

∞ 푋 푓 = 푥 푡 푒−푗2휋푓푡 푑푡 … … … … … … … … ….(8.1) −∞

∞ 푋 푡 = 푋 푓 푒−푗2휋푓푡 푑푓 … … … … . … . … … …(8.2) −∞

Where 푋 푓 = Fourier transform of input signal 푥 푡 = input signal 푓 = frequency used 푡 = time period of signal

8.6 SHORT TIME FOURIER TRANSFORM:

 The limitation of the Fourier transform is that it gives only the global frequency content of signal. It is overcome by the short time Fourier transform (STFT). The STFT is able to retrieve both frequency and time information from a signal.

 The STFT calculates the Fourier transform of a windowed part of the original signal, where the window shifts along the time axis. In other words, a signal X(t) is the windowed signal and FT is taken, giving the frequency content of the signal in the windowed time interval. The short time Fourier transform of signal X(t) is defined as in equation (8.3) below.

푋 ∞ −푗 2휋푓푡 푆푇퐹푇 휏,푓 = −∞ 푥 푡 푔 푡−휏 푒 푑푡 ……………………………………………(8.3)

Where 푥 푡 = input signal

218

Evaluation of spectral mismatch and development of necessary objective measures

푔 푡 − 휏 = windowed part of signal with time shift

푋푆푇퐹푇 휏, 푓 = Short time Fourier transformed signal 푓 = frequency used 푡 = time period of signal

 The performance of the STFT analysis depends critically on the chosen window g(t). A short window gives a good time resolution, but different frequencies are not identified very well. A long window gives an inferior time resolution, but a better frequency resolution.

 It is not possible to get both good time resolution and good frequency resolution. This is known as the Heisenberg inequality. The Heisenberg inequality states that the product of time resolution ∆t and frequency resolution ∆f (bandwidth-time product) is constant, i.e. the time-frequency plane is divided into blocks with an equal area. The bandwidth-time product is lower bounded (minimum block size) as

∆T *∆f ≥ 1/4π………………………………. (8.4)

Where ∆T = time product ∆f = bandwidth product

 Finding a proper window length is critical for the quality of the STFT. The STFT uses a fixed window length, so ∆t and ∆f are constant. With a constant ∆t and ∆f, the time-frequency plane is divided into blocks of equal size as shown in the figure 8.1 below:

219

Evaluation of spectral mismatch and development of necessary objective measures

Frequency

Time Fig. 8.1: Constant resolution time-frequency plane

This resolution is not satisfactory. Low frequency components often last a long period of time therefore a high frequency resolution is required. High frequency components often appear as short bursts invoking the need for a higher time resolution.

8.7 WAVELET TRANSFORM:

 The analysis of a non-stationary signal using FT or STFT does not give satisfactory results [72]. Better results can be obtained using a wavelet analysis. One advantage of wavelet analysis is the ability to perform local analysis.

 Wavelet analysis is able to reveal signal aspects that other analysis techniques miss, such as trends, breakdown points, discontinuities, etc. In comparison to the STFT the wavelet analysis makes it possible to perform a multi-resolution analysis.

8.7.1 Multi-resolution analysis (MRA):  The time-frequency resolution problem is caused by the Heisenberg uncertainty principle and exists regardless of the used analysis technique. For the STFT a fixed time-frequency resolution is used.

220

Evaluation of spectral mismatch and development of necessary objective measures

 By using an approach called multi-resolution analysis (MRA) it is possible to analyze a signal at different frequencies with different resolutions. The change in resolution is schematically displayed in the following figure 8.2:

Frequency

Time

Fig. 8.2: Multi-resolution time-frequency plane

 For the resolution of the figure 8.2 it is assumed that low frequencies last for the entire duration of the signal, whereas high frequencies appear from time to time as a short burst. This is often the case in practical applications.

 The wavelet analysis calculates the correlation between the signal under consideration and a wavelet function 휑(t). The similarity between the signal and the analyzing wavelet function is computed separately for different time intervals, resulting in a two dimensional representation. The analyzing wavelet function 휑 (t) is also referred as the mother wavelet.

8.8 CONTINUOUS WAVELET TRANSFORM:

 The continuous wavelet transform is defined by

∞ 푡−휏 푋 휏, 푠 = 1/ 푠 푥 푡 휑 푑푡 …………………… (8.5) 푊푇 −∞ 푠

Where 푋푊푇 휏, 푠 = continuous wavelet transformed signal

221

Evaluation of spectral mismatch and development of necessary objective measures

휏 = translation parameter s = scale parameter 휑 = mother wavelet 푥 푡 = input signal 푡 = time period of signal

 The transformed signal XWT (휏, s) is a function of the translation parameter (휏) and the scale parameter (s). The mother wavelet is denoted by (휑). The signal energy is normalized at every scale by dividing the wavelet coefficients by 1/ 푠 . This ensures that the wavelets have the same energy at every scale. The mother wavelet is contracted and dilated by changing the scale parameter ‗s‘.

 The variation in scale ‗s‘ changes not only the central frequency ‗fc‘ of the wavelet, but also the window length. Therefore, the scale ‗s‘ is used instead of the frequency for representing the results of the wavelet analysis. The translation parameter ′휏′ specifies the location of the wavelet in time, by changing 휏 the wavelet can be shifted over the signal.

 For constant scale ‗s‘ and varying translation ′휏′ the rows of the time-scale plane are filled, varying the scale ‗s‘ and keeping the translation ′휏′ constant fills the columns of the time-scale plane. The elements in XWT (휏, s) are called wavelet coefficients, each wavelet coefficient is associated to a scale (frequency) and a point in the time domain.

 The WT also has an inverse transformation, as was the case for the FT and the STFT. The inverse continuous wavelet transformation (ICWT) is defined by the equation (8.6) below.

1 ∞ 푡−휏 푥 푡 = 푋 휏, 푠 ∗ (1/ 푠2) ∗ 휑 푑휏 푑푠 …………. (8.6) 푐2휑 −∞ 푆

Where 푥 푡 = inverse continuous wavelet transformed signal 푋 휏, 푠 = wavelet transformed signal 휏 = translation parameter

222

Evaluation of spectral mismatch and development of necessary objective measures

s = scale parameter 휑 = mother wavelet 푡 = time period of signal 푐 = proportionality constant

 A wavelet function has its own central frequency ‗fc‘ at each scale; the scale ‗s‘ is inversely proportional to that frequency. A large scale corresponds to a low frequency, giving global information of the signal. Small scales correspond to high frequencies providing detail signal information.

 For the WT, the Heisenberg inequality still holds, the bandwidth-time product ∆t. ∆f is constant and lower bounded. Decreasing the scale ‗s‘ i.e. a shorter window, will increase the time resolution ∆t, resulting in a decreasing frequency resolution ∆f. This implies that the frequency resolution ∆f is proportional to the frequency f, i.e. wavelet analysis has a constant relative frequency resolution.

8.9 DESCRETE WAVELET TRANSFORM:

The discrete wavelet transform (DWT) uses filter banks for the construction of the multi-resolution time-frequency plane. The DWT uses multi-resolution filter banks and special wavelet filters for the analysis and reconstruction of signals.

8.9.1 Multi-resolution filter bank:  A filter bank consists of filters which separate a signal into frequency bands. The low-pass L(z) and high-pass H(z) filtering branches of the filter bank retrieve respectively the approximations and details of the signal x(k). In the figure 8.3, a three level filter bank is shown.

 The filter bank can be expanded to an arbitrary level, depending on the desired resolution. The coefficients cl(k) represent the lowest half of the frequencies

223

Evaluation of spectral mismatch and development of necessary objective measures

in x(k), down-sampling doubles the frequency resolution. The time resolution is halved i.e. only half the numbers of samples are present in cl(k).

 In the second level the outputs of L(z) and H(z) double the time resolution and decrease the frequency content, i.e. the width of the window is increased. After each level, the output of the high-pass filter represents the highest half of the frequency content of the low-pass filter of the previous level, this leads to a pass-band.

 The time-frequency resolution of the analysis bank of the figure 8.3 is similar to the resolution shown in the figure 8.2. For a special set of filters L(z) and H(z) this structure is called the DWT, the filters are called wavelet filters.

ch(k) H(z) 2 x(k) clh(k) H(z) 2 cllh(k) L(z) 2 cl(k) H(z) 2 L(z) 2 clll(k) L(z) 2

Fig. 8.3: Analysis filter bank

2 H‘(z) y(k) 2 H‘(z) 2 H‘(z) 2 L‘(z)

2 L‘(z) 2 L‘(z)

Fig. 8.4: Synthesis filter bank

Both analysis and synthesis filter banks has three levels and are used in the wavelet filter applications. In this research work discrete wavelet transform is used which decomposes the signal into low frequency (approximation) coefficients and high

224

Evaluation of spectral mismatch and development of necessary objective measures frequency (detail) coefficients. These wavelet coefficients are modified to remove spectral mismatch.

8.9.2 Selection of wavelet:  In comparison to the Fourier transform, the analyzing function of the wavelet transform can be chosen with more freedom without the need of using sine- forms. A wavelet function 휑(t) is a small wave, which must be oscillatory in some way to discriminate between different frequencies.

 The wavelet contains both the analyzing shape and the window. The choice of wavelet depends on the application in which it is being used. The optimum wavelet is selected based on the energy conservation properties in the approximation part of the wavelet coefficients. The wavelet is selected such that it matches with the required features of the signal. In this work Daubechies wavelet is used. Figure 8.5 shows the Daubechies wavelet signal.

Fig. 8.5: Daubechies wavelet

8.10 DYNAMIC TIME WARPING:

What happens when people vary their rate of speech during a phrase? How can a speaker verification system with a password of "Project" accept the user when he says "Prrroooject"? A simple linear squeezing of this longer password will not match the key signal because the user slowed down the first syllable while he kept the normal speed for the "ject" second syllable.

225

Evaluation of spectral mismatch and development of necessary objective measures

 One need to non-linearly time-scale the input signal to the key signal so that one can line up appropriate sections of the signals (i.e. one can compare "Prrrooo" to "Pro" and "ject" to "ject"). The solution to this problem is to use a technique known as "dynamic time warping" (DTW). This procedure computes a non-linear mapping of one signal onto another by minimizing the distances between the two.

 In order to get an idea of how to minimize the distances between two signals, let's define two matrices: K(n), n = 1,2,...,N, and I(m), m = 1,2,...,M. One can develop a local distance matrix (LDM) which contains the differences between each point of one signal and all the points of the other signal. Basically, if n = 1,2,...,N defines the columns of the matrix and m = 1,2,...,M defines the rows, the values in the local distance matrix LDM(m, n) would be the absolute value of (I(m) - K(n)).

 The purpose of the DTW is to produce a warping function that minimizes the total distance between the respective points of the signals. Let‘s introduce the concept of an accumulated distance matrix (ADM). The ADM contains the respective value in the local distance matrix plus the smallest neighboring accumulated distance. One can use this matrix to develop a mapping path which travels through the cells with the smallest accumulated distances, thereby minimizing the total distance between the two signals.

 There are some constraints that one would want to place on a mapping path. One would not want this path to wander too far from a linear mapping. One should also say that the path can only map points that haven't been covered yet. For example, one can't map point 6, point 5 and then point 6 again. For boundary considerations one wants the endpoints of the signals to map to each other (I (1) map to K (1) and I (M) maps to K (N)).

 With these constraints in mind one can develop this concept of an accumulated distance matrix. As discussed previously ADM values are the LDM values at that location plus the smallest neighboring accumulated distances. To keep

226

Evaluation of spectral mismatch and development of necessary objective measures

from backtracking over points that have already been covered one use the Itakura method to define what these "neighboring" points are.

 The Itakura method says that neighboring points of the point at (m, n) are (m, n-1), (m-1, n-1), and (m-2, n-1). Defining ADM (1, 1) to be LDM (1, 1), one can write the equation for the other cells in the accumulated distance matrix:

ADM(m, n)=LDM(m, n)+min{ADM(m,n-1), ADM(m-1,n-1), ADM(m-2,n-1)} …(8.7)

 With this matrix developed, one can create the path by starting at the value in ADM(M,N) and working way backwards to ADM(1,1), travelling through the cells with the lowest accumulated distances subject to the Itakura constraints.

8.11 DESCRIPTION OF SPEECH SIGNAL:

Fig. 8.6: Description of speech signal

The speech signal can be represented as shown in the above figure 8.6 1) The entire word or data can be represented as a unit of time. The representation of speech in time signal helps in the analysis of speech. The first figure shows this time signal. 2) The second figure gives the energy plot of the word. This helps in the proper concatenation of different words and syllables. The energy matching during concatenation is very important.

227

Evaluation of spectral mismatch and development of necessary objective measures

3) The third figure represents spectrogram of the word. The spectrogram convention of PSD is used in the study of original, proper and improper concatenated words. 4) The fourth figure shows variation in pitch of the speaker. Pitch varies from person to person. It may be divided into low, medium and high pitch.

8.12 SYSTEM BLOCK DIAGRAM FOR SPECTRAL SMOOTHING:

The following figure 8.7 shows block diagram of spectral smoothing in Marathi TTS. Important blocks are shown in different colors.

Search in text and Text input Syllable search audio database

Finding Concatenation of discontinuities and Speech output syllables smoothing

Fig. 8.7: Block diagram of spectral smoothing and TTS

Explanation of block diagram: Following subsections explains each part of the diagram.

Text processing and audio encoding: In text processing, all spaces, tabs, commas, full stops etc are removed and words are isolated from each other. The encoding stage encodes input text into code string (CV structure). This block breaks the word into syllables using CV structure breaking rules.

228

Evaluation of spectral mismatch and development of necessary objective measures

Search for word: Input text is searched in the text and audio database. Audio database stores recorded sound files. Limited numbers of most common words are recorded. Textual database stores code string, wave filename, from and to positions of the word within that audio file. If the word is found then it is given to the output unit. If the word is not found then it is given to the next stage.

Search for syllable: Next stage involves selection of syllables from the database. If the word is not present in the database, after finding syllables they are given to the next stage for concatenation.

Concatenation of syllables: Concatenation of words is carried out to generate proper-concatenated and improper- concatenated words. Then the original word is compared with the concatenated word to find out the difference [66].

Proper concatenation of syllables: The following figures 8.8 and 8.9 shows formation of proper and improper concatenated word. The following figure 8.8 shows proper concatenated word where correct position of syllable is considered.

dmagm H m‘H as

dma gm H m‘ H as

dmaH as

Fig. 8.8: Proper concatenated word ‘dmaH as’ (varkari)

229

Evaluation of spectral mismatch and development of necessary objective measures

Improper concatenation (IC): In improper concatenation the syllable present in the initial position is taken from a different position available either at middle or final position. The final position syllable is taken from initial or middle position. This concatenation does not consider position of syllables while forming new words. The following figure 8.9 shows formation of improper concatenated word.

H madma H asZm

H ma dma H as Zm

dmaH as

Fig. 8.9: Improper concatenated word ‘dmaH as’ (varkari)

Speech output: Output speech block consist of low pass filter to reduce remaining noise present in synthesized word. The synthesized speech output should be natural and intelligible. The speech output of present Marathi TTS is evaluated with subjective (MOS) and objective evaluation methods. The numerical figures of objective evaluation are implemented first time for spectral distortion reduction.

8.13 FINDING DISCONTINUITIES AND SMOOTHING:

This work emphasizes the difference between proper and improper concatenated words in text to speech synthesis. Different methods are used to find discontinuities at the concatenation joint and smoothing of this spectral distortion [67]. The methods used for spectral smoothing are: 1) Power spectral density 2) Multi-resolution based wavelet transform with Back-propagation 3) Dynamic time warping

230

Evaluation of spectral mismatch and development of necessary objective measures

8.14 ACTUAL IMPLEMENTATION REQUIREMENTS:

 This work requires use of a noise free database with no dc offset or any external noise. The reason being that a person has to be a very patient speaker with a clear voice as well as he/she should have a proper accent. A person knowing Marathi language pronunciation should record most frequent words of the database.

 As it is a very precise study, even a small amount of noise or disturbance during recording can make a huge amount of difference. The microphone used in recording should not be very sensitive. All recording session should be recorded in properly designed sound proof studio. One has to make database of the Marathi words and form proper and improper words for testing.

 During plotting of PSD plots problem of multiple peaks arise for which smoothing is required. Choosing the correct order of filter for different ranges is a big problem as the amplitude level and energy level of every word varies a lot. So in case of low amplitude plot if higher order filter is applied then peaks are lost. One has to perform lot of testing to get good plot for the PSD.

 Similarly for other spectral smoothing methods like DTW, wavelet transform with Back-propagation care is taken for obtaining proper results.

8.15 SYSTEM DESIGN FOR SPECTRAL SMOOTHING:

System for spectral smoothing is designed and implemented as shown in the following subsections.

8.15.1 Flowchart for spectral smoothing: After segmentation of syllable is carried out, proper and improper words (PC and IC) are prepared with the help of ‗sound forge‘ software by considering syllable positions. Original, PC and IC are given as input to different spectral smoothing algorithms like; PSD, DTW and wavelet transform [68]. Different spectral distortion reduction

231

Evaluation of spectral mismatch and development of necessary objective measures methods are compared for their accuracy. The flowchart for spectral distortion reduction is shown below in the figure 8.10 and different methods are compared for their performance.

Segmentation of Cutting of syllables syllable

Forming proper or Concatenation improper word block

Apply proper or improper words Selection of for smoothing Method block

Wavelet transforms PSD DTW & Back propagation

Fig. 8.10: Flowchart for spectral smoothing

8.15.2 Spectral smoothing algorithm: This algorithm is explained below with the help of implementation steps.

8.15.2.1 Purpose of algorithm This algorithm gives smooth output after concatenation block and any of the three methods (PSD, wavelet and DTW) can be used to reduce concatenation noise/distortion.

8.15.2.2 Methodology used and step by step implementation The following steps explain how this algorithm is implemented. Different smoothing techniques are used to remove the glitches at the concatenation point and reduce the noise in the output. This helps to improve naturalness of speech quality.

232

Evaluation of spectral mismatch and development of necessary objective measures

1) The input to the text to speech synthesis is Devnagari (Marathi) text. 2) The input word is searched in the database. If the word is found it is given to the output file. 3) If word is not found, it is broken into syllables using CV structure breaking rules. The syllables are then searched into the database and are given to the concatenation unit. 4) Proper and improper concatenated word is formed for testing of different smoothing algorithms. 5) The discontinuities present in the concatenated word are reduced by applying these smoothing techniques. 6) After smoothing, the word is given to the output file

8.15.2.3 Results/outcome of the algorithm This algorithm should give smooth output at the concatenation point. This algorithm and all sub modules are implemented one by one as explained in the following paragraphs.

8.16 CONCATENATION OF SYLLABLES:

Concatenation is carried out to generate proper-concatenated and improper- concatenated words, so as to compare the original word with the concatenated word and find out the difference for quantification of mismatch.

Fig. 8.11: Original word H m_Jma (kamgar) in time domain.

233

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.12: Proper-concatenated word H m_Jma (kamgar) in time domain.

Fig. 8.13: Improper-concatenated word H m_Jma (kamgar) in time domain.

The above figures shows speech waveforms of original word H m_Jma (kamgar), properly concatenated (considering syllable position) word H m_Jma (kamgar) and improperly concatenated (not considering syllable position) word H m_Jma (kamgar). The first figure looks continuous in time domain form as it is originally recorded input word. Second figure shows properly concatenated word where energy of both syllables (amplitude) is similar to original word. Third figure shows improperly concatenated word where energy is high for both syllables. Above all figures show audio contents of input words. Here x axis is time and y axis is normalized amplitude of respective sample.

8.17 RESULTS OF CORRELATION OF ENERGIES OF CONCATENATED WORDS:

 The work emphasizes the difference between proper and improper concatenated words in text to speech synthesis. This is done by calculating the

234

Evaluation of spectral mismatch and development of necessary objective measures

energy of cross correlation between original-proper concatenated (PC) and original- improper concatenated (IC) words.  The aim is to quantify the mismatch between original-proper and original- improper concatenated words and compare them. This work gives/ brings forth the difference between proper concatenated (PC) and improper concatenated (IC) words. By reducing this difference, text to speech system can produce more natural speech output. This method of energy calculation of correlated original, proper and improper words is used first time for spectral mismatch estimation.

The following figure 8.14 shows auto correlation of original word H m_Jma (kamgar)

Fig. 8.14: Auto-correlation of original word H m_Jma (kamgar)

This figure 8.14 shows auto-correlation of original word H m_Jma (kamgar) having main lobe and side lobes. This figure is considered as standard reference for comparing other concatenated words.

The following figures 8.15 and 8.16 show cross-correlation of original and PC-IC words.

235

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.15: Cross-correlation of original H m_Jma (kamgar) and PC H m_Jma (kamgar)

Fig. 8.16: Cross-correlation of original H m_Jma (kamgar) and IC H m_Jma (kamgar)

The above figures show the cross-correlation of original word with PC and IC. Figure 8.15 shows clear distinction between main lobe and side lobes but for figure 8.16 there is no such lobes and it has lot of distortion. This has resulted due to improper selection of syllables for concatenation.

236

Evaluation of spectral mismatch and development of necessary objective measures

In the following table 8.1, energy calculation is shown for three lobes, first side lobe, main lobe and second side lobe. From these results, one can conclude that, the energy of original and PC word is very close whereas energy of IC word is very different. There is lot of difference between IC word values of energies and original word values for all the three lobes of cross-correlation. In the above figures, one can observe that the main lobe and side lobes are present for both 1) original vs. original (fig. 8.14) and 2) original vs. PC cross-correlation (fig. 8.15) plots. In case of IC cross-correlation (fig. 8.16), the three lobes are resulting in lot of distortion and there is no difference in their amplitudes. Due to spectral distortion, the three lobes cannot be defined properly. As compared to IC word, PC word shows clearly all the three lobes (main lobe and side lobes) with less distortion. This shows that spectral distortion is more for improper concatenated word [69]. Following results show average energy calculation for selected window of cross-correlation. This algorithm of energy calculation of cross correlated words is implemented first time for spectral distortion measurement.

Table 8.1 Average energy results for cross-correlation of original, PC and IC

Original vs. improper Original vs. proper Original vs. original Name of Energy Energy Energy Energy Energy Energy Energy Energy Energy input of of of of side of side of side of side of side of side word main main main lobe 1 lobe 2 lobe 1 lobe 2 lobe 1 lobe 2 lobe lobe lobe H m_Jma 11.37 21.29 16.67 8.76 121.13 6.05 5.77 124.85 5.82 (kamgar) JwéXod 39.90 110.68 103.89 162.97 79.98 162.20 165.28 82.21 165.28 (gurudev) a§J^y_s (rangbhu 51.18 110.16 70.08 129.59 181.68 128.61 139.63 184.31 139.63 mi)

The above results show the average energy of original vs. original, original vs. proper and original vs. improper cross-correlated words. Three values in each set are the average energy values of main lobe and two side lobes. Three values are obtained for each cross-correlated words based upon the window chosen.

8.18 DETAILED BLOCK DIAGRAM OF DISTANCE AND ENERGY CALCULATION AND ITS EXPLANATION:

237

Evaluation of spectral mismatch and development of necessary objective measures

The following block diagram 8.17 shows distance calculation between auto and cross correlated words. It shows energy calculation for finding the difference between original-proper and original-improper words. The first four blocks show the procedure of selecting the word for concatenation and synthesizing them to form proper and improper concatenated words. The second column of blocks shows the procedure for finding the distance between auto and cross-correlated words. The third column shows the energy calculation of input and concatenated word after correlation with selected window. Both distance and energy are calculated and compared for 500 words. This gives numerical figures for mismatch or spectral distortion. This quantification of mismatch is one of the objective methods implemented first time for evaluation of naturalness of Marathi TTS system. The following block diagram shows important blocks in different colors.

Select word Framing Correlate words

Select word with Correlate frames Selection of syllables in and apply DTW window for energy different position calculation

Form concatenated Find peak value for Calculate energy words every correlation

Normalization Plot peak values

Find distance between auto and cross correlated words

Compare values

Fig. 8.17: Block diagram of quantification of mismatch

238

Evaluation of spectral mismatch and development of necessary objective measures

8.19 PREREQUISITES OF DISTANCE CALCULATION ALGORITHM:

8.19.1 Normalization of sound file: The voice of the speaker can increase or decrease while recording. This can affect the correlation output in an adverse way. To compensate for this problem, the peak values of sound signal should be maintained between the range -1 to +1. Hence normalization is done where peak values of all sound files are maintained in the same range.

8.19.2 Framing: The frame size selected is of 10 ms duration. The logic behind taking this duration is that, when human speaks, the position of glottis remains constant for the minimum duration of 10 ms during which frequency components remains same. This makes it easier to understand the behavior of speaking of each and every word and syllable. Also the entire analysis is being carried out in frequency domain, so the 10 ms duration is very important for this work.

8.19.3 Correlation of frames and application of DTW:  In signal processing, cross-correlation is a measure of similarity of two waveforms as a function of a time-lag applied to one of them. This is also known as a ‗sliding dot product’ or ‗inner-product’. It is commonly used to search a long duration signal for a shorter known feature. Auto-correlation is cross-correlation of a signal with itself.

Cross-correlation: ∞ 푅12 푛 = −∞ 푥1 푚 ∗ 푥2 푚 − 푛 … … … … … … … … … . .(8.8)

Where R12 (n) = cross correlation of signal 1 with signal 2 푥1(푚) = input signal 1 푥2 푚 − 푛 = input signal 2

239

Evaluation of spectral mismatch and development of necessary objective measures

Auto-correlation: ∞ 푅 푛 = −∞ 푥1(푚 ∗ 푥 푚 − 푛) … … … … … … … … … … … .(8.9)

Where 푥1(푚) = input signal 1

Some standard examples of cross and autocorrelation are given below.

The following figure 8.18 shows correlation of 100Hz sine wave with itself.

Fig. 8.18: 100 Hz sine wave correlated with 100 Hz sine wave.

In the above diagram the peak value is 500.

240

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.19: 100 Hz sine wave correlated with 200 Hz sine wave.

 The above diagram 8.19 shows the correlation of a 100Hz sine wave with a 200 Hz sine wave. The peak value is 15. This value is much less than the peak value obtained in the previous figure as the frequency of the second sine wave (200 Hz) is much higher than the first sine wave (100 Hz).

Fig. 8.20: 100 Hz sine wave correlated with 300Hz sine wave.

 The above diagram 8.20 shows the correlation of a 100Hz sine wave with a 300 Hz sine wave. The peak value is 8. This value is much lesser than the

241

Evaluation of spectral mismatch and development of necessary objective measures

peak value obtained in both the previous diagrams as the frequency of the second sine wave (300 Hz) is highest and hence value of the correlation is least. In this research work, the frames of original sound files are correlated with the frames of PC and IC word.

 Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another he or she was walking more quickly or if there were accelerations and decelerations during the course of one observation. In general, DTW is a method that allows a computer to find an optimal match between two given sequence (e.g. time series) with certain restrictions.

 In DTW one begins with a set of template streams, each labeled with a class. Given an unlabeled sample word, the minimum distance between the sample word and each template is computed and the class is attributed to the nearest template. DTW is applied after correlation to remove difference between different concatenation frames.

The following section shows results of application of DTW after correlation.

8.20 PLOTING MAXIMUM PEAK VALUES:

8.20.1 Research findings of peak calculation  The peak values of the original, PC and IC correlations are found out and the maximum peak value is plotted. The following diagram shows the peak values obtained after correlating the frames of original and PC and applying DTW to the result. Figure 8.21 is correlation obtained with original and PC. Figure 8.22 is correlation obtained with original and IC. Since auto-correlation of the word is the reference waveform for the conclusion, figure 8.23 shows the auto-correlation of original word.

242

Evaluation of spectral mismatch and development of necessary objective measures

 The shape of the peaks in the figure 8.21 and the figure 8.23 are same. Hence from these figures it can be concluded that PC is closer to original word than IC. In figure 8.22 all the three peaks are of same amplitude because of the mismatch in the correlation of original and IC.

The following all figures show the correlation results of original, PC, IC words after applying DTW.

Fig. 8.21: Peak values obtained after correlating the frames of original and PC and applying DTW

243

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.22: Peak values obtained after correlating the frames of original and IC and applying DTW

The above figure shows more distortion as compared to other figures. Being it is improperly concatenated word the correlation resulted in lot of distortion after application of DTW. There is no clear main and side lobes as they are present in other two figures.

Following figure 8.23 shows correlation of original signal or input word with itself. It clearly shows main lobe and two side lobes. The energy of main lobe is high as compared to side lobes. This shows more correlation as it is an original word.

244

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.23: The peak values obtained after correlating the frames of original and original.

8.21 DISTANCE CALCULATION ALGORITHM:

Distance calculation algorithm in general form is explained in the following section.

8.21.1 Purpose of algorithm This algorithm is used for the formation of concatenated word and finding distance between original and concatenated word. Different parameters like auto-correlation, cross-correlation, Euclidean distance and energy calculation are used.

8.21.2 Methodology used and step by step implementation The following section explains how different parameters can be used to find spectral distortion at the joint of syllables for given input word or concatenated word. 1) Selection of words for concatenation. 2) Cutting of syllables from words for concatenation. 3) Concatenation is carried out manually using ‗sound forge‘ software. 4) Properly concatenated word and improperly concatenated words are obtained. 5) Manual correction of the words using time domain analysis.

245

Evaluation of spectral mismatch and development of necessary objective measures

6) The cross correlation of original signal is taken with proper concatenated word and improper concatenated word. 7) Comparison is carried out for concatenated and original word on the basis of different parameters.

8.21.3 Results/outcome of the algorithm This algorithm is used for calculating distortion present at the joint of syllables in a word. The word can be an input word or a concatenated word (proper/improper). The results for different modules are explained in the following sections.

The accuracy of distance and energy calculation after correlation is limited to 70%. The spectral mismatch reduction results of DTW, PSD and wavelet are more promising and hence they are implemented for spectral distortion reduction.

These results are shown here for reference and comparison with other methods.

8.22 FINDING EUCLIDEAN DISTANCE AND ENERGY BETWEEN AUTO-CORRELATED AND CROSS-CORRELATED WORDS:

8.22.1 Euclidean distance: The Euclidean distance (ED) or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler and is given by the Pythagorean formula. By using this formula as distance, Euclidean space (or even any inner product space) becomes a metric space.

8.22.2 Definition: The Euclidean distance between points ‗P‘ and ‗Q‘ is the length of the line segment

PQ. In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in Euclidean n-space, then the distance from p to q is given by:

2 2 2 푛 2 d(p,q) = 푝1 − 푞1 + 푝2 − 푞2 + ⋯ … . + 푝푛 − 푞푛 = 푖=1 pi − qi …..(8.10)

246

Evaluation of spectral mismatch and development of necessary objective measures

The auto correlation of the original word serves as a reference for distance calculation of original-proper and original-improper correlated output. Euclidean distance is calculated between original-original (auto correlated) and original-PC (cross correlated) output. Euclidean distance is also found out between original –original (auto- correlated) and original-IC (cross-correlated) output.

Fig. 8.24: Distance between original-original auto-correlated and IC-original correlated output.

The above figure 8.24 shows the Euclidean distance between original-original and original-IC correlated outputs. The final distance value is calculated by adding up each point in the above figure. The peak value of the distance obtained 40 is much more than peak value 18 obtained in the diagram 8.25 below.

Here distance is calculated between auto-correlated and PC-original correlated output. This gives a clear indication that distance of original-PC correlated output is less than original-IC correlated output.

247

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.25: Distance between original-original auto-correlated and PC-original correlated output.

The above example shows that Euclidean distance is less for PC-original correlated output and is more for IC-original correlated output. This example shows that syllable position plays an important role. Properly concatenated word (using proper syllable positions) gives less distortion as compared to improperly concatenated word (not using proper syllable position).

8.22.3 Explanation of algorithm for energy calculation: When two signals are correlated then maximum correlation is obtained when the signals overlap completely. Hence it is in this region (the region of maximum correlation) the energy is calculated. Due to delays in position of syllables, it is necessary to calculate the energy in a window which is present in the area of maximum correlation.

248

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.26: Correlation graph

In the above diagram 8.26, maximum correlation is achieved at a delay of 3.

Here the second series is being slid past the first, at each shift the sum of the product of the newly lined up terms in the series is computed. This sum is large when the shift (delay) is such that similar structure lines up. This is essentially the same as the so called convolution except for the normalization terms in the denominator.

A window is chosen for calculating the energy of the correlated words in that window. The window can be selected by looking at the major lobe of the auto- correlation of the original word and selecting ‗upper limit‘ and ‗lower limit‘. The window can be automatically selected by taking 50 or 100 samples from the peak value.

249

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.27: Auto-correlation of original vs. original

The above diagram 8.27 shows the auto-correlation of the original word. The graph shows strong correlation in the middle lobe with very low correlation near the side lobes. This is a double sided graph as the sound file also contains positive and the negative peaks.

8.23 ENERGY CALCULATION OF CORRELATED WORDS FOR SPECTRAL DISTORTION REDUCTION:

Average energy is calculated by summing the square of the samples of correlated word and dividing it by the length of the window.

(amplitude )2 Average energy= ∑ ……………………………………….(8.11) no .of samples

250

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.28: Original vs. original correlation of word H m_Jma (kamgar)

The above diagram 8.28 shows the absolute value of auto-correlation of original word H m_Jma (kamgar). In the two syllable word, the lobes can be seen as shown in the figure. Here the side lobes are having low energy and the middle lobe is having very high energy. The middle lobe represents the complete overlapping of the original word whereas the side lobes represent the overlapping of Jma (gar) - H m_ (kam) and H m_ (kam) – Jma (gar)

Fig. 8.29: Original vs. PC word correlation of word H m_Jma (kamgar)

251

Evaluation of spectral mismatch and development of necessary objective measures

The above diagram 8.29 shows the absolute value of correlation of original word and proper concatenated word. Main lobe and side lobes are formed similar to the autocorrelation of original word.

Fig. 8.30: Original vs. IC correlation of word H m_Jma (kamgar)

The above diagram 8.30 shows the absolute value of cross correlation of original and improper concatenated word. The energy is distributed throughout the correlation output and there are no distinct lobes (main lobe or side lobes) which shows distortion present in the IC word. From these graphical results of energy of cross-correlated words, it is clear that the energy of original and proper concatenated words is close while energy of improper concatenated word is different. The graphical results of improper concatenated word show more energy distortion.

8.24 ANALYSIS OF RESULTS OF DISTANCE AND ENERGY CALCULATED IN ORIGINAL, PC AND IC WORDS:

The Euclidian distance and the energy of cross-correlation are calculated for PC and IC words and a comparison is carried out for each parameter between the correlated

252

Evaluation of spectral mismatch and development of necessary objective measures words. Testing of 500 words is carried out for ED and energy of cross-correlation: A few results in tabular form are displayed below.

Table 8.2: Result table of Euclidian distance calculation Sr. PC distance from IC distance from Word No. original original 1 {H emoa(Kishor) 0.6878 5.40 2 ~m~a(babar) 1.35 3.5 3 ^yXmZ(bhudan) 2.52 3.31 4 JwéXod(gurudev) 6.64 9.75 5 {Jas (giri) 1.86 4.24 6 H maUs (karani) 2.27 5.93 7 H madsa (karveer) 3.7 5.15 8 _hmdsa (mahaveer) 3.25 6.28 9 _m\ s (maphi) 1.97 3.87 10 _mes (mashi) 1.44 2.44 11 _wH m (muka) 0.92 1.26 12 ZmJZmW (Nagnath) 2.19 3.08 13 ZH ma (nakar) 0.91 1.15 14 nwÌ nŒo_ (putraprem) 0.87 2.09 15 am_~mJ (Rambaug) 0.71 1.60 16 a§J^y_s (rangbhumi) 3.94 7.78 17 gmW (saath) 0.82 2.25 18 pe\ m (shipha) 0.53 1.3 19 eyadsa (shurveer) 1.97 7.51 20 gmo_ag (somras) 2.34 4.95 21 gy\ s (sufi) 0.98 2.75 22 R mH U (thakan) 0.66 1.16 23 pVaH g (tirkas) 1.00 2.58 24 ~wY (budh) 1.21 5.14

253

Evaluation of spectral mismatch and development of necessary objective measures

Sr. PC distance from IC distance from Word No. original original 25 Jdma (gawar) 1.55 4.99

 In the above table 8.1 the first column represents the distance between the original and PC words and the second column shows the distance between original and IC words. The distance between original and PC is less than the distance between original and IC. It shows close spectral alignment of original and PC words.

 This shows that original word is more similar to PC than IC. For better explanation let‘s concentrate on example of word {H emoa (Kishor) in the above table. In this word, the Euclidian distance between original and PC is 0.68 and that of original and IC is 5.4. Therefore, PC is closer to original than IC. This is same for most of the words in the table. 500 words are tested to find this data.

8.24.1 Energy results: Energy results of cross-correlation window are shown in the following table. Correlation shows the match between two signals. It is a unit less parameter.

Table 8.3: Result table of energy calculation Sr. Original (auto- Words IP word PC word No. correlation) 1 Am_d¥j (aamvruksha) 0.345 0.3740 18.2297 2 ~m~wb (babul) 0.386238 1.1208 7.1568 3 ^yXmZ (bhudan) 0.3930 40.184 67.680 4 \ mes (phashi) 0.0021 0.0690 17.6240 5 {Jas (giri) 1.4370 10.7900 457.550 6 JwéXod (gurudev) 32.6800 533.210 8803.02 7 Om_IoS (jamkhed) 0.0001810 0.002164 32.9300 8 H m~yb (kabul) 0.4000 1.6200 25.8300

254

Evaluation of spectral mismatch and development of necessary objective measures

Sr. Original (auto- Words IP word PC word No. correlation) 9 H m\ s (kafi) 0.3787 1.5940 30.7989 10 H madsa (karveer) 54.24 221.31 5188.49 11 ImZXmZ (khandan) 0.4200 2.7900 57.7700 12 _YwH a (Madhukar) 3.1010 23.3300 487.05 13 _hmdsa (Mahaveer) 78.8800 242.21 2068.42 14 _m\ s (mafi) 0.3463 7.8910 100.291 15 _mes (mashi) 0.0943 14.50 35.030 16 _mPm (maza) 3.440 7.770 85.490 17 n³o_amO (Premraj) 0.7956 3.237 81.970 18 amO (Raj) 0.4239 4.306 16.05 19 am_~mJ (Rambaug) 0.6410 3.7860 16.050 20 am_nya (Rampur) 0.4760 2.172 42.270 21 a§J^y_s (rangbhumi) 42.24 814.64 9878.780 22 gmW (sath) 0.07070 4.470 18.340 23 eyadsa (shurveer) 8.340 432.080 1176.41 24 gw\ s (suphi) 4.280 33.79 172.38 25 gyH a (sukar) 2.910 14.68 61.990

 The above table 8.2 shows the energy values of correlated outputs in the selected window. The energy of original-PC correlated word is higher than the energy of original-IC. A higher value indicates higher correlation.

 Hence the algorithm for energy calculation has been able to prove that the PC resembles more to original word than IC word. Consider one example for understanding of this concept for the word ~m~wb (babul) in the above table. Here the energy of original-PC correlated word is higher (1.1208) than the energy of original-IC correlated word (0.386238). PC word is closer to original word than IC word.

255

Evaluation of spectral mismatch and development of necessary objective measures

8.25 OTHER-ALGORITHMS OF SPECTRAL SMOOTHING:

The main aim of this work is to increase the naturalness of the text to speech system by applying different smoothing algorithms for proper and improper concatenated words. The following methods are tested for reduction of spectral distortion [70].

The parameters used for smoothing are: 1) Power spectral density 2) Wavelet transform 3) Dynamic time warping

8.26 IDENTIFICATION OF MISMATCH OF FORMANTS USING PSD:

 The power spectral density (PSD) Sx (w) for a signal is a measure of its power distribution as a function of frequency. PSD is a very useful tool if one wants to identify oscillatory signals in time series data and want to know their amplitude. In this research work PSD is used to find discontinuities in frequencies and for smoothing of these discontinuities.

2 Sx (w) =|X (f) | ……………………………. (8.12)

Where Sx (w) = power spectral density of input signal | X (f)| = Absolute value of FFT

8.26.1 PSD Algorithm: This algorithm is explained in the following section. PSD shows graphical results for spectral distortion.

8.26.2 Purpose of algorithm This algorithm is used to find graphical distortion at the concatenation point using PSD plot. Square of absolute value of FFT gives PSD value. PSD is used to find concatenation distortion at syllable joints in a word.

256

Evaluation of spectral mismatch and development of necessary objective measures

8.26.3 Methodology used and step by step implementation The following steps show how PSD is implemented to find spectral distortion of synthesized word. It helps to show how position of syllable is important for reduction of spectral noise.

1) Divide the speech signal into the frames of 110 samples. 2) Calculate FFT (fast Fourier transform) of each frame. 3) Take absolute value of FFT and square it to get PSD. 4) Normalize PSD by dividing it by length of window. 5) Plot PSD of each frame for original, proper and improper concatenated words. 6) Calculate three peak values i.e. three formants for each frame. 7) Find the frame at which the concatenation is carried out for original, proper and improper word. 8) Calculate the difference between the peak values of original-proper and original-improper words for all the three formants. 9) From the difference values identify the mismatch in the proper and improper word. 10) The formants of improper word are modified according to the difference values. 11) The distorted values of formants are replaced by the modified values. 12) Inverse Fourier transform is taken and the signal is reconstructed. 13) The word is played back. 14) The formant values of improper word are given as input and the formant values of original word are given as output to the back-propagation algorithm for training the neural network. 15) This back-propagation algorithm reduces spectral distortion of improper concatenated frames for given input word.

8.26.4 Results/outcome of the algorithm This algorithm gives clear picture of distortion at the concatenation point of synthesized word in terms of graphical plot. PSD results are good indicators of spectral distortion estimation. With modification of formant plots, this distortion is reduced using NN (Back-propagation).

257

Evaluation of spectral mismatch and development of necessary objective measures

 The power spectral density is to be plotted. It is known that human speech varies in between the range of 300 Hz to 3500 Hz: PSD is plotted for this range to understand the speech frequency behavior. In PSD plots, the samples have to be placed one after the other with amplitude on x-axis and frequency on the y- axis as shown in the figure 8.31 below.

8.26.4.1 Research findings  In all the PSD plots below, x axis shows respective sample power. For properly locating different formants, frequency is plotted on the y axis.

 This makes it easier to see peaks at particular frequency in three different ranges. Three different ranges are mentioned because there are three different formants for which study is carried out for speech signal. The frequency

ranges taken in this study are for ‗f0‟ 300-700 Hz, for ‗f1‟ 700-1500 Hz and for

‗f2‟ 1500-3500 Hz. First three formants are important for speech signals.

The following figure 8.31 shows PSD plot of original word XodJS (Devgad)

This portion represents This represents the spectrum of syllable Xod spectrum of JS (Gad) (Dev) in XodJS (Devgad)

Fig. 8.31: PSD plot of original XodJS (Devgad)

258

Evaluation of spectral mismatch and development of necessary objective measures

The above figure shows PSD plot of word XodJS (Devgad). Two syllables of this word are easily located in two different spectrums of PSD plot.

8.26.5 Peak determination: It is being observed that the multiple peaks are obtained in a single range so smoothing of the graph is required. For this moving average filter is used. Figure 8.32 shows the difference between multiple peak and smooth peak graph.

Multiple peaks Single peak

Fig. 8.32: Concept of peak determination

8.26.6 Formant extraction: All three formants are plotted on the same graph for each word (selected original word, properly concatenated word and improperly concatenated word). For this plot frequency has to be kept on the y-axis and number of samples with their amplitude on the x-axis.

In case these plots are not very smooth different filters can be applied. The idea behind this plot is to see the difference and the similarity in same formants of two different words. For example: 1) Between original and proper concatenated word. 2) Between original and improper concatenated word. 3) Between proper and improper concatenated word.

It is observed that at the place of the concatenation joint there is change in the nature of formant plot. When the same procedure is applied to proper and improper

259

Evaluation of spectral mismatch and development of necessary objective measures concatenated word; it is observed that the change in nature of graph is much more in case of improper concatenation.

The following figure 8.33 shows formant plot of XodJS (Devgad)

f0 f1 f2 Fig. 8.33: Formant plot of original XodJS (Devgad)

In the above figure 8.33, three formants of given input word are shown.

8.27 SLOPE CALCULATION FOR FORMANT NOISE DETECTION:

From the formants plot, more important information is extracted by calculating the gradient for each formant. But if the slope for the entire formant plot is taken at once then the result obtained does not make much sense. That is why the gradient is calculated in the range of 5 samples. It is observed that the difference in gradient is much more in case of improper concatenation. With this study of slopes; changes can be made in concatenated word to make their behavior as good as original word which has very less concatenation cost.

260

Evaluation of spectral mismatch and development of necessary objective measures

8.27.1 Calculation and plotting of normalized values: To study the difference in more précised manner, the normalized values of all the formants are taken and plotted. They are plotted such that formants of different words can be studied and compared in the same graph. The normalized value of the graph helps in studying the range in which the formant is varying in a small limit of 0 to 1 and it became easier to analyze the nature of any type of concatenation.

If ‗f0‘ formant is considered then, it can be observed that in case of formant plot there is high variation in the values so it is tough to conclude anything. But if the normalized plot is seen then the analysis becomes much easier and this graph is much more effective. See the following figures of formant plots for formant f0 which are shown in normalized form for PC and IC.

Fig. 8.34: Normalized plot of ‘f0’

261

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.35: Normalized plot of ‘f0’

By observing the above figures 8.34 and 8.35, it can be concluded that the normalized plot for proper concatenation is more similar to the original plot than improper concatenation. It is observed that, there are more number of peaks in the figure 8.35 for improper plot which shows spectral distortion.

8.27.2 Implementation of PSD and formant plots:

8.27.2.1 Purpose of algorithm This algorithm is used for calculating formants from the PSD plot. Formant plots helps to estimate presence of spectral distortion in synthesized word. They help to find difference between original and synthesized speech output.

8.27.2.2 Methodology used To implement this algorithm the following steps are followed.

Format plot calculation algorithm: 1) Selection of input word. 2) Selection of words for concatenation. 3) Cutting of syllables from words for concatenation.

262

Evaluation of spectral mismatch and development of necessary objective measures

4) Concatenation is carried out manually using ‗sound forge‘ software or automatically using concatenation program. 5) Manual correction of the words using time domain analysis. 6) All the words cut into sample of 10 millisecond duration. 7) PSD for each word is plotted. 8) Peaks are extracted from each and every PSD plot for the plotting of the formants. 9) Slope of each formant plot is calculated over the range of 50 milliseconds. 10) Normalized value graph is plotted for each formant for all the three combinations. 11) Slope of all the formants is compared for all three combinations. 12) The values of slopes are plotted for more precise combination. 13) The cross correlation of original signal is taken with proper concatenated word and improper concatenated word. 14) Comparison is carried out for concatenated and original word on the basis of different parameters.

8.27.2.3 Results/outcome of the algorithm This algorithm gives formant/slope plot which helps to find spectral distortion and by calculating difference between original and synthesized words, distorted (improper synthesized word) word can be improved in terms of spectral noise.

8.27.3 Word selection: Following example shows how concatenated words are formed with the help of ‗sound forge‘ software.

Original word: XodJS (Devgad) Concatenated word: Proper concatenation XodYa (Devdhar) + M¨XsJS (Chandigadh) = XodJS (Devgad) Improper concatenation am_Xod (Ramdev) + JSJSVmZm (gadgadtana) = XodJS (Devgad)

Original word: [dXoe (videsh) Concatenated word:

263

Evaluation of spectral mismatch and development of necessary objective measures

Proper concatenation [d_mZ (viman) + ñdXoe (swadesh) = [dXoe (videsh) Improper concatenation d¡^ds (Vaibhavi) + Xoe^ŠVs (deshbhakti) = [dXoe (videsh)

8.27.4 PSD plots: In all these PSD figures, x axis shows energy/amplitude or power of respective sample and y axis shows respective frequency.

Fig. 8.36: PSD plot of original word XodJS (Devgad)

The above figure 8.36 represents PSD plot of original input word XodJS (Devgad)

Fig. 8.37: PSD plot of properly concatenated XodJS (Devgad)

264

Evaluation of spectral mismatch and development of necessary objective measures

Figure 8.37 shows PSD plot of properly concatenated word. There is difference in energy where the concatenation of the word takes place. In the above and below figures, the highlighted area is shown with letters (a) and (b) and is explained below.

Fig. 8.38: PSD plot of improperly concatenated XodJS (Devgad)

a) In the above figure for highlighted area in the plot of improperly concatenated word XodJS (Devgad), it can be easily seen that compared to original word and properly concatenated word, the energy in later part of graph is low. This is because in the improperly concatenated XodJS (Devgad) the second syllable JS (gad) has been taken from word JS JSVmZm (gadgadtana). As it is taken from initial position the energy of the syllable JS (gad) is not as per requirement. This leads to spectral mismatch between original and improper concatenated word. b) The highlighted area in the plot of proper concatenated word XodJS (Devgad) (fig. 8.37) gives higher energy during the speaking of syllables JS (gad). This has happened because of proper selection of the word for concatenation. The energy of the word should be same as the original word to achieve a natural voice even after concatenation.

265

Evaluation of spectral mismatch and development of necessary objective measures

c) During proper concatenation of this word the JS (gad) syllable is taken from M§XsJS (Chandigad). At the time of speaking of the word M§XsJS (Chandigad) the user can feel that speaking of syllable JS (gad) does sound similar to the speaking of JS (gad) in the word XodJS (Devgad). Thus the PSD method gives spectral mismatch between original and concatenated word. It is used for spectral mismatch estimation of synthesized word.

The following figures show PSD results for one more word pdXoe (videsh)

Fig. 8.39: PSD plot of original word pdXoe (videsh)

Fig. 8.40: PSD plot of properly concatenated word pdXoe (videsh)

266

Evaluation of spectral mismatch and development of necessary objective measures

From the PSD plot of original and properly concatenated pdXoe (videsh); it can be observed that concatenation is taking place around 0.19 to 0.20 on x axis (normalised plot) that is around 190 millisecond. There is change in the nature of wavefrom at this point and this change is because of the concatenation.

Fig. 8.41: PSD plot of improperly concatenated word pdXoe (videsh)

In the improper concatenation plot of word pdXoe (videsh) there is high energy shown in the range which is for formant ‗f2‘ of figure 8.41. Comparing this graph with the other two plots of the same word (original and properly conctenated) it can be observed that it happend because of improper conatenation. As the word in improper concatenation is formed with syllable positon which is not similar as per requirement of synthesized word, the resulting distortion is more. This is an important parameter which gives spectral mismatch between proper and improper concatenation of words.

The difference and similarity between all the three words original, PC and IC in terms of sound file can be observed. To compare the sound contents of all these three words, length of all the words should be same and they should be properly aligned. The following figure shows sound file for original, PC and IC words for XodJS (Devgad).

The following figure 8.42 shows sound file for the original, PC and IC words. These wave files are adjusted to the same length.

267

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.42: Sound files of original, PC and IC XodJS (Devgad)

The above figure is divided into three subparts as labeled on the left side of diagram. a) Original XodJS (Devgad) b) Properly concatenated XodJS (Devgadh) c) Improperly concatenated XodJS (Devgadh)

The above shown figure 8.42 displays time domain nature of three different words. It is very important to align the words properly otherwise the results obtained will not provide accurate information. The different scenario such as

a) During the placement of the second syllable if the gap is more, then the entire second syllable is shifted and all the observations in the figure 8.42 show great deal of mismatch between original and concatenated word. But in actual that is not a mismatch but a manual error made during concatenation. b) Another point that is important during concatenation is that the length of all the three words should be same otherwise the number of samples may vary for all the words.

To avoid the misalignment during concatenation of proper and improper words, sound forge software is used. This software takes care of proper alignment of syllables to form proper synthesized word.

268

Evaluation of spectral mismatch and development of necessary objective measures

8.28 ANALYSIS OF FORMANTS TO EVALUATE SPECTRAL MISMATCH:

The following figures show formant plots for original, PC and IC word for XodJS (Devgad)

Fig. 8.43: Formant plot of original XodJS (Devgad)

Fig. 8.44: Formant plot of properly concatenated XodJS (Devgad)

269

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.45: Formant plot of improperly concatenated XodJS (Devgad)

The above shown figures (8.43, 8.44 and 8.45) shows that variation in f2 is more in all three plots (original, proper and improper). This shows that variation in higher frequency range ‗f2‘ is more compared to lower frequency range ‗f0‘. The base frequency of speaker remains same for different words but higher frequency varies a lot with respect to different words and energy spoken with. While analyzing spectral mismatch results, it is observed that formant ‗f0‘ results are much better compared to

‗f2‘.

270

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.46: Formants ‘f0’ for original, PC and IC XodJS (Devgad)

 In the above figure 8.46 formant ‗f0‘ of three different words (original, proper and improper) are shown for word XodJS (Devgad). From this figure 8.46 it can be observed that the base frequency for all words remain same. But after concatenation there is a change in the nature of improperly concatenated word.

 The original and proper XodJS (Devgad) showing peak at the same place where- as improper concatenation is showing multiple peaks around the area of concatenation which leads to spectral mismatch. There is a difference in energy of the improper concatenated word XodJS (Devgad) compared to original: this difference is because of wrong selection of position of syllable during concatenation.

271

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.47: Formant ‘f1’ of original, PC and IC amOJS (Rajgad)

The above shown figure 8.47 is the formant ‗f1‘ plot of (original, proper and improper) amOJS (Rajgad). From this figure it can be observed that original and proper concatenated words are showing the same nature throughout the plot, but there is a complete mismatch between improper and other two plots (original and proper). It can also be observed that it is showing peaks at the same place but at the time of concatenation there is huge difference in energy of two words.

272

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.48: Formant f2 of original, PC and IC word pdXoe (videsh)

The above shown figure 8.48 represents formant ‗f2‘ plot of pdXoe (videsh) for original, PC and IC. In this figure, the natures of the two plots, original and PC are almost similar, but the nature of an improperly concatenated word is very different from the original word. At the place of concatenation around 19th and 20th sample the improperly concatenated word is giving peaks at very high frequency this is happening because pd (vi) of pdXoe (videsh) is taken from ending portion of d¡^ds (Vaibhavi); which has different energy.

273

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.49: Slope plot ‘f2’ of original, PC and IC for word amOJS (Rajgad)

The above shown figure 8.49 is a slope plot of formant ‗f2‘ for word amOJS (Rajgad). From this figure it can be observed that nature of original and PC slope is similar throughout the figure. This implies that it is showing positive and negative slope at the same time, but slope of improper concatenated word (IC) is very different in nature. The extra peaks of IC shows spectral distortion present at concatenation point.

274

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.50: Slope ‘f0’ plot of XodJS (Devgad)

The above shown figure 8.50 is a slope plot of formant ‗f0‘ for word XodJS (Devgad). From this figure it is observed that the slope of a proper concatenated word is much closer to original than an improper concatenated word. During the 5th and 6th sample the nature of improper slope is descending where as nature for other two curves are ascending. This is the place where new syllable is coming into picture after concatenation. There is a lot of distortion near concatenation point of IC word. The variation of sample energy is less compared to f1 and f2 plots as it is low frequency graph.

PSD results for one more word are shown below along-with formant plots.

275

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.51: PSD plot of original word ~mJdmZ (bagwan)

Fig. 8.52: PSD plot of proper concatenated word ~mJdmZ (bagwan)

276

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.53: PSD plot of improper concatenated word ~mJdmZ (bagwan)

Fig. 8.54: Formant plot of original word ~mJdmZ (bagwan)

277

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.55: Formant plot of proper concatenated word ~mJdmZ (bagwan)

Fig. 8.56: Formant plot of improper concatenated word ~mJdmZ (bagwan)

From all the above PSD and formant plots, it is clear that higher formants of f2 and f1 have more variation as compared to f0. These plots show spectral distortion present at the concatenation point of synthesized word. They help to estimate the presence of spectral noise. This distortion is more for improper concatenated word than proper concatenated word [71].

278

Evaluation of spectral mismatch and development of necessary objective measures

8.29 COMPREHENSIVE ANALYSIS OF RESULTS OBTAINED BY PSD:

To see the spectral mismatch and to analyse concatenation based on differrent parameters GUI code is developed. The user has to run the GUI code in Matlab after which following window is viewable. With the help of GUI testing of algorithm becomes easier.

Fig. 8.57: Snapshot of PSD GUI

The above window has two options a) To select the word for which analysis has to be done. b) To select the parameter for which results are given as output.

279

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.58: PSD plot of original word pdXoe (videsh) in GUI

As an example the above figure 8.58 shows the power spectral density plot for the original word pdXoe (videsh)

8.29.1 PSD results in frame by frame basis: PSD plots becomes more clear and easy to understand in STFT (short time Fourier transform) form. PSD plot of original, proper and improper word is plotted. The cutting point is also shown. Red color frame is the cutting point in the following figures. The x-axis is time and the y-axis is frequency. Thus it is time-frequency representation i.e. STFT representation. Results of new PSD algorithm in STFT format are given below for a few words.

280

Evaluation of spectral mismatch and development of necessary objective measures

8.29.2 Results for 2-syllable words:

Fig. 8.59: PSD plot of original {eH ma (shikar) and properly concatenated {eH ma (shikar)

Fig. 8.60: PSD plot of original {eH ma (shikar) and improperly concatenated {eH ma (shikar)

The part of the plot before red line i.e. cutting point is syllable {e (shi) and the part after the red line is syllable H ma (kar). It can be seen from the plots that there is large difference in the formants of an improper word and the original word which leads to spectral mismatch. The formants of a proper word are similar to the formants of the original word. Thus a properly concatenated word is more similar to original than an improperly concatenated word. Formants of an improper word are modified by calculating the difference between original and improper formants. This difference is used to replace distorted formants with the original formants. This has resulted in naturalness improvement of speech output.

281

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.61: PSD plot of improper {eH ma (shikar) and modified improper {eH ma (shikar)

In the modified plot, the formants of improper {eH ma (shikar) are modified such that they are made similar to the formats of original {eH ma (hikar)

Few more results of two syllable words are shown below:

Fig. 8.62: PSD plot of original _maoH as (marekari) and properly concatenated _maoH as (marekari)

282

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.63: PSD plot of original _maoH as (marekari) and improperly concatenated _maoH as (marekari)

Fig.8.64: PSD plot of original _maoH as (marekari) and improperly concatenated _maoH as (marekari) after modification

Fig. 8.65: PSD plot of original A\ dm (afava) and properly concatenated A\ dm (afava)

283

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.66: PSD plot of original A\ dm (afava) and improperly concatenated A\ dm (afava)

Fig 8.67: PSD plot of original A\ dm (afava) and improperly concatenated A\ dm (afava) after modification

8.29.3 Numerical results of PSD for 2 syllable word: The difference between the PSD of original-proper and original-improper words is calculated. The difference is taken for 15 frames before and 15 frames after the concatenation point. The values in red color are the values for the concatenation point in the following table. All the following values are in watts/Hz.

Table 8.4: Numerical results for word {eH ma (shikar)

First formant (watt/Hz) Second formant (watt/Hz) Third formant (watt/Hz) Original- Original- Original- Original- Original- Original- proper improper proper improper proper improper 4.372123 6.045172 1.604309 2.436136 1.251421 2.815694 3.620833 4.506938 0.460083 2.374748 1.989462 4.631546 2.40446 4.511822 3.265818 5.677004 0.489891 1.615047

284

Evaluation of spectral mismatch and development of necessary objective measures

First formant (watt/Hz) Second formant (watt/Hz) Third formant (watt/Hz) Original- Original- Original- Original- Original- Original- proper improper proper improper proper improper 0.638449 5.801731 0.370185 2.26516 0.922886 1.495002 2.088958 3.81154 2.561768 3.441863 5.143503 6.438593 5.265664 6.476127 0.368604 0.560566 6.485193 6.678317 2.165232 7.76862 0.395312 0.702921 9.625612 10.922336 2.045654 6.626132 4.291382 5.852739 3.669644 3.865797 2.234591 3.023234 3.792063 3.839177 4.369755 5.829948 0.382335 0.431937 2.800457 4.684991 2.901609 3.280343 1.376366 2.218391 0.101408 1.08962 1.307272 1.808039 1.168121 2.671595 1.472119 2.408066 1.01305 1.199581 0.049982 0.774933 1.834192 1.988952 0.431505 0.455103 0.088094 0.116415 0.43646 0.588045 0.174199 0.623503 0.439592 0.614891 0.097197 0.331646 0.508808 0.818166 0.109936 0.132355 0.522631 0.859446 0.106916 0.315197 0.261019 0.294106 0.248495 0.340329 0.291089 0.543552 0.062745 0.92278 0.280008 0.70827 0.544073 0.665108 0.55765 0.623611 0.180804 0.252585 0.117792 0.257613 0.369961 0.42873 0.43877 0.514684 0.131256 0.200727 0.122514 0.437498 0.148938 0.26834 0.008189 0.013048 0.462915 0.650152 0.143341 0.471157 0.120646 0.301773 0.647162 0.687923 9.965161 10.562599 0.330939 0.396689 3.826413 0.487544 7.373661 8.856716 0.155745 0.414947 2.382338 3.217219 0.474792 0.55573 0.625918 0.763958 2.475961 3.181049 4.058562 5.22791 0.043244 0.819639 0.23043 0.383124 16.04277 26.20343 1.088882 1.804937 6.754812 7.818619 23.49401 25.50191 1.989208 2.735277 25.81782 31.370634 16.38138 18.394 2.010129 3.333212 33.51124 41.39123 25.02413 25.66102 2.161204 3.200574 57.2219 62.97374 4.307781 9.840589 7.41784 8.558939

From the table 8.3 it can be seen that there is large difference in the PSD values of proper, improper and original words which indicates the spectral mismatch in the

285

Evaluation of spectral mismatch and development of necessary objective measures proper and improper words. The mismatch is more near the concatenation joint. It is more for improper concatenated word than proper concatenated word. If we consider example of the f1 (formant one) before concatenation (before red color frame values), original-proper formant value is 0.439592 watt/Hz whereas original-improper value is

0.614981 watt/Hz. Difference of 0.0366 watts/Hz (average of all formants) is observed for PC words as compared to the difference of 0.1186 watts/Hz for IC words. It shows that proper concatenated words are more close to original words than improper concatenated words.

8.29.4 Results for 3 syllable words:

Fig. 8.68: PSD plot of original M_H sXma (chamkidar) and properly concatenated M_H sXma (chamkidar)

Fig. 8.69: PSD plot of original M_H sXma (chamkidar) and improperly concatenated M_H sXma (chamkidar)

There are two cutting points in the figure 8.68 as it is a three syllable word. The part of the plot before first cutting point is syllable M_ (cham), the part after that is syllable

286

Evaluation of spectral mismatch and development of necessary objective measures

H s (ki) and the part of the plot after second cutting point is syllable Xma (dar). It can be seen from the plots that there is large difference in the PSD of an improper word and an original word than proper and original word which leads to spectral mismatch in improper word.

Fig. 8.70: PSD plot of improper M_H sXma (chamkidar) and improperly concatenated M_H sXma (chamkidar) after modification

In the modified plot, the PSD values of improper word near concatenation points are modified and made similar to the original word. The results for a few more words are shown below.

Fig. 8.71: PSD plot of original H m_YmaUm (kamdharana) and properly concatenated H m_YmaUm (kamdharana)

287

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.72: PSD plot of original H m_YmaUm (kamdharana) and improperly concatenated H m_YmaUm (kamdharana)

Fig. 8.73: PSD plot of original H m_YmaUm (kamdharana) and improperly concatenated H m_YmaUm (kamdharana) after modification

Fig. 8.74: PSD plot of original XemdVma (dashavtar) and properly concatenated XemdVma (dashavtar)

288

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.75: PSD plot of original XemdVma (dashavtar) and improperly concatenated XemdVma (dashavtar)

Fig. 8.76: PSD plot of original XemdVma (dashavtar) and improperly concatenated XemdVma (dashavtar) after modification

8.29.5 Numerical results of PSD for 3 syllable word: The difference between the formants of original-proper and original-improper words is calculated. The difference is taken for 10 frames before and 10 frames after both the concatenation points. The difference is taken for all the three formants. The values in red color are the values for concatenation frames.

Table 8.5: Numerical results for word H m_YmaUm (kamdharana)

First formant (watt/Hz) Second formant (watt/Hz) Third formant (watt/Hz) Original- Original- Original- Original- Original- Original- proper improper proper improper proper improper First cutting point 36.27805 40.07811 2.891668 5.023969 3.145258 4.080231

289

Evaluation of spectral mismatch and development of necessary objective measures

First formant (watt/Hz) Second formant (watt/Hz) Third formant (watt/Hz) Original- Original- Original- Original- Original- Original- proper improper proper improper proper improper 33.66358 39.155186 3.055134 4.712101 3.943713 5.379192 13.00875 14.280157 4.292516 5.768871 7.061861 8.139648 4.06879 8.085539 0.125269 3.248089 5.745949 7.017595 7.865777 8.570448 0.568614 1.234643 5.765296 6.982857 10.7629 11.868121 0.580662 0.8266 5.405784 6.317275 2.993148 5.683566 1.335886 2.007553 0.932428 2.643646 16.99962 19.232243 16.70605 20.658192 3.917732 6.288512 18.59749 19.15083 21.18729 26.144337 9.186845 9.882986 18.33306 20.685015 23.47998 26.639247 4.563777 4.729203 26.60697 31.48352 17.76912 21.7026 3.930504 4.041851 20.80796 20.978751 2.587397 3.2637 0.480476 0.510656 28.01041 29.455568 8.628993 9.205696 0.214801 1.320191 25.38261 26.974617 6.325874 8.910257 0.577276 1.449041 44.03956 45.375221 2.475307 3.349288 0.015699 1.798456 26.51792 28.527094 4.063266 5.982963 0.155259 1.737049 19.0186 24.482236 19.5535 20.255813 1.005329 2.028725 45.90944 46.55739 28.4668 29.632104 0.237315 1.486242 68.33064 71.23094 33.54166 40.37203 2.736582 3.474859 77.31558 79.417367 3.357287 5.52404 0.785367 2.158093 114.6353 16.12562 11.04742 33.74665 6.692223 6.804684 Second cutting point 33.92581 72.26906 9.371313 29.7966 13.60256 13.812138 14.59645 29.10085 20.90049 21.817026 11.3407 11.981676 35.08267 42.26339 27.2529 30.09064 3.551877 4.946423 43.86819 48.38172 38.74977 40.91644 0.410496 0.531085 14.89505 17.02978 44.2576 45.8463 4.350424 5.795476 3.774988 8.415386 37.15206 43.31233 2.710941 4.578499 15.44003 19.534411 10.76163 36.24272 0.108098 3.95355 9.843838 10.925325 54.55704 61.74958 2.023395 5.623964 23.05398 24.807055 34.40608 38.071655 1.286419 2.007893

290

Evaluation of spectral mismatch and development of necessary objective measures

First formant (watt/Hz) Second formant (watt/Hz) Third formant (watt/Hz) Original- Original- Original- Original- Original- Original- proper improper proper improper proper improper 9.398537 10.114244 20.93181 21.91756 6.549263 7.637051 0.60372 3.935164 5.579357 7.448092 5.986308 6.834373 4.759237 6.782752 5.743766 8.791217 6.371411 7.493872 3.685446 5.129344 0.730931 2.246993 4.381086 5.357424 5.150725 7.780251 4.627136 4.776095 2.322764 3.003323 2.149388 3.149993 8.16507 9.919301 3.986215 4.184112 3.357263 4.749767 0.3144 12.36594 2.357354 3.083138 27.24548 30.838936 45.37261 66.76282 2.180648 3.80172 26.86963 34.85197 104.9637 106.2681 10.02977 11.307106 10.3362 14.38211 79.06561 66.42473 0.734541 1.094017 13.62374 30.22268 10.87184 17.3922 4.158423 9.545652 58.4716 59.32296 68.618 108.3607 13.67478 14.52841

From the table it can be seen that there is large difference in the PSD values of the proper, improper and original words which indicates the spectral mismatch. The mismatch is more near the concatenation frame.

7.29.6 Results for 4 syllable words:

Fig. 8.77: PSD plot of original A{d^mÁ` (avibhajya) and properly concatenated A{d^mÁ` (avibhajya)

291

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.78: PSD plot of original A{d^mÁ` (avibhajya) and improperly concatenated A{d^mÁ` (avibhajya)

In figure 8.77 there are three cutting points as it is a four syllable word. The part of the plot before first cutting point is syllable A (aa), the second part is syllable {d (vi), third part is syllable ^mO² (bhaj) and the last part is syllable ` (ya)

It can be seen from the plots that there is large difference in the PSD of improper word and original word which leads to spectral mismatch. The values of the proper word are similar to the values of the original word.

Fig. 8.79: PSD plot of original A{d^mÁ` (avibhajya) and improperly concatenated A{d^mÁ` (avibhajya) after modification

In the modified plot the PSD values of improper word near concatenation points are modified and made similar to the PSD values of original word. The results of few more words are shown below.

292

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.80: PSD plot of original X¡Z§{XZs (dainandini) and properly concatenated X¡Z§{XZs (dainandini)

Fig. 8.81: PSD plot of original X¡Z§{XZs (dainandini) and improperly concatenated X¡Z§{XZs (dainandini)

Fig. 8.82: PSD plot of original X¡Z§{XZs (dainandini) and improperly concatenated X¡Z§{XZs (dainandini) after modification

293

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.83: PSD plot of original XoUoKoUoo (deneghene) and properly concatenated XoUoKoUo (deneghene)

Fig. 8.84: PSD plot of original XoUoKoUo (deneghene) and improperly concatenated XoUoKoUo (deneghene)

Fig. 8.85: PSD plot of original XoUoKoUo (deneghene) and improperly concatenated XoUoKoUo (deneghene) after modification

294

Evaluation of spectral mismatch and development of necessary objective measures

8.29.7 Numerical results of PSD for 4 syllable word: The difference between the PSD of original-proper and original-improper words is calculated. The difference is taken for 5 frames before and 5 frames after all three concatenation points. The values in red are the values for concatenation joints.

Table 8.6: Numerical results for word X¡Z§{XZs (dainandini)

First formant (watt/Hz) Second formant (watt/Hz) Third formant (watt/Hz) Original- Original- Original- Original- Original- Original- proper improper proper improper proper improper First cutting point 13.94382 15.174104 5.794973 7.094022 6.036004 8.585142 1.234095 3.408347 5.548151 6.991546 7.15027 8.928464 12.36273 13.15407 6.55481 7.782428 9.41544 10.350509 7.341301 8.150824 1.976874 5.892173 12.35026 13.485173 4.091253 6.624903 4.746626 6.483166 18.16976 18.65776 27.65433 29.0168 1.969687 3.505509 21.96549 22.32044 31.77165 32.2116 12.21881 13.411996 18.29208 24.768818 10.00293 12.351049 2.57219 6.699076 9.152967 10.54143 12.62334 15.77474 0.702009 4.716567 4.756187 18.40054 3.423433 28.57654 1.275958 11.89488 4.134532 5.89905 2.701193 3.739104 0.947142 10.292241 6.432128 8.115743 Second cutting point 52.27924 52.93076 29.3507 32.6362 10.20964 11.009482 18.16564 25.28398 54.68833 58.14098 6.775907 17.75748 21.47503 30.3718 20.568 23.70448 0.144511 11.50711 21.23111 23.69572 41.35609 44.19597 5.873263 16.44086 20.9595 34.16229 19.09652 27.55846 2.241797 15.14045 20.79091 21.42204 19.33336 18.4607 5.071511 6.401831 1.822273 4.57195 21.38057 23.92201 2.171836 14.29043 4.479368 5.113232 2.377366 3.448695 0.164909 12.02444 4.255636 6.551273 1.195149 3.762329 4.377598 12.1642 5.374431 6.050691 6.037985 5.63535 3.132824 7.396122 2.913151 3.095241 2.492723 3.319162 4.456017 4.525134

295

Evaluation of spectral mismatch and development of necessary objective measures

First formant (watt/Hz) Second formant (watt/Hz) Third formant (watt/Hz) Original- Original- Original- Original- Original- Original- proper improper proper improper proper improper Third cutting point 0.322359 0.418873 0.297771 1.212094 3.357376 4.021707 2.15596 3.551527 0.207554 0.637521 2.904297 3.469888 1.59428 2.798606 0.700594 0.83263 0.287616 6.422057 0.015686 0.773199 5.462834 6.323969 1.996526 2.549593 12.95731 12.881086 4.832411 7.758227 0.659423 2.522593 6.310255 11.54653 0.944758 1.195255 0.9541 4.41681 0.38814 1.850475 2.402954 3.037006 5.371718 7.225558 4.133116 6.327146 0.159852 1.761721 0.555079 3.362784 1.779213 2.072475 2.668055 2.751417 0.810232 18.21998 2.122311 3.398649 0.234272 0.935893 5.089676 12.85679 2.918918 9.591844 1.589419 2.604154 6.323217 12.77288

Research findings: From these tables it can be seen that there is a large difference in the formant values of the proper, improper and original words which indicates the spectral mismatch in the proper and improper words. The mismatch is more near the concatenation joint for IC word than PC word. PSD has estimated and reduced successfully spectral distortion for 445 words out of 500.

8.30 DETECTION OF SPECTRAL MISMATCH AND ITS REDUCTION USING MULTI-RESOLUTION WAVELET ANALYSIS:

This method is used to reduce spectral distortion present at the concatenation point. The following sections explain the implementation details of this method.

8.30.1 Purpose of algorithm This algorithm is used to find approximate and detail coefficients of multi-resolution based wavelet transform. Wavelet domain shows discontinuities present at the concatenation point due to spectral noise. These discontinuities are given as input to

296

Evaluation of spectral mismatch and development of necessary objective measures neural network based Backpropagation algorithm which removes this distortion. It modifies the wavelet coefficients with reference to original waveform and produces modified coefficients for wavelet and thus reduces spectral distortion.

A multilayer neural network with one layer of hidden units is used for Back- propagation algorithm. The present implementation of Back-propagation algorithm is with following specifications:

No of input nodes: 30 input frames are provided to 30 input nodes of BP algorithm. No. of hidden nodes: 20 are used, this number is fixed, and only one hidden layer is used. No. of output nodes: 30 output nodes are present.

Here manual segmented word wavelet coefficients are used as known good output. The input to back-propagation algorithm is nothing but proper and improper concatenated wavelet coefficients.

8.30.2 Methodology used and step by step implementation The following steps are implemented for this algorithm. After extraction of approximate and detail coefficients from wavelet transform, they are given as input to Back-propagation algorithm. The distortion present at concatenation joint gets reduced with modified wavelet coefficients present at the output of NN [72].

1) The length of original, proper and improper word is made same. 2) The original, proper and improper word is decomposed up to level 5 with the help of Daubechies wavelet. 3) The approximate and detail coefficients are extracted from the wavelet transform. 4) The coefficients are given as input to the neural network which is trained with the Backpropagation algorithm. 5) The output of neural network gives the modified wavelet coefficients. 6) The improper concatenated word is reconstructed from the modified wavelet coefficients.

297

Evaluation of spectral mismatch and development of necessary objective measures

8.30.3 Results/ outcome of the algorithm: This algorithm gives modified wavelet coefficients which helps to reduce distortion present at the concatenation joint. This algorithm gives more natural audio output and reduces audio glitch distortion present at the concatenation point.

8.30.4 Wavelet results: The original, proper and improper word is decomposed up to level 5 using Daubechies wavelet and the wavelet coefficients at each level are plotted for all words. The x-axis shows the coefficient number and the y-axis shows the energy of wavelet coefficient.

Fig. 8.86: Approximate wavelet coefficients of AMb (achal) from level 1 to 5

298

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.87: Detail wavelet coefficients of AMb (achal) from level 1 to 5

It can be seen from the figures 8.86 and 8.87 that there is mismatch in the wavelet coefficients of proper and improper word. Some more examples of wavelet results are shown below:

Fig. 8.88: Approximate wavelet coefficients of ‘a{dZm’ (Ravina) from level 1 to 5

299

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.89: Detail wavelet coefficients of ‘a{dZm’ (Ravina) from level 1 to 5

Fig. 8.90: Approximate wavelet coefficients of ‘{eH ma’ (shikar) from level 1 to 5

300

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.91: Detail wavelet coefficients of ‘{eH ma’ (shikar) from level 1 to 5

Fig. 8.92: Approximate wavelet coefficients of ‘dmaH as’ (varkari) from level 1 to 5

301

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.93: Detail wavelet coefficients of ‘dmaH as’ (varkari) from level 1 to 5

These coefficients are given as input to the neural network which is trained with the back-propagation algorithm which aims to reduce the spectral mismatch between proper, improper and original wavelet coefficients.

Fig. 8.94: Approximate wavelet coefficients of original, improper and modified improper AMb (achal)

302

Evaluation of spectral mismatch and development of necessary objective measures

It can be seen from the figure 8.94 that the approximate wavelet coefficients of improper AMb (achal) are modified. The neural network has reduced the mismatch between original and improper wavelet coefficients. This spectral mismatch reduction is calculated in terms of percentage error calculation as shown in the table 8.6 and 8.7 below.

Fig. 8.95: Detail wavelet coefficients of original, improper and modified improper AMb (achal)

It can be seen from the figure 8.95 that the detail wavelet coefficients of improper AMb (achal) are modified. Some more examples for wavelet-neural network based combination are shown below. Figure 8.96 below shows approximate and detail coefficients of ‗a{dZm'(Ravina).

303

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.96: Approximate wavelet coefficients of original, improper and modified improper ‘a{dZm' (Ravina)

Fig. 8.97: Detail wavelet coefficients of original, improper and modified improper ‘a{dZm' (Ravina)

304

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.98: Approximate wavelet coefficients of original, improper and modified improper ‘{eH ma' (shikar)

Fig. 8.99: Detail wavelet coefficients of original, improper and modified improper ‘{eH ma' (shikar)

305

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.100: Approximate wavelet coefficients of original, improper and modified improper ‘dmaH as' (varkari)

Fig. 8.101: Detail wavelet coefficients of original, improper and modified improper ‘dmaH as ' (varkari)

306

Evaluation of spectral mismatch and development of necessary objective measures

The mismatch is more at level 5 than the other levels for both approximate and detail coefficients. Please refer table 8.6 and 8.8 below for this reference. The corresponding frequency band at level 5 for approximate coefficients is 0 to172 Hz and for detail coefficients is 172Hz to 344Hz. The level of wavelet is the decomposition of signal for detail and approximate coefficients. The neural network has reduced the mismatch between original and improper wavelet coefficients. The difference between wavelet coefficients of original-improper and original-modified improper (output of neural network) is calculated. After taking the difference between each coefficient, the average is taken. From these difference values, the percentage error is calculated. The following table 8.6 shows that after applying neural network, the percentage error is reduced. Percentage error shows the improvement after using wavelet-Back propagation combination. Percentage error improvement of 20% has obtained for approximate and detail coefficients respectively after using wavelet-NN combination. Total 500 words are tested for this combination. Being it is ratio it shows improvement and is unit less. Wavelet-Backpropagation combination is first time implemented for calculation of percentage error between original, improper and modified improper words.

Table 8.7: Numerical results of wavelet and Back propagation for approximate coefficients

Percentage error Percentage error Word original-improper original-modified improper AmMb (anchal) 38.41 13.86 JsVH ma (geetkar) 49.83 28.65 H m_Jma (kamgar) 75.42 34.26 am_Xod (Ramdev) 44.68 14.66 _maoH as (marekari) 39.24 16.35 _m`Xod (Maydev) 63.43 25.26 gmdYmZ (savdhan) 34.88 16.06 dmaH as (varkari) 51.96 29.57 Cnm` (upay) 24.89 11.84 {dZm`H (Vinayak) 51.96 13.86

307

Evaluation of spectral mismatch and development of necessary objective measures

The decrease in the percentage error shows that neural network has reduced the mismatch in the improper word. The following table shows results for detail coefficients.

Table 8.8: Numerical results of wavelet and Back-propagation for detail coefficients Percentage error Percentage error Word original-improper original-modified improper AmMb (anchal) 55.43 32.89 JsVH ma (geetkar) 36.48 22.37 H m_Jma (kamgar) 45.22 63.36 am_Xod (Ramdev) 68.0 18.51 _maoH as (marekari) 54.82 34.72 _m`Xod (Maydev) 33.0 16.19 gmdYmZ (savdhan) 40.79 11.12 dmaH as (varkari) 37.99 17.90 Cnm` (upay) 51.36 34.22 {dZm`H (Vinayak) 52.74 10.62

8.31 DETECTION OF SPECTRAL MISMATCH AND ITS REDUCTION USING DTW:

8.31.1 Purpose of algorithm This algorithm is one more approach for reduction of spectral distortion present at the concatenation joint. With the help of FFT and accumulated distance matrix, this algorithm uses Itakura method to find the distance (distortion) present at the joint and reduces it to make speech output sound more natural. In previous section DTW is used for distance and energy calculation of cross correlated words. Here it is used for spectral distortion reduction with FFT.

308

Evaluation of spectral mismatch and development of necessary objective measures

8.31.2 Methodology used and step by step implementation The following steps show the technique used for the implementation of this approach. DTW is simple time domain technique to reduce spectral mismatch and produce natural sounding speech.

1) Take original and proper word 2) Divide the speech signal into the frames of 110 samples. 3) Calculate the FFT of each frame 4) Make length of two matrices same i.e. one matrix for original and other for proper concatenated word. 5) Form a distance matrix ‗d‘ by calculating difference between each sample value of original and proper concatenated word. 6) Form an accumulated distance matrix ‗D‘ using Itakura method D(m, n) = d(m, n) + min{D(m,n-1), D(m-1,n-1), D(m-2,n-1)}…………(8.13 ) 7) The shortest path is found from the ‗D‘ matrix. 8) Adjust frames of proper concatenated word according to this shortest path. This step is carried out to modify proper concatenated word and make it more natural and to remove spectral distortion. 9) Remove phase shift occurred during adjustment 10) Take inverse FFT 11) Resize the proper concatenated word to its original length. 12) The same procedure is repeated for original and improper concatenated words.

8.31.3 Results/ outcome of the algorithm This algorithm gives speech output with reduced spectral distortion. It can be calculated with correlation of original-proper and original-improper concatenated words. Correlation increases after applying DTW algorithm which shows reduction in spectral mismatch.

8.31.4 Correlation results of DTW The cross correlation of original- proper and original- improper concatenated words are computed before and after applying DTW. Autocorrelation of original word is also computed for comparison. The value of correlation increases after applying DTW. The following figures show results of correlation. 309

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.102: Correlation results for ‘AMb’ (achal) original, ‘AMb’ (achal) proper before and after applying DTW

Fig. 8.103: Correlation results for ‘AMb’ (achal) original, ‘AMb’(achal) improper before and after applying DTW

Above figures 8.102, 8.103 shows the correlation results for ‗AMb‘ (achal) original, ‗AMb‘ (achal) proper before and after applying DTW and ‗AMb‘ (achal) improper before and after applying DTW.

310

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.104: Correlation results for ‘am_Xod’ (Ramdev) original, ‘am_Xod’ (Ramdev) proper before and after applying DTW

Fig. 8.105: Correlation results for ‘am_Xod’ (Ramdev) original, ‘am_Xod’ (Ramdev) improper before and after applying DTW

The above figures 8.104, 8.105 shows the correlation results for ‗am_Xod‘ (Ramdev) original, ‗am_Xod‘ (Ramdev) proper before and after applying DTW and ‗am_Xod‘ (Ramdev) improper before and after applying DTW.

311

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.106: Correlation results for ‘nmdSa’ (pavdar) original, ‘nmdSa’ (pavdar) proper before and after applying DTW

Fig. 8.107: Correlation results for ‘nmdSa’ (pavdar) original, ‘nmdSa’ (pavdar) improper before and after applying DTW

The above figures 8.106, 8.107 shows the correlation results for ‗nmdSa‘ (pavdar) original, ‗nmdSa‘ (pavdar) proper before and after applying DTW and ‗nmdSa‘ (pavdar) improper before and after applying DTW.

312

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.108: Correlation results for ‘{eH ma’ (shikar) original, ‘{eH ma’ (shikar) proper before and after applying DTW

Fig. 8.109: Correlation results for ‘{eH ma’ (shikar) original ‘{eH ma’ (shikar) improper before and after applying DTW

The above figures 8.108, 8.109 shows the correlation results for ‗{eH ma‘ (shikar) original, ‗{eH ma‘ (shikar) proper before and after applying DTW and ‗{eH ma‘ (shikar) improper before and after applying DTW.

313

Evaluation of spectral mismatch and development of necessary objective measures

The following table 8.8 shows the values of the correlation for proper and improper words before and after applying DTW. Correlation result of original and PC word is improved from 4.78 to 5.18 after using DTW. Correlation of original and IC is improved from 4.51 to 4.94 after using DTW. Correlation shows the match between two signals. It is a unit less parameter.

Table 8.9: Numerical results of correlation Cross correlation of original Cross correlation of original Word and proper word and improper word Before DTW After DTW Before DTW After DTW AMb (achal) 2.3056*104 2.4586*104 2.2525*104 2.2587*104 JsVH ma (geetkar) 5.6632*103 5.869*103 4.8921*103 5.8026*103 dmaH as (varkari) 7.0598*103 8.6418*103 9.7344*103 9.2664*103 A\ dm (afava) 1.3279*104 1.6421*104 1.2571*104 1.4139*104 XodJS (Devgad) 1.2314*104 1.3443*104 1.081*104 1.2417*104 _maoH as (marekari) 7.4595*103 9.5448*103 8.0797*103 8.9536*103 nmdSa (pavdar) 1.7639*104 2.1324*104 1.3613*104 1.3823*104 {dO` (Vijay) 9.8615*103 1.1435*103 8.5772*103 8.9659*103 Cnm` (upay) 7.8091*103 9.1944*103 6.4715*103 7.0319*103

From the table 8.8 it can be seen that the value of correlation that is similarity of improper and proper word with the original word is increased after applying DTW. This shows that DTW has reduced the spectral distortion present at the concatenation joint of proper and improper concatenated word. The correlation of PC word shows higher value than IC word which indicates that PC word is more close to original than IC word. If we concentrate on word nmdSa (pavdar) then correlation of original-PC after applying DTW is 2.1324*104 and correlation of original-IC is 1.3823*104 which shows that PC word is more close to original than IC word.

314

Evaluation of spectral mismatch and development of necessary objective measures

8.32 GUI TO PERFORM DTW, WAVELET AND PSD:

 Graphical user interface (GUI) is a type of user interface that allows people to interact with programs in a more friendly way than typing such as computers; hand-held devices, MP3 players, portable media players or gaming devices; household appliances and office equipments with images rather than text commands.

 A GUI offers graphical icons and visual indicators, as opposed to text-based interfaces, typed command labels or text navigation to fully represent the information and actions available to a user. The actions are usually performed through direct manipulation of the graphical elements.

 A lot of care has to been taken into consideration so as to allow the user to go through a step of sequence as designed by the programmer as well as to give him a fair degree of freedom. Also, care has to be taken to make the GUI user responsive and not allow the algorithms and calculations take a toll on the user interface making the user believe that the GUI has not changed or died due to its misuse.

 Proper dialogue boxes, warnings and errors should also be provided to the user as and when necessary to help him realize his actions and their impact. To see the numerical comparison between the concatenated words and to analyze results on different algorithms, the user has to run the GUI code in MALAB which opens the following window.

315

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.110: GUI snapshot for spectral smoothing algorithms

The GUI contains list box, two radio buttons and three push buttons.

8.32.1 List box: List box provides list of words. Among these words select any word along with one of the radio buttons i.e. audio or objective.

8.32.2 Radio buttons: When audio radio button is selected, modified improper word followed by original and improper word can be heard. When objective radio button is selected, graphical results can be displayed.

8.32.3 Push buttons: Three push buttons are provided for implementing desired algorithm for the selected word. After the push button is pressed, respective algorithm runs at the back end and shows desired results.

1) PSD results in GUI

316

Evaluation of spectral mismatch and development of necessary objective measures

Fig. 8.111: PSD plot of original and improper concatenated word

2) DTW results in GUI The result show 1) original and improper word A\ dm (afava) 2) improper and modified improper word A\ dm (afava)

Fig. 8.112: DTW results in GUI

317

Evaluation of spectral mismatch and development of necessary objective measures

8.33 SUBJECTIVE ANALYSIS:

All these algorithms, PSD, wavelet with back-propagation and DTW results can be judged with subjective analysis. All graphical and tabular results shown above are objective results for spectral mismatch reduction. The different techniques show how to estimate and reduce spectral distortion. The following table 8.9 shows subjective results for PSD, wavelet and DTW. MOS test is conducted for this evaluation. The following table shows subjective analysis for this research system.

Table 8.10: Subjective analysis for DTW, PSD and wavelet

MOS analysis Sr. No. Multi-resolution PSD DTW based wavelet Subject 1 3 4 3 Subject 2 3 4 4 Subject 3 2 3 4 Subject 4 2 3 4 Subject 5 4 4 4 Subject 6 3 4 5 Subject 7 4 3 4 Subject 8 4 3 3 Total / average 3.125 3.5 3.875

This analysis shows that DTW and multi-resolution based wavelet algorithm produced more natural results. From the MOS analysis shown in the above table, it is seen that the MOS score of 3.875 of wavelet method is higher than other methods. Wavelet algorithm with multi-resolution analysis gives highest subjective score. Similarly, DTW gives next highest MOS score of 3.5 as shown in the above table. Subjective analysis supports the fact that multi-resolution based wavelet with back- propagation and DTW with neural network can provide more natural audio results. PSD has produced more effective results. Both objective and subjective analysis supports the fact that contextual analysis with proper segmentation and spectral mismatch reduction (spectral smoothening) improved resulting speech output quality. These results can be extended for other languages using Devnagari script [73].

318

Evaluation of spectral mismatch and development of necessary objective measures

8.34 CONCLUSION OF SPECTRAL MISMATCH REDUCTION:

Various algorithms in this chapter show the spectral mismatch in the proper and improper concatenated words graphically and numerically. These results prove that after applying different smoothing techniques like PSD, DTW and multi-resolution based wavelet, the spectral mismatch is reduced producing more natural speech output. Wavelet algorithm with multi-resolution analysis along with Back- propagation has given MOS score of 3.875 whereas DTW has given MOS score of 3.50. PSD graphical results are more promising for estimation of spectral distortion but its audio output is not as natural as other methods. PSD has given MOS score of 3.125 which is lower than other two methods. As naturalness of speech output depends on proper concatenation and spectral mismatch reduction at concatenation joint, the wavelet results has given more natural audio as compared to other methods.

319

Chapter-9

Linguistic analysis

Linguistic analysis

CHAPTER- 9

LINGUISTIC ANALYSIS

No. Title of the Contents Page No. 9.1 Introduction 321 9.1.1 Vowels 322 9.1.2 Consonants 322 9.1.3 Consonant vowel (CV) structure 323 9.1.4 Suffix processing in text analysis 323 9.1.5 Structure formation 324 9.2 Implementation of suggestions from linguistic experts 324 9.2.1 Sandhi and related rules 326 9.3 Linguistic expert‘s analysis and new rules 330 9.4 Database preparation with words and syllables 332 9.5 Testing and results 332 9.5.1 Testing of paragraphs 332 Comparison of naturalness obtained by various 9.5.2 350 methods 9.6 Conclusion of linguistic analysis 355

320

Linguistic analysis

CHAPTER- 9

LINGUISTIC ANALYSIS

9.1 INTRODUCTION:

 Most of the Indian languages are syllabic centered. In any language, phonemes are the smallest units of sound that distinguish one word from another. The scripts of Indian languages have originated from the ancient Brahmi script. The basic units of writing system are characters which are orthographic representation of speech sounds.

 A character in Indian language scripts is close to syllable and can be typically of the following form: C, V, CV, CCV and CVC where C is a consonant and V is a vowel. There are about 35 consonants and about 18 vowels in Indian languages. An important feature of Indian language scripts is their phonetic nature. The phonetic nature of the languages leads to a writing system which represents sounds through unique symbols [74].

 Each language has its own representation for the sounds and thus its own script, though some languages may use a common script. Speech generated from the speech production system consists of a sequence of basic sound units of a particular language.

321

Linguistic analysis

The following figure 9.1 shows classification of sound units in Indian languages.

Sound units

Vowels Consonants

Long Semi- vowels vowels

Short Diphthongs Nasals Fricatives vowels

Short Affricates consonants

Fig. 9.1: Classification of sound units in Indian language

9.1.1 Vowels: Vowels are sounds produced with no obstruction to the air stream as it passes through the vocal tract. Resonance is decisive in vowel production and all vowel sounds are therefore voiced. Vowels are further classified into: 1) Short vowels: Duration of production is smaller. 2) Long vowels: Duration of production is longer nearly double than that of short vowels.

9.1.2 Consonants: A consonant is a speech sound produced as a result of a partial or complete obstruction to the airstream somewhere in the vocal tract. Voicing does not count as obstruction. Consonants are further classified as: 1) Stop consonants: During the production of these consonants the vocal tract is completely closed at some point, somewhere along the length of the vocal tract and suddenly released.

322

Linguistic analysis

2) Nasals: Nasal sounds are similar to vowels having lower formant energy compared to vowels. Nasal sounds are produced with the help of air flow in nasal cavity. The examples of nasal sound units found in Indian languages are /ng/, /nj/, /n/, /N/ and /m/. 3) Semivowels: The semivowels are weakly periodic as compared to the vowels and having lower energy as compared to vowels. The set of semivowels in Indian language include /y/, /r/, /l/ and /v/. 4) Fricatives: The fricatives are the consonants produced by a narrow constriction somewhere along the length of the vocal tract. The basic difference between fricatives and stop consonants is that the closure is partial and narrow in case of fricatives and is complete in case of stop consonants. Depending on the place of narrow constrictions we have different fricatives. In case of most Indian languages we have /s/, /sh/, /shh/ and /h/ as the fricatives. 5) Affricates: The affricates are the consonants where the production involves combination of stop and fricative consonant production. Initially, the vocal tract is completely closed somewhere all the length to create a total constrictions. After this the constriction is partially released to create fricative excitations. Most of the Indian languages have /ch/, /chh/, /j/ and /jh/ as the affricate consonants [75].

9.1.3 Consonant vowel (CV) structure: CV structure of a given word is formed using standard structure formation rules. CV structure breaking rules have been developed. The word is split into its consequent syllables using these rules. For Marathi language there are more than 470 distinct CV structures. Details of CV structures and their breaking rules are shown in tabular form in the appendix.

9.1.4 Suffix processing in text analysis: If the word does not exist in the database then the text analyzer checks whether the word ends with some suffix. If the word ends with some suffix then it separates the word from the suffix and searches the word and the suffix in the textual database separately. For example in the word Amnë`m (aplya), the algorithm separates Amn (aap) and ë`m (lya) and searches them separately.

323

Linguistic analysis

Most commonly used suffices are _Yo(madhe), _YyZ (madhun), `mV (yaat), bm (la), br (li), bo (le), ë`m (lya), Zm (naa), Zr (ni), Zo (ne), Mm (cha), Mr (chi), Mo (che), À`m (chya)

9.1.5 Structure formation:  If the last alphabet is joint, the last consonant is written as ―CV‖ and preceding consonant is written as ―C‖. Alphabet having rafar is written as ―CV‖, but its preceding consonant is written as ―C‖. Marathi is an Indo-Aryan, Indo- European and Indo-Iranian language spoken by about 73 million people (year 2001). Recently as per 2008 survey results, 96.75 million people mainly in the Maharashtra and neighboring states of Maharashtra speak Marathi. It is also spoken in Israel and Mauritius.

 Marathi is thought to be a descendent of Maharashtri; one of the Prakrit languages, which is developed from Sanskrit. The vowels getting attached to the consonant are not in one direction. They can be placed either on the top or the bottom of consonant. Vowels can also be placed on the left or right of consonant.

e.g. $z , x , o , ¡ etc.

 Since the origin of all Indian scripts is same, they share the common phonetic structure. The alphabet may vary slightly and also the graphical shapes. In all of the scripts basic consonant, vowels and their phonetic representation is same [76].

9.2 IMPLEMENTATION OF SUGGESTIONS FROM LINGUISTIC EXPERTS:

 In Marathi language, vowels, consonants, formation of different contexts, nouns, singular-plural difference, need to be considered while forming new words or breaking some original words. There are many words in Marathi

324

Linguistic analysis

which start from prefixes, CngJ© (upasarga) and some other words which ends with suffixes, àË`` (pratyay).

 Apart from this there are two more types for formation of the words, g_mg (samas) and words formed with Aä`ñV (abhyasta). Different contexts in Marathi language can be evaluated with these four types. Some examples are given below:

1) Examples of prefixes, CngJ© (upasarga) A (aa) + ~moc (bol) = A~moc (abol) gw (su) + J§Y (gandh) = gwJ§Y (sugandha)

 In Marathi there are three kinds of CngJ©K{QV (upasargaghatit) words. Words formed from g§ñH¥ V (Sanskrit), _amR s (Marathi) and \ mags-Aao~s (Farasi-Arebi) upasaraga.

2) Examples of suffixes, àË`` (pratyay) ag (ras) + BH (ik) = a[gH (rasik)

 In Marathi there are two types of àË`` (pratyay) 1) YmVwgmpYVo (dhatusadhite) and 2) eãXgmpYVo(shabdasadhite)

 CngJ© (upasarga) is always at the start of word and àË`` (pratyay) is at the end of word.

3) Examples of Aä`ñV (abhyasta) Ka (ghar) + Ka (ghar) = KamoKa (gharoghar)

 There are three types of Aä`ñV (abhyasta) words 1) nyUm©ä`ñV (purnabhyasta) 2) A¨emä`ñV (anshabhyasta) and 3) AZwH aUdmMH (anukaranwachak)

325

Linguistic analysis

 If new word is formed from two small different words and resulting in one OmoSeãX (jodashabda) then it is called gm_mpgH eãX (samasik shabda)

4) Examples of gm_mpgH eãX (samasik shabda) Xod (dev) + Ka (ghar) = XodKa (devghar)

 In Marathi the linguists have suggested the rule that proper nouns cannot be cut into syllables and reused in the formation of new words. This rule has been implemented first time for this Marathi TTS. Consider the following examples. Here proper nouns pñ‘Vm (Smita) and aoIm (Rekha) cannot be cut into syllables.

pñ‘Vm (Smita) = pñ_ (Smi) + Vm (Ta) aoIm (Rekha) = ao (Re) + Im (Kha)

 Now consider the following examples, where some proper noun can be broken. These are some exceptional cases.

ph_mc` (Himalaya) = ph_² (Him) + Amc` (Alaya) g§J_oída (Sangameshwar) = g§J_² (Sangam)+ B©ída (Ishwar)

 In English or other languages only singular-plural has more impact on the context of the words and sentences, but in Marathi apart from this it is the subject, their types, pronouns, proper nouns etc also affect the flow of context.

9.2.1 Sandhi and related rules:  In Marathi there are many rules where vowels and consonants are concatenated together. Also there are certain other rules of joining two consonants or some other special combinations. The following paragraph shows some examples of these rules provided by linguists. They suggested considering these types of contexts while forming textual paragraph of input text.

 In Marathi there are three types of g§Ys (sandhi) which are shown below:

326

Linguistic analysis

 eãXg§Ys (shabdasandhi) XmoZ eãXm§Mm g§`moJ hmoDZ Ë`m§Mm EH eãX V`ma hmoUo åhUOo Ë`m eãXm§Mm g§Ys hmoUo. Ë`mMo VsZ àH ma AmhoV. (Don shabdancha sanyog houn tyancha ek shabda tayar hone mhanaje tya shabdancha sandhi hone. Tyache teen prakar aahet.)

 ñdag¨Ys (swarsandhi) ñdamnwTo ñda `odyZ hmoUmas g§Ys. (Swarapudhe swar yevun honari sandhi.) A (Aa) + A (Aa) = Am (Aaa) am_ (Ram) + A`Z (Ayan) = am_m`U (Ramayan) A (Aa) + Am (Aaa) = Am (Aaa) ped (Shiva) + Amb` (Alaya) = pedmb` (Shivalaya) Am (Aaa) + Am (Aaa) = Am (Aaa) _hm (Maha) + AmË_m (Atma) = _hmË_m (Mahatma) B (E) + B (E) = B© (Eee) apd (Ravi) + BÝX (Indra)³ = ads§X(Ravindra) B (E) + B© (Eee) = B© (Eee) Xods (Devi) + BÀN m (Echha) = XodsÀN m (Devichha) C (U) + C (U) = D (Uoo) dYy (Vadhu) + D Ëgd (Utsav) = dYyËgd (Vadhustav) A (Aa) + Eo (Aai) = Eo (Aai) jU (Kshan) + EoH (Aaik) = jU¡H (Kshanaik) Am (Aaa) + Amo (O) = Am¡ (Aau) J§Jm (Ganga) + AmoK (Aogh)= J§Jm¡K (Gangaugh) A (Aa) + Am¡ (Aau) = Am¡ (Aau) dZ (Van)+ Am¡fYs (Aushadhi) = dZm¡fYs (Vanaushadhi)

 ì`§OZ g§Ys (vyanjansandhi) 1) XmoZ ì`§OZo, ñda + ì`§OZ AWdm ì`§OZ + ñda, EH mnwTo EH Amë`mZo hmoUmas g§Ys. (Don vyanjane, swar va vyanjan athava vyanjan va swar, ekapudhe ek alyane honari sandhi.)

327

Linguistic analysis

2) g§Ys hmoVmZm ì`§OZmÀ`m n{hë`m nmM dJm©Vsb AZwZmpgH mpedm` BVa H moUË`hs ì`§OZmnwTo H R moa ì`§OZ Amë`mg AmYsÀ`m ì`§OZm~X²Xb Ë`mM ì`§OZmÀ`m dJm©Vsb nphbo ì`§OZ `odyZ g§Ys hmoVmo. (Sandhi hotana vyanjanachya pahilya paach vargateel anunasikashivay itar kontyahi vyanjanapudhe kathor vyanjan alyas, aadhichya vyanjanabaddal tyach vargateel pahile vyanjan yevun sandhi hoto.)

dmJ² (vaag)+ àMma (prachar) = dmŠàMma (vaagprachar) fQ² (shat)+ H moZ (kon) = fQ² H moZ (shatkon) eaX² (sharad)+ H mb (kal) = eaËH mb (sharatkal)

3) g§Ys hmoVmZm ì`§OZmÀ`m n{hë`m nmM dJm©Vsb H moUË`hs ì`§OZmnwTo AZwZmpgH Amë`mg Ë`mEodOs Ë`mM dJm©Vsb AZwZmpgH `odyZ g§Ys hmoVmo. (Sandhi hotana vyanjanachya pahilya paach vargateel kontyahi vyanjanapudhe anunasik alyas tya aaivaji tyach vargateel anunasik yevun sandhi hoto.)

dmH² (vak) + _` (may) = dmL_` (vangmay) OJV² (jagat)+ ZmW (nath) = OJÞmW (Jaggannath)

4) g§Ys hmoVmZm V² `m ì`§OZmnwTo M² pH¨dm N² ho ì`§OZ Ambo Va Ë`m EodOs M² ho ì`§OZ `odyZ g§Ys hmoVmo. (Sandhi hotana ta ya vyanjanapudhe cha kinwa chha he vyanjan aale tar tya aaivaji cha he yevun sandhi hoto.)

gV² (sat) + MpaÌ (charitra) = gÀMpaÌ (satcharitra) eaV² (sharat) + M¨X³ (chandra) = eaÀM¨X³ (sharatchandra)

5) g§Ys hmoVmZm V² `m ì`§OZmnwTo O ho ì`§OZ Amë`mg V EodOs O ho ì`§OZ `odyZ g§Ys hmoVmo. (Sandhi hotana ta ya vyanjanapudhe ja he vyanjan alyas ta aaivaji ja he vyanjan yvun sandhi hoto.) gV² (sat) + OZ (jan) = g‚mZ (sajjan) OJV² (jagat) + OZZs (janani) = OJ‚mZZs (Jagatjanani)

328

Linguistic analysis

6) V² nwTo e² Amë`mg V EodOs M , e EodOs N ho ì`§OZ `odyZ g§Ys hmoVmo. (Ta pudhe sha alyas ta aaivaji cha, sha aaivaji chha he vyanjan yevun sandhi hoto.)

gV² (sat) + peî` (shishya) = gpÀNî` (sachishya) CV² (uta) + ídmg (shwas) = CÀN²dmg (utchhavas)

7) V² `m ì`§OZmnwTo c ho ì`§OZ Amë`mg V EodOs c ho ì`§OZ `odyZ g§Ys hmoVmo. (Ta ya vyanjanapudhe la he vyanjan yevun sandhi hoto.)

VV² (tat) + csZ (lin) = V„sZ (tallin) CV² (uta) + coI (lekh) = C„oI (ullekha)

8) V² nwTo h ho ì`§OZ Amë`mg X hmoVmo. (Ta pudhe ha he vyanjan alyas da hoto.) CV² (uta) + hma (har) = CX²Yma (udhar)

 pdgJ©g§Ys (visargasandhi) a) pdgJªmÀ`mnwTo ñda pH§ dm ì`§OZ Amë`mda hmoUmas g§Ys. (Visargachyapudhe swar kinwa vyanjan alyavar honari sandhi.)

pZ… (nihi) + VoO (tej) = pZñVoO (nistej) _Z… (manha) + Vmn (taap) = _ZñVmn (manastap)

b) BH mamÝV CH mamÝV eãXmÀ`m A§Vs pdgJ© Amcm d nwTo H , I, n, \ ì`§OZ Amco Va pdgJm©EodOs f `odyZ g§Ys hmoVmo. (Ekarant ukarant shabdachya anti visarga aala va pudhe ka, kha, pa, pha vyanjan aale tar visargachyaaivaji sha yevun sandhi hoto.)

pZ… (nihi) + H nQ (kapat) = pZîH nQ (nishkapat) Apd… (avi) + H ma (kar) = ApdîH ma (avishkar)

c) AH mamÝV eãXmÀ`m A§Vs pdgJ© Amcm d nwTo H² , n² ì`§OZ Amco Va pdgJ© H m`_ amhyZ g§Ys hmoVmo. (Akarant shabdachya anti visarga aala va pudhe ka, pa vyanjan aale tar visarga kayam rahun sandhi hoto.)

329

Linguistic analysis

n³mV… (prataha) + H mb (kaal) = n³mV…H mb (pratahakal) AY… (adhaha) + nmV (paat) = AY…nmV (adhahapaat)

If we consider all types of contexts in Marathi and prepare some text or paragraph it will be a great challenge for synthesizer. The following paragraph shows different types of contexts. This paragraph is implemented as per linguistic expert‘s suggestion for Marathi TTS as a book reading application.

Fig. 9.2: Paragraph with different types of contexts

Apart from all these rules there are exceptions of handling some special characters like G (Ru), k (Dnya), l (Shra) etc.

9.3 LINGUISTIC EXPERT‟S ANALYSIS AND NEW RULES:

1) As per new guidance achieved from linguist professors; in preparation of contextual database and audio result paragraph, all proper nouns are stored in original form. If new proper noun is formed it is either formed with some available syllables or with ~mamISs (barakhadi). While preparing syllable

330

Linguistic analysis

database, care is taken that stored syllables are not formed from any proper noun. All syllables stored in the syllable database are obtained from words with the help of CV structure rules. Contextual database is prepared with direct storage of proper nouns. See the following sentence where ^maV (Bharat) is stored in original form.

e.g. ^maV _mPm> Xoe Amho. (Bharat maza desh aahe.)

Here ^maV (Bharat) is proper noun and stored directly in the database.

2) In this TTS system, there are many suffixes, àË`` (pratyay) already stored in the database. Some more new àË`` (pratyay) are added to consider all àË``-KpQV (pratyay ghatit) words in Marathi language. Apart from this, there are many prefixes, CngJ©K{QV (upasargaghatit) words in Marathi. Some CngJ© (upasarga) are stored separately but most of them are taken from basic ~mamISs (barakhadi) stored in syllable database. See the following examples of CngJ©K{QV (upasargaghatit) words. New contextual database is modified accordingly. Some examples are shown below:

e.g AOmV (ajat) = A (aa) + OmV (jaaat) Ag_W© (asamartha) = A (aa) + g_W© (samartha) n«^md (prabhav) = nŒ (pra) + ^md (bhav) pdZ` (Vinay) = pd (Vi) + Z` (Naya) pZdmg (nivas) = pZ (ni) + dmg (vaas) ApVgma (atisaar) = ApV (ati) + gma (saar) AZwamYm (Anuradha) = AZw (Anu) + amYm (Radha) Ap^_mZ (abhiman) = Ap^ (abhi) + _mZ (maan) CnH ma (upakar) = Cn (upa) + H ma (kaar)

In all these examples, CngJ© (upasarga) can be stored separately or formed with ~mamISs (barakhadi). This suggestion is implemented with some CngJ© (upasarga) stored in

331

Linguistic analysis contextual database or formed with ~mamISs (barakhadi). There are three types of CngJ© (upasarga) 1) Sanskrit upasarga 2) Marathi upasarga and 3) Arabic/Farasi upasarga. These all need to be considered to form rich contextual database. If all prefixes, CngJ© (upasarga) and suffixes, àË`` (pratyay) are taken into consideration then it helps to form new words from syllables and can generate more new and different words. It helps to prepare moderate size database and increases naturalness.

9.4 DATABASE PREPARATION WITH WORDS AND SYLLABLES:

 While preparing database, very frequently used words and syllables need to be stored. Syllables should be stored with their proper positions. For this exercise contextual analysis is carried out. Based on contextual analysis and classification of types of syllables, database is designed. Very precise selection of words and storage of syllables is required for efficient database. Here Marathi TTS is hybrid, as both words as well as syllables are stored and hence resulting synthesis is more natural.

 Database size is moderate, but it can generate many new words based on existing most frequent words and syllables. Database of moderate size of about 3000 words is prepared by considering all these factors. Syllables are not stored separately; they are soft cut from their original words and their details are stored in textual database. They are used on the fly while forming new words from their original locations with the help of ‗from‘ and ‗to‘ positions present in the textual database.

9.5 TESTING AND RESULTS:

9.5.1 Testing of paragraphs: As suggested by experts at 3rd stage evaluation of this PhD work, linguist‘s suggestions for contextual preparation of paragraphs have been implemented. A total of four paragraphs are prepared out of which second paragraph is India pledge which

332

Linguistic analysis is first page of most of the Marathi text books of schools in Maharashtra. This paragraph is representative of book reading application. With the present Marathi text to speech synthesis, this practical application (book reading or similar other applications) can be implemented. Resulting speech output is intelligible and natural as compared to other language synthesizers. Four paragraphs used for testing are shown below:

 Paragraph 1 _s `§Xm JUnVsbm nmë`m©bm Jobmo. _mPo Ka AmpU _mÂ`m AmOmo~m§Mo Ka eoOmas eoOmas AmhoV. _w»` añË`mda AmV J„sV dië`mda pOWo Vs J„s g§nVo {VWoM g_moa _mÂ`m AmOmo~m§Mm NmoQ m ~§Jbm Amho. Ë`m ~§Jë`mVë`m bhmZem {XdmZImÊ`mVë`m {ISH sVyZ g_moaMs AmV OmUmas J„s {XgVo AmpU J„sÀ`m ZmŠ`mda Amboë`mbm {ISH sVbm Bg_ ghO {Xgy eH Vmo. (Mi yanda gangpatila parlyala gelo. Maze ghar ani mazya aajobanche ghar shejari shejari aahet. Mukhya rastyavar aat gallit valayavar jithe ti galli sampate tithech samor mazya aajobancha chota bangala aahe. Tya banglyatalya lahansha divankhanyatalya khidakitun samorachi aat janari galli disate ani gallichya nakyavar aalelyala khidakitala isam sahaj disu shakato.)

All the words in this paragraph and their concatenation details are shown in the following table. The ‗from‘ and ‗to‘ locations are shown in terms of samples. They indicate total number of samples required for respective word or syllable. The number of samples indicates the range and hence it is unit less.

Table 9.1: Concatenation details of nw. b. Xoenm§So paragraph Syllable/ From-to Used Names of Name of Used direct / pratyaya Name of locations linguistic syllables/ the word synthesized extracted sound file (samples on rules pratyaya from time axis) bm as Separately ganapati.wav, 40 to 5536 and JUnVsbm JUnVs direct Yes prtyaya recorded la.wav 324 to 4080 Subject Separately _s Direct Yes mi.wav 1840 to 5376 pratyay recorded Subject Separately _mPo Direct Yes maze2.wav 848 to 5800 pratyay recorded Subject Separately _mÂ`m Direct Yes mazya2.wav 120 to 5320 pratyay recorded Subject Separately Vs Direct Yes ti.wav 34 to 1626 pratyay recorded Ë`m Direct Yes Subject Separately tya.wav 92 to 2940

333

Linguistic analysis

Syllable/ From-to Used Names of Name of Used direct / pratyaya Name of locations linguistic syllables/ the word synthesized extracted sound file (samples on rules pratyaya from time axis) pratyay recorded Pratyay 1 to 4144 and AmOmo~m§Mm Direct No Mm pratyay ajobancha.wav recorded 0 to 2052 Synthesized 64 to 3400 `§ and Xm with yan.wav and `§Xm with No barakhadi and 16 to barakhadi da.wav barakhadi 3312 añË`mVyZ 80 to 5808 da as recorded rastyatun.wav añË`mda Synthesized Yes and 484 to pratyay and da as and var.wav 3360 pratyay 64 to 3388 Direct and V V as Pratyay ta.wav and J„sV Yes and 296 to as pratyay pratyay recorded galli.wav 2780 ~§ from Barakhadi Two 98 to 2288 barakhadi ban.wav and ~§Jbm and from No syllables and 2528 to and Jbm changala.wav Mm§Jcm ~§ and Jbm 5376 from Mm§Jcm ~§J from 472 to 3612, ban.wav, From ~mamISs ~§J, ë`m and barakhadi, 24 to 2568 ~§Jë`mVë`m Yes lya.wav and and pratyaya Vë`m and and 48 to ë`m Vë`m tlya.wav as pratyay 4436 {X, dmZ, Im {XdmZImZm Direct and diwankhana.w as direct recorded 0 to 5384 and {XdmZImÊ`mVë`m Ê`mVë`m as Yes av and and Ê`mVë`m and Ê`mVë`m 0 to 6320 pratyaya nyatlya.wav as pratyay as pratyaya {ISH s Direct and {I, SH s and recorded khidki.wav 60 to 4576 {ISH sVyZ VyZ as Yes VyZ as and VyZ as and tun.wav and 0 to 2932 pratyaya pratyaya pratyaya J„s direct J„s direct 64 to 3388 Direct and recorded galli.wav and J„sÀ`m Yes and À`m as and 176 to pratyay and À`m as chya.wav pratyay 2188 pratyay ZmŠ`m from ZmŠ`m from ZmŠ`mH SyZ 56 to 3584 ZmŠ`mH SyZ ZmŠ`mH SyZ and nakyakadun.w ZmŠ`mda direct No and 484 to and as as av and var.wav recorded da da 3360 pratyay pratyay {ISH s direct recorded {ISH s direct {I, SH s and and as 60 to 4576 Vbm khidaki.wav {ISH sVbm and as Yes as and 76 to Vbm Vbm pratyay as and tala.wav pratyay pratyay per 5012 linguist‘s suggestions amIÊ`mMs amIUo as pH«`mnX Yes amI, Ê`mMs. amI rakhane.wav 156 to 5012

334

Linguistic analysis

Syllable/ From-to Used Names of Name of Used direct / pratyaya Name of locations linguistic syllables/ the word synthesized extracted sound file (samples on rules pratyaya from time axis) direct and extracted and and 0 to 4288 Ê`mMs as from pH«`mnX nyachi.wav pratyay. amIUo. Ê`mMs as pratyaya di from di from pH«`mnX diUo direct, valane.wav, 0 to 1952, 24 pH«`mnX diUo direct dië`mda ë`m, da as Yes lya.wav and to 2568 and diUo. ë`m, da recorded. pratyay var.wav 484 to 3360 pratyaya ë`m, da pratyaya pH«`mnX direct Ambo from Ambo pH«`mnX recorded 28 to 3296 pH«`mnX and ale.wav, Amboë`mbm and ë`mbm as Yes and ë`mbm as and 44 to ë`mbm lyala.wav pratyaya. one 5012 pratyay. pratyaya Prepared 92 to 3260, Synthesized from ja2.wav, Om ,Um, as 512 to 3216 OmUmas with No barakhadi na.wav and with ~mamISs and 36 to ~mamISs and as can ree.wav 2736 be pratyay Direct Common Ka No Ka ghar.wav 216 to 3976 recorded noun Direct Common eoOmas No eoOmas shejari.wav 0 to 4976 recorded noun Direct Common Bg_ No Bg_ isam.wav 40 to 4440 recorded noun Direct Common J„s No J„s galli.wav 64 to 3388 recorded noun Direct pH«`mnX Direct Amho Yes Amho ahe.wav 48 to 3348 recorded recorded Direct pH«`mnX Direct Jobmo Yes Jobmo gelo.wav 52 to 3584 recorded recorded Direct pH«`mnX Direct g§nVo Yes g§nVo sampte.wav 48 to 6142 recorded recorded Direct pH«`mnX Direct {XgVo Yes {XgVo disate.wav 0 to 4580 recorded recorded Direct pH«`mnX Direct {Xgy Yes {Xgy disu.wav 156 to 2784 recorded recorded Direct pH«`mnX Direct eH Vmo Yes eH Vmo shakto.wav 12 to 4572 recorded recorded mukhya.wav, 0 to 4192, 40 _w»`, AmV, _w»`, AmV, pdeofU aat.wav, to 3168, 120 Direct pOWo, Yes pOWo, Direct jithe.wav, to 4864, 40 to recorded pVWo, N moQ m pVWo, N moQ m recorded tithe.wav and 5808 and 0 to chota.wav 4336

335

Linguistic analysis

Syllable/ From-to Used Names of Name of Used direct / pratyaya Name of locations linguistic syllables/ the word synthesized extracted sound file (samples on rules pratyaya from time axis) Direct Direct AmpU, bm as recorded ani.wav and 0 to 6144 and AmpU, bm Yes recorded pratyay common la.wav 0 to 3078 pratyaya pdeofU bhmZ as Direct 12 to 4432 pdeofU pdeofU and lahan.wav and bhmZem Yes recorded and 36 to recorded em from sha.wav and em from 2932 ~mamISs ~mamISs g,_mo ,a ,Ms all formed ~mamISs used from for sa.wav, 0 to 2440, 0 Synthesized barakhadi synthesizin mo.wav, to 3168, 0 to g_moaMs Yes from ~mamISs when g out of ra2.wav, 4352, 0 to syllables database chi.wav 3384 are not words. available. gh, O from direct ghO from ghH ma and O sahkar.wav 0 to 2276 and ghO Yes recorded ghH ma and O from ~mamISs and ja3.wav 148 to 3440 ghH ma and from ~mamISs ~mamISs {VWo and M {VWo recorded {VWo, M as both tithe.wav and 42 to 5808 {VWoM and M as Yes common directly cha2.wav and 0 to 2324 Pratyay syllables. recorded. nmë`m©, bm syllables formed nmë`m©HSo from bm directly parlyakade.wa 16 to 3640 nmë`m©bm recorded and Yes original recorded. v and la.wav and 0 to 3078 as pratyay bm word and bm as pratyay

 Paragraph 2 ^maV _mPm Xoe Amho. gmao ^maVs` _mPo ~m§Yd AmhoV. _mÂ`m Xoemda _mPo n«o_ Amho. _mÂ`m XoemVë`m g_¥Õ Am[U pdpdYVoZo ZQboë`m na¨nam¨Mm _cm A[^_mZ Amho. Ë`m na¨nam¨Mm nmB©H hmoÊ`mMs nmÌVm _mÂ`m A§Js `mds åhUyZ _s gX¡d n«`ËZ H asZ. _s _mÂ`m nmbH m§Mm JwéOZm§Mm Am[U dpSbYmè`m _mUgm§Mm _mZ RodsZ Am[U n«Ë`oH mes gm¡OÝ`mZo dmJoZ. _mPm Xoe Am[U _mPo Xoe~m§Yd `m¨À`mes pZðm amIÊ`mMs _s n«pVkm H asV Amho. Ë`m¨Mo H ë`mU Am[U Ë`m¨Ms g_¥Õs ô`m§VM _mPo gm¡»` gm_mdco Amho. (Bharat maza desh aahe. Sare Bharatiya maze bandhav aahet. Mazya deshavar maze prem aahe. Mazya deshatlya samrudha ani vividhatene natlelya paramparancha

336

Linguistic analysis mala abhimaan aahe. Tya paramparancha paik honyachi patrata mazya angi yavi mhanun mi sadaiv prayatna kareen. Mi mazya palakancha gurujanancha ani vadildharya manasancha maan thevin ani pratyekashi saujanyane vagen. Maza desh ani maze deshbandhav yanchyashi nishta rakhanyachi mi pratidnya karit aahe. Tyanche kalian ani tyanchi samrudhi hyatach maze saukhya samavale aahe.)

All the words in this paragraph and their concatenation details are shown in the following table. The ‗from‘ and ‗to‘ locations are shown in terms of samples. They indicate total number of samples required for respective word or syllable. The number of samples indicates the range and hence it is unit less.

Table 9.2: Concatenation details of India pledge From-to Syllable/ Used Names of locations Name of Used direct / pratyaya Name of sound linguisti syllables/ (samples the word synthesized extracted file c rules pratyaya on time from axis)

bharat2.wav, 0 to 4400, ^maV, ^maV, ^maVs` Original bhartiya.wav, 0 to 10169, ^maVs`, n«o_, Direct Yes directly proper prem.wav, 0 to 4474, n«pVkm recorded nouns pratidnya.wav 0 to 6536

maza4.wav, 0 to 6048, sare.wav, 0 to 4031, maze4.wav, 0 to 4280, _mPm, gmao, All subject mazya2.wav, 0 to 5440, _mPo, _mÂ`m, _mPm, gmao pronouns mala.wav, 0 to 5288, _cm, Ë`m, _s, Direct Yes subject and related tya.wav, 0 to 1962, `m¨À`mes, pronoun words are mi.wav, 0 to 3644, recorded. Ë`m¨Mo, Ë`m¨Ms yanchyashi.wav, 0 to 8816, tyanche.wav, 0 to 64880 tyanchi.wav to 6112

Xoe, ~m§Yd, _mZ Common desh3.wav, 0 to 5648, Xoe, ~m§Yd, Direct Yes directly nouns are bandhav.wav, 0 to 5464, _mZ recorded recorded maan3.wav 0 to 2971

ahe.wav, 0 to 3840, All ahet.wav, 0 to 6552, Amho, AmhoV, pH«`mnX important vagen.wav, 0 to 6387, H asZ, RodsZ, Direct Yes directly pH«`mnX karin.wav, 0 to 3804, dmJoZ, H asV recorded recorded. thevin.wav, 0 to 4080, karit.wav 0 to 5968

Synthesized Yes Xoemda, Xoem from XoemH All deshavar.wav, 0 to 8115,

337

Linguistic analysis

From-to Syllable/ Used Names of locations Name of Used direct / pratyaya Name of sound linguisti syllables/ (samples the word synthesized extracted file c rules pratyaya on time from axis)

XoemVë`m SyZ and da, important deshatlya.wav 0 to 7812 Vë`m as pratyayas pratyay /suffixes recorded

g_¥Õ, pZðm, H samrudhha.wav, 0 to 9328, g_¥Õ, pZðm, pdeofU ë`mU, g_¥Õs nishtha.wav, 0 to 6832, H ë`mU, Direct Yes recorded separately kalian.wav, 0 to 6360, g_¥Õs separately recorded samruddhi.wav 0 to 8192

Am[U, åhUyZ, d Common aani5.wav, 0 to 4576, Am[U, åhUyZ, these Direct No word, n«Ë`` mhanun.wav, 0 to 5520, d suffixes separately va.wav 0 to 2980 recorded

Synthesized from pdpdY, vividh.wav, 0 to 4216, Synthesized tene.wav, 0 to 4952, pdpdYVoZo, ZQ, na¨nam word and with pdeofU nat.wav, 0 to 3360, ZQboë`m, Synthesized Yes VoZo, and lelya.wav, 0 to 4752, na¨nam¨Mm boë`m, Mm pratyay/ barakhadi parampara.wav, 0 to 6160, cha4.wav 0 to 2736 barakhadi

Ap^ from _mZ abhi2.wav, 0 to 2976, A[^_mZ Synthesized Yes and separately Ap^Zd _mZ maan3.wav 0 to 3152 recorded recorded

nmB©H pdeofU Most of the nmB©H Direct Yes recorded pdeofU paik.wav 0 to 4164 directly recorded

Some pH«`mnX with pratyay hmo from and barakhadi barakhadi ho.wav, 0 to 3432, hmoÊ`mMs Synthesized Yes and Ê`mMs as are formed nyachi.wav 0 to 4612 pratyay if not recorded directly.

nmÌ directly patra.wav, 0 to 4376, nmÌVm Synthesized Yes Vm as pratyay recorded taa.wav 0 to 2069

A§Js from Some words A§Js Synthesized Yes angi.wav 0 to 3928 word A§JsH ma or syllables are

338

Linguistic analysis

From-to Syllable/ Used Names of locations Name of Used direct / pratyaya Name of sound linguisti syllables/ (samples the word synthesized extracted file c rules pratyaya on time from axis)

extracted/

formed from big words or with barakhadi

If not pH«`mnX available `mds Direct Yes directly can be yavi.wav 0 to 5554 recorded. formed with barakhadi

g from barakhadi X¡d, g two sa.wav, 0 to 882, 0 gX¡d Synthesized Yes and X¡d from syllables daiv.wav to 4728 XwX£d

some n«`ËZ [H«`mpdeofU [H«`mpdeofU prayatna.wav, 0 to 5128, n«`ËZ, gm¡»` Direct No and pdeofU recorded saukhya.wav 0 to 6944 recorded directly directly

Common nmbH m§Mm, palakancha.wav, nmbH m§Mm, noun JwéOZm§Mm 0 to 8824, Direct Yes gurujanancha.w JwéOZm§Mm directly directly 0 to 9792 av recorded recorded

dpSb word Ym from recorded vadil.wav, 0 to 3424, barakhadi dpSbYmè`m Synthesized Yes Ymè`m from dha.wav, 0 to 3332, and è`m as barakhadi rya.wav 0 to 2535 pratyay and pratyay

_mU from gm§Mm pratyay _mZpgH and mansik.wav, 0 to 3204, _mUgm§Mm Synthesized No directly as sancha.wav 0 to 5800 gm§Mm recorded. pratyay.

n«Ë`oH m from n«Ë`oH mZo pratyekane.wav, 0 to 5920, n«Ë`oH mes Synthesized Yes and directly n«Ë`oH mZo es shi.wav 0 to 4048 as pratyay. recorded.

Direct No gm¡OÝ`mZo saujnyane.wav 0 to 9712 gm¡OÝ`mZo Directly directly

339

Linguistic analysis

From-to Syllable/ Used Names of locations Name of Used direct / pratyaya Name of sound linguisti syllables/ (samples the word synthesized extracted file c rules pratyaya on time from axis)

recorded. recorded.

Xoe original ~m¨ and Yd recorded syllables desh.wav, 0 to 4300, Xoe~m§Yd Synthesized Yes word, ~m§ taken from bangadi.wav, 0 to 2600, from ~m§JSs, words ~m§JS s dhawal.wav 0 to 2772 Yd from Ydb and Ydb

I, am syllables ra.wav, 0 to 3272, from Ê`mMs directly amIÊ`mMs Synthesized Yes kha.wav, 0 to 4272, barakhadi recorded. nyachi.wav 0 to 4472 and Ê`mMs as pratyay.

From subject `mVM, name with yatach.wav, 0 to 5240, Direct Yes barakhadi ô`m§VM n«Ë`` and Hyantach.wav 0 to 7040 and pratyay ~mamISs

pH«`mnX gm_m from prepared 0 to 4128, gm_mpOH and samajeek.wav, gm_mdco Synthesized Yes from word 0 to 2980, from va.wav, le.wav dco and 0 to 3896 barakhadi barakhadi

 Paragraph 3 d. nw. H mio åhUVmV, AmH memV Ooìhm EImXm H¥ pÌ_ J¥h gmoSVmV Voìhm JwêËdmH f©UmÀ`m gs_o~mhoa Ë`mcm pnQ miyZ cmdon`ªVM gJim g§Kf© AgVmo. Ë`mZo EH Xm ñdV…Ms JVs KoVcs, H s Cacocm n«dmg AmnmoAmn hmoVmo. _mUgmM§ AgM§ Amho g_mOmV pdpeð C¨Ms JmRon`ªV O~a g§Kf© AgVmo nU EH Xm AnopjV C¨Msda nmohMcmV H s Am`wî`mVë`m AZoH g_ñ`m Vs C¨MsM gmoS dVo. (Va. Pu. Kale mhanatat, akashat jenvha ekhada krutrim graham sodtat tenvha gurutvakarshnachya seemebaher tyala pitalun laveparyantacha sagla sangharsha asato. Tyane ekda swatahachi gati ghetali ki urlela pravas apoaap hoto. Manasacha asacha aahe samajat vishisht unchi gatheparyant jabar sangharsha asato pan ekada apekshit unchivar pohochlat ki ayishyatlya anek samasya ti unchich sodavate.)

All the words in this paragraph and their concatenation details are shown in the following table. The ‗from‘ and ‗to‘ locations are shown in terms of samples. They

340

Linguistic analysis indicate total number of samples required for respective word or syllable. The number of samples indicates the range and hence it is unit less.

Table 9.3: Concatenation details of d. nw. H mio paragraph Used Syllable/ From-to Names of Name of Used direct / lingui pratyaya Name of locations syllables/ the word synthesized stic extracted sound file (samples on pratyaya rules from time axis)

Original from va.wav, 0 to 3664, 0 barakhadi d., nw., H mio Direct Yes d, nw, H m, io pu.wav, to 1252, 88 to and kale.wav 2912 Proper noun

åhUUo as pH«`mnX åhU direct and åhU, VmV as extracted mhanane.wav 96 to 6024, åhUVmV Yes VmV as pratyay from pH«`mnX and tat.wav 160 to 6264 pratyay. åhUUo

AmH m from AmH medmUs emV as akashwani.wa 64 to 3388, 0 AmH memV Synthesized Yes pratyay pratyay v and shat.wav to 2680 separately recorded

All subject pronouns Ooìhm Ooìhm Direct Yes and related separately jenvha.wav 20 to 4864 words recorded recorded.

0 to 5536 E, Im and Xm (whole word) separately ea.wav, Used EImXm Synthesized Yes taken kha.wav and barakhadi 0 to 1408, 0 from daa.wav to 3096, 0 to barakhadi 3312

pdeofU Separately H¥ pÌ_ Direct Yes recorded krutrim.wav 104 to 5352 recorded separately

J¥ from Common J¥hñV and h J¥h Synthesized No graha.wav 120 to 5884 noun from barakhadi

gmoSVmV Synthesized No Synthesized sodtat.wav 0 to 8952 gmoS from with from pHŒ`mnX

341

Linguistic analysis

Used Syllable/ From-to Names of Name of Used direct / lingui pratyaya Name of locations syllables/ the word synthesized stic extracted sound file (samples on pratyaya rules from time axis)

kriyapada gmoSdVo gmoSdVo and and VmV from barakhadi barakhadi

All subject pronouns Voìhm Voìhm Direct Yes and related separately tenvha.wav 64 to 4000 words recorded recorded.

JwêËdmH f© from gurutwakarsh. UmÀ`m as 0 to 8256 and JwêËdmH f©UmÀ`m Synthesized Yes JwêËdmH f©U wav and pratyay 0 to 4496 and UmÀ`m nachya.wav as pratyay

gs_o from ~mhoa as and gs_oda seeme.wav and 4 to 3224 and gs_o~mhoa Synthesized No common ~mhoa baher.wav 0 to 3684 noun direcly recorded

All subject pronouns Ë`mcm, Ë`mZo tyala.wav and 0 to 3304 and Ë`mcm, Ë`mZo Direct Yes and related directly tyane.wav 0 t0 3060 words recorded recorded.

pnQ m from iyZ as pHŒ`mnX pnQ miUo pita.wav and 0 to 5312 and pnQ miyZ Synthesized Yes pratyay or and iyZ as llun.wav 0 to 2548 suffix pratyay

M as pratyay. la.wav, ve.wav cmdo from 0 to 4034,0 to cmdofrom and cmdon`ªVM Synthesized No barakhadi. 3340 and 0 to barakhadi paryantacha.w directly 4724 n`ªV av recorded.

Direct gJim, EH Xm recorded sagala.wav, 0 to 3740 and gJim, EH Xm Direct No directly common ekada.wav o to 3504 recorded word

Directly g§Kf© directly sangharsha.wa g§Kf© Direct Yes recorded 0 to 5248 recorded v pdeofZm_

342

Linguistic analysis

Used Syllable/ From-to Names of Name of Used direct / lingui pratyaya Name of locations syllables/ the word synthesized stic extracted sound file (samples on pratyaya rules from time axis)

All important pH«`mnX direct ahe.wav and 0 to 3352 and AgVmo, Amho Direct Yes pH«`mnX recorded asato.wav 0 to 5336 directly recorded.

Ms pratyay nŒË``KpQ V swataha.wav 0 to 4744 and ñdV…Ms Synthesized Yes separately word and chi.wav 0 to 1280 recorded

J, Vs Word recorded JVs Synthesized No formed from gati.wav 0 to 3432 from barakhadi barakhadi

KoVcs Direct pH«`mnX direct KoVcs Yes directly ghetali.wav 0 to 7276 recorded recorded recorded.

Direct From H s, Vs from ki.wav and 0 to 2392 and H s, Vs No recorded barakhadi barakhadi ti.wav 0 to 1630

Ca from CaUo pH«`mnX pH«`mnX CaUo ur2.wav and 36 to 2888 Cacocm Synthesized Yes recorded and as lela.wav and 0 to 7800 directly cocm pratyay

n«dmg Common n«dmg Direct No recorded pravas.wav 0 to 4080 noun directly

Formed Am, n, nmo apoaapcon.wa AmnmoAmn Synthesized No from from 0 to 15504 v barakhadi barakhadi

pH«`mnX Direct hmoVmo direct hmoVmo Yes direct hoto.wav 0 to 4272 recorded recorded recorded

_mUgm from word _mUgmH _mUgm, M as manasa.wav 0 to 5080 and _mUgmM Synthesized Yes SyZ and M as pratyay and cha.wav 0 to 2324 pratyay

Ag from AgVmo direct asato.wav and 0 to 1292 and AgM Synthesized Yes and AgVmo M recorded cha.wav 0 to 2324 as pratyay and M as

343

Linguistic analysis

Used Syllable/ From-to Names of Name of Used direct / lingui pratyaya Name of locations syllables/ the word synthesized stic extracted sound file (samples on pratyaya rules from time axis)

pratyay

Concatena tion of g_m samaj.wav and 0 to 2960 and g_mOmV Synthesized No g_m from g_mO syllable jaat.wav 0 to 2216 and OmV word

Most of pdpeð directly the pdeofU pdpeð Direct Yes vishishta.wav 0 to 5336 recorded directly recorded

pdeofU Direct C¨Ms directly C¨Ms Yes direct unchi.wav 0 to 3584 recorded recorded recorded

JmRo from n`ªV barakhadi recorded gathe.wav and 0 to 4900 and JmRon`ªV Synthesized Yes and being a n`ªV paryant.wav 0 to 3808 directly common recorded word.

pdeofU Oã~a pdeofU Oã~a Synthesized Yes direct jabbar.wav 0 to 3844 recorded recorded

All suffix or nU direct nU Synthesized Yes pratyay pan.wav 0 to 4048 recorded recorded

pdeofU AnopjV pdeofU AnopjV Direct Yes direct apekshit.wav 0 to 5472 recorded recorded

M, da unchi.wav, 0 to 3584, 0 C¨Ms directly pratyay C¨Msda, C¨MsM Synthesized Yes cha.wav, to 2324, 0 to recorded. directly var.wav 3052 recorded

nmohMcm V pratyay pohochala.wav 0 to 4144 and nmohMcmV Synthesized Yes directly directly and ta.wav 0 to 3208 recorded. recorded

from ayushyamadhy Am`wî`m Vë`m direct 0 to 3808 and Am`wî`mVë`m Synthesized Yes e.wav and Am`wî`m_Ü`o and recorded 0 to 3502 tlya.wav Vë`m as

344

Linguistic analysis

Used Syllable/ From-to Names of Name of Used direct / lingui pratyaya Name of locations syllables/ the word synthesized stic extracted sound file (samples on pratyaya rules from time axis)

pratyay

Most AZoH , g_ñ`m common anek.wav and 0 to 2968 and AZoH , g_ñ`m Direct No common nouns samsya.wav 0 to 6032 noun direct recorded

Most gmoSdVo pH«`mnX common gmoSdVo Direct Yes recorded pH«`mnX sodawate.wav 0 to 5456 directly recorded directly

 Paragraph 4 EH m bhmZem IoSoJmdmV amOy Am[U Ë`mMs åhmVmas AmB© ahmV hmoVo. amOy hm AË`§V Amies hmoVm. Vmo pXdg^a PmonyZ ls_¨V ìhm`Ms ñdßZ¨ ~KV Ago. Ë`mMs AmB© XmoZ doioÀ`m OodUmgmR s am̧pXdg H m~mSH îQ H asV Ago. EHo pXdes amOyZo ñdßZmV PmSmbm n¡go bmJbobo nmphbo. dm, H m` PmS Amho? Oa Ago PmS _bm p_imbo, Va _s nU Iyn ls_¨V hmoB©Z. Vm~SVmo~ Vmo Ka gmoSyZ n¡go bmJUmè`m PmS mÀ`m emoYmV pZKmcm. pXdg^a MmcVM amphcm. MmcVm MmcVm EHo pXdes Vmo EH m PmS mImcs Pmoncm Am[U Zoh_sn«_mUo Kmoê cmJcm. AÝZ nmÊ`mpdZm amÌgwÕm Kmcpdcs. AMmZH EH m åhmVmè`m _mUgmZo Ë`mbm OmJo Ho co Am[U ZOaocm ZOa p^S pdcs. (Eka lahanashya khedegavat Raju ani tyachi mhatari aai rahat hote. Raju ha atyant aalashi hota. To divasbhar zopun srimant vhayachi swapna baghat ase. Tyachi aai don velechya jewanasathi ratrandivas kabadkashta karit ase. Eke divashi Rajune swapnat zadala paise lagalele pahile. Va, kay zaad aahe? Jar ase zaad mala milale tar mi pan khup srimant hoin. Tabadtob to ghar sodun paise lagnarya zadacya shodhat nighala. Divasbhar chalatach rahila. Chalata chalata eke disvashi to eka zadakhali zopala ani nehamipramane ghoru lagala. Anna panyavina ratrasudhha ghalavili. Achanak eka mhatarya manasane tyala jage kele and najarela najar bhidavili.)

All the words in this paragraph and their concatenation details are shown in the following table. The ‗from‘ and ‗to‘ locations are shown in terms of samples. They indicate total number of samples required for respective word or syllable. The number of samples indicates the range and hence it is unit less.

345

Linguistic analysis

Table 9.4: Concatenation details of amOy JmoîQ paragraph Used Syllable/ From to Names of Name of Used direct / linguis pratyaya Name of locations syllables/ the word synthesized tic extracted sound file (samples on pratyaya rules from time axis)

eka.wav, 0 to 4756, 0 tyachi.wav, to 5408, 0 to All to.wav, 2144, 0 to subject EH m, Ë`mMs, Vmo, Subject don.wav, 3028, 0 to pronouns XmoZ, EHo , Oa, pronoun eke.wav, 5440, 0 to Direct Yes and _bm, Va, _s, directly jar.wav, 4448, 0 to related recorded mala.wav, 4872, 0 to Ë`mbm words are tar.wav, 2656, 0 to recorded. mi.wav, 3456, 0 to tyala.wav 5272

Some pratyay Proper noun raju.wav, 0 to 4828, 0 like Zo are amOy, amOyZo Direct Yes directly rajunecon.wa to 7376, 0 to used with recorded v, ne.wav 2096 proper nouns

0 to 5712, 0 aai.wav, to 3880, 0 to n¡go, PmS,Ka,H m`, Common aalashi.wav, AmB©, Amies, 5472, 0 to AmB©, Amies, nouns are swapna.wav, ñdßZ¨, n¡go, PmS, Direct Yes 5064, 0 to ñdßZ¨ directly directly zaad.wav, Ka, H m` 4320, 0 to recorded ghar.wav, recorded 3956, 0 to kaay.wav 2652

hote.wav, 0 to 4056, 0 hota.wav, to 3692, 0 to hmoVo, hmoVm, Ago, All ase.wav, 3224, 0 to nmphbo, Amho, pH«`mnX important pahile.wav, 4088, 0 to p_imbo, amphcm, Direct Yes directly pH«`mnX aahe.wav, 3888, 0 to Pmoncm, cmJcm, recorded recorded milale.wav, 5584, 0 to Ho co directly rahila.wav, 4472, 0 to zopla.wav, 7056, 0 to lagla.wav 4224

All gava.wav, 0 to 2716, 0 Jmdm, pXd, ñdßZm, important ta.wav, to 3112, 0 to PmSm, bmJ, Ago JmdmV, pXdes, pratyayas gava.wav, 2716, 0 to syllables ñdßZmV, PmSmbm, /suffixes shi.wav, 1486, 0 to Synthesized Yes taken from V, es, bm, swapna.wav, 4328, 0 to bmJbobo, AgoM, some bobo, M, À`m zadamadhye. 4384, 0 to PmSmÀ`m original recorded wav, laa.wav, 4144, 0 to words or and used lag.wav, 3704, 0 to formed with with some lele.wav, 3536, 0 to

346

Linguistic analysis

Used Syllable/ From to Names of Name of Used direct / linguis pratyaya Name of locations syllables/ the word synthesized tic extracted sound file (samples on pratyaya rules from time axis)

barakhadi words or ase.wav, 3224, 0 to barakhadi cha.wav, 2144, 0 to chya.wav 2920

pdeofU Separately mhatari.wav, 0 to 6464, 0 åhmVmas, AË`§V Direct Yes recorded recorded atyant.wav, to 7816 separately

Am[U, hm, dm, Common aani.wav, 0 to 4352, 0 nU these Am[U, hm, dm, words, ha.wav, to 3336, 0 to Direct Yes n«Ë`` nU suffixes vaa.wav, 3200, 0 to recorded separately pan.wav 3716 recorded

Synthesized mhata.wav, 0 to 4416, 0 from åhmVmas, Synthesiz rya.wav, to 2536, 0 to åhmVmè`m, è`m, pXdg, ^a, ed with divas.wav, 3608, 0 to pXdg^a, _mU, gmZo, chmZ, words, bhar.wav, 3136, 0 to Synthesized Yes _mUgmZo, em pratyay maann.wav, 3216, 0 to chmZem words and sane.wav, 6144, 0 to pratyay/ barakhadi lahan.wav, 3536, 0 to sha.wav 3400 barakhadi

velemule.wav 0 to 4176, 0 doio_þio, À`m, À`m, Um, gmR s , chya.wav, to 2920, 0 to doioÀ`m, Synthesized Yes Ood, Um, gmR s separately jev.wav, 3016, 0 to OodUmgmR s recorded recorded na.wav, 3216, 0 to sathi.wav 4328

pdeofU Most of Iyn Direct Yes recorded the pdeofU khup.wav 0 to 2784 directly recorded

raha.wav, 0 to 3808, 0 ahm, ~K, H a, ta.wav, to 3112, 0 to Some bmJ, Mmc, Kmc, zo.wav, 1976, 0 to pH«`mnX with ahmV, PmonyZ, p^S from ahmUo, poo.wav, 3600, 0 to ìhm`Ms, ~KV, H pratyay na.wav, 3216, 0 to ~KUo, H aUo, and aV, hmoB©Z, gmoSyZ, vhay.wav, 3088, 0 to bmJUo, MmcUo, barakhadi bmJUmè`m, Synthesized Yes chi.wav, 2424, 0 to KmcUo, p^SUo are MmcVM, MmcVm, baghane.wav, 2824, 0 to formed if Kmcpdcs, Remaining ta.wav, 3112, 0 to not from pratyay karane.wav, 2494, 0 to p^Spdcs recorded and ta.wav, 3112, 0 to directly. barakhadi ho.wav, 3696, 0 to ein.wav, 5040, 0 to so.wav, 2439, 0 to

347

Linguistic analysis

Used Syllable/ From to Names of Name of Used direct / linguis pratyaya Name of locations syllables/ the word synthesized tic extracted sound file (samples on pratyaya rules from time axis)

dun.wav, 2624, 0 to lagane.wav, 3744, 0 to narya.wav, 3412, 0 to chal.wav, 2604, 0 to taa.wav, 2069, 0 to tacha.wav, 5424, 0 to ghalne.wav, 3244, 0 to bhidne.wav, 2692, 0 to vili.wav 2580

Some am̧ recorded words directly. pXdg recorded is common directly divas.wav, 0 to 3608, 0 am̧pXdg Direct Yes noun and and joined ratran.wav to 4760 joined with to form am̧ new words.

Some It is a big words or joint word syllables are formed from extracted/ kabad.wav, 0 to 4332, 0 H m~mSH îQ Synthesized Yes common kashta.wav to 1920 noun H m~mS formed from big and H îQ words or word. with barakhadi

If not available sho.wav, 0 to 2158, 0 pH«`mnX can be dha.wav, to 3568, 0 to emoYmV, pZKmcm Direct Yes directly formed ta.wav, 3112, 0 to recorded. with nighala.wav 6152 barakhadi

pXdg, ^a, IoSo, Jmdm, V Some words these divas.wav, 0 to 3608, 0 are formed words, bhar.wav, to 3136, 0 to pXdg^a, from small pratyay, Synthesized Yes khede.wav, 4200, 0 to IoSoJmdmV words, subwords gava.wav, 2716, 0 to syllables and and ta.wav 3112 barakhadi syllables are formed.

348

Linguistic analysis

Used Syllable/ From to Names of Name of Used direct / linguis pratyaya Name of locations syllables/ the word synthesized tic extracted sound file (samples on pratyaya rules from time axis)

some srimant.wav, 0 to 7872, 0 [H«`mpdeofU [H«`mpdeofU ls_¨V, AMmZH , achanak.wav, to 6160, 0 to Direct Yes recorded and pdeofU Kmoê , OmJo ghoru.wav, 5552, 0 to directly recorded jage.wav 5936 directly

Vm~S, Vmo~ Aä`ñV words both Vm~SVmo~ Direct Yes recorded words are tabadtob.wav 0 to 6880 directly used together.

Big and new PmSm from words are zadajawal.wa 0 to 3612, 0 PmSmImcs, PmSmOdi and formed v, khali.wav, to 3508, 0 to Synthesized No amÌgwÕm Imcs as from ratra.wav, 5552, 0 to pratyay syllables sudhha.wav 5008 and pratyay

Common AÝZ directly words AÝZ Direct No anna.wav 0 to 5120 recorded. directly recorded.

Big and new Zoh_s and n«_mUo words are nehami.wav, 0 to 4068, 0 Zoh_sn«_mUo Synthesized Yes are joined formed pramane.wav to 5672 together from two small words.

Big and new nmÊ`m from words are nmÊ`mgmR s and formed panyasathi.w 0 to 4416, 0 nmÊ`mpdZm Synthesized Yes pdZm as from av, vina.wav to 3580 pratyay syllables and pratyay.

Common ZOa directly words are ZOa Direct Yes najar.wav 0 to 5264 recorded. recorded directly

349

Linguistic analysis

Used Syllable/ From to Names of Name of Used direct / linguis pratyaya Name of locations syllables/ the word synthesized tic extracted sound file (samples on pratyaya rules from time axis)

Words formed ZO from ZOa from naja.wav, 0 to 3708, 0 ZOaocm Synthesized Yes and aocm from syllables re.wav, to 3480, 0 to barakhadi of existing laa.wav 4144 words and barakhadi

 These tabular details show concatenation details and use of original words for synthesis. All audio results are intelligible and more natural as compared to other TTS results.

9.5.2 Comparison of naturalness obtained by various methods:  To analyze naturalness of resulting speech output, some sentences and paragraphs of this Marathi TTS are tested with few Marathi language listeners. One group of Marathi language subjects is formed and subjective analysis is carried out. MOS (mean opinion score) is a common type of subjective analysis. Here all subjects are asked to judge both reference audio output along with proposed system output.

 This subjective analysis helps to prove that guidance provided by linguist if implemented properly in this TTS system, naturalness of speech output increases. In MOS all subjects are asked to rank the output on 1 to 5 scale; Where 1 = poor audio, 2 = non-acceptable audio for naturalness, 3 = acceptable and little natural, 4 = natural and good for implementation and 5 = excellent natural audio output.

The following table shows subjective analysis for Marathi TTS system along with other naturalness improvement techniques based on signal processing.

350

Linguistic analysis

Table 9.5: MOS test for linguistic analysis implemented for Marathi TTS

MOS analysis Linguist‟s S. No. Multi-resolution PSD DTW rules based based wavelet TTS Subject 1 3 4 3 3 Subject 2 3 4 4 3 Subject 3 2 3 4 4 Subject 4 2 3 4 4 Subject 5 4 4 4 3 Subject 6 3 4 5 4 Subject 7 4 3 4 4 Subject 8 4 3 3 4 Total 3.125 3.5 3.875 3.625

This analysis shows that after considering linguists suggestions resulting audio sounded natural and acceptable for implementation. If the MOS scores of linguist‘s rule based TTS (3.625) and multi-resolution based wavelet (3.875) is considered, it shows that the wavelet is better than linguistic analysis. But the score is very close to signal processing technique. This shows that apart from core DSP algorithms, contextual analysis and linguistic analysis is equally important for naturalness improvement of TTS output. This subjective perceptive test shows that both subjective and objective validation methods shows effective improvement in naturalness of Marathi TTS system.

 After testing paragraph results and comparison of naturalness for various methods using MOS, different TTS systems available are compared. The Marathi TTS gives more natural speech output with moderate database and outperformed as compared to other methods. Comparison of different Indian language TTS is shown in the following table.

351

Linguistic analysis

Table 9.6: Comparison of different TTS systems Sr Technique Grammatical Type of TTS Language Naturalness Results No used structure 1 Concatenative Tamil a) Syllable Good quality a) Normal text a) Database TTS (2002- based naturalness. normalization for various 2014) concatenation is performed. domain b) b) Special care words and Syllabification is taken for syllables units are pitch normalizing b) Syllable marked using non-standard pitch DCT- words. modificatio SAF(spectral c) Next text n is autocorrelation splitting is performed function) carried out. based on c) Total no of time scale Advantage: units around modificatio System is 3000 n. expandable to c) Speech any language files are just by stored in changing the PCM language rules. format for naturalness. d) Syllabificati on is based on rules. e) Only subjective analysis. 2 Concatenative/ Telugu a) Syllable Quality of a) Diphone or a) Corpus based based synthesized syllable as Reasonable TTS (2010- concatenation speech basic unit. naturalness 2013) b) Contextual reasonably b) Common b) Matching analysis and natural. syllable and function hence barakhadi is technique Advantage: minimizes co- used in needs Matching articulation database. experience function effect c) Letter to and through technique c) Matching sound rules is context based on function which used for text analysis. contextual assign weight processing. c) analysis to syllables. Efficiency produces Maximum and natural and weight syllable performanc prosodic selected. e is better output than corpus based earlier counterpart. 3 Concatenation/ Hindi Unit selection/ Analysis a) Due to a)Unit Durational limited domain shows festival any selection is

352

Linguistic analysis

Sr Technique Grammatical Type of TTS Language Naturalness Results No used structure TTS (1993- in Festival intelligibility arbitrary good for 2014) framework of speech is language can producing more than be used sound of 98% but b) Knowledge any naturalness is of phone set, generalized Durational very low syllabification text modeling, around 40% rules and stress b) Although intonation marking of high component language is intelligible,

with ‗f0‘ required. naturalness contours and is limited to latest Gayatri just 40%. are new c) Pre- developments processor and prosody need to be considered for improveme nt. 4 Syllable based Bengali Corpus based Fairly clear Corpus design a) Reading TTS (2002- concatenative with degree has following online 2011) and ESNOLA of subparts AnandBazar naturalness a) Basic unit- Patrika here syllable b) selection Clustering Accurate b) Text corpus based on prosody model collection syllable can be c) Optimal text position, developed selection context and with ANN and d) Deriving phonologica SVM in future letter to sound l features rules c) Changes e) Labeling of in unit corpus and selection building are made by prosody model defining f) Clustering target cost of units and concatenati on cost d) Phase break prediction module e) Evaluation with both subjective and

353

Linguistic analysis

Sr Technique Grammatical Type of TTS Language Naturalness Results No used structure objective measures 5 KTTS System Kannada Unit selection The quality a) The a) The (2010-2012) based of combination of proposed concatenative synthesized consonant approach synthesizer. speech is phoneme and minimizes reasonable vowel co- natural. phoneme articulation produces a effect. syllable. b) b) To get the Generated natural sounds speech at the output, shows each phoneme distortion at was extracted the from a concatenati word/phrase. on point. c) Duration of c) Speech phoneme enhancemen changes with t technique its position. for smooth speech. 6 TTS for Voice Malyalam Corpus based Speech files a) Text a) Only Response concatenative are stored in normalization subjective System (2011) voice response PCM format with sentence evaluation system for improved splitting. b) Mapping naturalness. b) 2500 unique file for phonemes are Malyalam used words and c) Database syllables preparation is with Java time program consuming due c) Output to complexity resembles of language natural human voice with high intelligibilit y

354

Linguistic analysis

9.6 CONCLUSION OF LINGUISTIC ANALYSIS:

The linguist‘ suggestions helped to improve contextual based analysis of the database. The implementation of certain language rules improved the overall performance of speech output of Marathi TTS system. Different context patterns are studied and optimization of database is carried out. This resulted in reduction in the database size and naturalness improvement of TTS output. MOS subjective evaluation has given a score of 3.625 for linguistic analysis which is comparable to core DSP techniques like wavelet, DTW and PSD. As per our new approach of incorporating linguistic rules for syllable formation, the naturalness becomes 73% (MOS score of 3.625 on 5 point scale). If linguist‘s suggestions are implemented along with wavelet (MOS score 3.875), the resulting speech output can be improved further.

.

355

Chapter-10

Outcome of research

Outcome of research

CHAPTER- 10

OUTCOME OF RESEARCH

No. Title of the Contents Page No. Outcome of research 357

356

Outcome of research

CHAPTER- 10

OUTCOME OF RESEARCH

 This research work contributes to the speech synthesis field for improving the naturalness of speech in Marathi language TTS. This work demonstrates development of four modules of naturalness improvement as below:

1) Contextual analysis and classification of syllables for Marathi TTS 2) Position based syllabification 3) Spectral mismatch calculation and reduction 4) Implementation of linguist‘s suggestions for improving speech naturalness.

 Software code consisting of these four modules is written in C-VB and Matlab on Windows platform. First module is developed in C-VB combination consisting of about 1000 lines code and remaining modules are implemented in Matlab using different functions in signal processing toolbox. All these modules are tested for words, sentences and paragraphs and found that the system is giving very natural and intelligible speech output.

 This text to speech system is using a small database of about 3000 words as well as syllables. This hybrid syllabic approach (words and syllables) has resulted in more naturalness of speech output with only 6MB memory requirement for database.

 Innovative algorithms for naturalness improvement of speech output are developed and implemented successfully in both time and frequency domain. Objective measures of naturalness (numerical figures) and subjective listening test (MOS) gave essential proof to the theory of speech synthesis that integrity of syllable position and resulting spectral distortion are correlated.

357

Outcome of research

 About 500 words, 100 sentences and 4 paragraphs are tested to develop text to speech synthesis for Marathi language.

More detailed outcome of these four modules is individually described herewith:

1. First module of ‗contextual analysis and classification of syllables‘ is implemented in C-VB on windows platform. This total code of about 1000 lines for assigning codes string, CV structure formation and concatenation have been successfully developed in C. Marathi synthesis system front end is developed in VB where input text can be entered based on mouse click keyboard. Text from newspapers, magazines, books is collected and provided as input. This module gives list of most frequent words and their syllables. These words are stored in the audio database. Modular database of about 3000 words and syllables is developed with this hybrid synthesis method. This database is used for synthesizing Marathi input text.

2. In the second module ‗position based syllabification‘ different methods of segmentation are designed and developed. Two techniques neural and non- neural based segmentation are implemented.  In neural approach, Maxnet single layer NN method has been successfully implemented for segmentation accuracy of 70%. Total 500 words are tested with 2, 3 and 4 syllables.  Further to improve the segmentation accuracy, supervised-unsupervised NN combination of Back-propagation-Maxnet is implemented with segmentation accuracy of 82%.  Non-neural methods like slope detection and simulated annealing have been designed and developed after testing 500 words with 2, 3, 4 and 5 syllables. They have provided maximum segmentation accuracy of 77%.  An innovative method of segmentation is developed with classification and neural networks. This K-means-Maxnet method has been successfully implemented for segmentation accuracy of 90%. Here 500 words are tested; even 5 syllable words are included in the test. About 455 out of 500 words have given same segmentation locations as manual segmentation

358

Outcome of research

values. Due to highest syllabification accuracy of this method, it is finalized for segmentation.

3. Third module ‗spectral mismatch calculation and reduction‘ reduced the spectral distortion after concatenation. Three different signal processing techniques PSD, DTW and multi-resolution based wavelet are implemented for this purpose.  Power spectral density has given accurate graphical plots for the spectral distortion at concatenation. Formant and slope plots are used to display spectral distortion more effectively. With PSD plot one can observe that the spectral distortion of improper concatenated (IC) words is more than proper concatenated (PC) words. PSD has estimated and reduced successfully spectral distortion for 445 words out of 500. Difference between all the formants near concatenation point is taken and all the formants are averaged to reduce this spectral mismatch. Difference of

0.0366 watts/Hz (average of all formants) is observed for PC words as

compared to the difference of 0.1186 watts/Hz for IC words. It shows that proper concatenated words are more close to original words than improper concatenated words. Please refer section 8.29.3 of chapter 8 for more details of these results.  To further improve spectral smoothing, DTW is developed and implemented. DTW has reduced the distance between IC word formants for spectral smoothing. Correlation result of original and PC word is improved from 4.78 to 5.18 after using DTW. Correlation of original and IC is improved from 4.51 to 4.94 after using DTW. This method is successfully tested for 500 words. For details of these results, please refer section 8.31.4 of chapter 8.  Multi-resolution based wavelet analysis is used to further improve spectral noise at concatenation joint. Multi-resolution based wavelet algorithm resulted in more natural speech output by changing % error of improper words with modified wavelet coefficients. For spectral mismatch reduction wavelet-Back-propagation combination has given best audio quality for TTS output. This module has implemented wavelet with energy level from

359

Outcome of research

1 to 5 for both approximate and detail coefficients. IC word coefficients are given to NN (Back-propagation algorithm) which reduces the distance and gives improved coefficients similar to original. Percentage error improvement of 20% has obtained for approximate and detail coefficients respectively after using wavelet-NN combination. Please refer section 8.30.4 of chapter 8 for more details of these results. Total 500 words are tested for this combination. Among all these methods, wavelet-back- propagation combination has given good audio results. MOS score (subjective listening test) of multi-resolution based wavelet algorithm has highest value (3.875) as compared with other methods like PSD (3.125) and DTW (3.5) on maximum scale of 5.

4. Fourth module has implemented linguistic suggestions for contextual analysis and improvement of speech naturalness. In this research work, linguistic rules suggested by linguistic experts are implemented for contextual analysis and for improvement of database preparation. This has resulted in naturalness increment of speech output.  Two important language rules developed are a) proper names need to be stored in original form in the database and can‘t be reused for syllabification and b) all suffixes (pratyayas n³Ë``) and prefixes (upasargas CngJ©) need to be stored in the database for increasing naturalness of speech output. The suffixes and prefixes that are stored in the database have resulted in improving the naturalness.  To test this implementation, subjective listening test is carried out for both core DSP methods (PSD, DTW and wavelet) and linguist‘s rule based synthesis. On MOS scale linguist‘s rules implementation resulted in 3.625 which is close to 3.875, core DSP technique (wavelet-backpropagation) output. It is equally high compared to other objective approaches like PSD, DTW and multi resolution based wavelet. This shows that apart from core DSP algorithms, contextual and linguistic analysis is equally important for naturalness improvement of TTS output.

360

Chapter-11

Future scope, improvements and concluding remarks

Future scope, improvements and concluding remarks

CHAPTER- 11

FUTURE SCOPE, IMPROVEMENTS AND CONCLUDING REMARKS

No. Title of the Contents Page No. 11.1 Concluding remarks 362 11.2 Improvements and future scope 364 11.3 Applications 366

361

Future scope, improvements and concluding remarks

CHAPTER- 11

FUTURE SCOPE, IMPROVEMENTS AND CONCLUDING REMARKS

11.1 CONCLUDING REMARKS:

This research work has provided numerical data, which can serve as an essential proof to the theory that the integrity of the position of syllables should be maintained during concatenation in speech synthesis. Naturalness of TTS system is based on how properly segmentation units are formed and after synthesizing different words how carefully the concatenation joint distortion is reduced. In this research work, different segmentation techniques are implemented for syllable formation. This Marathi TTS is using different syllabification algorithms for preparing hybrid database. The database is storing most common words of the languages and syllables of those words. New words are formed from existing words or syllables in the database. Due to this, the size of database remains moderate. Syllables are stored along with their position, which is based on contextual analysis of database. After synthesizing new words from existing words or syllables, the concatenation joint has spectral mismatch. Various algorithms in this research work show the spectral mismatch in the proper and improper concatenated words graphically and numerically. Spectral mismatch is more for improper concatenated word without considering position. After applying different smoothing techniques the spectral mismatch is reduced producing more natural speech output. The implementation of context based (considering syllable positions) segmentation, spectral estimation and reduction and linguistic guidance for efficient moderate database preparation resulted in natural sounding output for this Marathi TTS. Four audio sample paragraphs demonstrate working of this synthesis system. This Marathi TTS is used for book reading application. The following section shows approaches implemented for naturalness improvement of this Marathi TTS research work.

362

Future scope, improvements and concluding remarks

 Contextual analysis helps to produce more natural speech synthesis output. Different contexts present in Marathi language are studied and after a detail analysis, some contexts are implemented in terms of language rules. Due to this analysis, the most frequently used words in the language can be found out and optimization of database can be carried out. Database size reduction and improvement in TTS output quality are the biggest advantages of contextual analysis. For contextual analysis CV structure breaking rules are used which are based on empirical study of the Marathi language.

For naturalness improvement of synthesized speech two approaches are discussed.

 First approach is position based segmentation of word into syllables. For improving naturalness neural, classification and non-neural network techniques have been implemented. Syllable is considered as better segmentation unit due to its better durational and segmental stability. After a comparative analysis, it has been observed that neural-classification approach resulted in better results for natural sounding segmentation. This method has given 90% segmentation accuracy. Naturalness of synthesized speech output is dependent on breaking and joining of synthesis units. If segmentation is carried out properly with fine accuracy then resulting synthesized speech results is less spectral distortion after concatenation.

 In the second approach, spectral analysis is carried out for reduction of spectral distortion after concatenation. Different time and frequency domain methods like PSD, DTW and wavelet are tested for concatenation joint noise reduction. These methods help to estimate and reduce spectral noise. It has been observed that if the syllable position is not considered while concatenation of two speech synthesis units, the resulting spectral distortion is very high. If syllable position is considered then resulting spectral noise after concatenation is less. If spectral noise present at concatenation is reduced due to spectral smoothing methods then resulting accuracy and naturalness of synthesized speech output is very high. Thus naturalness improvement is achieved with the help of accurate syllable segmentation and spectral mismatch reduction methods.

363

Future scope, improvements and concluding remarks

 After a successful analysis, PSD results showed more promising graphical distortion but wavelet analysis resulted in more natural sounding speech output. DTW graphical results are promising but audio quality is just acceptable. Wavelet domain analysis for spectral distortion reduction resulted in more accurate output and showed increased natural output as compared to other methods. Percentage error improvement of 20% is achieved for both approximate and detail wavelet coefficients. A database of about 3000 words as well as syllables is used for testing of different algorithms. NN based segmentation is very time consuming hence classification-neural combination has tested for proper segmentation of speech units. Contextual analysis of Marathi language resulted in position based spectral distortion reduction. As the position of syllable is considered while preparing database proper syllable speech units are used while forming new words. Limited database size is the biggest advantage of this work. Existing most frequent words and syllables of these words are used. The resulting database remains limited and memory requirement of this system is also limited.

 Linguistic analysis is carried out for making context rules and development of more natural sounding TTS system. Linguistic suggestions of couple of rules resulted in improvement in synthesized speech output (MOS score of 3.625 on a 5 point scale). This shows that apart from core DSP techniques, linguistic study is also very important for finding and implementing different contexts in terms of different rules for database optimization and CV structure improvement.

11.2 IMPROVEMENTS AND FUTURE SCOPE:

This work can be extended further for next developments of TTS system. Some of the aspects of future scope are formulated in the following points: 1) This can be used with extension to CV rules for language to language conversion. Once CV structures are developed for all contexts of one language, they can be modified and adapted for other languages with some mapping rules.

364

Future scope, improvements and concluding remarks

2) If present TTS is enhanced with prosody, it will sound more natural with presence of different emotions. If the TTS has emotions along with better intelligibility and naturalness, this will help for physically challenged people for conveying their feelings to others. 3) With some addition of emotions and musical aspects, the system can be developed for poetic text conversion. Poetic text or paragraphs is part of all language literatures since long years. If emotions and musical aspects get implemented in TTS systems, it will be a great opportunity for poetic text conversion. People can enjoy this capability of TTS systems along with plain text. 4) This system can be extended for other commercial application like E-mail reader. Commercial applications and their usage are increasing these days due to computerization and other technology aspects. E-mail reader is very important these days as most of the communication is through computer. If more and more commercial applications are developed, they will improve the life style and convenience for common people. 5) The TTS system can be further analyzed with respect to more detail language rules for implementing more context patterns. Linguistic analysis can improve some more language rules based on all types of contexts of a particular language. These rules will help to improve the TTS front end processing. 6) The system can be extended for more CV structure breaking rules for performance improvement. More detail language analysis and empirical study will help to develop more CV structures for provided scripts. These structures will improve the context handling and overall text processing. 7) This system can be further developed for database reduction by paragraph selection. Database reduction will improve the search time of any TTS engine. One more new approach of database improvement by paragraph selection can reduce the memory requirement of TTS systems. 8) The system can be extended to find other evaluation measures for spectral mismatch using information content. 9) The system can be extended for evaluation of spectral mismatch using prosodic aspects. 10) The system can be extended for further linguistic analysis for naturalness and intelligibility improvement. 365

Future scope, improvements and concluding remarks

11.3 APPLICATIONS:

Using text-to-speech for short phrases: An application should use text-to-speech only for short phrases or notifications, not for reading long passages of text. Because listening to a synthesized voice more than few sentences require more concentration and user can become irritated. This is a limited domain application for short phrases with limited database.

Applications for the blind: Probably the most important and useful application field in speech synthesis is the reading and communication aids for the blind. Before synthesized speech, specific audio books were used where the content of the book was read into audio tape. It is clear that making such spoken copy of any large book takes several months and is very expensive. It is also easier to get information from computer with speech instead of using special Bliss symbol keyboard; which is an interface for reading the Braille characters.

Today, the quality of reading machines has reached acceptable level and prices have become affordable for single individual, so a speech synthesis is very helpful and common device among visually impaired people. The current systems are mostly software based. Regardless of how fast the development of reading and communication aids is, there are always some improvements to be done.

Multimedia phone application: Synthesized speech may also be used to speak out short text messages (SMS) in mobile phones. New phones such as iphones, android and windows phones has multimedia software for text to speech which enable phones to read contacts, SMS etc. But the sound quality is robotic for this software.

Telecommunication application: The PACER IVR and voice broadcasting system fully supports XML push client/server applications. Application servers can now send XML protocol messages to PACER phone system to automatically dial a number and play a recorded message. Text messages can be transmitted and converted to a voice message using this text to 366

Future scope, improvements and concluding remarks speech application software. These IVR and voice broadcasting systems give your organization a 24 by 7 capability hence providing around-the-clock information to your callers using text-to-speech software. Call centers in particular can become instantly more productive by letting the phone system provide the caller with information and by determining the best service representative to handle other requests. But this application is very specific for telephone communication. It cannot be extended for other applications.

Applications for the deafened and vocally handicapped: People who are born-deaf cannot learn to speak properly and people with hearing difficulties has usually speaking difficulties. Synthesized speech gives the deafened and vocally handicapped an opportunity to communicate with people who do not understand the sign language. With a talking head it is possible to improve the quality of the communication situation even more because the visual information is the most important with the deaf and dumb. A speech synthesis system may also be used with communication over the telephone line (Klatt et al.1987). Adjustable voice characteristics are very important in order to achieve individual sounding voice. Some tools such as HAMLET (helpful automatic machine for language and emotional talk) have been developed to help users to express their feelings (Murray et al. 1991, Abedjieva et al. 1993). These systems are performing well but needs updating for new requirements and technologies.

Educational application: Synthesized speech can be used in many educational situations. A computer with speech synthesis can teach 24 hours a day and 365 days a year. It can be programmed for special tasks like spelling and pronunciation teaching for different languages. It can also be used with interactive educational applications. It is almost impossible to learn writing and reading without spoken help. With proper computer software unsupervised training for these problems is easy and inexpensive to arrange. A speech synthesis connected with word processor is also a helpful aid to proof reading. Many users find it easier to detect grammatical and stylistic problems when listening than reading. Normal misspellings are also easier to detect. Speech synthesizers developed with different languages need to be implemented for educational applications which is time consuming and costly affair. 367

Future scope, improvements and concluding remarks

Web reader text to speech: Web Reader HD is a web browser designed to read the webpage to you. This application is a great alternative to other applications that work with your input text. The application allows you to choose where in the web page you want it to start reading. The application can also be configured to start reading once loaded. It comes with two great voices (one male and one female). The application includes playback speed settings. The application supports HTML, plain text, Microsoft word and RTF files. The speed adjustment of this application is limited which affects the naturalness of speech output.

Other applications and future directions: In principle speech synthesis may be used in all kind of human-machine interactions. For example: in warning and alarm systems, synthesized speech may be used to give more accurate information of the current situation. Using speech instead of warning lights or buzzers gives an opportunity to reach the warning signal for example from a different room. Speech synthesis may also be used to receive some desktop messages from a computer such as printer activity or received e-mail.

In the future, if speech recognition techniques reach an adequate level: synthesized speech may also be used in language interpreters or several other communication systems, such as videophones, videoconferencing or talking mobile phones. If it is possible to recognize speech, transcribe it into ASCII string and then re-synthesize it back to speech a large amount of transmission capacity may be saved. With talking mobile phones it is possible to increase the usability considerably for example with visually impaired users or in situations where it is difficult or even dangerous to try to reach the visual information. It is obvious that it is less dangerous to listen than to read the output from mobile phone for example when driving a car. During the last few decades the communication aids have been developed from talking calculators to modern three-dimensional audiovisual applications. The application field for speech synthesis is becoming wider all the time which brings more funds into research and development areas.

368

Appendix-1

CV structure breaking rules and WAVE file details

CV structure breaking rules and WAVE file details

A.1 CV STRUCTURE BREAKING RULES

CV structure breaking rules is the heart of the syllabic based speech synthesis. These rules are developed previously and based on these rules; the separation of syllables is carried out. These rules are developed after a detailed study of consonants and vowels of Marathi language. Following table A2.1 shows more than 470 such developed rules for Marathi language. Observing the structure of language, it has found that each syllable contains at least one vowel. Based on this property breaking of syllables is carried out. After developing these rules, it is found that making use of single CV structure rule can break many words. More than 470 rules are available for Marathi language. These rules are taken from the thesis ‗Text to speech synthesis for Hindi language using hybrid syllabic approach‘ by Dr. J. S. Chitode, submitted to Bharati Vidyapeeth University, College of Engineering, Pune. See the following table of CV structure breaking rules.

Table A2.1: CV structure breaking rules Sr. CV structure CV structure after breaking No. 1. CVCVCVC CV+CV+CVC 2. CVCV CV+CV 3. CVCCVCVCCV CVC+CV+CVC+CV 4. CVCVCCVCCVCCV CV+CVC+CVC+CVC+CV 5. VCCVCVC VC+CV+CVC 6. VCV V+CV 7. CVCVCVCV CV+CV+CV+CV 8. CVCVCVCVCVC CV+CV+CV+CV+CVC 9. CVCCV CVC+CV 10. CVCVCVCCVCCVC CV+CV+CVC+CVC+CVC 11. CVCVCV CV+CV+CV 12. VCVCVCV V+CV+CV+CV 13. CVCCCVC CVC+CCVC 14. CCVCVC CCV+CVC 15. CVCVCCV CV+CVC+CV 16. CVCVC CV+CVC 17. CVCCCV CVC+CCV 18. CVCCCVCVCCV CVC+CCV+CVC+CV 19. VCCVCVCVC VC+CV+CV+CVC

i

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 20. CVCCVC CVC+CVC 21. VCVCVCVC V+CV+CV+CVC 22. CCVCCV CCVC+CV 23. CCVCVCCCCVCVCVC CCV+CVC+CCCV+CV+CVC 24. CVCVCVCVV CV+CV+CV+CV+V 25. CVCVCVCCVC CV+CV+CVC+CVC 26. VCVCCV V+CVC+CV 27. CVCVCVCVC CV+CV+CV+CVC 28. CVCCVCV CVC+CV+CV 29. CVCCVCVC CVC+CV+CVC 30. VCCV VC+CV 31. CVCVCVCVCV CV+CV+CV+CV+CV 32. VCCCV VCC+CV 33. CCVCCVCVCV CCVC+CV+CV+CV 34. VCCVCVCV VC+CV+CV+CV 35. CVVCV CV+V+CV 36. CVCVCCVCVC CV+CVC+CV+CVC 37. VCVCVCVV V+CV+CV+CV+V 38. VCCVC VC+CVC 39. CVCCVCVCV CVC+CV+CV+CV 40. CVCCCVCCVC CVC+CCVC+CVC 41. CVCCVCCVCVCVC CVC+CV+CCV+CV+CVC 42. CVCCVCCVCVC CVC+CVC+CV+CVC 43. VCVCCCVCCVC V+CVC+CCVC+CVC 44. VCCCVCCV VC+CCVC+CV 45. CVCVVCVCVC CV+CV+V+CV+CVC 46. CVCCCVCVC CVC+CCV+CVC 47. CVCCCVCVCV CVC+CCV+CV+CV 48. VCVCCVCVCVCVC V+CVC+CV+CV+CV+CVC 49. CVCCVCCVC CVC+CVC+CVC 50. CVCCCVCVCCVCVC CVC+CCV+CVC+CV+CVC 51. VCVC V+CVC 52. CCVCCVCVC CCVC+CV+CVC 53. VCVCCVC V+CVC+CVC 54. CCVCVCV CCV+CV+CV 55. CCVCV CCV+CV 56. CVCVV CV+CV+V

ii

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 57. VVC V+VC 58. CCVCVCVCVCV CCV+CV+CV+CV+CV 59. CCVCVCVCV CCV+CV+CV+CV 60. CCVCCVCCVCV CCVC+CVC+CV+CV 61. CVCCVCCV CVC+CVC+CV 62. CVCCCVCV CVC+CCV+CV 63. CCVCCVCCVC CCVC+CVC+CVC 64. CVCVCVCVCCVC CV+CV+CV+CVC+CVC 65. CVCCVCVCVCV CVC+CV+CV+CV+CV 66. CVCCVCVCVC CVC+CV+CV+CVC 67. CVCCVVCVC CVC+CV+V+CVC 68. CVCVCVCVCCV CV+CV+CV+CVC+CV 69. CVCCVCVCVCCV CVC+CV+CV+CVC+CV 70. CVCVCVCVCVCV CV+CV+CV+CV+CV+CV 71. CVCVCVCVCVCVC CV+CV+CV+CV+CV+CVC 72. VCCVCCVCVC VC+CVC+CV+CVC 73. VCCVCCV VC+CVC+CV 74. CCVCVCCV CCV+CVC+CV 75. CVCVCCVCVCCVCVC CV+CVC+CV+CVC+CV+CVC 76. CVCVCCVC CV+CVC+CVC 77. VCVCVC V+CV+CVC 78. CVCVCCVCVCVCCV CV+CVC+CV+CV+CVC+CV 79. CVCVCCVCV CV+CVC+CV+CV 80. CVVC CV+VC 81. CCCVCVC CCCV+CVC 82. CVCVCVCCV CV+CV+CVC+CV 83. CCVCCVC CCVC+CVC 84. CCVCCVCVCCVCVC CCVC+CV+CVC+CV+CVC 85. CVCVCVCCVCVC CV+CV+CVC+CV+CVC 86. CVCVCCVCCVCCVC CV+CVC+CVC+CVC+CVC 87. CCVCVCVC CCV+CV+CVC 88. VCVCVCCVC V+CV+CVC+CVC 89. CVCCVCVCVCCVC CVC+CV+CV+CVC+CVC 90. CVCCCVCVCCVCV CVC+CCV+CVC+CV+CV 91. VCVCCVCVCV V+CVC+CV+CV+CV 92. CVCVCCVCCV CV+CVC+CVC+CV 93. VCCVCVCVCVCVC VC+CV+CV+CV+CV+CVC

iii

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 94. CVCCVCVCCVC CVC+CV+CVC+CVC 95. CVCCVV CVC+CV+V 96. CVCCVCCCV CVC+CVC+CCV 97. CVCCCVCCVCV CVC+CCVC+CV+CV 98. CVCCVCVCCCCVC CVC+CV+CVC+CCCVC 99. CCVCVCCVC CCV+CVC+CVC 100. CVV CV+V 101. CVCVCVCCVCV CV+CV+CVC+CV+CV 102. CVCCVCVVCVCV CVC+CV+CV+V+CV+CV 103. VCVCCCV V+CVC+CCV 104. CVCCVCVCVCVCVCVC CVC+CV+CV+CV+CV+CV+CVC 105. VCVCVCCV V+CV+CVC+CV 106. VCCVCVCVCVCCCV VC+CV+CV+CV+CVC+CCV 107. CCVCCVCCV CCVC+CVC+CV 108. CCVCCCVC CCVC+CCVC 109. VCVCVCVCVCVCCVCVC V+CV+CV+CV+CV+CVC+CV+CVC 110. VCVCVCVCVCVCCVC V+CV+CV+CV+CV+CVC+CVC 111. CVCVCCCV CV+CVC+CCV 112. CVCVCVCCVCCVCCV CV+CV+CVC+CVC+CVC+CV 113. CVVCCVCVCV CV+VC+CV+CV+CV 114. CVCCVCVCCVCVCVCVC CVC+CV+CVC+CV+CV+CV+CVC 115. CCVCCVCV CCVC+CV+CV 116. CVCVCCVCCVC CV+CVC+CVC+CVC 117. CVCCCVCVCVC CVC+CCV+CV+CVC 118. CVCVCVCCCV CV+CV+CVC+CCV 119. CCVVCV CCV+V+CV 120. CVCCVCVCVCCVCVC CVC+CV+CV+CVC+CV+CVC 121. CVCVCVCVCVCVCV CV+CV+CV+CV+CV+CV+CV 122. VCCVCV VC+CV+CV 123. CVCCCVCVCVCVCVC CVC+CCV+CV+CV+CV+CVC 124. CVCVCCCVCV CV+CVC+CCV+CV 125. VCVCVCVVC V+CV+CV+CV+VC 126. VCVCCVCCV V+CVC+CVC+CV 127. CVCCVCVCVCVCV CVC+CV+CV+CV+CV+CV 128. VCCVCVCVCVC VC+CV+CV+CV+CVC 129. VCCVCVCVCV VC+CV+CV+CV+CV 130. VCVCCCVCCV V+CVC+CCVC+CV

iv

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 131. VCVCCVCV V+CVC+CV+CV 132. CVVCVVC CV+V+CV+VC 133. CVCCVCVCVCVCCV CVC+CV+CV+CV+CVC+CV 134. CVCVCCVVCVCVC CV+CVC+CV+V+CV+CVC 135. CVCCVCVCCCV CVC+CV+CVC+CCV 136. CVCCCCVCVC CVC+CCCV+CVC 137. CVVCVCVC CV+V+CV+CVC 138. CVCCVCVCVCVC CVC+CV+CV+CV+CVC 139. CVCCVCCVCV CVC+CVC+CV+CV 140. VCCVCVCCVCV VC+CV+CVC+CV+CV 141. CVCVVC CV+CV+VC 142. CVCCCVVC CVC+CCV+VC 143. CCVCVCVCCV CCV+CV+CVC+CV 144. CCVCVCCCCV CCV+CVC+CCCV 145. VCVCVCCVCCVC V+CV+CVC+CVC+CVC 146. VCVCVCVCVC V+CV+CV+CV+CVC 147. VCVCCVCVCVCV V+CVC+CV+CV+CV+CV 148. CVCVCCCVC CV+CVC+CCVC 149. VCCVCCVC VC+CVC+CVC 150. CVVCCVCV CV+VC+CV+CV 151. CVCVVCCCVC CV+CV+VC+CCVC 152. CVVCVC CV+V+CVC 153. VCVCVV V+CV+CV+V 154. CVCVCCVCVCVC CV+CVC+CV+CV+CVC 155. CVCVCVVC CV+CV+CV+VC 156. CCVCVCVCVC CCV+CV+CV+CVC 157. VCCCVCVCV VC+CCV+CV+CV 158. CVCVCVCCVCCV CV+CV+CVC+CVC+CV 159. CVCVVCVV CV+CV+V+CV+V 160. CVCCVCCCVC CVC+CVC+CCVC 161. CVCVCCVCVCCCV CV+CVC+CV+CVC+CCV 162. CVCVCVCVCVCCVCV CV+CV+CV+CV+CVC+CV+CV 163. CVCVCVCCVCVCV CV+CV+CVC+CV+CV+CV 164. VCCVCCVCVCVC VC+CVC+CV+CV+CVC 165. VCCVCVCVCVCV VC+CV+CV+CV+CV+CV 166. VCVCCVCCVCCVC V+CVC+CVC+CVC+CVC 167. CVCVVCVCCVCCV CV+CV+V+CVC+CVC+CV

v

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 168. CVCCCCVC CVC+CCCVC 169. CCVVC CCV+VC 170. CVVCVCCVC CV+V+CVC+CVC 171. CVCVCCVCVCV CV+CVC+CV+CV+CV 172. CVCCVCVCVCVCCCV CVC+CV+CV+CV+CVC+CCV 173. CVCVCVCVCVCCCV CV+CV+CV+CV+CVC+CCV 174. CVCVCVCVCVCCV CV+CV+CV+CV+CVC+CV 175. VCCVCVCCVCVC VC+CV+CVC+CV+CVC 176. CVCVCCVCCVCVC CV+CVC+CVC+CV+CVC 177. VCCVCCVCCV VC+CVC+CVC+CV 178. CVCCVCCCVCV CVC+CVC+CCV+CV 179. CVCCCVCVCVCVC CVC+CCV+CV+CV+CVC 180. CCVCVCCVCVC CCV+CVC+CV+CVC 181. CVCVVCVC CV+CV+V+CVC 182. CVCVCVV CV+CV+CVV 183. CCVCVCCVCV CCV+CVC+CV+CV 184. VCVVCVCV V+CV+V+CV+CV 185. CVCCVCVCCCVC CVC+CV+CVC+CCVC 186. VCVCCVCVC V+CVC+CV+CVC 187. CVCVVCV CV+CV+V+CV 188. VCVCVCVCVCCV V+CV+CV+CV+CVC+CV 189. VCCVCVCVCVCVCV VC+CV+CV+CV+CV+CV+CV 190. CCVCVCVCCCV CCV+CV+CVC+CCV 191. CVCCVCVCVCVCCVV CVC+CV+CV+CV+CVC+CVV 192. VCVCVCVCVCVC V+CV+CV+CV+CV+CVC 193. VCCVCVCCV VC+CV+CVC+CV 194. VCVCVCVCCVCV V+CV+CV+CVC+CV+CV 195. CCVCVCVCCVC CCV+CV+CVC+CVC 196. VCCVCVCVCCV VC+CV+CV+CVC+CV 197. VCCVCCCVCV VC+CVC+CCV+CV 198. CVVCVCV CV+V+CV+CV 199. VCVCVCCCV V+CV+CVC+CCV 200. CVCVVCVCVCV CV+CV+V+CV+CV+CV 201. CCVCVCVCCCVCVCV CCV+CV+CVC+CCV+CV+CV 202. VCVCVVCVCV V+CV+CV+V+CV+CV 203. CVCCVCVCCCCVCV CVC+CV+CVC+CCCV+CV 204. CVCCVCVCVCCVCCV CVC+CV+CV+CVC+CVC+CV

vi

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 205. VCCVCVCCVC VC+CV+CVC+CVC 206. CVCVCCCVCVC CV+CVC+CCV+CVC 207. VCVCVCVCCV V+CV+CV+CVC+CV 208. VCCVCVCVCVCCV VC+CV+CV+CV+CVC+CV 209. CVCVCVCVCCCV CV+CV+CV+CVC+CCV 210. CVVCVCVCVC CV+V+CV+CV+CVC 211. CVCCVCVCCVCV CVC+CV+CVC+CV+CV 212. CVCCCVCCV CVC+CCV+CCV 213. VCVCVCVCVCV V+CV+CV+CV+CV+CV 214. VCCCCVC VC+CCCVC 215. CVCVCVCVCVCCVC CV+CV+CV+CV+CVC+CVC 216. CVCCCVCVCCVC CVC+CCV+CVC+CVC 217. CVCCVCVCVCVCVC CVC+CV+CV+CV+CV+CVC 218. CVCCVCCVCVCCV CVC+CVC+CV+CVC+CV 219. CVCCVVC CVC+CV+VC 220. VCVCVCCVCV V+CV+CVC+CV+CV 221. VCCVCCCVC VC+CVC+CCVC 222. VCVCVCCVCCV V+CV+CVC+CVC+CV 223. VCVCVCCVCVCCCCVC V+CV+CVC+CV+CVC+CCCVC 224. VCCVCVCVCCCV VC+CV+CV+CVC+CCV 225. VCVCVCCVCVCCCV V+CV+CVC+CV+CVC+CCV 226. CVCVCCVCVCVCVC CV+CVC+CV+CV+CV+CVC 227. VCCVCCVCVCCV VC+CVC+CV+CVC+CV 228. CCCVCV CCCV+CV 229. CVCVCCVCVCVCCVCV CV+CVC+CV+CV+CVC+CV+CV 230. CVCVCVCCVCCVCV CV+CV+CVC+CVC+CV+CV 231. CVCVCCVCVCCVC CV+CVC+CV+CVC+CVC 232. VCCVCVCCVCCV VC+CV+CVC+CVC+CV 233. CVCVCVCCCVCVCCCCVC CV+CV+CVC+CCV+CVC+CCCVC 234. CCVCVCVCVCVC CCV+CV+CV+CV+CVC 235. CCVCCVCVCCVCV CCVC+CV+CVC+CV+CV 236. CCVCCVCVCCCV CCVC+CV+CVC+CCV 237. VCCVCVCCVCCVC VC+CV+CVC+CVC+CVC 238. CCVCCCVCVCVCVC CCVC+CCV+CV+CV+CVC 239. VCVCVCVCV V+CV+CV+CV+CV 240. CCVCVCVCVCCV CCV+CV+CV+CVC+CV 241. CVCVCCVCVCCV CV+CVC+CV+CVC+CV

vii

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 242. CCVCVCCCV CCV+CVC+CCV 243. CVCVCCVCCVCVCCV CV+CVC+CVC+CV+CVC+CV 244. CVCVCVCCCVVCVCVC CV+CV+CVC+CCV+V+CV+CVC 245. VCCVCCVCVCVCVC VC+CVC+CV+CV+CV+CVC 246. CVCCCVCCCV CVC+CCVC+CCV 247. CVVCVCCVCVCVCCCV CV+V+CVC+CV+CV+CVC+CCV 248. CCVCVCVCCCVC CCV+CV+CVC+CCVC 249. CCVCCVCVCCV CCVC+CV+CVC+CV 250. VCCVCVCCCV VC+CV+CVC+CCV 251. CCVCVCCVCCV CCV+CVC+CVC+CV 252. CVCCVCVCVCVCCVC CVC+CV+CV+CV+CVC+CVC 253. CCVCVCCVCCVC CCV+CVC+CVC+CVC 254. VCVVCVVC V+CV+V+CV+VC 255. CVCCVCVCVCCVCCVCVCV CVC+CV+CV+CVC+CVC+CV+CV+CV 256. CVCVCCVCCVCVCVC CV+CVC+CVC+CV+CV+CVC 257. VCVCVCVCCVCVCV V+CV+CV+CVC+CV+CV+CV 258. CVCCVCCVCVCCVCVCVCVCV CVC+CVC+CV+CVC+CV+CV+CV+CV+CV+C CVC VC 259. CCVCVCCVCVCCVCVC CCV+CVC+CV+CVC+CV+CVC 260. CVCVCCVCVCVCV CV+CVC+CV+CV+CV+CV 261. CVCCCVCCVCVC CVC+CCVC+CV+CVC 262. CCVCVCVCVCCVCVC CCV+CV+CV+CVC+CV+CVC 263. CVCVCVCVVCVCV CV+CV+CV+CV+V+CV+CV 264. VCVCCVCVCCVC V+CVC+CV+CVC+CVC 265. CCVCVCVCVCCVCCV CCV+CV+CV+CVC+CVC+CV 266. CVCCVCCVCCV CVC+CVC+CVC+CV 267. CVCVCVCCVV CV+CV+CVC+CVV 268. CVCCCVCVCVCV CVC+CCV+CV+CV+CV 269. CVCVVCVCV CV+CV+V+CV+CV 270. CVCVCVCCCVCVCCCV CV+CV+CVC+CCV+CVC+CCV 271. VCVCVCCVCVCVCVC V+CV+CVC+CV+CV+CV+CVC 272. CVCVVCCVCV CV+CV+V+CCV+CV 273. VCVCCVCVCVC V+CVC+CV+CV+CVC 274. CCVCVCVCCCVCVCVCV CCV+CV+CVC+CCV+CV+CV+CV 275. VCVCVCVCCVCCCVCVCVC V+CV+CV+CVC+CVC+CCV+CV+CVC 276. VCVCVCVCCVC V+CV+CV+CVC+CVC 277. VCVCVCCVCCVCV V+CV+CVC+CVC+CV+CV

viii

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 278. VCCVCVCCCVCVCCV VC+CV+CVC+CCV+CVC+CV 279. CCVCVCCCVCVCVC CCV+CVC+CCV+CV+CVC 280. CVVCVCCVCV CV+V+CVC+CV+CV 281. CVCVCCCCVC CV+CVC+CCCVC 282. VCVVCVVCVCV V+CV+V+CV+V+CV+CV 283. CVCVCVCVCVCVCVC CV+CV+CV+CV+CV+CV+CVC 284. VCVCCVCVCVCVCVCVCV V+CVC+CV+CV+CV+CV+CV+CV+CV 285. VCCCVCVCCV VC+CCV+CVC+CV 286. CVCCVCVCVCVCVCV CVC+CV+CV+CV+CV+CV+CV 287. VCCCVCVCCCVCVC VC+CCV+CVC+CCV+CVC 288. CVCCVCCVCVCVCCVC CVC+CVC+CV+CV+CVC+CVC 289. CVCVCCVVC CV+CVC+CV+VC 290. CVCCVCC CVC+CVCC 291. CVCVVCCV CV+CV+V+CCV 292. CCVCVCVCCCVCVCVC CCV+CV+CVC+CCV+CV+CVC 293. CCVCVCVCCVCV CCV+CV+CVC+CV+CV 294. VCCVCCVCV VC+CVC+CV+CV 295. VCCCVC VC+CCVC 296. VCVCCVCCVCCV V+CVC+CVC+CVC+CV 297. CVCVCCVCVCVCCVC CV+CVC+CV+CV+CVC+CVC 298. CVCVCCCVCVCCV CV+CVC+CCV+CVC+CV 299. CVCCVCVCCVCCVC CVC+CV+CVC+CVC+CVC 300. VCCCVCVCVC VC+CCV+CV+CVC 301. CVCVCVCCVVCVCVCV CV+CV+CVC+CV+V+CV+CV+CV 302. VCVVCCV V+CV+V+CCV 303. CVCVCVCVCCVCVCV CV+CV+CV+CVC+CV+CV+CV 304. CVCVCVVCVCV CV+CV+CV+V+CV+CV 305. CVCCCCVCV CVC+CCCV+CV 306. CCVCCVCVCVC CCVC+CV+CV+CVC 307. CVCCVVCV CVC+CV+V+CV 308. CVCVCCVVCV CV+CVC+CV+V+CV 309. CVCCVCVCCCVCVCVC CVC+CV+CVC+CCV+CV+CVC 310. CVCVCVCVCVCVCVCV CV+CV+CV+CV+CV+CV+CV+CV 311. CCVCCCV CCVC+CCV 312. CVCCVCVCVCVCVCCCV CVC+CV+CV+CV+CV+CVC+CCV 313. CCVCVCVCVCVCV CCV+CV+CV+CV+CV+CV 314. VCCVCCCVCCV VC+CVC+CCV+CCV

ix

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 315. CVCVCVCVCCCVCVC CV+CV+CV+CVC+CCV+CVC 316. VCVCVCVCVCVCCV V+CV+CV+CV+CV+CVC+CV 317. CVCCVCVCVCCVCV CVC+CV+CV+CVC+CV+CV 318. CVCCVCCVCVCV CVC+CVC+CV+CV+CV 319. CCVCVCCCVC CCV+CVC+CCVC 320. CVCCVCVCVCCCV CVC+CV+CV+CVC+CCV 321. CVCCVCVCCCVCVCV CVC+CV+CVC+CCV+CV+CV 322. CVCVCVCCCVC CV+CV+CVC+CCVC 323. CVCCVCCVCVCVCVCVC CVC+CVC+CV+CV+CV+CV+CVC 324. VCCCVCCVCCV VC+CCVC+CVC+CV 325. CCVCVCVCCVCCV CCV+CV+CVC+CVC+CV 326. VCCCVCVCCVC VC+CCV+CVC+CVC 327. CVCVCVCCVCVCCV CV+CV+CVC+CV+CVC+CV 328. CCVCVCCVCVCVCVCVC CCV+CVC+CV+CV+CV+CV+CVC 329. CVCCVCVCCVCVC CVC+CV+CVC+CV+CVC 330. VCCVCVCVCVCVCVC VC+CV+CV+CV+CV+CV+CVC 331. VCCVCCVCVCV VC+CVC+CV+CV+CV 332. CVCCCCV CVC+CCCV 333. VCVCVVC V+CV+CV+VC 334. VCVCCVCCVC V+CVC+CVC+CVC 335. VCVCVCVCCVCVC V+CV+CV+CVC+CV+CVC 336. VCVCCCVC V+CVC+CCVC 337. VCCCVCV VC+CCV+CV 338. VCVCCVCVCVCCV V+CVC+CV+CV+CVC+CV 339. CCVCCVCCVCVC CCVC+CVC+CV+CVC 340. CVCVCCVCVCVCVCVCVC CV+CVC+CV+CV+CV+CV+CV+CVC 341. CVCCVCVCCVCVCCV CVC+CV+CVC+CV+CVC+CV 342. VCVCVVCVC V+CV+CV+V+CVC 343. VCVCVVCCV V+CV+CV+V+CCV 344. VCVCVCVCVCVCV V+CV+CV+CV+CV+CV+CV 345. CVCVVCCVC CV+CV+V+CCVC 346. VCVCVCCVCVCCCVCVCV V+CV+CVC+CV+CVC+CCV+CV+CV 347. VCVCVCCVCVCCCVCVCVC V+CV+CVC+CV+CVC+CCV+CV+CVC 348. VVCCCVC VVC+CCVC 349. CVCVCVCVCVCVCVCCVCVC CV+CV+CV+CV+CV+CV+CVC+CV+CVC 350. CVCVCVCVCVCVCCV CV+CV+CV+CV+CV+CVC+CV 351. VCCCCVCV VC+CCCV+CV

x

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 352. CVCVCVCVCVCVCCVC CV+CV+CV+CV+CV+CVC+CVC 353. VCVCVCVCVCCCVC V+CV+CV+CV+CVC+CCVC 354. CCVCVCVCCVCVCV CCV+CV+CVC+CV+CV+CV 355. CCVCVCCVCVCVC CCV+CVC+CV+CV+CVC 356. CCVCVCVCVCVCCV CCV+CV+CV+CV+CVC+CV 357. CCVCVV CCV+CVV 358. VCVCVCVCVCVCCCVC V+CV+CV+CV+CV+CVC+CCVC 359. CVCVCCVV CV+CVC+CVV 360. CVCVCVVCV CV+CV+CV+V+CV 361. CVVCCVC CV+VC+CVC 362. CCVCCVCVCCVC CCVC+CV+CVC+CVC 363. CVVCCV CV+VC+CV 364. VCVCCVCCVCVC V+CVC+CVC+CV+CVC 365. CVCVCCVCCVCVCCVC CV+CVC+CVC+CV+CVC+CVC 366. CVCVVCVCCVCVCVC CV+CV+V+CVC+CV+CV+CVC 367. CCVCVCCCCVCVCVCVC CCV+CVC+CCCV+CV+CV+CVC 368. CVCCVCCVCVCCVCV CVC+CVC+CV+CVC+CV+CV 369. CVCCVCCVCVCVCVC CVC+CVC+CV+CV+CV+CVC 370. CVCVCVCCVCVCVC CV+CV+CVC+CV+CV+CVC 371. CVCVCVCVCCVCVC CV+CV+CV+CVC+CV+CVC 372. VCVCVCVCCCVCVC V+CV+CV+CVC+CCV+CVC 373. VCCVCVCCCVCVC VC+CV+CVC+CCV+CVC 374. VCVCVCCCVCV V+CV+CVC+CCV+CV 375. CVCVCCVCVCCVCVCVC CV+CVC+CV+CVC+CV+CV+CVC 376. VCCVCVCVCCVC VC+CV+CV+CVC+CVC 377. CVCCVCCVCCVCV CVC+CVC+CVC+CV+CV 378. CVCVCCVCVCVCVCVC CV+CVC+CV+CV+CV+CV+CVC 379. CVCVCVCCVCCVCVC CV+CV+CVC+CVC+CV+CVC 380. CVCCVCCVVC CVC+CVC+CV+VC 381. CCVCVCCCCVCVC CCV+CVC+CCCV+CVC 382. CVCCVCVCCCVCVC CVC+CV+CVC+CCV+CVC 383. CVCCCVCCCVC CVC+CCVC+CCVC 384. CVCCVCCVCVCVCCV CVC+CVC+CV+CV+CVC+CV 385. CVCVCVCVCCVCVCVC CV+CV+CV+CVC+CV+CVC 386. VCVCVCCVCCVCVC V+CV+CVC+CVC+CV+CVC 387. CVVCVCVCV CV+V+CV+CV+CV 388. CVCVCVCVCVCVCVV CV+CV+CV+CV+CV+CV+CVV

xi

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 389. CVCVVCVCVCVC CV+CV+V+CV+CV+CVC 390. CVCVCVCC CV+CV+CVCC 391. CVVCVCVCVCVC CV+V+CV+CV+CV+CVC 392. CVCVCVCVVCVC CV+CV+CV+CV+V+CVC 393. CCVCVCVCCVCVCCCV CCV+CV+CVC+CV+CVC+CCV 394. CVCCVCCCVCVCVC CVC+CVC+CCV+CV+CVC 395. CVCVCVCCVCCVCVCCCV CV+CV+CVC+CVC+CV+CVC+CCV 396. CCVCVCCVCCCV CCV+CVC+CVC+CCV 397. VCVCCVCCVCVCV V+CVC+CVC+CV+CV+CV 398. VCCCVVCV VC+CCV+V+CV 399. CVVCVCVVCV CV+V+CV+CV+V+CV 400. CVCCVCVCCVCCV CVC+CV+CVC+CVC+CV 401. CVCVCVCVCCVCV CV+CV+CV+CVC+CV+CV 402. VCVCVCVVCV V+CV+CV+CV+V+CV 403. CVCCVCVCCVCVCVC CVC+CV+CVC+CV+CV+CVC 404. CVCVCCVCC CV+CVC+CVCC 405. CVCCVCCCVCCV CVC+CVC+CCV+CCV 406. CVCCVCVV CVC+CV+CVV 407. CCVCVCCCCVCV CCV+CVC+CCCV+CV 408. CVCVCCVCVCCVCV CV+CVC+CV+CVC+CV+CV 409. CVCCCCVCCVCCV CVC+CCCVC+CVC+CV 410. CCVCCVCVCVCVC CCVC+CV+CV+CV+CVC 411. CCVCCVCVCVCV CCVC+CV+CV+CV+CV 412. CVCVCCCVCVCVCCV CV+CVC+CCV+CV+CVC+CV 413. CCVCCVCVCCCVC CCVC+CV+CVC+CCVC 414. CVCVCVCCVCVCVCV CV+CV+CVC+CV+CV+CV+CV 415. CCVCVCVCVCVCVC CCV+CV+CV+CV+CV+CVC 416. VCVCCCVCVCVCCV V+CVC+CCV+CVC+CV 417. VCCVCVCCVCVCV VC+CV+CVC+CV+CV+CV 418. CVCCVCCVCVCVCVCVCV CVC+CVC+CV+CV+CV+CV+CV+CV 419. CVCVCVCVCVCCVCVC CV+CV+CV+CV+CVC+CV+CVC 420. CVCVCVVCVC CV+CV+CV+V+CVC 421. CVCVCVCVCCCVC CV+CV+CV+CVC+CCVC 422. CCVCVCCVCVCCV CCV+CVC+CV+CVC+CV 423. CVCVCVCVCVVCVCVC CV+CV+CV+CV+CV+V+CV+CVC 424. CVCCVCVCVCCVCCVC CVC+CV+CV+CVC+CVC+CVC 425. VCVCVCCVCVC V+CV+CVC+CV+CVC

xii

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 426. CVCCCVCVCVCVCV CVC+CCV+CV+CV+CV+CV 427. CVCVCVCCVCVCVCCVCCVC CV+CV+CVC+CV+CV+CVC+CVC+CVC 428. VCCVCVVC VC+CV+CV+VC 429. CVVCCVCVCVC CV+V+CCV+CV+CVC 430. VCVVCVC V+CV+V+CVC 431. VCVVCVCCVCCVCCVC V+CV+V+CVC+CVC+CVC+CVC 432. VCVCCVCVCCCV V+CVC+CV+CVC+CCV 433. CCVCCVCVCVCVCV CCVC+CV+CV+CV+CV+CV 434. CCVCCVCCVCCV CCVC+CVC+CVC+CV 435. CVCVCVCVCVCVCVCVC CV+CV+CV+CV+CV+CV+CV+CVC 436. CVCCVCCVCCVC CVC+CVC+CVC+CVC 437. VCCVCVCCCCVC VC+CV+CVC+CCCVC 438. CVCCVVCVCCVC CVC+CV+V+CVC+CVC 439. CVCVCVCCVCVCVCVC CV+CV+CVC+CV+CV+CV+CVC 440. VCVCCCVCVC V+CVC+CCV+CVC 441. CVCVCVCCVCVCVCCV CV+CV+CVC+CV+CV+CVC+CV 442. CCVCCVCVCVCVCCV CCVC+CV+CV+CV+CVC+CV 443. CVCVCVCCCVCVCV CV+CV+CVC+CCV+CV+CV 444. CVCCVCVCVCVCVCCV CVC+CV+CV+CV+CV+CVC+CV 445. CCVCCCVCCV CCVC+CCVC+CV 446. VCCVCVV VC+CV+CVV 447. CVCCCVCVCVCCVCV CVC+CCV+CV+CVC+CV+CV 448. VCVCVCCCVCCV V+CV+CVC+CCVC+CV 449. VCVCVCCCVCCVC V+CV+CVC+CCVC+CVC 450. CVCVCCVCVVCCV CV+CVC+CV+CV+V+CCV 451. VCCVCCCV VC+CVC+CCV 452. CVCVCVCVCVCVCVCCCV CV+CV+CV+CV+CV+CV+CVC+CCV 453. CCCVCCV CCCVC+CV 454. CVCVCCVCVCCVCVCVCVC CV+CVC+CV+CVC+CV+CV+CV+CVC 455. CVCVVCVCCV CV+CV+V+CVC+CV 456. CVCCVCVCCCCVCCVC CVC+CV+CVC+CCCVC+CVC 457. CCVCVCVCCCCVC CCV+CV+CVC+CCCVC 458. CVCCVCCVCCVCVC CVC+CVC+CVC+CV+CVC 459. VCVCVCVCCVCVCVC V+CV+CV+CVC+CV+CV+CVC 460. CVCVCVCCCVCCVC CV+CV+CVC+CCVC+CVC 461. CVCCCVCVCCVCCVC CVC+CCV+CVC+CVC+CVC 462. VCCCVCCVCVCVC VC+CCVC+CV+CV+CVC

xiii

CV structure breaking rules and WAVE file details

Sr. CV structure CV structure after breaking No. 463. VCCCVVC VC+CCV+VC 464. CVVCVCVCCVC CV+V+CV+CVC+CVC 465. VCVCVVCV V+CV+CV+V+CV 466. CCVVCVC CCV+V+CVC 467. VCCCVCCVCVCVCV VC+CCVC+CV+CV+CV+CV 468. CVCCVCVCCCCV CVC+CV+CVC+CCCV 469. CVCVCVCVCVCCVCVCVC CV+CV+CV+CV+CVC+CV+CV+CVC 470. VCVCV V+CV+CV 471. CVVCVV CV+V+CV+V

A.2 TECHNICAL INTERPRETATION OF SPEECH:

In order to develop a text to speech conversion application or any other application related to speech, it is required to know how the computer deals with speech. This section tries to explain various technical aspects of speech.

Study of WAV file format: Before working with WAV file, one should know the details of WAV file format. The details include the procedure to open and to read the WAV files.

 WAV (or WAVE), short form waveform audio format, is a Microsoft and IBM audio file format standard for storing audio in Windows PCs. It is a variant of the RIFF bit stream format method for storing data in ―chunks‖.

 The WAV file format is a subset of Microsoft's RIFF specification for the storage of multimedia files.

 RIFF stands for resource interchange file format.

 WAV, like any other uncompressed format encodes all sounds, whether they are complex sounds or absolute silence, with the same number of bits per unit

xiv

CV structure breaking rules and WAVE file details

of time. It is well known that WAV files can also be encoded with a variety of codecs to reduce the file size (for example the GSM or mp3 codecs).

 The default byte ordering assumed for WAV data files is little-endian. Files written using the big-endian byte ordering scheme have the identifier RIFX instead of RIFF. 8-bit samples are stored as unsigned bytes, ranging from 0 to 255. 16-bit samples are stored as 2's-complement signed integers, ranging from -32768 to 32767.

 Though a WAV file can hold compressed audio, the most common WAV format contains uncompressed audio in the pulse code modulation (PCM) format. PCM audio is the standard audio file format for CDs, containing two channels of 44,100 samples per second, 16 bits per sample. Since PCM uses an uncompressed, lossless storage method, which keeps all the samples of an audio track, professional users or audio experts may use the WAV format for maximum audio quality. WAV audio can also be edited and manipulated with relative ease using software.

 Uncompressed WAV files are quite large in size (around 10MB per minute of music.) but relatively ―pure‖, i.e. lossless file type, suitable for retaining ―first generation‖ archived files of high quality or used on a system where high fidelity sound is required and disk space is not restricted. More frequently, the smaller file sizes of compressed but lossy formats such as MP3, ATRAC, AAC, (Ogg) Vorbis and WMA are used to store and transfer audio. Their small file sizes allow faster internet transmission, as well as lower consumption of space on the memory media.

 There are also more efficient lossless codecs available, such as FLAC, Shorten, Monkey‘s audio, ATRAC advanced lossless, Apple lossless, WMA lossless, TTA and WavePack, but none of these is yet a ubiquitous standard for both professional and home audio.

 In spite of their large size, uncompressed WAV (though that format can be different from the Microsoft WAV) files are sometimes used by a few radio

xv

CV structure breaking rules and WAVE file details

broadcasters, especially those that adopted the tapeless system "D-Cart‖, which was developed by the Australian broadcaster ABC. Non compressed formats are used in this application to preserve sound quality and have become more economical as the cost of data storage has dropped. In the system of ―D- Cart‖, the sampling rate of WAV files is usually at a 48 kHz 16 bit stereo, which is identical to that of the digital audio tape.

A WAV file is often just a RIFF file with a single "WAVE" chunk which consists of two sub-chunks - a "fmt" chunk specifying the data format and a "data" chunk containing the actual sample data. This form of RIFF structure is called the "Canonical form". There may be additional sub chunks in a WAV data stream. If so, each will have a char [4] SubChunkID, and unsigned long SubChunkSize, and SubChunkSize amount of data. This format is shown in the following figure A.1 below:

File Offset Field Size endian Field Name (bytes) (bytes) big 0 Chunk ID 4 The “Riff” chuck descriptor little 4 Chunk Size 4 The format of concern here is ―WAVE‖, which requires two sub- big 8 Format 4 chunks: ―fmt‖ and ―data‖ big 12 Sub chunk 1 ID 4 little 16 Sub chunk 1 Size 4 little 20 Audio Format 2 little 22 Num Channels 2 The “fmt” sub-chunk Describes the format of the sound little 24 Sample Rate 4 information in the data sub-chunk little 28 Byte Rate 4 little 32 Block Align 2 little 34 Bits per Sample 2

big 36 Sub chunk 2 ID 4 The “data” sub-chunk little 40 Sub chunk 2 Size 4 Indicates the size of the sound information and contains the raw little 44 Data Sub chunk 2 Size sound data

Fig. A.1: WAV file format

Opening of a sound file: The Microsoft WAV file format has to be studied in detail in order to open a WAV file and to extract the data chunk. The WAV file format has no compression and

xvi

CV structure breaking rules and WAVE file details hence, raw data can be obtained from it, which can then be further analyzed. The WAV file stores data in the form of Hex data. The WAV file starts with its header followed by data. The header consists of ‗RIFF‘ (Resource interchange file format) as the first 4 bytes. This is followed by all the relevant information about the file like sampling rate, type of recording (mono or stereo), number of samples per second (sampling frequency), length of the data, number of bits per sample, number of bytes per sample etc. The header is followed by letters ‗d a t a‘ after which the actual data starts.

Recording of WAV files: WAV file format is used for recording files. WAV files are probably the simplest of the commonly used formats for storing audio samples. Unlike MPEG and other compressed formats, WAV files store samples in raw data, where no pre-processing is required other than the formatting of the data.

The recorded WAV files have the following properties

Fig. A.2: Format used for recording sound file.

The format used for recording is ‗11 KHz, 8-bit, Mono‘ as shown in the figure A.2. The maximum bandwidth that human speech can attain is 3.5 KHz. Hence, the 11 KHz format is sufficient for analysis of speech. This comes according to the Nyquist criterion, which states that the maximum bandwidth that the signal can attain is half the sampling frequency. The maximum bandwidth of speech can be 3.5 KHz as stated earlier; hence any sampling frequency above 7 KHz is sufficient to record any speech data. The attributes of the file also consider 8-bits and mono channel. The necessary data levels available for an 8-bit file are from 0 to 255 and are sufficient for analysis of speech data. There is no need of stereo recording for analysis of speech data.

xvii

Appendix-2

Author‟s publications

Authors publications

AUTHOR‟S PUBLICATIONS a) Research Papers - International Conferences:

[1] ―An optimized soft cutting approach to derive syllables from words in text to speech synthesizer‖ published in the conference of signal and image processing 2006, organized by IASTED (International association of science and technology for development) at Honolulu, Hawai, USA. [2] ―Automatic segmentation of syllables of Marathi words using Maxnet‖ presented in the conference of International conference on electronic design and signal processing ICEDSP-2009 at Manipal Institute of Technology, Manipal, India. [3] ―Spectral mismatch as the index of quality of naturalness in synthetic speech‖ published in the conference of 2009 IEEE Pacific Rim conference on communications, Computers and signal processing at University of Victoria, Victoria, B.C., Canada. [4] ―Spectral mismatch estimation in synthetic speech‖ presented in the conference of International conference on electronic design and signal processing ICEDSP-2009 at Manipal Institute of Technology, Manipal, India. [5] ―Syllable formation using supervised and unsupervised neural network approach for Marathi TTS‖ published in 2010 Third International conference on education technology and training (ETT), Wuhan, China. [6] ―Comparative analysis of neural and non-neural approach of syllable segmentation in Marathi TTS‖ published in 2011 International conference on electric and electronics (EEIC2011), Nanchang, China.

b) Research Papers - National Conferences:

[1] ―Identification of vowels in Devnagari script using energy calculation‖ presented in the conference of national conference on emerging trends in electronics and communication organized by Padre Conciecao College of engineering, Verna, Goa, India, 2006. [2] ―Contextual analysis and classification of syllables for Hindi language‖ presented in the conference of national conference on signal and image processing applications organized by IET Mumbai and COEP, Pune, India, 2009. [3] ―Comparison between the performance of supervised and unsupervised methods for syllable formation‖ authored the paper for the conference of national conference on Pervasive computing (NCPC-2010) organized by IEEE Pune subsection and Sinhgad college of engineering, Pune. [4] ―Quantification of mismatch of syllable in text to speech synthesis‖ authored the paper for the conference of national conference on Pervasive computing (NCPC-2010) organized by IEEE Pune subsection and Sinhgad college of engineering, Pune. c) Research Papers – International Journals

[1] ―Relative functional comparison of neural and non- neural approaches for syllable segmentation in Devnagari TTS system‖ published in IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 3, No 2, May 2012, ISSN (Online): 1694-0814 [2] ― Estimation of spectral mismatch for joint cost evaluation in Marathi TTS‖ published in IJCA, international journal of computer applications, (0975-8887)Volume 65-No. 17, March 2013 [3] ―Context based segmentation and spectral mismatch reduction for more naturalness with application to text to speech (TTS) for Marathi language‖ published in IJARCCE, Vol. 2, Issue 11, November 2013 [4] ―Position based syllabification and objective spectral analysis in Marathi text to speech for naturalness‖ (2014-2015), published in Springer international journal of speech technology (IJST), 28 February 2015, DOI 10.1007/s10772-014- 9263-3

xviii

Bibliography

Bibliography

Bibliography

BIBLIOGRAPHY

[1] MKalika Bali, Partha Pratim Talukdar, N. Sridhar Krishna and A.G. Ramakrishnan, ―Tools for the development of a Hindi speech synthesis system‖, 5th ISCA speech synthesis workshop, pp 109-114, 2003. [2] Somnath Roy, ―A technical guide to concatenative speech synthesis for Hindi using festival‖, International journal of computer applications, vol. 86, no. 8, pp 30-34, January 2014. [3] B. Sudhakar, R. Bensraj, ―Development of concatenative syllable based text to speech synthesis system for Tamil‖, International journal of computer applications, vol. 91, no. 5, pp 22-25, April 2014. [4] G. Swathi, C. Kiran Mai, B. Raveendra Babu, ―Speech synthesis system for Telugu language‖, International journal of computer applications, vol. 81, no. 5, pp 25-30, November 2013. [5] Bhusan Chettri, Krishna Bikram Shah, ―Nepali text to speech synthesis system using ESNOLA method of concatenation‖, International journal of computer applications (0975 - 8887), vol. 62, no. 2, pp 24-28, January 2013. [6] Jian Zhou, Xianyong Fang, Liang Tao and Li Zhao, ―Speech intelligibility enhancement using convolutive non- negative matrix factorization with noise prior‖, International journal of multimedia and ubiquitous engineering vol.9, no.7, pp 73-77, 2014. [7] Susan R. Hertz, ―Integration of rule-based formant synthesis and waveform concatenation: a hybrid approach to text- to-speech synthesis‖, Proceedings IEEE 2002 workshop on speech synthesis, IEEE 0-7803-7395-2/02, pp 87-90, 2002. [8] Munkhtuya Davaatsagaan and Kuldip K. Paliwal, ―Diphone-based concatenative speech synthesis system for Mongolian‖, Proceedings of international multi-conference of engineers and computer scientists, vol. I, IMECS 2008, pp 276-279, 19-21 March, 2008. [9] M. Nageshwara Rao, Samuel Thomas, Hema A. Murthy and C.S. Ramalingam, ―Natural sounding TTS based on syllable-like units‖, EUSIPCO - European signal processing conference, Florence, Italy, September 4 - 8, 2006. [10] T. Nagarajan, M. Nageshwara Rao, Samuel Thomas and Hema A. Murthy, ―Text-to-speech synthesis using syllable- like units‖, NCC – National conference on communications, IIT Kharagpur, India, January 28-30, 2005. [11] Lijuan Wang, Yong Zhao, Min Chu, Jianlai Zhou, And Zhigang Cao, ―Refining segmental boundaries for TTS database using fine contextual-dependent boundary models‖, ICASSP - 2004 IEEE International conference on acoustics, speech and signal processing, Montreal, Quebec, Canada, pp 641-649, May 17-20. [12] S.P. Kishore, Rohit Kumar, Rajeev Sangal, ―A data-driven synthesis approach for Indian languages using syllable as basic unit‖, International conference on natural language processing (ICON), Mumbai, India, pp 311-316, December 2002. [13] Eric Lewis, Mark Tatham, ―Automatic segmentation of recorded speech into syllables for speech synthesis‖, EUROSPEECH - 7th European conference on speech communication and technology, 2nd Interspeech Event, Aalborg, Scandinavia, Denmark, September 2001. [14] G. Kawai, J Van Santen, ―Automatic detection of syllabic nuclei using acoustic measures‖, IEEE workshop on speech synthesis, Santa Monica, California, USA, pp 39-42, 11-13 Sept. 2002. [15] Stas Tiomkin, David Malah, Slava Shechtman and Zvi Kons, ―A hybrid text-to-speech system that combines concatenative and statistical synthesis units‖, IEEE transactions on audio, speech and language processing, vol. 19, no. 5, pp 1558-1569, July 2011. [16] Shinnosuke Takamichi, Tomoki Toda, Yoshinori Shiga, Sakriani Sakti, Graham Neubig, And Satoshi Nakamura, ―Parameter generation methods with rich context models for high-quality and flexible text-to-speech synthesis‖, IEEE journal of selected topics in signal processing, vol. 11, no.4, pp 85-99, April 2014. [17] David T. Chapell, John H.L. Hansen, ―A comparison of spectral smoothing methods for segment concatenation based speech synthesis‖, Elsevier, Speech Communication, vol. 36, issues 3-4, pp 343-373, March 2002. [18] Barry Kirkpatrick, D. O'brien, R. Scaife, A. Errity, ―Spectral dynamics as a source of discontinuity in concatenative speech synthesis‖, 15th International conference on digital signal processing-DSP, IEEE 2007, pp 615-618, 2007. [19] Ki-Seung Lee, Sang Ryong Kim, ―Context-adaptive smoothing for concatenative speech synthesis‖, IEEE signal processing letters, pp 422-425, December 2002.

i

Bibliography

[20] Jithendra Vepa, Simon King, ―Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis‖, IEEE transactions on speech and audio processing, vol. 14, no.5, pp 1763-1771, September 2006. [21] Jithendra Vepa, Simon King, ―Objective distance measures for spectral discontinuities in concatenative speech synthesis‖, 7th International conference on spoken processing - ICSLP, IEEEXPLORE, 2002. [22] Phuay Hui Low, Ching Hasiang Ho, S. Yaseghi, ―Using estimated formants tracks for formants smoothing in text to speech (TTS) synthesis‖, Conference: IEEE workshop on automatic speech recognition and understanding, ASRU 2003, pp 688-693, 2003. [23] Youcef Tabet, Mohamed Boughazi, ―Speech synthesis techniques. A survey‖, WOSSPA-2011, Seventh international workshop on systems, signal processing and their applications, pp 67-70, 9-11 May 2011. [24] S. R. Rajesh Kumar, B. Yegnanrayana, ―Significance of durational knowledge for speech synthesis system in an Indian language‖, fourth IEEE region 10 international conference, TENCON 1989, Mumbai, India, 1989. [25] Matthias Eichner, Mattthias Wolff, Rudiger Hoffimann, ―Improved duration control for speech synthesis using a multigram language model‖, Conference: IEEE international conference on acoustics, speech and signal processing (ICASSP) - 2002, IEEE-0-7803-7402-9/2002, pp 417-420, 2002. [26] Morais, Edmilson Taylor, Paul Violaro, Fabio, ―Concatenative text-to-speech synthesis based on prototype waveform interpolation (a time frequency approach)‖, Cstr/ University of Edinburg, ICSLP-2000, vol.2, pp 387-390, 2000. [27] K. Szklanny, M. Wójtowski, ―Automatic segmentation quality improvement for realization of unit selection speech synthesis‖, IEEE conference on human system interactions, pp 251-256, May 25-27, 2008. [28] Ivan Bulyko, Mari Ostendorf and Jeff Bilmes, ―Robust splicing costs and efficient search with BMM models for concatenative speech synthesis‖, 2002 IEEE International conference on acoustics, speech and signal processing (ICASSP), IEEE-0-7803-7402-9/2002, pp 461-464, 2002. [29] Rohit Kumar, ―A genetic algorithm for unit selection based speech synthesis‖, International conference on spoken language processing (ICSLP), Jeju, Korea, October 2004. [30] T. Raitio, Suni A., Yamagishi J., Pulakka H., ―HMM-based speech synthesis utilizing glottal inverse filtering‖, IEEE transactions on audio, speech, and signal processing, vol. 19, no. 1, pp 153-165, January 2011. [31] Zhen Hua Ling, Korin Richmond, Junichi Yamagashi, Ren Hua Wang, “Integrating articulatory features into hmm- based parametric speech synthesis‖, Journal - IEEE transactions on audio, speech, and language processing, vol. 17, no. 6, pp 1171-1185, August 2009. [32] Zeynep Orphan, Gormez Z., ―Evaluation of the concatenative Turkish text-to-speech system‖, Image and signal processing, 978-1-4244-4131-0/09/2009 IEEE, CISP 2009, pp 1-5, 2009. [33] Anupam Basu, Debasish Sen, Shiraj Sen and Soumen Chakraborty, ―An Indian language speech synthesizer – techniques and applications‖, Indian institute of technology, Kharagpur 721302, December 17-19, 2003. [34] Xian-Jun Xia, Zhen-Hua Ling, Chen-Yu Yang, Li Rong Dai, ―Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech‖, 8th International symposium on Chinese spoken language processing (ISCSLP), 2012 978-1-4673-2507-3/12 © 2012 IEEE, pp 160-164, 2012. [35] Sreelekshemi K.S., Gopinath D. P., ―Clustering of duration patterns in speech for text-to-speech synthesis‖, 2012 annual IEEE India conference (INDICON), 978-1-4673-2272-0/12© 2012 IEEE, pp 1122-1127, 2012. [36] N. P. Narendra, K. Sreenivasa Rao, Krishnendu Ghosh, Ramu Reddy Vempada and Sudhamay Maity, ―Development of syllable-based text to speech synthesis system in Bengali‖, International journal of speech technology, Springer, Doi 10.1007/S10772-011-9094-4, pp 167-181, September 2011. [37] T. Nagarajan, H. A. Murthy, ―Subband-based group delay segmentation of spontaneous speech into syllable-like units‖, Hindawi publishing corporation, Eurasip journal on applied signal processing 2004:17, pp 2614–2625, 2004. [38] Marrian Machhi, ―Issues in text to speech synthesis‖, IEEE International joint symposia on intelligence and systems, pp 318 -322, March 1998. [39] Mei Kobayashi, Masaharu Sakamoto, Takashi Saito, Yasuhide Hashimoto, Masafumi Nishimura and Kazuhiro Suzuki, ―Wavelet analysis used in text to speech synthesis‖, IEEE transactions on circuits and systems—II: analog and digital signal processing, vol. 45, no. 8, pp 1125-1129, August 1998. [40] B. Sudhakar and R. Bensraj, ―Development of Concatenative Syllable based Text to Speech Synthesis System for Tamil‖, International Journal of computer applications, vol. 91, no.5, pp 22-15, April 2014.

ii

Bibliography

[41] D.J.Ravi and Sudarshan Patilkulkarni, ―A novel approach to develop speech database for Kannada text-to-speech system‖, International journal on recent trends in engineering & technology, vol. 05, no. 01, pp 119-122, Mar 2011. [42] Eric Lewis and Mark Tatham, ―Word and syllable concatenation in text to speech synthesis‖, Proceedings of sixth European conference on speech communications and technology, ESCA, pp 615-618, September 1999. [43] E. Veera Raghavendra, Srinivas Desai, B. Yegnanarayana, Alan W Black, Kishore Prahallad, ―Global syllable set for building speech synthesis in Indian languages‖, IEEE spoken language technology workshop, SLT 2008, pp 49-52, 2008. [44] Shyamal Kumar Das and Arup Saha, ―Speech corpora development in Indian languages‖, ICASSP- 2008, C-Dac, Kolkata, 2008. [45] Doroteo Torre Toledano, Telefhica Investigacih Y Desarrollo, ―Neural network boundary refining for automatic speech segmentation‖, 2000 IEEE International conference on acoustic, speech and signal processing, (CASSP 2000, IEEE 0-7803-6293-4/00/$10.0002000, pp 3438-3441, 2000. [46] Abhinav Sethy, Shrikanth Narayanan, ―Refined speech segmentation for concatenative speech synthesis‖, Proceedings of ICSLP 2002, pp 336-353, 2002. [47] Dr. Aishwarya P, ―Implementation of speech synthesis system using neural networks‖, International journal of emerging trends & technology in computer science (IJETTCS), vol. 3, issue 2, pp 107-110, March – April 2014. [48] Cheng-Yuan Lin, Kuan-Ting Chen and J.-S. Roger Jang, ―A hybrid approach to automatic segmentation and labeling for mandarin Chinese speech corpus‖, INTERSPEECH 2005-Eurospeech, 9th European conference on speech communication and technology, pp 233-239, 2005. [49] Hisashi Kawai and Tomoki Toda, ―An evaluation of automatic phone segmentation for concatenative speech synthesis‖, IEEE international conference on acoustic, speech and signal processing, ICASSP 2004, pp 677-680, 2004. [50] Juan M. Toro, Mohinish Shukla, Marina Nespor and Ansagr Endress, ―The quest for generalizations over consonants: Asymmetries between consonants and vowels are not the by-product of acoustic differences‖, International journal of perception and psychophysics, pp 1515-1525, 2008. [51] ―Articulatory phonetics‖, Wikipedia download, Site name: en.wikipedia.org/wiki/Articulatory_phonetics [52] C. Chandra Sekhar and B. Yegnanarayanna, ―Classification of CV transitions in continuous speech using neural network models‖, International symposium on speech, image processing and neural networks, IEEE 0-7803-1865- X/94, Proceedings ISSIPINN 1994, pp 97-100, 13-16 April 1994. [53] J. A. Solewicz, J. A. Moraes, A. Alcaim, ―Text-to-speech system for Brazilian Portuguese using a reduced set of synthesis units‖, 1994 International symposium on speech, image processing and neural networks, IEEE 0-7803- 1865-X/94, ISSIPNN-1994, pp 579-582, 13-16 April 1994. [54] Krichi Mohamed Khalil, Cherif Adnan, ―Optimization of Arabic database and an implementation for Arabic speech synthesis system using HMM: HTS_Arab_Talk‖, International journal of computer applications, vol. 73, no.17, pp 11-17, July 2013. [55] Ton Weijters and Jolian Thole, ―Speech synthesis with artificial neural networks‖, 1993 IEEE international conference on neural network, pp 1764-1769, 1993. [56] Orhan Karaali, Gerald Corrigan, and Ira Gerson, ―Speech synthesis with neural networks‖, Invited Paper, World congress on neural networks, San Diego, pp 45-50, September 1996. [57] G. Corrigan, N. Massey, O. Schnurr, ―Transition-based speech synthesis using neural networks‖, Proceedings of IEEE international conference on acoustic, speech and signal processing, ICASSP 2000, pp 945-948, 2000. [58] T.T. Le, J.S. Mason, ―Artificial neural networks for nonlinear time-domain filtering of speech‖, IEEE Proceedings- vision, image and signal processing, vol. 143, no 3, pp 149-154, June 1996. [59] H.M. Torres, J.A. Gurlekian, ―Acoustic speech unit segmentation for concatenative synthesis‖, Elsevier, computer speech and language, pp 196-206, August 2007. [60] G.V. Galunov, A.F. Kononov and B.A. Smirnov, ―Automatic speech signal segmentation using neural networks‖, SPEECOM 2000, International workshop on speech and computer, XIII session of the Russian Acoustical Society, AudiTech Ltd., Saint-Petersburg, Russia.

iii

Bibliography

[61] Lokendra Shastri, Shuangyu Chang, Steven Greenberg, ―Syllable detection and segmentation using temporal flow neural network‖, 1999: Proceedings of the fourteenth International congress of phonetic , International computer science institute, India. [62] M. Plumpe, S. Meredith, ―Which is more important in a concatenative text to speech system—pitch, duration or spectral discontinuity?‖ Microsoft research, Redmond, USA, pp 193-206, 1998. [63] Yang Jian, Pu Ytianprrn, Liu Bing, ―A Naxi speech synthesis system based on pitch-synchronous overlap-add‖, Proceedings of fourth international conference on signal processing-ICSP-98, pp 654-657, 1998. [64] Yannis Stylianou and Ann K. Syrdal, ―Perceptual and objective detection of discontinuities in concatenative speech synthesis‖, 2001 IEEE international conference on acoustics, speech and signal processing, ICASSP 2001, IEEE 0- 7803-7041-4/2001, pp 837-840, 2001. [65] Alexander Kain, Qi Miao, Jan P. H. Van Santen, ―Spectral control in concatenative speech synthesis‖, Center for spoken language understanding (CSLU) OGI school of science & engineering Oregon health & science university (Ohsu), USA, IST, 2005. [66] Deepika Singh, Parminder Singh, ―Removal of spectral discontinuity in concatenated speech waveform‖, International journal of computer applications, vol. 53, no.16, pp 13-17, September 2012. [67] Esther Klabbers and Raymond Veldhuis, ―Reducing audible spectral discontinuities‖, IEEE transactions on speech and audio processing, vol. 9, no. 1, pp 39-51, January 2001. [68] Robert E. Donovan, ―A new distance measure for costing spectral discontinuities in concatenative speech synthesizers‖, Fourth ISCA ITRW on speech synthesis, 2001 [69] Jithendra Vepa, Simon King and Paul Taylor, ―New objective distance measures for spectral discontinuities in concatenative speech synthesis‖, Proceedings of 2002 IEEE workshop on speech synthesis, IEEE 0-7803-7395- 2/2002, pp 223-226, 2002. [70] Gianpaolo Evangelista, ―Pitch-synchronous wavelet representations of speech and music signals‖, IEEE transactions on signal processing. vol. 41. no.12, pp 3313-3330, December 1993. [71] Christophe Blouin, Oliver Rosec, Paul C. Bagshaw and Christophe D‘alessandro, ―Concatenation cost calculation and optimization for unit selection in TTS‖, Proceedings of 2002 IEEE workshop on speech synthesis, IEEE-0-7803- 7395-2/2002, pp 231-234, 2002. [72] Madan Lakshmanan, Dyonisias Ariananda and H Nikokar, ―A study on the performance of wavelet packets for spectral analysis‖, IEEE sponsored International conference on signal processing and communications (SPCOM), DOI: 10.1109/SPCOM.2010.5560480, pp 1-4, 2010. [73] Gareth Evans, James Belsey and Paul Blenkhorn, ―A multi-lingual speech synthesizer for blind people‖, IEEE seminar on speech and language processing for disabled and elderly people, pp 3/1-3/4, 2000. [74] Juergen Schroeter, Alistair Conkie, Ann Syrdal, Mark Beutmagel, Matthias Jilka, Volker Strom, Yeon-Jun Kim, Hong-Goo Kang And David Kapilow, ―A perspective on the next challenges for TTS research‖, Proceedings of 2002 IEEE workshop on speech synthesis, IEEE 0-7803-7395-2/2002, pp 211-214, September 2002. [75] Najwa K. Bakhsh, Saleh Alshomrani, Imtiaz Khan, ―A comparative study of Arabic text-to-speech synthesis systems‖, International journal of information engineering and electronic business, pp 27-31, 2014. [76] Gopalkrishna Anumanchipalli, Rahul Chitturi, Sachin Joshi, Rohit Kumar, Satindar Pal Singh, R.N.V. Sitaram and S. P. Kishore, ―Development of indian language speech database for large vocabulary speech recognition systems‖, Proceedings of international conference on speech and computer (SPECOM-2005), pp 344-348, 2005. [77] Dinesh Kumar, Neeta Rana, ―Speech synthesis system for online handwritten Punjabi word: an implementation of SVM and concatenative TTS‖, International journal of computer applications, vol. 26, no.2, pp 13-17, July 2011. [78] Hazel Morton, Nancie Gunson, Diarmid Marshalla, Fergus Mcinnes, Andrea Ayres, Mervyn Jack, ―Usability assessment of text-to-speech synthesis for additional detail in an automated telephone banking system‖, Elsevier, computer speech and language, pp 341-362, 2010. [79] Norma Conn and Michael Mctear, ―Speech technology: A solution for people with disabilities‖, IEEE seminar on speech and language processing for disabled and elderly people, pp 7/1-7/6, 2000. [80] G. L. Jayavardhana Rama, A. G. Ramakrishnan, R. Muralishankar and R Prathibha, ―A complete text-to-speech synthesis system in Tamil‖, Proceedings of 2002 IEEE workshop on speech synthesis, pp 191-194, September 2002.

iv

Bibliography

[81] K. Partha Sarathy, A.G.Ramakrishnan, ―Text to speech synthesis system for mobile applications‖, Workshop on image and speech processing-(WISP)-2008. [82] Akshay Sharma, Abhishek Srivastava and Adhar Vashishth, ―An assistive reading system for visually impaired using OCR and TTS‖, International journal of computer applications, vol. 95, no. 2, pp 13-18, June 2014. [83] Moses Ekpenyong and Emem Obong Udoh, (2012), ―Morpho-syntactic analysis framework for tone language text-to- speech systems‖, Published by Canadian center of science and education, computer and information science, vol. 5, no. 4, pp 83-101, 2012. [84] Shikha Kabra, Ritika Agarwal, ―Auto spell suggestion for high quality speech synthesis in Hindi‖, International journal of computer applications, vol. 87, no.17, pp 31-34, February 2014. [85] Waleed M. Azmy, Sherif Abdou, Mahmoud Shoman, ―Arabic unit selection emotional speech synthesis using blending data approach‖, International journal of computer applications, vol. 81, no.8, pp 22-28, November 2013. [86] Adnan Maxhuni, Agni Dika, Avni Rexhepi and Dren Imeraj, ―The development of algorithms for alleviating the problem of discontinuity in speech synthesis from text written in Albanian‖, IJCSI- international journal of computer science issues, vol. 10, issue 4, no 1, pp 203-207, July 2013. [87] Kiruthiga S and Krishnamoorthy K, ―Annotating speech corpus for prosody modeling in indian language text to speech systems‖, IJCSI- international journal of computer science issues, vol. 9, issue 1, no. 1, pp 104-108, January 2012. [88] Halis Altun, Tankut Yalcinoz, Tubitak-Turk J Elec Engin, ―Accurate parameter estimation for an articulatory speech synthesizer with an improved neural network mapping‖, Turk journal of electronic engineering, vol. 9, no.2, pp 147- 160, 2001. [89] Pedro M. Carvalho, Luis C. Oliveira, Isabel M. Trancoso, M. Ceu Viana, ―Concatenative speech synthesis for European Portuguese‖, Third ESCA/COCOSDA workshop (ETRW) on speech synthesis, pp 26-29, 1998. [90] A. Hunt and A. Black, ―Unit selection in a concatenative speech synthesis system using large speech database‖, 1996 IEEE international conference on acoustics, speech and signal processing, ICASSP 96, pp 373-376, 1996. [91] Horng Jyh Wu, Ho Chung Lui, Hwee Boon Low, ―Corpus-based speech and language research in the institute of systems science‖, 1994 International symposium on speech, image processing and neural networks, ISSIPNN-1996, IEEE 0-7803-1865-X/94, pp 142-145, 13-16 April 1994. [92] K. Sreenivasa Rao and B. Yegnanarayana, ―Modeling syllable duration in indian languages using neural network‖, IEEE International conference on acoustics, speech and signal processing, ICASSP 2004, pp 313-316, 2004. [93] S. P. Kishore, Rajeev Sangal and M. Srinivas, ―Building Hindi and Telugu voices using Festvox‖, IEEE international conference on natural language processing, ICON-2002, pp 313-317, 2002. [94] Andras Nagy, Peter Pesti, Geza Nemeth and Tamas Bohm, ―Design issues of a corpus-based speech synthesizer‖, Budapest university of technology and economics, department of telecommunications and media informatics, 2005. [95] Jun Xu, Cuntai Guan and Haizhou Li, ―An objective measure for assessment of a corpus based text-to-speech system‖, Proceedings of 2002 IEEE workshop on speech synthesis, IEEE-0-7803-7395-2/ 2002, pp 179-182, September 2002. [96] Jun Yao And Yuan-Ting Zhang, ―The application of bionic wavelet transform to speech signal processing in cochlear implants using neural network simulation‖, IEEE transactions on biomedical engineering, IEEE 0018- 9294/02$17.00© 2002, pp 309-314, November 2002. [97] Satoshi Takano, Kimihito Tanaka, Hideyuki Mizuno, Masanobu Abe, ―A Japanese TTS system based on multiform units and a speech modification algorithm with harmonics reconstruction‖, IEEE transactions on speech and audio processing, vol. 9, no. 1, January 2001. [98] S P Kishore and Alan W Black, ―Unit size in unit selection speech synthesis‖, Proceedings of EUROSPEECH 2003, pp 305-312, 2003. [99] Johan Wouters, Michael W. Macon, ―Unit fusion for concatenative speech synthesis‖, Center for spoken language understanding Oregon graduate institute, ICSLP 2000. [100] Yu Shi, Eric Chang, Hu Peng and Min Chu, ―Power spectral density based channel equalization of large speech database for concatenative TTS system‖, IEEE, proceedings of acoustics, speech and signal processing, ICASSP 2002, pp 2369-2372, 2002.

v

Bibliography

[101] Yannis Stylianou, ―Removing linear phase mismatches in concatenative speech synthesis‖, IEEE transactions on speech and audio processing, vol. 9, no. 3, pp 232-239, March 2001. [102] Junichi Yamagishi, Simon King, Bela Usabaev, Oliver Watts, Jilei Tian, John Dines, Reima Karhila, Keiichi Tokuda, ―Thousands of voices for hmm-based speech synthesis–analysis and application of TTS systems built on various ASR corpora‖, IEEE transactions on audio, speech and language processing, pp 984-1004, July 2010. [103] Alan W Black, Heiga Zen, Keiichi Tokuda, ―Statistical parametric speech synthesis‖, IEEE proceedings of ICASSP 2007, pp 1039-1064, 2007. [104] E.Veera Raghavendra, B. Yegnanarayana, Kishore Prahallad, ―Speech synthesis using approximate matching of syllables‖, IEEE spoken language technology workshop, SLT 2008, pp 37-40, 2008. [105] Justin Andrews, Menyn Curtis, ―A comprehensive analysis, synthesis system and synthesis methodology for the production of high quality speech‖, International symposium on speech, image processing and neural networks, ISSIPNN-1994, pp 591-594, 1994. [106] Ha_Sim Sak, Tunga G¨Ung¨Or, Ya_Sar Safkan, ―A corpus-based concatenative speech synthesis system for Turkish‖, Turk journal of electronic engineering, vol.14, no.2, pp 209-229, 2006. [107] Juergen Schroeter, Alistair Conkie, Ann Syrdal, Mark Beutnagel, Matthias Jilka, Volker Strom, Yeon-Jun Kim, Hong-Goo Kang and David Kapilow, ―A perspective on the next challenges for TTS research‖, Proceedings of 2002 IEEE workshop on speech synthesis, pp 211-214, 2002. [108] A.D.M. Garvin, ―A self-structuring algorithm for artificial neural networks‖, 1994 IEEE International conference on acoustics, speech and signal processing, pp 525-528, 1994. [109] Alan W Black and Paul Taylor, ―Automatically clustering similar units for unit selection in speech synthesis‖, Centre for speech technology research, university of Edinburgh, UK, November 1997. [110] Mahesh Viswanathan and Madhubalan Viswanathan, ―Comparison of measures of speech quality for listening tests of text-to-speech systems‖, Proceedings of 2002 IEEE workshop on speech synthesis, pp 11-14, 2002. [111] Naofumi Aoki, ―Development of a rule-based speech synthesis system for the Japanese language using a MELP vocoder‖, Proceedings of X European signal processing conference-EUSIPCO 2000, pp 94-99, 2000. [112] Gopalakrishna Anumanchipalli, Rahul Chitturi, Sachin Joshi, Rohit Kumar, Satinder Pal Singh, R.N.V. Sitaram, S P Kishore, ―Development of Indian language speech databases for large vocabulary speech recognition systems‖, International institute of information technology, Hyderabad, India, Hewlett Packard Labs India, Bangalore, India, 2004. [113] Yannis Pantazis, Yannis Stylianou, Esther Klabbers, (2005), ―Discontinuity detection in concatenated speech synthesis based on nonlinear speech analysis‖, INTERSPEECH 2005, 9th European conference on speech communication and technology, September 2005, pp 564-572 [114] Enrico Zovato, Alberto Pacchiotti, Silvia Quazza, Stefano Sandri Loquendo, ―Towards emotional speech synthesis: a rule based approach‖, 5th ISCA speech synthesis workshop, pp 136-141, 2003. [115] Zöe Handley, Marie-Josée Hamel, ―Establishing a methodology for benchmarking speech synthesis for computer- assisted language learning (call)‖, Language learning and technology, vol. 9, no.3, pp 99-119, September 2005. [116] Murtaza Bulut, Shrikanth S. Narayanan, Ann K. Syrdal, ―Expressive speech synthesis using a concatenative synthesizer‖, INTERSPEECH, Proceedings of international conference on speech and language processing, ICSLP 2002, pp 1265-1268, 2002. [117] Arthur R. Toth, ―Forced alignment for speech synthesis databases using duration and prosodic phrase breaks‖, Language technologies institute, Carnegie Mellon university, 5th ISCA Speech Synthesis Workshop 2003, pp 225-227. 2003. [118] Shinsuke Sakai and James Glass, ―Fundamental Frequency Modeling for Corpus-Based Speech Synthesis Based on A Statistical Learning Technique‖, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2003, pp 712-717, 2003. [119] Silvia Quazza, Laura Donetti, Loreta Moisa, Pier Luigi Salza, Loquendo, ―Actor®: A multilingual unit-selection speech synthesis system‖, Proceedings of 4th ISCA tutorial and research workshop on speech synthesis, pp 1-5, 2001. [120] Tomoki Toda, Hisashi Kawait, Minoru Tsuzakit and Kiyohiro Shikanott, ―Perceptual evaluation of cost for segment selection in concatenative speech synthesis‖, Proceedings of 2002 IEEE workshop on speech synthesis, pp 183-186, 2002.

vi

Bibliography

[121] Hyung-Min Park, Richard M. Stern, ―Spatial separation of speech signals using amplitude estimation based on inter- aural comparisons of zero-crossings‖, Elsevier, Speech communication, pp 15-25, 2009. [122] Marko Kos, Matej Grašič, Bojan Kotnik, Zdravko Kačič, ―Speech bandwidth classification for broadcast news domain using artificial neural network and Gaussian mixture models‖, 15th International conference on systems, signals and image processing, IWSSIP 2008, pp 283-286, 2008. [123] Mark Tatham, Katherine Morton and Eric Lewis, ―Syllable reconstruction in concatenated waveform speech synthesis‖, Proceedings of international congress of phonetic sciences, pp 538-545, 1998. [124] Esther Klabbers, Karlheinz St¨Ober, Raymond Veldhuis, Petra Wagner and Stefan Breuer, ―Speech synthesis development made easy: the Bonn open synthesis system‖, 7th International conference on speech communication and technology, EUROSPEECH, pp 3-7, 2001. [125] Steffen Werner, Matthias Eichner, Matthias Wolff and Ruediger Hoffmann, ―Toward spontaneous speech synthesis— utilizing language model information in TTS‖, IEEE transactions on speech and audio processing, vol. 12, no. 4, pp 436-445, July 2004. [126] Yan-Qiu Shao, Yong-Zhen Zhao, Ji-Qing Han, Ting Liu, ―Using different models to label the break indices for mandarin speech synthesis ―, Proceedings of 2005 international conference on machine learning and cybernetics, Guangzhou, pp 3802-3807, 18-21 August 2005. [127] Jian Yu, Wanzhi Zhang, Jianhua Tao, ―A new pitch generation model based on internal dependence of pitch contour for Mandarin TTS system”, 2006 IEEE International conference on acoustics, speech and signal processing, ICASSP, pp 1-6, 2006. [128] Tomoki Todayz, Hisashi Kawaiz and Minoru Tsuzakiz, ―Optimizing sub-cost functions for segment selection based on perceptual evaluations in concatenative speech synthesis‖, IEEE international conference on acoustic, speech and signal processing, ICASSP-2004, pp 657-660, 2004. [129] Andreas Stolcke, Neville Ryant, Vikramjit Mitra, Jiahong Yuan, Wen Wang, Mark Liberman, ―Highly accurate phonetic segmentation using boundary correction models and system fusion‖, IEEE International conference on acoustic, speech and signal processing, ICASSP-2014, pp 434-439, 2014. [130] S.P. Kishore, Alan Black, Rohit Kumar and Rajeev Sangal, ―Experiments with unit selection speech databases for indian languages‖, proceedings of national seminar on language technology tools: implementation of Telugu, Hyderabad, India, pp 1-6, 2003. [131] Manish Sharma and Richard Mammone, ―Automatic speech segmentation using neural tree networks‖, Proceedings of 1995 IEEE workshop on neural networks for signal processing, pp 282-290, 1995. [132] Poonam Bansal, Ph.D, Amita Pradhan, Ankita Goyal, Astha Sharma and Mona Arora, ―Speech synthesis - automatic segmentation‖, International journal of computer applications, vol. 98, no.4, pp 29-31, July 2014. [133] D.S. Shete, Prof. S.B. Patil etc, ―Zero crossing and Energy of the speech signal of Devnagari script‖, IOSR Journal of VLSI and Signal Processing (ISOR-JVSP), Volume 4, Issue 1, Jan. 2014, PP-01-05

Books Referred:

a. Douglas O‘shnaughnessy, ―Speech communications, human and machine‖, (second edition), Wiely IEEE Press, ISBN: 978-0-7803-3449, 1999. b. Mark Tatham and Katherine Morton, ―Developments in speech synthesis‖, Weily IEEE Press, ISBN: 978-0-470- 85538-6, 2005. c. L. R. Rabiner and R. W. Schafer, ―Digital processing of speech signals‖, Prentice Hall Publication, ISBN: 978- 0132136037, 1978. d. Shrikanth Narayanan and Abeer Alwan, ―Text to speech synthesis, new paradigms and advances‖, Prentice Hall Publication, ISBN: 007-6092026037, 2004.

PhD Thesis referred:

i. Dr. J. S. Chitode, ―Text to speech synthesis for Hindi language using hybrid syllabic approach‖, A PhD thesis, computer engineering, year 2008, submitted to Bharati Vidyapeeth deemed university, college of engineering, Pune 2008. ii. Samuel Thomas, ―Natural sounding text to speech synthesis based on syllable like units‖, A PhD Thesis, computer science and engineering, I.I.T., Madras, 2007. iii. Baris Eker, ―Turkish text to speech system‖, A PhD thesis, computer engineering, Bilkent university, Ankara, Turkey 2002. iv. Oliver Watts, ―Unsupervised learning for text to speech synthesis‖, A PhD thesis, University of Edinburgh, U.K, 2012.

vii