HMM-Based Vietnamese Text-To-Speech : Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation Thi Thu Trang Nguyen
Total Page:16
File Type:pdf, Size:1020Kb
HMM-based Vietnamese Text-To-Speech : Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation Thi Thu Trang Nguyen To cite this version: Thi Thu Trang Nguyen. HMM-based Vietnamese Text-To-Speech : Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation. Other [cs.OH]. Université Paris Sud - Paris XI; Institut Polytechnique (Hanoï), 2015. English. NNT : 2015PA112201. tel-01260884 HAL Id: tel-01260884 https://tel.archives-ouvertes.fr/tel-01260884 Submitted on 22 Jan 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Université Paris-Sud Ecole Doctorale 427: Informatique de Paris-Sud Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur Specialty : Computer Science Doctor of Science Defense on Thursday, 24 September 2015 by Thi Thu Trang NGUYEN HMM-based Vietnamese Text-To-Speech: Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation Committee: Advisors: Christophe D’ALESSANDRO Directeur de recherche CNRS (LIMSI) Do Dat TRAN Professeur (Institut Polytechnique de Hanoi, Vietnam) Reviewers: Philippe MARTIN Professeur émérite (Université Paris-Diderot 7) Yannis STYLIANOU Professeur (Université de Crète, Grèce) Examiner: Laurent BESACIER Professeur (Université Joseph Fourier, Grenoble) Sophie ROSSET Directeur de recherche CNRS (LIMSI) Groupe Audio et Acoustique ED 427 - Université Paris-Sud LIMSI-CNRS UFR Sciences Orsay Rue John von Neumann - Campus Universitaire Batiment 650 rue Noetzlin d’Orsay - Bât 508 91405 Orsay Cedex, France F-91405 Orsay Cedex, France This dissertation is dedicated to: My son Teddy, who was six months when I started, My parents and my husband for their love, endless support and encouragement. Acknowledgements Foremost, I would like to express my most sincere and deepest gratitude to my thesis advisors M. Christophe d’ALESSANDRO (Directeur de Recherche at LIMSI-CNRS, France), Prof. TRẦN Đỗ Đạt and Prof. PHẠM Thị Ngọc Yến (MICA-CNRS, Vietnam) for their continuous support and guidance during my PhD program, and for providing me with such a serious and inspiring research environment. I am really graceful to Christophe for his excellent mentorship, caring, patience, and immense knowledge on Text-To-Speech (TTS). His advise helped me in all the time of research and writing of this thesis. He has also helped me much in completing the joint program administration, applying for scholarship Excellence Eiffel, and funding for traveling or conference. I am very thankful to Prof. Đạt, M. Eric CASTELLI and Prof. Yến for shaping my thesis at the beginning, for their supports in applying for scholarship Évariste Galois, and for their enthusiasm and encouragement. Prof. Đạt has substantially facilitated my PhD research, especially at the time I was a freshman on speech processing and TTS, with his valuable comments on Vietnamese TTS. I am fortunate to have the opportunity to work with Albert RILLIARD (LIMSI). He has brought me great joy and crucial encouragement during my PhD. He has taught me various essential knowledge, such as prosody, statistical analysis, and perceptual evaluation. That has had a great impact on steering this thesis, leading to considerable results for my work. I am very grateful to Albert for his caring and advice on research, writing and presentation. It is my pleasure to thank my thesis reviewers: Prof. Philippe MARTIN (Université Paris- Diderot 7), and Prof. Yannis STYLIANOU (Toshiba’s Cambridge Research Laboratory) for accepting and spending their time on reading and giving valuable feedback on my thesis. I would also like to thank Mme. Sophie ROSSET (LIMSI), and Prof. Laurent BESACIER (LIG) for their acceptance to be in my defense committee. I would like to thank Prof. Jacqueline VAISSIÈRE (LPP) for her caring and support during my first three-month internship in France as well as my PhD. I highly appreciate the opportunity to know and work with M. Alexis MICHAUD (MICA). I am sincerely indebted to Alexis for his suggestions and valuable comments on linguistics and writing. I take this opportunity to extend my heartfelt gratitude to my dear friends and colleagues ; especially Marc for his constructive discussions and co-operation ; Areti, Hảo, Chi, Hải Anh, Thuỳ for their encouragement and enthusiasm in revising English for my dissertation ; and together with Olivier, David, Samuel, anh Cường, Khoa, Diệp, anh Sơn, Xuân for their supports and comments for my PhD, and for many fun and a friendly working ambiance at LIMSI and MICA. I wish to give thanks to students: Lan, Thắng, Tùng and the subjects for their efforts in conducting/participating the perception tests at MICA; to my Vietnamese friends in Paris: Khánh, Ngọc Anh, Bình for their enthusiastic supports in recording sessions at LIMSI, and anh Bắc, Hiếu for their helpful suggestions. The present research would not have been feasible without financial supports from the French government with the two scholarships: Évariste Galois and Excellence Eiffel. I would 6 also like to acknowledge the funding from the Région Ile-de-France through the FUI ADN-TR project (2011-2014), Vietnamese NAFOSTED fund for participating conferences. I also take this opportunity to express my gratefulness to Prof. Nicole BIDOIT, Director and to Stéphanie DRUETTA, Assistant of the Ecole Doctorale d’Informatique de Paris-Sud for their supports during my research. Last but not the least, I would like to dedicate this moment to my son Teddy and my husband Chí, who have given me much courage to accomplish this thesis, to my parents for their endless love and support during all my PhD. Contents Notations and Abbreviations 13 List of Tables 17 List of Figures 19 Lists of Media files 23 1 Vietnamese Text-To-Speech: Current state and Issues 25 1.1 Introduction..................................... 27 1.2 Text-To-Speech (TTS)............................... 28 1.2.1 Applications of speech synthesis..................... 28 1.2.2 Basic architecture of TTS......................... 29 1.2.3 Source/filter synthesizer.......................... 31 1.2.4 Concatenative synthesizer......................... 32 1.3 Unit selection and statistical parametric synthesis................ 33 1.3.1 From concatenation to unit-selection synthesis............. 33 1.3.2 From vocoding to statistical parametric synthesis............ 34 1.3.3 Pros and cons................................ 36 1.4 Vietnamese language................................ 38 1.5 Current state of Vietnamese TTS......................... 40 1.5.1 Unit selection Vietnamese TTS...................... 41 1.5.2 HMM-based Vietnamese TTS....................... 42 1.6 Main issues on Vietnamese TTS......................... 43 1.6.1 Building phone and feature sets...................... 43 1.6.2 Corpus availability and design...................... 44 1.6.3 Building a complete TTS system..................... 45 1.6.4 Prosodic phrasing modeling........................ 45 1.6.5 Perceptual evaluations with respect to lexical tones........... 46 1.7 Proposition and structure of dissertation..................... 46 2 Hanoi Vietnamese phonetics and phonology: Tonophone approach 49 2.1 Introduction..................................... 51 2.2 Vietnamese syllable structure........................... 51 2.2.1 Syllable structure.............................. 52 2.2.2 Syllable types................................ 55 2.3 Vietnamese phonological system......................... 56 2.3.1 Initial consonants.............................. 56 8 Contents 2.3.2 Final consonants.............................. 56 2.3.3 Medials or Pre-tonal sounds........................ 58 2.3.4 Vowels and diphthongs........................... 58 2.4 Vietnamese lexical tones.............................. 60 2.4.1 Tone system................................ 60 2.4.2 Phonetics and phonology of tone..................... 61 2.4.3 Tonal coarticulation............................ 63 2.5 Grapheme-to-phoneme rules............................ 63 2.5.1 X-SAMPA representation......................... 64 2.5.2 Rules for consonants............................ 64 2.5.3 Rules for vowels/diphthongs........................ 65 2.6 Tonophone set................................... 66 2.6.1 Tonophone................................. 66 2.6.2 Tonophone set............................... 67 2.6.3 Acoustic-phonetic tonophone set..................... 67 2.7 PRO-SYLDIC, a pronounceable syllable dictionary............... 69 2.7.1 Syllable-orthographic rules........................ 69 2.7.2 Pronounceable rhymes........................... 70 2.7.3 PRO-SYLDIC............................... 71 2.8 Conclusion..................................... 72 3 Corpus design, recording and pre-processing 75 3.1 Introduction..................................... 77 3.2 Raw text...................................... 78 3.2.1 Rich and balanced corpus......................... 78 3.2.2 Raw text from different sources...................... 78 3.3 Text pre-processing................................. 79 3.3.1 Main tasks................................. 79 3.3.2 Sentence segmentation..........................