Silent Communication: Whispered Speech-To-Clear Speech Conversion Viet-Anh Tran
Total Page:16
File Type:pdf, Size:1020Kb
Silent Communication: whispered speech-to-clear speech conversion Viet-Anh Tran To cite this version: Viet-Anh Tran. Silent Communication: whispered speech-to-clear speech conversion. Computer Sci- ence [cs]. Institut National Polytechnique de Grenoble - INPG, 2010. English. tel-00614289 HAL Id: tel-00614289 https://tel.archives-ouvertes.fr/tel-00614289 Submitted on 10 Aug 2011 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. INSTITUT POLYTECHNIQUE DE GRENOBLE N° attribué par la bibliothèque |__|__|__|__|__|__|__|__|__|__| T H E S E pour obtenir le grade de DOCTEUR DE L’INP Grenoble Spécialité : Signal, Image, Parole, Télécoms préparée au : Département de Parole et Cognition du laboratoire GIPSA-Lab dans le cadre de l’Ecole Doctorale : Electronique, Electrotechnique, Automatique et Traitement du Signal présentée et soutenue publiquement par Viet Anh Tran le 28 Janvier 2010 TITRE Silent Communication : whispered speech-to-clear speech conversion Communication silencieuse : conversion de la parole chuchotée en parole claire DIRECTEUR DE THESE Hélène Lœvenbruck CO-DIRECTEUR DE THESE Christian Jutten & Gérard Bailly JURY M. Eric Moulines , Président Mme. Régine André-Obrecht , Rapporteure M. Gérard Chollet , Rapporteur Mme. Hélène Lœvenbruck , Directrice de thèse M. Christian Jutten , Co-encadrant M. Gérard Bailly , Co-encadrant M. Yves Laprie , Examinateur Acknowledgements I would like to express my deepest appreciation to my advisors Dr. Hélène Lœvenbruck, Dr. Gérard Bailly and Prof. Christian Jutten, whose constant guidance, encouragement and support from the initial to the final level enabled me to develop an understanding of this research in GIPSA-Lab. During my thesis, I had a great opportunity to work with Dr. Tomoki Toda, assistant professor at Nara Institute of Science and Technology (NAIST), Japan. I would like to thank Dr. Tomoki Toda for letting me use the NAM-to-speech system, developed at NAIST, and especially for his continuous support, valuable advice, discussions and collaboration throughout this research. I would like to express my gratitude to the various members of the jury. I would like to acknowledge Prof Régine André-Obrecht and Dr. Gérard Chollet for having accepted to be examiners of my thesis and for their helpful comments and suggestions in the writing of the thesis. Special thanks go to Prof. Eric Moulines for accepting to be the president of the defense committee as well as Dr. Yves Laprie for his participation in the jury. I am grateful to Coriandre Vilain, Christophe Savariaux, Lionel Granjon and Alain Arnal for their help in fixing the hardware problems and in data acquisition. Thanks also go to Pierre Badin for valuable advice about the presentation of my thesis. I also wish to thank Dr. Panikos Heracleous with whom I was fortunate to discuss and collaborate on the Non-Audible Murmur recognition problem. Finally, I would like to acknowledge my colleagues: Nino, Nadine, Mathieu, Oliver, Xavier, Anahita, Noureddine, Stephan, Atef, Amélie, Keigo, Yamato and others for many good times with them at Gipsa-Lab and at NAIST, and of course, my family and my friends for their endless encouragement and support. Contents Contents ................................................................................................................... 1 List of figures........................................................................................................... 4 List of tables............................................................................................................. 7 Abstract.................................................................................................................... 8 Résumé ................................................................................................................... 10 Introduction........................................................................................................... 12 Motivation of research ........................................................................................ 12 Scope of thesis .................................................................................................... 12 Signal-to-signal mapping ................................................................................ 13 HMM-based recognition-synthesis................................................................. 13 Organization of thesis ......................................................................................... 14 Related project .................................................................................................... 15 Chapter 1 Silent speech interfaces and silent-to-audible speech conversion. 17 1.1 Introduction............................................................................................. 17 1.2 Silent speech........................................................................................... 18 1.3 Silent speech interfaces (SSI) ................................................................. 20 1.3.1 Electro-encephalographic (EEG) sensors ....................................... 20 1.3.2 Surface Electromyography (sEMG)................................................ 23 1.3.3 Tongue displays.............................................................................. 28 1.3.4 Non-Audible Murmur (NAM) microphone .................................... 32 1.4 Conversion of silent speech to audible speech........................................ 34 1.4.1 Direct signal-to-signal mapping...................................................... 35 1.4.2 Combination of speech recognition and speech synthesis.............. 45 1.5 Summary ................................................................................................. 51 Chapter 2 Whispered speech ............................................................................. 53 2.1 Introduction............................................................................................. 53 2.2 The speech production system ................................................................ 53 2.3 Whispered speech production................................................................. 55 2.4 Acoustic characteristics of whispered speech......................................... 57 2.5 Perceptual characteristics of whisper...................................................... 59 2.5.1 Pitch-like perception ............................................................................. 59 2.5.2 Intelligibility ................................................................................... 61 2.5.3 Speaker identification..................................................................... 62 2.5.4 Multimodal speech perception........................................................ 62 2.6 Summary ................................................................................................. 66 Chapter 3 Audio and audiovisual corpora ........................................................ 67 3.1 Introduction............................................................................................. 67 1 2 Contents 3.2 French corpus..........................................................................................69 3.2.1 Text material....................................................................................69 3.2.2 Recording protocol ................................................................................70 3.3 Audiovisual Japanese corpus...................................................................72 3.3.1 Text material....................................................................................72 3.3.2 Materials and recording...................................................................72 3.4 Summary..................................................................................................74 Chapter 4 Improvement to the GMM-based whisper-to-speech system........75 4.1 Introduction .............................................................................................75 4.2 Original NAIST NAM-to-Speech system ...............................................76 4.2.1 Spectral estimation ..........................................................................77 4.2.2 Source excitation estimation............................................................82 4.3 Improvement to the GMM-based whisper-to-speech conversion ...........83 4.3.1 Aligning data...................................................................................84 4.3.2 Feature extraction............................................................................85 4.3.3 Voiced/unvoiced detection..............................................................86 4.3.4 F0 estimation ...................................................................................89 4.3.5 Influence of spectral variation on the predicted accuracy ...............91 4.3.6 Influence of visual information for the conversion .........................96 4.4 Perceptual evaluations...........................................................................102