Methods for More Efficient, Effective and Robust Speech Recognition

Methods for More Efficient, Effective and Robust Speech Recognition Michael Galler School of Computer Science McGill University, Montreal December 21, 1999 A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Ph.D. OMichael Galler 1999 Acquisitions and Acquisitions et Bibliographie Services services bibliographiques The author has granted a non- L'auteur a accordé une licence non exclusive Licence aliowing the exclusive permettant à la National Library of Cana& to Bibliothèque nationale du Canada de reproduce, ioan, diseiiute or sell reproduire, prêter, distribuer ou copies of this thesis in microforni, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/fiIm, de reproduction sur papier ou sur fonnat électronique. The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extnicts fiom it Ni la thèse ni des extraits substantiels may be printed or othenvise de ceile-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation. Abstract In the last few years the field of computer speech recognition has corne into its own as a practical technology. These advances have been due primarily to a steady, step-wise refinement of existing techniques, and the availability of ever more powerful computers on which to implement them. However, many problems remain partially unsolved or entirely open. For example, the current methods continue to fail in cases of small, highly confusable vocabuiaries. New acoustic modeling techniques and search methods are proposed in the area of robust, speaker-independent continuous speech recognition. Ran- domized search techniques are used to refine and irnprove the topologies and clustered training of allophone hidden Markov models. -4new approach for word hypothesization and pronunciation modeling is proposed, based on pseudo-syllabic units. Algorithms are described for transforming a lattice of syllables into words, and learriing the phonotactics of syllables automatically in a statistical framework. Nest, a multi-gramrnar method for generating al- ternative hypotheses is presented, together with the method's experimental evaluation in a telephone-based spelled word recognition system. A blueprint is given for building time and rnemory-optimized speech decoders. Finally, analysis and recognition techniques are applied to the problem of enhancing the intelligibility of speech produced by alaryngeal speakers. Abstract Ces dernières années, l'essor de la recherche en reconnaissance automatique de la parole a permis le développement d'applications concrètes dans ce domaine. Ces avancées sont essentiellement dues à un travail minutieux d'amélioration de techniques existantes, ainsi qu'à l'apparition d'ordinateurs plus puissants sur lesquels ces solutions ont pu être implémentées. Ainsi, plusieurs logiciels de dictée automatique grand vocabulaire connaissent actuelle- ment un véritable succès. Cependant, de nombreux problèmes liés à la reconnaissance restent encore mal résolus. Cette thèse étudie plus spécifiquement des vocabulaires difficiles où la con- fusion entre les mots peut être grande, comme les lettres anglaises épellées. À partir de ces lettres, un très grand nombre de noms propres peuvent être com- posés. Or, une bonne reconnaissance de tels vocabulaires n'est pas correcte- ment réalisée par les systèmes de dictée les plus connus. Plusieurs problèmes liés à la modélisation acoustique et a la reconnaissance de la parole isolée et continue sont considérés ici du point de vue de l'exploration des graphes. Des techniques de recherche aléatoire sont ainsi utilisées afin d'affiner et d'améliorer la topologie et l'apprentissage des modèles de Markov des allo- phones. Nous proposons également des algorithmes pour transformer des treillis de syllabes en mots, et pour apprendre automatiquement les aspects phonotactiques des syllabes, le tout dans un cadre statistique. De plus, une méthode de recherche dans de multiple grammaires permettant de générer plusieurs hypothèses est présentée. Nous avons évalué expérimentalement cette méthode dans un reconnaisseur basé sur des mots épellés à travers le téléphone. Une méthodologie de construction des moteurs de recherche optimisés en temps et en mémoire est proposée pour des applications de ce type. Dans une étude parallèle, nous avons également appliqué ces techniques d'analyse et de reconnaissance au problème d'amélioration de l'intelligibilité de la parole produite par des locuteurs alaryngiques. Acknowledgements This thesis is the outcome of several years of continuous involvement in the field of speech recognition, both in the university and in industry. It comprises a collection of ideas, techniques and applications, most of which have been published elsewhere, in conferences or journals, and worked out in collaboration with a number of excellent colleagues and mentors. Foremost in the latter category are Dr. Renato De Mori and Dr. Jean-Claude Junqua, who have given generously of their time, creativity and patience, without which 1 could not have finally accomplished this work. The work on phonerne modeling (chapter 8), pronunciation modeling and syllable phonotactics was performed at LIcGill University's Center for In- telligent Machines (CIM). At CIbl I benefited from sharing software and having discussions with Dr. Charles Snow, Matteo Contolini, Fabio Brug- nara, and Giulio Antonio1 Chapter 10 on building an efficient decoder is the outgrowth of a series of discussions with a young colleague, Luca Rigazio, at Speech Technology Laboratory (STL). It began as a tutorial and ended in a collaboration. Many excellent ideas described here are his. The papers and related patents on which 1 based chapter 11, on robust spelled word recognition over the telephone, were CO-authoredwith Dr. Junqua. The work on esophageal speech enhancement, in chapter 12 is something of a departure. The problem was introduced to me by the linguist Dr. Hector Javkin, and he provided the data to which we applied techniques of speech recognition and analysis. 1 found this work particularly rewarding, going as it does beyond the usual goals of improved prodiictivity. It exemplifies the application of computer science for humane purposes. Indeed, what is often overlooked in the whole area of computer speech recognition is that in addition to its many commercial applications, it will serve to significantly improve the lives of the hearing and speech impaired. The chapter is based on a paper and two patents CO-authoredwith the linguists Dr. Javkin, Dr. Nancy Niedzielski, and Jim Reed (al1 formerly at the University of California - Santa Barbara), and Robert Boman, of Speech Technology Laboratory. About the software packages and tools 1 used in my experimental work: 1 was able to advantage myself of excellent work made available to me in the two speech laboratories where most of this research was performed. The signal acquisition software was in some cases rny omm; in others 1 used software mitten by Philippe Boucher at CIM, and by Dr. Philippe Morin and his colleagues at STL. The M'CC signal analysis program 1 used was written at CIM by Dr. Charles Snow. The PLP domain signal analysis software was shared with me by my STL colleagues. One HMM training package 1 used was shared with CIM by Our fnends at the Italian research institute IRST. At STL we use HMM training software written in-house by Matteo Contolini, and Dr. Roland Kuhn. Ali the decoders and search engines used in this thesis research, as well as many of the language building tools were the result of my own development work. Now to personal matters; 1 would like to thank the mernbers of my family to whom 1 owe so much: my father and my late mother, and my wife, Kayoko, whose patient support, tolerance, and encouragement allowed me to pursue and eventually complete this work. Contents 1 Introduction 1 1 Preface 2 1.1 How this thesis is organized .................... 3 2 Speech Recognition: A Sunrey 5 2.1 Historical Review ......................... 7 3 Machine Leaniing and Pattern Recognition 9 3.1 The Fitting Problem ....................... 10 3.1.1 Regession Methods .................... 10 3.1.2 Pattern Classification ................... 11 Discriminant .4 nalysis .................. 12 Hidden blarkov Models .................. 12 Principal Component Analysis .............. 14 Perceptrons ........................ 15 Mult i-layer Neural Networks ............... 17 3.2 Optimization Methods ...................... 18 3.2.1 Hill Climbing ....................... 19 3.2.2 Randomized search .................... 20 3.2.3 Heuristic Search ...................... 20 3.3 Hand-Tuning of Machine Learning Models ........... 21 II Fundamentals of Speech Processing 23 4 Speech Production and Acoustic-Phonetics 24 4.1 Phonemes and Allophonic Variability .............. 24 4.2 Speech Production by Phoneme Group ............. 24 5 Analysis of the Speech Signal 30 5.1 The Acoustic Analyser Module ................. 30 5.2 Signals and Spectra ........................ 31 5.3 Acoustic Properties of Phonemes ................ 32 5.4 Acoustic Feature Extraction ................... 33 5.4.1 Signal Preprocessing ................... 34 5.4.2 Spectral Analysis ..................... 34 5.4.3 Auditory-Based Analysis ................. 35 5.4.4 Linear Predictive Coding Analysis ...........

Load more