Semi-Continuous Hidden Markov Models for Speech Recognition
Total Page:16
File Type:pdf, Size:1020Kb
SEMI-CONTINUOUS HIDDEN MARKOV MODELS FOR SPEECH RECOGNITION Xuedong Huang Thesis submitted for the degree of Doctor of Philosophy University of Edinburgh 1989 DECLARATION This thesis, unless explicitly noted, was composed entirely by myself, and reports original work of my own execution. Xuedong Huang', B.Sc.' M)$c. 2 B.Sc. in Computer Science, Hunan University, 1982 2 M.Sc. in Computer Science and Technology, Tsinghua University, 1984 -1- ABSTRACT Hidden Markov models, which can be based on either discrete output probability distributions or continuous mixture output density functions, have been demonstrated as one of the most powerful statistical tools available for automatic speech recognition. In this thesis, a semi-continuous hidden Markov model, which is a very general model including both discrete and continuous mixture hidden Markov models as its special forms, is proposed. It is a model in which vector quantisation, the discrete hidden Markov model, and the continuous mixture hidden Markov model are unified. Based on the assumption that each vector quantisation codeword can be represented by a continuous probability density function, the semi-continuous output probability is then a combination of discrete model-dependent weighting coefficients with these continuous codebook probability density functions. In comparison to the conventional continuous mixture hidden Markov model, the semi-continuous hidden Markov model can maintain the modelling ability of large-mixture probability density functions. In addition, the number of free parameters and the computational complexity can be reduced because all of the probability density functions are tied together in the codebook. The semi- continuous hidden Markov model thus provides a good solution to the conflict between detailed acoustic modelling and insufficient training data. In comparison to the conventional discrete hidden Markov model, robustness can be enhanced by using multiple codewords in deriving the semi-continuous output probability; and the vector quantisation codebook itself can be optimised together with the hidden Markov model parameters in terms of the maximum likelihood criterion. Such a unified modelling can substantially minimise the information lost by conventional vector quantisation. Evaluation of the semi-continuous hidden Markov model was carried out in a range of speech recognition experiments and results have clearly demonstrated that the semi- continuous hidden Markov model offers improved speech recognition accuracy in comparison to both the discrete hidden Markov model and the continuous mixture hidden Markov model. It is concluded that the unified modelling theory is indeed powerful for modelling non-stationary stochastic processes with multi-modal probabilistic functions of Markov chains, and as such is very useful for automatic speech recognition. - 11 - ACKNOWLEDGEMENTS* First of all, I would like to acknowledge with deep gratitude the support, patience, wisdom, and encouragement of Professor Mervyn Jack, my supervisor at Edinburgh University. His guidance made this work possible, his encouragement sustained it through the difficult stages, and his unending quest for excellence stimulated me to achieve many wonderful results. I would also like to thank Professor John Layer and Professor James Hieronymus for their interest in my work. Part of the research in this thesis was carried out in the School of Computer Science, Carnegie Mellon University, USA. I would like to thank Dr. Kai-Fu Lee and Professor Raj Reddy, who made my visit to Carnegie Mellon University possible. In particular, Kai-Fu gave me invaluable advice in my research, and helped shape many aspects of this thesis. I would like to extend my thanks to many other colleagues in Edinburgh University, Carnegie Mellon University, and Tsinghua University. I must thank in particular Dr. Yasuo Ariki, Dr. Fergus McInnes, and Mr. Hsiao-Wuen Hon, who helped me considerably in this research. I am grateful to Dr. George Duncan, Dr. Andrew Sutherland, Mr. Alan Crowe, Mr. Art Blokiand, and Ms. Mei-Yuh Hwang for their help. I am also deeply indebted to Professor Ditang Fang and Professor Tong Chang, Tsinghua University, China, and Dr. Frank Soong, Bell Laboratories, USA, whose continuous support and encouragement will never be forgotten. Finally, I wish to thank my family and friends for their support. I am especially grateful to my parents for all the sacrifices they have made over the years, and who by their teaching and example, have taught me to appreciate the importance and honour of scholarly work. My most loving gratitude is reserved for my wife, Yingzhi. She encouraged and cared for me during the long days and nights of my research. Although I am indebted to the above people for their aid in this endeavour, I accept full responsibility for the final version of this thesis. * This work was supported by an Edinburgh University Studentship and ORS Award. - 111 - TABLE OF CONTENTS 1 . INTRODUCTION ................................................................................................... 1 1.1. Thesis Organisation .................................................................................. 5 2. FUNDAMENTAL TECHNIQUES FOR SPEECH RECOGNITION 8 2.1. Signal Processing for Speech Recognition .............................................. 9 2.1.1. Short-time Fourier analysis of speech signals ....................... 9 2.1.2. LPC analysis of speech signals ................................................ 10 2.1.3. Cepstral analysis of speech signals ......................................... 11 2.1.4. Distortion measures .................................................................. 12 2.1.5. Vector quantisation .................................................................. 13 2.2. Acoustic Pattern Matching ...................................................................... 15 2.2.1. Dynamic time warping algorithm ........................................... 16 2.2.2. Hidden Markov modelling ....................................................... 18 2.2.3. Neural networks ....................................................................... 20 2 .3. Language Modelling .................................................................................. 21 2.3.1. Language models for speech recognition ................................ 23 2.3.2. Complexity measures of language ........................................... 24 2 .4. Summary ................................................................................................... 25 3. VECTOR QUANTISATION AND MIXTURE DENSITIES ........................... 26 3.1. Conventional Vector Quantisation .......................................................... 27 3.1.1. Codebook design ........................................................................ 29 3.2. Modelling the VQ Codebook as Mixture Densities ................................ 32 3.2.1. Estimation of mixture densities .............................................. 34 3.2.2. The EM algorithm .................................................................... 36 3 .3. Summary ................................................................................................... 42 4. HIDDEN MARKOV MODELS AND BASIC ALGORITHMS ....................... 43 4.1. Markov Processes ....................................................................................... 44 4.2. Definition of the Hidden Markov Model .............. .................................. 46 4.3. Basic Algorithms for HMMs .................................................................... 52 4.3.1. Forward-Backward algorithm .................................................. 53 4.3.2. Viterbi algorithm ...................................................................... 57 4.3.3. Baum-Welch reestimation algorithm ...................................... 58 4.4. Proof of the Reestimation Algorithm ...................................................... 63 4.5. Time Duration Modelling ......................................................................... 66 4.6. Isolated vs. Continuous Speech Recognition .......................................... 70 4.7. Summary ................................................................................................... 72 5. CONTINUOUS HIDDEN MARKOV MODELS ................................................ 73 5.1. Continuous Parameter Reestimation ...................................................... 74 5.1.1. General case .............................................................................. 74 5.1.2. Applications to single Gaussian density function .................. 78 - iv - -v- 5.2. Mixture Density Functions ...................................................................... 80 5.3. Continuous Mixture Hidden Markov Model ........................................... 82 5 .4. Summary ................................................................................................... 87 6. UNIFIED THEORY WITH SEMI-CONTINUOUS MODELS ........................ 88 6.1. Discrete HMM vs. Continuous HMM ...................................................... 89 6.2. Semi-Continuous Hidden Markov Model ................................................ 90 6.3. Reestimation Formulas ............................................................................ 95 6.4. Semi-Continuous Decoder ........................................................................ 99 6.5. Proof of the Reestimation Formulas ......................................................