Probabilistic Sequence Models with Speech and Language Applications
Total Page:16
File Type:pdf, Size:1020Kb
Thesis for the degree of Doctor of Philosophy Probabilistic Sequence Models with Speech and Language Applications Gustav Eje Henter Communication Theory School of Electrical Engineering KTH – Royal Institute of Technology Stockholm 2013 Henter, Gustav Eje Probabilistic Sequence Models with Speech and Language Applications Copyright c 2013 Gustav Eje Henter except where otherwise stated. All rights reserved. ISBN 978-91-7501-932-1 TRITA-EE 2013:042 ISSN 1653-5146 Communication Theory School of Electrical Engineering KTH – Royal Institute of Technology SE-100 44 Stockholm, Sweden Abstract Series data, sequences of measured values, are ubiquitous. Whenever ob- servations are made along a path in space or time, a data sequence results. To comprehend nature and shape it to our will, or to make informed de- cisions based on what we know, we need methods to make sense of such data. Of particular interest are probabilistic descriptions, which enable us to represent uncertainty and random variation inherent to the world around us. This thesis presents and expands upon some tools for creating prob- abilistic models of sequences, with an eye towards applications involving speech and language. Modelling speech and language is not only of use for creating listening, reading, talking, and writing machines—for instance al- lowing human-friendly interfaces to future computational intelligences and smart devices of today—but probabilistic models may also ultimately tell us something about ourselves and the world we occupy. The central theme of the thesis is the creation of new or improved models more appropriate for our intended applications, by weakening limiting and questionable assumptions made by standard modelling techniques. One contribution of this thesis examines causal-state splitting reconstruction (CSSR), an algorithm for learning discrete-valued sequence models whose states are minimal sufficient statistics for prediction. Unlike many tradi- tional techniques, CSSR does not require the number of process states to be specified a priori, but builds a pattern vocabulary from data alone, mak- ing it applicable for language acquisition and the identification of stochastic grammars. A paper in the thesis shows that CSSR handles noise and errors expected in natural data poorly, but that the learner can be extended in a simple manner to yield more robust and stable results also in the presence of corruptions. Even when the complexities of language are put aside, challenges re- main. The seemingly simple task of accurately describing human speech signals, so that natural synthetic speech can be generated, has proved dif- ficult, as humans are highly attuned to what speech should sound like. A pair of papers in the thesis therefore study nonparametric techniques suit- able for improved acoustic modelling of speech for synthesis applications. Each of the two papers targets a known-incorrect assumption of established i methods, based on the hypothesis that nonparametric techniques can better represent and recreate essential characteristics of natural speech. In the first paper of the pair, Gaussian process dynamical models (GPDMs), nonlinear, continuous state-space dynamical models based on Gaussian processes, are shown to better replicate voiced speech, with- out traditional dynamical features or assumptions that cepstral param- eters follow linear autoregressive processes. Additional dimensions of the state-space are able to represent other salient signal aspects such as prosodic variation. The second paper, meanwhile, introduces KDE-HMMs, asymptotically-consistent Markov models for continuous-valued data based on kernel density estimation, that additionally have been extended with a fixed-cardinality discrete hidden state. This construction is shown to pro- vide improved probabilistic descriptions of nonlinear time series, compared to reference models from different paradigms. The hidden state can be used to control process output, making KDE-HMMs compelling as a probabilistic alternative to hybrid speech-synthesis approaches. A final paper of the thesis discusses how models can be improved even when one is restricted to a fundamentally imperfect model class. Mini- mum entropy rate simplification (MERS), an information-theoretic scheme for postprocessing models for generative applications involving both speech and text, is introduced. MERS reduces the entropy rate of a model while remaining as close as possible to the starting model. This is shown to produce simplified models that concentrate on the most common and char- acteristic behaviours, and provides a continuum of simplifications between the original model and zero-entropy, completely predictable output. As the tails of fitted distributions may be inflated by noise or empirical variability that a model has failed to capture, MERS’s ability to concentrate on high- probability output is also demonstrated to be useful for denoising models trained on disturbed data. Keywords: Time series, acoustic modelling, speech synthesis, stochas- tic processes, causal-state splitting reconstruction, robust causal states, pat- tern discovery, Markov models, HMMs, nonparametric models, Gaussian processes, Gaussian process dynamical models, nonlinear Kalman filters, information theory, minimum entropy rate simplification, kernel density es- timation, time-series bootstrap. ii List of Papers The thesis is based on the following papers: [A] G. E. Henter, M. R. Frean, and W. B. Kleijn, “Gaussian Pro- cess Dynamical Models for Nonparametric Speech Representa- tion and Synthesis,” in Proc. ICASSP, 2012, pp. 4505–4508. [B] G. E. Henter and W. B. Kleijn, “Picking Up the Pieces: Causal States in Noisy Data, and How to Recover Them,” Pattern Recogn. Lett., vol. 34, no. 5, pp. 587–594, 2013. [C] G. E. Henter and W. B. Kleijn, “Minimum Entropy Rate Simpli- fication of Stochastic Processes,” IEEE T. Pattern Anal., sub- mitted. [D] G. E. Henter, A. Leijon, and W. B. Kleijn, “Kernel Den- sity Estimation-Based Markov Models with Hidden State,” manuscript in preparation. iii In addition to papers A–D, the following papers have also been produced in part by the author of the thesis: [E] G. E. Henter and W. B. Kleijn, “Simplified Probability Models for Generative Tasks: A Rate-Distortion Approach,” in Proc. EUSIPCO, vol. 18, 2010, pp. 1159–1163. [F] G. E. Henter and W. B. Kleijn, “Intermediate-State HMMs to Capture Continuously-Changing Signal Features,” in Proc. In- terspeech, vol. 12, 2011, pp. 1817–1820. [G] P. N. Petkov, W. B. Kleijn, and G. E. Henter, “Enhancing Subjective Speech Intelligibility Using a Statistical Model of Speech,” in Proc. Interspeech, vol. 13, 2012, pp. 166–169. [H] P. N. Petkov, G. E. Henter, and W. B. Kleijn, “Maximizing Phoneme Recognition Accuracy for Enhanced Speech Intelligi- bility in Noise,” IEEE T. Audio Speech, vol. 21, no. 5, pp. 1035– 1045, 2013. iv Acknowledgements While this thesis may have just one name printed on the front page, it owes its existence to the input and support of many people, and I am grateful for each and every one. First and foremost, I must give thanks to my two scientific guiding stars at SIP who offered me the opportunity to pursue a PhD, and first among them my principal supervisor, supermind Professor Bastiaan Kleijn: Your scientific creativity is immense and inspiring, your considerable experience is dispensed in lucid and compact pearls of wisdom, your formidable skill in debate has helped hone mine, and your editing capabilities and sense for greatness in scientific presentation are second to none, with our papers being much stronger as a result. Thank you for your immeasurable teachings, and for developing me from a student into a researcher. I hope you can see hints of yourself reflected back in this thesis. Next, I am deeply grateful to my second supervisor, Professor Arne Lei- jon: Your abilities as a scientist and engineer, your didactic skills, your ap- proachability, and your dependability are as great as your humility. Thank you for turning me from a student into a teacher. It has been a delight to work with you on the Pattern Recognition course, and I value your opinion immensely, whether in science or outside of it. Beside my two supervisors, I am also indebted to many other seniors in the department, including Professor Markus Flierl, Doctor Saikat Chatterje, and Professor Mikael Skoglund—not only directly, but also for their work in keeping electrical engineering at KTH a thriving research environment. On that note, I also must thank Dora Söderberg for her happy assistance with administrative matters. Additionally, I am grateful to Professor Ragnar Thobaben for helpfully performing quality review of the thesis. Next I must thank my coworkers at SIP, and lately Communication The- ory, who have made my years as a PhD student rewarding both scientifically and socially. Aside from Ermin, Konrad, Pravin, Zhanyu, and Sebastian, who I’ve had the pleasure of sharing an office with, special thanks go to all who have worked on their sound and imaging PhDs alongside me, includ- ing, alphabetically, Anders, Chris, David Z, Du, Guoqiang, Haopeng, Jalil, Jan, Janusz, Minyue, Nasser, Obada, Petko (also a scientific collaborator), and Svante. Among the post-docs, Cees, Hannes, and Timo also come to mind. Our many interesting discussions at lunch, around the fika table, at v seminars, and on many other occasions will remain a fond memory of mine. Speaking of interesting discussions, I would be remiss if I did not mention my MSc student Andreas and the many inquisitive, ambitious,