Robust Structured Voice Extraction for Flexible Expressive Resynthesis
Total Page:16
File Type:pdf, Size:1020Kb
ROBUST STRUCTURED VOICE EXTRACTION FOR FLEXIBLE EXPRESSIVE RESYNTHESIS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Pamornpol Jinachitra June 2007 c Copyright by Pamornpol Jinachitra 2007 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Julius O. Smith, III Principal Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Robert M. Gray I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Jonathan S. Abel Approved for the University Committee on Graduate Studies. iii iv Abstract Parametric representation of audio allows for a reduction in the amount of data needed to represent the sound. If chosen carefully, these parameters can capture the expressiveness of the sound, while reflecting the production mechanism of the sound source, and thus allow for an intuitive control in order to modify the original sound in a desirable way. In order to achieve the desired parametric encoding, algorithms which can robustly identify the model parameters even from noisy recordings are needed. As a result, not only do we get an expressive and flexible coding system, we can also obtain a model-based speech enhancement that reconstructs the speech embedded in noise cleanly and free of musical noise usually associated with the filter-based approach. In this thesis, a combination of analysis algorithms to achieve automatic encoding of a human voice recorded in noise is described. The source-filter model is employed for parameterization of a speech sound, especially voiced speech, and an iterative joint estimation of the glottal source and vocal tract parameters based on Kalman filtering and the expectation-maximization algorithm is presented. In order to find the right production model for each speech segment, speech segmentation is required which is especially challenging in noise. A switching state-space model is adopted to represent the underlying speech production mechanism involving the smoothly varying hidden variables and their relationship to the observed speech. A technique called the unscented transform is incorporated in the algorithm to improve the segmentation performance in noise. In addition, during voiced periods, the choice of the glottal source model requires the detection of the glottal closure instants. A dynamic programming-based algorithm with a flexible parametric model of the source is also proposed. Each algorithm is evaluated in comparison to recently published v methods from the literature. The system combination demonstrates the possibility of a parametric extraction of speech from a clean recording or a moderately noisy recording, further providing the option of modifying the reconstruction to implement various desirable effects. vi Acknowledgements I would like to thank Professor Julius Smith, my principal advisor, who gave me all kinds of support throughout the course of developing the work in this dissertation. His encouragement and inspiration made a difference in propelling me to try to do better work and keep improving myself. His enthusiasm in sound source modeling has inspired me to pursue the research in this area, to think differently and to take chances. His teachings in classes, seminars and meetings we had now and then had greatly influenced the way I approach the problem. His advice these last five years will surely guide me through my future career and life just as it has guided me to the finishing of this thesis. I would like to thank my associate advisor, Professor Robert M. Gray, who agreed so willingly to be on my reading committee and provided me with many useful com- ments that expanded my horizon greatly. Without a doubt, his expertise has con- tributed to the improvement of this dissertation. I would like to thank Dr. Jonathan Abel who is also on my reading committee. He has contributed to the CCRMA community and to my work here in particular, beyond his duty. His enthusiasm, encouragement and technical insights have proved to be a great help to this work. The CCRMA community has provided me with a great learning atmosphere in a multi-disciplinary area which I always appreciate. I am particularly grateful to my colleagues and friends in the DSP group: Kyogu Lee, Ryan Cassidy, Patty Huang, Hi- roko Terasawa, Ed Berdhal, David Yeh, Nelson Lee, Gautham Mysore, Matt Wright, Greg Sell. The group’s alumni, and mostly my predecessors: Yi-Wen Liu, Aaron Master, Arvindh Krishnaswamy, Rodrigo Segnini and Harvey Thornberg, also helped vii me along the way by both showing and sharing. We supported each other technically as well as morally, by letting each other know that we are not alone in this long and arduous process. My thanks extend to CCRMA DSP seminar regular participants, who contributed bits and pieces along the way. My gratitude also goes to my Master degree’s thesis advisor, Professor Jonathon Chambers, who got me interested in statistical signal processing and perhaps set the path for the rest of my academic career. I would like to also thank the people at Toyota InfoTechnology Center, U.S., the sole provider of my financial support for most of the last three years. Special thanks go to my manager, Ben Reaves, who looked after my welfare and professional development, not only as a manager but also as a friend. I would like to thank Ramon Prieto, my colleague and friend there who brought me in and shared a lot of his experience. I would like to thank the executives at the ITC for their continuous support and friends there who made the experience all the more enjoyable. Last but not least, I would like to express my eternal gratitude toward my beloved parents, who kept investing in my education and gave me the freedom, along with the moral and financial support, to pursue whatever I wanted. I would like to extend my gratitude to my sister who has been a great big sister throughout and was a great help to the family in affording me the freedom and this wonderful experience abroad. Under the category of my family, I would like to thank Papinya for her support and companionship that certainly made the hard part of this experience more bearable and the better part of it even more joyful. viii Preface The research in parametric representation of audio, especially in physical modeling, has been a dominant theme at CCRMA, Stanford University. The idea of structured audio coding where the sound source parametric description is sent for resynthesis at the receiving end, was originated at CCRMA [1] [2]. Since then some structured audio coding standard and tools have emerged in the musical domain. Arguably, though, structured speech coding has been the holy grail since Homer Dudley’s day at the Bell Laboratory in the form of an articulatory model. Yet, the most popular speech synthesis is achieved by concatenation of samples and high-quality speech coding is achieved only with the help of codebooks. This dissertation demonstrates the possibility of a well-structured audio paramet- ric representation which can be obtained from real-world recordings where noise can be present. It is based heavily on prior research and inspired by the ideas developed at CCRMA. The aspect of modification flexibility is largely inspired by the current difficulty of generating emotional speech from neutral speech and the evident desire for an expressive computer-generated singing voice which is also easy to control. The inclusion of noise into consideration is motivated by the lack of research in this area and the experience of seeing many technologically superior techniques fail when it comes to real-world deployment. Speech enhancement by reconstruction also seems to be a deserted venue of research that may not deserve to be abandoned just yet. From the applications point of view, the dissertation is an attempt to bring to- gether coding and synthesis while achieving speech enhancement as a by-product. It tries to achieve all desirable properties of a sound source analysis from natural ix recording: high compression or data reduction, faithful and noise-free reconstruc- tion, physically meaningful controls that are flexible to manipulate or modify, all achieved with robustness against noise. Given the current inadequate knowledge of the true physical model for voice production, most algorithms presented in this thesis only try to stay as close as possible to some physical interpretation, while retaining other properties. With more powerful techniques becoming available in recent years, an investigation into these “upgrades” to see if they can give such coding effects is an interesting endeavor both for academic and industrial purposes. Hopefully, the contributions here will encourage more attempts in this direction for even better well-structured parametric models that may allow us to attain the aforementioned properties. x Contents Abstract v Acknowledgements vii Preface ix 1 Introduction 1 1.1 Review of Parametric Speech Representation . .... 5 1.1.1 TemplateModel.......................... 5 1.1.2 SinusoidModel .......................... 7 1.1.3 TheVocoder............................ 8 1.1.4 Source-filterModel ........................ 9 1.1.5 FormantSynthesizer . 12 1.1.6 DigitalWaveguideModel. 13 1.1.7 ArticulatoryModel . 15 1.1.8 Fluid Mechanics Computational Models . 17 1.2 SingingVersusSpeaking . .. .. 18 1.3 Parameter Identification and Speech Enhancement . ..... 19 1.4 Summary ................................. 22 2 System Overview 23 2.1 SegmentationFront-end . 23 2.2 VoiceModel................................ 24 2.3 SynthesisSystem ............................