Computational Auditory Scene Induction
Total Page:16
File Type:pdf, Size:1020Kb
NORTHWESTERN UNIVERSITY Computational Auditory Scene Induction A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL AND THE DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE OF NORTHWESTERN UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree DOCTOR OF PHILOSOPHY Field of Computer Science By Jinyu Han EVANSTON, ILLINOIS August 2012 2 c Copyright by Jinyu Han 2012 All Rights Reserved 3 ABSTRACT Computational Auditory Scene Induction Jinyu Han Real world sound is a mixture of different sources. The sound scene of a busy cof- feehouse, for example, usually consists of several conversations, music playing, laughter and maybe a baby crying, the door being slammed, different machines operating in the background and more. When humans are confronted with these sounds, they rapidly and automatically adjust themselves in this complex sound environment, paying attention to the sound source of interest. This ability has been labeled in psychoacoustics under the name of Auditory Scene Analysis (ASA). The counterpart to ASA in machine listening is called Computational Auditory Scene Analysis (CASA) - the efforts to build computer models to perform auditory scene anal- ysis. Research on CASA has led to great advancement in machine systems capable of analyzing complex sound scene, such as audio source separation and multiple pitch esti- mation. Such systems often fail to perform in presence of corrupted or incomplete sound scenes. In a real world sound scene, different sounds overlap in time and frequency, in- terfering with and canceling each other. Sometimes, the sound of interest may have some 4 critical information totally missing, examples including an old recording from a scratched CD or a band-limited telephone speech signal. In the real world filled with incomplete sounds, the human auditory system has the ability, known as Auditory Scene Induction (ASI), to estimate the missing parts of a continuous auditory scene briefly covered by noise or other interferences, and perceptually resynthesize them. Since human is able to infer the missing elements in an auditory scene, it is important for machine systems to have the same function. However, there are very few efforts in computer audition to computationally realize this ability. This thesis focuses on the computational realization of auditory scene induction - Com- putational Auditory Scene Induction (CASI). More specifically, the goal of my research is to build computer models that are capable of resynthesizing the missing information of an audio scene. Building upon existing statistical models (NMF, PLCA, HMM and N-HMM) for audio representation, I will formulate this ability as a model-based spectro- gram analysis and inference problem under the expectation{maximization (EM) frame- work with missing data in the observation. Various sources of information, including the spectral and temporal structure of audio, and the top-down knowledge about speech are incorporated into the proposed models to produce accurate reconstruction of the missing information in an audio scene. The effectiveness of these proposed machine systems are demonstrated on three audio signal processing tasks: singing melody extraction, audio imputation and audio bandwidth expansion. Each system is assessed through experiments on real world audio data and compared to the state-of-art. Although far from perfect, the proposed systems have shown many advantages and significant improvement over the existing systems. In addition, this thesis has shown that different applications related to 5 missing audio data can be considered under the unified framework of CASI. This opened a new avenue of research in the Computer Audition community. 6 Acknowledgements First and foremost, I would like to thank my advisor, Professor Bryan Pardo, for creating the group in which I was able to do this work, for inviting me to join his lit- tle ensemble back in 2007, and for supporting me since then. Bryan opened the door for me to a whole new world of knowledge and practice. Without his unabated trust, and unwavering commitment to providing me a creative and protected environment, this work would not have been accomplished. His passion for scientific exploration and his philosophy of research will continue to inspire me in the future. I owe an immense amount of gratitude to Gautham J. Mysore, who has been an excellent mentor and collaborator over the last year. He sets an example as scholar and taught me the qualities a researcher should possess, for which I am particularly grateful. He has taught me a great deal about research from general approaches to problem solving to specifics about machine learning and signal processing. Special thanks go to my thesis readers, Jorge Nocedal and Thrasyvoulos N. Pappas for serving on my dissertation committee and for providing valuable feedback on this dissertation. Their insightful reading and suggestions of my original proposal have greatly improved the final work. I thank Professor Thrasyvoulos N. Pappas for his enjoyable class on Digital Signal Processing which built the foundations of my thesis work. I thank Professor Jorge Nocedal for his excellent lectures from which I learned a great deal about 7 optimization and machine learning. I am also grateful to Professor Doug Downey for participating in my PhD qualify exam. I would like to thank all of the members at the Media Technology Lab, Gracenote. I am extremely grateful to Markus Cremer and Bob Coovor for their inspiration and encouragement in research and my personal life. Special thanks go to Ching-Wei Chen, with whom the collaboration has been a great joy. I would like to thank my wonderful former and present labmates who make the Interac- tive Audio (IA) Lab a pleasant place to work. Particular honors go to Zhiyao Duan, Zafar Rafii, Mark Cartwright, David Little, and Michael Skalak, with whom I have had par- ticularly enlightening discussions and fruitful collaborations. Without John Woodruff's foundational work, my research would have been much more difficult. Many thanks also go to Arefin Huq, Rui Jiang, Sara Laupp, Anda Bereczky and Dominik Kaeser for making my time at the IA Lab particularly enjoyable. I would like to thank Prof. Yuan Dong for giving me my first opportunity to conduct research and encouraging me to pursue graduate study. It was at his lab at the Orange Labs (Beijing), France Telecom, that I discovered my love and passion for audio related research. I would also like to acknowledge the financial support provided to me through two NSF grants (IIS-0812314 and IIS-0643752). 8 I dedicate this thesis to Jiayi Han, Feng Li and Jin Xu 9 Table of Contents ABSTRACT 3 Acknowledgements 6 List of Tables 11 List of Figures 13 Chapter 1. Introduction 19 1.1. Contribution 21 1.2. Outline 26 1.3. Structure in Audio 28 1.4. Auditory Scene Analysis and Induction 31 1.5. Motivation 34 Chapter 2. Singing Melody Extraction 40 2.1. Related work 42 2.2. Modeling of Audio 47 2.3. System description 54 2.4. Illustrative example 59 2.5. Experiment 62 2.6. Contributions and Conclusion 65 10 Chapter 3. Audio Imputation 69 3.1. Related work 71 3.2. Non-negative Hidden Markov Model 80 3.3. Audio Imputation by Non-negative Spectrogram Factorization 96 3.4. System description 99 3.5. Experiment 107 3.6. Contribution and Conclusion 117 Chapter 4. Language Informed Audio Bandwidth Expansion 119 4.1. Related work 122 4.2. System Overivew 126 4.3. Word Models 127 4.4. Speaker Level Model 129 4.5. Estimation of incomplete data 131 4.6. Experimental results 135 4.7. Contribution and Conclusion 142 Chapter 5. Conclusion and Future Research 153 5.1. Future Directions 154 References 158 11 List of Tables 2.1 The expectation{maximization (EM) algorithm of PLCA learning 51 2.2 Performance comparison of the proposed algorithm against DHP and LW, averaged across 9 songs of 270 seconds from the MIREX melody extraction dataset. 63 3.1 The parameters of the Non-negative Hidden Markov Model. These parameters can be estimated using Expectation-Maximization algorithm. q and z range over the sets of spectral component indices and dictionary indices respectively. f ranges over the set of analysis frequencies in the FFT. 90 3.2 The generative process of an audio spectrogram using N-HMM. 91 3.3 The EM process of N-HMM Learning 95 3.4 Algorithm I for Audio Imputation 104 3.5 Algorithm II for Audio Bandwidth Expansion 106 3.6 Audio excepts dataset used for Evaluations 112 3.7 Performances of the Audio Imputation results by the proposed Algorithm I and PLCA. There is no statistical difference at a significant level 0:05 between the two methods with a p-value 0:76. 115 12 3.8 Performances of the Audio Bandwidth Expansion results by the proposed Algorithm II and PLCA. There is statistical difference at a significant level 0:05 between the two methods with a p-value 0:01 116 4.1 Algorithm III for Language Informed Speech Bandwidth Expansion 134 4.2 Scale of Mean Opinion Score used by the objective measure OVRL. 137 4.3 Performances of audio BWE results by the proposed method and PLCA in Con-A . Numbers in bold font indicate the difference between the proposed and PLCA is statistically significant by a student t-test at 5% significance level. 140 4.4 Performances of audio BWE results by the proposed method and PLCA in Con-B . Numbers in bold font indicate the difference between the proposed and PLCA is statistically significant by a student t-test at 5% significance level. 140 13 List of Figures 1.1 Illustration of the (a) waveform and (b) spectrogram of an audio clip of a male speaker saying, \She had your dark suit in greasy wash water all year". The level of the signal at a given time-frequency bin is indicated by a color value as explained in the (c) colorbar.