Sound-Source Recognition: a Theory and Computational Model

Sound-Source Recognition: A Theory and Computational Model by Keith Dana Martin B.S. (with distinction) Electrical Engineering (1993) Cornell University S.M. Electrical Engineering (1995) Massachusetts Institute of Technology Submitted to the department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June, 1999 © Massachusetts Institute of Technology, 1999. All Rights Reserved. Author .......................................................................................................................................... Department of Electrical Engineering and Computer Science May 17, 1999 Certified by .................................................................................................................................. Barry L. Vercoe Professor of Media Arts and Sciences Thesis Supervisor Accepted by ................................................................................................................................. Professor Arthur C. Smith Chair, Department Committee on Graduate Students _____________________________________________________________________________________ 2 Sound-source recognition: A theory and computational model by Keith Dana Martin Submitted to the Department of Electrical Engineering and Computer Science on May 17, 1999, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science. Abstract The ability of a normal human listener to recognize objects in the environment from only the sounds they produce is extraordinarily robust with regard to char- acteristics of the acoustic environment and of other competing sound sources. In contrast, computer systems designed to recognize sound sources function precar- iously, breaking down whenever the target sound is degraded by reverberation, noise, or competing sounds. Robust listening requires extensive contextual knowledge, but the potential contribution of sound-source recognition to the process of auditory scene analysis has largely been neglected by researchers building computational models of the scene analysis process. This thesis proposes a theory of sound-source recognition, casting recognition as a process of gathering information to enable the listener to make inferences about objects in the environment or to predict their behavior. In order to explore the process, attention is restricted to isolated sounds produced by a small class of sound sources, the non-percussive orchestral musical instruments. Previous research on the perception and production of orchestral instrument sounds is reviewed from a vantage point based on the excitation and resonance structure of the sound-production process, revealing a set of perceptually salient acoustic features. A computer model of the recognition process is developed that is capable of “listening” to a recording of a musical instrument and classifying the instrument as one of 25 possibilities. The model is based on current models of signal process- ing in the human auditory system. It explicitly extracts salient acoustic features and uses a novel improvisational taxonomic architecture (based on simple statis- tical pattern-recognition techniques) to classify the sound source. The performance of the model is compared directly to that of skilled human listeners, using 3 both isolated musical tones and excerpts from compact disc recordings as test stimuli. The computer model’s performance is robust with regard to the variations of reverberation and ambient noise (although not with regard to competing sound sources) in commercial compact disc recordings, and the system performs better than three out of fourteen skilled human listeners on a forced-choice classi- fication task. This work has implications for research in musical timbre, automatic media annotation, human talker identification, and computational auditory scene analysis. Thesis supervisor: Barry L. Vercoe Title: Professor of Media Arts and Sciences 4 Acknowledgments I am grateful for the fantastic level of support I have enjoyed in my time at MIT. First and foremost, I thank Barry Vercoe for bringing me into his rather unique research group (known by a wide range of names over the years, including The Machine Listening Group, Synthetic Listeners and Performers, Music and Cogni- tion, and “The Bad Hair Group”). I could not have dreamed of a place with a broader sense of intellectual freedom, or of an environment with a more brilliant group of colleagues. Credit for both of these aspects of the Machine Listening Group is due entirely to Barry. I am thankful also for his patient encouragement (and indulgence) over the past six years. I am grateful to Marvin Minsky and Eric Grimson for agreeing to serve as members of my doctoral committee. I have drawn much inspiration from reading their work, contemplating their ideas, and adapting their innovations for my own use. Two members of the Machine Listening Group deserve special accolades. On a level of day-to-day interaction, I thank Eric Scheirer for serving as my intellectual touchstone. Our daily conversations over morning coffee have improved my clarity of thought immensely (not to mention expanded my taste in literature and film). Eric has been a faithful proof-reader of my work, and my writing has improved significantly as a result of his feedback. Many of my ideas are variations of things I learned from Dan Ellis, and I count him among my most influential mentors. Although my dissertation could be viewed as a criticism of his work, it is only because of the strengths of his research that mine makes sense at all. I would also like to thank the other members of the Machine Listening Group, past and present. To my officemates, Youngmoo Kim, Bill Gardner, Adam Lind- say, and Nicolas Saint-Arnaud, thanks for putting up with my music and making the daily grind less of one. Also, thanks to Paris Smaragdis, Jonathan Feldman, Nyssim Lefford, Joe Pompei, Mike Casey, Matt Krom, and Kathryn Vaughn. 5 For helping me assemble a database of recordings, I am grateful to Forrest Larson of the MIT Music Library, Juergen Herre, Janet Marques, Anand Sarwate, and to the student performers who consented to being recorded, including Petra Chong, Joe Davis, Jennifer Grucza, Emily Hui, Danny Jochelson, Joe Kanapka, Teresa Marrin, Ole Nielsen, Bernd Schoner, Stephanie Thomas, and Steve Tistaert. I would like to thank the students in the Media Lab’s pattern-recognition course who worked with some preliminary feature data and provided valuable feedback. Special thanks are due to Youngmoo Kim, who collaborated with me on my first musical instrument recognition system (Computer experiment #1 in Chapter 6). Connie Van Rheenen, Betty Lou McClanahan, Judy Bornstein, Dennis Irving, Greg Tucker, and Linda Peterson provided essential support within the Media Lab, and I am thankful for their assistance. Other Media Labber’s owed thanks include Bill Butera, Jocelyn Scheirer, and Tom Minka. Over my many years of schooling, I have been strongly influenced or assisted by many teachers. At MIT, I have particularly benefited from interacting with Roz Picard, Whit Richards, and Bill Peake. I am grateful to two professors at Cornell University, Carol Krumhansl and Clif Pollock, both for being great teachers and for helping me get to MIT. Of course, my most influential teachers are my parents, and I owe an immense debt of gratitude to them for their support (of all kinds) over the years. Thank you for helping me get through my many paroxysms of self doubt.1 Finally I wish to thank my wife and best friend, Lisa, whose emotional support during the ups and downs of graduate school made it possible for me to keep going. I promise not to drag you through this again! This thesis is dedicated to the memory of my grandfather, Dana West Martin. 1. A phrase gleefully stolen from David Foster Wallace, who along with Kurt Vonnegut, Andy Partridge, and the members of Two Ton Shoe, deserves a little bit of credit for helping me maintain some level of sanity during the last year or so. 6 Table of Contents 1 Introduction .......................................................... 9 1.1 Motivation and approach ..............................................................10 1.2 A theory of sound-source recognition ..........................................11 1.3 Applications ..................................................................................13 1.4 Overview and scope ......................................................................15 2 Recognizing sound sources ............................. 19 2.1 Understanding auditory scenes .....................................................20 2.1.1 Exploiting environmental constraints .......................................20 2.1.2 The importance of knowledge ...................................................21 2.1.3 Computational auditory scene analysis .....................................21 2.2 Evaluating sound-source recognition systems ..............................23 2.3 Human sound-source recognition .................................................26 2.4 Machine sound-source recognition ...............................................29 2.4.1 Recognition within micro-domains ...........................................29 2.4.2 Recognition of broad sound classes ..........................................30 2.4.3 Recognition

Sound-Source Recognition: a Theory and Computational Model

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support