Music-Listening Systems
Total Page:16
File Type:pdf, Size:1020Kb
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/2646946 Music-Listening Systems Article · May 2000 Source: CiteSeer CITATIONS READS 94 321 4 authors, including: Barry Lloyd Vercoe Massachusetts Institute of Technology 55 PUBLICATIONS 1,512 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: MIT Media Lab View project All content following this page was uploaded by Barry Lloyd Vercoe on 03 October 2014. The user has requested enhancement of the downloaded file. Music-Listening Systems Eric D. Scheirer B.A. Linguistics, Cornell University, 1993 B.A. Computer Science, Cornell University, 1993 (cum laude) S.M. Media Arts and Sciences, Massachusetts Institute of Technology, 1995 Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements of the degree of Doctor of Philosophy at the Massachusetts Institute of Technology June, 2000 Copyright © 2000, Massachusetts Institute of Technology. All rights reserved. Author Program in Media Arts and Sciences April 28, 2000 Certified By Barry L. Vercoe Professor of Media Arts and Sciences Massachusetts Institute of Technology Accepted By Stephen A. Benton Chair, Departmental Committee on Graduate Students Program in Media Arts and Sciences Massachusetts Institute of Technology Music-Listening Systems Eric D. Scheirer Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, on April 28, 2000, in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Abstract When human listeners are confronted with musical sounds, they rapidly and automatically orient themselves in the music. Even musically untrained listeners have an exceptional ability to make rapid judgments about music from very short examples, such as determining the music’s style, performer, beat, complexity, and emotional impact. However, there are presently no theories of music perception that can explain this behavior, and it has proven very difficult to build computer music-analysis tools with similar capabilities. This dissertation examines the psychoacoustic origins of the early stages of music listening in humans, using both experimental and computer-modeling approaches. The results of this research enable the construction of automatic machine-listening systems that can make human-like judgments about short musical stimuli. New models are presented that explain the perception of musical tempo, the perceived segmentation of sound scenes into multiple auditory images, and the extraction of musical features from complex musical sounds. These models are implemented as signal-processing and pattern-recognition computer programs, using the principle of understanding without separation. Two experiments with human listeners study the rapid assignment of high-level judgments to musical stimuli, and it is demonstrated that many of the experimental results can be explained with a multiple-regression model on the extracted musical features. From a theoretical standpoint, the thesis shows how theories of music perception can be grounded in a principled way upon psychoacoustic models in a computational-auditory-scene- analysis framework. Further, the perceptual theory presented is more relevant to everyday listeners and situations than are previous cognitive-structuralist approaches to music perception and cognition. From a practical standpoint, the various models form a set of computer signal-processing and pattern-recognition tools that can mimic human perceptual abilities on a variety of musical tasks such as tapping along with the beat, parsing music into sections, making semantic judgments about musical examples, and estimating the similarity of two pieces of music. Thesis Supervisor: Barry L. Vercoe, D.M.A. Professor of Media Arts and Sciences This research was performed at the MIT Media Laboratory. Primary support was provided by the Digital Life consortium of the Media Laboratory, with additional support from Interval Research Corporation. The views expressed within do not necessarily reflect the views of supporting sponsors. Doctoral Dissertation Committee Thesis Advisor Barry L. Vercoe Professor of Media Arts and Sciences Massachusetts Institute of Technology Thesis Reader Rosalind W. Picard Associate Professor of Media Arts and Sciences Massachusetts Institute of Technology Thesis Reader Perry R. Cook Assistant Professor of Computer Science Assistant Professor of Music Princeton University Thesis Reader Malcolm Slaney Member of the Technical Staff Interval Research Corporation Palo Alto, CA Acknowledgments It goes without saying, or it should, that the process of doing the research and writing a dissertation is really a collaborative effort. While there may be only one name on the title page, that name stands in for a huge network of colleagues and confidants, without whom the work would be impossible or unbearable. So I would forthwith like to thank everyone who has made this thesis possible, bearable, or both. I have been lucky to reside in the Machine Listening Group of the Media Laboratory over a time period that allowed me to collaborate with so many brilliant young researchers. Dan Ellis and Keith Martin deserve special mention for their role in my research life. Dan took control of the former “Music and Cognition” group and showed us how we could lead the way into the bright future of Machine Listening. Keith and I arrived at the Lab at the same time, and have maintained a valuable dialogue even though he escaped a year earlier than I. I feel Dan’s influence is most strongly in the depth of background that he encouraged me, by example, to develop. Many of the new ideas here were developed in collaboration with Keith, and the stamp of his critical thinking is on every page. The rest of the Machine Listening Group students past and present were no less critical to my research, whether through discussion, dissension, or distraction. In alphabetical order, I acknowledge Jeff Bilmes, Michael Casey, Wei Chai, Jonathan Feldman, Ricardo García, Bill Gardner, Adam Lindsay, Youngmoo Kim, Nyssim Lefford, Joe Pompei, Nicolas Saint- Arnaud, and Paris Smaragdis. During the final year of my stay here, I had the good fortune to be exiled from the main Machine Listening office space—this is good fortune because I landed in an office with Push Singh. Push’s remarkable insight into the human cognitive system, and his willingness to speculate, discuss, and blue-sky with me, have enriched both my academic life and my thesis. Barry Vercoe, of course, deserves the credit for organizing and managing this wonderful group of people. Further, he has given us the greatest gift that an advisor can give his students—the freedom to pursue our own interests and ideas. Barry’s willingness to shelter us from the cold winds of Media Lab demo pressure has been the catalyst for the group’s academic development. Connie van Rheenen deserves special credit for the three years that she has been our administrator, resource, and den mother. I don’t recall how it was that we managed our lives before Connie joined us. The other members of my committee, Roz Picard, Perry Cook, and Malcolm Slaney, have been exemplary in their continuous support and encouragement. From the earliest stages of my proposal to the last comments on this thesis, they have been a bountiful source of inspiration, encouragement, and suggestions. I would also like to specially mention Prof. Carol Krumhansl of Cornell University. My first exposure to music psychology and the other auditory sciences came in her class. The two years that I spent as her undergraduate assistant impressed on me the difficulty and reward of a long-term experimental research program, and her analytic focus is unparalleled. Carol also encouraged me to apply to the Media Lab, a decision that I have never regretted for a second. My parents have been uniquely unflagging in their ability to make me feel free to do exactly as I pleased, while still demanding that I live up to my ability. From the earliest time I can remember, they have instilled in me a love of learning that has only grown over the years. Finally, there is no way to express the love and gratitude that I feel for my wife Jocelyn every moment of every day. Of all the people I have ever met, she is the most caring, most supportive, and most loving. I owe everything I have done and will do to her. Music-Listening Systems Eric D. Scheirer Contents CHAPTER 1 INTRODUCTION .......................................................................................................... 13 1.1. ORGANIZATION ............................................................................................................................. 15 CHAPTER 2 BACKGROUND .............................................................................................................17 2.1. PSYCHOACOUSTICS........................................................................................................................ 17 2.1.1. Pitch theory and models........................................................................................................ 18 2.1.2. Computational auditory scene analysis................................................................................. 22 2.1.3. Spectral-temporal pattern analysis........................................................................................25 2.2. MUSIC PSYCHOLOGY....................................................................................................................