Music Content Analysis Through Models of Audition
Total Page:16
File Type:pdf, Size:1020Kb
Music Content Analysis through Models of Audition Keith D. Martin, Eric D. Scheirer, and Barry L. Vercoe MIT Media Laboratory Machine Listening Group Cambridge MA USA { k dm , ed s, bv }@ me di a. mi t. ed u ABSTRACT The direct application of ideas from music theory and music signal processing has not yet led to successful musical multimedia systems. We present a research framework that addresses the limitations of conventional approaches by questioning their (often tacit) underlying princi- ples. We discuss several case studies from our own research on the extraction of musical rhythm, timbre, harmony, and structure from complex audio signals; these projects have demon- strated the power of an approach based on a realistic view of human listening abilities. Con- tinuing research in this direction is necessary for the construction of robust systems for music content analysis. INTRODUCTION Most attempts to build music-analysis systems have tried hard to respect the conventional wisdom about the structure of music. However, this model of music—based on notes grouped into rhythms, chords and harmonic progressions—is really only applicable to a restricted class of listeners; there is strong evidence that non-musicians do not hear music in these terms. As a result, attempts to directly apply ideas from music theory and statistical signal-processing have not yet led to successful musical multimedia systems. Today’s computer systems are not capable of understanding music at the level of an average five-year-old; they cannot recognize a melody in a polyphonic recording or understand a song on a children’s television program. We believe that to build robust and broadly useful musical systems, we must discard entrenched ideas about what it means to listen to music and start again. The goal of this paper is to present an unconventional view of what human listeners are able and, espe- cially, unable to do when listening to music. We provide a conceptual framework that acknowledges these perceptual limitations and even exploits them for the purpose of building artificial listening systems. We hope to convince other researchers to think deeply about the limitations of conventional approaches and to consider alternatives to direct application of research results from structuralist music psychology to the con- struction of music-analysis systems. Many of these ideas are rooted in traditional music theory and have questionable relevance to practical issues in building real computational models. In contrast, we will pre- sent evidence from our own modeling work, which advocates a more directly psychoacoustic perspective. The paper has three main sections. First, we present a collection of broad research goals in musical content analysis. Second, we describe a research framework that attempts to address the limitations of conventional approaches. Third, we present several case studies from our own research, demonstrating the power of an approach based on a realistic view of the abilities of human listeners. BROAD RESEARCH GOALS Current research in the Machine Listening Group at the MIT Media Lab addresses two broad goals simul- taneously. The first is the scientific goal of building computer models in order to understand the properties of human perceptual processes. The second is the practical goal of engineering computer systems that can understand musical sound and building useful applications around them. Although the framework dis- cussed here is wholly compatible with the broader project of general sound understanding, this paper ad- dresses only music, and only from an application-centered perspective. For the case studies we give as examples here, there are direct parallels in our research into non-musical sound and the scientific study of auditory perception in general. Presented at ACM Multimedia ’98 Workshop on Content Processing of Music for Multimedia Applications, Bristol UK, 12 Sept 1998 Computer systems with human-like music-listening skills would enable many useful applications: Interaction with large databases of musical multimedia could be made simpler by annotating audio data with information that is useful for search and retrieval, by labeling salient events and by iden- tifying sound sources in the recordings (Wold et al. 1996; Foote in press). The computer might understand naturally expressed queries by a user and use automatic segmentation to return only those parts of the database that are of interest to the user. The first generation of partly-automated structured-audio coding tools could be built. As struc- tured audio techniques (Vercoe, Gardner and Scheirer 1998) for very-low-bitrate transmission of in- teractive soundtracks become widely available, it will be desirable to develop content for them rap- idly. Robust automated musical collaboration systems could be constructed. Such systems could act as teachers and accompanists, and also as “mediators” in real-time human collaboration over large distances where communication delay would otherwise make musical interaction impossible. The more powerful and humanistic the machine listener becomes, the more sensitive and musical we can make the synthetic performer (Vercoe 1984). More intelligent tools could be constructed for composers, whether experienced or novice. By al- lowing rapid access to in-progress compositions, and control over databases of timbre, samples, and other material, a “composers’ workbench” incorporating musically intelligent systems would become a powerful tool (Chafe, Mont-Reynaud and Rush 1982). In order to build effective musical multimedia systems, it is important to understand how it is that people are capable of communicating about music and understanding the musical behavior of others. To do so requires understanding the human representations of musical sound and building computer systems with compatible representations. The goal should be to make the interface natural, allowing human users to in- teract in the same way they do with other humans, perhaps by humming a few bars, articulating similarity judgments, or by performing a “beat-box” interpretation of a song. The best way to understand what a hu- man can hear in music—and thus what musical “information” humans have access to—is to study the lis- tening process itself. UNDERLYING PRINCIPLES Every research agenda is based on a set of principles regarding the topic under study. Our principles differ greatly from those typically employed in musical content analysis, and we believe that our approach will yield more useful computer-music systems in both the near and far term. The following summary is not exhaustive, but it is intended to convey a sense of how our approach differs from the usual one. The abilities of non-musicians are of primary interest. Much of music psychology relies on experimental results from “expert” listeners. Indeed, advanced music students at the university level are often used as experimental subjects. This creates confounds in the ex- periments, as they are often intended to examine exactly those aspects of music (key sense, harmonic per- ception, pitch equivalence) that the subjects have been taught for many years. A review of psychological research using non-musician listeners (Smith 1997) indicates that many of the musical abilities assumed, and, unsurprisingly, detected in musicians are not present in non-musicians. Non-musicians cannot recog- nize intervals, do not make categorical decisions about pitch, do not understand the functional properties which theoreticians impute to chords and tones, do not recognize common musical structures, and might not even maintain octave similarity. This is not to dismiss non-expert listeners as musically worthless; rather, it is to say that unless we want music systems to only have relevance to the skills and abilities of listeners who are graduate-level musi- cians, we must be cautious about the assumptions we follow. Naïve listeners can extract a great deal of information from musical sounds, and they are likely to be the primary users of many of the applications we would like to build. For example, non-musicians are generally able to identify the beat in music performed with arbitrary timbres, find and recognize the melody in a recording of a complex arrangement, determine the identity of the sound sources in a recording, divide a piece of music into sections (e.g., verse, chorus, bridge), articulate whether a piece of music is “complex” or “simple,” identify the genre of a piece of mu- sic, and so forth. At present, computers cannot perform any of these tasks as well as even the least musi- cally-sophisticated human listener. 2 We believe that much of the process of listening to and understanding music is simply the application of general hearing processes to a musical signal. In that regard, there is no reason to consider musicians as special; they only have stronger cognitive schemata that allow them to make different use of the same per- ceptual data. We assume that the abilities of a non-musician are prerequisite to, and to some extent separa- ble from, the abilities of a musician. “Notes” are not important. To a non-expert listener, the acoustic structure or “surface” of a piece of music is more important than its written form. Although many music-psychological models require transcription (whether explicitly or im- plicitly) into a note-level representation prior to analysis, this transcriptive metaphor is simply an incorrect description of human perception (Scheirer 1996). For most kinds of music, there are no notes in the brain. We hear groups of notes—chords—as single objects in many circumstances, and it can be quite difficult to “hear out” whether or not a particular pitch has been heard in a chord. In the next section, we will show that, as one example, beat tracking of arbitrarily complex music does not require transcription (or even ex- plicit “onset events”). Analysis of music into “notes” is also unnecessary for classification of music by genre, identification of musical instruments by their timbre, or segmentation of music into sectional divisions.