Extracting Information from Music Audio

EXTRACTING INFORMATION FROM MUSIC AUDIO Information includes individual notes, tempo, beat, and other musical properties, along with listener preferences based on how the listener experiences music. By Daniel P.W. Ellis Music audio contains a great deal of information and emotional sig- nificance for human listeners. Machine systems that would be able to identify the listener-salient detail in music or that could predict listener judgments would be useful in many domains, from music theory to data compression to e-commerce. Here, I consider the kinds of information available in the music signal, reviewing current work in automatic music signal analysis, from the detection of individual notes to the prediction of listeners’ music preferences. While this is a vibrant research area, including an annual international evaluation, researchers are still far from a working model of how listeners experience music. 32 August 2006/Vol. 49, No. 8 COMMUNICATIONS OF THE ACM Music is arguably the richest and most machine.We are only now on the verge of carefully constructed of all acoustic signals; having the algorithms, computational power, several highly trained performers might work and data sets needed to produce systems that for hours to get the precise, desired effect in a approach useful, general music transcription, particular recording. We can reasonably con- along with various other musically relevant clude that the amount of information carried judgments. Meanwhile, technological devel- by the musical waveform is greater than in opments have also presented urgent chal- any other sound, although it also gets us into lenges in navigating large online and portable the problematic territory of trying to define music collections that cry out for a “listening exactly what information it is that music car- machine” able to hear, remember, and ries, why it exists, and why so many people retrieve in listener-relevant terms. spend so much time creating and enjoying it. Here, I look at a range of problems in Putting aside these philosophical points extracting information from music record- (they’re beyond my scope here), we can name ings, starting with the most detailed (such as many objective aspects of a music recording the individual notes played by a performer) (such as beat, melody, and lyrics) a listener and moving to high-level properties (such as might extract. As with other perceptual feats, musical genre applying to entire pieces or col- we can hope to build computer-based sys- lections of recordings). However, the unifying tems to mimic these abilities. It will be inter- theme is that abstract, symbolic information esting to see how well it can be done and is extracted from raw audio waveforms. Thus, consider the applications in which these sys- I do not include the significant body of work tems might be used. on making high-level musical inferences Music and computers have been linked directly from score representations (such as since the earliest days of electronic computa- machine-readable note-event descriptions tion, including the synthesis in 1967 by Max like the Musical Instrument Digital Inter- Matthews (then a researcher at Bell Labs) of face), though it has influenced more recent “Daisy Daisy” on an IBM 7094 mainframe. audio-based work. Computer music synthesis soon led to the idea of computer music analysis, with the first EVENT-SCALE INFORMATION attempt at automatic transcription in 1977 The information carried by music occurs at [9]. However, it was clear that, as with other multiple levels, or timescales, each useful to attempts at machine perception, the seem- automatic analysis systems for a variety of ingly effortless analysis performed by human purposes. At the shortest timescale are the senses were very difficult to duplicate on a individual musical note events (such as indi- COMMUNICATIONS OF THE ACM August 2006/Vol. 49, No. 8 33 vidual strikes on a piano keyboard). A musical score magnitude, or spectrogram) containing particular comprises a formal notation for these events, suitable notes based on labeled training data [3]. This data may for enabling a performer to play a piece. Music tran- be obtained from multitrack music recordings (each scription is the process of recovering the musical score instrument in a separate channel), extracting the pitch describing the individual notes played in a recording; of the main vocal line, then using the pitch values as we know it is possible because music students (after labels for training features extracted from the full mix- appropriate training) often do it very well. Transcrip- down. This approach compares well to more tradi- tion is valuable for searching for a particular melody tional techniques, finishing third out of 10 systems in within a database of recordings (needed for query by a 2005 formal evaluation of systems that identify the humming); high-quality transcripts would also make melody in popular music recordings. Conducted as possible a range of analysis-resynthesis applications, part of the Music Information Retrieval Evaluation including analyzing, modifying, and cleaning up eXchange (MIREX-05) evaluations of music informa- famous archival recordings. A commercial example is tion retrieval technologies [2], it correctly transcribed Zenph Studios (www.zenph.com), a four-year-old approximately 70% of melody notes (on average). In startup that recreates damaged or noisy recordings of many cases, transcribed melodies were clearly recogniz- piano masterpieces by extracting the precise perfor- able, implying transcripts are useful (such as for mance details, then re-rendering them on a robotic retrieval). But a significant number of excerpts had piano. accuracies below 50% and barely recognizable tran- Musical pitch arises from local, regular repetition scripts. At LabROSA, our use of the classifier approach (periodicity) in the sound waveform, which in turn for detecting multiple simultaneous and overlapping gives rise to regularly spaced sinusoid harmonics at notes in piano music has also worked well. integer multiples of the fundamental frequency in a Individual note events may not be the most salient spectral, or Fourier, analysis. Note that transcription way to describe a piece of music, since it is often the could be a relatively simple search for a set of funda- overall effect of the notes that matters most to a listener. mental frequency Fourier components. However, such Simultaneous notes give rise to chords, and musical tra- a search may be compromised for two main reasons: ditions typically develop rich structures and conven- tions based on chords and similar harmonies. Chords Indistinctness. Noise, limitation of dynamic range, and could be identified by transcribing notes, then deciding the trade-off between time and frequency resolution what chord they constitute, but it is easier and more makes identifying discrete harmonics in Fourier robust to take a direct path of recognizing chords from transforms unreliable and ambiguous; and the audio. The identity of a chord (such as C Major Interference. Simultaneous sinusoids of identical or and E minor 7th) does not change if the notes move by close frequencies are difficult to separate, and con- multiples of one octave, so chord-recognition systems ventional harmony guarantees that multiple-voice typically use so-called “chroma” feature instead of nor- music is full of such collisions; even if their frequen- mal spectra. Where a spectrogram slice describes the cies match, their relative phase may result in rein- energy in every distinct frequency band (10Hz–20Hz, forcement or cancellation. 20Hz–30Hz, 30Hz–40Hz, and so on), a chroma feature collects all the spectral energy associated with a onetheless, many note transcription particular semitone in the musical scale (such as A) by systems are based on fitting harmonic summing the energy from all the octave transpositions models to the signals and have steadily of that note over some range (such as 110Hz–117Hz, increased the detail extracted, ranging 220Hz–233Hz, and 440Hz–466Hz). from one or two voices to higher-order Other chroma bins sum the energy from interleaved Npolyphony. The range of acoustic conditions in which frequency combs. Since the combination of notes in a they can be applied has also increased, from small num- chord can produce a fairly complex pattern, chord- bers of specific instruments to instrument-independent recognition systems almost always rely on trained clas- systems. Systems that transcribe notes from music sifiers; LabROSA borrows heavily from speech audio include those described in [5, 6]. recognition technology, using the well-known expecta- The Laboratory for the Recognition and Organiza- tion-maximization (EM) algorithm to construct hid- tion of Speech and Audio (LabROSA) at Columbia den Markov models (HMMs). Each model describes a University has taken a more “ignorance-based” chord family, and the process of model estimation approach. There, my colleagues and I train general- simultaneously estimates the alignment between a purpose support-vector machine classifiers to recognize known chord sequence and an audio recording while spectral slices (from the short-time Fourier transform avoiding the time-consuming step of manually mark- 34 August 2006/Vol. 49, No. 8 COMMUNICATIONS OF THE ACM Figure 1. Example transcription for a fragment of “Let It Be” by the Beatles. Below a conventional narrowband spectrogram are automatically generated estimates of down- beat, melody note probability, and piano part.

Extracting Information from Music Audio

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support