A Large-Scale Evaluation of Acoustic And

ALarge-Scale Evaluation of Acoustic and Subjective Music Similarity Measures Adam Berenzweig Beth Logan Daniel P.W. Ellis Brian Whitman LabROSA HP Labs LabROSA Music Mind & Machine Group Columbia University One Cambridge Center Columbia University MIT Media Lab New York NY U.S.A. Cambridge MA U.S.A. New York NY U.S.A. Cambridge MA U.S.A. [email protected] [email protected] [email protected] [email protected] Abstract 1Introduction Techniques to automatically determine music similarity have Subjective similarity between musical pieces and attracted much attention in recent years (Ghias et al., 1995; artists is an elusive concept, but one that must be pur- Foote, 1997; Tzanetakis, 2002; Logan and Salomon, 2001; Au- sued in support of applications to provide automatic couturier and Pachet, 2002; Ellis et al., 2002). Similarity is at organization of large music collections. In this paper, the core of the classification and ranking algorithms needed to we examine both acoustic and subjective approaches organize and recommend music. Such algorithms will be used for calculatingsimilarity between artists, comparing in future systems to index vast audio repositories, and thus must their performance on a common database of 400 pop- rely on automatic analysis. ular artists. Specifically, we evaluate acoustic tech- However, for the researcher or system builder looking to use niques based on Mel-frequency cepstral coefficients similarity techniques, it is difficult to decide which is best suited and an intermediate ‘anchor space’ of genre classifi- for the task at hand. Few authors perform comparisons across cation, and subjective techniques which use data from multiple techniques, not least because there is no agreed-upon The All Music Guide, from a survey, from playlists database for the community. Furthermore, even if a common and personal collections, and from web-text mining. database were available, it would still be a challenge to estab- lishanassociated ground truth, given the intrinsically subjective We find the following: (1) Acoustic-based measures nature of musicsimilarity. can achieve agreement with ground truth data that is at least comparable to the internal agreement between The work reported in this paper started with a simple question: different subjective sources. However, we observe How do two existing audio-based music-similarity measures significant differences between superficially similar compare? This led us in several directions. Firstly, there are distribution modeling and comparison techniques. (2) multiple aspects of each acoustic measure: the basic features Subjective measures from diverse sources show rea- used, the way that feature distributions are modeled, and the sonable agreement, with the measure derived from methods for calculating similarity between distribution models. co-occurrence in personal music collections being the In this paper, we investigate the influence of each of these fac- most reliable overall. (3) Our methodology for large- tors. scale cross-site music similarity evaluations is prac- To do that, however, we needed to be able to calculate a mean- tical and convenient, yielding directly comparable ingful performance score for each possible variant. This basic numbers for different approaches. In particular, we question of evaluation brings us back to our earlier question of hope that our information-retrieval-based approach to where to get ground truth (Ellis et al., 2002), and then how to scoring similarity measures, our paradigm of sharing use this ground truth to score a specific acoustic measure. Here, common feature representations, and even our partic- we consider five different sources of ground truth, all collected ular dataset of features for 400 artists, will be useful via the Web one way or another, and look at several different to other researchers. ways to score measures against them. We also compare them with one another in an effort to identify which measure is ‘best’ Keywords: Music similarity, acoustic measures, in the sense of approaching a consensus. evaluation, ground-truth. Afinalaspect of this work touches the question of sharing common evaluation standards, and computing comparable measures across different sites. Although common in fields such as speech recognition, we believe this is one of the first and Permission to make digital or hardcopies of all or part of this work largest cross-site evaluations in music information retrieval. forpersonal or classroom use is granted without fee provided that Ourwork was conducted in two independent labs (LabROSA at copies are not made or distributed for profit or commercial advan- Columbia, and HP Labs in Cambridge), yet by carefully spec- tage and that copies bear this notice and the full citation on the first ifying our evaluation metrics, and by sharing evaluation data page. c 2003 Johns Hopkins University. in the form of derived features (which presents little threat to copyright holders), we were able to make fine distinctions be- (Logan and Salomon, 2001; Berenzweig et al., 2003). We then tween algorithms running at each site. We see this as a powerful compute similarity using a measure that compares the models paradigm that we would like to encourage other researchers to for two artists. The results of each measure are summarized use. in a similarity matrix,asquare matrix where each entry gives This paper is organized as follows. First we review prior work the similarity between a particular pair of artists. The leading diagonal is, by definition, 1, which is the largest value. in music similarity. We then describe the various algorithms and data sources used in this paper. Next we describe our database The techniques studied are characterized by the features, mod- and evaluation methodologies in detail. In Section 6 we discuss els and distance measures used. our experiments and results. Finally we present conclusions and suggestions for future directions. 3.1 Feature Spaces The feature space should compactly represent the audio, distill- 2PriorWork ing musically important information and throwing away irrele- vant noise. Although many features have been proposed, in this Prior work in music similarity hasfocused on one of three areas: paper we concentrate on features derived from Mel-frequency symbolic representations, acoustic properties, and subjective or cepstral coefficients (MFCCs). These features have been shown ‘cultural’ information. We describe each of these below noting to give good performance for a variety of audio classification in particular their suitability for automatic systems. tasksand are favored by a number of groups working on audio Many researchers have studied the music similarity problem by similarity (Blum et al., 1999; Foote, 1997; Tzanetakis, 2002; analyzing symbolic representations such as MIDI music data, Logan, 2000; Logan and Salomon, 2001; Aucouturier and Pa- musical scores, and the like. A related technique is to use pitch- chet, 2002; Berenzweig et al., 2003). tracking to find a ‘melody contour’ for each piece of music. Mel-cepstra capture the short-timespectral shape, which carries String matching techniques are then used to compare the tran- important information about the music’s instrumentation and its scriptions for each song e.g. (Ghias et al., 1995). However, timbres, the quality of a singer’s voice, and production effects. techniques based on MIDI or scores are limited to music for However, as a purely local feature calculated over a window which this data exists in electronic form, since only limited of tens of milliseconds, they do not capture information about success has been achieved for pitch-tracking of arbitrary poly- melody, rhythm or long-term song structure. phonic music. We also examine features in an ‘anchor space’ derived from Acoustic approaches analyze the music content directly and MFCC features. The anchor space technique is inspired by a thus can be applied to any music for which one has the au- folk-wisdom approach to music similarity in which people de- dio. Blum et al. present an indexing system based on match- scribe artists by statements such as, “Jeff Buckley sounds like ingfeatures such as pitch, loudness or Mel-frequency cepstral Van Morrison meets Led Zeppelin, but more folky”. Here, coefficients (MFCCs) (Blum et al., 1999). Foote has designed musically-meaningful categories and well-known anchor artists amusicindexing system based on histograms of MFCC fea- serve as convenient reference points for describing salient features derived from a discriminatively trained vector quantizer tures of the music. This approach is mirrored in the anchor (Foote, 1997). Tzanetakis (2002) extracts a variety of features space technique with classifiers trained to recognize musically- representing the spectrum, rhythm and chord changes and con- meaningful categories. Music is “described” in terms of these catenates them into a single vector to determine similarity. Lo- categories by running the audio through each classifier, with the gan and Salomon (2001) and Aucouturier and Pachet (2002) outputs forming the activation or likelihood of the category. model songs using local clustering of MFCC features, deter- mining similarity by comparing the models. Berenzweig et al. For this paper, we used neural networks as anchor model pat- (2003) uses a suite of pattern classifiers to map MFCCs into an tern classifiers. Specifically, we trained a 12-class network to “anchor space”, in which probability models are fit and com- discriminate between 12 genres and two two-class networks to pared. recognize these supplemental classes: Male/Female (gender of the vocalist), and Lo/Hi fidelity.Further details about the choice With the growth of the Web, techniques based on publicly- of anchors and the training technique are available in (Beren- available data have emerged (Cohen and Fan, 2000; Ellis et al., zweig et al., 2003). An important point to note is that the in- 2002). These use text analysis and collaborative filtering tech- put to the classifiers is a large vector consisting of 5 frames of niques to combine data from many users to determine similar- MFCC vectors plus deltas. This gives the network some time- ity.

Load more