<<

Adam Berenzweig,* BethLogan, † Daniel ALarge-Scale Evaluation P.W. Ellis,* and BrianWhitman †‡ *LabROSA ofAcoustic and Columbia University NewYork, NewYork 10027 USA [email protected] Subjective Music- [email protected] †HP Labs Similarity Measures OneCambridge Center Cambridge,Massachusetts 02142– 1612 USA [email protected] ‡Music,Mind &MachineGroup MIT MediaLab Cambridge,Massachusetts 02139– 4307 USA [email protected]

Avaluablegoal in theŽ eld of MusicInformation gained.In our previous work (Ellis etal.2002), we Retrieval(MIR) is to devisean automatic measure haveexamined several sources of humanopinion of thesimilarity betweentwo musicalrecordings aboutmusic similarity, with theimpetus thathu- basedonly on ananalysisof theiraudio content. manopinion must bethe Ž nalarbiter of music Sucha tool—aquantitative measure of similarity— similarity, becauseit is asubjectiveconcept. How- canbe usedto build classiŽcation, retrieval, brows- ever,as expected,there are as manyopinions about ing,and recommendation systems. To develop musicsimilarity asthereare people to beasked, sucha measure,however, presupposes some andso thesecond question is how to unify the groundtruth, singleunderlying similarity that various sourcesof opinion into asingleground constitutesthe desired output of themeasure. Mu- truth.As we shall see, it turnsout thatperhaps sicsimilarity is anelusiveconcept— wholly subjec- this is thewrong way to look atthings, and so we tive,multifaceted, and a moving target—but one develop theconcept of a‘‘consensustruth’ ’ rather thatmust bepursued in support of applications to thana singleground truth. provide automaticorganization of largemusic col- Finally, armedwith theseevaluation techniques, lections. weprovide anexampleof across-siteevaluation of In this article,we explore music-similarity mea- severalacoustic- and subjective-based similarity suresin severalways, motivated by different types measures.We addressseveral main researchques- of questions.We are Ž rstmotivated by thedesire tions.Regarding the acoustic measures, which fea- to improve automatic,acoustic-based similarity turespaces and which modeling andcomparison measures.Researchers from severalgroups have re- methods arebest? Regarding the subjective mea- centlytried manyvariations of afewbasic ideas, sures,which provides thebest single ground truth, but it remainsunclear which are best-suited for a thatis, which agrees best on averagewith theother givenapplication. Few authors perform compari- sources? sonsacross multiple techniques,and it is impossi- In theprocess of answeringthese questions, we ble to compareresults from different authors, addresssome of thelogistical difŽ culties peculiar becausethey do not sharethe required common to our Želd,such as the legal obstacles to sharing ground:a common databaseand a common evalua- musicbetween research sites. We believethis is tion method. oneof theŽ rstand largest cross-site evaluations in Of course,to improve anymeasure, we needan MIR. Ourwork wasconducted in threeindepen- evaluationmethodology, ascientiŽc wayof deter- dentlabs (LabROSA at Columbia, MIT, andHP mining whetherone variant is betterthan another. Labsin Cambridge),yet by carefullyspecifying our Otherwise,we are left to intuition, andnothing is evaluationmetrics, and by sharingdata in theform ComputerMusic Journal, 28:2, pp. 63 –76,Summer 2004 of derived features(which presents little threatto q 2004Massachusetts Institute ofTechnology. copyright holders),we were able to makeŽ nedis-

Berenzweig,Logan, Ellis, and Whitman 63 tinctions betweenalgorithms runningat each site. out of necessityto build usefulapplications using Weseethis asapowerful paradigmthat we would currenttechniques. Before proceeding, however, it like toencourageother researchers to use. is worthwhile to examinein more detail someof Finally, anoteabout the terminology usedin this theproblems thatbeset the idea of acoherent article.To date,we have worked primarily with quantitativemeasure of musicsimilarity. popular music,and our vocabularyis thusslanted. Unlessnoted otherwise, when we refer to ‘‘artists’’ or‘‘musicians’’ wearereferring to theperformer, Individual Variation not thecomposer (which frequently arethe same anyway).Also, whenwe referto a‘‘song,’’ wemean Thatpeople have individual tastesand preferences asinglerecording of aperformanceof apieceof is centralto thevery ideaof musicand humanity. music,not anabstractcomposition, andalso not By thesame token, subjective judgments of the necessarilyvocal music. similarity betweenspeciŽ c pairsof artistsare not This articleis organizedas follows. First weex- consistentbetween listeners and may vary with an aminethe concept of musicsimilarity andreview individual’s mood or evolveover time.In particu- prior work.We thendescribe the various algo- lar,music that holds no interestfor agivensubject rithms anddata sources used in this article.Next, veryfrequently ‘‘soundsthe same.’ ’ wedescribeour evaluationmethodologies in detail anddiscuss issues with performing amulti-site evaluation.Then we discussour experimentsand Multiple Dimensions results.Finally, wepresent conclusions and sugges- tions for futuredirections. Thequestion of thesimilarity betweentwo artists canbe answered from multiple perspectives.Music may besimilar or distinct in termsof genre,mel- Music Similarity ody, rhythm, tempo,geographical origin, instru- mentation,lyric content,historical timeframe— Theconcept of similarity hasbeen studied many virtually anyproperty thatcan be used to describe times in Želdsincluding psychology,information music.Although thesedimensions arenot indepen- retrieval,and epistemology. Perhaps the most fa- dent,it is clearthat different emphaseswill result mous similarity researcheris Amos Tversky,a cog- in different artists.The fact that both PaulAnka nitive psychologist who formalized andstudied andAlanis Morissetteare from Canadamight beof similarity, perception,and categorization. Tversky paramountsigniŽ cance to aCanadiancultural na- wasquick to notethat human judgments of simi- tionalist, althoughanother person might not Žnd larity do not satisfy thedeŽ nition of aEuclidean theirmusic at all similar. metric,as discussed below (Tversky1977). Healso studied thecontext-dependent nature of similarity andnoted the interplay betweensimilarity andcat- Not a Metric egorization.Other notable work includesGold- stone,Medin, and Gentner (1991) andthe music Asdiscussed in Tversky (1977) andelsewhere, sub- psychology literature—e.., Deutsch (1999) andthe jectivesimilarity often violates thedeŽ nition of a study of melodic similarity in Cambouropoulos metric,in particularthe properties of symmetry (2001). andthe triangle inequality. For example,we might In this article,we are essentially trying to pin a saythat the 1990s LosAngeles pop musicianJason single,quantitative measure to aconceptthat fun- Falkneris similar to theBeatles, but wewould be damentallyresists such deŽ nition. Later,we partly lesslikely to saythat the Beatles are similar to Ja- justify this approachwith theidea of aconsensus sonFalkner, because the more celebratedband truth,but in reality weare forced into thesituation servesas aprototype againstwhich to measure.

64 Computer MusicJournal Thetriangle inequality canbe violated becauseof successhas been achieved for pitch trackingof ar- themultifaceted nature of similarity: for example, bitrary polyphonic music. MichaelJackson is similar to theJackson Five, his Acousticapproaches analyze the music content Motown roots,and also to Madonna.Both arehuge directly andthus can be applied to anymusic for pop starsof the1980s, but Madonnaand the Jack- whichone has the audio. Blum etal.(1999) present sonFive do not otherwisehave much in common. anindexing systembased on matchingfeatures suchas pitch,loudness, or Mel-FrequencyCepstral CoefŽcients (MFCCs; these are a compactrepresen- Variability andSpan tation of thefrequency spectrum, typically com- putedover short time windows).Foote (1997) has Fewartists are truly asingle‘ ‘point’’ in anyimagi- designeda musicindexing systembased on histo- nablestylistic spacebut undergochanges through- gramsof MFCC featuresderived from adiscrimina- out theircareers and may consciouslyspan tively trainedvector quantizer. Tzanetakis (2002) multiple styleswithin asinglealbum, or evena extractsa varietyof featuresrepresenting the spec- singlesong. Trying to deŽne a singledistance be- trum,rhythm, andchord changes and concatenates tweenany artist and widely ranging,long-lived mu- theminto asinglevector to determinesimilarity. sicianssuch as David Bowieor Sting seems Loganand Salomon (2001) andAucouturier and unlikely to yield satisfactoryresults. Pachet(2002) model songsusing local clustering of Despiteall of thesedifŽ culties, techniques to au- MFCC features,then determine similarity by com- tomatically determinemusic similarity haveat- paringthe models. Berenzweig, Ellis, andLawrence tractedmuch attention in recentyears (Ghias et al. (2003) usea suiteof patternclassiŽ ers to map 1995; Foote 1997; Tzanetakis2002; Loganand Salo- MFCCs into an anchorspace ,in whichprobability mon 2001; Aucouturierand Pachet 2002; Ellis etal. models areŽ tandcompared. 2002). Similarity liesat the core of theclassiŽ ca- With thegrowth of theWorld Wide Web,several tion andranking algorithms neededto organizeand techniqueshave emerged that are based on public recommendmusic. Such algorithms could beused dataderived from subjectivehuman opinion (Co- in futuresystems to index vastaudio repositories, henand Fan 2000; Ellis etal.2002). Theseuse text andthus they must rely on automaticanalysis. analysisor collaborativeŽ lteringtechniques to combine datafrom manyusers to determinesimi- larity. Becausethey are based on humanopinion, Prior Work theseapproaches capture many cultural and other intangiblefactors that are unlikely to beobtained Prior work in musicsimilarity hasfocused on one from audio.The disadvantage of thesetechniques of threeareas: symbolic representations,acoustic is thatthey are only applicableto musicfor which properties,and subjective or ‘‘cultural’’ informa- areasonableamount of reliableonline datais avail- tion. Wedescribeeach of thesebelow noting in able.For newor undiscoveredartists, an audio- particulartheir suitability for automaticsystems. basedtechnique may bemore suitable. Manyresearchers have studied themusic- similarity problem by analyzingsymbolic represen- tationssuch as MIDI musicdata, musical scores, Acoustic Similarity andthe like. A relatedtechnique is tousepitch trackingto Žndamelodic contourfor eachpiece of In this section,we describeour acoustic-basedsim- music.String-matching techniques are then used to ilarity measures.These are techniques for comput- comparethe transcriptions for eachsong (e.g., ingsimilarity basedsolely on theaudio content,as Ghiaset al.1995). However,techniques based on opposed to subjectivemeasures which involve hu- MIDI or scoresare limited tomusicfor whichthis manjudgments. Our techniques fall into aclassof dataexists in electronicform, sinceonly limited methods commonly usedin MIR thatcan be de-

Berenzweig,Logan, Ellis, and Whitman 65 scribedas probabilistic featuremodeling andcom- propertiesof theinput. Agood featurespace com- parison.Essentially, the music is transformed from pactlyrepresents the audio, distilling important in- rawaudio samplesinto atime seriesof featurevec- formation andthrowing away irrelevant noise. tors,each of whichcaptures the essential charac- Although manyfeatures have been proposed for teristicsof thesound overa short time interval. musicanalysis, such as spectralcentroid, band- Thetime dimension is thenignored, and the series width, loudness,and sharpness (McKinney and offeaturevectors are considered to berandom sam- Breebaart2003), in this articlewe concentrateon plesdrawn from aprobability distribution thatrep- featuresderived from Mel-FrequencyCepstral Co- resentsthe piece of music.Probability distributions efŽcients (MFCCs). These features, originally de- aremuch easier to handlewhen represented with a veloped for speech-recognitionsystems, have been parameterizedclass of distributions, andso for each shownto givegood performancefor avarietyof au- pieceof music,the parameters of thechosen proba- dio classiŽcation tasks and are favored by anumber bility model areŽ tto theobserved samples that of groupsworking on audio similarity (Blum etal. havebeen extracted from themusic. Finally, pieces 1999; Foote 1997; Tzanetakis2002; Logan2000; Lo- ofmusiccan be comparedby comparingthe para- ganand Salomon 2001; Aucouturierand Pachet metricmodels thathave been Ž tto theaudio data. 2002; Berenzweig,Ellis, andLawrence 2003). In fact,this processcan operate on theartist TheMel-Cepstrum capturesthe overall spectral level,the album level,or thesub-song level, in ad- shape,which carries important information about dition to thesong level. In this article,we usedis- theinstrumentation andits timbres,the quality of tributions thatmodel anentireartist’ s work to get asinger’s voice,and production effects.However, anartist-similarity metric,rather than a songsimi- asa purely localfeature calculated over a window larity metric.The results of eachmeasure are sum- of tensof milliseconds, it doesnot captureinforma- marized in a similarity matrix ,asquarematrix tion aboutmelody, rhythm, or long-term song whereineach entry gives the similarity betweena structure. particularpair of artists.The leading diagonal is, by Wealsoexamine features in ananchorspace de- deŽnition, unity, whichis thelargest value. Some- rived from MFCC features.The anchor space tech- times adistancematrix is more convenient,in niqueis inspired by afolk-wisdom approachto musicsimilarity in whichpeople describe artists by whichentries measure dissimilarity. statementssuch as, ‘ ‘Jeff Buckleysounds like Van In this article,we examine several acoustic-based Morrison meetsLed Zeppelin, but more folksy.’’ similarity measuresthat use the statistical para- Here,musically meaningfulcategories and well- digm describedabove. The techniques are further knownanchor artists serve as convenient reference describedin Loganand Salomon (2001) andBeren- points for describingthe music. This ideainspires zweig,Ellis, andLawrence (2003) andare character- theanchor space technique, wherein classiŽ ers are izedby thefeatures, models anddistance measures trainedto recognizemusically meaningfulcatego- used.The next few subsections describe the vari- ries,and music is subsequently‘ ‘described’’ in antsin more detail:Ž rst,the feature spaces (MFCC termsof thesecategories. Once the classiŽ ers are spaceand anchor space), followed by themodeling trained,the audio is presentedto eachclassiŽ er, techniquesand the distance measures (Centroid andthe outputs, representing the activation or like- Distance,Earth Mover’ s Distance,and the Asymp- lihood of thecategories, position themusic in the totic Likelihood Approximation). new space. For this article,we used neural networks as an- chormodel classiŽers, and we used musical genres FeatureSpaces asthe anchor categories, augmented with two sup- plementalcategories. SpeciŽ cally, we traineda TheŽ rststep of audio analysisis to transform the twelve-classnetwork to discriminatebetween rawaudio into afeaturespace, a numericalrepre- twelvegenres: grunge, college rock, country, dance sentationin whichdimensions measuredifferent rock,electronica, metal and punk, new wave, rap,

66 Computer MusicJournal R&B/soul,singer/ songwriter,soft rock,and trad tercenters, and reiterates. The expectation maxi- rock.Additionally, therewere two separateneural mization (EM)algorithm is more powerful than netsto recognizethe supplemental classes: male/ K-means,but similar, exceptdata points aregiven female(sex of thevocalist) and low/ highŽ delity. soft (partial)assignments to theclusters. Furtherdetails about the choice of anchorsand the In this work,two methods oftrainingthe Gaus- trainingtechnique are available in Berenzweig,El- sianmixture models wereused: simple K-means lis, andLawrence (2003). A systemthat uses anchor clusteringof thedata points to form clustersthat spacein amusic-browsing application is available werethen each Ž twith aGaussiancomponent, and online atwww.playola.org. standardexpectation-maximization (EM)re- An important point tonoteis thatthe input to estimation of theGMM parametersinitialized from theclassiŽ ers is alargevector consisting of Žve theK-means clustering. Although unconventional, framesof MFCC vectorsplus deltas.This givesthe theuse of K-meansto trainGMMs without asub- networksome time-dependent information from sequentstage of EMre-estimationwas discovered whichit canlearn about rhythm andtempo, at to beboth efŽcient and useful for song-levelsimi- leaston thescale of afewhundred milliseconds. larity measurementin previous work (Loganand Salomon 2001). Theparameters for thesemodels arethe mean, Modeling andComparing Distributions covariance,and weight of eachcluster. In someex- periments,we useda singlecovariance to describe Becausefeature vectors are computed from short all theclusters. This is sometimesreferred to asa segmentsof audio,an entire song induces a cloud pooled covariance in theŽ eld of speechrecogni- of points in featurespace. The cloud canbe tion; in contrast,an independentcovariance model thoughtof assamples from adistribution thatchar- estimatesseparate covariance matrices for each acterizesthe song, and we can model thatdistribu- cluster,allowing eachto takeon anindividual tion usingstatistical techniques. Extending this shapein featurespace, but requiringmany more idea,we canconceive of adistribution in feature parametersto beestimatedfrom thedata. spacethat characterizes the entire repertoire of HavingŽ tmodels to thedata, we calculate simi- eachartist. larity by comparingthe models. The Kullback- Weuse Gaussian mixture models (GMMs)to Leibler(KL)-divergence or relativeentropy is the model thesedistributions, similar to thetechnique naturalway to deŽne distancebetween probability presentedin Loganand Salomon (2001).GMMs are distributions. However,for GMMs,no closedform aclassof probability models thatare often usedto for theKL-divergence is known.We explore several model distributions thathave more thanone mode, alternativesand approximations: thecentroid dis- or ‘‘hump.’’ Theclassic bell-shaped curve of asin- tance(Euclidean distance between the overall gleGaussian is clearlynot suitedto Žttinga distri- means);the earth-mover’ s distance(EMD; see Rub- bution with severalpeaks, and therefore, a ner,Tomasi, andGuibas 1998), whichcalculates weightedmixture of Gaussiansis amore powerful thecost of moving probability massbetween mix- modeling tool. turecomponents to makethem equivalent; and the Training mixture models—in otherwords, Ž tting asymptotic likelihood approximation (ALA)to the theparameters to theobserved data— is not assim- KL-divergencebetween GMMs (Vasconcelos 2001), ple astraininga singleGaussian, which only en- whichsegments feature space and assumes only tails computing themean and variance of thedata. oneGaussian component dominates in eachregion. In fact,iterative procedures must beused to con- Anotherpossibility would beto computethe likeli- vergeto asolution thatmaximizes the likelihood hood of onemodel givenpoints sampledfrom the of theobserved data. Several such procedures are second(Aucouturier and Pachet 2002), but asthis commonly used.K-means clustering assigns data is very computationally expensivefor largedatasets points to thenearest cluster, recomputes the clus- it wasnot attempted.Computationally, thecen-

Berenzweig,Logan, Ellis, and Whitman 67 troid distanceis thecheapest of our methods and Ideally, thesurvey would provide enoughdata to theEMD themost expensive. derivea full similarity matrix, for exampleby

countinghow many times usersselected artist ai

beingmost similar to artist aj.However,even with SubjectiveSimilarity Measures the22,000 responsescollected, the coverage of our modest artistset is relatively sparse:only around Whereasacoustic-based techniques can be fully au- 7.5% of all our artistpairs were directly compared, tomated,subjective music-similarity measuresare andonly 1.7% of artistpairs were ever chosen as derived from sourcesof humanopinion, for in- most similar. Weconstructedthis sparsesimilarity stanceby mining theWorld Wide Web.Although matrix bypopulating eachrow with thenumber of thesemethods cannotalways be used on newmu- timesa givenartist was chosen as most similar to a sicbecause they require observations of humanin- targetas aproportion of thetrials in whichit could teractionwith themusic, they can uncover subtle havebeen chosen. This heuristicworked quite well relationships thatmay bedifŽ cult to detectfrom for our data. theaudio signal(for examplebands that represent thesame subculture, are in uenced by onean- other,or evenphysically look alike). Subjectivemeasures are also valuable as a ‘‘san- Expert Opinion ity check’’ againstwhich to evaluateacoustic- basedmeasures; even a sparseground truth can help validatea more comprehensiveacoustic mea- Ratherthan surveying the masses, we can ask a sure.Like the acoustic measures, subjective simi- fewexperts. Several music-related online services larity information canalso be represented as a containmusic taxonomies andarticles containing similarity matrix, wherethe values in eachrow similarity data.The All MusicGuide givethe relative similarity betweenevery artist and (www.allmusic.com)is sucha servicein whichpro- onetarget. This sectiondescribes several sources of fessionaleditors writebrief descriptions of alarge humanopinion aboutmusic similarity andhow to numberof popular musicalartists, often including convertthem into ausefulsimilarity measure. alist of similar artists.We extractedthe ‘ ‘similar artists’’ lists from theAll MusicGuide for the400 artistsin our set,discarding any artists from out- Survey sidethe set, resulting in anaverageof 5.4 similar artistsper list (so1.35% of artistpairs had direct Themost straightforwardway to gatherhuman links).Twenty-six of our artistshad no neighbors similarity judgmentsis to explicitly askfor them from within theset. in asurvey.We have previously constructeda Web Asin Ellis etal. (2002), we convertthese descrip- site,musicseer.com, to conductsuch a survey(Ellis tions of theimmediate neighborhood of eachartist etal.2002). WedeŽned a setof some400 popular into asimilarity matrix by computing thepath artists(described in asubsequentsection), then pre- lengthbetween each artist in thegraph where nodesare artists and there is anedgebetween two sentedsubjects with alist of 10 artists( a1, . . . , a10), anda singletarget artist at,andasked ‘ ‘Whichof artistsif theAll MusicGuide editors consider theseartists is most similar to thetarget artist?’ ’ themsimilar. Ourconstruction is symmetric,be- Weinterpreteach response to meanthat the cho- causelinks betweenartists were treated as non- sen artist ac is more similar to thetarget artist at directional.We callthis theErdo ¨smeasure,after thanany of theother artists in thelist only if those thetechnique used among mathematicians to artistsare known to thesubject, which we canin- gaugetheir relationship to PaulErdo ¨s.This extends ferby seeingif thesubject has ever selected the art- thesimilarity measureto cover87.4% of artist ists in anycontext. pairs.

68 Computer MusicJournal Playlist Co-Occurrence increaseswith thenumber of co-occurrencesob- served. Anothersource of humanopinion aboutmusic Weretrieved user collection datafrom OpenNap, similarity is human-authoredplaylists, suchas apopular music-sharingservice, although we were thoseselected for amixed tapeor compilation CD. carefulnot download anyaudio Žles.After discard- Ourassumption is thatsongs co-occurring in the ingartists not in our dataset, we wereleft with sameplaylist will, on average,be more similar than about176,000 user-to-artist relationsfrom about two randomly chosensongs. This assumption is 3,200 usercollections. To turn this datainto asim- suspectfor many typesof playlists, but aswewill ilarity matrix, weusedthe same normalized condi- seeit proves useful.The Web is arichsource for tional probability techniquefor playlists as suchplaylists. In particular,we gatheredaround describedabove. This returneda similarity matrix 29,000 playlists from TheArt of theMix with nonzerovalues for 95.6% of theartist pairs. (www.artofthemix.org),aWebsite that serves as a repository andcommunity centerfor playlist hob- byists. ‘‘Webtext’’ To convertthis datainto asimilarity matrix, we beginwith thenormalized playlist co-occurrence Arichsource of information residesin textdocu- matrix, whereentry ( i,j)representsthe joint proba- mentsthat describe or discussmusic. Using tech- bility thatartist ai and aj occurin thesame play- niquesfrom theIR community, wederived list. However,this probability is inuenced by artist-similarity measuresfrom documentsre- overall artistpopularity, whichshould not affecta turnedfrom Websearches (Whitman and Lawrence similarity measure.Therefore, we usea normalized 2002). Thebest-performing similarity matrix from conditional probability matrix instead:entry ( i,j) of thatstudy, which measures document similarity thenormalized conditional probability matrix C is basedon frequencybigram phrases, is usedhere. theconditional probability p(ai | aj)divided by the This matrix hasessentially full coverage. prior probability p(ai). Because

p(ai |aj) p(ai,aj) cij 4 4 EvaluationMethods p(ai) p(ai)p(aj) this is anappropriatenormalization of thejoint In this section,we describeour evaluationmethod- probability. Notethat the expected logarithm of ology. First, acaveat:any evaluation system inher- this measureis themutual information I(ai;aj) be- ently assumessome idea of groundtruth against tweenartist ai and aj. whichthe candidate is evaluated.Although simi- Usingthe playlists gatheredfrom Art of theMix, larity is inherentlysubjective, thus there is no au- weconstructeda similarity matrix with 51.4% thoritative groundtruth, we cantentatively treat coveragefor our artistset (i.e., more thanhalf of thesubjective data described above as if it were thematrix cellswere nonzero). groundtruth. This approachis partly justiŽed be- causethe data are derived from humanchoices, but more importantly, welaterleverage the diversity of UserCollections sourcesto examinehow well thesources agree with eachother. Similar to user-authoredplaylists, individual music Wepresent several techniques for evaluatinga collectionsare another source of musicsimilarity similarity measure.The Ž rsttechnique is ageneral often availableon theInternet. Mirroring theideas method for evaluatingone similarity matrix given underlying collaborative Žltering,we assumethat anotheras areferenceground truth. Then we pres- artistsco-occurring in someone’s collection havea enttwo techniquesspeciŽ cally designed for using better-than-averagechance of beingsimilar, which thesurvey data as ground truth.

Berenzweig,Logan, Ellis, and Whitman 69 Evaluation Againsta ReferenceSimilarity Matrix nected’’ artistsin theexpert measure, who must be treatedas uniformly dissimilar to all artists).We If wearegiven one similarity metricas ground handlethis by calculatingan averagescore over truth,how canwe calculate the agreement multiple random permutationsof theequivalently- achievedby othersimilarity matrices?We usean rankedentities; owing to theinteraction with the approachinspired by practicein textinformation top-N selection,a closed-form solution haseluded retrieval(Breese, Heckerman, and Kadie 1998). us.The number of repetitions wasbased on empiri- Eachmatrix row is sortedinto decreasingsimilarity calobservations of thevariation in successiveesti- andtreated as the results of aqueryfor thecorre- matesto obtain astableestimate of theunderlying sponding targetartist. The top N ‘‘hits’’ from the mean. referencematrix deŽne theground truth, with ex- ponentially decayingweights so thatthe top hit hasweight 1, the second hit hasweight ar, the next Evaluation AgainstSurvey Data 2 (ar) etc.(We consider only N hits to minimize is- suesarising from similarity information sparsity.) Thesimilarity datacollected using our Web-based Thecandidate matrix ‘‘query’’ is scoredby sum- surveycan be argued to bea good independent ming theweights of thehits byanotherexponen- measureof ground-truth artistsimilarity, because tially decayingfactor, so thata ground-truth hit userswere explicitly askedto indicatesimilarity. placedat rank r is scaledby ( a )r.Formally, wede- c However,the coverage of thesimilarity matrix de- Ž ne the Top-N Ranking AgreementScore for row rived from thesurvey data is only about1.7%, i as: whichmakes it undesirableto useas a ground- N k truth referenceas describedin theprevious section. s 4 (a ) r(a ) r i o r c Instead,we cancompare the individual userjudg- r4 1 mentsfrom thesurvey directly to themetric we where kr is theranking according to thecandidate wishto evaluate.That is, we askthe similarity measureof the rth -rankedhit underthe ground metricthe same questions that we asked the users truth.The parameters ac and ar governhow sensi- andcompute an average agreement score. tive themetric is to orderingunder the candidate Weusedtwo variantsof this idea.The Ž rst, aver- andreference measures, respectively. We used the ageresponse rank ,determinesthe average rank of 1/3 2 values ar 4 0.5 and ac 4 (ar) to emphasizethe theartists chosen from thelist of tenpresented in position of thetop fewground-truth hits.With thesurvey according to theexperimental metric.

N 4 10andthese values of ar and ac,theoptimal For example,if theexperimental metric agrees per- score—achieved when the top tenground truth hits fectlywith thehuman subject, then the ranking of arethe same, and in thesame order as,the top ten thechosen artist will beŽ rstin everycase, whereas from thecandidate matrix— is 0.999. Finally, the arandom orderingof theartists would producean overall scorefor theexperimental similarity mea- averageranking of 5.5.In practice,the ideal score sureis theaverage of thenormalized row scores of 1.0 is not possible, becausesurvey subjects did 1 N not alwaysagree on artistsimilarity; therefore,a S 4 ceilingexists corresponding to thesingle, consis- N oSi/Smax i tentmetric that optimally matchesthe survey where Sm ax is theoptimal score.Thus alargerrank data.For our data,this wasestimated to givea agreementscore is better,with 1.0 indicatingper- scoreof 2.13. fectagreement. Thesecond approach is simply to counthow Oneissue with this measurearises from thehan- manytimes thesimilarity measureagrees with the dling of ties.Because much of thesubjective infor- userabout the Ž rst-place(most similar) artistfrom mation is basedon counts,ranking ties are not thelist. This proportion, called Žrst-place agree- uncommon (anextreme case being the 26 ‘‘discon- ment,hasthe advantage that it canbe viewed as

70 Computer MusicJournal theaverage of asetof independentbinomial artistschosen to havethe maximal overlap of the (binary-valued)trials, meaning that we can use a usercollection (OpenNap)and playlist (TheArt of standardstatistical signiŽ cance test to conŽrm that theMix) data.We hadpreviously purchasedaudio certainvariations in valuesfor this measurearise correspondingto themost popular OpenNapartists from genuinedifferences in performance,rather andhad also used these artists to constructthe sur- thanrandom variations in themeasure. Our esti- veydata. For eachartist, our databasecontains au- mateof thebest possible Žrst-placeagreement with dio, surveyresponses, expert opinions from All thesurvey data was 53.5%. MusicGuide, playlist information, OpenNapcol- lectiondata, and Webtext data. Theaudio dataconsists of 707 albums and8,772 Multi-Site Evaluation Procedures songs,for anaverageof 22 songsper artist. The spe- ciŽc tracklistings for this database,which we refer To compareresults between research sites, it is to as‘ ‘uspop2002,’’ areavailable online at necessaryto havea common databaseand a com- www.ee.columbia.edu/ ;dpwe/research/musicsim. mon evaluationmethod. Using the evaluation techniquesdescribed above, we hadto sharedata betweencenters. However, we encounter legal re- Experimentsand Results strictions whenattempting to sharecopyrighted music.Although efforts areunderway to procurea Anumber of experimentswere conducted to an- databaseof musicthat is freefor usein themusic swerthe following questionsabout acoustic- and information retrievalcommunity (Downie2002), subjective-basedsimilarity measures.First, is an- negotiationscan be slow, and for certaintypes of chorspace better for measuringsimilarity than researchit is not necessaryto havethe full audio. MFCC space?Second, which method of modeling For our purposes,it sufŽces to sharethe MFCC fea- andcomparing feature distributions is best?Third, turesderived from theaudio, since all of the whichsubjective similarity measureprovides the acoustic-basedsimilarity measureswe use begin bestground truth, e.g., in termsof agreeingbest, on with computing theMFCCs. Similarly, otherre- average,with theother measures? searcherswanting to experimentwith different Although it riskscircularity to deŽne the best techniquesneed only sharethe front-end features. groundtruth asthemeasure that agrees best with Becauseour audio experimentswere conducted theothers, we arguethat because the various mea- attwo sites,a levelof discipline wasrequired when suresare constructed from diversedata sources and settingup thedata. We sharedMFCC features methods,any correlation between them should re- ratherthan raw audio, both to savebandwidth and ecta trueunderlying consensusamong the people to avoid copyright problems, asmentioned. This who generatedthe data. A measureconsistent with hadthe added advantage of ensuringboth sites all thesesources must approacha ‘‘consensus startedwith thesame features when conducting truth,’’ evenif no absoluteground truth actually experiments.Duplicated tests on asmall subsetof exists. thedata were used to verify theequivalence of our processingand scoring schemes. Webelieve that this techniqueof establishing AcousticSimilarity Measures common feature-calculationtools, thensharing common featuresets, could beuseful for future WeŽrstcompare the acoustic-based similarity cross-groupcollaborations andshould beseriously measures,examining artist models trainedon consideredby thoseproposing evaluations,and we MFCC andanchor space features. Each model is would beinterested in sharingour derived features. trainedusing features calculated from theavailable Wehave compiled arelatively largedataset from audio for thatartist. Our MFCC featuresare 20- audio andonline sources.The dataset covers 400 dimensional andare computed using 32-msec

Berenzweig,Logan, Ellis, and Whitman 71 framesoverlapped by 16 msec.The anchor space Table1 showsthe average response rank and Ž rst- featureshave 14 dimensions whereeach dimension placeagreement percentage for eachapproach. representsthe posterior probability of apre-learned From this table,we seethat the different training acousticclass given the observed audio asdescribed techniquesfor GMMsgive comparable perfor- in thesection ‘ ‘AcousticSimilarity’ ’ above. manceand that more mixture componentshelp up In apreliminary experiment,we performed di- to apoint. Pooling thedata to trainthe covariance mensionality reductionon theMFCC spaceby tak- matricesis useful,as has been shown in speech ingthe Ž rst14 dimensions of aPCAanalysisand recognition,because it allows for more robust co- comparedresults with theoriginal 20-dimensional varianceparameter estimates. Omitting theŽ rst MFCC space.There was no appreciabledifference cepstralcoefŽ cient gives better results, possibly be- in results,conŽ rming thatany difference between causesimilarity is more relatedto spectralshape thanoverall signalenergy, although this improve- theanchor-based and MFCC-based models is not mentis lesspronounced when pooled covariances owingto thedifference in dimensionality. areused. The best system is onethat uses pooled Table1 showsresults for similarity measures covariancesand ignores c .Models trainedwith the basedon MFCC space,in whichwe compare the 0 simpler K-meansprocedure appear to perform as effectof varyingthe distribution models andthe well asGMMs and thus are preferred. distribution similarity method.For theGMM dis- Asimilar tablewas constructed for anchor-space- tribution models,we vary the number of mixtures, basedmethods, which revealed that full, indepen- usepooled or independentvariance models, and dentcovariance using all 14 dimensions wasthe trainusing either plain K-means,or K-meansfol- best-performing method.Curiously, while theALA lowed by EMre-estimation.Distributions arecom- distancemeasure performed poorly on MFCC-based paredusing centroid distance, ALA, or EMD(as models,it performed competitively with EMDon describedin thesection ‘ ‘Modeling andComparing anchor-spacemodels. Wearestill investigatingthe Distributions’’ ).Wealsocompare the effect of in- cause;perhaps it is becausethe assumptions be- cludingor excludingthe Ž rstcepstral coefŽ cient, hind theasymptotic likelihood approximation do c0,whichmeasures the overall intensity of asignal. not hold in MFCC space.

Table1. Average response rank and Ž rst-placeagreement percentages for various similarity schemes basedon MFCC features.Lower values are better for averageresponse rank, and larger percentages are betterfor Žrst-placeagreement Independent Pooled #mix c0? ALA EMD ALA Cntrd EMD EM 8 y 4.76 / 16% 4.46 / 20% 4.72 / 17% 4.66 / 20% 4.30 / 21% 8 n – 4.37 / 22% – – 4.23 / 22% 16 n – 4.37 / 22% – – 4.21 / 21% K-means 8 y – 4.64 / 18% – – 4.30 / 22% 8 n 4.70 / 16% 4.30 / 22% 4.76 / 17% 4.37 / 20% 4.28 / 21% 16 y – 4.75 / 18% – – 4.25 / 22% 16 n 4.58 / 18% 4.25 / 22% 4.75 / 17% 4.37 / 20% 4.20 / 22% 32 n – – 4.73 / 17% 4.37 / 20% 4.15 / 23% 64 n – – 4.73 / 17% 4.37 / 20% 4.14 / 23% Optimal 2.13 / 53.5% Random 5.50 / 11.4%

72 Computer MusicJournal Thecomparison of thebest-performing MFCC Table2. Best-in-class comparison of anchorversus andanchor-space models is shownin Table2. We MFCC-based measures(average response rank / seethat both havesimilar performanceunder these Žrst-placeagreement percentage) metrics,despite the prior information encodedin MFCC Anchor theanchors. #mix EMD ALA 8 4.28 / 21.3% 4.25 / 20.2% Comparing Ground Truth Measures 16 4.20 / 22.2% 4.20 / 19.8% TheMFCC systemuses K-meanstraining, pooled diagonal covar- Nowwe turn to acomparison of theacoustic and iancematrices, and excludes c0.Theanchor-space system uses EM subjectivemeasures. We take the best-performing training,independent full covariance matrices, and includes c0. approachesin eachfeature-space class (MFCC and anchorspace, limiting both to 16 GMMcompo- nentsfor parity) andevaluate them against each of thesubjective measures. At thesame time, we ful. Theuser collection (‘‘opennap’’ )datahas the evaluateeach of thesubjective measures against second-highestmean*, including particularly high eachother. The results are presented in Table3. agreementwith thesurvey metric, as canbe seen Rows representsimilarity measuresbeing evalu- whenthe Top- N rankingagreements are plotted as ated,and the columns giveresults treating each of animagein Figure1. Thus, we considerthe user our Žvesubjective similarity metricsas ground collectionsas the best single source of aground truth. Top-N rankingagreement scores are com- truth basedon this evidence,with thesurvey’ s (and putedas describedin thesection ‘ ‘Evaluation hencethe Ž rst-placeagreement metric’ s) providing Againsta ReferenceSimilarity Matrix.’’ reliabledata also. (Interestingly, the collection data Themeans down eachcolumn, excluding the doesless well agreeingwith thesurvey data when self-referencediagonal, are also shown (denoted measuredby theŽ rst-placeagreement percentage; ‘‘mean*’’). Thecolumn meanscan be taken as a weinfer thatit is doing betterat matchingfurther measureof how well eachmeasure approaches down therankings.) groundtruth by agreeingwith all thedata. By this Thenatural asymmetry of Table3 existsbecause standard,the survey-derived similarity matrix is ac ? ar,andthe diagonal is lessthan one because of best,but its verysparse coverage makes it lessuse- therandomized tiebreakers necessary owing to the

Table3. First-place agreement percentages (with survey data) and Top- N rankingagreement scores (againsteach candidate’ s groundtruth) for acousticand subjective similarity measures 1stplace Survey Expert Playlist Collection Webtext Random 11.8% 0.015 0.020 0.015 0.017 0.012 Anchor 19.8% 0.092 0.095 0.117 0.097 0.041 MFCC 22.2% 0.112 0.099 0.142 0.116 0.046 Survey 53.5% 0.874 0.249 0.204 0.331 0.121 Expert 27.9% 0.267 0.710 0.193 0.182 0.077 Playlist 26.5% 0.222 0.186 0.985 0.226 0.075 Collection 23.2% 0.355 0.179 0.224 0.993 0.083 Webtext 18.5% 0.131 0.082 0.077 0.087 0.997 mean* 0.197 0.148 0.160 0.173 0.074 Themean* rows represent the means of eachground-truth column, excluding the bolded ‘ ‘cheating’’ diagonaland the ‘ ‘random’’ row.

Berenzweig,Logan, Ellis, and Whitman 73 Figure1. Top- N ranking agreementscores from Table 3plottedas a grayscaleimage.

rnd

ank 0.8 mfc srv 0.6 exp ply 0.4 clc wtx 0.2 mn* srv exp ply clc wtx

sparsity of thesources. If wecalculate a symmetric (10,884 trials).Thus, all thesubjective measures agreementscore for eachpair of sourcesby averag- showsigniŽ cantly different results,although differ- ingthe two asymmetricnumbers, some interesting encesamong the variants in modeling schemes resultsemerge. The best-agreeing pair is thesurvey from Tables1 and2 areat theedge of signiŽcance. andthe collection data,which is somewhatsurpris- inggiven the very different natureof datasources: explicit userjudgments in thesurvey and co- Conclusions andFuture Work occurrenceof artistsin usercollections. Less sur- prising is theagreement between the survey and Returningto thethree questions posed in theprevi- expertsources, which both comefrom explicit oussection, based on theresults shown above, we judgmentsby humans,and between the collection drawseveral conclusions. First, MFCC andanchor andthe playlist sources,which both arederived spaceachieve comparable results on thesurvey from co-occurrencedata. data.Second, K-means training is comparableto Wementioned thata keyadvantage of theŽ rst- EMtraining.Using pooled, diagonalcovariance ma- placeagreement measure was that it allowed the tricesis beneŽcial for MFCC space,but in general useof establishedstatistical signiŽ cance tests. Us- thebest modeling schemeand comparison method inga one-tailed testunder a binomial assumption, dependon thefeature space being modeled. Third, Žrst-placeagreements differing by more thanabout themeasure derived from co-occurrencein per- 1%aresigniŽ cant at the5% levelfor this data sonalmusic collections is themost usefulground

74 Computer MusicJournal truth,although some way of combining theinfor- Berenzweig,A., .P.W.Ellis,and S. Lawrence.2003. mation from different sourcewarrants investiga- ‘‘Anchor Spacefor ClassiŽ cation and SimilarityMea- tion sincethey are providing different information. surementof Music.’’ In Proceedingsof the2003 IEEE Thework coveredby this articlesuggests many InternationalConference on Multimediaand Expo . directions for futureresearch. Although theacous- Piscataway,New Jersey: Institute of Electricaland ElectronicsEngineers. ticmeasures achieved respectable performance, Blum, T.L.,et al. 1999. ‘‘Method and Articleof Manu- thereis still muchroom for improvement. One facturefor Content-Based Analysis, Storage, Retrieval, glaringweakness of our currentfeatures is their and Segmentationof Audio Information.’’ U.S. Patent failure to captureany temporal structureinforma- No. 5,918,223. tion, althoughit is interestingto seewhat can be Breese,J. S.,D.Heckerman,and C.Kadie.1998. ‘‘Empiri- achievedbased on this limited representation. calAnalysis of PredictiveAlgorithms for Collaborative Basedon our cross-siteexperience, we feel that Filtering.’’ FourteenthAnnual Conferenceon Uncer- this work points theway to practicalmusic- tainty inArtiŽcial Intelligence .San Francisco,Califor- similarity systemevaluations that can even be car- nia:Morgan Kaufmann, pp. 43–52. ried out on thesame database, and that the serious Cambouropoulos, E. 2001. ‘‘MelodicCue Abstraction, obstaclesto sharingor distributing largemusic col- Similarity,and CategoryFormation: A Computational lectionscan be avoided bytransferringonly derived Approach.’’ MusicPerception 18(3):347–370. Cohen, W.-W.,and W.Fan. 2000. ‘‘Web-Collaborative features(which should alsoreduce bandwidth re- Filtering:Recommending Music by Crawlingthe quirements).To this end,we have set up awebsite Web.’’ WWW9 /Computer Networks 33(1–6):685– 698. givingfull detailsof our groundtruth andevalua- Deutsch,D., ed.1999. The Psychology of Music , 2nd ed. tion data(w ww.ee.columbia.edu/ ;dpwe/research/ NewYork: AcademicPress. musicsim).Wewill alsoshare the MFCC features Downie, J.S.,ed. 2002. The MIR/MDL Evaluation Pro- for the8,772 trackswe used in this work by burn- jectWhite Paper Collection, 2nd ed.Champaign, Illi- ingDVDs to sendto interestedresearchers. We are nois:GSLIS. alsointerested in proposals for otherfeatures that Ellis,D. P.,et al.2002. ‘‘The Questfor Ground Truth in it would bevaluable to calculatefor this dataset. MusicalArtist Similarity.’ ’ Proceedingsof theThird InternationalSymposium on MusicInformation Re- trieval. Paris:IRCAM, pp. 170–177. Foote, J.1997. ‘‘Content-BasedRetrieval of Musicand Acknowledgments Audio.’’ In C.J.Kuo, S.Chang,and V.N.Gudivada, eds. SPIE Vol. 3229: MultimediaStorage and Archiv- Wearegrateful for support for this work received ingSystems II .Bellingham,Washington: SPIE Press, from NECLaboratoriesAmerica, Inc. We also pp. 138–147. thankthe anonymous reviewersfor theiruseful Ghias,A., et al.1995. ‘‘Query by Humming.’’ ACM Mul- comments. timedia 95:231–236. Muchof thecontent of this articlealso appears Goldstone,R. L.,D. Medin,and D.Gentner.1991. ‘‘Rela- in our whitepaper presented at the Workshop on tionalSimilarity and theNonindependence of Features theEvaluation of MusicInformation Retrieval in SimilarityJudgments.’ ’ CognitivePsychology (MIR)Systems at SIGIR-03, Ottawa,August 2003. 23:222–262. Logan,B. 2000. ‘‘Mel-FrequencyCepstral CoefŽ cients for MusicModeling.’ ’ Proceedingsof theFirst Interna- tional Symposium on MusicInformation Retrieval References 2000. Amherst:University of Massachusettsat Am- herst, n.p. Aucouturier,J. J.,and F.Pachet.2002. ‘‘Music-Similarity Logan,B., and A.Salomon. 2001. ‘‘AMusic-Similarity Measures:What’ s theUse?’ ’ Proceedingsof theThird Function Basedon SignalAnalysis.’ ’ Paperpresented at InternationalSymposium on MusicInformation Re- the2001 InternationalConference on Multimediaand trieval.Paris:IRCAM, pp. 157–163. Expo, Tokyo, Japan,25 August.

Berenzweig,Logan, Ellis, and Whitman 75 McKinney,M. F.,and J.Breebaart.2003. ‘‘Featuresfor trievalSystems for Audio Signals.’’ Ph.D. Thesis, Audio and MusicClassiŽ cation.’ ’ Proceedingsof the PrincetonUniversity. Third InternationalSymposium on MusicInformation Vasconcelos,N. 2001. ‘‘On theComplexity of Probabilis- Retrieval.Paris:IRCAM, pp. 151–158. ticImage Retrieval.’ ’ Proceedingsof theEighth Inter- Rubner, Y.,C. Tomasi,and L.Guibas.1998. ‘‘AMetric national Conferenceon Computer Vision , vol. 2. Piscataway,New Jersey: Institute of Electricaland forDistributions with Applications to ImageData- ElectronicsEngineers, pp. 400–407. bases.’’ Proceedingsof theSixth InternationalConfer- Whitman,B., and S.Lawrence.2002. ‘‘InferringDescrip- enceon Computer Vision .Piscataway,New Jersey: tionsand Similarityfor Music from Community Meta- Instituteof Electricaland ElectronicsEngineers, p. 59. data.’’ Proceedingsof the2002 International Tversky,A. 1977. ‘‘Featuresof Similarity.’’ Psychological Computer MusicConference .San Francisco,Califor- Review 84(4):327–352. nia:International Computer MusicAssociation, Tzanetakis,G. 2002. ‘‘Manipulation,Analysis, and Re- pp. 591–598.

76 Computer MusicJournal