<<

D3.4 .1 Similarity Report

Abstract The goal of Work Package 3 is to take the features and meta- data provided by Work Package 2 and provide the technology needed for the intelligent structuring, presentation, and use of large music collections. This deliverable is about audio and web-based similarity measures, novelty detection (which demonstrate to be a useful tool to combine with similarity), and first outcomes applying mid-level WP2 descriptors from D2.1.1 and pre- liminary versions of D2.1.2. We present improvements of the similarity measures presented in D3.1.1 and D3.2.1. The outcomes will serve as foundation for D3.5.1, D3.6.2, and the prototypes, in particular for the recommender and organizer.

Version 1.0 Date: May 2005 Editor: E. Pampalk Contributors: A.Flexer, E. Pampalk, M. Schedl, J. Bello, and C. Harte Reviewers: G. Widmer and P. Herrera

Contents

1 Introduction 3

2 Audio-Based Similarity 6 2.1 Data ...... 6 2.2 Hidden Markov Models for Spectral Similarity ...... 8 2.2.1 Methods ...... 10 2.2.2 Results...... 13 2.2.3 Comparing log-likelihoods directly ...... 14 2.2.4 GenreClassification ...... 16 2.2.5 Discussion ...... 17 2.3 Spectral Similarity Combined with Complementary Information ...... 18 2.3.1 SpectralSimilarity ...... 19 2.3.2 FluctuationPatterns ...... 19 2.3.3 Combination ...... 22 2.3.4 GenreClassification ...... 23 2.3.5 Conclusions ...... 28 2.4 Summary&Recommendations ...... 28

3 Web-Based Similarity 30 3.1 Web Mining by Co-occurrence Analysis ...... 31 3.2 ExperimentsandEvaluation ...... 32 3.2.1 Intra-/Intergroup-Similarities ...... 32 3.2.2 Classificationwithk-NearestNeighbors ...... 34 3.3 Conclusions & Recommendations ...... 39

4 Novelty Detection and Similarity 44 4.1 Data ...... 44 4.2 Methods ...... 45 4.2.1 MusicSimilarity ...... 45 4.2.2 Algorithms for novelty detection ...... 45 4.3 Results...... 46 4.4 Discussion ...... 48

1 CONTENTS 2

5 Chroma-Complexity Similarity 50 5.1 ChromagramCalculation...... 50 5.2 ChromagramTuning...... 51 5.3 ChromagramProcessing ...... 51 5.4 ChromaComplexity ...... 52 5.4.1 ChromaVisuTool ...... 52 5.5 Results...... 54 5.5.1 (Dave Brubeck Quartet) ...... 54 5.5.2 ClassicOrchestra...... 54 5.5.3 ClassicPiano...... 54 5.5.4 ...... 54 5.5.5 HipHop ...... 55 5.5.6 Pop...... 55 5.6 Discussion&Conclusions ...... 55

6 Conclusions and Future Work 59 1. Introduction

Overall goal of Workpackage 3: The goal of workpackage 3 (WP3) is to take the features and meta-data provided by workpackage 2 (WP2) and provide the technology needed for the intelligent structuring, presentation, and use (query processing and retrieval) of large music collections. This general goal can be broken down into two major task groups: the automatic structuring and organisation of large collections of digital music, and intelligent music retrieval in such structured ”music spaces”.

Role of Similarity within SIMAC: Similarity measures are a key technology in SIMAC. They are the foundation of the deliverables D3.5.1 (music collection structuring and navigation module) and D3.6.2 (module for retrieval by similarity and semantic descrip- tors). They enable core functionalities of the organizer and recommender prototypes. Without similarity, functions such as playlist generation, organization and visualization, hierarchical structuring, retrieval, and recommendations can not be implemented. In fact, the importance of similarity goes beyond their role in the prototypes. A similarity measure can be licensed as is, and can easily find its way into online music stores or mobile audio players. Thus, it is highly recommended to continue the development and improvement of similarity measures throughout the SIMAC project beyond D3.4.1.

Definition of Similarity in D3.4.1: There are many aspects of similarity (timbre, har- mony, , etc.), and there are different sources from which these can be computed (audio, web-pages, lyrics, etc.). Most of all, similarity is a perception which depends on the listeners point of view and context. Within SIMAC any important dimension of perceived similarity is useful. However, the main parts of this deliverable define similarity as the concept which pieces within a (or subgenre) have in common. As already pointed in D3.1.1 and D3.2.1 the reason for is that this allows highly efficient (i.e. fast and cheap) evaluations of the similarity measures (since genre labels for artists are readily available).

Evaluation Procedures: In this deliverable we primarily use nearest neighbor clas- sifiers (and genre classification) to evaluate the similarity measures. The idea is that pieces within the same genre should be very close to each other. In addition we use inter and intra group distances as described in Section 3.2.1. These are particularly useful in understanding how well each group is modeled by the similarity measure. In

3 1. Introduction 4

Section 2.2.3 we compute the log-likelihood of the self-similarity within a song and use it to evaluate similarity measures. In particular, the first half of each song is compared to the second half. The idea is, that a good similarity measure would recognize these to be highly similar. In Section 4.3 we use receiver operator characteristics (ROC) to measure the tradeoff between sensitivity and specificity. Throughout this deliverable (and in particular in Chapter 5) we use illustrations to demonstrate characteristics of the similarity measures.

Specific Context of D3.4.1 in SIMAC: D3.4.1 is built on the code and ideas from WP2 (i.e. D2.1.1 and preliminary versions of D2.1.2). D3.4.1 uses findings of D3.1.1 and most of all of D3.2.1. In particular, a large part of D3.2.1 covers similarity measures (including the implementation of these) used in this deliverable. The recommendations of D3.4.1 are the foundation for D3.5.1, D.3.6.2 and the organizer and recommender prototypes.

Relationship to D3.1.1 and D3.2.1: In these previous deliverables of WP3 we presented a literature review of similarity measures, implementations (MA toolbox), and extensive evaluations thereof based on genre classification. In this deliverable we present improvements of these results and recommendations for the implementations of the prototypes. Topics covered in detail in these previous deliverables are only repeated if they are necessary to define the context. Thus, this deliverable is not self-contained but rather an add-on to D3.1.1 and D3.2.1.

Outcomes of D3.4.1: There are 5 main outcomes of this deliverable. The following 5 chapters are structured according to these.

A. Recommendations for audio-based similarity: We report on findings using HMMs and using combinations of different similarity measures. We show that a combination of different approaches can improve genre classification performance up to 14% in average (measured on four different collections).

B. Recommendations for web-based similarity: We report a simpler alternative to the approach presented in D3.2.1 and D.3.1.1 based on co-occurrences and number of pages retrieved by Google. We show that depending on the size of the music collection different approaches are preferable.

C. We demonstrate how novelty detection can be combined with similarity mea- sures to improve the performance. We show that using simple techniques genre classification or playlist generation can be improved. 1. Introduction 5

D. We report first results using outcomes from D2.1.1 and D2.1.2 for similarity. In particular we present a general approach to using musical complexity for similarity computations and in particular using chroma patterns. Our preliminary results demonstrate possible applications of mid-level descriptors developed in WP2.

E. We give an outlook of topics to pursue in the remainder of the SIMAC project and the necessary next steps. 2. Audio-Based Similarity

Advantages and Limitations of Audio-Based Similarity: Audio-based similarity is cheap and fast. Computing the similarity of two pieces can be done within seconds. The similarity can be computed between artists, songs, or even below the song level (e.g. between segments of different pieces). The main limitation is the quality of the computed similarity. For , the presence and the expressiveness of a singing voice, or instruments such as electric guitars are not modeled appropriately. In general the meaning (or the message) of the piece of music (including for example emotions) will as far as we can foresee remain incomprehensible to the computer. Furthermore, the audio-signal does not contain cultural information. Both history and social context of the piece of music are not accessible. In fact, as we will discuss in the next chapter these are better extracted through analysis of contents in the Internet.

This remainder of this chapter is organized as follows:

A. Description of data sets which we used for evaluation. of which was used as training set for the ISMIR’05 genre classification contest.

B. Results on modeling temporal information for spectral similarity. Our results show that temporal information improves the performance of the similar- ity within a song. However, this improvement does not appear significant measured in a genre classification task.

C. Results on combining different approaches. In particular, we combine the spectral similarity (which we have shown to outperform other approaches in D3.1.1 in various tasks) with information gathered from fluctuation patterns. On average (using the four collections) the improvement is about 14% for genre classification.

D. We summarize our findings and make recommendations for the prototypes.

2.1 Data

For our experiments we use four music collections with a total of almost 6000 pieces. Details are given in Tables 2.1 and 2.2. For the evaluation (especially to avoid overfitting)

6 2. Audio-Based Similarity 7

Artists/Genre Tracks/Genre Artists Tracks Min Max Min Max In-HouseSmall(DB-S) 16 66 100 2 7 4 8 In-House Large (DB-L) 22 103 2522 3 6 45 259 Magnatune Small (DB-MS) 6 128 729 5 40 26 320 Magnatune Large (DB-ML) 10 147 3248 2 40 22 1277

Table 2.1: Statistics of the four collections.

DB-S alternative, , classic orchestra, classic piano, dance, , happy sound, hard pop, , mystera, pop, , rock, rock & roll, romantic dinner, talk DB-L a cappella, , blues, , celtic, , DnB, , electronic, euro-dance, folk-rock, German hip hop, hard core rap, heavy metal/thrash, Italian, jazz, , melodic metal, punk, , trance, trance2 DB-MS classical, electronic, jazz/blues, metal/punk, pop/rock, world DB-ML ambient, classical, electronic, jazz, metal, , pop, punk, rock, world

Table 2.2: for each collection. it is important that the collections are structured differently and have different types of contents.

DB-S

The smallest collection consists of 100 pieces. We have previously used it in [26]. However, we removed all classes consisting of one artist only. The categories are not strictly genres (e.g. one of them is romantic dinner music). Furthermore, the collection also includes one non-music category, namely speech (German cabaret). This collection has a very good (i.e low) ratio of tracks per artist. However, due to its size the results need to be treated with caution.

DB-L

The second largest collection has mainly been organized according to genre/artist/. Thus, all pieces from an artist (and album) are assigned to the same genre, which is a questionable but common practice. Only two pieces overlap between DB-L and DB- S, namely Take Five and Rondo by the Dave Brubeck Quartet. The genres are 2. Audio-Based Similarity 8 user defined and inconsistent. In particular, there are two different definitions of trance. Furthermore, there are overlaps, for example, jazz and jazz guitar, heavy metal and death metal etc.

DB-MS

This collection is a subset of DB-ML which has been used as training set for the ISMIR 2004 genre classification contest. The music originates from Magnatune1 and is available via creative commons. UPF/MTG arranged with Magnatune a free use for research purposes. Although we have a larger set from the same source we use it to compare our results to those of the ISMIR’04 results. The genre labels are given on the Magnatune website. The collection is very un- balanced. Most pieces belong to the genre classical and a large number of pieces in world sound like . Some of the original Magnatune classes were merged by UPF/MTG due to ambiguities and the small number of tracks in some of the genres.

DB-ML

This is the largest set in our experiments. DB-MS is a subset of this collection. The genres are also very unbalanced. The number of artists is not much higher than in DB-MS. The number of tracks per artist is very high. The genres which were merged for the ISMIR contest are separated.

2.2 Hidden Markov Models for Spectral Similarity

This section deals with modeling temporal aspects to improve spectral similarity. The work presented in this section has been submitted to a conference [12]. As shown in D3.1.1, the following approach to music similarity based on spectral similarity pioneered by [20] and [1] (and later refined in [2]) outperformed all other alternatives. In the following we will refer to it as AP. For a given music collection of S songs, each belonging to one of G music genres, it consists of the following basic steps:

for each song, divide raw data into overlapping frames of short duration (around • 25ms)

compute Mel Frequency Cepstrum Coefficients (MFCC) for each frame (up to 20) • train a Gaussian Mixture Model (GMM, number of mixtures up to 50) for each of • the songs

1http://www.magnatune.com 2. Audio-Based Similarity 9

compute a similarity matrix between all songs using the likelihood of a song given • a GMM

based on the genre information, do k-nearest neighbor classification using the • similarity matrix

The last step of genre classification can be seen as a form of evaluation. Since usually no ground truth with respect to music similarity exists, each song is labeled as belonging to a using e.g. music expert advice. High genre classification results indicate good similarity measures. This approach based on GMMs disregards the temporal order of the frames, i.e. to the algorithm it makes no difference whether the frames in a song are ordered in time or whether this order is completely reversed or scrambled. Research on perception of musical timbre of single musical instruments clearly shows that temporal aspects of the audio signals play a crucial role (see e.g. [15]). Aspects like spectral fluctuation, attack or decay of an event cannot be modelled without respecting the temporal order of the audio signals. A natural way to incorporate temporal context into the above described framework is the usage of Hidden Markov Models (HMM) instead of GMMs. HMMs trained on MFCCs have already been used for music summarization ([19; 3; 30]) and genre classification [2] but with rather limited success. For the experiments reported in this section we use the DB-MS collection. We divide the raw audio data into overlapping frames of short duration and use Mel Frequency Cepstrum Coefficients (MFCC) to represent the spectrum of each frame. The frame size for computation of MFCCs for our experiments was 23.2ms (512 samples), with a hop-size of 11.6ms (256 samples) for the overlap of frames. Although improved results have been reported with numbers of MFCCs of up to 20 [2], we used only the first 8 MFCCs for all our experiments to limit the computational burden. In order to allow modeling of a bigger temporal context we also used so-called texture windows [36]: we computed means and variances of MFCCs across the following numbers of frames and used them as alternative input to the models: 22 frames, hop-size 11 (510.4ms, 255.2ms), 10 frames, hop-size 5 (232ms, 116ms), 10 frames, hop-size 2 (232ms, 46.4ms). This means that if a texture window is being used, after preprocessing a single data point xt is a 16-dimensional vector (8 mean MFCCs plus 8 variances across MFCCs) instead of a 8-dimensional vector if no texture window is used. 2. Audio-Based Similarity 10

2.2.1 Methods A Gaussian Mixture Model (GMM) models the density of the input data by a mixture model of the form

M GMM p (x)= X Pm [x, µm, Um] (2.1) N m=1 where P is the mixture coefficient for the m-th mixture, is the normal density m N and µm and Um are the mean vector and covariance matrix of the m-th mixture. The log-likelihood function is given by

1 T LGMM = log(pGMM (xt)) (2.2) T X t=1 for a data set containing T data points. This function is maximized both with respect to the mixing coefficients Pm and with respect to the parameters of the Gaussian basis functions using Expectation-Maximization (see e.g. [8]). Hidden Markov Models (HMM) [32] allow analysis of non-stationary multi-variate time series by modeling both the probability density functions of locally stationary multi- variate data and the transition probabilities between these stable states. If the probability density functions are modelled with mixtures of Gaussians, HMMs can be seen as GMMs plus transition probabilities. An HMM can be characterized as having a finite number N of states Q:

Q = q ,q ,...,q (2.3) { 1 2 N } A new state qj is entered based upon a transition probability distribution A which depends on the previous state (the Markovian property):

A = a ,a = P (q (t) q (t 1)) (2.4) { ij} ij j | i − where t = 1,...,T is a time index with T being the length of the observation sequence. After each transition an observation output symbol is produced according to a probability distribution B which depends on the current state. Although the classical HMM uses a set of discrete symbols as observation output, [32] already discuss the extension to continuous observation symbols. We use a Gaussian Observation Hidden Markov Model (GOHMM) where the observation symbol probability distribution for state j is given by a mixture of Gaussians:

B = b (x) , b (x)= pGMM (x) (2.5) { j } j j 2. Audio-Based Similarity 11

GMM where pj (x) is the density as defined for a mixture of Gaussians in Equ. 2.1. The Expectation-Maximization (EM) algorithm is used to train the GOHMM thereby estimating the parameter sets A and B. The log-likelihood function is given by

T HMM 1 t L = log(b (x )) + log(a −1 ) (2.6) T X qt qt,t t=1

for an observation sequence of length t = 1,...,T with q1,...,qT being the most likely state sequence and q0 a start state. The forward algorithm is used to identify most likely state sequences corresponding to a particular time series and enables the computation of the log-likelihoods. Full details of the algorithms can be found in [32].

0.01

0.008

0.006 p(d) 0.004

4

0.002 3 2 1 0 0 10 20 40 60 80 100 120 sec d

Figure 2.1: Duration probability densities p(d) (y-axis) for durations d (x-axis) in seconds for different combinations of window and hop sizes: line (1) win 23.2ms, hop 11.6ms, line (2) win 232ms, hop 46.4ms, line (3) win 232ms, hop 116ms, line (4) win 510.4ms, hop 255.2ms.

It is informative to have a closer look at how the transition probabilities influence the state sequence characteristics. The inherent duration probability density pi(d) associated with state qi, with self transition coefficient aii is of the form

p (d)=(a )d−1(1 a ) (2.7) i ii − ii This is the probability of d consecutive observations in state qi, i.e. the duration probability of staying d times in one of the locally stationary states modeled with a mixture of Gaussians. As [31] noted, this exponential state duration density is not optimal for a lot of physical signals. The duration of a single data point in our case is dependent on the window length win of the frame used for computing the MFCCs or the size of the texture window as well as the hop size hop. The length l of staying in the same state expressed in msec is then:

l =(d 1)hop + win (2.8) − 2. Audio-Based Similarity 12

with hop and win given in msec. Fig. 2.1 gives duration probability densities for all different combinations of hop and win used for preprocessing with aii set to 0.99 (which is a reasonable choice for audio data). One can see that whereas for hop = 11.6 and win = 23.2 the duration probability at five seconds is already almost zero, there still is an albeit small probability for durations up to 120 seconds for hop = 255.2 and win = 510.4. Our choice of different frame sizes and texture windows seems to guarantee a range of different duration probabilities. The shorter the state durations in HMMs are, the more often the state sequence will switch from state to state and the less clear the boundaries between the mixture of Gaussians of the individual states will be. Therefore, with shorter state durations the HMMs will be more akin to GMMs in their modeling behavior. An important open issue is the model topology of the HMM. Looking again at the work by [32] on speech analysis, we can see that the standard model for isolated word recognition is a left-to-right HMM. No transitions are allowed to states whose indices are lower than the current state, i.e. as time increases the state index increases. This has been found to account well for modeling of words which rarely have repeating vowels or sounds. For songs, a fully connected so-called ergodic HMM seems to be more suitable for modeling than the constrained left-to-right model. After all, repeating patterns seem to be an integral part of music. Therefore it makes sense to allow states to be entered more than once and hence use ergodic HMMs. There is a small number of papers describing applications of HMMs to the modeling of some form of spectral similarity. [19] compare HMMs and static clustering for music summarization. Fully ergodic HMMs with five to twelve states of single Gaussians are trained on the first 13 MFCCs (computed from 25.6ms overlapping windows). Key phrases are chosen based on state frequencies and evaluated in a user study. Clustering performs best and HMMs do not even surpass the performance of a random algorithm. [3] use fully ergodic three state HMMs with single Gaussians per state trained on the first ten MFCCs (computed from 30ms overlapping windows) for segmentation of songs into chorus, verse, etc. The authors found little improvement over using static k-means clustering for the problem. The same approach is used as part of a bigger system for audio thumb-nailing in [4]. [30] also compare HMMs and k-means clustering for music audio summary generation. The authors report about achieving smoother state jumps using HMMs. [2] report about genre classification experiments using HMMs with numbers of states ranging from 3 to 30 where the states are mixtures of four Gaussians. For their genre classification task the best HMM is the one with 12 states. Its performance is slightly worse than that of a GMM with a mixture of 50. The authors do not give any detail about the topology of the HMM, i.e. whether it is a fully ergodic one or one with left- 2. Audio-Based Similarity 13 to-right topology. It is also unclear whether they use full covariance matrices for the mixtures of Gaussians. From the graph in their paper (Figure 6) it is evident that HMMs with numbers of states ranging from 4 to 25 perform at a very comparable level in terms of genre classification accuracy. HMMs have also been used successfully for audio fingerprinting (see e.g. [5]). There HMMs with tailor made topologies trained on MFCCs are used to fully represent each detail of a song in a huge database. The emphasis is on exact identification of a specific song and not on generalization to songs with similar characteristics.

2.2.2 Results

Table 2.3: Overview of all types of models used and results achieved: index of model nr, model type model, number of states states, size of mixture mix, window size win, hop size hop, texture window tex, degrees of freedom df, mean log-likelihood likeli, number of HMM based log-likelihoods bigger than GMM based log-likelihoods H > G, z-statistic z, mean accuracy acc, standard deviation stddev, t-statistic t.

nr model states mix win hop tex df likeli H > G z acc stddev t 1 HMM 10 1 23.2 11.6 n 180 -31.10 22 -24.43 74.20 5.43 -0.26 2 GMM - 10 23.2 11.6 n 80 -29.89 76.54 3.64 3 HMM 3 3 23.2 11.6 n 81 -29.26 698 24.76 77.08 4.73 0.36 4 GMM - 9 23.2 11.6 n 72 -29.91 73.38 5.00 5 HMM 6 5 23.2 11.6 n 276 -28.95 706 25.46 78.18 4.59 0.00 6 GMM - 30 23.2 11.6 n 240 -29.93 78.19 3.32 7 HMM 3 3 510.4 255.2 y 153 -29.31 692 24.26 74.20 4.85 -0.05 8 GMM - 9 510.4 255.2 y 144 -29.92 74.62 3.67 9 HMM 3 3 232.0 116.0 y 153 -29.30 690 24.11 76.67 2.22 0.08 10 GMM - 9 232.0 116.0 y 144 -29.90 76.26 3.13 11 HMM 3 3 232.0 46.4 y 153 -29.34 677 23.13 73.79 4.81 -0.04 12 GMM - 9 232.0 46.4 y 144 -29.89 74.20 3.27

For our experiments with GMMs and HMMs we used the following parameters (ab- breviations correspond to those used in Table 2.3):

preprocessing: we used combinations of window (win) and hop sizes (hop) and • texture windows (tex set to yes (’y’) or no (’n’))

topology: 3, 6 and 10 state ergodic (fully connected) HMMs with mixtures of 1, 3 • or 5 Gaussians per state, GMMs with mixtures of 9, 10 or 30 Gaussians (see states and mix in Table 2.3 for combinations used); Gaussians use diagonal covariance matrices for HMMs and GMMs

computation of similarity: similarity is computed using Equ. 2.6 fpr HMMs and • Equ. 2.2 for GMMs 2. Audio-Based Similarity 14

The combinations of parameters states, mix, win, hop and tex used for this study yielded twelve different model classes: six types of HMMs and six types of GMMs. We made sure to employ comparable types of GMMs and HMMs by having comparable degrees of freedom for pairs of model classes: HMM (states 10, mix 1) vs. GMM (mix 10), HMM (states 3, mix 3) vs. GMM (mix 9), HMM (states 6, mix 5) vs. GMM (mix 30). The degrees of freedom (number of free parameters) for HMMs and GMMs are

df GMM = mix dim(x) (2.9) ×

df HMM = states mix dim(x)+ states2 (2.10) × × with dim(x) being the dimensionality of the input vectors. Column df in Table 2.3 gives the degrees of freedom for all types of models. With the first column nr indexing the different models, odd numbered models are always HMMs and the next even numbered model is always the associated GMM. The difference of freedom between two associated types of GMMs and HMMs is always the number of transition probabilities (states2).

2.2.3 Comparing log-likelihoods directly The first line of experiments compares goodness-of-fit criteria (log-likelihoods) between songs and models in order to explore which type of model best describes the data. Out-of-sample log-likelihoods were computed in the following way:

train HMMs and GMMs for each of the twelve model types for each of the songs • in the training set, using only the first half of each song

use the second half of each song to compute log-likelihoods LHMM and LGMM • This yielded S = 729 log-likelihoods for each of the twelve model types. Average log-likelihoods per model type are given in column likeli in Table 2.3. Since the absolute values of log-likelihoods very much depend on the type of songs used, it is much more informative to compare log-likelihoods on a song-by-song basis. In Fig. 2.2 histogram plots of the differences of log-likelihoods L L between associated model types are i − i+1 shown:

L L = LHMM(i) LGMM(i+1) (2.11) i − i+1 − with HMM(i) being an HMM of model type index nr = i and GMM(i + 1) being the associated GMM of model type index nr = i + 1 and i = 1, 3, 5, 7, 9, 11. 2. Audio-Based Similarity 15

The differences L L are computed for all the S = 729 songs before doing the i − i+1 histogram plots. As can be seen in Fig. 2.2, except for one histogram plot the majority of HMM models show a better goodness-of-fit of the data than their associated GMMs (i.e. their log-likelihoods are higher for most of the songs). The only exception is the comparison of model types 1 and 2 (HMM (states 10, mix 1) vs. GMM (mix 10)) which is interesting because in this case the HMMs have the biggest advantage in terms of degrees of freedom (180 vs. 80) over the GMMs of all the comparisons. This is due to the fact that this type of HMM models has the highest number of states with states = 10. But it also has only a single Gaussian per state to model probability density functions. Experiments on isolated word recognition in speech analysis [32] have shown that small sizes of the mixtures of Gaussians used in HMMs do not catch the full detail of the emission probabilities which often are not Gaussian at all. Mixtures of five Gaussians with diagonal covariances per state have been found to be a good choice.

Lik − Lik Lik − Lik Lik − Lik 1 2 3 4 5 6 150 150 150

100 100 100

50 50 50

0 0 0 −2 0 2 −2 0 2 −2 0 2

Lik − Lik Lik − Lik Lik − Lik 7 8 9 10 11 12 150 150 150

100 100 100

50 50 50

0 0 0 −2 0 2 −2 0 2 −2 0 2

Figure 2.2: Histogram plots of differences in log-likelihood between associated models.

Finding a correct statistical test for comparing likelihoods of so-called non-nested models is far from trivial (see e.g. [23] or [14]). HMMs and GMMs are non-nested models because one is not just a subset of the other as would e.g. be the case with a mixture of five Gaussians compared to a mixture of six Gaussians. What makes the models non- nested is the fact that it is not clear how to weigh the parameter of a transition probability aij against, say, a mean µm of a Gaussian. Nevertheless, it is correct to compare the log-likelihoods since we use out-of-sample estimates, which automatically punishes over- fitting due to excessive free parameters. It is just the distribution characteristics of the log-likelihoods which are hard to describe. Therefore we resorted to the distribution free sign test which relies only on the rank of results (see e.g. [34]). Let CI be the score 2. Audio-Based Similarity 16

under condition I and CII the score under condition II then the null hypothesis tested by the sign test is

1 H : p(C >C )= p(C

CI > CII and the number of matched pairs N is greater than 25 then the sampling distribution is the normal distribution with

c 1 N z = − 2 (2.13) 1 √ 2 N Column H > G in Table 2.3 gives the count c of HMM based log-likelihoods being bigger than GMM based log-likelihoods for all pairs of associated model types. Column z gives the corresponding z-values obtained using Equ. 2.13. All z-values are highly significant at the 99% error level since all z > z = 2.58. Therefore HMMs always | | 99 better describe the data compared to their associated GMMs with the exception of the comparison of model types 1 and 2 (HMM (states 10, mix 1) vs. GMM (mix 10)). To counter the argument that the superior performance of the HMMs is due to their extra number of degrees of freedom (i.e. number of transition probabilities, see column df in Table 2.3) we also compared the smallest type of HMMs (model nr 3: HMM (states 3, mix 3), df = 81) with the biggest type of GMMs (model nr 6: GMM (mix 30), df = 240). This comparison yielded a count c (H > G) of 635, and a z-value of z = 20.14 > z99 = 2.58 again being highly significant. We conclude that it is not the sheer number of degrees of freedom in the models but the quality of the free parameters which decides which type of model better fits the data. After all, the degrees of freedom of the HMMs in our last comparison are outnumbered three times by those of the GMMs.

2.2.4 Genre Classification The of experiments compares genre classification results. In a 10-fold cross validation we did the following:

train HMMs and GMMs for each of the twelve model types for each of the songs • in the training set (the nine training folds), this time using the complete songs

for each of the model types, compute a similarity matrix between all songs using • the log-likelihood of a song given a HMM or a GMM (LHMM and LGMM )

based on the genre information, do one-nearest neighbor classification for all songs • in the test fold using the similarity matrices 2. Audio-Based Similarity 17

Average accuracies and standard deviations across the ten folds of the cross validation are given in columns acc and stddev in Table 2.3. Looking at the results one can see that the achieved accuracies range from around 73% to around 78% with standard deviations of up to 5%. We compared accuracy results of associated model types in a series of paired t-tests (model nr 1 vs. nr 2, . . ., nr 11 vs. nr 12). The resulting t-values are given in column t in Table 2.3. All t-values are not significant at the 99% error level since all t < t = 3.25 (the same holds true at the 95% error level). Even the biggest | | (99,df=9) difference in accuracy (between model type nr 4, GMM (mix 9), acc = 73.38, and model type nr 6, GMM (mix 30), acc = 78.19) is not significant: t = 0.43

2.2.5 Discussion There are two main results: (i) HMMs better describe spectral similarity of songs than the standard technique of GMMs. Comparison of log-likelihoods clearly shows that HMMs allow for a better fit of the data. This holds not only if looking at competing models with comparable numbers of degrees of freedom but also for GMMs with numbers of parameters that are much larger than of those of the HMMs. The only outlier in this respect is model type 1 (HMM (states 10, mix 1)). But as discussed in the previous section this is probably due to the poor choice of single Gaussians for modeling the emission probabilities. (ii) HMMs perform at the same level as GMMs when used for spectral similarity based genre classification. There is no significant gain in terms of classification accuracy. Genre classification is of course a rather indirect way of measuring differences between alternative similarity models. The human error in classifying some of the songs gives rise to a certain percentage of misclassification already. Inter-rater reliability between a number of music experts is far from perfect for genre classification. Although we believe this is the most comprehensive study on using HMMs for spectral similarity of songs so far, there is of course a lot still to be done. Two possible routes for further improvements come to mind: the topology of the HMMs and the handling of the state duration. Choosing a topology for an HMM still is more of an art than a science (see e.g. [10] for a discussion). Our limited set of examined combinations of numbers of states and sizes of mixtures could be extended. One should however notice that too large numbers for these parameters quickly lead to numerical problems due to insufficient training data. We also have not yet tried out left-to-right models. With our choice of different frame sizes and texture windows we tried to explore a 2. Audio-Based Similarity 18 range of different state duration densities. There are of course a number of alternative and possibly more principled ways of doing this. The usage of so-called explicit state duration modeling could be explored. A duration parameter d per HMM state is added.

Upon entering a state qi a duration di is chosen according to a state duration density p(di). Formulas are given in [32]. Another idea is to use an array of n states with identical self transition probabilities where it is enforced to pass each state at least once. This gives rise to more flexible so-called Erlang duration density distributions (see [10]). An altogether different approach of representing the dynamical nature of audio signals is the computation of dynamic features by substituting the MFCCs with features that already code some temporal information (e.g. autocorrelation or reflection coefficients). Examples can be found in [32]. Some of these ideas might be able to further improve the modeling of songs by HMMs but it is not clear whether this will also help the genre classification performance.

2.3 Spectral Similarity Combined with Complemen- tary Information

In this section we demonstrate how the performance of the AP spectral similarity can be improved. In particular, we combine it with complementary information taken from fluctuation patterns (which describe loudness fluctuations over time) and two new de- scriptors derived thereof. The work presented in this section has been submitted to a conference [27]. To evaluate the results we use the four music collections described previously. Com- pared to the winning algorithm of the ISMIR’04 genre classification contest our findings show improvements of up to 41% (12 percentage points) on one of the collections, while the results on the contest training set (using the same evaluation procedure as in the contest) increased by merely 2 percentage points. One of our main observations is that not using different music collections (with different structures and contents) can lead to overfitting. Another observation is the need to distinguish between artist identification and genre classification. Furthermore, our findings confirm the findings of Aucouturier and Pachet [2] who suggest the existence of a glass ceiling which cannot be surpassed without taking higher level cognitive processing into account. 2. Audio-Based Similarity 19

2.3.1 Spectral Similarity We use the same spectral similarity described in the previous section on HMMs. We used the implementations in the MA Toolbox [26] and the Netlab Toolbox2 for Matlab. From the 22050Hz mono audio signals two minutes from the center are used for further analysis. The signal is chopped into frames with a length of 512 samples (about 23ms) with 50% overlap. The average energy of each frame’s spectrum is subtracted. The 40 Mel frequency bands (in the range of 20Hz to 16kHz) are represented by the first 20 MFCC coefficients. For clustering we use a Gaussian Mixture Model with 30 clusters and trained using expectation maximization (after k-means initialization). The cluster model similarity is computed with Monte Carlo sampling and a sample size of 2000. The classifier in the experiments described below computes the distances of each piece in the test set to all pieces in the training set. The genre of the closest neighbor in the training set is used as prediction (nearest neighbor classifier).

2.3.2 Fluctuation Patterns Fluctuation Patterns (FPs) describe loudness fluctuations in 20 frequency bands [25; 29]. They describe characteristics of the audio signal which are not described by the spectral similarity measure. First, the audio signal is cut into 6-second sequences. We use the center 2 minutes from each piece of music and cut it into non-overlapping sequences. For each of these sequences a psychoacoustic spectrogram, namely the Sonogram is computed. For the loudness curve in each frequency band a FFT is applied to describe the amplitude - ulation of the loudness. From the FPs we extract two new descriptors. The first one, describes how distinctive the fluctuations at specific frequencies are, we call it Focus. The second one which we call Gravity, is related to the overall perceived .

Sone

Each 6-second sequence is cut into overlapping frames with a length of 46ms. For each frame the FFT is computed. The frequency bins are weighted according to a model of the outer and middle-ear to emphasize frequencies around 3-4kHz and suppress very low or high frequencies. The FFT frequency bins are grouped into frequency bands according to the critical-band rate scale with the unit Bark [40]. A model for spectral masking is applied to smooth the spectrum. Finally, the loudness is computed with a non-linear function. We normalize the loudness of each piece such that the peak loudness is 1.

2http://www.ncrg.aston.ac.uk/netlab 2. Audio-Based Similarity 20

Fluctuation Patterns

Given a 6-second Sonogram we compute the amplitude modulation of the loudness in each of the 20 frequency bands using a FFT. The amplitude modulation coefficients are weighted based on the psychoacoustic model of the fluctuation strength [11]. This mod- ulation has different effects on our hearing sensation depending on the frequency. The sensation of “fluctuation strength” is most intense around 4Hz and gradually decreases up to a modulation frequency of 15Hz. The FPs analyze modulations up to 10Hz. To emphasize certain patterns a gradient filter (over the modulation frequencies) and a Gaussian filter (over the frequency bands and the modulation frequencies) are applied. Finally, for each piece the median from all FPs representing a 6-second sequence is computed. This final FP is a matrix with 20 rows (frequency bands) and 60 columns (modulation frequencies). Two pieces are compared by interpreting their FP matrices as 1200-dimensional vectors and computing the Euclidean distance. An implementation of the FPs is available in the MA Toolbox [26]. Figure 2.3 shows some examples of FPs. The vertical lines indicate reoccurring periodic beats. The song Spider, by Flex, which is a typical example of the genre eurodance, has the strongest vertical lines.

Focus

The Focus (FP.F) describes the distribution of the energy in the FP. In particular, FP.F is low if the energy is focused in small regions of the FP, and high if the energy is spread out over the whole FP. The FP.F is computed as mean value of all values in the FP matrix, after normalizing the FP such that the maximum value equals 1. The distance between two pieces of music is computed as the absolute difference between their FP.F values. Figure 2.3 shows five example histograms of the values in the FPs and the mean thereof (as vertical line). Black Jesus by Everlast (belonging to the genre alternative) has the highest FP.F value (0.42). The song has a strong focus on guitar chords and vocals, while the drums are hardly noticeable. The song Spider by Flex (belonging to eurodance) has the lowest FP.F value (0.16). Most of the songs energy is in the strong periodic beats. Figure 2.4 shows the distribution of FP.F over different genres. The values have a large deviation and the overlap between quite different genres is significant. Electronic has the lowest values while punk/metal has the highest. The amount of overlap is an important factor for the quality of the descriptor. As we will see later, in the optimal combination of all similarity sources, FP.F has the smallest contribution. 2. Audio-Based Similarity 21

CM FP FP.F FP.G 0.28 −6.4 Take Five Brubeck et al. 0.16 −5.0 Flex Spider

0.23 −6.4 Surfin’ USA Beach Boys

0.32 −5.9 Crazy Spears

0.42 −5.8 Everlast Black Jesus

Figure 2.3: Visualization of the features. On the y-axis of the cluster model (CM) is the loudness (dB-SPL), on the x-axis are the Mel frequency bands. The plots show the 30 centers and their variances on top of each other. On the y-axis of the FP are the Bark frequency bands, the x-axis is the modulation frequency (in the range from 0-10Hz). The y-axis on the FP.F histogram plots are the counts, on the x-axis are the values of the FP (from 0 to 1). The y-axis of the FP.G is the sum of values per FP column, the x-axis is the modulation frequency (from 0-10Hz).

Gravity

The Gravity (FP.G) describes the center of gravity (CoG) of the FP on the modulation frequency axis. Given 60 modulation frequency-bins (linearly spaced in the range of 0-10Hz) the CoG usually lies between the 20th and the 30th bin, and is computed as

j FPij CoG = Pj Pi , (2.14) Pij FPij where FP is a 20 60 matrix and i is the index of the frequency band, and j of the × modulation frequency. We compute FP.G by subtracting the theoretical mean of the fluctuation model (which is around the 31st band) from the CoG. Low values indicate that the piece might be perceived slow. However, FP.G is not intended to model the perception of tempo. Effects such as vibrato or tremolo are also reflected in the FP. The distance between two pieces of music is computed as the absolute difference between their FP.G values. Figure 2.3 shows the sum of the values in the FP over the frequency bands (i.e. the sum over the rows in the FP matrix) and the CoGs marked with a vertical line. Spider 2. Audio-Based Similarity 22

FP.F FP.G World Pop/Rock Metal//Blues Electronic Classical 0.1 0.2 0.3 0.4 0.5 −10 −5 0

(a) DB-MS

FP.F FP.G Punk Jazz Guitar Jazz Electronic Death Metal A Cappella 0.1 0.2 0.3 0.4 0.5 −10 −5 0

(b) DB-L

Figure 2.4: Boxplots showing the distribution of the descriptors per genre on two music collections. A description of the collections can be found in Section 2.1. The boxes have lines at the lower quartile, median, and upper quartile values. The whiskers show the extent of the rest of the data (the maximum length is 1.5 of the inter-quartile range). Data beyond the ends of the whiskers are marked with plus-signs. by Flex has the highest value (-5.0), while the lowest value (-6.4) is computed for Take Five by the Dave Brubeck Quartet and Surfin’ USA by the Beach Boys. Figure 2.4 shows the distribution of FP.G over different genres. The values have a smaller deviation compared to FP.F and there is less overlap between different genres. Classical and a capella have the lowest values, while electronic, metal, and punk have the highest values.

2.3.3 Combination To combine the distance matrices obtained with the 4 above mentioned approaches we use a linear combination similar to the idea used for the aligned Self-Organizing Maps (SOMs) [28]. Before combining the distances we normalize the four distances such that the standard deviation of all pairwise distances within a music collection each equals 1. In contrast to the aligned-SOMs we do not rely on the user to set the optimum weights for the linear combination, instead we automatically optimize the weights for 2. Audio-Based Similarity 23 genre classification.

2.3.4 Genre Classification We evaluate the genre classification performance on four music collections to find the optimum weights for the combination of the different similarity sources. We use a nearest neighbor classifier and leave-one-out cross validation for the evaluation. The accuracies are computed as ratio of the correctly classified compared to the total number of tracks (without normalizing the accuracies with respect to the different class probabilities). Genre classification is not the best choice to evaluate the performance of a similarity measure. However, unlike listening tests it is very fast and cheap. In contrast to the ISMIR 2004 genre contest we apply an artist filter. In particular, we ensure that all pieces of an artist are either in the training set or test set. Otherwise we would be measuring the artist identification performance, since all pieces of an artist are in the same genre (in all of the collections we use). The resulting performance is significantly worse. For example, on the ISMIR 2004 genre classification training set (using the same algorithm we submitted last year) we get 79% accuracy without artist filter and only 64% with artist filter. The difference is even bigger on a large in-house collection where (using the same algorithm) we get 71% without artist filter and only 27% with filter. In the results described below we always use an artist filter if not stated otherwise. In the remainder of this section first the four music collections we use are described. Second, results using only one similarity source are presented. Third, pairwise combina- tions with spectral similarity (AP) are evaluated. Fourth, all four sources are combined. Finally, the performances on all collections is evaluated to avoid overfitting.

Individual Performance

The performances using one similarity source are given in Figure 2.5 in the first (only spectral similarity, AP) and last columns (only the respective similarity source). AP clearly performs best, followed by FP. The performance of FP.F is extremely poor on DB-S while it is equal to FP.G on DB-L. For DB-MS without the artist filter we obtain 79% using only AP (this is the same performance also obtained on the ISMIR’04 genre contest test set, which indicates that there was no overfitting on the data). Using only FP we obtain 66% accuracy which is very close to the 67% Kris West’s submission achieved. The accuracy for FP.F is 30% and 43% for FP.G. Always guessing that a piece is classical gives 44% accuracy. Thus, the performance of FP.F is significantly below the random guessing baseline. 2. Audio-Based Similarity 24

FP 29 30 32 33 30 27 26 25 23 18 17 FP.F 29 28 28 25 20 19 17 17 14 6 1 FP.G 29 31 35 36 37 35 31 29 25 21 15 0 10 20 30 40 50 60 70 80 90 100

(a) DB-S

FP 27 30 30 29 30 30 29 28 26 25 23 FP.F 27 27 27 25 24 23 23 22 20 18 8 FP.G 27 30 29 28 27 26 26 25 24 22 8 0 10 20 30 40 50 60 70 80 90 100

(b) DB-L

FP 64 63 64 65 65 65 63 63 62 61 58 FP.F 64 66 64 63 63 61 59 58 58 54 28 FP.G 64 64 64 64 63 61 61 61 60 57 42 0 10 20 30 40 50 60 70 80 90 100

(c) DB-MS

FP 56 57 57 58 58 57 56 55 55 52 49 FP.F 56 56 56 54 54 53 53 52 52 50 25 FP.G 56 57 56 56 55 54 54 54 53 52 32 0 10 20 30 40 50 60 70 80 90 100

(d) DB-ML

Figure 2.5: Results for combining AP with one of the other sources. All values are given in percent. The values on the x-axis are the mixing coefficients. For example, the fourth column in the second row is the accuracy for combining 70% AP with 30% of FP.F.

Combining Two

The results for combining AP with one of the other sources are given in Figure 2.5. The main findings are that combining AP with FP or FP.G performs better than combining AP with FP.F (except for 10% FP.F and 90% AP in DB-MS). For all collections a combination can be found which improves the performance. However, the improvements on the Magnatune collection are marginal. The smooth changes of the accuracy with respect to the mixing coefficient are an indicator that the the approach is relatively robust (within each collection). 2. Audio-Based Similarity 25

100 95 90 85 80 75 70 65 60 55 50 AP 29 30 33 34 39 38 38 39 39 41 41 FP 41 41 38 39 39 36 35 35 32 31 27 FP.F 39 39 41 41 41 38 36 35 29 21 19 FP.G 35 36 37 39 40 41 41 41 41 37 35 0 5 10 15 20 25 30 35 40 45 50

(a) DB-S

100 95 90 85 80 75 70 65 60 55 50 AP 27 30 31 32 32 32 32 32 31 32 31 FP 30 32 32 32 32 31 31 32 31 31 30 FP.F 31 32 32 32 31 31 30 29 28 26 23 FP.G 32 32 32 32 31 30 29 29 29 28 26 0 5 10 15 20 25 30 35 40 45 50

(b) DB-L

100 95 90 85 80 75 70 65 60 55 50 AP 64 67 68 67 67 67 67 67 67 67 67 FP 68 67 67 67 67 67 66 67 67 65 65 FP.F 66 68 67 67 66 66 65 65 64 64 61 FP.G 67 68 67 67 67 66 65 65 65 64 61 0 5 10 15 20 25 30 35 40 45 50

(c) DB-MS

100 95 90 85 80 75 70 65 60 55 50 AP 56 57 57 58 58 58 58 58 58 58 57 FP 57 58 58 58 58 58 58 58 58 57 57 FP.F 58 58 58 58 57 57 56 56 55 55 53 FP.G 58 58 58 58 58 57 57 57 56 56 54 0 5 10 15 20 25 30 35 40 45 50

(d) DB-ML

Figure 2.6: Results for combining all similarity sources. A total of 270 combinations are summarized in each table. All values are given in percent. The mixing coefficients for AP (the first row) are given above the table, for all other rows below. For each entry in the table of all possible combinations the highest accuracy is given. For example, the second row, third column depicts the highest accuracy obtained from all possible combinations with 10% FP. The not specified 90% can have any combination of mixing coefficients, e.g. 90% AP, or 80% AP and 10% FP.G etc. 2. Audio-Based Similarity 26

Weights Classification Accuracy Rank AP FP FP.F FP.G DB-S DB-L DB-MS DB-ML Score 1 65 15 5 15 38 32 67 58 1.14 2 65 10 10 15 38 31 67 57 1.14 3 70 10 5 15 38 31 67 58 1.14 4 55 20 5 20 39 31 65 57 1.14 5 60 15 10 15 38 31 66 57 1.14 6 60 15 5 20 39 31 66 57 1.13 7 75 10 5 10 37 31 67 58 1.13 8 75 5 5 15 38 31 66 58 1.13 9 65 10 5 20 38 30 66 58 1.13 10 55 5 10 30 41 29 65 56 1.13 248 100 0 0 0 29 27 64 56 1.00 270 50 0 50 0 19 23 61 53 0.85

Table 2.4: Overall performance on all collections. The displayed values in columns 2- 4 are the mixing coefficients in percent. The values in columns 5-8 are the rounded accuracies in percent.

Combining All

Figure 2.6 shows the accuracies obtained when all similarity sources are combined. There are a total of 270 possible combinations using a step size of 5 percent-points and limiting AP to a mixing coefficient between 100-50% and the other sources to 0-50%. Analogously to the previous results FP.F has the weakest performance and the im- provements for the Magnatune collection are not very exciting. As in Figure 2.5 the smooth changes of the accuracy with respect to the mixing coefficient are an indicator for the robustness of the approach (within each collection). Without the artist filter the combinations on the DB-MS reach a maximum of 81% (compared to 79% using only AP). It is clearly noticeable that the results on the collections are quite different. For example, for DB-S using as little AP as possible (highest values around 45-50%) and a lot of FP.G (highest values around 25-40%) gives best results. On the other hand, for the DB-MS collection the best results are obtained using 90% AP and only 5% FP.G. These deviations indicate overfitting, thus we analyze the performances across collections in the next section.

Overall Performance

To study overfitting we compute the relative performance gain compared to the AP baseline (i.e. using only AP). We compute the score (which we want to maximize) as 2. Audio-Based Similarity 27

Average 1.2

1

0.8 DB−S 1.4

1

0.6 DB−L 1.3

1

0.7 DB−MS 1.1

1

0.9 DB−ML 1.1

1

0.9 50 100 150 200 250

Figure 2.7: Individual relative performance ranked (x-axis) by score (y-axis). the average of these gains over the four collections. The results are given in Table 2.4. The worst combination (using 50% AP and 50% FP.F) yields a score of 0.85. (That is, in average, the accuracy using this combination is 15% lower compared to using 100% AP.) There are a total of 247 combinations which perform better than the AP baseline. Almost all of the 22 combinations that fall below AP have a large contribution of FP.F. The best score is 14% above the baseline. The ranges of the top 10 ranked combinations are 55-75% AP, 5-20% FP, 5-10% FP.F, 10-30% FP.G. Without artist filter, for DB-MS the top three ranked combinations from Table 2.4 have the accuracies 1: 79%, 2: 78%, 3: 79% (the AP baseline is 79%, the best possible combination yields 81%). For the DB-S collection without artist filter the AP baseline is 52% and the top three ranked combinations have the accuracies 1: 63%, 2: 61%, 3: 62% (the best possible score achieved through combination is 64%). This is another indication that genre classification and artist identification are not the same type of problem. Thus, it is necessary to ensure that all pieces from an artist (if all pieces from an artist belong to the same genre) are either in the training or test set. Figure 2.7 shows the relative performance of all combinations ranked by their score. As can be seen there are significant deviations. In several cases a combination performs 2. Audio-Based Similarity 28 well on one collection and performs poor on another. This indicates that there is a large potential for overfitting if the necessary precautions are not taken (such as using several different music collections). However, another observation is that although there is a high variance the performance stays above the baseline for most of the combinations and there is a common trend. Truly reliable results would require further testing on additional collections.

2.3.5 Conclusions In this section we have presented an approach to improve audio-based music similarity and genre classification. We have combined spectral similarity with three additional information sources based on Fluctuation Patterns. In particular, we have presented two new descriptors and a series of experiments evaluating the combinations. Although we obtained an average performance increase of 14%, our findings confirm the glass ceiling observed in [2]. Preliminary results with a larger number of descriptors indicate that the performance per collection can only be further improved by up to 1-2 percent-points. However, the danger of overfitting is eminent. Our results show that there is a significant difference in the overall performance if pieces from the same artist are in the test and training set. We believe this shows the necessity to use an artist filter to evaluate genre classification performances (if all pieces from an artist are assigned to the same genre). Furthermore, the deviations between the collections suggest that it is necessary to use different collections to avoid overfitting. One possible future direction is to focus on developing similarity measures for specific music collections (analogously to developing specialized classifiers able to distinguish only two genres). However, combining audio-based approaches with information from different sources (such as the web), or modeling the cognitive process of music listening are more likely to help us get beyond the glass ceiling.

2.4 Summary & Recommendations

In this chapter we have followed two paths. The motivation for following the first one is that spectral similarity as we use it does not capture many aspects of the audio signal which are very important for the perception of timbre (such as the attack or decay). Although we were able to show that using HMMs allows us to better model a song, we do not recommend its use in the SIMAC prototypes. Primarily, because of the drastic increase in computation time. Furthermore, in terms of genre classification (which is not the best choice for evaluation) the performance does not improve significantly. However, applying HMMs to model temporal aspects for spectral similarity appear to be 2. Audio-Based Similarity 29 an interesting direction for future research. The second path we followed in this chapter was to combine what we knew works best with other approaches. As a result we have found a combination which significantly improves the results on some of the collections we used for evaluation. We recommend the usage of this combination as described in detail above for D3.5.1 and the prototypes. The implementation of the fluctuation patterns and the spectral similarity is available in the MA toolbox for Matlab. 3. Web-Based Similarity

In this chapter we propose an alternative, which we have published in [33], to the web- based similarity measure described in detail in D3.2.1. The similarity measure operates on artists names based on search results of Google queries. Co-occurrences of artist names on web pages are analyzed to measure how often two artists are mentioned to- gether on the same web page. We estimate conditional probabilities using the extracted page count. These conditional probabilities give a similarity measure which is evalu- ated using a data set containing 224 artists from 14 genres. For evaluation, we use two different methods, intra-/intergroup-similarities and k-Nearest Neighbors classifica- tion. Furthermore, a confidence filter and combinations of the results gained from three different query settings are tested. It is shown that these enhancements can raise the performance of the web-based similarity measure. Comparing the results to those of similar approaches show that our approach, though being quite simple, performs well and can be used as a similarity measure that incorporates “social knowledge”. The approach is similar to the approach presented in [38]. The main difference is that we calculate the complete distance matrix. This offers additional information since we can also predict which artists are not similar. Such information is necessary, for example, when it comes to creating playlists that incorporate a broad variety of different music styles. Moreover, in [38], artists are extracted from “Listmania!”, which uses the database of the web shop . The number of artists in this database is obviously smaller than the number of artist-related web pages indexed by Google. For example, most local artists or artists without a record deal are not contained. Thus, the approach of [38] cannot be used for such artists. A shortcoming of the co-occurrence approach is that creating a complete distance ma- trix has quadratic computational complexity in the number of artists. Despite this fact, the approach is quite fast for small- and medium-sized collections with some hundreds of artists since it is very simple and does not rely on extracting and weighting hundreds of thousands of words like the tf idf approach of [18]. Moreover, using heuristics could · reduce the computational complexity.

30 3. Web-Based Similarity 31

3.1 Web Mining by Co-occurrence Analysis

Since our similarity measure is based on artist co-occurrences, we need to count how often artist names are mentioned together on the same web page. To obtain these page counts, the search engine Google was used. Google has been chosen for the experiments because it is the most popular search engine at the moment. Furthermore, investigations of different search engines showed that Google yields the best results for musical web crawling [18]. Given a list of artist names, we use Google to estimate the number of web pages containing each artist and each pair of artists. Since we are not interested in the content of the found web pages, but only in their number, the search is restricted to display only the top-ranked page. In fact, the only information we use is the page count that is returned by Google. This raises performance and limits web traffic. The outcome of this procedure is a symmetric matrix C, where element cij gives the number of web pages containing the artist with index i together with the one indexed by j. The values of the diagonal elements cii show the total number of web pages containing artist i. Based on the page count matrix C, we then use relative frequencies to calculate a conditional probability matrix P as follows. Given two events ai (artist with index i is mentioned on web page) and aj (artist with index j is mentioned on web page), we estimate the conditional probability pij(the probability for artist j to be found on a web page that is known to contain artist i) as shown in Formula 3.1.

cij p (ai aj ai)= (3.1) ∧ | cii Obviously, P is not symmetric. Since we need a symmetric similarity function in order to use k-NN, we compute a symmetric equivalent Ps by simply calculating the arithmetical mean of pij and pji for every pair of artists i and j. Addressing the problem of finding only music-related web pages, we used three dif- ferent query settings.

“artist1” “artist2” music • “artist1” “artist2” music review • allintitle: “artist1” “artist2” • The first one, in the following abbreviated as M, searches only for web pages containing the two artist names as exact phrases and the word ”music”. The second one, which has already been used in [37], restricts the search to pages containing the additional terms ”music” and ”review”. This setting, abbreviated as MR, was used to compare our 3. Web-Based Similarity 32 results to those of [18]. The third setting (allintitle) only takes into consideration web pages containing the two artists in their title. It is the most limiting setting, and the resulting page count matrices are quite sparse. However, our evaluation showed that this setting performs quite well on the k-NN classification task and can be used successfully in combination with M or MR.

3.2 Experiments and Evaluation

We conducted our experiments on the data set already used in [18]. It comprises 14 quite general and well-known genres with 16 assigned artists each. A complete list can be found on the Internet1. Two different evaluation methods were used: ratios between intra- and intergroup-similarities and hold-out experiments using k-NN classification.

3.2.1 Intra-/Intergroup-Similarities This evaluation method is used to estimate how well the given genres are distinguished by our similarity measure P . For each genre, the fraction between the average intragroup- probability and the average intergroup-probability is calculated. The higher this ratio, the better the differentiation of the respective genre. The average intragroup-probability for a genre g is the probability that two arbitrarily chosen artists a and b from genre g co-occur on a web page that is known to contain either artist a or b. The average intergroup-probability for a genre g is the probability that two arbitrarily chosen artists a (from genre g) and b (from any other genre) co-occur on a web page that is known to contain either artist a or b. Thus, the average intragroup-probability gives the probability that two artists from the same genre co-occur. The average intergroup-probability gives the probability that an artist from genre g co-occurs with an artist not from genre g.

Let A be the set of all artists and Ag the set of artists assigned to genre g. Formally, the average intra- and intergroup-probabilities are given by Equations 3.2 and 3.3, where A is the cardinality of A and A A is the set A without the elements contained in | g | g \ g the set Ag.

a26=a1 p 1 2 Pa1∈Ag Pa2∈Ag a a intrag = (3.2) A 2 A | g | − | g |

pa1a2 inter = Pa1∈Ag Pa2∈A\Ag (3.3) g A A A | \ g | · | g | 1http://www.cp.jku.at/people/schedl/music/artist list 224.pdf 3. Web-Based Similarity 33

Obviously, the ratio intrag/interg should be at least greater than 1.0 if the similarity measure is to be of any use.

Results and Discussion

Table 3.1 shows the results of evaluating our co-occurrence approach with this first evaluation method. It can be seen that the allintitle-setting yields the best results as the average intergroup-similarities are very low. Hence, nearly no artists from differ- ent genres occur together in the title of the same web page. Especially for the genres “Jazz” and “Classical”, the results are excellent. However, for “/Indie” and “”, the ratios are quite low. This can be explained by the low average intragroup-similarities for these genres. Thus, artists belonging to these genres are sel- dom mentioned together in titles. Analyzing the page count matrices revealed that the allintitle-setting yields good results if web pages containing artists from the same genre in their title are found. If not, the results are obviously quite bad. This observation motivated us to conduct experiments with confidence filters and combinations of the allintitle-setting with M and MR. These experiments are described in detail in the next section. Moreover, Table 3.1 shows that, aside from “Classical”, “Blues” is distinguished quite well. Also remarkable is the very bad result for “Folk” music in the MR-setting. This may be explained by intersections with other genres, e.g. “Country”. The approach presented in [39] was tested on the list of artists already used in [18]. The results, which are visualized in Table 3.2, are slightly worse than the results using our approach on the same data set. An explanation for this is that we use an asym- metric similarity measure that, for each pair of artists (artist1 and artist2), incorporates probability estimations for artist1 being mentioned on web pages containing artist2 and for artist2 appearing on web pages of artist1. This additional information is lost when using the normalization method proposed in [39]. In Table 3.3, the evaluation results for the approach of [18], again using exactly the same list of artists, are depicted. To obtain them, the distances between the feature vectors gained from the tf idf calculations are computed for every pair of artists. This · gives a complete similarity matrix. Since most of the query settings used in [18] differ from ours, we can only compare the results of the MR-setting. Taking a closer look at the results shows that tf idf performs better for eight genres, our approach performs · better for six genres. However, the mean of the ratios is better for our approach because of the high value for the genre “Classical”. A possible explanation is that web pages concerning classical artists often also contain words which are used on pages of other genres’ artists. In contrast, classical artist names seem to be mentioned only together 3. Web-Based Similarity 34

90 allintitle only M only MR only 85 allintitle+M allintitle+MR

80

75

70 accuracy

65

60

55

50 0 1 2 3 4 5 6 7 8 9 threshold for the number of allowed zero−distance−elements

Figure 3.1: Accuracies in percent for single and combined similarity measures using 9-NN t15-validation and the confidence filter. The combined results are depicted as dotted lines. It is remarkable that the high values for the allintitle-accuracies come along with up to 18% of unpredictable artists. All other measures (single and combined) leave no data items unpredicted. with other artists belonging to the same genre, which is reflected by the very high ratios of our approach for this genre.

3.2.2 Classification with k-Nearest Neighbors The second set of evaluation experiments was conducted to show how well our similarity measure works for classifying artists into genres. For this purpose, the widely used technique of k-Nearest Neighbors was chosen. This technique simply uses the k data items for prediction that have a minimal distance to the item that is to be classified. The most frequent class among these k data items is predicted for the unclassified data item. As for the partitioning of the complete data set into training set and test set, we used different settings, referred to as tx, where x is the number of data items from each genre that are assigned to the training set. In a t15-setting, for example, 15 artists from each genre are used for training and one remains for testing. For measuring the distances 3. Web-Based Similarity 35

90

80

70

60

50 accuracy

40

30 t15: allintitle+M t8: allintitle+M t4: allintitle+M t2: allintitle+M 20 t15: allintitle+MR t8: allintitle+MR t4: allintitle+MR t2: allintitle+MR 10 0 1 2 3 4 5 6 7 8 9 threshold for the number of allowed zero−distance−elements

Figure 3.2: Accuracies in percent for different combinations of the three settings (allinti- tle, M, MR) and different training set sizes. 9-NN classification was used. between two data items, we use the similarities given by the symmetric probability matrix

Ps. We ran all experiments 1.000 times to minimize the influence of statistical outliers on the overall results. The accuracy, in the following used for measuring performance, is defined as the percentage of correctly classified data items over all classified data items in the test set. Since the usage of confidence filters may result in unclassified data items, we introduce the prediction rate which we define as the percentage of classified data items in the complete test set. In a first test with setting t8, k-NN with k = 9 performed best, so we simply used 9-NN for classification in the subsequent experiments. It is not surprising that values around 8 perform best in a t8-setting, because in this case the number of data items from the training set that are used for prediction equals the number of data items chosen from each class to represent the class in the training set. The t8-setting without any confidence filter gives accuracies of about 69% for M, about 59% for MR and about 74% for allintitle. Using setting t15, these results can be improved for M ( 75% using ≈ 9-NN) and for allintitle ( 80% using 6-NN). For MR, no remarkable improvement could ≈ 3. Web-Based Similarity 36

90 t15 t8 t4 80 t2

70

60

50 accuracy

40

30

20

10 0 10 20 30 40 50 60 70 80 90 100 prediction rate

Figure 3.3: Accuracy plotted against prediction rate for different training set sizes and 9-NN classification. Only the uncombined allintitle-setting was used for this plot. be achieved. In the case that no confidence filter is used, like in the first tests described above, a random genre is predicted for the artist to be classified if his/her similarity to all artists in the training set is zero. Due to the sparseness of its similarity matrix, this problem mainly concerns the allintitle-measure. To overcome the problem and benefit from the good performance of the allintitle-measure but also address the sparseness of the respective similarity matrix, we tried out some confidence filters to combine the similarity measures that use the three different query settings. The basic idea is to use the allintitle-measure if the confidence in its results is high enough. If not, the M- or MR-measure is used to classify an unknown data item. We experimented with confidence filters using mathematical properties of the distances between the unclassified data item and its nearest neighbors. The best results, however, were achieved with a very simple approach based on counting the number of elements with a probability/similarity of zero in the set of the nearest neighbors. If this number exceeds a given threshold, the respective data item is not classified with the allintitle-measure, but the M- or MR- 3. Web-Based Similarity 37

Figure 3.4: Confusion matrix for the averaged results of 1.000 runs using 9-NN t15- validation. The confidence filter was applied to the allintitle-setting. The values are the average accuracies in percent. measure is used instead. Using this method, only artists that co-occur at least with some others in the title of some web pages are classified with allintitle. On the other hand, if not enough information for a certain artist is available in the allintitle-results, MR or M is used instead. These two measures usually give enough information for prediction. Indeed, their prediction rates equal 100% for the data set used for our evaluations. This is also manifested in Figure 3.1 which shows that the accuracies for MR and M are independent of the threshold for the confidence filter.

Results and Discussion

We already mentioned the classification accuracies of up to 80% for uncombined mea- sures. Since we wanted to analyze to what extent the performance can be improved when using combinations, we conducted t15-validations using either a single measure or 3. Web-Based Similarity 38 combinations of allintitle with MR and M. The results are shown in Figure 3.1. Along the abscissa, the influence of different thresholds for the confidence filter can be seen. The falling accuracies for allintitle with raising threshold values confirms our assump- tion that the performance of the allintitle-measure depends strongly on the availability of enough information. It is important to note that the uncombined allintitle-measure does not always make a prediction when using the confidence filter, also cf. Figure 3.3. Remarkable are the very high accuracies (fraction between correctly classified artists and classifiable artists) of up to 89,5% for allintitle with a threshold value of 2. However, in this setting, 14% of the artists cannot be classified. Taking a closer look at the MR- and M-settings shows that they reach accuracies of about 54% and 75% respectively and that these results are independent of the threshold for the confidence filter. In fact, MR and M, at least for the used data set, always provide enough information for prediction. Combining the measures by taking allintitle as primary one and, if no prediction with it is possible, MR or M as fallback also combines the advantages of high accuracies and high prediction rates. Indeed, using the combination allintitle+M gives accuracies of 85% at 100% prediction rate. Since the accuracies for M are much higher than for MR, the combination of allintitle with M yields better results than with MR. Compared to the k-NN results of [18], these accuracies are at least equal although the co-occurrence approach is much simpler than the tf idf approach. However, the single MR-setting · performs quite poorly with our approach. This can be explained by the fact that web pages containing music reviews seldom mention other artists, but usually compare new artists’ to more recent ones by the same artist. In addition, we were interested in the number of artists needed to define a genre adequately. For this reason, we ran some experiments using different training set sizes. In Figure 3.2, the results of these experiments for 9-NN classification using the combinations allintitle+M and allintitle+MR are depicted. It was observed that t15 and t8 again provide very high accuracies of up to 85% and 78% respectively. Examining the results of the t4- and t2-settings reveals much lower accuracies. These results are remarkably worse than those of [18] for the same settings (61% for t4 with our approach using 9-NN vs. 76% with the tf idf approach using 7-NN and the additional search keywords · “music genre style”, 35% for t2 with our approach vs. 43% with the tf idf approach · using 7-NN and the same additional keywords). In these two settings, the additional information used by the tf idf approach seems to be highly valuable. As a final remark · on Figure 3.2, we want to point out that the prediction rate for all depicted experiments is 100%. As already mentioned, the uncombined allintitle-setting using the confidence filter does not always yield a prediction. To analyze the trade-off between accuracy and prediction rate, we plotted these properties for the allintitle-setting in Figure 3.3. This 3. Web-Based Similarity 39

figure shows that, in general, an increase in accuracy goes along with a decrease in prediction rate. However, an increase in prediction rate accompanied by a slight increase in accuracy which yields the maximum accuracy values can be seen at the beginning of each plot. The highest accuracies obtained for the different settings are 89% for t15 (86% prediction rate), 84% for t8 (59% prediction rate), 64% for t4 (34% prediction rate), and 35% for t2 (10% prediction rate). These maximum accuracy values are usually achieved with a threshold of 1 or 2 for the confidence filter. It seems that restricting the number of allowed zero-distance-elements in the set of the nearest neighbors to 0 is counterproductive since it decreases the prediction rate without increasing the accuracy. Finally, to investigate which genres are likely to be confused with others, we cal- culated a confusion matrix, cf. Figure 3.4. It can be seen that the genres “Jazz”, “Blues”, “Reggae”, and “Classical” are perfectly distinguished. “Heavy Metal/Hard Rock”, “Electronica”, and “Rock ’n’ Roll” also show very high accuracies of about 95%. For “Country”, “Folk”, “RnB/Soul”, “Punk”, “Rap”, and “Pop” , accuracies between 83% and 89% are achieved. In comparison with the results of [18], where “Pop” achieved only 80%, we reach 88% for this genre. In contrast, our results for the genre “Alternative Rock/Indie” are very bad ( 50%). A more precise analysis reveals that this genre is ≈ often confused with “Electronica”, which may be explained by some artists producing music of different styles (over time), like “” in “Alternative Rock/Indie” or “” and “” in “Electronica”. “Depeche Mode”, for example, was a pioneer of “-Pop” in the 1980s.

3.3 Conclusions & Recommendations

In this section we presented an artist similarity measure based on co-occurrences of artist names on web pages. We used three different query settings (M, MR, and allintitle) to retrieve page counts from the search engine Google. Experiments showed that the allintitle-setting provides high accuracies with k-Nearest Neighbors classification. High prediction rates, however, are achieved with the M-setting. In order to exploit the advantages of both settings, the two measures were combined using a simple threshold- based confidence filter. We showed that this combination gives accuracies of up to 85% at 100% prediction rate (no unclassified artists). These results are at least equal to those presented in [18] when using a sufficient number of training samples from each genre. In [18], however, a much more complex approach, tf idf, is used. For scenarios with · only very few artists available to define a genre, the tf idf approach performs better · due to its extensive use of additional information. In contrast, less information is used in the approach presented in [39]. Our approach differs from that of Zadel and Fujinaga, among other things, in that they use a symmetric similarity measure and a different 3. Web-Based Similarity 40 normalization method. As a result, their approach performs slightly worse than ours. Further research may focus on the combination of web-based and signal-based data to raise the performance of similarity measures or to enrich signal-based approaches with cultural metadata from the Internet. Since the data set used for evaluation contains quite general genres and well-known artists, it would be interesting to test our approach on a more specific data set with a more fine-grained genre taxonomy. Finally, heuristics that reduce the computational complexity of our approach should be tested. This would enable us to process also large artist lists. For the SIMAC prototypes depending on the number of artists in the collection we recommend either the tf idf approach if the number of artists is beyond 100. Or the · simpler co-occurrences approach if the number of artists is below. If it is not clear if the number of artists will increase at a later point in time, the preference should be given to tf idf. · 3. Web-Based Similarity 41 ries s for ratio between 8.962 4.232 70.163 23.963 58.496 24.839 48.210 26.159 34.455 18.819 15.432 24.907 223.762 212.534 2561.504 ratio avg 3.092e-5 2.453e-5 2.377e-5 2.377e-5 3.785e-5 2.300e-5 7.092e-5 2.847e-5 5.014e-5 8.023e-5 4.504e-6 5.708e-5 8.822e-5 6.248e-5 and the inter avg 2.170e-3 5.877e-4 5.052e-3 2.058e-3 9.400e-4 1.109e-3 1.855e-3 9.807e-4 8.808e-4 3.733e-4 4.494e-4 1.501e-3 1.154e-2 1.556e-3 intra allintitle ing into account web pages with artists ratio 3.743 2.644 1.374 3.399 4.483 1.780 2.860 2.798 2.872 2.013 2.470 3.206 2.426 1.899 18.177 avg 0.039 0.039 0.039 0.024 0.044 0.067 0.055 0.036 0.047 0.072 0.011 0.083 0.079 0.045 intra average intergroup-probability , the avg 0.104 0.054 0.132 0.106 0.078 0.192 0.153 0.134 0.072 0.178 0.201 0.267 0.191 0.086 intra music larities using our co-occurence measure. On the left, the result are illustrated. The middle columns show the results for the que , the better the differentiation of the respective genre. ratio 4.221 2.723 0.529 2.460 5.085 2.148 1.419 1.654 1.774 2.807 2.040 1.649 1.261 1.817 review 31.733 ratio + avg 0.032 0.098 0.038 0.026 0.032 0.098 0.066 0.048 0.010 0.126 0.072 0.042 0.066 0.041 +music inter avg average intragroup-probability 0.088 0.052 0.094 0.132 0.068 0.139 0.110 0.074 0.135 0.134 0.312 0.208 0.091 0.075 intra music, review . The rightmost columns show the results for the queries only tak +music keywords genre mean Country Folk Jazz Blues RnB/Soul Heavy Metal/Hard Rock Alternative Rock/Indie Punk Rap/Hip-Hop Electronica ’n’ Roll Pop Classical with additional in their title. For each genre, the these two probabilities is depicted. The higher the the queries using the additional keywords Table 3.1: Results of the evaluation of intra-/intergroup-simi 3. Web-Based Similarity 42 ratio 7.305 4.389 55.328 25.962 56.080 20.627 40.909 27.850 33.745 16.698 13.857 22.630 181.811 171.398 2048.574 avg 5.401e-5 4.294e-5 3.842e-5 5.572e-5 5.719e-5 3.888e-5 1.169e-4 4.808e-5 9.009e-5 1.387e-4 7.556e-6 1.095e-4 1.622e-4 9.718e-5 inter avg 2.988e-3 1.115e-3 6.585e-3 3.125e-3 1.180e-3 1.591e-3 3.256e-3 1.622e-3 1.517e-3 7.118e-4 6.581e-4 2.316e-3 1.548e-2 2.199e-3 intra allintitle ratio 3.551 2.591 1.340 3.235 4.222 1.655 2.796 2.774 2.758 1.934 2.302 3.112 2.430 1.905 16.592 avg 0.058 0.058 0.056 0.036 0.065 0.097 0.080 0.058 0.068 0.112 0.016 0.122 0.118 0.069 intra -similarities using relatednesses according to [39]. avg 0.150 0.082 0.180 0.154 0.107 0.272 0.223 0.187 0.111 0.257 0.270 0.379 0.286 0.131 intra music ratio 3.736 2.725 0.502 2.273 4.448 1.950 1.429 1.683 1.798 2.513 1.952 1.592 1.209 1.793 26.438 avg 0.050 0.159 0.059 0.040 0.050 0.140 0.097 0.086 0.016 0.185 0.116 0.062 0.107 0.065 inter avg 0.136 0.080 0.129 0.178 0.097 0.201 0.164 0.111 0.216 0.210 0.423 0.295 0.141 0.117 intra music, review Table 3.2: Results of the evaluation based on intra-/intergroup keywords genre mean Country Folk Jazz Blues RnB/Soul Heavy Metal/Hard Rock Alternative Rock/Indie Punk Rap/Hip-Hop Electronica Reggae Rock ’n’ Roll Pop Classical 3. Web-Based Similarity 43

keywords music, review genre intra avg inter avg ratio Country 0.118 0.049 2.425 Folk 0.064 0.043 1.480 Jazz 0.131 0.048 2.722 Blues 0.134 0.047 2.875 RnB/Soul 0.109 0.060 1.812 Heavy Metal/Hard Rock 0.080 0.049 1.618 Alternative Rock/Indie 0.075 0.049 1.521 Punk 0.098 0.053 1.848 Rap/Hip-Hop 0.129 0.050 2.545 Electronica 0.077 0.039 1.985 Reggae 0.135 0.045 3.025 Rock ’n’ Roll 0.105 0.050 2.099 Pop 0.081 0.052 1.577 Classical 0.230 0.025 9.164 mean 2.621

Table 3.3: Results of the evaluation based on intra-/intergroup-similarities using the tf idf approach according to [18]. · 4. Novelty Detection and Similarity

This chapter presents novelty detection as a tool in MIR to improve the performances of similarity measures. The work presented in this chapter has been submitted to a conference [13]. Novelty detection is the identification of new or unknown data that a machine learning system is not aware of during training (see [22] for a review). It is a fundamental requirement for every good machine learning system to automatically identify data from regions not covered by the training data since in this case no reasonable decision can be made. In the field of music information retrieval so far the problem of novelty detection has been ignored. For music information retrieval, the notion of central importance is musical similarity. Proper modeling of similarity enables automatic structuring and organization of large collections of digital music, and intelligent music retrieval in such structured “music spaces”. This can be utilized for numerous different applications: genre classification, play list generation, music recommendation, etc. What all these different systems lack so far is the ability to decide when a new piece of data is too dissimilar for making a decision. Let us e.g. assume the following user scenario: a user has on her hard drive a collection of songs classified into the three genres ’hip hop’, ’punk’ and ’death metal’; given a new song from a genre not yet covered by the collection (say, a ’reggae’ song), the system should mark this song as ’novel’ therefore needing manual processing instead of automatically and falsely classifying it into one of the three already existing genres (e.g. ’hip hop’). Another example is the automatic exclusion of songs from play lists because they do not fit the overall flavor of the majority of the list. Novelty detection could also be utilized to recommend new types of music different from a given collection if users are longing for a change.

4.1 Data

For the experiments presented in this chapter we used the DB-ML collection as described in Section 2.1. From the 22050Hz mono audio signals two minutes from the center of each song are used for further analysis. We divide the raw audio data into overlapping frames of short duration and use Mel Frequency Cepstrum Coefficients (MFCC) to rep- resent the spectrum of each frame. The frame size for computation of MFCCs for our

44 4. Novelty Detection and Similarity 45 experiments was 23.2ms (512 samples), with a hop-size of 11.6ms (256 samples) for the overlap of frames. The average energy of each frame’s spectrum was subtracted. We used the first 20 MFCCs for all our experiments.

4.2 Methods

4.2.1 Music Similarity The approach presented in this chapter can be applied to any type of music similar- ity. For the experiments presented here we use the spectral similarity as described in Section 2.3.1.

4.2.2 Algorithms for novelty detection Ratio-reject: The first reject rule is based on density information about the training data captured in the similarity matrix. An indication of the local densities can be gained from comparing the distance between a test object X and its nearest neighbor in the training set NN tr(X), and the distance between this NN tr(X) and its nearest neighbor in the training set NN tr(NN tr(X)) [35]. The object is regarded as novel if the first distance is much larger than the second distance. Using the following ratio

d(X, NN tr(X)) ρ(X)= k k (4.1) d(NN tr(X), NN tr(NN tr(X))) k k we reject X if:

ρ(X) >ε[ρ(Xtr)] + s std(ρ(Xtr)) (4.2) ∗ with ε[ρ(Xtr)] being the mean of all quotients ρ(Xtr) inside the training set and std(ρ(Xtr)) the corresponding standard deviation (i.e. we assume that the ρ(Xtr) have a normal distribution). Parameter s can be used to change the probability threshold for rejection. Setting s = 3 means that we reject a new object X if its ratio ρ(X) is larger then the mean ρ within the training set plus three times the corresponding standard deviation. In this case a new object is rejected because the probability of its distance ratio ρ(X) is less than 1% when compared to the distribution of ρ(Xtr). Setting s = 2 rejects objects less probable than 5%, s = 1 less than 32%, etc. Knn-reject: It is possible to directly use nearest neighbor classification to reject new data with higher risk of being misclassified [17]: reject X if not: 4. Novelty Detection and Similarity 46

g(NN1tr(X)) = g(NN2tr(X)) = . . . = g(NNktr(X)) (4.3) with NNitr(X)) being the ith nearest neighbor of X in the training set, g() a function which gives the genre information for a song and i = 1,...,k. A new object X is rejected if the k nearest neighbors do not agree on its classification. This approach will work for novelty detection if new objects X induce high confusion in the classifier. The higher the value for k the more objects will be rejected.

4.3 Results

To evaluate the two novelty detection approaches described in Sec. 4.2.2 we use the following approach shown as pseudo-code in Table 4.1. First we set aside all songs belonging to a genre g as new songs ([new,data]=separate(alldata,g)) which yields data sets new and data (all songs not belonging to genre g). Then we do a ten-fold crossvalidation using data and new: we randomly split data into train and test fold ([train,test] = split(data,c)) with train always consisting of 90% and test of 10% of data. We compute the percentage of new songs which are re- jected as being novel (novel reject(g,c) = novel(new)) and do the same for the test songs (test reject(g,c) = novel(test)). Last we compute the accuracy of the nearest neighbor classification on test data that has not been rejected as being novel (accuracy(g,c) = classify(test(not test reject))). The evaluation pro- cedure gives G C (22 10) matrices of novel reject, test reject and accuracy × × for each parameterization of the novelty detection approaches.

Table 4.1: Outline of Evaluation Procedure

for g = 1 : G [new,data] = separate(alldata,g) for c = 1 : 10 [train,test] = split(data,c) novel_reject(g,c) = novel(new) test_reject(g,c) = novel(test) accuracy(g,c) = classify(test(not test_reject)) end end

The results for novelty detection based on the Ratio-reject and the Knn-reject rule are given in Figs. 4.1 and 4.2 as Receiver Operating Characteristic (ROC) curves [24]. To 4. Novelty Detection and Similarity 47 obtain an ROC curve the fraction of false positives (object is not novel but it is rejected, in our case test reject) is plotted versus the fraction of true positives (object is novel and correctly rejected, in our case novel reject). An ROC curve shows the tradeoff between how sensitive and how specific a method is. Any increase in sensitivity will be accompanied by a decrease in specificity. If a method becomes more sensitive towards novel objects it will reject more of them but at the same it will also become less specific and also falsely reject more non-novel objects. Consequently, the closer a curve follows the left-hand border and then the top border of the ROC space, the more accurate the method is. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the method. We plot the mean test reject versus the mean novel reject for falling numbers of s (Ratio-reject) and growing numbers of k (Knn- reject). In addition the mean accuracy for each of the different values of s and k are depicted as separate curves. All means are computed across all 22 10 corresponding × values. The accuracy without any rejection due to novelty detection is 70%.

100

80 s=0

60 s=1

40 s=2 novel_reject s=3 20

0 0 20 40 60 80 100 test_reject

Figure 4.1: Ratio-reject ROC, mean test reject vs. novel reject (circles, solid line) and accuracy (diamonds, broken line) for ’no rejection’, s=5,3,2,1,0.

Ratio-reject: The results for novelty detection based on the Ratio-reject rule are given in Fig. 4.1. With the probability threshold for rejection set to s = 2 (rejection because data is less probable than 5%), the accuracy rises up to 79% while 19% of the test songs are falsely rejected as being novel and therefore not classified at all and 42% of the new songs are being rejected correctly. If one is willing to lower the threshold to s = 0 (rejection because data is less probable than 50%) the accuracy is at 92% with already 49% of the test songs rejected erroneously and 84% of the new songs rejected 4. Novelty Detection and Similarity 48 correctly.

100

80 k=3

60 k=2

40 novel_reject

20

k=1 0 20 40 60 80 100 test_reject

Figure 4.2: Knn-reject ROC, mean test reject vs. novel reject (circles, solid line) and accuracy (diamonds, broken line) for k=1 (no rejection) and k=2,3,4,5,6,7,8,9,10,20.

Knn-reject: The results for novelty detection based on the Knn-reject rule are given in Fig. 4.2. If k is set to 2 the accuracy rises up to 89% while 35% of the test songs are wrongly rejected as being novel and therefore not classified at all and 65% of the new songs are being rejected correctly. With k = 3 the accuracy values start to saturate at 95% with already 49% of the test songs rejected erroneously and 81% of the new songs rejected correctly.

4.4 Discussion

We have presented two approaches to novelty detection, where the first (Ratio-reject) is based directly on the distance matrix and does not, contrary to Knn-reject, need the genre labels. When comparing the two ROC curves given in Figs. 4.1 and 4.2 it can be seen that both approaches work approximately equally well. E.g. the performance of the Ratio-reject rule with s = 1 resembles that of the Knn-reject rule with k = 2. The same holds for s = 0 and k = 3. Also the increase in accuracy is comparable for both methods. Depending on how much specificity one is willing to sacrifice, the accuracy can be increased from 70% to well above 90%. Looking at both ROC curves, we would like to state that they indicate quite fair accurateness of both novelty detection methods. When judging genre classification results, it is important to remember that the human 4. Novelty Detection and Similarity 49 error in classifying some of the songs gives rise to a certain percentage of misclassification already. Inter-rater reliability between a number of music experts is usually far from perfect for genre classification. Given that the genres for our data set are user and not expert defined and therefore even more problematic, it is not surprising that there is a considerable decrease in specificity for both methods. Of course there is still room for improvement in novelty detection for music similarity. The two presented methods are a first attempt to tackle the problem and could probably be improved themselves. One could change the Knn-reject rule given in Equ. 4.3 by introducing a weighting scheme which puts more emphasis on closer than on distant neighbors. Then there is a whole range of alternative methods which could be explored: probabilistic approaches (see e.g. [7]), Bayesian methods [21] and neural network based techniques (see [22] for an overview). Finally we would like to comment that whereas the Knn-reject rule is bound to the genre classification framework, Ratio-reject is not. Knn-reject probably is the method of choice if classification is the main interest. Any algorithm that is able to find a range of nearest neighbors in a data base of songs can be used together with the Knn-reject rule. Ratio-reject on the other hand has an even wider applicability. It is a general method to detect novel songs given a similarity matrix of songs. Since it does not need genre information it could be used for anything from play list generation and music recommendation to music organization and visualization. 5. Chroma-Complexity Similarity

In this chapter we use the chromagram implementation of D2.1.1 developed by Chris Harte at QMUL to compute descriptors for similarity measures which could be useful for playlist generation and related tasks. We briefly review the chromagram implementation based on the constant Q transform and how we use this to compute a measure for chroma complexity. (Note that chroma complexity is closely related to chord complexity.) We discuss possibilities for further development and how the prototypes can benefit from the descriptors developed in WP2. The general approach we apply can be applied to any similar mid-level representation and thus opens the way for further integration of WP2 results in WP3. The following description of the chromagram calculation and the chromagram tun- ning has been copied from a paper [6] which was submitted to a conference and will be part of D2.1.2. The remaining work was part of the collaboration between Chris Harte and Juan Bello from QMUL and Elias Pampalk from OFAI visiting QMUL.

5.1 Chromagram Calculation

A standard approach to modeling pitch perception is as a function of two attributes: height and chroma. Height relates to the perceived pitch increase that occurs as the frequency of a sound increases. Chroma, on the other hand, relates to the perceived circularity of pitched sounds from one octave to the other. The musical intuitiveness of the chroma makes it an ideal feature representation for note events in music signals. A temporal sequence of chromas results in a time-frequency representation of the signal known as chromagram. A common method for chromagram generation is the constant Q transform [9]. It is a spectral analysis where frequency domain channels are not linearly spaced, as in DFT-based analysis, but logarithmically spaced, thus closely resembling the frequency resolution of the human ear. The constant Q transform Xcq of a temporal signal x(m) can be calculated as: N(k)−1 −j2πfkn Xcq(k)= X w(n, k)x(n)e (5.1) n=0 where both, the analysis window w(k) and its length N(k), are functions of the bin th position k. The center frequency fk of the k bin is defined according to the frequencies

50 5. Chroma-Complexity Similarity 51 of the equal-tempered scale such that:

k/β fk = 2 fmin (5.2) where β is the number of bins per octave, thus defining the resolution of the analysis, and fmin defines the starting point of the analysis in frequency. From the constant Q spectrum Xcq, the chroma for a given frame can then be calculated as:

M Chroma(b)= X Xcq(b + mβ) (5.3) | | m=0 where b [1, β] is the chroma bin number, and M is the total number of octaves in the ∈ constant Q spectrum. We downsample, the signal to 11025Hz, β = 36 and analysis is performed between fmin = 98Hz and fmax = 5250Hz. The resulting window length and hop size are 8192 and 1024 samples respectively.

5.2 Chromagram Tuning

Real-world recordings are often not perfectly tuned, and slight differences between the tuning of a piece and the expected position of energy peaks in the chroma representation can have an important influence on the estimation of chords. The 36-bin per octave resolution is intended to clearly map spectral components to a particular semitone regardless of the tuning of the recording. Each note in the octave is mapped by 3 bins in the chroma, such that bias towards a particular bin (i.e. sharpening or flattening of notes in the recording) can be spotted and corrected. To do this we use a simpler version of the tuning algorithm proposed in [16]. The algorithm starts by picking all peaks in the chromagram. Resulting peak positions are quadratically interpolated and mapped to the [1:5; 3:5] range. A histogram is generated with this data, such that skewness in the distribution is indicative of a particular tuning. A corrective factor is calculated from the distribution and applied to the chromagram by means of a circular shift. Finally, the tuned chromagram is low-pass filtered to eliminate sharp edges.

5.3 Chromagram Processing

To emphasize certain patterns in the chromagram and to remove temporal variations we use several filters. (1) We use a Gaussian filter over time. This window is very large and removes variations within 50ms. This helps reduce the impact of, for example, the broad spectrum of sharp attacks. (2) We use a loudness normalization to remove the impact of the changing loudness levels in different sections of the piece. (3) We use 5. Chroma-Complexity Similarity 52 gradient filters to emphasize horizontal lines in the chromagram. (4) We smooth the chromagram over the tone scale to a resolution of about one semitone (i.e. 12 bins instead of 36). This smoothing is done circular in accordance with the distance between semitones. The result are the chromagrams displayed in the figures of this chapter.

5.4 Chroma Complexity

Depending on the number of chords and their similarity, the patterns which appear in a chromagram might be very complex, or very simple. To measure this we use clustering. In particular, the chromagram is clustered with k-means, finding groups of similar chroma patterns. The cluster algorithm starts with 8 clusters. If two clusters are very similar, the groups are merged. This is repeated until convergence or until only 2 groups are left. The similarity is measured using a heuristic for perceptual similarity of two patterns. To avoid getting stuck in local minimas, the clustering is repeated several times (with different initializations). (The time resolution of the chromagram is very low, thus the computation time for clustering is neglectable.) This is not the optimal choice for several reasons. Alternatives include using, for example, the Bayes information criteria, or avoiding quantization in the first place. In the following we will give some examples to illustrate the chroma complexity and to show that there are general tendencies for genres which make the approach interesting for tasks such as genre classification or playlist generation.

5.4.1 ChromaVisu Tool To study the chromagram patterns we developed a Matlab tool to visualize patterns while listening to the corresponding music. A screenshot is shown in Figure 5.1. The six main components are:

A. The large circle in the upper left is the current chroma-pattern (i.e. the pattern associated with the part of the song just playing).

B. To the right is the mean over all patterns which for many types of music is a useful indicator of the key.

C. To the right are up to eight different chroma patterns which occur frequently. The number above each pattern indicates how often it occurs (percentage). The number of different patterns is determined automatically as described above and is the measure for the chroma complexity. 5. Chroma-Complexity Similarity 53

Figure 5.1: ChromaVisu: a tool to study chroma pattern complexity.

D. Beneath is a fuzzy segmentation (cluster assignment) indicating when which of the eight patterns is active. Each line represents one cluster (with the most frequent cluster in the first row). White means that the cluster is a good match, black is a very poor match. If none of the clusters is a good match, the last line (which does not represent a cluster) is white. Modeling the repetitions and the structure which immediately become apparent are the primary target for further work on chroma complexity.

E. Just below the cluster assignment is a slider which helps track the current position of the song.

F. Below it is the chromagram in a flat representation. The first five rows and the last five rows are repetitions to help recognize patterns on the boundaries. 5. Chroma-Complexity Similarity 54

5.5 Results

In this section we illustrate the chroma complexity for several pieces from different styles. The main characteristic is the number of different chroma patterns. In the current implementation this number is limited to the range 2-8. As can be seen for some genres this number is relatively high while others tend to have lower numbers. However, the examples also demonstrate that there are always exceptions, thus chroma complexity by itself is not a suitable similarity measure and needs to be combined with additional information (such as patterns) for applications such as playlist generation. The songs used for the following illustrations are from the DB-S collection (see Section 2.1). We chose typical examples from the categories: jazz, classic piano, classic orchestra, dance, hip hop, and pop. From each piece we analyze the center minute.

5.5.1 Jazz (Dave Brubeck Quartet) The chroma complexity usually ranges from 7-8. However, there are exceptions such as “Take Five” shown in Figure 5.3. In the analyzed section from this piece the drummer plays interesting and complex patterns while the other instruments (including the piano) are in a loop. This illustrates the need for additional descriptors. Alternatively, it would also make sense to analyze the whole piece instead of only one minute. In general the patterns in the cluster assignment are rather complex. This suggests that a descriptor based on the texture might be a useful supplement to the number of patterns.

5.5.2 Classic Orchestra The chroma complexity usually ranges from 6-7. The values are generally high, but not as high as for jazz. This is also reflected in the complexity of the cluster assignment which seems to be structured clearer.

5.5.3 Classic Piano The chroma complexity usually ranges from 7-8. The values are comparable to those of jazz. Some of the patterns in the cluster assignment are as complex as those of the jazz pieces, others are similar to those of classical orchestra.

5.5.4 Dance The chroma complexity usually ranges from 2-4. The patterns in the cluster assignment are often very simple repetitions. A frequent observation is that the same chroma pattern 5. Chroma-Complexity Similarity 55 represents a larger part of the piece as can be seen in Figure 5.6.

5.5.5 Hip Hop The chroma complexity for hip hop is usually around 2. The variations in this style of music are based on the lyrics, rhythm, beats, , but seldomly on the harmonic structure. In Figure 5.7 the two chroma patterns found are very similar. The main difference is that in the second pattern (which occurs 29% of the time) D is stronger pronounced. If the system were not limited to find at least 2 patterns it is possible that the two would be summarized by one.

5.5.6 Pop The chroma complexity for pop is usually around 2-4. The patterns in the cluster assignment are often simple repetitions. For example, in Figure 5.8 the cluster assignment reveals that the sequence 1,2,3 is repeated frequently.

5.6 Discussion & Conclusions

As shown by the examples above the distinction between pieces with low complexity and pieces with high complexity is meaningful. Compared to the MFCCs used for spectral similarity the chromagram is a high-level representation of a song. Using chromagram- based information such as the chroma complexity, we might be able to reach beyond the glass ceiling pointed out in D3.1.1 and D3.2.1. Despite very promising results the work presented in this chapter is still preliminary. Next steps include higher level analysis of the chroma patterns and their relationship to each other (e.g. using a Notennetz). As shown with the Take Five example it is necessary to combine chroma complexity with additional (complementary) similarity measures including spectral similarity. This combination could be done as demonstrated in Section 2.3. The prototypes could benefit from similarity measures reflecting higher level musical concepts. These might make the prototypes more interesting for people with specific interests in music. For example, pieces with a exceptionally strong deviation in the chroma complexity could be removed from a playlist. Another example would be to organize a music collection according to chroma complexity. For example, to group all pieces with a very high complexity. 5. Chroma-Complexity Similarity 56

Mean 20 17 14 13 G F G F G F G F G F E E E E E A A A A A D D D D D B B B B B C C C C C

12 10 8 5 G F G F G F G F E E E E A A A A D D D D B C B C B C B C

Cluster Assignment

0 10 20 30 40 50 60 Time [s]

Figure 5.2: Jazz, Dave Brubeck Quartet, Strange Meadow Lark.

Mean 53 47 G F G F G F E E E A A A D D D B B B C C C

Cluster Assignment

0 10 20 30 40 50 60 Time [s]

Figure 5.3: Jazz, Dave Brubeck Quartet, Take Five. 5. Chroma-Complexity Similarity 57

Mean 26 21 17 17 G F G F G F G F G F E E E E E A A A A A D D D D D B B B B B C C C C C

15 4 G F G F E E A A D D B C B C

Cluster Assignment

0 10 20 30 40 50 60 Time [s]

Figure 5.4: Classic Orchestra, Ravel Maurice, .

Mean 46 16 14 8 G F G F G F G F G F E E E E E A A A A A D D D D D B B B B B C C C C C

6 5 4 G F G F G F E E E A A A D D D B C B C B C

Cluster Assignment

0 10 20 30 40 50 60 Time [s]

Figure 5.5: Classic Piano, Chopin, Etude, Op 25, No 7. 5. Chroma-Complexity Similarity 58

Mean 53 30 10 6 G F G F G F G F G F E E E E E A A A A A D D D D D B B B B B C C C C C

Cluster Assignment

0 10 20 30 40 50 60 Time [s]

Figure 5.6: Dance, DJs at Work, Time to Wonder.

Mean 71 29 G F G F G F E E E A A A D D D B B B C C C

Cluster Assignment

0 10 20 30 40 50 60 Time [s]

Figure 5.7: Hip Hop, Nelly, EII.

Mean 39 38 22 G F G F G F G F E E E E A A A A D D D D B B B B C C C C

Cluster Assignment

0 10 20 30 40 50 60 Time [s]

Figure 5.8: Pop, Emma Bunton, What took you so long? 6. Conclusions and Future Work

In this deliverable we have made several recommendations for the prototypes. In particu- lar, for audio-based similarity and web-based similarity. In both cases we have presented improved versions of the results in D3.1.1 and D3.2.1. The combination of audio- and web-based similarity is ongoing work. First results have been presented in D3.1.1 and D3.2.1. However, detailed analysis will require compilation of a large database with a large (>> 100) number of artists. We have demonstrated the use of novelty detection as an interesting tool to combine with similarity measures. For example, a genre classifier can benefit by recognizing if pieces (which need to be classified) are very different from all other pieces in the collection. Furthermore, we have demonstrated an approach to use mid-level descriptors devel- oped in WP2 for WP3 tasks. Specifically, we have used the chromagram to develop a measure for chroma complexity (which is closely related to chord complexity). Although the results are preliminary, they indicate that the glass ceiling pointed out in D3.1.1 and D.3.2.1 might be higher than originally anticipated. Due to its importance, the development of similarity measures will continue in the SIMAC project. A combined effort between WP2 and WP3 seems promising. Inter- esting directions for similarity include rhythm patterns, complexity in general, harmonic progression, and musical structure.

59 Bibliography

[1] J.-J. Aucouturier and F. Pachet. Music similarity measures: What’s the use? In Proceedings of the Third International Conference on Music Information Retrieval (ISMIR 2002), pages 157–163, 2002.

[2] J.-J. Aucouturier and F. Pachet. Improving timbre similarity: How high is the sky? Journal of Negative Results in Speech and Audio Sciences, 1(1), 2004.

[3] J.-J. Aucouturier and M. Sandler. Segmentation of musical signals using hidden markov models. In Proceedings of the Audio Engineering Society 110th Convention, Amsterdam, May 12-15 2001.

[4] J.-J. Aucouturier and M. Sandler. Finding repeating patterns in acoustic musi- cal signals. In Proceedigs of the AES 22nd International Conference on Virtual, Synthetic and Entertainment Audio, 2002.

[5] E. Batlle, J. Masip, and P. Cano. System analysis and performance tuning for broadcast audio fingerprinting. In Proceedings of the 6th International Conference on Digital Audio Effects (DAFX-03), London, UK, September 8-11 2003.

[6] J. Bello and J. Pickens. A robust mid-level representation for harmonic content in music signals. 2005. submitted.

[7] C. Bishop. Novelty detection and neural network validation. In Proceedings of the IEE Conference on Vision and Image Signal Processing, pages 217–222, 1994.

[8] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, 1995.

[9] J.C. Brown. Calculation of a constant q spectral transform. Journal of the Acoustical Society of America, 89(1):425–434, 1991.

[10] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Neucleic Acids. Cambridge University Press, Cambridge UK, 1998.

[11] H. Fastl. Fluctuation strength and temporal masking patterns of amplitude- modulated broad-band noise. Hearing Research, 8:59–69, 1982.

60 BIBLIOGRAPHY 61

[12] A. Flexer, E. Pampalk, and G. Widmer. Hidden markov models for spectral similarity of songs. 2005. submitted.

[13] A. Flexer, E. Pampalk, and G. Widmer. Novelty detection based on spectral simi- larity of songs. 2005. submitted.

[14] R.M. Golden. Statistical tests for comparing possibly misspecified and nonnested models. Journal of Mathematical Psychology, 44:153–170, 2000.

[15] J.M. Grey. Multidimensional perceptual scaling of musical timbres. Journal of the Acoustical Society of America, 61:1270–1277, 1977.

[16] Christopher A. Harte and Mark B. Sandler. Automatic chord identification using a quantised chromagram. In Proceedings of the 118th Convention of the Audio Engineering Society, Barcelona, Spain, May 28-31 2005.

[17] M.E. Hellman. The nearest neighbour classification with a reject option. IEEE Transaction on Systems Science and Cybernetics, 6(3):179–185, 1970.

[18] P. Knees, E. Pampalk, and G. Widmer. Artist Classification with Web-based Data. In Proceedings of the 5th International Symposium on Music Information Retrieval (ISMIR’04), pages 517–524, Barcelona, Spain, October 2004.

[19] B. Logan and S. Chu. Summarization using key phrases. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, volume II, pages 749–752, 2000.

[20] Beth Logan and Ariel Salomon. A music similarity function based on signal analy- sis. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’01), Tokyo, Japan, 2001.

[21] D.J.C. MacKay. The evidence framework applied to classification networks. Neural Computation, 4:720–736, 1992.

[22] M. Markou and S. Singh. Novelty detection: a review-part 1: neural network based approaches. Signal Processing, 83(12):2499 – 2521, 2003.

[23] M. McAleer. The significance of testing empirical non-nested models. Journal of Econometrics, 67:149–171, 1995.

[24] C.E. Metz. Basic principles of roc analysis. Seminars in Nuclear Medicine,, 8(4):283– 298, 1978. BIBLIOGRAPHY 62

[25] E. Pampalk. Islands of music: Analysis, organization, and visualization of music archives. Master’s thesis, Vienna University of Technology, Department of Software Technology and Interactive Systems, 2001.

[26] E. Pampalk. A matlab toolbox to compute music similarity from audio. In Pro- ceedings of the Fifth International Conference on Music Information Retrieval (IS- MIR’04), Barcelona, Spain, October 10-14 2004.

[27] E. Pampalk, A. Flexer, and G. Widmer. Improvements of audio-based music sim- ilarity and genre classification. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), 2005. submitted.

[28] Elias Pampalk, Werner Goebl, and Gerhard Widmer. Visualizing changes in the structure of data for exploratory feature selection. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, August 24-27 2003. ACM.

[29] Elias Pampalk, Andreas Rauber, and Dieter Merkl. Content-based organization and visualization of music archives. In Proceedings of the ACM Multimedia, pages 570–579, Juan les Pins, France, December 1-6 2002. ACM.

[30] Rodet X. Peeters G., La Burthe A. Toward automatic music audio summary gen- eration from signal analysis. In Proceedings of the Third International Conference on Music Information Retrieval (ISMIR’02), pages 157–163. IRCAM, 2002.

[31] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

[32] L. R. Rabiner and B. H. Juang. An introduction to hidden markov models. IEEE Magazine on Acoustics, Speech and Signal Processing, 3(1):4–16, 1986.

[33] M. Schedl, P. Knees, and G. Widmer. A web-based approach to assessing artist similarity using co-occurrences. In Proceedings of the Workshop on Content-Based Multimedia Indexing (CBMI’05), 2005. submitted.

[34] S. Siegel. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1956.

[35] D.M.J. Tax and R.P.W. Duin. Outlier detection using classifier instability. In Proceedings of the Joint IAPR International Workshop SSPR’98 and SPR’98,, pages 593–601, 1998.

[36] George Tzanetakis and Perry Cook. Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5):293–302, 2002. BIBLIOGRAPHY 63

[37] Brian Whitman and Steve Lawrence. Inferring descriptions and similarity for music from community metadata. In Proceedings of the 2002 International Conference, pages 591–598, G”oteborg, Sweden, September 2002.

[38] M. Zadel and I. Fujinaga. Web services for music information retrieval. In Proceed- ings of the 5th International Symposium on Music Information Retrieval (ISMIR04), Barcelona, Spain, 2004.

[39] M. Zadel and I. Fujinaga. Web Services for Music Information Retrieval. In Proceed- ings of the 5th International Symposium on Music Information Retrieval (ISMIR’04), Barcelona, Spain, October 2004.

[40] E. Zwicker and H. Fastl. Psychoacoustics, Facts and Models. Springer, , 2nd edition, 1999.