<<

Computer Science Clinic

Final Report for Auditude

Music Similarity and Recommendation

May 13, 2003

Team Members Paul Ruvolo (Team Leader) Brad Poon Elizabeth Schoof Nicholas Taylor

Advisor Melissa O’Neill

Liaison Nicholas Seet ’99

Abstract

The Auditude clinic team has investigated content-based similarity relationships between recordedmusicalperformances.Ifacomputer system can automatically determine whether two recordings are similar, it can assist in managing a music collection and make recommendations for possible new acquisitions. Similarity is acomplexconceptinvolving many judgments–our team has focused our attention on a combination of rhythm, timbre, and apparent loudness. We have developed software that extracts these features fromarecording, and uses them to organize music. We have not used any metadata in our process, though we have designed it so that metadata-based or other similarities could readily be integrated. The system can be used to categorize music, to make recommendations, and generate playlists that arrange music in a sequence with smooth transitions between songs. The project as a whole takes a significant step towards making the experience of discovering and listening to new digital music effortless.

Contents

Abstract iii

1Introduction 1 1.1Auditude...... 1 1.2Problem Statement...... 1 1.3OverviewofOur Solution ...... 2 1.4Deliverables ...... 3

2FeatureExtraction 5 2.1Psychoacoustics andLoudnessSensation ...... 5 2.2BeatSpectrum ...... 16 2.3Timbre ...... 20

3 Distance Metrics 25 3.1Euclidean ...... 25 3.2Mahalanobis ...... 26 3.3Cosine ...... 26 3.4DimensionalityReduction ...... 28 3.5Combining FeatureVectors ...... 28 3.6EMD andClustering ...... 31

4Generating Maps 33 4.1The Self-OrganizingMap ...... 33 4.2The GrowingHierarchicalSelf-Organizing Map...... 34 4.3Our ApplicationofSOMs ...... 35 4.4EdgeWeights ...... 36 4.5Motivationfor Maps andHierarchicalMaps ...... 36 4.6Expanding aGHSOM ...... 37 4.7MultipleTrees andMultipleFeatures...... 37 4.8Map Quality ...... 38 vi Contents

5Similarity Lookups 39 5.1SimilarityLookups Defined ...... 39 5.2Topological Distance ...... 39 5.3Complexity...... 40 5.4MultipleFeatures ...... 40 5.5Algorithmsfor Lookups...... 40

6PlaylistGeneration 43 6.1ComplexityofGeneratinganOptimal Playlist ...... 43 6.2AGreedy Algorithm ...... 45 6.3Approximation Algorithms ...... 45 6.4Genetic Algorithms ...... 47

7UserTesting 49 7.1First User Test ...... 49 7.2SecondUserTest ...... 51 7.3Professor O’Neill’sPlaylist ...... 53 7.4PresentationDaysDemo...... 60

8Conclusions and Future Work 63

AManualsand End to End Description 65 A.1Fluctuation Strength ...... 65 A.2Calculating theBeatSpectrum...... 67 A.3TimbreSimilarity ...... 71 A.4GeneratingMaps ...... 73 A.5PlaylistGenerationUsing theGreedyAlgorithm ...... 76 A.6PlaylistGenerationUsing theGenetic Algorithm ...... 80 A.7PerformingMusic Recommendation ...... 81

BUnexploredPossibilities 89 B.1RepresentingTrees in aDatabase...... 89 B.2MoreEfficientDimensionalityReduction ...... 94 B.3Other Algorithms forPlaylistGeneration ...... 95

CSagaofProject(withPictures) 97 C.1CastofCharacters ...... 97 C.2GroundZero...... 98 C.3LaMèrde de Paris ...... 98 C.4ForgetParis... Please ...... 99 C.5Casey at theBat ...... 99 Contents vii

C.6The Ghetto Girl MakesGood ...... 102 C.7Wait, We Have to do aPresentation? ...... 102 C.8The DayofReckoning ...... 103 C.9There andBackAgain,aProjectManager’s Story ...... 103

List of Figures

1.1Anoverviewofour solution ...... 2

2.1Spectra of threepopular songs andaclassical piece ...... 7 2.2 Critical band spectra of three popular songs and a classical piece . 9 2.3Criticalbandspectra,modified by thespreading function ..... 10 2.4Phonvaluesfor theexample songs ...... 11 2.5Sonevaluesfor theexample songs ...... 13 2.6Modulationamplitude values forthe examplesongs ...... 14 2.7Fluctuation strength values forthe four examplesongs ...... 15 2.8 Similarity matrix of Vivaldi’s Spring ...... 17 2.9The beat spectrum ...... 19 2.10 TheFFT of thebeatspectrum ...... 21 2.11 Thewaveformrepresentation ...... 22 2.12 MFCC data forthree songs ...... 24

3.1Asample covariance matrix...... 27 3.2 Songs projected into the x-y plane...... 29 3.3Contributionofadditional dimensions...... 30

4.1Aview of ahierarchicalSOM ...... 35

6.1Envisioning playlist generation as ashortestpathproblem ..... 44

7.1Comparing machineand humansimilarityjudgments...... 52

A.1The GHSOMcodeinaction...... 77

B.1Two different viewsofthe same tree...... 92 B.2Anumbered tree ...... 93 B.3Asample split ...... 95 x List of Figures

C.1AdventuresinParis ...... 100 C.2 Projects Day, 2003 ...... 101 Chapter 1

Introduction

Music similarity judgments are valuable to both the music industry and consumers. Stores can use this information to organizetheirmerchandise and recommend similar artists or albums. Individual listeners can use similarity to arrange playlists and select new music to extend their collections. Although people make these judgments fairly quickly even without formal mu- sic training, it is prohibitively time-consuming for people to analyze a meaningful fraction of all the recorded music that exists today. This motivates the quest for a computer-generated music similarity judgment useful for arranging playlists and recommending new recordings.

1.1 Auditude

Auditude has a very accurate music recognition system which is currently em- ployed by several service providers as well as in consumer applications. The prob- lem with this recognition system is, in a sense, that it is too accurate. It has been in- tentionally tuned to prevent similar-sounding songs being mistaken for each other. Though their system can recognize a particular recording even if it has been cor- rupted by radio static and cell phone compression, it cannot calculate the less pre- cise matching required for similarity and recommendation.

1.2 Problem Statement

For this project, we have explored music similarity,featureextraction,andcom- parison metrics. We have also developed frameworks which use these metrics to group songs hierarchically and to create pleasantly ordered playlists. These frame- works can also easily be extended to use new metrics as they are developed. 2 Introduction

Database Creation (One Time)

Sone/Loudness

Tree of Feature MFCC/Timbre Neural Networking/ Audio Extraction GHSOM Relationships Waveforms

Beat Spectrum

Saved Tree Database Lookup (Many Times)

Existing Similar Songs Query Song Auditude Song ID Tree lookup (optionally ordered) Technology

Figure 1.1: An overview of our solution

1.3 Overview of Our Solution

Our system attempts to determine similarity relationships from a database of au- dio files. To do this, it first determines a ‘feature vector,’ or a series of numbers that somehow represent the way the song sounds, from each audio file. These feature vectors are then used as inputtoaneuralnetwork, which finds similarity relation- ships and arranges the songs into a hierarchy. An overview of this process is shown in the top half of Figure 1.1. Details of the procedures used may be found in Chap- ters 2, 3, and 4. Once the hierarchy of relationships has been generated, it can be used many times for similarity lookups from either content-based queries or Auditude-generated song IDs. Once the song name has been obtained, the song in question is located in the tree of relationships, and nearby songs are found. This procedure is shown in the bottom half of Figure 1.1. The specifics of this process may be found in Chap- ter 5. Our system can also order the results of a lookup, or any other set of songs from the database, in amanner that will reduce the overall dissimilarity between adjacent songs. The methods used for sequencing are discussed in Chapter 6. Deliverables 3

1.4 Deliverables

In the first semester, we gave Auditude a proposal describing what we hoped to accomplish this year and a mid-year report detailing our progress through Novem- ber. Now, at the end of the project, we are delivering the codewewroteandafinal report. This document describes the possible approaches we researched, the al- gorithms we implemented, and the results we obtained. It also includes a user’s manual for the programs we developed and some suggestions for future develop- ment.

Chapter 2

Feature Extraction

Feature vectors lie at the core of our technique for discerning similarity. A feature vector represents the coordinates of a particular song in a high-dimensional vector space. It is an abstraction of the original raw audio data, focusing on some fea- ture(s) of interest. The advantage of this abstract representation is that it allows straightforward comparisons between different songs. Additionally, feature vec- tors are ideal for input into our mapping techniques that further abstract music relationships. In this chapter, we will examine three different features we have used in feature extraction, and methods for combining these features into a single feature vector.

2.1 Psychoacoustics and Loudness Sensation

Psychoacoustics is the study of how people perceive and process sound (Zwicker & Fastl, 1999). It is a very broad and interdisciplinary field, drawing from medicine, physics, engineering, music, and a host of other areas of study. For application to this project, however, we are focusing on several regions of psychoacoustics that were used successfully in Rauber et al. (2002) and Pampalk et al. (2002), loudness sensation and fluctuation strength.

2.1.1 Theory Loudness sensation isameasure of how loud people perceive a sound to be, relative to a fixed reference sound. The loudness sensation of a sound is highly dependent on frequency, and does not scale linearly with sound pressure level; thus the re- sults of comparisons between loudness sensation levels can be quite different from simply comparing the spectra of two signals. 6 Feature Extraction

Loudness sensation has some of the same disadvantages as comparison of spectra, namely that it is not time invariant; the spectrum of a signal may change completely if the starting point is shifted by a small amount. Therefore, a different measure is needed, and following the approach taken by Rauber et al. (2002) we have chosen to use fluctuation strength. Fluctuation strength is a measure of the rise and fall of loudness sensation over time, and, since such patterns are generally constant for a short period of time, it is relatively time invariant.

2.1.2 Method Determining loudness sensation is a two step process. The first step is to convert the signal from PCM audio into a physical property of sound, pressure as a function of time. This conversion is done by scaling the sound sample values, which are real numbers in the range [−1, 1] by a factor that will make the loudest sound in the recording the desired decibel level (75 dB); the scaling factor is determined from the relationship √ 20 L p = p0 10 (2.1)

where L is the sound pressure level in dB, p is the sound pressure in Pascals, and p0 is the threshold of hearing, 10−5 Pa. The samples are all then squared, converting them from pressure to power, or sound pressure level, I in W/m2 (Zwicker & Fastl, 1999, p. 1). The next step is the computation of the spectrum of the signal using the fast Fourier transform (Stearns & David, 1996, ch. 3); we break the signal up into short, overlapping windows of about 23 ms, so we have a data point roughly every 12 ms. For each data point, there are

samples 0.00232 seconds × 11025 = 256 samples (2.2) second and since the Fourier transform produces the same number of coefficients as it takes in data points, thereare256 coefficients (though only at 128 distinct fre- quencies). Thus we have transformed the data, but have not significantly reduced the amount of it. Figure 2.1 shows the spectra for three popular music songs, “Eight Days a Week” and“DayTripper”byTheBeatlesand “Material Girl” by Madonna, as well as a classical piece for baritone and orchestra, “Wenn mein Schatz Hochzeit macht” from Gustav Mahler’s Songs of a Wayfarer. Byusinganotherproductofpsychoacoustical research we can simplify the data without losing any meaningful information. The Bark scale (Zwicker & Fastl, 1999, ch. 6) breaks the frequency range of the human ear up into 24 regions, known as critical bands, that correspond to the regions of the frequency spectrum that Psychoacoustics and Loudness Sensation 7

Spectrum (dB) Spectrum (dB)

80 80 5000 5000

60 60

4000 4000 40 40

3000 20 3000 20

Frequency (Hz) 0 Frequency (Hz) 0 2000 2000

−20 −20

1000 1000 −40 −40

0 0 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 Time (sec) Time (sec)

(a) “Eight Days a Week (b) “Day Tripper”

Spectrum (dB) Spectrum (dB)

80 80 5000 5000

60 60

4000 4000 40 40

20 3000 3000 20

0 Frequency (Hz) Frequency (Hz) 2000 2000 0

−20

−20 1000 −40 1000

−40 −60 0 0 0 50 100 150 200 0 50 100 150 200 Time (sec) Time (sec)

(c) “Material Girl” (d) “Songs of a Wayfarer”

Figure 2.1: Spectra of three popular songs and a classical piece. 8 Feature Extraction

humans distinguish. Zwicker & Fastl (1999, pg. 159) list the frequency bands that make up the Bark scale. Figure 2.2 shows the spectra from Figure 2.1 condensed into the critical bands. Here we modify the critical band spectrum to account for the asymmetric spreading effects between critical bands, using the method of Rauber et al. (2002), which is based on the equations of Schroeder et al. (1979). We do this by convolut- ing the spreading function

  1 − 2 2 10 log10 B(z)=15.81 + 7.5(z +0.474) 17.5 1+(z +0.474) dB (2.3)

with the spectrum in each time window. The results of applying this transforma- tion to the songs in Figure 2.2 can be seen in Figure 2.3. The excitation levels for each critical band are then converted to sound pres- sure level using the relationship   I L =10log (2.4) I0

−12 2 where I0 =10 W/m ,againthethreshold of hearing. From there, we can begin to take into account the level at which people per- ceive sound. This is done by taking experimental data (Fletcher & Munson, 1933) of the levels of a sound necessary to maintain a constant volume at different fre- quencies; Allen & Neely (1997) use polynomial regression on the raw data to pro- vide the cubic polynomials used in our work. The loudness level, in the unit phon, of a tone is defined to be the sound pressure level of a tone at 1 kHz that sounds as loud as that tone. In general, higher sound pressure levels are needed at high and low frequencies to maintain the loudness sensation felt at mid-range frequencies between 1 and 5 kHz (Zwicker & Fastl, 1999, p. 204). Figure 2.4 shows the results of applying this transformation to the data from Figure 2.2. If the center of the criti- cal band did not fall directly on one of the 11 loudness curves from Allen & Neely (1997), we used a weighted average of the two closest curves. From here we must take into account the perceived loudness of different phon levels, or, put another way, how much louder or softer one sound seems than an- other. The standard way to do this is by conversion from phon to the unit of loud- ness, sone;onesone is the loudness of a 1 kHz tone at 40 dB. We used the con- version formulas from Bladon & Lindblom (1981). Since the human ear perceives quiet sounds differently from loud sounds, there are two different formulas; L sig- nifies loudness level in phon, and S signifies loudness in sone. For L ≥ 40 phon, loudness is given by L−40 S =2 10 (2.5) Psychoacoustics and Loudness Sensation 9

2 −4 2 −4 Critical Band Spectrum (W/m ) x 10 Critical Band Spectrum (W/m ) x 10 20 20 9 5

18 18 8 4.5

16 16 4 7

14 14 3.5 6

12 3 12 5

10 2.5 10 4

Critical Band (Bark) 8 2 Critical Band (Bark) 8 3 6 1.5 6 2 4 1 4

1 0.5 2 2

0 0 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 Time (sec) Time (sec)

(a) “Eight Days a Week” (b) “Day Tripper”

2 −4 2 −4 Critical Band Spectrum (W/m ) x 10 Critical Band Spectrum (W/m ) x 10 20 20 10 18 18 5 9

16 16 8

14 14 4 7

12 12 6 3 10 5 10

Critical Band (Bark) 8 4 Critical Band (Bark) 8 2

6 3 6

2 4 4 1

1 2 2

0 0 0 50 100 150 200 0 50 100 150 200 Time (sec) Time (sec)

(c) “Material Girl” (d) “Songs of a Wayfarer”

Figure 2.2: Critical band spectra of three popular songs and a classical piece. 10 Feature Extraction

2 −4 2 −4 Spread Spectrum (W/m ) x 10 Spread Spectrum (W/m ) x 10 20 5.5 20

9 18 5 18

4.5 8 16 16

4 7 14 14 3.5 6 12 12 3 5 10 10

Time (sec) 2.5 Time (sec) 4 8 8 2 3 6 1.5 6

2 4 1 4

1 2 0.5 2

0 2000 4000 6000 8000 10000 12000 14000 2000 4000 6000 8000 10000 12000 14000 Critical Band (Bark) Critical Band (Bark)

(a) “Eight Days a Week” (b) “Day Tripper”

2 −3 2 −4 Spread Spectrum (W/m ) x 10 Spread Spectrum (W/m ) x 10 20 20 1 18 18 5 0.9 16 16 0.8

14 14 4 0.7

12 12 0.6 3 10 0.5 10 Time (sec) Time (sec)

8 0.4 8 2

6 0.3 6

4 0.2 4 1

0.1 2 2

0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 4 4 Critical Band (Bark) x 10 Critical Band (Bark) x 10

(c) “Material Girl” (d) “Songs a Wayfarer”

Figure 2.3: Critical band spectra, modified by the spreading function. Psychoacoustics and Loudness Sensation 11

Phons Phons

20 20 80 80 18 18

60 16 60 16

14 14 40 40 12 12

20 10 10 20

Critical Band (Bark) 8 Critical Band (Bark) 8 0

6 0 6 −20

4 4

−20 −40 2 2

0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 Time (sec) Time (sec)

(a) “EightDaysaWeek” (b) “Day Tripper”

Phons Phons

20 20 80 80 18 18 70 60 16 16 60

14 40 14 50

12 12 40 20 30 10 10

0 20 Critical Band (Bark) 8 Critical Band (Bark) 8 10 6 −20 6 0

4 4 −40 −10

2 2 −20

−60 0 50 100 150 200 0 50 100 150 200 Time (sec) Time (sec)

(c) “Material Girl” (d) “Songs of a Wayfarer”

Figure 2.4: Phon values for the example songs. 12 Feature Extraction

while for L<40 phon, loudness is given by   L 2.642 S = (2.6) 40 The results of applying this transformation to the loudness levels in Figure 2.4 are shown in Figure 2.5. The results shown in Figure 2.5 differ greatly between the songs, and the plots are certainly more distinct than any of the previous measures of loudness; how- ever, there is still so much data that directcomparisonisdifficult. As mentioned earlier, these results are not time invariant, and shifting the start or end of the sampled region would also shift all of the values. This feature is therefore not ideal for a similarity engine, since it is dependent on the exact choice of sampling win- dow, and also provides a rather restrictive definition of similarity. However, if one examines the short-term change in loudness at different frequencies, interesting patterns begin to emerge. Given that changes in loudness are known, the process for computing fluctu- ation strength is quite simple. One begins by taking the Fourier transform of the loudness for each critical band over reasonably short windows of time; following Rauber et al. (2002), we use six-second windows. The lower-order coefficients are retained, since the human ear is most sensitive to fluctuations at modulation fre- quencies of between 1 and 10 Hz (Zwicker & Fastl, 1999, ch. 10); they are grouped by frequency to reduce the amount of data being stored. Graphs of fluctuation strength from the example songs are shown in Figure 2.6. The final step in the calculation of fluctuation strength is to weight the groups of coefficients by how well humans perceive fluctuations at that modulation fre- quency; we use the method of Zwicker & Fastl (1999, ch. 10), which gives the fol- lowing relationship for fluctuation strength F as a function of modulation ampli- tude ∆L and modulation frequency fmod: ∆L F ∼ (2.7) (fmod/4Hz)+(4Hz/fmod) The equation clearly shows that humans are more sensitive to fluctuations at a modulation frequency of around 4 Hz. After applying this weighting function to the data from Figure 2.6, the relationships between the songs become clearer, as the uninteresting lower-modulation frequency fluctuations are weighted down. The result is shown inFigure2.7. To generate a single vector fromallofasong’sfluctuationstrengthmatrices,we take the element-wise median at each point in the matrix; this gives a reasonably good approximation of general patterns throughout the song, and is a technique Psychoacoustics and Loudness Sensation 13

Sones Sones

20 20 35

18 18 25 30

16 16

25 14 20 14

12 12 20 15 10 10 15

Critical Band (Bark) 8 Critical Band (Bark) 8 10

6 6 10

4 5 4 5

2 2

0 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 Time (sec) Time (sec)

(a) “Eight Days a Week” (b) “Day Tripper”

Sones Sones

20 20

25 25 18 18

16 16

20 20 14 14

12 12 15 15

10 10

Critical Band (Bark) 8 Critical Band (Bark) 8 10 10

6 6

4 5 4 5

2 2

0 0 0 50 100 150 200 0 50 100 150 200 Time (sec) Time (sec)

(c) “Material Girl” (d) “Songs of a Wayfarer”

Figure 2.5: Sone values for the example songs. 14 Feature Extraction

Median Modulation Amplitude Median Modulation Amplitude 0.3 20 20 0.2 18 18 0.25 0.18 16 16 0.16

14 0.2 14 0.14

12 12 0.12

0.15 10 10 0.1

Critical Band (Bark) 8 Critical Band (Bark) 8 0.08 0.1

6 6 0.06

0.04 4 0.05 4

0.02 2 2

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Modulation Frequency (Hz) Modulation Frequency (Hz)

(a) “Eight Days a Week” (b) “Day Tripper”

Median Modulation Amplitude Median Modulation Amplitude

20 20 0.1 0.08

18 18 0.09

0.07 16 16 0.08

0.06 14 14 0.07

12 0.05 12 0.06

0.05 10 0.04 10

Critical Band (Bark) 8 Critical Band (Bark) 8 0.04 0.03

0.03 6 6 0.02 0.02 4 4

0.01 0.01 2 2

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Modulation Frequency (Hz) Modulation Frequency (Hz)

(c) “Material Girl” (d) “Songs of a Wayfarer”

Figure 2.6: Modulation amplitude values for the example songs. Psychoacoustics and Loudness Sensation 15

Median Fluctuation Strength Median Fluctuation Strength 0.035 20 20 0.04 18 18 0.03 0.035 16 16

0.025 0.03 14 14

12 0.025 12 0.02

10 0.02 10 0.015

Critical Band (Bark) 8 Critical Band (Bark) 8 0.015

6 6 0.01 0.01

4 4 0.005 0.005 2 2

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Modulation Frequency (Hz) Modulation Frequency (Hz)

(a) “Eight Days a Week” (b) “Day Tripper”

−3 Median Fluctuation Strength Median Fluctuation Strength x 10

20 20 10 0.025

18 18 9

16 16 8 0.02

14 14 7

12 0.015 12 6

10 10 5

Critical Band (Bark) 8 0.01 Critical Band (Bark) 8 4

6 6 3

2 4 0.005 4

1 2 2

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Modulation Frequency (Hz) Modulation Frequency (Hz)

(c) “Material Girl” (d) “Songs of a Wayfarer”

Figure 2.7: Fluctuation strength values for the four example songs. 16 Feature Extraction

successfully used previously (Rauber et al., 2002). These 20 × 20 matrices are then flattened into a 400-dimensional vector for the purposes of comparison; the order of dimensions is unimportant for the naïve feature vector comparisons we use.

2.2 Beat Spectrum

The beat spectrum is a useful similarity metric that compactly captures the rhythm and tempo features of a song. The beat spectrum represents periodicities in audio. For example, repetitive music will have strong beat spectrum peaksattherepe- tition times. The beat spectrum will give us a measure for comparing rhythmic similarity, which is helpful in ordering songs based on their rhythmic properties. Our algorithm for calculating the beat spectrum was developed by Jonathan Foote (Foote et al., 2002, Foote & Cooper, 2002). We calculate the beat spectrum in three steps which we will discuss in detail below. First, we parameterize the audio to obtain its representative feature vectors. Second, we embed the feature vectors into a two-dimensional matrix by calculating the cosine distance between the vectors. Finally,wecreatethebeatspectrumby performing autocorrelationonthesimilaritymatrix;thismethodyieldsamore accurate approximation of the beat spectrum than by simply summing along the diagonals.

2.2.1 Parameterizing the Audio In order to parameterize the audio, we first made hamming windows from the raw audio. Hamming windows are small sampling windows which contain the ampli- tudes of the discrete frequency components of a sound signal (Pradeep & Gupta, 2003). In testing, we used 256-sample frames overlapped by 128 samples, although other sized frames can certainly be used. Computing the fast Fourier transform on each frame results in a series of feature vectors that characterizes the spectral con- tent. In addition to hamming windows, Foote has used MFCCs (see Section 2.3) to parameterize the audio, but we did not implement that method of parameteriza- tion.

2.2.2 Creating the Similarity Matrix Once the audio hasbeenparameterized,weembedthefeaturevectorsintoatwo- dimensional matrix by computing the scalar (dot) product of all pairwise combina- tions from frames i and j of the set of feature vectors v.Normalizing the product reduces dependency onmagnitude,andthusenergy,whichisusefulfor songs with Beat Spectrum 17

5

10

15 ime (seconds) T

20

25

30 51015202530 Time (seconds)

Figure 2.8: Similarity matrix of Vivaldi’s Spring asignificantportion of silence: D(i, j)= i j (2.8) ||vi||||vj||

The similarity measure is performed on all pairwise combinations of the fea- ture vectors to create the similarity matrix S.Highernumbers correspond to higher measures of self-similarity, while lower numbers correspond to dissimilarity. Figure 2.8 is an example of a self-similarity matrix computed for 30 seconds of Vivaldi’s Spring calculated in Matlab.Regionsofhigh self-similarity appear as bright squares along the main diagonal In the figure, four main squares are vis- ible along the diagonal by simply observing the differences in brightness. These four squares are highlighted in red. The first two represent the opening that re- peats again at 8 seconds butsofter.Thenextsquareat15secondstransitions into the next main theme of the song, and again repeats at around 23 seconds with asofteramplitude.Notice that smaller bright squares are visible within the four main squares themselves. These smaller squares are representative of the repeti- tive beat within each repeat of the chorus. 18 Feature Extraction

The entire similarity matrix is not necessary for computing the beat spectrum. We are only concerned with similarity within a few seconds of each frame, because the beat spectrum will tell capture the rhythmic properties of the entire similarity matrix within a small lag time. For example, a high beat spectrum at peak at a lag time of 1 second says that there is a repetitive beat every second. We define the lag domain L(i, j) to be the range of similarity values within the lag time l = j−i.Calculatingframe similarity within a small lag domain will reduce the number of computations needed to obtain the beat spectrum, reducing the algorithmic complexity from O(n2) for a full similarity matrix to O(n),andinpracticethe beat spectrum may be computed several times faster than real-time.

2.2.3 Deriving the Beat Spectrum The periodicity of theaudiocanbecalculatedfromthesimilaritymatrixasthebeat spectrum. Self-similarity as a function of the lag time is called the beat spectrum (Foote et al., 2002). We can estimate the beat spectrum through two methods. First, by simply summing along the diagonals, we find that  B(l)= S(k, k +1) (2.9) k⊂R

Second, by performing autocorrelation on the matrix, we find that (Foote et al., 2002)  B(l)= S(k, k +1) (2.10) k⊂R Foote found that the autocorrelation of the similarity matrix works well across a range of musical genres, tempos, and rhythmic structures. In practice, we used autocorrelation to calculate the beat spectrum, although the diagonal sum method is significantly faster. Figures 2.9(a) and (b)aretwoexamplesofbeatspectra that show similar beat. We are mainly concerned with the peaks, which tell us at what point in time a repetitive beat occurs. Note that the first peaks for both “Eight Days a Week” and “The Good Life” occurs at around .80 seconds, even though these beats are pro- duced from different instruments!

2.2.4 Processing the Beat Spectrum The beat spectrum vectors areprocessedinthefollowing manner. First, we trun- cate all the vectors to the same lag time. A lag time of about 5 seconds will suffice and is long enough to encompass most distinct beats in a song. Next, we subtract the mean and normalize the vector by dividing by the maximum value so that the Beat Spectrum 19

0.4

0.3

0.2

0.1

Power 0

–0.1

–0.2

–0.3 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time (seconds)

(a) Beatles, “Eight Days a Week”

0.25

0.2

0.15

0.1

Power 0.05

0

–0.05

–0.1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time (seconds)

(b) Weezer, “The Good Life”

Figure 2.9: The beat spectrum 20 Feature Extraction

peak magnitude has a value of 1. Finally, since all beat spectra are similar at zero lag, we truncate the first 0.120 seconds. This is an arbitrarily chosen number, but Foote also uses it inhismethod(Footeet al., 2002).

2.2.5 Facilitating Beat Comparison using FFT Although we can see parallels between the two beat spectra shown in Figure 2.9, a straightforward comparison of feature-vector coefficients may not reflect this simi- larity. Both songs have a their main beats about once a second, but they do not have exact same beat—while the beats are fairly closely aligned on the left-hand side of the graph, they are drifting apart on the right-hand side. Thus, we should avoid directly comparing beat spectra and instead use a better metric. One approach is to take the fft of the beat spectrum to obtain beat frequencies. Figure 2.10 shows the results of taking the fft of our two example songs. The green line shows the results from blurring the spectrum slightly to spread the peaks.

2.3 Timbre

Another characteristic of music that people notice is the mixture of instruments used to perform the piece. The characteristic voice of an instrument is its tim- bre. The timbre of a piece of music may berepresentedasthefirstseveralMel Frequency Cepstral Coefficients (MFCCs)Logan & Salomon (2001b) of the record- ing. MFCCs may be calculated with a modified Fourier Transform which imitates the ways human ears modify sound as they process it. Several researchers have explored this, and disagree on how many coefficients should be used in order to capture the timbre of the sound without the pitch (Aucouturier & Pachet, 2002, Logan & Salomon, 2001b, Rabiner & Juang, 1993).

2.3.1 Theory Mel Frequency Cepstral Coefficients (MFCCs) are a standard transformation of sound in speech recognition, but are also be effective for musical sound analysis. This feature extraction method builds upon the work of psycho-acoustic modeling (Rabiner & Juang, 1993). Our algorithm takes audio files, as in Figure 2.11 and ex- tracts their MFCCs.Thefirst10MFCCsofthesongfrom Figure 2.11 are shown in Figure 2.12(a). Timbre 21

0.040

0.035

0.030

0.025

0.020 Power 0.015

0.010

0.005

0.000 0 5 10 15 20 25 Beat Frequency (Hz)

(a) Beatles, “Eight Days a Week”

0.014

0.012

0.010

0.008

Power 0.006

0.004

0.002

0.000 0 5 10 15 20 25 Beat Frequency (Hz)

(b) Weezer, “The Good Life”

Figure 2.10: The FFT of the beat spectrum 22 Feature Extraction

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

−0.8 0 5 10 15 20 25 30 Time (seconds)

Figure 2.11: The waveform representation of ’ song “Eight Days a Week”

Acoustical Modeling: Mel-Frequency Scale Since our target audience is the average music listener, and not musicologists or professional classical pianists, it makes sense for us to base our metrics on what people hear, rather than on what is played. Even if we had the score for every piece of music, and could use it in our metrics, it would not be as helpful as accurate data about what an average human actually hears when that piece is performed. Ears filter and transform sound waves in complex ways as they travel through the ear canal. The mel scale is a filter for digital signal processing which approximates the filtering effects of the human ear.

Cepstrum of Mel Frequency The cepstrum of this psycho-acoustic model of the sound provides valuable in- formation about the timbre and pitch of the original sound. As a song’s timbre changes over time, so do the MFCCs of that song, thus MFCCs must be calculated at multiple points throughout the song. (In our work we have used overlapping th windows of 25ms.) The n cepstral coefficient, c˜n is given by the equation (Au- Timbre 23 couturier & Pachet, 2002):

+π 1 · c˜ = × log(S(ejω)) · ejw ndω (2.11) n 2π −π where ω is the frequency and S(ejω) is the frequency response. The first K co- efficients of this frequency response are denoted as S˜k.Theseareused when we combine the Cepstrum and mel-frequency as follows: (Rabiner & Juang, 1993)   K π c˜ = (log S˜ )cos n (k − frac12) (2.12) n k K k=1

2.3.2 Comparing MFCCs Our algorithm currently generates ten floating point numbers per time window, or 29,970 floats for a 30-second song sample. Even a coarse, human evaluation of this data, as in Figure 2.12, reveals some timbral distinctions. We have explored a variety of ways to compare the MFCCs of songs objectively. In order to facilitate fast live comparisons, we have considered a number of meth- ods of reducing the volume of data, hoping to extract the essence of each song, and sacrifice the less helpful data. We did this by clustering the data (as described in Section 3.6) for two reasons. Using this raw data as a feature vector is impractical for large-scale implementa- tions. The space required to store these vectors would be excessive. While the space issue could be solved by simply averaging these 2997 vectors, this may tend to mask the variation in a piece of music. With averaging, a piece which has vio- lins and flutes playing together would look very much the same as a piece in which the two instruments take turns in a call-and-response manner. On the other hand, clustering would maintain the distinction between the two timbres in the latter example, and find no equivalent distinction in the former. We also tried simpler methods of reducing the data.Themethodcurrently used in our tree code (Chapter5)isasimpleFouriertransform, reducing the size of a song’s feature vector to only 10 × 15 floats.¹ We have also foundsomesuccess using the first several coefficients of the Fourier transform of the full 10 by 2997 matrix.

1. The Sandia clinic team, which has been working on clustering, using only the first 2 of these 10 averaged coefficients, achieved fairly good clusterings. Unfortunately, this work was based on a different corpus of songs with different clustering algorithms. We have not had the opportunity to duplicate their results. 24 Feature Extraction

4 4

2 2

0 0 MFCCs MFCCs

–2 –2

–4 –4 40 50 60 70 40 50 60 70 Time (seconds) Time (seconds)

(a) “Eight Days a Week” (b) “Come OutandPlay”

4

2 2 1.5

1

0 0.5 MFCCs

0

MFCC Values -0.5 –2

-1

-1.5 –4 40 50 60 70 -2 Time (seconds) Beatles Offspring Tony Bennett

(c) “I Left my Heart in San Francisco” (d) Average values

Figure 2.12: MFCC data for three songs. In these graphs, the first three coefficients illuminate the distinction between a guitar and drums combo (Beatles and Offspring) and mellow vocals with piano (Tony Bennett). Chapter 3

Distance Metrics

Once we had feature vectors for different songs, it was necessary to find some way to compare them. The result of this comparison is a distance that defines how sim- ilar the two vectors are to each other. We experimented with a number of different distance metrics, both simple and complicated, and compared their performance and quality of results.

3.1 Euclidean

The Euclidean distance metric is probably the simplest to understand. It is de- fined to be the linear distance between the ends of two vectors in Rn.Fromthe PythagoreanTheorem, this can be determined to be, for two n-dimensional vec- tors x and y,

n 2 dE(x, y)=|x − y| = (xi − yi) (3.1) i=1 We can also express this in linear algebra terms, assuming x and y are column vectors: T dE(x, y)= (x − y) (x − y) (3.2) This metric is commonly used to compare feature vectors, due to its simplicity (Hand et al., 2001, ch. 2). In addition, it was the default in the map generating code described below.

3.2 Mahalanobis

In an attempt to refine the somewhat blunt nature of the Euclidean distance, we studied a more sophisticated metric known as Mahalanobis distance. It adjusts 26 Distance Metrics

Euclidean distance to give less weight to dimensions that vary together and more to those that vary independently. We had hoped that this would enhance patterns in the data. The first step in the Mahalanobis distance is to compute a covariance matrix of all the feature vectors. For two dimensions x and y with m observations of each, the covariance is 1 m Cov(x, y)= (x(i) − x)(y(i) − y) (3.3) m i=1 where x(i) denotes the ith observation of dimension x;thisnumber is positive if the two dimensions vary together (i.e. one is high when the other is high), negative if the two vary oppositely (i.e. one is low when the other is low), and close to zero if the two vary independently. To create the covariance matrix Σ,wesimplyletΣi,j be the covariance between the ith and the jth dimensions. A sample covariance matrix is shown in Figure 3.1. To add this weighting to the Euclidean distance, we insert the inverse of the covariance matrix Σ between the vectors in Equation 3.2, giving the Mahalanobis distance between two column vectors x and y as

T −1 dMH(x, y)= (x − y) Σ (x − y) (3.4) This distance metric has a strong statistical basis, since it follows from the defini- tion of the n-dimensional normal distribution (Hand et al., 2001, ch. 2,9). Note that the addition of the matrix multiplication to the distance metric in- creases the complexity of determining the distance between two n-dimensional feature vectors from O(n) to O(n2).Inourtests,thisdrastically increased the amount of time needed for map generation. Since the weighting did not seem to improve the quality of the results, we did pursue this metric beyond initial testing.

3.3 Cosine

Cosine distance is another relatively simple distance metric. It leverages the fact that the dot product of two vectors x and y in Rn is the cosine of the angle be- tween them multipliedbytheproductoftheirlengths.Thusitis computationally efficient to extract the cosine of the angle between them, which gives an indication of whether the two vectors point in the same direction, regardless of their relative lengths; it ranges from −1 to 1,with1indicating parallel, 0 indicating perpendicu- lar, and −1 indicating anti-parallel. From the formula for dot product, x · y = |x||y| cos θ (3.5) Cosine 27

Covariance Matrix for Flattened Fluctuation Strength Vectors −5 x 10

15

50

100

150 10

200

250

5

300

350

400 0 50 100 150 200 250 300 350 400

Figure 3.1: A sample covariance matrix. The vectors used were flattened fluctua- tion strength matrices; the repeating pattern every 20 dimensions shows that pix- els adjacent in the 20 by 20 matrix tend to vary together. Notice that all entries are greater than orequaltozero,showing that no dimensions tended to vary oppo- sitely. 28 Distance Metrics

we can extract the cosine distancebetweenx and y as x · y dc(x, y)=cosθ = | || | (3.6) x y n i=1 xiyi = (3.7) n 2 n 2 i=1 xi i=1 yi

This distance measure has been used effectively in many information retrieval ap- plications (Hand et al., 2001, ch. 14).

3.4 Dimensionality Reduction

To reduce the amount of data, and therefore to increase speed of processing, we experimented with multidimensional scaling (Hand et al., 2001, ch. 3), a method for reducing the dimensionality of multidimensional data. This technique attempts to project the distances between feature vectors in a high-dimensional space into alower-dimensional space, and attempts to put the maximum variation between vectors in the lower-order dimensions. We found that this technique gave relatively good results. Figure 3.2 shows the results a subset of our test songs displayed in the x-y plane using scaled fluctua- tion strength data. The groupings appear reasonably good. Figure 3.3 shows the amount of data that remains after doing the dimensionality reduction. Doing the dimensionality reduction is relatively computationally expensive. Since multidimensional scaling requires the construction of a pairwise distance matrix, it is inherently Ω(m2),wherem is the number of data points. Even for our small test set of around 160 songs, doing the dimensionality reduction took orders of magnitude longer than the map generation. Therefore, we decided not to pursue it further, since map generation was already much faster than feature extraction.

3.5 Combining Feature Vectors

When combining feature vectors from different sources into one über-feature vec- tor, we had to normalize the data so that the choice of units does not affect the overall distance. This is because, for the distance metrics discussed above, if one feature vector had values that were on average 1000 times higher than those in another feature vector, differences in values from the first feature vector would outweigh any difference in values from the second when comparing two combined feature vectors. Using a common strategy to compensate for this (Hand et al., 2001, Combining Feature Vectors 29

0.02

0.015 Beatles - Day Tripper Metallica - Enter sandman 0.01 Beach Boys - Wouldn’t It Be Nice Metallica - Master Of Puppets The Doors - Break on through 0.005 Beach Boys - In My Room 0 The Doors - Roadhouse blues Metallica - Nothing else matters

-0.005 Metallica - Hero of the day The Doors - Riders on the storm Beach Boys - California Girls -0.01 Beatles - A day in the life The Doors - Alabama song Mahler - Songs of a Wayfarer 3 Mahler - Songs of a Wayfarer 2 -0.015 Beatles - Eleanor Rigby Frank Sinatra - That’s Life Mahler - Songs of a Wayfarer 1 The Doors - People are strange -0.02 Mahler - Songs of a Wayfarer 4 F. Sinatra and S. Davis Jr. - Me and My Shadow Frank Sinatra - New york new york -0.025 Beach Boys - God Only Knows Frank Sinatra - Summer Wind Beach Boys - 409 -0.03 -0.15 -0. 1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Figure 3.2: A subset of our test songs projected into the x-y plane, using the fluc- tuation strength data. Notice some logical groupings; for example, the four songs from Mahler’s “Songs of a Wayfarer” are close to each other. 30 Distance Metrics

Multidimensional Scaling 0.7

0.6

0.5

0.4

0.3 Contribution to Variance

0.2

0.1

0 5 10 15 20 25 30 35 40 45 50 Dimension

(a) Relative contribution ofthefirstfifty dimensions

Multidimensional Scaling 1

0.9

0.8

0.7

0.6

0.5

0.4 Fraction of Variance

0.3

0.2

0.1

0 5 10 15 20 25 30 35 40 45 50 Number of Dimensions

(b) Amount of variance using a subset of dimensions

Figure 3.3: Contribution of additional dimensions. These graphs show the contri- bution to the overall variance of the data of each of the dimensions in the scaled output. The amount of variance can be thought of as corresponding roughly to the amount of information. EMD and Clustering 31 ch. 2), we divide each dimension in the combined feature vector by its standard de- viation. To emphasizedifferences between values, as opposed to the actual magni- tude of values, we also subtract off the mean, yielding a statistical measure known as a t-score. This gives an overall transformation for each dimension x of

 x(i) − x x (i)= (3.8) sx where x(i) is an individual measurement of dimension x, x is the mean of the observed measurements of dimension x,andsx is the standard deviation of the measurements of dimension x.

3.6 EMD and Clustering

We used earth mover’s distance (EMD) as a distance metric for comparing timbre feature vectors (Logan & Salomon, 2001a). The technique could could potentially prove useful for many other types of features, however. The techniques we used for making timbral comparisons using earth mover’s distance were drawn from Beth Logan’s work in this area (2001a).

3.6.1 Clustering Method EMD requires clustered data. We chose K-means asourmethod clustering the MFCC data for EMD, largely because it was already implemented in Matlab and seems to produce fairly good results in similarity applications (Fung, 2001). Since K-Means is non-deterministic, we had to non-randomly select the initial condi- tions. To avoid bias towards any part of the song, we took data from windows equally spaced throughout the sample. Gaussian mixture models would probably also work well for clustering the MFCCs.

3.6.2 Earth Mover’s Distance The EMD metric calculates the distance between two sets of clusters. Each set of clusters may contain any number of clusters. For each cluster, a centroid (a vector in n-dimensional space) and weight (a count of how many points are in the cluster) are required. Given a metric for the distance between any two points in this space, EMD calculates how much “work” is necessary to transform one set of clusters into the other. In this particular context, work is simply the quantity moved times the distance it is being moved. The name for this algorithm comes from one concep- tual explanation of the algorithm. One set of clusters is a set of piles of dirt. The locations and sizes of the piles are represented by the centroids and weights. The 32 Distance Metrics

other set of clusters is like a set of holes. The algorithm calculates how to fill all the holes with all the dirt, minimizing the amount of work (dirt moving) that must be done.

3.6.3 Symmetric K-L Distance The Symmetric Kulback-Leibler Distance is one of many metrics EMD could use to calculate the work needed to move one unit of dirt from one place to another¹. This distance metric uses the mean and covariances of two clusters to calculate the distance between them. This accounts for the variation of the cluster sizes–some are densely packed, others are quite sparse. Thedistancebetween two distribu- tions A and B with means µa and µb and variances σa and σb may be calculated as     1  σ2 σ2 1 1 KL(A; B)= a,i + b,i +(µ − µ )2 + − 2 (3.9) 4i σ2 σ2 a,i b,i σ2 σ2 i b,i a,i a,i b,i

1. Logan, Beth. Personal communication. 7 Mar. 2003. Chapter 4

Generating Maps

An ideal but inefficient way to generate similarity judgments would be to directly compare feature vectors. In such a scheme, finding ten songs similar to Pearl Jam’s “Better Man” would entail comparing it with all songs in the library and choosing the ten most similar. This approach does not scale well with an increasing corpus size. A more compact and abstract representation of song relationships is needed. There are many neural networking techniques that are well suited to creating these sorts of abstract relationships. One such technique is the self-organizing map.One can think of a self-organizing map as a translation between high dimensional fea- ture space and a low-dimensional topology (e.g. a tree or a grid). We used feature vectors as input to these various mapping techniques, our goal was to create a more compact representation of relationships between songs. This compact representa- tion would be better suited to our lookup algorithms than raw feature vectors.

4.1 The Self-Organizing Map

The self-organizing map (SOM) is a widely used neural networking technique for clustering high dimensional vectors onto a two dimensional grid. Each cell in this two dimensional grid is a map unit. The connections between the various map units define the topology of the map. The most common topologies are rectan- gular and hexagonal grids. Each map unit contains a model vector (or neuron) that represents the consensus of the input vectors represented by that map unit. The basic idea is to project high dimensional vectors onto a representative model neuron (Kohonen, 1982). These maps are created using an iterative training procedure. In the simplest form there is a fixed set of neurons. As the training progresses, each input vector is presented to the neural network. The neural network matches each input with the 34 Generating Maps

most similar neuron. In this way each input vector is mapped into the topology of the SOM. The model vectors are gradually updated during this procedure to mold the map to fit the data. The traditional SOM has the disadvantage of requiring the user to know how many neurons the map should contain prior to training. th The updaterulefortraining the SOM is as follows. mi represents the i model vector in our set of model vectors m.

 mi(t)+α(t)[x(t) − mi(t)],i∈ Nc(t) mi(t +1)= mi(t),i∈ Nc(t)

An extension to the traditional SOM is the growing neural gas (sometimes also referred to as a growing self-organizing map,GSOM).Thistypeofmapovercomes one of the principle limitations of the SOM in that new neurons are added on de- mand as the map is being trained (Fritzke, 1995). The algorithm uses thresholding to decide if the current set of model vectors is inadequately representing the input data. If this is the case, a new neuron will be added to represent the data. In this way the topology of the map evolves over the training process. The final map is designed to map similar input vectors to the same neuron. This is a sort of clustering on our input data. One can begin to see the performance benefits of this sort of representation of data. Given a seed song we can find the neuron that contains it from our map. If we simply select other songs that are in the same node (neuron) as this seed song we can say that these songs are similar without ever having to explicitlycompareanyfeature vectors.

4.2 The Growing Hierarchical Self-Organizing Map

While a SOM letsusdrasticallyreduceor possibly eliminate any direct feature com- parisons we would have to perform during similarity lookup, we can do better. As the number of songs begins to reach the millions, a flat 2D mapping begins to be- come increasingly inadequate. The problem is similar to why we do not use di- rect feature comparisons.A2Dmapwithalarge number of map units requires too many explicit comparisons to be useful on a corpus of large size. Hierarchical structures solve this problem and are traditionally used to group large amounts of data. An example wouldbethecataloging of books in a library (grouped by subject hierarchy). A hierarchical structure allows us to index a huge number of songs and perform similarity lookups on them quickly. To accomplish this task we extend the idea of a GSOM by adding hierarchy to form a growing hierarchical self-organizing map (GHSOM). Our Application of SOMs 35

layer 0

layer 1

layer 2

layer 3

Figure 4.1: A view of a hierarchical SOM

This algorithm is the original SOM algorithm with the added idea of sub-maps. Each map unit may contain child maps which descend further in the hierarchy. Each node in the map still contains a model vector, however, we have the added ability for a node to point to sub-maps which serve to classify the data atgreater levels of detail. In Figure 4.1 we have a graphical representation of the relationship betweenamapanditssub-maps.Onecanalsothink of this as the parent-child relationship between nodes in a tree. The training process is similar to that of the GSOM. The difference being that in the case that the map is no longer adequately representing the input data, instead of always adding a new map unit, sometimes a whole new sub-map is generated. Anewsubmapcreationoccurs if the input vectors attached to a given map unit are not sufficiently similar. This algorithm has several nice properties. The map can represent our input data to an infinite degree of detail (as more and more map levels are generated). Also, the technique is completely unsupervized in that there needs to be no prior knowledge of how many levels or how many map units the GHSOM should contain.

4.3 Our Application of SOMs

Self Organizing Maps provide a perfect way to abstract the similarity relationships between songs. One can think of this as the third abstraction in our general ar- chitecture. The least abstract layer contains raw song data. The next level of ab- 36 Generating Maps

straction are feature vectors. Next we use GHSOM to create abstract relationships between feature vectors. Our application of GHSOM creates a partitioning on our corpus of songs. Once this partition is created, each part is repartitioned recursively. The topmost layer of the map can be thought of by analogy as dividing the songs into genres. However, the GHSOM does not place semantic meaning on its categories (e.g. genres) but instead creates categories completely based onpatternsintheinputvectors.As you descend the hierarchy of the map the distinctions between the vectors in the map units become finer. This map creates a tree topology of songs. The leaves of the tree are collec- tions of songs that are similar to each other. The non-leaf nodes do not contain songs, but rather contain links to the lower layers of the hierarchy. This topology (hierarchical) allows us to efficiently generate similarity lookups. We use a GHSOM implementation from the Vienna University of Technology (Rauber et al., 2002). This implementation was specifically designed to cluster song feature vectors, although it will work equally well with any vectors. The implemen- tation allows forthetuning of various facets of map creation. These include the propensity for adding map units at given map layer (controlling the width of the map), and also the propensity for adding new sub-maps (controlling the depth of the map). We made numerous enhancements to this implementation. Some of these enhancements included tools for visualizing the structure of the map, automati- cally linking song titles to the corresponding MP3s, and allowing alternate distance metrics to be plugged into the GHSOM. One enhancement was adding the notion of edge-weights. Edge-weights play a role in the lookup algorithms described in Chapter 5.

4.4 Edge Weights

Edge weights allow us to extend our tree-based topology of songs to that of a weighted tree topology. The weight of an edge in our treeisdefinedtobethedis- tance between themodelvectorsofthechildand parent map units. The distance metric is the same one that is used to generate the map. This gives us a notion of which child units at a given parent node are most similar to the parent. This added information allows us to make slightly more sophisticated similarity judgments. Motivation for Maps and Hierarchical Maps 37

4.5 Motivation for Maps and Hierarchical Maps

The tree-based topology of the GHSOM was a necessary abstraction to our raw song feature vectors. Given the Auditude architecture of many clients to one server, the server mustbeabletogeneratesimilaritylookupsquickly.Directfeaturevec- tor comparisons between the seed song and the millions of songs in the database would take aninordinateamount of time. It is true that generating the hierar- chical map uses many direct feature vector comparisons. However, this procedure need only be performed once. After the map is generated, a representation of this map is cached on the disk for use with similarity lookups. This hugely reduces the complexity of performingalookup,andthusincreasesthenumberofclientsand Auditude server could handle.

4.6 Expanding a GHSOM

Given that generating our song maps is costly in terms of time, it is important to determine how new songs can be added to the map without having to regenerate it. Algorithms for this on the fly map expansion are not specified by traditional GHSOM algorithms. However, we have come up with a heuristic that we feel will give good results.

1. Extract feature vector of new song

2. Begin at root of the tree. Compare the feature vector to each of the model vectors of the map units. Follow the edge that has the least distance.

3. Traverse the hierarchy of the tree until the song reaches a leaf.

4. Use the same thresholding mechanism of GHSOM to decide whether or not the leaf needs to be split into sub-maps.

This is only a temporary solution; if many songs are added after a tree is gener- ated, it will no longer give very accurate results.

4.7 Multiple Trees and Multiple Features

Given our three different feature vectors there are a large number of possible maps that we could generate. We generated maps of each of the features by themselves, as well as maps using combinations of various features. There are issues with this combination approach. Given the various dimensionalities of the features, one fea- ture has a tendency to outweigh the others when the feature vectors are compared. 38 Generating Maps

An alternative to thiscombination approach is to generate many separate maps and use them in tandem when performing lookups. This allows us to not only weight each feature differently in the lookup procedure, but also to use a differ- ent distance metric with each feature. This is particularly useful with our Timbre feature vectors. These vectors can be compared with the Earth Movers distance metric (see section 3.6), generating a much more precise genre match than con- ventional Euclidean distance. Once a good set of weights is determined for the individual tree approach, these weights could be used to buildacombinedfeature map. A combined feature vector could be created that is simply concatenation of all of the feature vectors from the individual features. Next, the appropriate components of this combined feature vector could be scaledusingagoodsetoffeature weightings.

4.8 Map Quality

Maps generated seemedtobegeneratingmostlymeaningful results. Typically songs from the same artist would end up very near each other in the generated map. However, there were many anomalies as music from completely opposite genres were grouped together (e.g. Rap and Classical). Evaluating these maps and their groupings is still an open issue. User testing would be helpful to pursue in evaluating thequalityofour maps produced with GHSOM. Chapter 5

Similarity Lookups

The ultimate goal of our clinic project is to generate similarity judgments about songs. The first three steps towards achieving this goal have been outlined in Chap- ters 2, 3 and 4. Now that we havearepresentation of both the feature vectors of asongandatopology of song relations on the corpus of songs, we can begin to perform similarity lookups.Thegeneral lookupframeworkistouse the topology of our cached tree to narrow the search space, and then use direct feature vector comparisons to generate the final ranked list of songs. The ordering of this list is entirely generated using direct feature vector comparisons.

5.1 Similarity Lookups Defined

Your friend has this great new song that he just gave you a copy of (legally of course). You fallinlovewiththesongand want to determine other songs that you should purchase. Traditionally you would ask your friend what other songs he or she thought were similar to this new song. Our system provides a completely automated way of doingthis.Givenaseedsong, our system generates a ranked list of songs that it deems to be similar to this seed song. This knowledge is intended to either be displayed directly to the user, or to be fed into our playlist generation algorithm to generate a pleasant sequence for this list of similar songs.

5.2 Topological Distance

Topological distanceisencodedentirelybythestructure of our generated hierar- chical map. The idea behind generating a map was to be able to narrow down the search space without having to do direct feature vector comparisons. Topological 40 Similarity Lookups

distance is what allows us to do this. Songs that are in the same leaf of tree are said to have a topological distance of 0. The topological distance between any other two nodes in the treeisdefinedtobesumoftheedgeweights between the leaves containing each of these songs (see Section 4.4).

5.3 Complexity

The complexity of performing a similarity lookups is

O(log n + r × dmc(L)) (5.1)

Where n is the number of songs in the corpus, r is the number of songs in the result set, L is the dimensionality of the feature vectors, and dmc is the complexity of the distance metric being used to compare the feature vectors.

5.4 Multiple Features

In Section 4.7 we discussed the tradeoffs between generating one tree using mul- tiple concatenated feature vectors, and generating multiple trees and combining them in the lookup algorithms. Our lookup algorithms support a multiple tree/ multiple feature set model. Weaccomplishthisbygiving each feature a relative weight. This weight defines how much the similarity lookups will be influenced by each of the various features. For instance, if the user wished to bias his or her similarity judgments towards choosing songs in the same genre, then he or she can increase the weight of the MFCC feature since it is good at distinguishing musical genres.

5.5 Algorithms for Lookups

There are two cases for our similarity lookups. The first is the user supplying a seed song and requesting recommendations from the entire Auditude library. The second is the user supplying a seed song and requesting recommendations from the collection of songs that they have on their computer. This second algorithm was not implemented in our final code. However, we have included both algorithms for completeness.

5.5.1 Definitions While outlining our lookup algorithm it makes sense to define some notational shortcuts. Thesenotationscoveraspects of the topology tree and and distances Algorithms for Lookups 41 between feature vectors. Topological distance (top(s1,s2))isthesumoftheedgeweightsalongthepath from s1 to s2.Forexample,twosongs whose paths branch off near the bottom of the tree are more likely to be closer than ones that split off at the root. Feature distance (feature(s1,s2))isthedistance between s1 and s2 as defined by direct feature vector comparisons. The distance is the sum over the distance of all features in the system weighted by the relative feature weights. The specific distance metric used to calculate the distance for a specific feature may be unique to that feature. For example, the Sone featurevectorsmightbecomparedwithEu- clidean distance, while the MFCC vectors might be compared with Earth Mover’s Distance. It is helpful todefine a some notational conventions for our features.

F = set of all features fw(f)=relative feature weight of featuref Node(q, f)=N(Node N that contains Song q for feature f)

5.5.2 Recommendation from Entire Library The user wishes to find r songs that are similar to a query song q from the entire database of music.

1. Determine the collection that contains the query song

∀f ∈ F, Nf = Node(q, f) (5.2)

2. Determine the number of needed songs for each feature f.Thealgorithm narrows down the search space to approximately 2r (where r is the num- ber of desired results). This isdecisionwasmade to allow the topological filtering to pull in more varied songs.   fw(f) targetf =2r (5.3) i fw(fi)

3. For each feature f in the set F ,addall songs that are stored in the collection at node Nf to the results. If this number is greater than targetf ,stopand proceed to the next step. If not, add in all songs which are descendants of the parent of Nf ,andletNf become the parent of the old Nf while repeating this step.

4. Reduce the size of the result setbyretaining the r songs that are closest to q. 42 Similarity Lookups

5.5.3 Recommendation from User’s Collection There are two ways that that a recommendation could be generated from a user’s collection. The first is to perform feature extraction and map generation on the user’s machine. With this information, the algorithm for performing a lookup on the user’s set of songs reduces to the algorithm outlined in Section 5.5.2. However, for both performance and intellectual property business reasons Auditdue prefers approaches where the majority of computation is performed on a central server (see Section 1.3). Instead, the lookup will be performed on the server in the same manner as for the entire library case. The only difference being that the search space is initially restricted to songs from the user’s collection. The user will request this type of search mainly when he or she is making a playlist, whereas for the entire library case the user is most likely looking for similar music to purchase.

5.5.4 Performance By exploiting the topological structure of the tree we can drastically reduce the number of explicit feature vector comparisons that need to be made. This use of hierarchical structure dramatically reduces the complexity of our lookup algo- rithms. Additionally, preprocessing could be used to calculate and cache feature dis- tances between songs that are topologically close in the tree. Songs that are topo- logically close are very likely to have direct feature comparisons performed on them. We leave an exploration of this preprocessing as future work. Chapter 6

Playlist Generation

In this chapter, we will show how our similarity metrics can be used to create playlists maximize inter-song similarity. Such playlists can demonstrate that our similarity metrics are valid and show one of the ways in which similarity informa- tion can be used. There are many ideas of what a ‘good’ playlist might be. Some people like the transitions between songs to be smooth. Others might like loudness or beat to be similar, or have similar genres grouped together. On the other hand, there are a few who despise it when two songs by the same artist are adjacent on the playlist. By comparing the feature vectors of songs using either cosine or Euclidean dis- tances, we are able to obtain a measure of similarity distance. Even then, however, our different feature vectors will find different areas of similarity between songs. For example, the sone feature vectors will order songs of similar loudness while the beat spectrum will order songs of similar beat, and these orderings might be completely different. Depending on what type of playlist we wish to generate, the correct combination of feature vectors will matter.

6.1 Complexity of Generating an Optimal Playlist

It turns out that playlist generation is a more difficult problem than originally an- ticipated, which can be easily seen when we reduce the problem to a known prob- lem in computer science, shortest Hamiltonian path (SHP). The ordinary Hamilto- nian path problem attempts to find a path that includes each vertex of an arbitrary graph exactly once, whereas the SHP problem involves finding the shortest path through all vertices in a fully connected graph. Our playlist problem, in which we want a sequence of songs that maximizes inter-song similarity, exactly corresponds to this problem—each song is a vertex in the graph, and the distances between ver- 44 Playlist Generation

A C

B F

D

E

(a) Songs & distances

A A C C

B F B F

D D

E E

(b) Shortest Hamiltonian path (SHP) (c) Shortest tour (TSP)

A A C C

B F B F

D D

E E

(d) Hamiltonian path using shortest edges (e) Tour with shortest edges

Figure 6.1: Envisioning playlist generation as a shortest path problem AGreedyAlgorithm 45 tices correspond to the similarity between songs. Furthermore, if we wish to ‘loop’ the playlist such that the last song in the playlist returns to the first song, then playlist generation reduces to the traveling salesman problem (TSP). Both SHP and TSP are NP-complete, so generating the optimal playlist is an NP-complete prob- lem. Figure 6.1(a) shows an example of six songs that have been embedded in a two dimensional space. The Euclidean distance between two songs indicates how sim- ilar they are to each other. Figure 6.1(b) shows the shortest Hamiltonian path, whereas Figure 6.1(c) shows theshowsthebestpathsolvingtheTSP.Withonly six songs, we are able to calculate the optimal path quickly, but the computational time required to solve the optimal path grows exponentially as we try to order more songs. Towards the end oftheproject, we realized that minimizing the total distance between all songs in the playlist may not necessarily produce the best listening experience. It is possible that the variations of TSP and SHP shown in Figures 6.1(d) and 6.1(e) may produce more pleasing results. The TSP variant is known as the bottleneck traveling salesman problem.

6.2 A Greedy Algorithm

For a small number of songs, an approximation of similarity might still yield a pleasing playlist, and thus a greedy approach will suffice. We pick one node—either specified by the user or generated randomly—and pick the shortest distance to the next node. We repeat the process until we reach the final node. A better playlist might be found if we use this algorithm with every song as the starting node, and then picking the playlist that yields the smallest total distance. The pitfalls of this approach are apparent. The first few songs will sound very similar, but the last few might sound awful. Better algorithms must be used if we wish to generate a more optimal playlist.

6.3 Approximation Algorithms

There are several approximation algorithms for approximating a solution to the traveling salesman problem within polynomial time. It is important to note that these algorithms only guarantee a solution for a ‘loop’ path, so that the beginning song of a playlist must also be the last song. Furthermore, the approximation al- gorithms are constrained by the requirement that the edge weights satisfy the tri- angle inequality (which our feature vectors guarantee to be true). We will briefly 46 Playlist Generation

describe two approximation algorithms, detailed in Chen’s notes on approxima- tion algorithms (Chen, 2003).

6.3.1 x2 Algorithm The x2 Algorithm returns a tour which is no worse than twice as long as the opti- mal tour. The algorithm is as follows:

1. Select a vertex r to be the ‘root’

2. Find a minimum spanning tree T from root r using Prim’s algorithm

3. Let L be the list of vertices visited in a preorder tree walk of T (a preorder tree walk visits the rootfirst,thenleftchild, then right child)

4. Return the Hamiltonian cycle H that visits the vertices in the order L

The total running time of this algorithm is O(n2) due to the complexity of Prim’s algorithm. For more information on this algorithm, see Cormen et al. (2001, sec. 35.2).

6.3.2 x1.5 Algorithm (Christofade’s Algorithm) This algorithm was developed by Nicos Christofade, and guarantees a tour which is no worse than 1.5 times as long as the optimal tour. The algorithm is slightly different from the x2 algorithm:

1. Select a vertex r to be the ’root’.

2. Find a minimum spanning tree T1 from root r.

3. Let S be the set of vertices with an odd degree.

4. Find a minimum weight matching M on S.

5. Find an Euler tour T2 from new graph T1 + M

6. Construct the Hamiltonian path by ’short-cutting’ T2 (visit the vertices in order of T2 only if they have not been encountered yet)

The running time of this algorithm is O(n3). Genetic Algorithms 47

6.4 Genetic Algorithms

Genetic algorithms provide a heuristic approach to solving the traveling salesman problem and related problems that has been widely studied since the early 70’s (Larrañaga et al., 1999). In general, a genetic algorithm simulates the process of evolution by following the rules of natural selection. This technique does not guar- antee the optimal solution, but yields a good approximation to an NP complete problem in a reasonable amount of time. This is the method we ended up imple- menting for playlist generation. Ageneticalgorithmconsists of three basic processes: mutation, crossover, and selection. Mutation occurs when a gene site is randomly replaced by another gene. Crossover involves taking two parent gene strings and combining them together such that the genes of the child are determined by the genes of both parents. Fi- nally, selection determines whether our new gene string is ’fit’ to survive compared to other gene strings. We can easily apply the concepts of a genetic algorithm to playlist generation. In our case, every song is a gene, a gene string is a particular ordering of the songs, and its fitness criterion is the total similarity difference of that ordering. There are many different ways to solve the traveling salesman problem using ageneticalgorithmdepending on which crossover methods and fitness functions are used. Several of these methods are described by Larrañga (1999). In our genetic algorithm implementation, we used an order based crossover method (Syswerda, 1991). This crossover method selects several random gene positions from a parent and imposes these positions on the other parent resulting in a new child. One of the attractive features of the genetic algorithm approach is that the fitness function provides a good deal of flexibility. Our current code can seek so- lutions for either TSP or SHP and can also attempt to find the opposite, ‘longest path’, solutions (i.e., playlist that maximize contrast between songs). Although our code does not currently address the bottleneck versions of these problems, a trivial change to the fitness function would allow these solutions to be found.

Chapter 7

User Testing

There is no guaranteed correlation between what simple numbers deem similar, and what actual people deem similar when they hear songs. For this reason user testing becomes critical as a means to determine whether similarity recommenda- tions are sound. While our user testing was by no means comprehensive, it served somewhat as a ‘reality check’ in the end.

7.1 First User Test

Our first usertestwassimple,andwasintendedtofindoutwhich feature vectors lent themselves better to various qualities of playlists. We picked twelve songs from our test corpus and had users rate four different orderings of the songs. Using the genetic algorithm, we generated four different playlists based on beat transition, beat, sone, and the combination of the previous three. Users were then asked to rate, on a scale of1to5(1beingworst,5beingbest), the following qualities:

Song Transition Does the end of one song flow ‘smoothly’ into the next? Beat Do consecutive songs have similaroverallbeat/tempo? Loudness Do consecutive songs have similar overall loudness? Genre Does the playlist group songs into their specific genres well? Overall Is this a good ordering for a playlist overall?

The Test Corpus The user test included the following 12 songs:

Abba “Dancing Queen” ACDC “Highway to Hell” Beach Boys “California Girls” 50 User Testing

Beatles “Day Tripper” Beatles “Here Comes the Sun” Bee Gees “Stayin’ Alive” Chordettes “Mr. Sandman” Cindy Lauper “Girls Just Wanna Have Fun” Jimi Hendrix “Day Tripper” Madonna “Material Girl” The Doors “Break on Through” The Doors “PeopleareStrange”

7.1.1 Results We tested 3 users, who rated four different playlists. The averages of the rated qualities are as follows:

Playlist# Description Song Transition Beat Loudness Genre Overall 1BeatTrans.3.02.67 2.33 3.0 3.67 2Beat3.333.02.33 2.33 2.67 3Sone4.0 3.33 2.67 3.33 3.67 4All4.04.673.67 2.67 3.67

7.1.2 Analysis Although the data we obtained is simply not enough information to draw any sub- stantial conclusion, there are some trends to be noted. While playlists generated by individual feature vectors received lower averages, Playlist 4, which included all three feature vectors, received higher averages. This somewhat confirmed our hy- pothesis that we could generate a better playlist using more feature vectors at once for comparison. In retrospect, we realized that the ‘loudness’ quality of a playlist was somewhat of a misunderstood judgment. Some songs were relatively louder than others when users listened to them on their computers, even though when we calculated sone we first normalized the volume. Furthermore, all three of our users complained about the selection of songs as atestcorpus,noting the extreme difference in genre (e.g. Chordette’s “Mr. Sand- man” and ACDC’s “Highway to Hell.” While we originally envisioned an excellent ordering of the corpus as grouping the songs into three main genres (80’s, Easy Rock, and Rock), we realized that in practice, ordinary users normally would not place these songs in thesameplaylist.Theendresultwereplaylistsusersthought contained ‘weird’ and ‘awful’ transitions. Second User Test 51

It is worth noting that while Playlist 4 ordered both Jimi Hendrix’s and The Beatles’ versions of “Day Tripper” next to each other, one user did not like it when two of the same songs were played consecutively in a playlist simply due to his tastes. In this case, a ‘maximally similar’ comparison of two songs was considered to be a ‘bad’ sequence, showing that numerical similarity does not always relate people’s conception of a ‘pleasing’ ordering.

7.2 Second User Test

Our second user test was designed to gauge how well the user’s conception of sim- ilarity between songs compared with the feature vectors’ measure of similarity. For our test songs, we chose 50 songs from our corpus and extracted a 10 second clip from the 30-40 second area from each song. Again, using the genetic algorithm, we generated four playlists: two finding the ‘best’ and ‘worst’ orderings of the beat and sone vectors combined, one finding the ‘best’ ordering of just the sone feature vectors, and one randomly generated playlist. We asked users to rate each ‘pairing’ of songs on a scale of 1 to 5. There were 50 songs, hence 49 pairings to rate. Since these were clips of songs, weaskedusersnottoratethepairingbasedontransi- tion, but rather how well they judged them to be ’similar’ pairings, as if they were listening to the entire songs. It is important to note thatourclinic team hand picked the test corpus for this user test, and that these were not songs generated by our similarity lookup engine. While our method guaranteed the condition that some songs would be different, the latter method may have picked a test corpus more suitable for playlist generation.

7.2.1 Results After obtaining data on how each user perceived similarity between each pairing of songs, we averaged the values for each pairing. We then plotted those values against cosine distance similarity values from our feature vectors to see if there was any correlation between the users’ and feature vectors’ judgments of similarity. Figure 7.1(a) shows data from playlist 1 (the ‘worst’ ordering using the beat and sone feature vectors). We can see that with some of the pairings there is indeed some correlation between judgments of similarity, while with some other pairings there is absolutely no correlation. We also noticed that users’ judgments tended to vary erratically. Figure 7.1(b) shows data from playlist 2, (the ‘best’ ordering using the beat and sone feature vectors). Again, users’ judgments seemed to vary quite a bit between pairings. There is somewhat good correlation up to song pairings 1–10, but in song 52 User Testing

1.1 Similarity Based on Feature Vectors Users Judgment of Similarity 1

0.9

0.8

0.7

0.6

0.5 Similarity Judgment (Normalized)

0.4

0.3

0.2 0 5 10 15 20 25 30 35 40 45 50 Song Pairing

(a) Judgments for ‘bad’ playlist

1.1 Similarity Based on Feature Vectors Users Judgment of Similarity 1

0.9

0.8

0.7

0.6

0.5 Similarity Judgment (Normalized)

0.4

0.3

0.2 0 5 10 15 20 25 30 35 40 45 50 Song Pairing

(b) Judgments for ‘good’ playlist

Figure 7.1: Comparing machine and human similarity judgments. Professor O’Neill’s Playlist 53 pairings 10–35 where the feature vectors say the pairings are good, users didn’t really see any consistent pattern.

7.2.2 Analysis Wenoticedthatwhenfeature vectors saw good similarity between pairings of songs, users tended to agree sometimes, but not as much as we had hoped. Because of the erratic ratings users gave the pairings, we believe that our test corpus was not sim- ilar enough to begin with. Many users complained about the system’s inability to pair rap songs with other non-rap songs. We believe that a playlist generated from asimilaritylookuponaverylargecorpus will yield better results.

7.3 Professor O’Neill’s Playlist

Out of curiosity, our clinic advisor Professor O’Neill wanted to see how our playlist generation method compared to her personal ordering of a small collection of songs and handmade orderings of the same songs by members of the clinic team. The test corpus included a number of electronica, rock, and new-age genres. Professor O’Neill’s original playlist was grouped first by artist and then in order of decreasing “aggressiveness”.

The Chemical Brothers Come with Us New Order Crystal Lloyd Cole Too Much E Lloyd Cole Butterfly Lloyd Cole Rattlesnakes Lloyd Cole Like Lovers Do(StephenStreet Mix) Kings of Convenience Failure Kings of Convenience Leaning against the Wall OP8 If I ThinkofLove Belle & Sebastian The Stars of Track and Field Belle & Sebastian Like Dylan in the Movies Dead Can Dance American Dreaming Dead Can Dance Nierika Moby Natural Blues Moby Everloving David van Tieghem Deep Sky David van Tieghem A Wing and a Prayer PatrickO’Hearn Sacrifice 54 User Testing

7.3.1 Random Playlists The participants in the test, both human and machine, did not have the advantage of knowing Professor O’Neill’s preferred order, nor did they have much familiarity with the songs themselves. Before showing the human- and machine-determined results, let us examine two random orderings of the songs:

David van Tieghem Deep Sky PatrickO’Hearn Sacrifice Lloyd Cole Like Lovers Do(StephenStreet Mix) Dead Can Dance Nierika Kings of Convenience Failure David van Tieghem A Wing andaPrayer Lloyd Cole Too Much E Moby Natural Blues Belle & Sebastian The Stars of Track and Field Kings of Convenience Leaning against the Wall OP8 If I ThinkofLove Lloyd Cole Rattlesnakes New Order Crystal Lloyd Cole Butterfly Belle & Sebastian Like Dylan intheMovies Moby Everloving The Chemical Brothers Come with Us Dead Can Dance American Dreaming

Lloyd Cole Like Lovers Do(StephenStreet Mix) New Order Crystal OP8 If I ThinkofLove Moby Natural Blues The Chemical Brothers Come with Us Kings of Convenience Leaning against the Wall PatrickO’Hearn Sacrifice Lloyd Cole Butterfly Lloyd Cole Too Much E Kings of Convenience Failure Dead Can Dance Nierika Dead Can Dance American Dreaming Belle & Sebastian The Stars of Track and Field Lloyd Cole Rattlesnakes Professor O’Neill’s Playlist 55

David van Tieghem Deep Sky Moby Everloving David van Tieghem A Wing and a Prayer Belle & Sebastian Like Dylan in the Movies

One thing we can observe from these random playlists is that it is not uncom- mon for two songs by the same artist to be grouped together. Random playlists contain both bad and good transitions, but human ingenuity can often rationalize even the strangest of transitions.

7.3.2 Nick’s Playlist Nick’s ordering of the songs most closely matched Professor O’Neill’s playlist. There are some differences, but Nick’s list generally places “rock” music first and then transitions intocalmermusic. The Chemical Brothers Come with Us New Order Crystal Lloyd Cole Too Much E Moby Natural Blues Dead Can Dance Nierika Moby Everloving Lloyd Cole Rattlesnakes Lloyd Cole Butterfly Lloyd Cole Like Lovers Do(StephenStreet Mix) OP8 If I ThinkofLove Dead Can Dance American Dreaming Kings of Convenience Failure Belle & Sebastian Like Dylan in the Movies Belle & Sebastian The Stars of Track and Field Kings of Convenience Leaning against the Wall David van Tieghem Deep Sky David van Tieghem A Wing and a Prayer PatrickO’Hearn Sacrifice

7.3.3 Liz’s Playlist Liz’s playlist reflects different choicesthanNickandProfessor O’Neill made.

David van Tieghem Deep Sky The Chemical Brothers Come with Us 56 User Testing

Dead Can Dance Nierika Lloyd Cole Too Much E Kings of Convenience Failure Lloyd Cole Like Lovers Do(StephenStreet Mix) Lloyd Cole Rattlesnakes Kings of Convenience Leaning against the Wall Moby Natural Blues Belle & Sebastian The Stars of Track and Field David van Tieghem A Wing andaPrayer Moby Everloving PatrickO’Hearn Sacrifice New Order Crystal Lloyd Cole Butterfly Belle & Sebastian Like Dylan intheMovies OP8 If I ThinkofLove Dead Can Dance American Dreaming

7.3.4 Brad’s Playlist Brad’s playlist was different again, although we can see common themes with the other playlists.

Moby Natural Blues The Chemical Brothers Come with Us New Order Crystal Dead Can Dance American Dreaming Kings of Convenience Failure Lloyd Cole Butterfly OP8 If I ThinkofLove Moby Everloving Kings of Convenience Leaning against the Wall Belle & Sebastian Like Dylan intheMovies Belle & Sebastian The Stars of Track and Field Lloyd Cole Too Much E Lloyd Cole Like Lovers Do(StephenStreet Mix) Lloyd Cole Rattlesnakes Dead Can Dance Nierika David van Tieghem A Wing andaPrayer PatrickO’Hearn Sacrifice David van Tieghem Deep Sky Professor O’Neill’s Playlist 57

What we can learn from looking at the orderings of these songs by four differ- ent people is that there is no single “correct” playlist. Different people will weigh different aspects of the songs in different ways. We should, therefore, expect to see similar findings in machine-generated playlists.

7.3.5 Loudness-Based Playlist Aplaylistbasedonloudnesssensationseems promising. Feature-vector compar- isons are based on cosine distance, with the actual distances are shown in the playlist below—values closer to one indicate better matches. Notice that several songs by the same artist havebeengroupedtogether—morethanwewouldex- pect by chance alone. Some of the songs by the same artist that are not grouped together match reasonable expectations—the two Moby songs, for example, are quite different and were alsosplitupbyBrad,Liz, and Nick in their playlists.

The Chemical Brothers Come with Us 0.936 Moby NaturalBlues 0.959 OP8 If I Think ofLove 0.953 Belle & Sebastian Like Dylan in the Movies 0.948 Belle & Sebastian The Stars of Track and Field 0.954 Dead Can Dance American Dreaming 0.956 Patrick O’Hearn Sacrifice 0.921 Kings of Convenience Leaning against the Wall 0.921 Lloyd Cole Too Much E 0.954 Lloyd Cole Rattlesnakes 0.951 Lloyd Cole Butterfly 0.941 New Order Crystal 0.968 Lloyd Cole Like Lovers Do ( Mix)0.944 Kings of Convenience Failure 0.951 Dead Can Dance Nierika 0.931 David van TieghemDeep Sky 0.965 David van Tieghem A Wing and a Prayer 0.921 Moby Everloving

7.3.6 Beat-Based Playlist Beat also appears to give a good playlist. In this case, we have used the fft of the beat spectrum for playlist generation, again with cosine distance for comparison. We can also see that this playlist shares commonalities with the previous playlist— three Lloyd Cole songs are grouped together (but interestingly not the same three), 58 User Testing

and it has done well at grouping “new age” songs together. One of the strangest transitions, from Dead Can Dance’s “Nierika” to Lloyd Cole’s “Too Much E” is clearly marked as a poor transition, with a cosine distance of only 0.630—it was apparently the best transition that our algorithm could find.

Dead Can Dance Nierika 0.630 Lloyd Cole Too Much E 0.768 Lloyd Cole Rattlesnakes 0.976 Lloyd Cole Like Lovers Do (Stephen Street Mix)0.962 OP8 If I Think ofLove 0.981 Kings of Convenience Leaning against the Wall 0.990 New Order Crystal 0.990 David van Tieghem A Wing andaPrayer 0.999 David van TieghemDeep Sky 0.988 Patrick O’Hearn Sacrifice 0.981 Dead Can Dance American Dreaming 0.967 Moby Everloving 0.946 Belle & Sebastian Like Dylan in the Movies 0.934 Kings of Convenience Failure 0.901 Belle & Sebastian The Stars of Track and Field 0.913 Lloyd Cole Butterfly 0.902 Moby NaturalBlues 0.815 The Chemical Brothers Come with Us

7.3.7 Timbre-Based Playlist Aplaylistbasedontimbrealsoyields relatively good results. In the playlist below, we have used the fft of the mfccswithcosinedistanceasthe metric. Again Lloyd Cole songs are grouped together, this time all four songs end up together, presum- ably because of the singer’s distinctive voice and the predominance of the guitar.

David van Tieghem A Wing andaPrayer 0.963 The Chemical Brothers Come with Us 0.988 Dead Can Dance American Dreaming 0.971 New Order Crystal 0.979 Lloyd Cole Rattlesnakes 0.972 Lloyd Cole Too Much E 0.959 Lloyd Cole Like Lovers Do (Stephen Street Mix)0.992 Lloyd Cole Butterfly 0.987 Moby NaturalBlues 0.990 Professor O’Neill’s Playlist 59

OP8 If I Think ofLove 0.987 Belle & Sebastian Like Dylan in the Movies 0.981 Belle & Sebastian The Stars of Track and Field 0.984 Kings of Convenience Failure 0.988 Moby Everloving 0.994 Kings of Convenience Leaning against the Wall 0.976 Dead Can Dance Nierika 0.972 David van TieghemDeep Sky 0.967 PatrickO’Hearn Sacrifice

Below is another timbre-based playlist,inthiscasethesongcomparisons were made using earth-mover’s distance on clustered mfcc data. As you can see, we obtained substantially similar results to those given above. The fact that these two playlists are almost identical helps to confirm that taking the fft of the mfcc data is a valid approximation. (In the listings below, the intersong distances are earth- movers distance, where lower numbers indicate greater similarity.)

David van Tieghem A Wing and a Prayer 1.504 The Chemical Brothers Come with Us 0.822 Dead Can Dance American Dreaming 1.208 New Order Crystal 0.801 Lloyd Cole Rattlesnakes 1.011 Lloyd Cole Too Much E 1.379 Lloyd Cole Like Lovers Do (Stephen Street Mix)0.568 Lloyd Cole Butterfly 0.685 OP8 If I Think ofLove 0.722 Moby NaturalBlues 0.842 Belle & Sebastian Like Dylan in the Movies 1.105 David van TieghemDeep Sky 1.027 Belle & Sebastian The Stars of Track and Field 0.868 Kings of Convenience Failure 0.949 Moby Everloving 0.964 Kings of Convenience Leaning against the Wall 1.330 Dead Can Dance Nierika 2.994 PatrickO’Hearn Sacrifice

7.3.8 Combined-Features Playlist Our finalplaylistusesacombination of all three distance metrics, with equal weight- ing given to each feature. It is also a good playlist, although in this case the over- 60 User Testing

all order is reversed compared with the other three machine-generated playlists. (When the distance between each pair of songs is symmetric, the sum of the dis- tances is identical to the sumofthedistances when the entire playlist is reversed.)

Dead Can Dance Nierika 0.825 Kings of Convenience Failure 0.937 Belle & Sebastian The Stars of Track and Field 0.944 Belle & Sebastian Like Dylan in the Movies 0.929 Moby Everloving 0.948 Kings of Convenience Leaning against the Wall 0.931 Patrick O’Hearn Sacrifice 0.954 David van TieghemDeep Sky 0.967 David van Tieghem A Wing andaPrayer 0.954 New Order Crystal 0.958 Dead Can Dance American Dreaming 0.943 Lloyd Cole Butterfly 0.939 Lloyd Cole Rattlesnakes 0.970 Lloyd Cole Like Lovers Do (Stephen Street Mix)0.963 OP8 If I Think ofLove 0.953 Moby NaturalBlues 0.904 The Chemical Brothers Come with Us 0.850 Lloyd Cole Too Much E

Overall, we can see that our techniques can produce a meaningful playlist. It isn’t exactly the same playlist as our sample humans created, but it is nevertheless quite reasonable.

7.4 Presentation Days Demo

During Harvey Mudd’s Presentation Days, we had an opportunity to demonstrate our system to a live audience. Members of the audience were given a list of all the songs in our test corpus and could suggest seed songs for the system to use for similarity lookups. We made several observations during our demonstrations. In our second pre- sentation, the audience chose a rock song for a lookup and was skeptical about the returned lookups. This discrepancy in our system can be explained by an in- consistent addition of songs to our corpus. The day before the presentation, we increased the test corpus by around 600 songs and ran our feature-extraction code on the files. Presentation Days Demo 61

These new files were downsampled using a different method than our original 150 song corpus. Oddly enough, during lookups the system seemed to group songs from only one of the two sets of songs—songs came either from our original cor- pus or from the newly added corpus, but not both. Despite this fact, our system returned a Metallica song using The Beatles’ “Eight Days a Week” as the seed song, arecommendationthatmanyaudiencemembers found to be odd. However, the audience seemed to refuse to acknowledge that the two songs do, in fact, share some striking similarities. In our third presentation, the audience was pleased by a lookup using an Enya song as the seed song, which returned all four Enya songs in the corpus as rec- ommendations. The lookup also returned a number of similar new-age songs. We noticed a significant improvement of our lookups when we used a larger corpus of similar genres (i.e., Professor O’Neill’s newly added collection). Our first corpus of 150 songs contained a mix of classic rock, alternative rock, 1980s pop, and rap, and returned very odd recommendations. To be honest, we were not entirely con- vinced that our system was performing good similarity recommendations until we had increased the corpussizesignificantly.

Chapter 8

Conclusions and Future Work

This year, the Auditude clinic team has taken a broad problem statement and fo- cused it into a more manageable, but still ambitious, endeavor. We have studied the work of many researchers in the field of music information retrieval and data mining. We have taken ideas, algorithms, and even fragments of code from several generous experts, combined them in new ways, thrown in a few of our own ideas, and developed the basic skeleton of a music similarity engine. We are eager to see Auditude develop this into a finished product, one which we look forward to using ourselves. In the course of our project, we explored several avenues which, in the end, were not partofourmain work, but which Auditude or others may find useful. We have included theseinAppendixB.

Appendix A

Manuals and End to End Description

The system wehaveoutlinedintheprecedingchapterswasconvertedfromthe- ory into code. What follows is an end-to-end description of how to run the code we generated. The process covers feature extraction, map generation, and finally similarity lookups.

A.1 Fluctuation Strength

The text of this section, the Matlab source files, and x86 Linux executables for the fluctuation strength feature extraction may be found in a directory called sone on the CD-ROM.

A.1.1 How to Run To perform feature extraction, run the fluct_files function with a list of .wav files as arguments. It will process each of them in turn and write a .sone.mat file for each of them. This .sone.mat file contains a vector of matrices called fluct; each of these matrices is the fluctuation strength matrix for a window of the song (6 second windows in this case). To get useful information out, run the process_flucts function with a list of .sone.mat files as arguments. It will read in each of these files and for each of them output a .sone.beg,a.sone.end,anda.sone.all file. Each of these files is a feature vec- tor (i.e. it has been flattened column-wise). The .beg file contains the first frame of the song, the .end file the last, and the .all file the median matrix over the entire 66 Manuals and End to End Description

song. These files are in text format, i.e. a series of ASCII real numbers separated by spaces.

A.1.2 How to Build Since it is easier to integrate with other components as a stand-alone binary, the fluct_files and process_fluct functions are written so that they can be turned into such binaries by the Matlab compiler. Todoso,runthecommands

mcc -m fluct_files mcc -m process_flucts

from within Matlab.Thiswillbuildbinariesinthecurrentworkingdirectory. You may need to fiddle with dynamic linker options to get the binaries to run; under Linux on x86 it was necessary to set $LD LIBRARY PATH to

$MATLAB/bin/glnx86:$MATLAB/extern/lib/glnx86:$MATLAB/sys/os/glnx86

where $MATLAB is the root directory of the Matlab installation.

A.1.3 Implementation Notes Below we examine the implementation of fluct_files and process_flucts in detail.

fluct_files This script transforms a raw audio file into a series of fluctuation strength matrices. This transformation takes a number of steps. First, the audio samples are scaled so that the loudest sound is 75 dB; this compensates for variations in recording level of different songs. At this point, the sample values are pressure in Pa. The sound samples are then transformed from the time domain into the frequency domain using a fast Fourier transform (FFT); the transform is done over windows of 256 samples. The pressure in each frequency band is then converted to intensity (in W/m2). We then break the spectrum up into the bands of the bark scale. Each window is then processed to account for spreading between bands caused by the human ear. These pressure values are then converted to dB. The dB-sound pres- sure level are then converted into Sone valuesusingthemethodofBladonand Lindblom, 1981; the sone values are converted to Phon using the method of Allen 1997. At this point we havePhonvaluesforeach frequency band for each window. To remove time-dependence, we then convert this data to modulation amplitude. This involves taking windows of 6 seconds, and using another FFT to determine periodic trends in the loudness sensations in each frequency band. We can weight Calculating the Beat Spectrum 67 the resulting matrices by how sensitive humans are to these changes at different modulation frequencies; this gives fluctuation strength. The data for weighting comes from Zwicker & Fastl (1999, Section 10.2). process_flucts

It was necessary to write a simple function, print_mat,toprintoutamatrixinASCII to a text file. For some reason, compiled Matlab functions can only write binary .mat files. print_mat uses the fopen, fprintf,andfclose primitive I/O operators to do the actual printing. The script reads all feature vectors in to a giant matrix, and then prints out the median, beginning, and ending vector for each song. This isn’t really neces- sary currently, since we don’t do any processing on the feature vectors; however, acoupleofmethodswetriedout for more sophisticated feature vector compari- son, dimensionality reduction and covariance matrix weighted Euclidean distance (a.k.a. Mahalanobis distance), required all of the data be in memory at the same time so it could be processed. This code is commented in the current version of process_flucts, but is left in to show how one could do similar manipulations in the future.

A.2 Calculating the Beat Spectrum

This section describes how to generate the beat spectrum for .wav files. There are two Matlab scripts for calculating the beat spectrum. The first, processes a single audio file. The second processes multiple files.

A.2.1 beatspec_single.m This is a simple script that allows you to view the similarity matrix and beat spec- trum in graphicalform.Itcanonlyberunonasinglewavfile,anditmustbedone in Matlab.Torunit,typeinMatlab:

>> beatspec single(’filename.wav’) where filename.wav is the file you want to run the beat spectrum on. The script will thencalculatethebeatspectrum,andgeneratetwographs:A gray scale matrix that represents the similarity matrix using cosine distance, and a plot of the beat spectrum using auto-correlation. No data is saved when using this script. 68 Manuals and End to End Description

A.2.2 beatspec.m This is a more powerful script that gives you several options for beat spectrum calculation. This is the script you would like to run if you want to generate beat spectrum files for song comparison. The Matlab code is designed to be compiled into C code by typing in Matlab:

>> mcc -m beatspec.m

This will generate an executable which takes two arguments:

•thefilestorun the beat spectrum on

•theoutput directory.

If no output directory is specified, it will save the files in the current directory. The executable is run by calling:

./beatspec input-files output-directory

Youwillthenbeaskedifyouwanttorunthebeatspectrum using the default con- figuration. Enter ’y’ or simply push enter to run the beat spectrum using default values, otherwise enter ’n’. The following are the options you can change:

1. Use Auto-correlation (DEFAULT) or Diagonal sum to calculate beat spectrum? Auto-correlation produces more accurate and robust results, but takes longer to calculate. Diagonal sum is faster, but less robust.

2. Desired window size? (DEFAULT: 256 frames) How many frameseach“window”willhavewhenweparameterize the audio.

3. Desired start time to run beat spectrum? (DEFAULT: 40 secs)

4. Desired end time to run beat spectrum? (DEFAULT: 70 secs) You can also enter 0 to run beat spectrum on the entire song. This may freeze the program if the song is too long, though.

5. Desired clip time for beginning/end beat spectrum clips? (DEFAULT: 10 secs) For example, ifcliptimeisx seconds, then the program will calculate the beat spectrum for the first and last x seconds of the song. This is useful for smooth transition in playlist generation. If you do not want to calculate the beat spectrum for these clip times, enter 0. Calculating the Beat Spectrum 69

6. Desired lag-time (DEFAULT: 5 secs) The beat spectrum is a function of self-similarity versus lag time. Lag time will affect how much data is ultimately included in the final beat spectrum calculation. Usually a lag time from 4-8 seconds is sufficient. It is important to remember that you can only compare songs with similar vector length, and thus only beat spectra with the same lag-time (and sampling rate).

Once the options have been set, beatspec will analyze the .wav file(s), calcu- late the beat spectra, and generate .beat.all files, which contain a list of floating point numbers, in the directory you specified. Also, if you specified to run the beat spectrum on beginning/end clips, it will generate .beat.beg and .beat.end files re- spectively.

A.2.3 Implementation Notes In Section 2.2, we discussed the theory behind the beat spectrum. In this section, we discuss how the theory applies to the actual implementation, in case you would like to tweak or change the way it is calculated.

Audio Parameterization Before we parameterize theaudio,wemustfirstgetthespecificportionsofthe audio we will run the beat spectrum on, specified by the options above. First we find how many frames per second (FPS) there will be based on the song’s sampling rate. FPS is calculated by dividing the sampling rate by half the window size. Once we have the FPS, we can find which frame represents the song’s time, and we can split the audio data into the ’chunks’ that we will analyze. For example, with the default options we will want the first and last 10 seconds of the audio, as well as the 40-70 second chunk. Once we have the specified audio, we must parameterize it into it’s spectral representation. This is typically done by ‘windowing’ the audio. A helper function spectrum.m is used to do this.

Similarity Matrix In Section 2.2.2, we mentioned that calculating the entire similarity matrix was not necessary for calculating the beat spectrum. Unfortunately, because of Matlab’s inefficiencies with for loops, the auto-correlation requires that the entire similarity 70 Manuals and End to End Description

matrix be present if we are to obtain an accurate representation of the beat spec- trum. Therefore, we must calculate the entire beat spectrum for each audio chunk, regardless of the specified lag-time. Ideally, if you are just using diagonal sum to calculate the beat spectrum, you would only need to calculate the similarity matrix for the values in between the main diagonal and the lag-time. However, since we only used auto-correlation in practice, we did not implement a method to truncate the similarity matrix accord- ingly.

Beat Spectrum Calculation As you recall in Section 2.2.3, there are two ways of extracting the beat spectrum from the similarity matrix. The two methods, auto-correlation and diagonal sum, are written in autocor.m and diagsum.m respectively. Once the calculation is complete, the appropriate files are saved to the specified output directory.

Standardizing Vectors If you have a sizeable number of feature vectors in your corpus, you may wish to ‘standardize’ the vectors so that they are weighted equally. This can be done with the ‘normalize.m’ file, or when it has been converted to C code, by typing:

./normalize input-vectors output-directory

This function will standardize the vector set by subtracting the mean and di- viding by the standard deviation for each data point in the set. For example, if each song is a row with each column representing a data point, it will subtract the mean and divide by the standard deviation from each column. It will then output the results in the specified output directory.

A.2.4 Description of Files The following is a list of files and helper functions included in the Beat Spectrum package:

autocor.m Takes in an n × n matrix and returns the auto-correlation

beatspec.m Calculates the beat spectrum given .wav files

beatspec_single.m Displays the similarity matrix and beat spectrum given a single .wav file Timbre Similarity 71

diagsum.m Takes in an n × n matrix and returns the sum of all the super diagonals

get_file.m Strips the path from an input string and returns just the file- name.

normalize.m Normalizes a set of vectors

print_mat.m Saves output to a file

simcosi.m Calculates cosine distance on two matrices

spectrum.m Parameterizes audio into spectral components

A.3 Timbre Similarity

The timbre feature-extraction process works well for recognizing artists’ voices and coarse genre classification. The MFCCs which are extracted are too big to use di- rectly. They may be reduced using either EMD or FFTs.

A.3.1 Relevant Code Files The following source files comprise timbre feature extraction:

mfcc.m taken from http://rvl4.ecn.purdue.edu/˜malcolm/interval/1998-010/, calculates mel frequency cepstral coefficients from pulse code modulation encodings of sound (commonly wav files)

emd.c taken from http://robotics.stanford.edu/˜rubner/emd/default.htm,cal- culates the distance between two distributions. Please see this website for more information about EMD and this particular implementation of it.

emd.h taken from http://robotics.stanford.edu/˜rubner/emd/default.htm,but significantly modified to match our features and ground dis- tance, as suggested by the webpage.

timbre_mfcc.m takes a wav file, returns a 10 × 2997 matrix of floats, 10 coef- ficients for each of 2997 overlapping windows in a 30-second sample. If the input wav is at least seventy seconds, it takes the 40 − 70 second window. Otherwise, the function uses the first 30 seconds. 72 Manuals and End to End Description

save_mfcc.m saves the data from cluster_mfcc to a .mfcc file.

emd_cmp.c astandalone program that takes a file with clustered data for two songs and uses the symmetric K-L distance to calculate the Earth Mover’s Distance between them.

ghsom_emd.c defines emd_kl_wrapper, a function which takes clustered mfcc data and calculates the distance between them with EMD and K-L.

mfcc_avg.m takes the output of timbre_mfcc a 10 × 2997 matrix of floats and saves the averages of each row to a .smfcc file.

mfcc_generator.m Generates MFCCs for all the .wav files inadirectoryandsaves them as .mfcc files. Changing the value of cluster determines the number of clusters that will be generated by the K-means algorithm.

fft_generator.m Generates ffts of the MFCCs for all the .wav files inadirectory and saves them as .fft files. Changing the value of fftcount changes the number of fft coefficients that will be stored in the .fft file.

A.3.2 File Formats The following file extensions are used by the timbre feature-extraction code:

.mfcc clustered mfcc data—2 matrices of COEFFS × CLUSTERS floats each, one for means and one for variances, then COEFFS weights.

.smfcc COEFFS coeffs, the averages of the raw mfcc data

.fft amatrixofFFTCOUNT × COEFFS floats.

A.3.3 Calculating MFCC To generate the MFCC of a song (in Matlab):

>> beatlesMFCC = timbre mfcc(’Beatles - Eight days a week.wav’);

To generate the MFCCs of all the songs in a directory, from that directory:

>> mfcc generator; Generating Maps 73

The files will be saved with the samenamesastheoriginals,exceptwiththe.wav changed to .mfcc. To generate FFTs of all the songs in a directory, from that directory.

>> fft_generator;

A.3.4 Comparing MFCCs To compare two MFCC feature vectors using EMD as the distance metric (using K-L distance to compare clusters), run

./emd cmp file1.mfcc file2.mfcc

The EMD distance is printed to standard output. This utility is mostly provided for quick sanity checking of MFCC data.

A.4 Generating Maps

In order to generate our hierarchical maps we began with a basic GHSOM imple- mentation from the Vienna University of Technology. A number of enhancements and additional options have been added to this base implementation. We have also created a number of helper tools to allow for easy data transformation from feature vector files to the input form required by the GHSOM code. This section is intended as a supplement to the full GHSOM manual available from the Vienna University of Technology. http://www.ifs.tuwien.ac.at/˜andi/ghsom/ download/ghsom\_guide.html

A.4.1 Transforming Data In order to generate hierarchical map, a number of file must be created.

1. Prop file: defines map-specific tunable parameters (e.g. thresholds for map splitting). Also, includes paths to other input files.

2. Template file: defines a feature vector for this map (includes number of di- mensions and number of input vectors).

3. Data file: includes the input vectors

One can find detailed description of how to create these files by hand by examining the full GHSOM manual. However, we have created a number of data transforma- tion tools to aid the creation of these input files. 74 Manuals and End to End Description

The first tool is designed to take a directory of feature vector files (most likely generated through our feature-extraction scripts) and set up all the files necessary for a run of the ghsom code. The script is named tools/data_transform/transform.pl.Ithasthefollowingvari- ables defined in the beginning of the Perl file. $dataDir This is the input feature vector data. Each file should be the name of the song from which the features were extracted. The extension of the file will be stripped off. $outputDir The directory to write the generated GHSOM files to. $corpusUrl AURLcontaining MP3 files for the songs in the corpus. Each MP3 file must have the same label as its corresponding feature vector file. However, any spaces that occur in the name of the song must be changed to underscores. This is a result of a limi- tation of the GHSOM implementation. $type Either “GHSOM” or “SOM”. Specifies the type of map to gener- ate. SOM is flat map, GHSOM is a hierarchical map. $makePairWiseDistance 0or1.Controls whether to make a pairwise distance file for use with Sandia clustering software. This feature is essentially use- less except to people with access to Sandia’s proprietary cluster- ing software, however, it has been included in this guide for the sake of completeness. Once the parameters have been set inside the script, then the script must be run. This will generate all files needed to generate GHSOM. Asecondscriptlocated at (tools/data_transform.glue.pl) to concatenate fea- ture vectors. It has the following variables defined in the beginning of the Perl file. $dataDir1 the first input directory for feature vectors. Each file should be the name of the song from which the features were extracted. The extension of the file will be stripped off. $dataDir2 the second input directory. The format is the same as for the first dataDir. The song labels must be the same in each directory so the appropriate feature vectors canbeconcatenated. $outputDir The place to store the concatenated feature vectors Once these input files are generated they can be fed directly into the GHSOM executable. Generating Maps 75

A.4.2 Running the Map Code GHSOM is run in the following manner.

./ghsom prop-file

This will generate a map and save the results to the outputDirectory (this is defined in the prop file, but it is set to ./output by our data transform script). The results can be viewed through a web server by making the output directory web accessible. The GHSOM code will generate an HTML version of the map.

A.4.3 Enhancements to the Map Code Anumber of enhancements were made to the map code. These enhancements were designed to tune the GHSOM implementation to our specific needs.

New Properties corpusURL (defined in the .prop file) this should be set to point to a web server that has copies of MP3 files of the input songs. The names of the files must have all spaces removed.

saveAsTree This should be set to true if the map generated by the GHSOM is to be used with the similarity lookup code. This property will save a representation ofthemaptoa.tree file.

useEMD This should be set to true if EMD is to be the default distance metric for building the map. This feature has been found to be largely useless for reasons explained in our distance metrics sec- tion, however, it has been included for completeness. coVarianceMatrixFile defines a file containing a covariance matrix. The co-variance matrix should contain tab delimited values.

Visualization Graphics These are created to show a graphical representation of the model and song vec- tors. They are automatically generated and displayed on the GHSOM output HTML.

Edge weights Edge weights are now calculated and displayed on the output HTML. 76 Manuals and End to End Description

A.4.4 Sample Map The map in Figure A.1(a) was generated using only sonedata.Thescreenshot presented is from the first level of hierarchy of the map. The greyscale grid images are used as aids to visual the vectors. The images next to the song represent input vectors, andtheones next to “QE” represent model vectors. Figure A.1(b) shows an expanded view of a low level map unit. This map unit requires no additional division in order to adequately classify the songs.

A.5 Playlist Generation Using the Greedy Algorithm

This section details how to generate playlists from feature vectors using the greedy approach. It will address the playlist.m and playlist2.m Matlab scripts. These files are designed to be compiled into C code by typing in Matlab: >> mcc -m playlist.m —OR— >> mcc -m playlist2.m

A.5.1 playlist.m This is the simpler version of playlist generation which takes in a single set of fea- ture vectors and generates a playlist using a brute force method. The size of the vectors does not matter, as long as they are the same size. It can be run by typing:

./playlist n { -bt } feature-vectors

Specifying the Seed Song n specifies the song to begin playlist generation. The default value is 0, or a random seed song. Note:Ifyoudonotspecify a seed song when you run the program, you will receive a syntax error, but this will not affect the program. This is due to fact that it is trying to convert a non-numerical string into a number.

Beat Transition (-bt) If you want to generate a playlist based on beat transition, then you need to specify it by the -bt option. If you select this option, then you must have two sets of feature vectors: .beg and .end present in the same directory. You only have to specify the location of the .beg files and the program will automatically search for the .end files in the samedirectory. Playlist Generation UsingtheGreedyAlgorithm 77

(a) The top-level map

(b) A final partition

Figure A.1: The GHSOM code in action 78 Manuals and End to End Description

The program will finally generate an ordered list of songs by comparing the given feature vectors. The algorithm for determining this ordering is detailed in chapter 6. No data is saved when this program is run.

A.5.2 playlist2.m This is the more powerful version of playlist generation. It can take several sets of feature vectors and it will automatically concatenate them. For example, I may want to generate a playlist based on both the beat spectrum and sone vectors, and weight one at 70% and the other at 30%.Thisfunctionalsogenerates the dis- tance matrixtothefilesongs.matrix,whichisusefulifyouwant to run the genetic algorithm or do testing. The size of the vectors does not matter and they will be weighted as you specify, but each set MUST have the same sized vectors within the set. Also, if multiple sets are specified, it is important that all of the files for the various features of thesamesonghavethesameroot,otherwiseanerrormay occur. The program is run by typing: ./playlist n { -bt beginning-feature-vectors } feature-vectors

Specifying Seed Song The n argument specifies the song to begin playlist generation. The default value is 0, which will use a random seed song. If you enter –1, the algorithm will generate a playlist using every song as the seed song, and then return the one with the shortest total distance. If you enter –2, the algorithm will use dimensionality reduction to find the optimal solution. However, this method has not been tested and probably does not return very accurate results.

Beat Transition Like playlist.m,thisprogramalso supports comparing transition vectors, but is only limited to a single set. In other words, I cannot compare both the beat spectrum and sone transition vectors. I have to choose one or the other. If you specify the -bt option, you must first specify the .beg files as the first vector set.

Generating an m3u File Once you run the program, you will be prompted if you would like to generate an m3u playlist. The default answer is yes. If you choose yes, the program will generate aplaylistandoutputtheorderingintoanm3u file called song_playlist.m3u in the current directory. This file can be loaded into an mp3compatibleplayer,suchas Winamp. The mp3 files must be in the samedirectoryasthem3u file. Playlist Generation UsingtheGreedyAlgorithm 79

Generating the Distance Matrix The program will automatically generate the distance matrix calculated by using the cosine distance metric on the feature vectors into a file called songs.matrix.The file itself just contains a matrix of numbers, but the genetic algorithm code requires the matrix. As the program runs, it will count thenumber of songs based on the feature vector files it is given and display the final count. If the count does not seem cor- rect, it may be because your feature vector files arenamedinconsistently. Next, it will analyze and glue the vector set specified. It will output a count of the number of feature vectors it was given. If there is more than one set of vectors, you are given the option to give each set a weight from 0 to 1. You may simply push enter without entering a weight and the program will weight each vector set equally. For example, if you have 4 vector sets, it will give each set a weight of 0.25. Finally, the program will output an ordered list of songs, and generate an m3u playlist if specified. It will also output the distance matrix.

A.5.3 The Algorithm The playlist generation algorithm is a greedy approach to the next shortest dis- tance problem. The distances are determined by calculating the cosine distance between the vectors. First it generates a matrix by calculating cosine distance, similar to the way the similarity matrix is created in the beat spectrum. If there are multiple feature vectors, then it will calculate a distance matrix for each vectorset,multiplyitby the specified weight, and add it to the previously calculated matrix. The result is a single n × n distance matrix (where n is the number of songs) that represents each song’s similarity to every other song. If we are doing beat transition, then the matrix is calculated slightly different, because weneedtocompareasong’senddatawith the next song’s beginning data for a ‘smooth’ transition. Finally, we set the diagonal of the matrix to −1,becausewedonotcarewhether or not a song is similar to itself. Once we have the matrix, it starts at the seed song and find the next closest song (i.e. maximum distance value) that is not already in the currently generating playlist. If the song is already in the playlist, it sets that value to −1 and finds the next max value. It then adds the song’s index to the playlist, and finds the next closest song at the new index. It goes through the path until all songs have been added to the playlist. 80 Manuals and End to End Description

A.5.4 Description of Files The following is a list of files and helper functions included in the Playlist Genera- tion package:

get_file.m Strips the path from an input string and returns just the file- name

playlist.m The simple version of playlist generation

playlist2.m The more powerful version of playlist generation

playlist2m3u.m Generates an m3u file given a list of song names

print_mat.m Saves output to a file

simcosi.m Calculates cosine distance on two matrices

A.6 Playlist Generation Using the Genetic Algorithm

This section briefly describes how to generate a playlist using the genetic algorithm code. The code is written inC++andtherefore somewhat independent from the Matlab code described above. It requires a distance matrix file of floating-point numbers and a list of song labels given as input. However, this genetic-algorithm code is integrated into the Tree Lookup code, and thus does not require stand- alone usage.

A.6.1 Running the Code The algorithm is called by typing:

./natselplay { option-switches } distance-matrix song-labels

Option Switches Switches can be used to modify the program behavior as follows:

-d Produce debugging output, showing the best individual and the gen- eration number each time a better individual is found.

-g n Specify the number of generations to run for (default 100). Performing Music Recommendation 81

-m n Specify the probability of genetic mutation (default 0.001). Each time aneworganism is generated, it is mutated with a probability equal to this value. If the organism mutates, a two random positions in the sequence are swapped.

-p n Specify the size of the population of organisms (default 1000). This is the number of organisms that compete to survive into the next gener- ation.

-s n Specify the size of the selection pool (default 500). This is the number of organisms that survive into the next generation. Note that surviving gives an organismachancetobecome a parent, but does not guaran- tee it. The difference between the -p value and the -s value is the num- ber of new organisms produced at the beginning of each generation.

-S n Specify a seed for the random-number generator. If no random seed is given, one is derived from the time of day.

A.7 Performing Music Recommendation

We have created anapplicationthatunifies all the major feature of our clinic. This application is located in the tree_code directory. It allows for the following func- tionality. 1. Generate similarity judgments

2. Create playlists

3. Play playlists and individual songs

A.7.1 Description of Files 1. Makefile - compiles the tree code

2. Genetic Algorithm Files

colony.cpp Implementation file for a colony of playlists colony.hpp Header file for a colony of playlists natselenv.hpp environment for natural selection header file natselenv.cpp environment for natural selection implementation natselplay.cpp main driver for running playlist generation using genetic algorithms 82 Manuals and End to End Description

natselplay.h wrapper function for playlist generation via genetic algo- rithms organism.hpp header file for an organism (which is really a playlist) organism.cpp implementation for the organism class random.hpp arandom number generator header file random.cpp implementation of the random number generator

3. Distance Metric files

distancemetric.h definition of the abstract distance metric class cosine.cc implementation of cosine distance metric cosine.h header file for cosine distance metric euclid.cc implementation of Euclidean distance metric euclid.h header file for Euclidean distance metric emd_distance.cc implementation of earth mover’s distance metric (this calls an external wrapper function) emd_distance.h header file for EMD metric.

4. EMD implementation and wrapper

emd/emd.c main implementation of emd emd/emd.h predeclarations for implementation of emd emd/tree_emd.h wrapper function to call from distance metric file emd/tree_emd.c implementation of wrapper file. Allows input of a raw data vector instead of a matrix.

5. General Files

lookuptree.cc unifies all features, trees, and feature vectors. Provides func- tionality to perform similarity lookups and playlist gener- ation. lookuptree.h header file for lookuptree.cc playlist.h aclassusedtodefine all the parameters to generate a playlist. The results of the playlist are also stored in this object. playlist.cc implementation of playlist class. query.h aclassthatdefines all information about a similarity lookup. The results of the lookup are also stored in this object. Performing Music Recommendation 83

query.cc implementation for the query class sample_lookup.cc the interactive shell for doing lookups and playlist genera- tion. treetag.h aclassthatdefines a tree tag. These tags are located in the .tree files generated by the GHSOM. treetag.cc implemenation of the treetag class. treetokenizer.h takes a string of characters and chops it into tree tag to- kens. treetokenizer.cc implementation of treetokenizer class. weightedtree.h encapsulates the weighted trees that represent our hierar- chical maps. weightedtree.cc implementation of the weightedtree class.

A.7.2 Input Property Files In an effort to be consistent with the GHSOM code, we have borrowedsomecon- vections. The property files that we use as input to the tree code are formatted as follows. One should note that both the dataFile can be automatically constructed using the data transformation script. Also, the treeFile is produced as a result of running the GHSOM code with the saveAsTree flag set to true.

feature feature-id weight relative-feature-weight metric euclid | emd | cosine dataFile path-to-datafile treeFile path-to-treeFile

Each of these entries defines a single feature. There can be as many features de- fined in the input file asneeded. The last feature must be followed by a single line containing the keyword done.Also,notethat the treeFile is an optional part of the feature definition. This means that some feature need not have a corresponding hierarchical map. The treeFile will be automatically generated by the GHSOM code if the map is run with the saveAsTree flag set to true. Also, the dataFile is the same format as the one used to run the GHSOM.

Asampleproperty file feature sone weight 1.0 metric euclid datafile ./sample_trees/sone_ghsom_corpus2.data 84 Manuals and End to End Description

treefile ./sample_trees/sone_ghsom_corpus2_77_1.tree

feature beat weight 1.0 metric cosine datafile ./sample_trees/beat_ghsom_corpus2.data treefile ./sample_trees/beat_ghsom_corpus2_77_1.tree

feature timbre weight 1.0 metric emd datafile ./sample_trees/ghsom_mfcc.data treefile ./sample_trees/fft_main_corpus_77_1.tree

done

A.7.3 Running the Tree Code The tree code is run in the following manner

./sample lookup path-to-prop-file

A.7.4 Commands Available in the Tree Application Once the treecodeisrun,ashellpromptwillbedisplayed.Thesystemisnowready to interact. There are a number of commands available to the user.

1. help Displays a list of all available commands.

2. quit Quits the program.

3. lookup result-set Perform a similarity lookup. All parameters for the lookup must have been set via the appropriate shell commands. result-set specifies a name under which to store the results of the lookup. Thenameshouldasingleword.It can be reference in other contexts inside the shell.

4. setTopologyWeight topology-weight Set the topology weights ofthenextlookup. This basically states how much Performing Music Recommendation 85

the lookup should be based on topology versus direct feature comparisons. Default value is 0.5.

5. getTopologyWeight Displays the current topology weights

6. setCyclical { true | false } Determines if the next playlist generated will be cyclical (i.e. ideal for loop- ing). Default value is false.

7. getCyclical Displays the whether the next playlist will be cyclical.

8. genPlaylist result-set Generate a playlist and store the results in result set.

9. setSeeedSong song-name | song-id Set the seed song for the next similarity lookup. The argument to setSeed- Song is either the full name of the song or simply the song id number (this can be determined by running the list command).

10. getSeedSong Displays the current seed song.

11. setNumResults number-of-results Set the number of results desired for the next lookup. Default value is 20.

12. getNumResults Displays the number of results for the next lookup.

13. setPlaylistSongs result-set Determine which songs should be made into a playlist the next time gen- Playlist is called. Result set must be anexistingresult set.

14. createResultSet result-set index … done Allows the user to create a custom result set. This is useful for testing the playlist generator using arbitrary input songs. The indices are simply song id numbers separated by white space.

15. list Displays the songs in the corpus.

16. listResultSet result-set Output the songs in a given result set. 86 Manuals and End to End Description

17. playSong song-name | song-id Plays a given songusingthedefaultMP3player.

18. playList result-set Plays a playlist containing the songs in resultSet.

A.7.5 Sample Music Recommendation Session In this session a user performs a similarity lookup on Van Halen’s “Right Now” (song 147). Then, the user creates a playlist from these songs.

50 > ./sample_lookup ./sample_trees/beat_sone.prop > setSeedSong 147 > getSeedSong Seed Song = Van_Halen_-_Right_now > setTopologyWeight .4 > setNumResults 20 > lookup results Executing Query on Songs - Parameters - numResults = 20 topologyWeight = 0.4 Van_Halen_-_Right_now Results - 1. Tony_Bennett_-_Fly_Me_to_the_Moon 2. The_Who_-_Baba_O’Riley 3. Mahler_-_Songs_of_a_Wayfarer_1 4. Mahler_-_Songs_of_a_Wayfarer_4 5. Bee_Gees_-_Stayin_alive 6. Guns_&_Roses_-_Knockin_on_heavens_door 7. Wagner_-_Ride_of_the_Valkyries 8. Pearl_Jam_-_Yellow_ledbetter 9. Black_Sabbath_-_Iron_man 10. Tony_Bennett_-_I_Left_My_Heart_In_San_Francisco 11. Metallica_-_Master_Of_Puppets 12. Abba_-_Dancing_queen 13. Metallica_-_Nothing_else_matter 14. Smashing_Pumpkins_-_1979 15. Billy_Joel_-_Piano_man 16. Sublime_-_40oz_to_freedom 17. Big_Bad_Voodoo_Daddy_-_Maddest_Kind_of_Love Performing Music Recommendation 87

18. Ccr_-_Long_as_i_can_see_the_light 19. Tony_Bennett_-_Rags_To_Riches 20. Goo_Goo_Dolls_-_Iris > setPlaylistSongs results > genPlaylist playlist Generating a playlist -

Results - 1. Metallica_-_Master_Of_Puppets 2. Ccr_-_Long_as_i_can_see_the_light 3. Mahler_-_Songs_of_a_Wayfarer_4 4. Big_Bad_Voodoo_Daddy_-_Maddest_Kind_of_Love 5. Wagner_-_Ride_of_the_Valkyries 6. Abba_-_Dancing_queen 7. Tony_Bennett_-_I_Left_My_Heart_In_San_Francisco 8. Sublime_-_40oz_to_freedom 9. Mahler_-_Songs_of_a_Wayfarer_1 10. Goo_Goo_Dolls_-_Iris 11. Smashing_Pumpkins_-_1979 12. The_Who_-_Baba_O’Riley 13. Guns_&_Roses_-_Knockin_on_heavens_door 14. Van_Halen_-_Right_now 15. Black_Sabbath_-_Iron_man 16. Metallica_-_Nothing_else_matter 17. Bee_Gees_-_Stayin_alive 18. Billy_Joel_-_Piano_man 19. Pearl_Jam_-_Yellow_ledbetter 20. Tony_Bennett_-_Fly_Me_to_the_Moon 21. Tony_Bennett_-_Rags_To_Riches > quit

Appendix B

Unexplored Possibilities

During the course of the year, the team researched techniques that were initially thought to be applicable to the project. However, mostly due to time constraints some of these ideas were never fully explored / implemented.

B.1 Representing Trees in a Database

Originally one of the goals of our clinic was to implement all of our lookup proce- dures in a database of some type. However, as the project progressed it was increas- ingly clear that the major contribution of our clinic would be to deal with feature extraction and mapping issues rather than scalability issues. However, we did do a lot of planning for the eventual necessity of using our system with a database.

B.1.1 Requirements for Our Database In order to evaluate database types in a more uniform manner, we formalized some requirements that the chosen database must meet.

1. The database must be able to incorporate new songs without having to re- generate the existing stored data. Auditude’s library of songs is immense. Creating the database from this li- brary will be a very time consuming operation. Auditude updates their li- brary of music monthly. Auditude should be able to add these updates mid- stream into our database.

2. The database must be able to perform similarity lookups quickly. 90 Unexplored Possibilities

This requirement follows from the idea that Auditude’s central servers will perform recommendation functions for a large number of distributed client applications.

3. The database must be freeorofareasonableprice. We do not have the funds necessary to purchase an extremely expensive pro- prietary database. An open-source and completely free database is the most desirable. Auditude may swap our specific database implementation with amoreexpensive option at a later date. Given the modularity of our code, upgrading the database should not be a particularly difficult operation.

4. The database must be able to efficiently store and manipulate a tree struc- ture. At the heart of our storage and lookup procedures isatree.Thistreeis created using GHSOMs (Growing Hierarchical Self-Organizing Maps). The database we choose must be particularly quick and flexible when it comes to encoding this structure.

B.1.2 Choosing a Database Type For this design decision we looked at two basic types of databases: relational and directory. We evaluated each of these types against the requirements that have been set forth in Section B.1.1.

Directory based database

These databases store data inanexplicitlyhierarchicalstructure.The most com- mon type of these databases are built on the X.500 DAP (Directory Access Pro- tocol). However, recently a newer technology LDAP (Lightweight Directory Ac- cess Protocol) has become the preferred type of directory database on the Inter- net (Howes, 1995). These databases are lightweight because they leave out the list operation in favor of the search operation. The shift in available operations makes LDAP databases optimized for reading rather than writing (Howes, 1995). When we evaluated these types of databases against our requirements a number of problems became evident. Storing a tree structure in an explicit directory hi- erarchy has a number of undesirable properties. These undesirable properties will become evident in Section B.1.3 when we outline the tree model that we have de- cided to use. The issue of cost also becomes a factor. The fastest directory based databases are not open-source and are in fact extremely expensive. There are free alternatives that simply do not perform at a high enough level. One could imagine Representing Trees in a Database 91 the simplest type of directory database is simply storing information on the host filesystem. However, this would be completely platform dependent, space ineffi- cient, non-standard, and extremely slow.

Relational database

These databases use tables to store information. Relationships between data ob- jects are encoded with keys to refer to data in other tables and rows. All tables reside inaflatnamespace.Atfirst glance, this seems inadequate for storing a hierarchical structure such as a tree. However, we were able to find a model for storing trees in a relational database that was not only adequate but in some ways provided a better model than the explicit hierarchy of a directory database. This model allows for easy insertions and removals of songs from the database. Using various query optimization techniques these databases have managed to give very good performance for a wide range of applications. One specific implementation, MySQL is particularly designed to offer the bare-bones features of a relational SQL database with an emphasis on performance. The web-journal Databasics(Harkins &Reid, 2001) gives a nice overview of the principle benefits of this implementation.

1. It’s inexpensive 2. It’s highly optimized 3. It provides flexible interfaces to many different databases.

For these reasons we decided that a relational database was the way to go, and specifically that MySQL was the implementation that we would use.

B.1.3 Representing Trees in a Relational Database Using explicit links to create a hierarchical structure is something that relational databases, SQL in particular, do not do very well. What SQL excels at is represent- ing set relationships (Celko, 1996). Consider the storage of explicit links between various nodes in a treeasanSQLtable.Inordertonavigate from the root to a node the table would have to be queried many times in order to follow the links down to the desired data. We found a better solution, the Nested-Set Model (Celko, 1996), for representing trees. The idea behindthenested-setmodelisinsteadofviewinganodeinatree as consisting of links to its parent node, to instead think of a node in a tree as a representation of the set of all nodes which fall below it in the tree. The reason this representation is attractive to us is that SQL can very easily encode this nested-set view of a tree. To do so, we need to number the nodes in 92 Unexplored Possibilities

Figure B.1: Two different views of the same tree. A is a graph, B is a nested-set.

an appropriate fashion. Joe Celko (1996) presents a model for representing trees in SQL. Each node is given two numbers. One can think of these as a left and a right index. In order to define the properties of this model, we need to define some symbols.

L(N)=the left index of Node N R(N)=the right index of Node N T = the set of all nodes in the tree D(N)=the set of all nodes that are direct or indirect descendants of Node N

The choice of left and right indices must satisfy the following invariant.

∀N ∈ T,∀c ∈ D(N):L(c) >L(N) ∧ R(c)

As long as we can come up with a numbering on the nodes that creates this prop- erty, then extracting relationships from the tree will be very easy. It is trivial to determine the root to node path in the tree. Additionally, finding all descendents Representing Trees in a Database 93

116

27 815

3456789101112

Figure B.2: A numbered tree of a node is extremely easy as well. These two tasks form the backbone of our similarity lookups. The numbering of these nodes is determined by a depth-first search. Initialize a counting variable i =1. Whenanodeisvisitedforthefirsttime its left index is numbered i, and then i is incremented. When a node is visited for the last time its right index is numbered i, and then i is incremented. Perform this numbering by initializing a depth first search starting at the root of the tree.

B.1.4 Applying the Nested-Set Model Now that we haveamethodforrepresenting trees in SQL, we need to generate a framework for deciding what data we will store in this tree and the procedures by which we will store this data. The requirements for this fall out of the algorithm we are going to use to grow our tree. Our tree generation will group songs at the leaves of a tree. Multiple songs can be stored at each leaf. Our tree model will therefore have to handle this type of structure. Each leaf of our tree will be called a collection. A collection will contain multi- ple songs, but will be able to be manipulated as a single unit inside the tree. Each intermediate node in a tree will contain no data except its left and right index. Next, we need todefine a number of procedures that we will be performing on this tree structure. The primary procedures are building the tree, and performing lookups on this tree. 94 Unexplored Possibilities

B.1.5 Growing Our Tree Our algorithm for growing the tree progressively splits the library of songs into finer and finer divisions. If the songs at a given node are determined to be too dissimilar, then new sub nodes are popped off to hold various portions of the old node. Fortunately, the nested-set tree model deals with this procedure easily.

Start with all songs in one collection at the root of the tree. The tree can be given an initial numbering. If the insertion of a song does not require the splitting of a collection, then no modification is needed to our tree structure at all. We simply add that song to the collection. When a split inacollectionisneededthemapping in the database be- tween collection and song must be modified as well as the structure of the tree. Specifically wemustpreservetheleftandrightindexproper- ties of the tree. Since we are only dealing with insertions at the leaves of our tree this renumbering is easy. We can formalize this procedure. We are splitting Node N into c child nodes.

∀i ∈ T : if (L(i) >L(N)) then L(i)=L(i)+2c if (R(i) >L(N)) then R(i)=R(i)+2c for i =1, i<=2c − 1, i = i +2 insert a node j into T with L(j)=L(N)+ i ∧ R(j)=R(N)+i +1 end

Collapsing multiple collections that share a common ancestor is ac- complished by reversing thisprocedure.

B.2 More Efficient Dimensionality Reduction

There exist other, more efficient algorithms for dimensionality reduction. We did not choose to pursue them because map creation was so fast that it seemed unnec- essary. In particular, had we chosen to pursue the idea further, we probably would have tried Principal Component Analysis next. This has the advantage of running Other Algorithms for Playlist Generation 95

1 20

2 11 12 19

34510 13 14 15 16 17 1018

6789

Figure B.3: A samplesplit.Allrednumbers have been increased by four (2 × 2 new nodes)

in time O(np2 + p3),whenn is the number of data points and p is the dimen- sionality each data point. Since the method we used, multidimensional scaling, runs in time O(n2),PCAwouldclearlybepreferable for large data sets. For more information, see Hand et al. (2001, ch .3).

B.3 Other Algorithms for Playlist Generation

Though the Traveling Salesman problem described in Chapter 6 is NP-complete, as mentioned before, relatively simple polynomial-time approximation algorithms do exist for special cases of it. We had hoped to implement one such algorithm, based on minimum spanning trees, which could have generated cyclic playlists that were within a factor of two of optimal. However, time constraints did not permit this. It would be interesting to see how this approach compares to the genetic algorithm approach used in our solution. For more details on this algorithm, see Cormen et al. (2001, sec. 35.2). In addition, it would be interesting to compare playlists like those described above with playlists that minimize the maximum discrepancy between adjacent songs. Further research would have to be done to determine if efficient algorithms exist for this problem; the genetic algorithm we used could be modified simply by changing the fitness function.

Appendix C

Saga of Project (with Pictures)

The following is the greatest story ever told. Any similarity to actual events is not acoincidence, since it is a true story.

C.1 Cast of Characters

The winds of fate threw together five people, their destinies tenuously linked...

•ProfessorO’Neill—Anidealisticyoung professor with a dream. A dream of what we do notknow,butadreamnonetheless.

•PaulRuvolo—Anidealistic young college student. He had a certain spring in his step and gleam in his eye. These would only last the first semester. He was the project manager for this wily bunch. He was also devastatingly good looking.

•NickTaylor—Anidealisticyoung... why the hell are they all idealistic? Nick wasnot idealistic, he was, uhhh, anti-idealistic. He had certain flair about him, maybe it was due to his intense love of Jazz, particular Glenn Miller’s “Take Five.”

• Liz Schoof — A ghetto fabulous girl who brought her knowledge of the streets to the project.

•BradPoon—Arealmaverick that plays by his own rules. He doesn’t take no for an answer. That’s right establishment, you best be watching out when the Gfunkisindahouse.Noreally, he’s this Asian dude from Orange County. 98 Saga of Project (with Pictures)

C.2 Ground Zero

Our cast of characters began with a simple task, to generate a performance-inde- pendent polyphonic music recommendation engine. Unfortunately most of them didn’t have the faintest idea what the meant. Which was probably good, because they would later learn that it is a diabolical problem that has foiled brilliant minds around the world. After looking over a dictionary for an hour or so, our fearless project manager, had realized the severity of the task athand. Essentially the group, to be known heretofore as the A-Team, was supposed to create a system that given a database of song performances could match a new song performance to the appropriate entry in the database. The system must do this without any knowledge other than the wave form of the new song performance. Liz was stunned. She thought the project was to lay down some funky beats and some mad fresh lyrics tocreateahitraptrack.However,shedecidedtolend her best to the noble quest.

C.3 La Mèrde de Paris

The team began by researching the state of the art in polyphonic music recogni- tion. Unfortunately the state of the art was not very advanced. However, the team dutifully examined several key strategies commonly used to tackle the problem. During this research phase the team learned of a magical far away land. A land known only as Europia,andacitydubbedParis.Therewasaspecialsquadronof the true and just convening at this location in only a month. Could the team pos- sibly participate in the festivities in such a short time? “The proposal was is due before the conference even starts!” Paul cried in protest. However, the team was able to push back this deadline in order to attend the conference in the hopes of gaining new insight into what had been discovered to be a very difficult project. The plan was to send one or two members of the team, however, good fortune smiled upon the A-Team. A beleaguered Professor Hodas offered to send the entire team to Paris. Paul was slightly upset when he discovered that there was an actual Paris city and that the conferencewasnotbeing held at the Paris hotel in Las Vegas. The cries of envy were quick to come from the rest of the students at Harvey Mudd College. The A-Team responded with dignity and humility as is the mark of a classy unit. Forget Paris... Please 99

The conference the team attended in Paris was dubbed Ismir (International Conference on Music Information Retrieval). The team set off for Paris with high hopes of solving the task that had been bestowed upon them. The team had many exploits set against the lush backdrop of the city of life. The team gained new direction in their quest as well. The original problem be- stowed upon the team proved to be much beyond the current reach of MIR tech- nology. The team instead decided to focus on the area of music recommendation and similarity. The A-Team would develop an application capable of generating similarity judgments on a corpus of songs and composing them into playlists. The shift in focus was inspired by a number of outstanding individuals encoun- tered at the conference. The most notable were Jonathan Foote of Xerox Park and Beth Logan of Hewlett Packard. Jonathan Foote had developed a similarity metric for extracting beat and rhythm information from a piece of music. The team im- mediately realized that this could be potentially useful for determining similarity between pieces of music. Beth Logan’s work was focused in the realm of timbre. She used MFCC (Mel Frequency Cepstrum Coefficients) to examine the timbre of pieces of music. This was another direction the team wanted to go in.

C.4 Forget Paris... Please

As the team returned to Claremont they realized that they were well behind sched- ule on their new project. However, the new direction gained from Ismir proved invaluable. The team reformulated a development plan for the rest of the semester. The team was to prototype three feature vectors in Matlab. The first semester concluded with some beginning stage feature extraction code, a huge step for the previously lost A-Team.

C.5 Casey at the Bat

Second semester began with earnest. The team was hitting lead-off in the pre- sentation order. They did not hesitatetohitahomerun.Tosayitwasanepic presentation of masterful skill would be an understatement. To say that the team is modest would be an overstatement. Development went ahead as planned. Paul began to develop plans for perform- ing lookup algorithms while the other three maggots... err team members... con- tinued work on feature extraction. Liz worked on extracting timbre information, Brad on Beat, and Nick on Loudness sensation. Nick’s loudness sensation was the first to be completed. Brad’s beat spectrum was close but there were still bugs. Liz’ MFCC’s were generated, but there was the 100 Saga of Project (with Pictures)

(a) Paris, City of Cafés (b) Paris, City of Tourism

(c) Ircam,HometoIsmir (d) The Pompidou Center

(e) Brad starts to “Go Native” (f) Liz wants to “Take Five”

Figure C.1: Adventures in Paris. Casey at the Bat 101

(a) The Obligatory Poster Pose

(b) Brad Gets into the Beat (c) Paul Leads Everyone in Prayer

Figure C.2: Projects Day, 2003. 102 Saga of Project (with Pictures)

problem of how to compare them. These delays were well accounted for in the project schedule. As the features became moremature,Paul began to experiment with map code that he stole... ummm downloaded... from the website of a participant of Ismir. This code was up and running and generating maps in a week, an enormous gain of time since the team had planned on implementing this technique themselves.

C.6 The Ghetto Girl Makes Good

Brad got his Beat Spectrum working after aquicke-mail to Jonathan Foote. With this task complete he began working on generating playlists, basically on schedule. Liz took a page from Beth Logan’s play book and decided to use Earth Movers Distance to compare the MFCC’s. Implementing this proved to be very challenging. In the meantime, our hero Paul was working on an application to unify all fea- tures and functionality to allow users to perform similarity lookups. This applica- tion took a few weeks to come together, but it was complete before the frosty chill of code freeze. Liz persevered and after a few hiccoughs produced code to allow comparison of her MFCC’s. All was well for the A-Team. Brad talked to the masses and got some feedback on our system through user testing. It remains to be seen whether he was in collusion with the good people at Big Bowl Cafe as he used their tasty grease laden Thai food to bribe users into participating. All threads of the project came together in the final week. The team now had a system that performed all the major functions they had outlined in their proposal, and had also had solicited feedback from the populace. A job well done. Or was it... Well if anything it was a job done. Or was it...

C.7 Wait, We Have to do a Presentation?

Late Monday night Paul was browsing through his e-mail from first semester. Ap- parently there was something called presentation days and the A-Team was sup- posed to be one of the leading contributors. “No wayyyyyy.... We ain’t be doing dat shizzzneeet” cried Brad. However, there was no avoiding it. The team be- gan preparing an extra large corpus in order to demo their system to the adoring masses. Much to the team’s delight, the expanded corpus performed extremely well. The Day of Reckoning 103

C.8 The Day of Reckoning

The air was thick with anticipation as the team awoke form their all too brief col- lective slumber. Nick was manning the poster as the team made their way down to the battlefield (Jacobs room 134). The team was in the zone. Every word flowed like molasses. Every analogy left the audience dumb-founded. Every second was a masterpiece. The presentation went off without a hitch, the demo impressed, and the teams work was received well by most.

C.9 There and Back Again, a Project Manager’s Story

As I sit here writing the final text of my year in clinic the memories come flooding back. However, I have to go so you don’t get to read about any of them. Clinic is finished as of the completion of this sentence... this sentence that I am typing right now... PERIOD.

Bibliography

Allen, Jont B., & Neely, Stephen T. 1997. Modeling the relation between the itensity of just-noticeable difference and loudness for pure tones and wideband noise. Journal of the acousticalsocietyofAmerica, 102(6), 3628–3645.

Aucouturier, Jean-Julien, & Pachet, François. 2002. Music similarity measures: What’s the use? In: Fingerhut (2002).

Bladon, R. A. W., & Lindblom, Björn. 1981. Modeling the judgement of vowel quality differences. Journal of the acousticalsocietyofAmerica, 69(5), 1414–1422.

Celko, Joe. 1996. A look at SQL trees. DBMS online.

Chen, Hui. 2003. Approximation algorithms for TSP. http://www.msci.memphis.edu/ ˜giri/7713/f00/HuiChen/HuiChen2.htm.

Cormen, Thomas H., Leiserson, Charles E., Rivest, Ronald L., & Stein, Clifford. 2001. Introduction to algorithms.Secondedn.Cambridge:MITPress.

Fingerhut, Michael (ed). 2002. ISMIR 2002 conference proceedings.Paris:IRCAM, for ISMIR.

Fletcher, H., & Munson, W. 1933. Loudness, its definition, measurement, and cal- culation. Journal of the acousticalsocietyofAmerica, 5, 82–108.

Foote, Jonathan, & Cooper, Matthew. 2002. Automatic music summarization via similarity analysis. In: Fingerhut (2002).

Foote, Jonathan, Cooper, Mathew, & Nam, Unjung. 2002. Audio retrieval by rhyth- mic similarity. In: Fingerhut (2002).

Fritzke, Bernd. 1995. A growing neural gas network learns topologies. Advances in neural information processin systems. 106 Bibliography

Fung, Glenn. 2001 (June). Acomprehensiveoverview of basic clustering algorithms. www.cs.wisc.edu/˜gfung/clustering.pdf.

Hand, David, Mannila, Heikki, & Smuth, Padhraic. 2001. Principles of data mining. Adaptive Computation and Machine Learning. Cambridge: Bradford.

Harkins, Susan Sales, & Reid, Martin W.P. 2001. Many web developers prefer MySQL. Databasics.

Howes, Timothy A. 1995. The lightweight directory access protocol: X.500 lite. CTI technical report 95-8.

Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. Biological cybernetics.

Larrañaga, P., Kuijpersj, C.M.H., Inza, R.H. Murga I., & Dizdarevic, S. 1999. Genetic algorithms for the travelling salesman problem: A review of representations and operators. Artificial intelligence review.

Logan, Beth, & Salomon, Ariel. 2001a (June). Acontent-basedmusicsimilarityfunc- tion.Tech. rept. CRL-2001-2. Cambridge Research Laboratory.

Logan, Beth, & Salomon, Ariel. 2001b. A music similarity function based on signal analysis. ICME.

Pampalk, E., Rauber, A., & Merkl, D. 2002. Content-based Organization and Visual- ization of Music Archives. In: Proceedings of the ACM multimedia.JuanlesPins, France: ACM.

Pradeep, Gatram, & Gupta, Shalabh. 2003. Extraction of frequency from words. http: //www.cse.iitk.ac.in/˜amit/courses/768/00/gatram/freq.html.

Rabiner, Lawrence, & Juang, Biing-Hwang. 1993. Fundamentals of speech recogni- tion. Englewood Cliffs: Prentice Hall.

Rauber, Andreas, Pampalk, Elias, & Merkl, Dieter. 2002. Using pycho-acoustic mod- els and self-organizing maps to create a hierarchical structuring of music by sound similarity. In: Fingerhut (2002).

Schroeder, M. R., Atal, B. S., & Hall, J. L. 1979. Optimizing digital speech coders by exploiting masking properties of the human ear. Journal of the acoustical society of America, 66(6), 1647–1652. Stearns, Samuel D., & David, Rath A. 1996. Signal processing algorithms in matlab. Prentice Hall Signal Processing Series. Saddle River: Prentice Hall. Bibliography 107

Syswerda, G. 1991. Schedule optimization using genetic algorithms. In: Larrañaga et al. (1999).

Zwicker, E., & Fastl, H. 1999. Psychoacoustics.Secondedn.Springer Series in Infor- mation Sciences, no. 22. Berlin: Springer.