A Dynamic Programming Approach to Segmenting DJ Mixed Music Shows

Tim Scarfe, Wouter M. Koolen and Yuri Kalnishkan

Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, United Kingdom {tim,wouter,yura}@cs.rhul.ac.uk

Abstract. In this paper we describe a method to infer time indexes of /song transitions in recorded audio. We apply our method to segment DJ-mixed electronic dance music podcasts/radio shows. These shows are seamlessly mixed by a DJ so the track boundaries are in- tentionally obfuscated. The driving factor is that audio streams often contain little or no of events to describe their contents much like a table of contents would provide on a purchased . Our technique involves transforming the podcast into a sequence of feature vectors each representing a small interval of time and defining a matrix showing the cost of a potential track starting at a given time with a given length. We consider both Fourier and time domain autocorrelation features. We use a dynamic programming approach to find the segmen- tation with lowest total. We tested the approach on a dataset of 15 shows of 2 hours each (called Global DJ Broadcast by Markus Schulz).

1 Introduction

DJ mixes are an essential part of electronic dance music. Whether you listen to your favourite DJ on the radio, live or downloaded from the internet – the individual tracks will be mixed together. More recently, podcasts (through Ap- ple iTunes) have become the de facto empowering people to download radio shows for offline listening automatically. Users can play the mixes on their computers and/or mobile devices. Many shows are also broadcast on internet ra- dio stations such as Digitally Imported and syndicated world wide on terrestrial radio. Podcasts on iTunes already have an “event metadata” concept called chapters built in. Very few radio show producers (two that we know of) use the chapters feature in their podcasts. Adding this metadata relies on a proprietary Apple extension to their media files and the ones we looked at had highly inaccurate indexes. This is not surprising given that it is a laborious manual process to add this metadata to the files. One of the interesting features of audio is that you can not scrub through it and get a contextual picture in the same way you could with video. Audio must be assimilated and experienced in real time to understand the context. Many DJs intentionally do not provide too much information about their mixes because they want to maintain a certain mystique and have their set seen as one distinct work of art rather than a set of tracks simply mixed together. We will not speculate about the reasons for this but it is clearly beneficial to have the metadata because it allows us to skip the tracks we do not like, and become intimately familiar with the shows content and the respective authors of the tracks. To mix tracks DJs always match the speed/BPM (beats per minute) of each track during a transition and align their percussion in the time domain. To make things sound even more seamless they often also apply EQ or harmonically mix tracks to stop any dissonance between instruments in adjacent tracks. DJs will know which “key” each track is in, and have access to a chart showing which other keys are harmonically compatible. This ensures that when both tracks are in, the tracks are harmonious and each critical band of hearing in the human ear is only stimulated by one peak in the frequency domain avoiding a characteristic muddy sound. Many real instruments have a harmonic series of peaks in the frequency domain, usually with each harmonic at integer multiple of their fundamental frequency. Synthesized instruments tend to mimic this spectral pattern (which is called timbre in music). This is the reason why instruments are pleasurable for humans to listen to. The technique we used in this paper is to segment the audio file into discrete time intervals (called tiles), and perform a transformation on those tiles into a domain where two tiles from the same tracks would be distinguishable by their cosine. We extracted two core sets of features; Fourier based and autocorrelation based. Intuitively speaking the Fourier features can be thought of as useful for distinguishing instruments (after filtering to remove noise and accentuate rele- vant features) and the autocorrelation features useful for distinguishing rhythmic patterns. Our features are quite generic in the sense that we look at the spectrum directly. Most papers in the literature extract features on top of the spectrum such as centroid, and bandwidth. Most of these papers are concerned with su- pervised classification or structural analysis. We work directly on the spectrum and match like for like instruments with increased levels of clarity/specificity. We are doing direct comparisons rather than classification on new data. Because the features were noisy, we filtered them using a one dimensional convolution first-derivative Gaussian filter. In the following sections we will de- scribe; the data set (Section2), the evaluation criteria (Section3), the test set (Section4), the feature extraction (Section5), the cost matrix we generate from the features (Section6), computing the best cost segmentation from the cost matrix (Section7) and finally our methodology and results (Section8).

1.1 Related Work A similar problem is addressed using dynamic programming in [10] and [9]. The novelty of our approach is that it can be used in the unsupervised setting to find an entirely distinct sequence of clusters, it assumes one cluster will not appear again later on. All related work in this area is concerned with summarisation by which we mean inferring some kind of structure concerned with repeating elements from the audio. J. Foote et al ([11][12]) has done a significant amount of work in this area. He has made use of similarity matrices for the purpose of structural analysis. His approach is based on statistical segment clustering. A distinguishing feature of our approach is that we evaluate how well we are doing in respect of humans for the same task. We compare our error to the relative error of cue sheets created by human domain experts. Another important feature of our work is that we focus directly on DJ mixed electronic dance music. We do not know of any previous work in this area.

2 Data Set

We have decided to use the radio show Global DJ Broadcast by Markus Schulz. This show goes out weekly at 5PM EST on Digitally Imported. Its contents are mainly Progressive Trance and Tech Trance in equal measures. Progressive trance is characterised as having a house riff (bass line) with melodic instrumen- tal elements over the top. Tech trance has less or no riff and a more rhythmic focus achieved with percussion and instruments and less or no melody. Both sub genres have a very distinctive sound sharing many instruments and effects, usu- ally with gratuitous Gaussian noise and reverb layered on top (or similar). The first track usually starts at around 36 seconds after an introduction. Markus usu- ally talks over the beginning of the first or second track and many more after. The music remains constant throughout after the introduction. Occasionally Markus has guest mixes during his show. The show is usually played at around 130 BPM although this not constant throughout the show and can change abruptly on a transition (more likely) or slowly over an interval of time. The shows come in 44100 samples per second, 16 bit stereo MP3 files @ 192Kbs or more recently 256Kbs. We will resample these to 4000Hz 16 bit mono (left channel) WAV files to allow us to process them faster. A cue sheet is a metadata file which describes track names and indices of a media file. Cue sheets are stored as plain text files and commonly have a “.cue” file name extension. Most popular media players support cue sheets, allowing the user to listen to the separate components of one large audio file. These shows are 2 hours long, and we have used 15 of them in our dataset. We discarded any shows that have cue sheets which completely disagree with each other or are entirely erroneous (for example the cue sheet we inspected for 17th March 2011). The average discrepancy is the total absolute difference between pairs of tracks, summed up and averaged by the number of tracks.

3 Evaluation Criteria

We do not feel that it is important to get these indexes exactly correct. The cue sheet authors are using their own recordings of the radio show which may differ in alignment to others and the indexes are subject to some debate due to the relatively large length of the overlap between any pair of adjacent tracks. We feel that blatant mistakes (see Definition1) should be avoided as the primary concern although we can’t easily measure them objectively. We will use visual inspection later to demonstrate the efficacy of our approach as this measure is a more practical concern. In Section4 we include a statement from one of the cue sheet authors regarding his process for finding the track indexes. Definition 1. A Blatant Mistake is qualitative term – interpreting (in effect) a transition as a distinct track, misconceptualising one distinct element of a track as being a track in its own right or making otherwise significantly inaccurate bound recommendations. We allow more tolerance for an index to start late into track B, vs early during a transition, forcing the listener to hear the end of track A before track B is in properly. For quantitative analysis in this paper more objective and nuanced evaluation criteria are required. Therefore, we measured the number of matched tracks within different intervals of time {60, 30, 20, 10, 5, 3, 1} (in seconds) as a margin around any of the track indexes we have in the cue sheets. This could be thought of as a lowest cost assignment in a bipartite graph with the caveat that we sometimes have more than one (no more than 4) sets of indexes for each radio show. In this case we only use the index that fits best and effectively discard the ones given by the other cue sheets. We only allow one of the cue sheet indexes for any given track to be used. All of the cue sheets are complete in the sense that they have an index present for each respective track in the shows. For this reason we could potentially calculate the evaluation in linear time without using a lowest-cost bipartite graph matching algorithm (such as the Hungarian algorithm [7]).

4 Test Set

There is already a large community of people very interested in getting track metadata for DJ sets. CueNation ([1]) is an example of this. CueNation is a website allowing people to submit cue sheets for popular DJ Mixes and radio shows. We have used this site to download human identified track metadata for each of the shows in our dataset. In one case only a single cue existed for a given show we have, but in other cases we have as many as 4 cue sheets. Downloading these radio shows is of dubious legality because of copyright, so we recorded them ourselves from Digitally Imported (a net radio station). Unfortunately this leaves open the possibility for our recordings to be out of sync with the cue sheets we have. We have therefore aligned the cue sheets to the starting point of the show (a clear swoosh after around 37 seconds or over several tracks of unanimous agreement if we have missed that point on our recording). What is very interesting to us is the consistency of the human identified track indexes given the case that the indexes are so ambiguous after analysis. Much of the time, the humans agree on an index within a one second tolerance. However there is often a massive discrepancy on track indexes. It is not that one is right and one is wrong, but more a case of subjectivity. We contacted the creator of CueNation and asked him for his process creating the indexes (he is likely the author of many of the cue sheets we have in our dataset). (Mikael Lindgren ([2])) Figuring out the index is not an absolute science, so there is no universally correct point when a track changes. Further- more, it depends on how the DJ mixes the tracks. Markus usually does a very clean job. If you look at the waveform for a track you notice that there often is a clear intro phase that turns into the first proper part of the track. Sometimes this transition is gradual, so there are 2-4 points that could be used for the transition point. Similarly tracks have an outro phase, and there can also be 2-4 valid transition points. If there is only one point where both tracks have a transition point then it is quite clear where the track changes. Often there are 2-3 to choose from. When there are 3, the middle one is the best choice. I prefer to start a track so that it is 8 or 16 bars (ca. 30 or 60 seconds) (1 bar = 8 beats, ca. 4 seconds) before the first breakdown, and choose the transition point accordingly. Call it a trademark of mine if you want to. But valid transition points take precedence, and therefore I sometimes use 4 or 12 bars (ca. 15 or 45 seconds).

For every show in our dataset we compare each respective cue sheet against every other using the evaluation criteria discussed in Section3 to show the number of matched tracks within different time thresholds (refer to Table1). The worst performing match in each threshold is shown in the table to emphasise the extent of human disagreement. To show that the disagreement between the worst pair is genuine and not simply a misalignment on our part, we calculated the least squares alignment error between them, which is defined as follows. Let A = (A1,...,An) and B = (B1,...,Bn) be the vectors of track boundary times of two cue sheets. We define the least squares alignment error as v u n uX 2 min t (Ai − Bi + t) . t i=1

This can be interpreted as the empirical standard deviation of the pairwise dif- ferences A − B. Intuitively, this measure is high when the disagreement between cue sheets is more fundamental than a mere shift. In the last column of Table1 we report for each show the maximal least squares alignment error among cue sheets pairs. A good example is show 03/02/11 where the worst disagreeing cue sheet pair aligned (72%) of indexes within 3 seconds.

5 Feature Extraction

We used Winamp [4] to downsample the shows to 8000Hz. We are not partic- ularly interested in frequencies above around 2000Hz in the spectrum because there are not many instrument harmonics that go that high and they are not Table 1. Dataset and Cue Sheet Discrepancy

Date Note Cues 60s 30s 20s 10s 5s 3s 1s Human # (%) (%) (%) (%) (%) (%) (%) Error (s) 03/02/11 World Tour 2 96 88 84 72 72 72 0 52.8 10/02/11 3 100 95 81 81 81 52 0 12.6 17/02/11 3 100 95 76 71 71 71 10 23.0 24/02/11 3 100 91 91 91 91 91 50 14.3 03/03/11 World Tour 3 100 96 85 78 78 78 0 12.2 10/03/11 WMC Special 2 100 100 100 100 100 100 61 0.9 24/03/11 4 96 91 74 74 74 74 39 19.1 31/03/11 3 100 100 100 100 100 100 50 1.2 14/04/11 1 n/a n/a n/a n/a n/a n/a n/a n/a 21/04/11 3 100 100 100 100 100 100 45 1.2 05/05/11 World Tour 2 100 96 84 84 44 0 0 14.4 12/05/11 2 100 100 100 100 100 100 96 0.2 19/05/11 2 100 100 95 95 95 95 71 6.4 26/05/11 2 100 100 100 100 100 100 19 1.7 02/06/11 World Tour 2 100 100 100 96 96 96 44 2.8 Average 99.4 96.6 90.7 88.8 85.9 80.7 34.7 Table 2. Please note that the thresholds columns are manually aligned (not with least squares). A high value in the last column indicates that several indices significantly disagree even after the best possible least squares alignment using the found t.

well defined even when they do. The Nyquist theorem ([5]) states that the high- est representable frequency is half the sampling rate, so we downsampled again manually to 4000Hz by binning the dataset into bin sizes of two, and averaging their contents. We will refer to the sample rate as R. Let L be the length of the show in samples. We would rather have used Winamp to downsample all the way to 4000Hz but the lower limit is 8000Hz. Arguably Winamp would do a better job of downsampling. Next we binned/segmented the dataset again into tiles of equal size. Samples at the end of the show that do not fill a complete tile get discarded. We denote the tile width by Tsec (in seconds). This will become one of the key parameters later on. We stack the tiles on top of each other to create the matrix V of size ( L ) K × N , where K = b R c is the number of tiles and N is the size of the Fourier 2 Tsec Tsamp vector (see next paragraph). Note that Tsec = R where Tsamp is tile width L in samples (b K c). As we are not always interested in the entire range of the spectrum, we use l to represent a low pass filter (in Hz) and h the high pass filter (in Hz). Fourier analysis allows one to represent a time domain process as a set of integer oscillations of trigonometric functions. We used the discrete Fourier trans- form (DFT) to transform the tiles into the frequency domain. The DFT given N−1 k P −i2π N n as F (xk) = Xk = n=0 xn · e transforms a sequence of complex num- bers x0, . . . , xN into another sequence of complex numbers X0,...,XN where −i2π k n e N are points on the complex unit circle. Note that the fftw algorithm that we used to perform this computation (see [6] and [8]) operates significantly faster when N is a power of 2 so we zero pad the input to make that the case. Let the matrix U of size K × (Sl − Sh) represent the Fourier transforms of N  N  each tile where Sh = bh( R )c + 1 and Sl = bl( R )c + 1 . We compute Di =

|F (Vttile,i)|, i ∈ {1,...,N}. We then insert D[Sh:Sl] into U for the corresponding tile. When we visualised some tiles (ttile) in U we noticed that it is rather noisy and the peaks in the frequency space were slightly out of alignment on adjacent tiles in the same song. To compensate for this we convolved the frequency space for each tile with a one dimensional Gaussian first derivative filter (Figure1) 2 − G 2G B2 N  given as − B2 e where G = −b2Bc, b2B + 1c,..., b2Bc and B = b R . Note that b is the bandwidth of the filter in Hz and this will be used as one of our main parameters. The formula for one dimensional convolution is given as

∞ def X (f ∗ g)[n] = f[m] g[n − m]. m=−∞

This filtering technique is critical for us to reliably distinguish pairs of tiles from the same song from their cosine. An advantage of convolution filtering over binning is you do not lose any resolution on the features (see Figure2) while still achieving the desired effect. This Gaussian has the effect of accentuating large adjacent groups of peaks within the bandwidth space and smoothing over smaller ones assumed to be outliers. We have K Fourier feature tiles each of length Sl − Sh and normalize them f all by fnorm = |f| where f is one feature tile. Then we can compute an K × K dissimilarity matrix S(i, j) by taking one minus the dot products of all pairs of normalised features (one minus because we will be finding the minimum cost assignment in Section7). See Figure3 for a partial annotated illustration of this matrix on one of the shows.

Fig. 1. Illustration of the Gaussian first derivative filter. Fig. 2. Illustration of how filtering works on feature vectors. The large peaks corre- spond to instruments. These peaks might be slightly misaligned on different tiles in the same song. Without the filtering the cosine matrix would not convey much information about the tracks indexes.

We wanted to have an additional feature set and dissimilarity matrix that would be useful for distinguishing tracks based on rhythmic features. For this we tile the audio in the same way as before but rather than take a Fourier transform we autocorrelate it. Autocorrelation is a special case of cross correlation, when the signal is cross correlated against itself. Autocorrelation given as

X Rxx(j) = xn xn−j n looks at the similarity of its components as a function of time separation between them where j is the lag and xn is the signal. The return length of R for a sequence is 2n − 1. The second half of R is a mirror image of the first half so we will just take the first half {R1,2,...,Tsamp }. We apply the same Gaussian convolution filter to these features, one minus the normalised vectors then generate a dissimilarity matrix as before. We also generate a third dissimilarity matrix which is the average of the FFT FFT+XCORR and XCORR dissimilarity matrices 2 . Fig. 3. Markus Schulz presents Global DJ Broadcast WMC Special (10 March 2011) Tile Width= 20Hz, Filter Bandwidth= 3Hz, Low Pass Filter= 1900Hz High Pass Filter= 110Hz Vertical lines indicate the cue sheet indexes, and the track numbers are labelled.

6 Cost Matrix

We now have three dissimilarity matrices S(i, j) as described in Section5. The algorithms described here and in Section7 will apply to all three of these dis- similarity matrices. We will refer to these as being separate algorithms, FFT, AVG and XCORR. In the results section these algorithms can be thought of as another parameter. Based on our experience, it is highly unlikely to have songs that are shorter than 90 seconds or longer than 13 minutes. For this reason we have decided to bound the minimum and maximum song length with w and W equal to 90 seconds and 13 × 60 seconds respectively. Intuitively, tiles within tracks are reasonably similar on the whole, while pairs of tiles that do not belong to the same track are significantly more dissimilar. We define C(f, t), the cost of a candidate track from tile f through tile t, to be the sum of the dissimilarities between all pairs of tiles inside it:

t t X X C(f, t) = S(i, j) i=f j=f

We obtain the cost of a full segmentation by summing the costs of its tracks. The goal is now to efficiently compute the segmentation of least cost. As a first step, we pre-compute C for each 1 ≤ f ≤ t ≤ T . Direct calculation using the definition takes O(TW 3) time. However, we can compute the full cost matrix in O(WT ) time using the following recursion (for f + 1 ≤ t − 1)

C(f, t) = C(f + 1, t) + C(f, t − 1) − C(f + 1, t − 1) + S(f, t) + S(t, f).

7 Computing Best Segmentation of Cost Matrix

A sequence t = (t1, . . . , tm+1) is called an m/T -segmentation if

1 = t1 < . . . < tm < tm+1 = T + 1.

We use the interpretation that track i ∈ {1, . . . , m} comprises times {ti, . . . , ti+1− T 1}. Let Sm be the set of all m/T -segmentations. Note that there is a very large number of possible segmentations

T − 1 (T − 1)! | T | = = = Sm m − 1 (m − 1)!(T − m)! (T − 1)(T − 2) ··· (T − m + 1)  T m−1 ≥ . (m − 1)! m

For large values of T , considering all possible segmentations using brute force is 24  602×2  infeasible. For example, a two hour long show would have more than 25 ≈ 1.06 × 1059 possible segmentations! We can reduce this number by imposing upper and lower bounds on the song length. Let N(T, W, w, m) be the number of segmentations with time T (in seconds), W the upper bound (in seconds) of the song length, w the lower bound (in seconds) of the song length and m the number of tracks. P We can write the recursive relation N(T, W, w, m) = N(tm − 1, W, w, m − 1), where the sum is taken over tm such that

tm ≤ T − w + 1

tm ≥ T − W + 1

tm ≥ (m − 1)w + 1

tm ≤ (m − 1)W + 1

The first two inequalities mean that the length of the last track is within an acceptable boundary between w and W . The last two inequalities mean that the lengths of the first m − 1 tracks are within the same boundaries. We calculated the value of N(7000, 60×15, 190, 25) and got 5.20×1056 which is still infeasible to compute with brute force. Our solution to this problem is to find a dynamic programming recursion. The loss of an m/T -segmentation t is

m X `(t) = C(ti, ti+1 − 1) i=1 We want to compute VT = min `(t) m T t∈Sm To this end, we write the recurrence

t V1 = C(1, t) and for i ≥ 2

Vt = min `(t) i t t∈Si

= min min `(t) + C(ti, t) ti ti−1 t∈Si−1

= min C(ti, t) + min `(t) ti ti−1 t∈Si−1

ti−1 = min C(ti, t) + Vi−1 ti

In this formula ti ranges from (i − 1)w + 1 to (i − 1)W + 1. T We have T × m values of Vm and calculating each takes at most O(T ) steps. The total time complexity is O(T 2m).

8 Methodology and Results

Table 3. Parameter Values

Parameter Values Count Tolerance Thresholds (s) 60, 30, 20, 10, 5, 3, 1 7 Gaussian Filter Size (Hz) 0.1, 0.5, 1, 2, 3, 5, 6, 8, 10, 12, 15 11 High Pass Filter (Hz) 75, 100, 110, 125, 150, 175, 200 7 Tile Width (s) 1, 2, 5, 10, 15, 20 6 Algorithm FFT, AVG, XCORR 3

We selected a set of values for each parameter (see Table3). The algorithm is run on every show in the dataset and with all permutations of parameters to generate a set of results. The number of results is 7×11×7×6× 3 = 9702. Each result had a performance score. We can use these global scores to show the best performing parameter set and average performance (for every show in the dataset) for each tolerance (please see Table5). These results give us indications of what parameters work well together but not a fair representation of how well the methods are likely to work on completely new data. For this we use a cross validation approach. We partitioned the dataset 15 times. In each partition we exclude one show for testing (different one every time), and use the others for finding the best parameters. We then ran on the tested show with the selected parameters and algorithm, then averaged the performance scores. Please see Table6 for the cross validated results. We feel that for most practical purposes this approach produces reasonable results. To demonstrate this we will use visual inspection to count the percentage of qualitatively acceptable indexes, by this we mean not blatant mistakes (see Definition1). We ran the best performing global algorithm and parameters for B a tolerance of 1 second and computed 1 − I where B is the number of blatant or unacceptable mistakes, and I is the number of tracks in the respective show. Please see Table4 to see these results. See Figure5 for an illustration of blatant vs. not blatant. Note that these results are just our opinion on the matter. In some cases it is clear from the dissimilarity matrix and then listening to the show that our algorithm is correct when the human captured indexes are wrong. Figure4 illustrates this concept. Those interested in a practical application of this method should look more closely at the qualitative results as in many cases we believe our track boundaries are even better than the humans’. On Table6 we also compare our results to the average human discrepancy. The human discrepancy was obtained by averaging the worst performing pair of cue sheets created by humans from Table1. Our accuracy is not dramatically below the average discrepancy of the worst divergent of human indices. As the threshold goes down both the accuracies decrease. Please refer to the best global results on Table5. The results are very rea- sonable for accuracy tolerances up to 30 seconds. At 60 seconds, we see that the parameters shift significantly. It may be the case that the best performing set of features has been selected because of an inclination to evenly space out the indexes. As tracks have a similar length on average this can produce deceiving 60×2×25 performance estimates considering that 2×602 ≈ 40% of the space of the en- tire show where 25 is the average number of tracks, and 60 is the threshold and 2 × 602 is the length of the show in seconds. Considering that most tracks are of similar length, we can conclude that our evaluation criteria does not work effec- tively at this tolerance. At this tolerance there is a high likelihood of success by evenly spacing out the features. A high value for h has been selected and more of the spectrum discarded, additionally a high filter size has been selected which will have the effect of blurring the feature space some what and hence the loss of some time resolution.

9 Conclusion and Further Work

We feel that our results are reasonable for most practical applications. Even disregarding the 60 second tolerance (for the reason discussed in the previous section) the 84.3% figure from the 30 second tolerance from the cross validated results is acceptable. One promising avenue of further research is refining the features. In par- ticular the current XCORR dissimilarity matrices are dominated by a feature which is shared by several groups of adjacent tracks (their BPM). We think this Table 4. Qualitative Results

Name Qualitatively Acceptable (%) 03/02/11 91.7% 10/02/11 95.0% 17/02/11 100.0% 24/02/11 95.2% 03/03/11 92.3% 10/03/11 90.9% 24/03/11 100.0% 31/03/11 91.3% 14/04/11 100.0% 21/04/11 100.0% 05/05/11 100.0% 12/05/11 95.7% 19/05/11 100.0% 26/05/11 95.0% 02/06/11 92.3% Average 95.96%

Table 5. Global Results (Best Performing Parameter Set For Each Threshold)

Threshold (s) High Pass Filter Tile Algorithm Global Filter (Hz) Size (Hz) Size (s) Performance 60 200 8 15 FFT 96.6% 30 110 3 20 FFT 84.3% 20 75 5 20 FFT 76.7% 10 75 5 10 AVG 59.4% 5 75 2 5 FFT 44.4% 3 75 3 2 FFT 38.7% 1 75 3 2 AVG 23.7%

Table 6. Cross Validated Results

60s (%) 30s (%) 20s (%) 10s (%) 5s (%) 3s (%) 1s (%) Cross Validated 95.2% 77.7% 74.5% 58.1% 38.8% 38.7% 18.9% Global Best Parameters 96.6% 84.3% 76.7% 59.4% 44.5% 38.7% 23.7% Human Discrepancy 99.4% 96.6% 90.7 % 88.8% 85.9% 80.7% 34.7% Fig. 4. An example of a cue index that is wrong in the cue sheets. These instances make our quantitative performance results look worse than they really are. From Markus Schulz presents Global DJ Broadcast (21 April 2011) could potentially be filtered out, revealing more of the distinguishing rhythmic structure of individual tracks. However, that requires further analysis. We have made our dataset available online ([3]). One can observe from Table5 that on lower thresholds we are hitting the lower limit of the h. It wants to go lower especially when the filter size is low. More experiments would find the optimum h, although we observed in testing that there was nothing particularly useful below 65Hz. If we ask for higher precision then the smaller tile sizes come out as the best parameters, because we need the higher resolution. We think we could refine the evaluation criteria using a metric such as some minimum cost alignment so that we can better compare our error to the error of humans, and avoid the resolution issues at higher thresholds. The song cost matrix could be improved using Machine Learning or heuristic approaches utilising domain knowledge.

References

1. http://www.cuenation.com community website created by Mikael Lindgren for peo- ple to share cue sheets. 2. Mikael Lindgren, personal communications 3. http://www.developer-x.com/papers/trackindexes the dataset for this paper is available to download here. Fig. 5. An example of a blatant error vs not blatant error. from Markus Schulz presents Global DJ Broadcast World Tour (3 March 2011)

4. Winamp the media player http://www.winamp.com 5. H. Nyquist. Certain topics in telegraph transmission theory vol. 47, pp. 617-644. 1928. 6. FFTW algorithm (“the fastest Fourier transform in the west”) (http://www.fftw.org) 7. Harold W. Kuhn, The Hungarian Method for the assignment problem, Naval Re- search Logistics Quarterly, 2:8397, 1955. 8. Frigo, Matteo and Johnson, Steven G. The Design and Implementation of FFTW3, Proceedings of the IEEE, 2005, 93, 2, 216–231 9. Michael M. Goodwin and Jean Laroche, A dynamic programming approach to audio segmentationand speech / music discrimination IEEE International Conference on Acoustic Speech and Signal Processing 2004, vol 4, pages 309–312. 10. Y. Shiu, H. Jeong and C.-C. J. Kuo, Similarity Matrix Processing for Music Struc- ture Analysis, AMCMM06, October 27, 2006. 11. Matthew Cooper and Jonathan Foote, Automatic Music Summarization via Sim- ilarity Analysis, Proceedings of IRCAM, 2002 12. Jonathan T. Foote and Matthew L. Cooper, Media Segmentation using Self- Similarity Decomposition, Proceedings of SPIE, 2003.