A Dynamic Programming Approach to Segmenting DJ Mixed Music Shows
Total Page:16
File Type:pdf, Size:1020Kb
A Dynamic Programming Approach to Segmenting DJ Mixed Music Shows Tim Scarfe, Wouter M. Koolen and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, United Kingdom ftim,wouter,[email protected] Abstract. In this paper we describe a method to infer time indexes of track/song transitions in recorded audio. We apply our method to segment DJ-mixed electronic dance music podcasts/radio shows. These shows are seamlessly mixed by a DJ so the track boundaries are in- tentionally obfuscated. The driving factor is that audio streams often contain little or no metadata of events to describe their contents much like a table of contents would provide on a purchased compact disc. Our technique involves transforming the podcast into a sequence of feature vectors each representing a small interval of time and defining a matrix showing the cost of a potential track starting at a given time with a given length. We consider both Fourier and time domain autocorrelation features. We use a dynamic programming approach to find the segmen- tation with lowest total. We tested the approach on a dataset of 15 shows of 2 hours each (called Global DJ Broadcast by Markus Schulz). 1 Introduction DJ mixes are an essential part of electronic dance music. Whether you listen to your favourite DJ on the radio, live or downloaded from the internet { the individual tracks will be mixed together. More recently, podcasts (through Ap- ple iTunes) have become the de facto standard empowering people to download radio shows for offline listening automatically. Users can play the mixes on their computers and/or mobile devices. Many shows are also broadcast on internet ra- dio stations such as Digitally Imported and syndicated world wide on terrestrial radio. Podcasts on iTunes already have an \event metadata" concept called chapters built in. Very few radio show producers (two that we know of) use the chapters feature in their podcasts. Adding this metadata relies on a proprietary Apple extension to their media files and the ones we looked at had highly inaccurate indexes. This is not surprising given that it is a laborious manual process to add this metadata to the files. One of the interesting features of audio is that you can not scrub through it and get a contextual picture in the same way you could with video. Audio must be assimilated and experienced in real time to understand the context. Many DJs intentionally do not provide too much information about their mixes because they want to maintain a certain mystique and have their set seen as one distinct work of art rather than a set of tracks simply mixed together. We will not speculate about the reasons for this but it is clearly beneficial to have the metadata because it allows us to skip the tracks we do not like, and become intimately familiar with the shows content and the respective authors of the tracks. To mix tracks DJs always match the speed/BPM (beats per minute) of each track during a transition and align their percussion in the time domain. To make things sound even more seamless they often also apply EQ or harmonically mix tracks to stop any dissonance between instruments in adjacent tracks. DJs will know which \key" each track is in, and have access to a chart showing which other keys are harmonically compatible. This ensures that when both tracks are in, the tracks are harmonious and each critical band of hearing in the human ear is only stimulated by one peak in the frequency domain avoiding a characteristic muddy sound. Many real instruments have a harmonic series of peaks in the frequency domain, usually with each harmonic at integer multiple of their fundamental frequency. Synthesized instruments tend to mimic this spectral pattern (which is called timbre in music). This is the reason why instruments are pleasurable for humans to listen to. The technique we used in this paper is to segment the audio file into discrete time intervals (called tiles), and perform a transformation on those tiles into a domain where two tiles from the same tracks would be distinguishable by their cosine. We extracted two core sets of features; Fourier based and autocorrelation based. Intuitively speaking the Fourier features can be thought of as useful for distinguishing instruments (after filtering to remove noise and accentuate rele- vant features) and the autocorrelation features useful for distinguishing rhythmic patterns. Our features are quite generic in the sense that we look at the spectrum directly. Most papers in the literature extract features on top of the spectrum such as centroid, and bandwidth. Most of these papers are concerned with su- pervised classification or structural analysis. We work directly on the spectrum and match like for like instruments with increased levels of clarity/specificity. We are doing direct comparisons rather than classification on new data. Because the features were noisy, we filtered them using a one dimensional convolution first-derivative Gaussian filter. In the following sections we will de- scribe; the data set (Section2), the evaluation criteria (Section3), the test set (Section4), the feature extraction (Section5), the cost matrix we generate from the features (Section6), computing the best cost segmentation from the cost matrix (Section7) and finally our methodology and results (Section8). 1.1 Related Work A similar problem is addressed using dynamic programming in [10] and [9]. The novelty of our approach is that it can be used in the unsupervised setting to find an entirely distinct sequence of clusters, it assumes one cluster will not appear again later on. All related work in this area is concerned with summarisation by which we mean inferring some kind of structure concerned with repeating elements from the audio. J. Foote et al ([11][12]) has done a significant amount of work in this area. He has made use of similarity matrices for the purpose of structural analysis. His approach is based on statistical segment clustering. A distinguishing feature of our approach is that we evaluate how well we are doing in respect of humans for the same task. We compare our error to the relative error of cue sheets created by human domain experts. Another important feature of our work is that we focus directly on DJ mixed electronic dance music. We do not know of any previous work in this area. 2 Data Set We have decided to use the radio show Global DJ Broadcast by Markus Schulz. This show goes out weekly at 5PM EST on Digitally Imported. Its contents are mainly Progressive Trance and Tech Trance in equal measures. Progressive trance is characterised as having a house riff (bass line) with melodic instrumen- tal elements over the top. Tech trance has less or no riff and a more rhythmic focus achieved with percussion and instruments and less or no melody. Both sub genres have a very distinctive sound sharing many instruments and effects, usu- ally with gratuitous Gaussian noise and reverb layered on top (or similar). The first track usually starts at around 36 seconds after an introduction. Markus usu- ally talks over the beginning of the first or second track and many more after. The music remains constant throughout after the introduction. Occasionally Markus has guest mixes during his show. The show is usually played at around 130 BPM although this not constant throughout the show and can change abruptly on a transition (more likely) or slowly over an interval of time. The shows come in 44100 samples per second, 16 bit stereo MP3 files @ 192Kbs or more recently 256Kbs. We will resample these to 4000Hz 16 bit mono (left channel) WAV files to allow us to process them faster. A cue sheet is a metadata file which describes track names and indices of a media file. Cue sheets are stored as plain text files and commonly have a \.cue" file name extension. Most popular media players support cue sheets, allowing the user to listen to the separate components of one large audio file. These shows are 2 hours long, and we have used 15 of them in our dataset. We discarded any shows that have cue sheets which completely disagree with each other or are entirely erroneous (for example the cue sheet we inspected for 17th March 2011). The average discrepancy is the total absolute difference between pairs of tracks, summed up and averaged by the number of tracks. 3 Evaluation Criteria We do not feel that it is important to get these indexes exactly correct. The cue sheet authors are using their own recordings of the radio show which may differ in alignment to others and the indexes are subject to some debate due to the relatively large length of the overlap between any pair of adjacent tracks. We feel that blatant mistakes (see Definition1) should be avoided as the primary concern although we can't easily measure them objectively. We will use visual inspection later to demonstrate the efficacy of our approach as this measure is a more practical concern. In Section4 we include a statement from one of the cue sheet authors regarding his process for finding the track indexes. Definition 1. A Blatant Mistake is qualitative term { interpreting (in effect) a transition as a distinct track, misconceptualising one distinct element of a track as being a track in its own right or making otherwise significantly inaccurate bound recommendations. We allow more tolerance for an index to start late into track B, vs early during a transition, forcing the listener to hear the end of track A before track B is in properly.