Methods and Datasets for DJ-Mix Reverse

http://zenodo.org/record/1422385, http://www.ircam.fr, http://abcdj.eu

Methods and Datasets for DJ-Mix Reverse Engineering Diemo Schwarz, Dominique Fourer Ircam Lab, CNRS, Sorbonne Université, Ministère de la Culture, Paris, France IBISC, Université d’Évry-Val-d’Essonne/Paris-Saclay, Évry, France

ABC_DJ Artist to Business to Business to Consumer Audio Branding System Collaboration Context: ABC_DJ Artist to Business to Business to Consumer The ABC DJ EU-Project Audio Branding System

MIR tools for audio branding http://abcdj.eu automatic, DJ-like playback of playlists in stores

HearDis! GmbH

The ABC DJ project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 688122. Scientiﬁc Context: Understanding DJ Culture & Practices

Important part of popular music culture

Enables: - musicological research in popular music - studies on DJ culture - computer support of DJing - automation of DJ mixing

Qualitative accounts exists, but… Problem: Lack of Annotated Databases of DJ Mixes or DJ Sets

Very large scale availability (millions) of DJ mixes, often with tracklist, e.g. http://www.mixcloud.com, YouTube, podcasts. very few annotated databases

Existing research in studio multi-track mixing and unmixing in DAWs

Existing work on DJ production tools, but no information retrieval from recorded mixes Needed Components

DJ Mixes Audio Tracks identify contained tracks (ﬁngerprinting)

Content Analysis Identiﬁcation get track start and end in mix Playlist determine tempo changes Alignment Time-Scaling (beat-aligned mixing) derive genre and Context Analysis suggested here social tags attached to the music Cultural Data Unmixing → inform about the choices a DJ makes when creating a mix Mix Data estimate fade curves for volume, downstream research bass/treble, and parameters of other enabled by DJ mix annotation eﬀects (compression, echo, etc.) Proposed Method for DJ Mix Reverse Engineering

Input • recorded DJ mix • playlist (list of tracks in the mix in correct order) • audio ﬁles of the original tracks

Five steps 1.rough alignment 2.sample alignment 3.veriﬁcation by track removal 4.estimation of gain curves 5.estimation of cue regions Step 1: Rough Alignment by DTW

7000 Dynamic Time Warping alignment of 0 concatenated MFCCs of tracks with mix 6000 → relative positioning of the tracks in -10 the mix (intersections) 5000

-20 → speed factor (slopes of path) 4000

MFCCs -30

mix frames 3000 mix

2000 -40 0

1000 -50 -1 Reference index -2 2000 4000 6000 8000 10000 12000 -2 -1 0 track frames

Query index concatenated tracks’ MFCCs Step 2: Sample Alignment

Reﬁne alignment to close in to sample precision: 1. time-scale source track according to estimated speed factor 2. search best sample shift around rough (frame) alignment maximum cross-correlation between mix and track Step 3: Veriﬁcation by Track Removal

Success of sample alignment can be veriﬁed by subtracting the aligned and time-scaled track from the mix → drop in RMS energy /!\ Method applicable even when ground truth is unknown or inexact! set123mix3-none-bass-04 10

0 [dB]

-10

erence of RMS energy di ﬀ erence 0 15 30 45 60 75 90 105 seconds 6Step Diemo Schwarz 4: andVolume Dominique Fourer Curve Estimation

1. we compute the short-time Fourier transforms (STFT) of x and si denoted Si(n, m) and X(n, m)(n and m being respectively the time and frequency indices) 2. weEstimate estimate the the volume volume curve curves at each âi instant(blackn lines)by computing applied the to medianeach track to obtain the mix of the mix/track ratio computed along the frequencies m0 M,whereM is Novel method based on time-frequency2 representations2 X (mix) and Si (track): the set of frequency indices where S (n, m0) > 0, such as: | i | X(n,m0) 2 median | | if m0 s. t. Si(n, m0) > 0 Si(n,m0) aî(n)= | | m0 M 9 | | (2) 80⇣ ⌘8 2 otherwise < 3. we optionally: post-processa î(n) to obtain a smooth curve by removing out- liers using a second median filter for which a kernel size equal to 20 provides good results in practice. The resulting volume curve can then be used to estimate the cue points (the time instants when a fading e↵ect begins or stops) at the next step. An illustration of the resulting process is presented in Fig. 3.

3.5 Step 5: Cue Point Estimation

In order to estimate the DJ cue points, we apply a linear regression ofa î at the time instants located at the beginning and at the end of the resulting volume curve (whena î(n) < , being a threshold defined arbitrarily as =0.7 max(â)). Assuming that a linear fading e↵ect was applied, the cue points can easily be deduced from the two ane equations resulting from the linear regression. The four estimated cue points correspond respectively to:

1. n1, the time instant when the fade-in curve is equal to 0 2. n2, the time instant when the fade-in curve is equal to max(ˆai) 3. n3, the time instant when the fade-out curve is equal to max(ˆai) 4. n4, the time instant when the fade-out curve is equal to 0. In order to illustrate the eciency of the entire method (steps 4 and 5), we present in Fig. 3 the results obtained on a real-world DJ-mix extracted from our proposed dataset.

Fig. 3: Estimated volume curve (black), linear fades (blue), ground truth fades (red) Step 5: Cue Point Estimation

Cue points are the start and end points of fades Estimation (blue lines) by linear regression of the fade curve â at beginning and end (where â is between 0 and 70% of its maximum) Ground truth fade curve in red http://zenodo.org/record/1422385

The UnmixDB Open DJ-Mix Dataset

Automatically generated “ecologically valid” beat-synchronous mixes based on CC-licensed freely available music tracks from net label http://www.mixotic.net curated by Sonnleitner, Arzt & Widmer (2016) Each mix combines 3 track excerpts of ~40s (start cutting into end on a downbeat) Precise ground truth about the placement of tracks in a mix, fade curves, speed Mixes generated in 12 variants: 4 eﬀects: no eﬀect, bass boost, dynamics compression, distortion 3 time-scaling algorithms: none, resample, time stretch

6 sets of tracks and mixes, 500 MB – 1 GB, total 4 GB python source code for mix generation at https://github.com/Ircam-RnD/unmixdb-creation no stretch resample stretch

103

Evaluation Measures and Results: 102

7000 0 1 6000 10

-10

Alignment 5000

-20 0 4000 10

-30

mix frames 3000

-1 2000 -40 10 log frameerror [s]

1000 -50

10-2 2000 4000 6000 8000 10000 12000 track frames

10-3 frame error: absolute error between ground truth and frame -4 10 none none none none resamp resamp resamp resamp stretch stretch stretch stretch start time from DTW rough alignment (step 1) [s] none bass comp dist none bass comp dist none bass comp dist

103

sample error: absolute error between ground truth and track 2 10 start time from sample alignment (step 2) [s] 101

100

10-1

no fx bass compression distortion log sampleerror [s]

10-2

10-3

-4 10 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist Evaluation Measures and Results:

7000 0

6000

-10

Speed Ratio 5000

-20 4000

-30

mix frames 3000

2000 -40

1000 -50 1.04

2000 4000 6000 8000 10000 12000 track frames

1.03

1.02

1.01 speed ratio: ratio between ground truth and 1 speed factor estimated by DTW alignment speedratio (step 1, ideal value is 1) 0.99

0.98

0.97

0.96 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist 103

102

Reinjecting Ground Truth Speed 101

100

10-1 log sampleerror [s]

High sensitivity of sample alignment and track removal on 10-2 accuracy of speed estimation from DTW 10-3

Judge its inﬂuence by reinjecting ground truth speed: -4 10 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist top: sample alignment error 103 bottom: sample alignment error with ground truth speed 102 reinjected for time-scaling 101

100

10-1 log risampleerror [s]

10-2

10-3

-4 10 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist 1.6 Evaluation Measures and Results: 1.4 set123mix3-none-bass-04 1.2 10 0 1 Suppression -10

10 0.8

0 suppratio -10 0.6

0 0.4

-10

0 15 30 45 60 75 90 105 seconds 0.2

none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist suppression ratio: ratio of track time with >15 dB of track

removal (step 3, bigger is better) 1.6

1.4

1.2

top: with DTW-estimated speed 1 bottom: with ground truth speed for time-scaling 0.8 risuppratio 0.6

0.4

0.2

-0.2 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist Evaluation Measures and Results: Fade Error

100

70 fade error: total diﬀerence between ground truth 60

and estimated fade curves (steps 4 and 5) 50

40 fadeerror [db/s]

0 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist Conclusion and Future Work

Our DJ-mix reverse engineering method validated on artificial open UnmixDB dataset → retrieval of rich data from existing real DJ mixes With some refinements, our method could become robust and precise enough to allow the inversion of EQ and other processing (e.g. compression, echo) Extend to mixes with non-constant tempo curves, more effects Close link between alignment, time-scaling, and unmixing hints at a joint and possibly iterative estimation algorithm Other signal representations (MFCC, spectrum, chroma, scattering transform)?

beware: IANADJ!