http://zenodo.org/record/1422385, http://www.ircam.fr, http://abcdj.eu
Methods and Datasets for DJ-Mix Reverse Engineering Diemo Schwarz, Dominique Fourer Ircam Lab, CNRS, Sorbonne Université, Ministère de la Culture, Paris, France IBISC, Université d’Évry-Val-d’Essonne/Paris-Saclay, Évry, France
ABC_DJ Artist to Business to Business to Consumer Audio Branding System Collaboration Context: ABC_DJ Artist to Business to Business to Consumer The ABC DJ EU-Project Audio Branding System
MIR tools for audio branding http://abcdj.eu automatic, DJ-like playback of playlists in stores
HearDis! GmbH
The ABC DJ project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 688122. Scientific Context: Understanding DJ Culture & Practices
Important part of popular music culture
Enables: - musicological research in popular music - studies on DJ culture - computer support of DJing - automation of DJ mixing
Qualitative accounts exists, but… Problem: Lack of Annotated Databases of DJ Mixes or DJ Sets
Very large scale availability (millions) of DJ mixes, often with tracklist, e.g. http://www.mixcloud.com, YouTube, podcasts. very few annotated databases
Existing research in studio multi-track mixing and unmixing in DAWs
Existing work on DJ production tools, but no information retrieval from recorded mixes Needed Components
DJ Mixes Audio Tracks identify contained tracks (fingerprinting)
Content Analysis Identification get track start and end in mix Playlist determine tempo changes Alignment Time-Scaling (beat-aligned mixing) derive genre and Context Analysis suggested here social tags attached to the music Cultural Data Unmixing → inform about the choices a DJ makes when creating a mix Mix Data estimate fade curves for volume, downstream research bass/treble, and parameters of other enabled by DJ mix annotation effects (compression, echo, etc.) Proposed Method for DJ Mix Reverse Engineering
Input • recorded DJ mix • playlist (list of tracks in the mix in correct order) • audio files of the original tracks
Five steps 1.rough alignment 2.sample alignment 3.verification by track removal 4.estimation of gain curves 5.estimation of cue regions Step 1: Rough Alignment by DTW
7000 Dynamic Time Warping alignment of 0 concatenated MFCCs of tracks with mix 6000 → relative positioning of the tracks in -10 the mix (intersections) 5000
-20 → speed factor (slopes of path) 4000
MFCCs -30
mix frames 3000 mix
2000 -40 0
1000 -50 -1 Reference index -2 2000 4000 6000 8000 10000 12000 -2 -1 0 track frames
Query index concatenated tracks’ MFCCs Step 2: Sample Alignment
Refine alignment to close in to sample precision: 1. time-scale source track according to estimated speed factor 2. search best sample shift around rough (frame) alignment maximum cross-correlation between mix and track Step 3: Verification by Track Removal
Success of sample alignment can be verified by subtracting the aligned and time-scaled track from the mix → drop in RMS energy /!\ Method applicable even when ground truth is unknown or inexact! set123mix3-none-bass-04 10
0 [dB]
-10
10
0
-10
10
0
-10
erence of RMS energy di ff erence 0 15 30 45 60 75 90 105 seconds 6Step Diemo Schwarz 4: andVolume Dominique Fourer Curve Estimation
1. we compute the short-time Fourier transforms (STFT) of x and si denoted Si(n, m) and X(n, m)(n and m being respectively the time and frequency indices) 2. weEstimate estimate the the volume volume curve curves at each âi instant(blackn lines)by computing applied the to medianeach track to obtain the mix of the mix/track ratio computed along the frequencies m0 M,whereM is Novel method based on time-frequency2 representations2 X (mix) and Si (track): the set of frequency indices where S (n, m0) > 0, such as: | i | X(n,m0) 2 median | | if m0 s. t. Si(n, m0) > 0 Si(n,m0) aˆi(n)= | | m0 M 9 | | (2) 80⇣ ⌘8 2 otherwise < 3. we optionally: post-processa ˆi(n) to obtain a smooth curve by removing out- liers using a second median filter for which a kernel size equal to 20 provides good results in practice. The resulting volume curve can then be used to estimate the cue points (the time instants when a fading e↵ect begins or stops) at the next step. An illustration of the resulting process is presented in Fig. 3.
3.5 Step 5: Cue Point Estimation
In order to estimate the DJ cue points, we apply a linear regression ofa ˆi at the time instants located at the beginning and at the end of the resulting volume curve (whena ˆi(n) < , being a threshold defined arbitrarily as =0.7 max(ˆa)). Assuming that a linear fading e↵ect was applied, the cue points can easily be deduced from the two a ne equations resulting from the linear regression. The four estimated cue points correspond respectively to:
1. n1, the time instant when the fade-in curve is equal to 0 2. n2, the time instant when the fade-in curve is equal to max(ˆai) 3. n3, the time instant when the fade-out curve is equal to max(ˆai) 4. n4, the time instant when the fade-out curve is equal to 0. In order to illustrate the e ciency of the entire method (steps 4 and 5), we present in Fig. 3 the results obtained on a real-world DJ-mix extracted from our proposed dataset.
Fig. 3: Estimated volume curve (black), linear fades (blue), ground truth fades (red) Step 5: Cue Point Estimation
Cue points are the start and end points of fades Estimation (blue lines) by linear regression of the fade curve â at beginning and end (where â is between 0 and 70% of its maximum) Ground truth fade curve in red http://zenodo.org/record/1422385
The UnmixDB Open DJ-Mix Dataset
Automatically generated “ecologically valid” beat-synchronous mixes based on CC-licensed freely available music tracks from net label http://www.mixotic.net curated by Sonnleitner, Arzt & Widmer (2016) Each mix combines 3 track excerpts of ~40s (start cutting into end on a downbeat) Precise ground truth about the placement of tracks in a mix, fade curves, speed Mixes generated in 12 variants: 4 effects: no effect, bass boost, dynamics compression, distortion 3 time-scaling algorithms: none, resample, time stretch
6 sets of tracks and mixes, 500 MB – 1 GB, total 4 GB python source code for mix generation at https://github.com/Ircam-RnD/unmixdb-creation no stretch resample stretch
103
Evaluation Measures and Results: 102
7000 0 1 6000 10
-10
Alignment 5000
-20 0 4000 10
-30
mix frames 3000
-1 2000 -40 10 log frameerror [s]
1000 -50
10-2 2000 4000 6000 8000 10000 12000 track frames
10-3 frame error: absolute error between ground truth and frame -4 10 none none none none resamp resamp resamp resamp stretch stretch stretch stretch start time from DTW rough alignment (step 1) [s] none bass comp dist none bass comp dist none bass comp dist
103
sample error: absolute error between ground truth and track 2 10 start time from sample alignment (step 2) [s] 101
100
10-1
no fx bass compression distortion log sampleerror [s]
10-2
10-3
-4 10 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist Evaluation Measures and Results:
7000 0
6000
-10
Speed Ratio 5000
-20 4000
-30
mix frames 3000
2000 -40
1000 -50 1.04
2000 4000 6000 8000 10000 12000 track frames
1.03
1.02
1.01 speed ratio: ratio between ground truth and 1 speed factor estimated by DTW alignment speedratio (step 1, ideal value is 1) 0.99
0.98
0.97
0.96 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist 103
102
Reinjecting Ground Truth Speed 101
100
10-1 log sampleerror [s]
High sensitivity of sample alignment and track removal on 10-2 accuracy of speed estimation from DTW 10-3
Judge its influence by reinjecting ground truth speed: -4 10 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist top: sample alignment error 103 bottom: sample alignment error with ground truth speed 102 reinjected for time-scaling 101
100
10-1 log risampleerror [s]
10-2
10-3
-4 10 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist 1.6 Evaluation Measures and Results: 1.4 set123mix3-none-bass-04 1.2 10 0 1 Suppression -10
10 0.8
0 suppratio -10 0.6
10
0 0.4
-10
0 15 30 45 60 75 90 105 seconds 0.2
0
none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist suppression ratio: ratio of track time with >15 dB of track
removal (step 3, bigger is better) 1.6
1.4
1.2
top: with DTW-estimated speed 1 bottom: with ground truth speed for time-scaling 0.8 risuppratio 0.6
0.4
0.2
0
-0.2 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist Evaluation Measures and Results: Fade Error
100
90
80
70 fade error: total difference between ground truth 60
and estimated fade curves (steps 4 and 5) 50
40 fadeerror [db/s]
30
20
10
0 none none none none resamp resamp resamp resamp stretch stretch stretch stretch none bass comp dist none bass comp dist none bass comp dist Conclusion and Future Work
Our DJ-mix reverse engineering method validated on artificial open UnmixDB dataset → retrieval of rich data from existing real DJ mixes With some refinements, our method could become robust and precise enough to allow the inversion of EQ and other processing (e.g. compression, echo) Extend to mixes with non-constant tempo curves, more effects Close link between alignment, time-scaling, and unmixing hints at a joint and possibly iterative estimation algorithm Other signal representations (MFCC, spectrum, chroma, scattering transform)?
beware: IANADJ!