A Crowdsourced Experiment for Tempo Estimation of Electronic Dance Music

A CROWDSOURCED EXPERIMENT FOR TEMPO ESTIMATION OF ELECTRONIC DANCE MUSIC Hendrik Schreiber Meinard Muller¨ tagtraum industries incorporated International Audio Laboratories Erlangen [email protected] [email protected] ABSTRACT commercial tempo estimation systems have been tested against the dataset (e.g. [12]). As is common for datasets Relative to other datasets, state-of-the-art tempo estima- annotated with only a single tempo per track, the two met- tion algorithms perform poorly on the GiantSteps Tempo rics Accuracy1 and Accuracy2 were used. Accuracy1 is dataset for electronic dance music (EDM). In order to in- defined as the fraction of correct estimates while allowing vestigate why, we conducted a large-scale, crowdsourced a tolerance of 4%. Accuracy2 additionally allows estimates experiment involving 266 participants from two distinct to be wrong by a factor of 2, 3, 1=2 or 1=3 (so-called octave groups. The quality of the collected data was evaluated errors). The highest results reported for the GiantSteps with regard to the participants’ input devices and back- dataset are 77:0% Accuracy1 by the applications NI Trak- ground. In the data itself we observed significant tempo tor Pro 2 2 (with octave bias 88 − 175) and 90:2% Accu- ambiguities, which we attribute to annotator subjectivity racy2 by CrossDJ 3 (with octave bias 75 − 150). 4 These and tempo instability. As a further contribution, we then results are surprisingly low—the highest reported Accu- constructed new annotations consisting of tempo distri- racy2 values for other commonly used datasets like ACM butions for each track. Using these annotations, we re- Mirum [10], Ballroom [4], and GTzan [13] are greater than evaluated two recent state-of-the-art tempo estimation sys- 95% [1]. Since EDM is often associated with repeating tems achieving significantly improved results. The main bass drum patterns and steady tempi [2, 7], it should be conclusions of this investigation are that current tempo es- comparatively easy to estimate the tempo for this genre. timation systems perform better than previously thought We hypothesize that relatively low accuracy values were and that evaluation quality needs to be improved. The new achieved for multiple possible reasons. Since the annota- crowdsourced annotations will be released for evaluation tions were scraped off a forum for disputed tempo labels, purposes. the dataset may contain many tracks that are especially hard to annotate for humans. And if not difficult for hu- 1. INTRODUCTION mans to annotate, it is conceivable that the tracks are par- Estimation of a music piece’s global tempo is a classic mu- ticularly hard for algorithms to analyze. Lastly, if neither sic information retrieval (MIR) task. It is often defined as humans nor algorithms fail, perhaps some of the scraped estimating the frequency with which humans tap along to annotations are simply wrong. the beat. A necessary precondition for successful global In this paper we investigate why tempo estimation sys- tempo estimation is the existence of a stable tempo as it tems perform so poorly for GiantSteps Tempo. To this end, often occurs in rock, pop, or dance music. To evaluate we conducted a large, crowdsourced experiment to collect a tempo estimation system one needs the system itself, a new tempo data for GiantSteps Tempo from human partic- dataset with suitable tempo annotations, and one or more ipants. The experiment is described in detail in Section 2. metrics. One such dataset, named GiantSteps Tempo, has The data is analyzed in Section 3 and used to create a new been released by Knees et al. in 2015 [6]. It was created by ground-truth. This ground-truth is then compared to the scraping a forum that let listeners discuss Beatport 1 songs original ground-truth and used to evaluate two recent al- with wrong tempo labels. Scraping was done via a script gorithms. The results are discussed in Section 4. Finally, and 15% of the labels were manually verified. All 664 in Section 5, we summarize our findings and draw conclu- tracks in the dataset belong to the umbrella genre electronic sions. dance music (EDM) with its subgenres trance, drum-and- bass, techno, etc. Since its release, several academic and 2. EXPERIMENT 1 http://www.beatport.com/, an online music store In order to generate a new ground-truth for the GiantSteps Tempo dataset, we set up a web-based experiment in which © Hendrik Schreiber, Meinard Muller.¨ Licensed under a 2 https://www.native-instruments.com/en/ Creative Commons Attribution 4.0 International License (CC BY 4.0). products/traktor/dj-software/traktor-pro-2/ Attribution: Hendrik Schreiber, Meinard Muller.¨ “A Crowdsourced 3 http://www.mixvibes.com/ Experiment for Tempo Estimation of Electronic Dance Music”, 19th In- cross-dj-software-mac-pc/ ternational Society for Music Information Retrieval Conference, Paris, 4 More benchmark results are available at http://www.cp.jku. France, 2018. at/datasets/giantsteps/ we asked participants to tap along to audio excerpts using their keyboard or touchscreen. The user interface for this experiment is depicted in Figure 1. Since most tracks from the dataset are 2 min long and tapping for the full duration is difficult, we split each track into half-overlapping 30 s segments. Out of the 664 tracks we created 4;640 such segments (in most cases 7 per track). To measure tempo, it is not important for tap and beat to occur at the same time. In contrast to experiments for beat tracking, phase shifts, input method latencies, or anticipatory early tapping—known as negative mean asynchrony (NMA)— are irrelevant, as long as they stay constant (see [11] for an overview of tapping and [3,5] for beat tracking). Therefore participants were asked to tap along to randomly chosen segments as steadily as possible, over the entire duration of 30 s without skipping beats. To encourage steady tapping, the user interface gave immediate feedback in the form of the mean tempo µ in BPM, the median tempo med in BPM, the standard deviation of the inter-tap-intervals (ITI) σ in milliseconds, as well as textual messages and emo- Figure 1: Illustration of the web-based interface used in jis (Figure 1). When calculating the standard deviation, our experimental user study. the first three taps were ignored, as those are typically of low quality (users have to find their way into the groove). When the standard deviation σ stayed very low, smilies, in this experiment via relevant mailing lists without offer- thumbs up and textual praise were shown. When σ climbed ing any benefits, members of the beaTunes group were in- above a certain threshold, the user was shown sad faces centivized by promising a reward license for the beaTunes and messages like “Did you miss a beat? Try to tap more software, if they submitted 110 valid annotations. While steadily.” To prevent low quality submissions, users were it was not explicitly specified what a “valid annotation” is, only allowed to proceed to the next track, once four condi- we attempted to steer people in the right direction using in- tions were met: structions and the instant feedback mechanisms described above (Figure 1). 1. 20 or more taps 2. Taps cover at least 15 s 3. DATA ANALYSIS 3. ITI standard deviation: σ < 50 ms Over a period of 21=2 months 266 persons participated 4. Median tempo: 50 ≤ med ≤ 210 BPM in the experiment, 217 (81:6%) belonging to beaTunes and 49 (18:4%) to academics. Together they submitted While the first three conditions were not explicitly com- 18;684 segment annotations (avg = 4:03=segment). We municated, the instructions made participants aware that made sure that all segments were annotated at least twice. the target tempo lies between 50 and 210 BPM. Once all Since some segments are harder to annotate than others, four conditions were met, a large red bar turned green and we monitored submissions and ensured that segments an- the Next button became enabled. For situations in which notated by participants as very different from the origi- the user was not able to fulfill all conditions, the user inter- nal ground-truth—exceeding a tolerance of 4%—were pre- face offered a No Beat checkbox. Once checked, it allowed sented to participants more often than others. The vast ma- users to bypass the quality check and proceed to the next jority of annotations was submitted by the beaTunes group song. It must be noted that there is a tradeoff between en- (95:1%). Overall 7:5% of all submissions were marked couraging participants to tap well (i.e. steadily) and a bias with No Beat. With 7:6% the No Beat-rate was slightly towards stable tempi. We opted for this design for two rea- higher among members of the beaTunes group. Members sons. 1) tempo in EDM is usually is very steady [2, 7]. of academics checked No Beat only for 5:2% of their sub- 2) the bias is limited to individual tapping sessions at the missions. Since the experiment was run in the participant’s segment level, i.e. we can still detect tempo stability prob- web-browser, the browser’s user-agent for each submission lems on the track level by aggregating segment level anno- was logged by the web-server. Among other information tations. the user-agent contains the name of the participant’s op- Participants were recruited from two distinct groups: erating system. 17;012 (91:1%) of the submissions were Academics and people interested in the consumer-level sent from desktop operating systems that are typically con- music library management system beaTunes 5 . We refer to nected to a physical keyboard.

Load more