<<

A CROWDSOURCED EXPERIMENT FOR ESTIMATION OF ELECTRONIC

Hendrik Schreiber Meinard Muller¨ tagtraum industries incorporated International Audio Laboratories Erlangen [email protected] [email protected]

ABSTRACT commercial tempo estimation systems have been tested against the dataset (e.g. [12]). As is common for datasets Relative to other datasets, state-of-the-art tempo estima- annotated with only a single tempo per track, the two met- tion algorithms perform poorly on the GiantSteps Tempo rics Accuracy1 and Accuracy2 were used. Accuracy1 is dataset for electronic (EDM). In order to in- defined as the fraction of correct estimates while allowing vestigate why, we conducted a large-scale, crowdsourced a tolerance of 4%. Accuracy2 additionally allows estimates experiment involving 266 participants from two distinct to be wrong by a factor of 2, 3, 1/2 or 1/3 (so-called octave groups. The quality of the collected data was evaluated errors). The highest results reported for the GiantSteps with regard to the participants’ input devices and back- dataset are 77.0% Accuracy1 by the applications NI Trak- ground. In the data itself we observed significant tempo tor Pro 2 2 (with octave bias 88 − 175) and 90.2% Accu- ambiguities, which we attribute to annotator subjectivity racy2 by CrossDJ 3 (with octave bias 75 − 150). 4 These and tempo instability. As a further contribution, we then results are surprisingly low—the highest reported Accu- constructed new annotations consisting of tempo distri- racy2 values for other commonly used datasets like ACM butions for each track. Using these annotations, we re- Mirum [10], Ballroom [4], and GTzan [13] are greater than evaluated two recent state-of-the-art tempo estimation sys- 95% [1]. Since EDM is associated with repeating tems achieving significantly improved results. The main patterns and steady tempi [2, 7], it should be conclusions of this investigation are that current tempo es- comparatively easy to estimate the tempo for this . timation systems perform better than previously thought We hypothesize that relatively low accuracy values were and that evaluation quality needs to be improved. The new achieved for multiple possible reasons. Since the annota- crowdsourced annotations will be released for evaluation tions were scraped off a forum for disputed tempo labels, purposes. the dataset may contain many tracks that are especially to annotate for humans. And if not difficult for hu- 1. INTRODUCTION mans to annotate, it is conceivable that the tracks are par- Estimation of a music piece’s global tempo is a classic mu- ticularly hard for algorithms to analyze. Lastly, if neither sic information retrieval (MIR) task. It is often defined as humans nor algorithms fail, perhaps some of the scraped estimating the frequency with which humans tap along to annotations are simply wrong. the beat. A necessary precondition for successful global In this paper we investigate why tempo estimation sys- tempo estimation is the existence of a stable tempo as it tems perform so poorly for GiantSteps Tempo. To this end, often occurs in rock, pop, or dance music. To evaluate we conducted a large, crowdsourced experiment to collect a tempo estimation system one needs the system itself, a new tempo data for GiantSteps Tempo from human partic- dataset with suitable tempo annotations, and one or more ipants. The experiment is described in detail in Section 2. metrics. One such dataset, named GiantSteps Tempo, has The data is analyzed in Section 3 and used to create a new been released by Knees et al. in 2015 [6]. It was created by ground-truth. This ground-truth is then compared to the scraping a forum that let listeners discuss 1 original ground-truth and used to evaluate two recent al- with wrong tempo labels. Scraping was done via a script gorithms. The results are discussed in Section 4. Finally, and 15% of the labels were manually verified. All 664 in Section 5, we summarize our findings and draw conclu- tracks in the dataset belong to the umbrella genre electronic sions. dance music (EDM) with its subgenres trance, drum-and- bass, , etc. Since its release, several academic and 2. EXPERIMENT 1 http://www.beatport.com/, an online In order to generate a new ground-truth for the GiantSteps Tempo dataset, we set up a web-based experiment in which © Hendrik Schreiber, Meinard Muller.¨ Licensed under a 2 https://www.native-instruments.com/en/ Creative Commons Attribution 4.0 International License (CC BY 4.0). products/traktor/dj-/traktor-pro-2/ Attribution: Hendrik Schreiber, Meinard Muller.¨ “A Crowdsourced 3 http://www.mixvibes.com/ Experiment for Tempo Estimation of ”, 19th In- cross-dj-software-mac-pc/ ternational Society for Music Information Retrieval Conference, Paris, 4 More benchmark results are available at http://www.cp.jku. France, 2018. at/datasets/giantsteps/ we asked participants to tap along to audio excerpts using their keyboard or touchscreen. The user interface for this experiment is depicted in Figure 1. Since most tracks from the dataset are 2 min long and tapping for the full dura- tion is difficult, we split each track into half-overlapping 30 s segments. Out of the 664 tracks we created 4,640 such segments (in most cases 7 per track). To measure tempo, it is not important for tap and beat to occur at the same time. In contrast to experiments for beat tracking, phase shifts, input method latencies, or anticipatory early tapping—known as negative mean asynchrony (NMA)— are irrelevant, as long as they stay constant (see [11] for an overview of tapping and [3,5] for beat tracking). Therefore participants were asked to tap along to randomly chosen segments as steadily as possible, over the entire duration of 30 s without skipping beats. To encourage steady tapping, the user interface gave immediate feedback in the form of the mean tempo µ in BPM, the median tempo med in BPM, the standard deviation of the inter-tap-intervals (ITI) σ in milliseconds, as well as textual messages and emo- Figure 1: Illustration of the web-based interface used in jis (Figure 1). When calculating the standard deviation, our experimental user study. the first three taps were ignored, as those are typically of low quality (users have to find their way into the groove). When the standard deviation σ stayed very low, smilies, in this experiment via relevant mailing lists without offer- thumbs up and textual praise were shown. When σ climbed ing any benefits, members of the beaTunes group were in- above a certain threshold, the user was shown sad faces centivized by promising a reward license for the beaTunes and messages like “Did you miss a beat? Try to tap more software, if they submitted 110 valid annotations. While steadily.” To prevent low quality submissions, users were it was not explicitly specified what a “valid annotation” is, only allowed to proceed to the next track, once four condi- we attempted to steer people in the right direction using in- tions were met: structions and the instant feedback mechanisms described above (Figure 1). 1. 20 or more taps

2. Taps cover at least 15 s 3. DATA ANALYSIS 3. ITI standard deviation: σ < 50 ms Over a period of 21/2 months 266 persons participated 4. Median tempo: 50 ≤ med ≤ 210 BPM in the experiment, 217 (81.6%) belonging to beaTunes and 49 (18.4%) to academics. Together they submitted While the first three conditions were not explicitly com- 18,684 segment annotations (avg = 4.03/segment). We municated, the instructions made participants aware that made sure that all segments were annotated at least . the target tempo lies between 50 and 210 BPM. Once all Since some segments are harder to annotate than others, four conditions were met, a large red bar turned green and we monitored submissions and ensured that segments an- the Next button became enabled. For situations in which notated by participants as very different from the origi- the user was not able to fulfill all conditions, the user inter- nal ground-truth—exceeding a tolerance of 4%—were pre- face offered a No Beat checkbox. Once checked, it allowed sented to participants more often than others. The vast ma- users to bypass the quality check and proceed to the next jority of annotations was submitted by the beaTunes group . It must be noted that there is a tradeoff between en- (95.1%). Overall 7.5% of all submissions were marked couraging participants to tap well (.e. steadily) and a bias with No Beat. With 7.6% the No Beat-rate was slightly towards stable tempi. We opted for this design for two rea- higher among members of the beaTunes group. Members sons. 1) tempo in EDM is usually is very steady [2, 7]. of academics checked No Beat only for 5.2% of their sub- 2) the bias is limited to individual tapping sessions at the missions. Since the experiment was run in the participant’s segment level, i.e. we can still detect tempo stability prob- web-browser, the browser’s user-agent for each submission lems on the track level by aggregating segment level anno- was logged by the web-server. Among other information tations. the user-agent contains the name of the participant’s op- Participants were recruited from two distinct groups: erating system. 17,012 (91.1%) of the submissions were Academics and people interested in the consumer-level sent from desktop operating systems that are typically con- music library management system beaTunes 5 . We refer to nected to a physical keyboard. 1,672 (8.9%) were from the former group as academics and the latter as beaTunes. mobile operating systems that are usually associated with While members of the academics group were asked to help touchscreens. Participants interested in a reward license, 5 https://www.beatunes.com/ also had to enter name and email. Both datapoints have Dataset Split + - p-value 0.3 segment 6 ±academics 0.0074 0.0090 3.11e−29 whole track ±keyboard 0.0088 0.0095 9.74e−7 0.2 beaTunes±keyboard 0.0089 0.0099 6.71e−10

academics±keyboard 0.0074 0.0073 8.68e−1 salience 0.1

0 Table 1: Average coefficients of variation cv for dataset 50 100 150 200 splits, academics or not, keyboard or not, and keyboard or BPM not for either beaTunes or academics. The low p-values indicate a significant difference between the dataset splits. Figure 2: Tempo salience distribution for segment 6 of track ‘Neoteric D&B Mix’ by Polex (Beatport id 4397469). Measured values are: P (T ) = 4, been removed from the collected data to ensure anonymity. track P (Tseg6) = 2, A(Ttrack) = 0.30, A(Tseg6) = 0.40, and We analyzed the submitted data to find out whether we JSD = 0.24. can find quality differences between submissions from dif- ferent participant groups (Section 3.1). Section 3.2 intro- duces metrics for ambiguity and stability. In Section 3.3, 3.2 Tempo Distribution Metrics we measure to which extent participants agree on one or multiple tempi for the same segment. Then, in Section 3.4, How steadily participants tapped does not say anything we take a look at segment annotations aggregated on the about whether they tapped along to the true tempo. But track-level. Finally, in Section 3.5, we investigate whether since the purpose of the experiment is to create a new tempo ambiguity is a genre-dependent phenomenon. ground-truth, we cannot easily verify submissions for cor- rectness. What we can do though, is to measure annotator 3.1 Submission Quality (dis)agreement both for a segment and for all segments be- longing to the same track. To this end, we define some We wondered how steadily participants tapped and metrics based on tapped tempo distributions. To create whether some groups of participants tapped more steadily such a tapped tempo distribution for a segment, we com- than others. Specifically, are the beaTunes submissions as bine the 10 central ITIs from each of its submissions in good as the academics submissions? We can use the co- a histogram T with a bin width of 1 BPM and then nor- σ Pn efficient of variation cv = µ of each submission’s ITIs malize so that i=1 T (xi) = 1, with n as the number as a normalized indicator for how steadily a participant of bins and xi as the corresponding BPM values. For T tapped. To remove tapping outliers within a segment, we we define local peaks as the highest non-zero T (xi) for sort each submission’s ITIs and only keep the central 10 all intervals [xi − 5, xi + 5]. This may include very small before calculating the cv. This has the effect of reducing cv peaks. We interpret the BPM values xi of the histogram’s for all submissions. The average cv for all submissions is local peaks as the perceptually strongest tempi and their cv = 0.0089. Assuming a normal distribution, this means heights equivalent to their saliences. Per-track tempo dis- that on average 99.7% of all central 10 ITIs lie within tributions are created simply by averaging the 7 segment ±2.67% (≡ 3σ) of their submission’s mean value. Us- histograms belonging to a given track. For an , ing cv as a measure for the submission quality of different please see Figure 2. dataset splits, we found that members of academics tapped As a first, very simple indicator for annotator disagree- significantly more steadily (cv = 0.0074) than members ment, we define P (T ) as the number of histogram peaks of beaTunes (cv = 0.0090) (Table 1). To test for signif- we find in a given tempo distribution T . A high peak count icance we used Welch’s t-test. Also, submissions from for a single segment P (Tseg) indicates annotator disagree- desktop operating systems that are typically installed on ment for that segment. This is not necessarily true for the devices connected to a physical keyboard (i.e., no touch- peak count for a track P (Ttrack), since it may also be a screen) are of significantly higher quality (cv = 0.0088) sign of tempo instability, i.e., tempo changes or no-beat- than submissions from devices using iOS or Android as sections. Because the peak count P does not say any- operating system (cv = 0.0095). Despite the differences, thing about the peaks’ height or salience, it is a relatively we found that even the ITIs from the group with the high- crude measure. Therefore we define as second metric the est cv, i.e., beaTunes without keyboard, still lie within only salience ratio between the most salient and the second most ±2.97% (≡ 3σ) of their mean value 99.7% of the time— salient peak as a measure for ambiguity. More formally, if again assuming a normal distribution. This is well below s1 is the salience of the highest peak and s2 the salience the tolerance of 4% allowed by Accuracy1. of the second highest peak, then the ambiguity A(T ) is de- We conclude that the data submitted by academics with fined as: keyboard is of the highest quality with regard to tempo sta-  bility, but find that the data submitted by members of bea- 1, for P (T ) = 0  Tunes without keyboard is still acceptable, because the dif- A(T ) := 0, for P (T ) = 1 (1) ference in cv is not very large. This may be a direct result s  2/s1, for P (T ) > 1 of the experiment’s design which did not permit partici- pants to submit highly irregular taps. A value close to 0 indicates low and a value close to 1 high ambiguity. This definition is inspired by McKinney 3,000 2,500 et al. [8] approach to ambiguity, but not identical. Just 2,000 like P , we can use A for both segment and track tempo 1,514 distributions. Again, for tracks we cannot be sure of the 1,000 segments 432 ambiguity’s source. 10 129 35 12 4 3 0 1 Finally, we introduce a third metric that focuses more 0 0 1 2 3 4 5 6 7 8 9 10 on tempo instability within tracks. Obvious indicators for instabilities are large differences between the tempo distri- 20 17 15 butions of segments belonging to one track. Since we cre- 15 14 42 .

ate tapped tempo distributions for each segment in a way 9 69 . 74 1

10 . 61 6 . 18 . 64 5 . 5 . that lets us interpret them as probability distributions, we 4 4 5 3 avg # subs/seg can use the Jensen-Shannon Divergence (JSD) for this pur- 0 pose, which is based on the Shannon entropy H. With the 0 0 1 2 3 4 5 6 7 8 9 10 JSD we measure the difference between the tempo distri- P (T ) bution’s entropy for the whole track and the average of the seg the individual segment tempo distributions’ entropies. Figure 3: (top) Segments per peak count. (bottom) Aver- n age number of submissions per segment by peak count. X H(T ) := − T (xi)logbT (xi) (2) i=1 200 178 m ! m 155 X 1 X 1 150 JSD(T1, ..., Tm) := H Tj − H(Tj) (3) 101 m m 81 j=1 j=1 100 tracks 57 45 50 21 15 To allow an easy interpretation of JSD-values, we 7 3 0 1 choose an unusual base for the entropy’s logarithm. By 0 1 2 3 4 5 6 7 8 9 10 11 12 setting b = n in (2), we ensure that 0 ≤ JSD ≤ 1. This means, that a JSD-value near 0 indicates a small difference

60 57 33 . between the tempo distributions for a track’s segments. . 27 52 33 61 56 . 39 91 . 83 . . 37 . 43 . . . 34

40 31 JSD 1 30

Correspondingly, a -value closer to means that the 29 28 28 27 26 25 tempo distributions of a track’s segments are very different. 20 To avoid detecting small tempo changes due to annotator 0 disagreements, we convert the segment tempo distributions avg # subs/track 0 1 2 3 4 5 6 7 8 9 10 11 12 T to a bin width of 10 BPM before calculating JSD. P (Ttrack) 3.3 Segment Annotator Agreement Figure 4: (top) Tracks per peak count. (bottom) Average How much do participants agree on a tempo for a given number of submissions per track by peak count. segment? Recall that we have 4,640 segments (and 18,684 annotations for these segments) coming from 664 tracks. As depicted in Figure 3 top, the submissions for more than As mentioned in Section 3.2, the peak count does not half the segments (2,500 or 53.9%) have just one peak, i.e., say anything about the peaks’ height or salience and is P (Tseg) = 1. For 1,514 or 32.6% of all segments we were therefore a relatively crude measure. We found that the av- able to find two peaks, indicating some ambiguity. For erage ambiguity for all segments is A(Tseg) = 0.25 (with 432 segments (9.3%) we found 3 peaks and for 184 seg- standard deviation σ = 0.32), meaning that on average the ments (4.0%) 4 peaks or more. 10 segments have no peak highest peak is 4 times more salient than the second high- at all, because they have been marked as No Beat in all their est peak. In other words, we can often observe a peak that submissions. When interpreting these numbers one has to is much more salient than others. At the same time, there keep in mind that some segments have been annotated by may also be a second peak with considerable salience. very few participants (Figure 3 bottom). To give an exam- ple, while the segments annotated with one peak are based 3.4 Track Annotator Agreement on 3.64 submissions on average, the segments annotated with 6 peaks are annotated with 9.42 submissions per seg- Just like for the segments, we looked at the number of ment. This reflects the fact that we presented difficult seg- tracks per peak count. We found only 81 tracks (12.2%) ments to participants more often, but could also be caused with one peak and 582 tracks (87.8%) with two or more by increased variability introduced by a higher number of peaks (Figure 4 top). The largest group among the multi- submissions. Because submissions marked as No Beat do peak tracks are tracks with two peaks (178 or 26.8%). not show up in this overview unless all submissions for a These numbers are much more reliable than the segment segment were No Beat, we counted the segments for which peak counts as they are based on at least 25 submissions a majority of submissions were marked with No Beat. That per track (Figure 4 bottom). Compared to the segments’ was the case for 118 segments (2.5%). peak counts we see a larger proportion of tracks with more 0.8 30 µJSD µJSD + 2σJSD 0.6 segments 1-3 segment 4 20 0.4 segments 5-7 10 salience 0.2 tracks in % 0 0 50 100 150 200 0 0.2 0.4 0.6 0.8 1 BPM JSD

Figure 5: Tempo salience distributions for segments of Figure 6: Distribution of tracks in the dataset per JSD in- the track ‘Rude Boy feat. Omar LinX Union Vocal Mix’ terval with a bin width of 0.05. The line shows µJSD by (Beatport id 1728723). The track’s tempo and the red line shows µJSD + 2σJSD. changes in segment 4, leading to four distinct peaks. With JSD = 0.44 its Jensen-Shannon divergence is high. Genre A(Tseg) A(Ttrack) all 0.25 0.26 techno 0.12 0.10 trance 0.17 0.12 drum-and-bass 0.37 0.39 0.36 0.38 than one peak. But this does not necessarily mean that the 0.35 0.43 ambiguity A is much higher than for the segments, because peak counts do not account for salience and even small lo- Table 2: Average ambiguity for the top 5 . cal peaks are counted. In fact, we measured an average ambiguity of A(Ttrack) = 0.26 (with standard deviation σ = 0.27)—almost the same average as for the segments. 3.5 Ambiguity by Genre Therefore we attribute the shift towards more peaks to the We wondered whether we can confirm findings by McK- much higher number of submissions per item and possible inney and Moelants [9] that the amount of tempo ambi- tempo instabilities in the tracks themselves. By tempo in- guity depends on the genre or musical style. To ensure stability we mean for example a tempo change in the mid- meaningful results, we considered only the 5 most often dle of the track, a quiet section, or no beat at all. Any of occurring genres in the dataset with 54 or more tracks these cases inherently lead to more peaks. A typical exam- each. We found that the genres techno and trance do not ple for a track with a tempo change is shown in Figure 5. seem to be very affected by ambiguity. More than 65% of their segments are annotated with just one peak. In In an attempt to quantify tempo instabilities in the sub- contrast to that, fewer than 38% of all segments in the missions we calculated the JSD introduced in Section 3.2. genres drum-and-bass, dubstep, and electronica are anno- The histogram in Figure 6 shows the distribution of tracks tated with just one peak (Figure 7 top). A similar picture per JSD interval with a bin width of 0.05. The average presents itself when looking at the average segment ambi- divergence for the whole dataset is µJSD = 0.15, the stan- guity A(Tseg). As shown in Table 2, it is 0.12 for techno dard deviation is σJSD = 0.11. To test whether a high segments and thus much lower than the overall average of JSD correlates with tempo instabilities, we considered all 0.25. The same is true for trance (0.17). Contrary to that, tracks with JSD > µJSD + 2σJSD = 0.375, resulting in the ambiguity values for drum-and-bass (0.37), electron- 39 tracks. Performing an informal listening test on these ica (0.36) and dubstep (0.35) are all well above the av- tracks revealed that 3 had no beat, 10 contained a tempo erage. We found similar relations for peak counts on the change (e.g. Figure 5), 7 had sections that felt half as fast track level (Figure 7 bottom) and the average track ambi- as other sections (metrical ambiguity), 8 contained larger guity A(Ttrack) (Table 2). This strongly supports McKin- sections with no discernible beat, 9 were difficult to tap, ney and Moelants’ finding that tapped tempo ambiguity is and 2 had a stable tempo through the whole track. From genre-dependent. Perhaps it is even an inherent property. this result one may conclude that a high JSD is connected to tempo instabilities, but it may also just indicate that a 4. EVALUATION track is difficult to tap. Nevertheless, using JSD helped us find tracks in the GiantSteps Tempo dataset that exhibit The tempo histograms for tracks can easily be turned into tempo stability issues. Since 2.5% of the segments were single tempo per track or two tempi+salience labels. This annotated most often with No Beat, we wondered whether provides us the opportunity to evaluate the original ground- any tracks have a majority of segments that have predomi- truth for the GiantSteps Tempo dataset by treating it like an nantly been annotated with No Beat, hinting at the absence algorithm. Since the original annotations are single tempo of not just a local beat (e.g., a sound effect or a silent sec- per track only, we are using Accuracy1 and Accuracy2 as tion), but the lack of a global beat. This is true for 6 tracks, metrics. To obtain one tempo value per track from a distri- i.e., 0.9% of the dataset. All 6 of them are among the 39 bution, we are using just the tempo value with the highest tracks with very high JSD and either have no beat, are very salience. The three tracks without a beat have been re- difficult to tap or contain large sections without a beat. moved. We refer to these new annotations as GSNew and drum-and-bass bock¨ 58.9 GSNew 64.8 dubstep GSOrig schreiber 63.1 electronica 70.2 techno trance 0 20 40 60 80 100 Accuracy1 in % 0 20 40 60 80 100 % of segments bock¨ 86.4 94 drum-and-bass schreiber 88.7 dubstep 95.2 electronica 0 20 40 60 80 100 techno Accuracy2 in % trance 0 20 40 60 80 100 Figure 9: Accuracies for the algorithms bock¨ and % of tracks schreiber measured against both GSOrig and GSNew.

no peak 1 peak 2 peaks 3 peaks 4 peaks 5 or more therefore believe that this increase and the discrepancy be- Figure 7: Percentage of segments (top) and tracks (bot- tween GSOrig and GSNew are hardly coincidences, but tom) with a given number of peaks by genre. Drum- strong indicators for incorrect annotations in GSOrig. and-bass, dubstep, and electronica suffer much more from tapped tempo ambiguity than techno and trance. 5. DISCUSSION AND CONCLUSIONS In this paper we described a crowdsourced experiment for Accuracy1 81.5 tempo estimation. We collected 18,684 tapped annotations Accuracy2 91.1 from 266 participants for electronic dance music (EDM) tracks contained in the GiantSteps Tempo dataset. To an- 0 20 40 60 80 100 alyze the data, we used multiple metrics and found that Accuracy in % half of the annotated segments and more than half of the tracks exhibit some degree of tempo ambiguity, which may Figure 8: Accuracies measured when comparing GSNew either from annotator disagreement or from intra- with GSOrig. track tempo instability. This refutes the assumption that it is always easy to determine a single global tempo for EDM. We were able to identify tracks with no tempo at GSOrig to the original ones as . Figure 8 shows the ac- all, no-beat-sections or tempo changes, which raises ques- GSOrig GSNew curacy results for the comparison of with tions about the suitability of parts of the dataset for the and reveals a large discrepancy between the two. Only 81.5 global tempo estimation task. Furthermore, we provided of the labels match when using Accuracy1, and only 91.1% additional evidence for genre-dependent tempo ambiguity. match when using Accuracy2. Based on the user-submitted data we derived the new an- Coming back to the original motivation for this paper— notations GSNew. The relatively low agreement with the the poor performance of tempo estimation systems for Gi- original annotations GSOrig indicates that one of the two antSteps Tempo—we evaluated the two state-of-the-art al- ground-truths contains incorrect annotations for up to 8.9% gorithms schreiber [12] and bock¨ [1] with both the of the tracks (ignoring octave errors). We re-evaluated two old and the new annotations. The algorithms were chosen recent tempo estimation algorithms against both ground- for their proven performance and conceptual dissimilar- truths and measured considerably higher accuracies when ity. While schreiber implements a conventional onset testing against GSNew. This leads us to the following con- detection approach followed by an error correction proce- clusions: GSOrig contains incorrectly annotated tracks as dure, bock¨ ’s core consists of a bidirectional long short- well as tracks that are not suitable for the global tempo term memory (BLSTM) recurrent neural network. De- estimation task. The accuracy of state-of-the-art tempo es- spite their conceptual differences, both algorithms reach timations systems is considerably higher than previously considerably higher accuracy values when tested against thought. And last but not least, as a community, we have GSNew (Figure 9). Accuracy1 increases for bock¨ by to get better at evaluating tempo algorithms in the sense 5.9 pp (58.9% to 64.8%) and for schreiber by 7.1 pp that we need verified, high quality datasets that represent (63.1% to 70.2%). Accuracy2 shows similar increases, reality with tempo distributions instead of single value an- 7.6 pp (86.4% to 94.0%) for bock¨ and 6.5 pp (88.7% to notations. If we cannot accurately measure progress, we 95.2%) for schreiber. Remarkably, both bock¨ and have no way of knowing when the task is done. schreiber reach higher Accuracy2 values for GSNew than the original annotations reached, when compared with Datasets GSNew. The increased results for GSNew are much more All data is available at http://www.tagtraum.com/ in line with values reported for other tempo datasets. We tempo_estimation.html. Acknowledgments [10] Geoffroy Peeters and Joachim Flocon-Cholet. Percep- The International Audio Laboratories Erlangen are a tual tempo estimation using GMM-regression. In Pro- joint institution of the Friedrich-Alexander-Universitat¨ ceedings of the second international ACM workshop Erlangen-Nurnberg¨ (FAU) and Fraunhofer Institute for In- on Music information retrieval with user-centered and tegrated Circuits IIS. Meinard Muller¨ is supported by the multimodal strategies (MIRUM), pages 45–50, New German Research Foundation (DFG MU 2686/11-1). York, NY, USA, 2012. ACM.

[11] Bruno H. Repp. Sensorimotor synchronization: A re- 6. REFERENCES view of the tapping literature. Psychonomic Bulletin & [1] Sebastian Bock,¨ Florian Krebs, and Gerhard Widmer. Review, 12(6):969–992, 2005. Accurate tempo estimation based on recurrent neu- [12] Hendrik Schreiber and Meinard Muller.¨ A post- ral networks and resonating comb filters. In Proceed- processing procedure for improving music tempo es- ings of the 16th International Society for Music Infor- timates using supervised learning. In 18th Interna- mation Retrieval Conference (ISMIR), pages 625–631, tional Society for Music Information Retrieval Confer- Malaga,´ Spain, 2015. ence (ISMIR), pages 235–242, Suzhou, , October [2] Mark J. Butler. Unlocking the Groove: Rhythm, Meter, 2017. and Musical Design in Electronic Dance Music. Pro- [13] George Tzanetakis and Perry Cook. Musical genre files in . Indiana University Press, 2006. classification of audio signals. IEEE Transactions on [3] Olmo Cornelis, Joren Six, Andre Holzapfel, and Marc Speech and Audio Processing, 10(5):293–302, 2002. Leman. Evaluation and recommendation of pulse and tempo annotation in ethnic music. Journal of New Mu- sic Research, 42(2):131–149, 2013.

[4] Fabien Gouyon, Anssi P. Klapuri, Simon Dixon, Miguel Alonso, George Tzanetakis, Christian Uhle, and Pedro Cano. An experimental comparison of au- dio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1832– 1844, 2006.

[5] Andre Holzapfel, Matthew EP Davies, Jose´ R Zap- ata, Joao˜ Lobato Oliveira, and Fabien Gouyon. Selec- tive sampling for beat tracking evaluation. IEEE Trans- actions on Audio, Speech, and Language Processing, 20(9):2539–2548, 2012.

[6] Peter Knees, Angel´ Faraldo, Perfecto Herrera, Richard Vogl, Sebastian Bock,¨ Florian Horschl¨ ager,¨ and Mick- ael Le Goff. Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections. In Proceedings of the 16th Interna- tional Society for Music Information Retrieval Confer- ence (ISMIR), pages 364–370, Malaga,´ Spain, October 2015.

[7] Michaelangelo Matos. Electronic dance music. Encyclopædia Britannica https: //www.britannica.com/art/ electronic-dance-music, last checked 6/9/2018, August 2015.

[8] Martin F McKinney and Dirk Moelants. Deviations from the resonance theory of tempo induction. In Proc. Conference on Interdisciplinary , Graz, , 2004.

[9] Martin F McKinney and Dirk Moelants. Ambiguity in tempo perception: What draws listeners to differ- ent metrical levels? Music Perception: An Interdisci- plinary Journal, 24(2):155–166, 2006.