
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by ZENODO Addressing Tempo Estimation Octave Errors in Electronic Music by Incorporating Style Information Extracted from Wikipedia Florian Horschl¨ ager,¨ Richard Vogl, Sebastian Bock,¨ Peter Knees Dept. of Computational Perception, Johannes Kepler University Linz, Austria [email protected], [email protected], [email protected], [email protected] ABSTRACT this discrepancy shows that there is still a need for im- provement. To this end, several approaches have directly A frequently occurring problem of state-of-the-art tempo addressed the octave error problem, either by incorporat- estimation algorithms is that the predicted tempo for a piece ing a-priori knowledge of tempo distributions [3], spectral of music is a whole-number multiple or fraction of the and rhythmic similarity [1,2], source separation [4,5], or tempo as perceived by humans (tempo octave errors). While classification into speed categories based on audio [6,7] often this is simply caused by shortcomings of the used al- and user-generated meta-data [8,9]. gorithms, in certain cases, this problem can be attributed to The importance of stylistic context for the task of beat the fact that the actual number of beats per minute (BPM) tracking has been stated before [10]. Similarly, in this within a piece is not a listener’s only criterion to consider work, we argue that there is a connection between the style it being “fast” or “slow”. Indeed, it can be argued that the of the music and its perceived tempo (which is strongly perceived style of music sets an expectation of tempo and related to the perception of the beat). More precisely, we therefore influences its perception. assume that human listeners take not only rhythmic infor- In this paper, we address the issue of tempo octave errors mation (onsets, percussive elements) but also stylistic cues in the context of electronic music styles. We propose to (such as instrumentation or loudness) into account when incorporate stylistic information by means of probability estimating tempo. 1 Therefore, when including knowledge density functions that represent tempo expectations for the on the style of the music, tempo estimation accuracy should individual music styles. In combination with a style classi- improve. Consider this simple example: If we knew that an fier those probability density functions are used to choose audio sample is a drum and bass track, it would be unrea- the most probable BPM estimate for a sample. Our evalu- sonable to estimate a tempo below 160 BPM. Yet our find- ation shows a considerable improvement of tempo estima- ings show that uninformed estimators can produce such an tion accuracy on the test dataset. output. Therefore we propose to incorporate stylistic in- formation into the tempo estimation process by means of 1. INTRODUCTION a music style classifier trained on audio data. In addition to predicting multiple hypotheses on the tempo of a piece A well-known problem of tempo estimation algorithms is of music using a state-of-the-art tempo estimation algo- the so called tempo octave error, i.e., the tempo as pre- rithm, we determine its style using the classifier and choose dicted by the algorithm is a whole-number multiple or frac- the tempo hypothesis being most likely in the context of tion of the actual tempo as perceived by humans. Since the determined style. For this, we utilize probability den- these errors on the metrical level are not always clearly sity functions (PDF) constructed from data extracted from agreed on by humans, evaluations performed in the lit- Wikipedia articles (i.e., BPM ranges or values as well as erature discount octave tempo errors by introducing sec- tempo relationships). ondary accuracy values (i.e. accuracy2) which also con- sider double, triple, half, and third of the ground truth tempo The remainder of this paper is organized as follows. Sec- as a correct prediction. In average, these values exceed the tion2 covers a representative selection of related work. In primary accuracy values based only on exact matches by section3 the proposed system is presented. This includes about 20 percentage points, cf. [1,2]. our strategy to extract style information from Wikipedia, While for tasks such as automatic tempo alignment for DJ in particular information on style-specific tempo ranges. mixes this can be a tolerable mistake, for making accurate In section4 we evaluate our approach using a new data set predictions of the semantic category of musical “speed,” for tempo estimation in electronic music. The paper con- i.e., whether a piece of music is considered “fast” or “slow,” cludes with a short discussion and ideas for future work in section5. Copyright: c 2015 Florian Horschl¨ ager,¨ Richard Vogl, Sebastian Bock,¨ Peter Knees et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 Unported License, which permits unrestricted 1 This assumption is supported by psychological evidence that identi- fication of pieces as well as recognition of styles and emotions can be per- use, distribution, and reproduction in any medium, provided the original author formed by humans within 400 msecs [11]. This information can therefore and source are credited. prime the assessment of rhythm and tempo which requires more context. 2. RELATED WORK ing user tags of “fast” and “slow” [9]. Independent of a specific method for tempo estimation, Moelants and McK- Gouyon et al. compare and discuss 11 tempo estimation al- inney investigate the factors of a piece being perceived as gorithms submitted to the ISMIR’04 tempo induction con- fast, slow, or temporally ambiguous [16]. In this work, we test [12]. Their paper shows that all submitted algorithms focus on predicting the correct beats per minute (bpm) for perform much better if tempo octave errors are considered a music piece rather than directly classifying music into as correctly estimated tempos. By ignoring this kind of speed categories. error it was already possible to reach accuracies beyond 80%. A more recent comparison of state-of-the art tempo estimation algorithms is given by Zapata and Gomez´ [13]. 3. METHOD Again the 11 algorithms compared in [12] are discussed Our approach consists of a two stage tempo estimation pro- along with 12 new approaches. In this comparison, again, cess (visualized in figure1). First, the Tempo Estimator the algorithm presented by Klapuri et al. [3] performs best, generates n = 10 tempo estimates using a state-of-the-art if tempo octave errors are ignored. tempo estimation approach. Second, the Style Estimator The tempo estimation algorithm described in [3] uses a classifies the audio file into a style. Finally, the Tempo bank of comb filters similarly to the approach by Scheirer Ranker chooses the most probable tempo in the context [14]. One important difference is that while Scheirer uses of the classified style. In the following, we describe the only five frequency subbands to calculate the input signal used tempo induction approach (that also serves as a refer- for the comb filters, Klapuri et al. use 36 frequency sub- ence baseline for our evaluations), the construction of the bands which are combined into 4 so-called “accent bands”. style classifier and the strategy for picking the most prob- This was done with the goal that changes in narrower fre- able tempo estimate. In the context of this work we also quency bands are also detected while maintaining the abil- present an approach for extraction of music style specific ity to detect more global changes which was already the tempo information from Wikipedia articles and how it is case in [14]. used within the described scheme. The two approaches presented by Seyerlehner et al. [1] are based on two periodicity sensitive features (the auto- 3.1 Baseline Tempo Estimator correlation function and fluctuation patterns) which are each used to train a k-Nearest-Neighbour classifier. The results In the tempo estimation stage, any state-of-the-art tempo obtained with this algorithm is at least comparable to the induction algorithm that can provide more than one tempo best results found in [12]. hypothesis can be used. In this work, we make use of the Peeters [2] uses a frequency domain analysis of an onset- beat detection method introduced by Bock¨ in [17]. The al- energy function to extract so called spectral templates. Dur- gorithm is based on bidirectional long short-term memory ing training, reference spectral patterns are created used (BLSTM) recurrent neural networks. As network input, six in two different approaches. First in an unsupervised ap- variations of short time fourier transform (STFT) spectro- proach where clustering of similar spectral templates is grams transformed to the Mel-scale (20 bands) are used. done via a fuzzy k-means algorithm and second in an su- The six variations consist of three spectrograms which are pervised variant where the 8 genres of the training dataset calculated using window sizes of 1024, 2048, and 4096 (ballroom) are used. Viterbi decoding is then used to de- samples. For this process, audio data with a sampling rate termine the two hidden variables (tempo and rhythmical of 44.1kHz is used which results in windows lengths of pattern) of the spectra templates to estimate the tempo of 23.2ms, 46.4ms and 92.8ms, respectively. In addition to an audio track. these three spectrograms, the positive first order difference Gkiokas et al. [7] and Eronen and Klapuri [6] use ma- to the median of the last 0.41s of every spectrogram is chine learning approaches to further improve tempo esti- used as input. The neural networks are randomly initial- mation results.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-