Singing Voice Detection in Monophonic and Polyphonic Contexts

17th European Signal Processing Conference (EUSIPCO 2009) Glasgow, Scotland, August 24-28, 2009 SINGING VOICE DETECTION IN MONOPHONIC AND POLYPHONIC CONTEXTS Hel´ ene` Lachambre, Regine´ Andre-Obrecht,´ Julien Pinquier IRIT - Universitéde Toulouse 118 route de Narbonne, 31062 Toulouse Cedex 9, France {lachambre, obrecht, pinquier}@irit.fr ABSTRACT The classification process uses bivariate Weibull models. We present a new method to estimate their parameters. In this article, we present an improvement of a previous singing voice detector. This new detector is in two steps. Then the singing voice detection is differentiated, de- First, we distinguish monophonies from polyphonies. pending of the results of the first step. We still look for vi- This distinction is based on the fact that the pitch estimated brato on the frequency tracking in polyphonic context but, in in a monophony is more reliable than the one estimated in monophonic context, we look for vibrato on the pitch. a polyphony. We study the short term mean and variance of In parts 2 and 3, we describe respectively the a confidence indicator; their repartition is modelled with bi- monophony/polyphony classifier and the previous singing variate Weibull distributions. We present a new method to voice detector. In part 4, we present the adaptation of the estimate the parameters of these distributions with the mo- singing voice detector to each case: monophonic and poly- ment method. phonic. Then in part 5, we present our corpus, experiments Then, we detect the presence of singing voice. This is and results. Finally we conclude and give some perspectives done by looking for the presence of vibrato, an oscillation of in part 6. the fundamental frequency between 4 and 8 Hz. In a monophonic context, we look for vibrato on the pitch. In a polyphonic context, we first make a frequency tracking on the 2. MONOPHONY/POLYPHONY CLASSIFICATION whole spectrogram, and then look for vibrato on each fre- SYSTEM quency tracks. Results are promising: from a global error rate of 29.7 % 2.1 Parameters (previousmethod),we fall to a globalerrorrate of 25 %. This In [7], de Cheveignéand Kawahara present a pitch estimator means that taking into account the context (monophonic or named YIN. This estimator is based on the computing of the polyphonic) leads to a relative gain of more than 16 %. difference function dt (τ) over each signal frame t: 1. INTRODUCTION N 2 Our work takes place in the general context of music descrip- dt (τ)= ∑ (xk − xk+τ ) (1) tion and indexation. In this process, many steps are neces- k=1 sary, including melody extraction, instruments, genre, artist, or singer identification. For all these tasks, it can be useful to with x the signal, N the window size and τ the shift time. have a precise information about the presence or absence of For a periodic signal, its period T should be given by the singing voice. first zero of dt (τ). This is not always possible, notably due to The singing voice detection has been a research subject imperfect periodicity [7]. The authors propose to use instead for around 10 years, it is a relatively recent subject. Recent the Cumulative Mean Normalised Difference: work have been conducted to find the best features to describe singing voice [1, 2, 3]. Other works have been ad- 1 i f τ = 0 dressing more specific music style contents [4], or have been ′ τ interested in the accompanied singing voice detection [5]. dt (τ)= (2) dt (τ)/ 1/τ. ∑ dt (k) otherwise In a previous work [6], we presented a singing voice de- " k=1 # tector based on the research of vibrato on the harmonics of the sound: we made the tracking of the harmonics present in the signal, and we looked for the presence of vibrato on each The pitch is given by the index T of the minimum of ′ ′ harmonic tracking. In this work, we propose to first separate dt (τ). The authors precise that the lower cmnd(t)= dt (T ) monophonies from polyphonies. Since this classification is is, the more confident the estimation of T is. So we consider very efficient, we aim at taking advantage of this knowledge cmnd(t) as a confidence indicator. to improve the singing voice detection, which is still based In the case of a monophony, the estimated pitch is confi- on the research of vibrato. The whole process is summarized dent, so cmnd(t) is low and do not vary much. A contrario, in figure 1. in the case of a polyphony,the estimated pitch is not reliable: The monophony/polyphonyclassifier is based on the fact cmnd(t) is higher and varies much more. These considera- that the estimated pitch is more reliable in the case of a tions lead us to use the two following parameters: the short monophony than in the case of a polyphony. We analyse a term mean and variance of cmnd(t), noted cmndmean(t) and confidence indicator issued from the YIN pitch estimator [7]. cmndvar(t), computed over 10 frames centred on frame t. © EURASIP, 2009 1344 Monophony/polyphony separation YES F0 Pitch Vibrato Short term cmnd_m(t) YIN Mean and var Likelihood Monophony? Decision cmnd_v(t) Sinusoidal Pseudo−temporal Extended cmnd(t) segments segments vibrato NO Singing voice detection Reference Weibull Models Figure 1: Global scheme of the system. 2.2 Modelling 2.3 Classification and results The bivariate repartition of (cmndmean,cmndvar) is modelled A bivariate Weibull model is learned for each class, on the with the bivariate Weibull distributions proposed in [8]: training corpus (described in part 5.1). These models are thereafter named the reference models. δ β1 β2 The classification is done every second. Since x δ y δ F(x,y)= 1 − exp − + (3) cmndmean(t) and cmndvar(t) are computed every 10 ms, we θ θ 1 2 have 100 2-dimension vectors every second. The decision is taken by computing the likelihood of 100 consecutive + + + + vectors (1 second) to each reference model. The assigned for (x,y) ∈ R × R , with (θ1,θ2) ∈ R × R the scale + + class is the one which maximizes the likelihood. parameters, (β1,β2) ∈ R × R the shape parameters and δ ∈]0,1] the correlation parameter. Results given by this method are very good: we have a global error rate of 6.3 % on the corpus presented in part 5.1. To estimate the five parameters (θ1,θ2,β1,β2,δ) of each bivariate distribution, we use the moment method. The mo- This is why we take this method as a preprocessing stage ments are given by Lu and Bhattacharyya in [9]: before looking for the presence of singing voice. 1 E[X]= θ1Γ( /β1 + 1) (4) 3. SINGING VOICE DETECTION 3.1 Vibrato 1 E[Y]= θ2Γ( /β2 + 1) (5) Vibrato is a well-known [11, 12] property of the human singing voice. In general, the vibrato is defined as a 2 2 2 1 periodic oscillation of the fundamental frequency. In the Var(X)= θ1 Γ( /β1 + 1)− Γ ( /β1 + 1) (6) specific case of the singing voice, this oscillation is at a rate of 4 to 8 Hz. So, on a given frequency tracking 2 2 2 1 Var(Y)= θ2 Γ( /β2 + 1)− Γ ( /β2 + 1) (7) vector F, the presence of vibrato is confirmed if there is a maximum between 4 and 8 Hz in the Fourier Transform of F. Cov(X,Y)= θ1θ2. In our work, we consider monophonic and polyphonic [Γ ( δ/β + 1)Γ( δ/β + 1)Γ( 1/β + 1/β + 1) extracts, so the research of the vibrato on the fundamental 1 2 1 2 (8) −Γ( 1/β + 1) Γ( 1/β + 1)Γ( δ/β + δ/β + 1)] frequency is not always possible. However, we note that if 1 2 1 2 the vibrato is present on the fundamental frequency, it is as Γ δ δ 1 ÷ ( /β1 + /β2 + ) well present on its harmonics. This is why we make a tracking of all the harmonicspresent in the signal (see section 3.2). with Γ(x) the gamma function. We then look for the presence of vibrato on these harmonics. From equations 4, 5, 6 and 7, we extract θ1, θ2, β1 and 3.2 Sinusoidal and Pseudo-temporal segmentations β2, the parameters of the two marginal distributions. We have shown [10] that equation (8) is equivalent to the 3.2.1 Sinusoidal segmentation following equation: The tracking of the harmonics (see figure 2), thereafter δ δ f (δ)= δB( /β1, /β2)= C (9) named “sinusoidal segments”, is done with the method described in [13]. Γ(a)Γ(b) The algorithm is the following one: with B(a,b)= Γ(a+b) the Beta function and C a constant de- pending from θ1, θ2, β1, β2 and Cov(X,Y), which are known. • compute the spectrum every 10 ms, with a 20 ms Ham- ming window, From equation (9), finding δ is equivalent to finding the • convert the frequency in cent (100 cent = 1/2 tone), and zeros of the expression f (δ) −C. As we have shown that smooth it with a 17 cent window, f (δ) is strictly decreasing function [10], we may find easily • detect the maxima of the spectrum: the frequencies i i the unique zero by dichotomy. ( ft ,i = 1,...,I) and their log amplitude (pt ,i = 1,...,I), 1345 • compute the distance between each pair of consecutive Then a limit of a segment is placed at instant t if the two maxima (at the instant t and t − 1): following conditions are respected: • There are at least 2 extremities of sinusoidal segments at 2 2 the instant t.

Load more