Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle Valentin Emiya, Roland Badeau, Bertrand David

To cite this version: Valentin Emiya, Roland Badeau, Bertrand David. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Au- dio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2010, 18 (6), pp.1643-1654. . <10.1109/TASL.2009.2038819>.

HAL Id: inria-00510392 https://hal.inria.fr/inria-00510392 Submitted on 18 Aug 2010

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destin´eeau d´epˆotet `ala diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publi´esou non, lished or not. The documents may come from ´emanant des ´etablissements d’enseignement et de teaching and research institutions in France or recherche fran¸caisou ´etrangers,des laboratoires abroad, or from public or private research centers. publics ou priv´es. IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING 1 Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle Valentin Emiya, Member, IEEE, Roland Badeau, Member, IEEE, Bertrand David, Member, IEEE

Abstract—A new method for the estimation of multiple con- and thus to an ill-posed problem, some knowledge is often current pitches in piano recordings is presented. It addresses the used for the further spectral modeling of the underlying issue of overlapping by modeling the spectral envelope sources. For instance, when iterative spectral subtractionis em- of the overtones of each note with a smooth autoregressive model. For the background , a moving-average model is ployed, several authors have adopted a smoothness assumption used and the combination of both tends to eliminate harmonic for the spectral envelope of the notes [8–10]. and sub-harmonic erroneous pitch estimations. This leads to a Several preceding works [12–14; 16] specifically address complete generative spectral model for simultaneous piano notes, the case of the piano. One of the interests in studying a single which also explicitly includes the typical deviation from exact instrument is to incorporate in the algorithms some results harmonicity in a piano series. The pitch set which maximizes an approximate likelihood is selected from among established from physical considerations [18; 19], in the hope a restricted number of possible pitch combinations as the one. that a more specific model will lead to a better performance, Tests have been conducted on a large homemade database called with the obvious drawback of narrowing the scope of the MAPS, composed of piano recordings from a real upright piano application. Indeed, while a large number of musical pieces are and from high-quality samples. composed for piano solo, the performance of ATM systems is Index Terms—Acoustic signal analysis, audio processing, mul- relatively poor for this instrument in comparison to others [6]. tipitch estimation, piano, transcription, spectral smoothness. Two deviations from a simple harmonic spectral model are specific to free vibrating string instruments and are particularly I.INTRODUCTION salient for the piano: a small departure from exact harmonicity The issue of monopitch estimation has been addressed for (the overtone series is slightly stretched) and the beatings be- decades by different approaches such as retrieving a periodic tween very close frequency components (for a single string two pattern in a [1; 2] or matching a regularly spaced different vibrating polarizations occur and the different strings pattern to an observed spectrum [3–5], or even by combining in a doublet or triplet are slightly detuned on purpose [20]). both spectral and temporal cues [6; 7]. Conversely the mul- In this work, we propose a new spectral model in which tipitch estimation (MPE) problem has become a rather active the inharmonic distribution is taken into account and adjusted research area in the last decade [8–11] and is mostly handled for each possible note. The spectral envelope of the overtones by processing spectral or time-frequency representations. This is modeled by a smooth autoregressive (AR) model, the MPE task has also become a central tool in musical scene smoothness resulting from a low model order. A smooth analysis [11], particularly when the targeted application is the spectral model is similarly introduced for the residual noise automatic transcription of music (ATM) [12–16]. While recent by using a low-order moving-average (MA) process, which works consider pitch and time dimensions jointly to perform is particular efficient against residual sinusoids. The proposed this task, MPE on single frames has been historically used method follows a preliminary work [21] in which the AR/MA as a prior processing to pitch tracking over time and musical approach was already used and MPE was addressed as an note detection. Indeed, in a signal processing perspective, a extension of a monopitch estimation approach. In addition period or a harmonic series can be estimated from a short to the opportunity to deal with higher polyphony levels, the signal snapshot. In terms of perception, only a few cycles are current paper introduces a new signal model for simultaneous needed to identify a pitched note [17]. piano notes and a corresponding estimation scheme. The In this polyphonic context, two issues have proved difficult major advance from the previous study is thus to model and and interesting for computational MPE: the overlap between estimate the spectral overlap between note spectra. An iterative the overtones of different notes and the unknown number of estimation framework is proposed which leads to a maximum such notes occurring simultaneously. As the superposition of likelihood estimation for each possible set of F0s. two sounds in octave relationship leads to a spectral ambiguity This paper is structured as follows. In section II, the overall MPE approach is described. The sound model is first V. Emiya is with the Metiss team at INRIA, Centre Inria Rennes - Bretagne detailed, followed by the principle of the algorithm: statistical Atlantique, Rennes, France, and, together with R. Badeau and B. David, with background, adaptive choice of the search space and detection Institut Télécom; Télécom ParisTech; CNRS LTCI, Paris, France. The research leading to this paper was supported by the French GIP ANR function. The estimation of the model parameters is then under contract ANR-06-JCJC-0027-01, Décomposition en Éléments Sonores explained in section III. After specifying the implementation et Applications Musicales - DESAM and by the European Commission under details, the algorithm is tested in section IV using a database contract FP6-027026-K-SPACE. The authors would also like to thanks Peter Weyer-Brown from Telecom ParisTech and Nathalie Berthet for their useful of piano sounds and the results are analyzed and compared comments to improve the English usage. with those obtained with some state-of-the-art approaches. 2 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING

Frame: (x (0) ,...,x (N − 1))T Conclusions are finally drawn in section V. ∗ Note that in the following, denotes the complex conjugate, Preprocessing T the transpose operator, † the conjugate transpose, ⌊.⌋ the floor function and |.| the number of elements in a set, the absolute Selection of mixture candidates value of a real number or the modulus of a complex number. Preprocessing In addition, the term polyphony will be applied to any possible mixture of notes, including silence (polyphony 0) and single For each possible mixture C notes (polyphony 1).

II. MULTIPITCH ESTIMATION ALGORITHM Iterative estimation of note models A. Generative sound model

At the frame level, a mixture of piano sounds is modeled as Noise Spectral MA parameter a sum of sinusoids and background noise. The amplitude of envelope Amplitude estimation each sinusoid is considered as a random variable. The spectral AR parameter estimation envelope for the overtone series of a note is introduced as estimation second order statistical properties of this random variable, p ∈ J1; P K making it possible to adjust its smoothness. The noise is Parameter estimation considered as a moving-average (MA) process. ˜ More precisely, a mixture of P ∈ N simultaneous piano Weighted log-likelihood Lx (C) notes, observed in a discrete-time N-length frame, is modeled C ∈ C as a sum x (t) , P x (t) + x (t) of the note p=1 p b e signals xp and of noise xb. The sinusoidal model for xp is arg max L˜ (C) P C∈C x e Hp

, 2iπfhpt ∗ −2iπfhpt Decision xp (t) αhpe + αhpe (1) Mixture estimation: C h=1 X ¡ ¢ Fig. 1. Multipitch estimation block diagram. b where αhp are the complex amplitudes and fhp the frequen- cies of the Hp overtones (here, Hp is set to the maximum number of overtones below the Nyquist frequency). Note p is 2 2 2 parameterized by C = (f , β ), f being the fundamental 2 2 and a gamma prior for : p 0p p 0p σp ∼ IG kσp , Eσp σb σb ∼ frequency (F0) and βp (f0p) being the so-called inharmonicity 2 Γ kσ2 , Eσ2 ³. A non-informative´ prior is assumed for Ap(z) coefficient of the piano note [20], such that b b and³ B(z). ´ 2 fhp , hf0p 1+ βp (f0p) h (2) With a user-defined weighting window w, we can equiv- alently consider the Discrete Time Fourier Transforms We introduce a spectralq envelope for note p as an au- (DTFT) Xp(f) of xp (t) w (t), Xb (f) of xb (t) w (t) and toregressive (AR) model of order Qp, parameterized by , P 2 2 1 X (f) p=1 Xp (f) + Xb (f) of x (t) w (t). W being θp , σ , Ap(z) , where σ is the power and is p p Ap(z) the DTFT of w, we thus have the transfer function of the related AR filter. Order Q should P ¡ ¢ p be low and proportional to Hp, in order to obtain a smooth Hp ∗ ∗ envelope and to avoid overfitting issues for high pitches. The Xp (f)= αhpW (f − fhp)+ αhpW (f + fhp) (4) amplitude α of overtone h of note p is the outcome of a h=1 hp X zero-mean complex Gaussian random variable1, with variance ¡ ¢ equal to the spectral envelope power density at the frequency In addition, the following synthetic notations will be used: a mixture of notes is denoted by C , (C1,..., CP ); the set of fhp of the overtone: parameters for spectral envelope models by θ , (θ1, . . . , θP ); 2 , σp the set of amplitudes of overtones of note p by αp αhp ∼ N 0, (3) 2iπf 2 α1p, . . . , αH p ; and the set of amplitudes of overtones of à |Ap (e hp )| ! p all notes by α , (α1, . . . , αP ). In order to model a smooth spectral envelope without ¡ ¢ residual peaks, the noise xb is modeled as an MA process , 2 2 with a low order Qb, parameterized by θb σb ,B(z) , σb B. Principle of the algorithm being the power and B(z) the finite impulse response (FIR) filter of the process. ¡ ¢ The algorithm is illustrated in Fig. 1. The main principles Furthermore, high note powers and low noise powers are described in the current section, except for the model 2 parameter estimation, which is detailed in Section III. are favored by choosing an inverse gamma prior for σp:

1 2 Using these centered Gaussian variables, xp is thus a harmonic process, e.g. an improper, constant prior density, which will not affect the detection in the statistical sense. function. EMIYA et al.: MULTIPITCH ESTIMATION OF PIANO SOUNDS USING A NEW PROBABILISTIC SPECTRAL SMOOTHNESS PRINCIPLE 3

4 1) Statistical background: Let us consider the observed • Lb (θb) , ln p(x|α, θb, C) is equal to the log-likelihood C frame x and the set of all the possible mixtures of pia- ln pxb related to the noise model θb; no notes. Ideally, the multipitch detection function can be • the priors ln p(θp) on the spectral envelope of the note p expressed as the maximum a posteriori (MAP) estimator Cˆ, and ln p(θb) on noise parameters. which is equivalent to the ML estimator if no a priori We empirically define the detection function as a weighted information on mixtures is specified: sum of these log-densities, computed with the estimated pa- rameters (see Appendix A for a discussion): ˆ p(x|C) p(C) C = argmax p(C|x) = argmax P C∈C C∈C p(x) L˜ (C) , w L˜ θ /P + w L˜ θ = argmax p(x|C) (5) x 1 p p 2 b b p=1 C∈C X ³ ´ ³ ´ P b b 2 2 2) Search space: The number of combinations among Q + w3 ln p σp /P + w4 ln p σ − µpol P (7) Q b possible notes being for polyphony P , the size of the p=1 P X ³ ´ ³ ´ search set C is Q Q = 2Q, i.e. around 3·1026 for a c c P =0¡ ¢P where typical piano with Q = 88 keys. Even when the polyphony is P ¡ ¢ limited to Pmax =6 and the number of possible notes to Q = ˜ , 1 2 Lp θp Lp σp, Ap − µenv Hp 60 (e.g. by ignoring the lowest and highest-pitched keys), the Hp Pmax Q 6 ˜ ³ ´ , 1 ³ 2 ´ number of combinations reaches ≈ 56·10 , which Lb θb Lb σb , B − µb |Fb| P =0 P  b |Fb| c c remains a too huge search set for a realistic implementation.  w1,...,w³ ´ 4, µpol, µ³env, µb are´ user-defined P ¡ ¢  b c b In order to reduce the size of this set, we propose to select a  coefficients given number Nc of note candidates. We use the normalized |Fb| is the number of noisy bins (see eq. (14)). product spectrum function defined in dB as    H(f0,β)  III. MODEL PARAMETER ESTIMATION , 1 2 ΠX (f0, β) ν 10 log |X (fh)| (6) A. Iterative estimation of spectral envelope parameters and H (f , β) 0 h Y=1 amplitudes of notes where X is the observed spectrum, H (f0, β) is the number We consider a possible mixture C = (C1,..., CP ). The of overtones for the note with f0 and estimation of the unknown spectral envelope parameters θ and 2 inharmonicity β, fh , hf0 1+ βh , and ν is a parameter of the unknown amplitudes of the overtones α is performed adjusted to balance the values of the function between bass by iteratively estimating the former and the latter, as described p and treble notes. As the true notes generate local peaks in ΠX , in this section. selecting an oversized set of Nc greatest peaks is an efficient 1) Spectral envelope parameter estimation: let us assume way for adaptively reducing the number of note candidates. the amplitudes α are known in order to estimate the spectral In addition, for candidate nc ∈ J1; NcK, accurate values of its envelope parameters θ in the maximum likelihood (ML) sense. Given that note models are independent, the likelihood of α fundamental frequency f0nc and inharmonicity βnc are found 3 is p(α|θ, C) = P p(α |θ , C ). The optimization w.r.t. θ by a local two-dimensional maximization of ΠX (f0nc , βnc ). p=1 p p p They are used in the subsequent estimation stages to locate the thus consists in maximizing the log-likelihood L σ2, A , Q p p p frequencies of the overtones. In the current implementation, ln p α |σ2, A , C w.r.t. θ = σ2, A , independently for p p p p p p p ¡ ¢ the number of note candidates is set to Nc = 9. This choice each note p. As proved in the Appendix B, the maximization ¡ 2 ¢ ¡ ¢ results from the balance between increasing the number of w.r.t. σp leads to the expression candidates and limiting the computational time. The size of Hp the set C of possible mixtures thus equals Pmax Nc = 466, L σ2, A = c + ln ρ (A ) (8) P =0 P p p p 2 p for Pmax =6. ¡ ¢ ³ ´ 3) Multipitche detection function: The multipitchP detection with c function aims at finding the correct mixture among all the Hp C Hp 1 2 possible candidates. For each candidate C ∈ , the parameters c , − ln(2πe) − ln |α | (9) 2 2 hp α, θ and θb of the related model are estimated as explained h=1 in section III. The detection function is then computed, and X 1 e 2 Hp Hp 2 2iπfhp theb multipitchb estimation is defined as the mixture with the h=1 |αhp| Ap e b , ρ (Ap) H 2 (10) greatest value. ³ 1 p 2 2iπfhp´ Q |αhp¯ | |¡Ap (e ¢¯ )| Since equation (5) is intractable, we define an alternate Hp h=1 ¯ ¯ Hp detection function which involves the following terms: 1 P 2 2 2 2iπfhp σp = |αhp| Ap e (11) • Lp (θp) , ln p(αp|θp, Cp) is the log-likelihood of the Hp h=1 amplitudes α of overtones of note p, related to the X ¯ ¡ ¢¯ p c ¯ ¯ 4 P Hp 2iπf t spectral envelope models θp; by using the substitution x 7→ x − 2Re αhpe hp , “Pp=1 Ph=1 ” t being the time instants related to the frame, this term is equal to P Hp 2iπf t 3The fminsearch Matlab function was used for this optimization. ln px x − 2Re αhpe hp |θb b “ “Pp=1 Ph=1 ” ” 4 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING

ln ρ A1 = −0.47 ln ρ A1 = −2.7 ³ ´ |X(f)|2/N ³ ´ |X(f)|2/N c c θˆ1 θˆ1 2 2 40 |αˆh1 | 40 |αˆh1 |

20 20 dB dB

0 0

−20 −20 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 f (Hz) f (Hz)

(a) True note: A♭4. Selected model: A♭4. (b) True note: A♭4. Selected model: A♭3.

Fig. 2. Estimation of spectral envelopes: the AR estimation performs a better fitting of the amplitudes for the correct model (left) than for the sub-octave model (right).

As c is a constant w.r.t. Ap, the optimization consists in a single index k. In the spectral domain, the DFT X of the maximizing ρ (Ap), which measures the spectral flatness (i.e. N-length frame x using a weighting window w of DFT W K the ratio between the geometrical and arithmetical means) of is X (f) = k=1 αkW (f − fk), the noise spectrum being 2 2iπf 2 |αhp| Ap e hp . This quantity lies in [0; 1] removed since we are only interested in the values of X at 1≤h≤H P p frequencies fhp, where the noise term is insignificant. andn is maximum¯ ¡ when¢¯ theo latter coefficients are constant, i.e. ¯ ¯ In these conditions, for 1 ≤ k0 ≤ K, the optimal linear when the filter Ap perfectly whitens the amplitudes αp. The estimator αˆk0 of αk0 as a function of X (fk0 ), obtained by estimation of Ap from the discrete data αp thanks to the Digital E 2 All-Pole (DAP) method [22] leads to a solution that actually minimizing the mean squared error ǫk0 = |αk0 − αˆk0 | is maximizes ρ (Ap), as proved in [23]. It is applied here to the h i ∗ amplitudes αp, for each note p. W (0) vk0 αˆk0 = K 2 X (fk0 ) (12) The estimation of the spectral envelope of a note is il- k=1 |W (fk0 − fk)| vk lustrated in Fig. 2. A note is analyzed by considering two possible models: the model related to the true note (Fig. 2(a)) The proof of eq.P (12) is given in the Appendix B. By and the model related to its sub-octave (Fig. 2(b)). In the ignoring non-overlapping overtones, the expression of the former case, the estimate of the spectral envelope is close to estimator of αhp can be simplified as the amplitudes of the overtones, whereas in the latter case, W ∗ (0) v2 the AR spectral envelope model is not adapted to amplitudes , hp αhp 2 2 X (fhp) (13) that are alternatively high and low: the low values obtained for |W (fh′p′ −fhp)| vh′p′ f f <∆ the spectral flatness ρ (and, consequently, for the likelihood) | h′p′ − hp| w P are here a good criterion to reject wrong models like the sub- d octave (see Fig. 6(a) for the whole spectral flatness curve). where ∆w is the width of the main lobe of W . 2) Estimation of the amplitudes of the overtones: let us The amplitude estimation is illustrated in Fig. 3 on a now assume that spectral envelope parameters θ are known, synthetic signal, composed of two octave-related notes, i.e. that the frequencies of the overtones may overlap, and that with overlapping spectra. As expected, the amplitudes are the amplitudes α are unknown. In all that follows, we assume perfectly estimated when there is no overlap (see odd-order peaks of note 1). When the overtones overlap, predominant that at a given overtone frequency fhp, the power spectrum of the noise is not significant in comparison with the power amplitudes are well estimated (see note 1 for f = 0.04 or spectrum of the amplitudes of the overtone. note 2 for f =0.4), as well as amplitudes with the same order of magnitude (see amplitudes at ). The estimation may When the overtone h of note p is not overlapping with any f =0.2 be less accurate when an amplitude is much weaker than the other overtone, the amplitude αhp is directly given by the other (see note 1 at f =0.36). spectrum value X (fhp). In the alternative case of overlapping overtones, the observed spectrum results from the contribution 3) Iterative algorithm for the estimation of note parameters: of overtones with close frequencies. We propose an estimate in the last two parts, we have successively seen how to of the hidden random variable α, given the observation X and estimate the spectral envelope models from known amplitudes the parameters θ that control the second-order statistics of α. and the amplitudes from known spectral envelope models. None of these quantities are actually known, and both must As defined by eq. (3), the amplitude αhp is a random vari- 2 σp be estimated. We propose to use the two methods above able such that α ∼ N (0, v ) with v , 2 2 . hp hp hp iπfhp |Ap(e )| in the Algorithm 1 to iteratively estimate both the spectral One can rewrite the sound model x as a sum of K sinusoids envelope and the amplitudes of the overtones. It alternates the K 2iπfk t and noise x(t)= k=1 αke +xb(t) with αk ∼ N (0, vk) estimation of the former and of the latter, with an additional and K =2 P H . In this formula, the couples of indexes thresholding of the weak amplitudes in order to avoid the p=1P p (h,p) related to overtones and notes have been replaced by amplitude estimates becoming much lower than the observed P EMIYA et al.: MULTIPITCH ESTIMATION OF PIANO SOUNDS USING A NEW PROBABILISTIC SPECTRAL SMOOTHNESS PRINCIPLE 5

f =0.02 f =0.04 0,1 0,2 0 0

−50 −50

−100

−100 dB dB

Periodogram of the mixture Periodogram of the mixture −150 AR envelope model of note 1 −150 AR envelope model of note 2 2 2 |αh1| (true amplitudes) |αh2| (true amplitudes) 2 2 |αˆh1 | (estimation) |αˆh2 | (estimation) −200 −200 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 Normalized frequency Normalized frequency (a) Note 1. (b) Note 2.

Fig. 3. Estimation of amplitudes from overlapping spectra: example on a synthetic mixture of 2 notes. The F0s f0,1 and f0,2 of the two notes are in an octave relation and the estimation is performed from a N = 2048-length observation. For each note, the true amplitudes (squares) have been generated from a spectral envelope AR model (black line) using eq. (3). The amplitudes are estimated (crosses) from the observation of the mixture (grey line) using the AR envelope information (eq. (12)).

60 60 |X|2/N |X|2/N 50 θˆ1 50 θˆ1 αˆh1 αˆh1 ˆ ˆ 40 θ2 40 θ2 αˆh2 αˆh2 30 30

20 20 dB dB 10 10

0 0

−10 −10

−20 −20

−30 −30 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 f (Hz) f (Hz)

(a) True notes: G5+D6. Selected model: G5+D6. (b) True notes: F#4+F#5. Selected model: F#4+F#5.

Fig. 4. Iterative estimation of amplitudes αp and spectral envelopes θp: example of a fifth (Fig. 4(a)) and of an octave (Fig. 4(b)). c b

5 spectral coefficients . For p ∈ J1; P K and h ∈ J1; HpK, the Algorithm 1 Iterative estimation of note parameters estimate αhp of αhp and θp of θp are respectively the values Require: spectrum X, notes C1,..., CP . (i) (i) Initialize the amplitudes α(0) of overtones to spectral values of αhp and θp after a number of iterations. Note that hp b X (fhp) for p ∈ J1; P K and h ∈ J1; HpK. the³ ML´ estimatesd ³ of´ the envelopes are directly related to the MAP estimation of the multipitch contents while amplitudes for each iteration i do for each note p do are estimated in the least mean square sense, which is a (i) (i−1) different objective function. Consequently, the convergence of Estimate θp from αp . {AR estimation} end for the iterative algorithm is not proved or guaranteed but it is (i) (i) Jointly estimate all α from X (fhp) and θp . {eq. (13)} observed after about Nit , 20 iterations. hp The estimation is illustrated in Fig. 4 with two typical Threshold (i) at a minimum value. cases. While the energy is split between the overlapping αhp end for overtones, non-overlapping overtones help estimating the spec- Ensure: estimation of and for and tral envelopes, resulting in a good estimation in the case of αhp θp p ∈ J1; P K h ∈ a fifth (Fig. 4(a)), the octave being a more difficult – but J1; HpK. successful – case (Fig. 4(b)).

B. Estimation of noise parameters

We assume that the noise signal results from the circular by a Qb-order FIR filter parameterized by its transfer function Q 6 2 B (z)= b b z−k, with b =1. filtering of a white centered Gaussian noise with variance σb k=0 k 0 As weP assumed that in a bin close to the frequency of an 5In our implementation, this threshold was set to the minimum observed X(f) amplitude min | | . overtone, the spectral coefficients of noise are negligible w.r.t. f √N 6The circularity assumption, which is commonly used and asymptotically the spectral coefficients of notes, the information related to valid, leads to a simplified ML solution. noise is mainly observable in the remaining bins, i.e. on the 6 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING

60 60 ln ρb ³B´ = −1.3 ln ρb ³B´ = −3.6 b b 40 40

20 20 dB dB 0 0

−20 −20

X(Fb) X(Fb) MA estimation θˆb X(Fb) X(Fb) MA estimation θˆb −40 −40 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 f (Hz) f (Hz)

(a) True note: E♭4. Selected model: E♭4. (b) True note: E♭4. Selected model: E♭5.

Fig. 5. Noise parameter estimation when selecting the true model (left) and the octave model (right). Fb denotes the complement of Fb. frequency support defined by 0

k k ∆ −1 F , ∀p ∈ J1; P K, ∀h ∈ J1; H K, −f > w b N p N hp 2 ½ fft Á ¯ fft ¯ ¾ −2 ¯ ¯ (14) ¯ ¯ ln(ρ(A1)) ¯ ¯ 4 −3 True note where ∆w denotes the width of the main lobe of w (∆w = N + −octave for a Hann window) and Nfft is the size of the DFT. The noise −4 parameter estimation is then performed using this subset of 40 50 60 70 80 90 note (MIDI) bins, which gives satisfying results asymptotically (i.e. when N → +∞, the number of removed bins being much smaller (a) ln ρ “Ap”. c than the total number of bins). As proved in the Appendix B, −1 it results in the maximization, w.r.t. B, of the expression ln(ρb) −2 True note +octave |F | − L σ2,B = c + b ln ρ (B) (15) −3 b b b 2 b ³ ´ −4 where c 2 −5 |Fb| 1 |Xb (f)| 40 50 60 70 80 90 c , − ln 2πe − ln (16) note (MIDI) b 2 2 N f∈Fb (b) ln ρ B . X 1 b “ ” 2 |Fb| b Xb(f) 2iπf f∈Fb B(e ) Fig. 6. ln ρ “A1” (top) and ln ρb “B” (bottom) as a function of a single- ρb (B) , µ ¶ (17) ¯ ¯ 2 note model (truec note: A♭4). b Q1 ¯ Xb¯(f) 2iπf |Fb| f¯∈Fb B(e ¯ ) 1 P 1 ¯ X (f) ¯ 2 2 ¯ b ¯ (18) σb = ¯ 2iπf¯ |Fb| N B (e ) f∈Fb ¯ ¯ X ¯ ¯ c ¯ ¯ As for the spectral envelope¯ AR parameters¯ (eq. (11) and (8)), the ML estimation of noise parameters consists in the maximization of a spectral flatness ρb (B). It can be that the wrong model will be rejected. As shown in Fig. 6, achieved thanks to the MA estimation approach proposed the spectral flatness criterion for the spectral envelope and in [23]. However, in order to speed up the computations, we noise estimations are complementary cues for rejecting wrong use the Algorithm 3 described in the Appendix C, which is models and selecting the right one. The criterion on spectral simpler and gives satisfying results for the current application. envelopes (Fig. 6(a)) shows many high values but also large The noise parameter estimation is illustrated in Fig. 5. minima for sub-harmonic errors (e.g. notes D♭3, A♭3 and When the true model is selected (Fig. 5(a)), most of primary D♭3 in this example). Conversely, the criterion for the noise lobes of the sinusoidal components are removed from Fb model (Fig. 6(b)) has a few high peaks located at the sub- and the resulting spectral coefficients are well-fitted by an harmonics F0s of the true note – i.e. for every model all the MA envelope. In the case a wrong model is selected and a sinusoids have been removed from the noise observation –, but lot of Fb bins are related to sinusoids (Fig. 5(b)), the MA is efficient for discriminating the other errors. Thus, if taken model is not adapted to the observations which are not well- separately, these criteria are not good pitch estimators, but they fitted. The resulting ρb value and likelihood will be low so efficiently combine into the detection function. EMIYA et al.: MULTIPITCH ESTIMATION OF PIANO SOUNDS USING A NEW PROBABILISTIC SPECTRAL SMOOTHNESS PRINCIPLE 7

IV. EXPERIMENTAL RESULTS either recorded by using an upright Disklavier piano or A. Implementation and tuning generated by several virtual piano software products based on sampled sounds. The development set and the test set The whole method is summarized by Algorithm 2. The MPE are disjointed. Two pianos are used in the former while the algorithm is implemented in Matlab and C. It is designed to latter comprises sounds from other five pianos. In total, two analyze a 93ms frame and to estimate its polyphonic content. upright pianos and five grand pianos were used. Recording A preprocessing stage aims at reducing the spectral dynamics conditions and tunings vary from one instrument to the other. and consists in flattening the global spectral decrease by Polyphony levels lie between 1 and 6 and notes are uniformly means of a median filtering of the spectrum. The values of distributed between C (65Hz) and B (1976Hz). One part the parameters of the MPE algorithm are given in Table I. 2 6 of the polyphonic mixtures is composed of randomly related The order Q of the spectral envelope filter A (z) of a note p p pitches whereas the other part comprises usual chords from p is set to Q = H /2 in order to be large enough to p p western music (major, minor, etc.). For each sound, a single fit the data and small enough to obtain a smooth spectral 7 93ms-frame located 10ms after the onset time is extracted and envelope . Parameters kσ2 , Eσ2 , k 2 , E 2 , µenv, µb, µpol p p σb σb analyzed. Two additional algorithms have been tested on the and (w1,...,w4) have³ been estimated´ ³ on´ a development same database for comparison purposes. The first one is Tolo- database composed of about 380 mixtures from two different nen’s multipitch estimator [24] for which the implementation pianos, with polyphony levels from 1 to 6. They were jointly from the MIR Toolbox [25] was used. As the performance of adjusted by optimizing the F-measure using a grid search, the the algorithm decreases when F0s are greater than 500Hz (C5), bounds of the grid being set by hand. the system was additionally tested in restricted conditions – denoted Tolonen-500 – by selecting from the database the Algorithm 2 Multipitch estimation. subset of sounds composed of notes between C2 and B4 only. Preprocessing The second one is Klapuri’s system [26]. The code of the latter 1 2 Computation of periodogram N |X| was provided by its author. Selection of Nc note candidates Our algorithm was implemented in Matlab and C. The com- for each possible note combination C do putational cost on a recent PC is about 150×real time. It thus Iterative estimation of the amplitudes of overtones and of requires more computations than other algorithms like [26] spectral envelope of notes (algorithm 1) but the proposed algorithm is computationally tractable, par- For each note p, derivation of Lp (eq. (8)) and of prior ticularly when comparing it to the greedy and intractable 2 p σp joint-estimation approach where no note candidate selection Noise³ ´ parameter estimation (algorithm 3) is performed, as discussed in section II-B. c 2 General results are presented in Fig. 7. Relevant items Computation of Lb (eq. (15)) and of prior p σb are defined as correct notes after rounding each F0 to the Derivation of the detection function related³ to the´ note combination (eq. (7)) c nearest half-tone. Typical metrics are used: the recall is the end for ratio between the number of relevant items and of original Maximization of the detection function items; the precision is the ratio between the number of relevant items and of detected items; and the F-measure is the harmonic mean between the precision and the recall. Parameter value Parameter value In this context, our system performs the best results for 4 Pmax 6 k 2 ,E 2 1, 10− polyphony 1 and 2: 93% F-measure vs 85% (polyphony 1) “ σp σp ” ` ´ 3 and 91% (polyphony 2) for Klapuri’s system. The trend then Nc 9 k 2 ,E 2 2, 10− “ σb σb ” ` ´3 reverses between Klapuri’s system and the proposed one, the ν 0.38 µenv −8.9·10− 4 fs 22050Hz µb −2.2·10− F-measure being respectively 91% and 88% for polyphony 3, N 2048 µpol 25 and 72% and 63% for polyphony 6. Moreover, the precision is 1 w Hann w1 8.1·10− 4 high for all polyphony levels whereas the recall is decreasing Qp Hp/2 w2 1.4·10 2 when polyphony increases. Results from Tolonen’s system are Qb 20 w3 6.2·10 Nit 20 w4 5.8 weaker for the overall tests, even with the restricted F0-range TABLE I configuration. PARAMETERSOFTHEALGORITHM. The ability of Klapuri’s system and ours to detect polyphony levels (independently of the pitches) is presented in Tab. II. For polyphony levels from 1 to 5, both systems succeed in detecting the correct polyphony level more often than any B. Evaluation other level. It should also be noted that the proposed system The proposed MPE algorithm has been tested on a database was tested in more difficult conditions since polyphony 0 (i.e. called MAPS8 and composed of around 10000 piano sounds silence) may be detected, which is the case for a few sounds9. The proposed system tends to underestimate the polyphony 7 Note that the number of degrees of freedom for the AR models is Hp/4= Qp since half the poles are affected to the positive frequencies, the remaining 9Note that silence detection could be improved by using an activity detector poles being their complex conjugates. as a preprocessing stage, since silence is a very specific case of signal that 8MAPS stands for MIDI Aligned Piano Sounds and is available on request. can be detected using more specific methods than the proposed one. 8 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING

Tolonen Tolonen-500 Klapuri Proposed method Tolonen Tolonen-500 Klapuri Proposed method 100 100

80 80

60 60 %

% 40 40 20 20 0 Recall Precision F−measure 0 Octave detection 1 2 3 4 5 6 Polyphony (a) F-measure Fig. 8. Octave detection: results for the 97 octave sounds extracted from the database (45 sounds for the Tolonen-500 system).

100 100

80 80

60 60

% % Finally, we report additional results which are not depicted. 40 40

20 20 First, if polyphony is known, the performance of our sys-

0 0 tem reaches the second rank after Klapuri’s system, with 1 2 3 4 5 6 1 2 3 4 5 6 Polyphony Polyphony correct detection rates from 95% (polyphony 1) down to (b) Recall (c) Precision 55% (polyphony 6). Second, F-measure values are between Fig. 7. Multipitch estimation results with unknown polyphony: for each 5% and 10% better for usual chords than for random-pitch algorithm, the F-measure, the precision and the recall are rendered as a chords. This has been observed with all the tested methods function of the true polyphony. For the proposed method, the performance for polyphony levels higher than two. While the algorithms on the training set is plotted using a dashed line. face more harmonically-related notes in usual chords – i.e. P P more spectral overlap –, it seems that simultaneous notes with Pest 123456 Pest 123456 widely-spread F0s in random chords are a bigger difficulty. 0 724690 0 000000 Third, the results obtained on the development set and the 1 90 14 8 8 8 12 1 92 21 5 2 1 3 2 3 83 18 12 8 19 2 1 74 18 7 4 7 test set are comparable, with only a few % deviation. This 3 0 1 68 27 16 25 3 1 3 65 24 10 12 shows that the parameter learning is not overfitting and that 4 0 0 2 47 27 32 4 0 1 9 49 24 27 the system has good generalization abilities for other models 5 0 0 0 1 31 11 5 1 0 1 14 36 23 6 00001 1 6 5 1 1 4 25 28 of piano and different recording conditions. Fourth, the whole database includes sounds from seven different pianos. The Proposed method Klapuri’s method results are comparable from one piano to the other, with about 3%-standard deviation. Results do not significantly depend on TABLE II POLYPHONY ESTIMATION: DETECTION RATE (%) W.R.T. TO THE NUMBER the upright/grand piano differences, on recording conditions OF SOUNDS WITH TRUE POLYPHONY P , AS A FUNCTION OF THE TRUE or on whether a real piano or a software-based one is used, POLYPHONY P AND OF THE ESTIMATED POLYPHONY PEST . which suggests robustness for varying production conditions. Fifth, the proposed method uses a candidate selection stage which may fail, causing the subsequent multipitch estimation to fail. The performance of the candidate selection stage has level since the parameter tuning consists in optimizing the thus been checked: satisfying results are obtained for low to F-measure on the development set. This objective function medium polyphony levels, 99% of true notes being selected could have been changed to take the polyphony level balance for polyphony 1 and 2, 94% for polyphony 3 and 86% for into account. This would result in reducing the polyphony polyphony 4. Performance is slightly decreasing in polyphony underestimation trend. However, the overall F-measure would 5 and 6, the scores being 78% and 71% respectively. Along the decrease. From a perceptive point of view, it has been shown original-pitch dimension, these errors are mainly located below that a missing note is generally less annoying than an added A (110Hz), and above A (1760Hz). Selecting a set of likely note when listening to a resynthesized transcription [27]. 2 6 F0s is a challenging issue for future works since a lot of MPE Thus, underestimating the polyphony may be preferred to algorithms would benefit from it. Indeed, they often consist overestimating it. Still, this trend turns out to be the main in optimizing an objective criterion that presents a lot of local shortcoming of the proposed method, and should be fixed in optima along the pitch dimension. Some basic solutions like the future for efficiently addressing sounds with polyphony exhaustive or grid search [27] or more elaborated ones like higher than 5 notes. Markov chain Monte Carlo methods [28] have already been The case of octave detection has been analyzed and results proposed for searching the global optimum, and an efficient are presented in Fig. 8. Octave detection is one of the most candidate selection stage would be very helpful to reach the difficult cases of multipitch estimation and the results are lower result more quickly. than the global results obtained for polyphony 2 (see Fig. 7). The proposed method reaches the highest results here, the F- measure being 81%, versus 77% for Klapuri’s system and C. Integration in an automatic transcription system 77%/66% for Tolonen’s (depending on the two F0 ranges). The proposed MPE method has been introduced for a single- This would suggest that the proposed models for spectral frame analysis. Beyond the multipitch issue, the automatic envelopes and overlapping spectra are significatively efficient. transcription task is addressed by integrating this method in EMIYA et al.: MULTIPITCH ESTIMATION OF PIANO SOUNDS USING A NEW PROBABILISTIC SPECTRAL SMOOTHNESS PRINCIPLE 9 a system that performs pitch tracking over time and outputs V. CONCLUSIONS not only pitches but note events. In the current case, the MPE In this paper, a sound model and a method were proposed method can be used in the transcription framework proposed for the multipitch estimation problem in the case of piano in [19]. The automatic transcription of piano music is thus sounds. The approach was based on an adaptive scheme, obtained by: including the adaptive matching of the inharmonic distribution of the frequencies of the overtones and an autoregressive 1) using an onset detector to localize new events; spectral envelope. We have shown the advantage of using a 2) detecting a set of pitch candidates after each onset; moving average model for the residual noise. The estimation 3) between two consecutive onsets, decoding the multipitch of the parameters was performed by taking the possible overlap contents thanks to a frame-based Hidden Markov Model between the spectra of the notes into account. Finally, a (HMM), the states being the possible combinations of weighted likelihood criterion was introduced to determine the simultaneous pitches, the likelihood being given by the estimated mixture of notes in the analyzed frame among a set proposed MPE method (eq. (7)); of possible mixtures. 4) postprocessing the decoded pitches to detect repetitions The performance of the method was measured on a large or continuations of pitches when an onset occurs. database of piano sounds and compared to the state-of-the- This transcription system was presented at the Multiple art. The proposed method provides satisfying results when Fundamental Frequency Estimation & Tracking task of the polyphony is unknown, and reaches particularly good scores MIREX10 contest in 2008 (system called EBD2). For the for mixtures of harmonically-related notes. piano-only test set, the evaluation is performed in two different This approach has been successively integrated in a full ways, depending on how a correct note is defined. In the first, transcription system [19] and alternative investigations [30] more constraining evaluation, where a correct note implies a are ongoing. Future works may deal with the use of differ- correct onset (up to a 50-ms deviation), a correct offset (up ent spectral envelopes in a similar modeling and estimation to a 20%-duration or 50-ms deviation) and a correct pitch framework. Indeed, while the proposed AR envelope model (up to a quartertone deviation), the proposed system reaches may be too generic to characterize the spectral envelope the 3rd rank with a 33.1% F-measure, out of 13 systems smoothness of musical instruments, this information can be scoring from 6.1% to 36.8%. For that evaluation, the duration easily replaced, as the variance of the amplitude Gaussian estimation is evaluated by the average overlap between correct models, by any parametric envelope model. The method may estimations and reference notes: the proposed method reaches also be extended to other instruments or to voice. Keeping the the 7th position with 80%, the values being in a narrow interval proposed sound model, it could be useful to investigate other ([77.4%; 84%]). In the second evaluation, where correctness estimation strategies in order to reduce the computational cost. only requires the onset and pitch criteria above, the proposed This may be achieved by means of a hybrid approach with an system reaches the 7th rank with a 56.9% F-measure, out of iterative selection of the notes and a joint estimation criterion. 13 systems scoring from 24.5% to 75.7%. The best rank is Furthermore, the log-likelihood function related to the model obtained for average overlap with a 61% value, the lowest for spectral envelopes is not very selective. Hence, some value being 40.1%. investigations on more appropriate functions could improve the performance of decision step. Finally, one could conduct Thus, when the proposed MPE method is integrated in others investigations about how to prune the set of all possible a full transcription system for piano music, state-of-the art note combinations more efficiently. They may include iterative results are reached in terms of global performance. In addition, candidate selection or MCMC methods. good average overlap scores show that the proposed method is suitable for pitch tracking over time and note extraction. APPENDIX A Note that one can wonder whether the reported performance DISCUSSIONONTHEDETECTIONFUNCTION is significant when comparing MPE and ATM quantitative According to eq. (5), the multipitch estimation is theoreti- results (with roughly 20% F-measure deviation, no matter the cally obtained by maximizing, w.r.t. C, the likelihood p(x|C), method is). It actually depends not only on the database, but which is expressed as a function of the parameters of the on the different testing protocols. While isolated frames com- proposed model as: posed of one chord are used in the proposed MPE evaluation,

ATM evaluation implies other difficulties like having asyn- p(x|C)= p(x, α, θb|C) dαdθb (19) chronous notes overlaping in time; detecting onsets; estimating ZZ the end of damping notes; dealing with reverberation queues = p(x|α, θb, C) p(α|C) p(θb) dαdθb (20) and so on. More, polyphony levels in musical recordings ZZ are often high and are not uniformly distributed at all, with = p(x|α, θb, C) p(α|θ, C) p(θ) p(θb) dθdαdθb a 4.5 average polyphony and a 3.1 standard deviation reported ZZZ for a number of classical music pieces [29, p.114]. Hence, F- (21) measure for MPE and ATM should be evaluated separately. Since the derivation of eq. (21) is not achievable, a simpler detection function must be built. The proposed function can be interpreted as a weighted log-likelihood and is obtained from 10http://www.music-ir.org/mirex/2008/ the integrand of eq. (21) by: 10 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING

• considering the estimated parameters θp, αp, θb; • taking the logarithm, thus turning the product into a sum; • normalizing the sums over the numberb of notesb by P , K c E ∗ ∗ E ∗ and log-likelihoods Lp and Lb by the size Hp and |Fb| [αk0 X (fk0 )] = W (fk0 − fk) [αk0 αk] of their respective random variable; k=1 X∗ • introducing penalization terms using the C-dependent = W (0) vk0 (24) sizes P , Hp, |Fb| with coefficients µpol, µenv, µb; K K E 2 • weighting each quantity by the coefficients w1,...,w4. and |X (fk0 )| = W (fk0 − fk) h i k=1 l=1 The normalization operation aims at obtaining data that can X X ∗ E ∗ be compared from one mixture C to another, since the di- × W (fk0 − fl) [αkαl ] mension of the random variables depends on the mixture. The penalization is inspired by model selection approaches [31], in K 2 which an integral like (21) is replaced by a likelihood penal- = |W (fk0 − fk)| vk (25) k ized by a linear function of the model order. The weighting by X=1 w1,...,w4 is finally used to find an optimal balance between the various probability density functions. Note that in eq. (7), The related error is priors related to filters Ap and B have been removed since they E 2 2 E 2 are non-informative, and thus can be considered as constant ǫk0 (ˆη)= |αk0 | + |ηˆ| |X (fk0 )| terms. hE ∗ i h ∗E i ∗ − ηˆ αk0 X (fk0 ) − ηˆ [αk0 X (fk0 )] E ∗ 2 E £ 2 | [α¤ k0 X (fk0 )]| = |αk0 | − E |X (f )|2 h i k0 APPENDIX B h 2 i |W (0)| vk0 PROOFS = 1 − K 2 vk0 (26) à k=1 |W (fk0 − fk)| vk ! P Proof of eq. (8): Using eq. (3), the log-likelihood to be Note that this approach has some connections with the Wiener maximized is filtering technique [32] widely used in the field of audio source separation. In both cases, the aim is to estimate a hidden variable thanks to its second-order statistics and to the Hp Hp 1 2 observations. In the considered short-term analysis context, the L σ2, A = − ln 2πσ2 + ln A e2iπfhp p p p 2 p 2 p proposed approach explicitly models the overlap phenomenon h=1 ¡ ¢ ¡ ¢ X ¯ ¡ ¢¯ and takes the resulting frequency leakage into account. As Hp ¯ ¯ 1 2 expected, the estimation error is all the larger as the estimated − |α |2 A e2iπfhp (22) 2σ2 hp p overtone is masked by another component, which justifies the p h=1 X ¯ ¡ ¢¯ approximation of eq. (12) in eq. (13). ¯ ¯ Proof of eq. (15): In a matrix form, the filtering process is The maximization w.r.t. σ2 leads to the estimate σ2 given p p expressed as x = B w , where x is an N-length frame of by eq. (11), which is used in eq. (22) to obtain eq. (8). b circ b b the noise process x , w ∼ N 0, I σ2 and B is the N × c b b N b circ N circulant matrix with first column (b ,...,bQ , 0,..., 0). ¡ ¢ 0 b Proof of eq. (12) and (13): The expected linear estimator † 2 Thus, we have xb ∼ N 0,Bcirc (Bcirc) σb . Bcirc is circulant, is expressed as αˆ , ηX (f ) with η ∈ C. The mean squared 2 k0 k0 † N−1 k ³ 2iπ N ´ E 2 so det BcircBcirc = k=0 B e . The likelihood of error is then ǫk0 (η) = |αk0 − ηX (fk0 )| . The optimal x is then dǫk b ³ ´ ¯ ³ ´¯ value ηˆ is such that 0 (ˆηh)=0, which is equivalenti to the Q ¯ ¯ dη ¯ ¯ decorrelation between the error (αk0 − ηXˆ (fk0 )) and the data 1 † † 2 −1 − 2 xb (BcircBcircσb ) xb X (fk0 ): e p xb = (27) N † 2 ¡ ¢ (2π) det BcircBcircσb ∗ ∗ ∗ E r 1 2 0= αk0 − ηˆ X (fk0 ) X (fk0 ) ³B− x ´ k circ bk − 2 2 2σ = E α∗ X (f ) − ηˆ∗E |X (f )| (23) e b £¡ k0 k0 ¢ k0¤ = (28) 2 h i N N−1 k £ ¤ 2 2iπ n (2πσb ) k=0 B e r E ∗ ¯ ³ ´¯ [αk0 X (fk0 )] Q ¯ ¯ which leads to ηˆ = 2 , where E ¯ ¯ h|X(fk0 )| i In the spectral domain, the log-likelihood of the noise signal EMIYA et al.: MULTIPITCH ESTIMATION OF PIANO SOUNDS USING A NEW PROBABILISTIC SPECTRAL SMOOTHNESS PRINCIPLE 11 is thus obtained using the Parseval identity: [3] M. R. Schroeder, “Period histogram and product spec-

N trum: New methods for fundamental-frequency measure- −1 2 2 N 2 1 2iπ k ment,” J. Acous. Soc. Amer., vol. 43, no. 4, pp. 829–834, Lb σ ,B = − ln 2πσ − ln B e N b 2 b 2 1968. k=0 ¯ ³ ´¯ ¡ ¢ X ¯ 2 ¯ [4] J. C. Brown, “Musical fundamental frequency tracking N−1 k ¯ ¯ 1 1 Xb using a pattern recognition method,” J. Acous. Soc. − N (29) 2σ2 N ¯ 2iπ k ¯ Amer., vol. 92, no. 3, pp. 1394–1402, 1992. b k=0 ¯B e¡ N¢ ¯ X ¯ ¯ [5] B. Doval and X. Rodet, “Fundamental frequency estima- ¯ ¯ When only considering the observations¯ ³ on´¯ the spectral bins tion and tracking using maximum likelihood harmonic ¯ ¯ Fb, the log-likelihood (29) becomes matching and HMMs,” in Proc. Int. Conf. Audio Speech and Sig. Proces. (ICASSP), vol. 1, Apr. 1993, pp. 221– |F | 1 2 L σ2,B = − b ln 2πσ2 − ln B e2iπf 224. b b 2 b 2 f∈F [6] G. Peeters, “Music pitch representation by periodicity ¡ ¢ Xb ¯ ¡ ¢¯ 1 1 X (f) 2¯ ¯ measures based on combined temporal and spectral rep- b (30) − 2 2iπf resentations,” in Proc. Int. Conf. Audio Speech and Sig. 2σb N B (e ) f∈Fb ¯ ¯ Proces. (ICASSP), Toulouse, France, May 2006, pp. 53– X ¯ ¯ ¯ ¯ 56. The maximization w.r.t. σ2 leads¯ to the¯ estimate given by b [7] V. Emiya, B. David, and R. Badeau, “A parametric eq. (18), which is used in eq. (30) to obtain eq. (15). method for pitch estimation of piano tones,” in Proc. Int. Conf. Audio Speech and Sig. Proces. (ICASSP), Honolulu, HI, USA, Apr. 2007, pp. 249–252. APPENDIX C [8] A. Klapuri, “Multiple fundamental frequency estimation MA ESTIMATION based on harmonicity and spectral smoothness,” IEEE The algorithm is based on the decomposition rb = Bb of Trans. Speech Audio Process., vol. 11, no. 6, pp. 804– the first (Qb + 1) terms rb of the autocorrelation function 816, 2003. of the process, as a product of the coefficient vector b , [9] C. Yeh, A. Robel, and X. Rodet, “Multiple fundamental T (b0,...,bQb ) by the matrix frequency estimation of polyphonic music signals,” in Proc. Int. Conf. Audio Speech and Sig. Proces. (ICASSP), b b ... b 0 1 Qb Philadelphia, PA, USA, Mar. 2005, pp. 225–228. 0 b ... b 0 Qb−1 [10] H. Kameoka, T. Nishimoto, and S. Sagayama, “A Multi- B ,  . . . .  (31) ...... pitch Analyzer Based on Harmonic Temporal Structured    0 . . . 0 b  Clustering,” IEEE Trans. Audio, Speech and Language  0    Processing, vol. 15, no. 3, pp. 982–994, 2007. The estimate b of b is obtained using Algorithm 3. As B [11] M. Christensen and A. Jakobsson, Multi-Pitch Estima- is upper triangular, the estimation of b in rb = Bb is fast tion, ser. Synthesis lectures on speech and audio pro- performed by backb substitution, instead of inverting a fullb cessing, B. Juang, Ed. Morgan and Claypool Publishers, matrix. The algorithm convergence wasb observedb afterbb about 2009. 20 iterations. [12] C. Raphael, “Automatic transcription of piano music,” in Proc. Int. Conf. Music Information Retrieval (ISMIR), Algorithm 3 Iterative estimation of noise parameters Paris, France, Oct. 2002, pp. 15–19. estimate the autocorrelation vector rb by the empiric corre- [13] M. Marolt, “A connectionist approach to automatic tran- lation coefficients rb obtained from the spectral observations scription of polyphonic piano music,” IEEE Trans. Mul- {X (f)} ; f∈Fb timedia, vol. 6, no. 3, pp. 439–449, 2004. T initialize b ← (1, 0b,..., 0) ; [14] J. Bello, L. Daudet, and M. Sandler, “Automatic piano for each iteration do transcription using frequency and time-domain informa- updateb the estimate B of B from b; tion,” IEEE Trans. Audio, Speech and Language Process- re-estimate b by solving rb = Bb; ing, vol. 14, no. 6, pp. 2242–2251, Nov. 2006. normalize b by its firstb coefficient;b [15] A. Klapuri and M. Davy, Signal Processing Methods for end for b b bb Music Transcription. Springer, 2006. b [16] G. Poliner and D. Ellis, “A discriminative model for polyphonic piano transcription,” EURASIP Journal on REFERENCES Advances in Signal Processing, vol. 8, pp. 1–9, 2007. [1] L. Rabiner, “On the use of autocorrelation analysis for [17] K. Robinson and R. Patterson, “The duration required to pitch detection,” IEEE Trans. Acoust., Speech, Signal identify the instrument, the octave, or the pitch chroma Process., vol. 25, no. 1, pp. 24–33, 1977. of a musical note,” Music Perception, vol. 13, no. 1, pp. [2] A. de Cheveigné and H. Kawahara, “YIN, a fundamental 1–15, 1995. frequency estimator for speech and music,” J. Acous. Soc. [18] L. I. Ortiz-Berenguer, F. J. Casajús-Quirós, and Amer., vol. 111, no. 4, pp. 1917–1930, 2002. M. Torres-Guijarro, “Non-linear effects modeling for 12 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING

polyphonic piano transcription,” in Proc. Int. Conf. Valentin Emiya was born in France in 1979. He Digital Audio Effects (DAFx), London, UK, Sep. 2003. graduated from Telecom Bretagne, Brest, France, in 2003 and received the M.Sc. degree in Acoustics, [19] V. Emiya, R. Badeau, and B. David, “Automatic tran- Signal Processing and Computer Science Applied scription of piano music based on HMM tracking of to Music (ATIAM) at Ircam, France, in 2004. He jointly-estimated pitches,” in Proc. Eur. Conf. Sig. Pro- received his Ph.D. degree in Signal Processing in 2008 at Telecom ParisTech, Paris, France. Since ces. (EUSIPCO), Lausanne, Switzerland, Aug. 2008. November 2008, he is a post-doctoral researcher [20] N. H. Fletcher and T. D. Rossing, The Physics of Musical with the METISS group at INRIA, Centre Inria Instruments. Springer, 1998. Rennes - Bretagne Atlantique, Rennes, France. His research interests focus on audio signal pro- [21] V. Emiya, R. Badeau, and B. David, “Multipitch estima- cessing and include sound modeling and indexing, source separation, quality tion of inharmonic sounds in colored noise,” in Proc. Int. assessment and applications to music and speech. Conf. Digital Audio Effects (DAFx), Bordeaux, France, Sep. 2007, pp. 93–98. [22] A. El-Jaroudi and J. Makhoul, “Discrete all-pole mod- eling,” IEEE Trans. Signal Process., vol. 39, no. 2, pp. 411–423, 1991. [23] R. Badeau and B. David, “Weighted maximum likelihood autoregressive and moving average spectrum modeling,” in Proc. Int. Conf. Audio Speech and Sig. Proces. (ICASSP), Las Vegas, NV, USA, Mar.-Apr. 2008, pp. Roland Badeau (M’02) was born in Marseille, 3761–3764. France, on August 28, 1976. He received the State [24] T. Tolonen and M. Karjalainen, “A computationally ef- Engineering degree from the École Polytechnique, ficient multipitch analysis model,” IEEE Trans. Speech Palaiseau, France, in 1999, the State Engineering degree from the École Nationale Supérieure des Audio Process., vol. 8, no. 6, pp. 708–716, 2000. Télécommunications (ENST), Paris, France, in 2001, [25] O. Lartillot and P. Toiviainen, “A Matlab toolbox for the M.Sc. degree in applied mathematics from the musical feature extraction from audio,” in Proc. Int. Conf. École Normale Supérieure (ENS), Cachan, France, in 2001, and the Ph.D degree from the ENST in Digital Audio Effects (DAFx), Bordeaux, France, Sep. 2005, in the field of signal processing. In 2001, 2007, pp. 237–244. he joined the Department of Signal and Image [26] A. Klapuri, “Multiple fundamental frequency estimation Processing, TELECOM ParisTech (ENST), as an Assistant Professor, where he became Associate Professor in 2005. His research interests include high by summing harmonic amplitudes,” in Proc. Int. Conf. resolution methods, adaptive subspace algorithms, audio signal processing, Music Information Retrieval (ISMIR), Victoria, Canada, and music information retrieval. Oct. 2006, pp. 216–221. [27] A. Cemgil, H. Kappen, and D. Barber, “A generative model for music transcription,” IEEE Trans. Audio, Speech and Language Processing, vol. 14, no. 2, pp. 679–694, 2006. [28] P. Walmsley, S. Godsill, and P. Rayner, “Polyphonic pitch tracking using joint Bayesian estimation of multiple frame parameters,” in Proc. IEEE Work. Appli. Sig. Proces. Audio and Acous. (WASPAA), New Paltz, NY, Bertrand David (M’06) was born on March 12, USA, Oct. 1999, pp. 119–122. 1967 in Paris, France. He received the M.Sc. degree [29] V. Emiya, “Transcription automatique de la musique de from the University of Paris-Sud, in 1991, and the piano,” Ph.D. dissertation, École Nationale Supérieure Agrégation, a competitive French examination for the recruitment of teachers, in the field of applied des Télécommunications, France, 2008. physics, from the École Normale Supérieure (ENS), [30] R. Badeau, V. Emiya, and B. David, “Expectation- Cachan. He received the Ph.D. degree from the maximization algorithm for multi-pitch estimation and University of Paris 6 in 1999, in the fields of musical acoustics and signal processing of musical separation of overlapping harmonic spectra,” in Proc. Int. signals. Conf. Audio Speech and Sig. Proces. (ICASSP), Taipei, He formerly taught in a graduate school in elec- Taiwan, Apr. 2009. trical engineering, computer science and communication. He also carried out industrial projects aiming at embarking a low complexity sound synthesizer. [31] P. Stoica and Y. Selen, “Model-order selection: a review Since September 2001, he has worked as an Associate Professor with the of information criterion rules,” IEEE Signal Process. Signal and Image Processing Department, GET-Télécom Paris (ENST). His Mag., vol. 21, no. 4, pp. 36–47, 2004. research interests include parametric methods for the analysis/synthesis of musical and mechanical signals, spectral parametrization and factorization, [32] N. Wiener, Extrapolation, Interpolation, and Smoothing music information retrieval, and musical acoustics. of Stationary Time Series. The MIT Press, 1964.