Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle Valentin Emiya, Roland Badeau, Bertrand David
Total Page:16
File Type:pdf, Size:1020Kb
Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle Valentin Emiya, Roland Badeau, Bertrand David To cite this version: Valentin Emiya, Roland Badeau, Bertrand David. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Au- dio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2010, 18 (6), pp.1643-1654. <http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=5356234>. <10.1109/TASL.2009.2038819>. <inria-00510392> HAL Id: inria-00510392 https://hal.inria.fr/inria-00510392 Submitted on 18 Aug 2010 HAL is a multi-disciplinary open access L'archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destin´eeau d´ep^otet `ala diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publi´esou non, lished or not. The documents may come from ´emanant des ´etablissements d'enseignement et de teaching and research institutions in France or recherche fran¸caisou ´etrangers,des laboratoires abroad, or from public or private research centers. publics ou priv´es. IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING 1 Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle Valentin Emiya, Member, IEEE, Roland Badeau, Member, IEEE, Bertrand David, Member, IEEE Abstract—A new method for the estimation of multiple con- and thus to an ill-posed problem, some knowledge is often current pitches in piano recordings is presented. It addresses the used for the further spectral modeling of the underlying issue of overlapping overtones by modeling the spectral envelope sources. For instance, when iterative spectral subtractionis em- of the overtones of each note with a smooth autoregressive model. For the background noise, a moving-average model is ployed, several authors have adopted a smoothness assumption used and the combination of both tends to eliminate harmonic for the spectral envelope of the notes [8–10]. and sub-harmonic erroneous pitch estimations. This leads to a Several preceding works [12–14; 16] specifically address complete generative spectral model for simultaneous piano notes, the case of the piano. One of the interests in studying a single which also explicitly includes the typical deviation from exact instrument is to incorporate in the algorithms some results harmonicity in a piano overtone series. The pitch set which maximizes an approximate likelihood is selected from among established from physical considerations [18; 19], in the hope a restricted number of possible pitch combinations as the one. that a more specific model will lead to a better performance, Tests have been conducted on a large homemade database called with the obvious drawback of narrowing the scope of the MAPS, composed of piano recordings from a real upright piano application. Indeed, while a large number of musical pieces are and from high-quality samples. composed for piano solo, the performance of ATM systems is Index Terms—Acoustic signal analysis, audio processing, mul- relatively poor for this instrument in comparison to others [6]. tipitch estimation, piano, transcription, spectral smoothness. Two deviations from a simple harmonic spectral model are specific to free vibrating string instruments and are particularly I. INTRODUCTION salient for the piano: a small departure from exact harmonicity The issue of monopitch estimation has been addressed for (the overtone series is slightly stretched) and the beatings be- decades by different approaches such as retrieving a periodic tween very close frequency components (for a single string two pattern in a waveform [1; 2] or matching a regularly spaced different vibrating polarizations occur and the different strings pattern to an observed spectrum [3–5], or even by combining in a doublet or triplet are slightly detuned on purpose [20]). both spectral and temporal cues [6; 7]. Conversely the mul- In this work, we propose a new spectral model in which tipitch estimation (MPE) problem has become a rather active the inharmonic distribution is taken into account and adjusted research area in the last decade [8–11] and is mostly handled for each possible note. The spectral envelope of the overtones by processing spectral or time-frequency representations. This is modeled by a smooth autoregressive (AR) model, the MPE task has also become a central tool in musical scene smoothness resulting from a low model order. A smooth analysis [11], particularly when the targeted application is the spectral model is similarly introduced for the residual noise automatic transcription of music (ATM) [12–16]. While recent by using a low-order moving-average (MA) process, which works consider pitch and time dimensions jointly to perform is particular efficient against residual sinusoids. The proposed this task, MPE on single frames has been historically used method follows a preliminary work [21] in which the AR/MA as a prior processing to pitch tracking over time and musical approach was already used and MPE was addressed as an note detection. Indeed, in a signal processing perspective, a extension of a monopitch estimation approach. In addition period or a harmonic series can be estimated from a short to the opportunity to deal with higher polyphony levels, the signal snapshot. In terms of perception, only a few cycles are current paper introduces a new signal model for simultaneous needed to identify a pitched note [17]. piano notes and a corresponding estimation scheme. The In this polyphonic context, two issues have proved difficult major advance from the previous study is thus to model and and interesting for computational MPE: the overlap between estimate the spectral overlap between note spectra. An iterative the overtones of different notes and the unknown number of estimation framework is proposed which leads to a maximum such notes occurring simultaneously. As the superposition of likelihood estimation for each possible set of F0s. two sounds in octave relationship leads to a spectral ambiguity This paper is structured as follows. In section II, the overall MPE approach is described. The sound model is first V. Emiya is with the Metiss team at INRIA, Centre Inria Rennes - Bretagne detailed, followed by the principle of the algorithm: statistical Atlantique, Rennes, France, and, together with R. Badeau and B. David, with background, adaptive choice of the search space and detection Institut Télécom; Télécom ParisTech; CNRS LTCI, Paris, France. The research leading to this paper was supported by the French GIP ANR function. The estimation of the model parameters is then under contract ANR-06-JCJC-0027-01, Décomposition en Éléments Sonores explained in section III. After specifying the implementation et Applications Musicales - DESAM and by the European Commission under details, the algorithm is tested in section IV using a database contract FP6-027026-K-SPACE. The authors would also like to thanks Peter Weyer-Brown from Telecom ParisTech and Nathalie Berthet for their useful of piano sounds and the results are analyzed and compared comments to improve the English usage. with those obtained with some state-of-the-art approaches. 2 IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING Frame: (x (0) ,...,x (N − 1))T Conclusions are finally drawn in section V. ∗ Note that in the following, denotes the complex conjugate, Preprocessing T the transpose operator, † the conjugate transpose, ⌊.⌋ the floor function and |.| the number of elements in a set, the absolute Selection of mixture candidates value of a real number or the modulus of a complex number. Preprocessing In addition, the term polyphony will be applied to any possible mixture of notes, including silence (polyphony 0) and single For each possible mixture C notes (polyphony 1). II. MULTIPITCH ESTIMATION ALGORITHM Iterative estimation of note models A. Generative sound model At the frame level, a mixture of piano sounds is modeled as Noise Spectral MA parameter a sum of sinusoids and background noise. The amplitude of envelope Amplitude estimation each sinusoid is considered as a random variable. The spectral AR parameter estimation envelope for the overtone series of a note is introduced as estimation second order statistical properties of this random variable, p ∈ J1; P K making it possible to adjust its smoothness. The noise is Parameter estimation considered as a moving-average (MA) process. ˜ More precisely, a mixture of P ∈ N simultaneous piano Weighted log-likelihood Lx (C) notes, observed in a discrete-time N-length frame, is modeled C ∈ C as a sum x (t) , P x (t) + x (t) of the note p=1 p b e signals xp and of noise xb. The sinusoidal model for xp is arg max L˜ (C) P C∈C x e Hp , 2iπfhpt ∗ −2iπfhpt Decision xp (t) αhpe + αhpe (1) Mixture estimation: C h=1 X ¡ ¢ Fig. 1. Multipitch estimation block diagram. b where αhp are the complex amplitudes and fhp the frequen- cies of the Hp overtones (here, Hp is set to the maximum number of overtones below the Nyquist frequency). Note p is 2 2 2 parameterized by C = (f , β ), f being the fundamental 2 2 and a gamma prior for : p 0p p 0p σp ∼ IG kσp , Eσp σb σb ∼ frequency (F0) and βp (f0p) being the so-called inharmonicity 2 Γ kσ2 , Eσ2 ³. A non-informative´ prior is assumed for Ap(z) coefficient of the piano note [20], such that b b and³ B(z). ´ 2 fhp , hf0p 1+ βp (f0p) h (2) With a user-defined weighting window w, we can equiv- alently consider the Discrete Time Fourier Transforms We introduce a spectralq envelope for note p as an au- (DTFT) Xp(f) of xp (t) w (t), Xb (f) of xb (t) w (t) and toregressive (AR) model of order Qp, parameterized by , P 2 2 1 X (f) p=1 Xp (f) + Xb (f) of x (t) w (t). W being θp , σ , Ap(z) , where σ is the power and is p p Ap(z) the DTFT of w, we thus have the transfer function of the related AR filter.