<<

arXiv:2003.10767v2 [eess.SP] 30 Oct 2020 CSP2020. ICASSP e o ahmtclSine,Ln nvriy E210L SE-22100 University, Lund fi[email protected]). Sciences, (email: Mathematical for ter a rcsigadDt nltc,K evn 01Lue,B Leuven, 3001 Leuven, KU Analytics, periodi Data fi[email protected]). not (email: and integer Processing is all nal signal are the sinusoidal equivalently, the or, it, of of freque multiples the is fundamental structure that no exists such there assumed i.e., the approximate, cases, only some e.g in (see, However, structures harmonic multiple proble containing estimation signals the [5 tha extending efficient contributions computationally, estimators recent as more developing well towards as consider statistically, directed this, are to been Due has period. signal correspond effort the pitch, of or reciprocal the , to fundamental signa employed the commonly is most feature the dise applications, human such of In state [4]. the of assessing detection of or fault [3], part machinery and dustrial monitoring voiced for extrac [2], the for instanc retrieval melodies information for modeling musical music are, in for [1], pitches, processing speech human as speech to field in of referred used variety often wide signals, a in Such appear sinusoids related monically in iseie oes pia astransport mass optimal models, misspecified tion, hoeia on netmto efrac,a ela for as examples. well numerical fin theoretical using as The illustrated bound. performance, are the attaining estimation estimator an on informa finding hybrid a bound computing pertur for theoretical randomly allows a This of signal. observation in harmonic an definition as robust third signal The inharmonic a properties. the yields spectral frequency signals’ and fundamental the of on theory definitio concept derived transport interpretable second easily mass estimators The and optimal assumption. of from harmonic behavior theoreti stems perfectly a the a provides analyzing under and correspon scenario, for definitions modeling the mode benchmark of misspecified for one a choices that different to show the We of sign estimation. inharmonic implications and such the for definiti study frequency different and fundamental act three of estimat not provide concept signals we the a the Herein, for despite periodic. allowing of frequency, being sig harmonic, fundamental composed such their perfectly applications, of i.e., as mult in Typically, treated integer harmonic, frequency. being are quite, common to a close not are of but frequencies whose almost, sinusoids are that at ftemtra eenhsbe ulse nteprocee the in published been has herein material the of Parts .Jkbsni ihteDprmn fMteaia Statist Mathematical of Department the with is Jakobsson A. .Evne swt h tdu etrfrDnmclSystems Dynamical for Center Stadius the with is Elvander F. inl htmyb elmdlda uepstoso har- of superpositions as modeled well be may that Signals Keywords Abstract I hswr,w osdrtemdln fsignals of modeling the consider we work, this —In namnct,fnaetlfeunyestima- frequency fundamental , enn udmna rqec for Frequency Fundamental Defining .I I. NTRODUCTION lotHroi Signals Harmonic Almost ii ladradAdesJakobsson Andreas and Elvander Filip n,Sweden, und, c,Cen- ics, terprets ig of dings ,with ], ,[6]). ., elgium, n of ons based dings Sig- , ually to m iples ases able nals ting tion ling ncy bed ing ion in- cal als ds e, c. s. n t l asvatebest the via the of nals implications the study and choices. signal, different inharmonic provide We the the signals? different of three does inharmonic from what emanating for definitions, answer: estimate. mean possible to three pitch to aims a work aims of this actually is concept the question it one the what signal, is quantity inharmonic This clear what an not stated to whe estimator seldom is fact, harmonic In it are. a performance applying case, derived estimation estimators on harmonic bounds applying relevant perfectly to robust the resort use simply may for or one frequenci [9] Although the components. methods between sinusoidal relation no the unstructure is an of there of performance when assumption estimation the i.e., Intuitivel under model, better possible achieve signals. is to what such than able efficientl of be to how should content clear one frequency less is the of it estimate structure known, lower apparent Cram´er-Rao is no inharmonicity the when the However, as [11]. bou such (CRLB) theoretical bound performance, information parameter the deriving estimation signal for for the on allows well as of This [10], estimators [7]. [9], efficient structure of devia harmonic former precise formulation perfect the the a In describing properties, from [8]. physical ba voice instruments their human some for on the instruments models parametric extent musical exist there some stringed case, by to found, produced be and may [7] and the inharmonic in as to e.g., referred are signals Such lss nteOTsne iligadfiiino h funda- the attra some of has definition definition this a that transport yielding show harmonic We sense, frequency. mass the mental OMT consider optimal the we in of [14]), closest concept e.g., inharmon (see, the the (OMT) on of Building versions signal. infinite-length on of depends representation as well nuisance. as considered duration, be time may problema long that is parameters a definition with this signals that for show we useful, altho However, practically signals. inharmonic to appl estimators behind harmonic assumptions inherent th the thi formalizes by Thus, definition attained (MLE). on estimator and likelihood tight bound inharmonic maximum asymptotically misspecified the a is to for bound as derive This us [13] model. we allows which (MCRLB) performance, this CRLB estimation Furthermore, misspecified the [12]). fundamen for use e.g., pseudo-true and a (see, as estimation frequency definition for misspecified pitch allowing the of interpreting bes sense, framework the the Kullback-Leibler to utilizing the corresponds this in noise, approximation additive Gaussian of case isl,w rps odfieteptho namncsig- inharmonic of pitch the define to propose we Firstly, sa lentv,w osdrapoiaigtespectral the approximating consider we alternative, an As ℓ 2 prxmto.W hwta o the for that show We approximation. views ying tion ugh nds sed tic tal es c- ic y, n y d e s s 1 t 2

tive properties, such as stability to small perturbations, as well where ∆k are called inharmonicity parameters. As to distin- as an intuitive appeal. Furthermore, when using the definition guish from the case with completely unrelated sinusoidal com- as a basis for estimating the pitch, the resulting estimator al- ponents, it is here assumed that the inharmonicity parameters lows for a closed-form expression for the asymptotic variance, are small in the sense that |∆k| ≪ ω0. Thus, assuming that the and, in the case of perfect harmonicity and white Gaussian component frequencies satisfy (2) constitutes a type of middle noise, has the same asymptotical performance as the MLE. In ground between the highly structured harmonic model, where this case, the resulting estimator corresponds asymptotically ∆k = 0, for all k, and the unstructured sinusoidal model, to an estimator formed using the extended invariance principle where there is no relation between the frequency components. (EXIP) [15], fitting a set of unstructured frequency estimates In some special cases, parametric models for the inharmonicity to a perfectly harmonic structure. The EXIP concept extends exists. For example, a common model for vibrating strings is the invariance principle for ML estimation to the case when 2 the variable transformation is not bijective and has earlier ω˜k = kω0 1+ k β (3) been successfully applied in, e.g., array processing problems where β > 0 is a parameterp related to the stiffness of the in the presence of calibration errors [16], as well as for pitch string [7]. Thus, for this model, the inharmonicity parameters estimation [17] under the assumption of perfect harmonicity. are given by ∆ = kω 1+ k2β − 1 > 0, for all k. Lastly, we consider modeling inharmonic signals within a k 0 However, such a structured model may not be assumed in the stochastic framework, wherein the noiseless signal is regarded p  general case, and, in this work, we therefore do not assume any as an observation of a stochastic process. Specifically, we particular structure of {∆ }K , or indeed that any useful such model deviations from the perfectly harmonic model as zero- k k=1 structure exists. Rather, we aim to put the intuitive concept mean random variables, allowing for interpreting the pitch of inharmonic, i.e., almost harmonic, signals on more solid as an expectation. As the resulting model contains both foundation by offering three conceptually different definitions deterministic and random parameters, we derive the hybrid of the meaning of a for a non-periodic CRLB (HCRLB) as a lower bound on estimation mean squared signal. Common for the three definitions will be the existence error (MSE). We also derive an easily implementable hybrid of a perfectly harmonic waveform, ML/maximum a posteriori (MAP) estimator that asymptoti- cally attains the resulting HCRLB. L iφℓ+iω0ℓt We compare and contrast the three definitions with each µt = rℓe , (4) other, highlighting their relative merits and applicability, as Xℓ=1 well as provide numerical illustrations of the derived estima- with L not necessarily being equal to K, that is a best tors and proposed bounds. approximation of xt, with the definition of optimality differing between the definitions. In this framework, ω0 will be the def- II. ALMOST HARMONIC SIGNALS inition of the pitch for an inharmonic signal xt. Furthermore, in order to find bounds on estimation performance, as well as 1 Consider the noiseless signal model to formulate estimators, we assume that the measured signal K is well modeled as iφ˜k+iω˜k t xt = r˜ke , (1) yt = xt + et, (5) Xk=1 where e is a circularly symmetric white Gaussian noise with for t = 0, 1,...,N − 1, and N ∈ N, where r˜k > 0, t variance σ˜2. However, most results presented herein may be φ˜k ∈ [−π, π), and ω˜k ∈ [−π, π) denote the amplitude, initial phase, and frequency, respectively, for the k:th signal readily extended to non-white noise processes. Throughout, let component. Then, if there exists ω0 ∈ [−π, π) such that T θ = ω0 φ1 ... φL r1 ... rL , ω˜k = kω0, for k = 1,...,K, the signal in (1) is referred to as being harmonic, with fundamental frequency, or pitch, denote the parameter vector defining the approximating signal, T ω0. Note that this is the case also if some of the components i.e., µt ≡ µt(θ), and let y = y0 ... yN−1 be the are missing, i.e., r˜k =0 for some k, as the signal period is still vector of available signal samples. For ease of reference, we   2π/ω0. The model in (1) appears in many signal processing recall that the probability density function (pdf) of y is thus applications, not least in audio processing, and considerable N−1 effort has been directed towards deriving estimators for the 1 1 2 ˜ pY (y)= exp − |yt − xt| . parameters ω0 and (˜rk, φk), for k = 1,...,K (see, e.g., (πσ˜2)N σ˜2 t=0 ! [6] for an overview). However, for inharmonic signals, the X integer relationship between the component frequencies is only III. ℓ OPTIMALITY AND MISSPECIFIED MODELS approximate [8], [19]. That is, 2 We initially consider approximating the inharmonic signal ω˜k = kω0 + ∆k, k =1,...,K, (2) in (1) in the ℓ2 sense, i.e., N−1 1In the interest of generality, we here consider complex-valued signals. For 1 θ = arg min |x − µ (θ)|2 . (6) real-valued signals, a corresponding complex version may easily be formed 0 N t t θ t=0 as the discrete-time analytical signal [18]. X 3

That is, the approximating pitch is the harmonic waveform surely as the signal to noise ratio (SNR), or number of signal yielding the least squared deviation from the inharmonic samples, N, tends to infinity [12]. This leads to a very practical signal. For notational brevity, L = K for the ℓ2 approximation. consequence: the harmonic waveform µt resulting from the ℓ2 Approximations such as (6) has earlier been applied in speech approximation in (6) corresponds to the expected result when coding applications for decreasing the data rate in speech anal- applying an harmonic MLE, or approximations thereof [6], to ysis/synthesis systems [20]. In addition to the intuitive appeal inharmonic measurements. We summarize this in the following of this choice, as well as the tractability of computing θ0, definition. this approximation may be interpreted as a so-called pseudo- Definition 2. Let x , for t =0,...,N − 1, be an inharmonic true parameter within the framework of misspecified models. t waveform. Then, the best harmonic approximation in ℓ is Specifically, consider a scenario in which it is (erroneously) 2 given by µt(θ0), where θ0 solves (6). believed that the signal samples yt are perfectly harmonic, i.e., generated as yt = µt +wt, where wt is a circularly symmetric The harmonic signal in Def. 2 may be seen as the quantity white Gaussian noise with (unknown) variance σ2. It may here being (tacitly) estimated when applying estimators derived be noted that the noise processes et and wt are not necessarily under an harmonic assumption to inharmonic signals. Fur- equal in distribution as it is allowed that σ2 6=σ ˜2. With this, thermore, by considering the interpretation as the pseudo-true one may consider the following definition. parameter, one may find a bound on performance on any unbiased estimator of θ . Such a family of bounds is the Definition 1 (Pseudo-true parameter [12]). Consider a signal 0 misspecified CRLB (MCRLB) [13]. Specifically, considering sample y. For a pdf , parametrized by the parameter vector p estimators that satisfy the MLE unbiasedness conditions, the , the pseudo-true parameter, , is defined as θ θ0 following proposition, adapted from [13], yields a bound on

θ0 = arg min − Ey (log p(y; θ)) , (7) estimator variance. θ Proposition 2. Let θˆ be an estimator of θ that is unbiased E 0 where y denotes expectation with respect to the pdf of y. under the signal pdf. Then, As may be noted, the minimization criterion in (7) is, up to T −1 −1 Ey (θˆ − θ0)(θˆ − θ0)  A(θ0) F (θ0)A(θ0) (8) an additive constant not depending on the parameter θ, equal to the Kullback-Leibler divergence between the pdf of the  σ2 ˜ where A(θ)= − σ˜2 F (θ) − F (θ) and assumed model and the actual pdf of the signal sample. That 2 N−1 is, the pseudo-true parameter realizes the best Kullback- 2˜σ R R I I θ0 F (θ)= ∇ µ (θ)∇ µ (θ)T +∇ µ (θ)∇ µ (θ)T , Leibler approximation of the pdf of the signal within the (σ2)2 θ t θ t θ t θ t t=0 parametric family p. In our case, the following proposition X 2 N−1 holds. F˜(θ)= ξR(θ)∇2µR(θ)+ ξI(θ)∇2µI(θ) . σ2 t θ t t θ t t=0 Proposition 1 (Pseudo-true parameter). Under the Gaussian X R Re I Im  assumption, the pseudo-true parameter θ0 for the harmonic Here, (·) = (·) and (·) = (·) denote the real and model is given by (6), and the pseudo-true variance parameter imaginary parts, respectively. for the additive noise is Proof. The result follows directly from [13]. N−1 1 The MCRLB is given by the diagonal of the right-hand side σ2 =σ ˜2 + |ξ (θ )|2 , N t 0 of (8) and thus provides a lower bound on the variance of any t=0 X estimator of θ0 that is unbiased under the pdf of the measured , where ξt(θ) µt(θ) − xt. signal. It may be noted that when ∆k =0, for all k, i.e., when the signal x is perfectly harmonic, F˜(θ )=0, and F (θ ) is Proof. As both the assumed and true distributions are cir- t 0 0 the standard Fisher information matrix (FIM). In this case, the cularly symmetric white Gaussian, the result follows di- MCRLB coincides with the CRLB of a harmonic signal. Thus, rectly. the MCRLB provides a means of assessing the performance From this, we may conclude that approximating the inhar- of any estimator derived under the harmonic assumption, even monic signal in ℓ2 may be interpreted as finding the Gaussian when the observed signal is inharmonic. Further, as N → ∞, pdf with mean identical to a periodic waveform that best one may express the MCRLB corresponding to the pseudo- approximates the true signal pdf in the Kullback-Leibler sense. true fundamental frequency in closed form, as detailed below. It may here be noted that the variance 2 of the noise σ wt Proposition 3 (Asymptotic MCRLB). Let the pseudo-true in the harmonic model is potentially greater than that of the parameter be inharmonic model due to the imperfect fit of the waveforms. T However, as may be readily verified, the value of the pseudo- θ0 = ω0 φ1 ... φK r1 ... rK . true parameter θ does not depend on neither σ2 nor σ˜2. 0 Then, as N → ∞, the asymptotic MCRLB for the pseudo-true Furthermore, it can be shown that the misspecified MLE fundamental frequency ω0 is given by (MMLE), i.e., the MLE derived under an assumed model C + E different from that of the actual measurements, asymptotically 2 −4 MCRLB(ω0)=˜σ 2 + O(N ), (9) tends to the pseudo-true parameter, i.e., θˆMMLE → θ0 almost (C − E + Z + D) 4

100 100 l cost l cost 2 2 fractions of fractions of 2 2 fractions of fractions of 3 3 10-1 0.1 0.15 0.2 0.25 0.3 0.35 0.1 0.15 0.2 0.25 0.3 0.35 (rad) (rad)

2.5 100 2

1.5

10-2 0.1 0.15 0.2 0.25 0.3 0.35 0.1 0.15 0.2 0.25 0.3 0.35 (rad) (rad)

Fig. 1. The ℓ2 cost function used in Def. 2 when applied to a perfectly Fig. 2. The ℓ2 cost function used in Def. 2 when applied to an inharmonic −3 harmonic signal with fundamental frequency ω0 = π/10 for two different signal generated using (3), with ω0 = π/10 and β = 10 for two different sample lengths. Top panel: N = 500. Bottom panel: N = 3000. sample lengths. Top panel: N = 500. Bottom panel: N = 3000.

N(N 2−1) K k2r2 where C = Pk=1 k , and 6 unstructured sinusoidal model, due to the MCRLB often being K considerable smaller than the CRLB. 2 2 N(N − 1)(2N − 1) Z = −2 k rk 6 Despite its appeal as formalizing tacit assumptions behind Xk=1 K N−1 using harmonic estimators, as well as resulting in easily 2 2 +2 k rkr˜kt cos(φ˘k +˘ωkt) computed performance bounds, Def. 2 becomes unsatisfactory k=1 t=0 if one allows for very long signals. Although the normalization X X K K N−1 N(N −1) by N allows for considering limits of the criterion (6), as D =2(N −1) k2r2 − k2r r˜ t cos(φ˘ +˘ω t) it is guaranteed to be finite, the definition of pitch becomes 2 k k k k k " t=0 # Xk=1 Xk=1 X ambiguous. To see this, it may be noted that the correlation K N−1 2 of any two sinusoids with distinct frequencies tends to zero 2 2 2 E = k r˜ t sin(φ˘k +˘ωkt) as N → ∞. Thus, (6) is minimized by setting a multiple of N k k=1 t=0 ! the pseudo-true ω equal to the frequency ω˜ corresponding X X 0 k K N−1 2 to the largest amplitude . However, any integer multiple is 2 N(N − 1) r˜k + k2 r˜ t cos(φ˘ +˘ω t) − r an equally valid choice. Def. 2 allows for selecting ω as N k k k k 2 0 k=1 t=0 ! ω =ω ˜ /ℓ, where k = arg max r˜ , for any ℓ ∈{1,...,K}. X X 0 k m m Thus, K indistinguishable candidates for ω exist if N is large where φ˘ = φ − φ˜ and ω˘ = kω − ω˜ , for k =1,...,K. 0 k k k k 0 k enough. It may also be noted that ω depends on the initial Here, O(N −4) denotes the order of the error term resulting 0 phases φ˜ of the inharmonic signal, parameters that may be from the asymptotic approximation. k considered nuisance. Proof. See the appendix. The issue of ambiguity for large N is illustrated in Figures 1 It may be noted that in the perfectly harmonic case, the and 2. Specifically, Figure 1 displays the ℓ2 cost function for terms E,Z, and D are all equal to zero as the pseudo-true a perfectly harmonic signal with ω0 = π/10 consisting of five , where r˜2 =r ˜3 are the largest amplitudes, parameter θ0 then coincides with the actual signal parameter. It is worth noting that σ˜2/C is equal to the asymptotic CRLB with the top and bottom panels of Figure 1 showing the for the perfectly harmonic model [21]. cost for N = 500 and N = 3000, respectively. As can be seen, for both cases, ω0 corresponds to the unique global Remark 1. As one may compute the MCRLB for an estimate minimum, with a cost of exactly zero. In contrast, Figure 2 of the pseudo-true fundamental frequency ω0, it is also possi- displays the same scenario, with the difference being that ble to construct a bound for the expected MSE for misspecified the perfectly harmonic structure having been replaced with estimators. That is, if one considers an estimate of ω0 to be a the string model in (3) with β = 10−3. For N = 500, misspecified estimate of ω˜1, the theoretical MSE is given by the global minimum is still unique, and the definition thus 2 2 unambiguous. However, for N = 3000, the cost function value Ey (ˆω0 − ω˜1) = MCRLB(ω0) + (ω0 − ω˜1) . (10) at the local minimum at ω˜2/4 approaches that of the global In fact, as will be illustrated in the numerical section, the minimum. If one lets N → ∞, these cost functions values will MSE for this misspecified estimate may for moderate values become identical, in addition to several other isolated global of SNR and sample lengths N be lower than the CRLB for an minima appearing. It is worth noting that, as the sample length 5

becomes longer, the larger the size of the error incurred by the a notion of distance, S : M+(T) ×M+(T) → R, between harmonic approximation is due to the sinusoidal components them by the Monge-Kantorovich problem of OMT [14], of the approximating harmonic signal becoming increasingly S(Φ0, Φ1) = min M(ω1,ω2)c(ω1,ω2)dω1dω2 orthogonal to the components of the inharmonic signal. Thus, T T M+( × ) T×T as the sample length approaches infinity, the squared ℓ2- Z difference will approach the sum of the (squared) norms of s.t. M(·,ω)dω =Φ0 , M(ω, ·)dω =Φ1 T T any sinusoidal component of the inharmonic whose frequency Z Z T T R is not matched perfectly by a component of the harmonic where c : × → + is a cost function defining the cost approximation. The issue of ambiguity may be addressed by of moving a unit mass. Here, M is referred to as a transport instead considering the spectral properties of the signal xt, plan as it may be interpreted as describing how mass is moved allowing for defining a harmonic approximation of the more from Φ0 to Φ1. As the constraints ensure that M transports all abstract signal, i.e., when no particular sample length N has available mass, and no other, between Φ0 and Φ1, the objective been specified. This can be achieved through the use of OMT, corresponds to the total cost of transport. The idea of using as described next. OMT as a measure of distance between spectra has earlier been considered in [22], wherein it was shown that S, for certain choices of c, may be used for defining a metric on M (T) IV. AN OMT-BASED DEFINITION OF PITCH + (see also [23], [24] for corresponding covariance based signal 2 As an alternative to defining the harmonic counterpart of distances). Herein, we will use c(ω1,ω2) = (ω1 − ω2) , i.e., an inharmonic signals by means of waveform approximation, the cost of transport between two frequencies is equal to such as in Def. 2, one may instead consider the signals’ their squared Euclidean distance. With this, one may find the spectral representation. To that end, assume that the phase best harmonic approximation of Φ in the OMT sense by ˜ x parameters φk in (1) are independent random variables uni- minimizing S(·, Φx) over the set of harmonic spectra. Note formly distributed on [−π, π), implying that the signal in (1) that, in contrast to Lp minimization, this corresponds to finding is a wide-sense stationary process. Then, the spectrum of the the most efficient way of perturbing the spectral peaks of Φx signal in (1) is given by in frequency so that the resulting spectrum is harmonic, with K the cost of moving a peak being proportional to its power. 2 Formally, Φx(ω)=2π r˜kδ(ω − ω˜k), (11) k=1 X Φµ = arg min S(Φ, Φx) (13) where ω ∈ [−π, π), and δ(·) denotes the Dirac delta function. Φ∈ΩL This constitutes a more abstract representation of xt as it is where ΩL is the set of harmonic spectra, i.e., not related to any particular sample length N. Furthermore, Ω = Φ ∈M (T) | any harmonic spectrum may be represented as L + n L L 2 T 2 Φ(x)=2π rℓ δ(ω − ℓω0) , rℓ ≥ 0,ω0 ∈ . Φµ(ω)=2π r δ(ω − ℓω0), (12) ℓ ℓ=1 ℓ=1 X o X To see that (13) may be solved efficiently, note that for any where we let L be potentially different from K. Thus, one may candidate ω0, corresponding to a subset of ΩL, all power at define the fundamental frequency of an inharmonic signal as the frequency ω˜k in Φx will be transported to the nearest the ω0 corresponding to an approximating harmonic spectrum integer multiple of ω0 when evaluating S. That is, letting Φµ. Note however that considering the approximation in Lp, ΩL,ω denote such a subset, i.e., all harmonic spectra with for some p ≥ 1, leads to the same problem as was encountered 0 fundamental frequency ω0, we get in the limiting case for ℓ2 approximation of the waveform, K namely that of selecting ω0 as ω0 =ω ˜k/ℓ, where k = 2 2 min S(Φ, Φx)=2π r˜k min (ℓω0 − ω˜k) . arg maxmr˜m, for any ℓ ∈ {1,...,L} yield equally accurate Φ∈ΩL,ω0 ℓ∈{1,2,...,L} approximations. However, a measure of distance not related Xk=1 Thus, solving (13) is equivalent to solving to point-wise comparison of Φx and Φµ may be obtained by considering the framework of OMT. Within this framework, K 2 2 the best approximation corresponds to the one requiring the minimize 2π r˜k min (ℓω0 − ω˜k) . (14) ω0 ℓ∈{1,2,...,L} least costly perturbation in order to shift the observed signal to kX=1 that of a perfectly harmonic model, with perturbations being Furthermore, at least one harmonic spectrum attaining the realized by moving the whole distribution, as opposed to the minimal cost exists, and is given by point-wise changes implied by Lp norms. L To formulate this, let T and let T denote 2 = [−π, π) M+( ) Φµ(ω)=2π r˜k δ(ω − ℓω0), the set of non-negative, generalized integrable functions on ! Xℓ=1 kX∈Iℓ T. Elements of M+(T) may be interpreted as distributions where ω0 solves (14), and with of mass on T, and, in particular, Φx ∈ M+(T) and Φµ ∈ T T 2 M+( ). Then, for any Φ0, Φ1 ∈ M+( ) with the same Iℓ = k | ℓ = arg min (mω0 − ω˜k) (15) total mass, i.e., T Φ0(ω)dω = T Φ1(ω)dω, one may define  m  R R 6

100 1 harmonic signal signal inharmonic signal l approx. 2

10-2 0.5 power

-4 10 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 (rad) -6 OMT cost 10 1 signal CHS approx.

-8 10 0.5 power

10-10 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 (rad) (rad)

Fig. 3. The OMT cost function, qL, for a perfectly harmonic signal with Fig. 4. Spectrum of inharmonic signal, as well as spectra of ℓ2 and CHS fundamental frequency ω0 = π/10, as well as for an inharmonic signal −3 approximations corresponding to Defs. 2 and 4, respectively. Top panel: signal generated from the string model in (3) with ω0 = π/10 and β = 10 . and ℓ2 approximation. Bottom panel: signal and CHS approximation. denoting the set of indices that are transported to harmonic Furthermore, it may be noted that the following propositions ℓ, for ℓ =1,...,L. As may be noted, the maximal harmonic hold also if one instead of using Def. 3 selects L = K. order, L, may not be equal to K (indeed, K may be unknown, K 2 or may not be a suitable choice for L). Although L could Proposition 4. Let Φx(ω)=2π k=1 r˜kδ(ω − kω˜0) be a be left as a user defined parameter, we here offer a data harmonic spectrum with fundamental frequency ω˜0. Then, the P dependent choice with both intuitive appeal and beneficial CHS is identical to Φx, and, in particular, ω0 =ω ˜0. practical consequences. Proof. Clearly, the maximal harmonic order is L = K. K ′ ′ Definition 3 (Maximal harmonic order). For a set {ω˜k}k=1 Further, qK (˜ω0)=0, and qK (ω ) > 0 for any ω 6=ω ˜0. such that ω˜ > ω˜ for k = 1,...,K − 1, let d = k+1 k This proposition ensures that for a perfectly harmonic min {ω˜ , ω˜ − ω˜ , ω˜ − ω˜ ,..., ω˜ − ω˜ }, i.e., the mini- 1 2 1 3 2 K K−1 spectrum, the CHS is the spectrum itself. As may be seen mum distance between two consecutive sinusoidal components. K from (16), the function is q is non-convex, with several local Then, the harmonic order for the set {ω˜ } is defined as L k k=1 minima. Thus, for arbitrarily large inharmonicity parameters L = min {ℓ ∈ N | ℓd ≥ ω˜K }. ∆k and arbitrary choices of amplitudes r˜k, the fundamental With this, the definition of the OMT harmonic spectrum is: frequency cannot be guaranteed to be found in a certain region (it may be noted that this is also the case for the Definition 4 (Closest harmonic spectrum). Let ℓ2 K approximation). However, for small harmonic perturbations, Φ (ω)=2π r˜2δ(ω − ω˜ ) be a (possibly) inharmonic x k=1 k k the following proposition holds. spectrum, and let L be the maximal harmonic order as defined in Def.P 3. Then, the closest harmonic spectrum (CHS) K 2 Proposition 5. Let Φx(ω)=2π k=1 r˜kδ(ω − ω˜k), where is defined as ω˜k = kω˜0 + ∆k, with k∆k , maxk |∆| < ω˜0/(2K + ∞ P k L 3). Then, L ∈ {K,K +1}, and the transport cost qL 2 has a exactly one local minimum on the interval Φµ(ω)=2π r˜k δ(ω − ℓω0). ω0 ! [˜ω0 −k∆k , ω˜0 + k∆k ]. Furthermore, Xℓ=1 kX∈Iℓ ∞ ∞ K where ω0 solves (14) and Iℓ is given by (15). 2 k=1 r˜kk |ω0 − ω˜0|≤ k∆k . It may here be noted that selecting L according to Def. 3 K 2 2 ∞ Pk=1 r˜kk acts as a safeguard against the so called sub-octave problem, P Proof. See the appendix. P i.e., associating the spectral lines with ω0/2 , for P ≥ 1, 2 which is bound to happen if L is chosen excessively large. For non-extreme choices of the amplitudes r˜k, the lo- The following two propositions verify that Def. 4 behaves in cal minimum in Prop. 5 is expected to be also the global a stable and predictable way. To simplify the proofs, define minimum, i.e., small inharmonic perturbations is expected to K yield a CHS whose pitch is close to that of the unperturbed , 2 2 perfectly harmonic signal. With this, one may conclude that qL(ω0) 2π r˜k min (ℓω0 − ω˜k) , (16) ℓ∈{1,2,...,L} Def. 4 provides a reasonable and well-behaved definition of Xk=1 pitch for inharmonic signals with some advantages over the ℓ2 i.e., the minimal cost of transporting Φx to a harmonic spectrum with fundamental frequency ω0 and L harmonics. 2Adversarial examples with combinations of very large and small ampli- Thus, the fundamental frequency of the CHS minimizes qL. tudes can always be constructed as to move the global mimimum. 7 approximation. Firstly, as the approximation is performed in the CHS is asymptotically equivalent to the EXIP cost function the spectral domain, there is no dependency parameters related corresponding to finding an estimate of the pitch from a set to the observation of an instance of the signal, such as sample of unstructured frequency estimates [17]. Furthermore, the length or initial phases. Secondly, it codifies the intuitive idea EXIP estimate has been shown to asymptotically have the same that a signal is almost harmonic if only a slight perturbation performance as the MLE [15]. of its spectrum is needed to obtain an harmonic structure. To illustrate the intuitive appeal of Def. 4, Figure 3 considers V. STOCHASTIC REPRESENTATION the same example as in Figures 1 and 2, with the difference As may be noted, the parameters of the inharmonic signal being that the approximation is performed in the spectral in (1) have in Defs. 2 and 4 been assumed to be deter- domain and in the OMT sense instead of in the temporal ministic (although the spectral representation allows for, e.g., domain and ℓ2. As can be seen, using Def. 4 causes only the initial phases to be random), and, as a consequence, the a small perturbation of the ω0 defining the approximating corresponding harmonic structure has been defined in terms spectrum as compared to the perfectly harmonic case. In fact, of approximations. In the third and last definition presented the harmonic spectrum of Def. 4 is constructed by slightly herein, we take an alternative approach and view (1) as a shifting the frequency locations of the peaks of the actual realization of a stochastic process in which the frequency signal spectrum. This is illustrated in Figure 4, displaying the parameters are random. Specifically, for the frequencies ω˜k 3 spectrum of the inharmonic signal, as well as the spectra of in (2), we let ω0 be a deterministic parameter, whereas ∆k the approximations from Defs. 2 and 4. As may be noted, are independent zero-mean random variables so that the total power is retained for Def. 4, whereas amplitudes E (˜ω )= E (kω + ∆ )= kω , for k =1,...,K. are underestimated or components altogether missing due to k 0 k 0 orthogonality for Def. 2. As a side, it may also be noted With this, ω0 may be interpreted as the fundamental frequency that the cost function is quadratic in a neighborhood of this in an average sense4. In order to arrive at a model allowing fundamental frequency. for manipulations, we will herein further assume that the As the ℓ2 approximation is closely related to the MLE inharmonicity parameters may be well modeled as Gaussian 2 2 derived under the perfectly harmonic assumption, Def. 2 is random variables, i.e., ∆k ∈ N (0, σ∆), where σ∆ denotes the readily applicable to actual estimation problems. However, common variance5. It may be noted that this representation Def. 4 may also be used as a plug-in estimator by replacing constitutes a weaker, less structured assumption on the nature the quantities (˜rk, ω˜k) by finite-sample estimates (r˜ˆk, ω˜ˆk) and of the inharmonic deviations than that provided by, e.g., the consider the obtained fundamental frequency to be an esti- stiff string model in (3), and may therefore be used to model mate of ω0. The following proposition details the asymptotic inharmonic signals for which there is no known parametric behavior of such an estimator. description. We summarize this in the following definition.

Proposition 6 (CHS estimate). Let (r˜ˆk, ω˜ˆk) be MLEs of Definition 5 (Expected harmonic signal). The signal in (1) is (˜rk, ω˜k), for all k, obtained from an unstructured sinusoidal a realization of the stochastic process model with Gaussian noise. Then, under the assumptions of K iφ˜k +i(kω0+∆k)t Prop 5, the plug-in estimate of the CHS pitch, ωˆ0, is an xt = r˜ke , (18) asymptotically, i.e., as N → ∞, consistent estimator of the k=1 X CHS ω0 with asymptotic variance given by T where the pdf of ∆= ∆1 ... ∆K is given by 6˜σ2 Var(ˆω )= (17) 1 1 2 0 K p(∆) = exp − k∆k . N(N 2 − 1) k2r˜2 2 K/2 2 2 k=1 k (2πσ∆) 2σ∆ 2   2 K K The corresponding harmonic signal is given by (4). 2˜σ P 2 2 2 + 4 k r˜k ℓr˜ℓ (ℓω˜k −kω˜ℓ) . K N k2r˜2 k=1 ℓ=1 ! It may be noted that there is an interesting relation between k=1 k X X Defs. 4 and 5. Specifically, for a given realization of xt,   Proof. See theP appendix. corresponding to an observation of ω˜k, the waveform may be explained perfectly by matching amplitudes, initial phases, Remark 2. It may be noted that the first term of (17) is and by making any choice of and such that identical to the asymptotic CRLB for the pitch in a perfectly ω0 ∆ harmonic model [21], whereas the second term is related to kω0 + ∆k =ω ˜k, for k =1,...,K. deviations of the frequencies from perfect harmonicity. Further, However, the choice that is most likely, in the sense of in the perfectly harmonic case, i.e., ω˜ = kω , for all k, the k 0 maximizing p(∆), is given by second term of (17) is equal to zero, i.e., the CHS plug-in K K estimate has the same asymptotic performance as the MLE 2 k=1 kω˜k ω0 = arg min (kω − ω˜k) = , for the perfectly harmonic model. This is not surprising; in K 2 ω k=1 k the perfectly harmonic case and under the assumption of Xk=1 P additive Gaussian white noise, the criterion qL minimized by 4Note, however, that the expectation of xt is not aP periodic waveform. 5This choice is made to simplify the exposition. However, all results 3As the spectra are all singular, the pointmasses are here represented by herein may be readily extended to allowing the variance to differ among scaled arrows. the inharmonicity parameters. 8

and ∆k =ω ˜k − ω0/k, for all k. That is, the estimate of where µt corresponds to a variation of Def. 4 in which the cost of 2 N−1 transport is not related to the component power, i.e., as if F (θ˘)= ∇ µR(θ˘)∇ µR(θ˘)T +∇ µI(θ˘)∇ µI(θ˘)T σ˜2 θ˘ t θ˘ t θ˘ t θ˘ t r˜k = 1, for all k. It may also be noted that this model is t=0 identifiable for all finite σ2 , i.e., one is not forced to identify X ∆ and I denoting the identity matrix of size K × K. A detailed one of the frequencies ω˜ with an integer multiple of ω in k 0 expression for E (F (θ˘)) may be found in the appendix. order for the model to be well-defined. ∆ Considering the noisy observation model in (5), we may Proof. See the appendix. proceed by asking what type of performance bounds and ˘ estimators that are relevant for the signal in Def. 5. To Remark 3. Partitioning F as T this end, let θ˘ = θT ∆T denote the concatenation F F T F˘ = θ,θ θ,∆ of the deterministic vector θ and the stochastic vector ∆. Fθ,∆ F∆,∆ Furthermore, let x(θ˘) denote the vector consisting of the N   signals samples in (18), parametrized by θ˘. Then, assuming where Fθ,θ is the (2K +1)×(2K +1) block corresponding to that the measurement noise is independent of ∆, the joint pdf the non-random parameters and F∆,∆ corresponds to the K of the measurement y and the inharmonicity ∆ is given by inharmonicity parameters, it may be noted Fθ,θ converges to the FIM corresponding to a perfectly harmonic model when p(y, ∆; θ)= p(y | ∆; θ)p(∆) (19) 2 1 1 1 2 letting σ → 0. Also, F∆,∆ ≈ 2 I when σ → 0 ∆ N N σ∆ ∆ where and N is reasonably large. Noting that the Schur complement ˘ T −1 1 1 2 of F∆,∆ in F is Fθ,θ − Fθ,∆F∆,∆Fθ,∆, and noting that p(y | ∆; θ)= exp − ky − x(θ, ∆)k2 2 (πσ˜2)N σ˜2 Fθ,∆ converges to a finite matrix as σ∆ → 0, the top   block of ˘−1 converges to −1, as is the conditional pdf of the measurement. From this, it may (2K + 1) × (2K + 1) F Fθ,θ T −1 2 T 2 be noted that it is not trivial to compute the CRLB for θ, Fθ,∆F∆,∆Fθ,∆ → σ∆Fθ,∆Fθ,∆ → 0, when σ∆ → 0. That is, nor to derive the MLE, as the marginal density p(y; θ) is not the HCRLB converges to the CRLB of the perfectly harmonic 2 available; this requires computing a K dimensional integral model as σ∆ → 0. with a non-linear integrand. Also, the Bayesian CRLB is not Furthermore, it can be shown that, given some regularity applicable in this case as prior distributions are only available conditions on the pdf p [26], the HCRLB is asymptotically for a subset of the parameters. However, it is possible to find tight and asymptotically attained by the hybrid ML/MAP a performance bound for the model in (19) by means of the estimator, i.e., by arg maxθ,∆ p(y, ∆; θ). Thus, the bound HCRLB [25], [26]. Furthermore, this bound may, as we will in Prop. 8 constitutes a useful predictor of estimation per- see in the numerical section, be asymptotically attained by a formance. Next, we present the ML/MAP estimator for the hybrid ML/MAP estimator [27]. The following result adapted inharmonic pitch model and show that it lends itself to from [25] holds. straightforward implementation. From (19), it may be noted ˆ the ML/MAP estimate of 2 maximizes the function Proposition 7 (Hybrid Cram´er-Rao lower bound [25]). Let θ˘ (θ, σ˜ , ∆) be an unbiased estimator of θ˘ in the sense that Ey (θˆ)= θ 1 2 1 ,∆ L = −N logσ ˜2 − y − x(θ˘) − k∆k2 , and Ey (∆)ˆ = E (∆), for any θ. Then, 2 2 2 ,∆ ∆ σ˜ 2 2σ∆

E ˘ˆ ˘ ˘ˆ ˘ T ˘ ˘ −1 y,∆ (θ − θ)(θ − θ)  F (θ) (20) which is the log-likelihood of (19), excluding constant terms. It may be noted that we here are required to estimate also   where the noise variance σ˜2. In order to formulate the ML/MAP ˘ E T estimator, define the dictionary function RK CN×K F = y,∆ ∇θ˘ log p(y, ∆; θ) ∇θ˘ log p(y, ∆; θ) . (21) A : →   Here, Ey,∆ and E∆ denote expectation with respect to the A(ω)= a(ω1) ... a(ωK ) joint pdf of y and ∆, and the marginal pdf of ∆, respectively. T where ω = ω ... ω and a : R → CN is the For non-linear measurement models, such as the one consid- 1 K Fourier vector. ered herein, the HCRLB is in general only tight asymptotically   [26]. However, it can be shown that if the bound is tight, Proposition 9 (ML/MAP estimator). Let ω be the set of then it is attained by the hybrid ML/MAP estimator [26]. frequencies maximizing the function This property makes the HCRLB attractive, especially for the 1 case considered herein, where it may be computed easily, ψ (ω)= −N log Σ(ω) − ν(ω), ML/MAP 2σ2 as detailed in the following proposition. Furthermore, the ∆ ML/MAP allows for straightforward implementation. where 1 −1 2 Proposition 8. For the inharmonic pitch model in (19), the Σ(ω)= y − A(ω) A(ω)H A(ω) A(ω)H y , matrix F˘ in (21) defining the HCRLB is given by N 2 K K 2  0 0 ℓ=1 ℓωℓ ˘ E ˘ ν(ω)= ωk − k . F = ∆ F (θ) + 1 (22) K 2 0 2 I ℓ σ∆ k=1 P ℓ=1 !     X P 9

pseudo-true fundamental asymp. MCRLB 0.3175 CHS fundamental MCRLB MMLE CHS asymp. variance ANLS harmonic CRLB 0.317 CHS estimator sinusoidal model CRLB MMLE 10-8 ANLS

0.3165 ) CHS estimator 2

0.316 MSE (rad 0.3155 expectation (rad)

0.315 10-9 0.3145

0 0.5 1 1.5 2 0 0.5 1 1.5 2 -3 10-3 10

Fig. 5. The pseudo-true fundamental frequency in Def. 2 as well as the CHS Fig. 6. The MCRLB, the asymptotic MCRLB, and the asymptotic CHS fundamental in Def. 4 for signals generated from the string model in (3) for estimator variance for estimating the pseudo-true ω0 in Def. 2 and the CHS varying β. Also plotted are the estimated expectations of the MMLE, ANLS, pitch in Def. 4, respectively, when the signal is generated from the string and CHS estimators. model in (3) for varying β. Also plotted are the corresponding MSE achieved by the MMLE, ANLS, and CHS estimators.

Then, the ML/MAP estimates of the fundamental frequency and inharmonicity parameters are given by A. Deterministic waveform To illustrate the behavior of Def. 2, i.e., the best harmonic K k=1 kωk approximation in ℓ2 sense, and the implied MMLE for vary- ω0 = K 2 ing degrees of inharmonicity, we consider signals generated P k=1 k from the string model in (3). It may here be noted that and ∆ℓ = ωℓ − ℓω0, for ℓ =1P,...,K, respectively. the MMLE is given by the non-linear least squares (NLS) estimator from [28]. In particular, let the signal consist of Remark 4. It may be noted that the ML/MAP estimate of 2 K = 5 components with amplitudes r˜ = e−ρ(k−K/2) with the model parameters are found by maximizing ψ k ML/MAP ρ = 0.2, and let ω = π/10. The initial phases φ˜ are over a set of K unconstrained frequencies, which may be 0 k chosen uniformly random on [−π, π). Furthermore, let the realized by a non-linear search. As ψ is non-convex, ML/MAP signal be observed at N = 500 time instances, and let the such a search requires a good initial point. In favorable noise SNR, defined as SNR = 10 log r˜2/σ˜2, be 10 dB. For conditions, such an initial point may be obtained by simple 10 k k this setting, Figure 5 presents the pseudo-true pitch i.e., the ω peak-picking in the periodogram. 0 defined in Def. 2, when varying theP string stiffness parameter −3 As can be seen from the criterion ψML/MAP, one may β on [0, 2 × 10 ]. Also presented is the OMT-based CHS 2 2 obtain two extreme cases by letting σ∆ → ∞ and σ∆ → 0, ω0 in Def. 4. As may be noted, both definitions correspond respectively. In the former case, maximizing ψML/MAP be- to small perturbations of the pitch, with the CHS definition comes equivalent to minimizing Σ with respect to a set of displaying a more linear behavior for a larger range of β. unconstrained frequencies ω, i.e., one obtains the MLE for a Figure 5 also displays the estimated expected values for the model with K unrelated sinusoids. In the latter case, one in MMLE and CHS estimators, obtained from 2000 Monte Carlo the limit arrives at the problem simulations for each value of β. The CHS pitches are com- puted based on unstructured ML estimates of the component maximize − Σ(ω) , s.t. ν(ω)=0, amplitudes and frequencies in accordance with Prop. 6. As ω can be seen, the sample averages correspond well to their or, equivalently, theoretical values. Also presented is the average estimate obtained using the approximate NLS (ANLS) estimator [28], maximize − Σ(ω) , s.t. ω = kω , for k =1,...,K, which is an asymptotic approximation of the MLE for an ω k 0 ω0, harmonic model (as N → ∞), corresponding to harmonic i.e., the misspecified MLE corresponding to a perfectly har- summation from a periodogram estimate. It may here be noted monic approximation. that the expectation of the ANLS estimator coincides with the pseudo-true ω0. Furthermore, Figure 6 presents the exact and asymptotic VI. NUMERICAL EXAMPLES MCRLB, as well the asymptotic variance of the CHS estima- tor, together with the MSE of the MMLE, ANLS, and CHS In this section, we provide numerical examples illustrating estimators. The MSE is here computed using the pseudo-true the derived theoretical results. ω0 as reference for the MMLE and ANLS, as it corresponds to 10

0.3151 10-6 asymp. MCRLB MCRLB 0.31505 CHS asymp. variance 10-7 harmonic CRLB sinusoidal model CRLB 0.315 MMLE 10-8 ANLS

) CHS estimator 0.31495 2

10-9 0.3149 MSE (rad

expectation (rad) 10-10 0.31485 pseudo-true fundamental CHS fundamental 0.3148 MMLE 10-11 ANLS CHS estimator 0.31475 -12 400 600 800 1000 1200 1400 1600 1800 2000 10 400 600 800 1000 1200 1400 1600 1800 2000 N N

Fig. 7. The pseudo-true fundamental frequency in Def. 2 as well as the CHS Fig. 8. The MCRLB, the asymptotic MCRLB, and the asymptotic CHS fundamental in Def. 4 for signals generated from the string model in (3) for estimator variance for estimating the pseudo-true ω0 in Def. 2 and the CHS varying N. Also plotted are the estimated expectations of the MMLE, ANLS, pitch in Def. 4, respectively, when the signal is generated from the string and CHS estimators. model in (3) for varying N. Also plotted are the corresponding MSE achieved by the MMLE, ANLS, and CHS estimators. their expected and asymptotical expected values, respectively, whereas the reference for the CHS estimator is the CHS ω0. computing the MSE for that particular set of {∆k}, and As may be noted, the bounds provide accurate predictions averaging over realizations. Specifically, the MSE for a given of the behaviors of the three estimators, with the asymptotic {∆k} is computed by adding the squared bias, with references MCRLB coinciding fairly well with its exact counterpart. As ω0 and ω0 +∆1, to the covariance bound, i.e., the MCRLB and reference, the CRLB for the corresponding perfectly harmonic the CHS asymptotic variance, respectively. For each value of 2 model, as well as for the lowest-frequency component in an σ∆, the corresponding MSEs are computed by averaging over unstructured sinusoidal model with exactly the same spectral 1000 Monte Carlo simulations. Furthermore, for each such content, are also provided. Figures 7 and 8 display corre- simulation, noise is added to the signal waveform according to sponding quantities, i.e., theoretical and estimated expected (5), and the signal parameters are estimated using MAP/MLE, values and theoretical and estimated variances, respectively, MMLE, ANLS, and the CHS estimators. The results are when fixing β = 5 × 10−4 and varying the number of displayed in Figures 9 and 10, with Figure 9 showing the samples, N, between 300 and 2000. As can be seen in MSE for the estimate of ω0 and Figure 10 the MSE for Figures 7, the CHS ω0 does not depend on the sample length, ω˜1 = ω0 +∆1. As may be seen from Figure 9, the MAP/MLE as expected. In contrast, the pseudo-true ω0 depends on N. estimator is able to attain the HCRLB, which is strictly smaller As r˜2 =r ˜3 dominate the amplitudes of the other components, than the bounds for the estimators derived for the deterministic we expect the ω0 from Def. 2 to fluctuate close to the interval models. For reference, Figure 9 provides also the MCRLB, the 2 2 [ω0 1+ β2 ,ω0 1+ β3 ] ≈ [0.3145, 0.3149] for large but theoretical CHS asymptotic variance, as well as the obtained moderate values of N (recall that Def. 2 becomes ambiguous MSEs for estimates of the ω of Defs. 2 and 4. From this, p p 0 as N → ∞). Furthermore, as may be seen from Figure 8, the it may be concluded that the variances of the estimators are MCRLB is far from being linear in the log of N, in contrast dominated by their squared bias when estimating ω0 from the to the CHS asymptotic variance. It may here be noted that model in (18). Furthermore, it may be noted from Figure 10 the slopes for the perfectly harmonic CRLB and the CHS that when considering the estimate of the frequency of the asymptotic variance are different as the second term of (17) first sinusoidal component, the HCRLB, as well as the MSE dominates for large N. of the MAP/MLE, tends to the CRLB corresponding to an 2 unstructured sinusoidal model as σ∆ grows, which is due to negative correlation between the estimates of ω and ∆ . In B. Stochastic waveform 0 1 contrast, for large enough inharmonicity, the MSEs of the We proceed by extending the simulation study to the MMLE, ANLS, and CHS estimators exceed the CRLB of the stochastic model in (18). Specifically, we let all signal parame- sinusoidal model. ters remain the same, with the difference being that each com- ponent frequency is perturbed by a Gaussian random variable. VII. DISCUSSION Fixing N = 500 and SNR = 10 dB, we compute the HCRLB The three different definitions of the pitch for inharmonic for ω0, as well as for the (random) frequency of the first signals all have their merits, and one cannot in a strictly sinusoidal component, i.e., ω˜1 = ω0 + ∆1, for varying values objective sense say that one is better than the others. Instead, 2 of the inharmonicity variance σ∆. As comparison, we compute their relative usefulness depend on the considered application, the theoretical MSE for the estimators implied by Defs. 2 as well as on ones aim with approximating close-to-harmonic 2 and 4. This is performed by sampling ∆k ∈ N 0, σ∆ , signals with perfectly harmonic counterparts.  11

-5 10-6 10 HCRLB HCRLB MCRLB + bias2 MCRLB + bias2 CHS asymp. variance + bias2 CHS asymp. variance + bias2 10-6 -7 harmonic CRLB harmonic CRLB 10 sinusoidal model CRLB sinusoidal model CRLB MAP/MLE MAP/MLE

) MMLE -7 MMLE 2 ANLS 10 ANLS -8 CHS estimator CHS estimator 10 MCRLB CHS asymp. variance MSE -8

MSE (rad MMLE 10 ANLS CHS estimator 10-9 10-9

10-10 -10 -12 -10 -8 -6 10 10 10 10 10 10-12 10-10 10-8 10-6 2 2

Fig. 9. The MSE obtained when estimating ω0 from the model in (18) for ˜1 = 0 +∆1 2 Fig. 10. The MSE obtained when estimating the frequency ω ω varying values of the inharmonicity parameter variance σ∆. Also given are of the first sinusoidal component from the model in (18) for varying values 2 MSEs for the estimates of the fundamental frequencies in Defs. 2 and 4. of the inharmonicity parameter variance σ∆.

procedure, one by definition cannot compute the ω in Def. 5 As noted, by defining pitch using an approximation in ℓ2, 0 as in Def. 2, one models the scenario of erroneously assuming from such an observation. However, adopting this view of that the observed waveform is periodic, for which Def. 2 inharmonic signals allows for computing a performance bound provides means for analyzing the average behavior of the constituting a smooth interpolant between perfectly harmonic misspecified MLE, as well as a bound on the estimation models and completely unstructured sinusoidal models, as well performance. Thus, Def. 2 is well suited for being used as as for finding an easily implemented estimator attaining the a benchmark when analyzing the behavior of pitch estimators bound. In practical terms, Def. 5 offers tools for processing when applied to inharmonic signals; the definition makes it inharmonic signals for which one suspects that there may possible to compute the bias and MSE, at least empirically, be no deterministic description of the inharmonic deviations. for any such estimator. However, the behavior of Def. 2 is, For example, such a class of signals could be the voiced as demonstrated in the numerical examples, non-linear with part of human speech, for which one may observe frequency- respect to both the inharmonicity and sample length, and is in dependent inharmonicity with no particular structure (this may addition ambiguous for very long signals. very well change over time for a given person due to, e.g., infections). As the inharmonicity pattern thus could change In this respect, the OMT-based approximation of Def. 4 from day to day for any given fundamental frequency, Def. 5 has the appealing quality of not depending on the actual likely provides a pragmatic view. signal, but only on its spectral properties. As shown both in the theoretical and numerical results, the behavior of Def. 4 is locally linear with respect to the signal inharmonicity, and corresponds well to the intuitive idea that inharmonicity corresponds to perturbations in frequency of spectral peaks. In addition to this, when used for estimation, Def. 4 displays linear behavior in terms of estimator variance, and coincides with the MLE for the perfectly harmonic case. With this in mind, Def. 4 is more satisfactory than Def. 2 when considered a tool for understanding inharmonic signals. Furthermore, its connection to the EXIP framework, i.e., the idea of fitting given parameter estimates to a certain structure, makes it relevant as a benchmark also in practical estimation scenarios.

In contrast to Defs. 2 and 4, which constitute approxi- mations, Def. 5 views the inharmonic signal as a waveform resulting from perturbing the frequencies of the sinusoidal components by zero-mean random variables, causing them to deviate from perfect integer multiples of the nominal pitch. In this respect, Def. 5 offers an explanation of why the observed signal waveform is not periodic. However, as any given observation of the signal has been generated by this random 12

T APPENDIX where u = uµ + uξ, with ρ = η − z (z./d) and A. Information matrix for the HCRLB T T T T 1 2 2 uµ = −1 (zµ./d) , uξ = 0 (zξ./d) , E i∆kt − 2 σ∆t E ˘ As ∆k e = e , one may write ∆ F (θ) = N−1 (t)     t=0 Λ , where the elements of the matrices   where ./ denotes elementwise division, and where the error matrix E is structured as P (t) , E R ˘ R ˘ T I ˘ I ˘ T Λ ∆ ∇θ˘µt (θ)∇θ˘µt (θ) +∇θ˘µt (θ)∇θ˘µt (θ) , E ET   E = 1 2 for t =0,...,N − 1, may be expressed as E E  2 3  (t) 2 2 2 Λω0,ω0 =t k r˜k −4 where E1 is a scalar on the order O(N ), E2 is a 2K vector k −3 X2 2 on the order O(N ), and E3 is a 2K × 2K matrix on the 2 −σ∆t ˜ ˜ +t e kℓr˜kr˜ℓ cos ω(k − ℓ)t +φk −φℓ order O(N −2). Thus, as F is given by (k,ℓX):k6=ℓ   2 2 2 T (t) −σ∆t ˜ ˜ σ˜ ηµ zµ Λω0,rk = te ℓr˜ℓ sin ω(k − ℓ)t + φk − φℓ F = + O(1), (σ2)2 zµ diag(d) ℓX:ℓ6=k     (t) 2 2 Λ =tkr˜2 +te−σ∆t ℓr˜ r˜ cos ω(k−ℓ)t+φ˜ −φ˜ ω0,φk k k ℓ k ℓ straightforward calculations yield that the error incurred by ℓ:ℓ6=k   neglecting the bounded terms of A and F may be bounded X −1 −1 1, ℓ = k from above by a matrix tending to zero faster than A F A (t) Λr ,r = 2 2 by a factor 1/N. Then, the MCRLB corresponding to ω0 is k ℓ e−σ∆t cos ω(k − ℓ)t + φ˜ − φ˜ , ℓ 6= k ( k ℓ given by the first diagonal element of A−1F A−1, which for   large N is given by (t) 0, ℓ = k Λ = 2 2 rk,φℓ −σ t e ∆ sin ω(k − ℓ)t + φ˜k − φ˜ℓ , ℓ 6= k 2 2 ( (σ ) T 2 2 1 T T u Fu = (σ ) u Fuµ + u Fuξ 2   ρ2 ρ2 µ ξ (t) r˜k, ℓ = k Λ = 2 2 C + E  φk,φℓ −σ∆t ˜ ˜ 2 (e r˜kr˜ℓ cos ω(k − ℓ)t+φk −φℓ , ℓ 6= k = σ , (C − E + Z + D)2   and Λ(t) =tΛ(t) , for η ∈{ω ,r ,φ }, and Λ(t) =t2Λ(t) . η,∆k η,φk 0 k k ∆k,∆ℓ φk,φℓ T where it used that uµ Fuξ =0, and

B. Proof of Proposition 3 T T C = ηµ − zµ (zµ./d) , E = zξ (zξ./d), 1 1 By Lemma 1 below, N F (θ0) and N A(θ0) converge to T D = −2zµ (zξ./d) ,Z = ηξ. arrowhead matrices. As shorthands, let F = F (θ0) and A = A(θ0). Then, Assuming that the pseudo-true fundamental frequency is not 1 η zT too close to zero, the correlation between signal components A = − + O(1), σ2 z diag(d) corresponding to different harmonic orders tends to zero as   N → ∞. The asymptotic expressions for C,E,D, and Z where O(1) denotes a bounded matrix and where diag(d) stated in the proposition follow directly, together with the denotes the diagonal matrix with as its main diagonal, d approximation error on the order O(N −4) for large N.  η = ηµ + ηξ and z = zµ + zξ with 1 1 Lemma 1. As N → ∞, F (θ0) and A(θ0) converge to N−1 ∂µ 2 N−1 ∂2µ N N η =2 t , η =2 Re ξ t , arrowhead matrices. µ ∂ω ξ t ∂ω2 t=0 0 t=0  0  X X Proof. Firstly, it may be noted that as θ0 solves the LS N−1 R R I I criterion in (6), it directly follows that d =2 ∇αµt ⊙ ∇αµt + ∇αµt ⊙ ∇αµt , t=0 N−1 N−1 XN−1 N−1 R R I I ∂µ ∂ ξt ∇θµt + ξt ∇θµt =0. z =2 Re t ∇ µ , z =2 Re ξ ∇ µ , µ ∂ω α t ξ t ∂ω α t t=0 t=0 t=0  0  t=0  0  X X X X R I where ⊙ denotes the Hadamard product, and Then, as any second derivative of µt and µt not involving differentiation with respect to ω is equal to a common constant T R I α = φ1 ... φK r1 ... rK , real scaling of elements of ∇θµt and ∇θµt , respectively, only the first column and first row of F˜(θ ) are non-zero. It is with all derivatives being evaluated at θ = θ . Then, by 0 0 straightforward to show that off-diagonal elements of the Sherman-Morrison-Woodbury formula [29], A−1 may be F (θ0) written as not related to partial derivatives of ω0 converge linearly to zero when scaled by , whereas the diagonal is bounded 2 1/N −1 2 0 0 σ T 1 A = −σ − uu + E, from below by positive values. Thus, N F (θ0), and thereby 0 diag(d)−1 ρ 1   N A(θ0), converges to an arrowhead matrix as N → ∞. 13

C. Proof of Proposition 5 E. Proof of Proposition 8 Clearly, if there are at least two consecutive sinu- We have soids with non-zero amplitude, d in Def. 3 satisfies ∇θ˘ log p(y, ∆; θ)= ∇θ˘ log p(y | ∆; θ)+ ∇θ˘ log p(∆), d ∈ [˜ω0 − 2 k∆k∞ , ω˜0 +2 k∆k∞]. Then, T T 2 where ∇˘ log p(∆) = diag 0 −∆ /σ , where 0 is a (K + 1)d ≥ ω˜K , (K − 1)d< ω˜K , θ ∆ zero vector of length 2K +1. Further, as  if k∆k < ω˜0/(2K + 3). Thus for k∆k < ω˜0/(2K + 3), ∞ ∞ E ∇ log p(y | ∆; θ) =0, we have that L ∈ {K,K +1}. Furthermore, for any ω ∈ y|∆ θ˘ [˜ω0 −k∆k∞ , ω˜0 + k∆k∞], and ∆ is independent of the measurement noise, k +1 0 0 |kω − ω˜k|≤ (k + 1) k∆k∞ < ω˜0 , E T ˘ 2K +3 y|∆ ∇θ˘ log p ∇θ˘ log p = F (θ)+ 1 T , 0 ( 2 )2 ∆∆  σ∆  whereas |(k ± 1)ω − ω˜k| > ω˜0(2K +1 − k)/(2K + 3). Thus,  2 where we use the shorthand p = p(y, ∆; θ), and arg min (ℓω − ω˜k) = k, for k =1,...,K, implying ℓ ˘ E T F (θ)= y|∆ ∇θ˘ log p(y | ∆; θ)∇θ˘ log p(y | ∆; θ) K K N−1 2 2 2 2 2 R R I I qL(ω)=2π r˜k(kω − ω˜k) =2π r˜k(k(ω − ω˜0) − ∆k) ˘ ˘ T ˘ ˘ T = 2 ∇θ˘µt (θ)∇θ˘µt (θ) +∇θ˘µt (θ)∇θ˘µt (θ) k=1 k=1 σ˜ t=0 X X X for any ω ∈ [˜ω0 −k∆k∞ , ω˜0 + k∆k∞]. This quadratic func- as the measurement noise is circularly symmetric white Gaus- tion has the unique stationary point E T 2 ˘ sian. As ∆ ∆∆ = σ∆I, the expression for F follows K 2 K 2 directly. It may also be readily verified that k=1 r˜kkω˜k k=1 r˜kk∆k  ω0 = =ω ˜0 + , K 2 2 K 2 2 ∂ P k=1 r˜kk P k=1 r˜kk E ∇ log p(y | ∆; θ) log p(y | ∆; θ) =0, y|∆ θ˘ ∂σ˜2 where it may beP noted that P   i.e., the HCRLB for θ˘ does not depend on whether σ˜2 is known K 2 K 2 k=1 r˜kk∆k k=1 r˜kk or not, implying that no partial derivatives with respect to σ˜2 K ≤ K k∆k∞ ≤k∆k∞ . r˜2k2 r˜2k2 need to be considered to compute the HCRLB of θ˘.  P k=1 k Pk=1 k

P P  REFERENCES [1] S. M. Nørholm, J. R. Jensen, and M. G. Christensen, “Instantaneous D. Proof of Proposition 6 Fundamental Frequency Estimation With Optimal Segmentation for Nonstationary Voiced Speech,” IEEE/ACM Trans. Audio, Speech, We here assume that the inharmonic perturbations are small Language Process., vol. 24, no. 12, pp. 2354–2367, Dec 2016. so that, asymptotically, i.e., when N → ∞ or the SNR [2] M. M¨uller, D. P. W. Ellis, A. Klapuri, and G. Richard, “Signal Processing for Music Analysis,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 6, tending to infinity, the assumptions of Prop. 5 hold in the pp. 1088–1110, 2011. sense that there exist ω such that |kω − ω˜ˆk| < ω/(2K + 3), [3] R. B. Randall, Vibration-Based Condition Monitoring: Industrial, for k = 1,...,K, almost surely. Noting that the covariance Aerospace and Automotive Applications, John Wiley & Sons, Chichester, ˆ T UK, 2011. matrix of the vector θ = r˜ˆ1 ... r˜ˆK ω˜ˆ1 ... ω˜ˆK [4] M. A. Little, P. E. McSharry, E. J. Hunter, J. Spielman, and L. O. is asymptotically given by (see, e.g., [30]) Ramig, “Suitability of Dysphonia Measurements for Telemonitoring   of Parkinson’s disease,” IEEE Trans. Biomed. Eng., vol. 56, no. 4, pp. 2 1015–102, April 2009. ˆ σ˜ I 0 12 2 2 Cov(θ)= , C2 = diag 1/r˜1 ... 1/r˜K [5] J. K. Nielsen, T. L. Jensen, J. R. Jensen, M. G. Christensen, and 2N 0 C2 N 2 −1   S. H. Jensen, “Fast fundamental frequency estimation: Making a   ststatistical efficient estimator computationally efficient,” Elsevier Signal and that the estimate of the CHS fundamental frequency is Processing, vol. 135, pp. 188–197, Jan 2017. given ωˆ0 = f(θˆ), where [6] M. Christensen and A. Jakobsson, Multi-Pitch Estimation, Morgan & Claypool, San Rafael, Calif., 2009. K 2 [7] N. H. Fletcher and T. D. Rossing, The Physics of Musical Instruments, r˜kkω˜k f(θ)= k=1 , Springer-Verlag, New York, NY, 1988. K 2 2 P k=1 r˜kk [8] E. B. George and M. J. T. Smith, “Speech analysis/synthesis and mod- ification using an analysis-by-synthesis/overlap-add sinusoidal model,” the expression in (17) is theP obtained from a first order IEEE Trans. Speech Audio Process., vol. 5, no. 5, pp. 389–406, Sep Taylor expansion of f at θ. As can be seen from Cov(θˆ), the 1997. [9] J. X. Zhang, M. G. Christensen, S. H. Jensen, and M. Moonen, “A estimates of ω˜k are asymptotically uncorrelated with estimates Robust and Computationally Efficient Subspace-Based Fundamental of the amplitudes, from which it directly follows that ωˆ0 is an Frequency Estimator,” IEEE Trans. Audio, Speech, Language Process, asymptotically unbiased estimate of ω0. As θˆ is the MLE of vol. 18, no. 3, pp. 487–497, March 2010. [10] N. R. Butt, S. I. Adalbj¨ornsson, S. D. Somasundaram, and A. Jakobsson, θ, consistency follows, with the asymptotic variance being “Robust Fundamental Frequency Estimation in the Presence of Inhar- 2 T ˆ monicities,” in 38th IEEE Int. Conf. on , Speech, and Signal E (ω0 − ωˆ0) = ∇θf(θ) Cov(θ)∇θf(θ). Processing, Vancouver, May 26–31, 2013. [11] S. M. Kay, Fundamentals of Statistical Signal Processing, Volume I: After some simplification, the expression in (17) follows.  Estimation Theory, Prentice-Hall, Englewood Cliffs, N.J., 1993. 14

[12] S. Fortunati, F. Gini, M. S. Greco, and C. D. Richmond, “Performance bounds for parameter estimation under misspecified models: Fundamen- tal findings and applications,” IEEE Signal Process. Mag., vol. 34, no. 6, pp. 142–157, Nov 2017. [13] C. D. Richmond and L. L. Horowitz, “Parameter Bounds on Estimation Accuracy Under Model Misspecification,” IEEE Trans. Signal. Process, vol. 63, no. 9, pp. 2263–2278, 2015. [14] C. Villani, Optimal transport: old and new, Springer Science & Business Media, 2008. [15] P. Stoica and T. S¨oderstr¨om, “On Reparametrization of Loss Functions Used in Estimation and the Invariance Principle,” Signal Processing, vol. 17, pp. 383–387, August 1989. [16] A. Swindlehurst and P. Stoica, “Maximum Likelihood Methods in Radar Array Signal Processing,” IEEE Proc., vol. 86, no. 2, pp. 421–441, February 1998. [17] H. Li, P. Stoica, and J. Li, “Computationally Efficient Parameter Estimation for Harmonic Sinusoidal Signals,” Signal Processing, vol. 80, pp. 1937–1944, 2000. [18] S. L. Marple, “Computing the discrete-time “analytic” signal via FFT,” IEEE Trans. Signal Process., vol. 47, no. 9, pp. 2600–2603, September 1999. [19] A. Klapuri, “Multiple fundamental frequency estimation based on harmonicity and spectral smoothness,” IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp. 804–816, 2003. [20] R. J. McAulay and T. F. Quatieri, “Pitch estimation and voicing detection based on a sinusoidal speech model,” in Proc. 1990 Int. Conf. on Acoustics, Speech, and Signal Processing, Albuquerque, NM, USA, 1990, pp. 249–252. [21] M. G. Christensen, A. Jakobsson, and S. H. Jensen, “Joint High- Resolution Fundamental Frequency and Order Estimation,” IEEE Trans. Audio, Speech, Language Process, vol. 15, no. 5, pp. 1635–1644, July 2007. [22] T. T. Georgiou, J. Karlsson, and M. S. Takyar, “Metrics for power spectra: an axiomatic approach,” IEEE Trans. Signal Process., vol. 57, no. 3, pp. 859–867, Mar. 2009. [23] F. Elvander, A. Jakobsson, and J. Karlsson, “Interpolation and Extrap- olation of Toeplitz Matrices via Optimal Mass Transport,” IEEE Trans. Signal. Process, vol. 66, no. 20, pp. 5285 – 5298, Oct. 2018. [24] F. Elvander, I. Haasler, A. Jakobsson, and J. Karlsson, “Multi-marginal optimal transport using partial information with applications in robust localization and sensor fusion,” Signal Process., vol. 171, June 2020, Art. no. 107474. [25] H. Messer, “The Hybrid Cramer-Rao Lower Bound - From Practice to Theory,” in Fourth IEEE Workshop on Sensor Array and Multichannel Processing, 2006., 12–14 July, 2006, pp. 304–307. [26] Y. Naim and H. Messer, “Notes on the Tightness of the Hybrid Cram´er- Rao Lower Bound,” IEEE Trans. Signal Process., vol. 57, no. 6, pp. 2074–2084, Jun. 2009. [27] A. Yeredor, “The Joint MAP-ML Criterion and its Relation to ML and to Extended Least-Squares,” IEEE Trans. Signal Process., vol. 48, no. 12, pp. 3484–3492, Dec. 2000. [28] M. G. Christensen, P. Stoica, A. Jakobsson, and S. H. Jensen, “Multi- pitch estimation,” Signal Processing, vol. 88, no. 4, pp. 972–983, April 2008. [29] G. H. Golub and C. F. Van Loan, Matrix Computations, The John Hopkins University Press, 3rd edition, 1996. [30] P. Stoica, A. Jakobsson, and J. Li, “Cisoid Parameter Estimation in the Colored Noise Case: Asymptotic Cram´er-Rao Bound, Maximum Likelihood and Nonlinear Least-Squares,” IEEE Trans. Signal Process., vol. 45, pp. 2048–2059, August 1997.