On the Efficiency of Maximum-Likelihood

Home , Efficiency (statistics)

2017 25th European Signal Processing Conference (EUSIPCO) On the Efﬁciency of Maximum-Likelihood Estimators of Misspeciﬁed Models

Mahamadou Lamine Diong, Eric Chaumette and Franc¸ois Vincent University of Toulouse - ISAE-Supaero, Toulouse, France Emails: [email protected], [email protected], [email protected]

Abstract—The key results on maximum-likelihood (ML) esti- referred to as the Huber’s ”sandwich” covariance in literature. mation of misspecified models have been introduced by statisti- However, Hubert did not explicitly discuss the information cians (P.J. Huber, H. Akaike, H. White, Q. H. Vuong) resorting to theoretic interpretation of this limit. This interpretation has a general probabilistic formalism somewhat difficult to rephrase into the formalism widespread in the signal processing literature. been emphasized by Akaike [6] who has observed that when In particular, Vuong proposed two misspecified Cramer-Rao´ the true distribution is unknown, the MLE is a natural estima- bounds (CRBs) to address, respectively, the situation where the tor for the parameters which minimizes the Kullback-Leibler true parametric probability model is known, or not known. In information criterion (KLIC) between the true and the assumed this communication, derivations of the existing results on the probability model. Then White [7] provided simple conditions accuracy of ML estimation of misspecified models are outlined in an easily comprehensible manner. Simple alternative derivations under which the MLE is a strongly consistent estimator for of these two misspecified CRBs based on the seminal work of the parameter vector which minimize the KLIC. While not as Barankin (which underlies all the lower bounds introduced in general as Huber’s conditions, White’s conditions are however deterministic estimation) are provided. Since two distinct CRBs sufficiently general to have broad applicability. Lastly, Q. H. exist when the true parametric probability model is known, a Vuong [8] proposed two misspecified Cramer-Rao´ bounds quasi-efficiency denomination is introduced. (CRBs) to address, respectively, the situation where the true I.INTRODUCTION parametric probability model is known, or not known, under a general probabilistic formalism involving regular and semi- Since its introduction by R.A. Fisher in deterministic es- regular parametric models. timation [1][2], the method of maximum likelihood (ML) Therefore, the purpose of this communication is twofold. estimation has become one of the most widespread used Firstly, in order to foster the understanding of the works of methods of estimation. The ongoing success of ML estimators Huber, Akaike, and White on misspecified MLEs [5][6][7], (MLEs) originates from the fact that, under reasonably general derivations of the key results are outlined in an easily compre- conditions on the probabilistic observation model [1][2], the hensible manner. Secondly, following the lead of Barankin’s MLEs are, in the limit of large sample support, Gaussian seminal work in deterministic estimation [9], simple alterna- distributed and consistent. Additionally, if the observation tive derivations of the two misspecified CRBs introduced by model is Gaussian, some additional asymptotic regions of Vuong are put forward. As a by-product, the misspecified operation yielding, for a subset of MLEs, Gaussian distributed CRB proposed by Vuong [8, Theorem 4.1 ] when the true and consistent estimates, have also been identified at finite parametric probability model is unknown, is a least-upper sample support [3][4]. However, a fundamental assumption CRB in the Barankin sense, which coincides with the Hu- underlying the above classical results on the properties of ber’s ”sandwich” covariance, and so called misspecified CRB MLEs is that the probability distribution which determines the under ML constraints in [10, (42)] or misspecified CRB for behavior of the observations is known to lie within a specified misspecified-unbiased estimators in [11, (5)]. Last, since two parametric family of probability distributions (the probability distinct misspecified CRBs exist when the true parametric model). In other words, the probability model is assumed to probability model is known, a quasi-efficiency denomination be ”correctly specified”. is introduced. Actually, in many (if not most) circumstances, a certain amount of mismatch between the true probability distribution A. Notations and assumptions of the observations and the probability model that we assume Let x be a M-dimensional complex random vector rep- is present. As a consequence, it is natural to investigate what l resenting the outcome of a random experiment (i.e., the happens to the properties of MLE if the probability model is observation vector) whose probability density function (p.d.f.) misspecified, i.e. not correctly specified. Huber [5] explored in is known to belong to a family P.A structure S is a set of hy- detail the performance of MLEs under very general assump- potheses, which implies a unique p.d.f. in P for x . Such p.d.f. tions on misspecification, and proved consistency, normality, l is indicated with p (x ; S). The set of all the a priori possible and derived the MLEs asymptotic covariance that is often l structures is called a probability model [2]. We assume that the This work has been partially supported by the DGA/MRIS p.d.f. of the random vector xl has a parametric representation, (2015.60.0090.00.470.75.01). i.e., we assume that every structure S is parameterized by a

ISBN 978-0-9928626-7-1 © EURASIP 2017 1500 2017 25th European Signal Processing Conference (EUSIPCO)

P -dimensional vector ψ, that is p (xl; ψ) , p (xl; S (ψ)), and following from similar argument given by Cramer´ [1, pp. 500- that the model is described by a compact subspace U ⊂ RP . In 503][7]. Hence θb (x) is asymptotically normal [7, Th. 3.2]: L the following, we consider L i.i.d. observations, {xl}l=1, for θ (x) ∼A N m , C , (3a) which the true parametric p.d.f. denoted pδ (xl) , p (xl; δ), b θb θb Pδ δ ∈ Uδ ⊂ R , and the assumed parametric p.d.f. denoted −1 ∂ ln f (xl; θf ) P m → θf − W (θf ) Ep , (3b) fθ (xl) , f (xl; θ), θ ∈ Uθ ⊂ R θ , belong to two (generally θb ∂θ different) families of p.d.f.’s, P and F . Let us denote: −1 −1 δ θ C → C (θ ) = W (θ ) C (θ ) W (θ ) , (3c) R θb f f ζ f f Eθ [g (x)] , Efθ [g (x)] = g (x) fθ (x) dx, Eδ [g (x)] , R T T T Ep [g (x)] = g (x) pδ (x) dx, where x = x ,..., x , T δ 1 L where C = E θ (x) − m θ (x) − m , m = QL QL θb p b θb b θb θb fθ (x) = l=1 fθ (xl) and pδ (x) = l=1 pδ (xl). If the true model is unknown or not needed, i.e., we do not have or do h i ∂ ln f(x;θ) Ep θb (x) , ζ , ζ (x; θ) = and: not need prior information on the particular parameterization L∂θ of the true distribution, we refer to pδ (x) and Pδ only as QL ∂ ln f (xl; θ) ∂ ln f (xl; θ) p (x) = p (xl) and P, respectively, and we denote: LCζ (θ) = Ep l=1R ∂θ ∂θT Ep [g (x)] = g (x) p (x) dx. ∂ ln f (xl; θ) ∂ ln f (xl; θ) − Ep Ep . (3d) II.ML ESTIMATION OF MISSPECIFIED MODELS ∂θ ∂θT As mentioned in the introduction, several authors [5][6][7] In the particular case of the MMLE (1a-1b) and under the have contributed to show that, under mild regularity conditions regularity conditions summarized in [11, Section II.A], θf is given in [7] (and summarized in [11, Section II.A]), the an interior point of Uθ, i.e. a local minimum of divergence misspecified MLE (MMLE) defined as: D (p||fθ) which satisfies:

θb (x) = arg max {fθ (x)} = arg max {ln fθ (x) /L} , (1a) θ = arg min {D (p||f )} = arg max {E [ln (f (x ))]} θ θ f θ p θ l θ θ is, in the limit of large sample support (L → ∞), a strongly ∂ ln f (xl; θ) 1 = arg Ep = 0 . (4a) consistent estimator for the parameter vector θf which min- θ ∂θ imizes the KLIC: Then: θ (x) a.s.→ θ = arg min {D (p||f )} ,D (p||f ) b f θ θ ∂ ln f (xl; θf ) h i θ Ep = 0, m = Ep θb (x) = θf , ∂θ θb = Ep [ln (p (xl) /fθ (xl))] . (1b) T C = E θ (x) − θ θ (x) − θ , (4b) Indeed, as noticed in [6], since θb p b f b f PL a.s. ln fθ (x) /L = l=1 ln fθ (xl) /L → Ep [ln (fθ (xl))] and the asymptotic covariance matrix (3c) can be further (strong law of large numbers), θb (x) is in general a natural estimator of: simplified and reduces to the Huber’s ”sandwich” covariance: −1 W (θf ) θf = arg max {Ep [ln (fθ (xl))]} CH (θf ) = θ L = arg min {Ep [ln (p (xl) /fθ (xl))]} , ∂ ln f (xl; θf ) ∂ ln f (xl; θf ) −1 × Ep W (θf ) . (4c) θ ∂θ ∂θT which can be proved to be strongly consistent under mild Last, since any covariance matrix satisfies the covariance regularity conditions given in [7]. Therefore the gradient of inequality [2], that is ∀η (x): the ML objective function (1a) can be well approximated via a first order Taylor series expansion about θf : h i h i−1 C ≥ E θ (x) − θ η (x)T E η (x) η (x)T θb p b f p 2 −1 a.s. ∂ ln f (x; θf ) ∂ ln f (x; θf ) T θb (x) → θf − , (2a) L∂θ∂θT L∂θ × Ep η (x) θb (x) − θf , (5a)

which, in the limit of large sample support, yields: consequently, according to (3c), almost surely ∀η (x):

a.s. −1 ∂ ln f (x; θf ) h T i θb (x) → θf − W (θf ) , L∂θ CH (θf ) ≥ Ep θb (x) − θf η (x) 2 −1 T ∂ ln f (xl; θ) h T i W (θ) = Ep , (2b) × Ep η (x) η (x) Ep η (x) θb (x) − θf , (5b) ∂θ∂θT

1 which is referred to as the Huber’s ”sandwich” (covariance) The value θf is called the pseudo-true parameter of θ for the model fθ (xl) when p (xl) is the true p.d.f.. inequality in the literature on misspeciﬁed models.

ISBN 978-0-9928626-7-1 © EURASIP 2017 1501 2017 25th European Signal Processing Conference (EUSIPCO)

P III.CRB FOR MISSPECIFIED MODELS IN THE BARANKIN R θ . Let us assume the following: A1) hp (θ, δ) for p = SENSE 1,...,Pθ are differentiable functions on a neighborhood of 0 0 P P 0 0 When the true parametric model p (x; δ) is known and the point θ , δ in R θ × R δ , A2) h θ , δ = 0, A3) assumed to be correctly specified, all the lower bounds intro- the Pθ × Pθ Jacobian matrix of h (θ, δ) with respect to θ 0 0 duced in deterministic estimation [12] have been derived from is nonsingular at θ , δ . Then, there is a neighborhood ∆ 0 P the seminal work of Barankin. In [9] Barankin established of the point δ in R δ , there is a neighborhood Θ of the 0 P the general form of the greatest lower bound on any sth point θ in R θ , and there is a unique mapping ϕ : ∆ → Θ 0 0 absolute central moment (s > 1) of a uniformly unbiased such that θ = ϕ δ and h (ϕ (δ) , δ) = 0 for all δ in ∆. 0 estimator with respect to p (x; δ), generalizing the earlier Furthermore, ϕ (δ) is differentiable at δ and satisfies: works of Cramer,´ Rao and Battacharayya on locally unbiased estimators. Barankin showed [9, Section 6], among other ϕ (δ) − ϕ δ0 = things, that the definition of the CRB can be generalized to !−1 ∂h θ0, δ0 ∂h θ0, δ0 any absolute moment as the limiting form of the Hammersley- 0 − T T δ − δ Chapman-Robbins bound (HaChRB). The general results in- ∂θ ∂δ 0 troduced by Barankin require not only the knowledge of the + o δ − δ . parameterization of the true distribution p (x; δ) but also the h i formulation of a uniform unbiasedness constraint of the form: ∂ ln f(xl;θ) In the case addressed (6b): h (θ, δ) = Eδ ∂θ and Eδ [gb (x)] = g (δ) , ∀δ ∈ Uδ, (6a) ϕ (δ) = θf (δ). Then: if one wants to derive lower bounds on the MSE of an unbiased ∂θ δ0 estimate g (x) of the vector g (δ) of functions of δ. Since the f 0 −1 b T = −W θf MMLE θb (x) (1a-1b) is in the limit of large sample support an ∂δ " 0 0# unbiased estimate of θf (4b), and since there exists an implicit ∂ ln f xl; θf ∂ ln p xl; δ × Eδ0 T , relationship between θf and δ (4a): ∂θ ∂δ ∂ ln f (xl; θ) θf (δ) = arg Eδ = 0 , (6b) and (7) becomes: θ ∂θ

the form of (6a) of interest is therefore: MCRBδ0 = h i " 0 0# Eδ θb (x) = θf (δ) , ∀δ ∈ Uδ. (6c) 0 −1 ∂ ln f xl; θf ∂ ln p xl; δ W θ Eδ0 f ∂θ ∂δT Then, according to [9] and in the particular case of the MSE " 0 0#−1 (absolute moment of order 2), the misspeciﬁed CRB (MCRB) 1 ∂ ln p xl; δ ∂ ln p xl; δ × Eδ0 for unbiased estimates of the pseudo-true parameter θf is L ∂δ ∂δT deﬁned, for any selected value δ0, as the limiting form of the " 0 0 # HaChRB obtained from the covariance inequality (5a) where ∂ ln p xl; δ ∂ ln f xl; θf 0 −1 × Eδ0 W θ . (8) [13]: ∂δ ∂θT f ! p x; δ0 + u dδ p x; δ0 + u dδ T 1 Pδ Clearly the Barankin approach provides a simpler alternative η (x) = 1, 0 ,..., 0 , p x; δ p x; δ derivation of (8) than the one earlier proposed by Vuong [8] (Theorem 3.1), under a general probabilistic formalism

h 0 T i involving regular and semi-regular parametric models, and Eδ0 θb (x) − θf η (x) = somewhat difﬁcult to follow. Moreover, by the covariance 0 0 0 0 inequality: 0 θf δ + u1dδ − θf ... θf δ + uPδ dδ − θf , 0 0 " 0 0 # θf = θf δ , up is the pth column of the identity matrix IPδ ∂ ln f xl; θf ∂ ln f xl; θf and dδ → 0, leading to C ≥ MCRB 0 where: Eδ0 ≥ θb δ ∂θ ∂θT ∂θ δ0 " 0 0# f ∂ ln f xl; θ ∂ ln p xl; δ MCRB 0 = f δ T Eδ0 ∂δ ∂θ ∂δT " 0 0#−1 0 ∂ ln p x; δ ∂ ln p x; δ ∂θf δ " 0 0#−1 × Eδ0 T T . (7) ∂ ln p xl; δ ∂ ln p xl; δ ∂δ ∂δ ∂δ × Eδ0 ∂δ ∂δT T ∂θf (δ) /∂δ can be easily obtained using the following " 0 0 # ∂ ln p xl; δ ∂ ln f xl; θf implicit function theorem [14, Theorem 9.28]. Let h (θ, δ) = × Eδ0 T (9a) T Pθ Pδ ∂δ ∂θ (h1 (θ, δ) , ··· , hPθ (θ, δ)) be a function of R × R →

with equality iff [2]: both in Vuong and Barankin senses. Another noteworthy point stressed in [10, Section VII.C] is that, since in the limit of large 0 " 0 0# ∂ ln f xl; θf ∂ ln f xl; θf ∂ ln p xl; δ sample support the MMLE satisﬁes [10, (57)]: = Eδ0 × ∂θ ∂θ ∂δT −1 −1 ∂ ln f (x; θf ) −W (θf ) " 0 0# 0 Ep θb (x) − θf = ∂ ln p xl; δ ∂ ln p xl; δ ∂ ln p xl; δ T L E 0 . (9b) L∂θ δ T ∂δ ∂δ ∂δ ∂ ln f (xl; θf ) ∂ ln f (xl; θf ) × Ep T , Therefore, when p (x; δ) is known, one can assert that in most ∂θ ∂θ

cases: CH (θf ) (4c) is also obtained from (5b) where the score 0 C → C θ > MCRB 0 , (10) θb H f δ function η (x) is defined as η (x) , ∂ ln f (x; θf ) /∂θ. Finally, the so called MCRB under ML constraints [10, (42)] in other words, in most cases, θ is no longer an efficient b and the so called MCRB for misspecified-unbiased estimators estimator of θ (in comparison with the correctly specified f [8, Theorem 4.1 ][11, (5)] appear to be, in the Barankin sense, case [1]). Furthermore, if p (x;) and f (x;) share the same the LUMCRB (13). parameterization, i.e. p (x; θ) and f (x; θ), then the MSE of θb is: A. Quasi-efficiency T 0 0 0 Interestingly enough, if the true parametric model p (xl; θ) MSE θ = Ep 0 θb (x) − θ θb (x) − θ , θb θ is known, the parametric model (12) can still be defined as: (11a) 0 !! and, in the limit of large sample support, is given by: p xl; δ f (xl; θ) peθ (xl) = 1 + exp 1 − 0 , 0 0 0 0 0 0 T c (θ) f x ; θ MSE θ → C θ + θ − θ θ − θ . (11b) l f θb H f f f 0 0 Therefore, when the true parametric model is known, one where θf = θf δ , and all the results mentioned above can assert that in most cases, the MMLE is not a consistent still hold as well. Then the covariance matrix of a locally estimator of θ, and whenever it is consistent, it is not an unbiased estimator of θf may be either equal to the MCRB efficient estimator of θ, which contrasts with the behavior of or to the LUMCRB. In the Barankin sense, the former case MLEs, since, if the MLEs are consistent then they are also defines an efficient estimator. Hence the need to introduce asymptotically efficient [3]. a new denomination for the latter case. We propose to call such an estimator a quasi-efficient estimator. Finally, one can IV. LEAST-UPPER CRB FOR MISSPECIFIED MODELS IN THE assert that in most cases, the MMLE is not a consistent BARANKINSENSE estimator of θ, and whenever it is consistent, it is only a quasi- If the true model is unknown, i.e., we do not have prior efficient estimator of θ. As expected, both the MCRB and the information on the particular parameterization of the true LUMCRB reduce to the usual CRB when the model is known distribution p (xl), the formulation of uniform unbiasedness to be correctly specified, i.e. f (xl; θ) , p (xl; θ), since then (6c) is no longer possible. However, the Barankin approach θf = θ and pe(xl; θ) = p (xl; θ), leading to: can still be used by building on Vuong’s work where it is shown [8, Theorem 4.1] that, under mild regularity conditions MCRBθ = LUMCRB (θ) = summarized in [11, Section II.A], the following surrogate −1 1 ∂ ln p (xl; θ) ∂ ln p (xl; θ) parametric model: Eθ = CRBθ, L ∂θ ∂θT p (xl) f (xl; θ) peθ (xl) , pe(xl; θ) = 1 + exp 1 − , and, in the limit of large sample support, any quasi-efficient c (θ) f (xl; θf ) estimator becomes an efficient estimator. And last but not least, (12) since the MMLEs asymptotic covariance matrix C (θ ) is where c (θ) is a normalizing constant, is a locally least H f available (4c), the derivation of additional lower bounds via favorable true parametric model in the MSE sense. Indeed, the the Huber’s sandwich inequality (5b) may seem questionable. minimization of the KLIC D (p ||f ) (1b) at the vicinity of eθ θ It is probably the reason why misspecified lower bounds have θ yields a locally unbiased estimator of θ , allowing for the f f received little consideration in the literature [8][15], except derivation of the MCRB (8) associated with p (x ; θ), which e l very recently [10][11]. satisfies (9b) at θf since [8, (A.62)]: ∂ ln f (xl; θf ) /∂θ = α∂ ln pe(xl; θf ) /∂θ, α = −2. Thus the MCRB associated V. AN ILLUSTRATIVE EXAMPLE with p (x ; θ) coincides at θ with the Huber’s ”sandwich” e l f We revisit the problem of the estimation of the variance covariance C (θ ) (4c). Therefore, by reference to (5b), H f of Gaussian data in the presence of misspecified mean value the Huber’s ”sandwich” covariance appears to be the least- proposed in [11, Section III]. Let us assume to have a set upper MCRB (LUMCRB) for locally unbiased estimates of of L i.i.d. scalar observations x = (x , . . . , x )T , distributed the pseudo-true parameter θ : 1 L f according to a Gaussian p.d.f. with a known mean value 2 LUMCRB (θf ) = CH (θf ) , (13) mx and an unknown variance σx , θ, i.e. p (x; θ) =

pN (x; mx1L, θIL). Suppose now that the assumed Gaussian p.d.f. is f (x; θ) = pN (x; m1L, θIL), so we misspecify the mean value. Then, (1a-1b) become [11, Section III]: L 1 P 2 2 bθ (x) = (xl − m) , θf = θ + (mx − m) , L l=1 h i where Ep bθ (x) = θf , and the LUMCRB is given by [11, (22)]: 2θ2 4θ (m − m)2 LUMCRB (θ ) = + x + (m − m)4 . f L L x (14) According to (7), since ∂θf (θ) /∂θ = 1, the MCRB is simply: 1 2θ2 MCRB = = CRB = , (15) θ 2 θ ∂ ln p(x;θ) L Eθ ∂θ

which exemplify that the MMLE is only a quasi-efﬁcient estimator of θ, as predicted in most cases when the true parametric model is known (10).

REFERENCES [1] H. Cramer,´ Mathematical Methods of Statistics. Princeton, NJ: Princeton Univ. Press, 1946 [2] E. L. Lehmann and G. Casella, Theory of Point Estimation (2nd ed.). Springer, 1998 [3] S. Haykin, J. Litva, and T. J. Shepherd, Radar Array Processing, Chapter 4, Springer-Verlag, 1993 [4] A. Renaux, P. Forster, E. Chaumette, and P. Larzabal, ”On the high snr cml estimator full statistical characterization”, IEEE Trans. on SP, 54(12): 4840-4843, 2006 [5] P. J. Huber, “The behavior of maximum likelihood estimates under non- standard conditions,” in Proceedings of the Fifth Berkeley Symposium in Mathematical Statistics and Probability, 1967. [6] H. Akaike, ”Information Theory and an Extension of the Maximum Likelihood Principle”, in Proceeding of IEEE ISIT, 1973 [7] H. White, “Maximum likelihood estimation of misspecified models,” Econometrica, vol. 50, pp. 1-25, Jan. 1982. [8] Q. H. Vuong, “Cramer-Rao´ bounds for misspecified models,”, working paper 652, Div. of he Humanities and Social Sci., Caltech, Pasadena, USA, 1986 [9] E.W. Barankin, ”Locally best unbiased estimates”, Ann. Math. Stat., 20(4): 477-501, 1949. [10] C.D. Richmond and L.L. Horowitz, ”Parameter Bounds on Estimation Accuracy Under Model Misspecification”, IEEE Trans. on SP, 63(9): 2263-2278, 2015 [11] S. Fortunati, F. Gini, and M. S. Greco, “The misspecified CRB and its application to the scatter matrix estimation in complex elliptically symmetric distributions,” in IEEE Trans. Signal Process, 64(9): 2387- 2399, 2016 [12] K. Todros and J. Tabrikian, ”General Classes of Performance Lower Bounds for Parameter Estimation-Part I: Non-Bayesian Bounds for Unbiased Estimators”, IEEE Trans. on IT, 56(10): 5064-5082, 2010. [13] R. McAulay, E.M. Hofstetter, ” Barankin Bounds on parameter estimation”, IEEE Trans. on IT, 17(6): 669-676, 1971 [14] W. Rudin, Principles of Mathematical Analysis. New York: Mc-Graw- Hill, 1976. [15] T. B. Fomby and R. C. Hill, Maximum-Likelihood Estimation of Mis- specified Models: Twenty Years Later. Oxford, U.K.: Elsevier, Ltd., 2003