MAXIMUM LIKELIHOOD ESTIMATION for the Efficiency of Mles, Namely That the Theorem 1
Total Page:16
File Type:pdf, Size:1020Kb
MAXIMUM LIKELIHOOD It should be pointed out that MLEs do ESTIMATION not always exist, as illustrated in the follow- ing natural mixture example; see Kiefer and Maximum likelihood is by far the most pop- Wolfowitz [32]. ular general method of estimation. Its wide- spread acceptance is seen on the one hand in Example 2. Let X1, ..., Xn be independent the very large body of research dealing with and identically distributed (i.i.d.) with den- its theoretical properties, and on the other in sity the almost unlimited list of applications. 2 To give a reasonably general definition p 1 x − µ f (x|µ, ν, σ, τ, p) = √ exp − of maximum likelihood estimates, let X = 2πσ 2 σ (X1, ..., Xn) be a random vector of observa- − − 2 tions whose joint distribution is described + √1 p − 1 x ν | exp , by a density fn(x )overthen-dimensional 2πτ 2 τ Euclidean space Rn. The unknown parameter vector is contained in the parameter space where 0 p 1, µ, ν ∈ R,andσ, τ>0. ⊂ s ∗ R . For fixed x define the likelihood The likelihood function of the observed = = | function of x as L() Lx() fn(x )con- sample x , ...x , although finite for any per- ∈ 1 n sidered as a function of . missible choice of the five parameters, ap- proaches infinity as, for example, µ = x , p > ˆ = ˆ ∈ 1 Definition 1. Any (x) which 0andσ → 0. Thus the MLEs of the five maximizes L()over is called a maximum unknown parameters do not exist. likelihood estimate (MLE) of the unknown Further, if an MLE exists, it is not neces- true parameter . sarily unique as is illustrated in the following Often it is computationally advantageous example. to derive MLEs by maximizing log L()in place of L(). Example 3. Let X1, ..., Xn be i.i.d. with den- | = 1 −| − | sity f (x α) 2 exp( x α ). Maximizing fn Example 1. Let X be the number of suc- | (x1, ..., xn α) is equivalent to minimizing cesses in n independent Bernoulli trials with |xi − α| over α.Forn = 2m one finds that success probability p ∈ [0, 1]; then anyα ˆ ∈ [x(m), x(m+1)] serves as MLE of α, where x(i) is the ith order statistic of the Lx(p) = f (x|p) = P(X = x|p) sample. n − = px(1 − p)n x x = 0, 1, ..., n. x The method of maximum likelihood estima- tion is generally credited to Fisher∗ [17–20], although its roots date back as far as Lam- Solving bert∗, Daniel Bernoulli∗,andLagrangeinthe ∂ eighteenth century; see Edwards [12] for an log L (p) = x/p − (n − x)/(1 − p) = 0 historical account. Fisher introduces the me- ∂p x thod in [17] as an alternative to the method of moments∗ and the method of least squares∗. for p, one finds that log Lx(p) and hence Lx(p) The former method Fisher criticizes for its has a maximum at arbitrariness in the choice of moment equa- tions and the latter for not being invari- pˆ = pˆ(x) = x/n. ant under scale changes in the variables. The term likelihood∗ as distinguished from This example illustrates the considerable in- (inverse) probability appears for the first time tuitive appeal of the MLE as that value of in [18]. Introducing the measure of informa- p for which the probability of the observed tion named after him (see FISHER INFORMA- value x is the largest. TION) Fisher [18–20] offers several proofs Encyclopedia of Statistical Sciences, Copyright © 2006 John Wiley & Sons, Inc. 1 2 MAXIMUM LIKELIHOOD ESTIMATION for the efficiency of MLEs, namely that the Theorem 1. Under A0 and A1 asymptotic variance of asymptotically normal | | → estimates cannot fall below the reciprocal P [fn(X ) > fn(X )] 1 of the information contained in the sample and, furthermore, that the MLE achieves this as n →∞for any , ∈ with = .If, ˆ lower bound. Fisher’s proofs, obscured by the in addition, is finite, then the MLE n fact that assumptions are not always clearly exists and is consistent. stated, cannot be considered completely rig- orous by today’s standards and should be The content of Theorem 1 is a corner- understood in the context of his time. To some stone in Wald’s [56] consistency proof of the extent his work on maximum likelihood esti- MLE for the general case. Wald assumes mation was anticipated by Edgeworth [11], that is compact, which by a familiar com- whose contributions are discussed by Sav- pactness argument reduces the problem to age [51] and Pratt [45]. However, it was Fis- thecaseinwhich contains only finitely her’s insight and advocacy that led to the many elements. Aside from the compactness prominence of maximum likelihood estima- assumption on , which often is not satisfied tion as we know it today. in practice, Wald’s uniform integrability con- For a discussion and an extension of Defi- ditions (imposed on log f (·|)) often are not nition 1 to richer (nonparametric) statistical satisfied in typical examples. models which preclude a model description Many improvements in Wald’s approach through densities (i.e., likelihoods will be toward MLE consistency were made by later missing), see Scholz [52]. At times the pri- researchers. For a discussion and further ref- mary concern is the estimation of some func- erences, see Perlman [42]. Instead of Wald’s tion g of . It is then customary to treat g(ˆ ) theorem or any of its refinements, we present as an ‘‘MLE’’ of g(), although strictly speak- another theorem, due to Rao [47], which ing, Definition 1 only justifies this when g is a shows under what simple conditions MLE one-to-one function. For arguments toward a consistency may be established in a certain general justification of g(ˆ )asMLEofg(), specific situation. see Zehna [58] and Berk [7]. Theorem 2. Let A0 and A1 be satisfied and let f (·|) describe a multinomial experiment CONSISTENCY with cell probabilities π() = (π1(), ..., πk()). If the map → π(), ∈ ,has Much of maximum likelihood theory deals a continuous inverse (the inverse existing with the large sample (asymptotic) proper- because of A1), then the MLE ˆ n, if it exists, ties of MLEs; i.e., with the case in which is a consistent estimator of . it is assumed that X1, ..., Xn are indepen- dent and identically distributed with density For a counterexample to Theorem 2 when the f (·|)(i.e.,X , ..., X i.i.d. ∼ f (·|)). The joint 1 n inverse continuity assumption is not satisfied density of X = (X , ..., X )isthenf (x|) = 1 n n see Kraft and LeCam [33]. n f (x |). It further is assumed that the i=1 i A completely different approach toward distributions P corresponding to f (·|)are θ proving consistency of MLEs was given by identifiable, i.e., = ,and, ∈ im- Cramer´ [9]. His proof is based on a Taylor plies P = P . For future reference we state expansion of log L() and thus, in contrast the following assumptions: to Wald’s proof, assumes a certain amount of smoothness in f (·|) as a function of . ·| ∈ A0: X1, ..., Xn i.i.d. f ( ) ; Cramer´ gave the consistency proof only for A1: the distributions P, ∈ ,areiden- ⊂ R. Presented here are his conditions gen- tifiable. eralized to the multiparameter case, ⊂ Rs: The following simple result further supports C1: The distributions P have common the intuitive appeal of the MLE; see Bahadur support for all ∈ ; i.e., {x : f (x|) > [3]: 0} does not change with ∈ . MAXIMUM LIKELIHOOD ESTIMATION 3 C2: There exists an open subset ω of For a proof see Lehmann [37, Sect. 6.4]. The containing the true parameter point theorem needs several comments for clarifi- 0 such that for almost all x the den- cation: sity f (x|) admits all third derivatives (a) If the likelihood function L() attains ∂3 its maximum at an interior point of f (x|)forall ∈ ω. ∂ i∂ j∂ k then the MLE is a solution to the like- lihood equation. If in addition the like- C3: lihood equations only have one root, then Theorem 3 proves the consistency ∂ ˜ E log f (X|) = 0, j = 1, ..., s of the MLE(ˆ n = n). ∂ j (b) Theorem 3 does not state how to iden- and tify the consistent root among possibly ∂ many roots of the likelihood equations. Ijk():= E log f (X|) ˜ ∂j One could take the root n which is closest to ,butthen˜ is no ∂ 0 n × log f (X|) longer an estimator since its construc- ∂k tion assumes knowledge of the un- 2 known value of .Thisproblemmay =− ∂ | 0 E log f (X ) be overcome by taking that root which ∂ j∂ k is closest to a (known) consistent esti- exist and are finite for j, k = 1, ..., s mator of 0. The utility of this ap- and all ∈ ω. proach becomes clear in the section on C4: The Fisher information matrix I() = efficiency. (Ijk())j,k=1,...,s is positive definite for (c) The MLE does not necessarily coincide all ∈ ω. with the consistent root guaranteed by Theorem 3. Kraft and LeCam [33] give C5: There exist functions Mijk(x) indepen- dent of such that for all i, j, k, = an example in which Cramer’s´ con- 1, ..., s, ditions are satisfied, the MLE exists, is unique, and satisfies the likelihood 3 ∂ equations, yet is not consistent. log f (x|) Mijk(x) ∂θ ∂θ ∂θ i j k In view of these comments, it is advantageous for all ∈ ω, to establish the uniqueness of the likeli- hood equation roots whenever possible. For where example, if f (x|)isofnondegeneratemul- tiparameter exponential family type, then ∞ E0 (Mijk(X)) < .