<<

MAXIMUM LIKELIHOOD It should be pointed out that MLEs do ESTIMATION not always exist, as illustrated in the follow- ing natural mixture example; see Kiefer and Maximum likelihood is by far the most pop- Wolfowitz [32]. ular general method of estimation. Its wide- spread acceptance is seen on the one hand in Example 2. Let X1, ..., Xn be independent the very large body of research dealing with and identically distributed (i.i.d.) with den- its theoretical properties, and on the other in sity the almost unlimited list of applications.     2 To give a reasonably general definition p 1 x − µ f (x|µ, ν, σ, τ, p) = √ exp − of maximum likelihood estimates, let X = 2πσ 2 σ (X1, ..., Xn) be a random vector of observa-     − − 2 tions whose joint distribution is described + √1 p − 1 x ν | exp , by a density fn(x )overthen-dimensional 2πτ 2 τ Euclidean space Rn. The unknown parameter vector is contained in the parameter space where 0  p  1, µ, ν ∈ R,andσ, τ>0. ⊂ s ∗ R . For fixed x define the likelihood The of the observed = = | function of x as L() Lx() fn(x )con- sample x , ...x , although finite for any per- ∈ 1 n sidered as a function of . missible choice of the five parameters, ap- proaches infinity as, for example, µ = x , p > ˆ = ˆ ∈ 1 Definition 1. Any (x) which 0andσ → 0. Thus the MLEs of the five maximizes L()over is called a maximum unknown parameters do not exist. likelihood estimate (MLE) of the unknown Further, if an MLE exists, it is not neces- true parameter . sarily unique as is illustrated in the following Often it is computationally advantageous example. to derive MLEs by maximizing log L()in place of L(). Example 3. Let X1, ..., Xn be i.i.d. with den- | = 1 −| − | sity f (x α) 2 exp( x α ). Maximizing fn Example 1. Let X be the number of suc- | (x1, ..., xn α) is equivalent to minimizing cesses in n independent Bernoulli trials with |xi − α| over α.Forn = 2m one finds that success probability p ∈ [0, 1]; then anyα ˆ ∈ [x(m), x(m+1)] serves as MLE of α, where x(i) is the ith order of the Lx(p) = f (x|p) = P(X = x|p) sample.   n − = px(1 − p)n x x = 0, 1, ..., n. x The method of maximum likelihood estima- tion is generally credited to Fisher∗ [17–20], although its roots date back as far as Lam- Solving bert∗, Daniel Bernoulli∗,andLagrangeinthe ∂ eighteenth century; see Edwards [12] for an log L (p) = x/p − (n − x)/(1 − p) = 0 historical account. Fisher introduces the me- ∂p x thod in [17] as an alternative to the method of moments∗ and the method of ∗. for p, one finds that log Lx(p) and hence Lx(p) The former method Fisher criticizes for its has a maximum at arbitrariness in the choice of equa- tions and the latter for not being invari- pˆ = pˆ(x) = x/n. ant under scale changes in the variables. The term likelihood∗ as distinguished from This example illustrates the considerable in- (inverse) probability appears for the first time tuitive appeal of the MLE as that value of in [18]. Introducing the measure of informa- p for which the probability of the observed tion named after him (see FISHER INFORMA- value x is the largest. TION) Fisher [18–20] offers several proofs

Encyclopedia of Statistical Sciences, Copyright © 2006 John Wiley & Sons, Inc. 1 2 MAXIMUM LIKELIHOOD ESTIMATION for the efficiency of MLEs, namely that the Theorem 1. Under A0 and A1 asymptotic of asymptotically normal   | | → estimates cannot fall below the reciprocal P [fn(X ) > fn(X )] 1 of the information contained in the sample   and, furthermore, that the MLE achieves this as n →∞for any , ∈ with = .If, ˆ lower bound. Fisher’s proofs, obscured by the in addition, is finite, then the MLE n fact that assumptions are not always clearly exists and is consistent. stated, cannot be considered completely rig- orous by today’s standards and should be The content of Theorem 1 is a corner- understood in the context of his time. To some stone in Wald’s [56] consistency proof of the extent his work on maximum likelihood esti- MLE for the general case. Wald assumes mation was anticipated by Edgeworth [11], that is compact, which by a familiar com- whose contributions are discussed by Sav- pactness argument reduces the problem to age [51] and Pratt [45]. However, it was Fis- thecaseinwhich contains only finitely her’s insight and advocacy that led to the many elements. Aside from the compactness prominence of maximum likelihood estima- assumption on , which often is not satisfied tion as we know it today. in practice, Wald’s uniform integrability con- For a discussion and an extension of Defi- ditions (imposed on log f (·|)) often are not nition 1 to richer (nonparametric) statistical satisfied in typical examples. models which preclude a model description Many improvements in Wald’s approach through densities (i.e., likelihoods will be toward MLE consistency were made by later missing), see Scholz [52]. At times the pri- researchers. For a discussion and further ref- mary concern is the estimation of some func- erences, see Perlman [42]. Instead of Wald’s tion g of . It is then customary to treat g(ˆ ) theorem or any of its refinements, we present as an ‘‘MLE’’ of g(), although strictly speak- another theorem, due to Rao [47], which ing, Definition 1 only justifies this when g is a shows under what simple conditions MLE one-to-one function. For arguments toward a consistency may be established in a certain general justification of g(ˆ )asMLEofg(), specific situation. see Zehna [58] and Berk [7]. Theorem 2. Let A0 and A1 be satisfied and let f (·|) describe a multinomial CONSISTENCY with cell probabilities π() = (π1(), ..., πk()). If the map → π(), ∈ ,has Much of maximum likelihood theory deals a continuous inverse (the inverse existing with the large sample (asymptotic) proper- because of A1), then the MLE ˆ n, if it exists, ties of MLEs; i.e., with the case in which is a consistent of . it is assumed that X1, ..., Xn are indepen- dent and identically distributed with density For a counterexample to Theorem 2 when the f (·|)(i.e.,X , ..., X i.i.d. ∼ f (·|)). The joint 1 n inverse continuity assumption is not satisfied density of X = (X , ..., X )isthenf (x|) =  1 n n see Kraft and LeCam [33]. n f (x |). It further is assumed that the i=1 i A completely different approach toward distributions P corresponding to f (·|)are θ proving consistency of MLEs was given by identifiable, i.e., = ,and,  ∈ im- Cramer´ [9]. His proof is based on a Taylor plies P = P  . For future reference we state expansion of log L() and thus, in contrast the following assumptions: to Wald’s proof, assumes a certain amount of smoothness in f (·|) as a function of . ·| ∈ A0: X1, ..., Xn i.i.d. f ( ) ; Cramer´ gave the consistency proof only for A1: the distributions P, ∈ ,areiden- ⊂ R. Presented here are his conditions gen- tifiable. eralized to the multiparameter case, ⊂ Rs:

The following simple result further supports C1: The distributions P have common the intuitive appeal of the MLE; see Bahadur support for all ∈ ; i.e., {x : f (x|) > [3]: 0} does not change with ∈ . MAXIMUM LIKELIHOOD ESTIMATION 3

C2: There exists an open subset ω of For a proof see Lehmann [37, Sect. 6.4]. The containing the true parameter point theorem needs several comments for clarifi- 0 such that for almost all x the den- cation: sity f (x|) admits all third derivatives (a) If the likelihood function L() attains ∂3 its maximum at an interior point of f (x|)forall ∈ ω. ∂ i∂ j∂ k then the MLE is a solution to the like- lihood equation. If in addition the like- C3: lihood equations only have one root,   then Theorem 3 proves the consistency ∂ ˜ E log f (X|) = 0, j = 1, ..., s of the MLE(ˆ n = n). ∂ j (b) Theorem 3 does not state how to iden- and   tify the consistent root among possibly ∂ many roots of the likelihood equations. Ijk():= E log f (X|) ˜ ∂j One could take the root n which   is closest to ,butthen˜ is no ∂ 0 n × log f (X|) longer an estimator since its construc- ∂k   tion assumes knowledge of the un- 2 known value of .Thisproblemmay =− ∂ | 0 E log f (X ) be overcome by taking that root which ∂ j∂ k is closest to a (known) consistent esti- exist and are finite for j, k = 1, ..., s mator of 0. The utility of this ap- and all ∈ ω. proach becomes clear in the section on C4: The matrix I() = efficiency. (Ijk())j,k=1,...,s is positive definite for (c) The MLE does not necessarily coincide all ∈ ω. with the consistent root guaranteed by Theorem 3. Kraft and LeCam [33] give C5: There exist functions Mijk(x) indepen- dent of such that for all i, j, k, = an example in which Cramer’s´ con- 1, ..., s, ditions are satisfied, the MLE exists, is unique, and satisfies the likelihood

3 ∂ equations, yet is not consistent. log f (x|)  Mijk(x) ∂θ ∂θ ∂θ i j k In view of these comments, it is advantageous for all ∈ ω, to establish the uniqueness of the likeli- hood equation roots whenever possible. For where example, if f (x|)isofnondegeneratemul- tiparameter type, then ∞ E0 (Mijk(X)) < . log L() is strictly concave. Thus the like- lihood equations have at most one solution. We can now state Cramer’s´ consistency theo- Sufficient conditions for the existence of such rem. a solution may be found in Barndorff-Nielsen [4]. In a more general context Makel¨ ainen¨ Theorem 3. Assume A0, A1,andC1–C5. et al. [38] give sufficient conditions for the Then with probability tending to one as n → existence and uniqueness of roots of the like- ∞ there exist solutions ˜ = ˜ (X , ..., X ) n n 1 n lihood equations. of the likelihood equations

∂ ∂ n log fn(X|) = log f (Xi|) = 0, ∂ j ∂ j i=1 The main reason for presenting Cramer’s´ j = 1, ..., s, consistency theorem and not Wald’s is the fol- lowing theorem which specifically addresses ˜ such that n converges to 0 in probability; consistent roots of the likelihood equations ˜ i.e., n is consistent. and not necessarily MLEs. 4 MAXIMUM LIKELIHOOD ESTIMATION

Theorem 4. Assume A0, A1,andC1–C5. have asymptotic equal to this lower ˜ If n is a consistent sequence of roots of the bound, they were called efficient . ˜ likelihood equations, then as n →∞, In fact, Lehmann [36] refers to n as an efficient likelihood estimator (ELE) in con- √ L −1 trast to the MLE, although the two often n(n − 0) ⇒ Ns(0, I(0) ); will coincide. For a discussion of the usage ˜ of ELE versus MLE refer to his paper. As i.e., in large samples the distribution of n −1 is approximately s-variate normal with it turns out, (I() )jj will not serve as a −1 true lower bound on the asymptotic vari- 0 and matrix I(0) /n.This theorem is due to Cramer´ [9], who gave a ance of asymptotically normal estimators of proof for s = 1. A proof for s  1 may be found j unless one places some restrictions on the in Lehmann [37]. behavior of such estimators. Without such restrictions Hodges (see LeCam [35]) was −1 Because of the form, I()√ ,oftheasymp- able to construct so called superefficient esti- ˜ totic for n(n − ), one mators (see SUPEREFFICIENCY,HODGES for a ˜ generally regards n as an efficient estimator simple example). It was shown by LeCam [35] for . The reasons for this are now dis- (see also Bahadur [2]) that the set of superef- cussed. Under regularity conditions (weaker ficiency points must have Lebesgue measure than those of Theorem 4) the Cramer—Rao´ ∗ zero for any particular sequence of estima- lower bound (CRLB) states that tors. LeCam (see also Hajek´ [28]) further showed that falling below the lower bound at −1 var(Tjn)  (I() )jj/n avalue0 entails certain unpleasant prop- erties for the ∗ (or other risk functions) of such superefficient estima- for any unbiased estimator Tjn of j which −1 tors in the vicinity of . Thus it appears not is based on n observations. Here (I() )jj 0 refers to the jth diagonal element of I()−1. advisable to use superefficient estimators. Note, however, that the CRLB refers to the Unpleasant as such superefficient estima- actual variance of an (unbiased) estimator tors are, their existence led to a reassessment and not to the asymptotic variance of such of large sample properties of estimators. In estimator. The relationship between these particular, a case was made to require a cer- two variance concepts is clarified by the fol- tain amount of regularity not only in the lowing inequality. If as n →∞ distributions but also in the estimators. For example, the simple requirement that the √ L asymptotic variance υ () in (1) be a con- n(T − ) −→ N(0, υ ()), (1) j jn j j tinuous function in would preclude such estimator from being superefficient since, as then remarked above, such phenomenon may oc- cur only on sets of Lebesgue measure zero. lim n var(Tjn)  υj(), n→∞ For estimators satisfying (1) with continu- −1 ous υj() one thus has υj()  (I() )jj. where equality need not hold; see Lehmann Rao [49] requires the weak convergence in (1) [37]. to be uniform on compact subsets of ,which Thus for unbiased estimators Tjn which under mild assumption on f (·|) implies the are asymptotically normal, i.e., satisfy (1), continuity of the asymptotic variance υj(). and for which Hajek´ [27] proved a very general theorem which gives a succinct description of the lim [n var(Tjn)] = υj(), n→∞ asymptotic distribution of regular estima- tors. His theorem will be described in a some- −1 = the CRLB implies that υj()  (I() )jj.It what less general form below. Let (n) − √ L was therefore thought that (I() 1) is a s (n) jj 0 + h/ n, h ∈ R and denote by −→ con- lower bound for the asymptotic variance of vergence in law when (n) is the true param- any asymptotically normal unbiased estimate eter. ˜ of j. Since the estimators jn of Theorem 4 MAXIMUM LIKELIHOOD ESTIMATION 5

Definition 2. An estimator sequence {Tn} need not be efficient. In this context we note is called regular if for all h ∈ Rs that the consistent sequence of roots gen- erated by Theorem 3 is essentially unique. √ L(n) n(Tn − (n)) −→ T For a more precise statement of this result see Huzurbazar [31] and Perlman [43]. Roots as n →∞, where the distribution of the ran- of the likelihood equations may be found by dom vector T is independent of h. one of various iterative procedures offered in several statistical computer packages; see ·| The regularity conditions on f ( ) are for- ITERATED MAXIMUM LIKELIHOOD ESTIMATES. mulated as follows. Alternatively on may just take a one-step iteration estimator in place of a root. Such ·| √ Definition 3. f ( ) is called locally asymp- one step estimators use as starting point n- ∈ s totically normal (LAN) if for all h R consistent estimators (not necessarily effi- cient) and are efficient. A precise statement is log[fn(X|(n))/fn(X|0)] given in Theorem 6. First note the following =  − 1  + h n(0) 2 h I(0)h Zn(h, 0) definition. √ with Definition 4. An estimator Tn is n- L consistent for estimating the true 0 if for −→0 n(0) Ns(0, I(0)) every >0thereisaK and an N such that √ −   − and P0 ( n(Tn 0) K ) 1

P for all n  N ,where · denotes the −→0 →∞ Zn(h, 0) 0asn . Euclidean norm in Rs. Comment. Under the conditions A0, A1,and Theorem 6. Suppose the assumptions√ of C1–C5, one may show that f (·|)isLANand ∗ Theorem 4 hold and that n is a n-consis- in that case  =  tent estimator of .Letδn (δ1n, ..., δsn)be  ∂ the solution of the linear equations  n(0) = log fn(X|), ..., ∂ 1 s  (δ − ∗ )R (∗ ) =−R(∗ ), ∂ √ kn kn jk n j n × log fn(X|) / n. k=1 ∂ s j = 1, ..., s,where Theorem 5. (Hajek).´ If {Tn} is regular and  ∂ f (·|)isLANthenT = Y + W,whereY ∼ R () = log L(), j ∂ N (0, I( )−1)andW is a random vector j s 0 and independent of Y. The distribution of W is 2 determined by the estimator sequence.  = ∂ Rjk() log L(). ∂ j∂ k = Then Comment. Estimators for which P0 (W 0) = 1 are most concentrated around (see √ L 0 n(δ − ) −→ N (0, I( )−1) Hajek´ [27]) and may thus be considered effi- n 0 s 0 cient among regular estimators. Note that as n →∞, i.e., δn is asymptotically efficient. this optimality claim is possible because com- peting estimators are required to be regular; For a proof of Theorem 6 see Lehmann [37]. on the other hand, it is no longer required The application of Theorem 6 is particu- that the asymptotic distribution of the esti- larly useful in that it does not require the mator be normal. solution of the likelihood equations. The two ˜ As remarked in the comments to Theorem estimators n (Theorem 4) and δn (Theorem ˜ 3theELEn of Theorem 4 may be cho- 6) are asymptotically efficient. Among other sen by taking that root of the likelihood estimators that share this property are the equations which is closest to some known Bayes estimators. Because of this multitude consistent estimate of . The latter estimate of asymptotically efficient estimators one has 6 MAXIMUM LIKELIHOOD ESTIMATION tried to discriminate between them by con- Hence the MLE, although consistent, may no sidering higher order terms in the asymp- longer be asymptotically optimal. totic analysis. Several measures of second- order efficiency∗ have been examined (see For a different approach which covers regular EFFICIENCY,SECOND-ORDER) and it appears as well as nonregular problems, see Weiss that with some qualifications MLEs are se- and Wolfowitz [57] on maximum probability ∗ cond-order efficient, provided the MLE is estimators . first corrected for bias of order 1/n. Without such bias correction one seems to be faced MLES IN MORE GENERAL MODELS with a problem similar to the nonexistence of estimators with uniformly smallest mean The results concerning MLEs or ELEs dis- squared error in the finite sample case (see cussed so far all assumed A0.Inmanysta- ESTIMATION,CLASSICAL). For investigations tistical applications the structure along these lines see Rao [48], Ghosh and gives rise to independent observations which Subramanyam [23], Efron [13], Pfanzagl and are not identically distributed. For example Wefelmeyer [44] and Ghosh and Sinha [22]. one may be sampling several different pop- For a lively discussion of the issues involved ulations or with each observation one or see also Berkson [8]. more covariates may be recorded. For the Other types of optimality results for MLEs, former situation Theorems 3 and 4 are easily ∗ such as the local asymptotic admissibility extended, see Lehmann [37]. In the case of ∗ and minimax property, were developed by independent but not identically distributed LeCam and Hajek.´ For an exposition of this observations, results along the lines of Theo- work and some perspective on the work of oth- rem 3 and 4 were given by Hoadley [29] and ers, see Hajek´ [28]. For a simplified introduc- Nordberg [40]. Nordberg deals specifically tion to these results, see also Lehmann [37]. with exponential family∗ models presenting Thefollowingexample(seeLehmann[37]) the binary logit model and the log-linear illustrates a different kind of behavior of the Poisson model as examples. See also Haber- MLE when the regularity condition C1 is not man [26], who treats maximum likelihood satisfied. theory for diverse parametric structures in multinomial and Poisson counting mod- ∼ Example 4. Let X1, ..., Xn be i.i.d. U(0, els. ) (uniform on (0, )). Then the MLE is The maximum likelihood theory for incom- ˆ = n max(X1, ..., Xn) and its large sample plete data∗ from an exponential family is behavior is described by treated in Sundberg [55]. Incomplete data models include situations with grouped∗,cen- L ∗ n( − ˆ n) −→ · E sored , or and finite mixtures. See Sundberg [55] and Dempster et al. [10] as n →∞,whereE is an exponential ran- for more examples. The latter authors present dom variable with mean 1. Note√ that the the EM algorithm for iterative computation normalizing factor is n and not n.Alsothe of the MLE from incomplete data. asymptotic distribution E is not normal and Relaxing the independence assumption of is not centered at zero. Considering instead the observations opens the way to stochas- δn = (n + 1) ˆ n/n one finds as n →∞ tic process applications. problems concerning stochastic processes∗ L n( − δn) −→ (E − 1). only recently have been treated with vigor. The maximum likelihood approach to param- Further eter estimation here plays a prominent role. Consistency and asymptotic normality∗ 2 2 E[n( ˆ n − )] → 2 of MLEs (or ELEs) may again be estab- lished by using appropriate martingale∗ limit and theorems. Care has to be taken to define the concept of likelihood function when the 2 2 E[n(δn − )] → . observations consist of a continuous time MAXIMUM LIKELIHOOD ESTIMATION 7

2 stochastic process. As a start into the grow- Example 5. Let X1, ..., Xn be i.i.d. N(µ, σ ); ing literature on this subject see Feigin [16], then the MLE of σ 2 is Basawa and Prakasa Rao [5,6]. n Another assumption made so far is that σˆ 2 = (X − X)2/n, the parameter space has dimension s, i = where s remainsfixedasthesamplesize i 1 n increases. For the problems one encoun- which has mean value σ 2(n − 1)/n, i.e.,σ ˆ 2 ters as s grows with n or when is so rich is biased. Taking instead σ˜ 2 = σˆ 2n/(n − 1) that it cannot be embedded into any finite we have an unbiased estimator of σ 2 which dimensional Euclidean space, see Kiefer and has uniformly smallest variance among all Wolfowitz [32] and Grenander [25] respec- unbiased estimators; however, tively. Huber [30] examines the behavior of MLEs derived from a particular paramet- E(ˆσ 2 − σ 2)2 < E(σ˜ 2 − σ 2)2 ric model when the sampled distribution is different from this . These for all σ 2. results are useful in studying the robustness properties of MLEs (see ROBUST ESTIMATION). The question of bias removal should therefore be decided on a case by case basis with con- MISCELLANEOUS REMARKS sideration of the estimation objective. See, however, the argument for bias correction Not much can be said about the small sample of order 1/n in the context of second-order properties of MLEs. If the MLE exists it is efficiency of MLEs as given in Rao [48] and generally a function of the minimal sufficient Ghosh and Subramanyam [23]. ∗ statistic , namely the likelihood function L(·). Gong and Samaniego [24] consider the However, the MLE by itself is not necessar- concepts of pseudo maximum likelihood esti- ily sufficient. Thus in small samples some mation which consists of replacing all nui- information may be lost by considering the sance parameters∗ in a multiparameter MLE by itself. Fisher [19] proposed the use model by suitable estimates and then solving ∗ of ancillary statistics to recover this infor- a reduced system of likelihood equations for mation loss, by viewing the distribution of the the remaining structural parameters. They MLE conditional on the ancillary statistics. present consistency and asymptotic normal- For a recent debate on this subject, see Efron ity results and illustrate them by example. and Hinkley [15], who make an argument for using the observed and not the expected Fisher information∗ in assessing the accu- AN EXAMPLE racy of MLEs. Note, however, the comments As an illustration of some of the issues and to their paper. See also Sprott [53,54], who problems encountered, consider the Weibull∗ suggests using parameter transformations model. Aside from offering much flexibility to achieve better results in small samples in its shape, there are theoretical extreme when appealing to large sample maximum value type arguments (Galambos [21]), rec- likelihood theory. Efron [14], in discussing ommending the Weibull distribution as an the relationship between maximum likeli- appropriate model for the breaking strength hood and decision theory∗, highlights the role of certain materials. of maximum likelihood as a summary prin- Let X , ..., X be i.i.d. W(t, α, β), the ciple in contrast to its role as an estimation 1 n Weibull distribution with density principle. Although MLEs are asymptotically unbi-   −   β x − t β 1 x − t ased in the regular case, this will not gener- f (x|t, α, β) = exp −( )β , ally be the case in finite samples. It is not α α α necessarily clear whether the removal of bias x > t, t ∈ R, α, β>0. from an MLE will result in a better estima- tor, as the following example shows (see also When the threshold parameter t is known, UNBIASEDNESS). say t = 0 (otherwise subtract t from Xi)the 8 MAXIMUM LIKELIHOOD ESTIMATION likelihood equations for the scale and shape His method is based on the conditional dis- parameters α and β have a unique solu- tribution of the MLEs given certain ancil- tion. In fact, the likelihood equations may lary statistics. It turns out that this con- be rewritten ditional distribution is analytically manage-  able; however, computer programs are ulti- n 1/βˆ mately required for the implementation of 1 βˆ αˆ = x ,(2)this method. n i i=1 Returning to the three-parameter Weibull  n n −1 n problem, so that the threshold is also un- βˆ βˆ 1 x log xi x − βˆ − log xi = 0, known, we find that the likelihood function i i n → = i=1 i=1 i=1 tends to infinity as t T min(X1, ..., Xn) (3) and β<1, i.e. the MLEs of t, α,andβ do not exist. It has been suggested that the parame-  i.e.,α ˆ is given explicitly in terms of βˆ,which ter β be restricted a priori to β 1 so that the in turn can be found from the second equation likelihood function remains bounded. In that by an iterative numerical procedure such as case the MLEs will always exist, but with Newton’s method. The regularity conditions positive probability will not be a solution to of Theorems 3 and 4 are satisfied. Thus we the likelihood equations. It is not clear what conclude that the large sample properties of the MLEs are     in this case. √ αˆ α L − Appealing to Theorems 3 and 4 one may n − −→ N (0, I 1(α, β)), βˆ β 2 attempt to find efficient roots of the likelihood√ equations or appeal to Theorem 6, since n- where consistent estimates are easily found (e.g., method of moments, method of quantiles∗).     2 However, the stated regularity conditions, − 1.109 α .257α I 1(α, β) = β . notably C1, are not satisfied. In addition to .257α.608β2 that the likelihood equations will, with posi- tive probability, have no solution at all and if From these asymptotic results it follows that, they have a solution they have at least two, as n →∞, one yielding a saddle point and one yielding a local maximum of the likelihood function √ L n(βˆ − β)/β −→ N(0, .608), (Rockette et al. [50]). A further problem in the three-parameter Weibull model concerns √ L nβˆ log(α/α ˆ ) −→ N(0, 1.109), identifiability for large shape parameters β. Namely, if X ∼ W(t, α, β) then uniformly in α from which large sample confidence inter- and t,asβ →∞ vals∗ for β and α may be obtained. How- ever, the large sample approximations are β L (X − t − α) −→ E good only for very large samples. Even for α 0 n = 100 the approximations leave much to be desired. For small to medium sample sizes ∗ where E0 is a standard extreme value ran- a large collection of tables is available (see dom variable with distribution function F(y) Bain [1]) to facilitate various types of infer- = 1 − exp(− exp(y)). Hence for large β, ence. These tables are based on extensive Monte Carlo∗ investigations. For example, L α instead of appealing to the asymptotic√ N(0, X t + α + E0 = u + bE0 .608) distribution of the pivot n(βˆ − β)/β, β the distribution of this pivot was simulated for various samples sizes and the percentage and it is clear that (t, α, β) cannot be recovered points of the simulated distributions were from u and b. This phenomenon is similar tabulated. Another approach to finite sam- to the one experienced for the generalized ple size inference is offered by Lawless [34]. gamma distribution∗; see Prentice [46]. In MAXIMUM LIKELIHOOD ESTIMATION 9 the Weibull problem the identifiability prob- 2. Bahadur, R. R. (1964). Ann. Math. Statist., 35, lem may be remedied by proper reparame- 1545–1552. trization. For example, instead of (t, α, β) 3. Bahadur, R. R. (1971). Some Limit Theorems one can easily use three quantiles. How- in Statistics, SIAM, Philadelphia. ever, because of the one-to-one relationship 4. Barndorff-Nielsen, O. (1978). Information and between these two sets of parameters the Exponential Families in . abovementioned problems concerning maxi- Wiley, New York. mum likelihood and likelihood equation esti- 5. Basawa, I. V. and Prakasa Rao, B. L. S. mators still persist. (1980). Stoch. Proc. Appl., 10, 221–254. It is conceivable that the conditions of 6. Basawa, I. V. and Prakasa Rao, B. L. S. Theorem 6 may be weakened to accommo- (1980). Statistical Inference for Stochastic Pro- date the three-parameter Weibull distribu- cesses. Academic Press, London. tion. Alternatively one may use the approach 7. Berk, R. H. (1967). Math. Rev., 33, No. 1922. of Gong and Samaniego [24] described above, 8. Berkson, J. (1980). Ann. Statist., 8, 457–469. in that one estimates the threshold t by 9. Cramer,´ H. (1946). Mathematical Methods of other . Mann and Fertig [39] offer one Statistics. Princeton University Press, Prince- such estimate; see WEIBULL DISTRIBUTION, ton, N.J. MANN–FERTIG TEST STATISTIC FOR. Neither 10. Dempster, A. P., Laird, N. M., and Rubin, of these two approaches seems to have been D. B. (1977). J. R. Statist. Soc. B, 39, 1–22. explored rigorously so far. For an extensive 11. Edgeworth, F. Y. (1908/09). J. R. Statist. Soc., account of maximum likelihood estimation as 71, 381–397, 499–512, and J. R. Statist. Soc., well as other methods for complete and cen- 72, 81–90. sored samples from a Weibull distribution 12. Edwards, A. W. F. (1974). Internat. Statist. see Bain [1] and Lawless [34]. Both authors Rev., 42, 4–15. treat maximum likelihood estimation in the 13. Efron, B. (1975). Ann. Statist., 3, 1189–1242. context of many other models. 14. Efron, B. (1982). Ann. Statist., 10, 340–356. For the very reason that applications em- 15. Efron, B. and Hinkley, D. V. (1978). ploying the method of maximum likelihood Biometrika, 65, 457–487. are so numerous no attempt is made here 16. Feigin, P. D. (1976). Adv. Appl. Prob., 8, to list them beyond the few references and 712–736. examples given above. For a guide to the 17. Fisher, R. A. (1912). Messenger of Mathemat- literature see the survey article on MLEs ics, 41, 155–160. by Norden [41]. Also see Lehmann [37] for a 18. Fisher, R. A. (1922). Philos. Trans. R. Soc. rich selection of interesting examples and for London A, 222, 309–368. a more thorough treatment. 19. Fisher, R. A. (1925). Proc. Camb. Phil. Soc., 22, 700–725. 20. Fisher, R. A. (1935). J. R. Statist. Soc., 98, Acknowledgments 39–54. In writing this article I benefited greatly from pertinent chapters of Erich Lehmann’s book Theory 21. Galambos, J. (1978). The Asymptotic Theory of [37]. I would like to thank him of Extreme Order Statistics. Wiley, New York. for this privilege and for his numerous helpful 22. Ghosh, J. K. and Sinha, B. K. (1981). Ann. comments. Statist., 9, 1334–1338. I would also like to thank Jon Wellner for the use 23. Ghosh, J. K. and Subramanyam, K. (1974). of his notes on maximum likelihood estimation and Sankhya (A), 36, 325–358. Michael Perlman and Paul Sampson for helpful 24. Gong, G. and Samaniego, F. J. (1981). Ann. comments. This work was supported in part by Statist., 9, 861–869. The Boeing Company. 25. Grenander, U. V. (1980). Abstract Inference. Wiley, New York. REFERENCES 26. Haberman, S. J. (1974). The Analysis of Fre- quency Data. The University of Chicago Press, 1. Bain, L. J. (1978). Statistical Analysis of Reli- Chicago. ability and Life-Testing Models. Dekker, New 27. Hajek,´ J. (1970). Zeit. Wahrscheinlichkeitsth. York. Verw. Geb., 14, 323–330. 10 MAXIMUM LIKELIHOOD ESTIMATION

28. Hajek,´ J. (1972). Proc. Sixth Berkeley Symp. 56. Wald, A. (1949). Ann. Math. Statist. 20, Math. Statist. Prob., 1, 175–194. 595–601. 29. Hoadley, B. (1971). Ann. Math. Statist., 42, 57. Weiss, L. and Wolfowitz, J. (1974). Maximum 1977–1991. Probability Estimators and Related Topics. 30. Huber, P. J. (1967). Proc. Fifth Berkeley Symp. Springer-Verlag, New York. (Lect. Notes in Math. Statist. Prob., 1, 221–233. Math., No. 424.) 31. Huzurbazar, V. S. (1948). Ann. Eugen., 14, 58. Zehna, P. W. (1966). Ann. Math. Statist., 37, 185–200. 744. 32. Kiefer, J. and Wolfowitz, J. (1956). Ann. Math. Statist., 27, 887–906. BIBLIOGRAPHY 33. Kraft, C. and LeCam, L. (1956). Ann. Math. Statist., 27, 1174–1177. Akahira, M. and Takeuchi, K. (1981). Asymptotic 34. Lawless, J. F. (1982). Statistical Models and Efficiency of Statistical Estimators: Concepts of Methods for Lifetime Data. Wiley, New York. Higher Order Asymptotic Efficiency.Springer- 35. LeCam, L. (1953). Univ. of Calif. Publ. Verlag, New York. Lecture Notes in Statistics Statist., 1, 277–330. 7. (Technical Monograph on higher order effi- ciency with an approach different from the 36. Lehmann, E. L. (1980). Amer. Statist., 34, references cited in the text.) 233–235. Barndorff-Nielsen, O. (1983). Biometrika, 70, 37. Lehmann, E. L. (1983). Theory of Point Esti- 343–365. (Discusses a simple approximation mation. Wiley, New York, Chaps. 5 and 6. formula for the conditional density of the max- 38. Makel¨ ainen,¨ T., Schmidt, K., and Styan, G. imum likelihood estimator given a maximal (1981). Ann. Statist., 9, 758–767. ancillary statistic. The formula is generally 39. Mann, N. R. and Fertig, K. W. (1975). Techno- accurate (in relative error) to order O(n−1)or metrics, 17, 237–245. even O(n−3/2), and for many important mod- 40. Nordberg, L. (1980). Scand. J. Statist., 7, els, including arbitrary transformation models, 27–32. it is in fact, exact. The level of the paper is 41. Norden, R. H. (1972/73). Internat. Statist. quite mathematical. With its many references Rev., 40, 329–354; 41, 39–58. it should serve as a good entry point into an area of research of much current interest although 42. Perlman, M. (1972). Proc. Sixth Berkeley its roots date back to R. A. Fisher.) Symp. Math. Statist. Prob., 1, 263–281. Fienberg, S. E. and Hinkley, D. V. (eds.) (1980). 43. Perlman, M. D. (1983). In Recent Advances R. A. Fisher: An Appreciation. Springer-Verlag, in Statistics: Papers in Honor of Her- New York. (Lecture Notes in Statistics 1. A man Chernoff on his 60th Birthday, collection of articles by different authors high- M. H. Rizvi, J. S. Rustagi, and D. Siegmund, lighting Fisher’s contributions in statistics.) eds. Academic Press, New York, pp. 339–370. Ibragimov, I. A. and Has’minskii (1981). Statisti- 44. Pfanzagl, J. and Wefelmeyer, W. (1978/79). cal Estimation, Asymptotic Theory.Springer- J. Multivariate Anal., 8, 1–29; 9, 179–182. Verlag, New York. (Technical monograph on 45. Pratt, J. W. (1976), Ann. Statist., 4, 501–514. the asymptotic behavior of estimators (MLEs 46. Prentice, R. L. (1973), Biometrika, 60, and Bayes estimators) for regular as well as 279–288. irregular problems, i.i.d. and non-i.i.d. cases.) 47. Rao, C. R. (1957). Sankhya, 18, 139–148. LeCam, L. (1970). Ann. Math. Statist., 41, 48. Rao, C. R. (1961). Proc. Fourth Berkeley Symp. 802–828. (Highly mathematical, weakens Math. Statist. Prob., 1, 531–546. Cramer’s´ third-order differentiability condi- 49. Rao, C. R. (1963). Sankhya, 25, 189–206. tions to first-order differentiability in quadratic mean.) 50. Rockette, H., Antle, C., and Klimko, L. (1974). J. Amer. Statist. Ass., 69, 246–249. LeCam, L. (1979). Maximum Likelihood, an Intro- duction. Lecture Notes No. 18, Statistics 51. Savage, L. J. (1976). Ann. Statist., 4, 441–500. Branch, Department of Mathematics, Univer- 52. Scholz, F. -W. (1980). Canad. J. Statist., 8, sity of Maryland. (A readable and humorous 193–203. account of the pitfalls of MLEs illustrated by 53. Sprott, D. A. (1973). Biometrika, 60, 457–465. numerous examples.) 54. Sprott, D. A. (1980). Biometrika, 67, 515–523. McCullagh, P. and Nelder, J. A. (1983). General- 55. Sundberg, R. (1974). Scand. J. Statist., 1, ized Linear Models. Chapman and Hall, Lon- 49–58. don. (This monograph deals with a class of MAXIMUM LIKELIHOOD ESTIMATION 11

statistical models that generalizes classical lin- ear models in two ways: (i) the response variable may be of exponential family type (not just nor- mal), and (ii) a monotonic smooth transform of the mean response is a linear function in the predictor variables. The parameters are esti- mated by the method of maximum likelihood through an iterated weight least-squares algo- rithm. The monograph emphasizes applications over theoretical concerns and represents a rich illustration of the versatility of maximum like- lihood methods.) Rao, C. R. (1962). Sankhya A, 24, 73–101. (A read- able survey and discussion of problems encoun- tered in maximum likelihood estimation.)

See also ANCILLARY STATISTICS—I;ASYMPTOTIC NORMALITY; CRAMER´ –RAO LOWER BOUND;EFFICIENCY,SECOND-ORDER; EFFICIENT SCORE;,THEORY OF;ESTIMATING FUNCTIONS;ESTIMATION,CLASSICAL;ESTIMATION:METHOD OF SIEVES;FISHER INFORMATION;FULL-INFORMATION ESTIMATORS; GENERALIZED MAXIMUM LIKELIHOOD ESTIMATION;INFERENCE, STATISTICAL;ITERATED MAXIMUM LIKELIHOOD ESTIMATES; LARGE-SAMPLE THEORY;LIKELIHOOD;LIKELIHOOD PRINCIPLE; LIKELIHOOD RATIO TESTS; M-ESTIMATORS;MINIMUM CHI- SQUARE;MAXIMUM PENALIZED LIKELIHOOD ESTIMATION; PARTIAL LIKELIHOOD;PSEUDO-LIKELIHOOD;RESTRICTED MAXIMUM LIKELIHOOD (REML); SUPEREFFICIENCY,HODGES;andUNBIASEDNESS.

F. W. SCHOLZ