A Survey of Maximum Likelihood EstimationAuthor(s): R. H. NordenReviewed work(s):Source: International Statistical Review / Revue Internationale de Statistique, Vol. 40, No. 3(Dec., 1972), pp. 329- 354; Published by: International Statistical Institute (ISI) Stable URL: http://www.jstor.org/stable/1402471 .

A Survey of Maximum Likelihood Estimation: Part 2Author(s): R. H. NordenReviewed work(s):Source: International Statistical Review / Revue Internationale de Statistique, Vol. 41, No. 1(Apr., 1973), pp. 39-58; Published by: International Statistical Institute (ISI) Stable URL: http://www.jstor.org/stable/1402786 .

International Statistical Institute (ISI) is collaborating with JSTOR to digitize, preserve and extend access toInternational Statistical Review / Revue Internationale de Statistique.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 1 Int. Stat. Rev., Vol. 40, No. 3, 1972, pp. 329-354/Longman GroupLtd/Printed in Great Britain

A Survey of Maximum LikelihoodEstimation

R. H. Norden University of Bath, England

Contents Abstract Page 329 1. Introduction 330 2. Historicalnote 330 3. Somefundamental properties 332 4. Consistency 335 5. 338 6.* Consistencyand efficiency continued 7.* Comparisonof MLE'swith other 8.* Multiparametersituations 9.* Theoreticaldifficulties 10.* Summary 11.* Appendixof importantdefinitions 12. Note on arrangementof bibliography 343 13. Bibliography 343

(* Part 2 following in next issue of the Review)

Abstract This survey,which is in two parts,is expositoryin natureand gives an accountof the develop- ment of the theory of MaximumLikelihood Estimation (MLE) since its introductionin the papersof Fisher(1922, 1925)up to the presentday whereoriginal work in this field still con- tinues.After a shortintroduction there follows a historicalaccount in whichin particularit is shown that Fishermay indeedbe rightfullyregarded as the originatorof this methodof esti- mation since all previous apparentlysimilar methods dependedon the method of inverse . There then follows a summaryof a numberof fundamentaldefinitions and resultswhich originatefrom the papersof Fisherhimself and the later contributionsof Cramerand Rao. Consistencyand Efficiencyin relation to maximumlikelihood estimators(MLE's) are then consideredin detailand an accountof Cramer'simportant (1946) theorem is given,though it is remarkedthat, with regard to consistency.Wald's (1949)proof has greatergenerality. Comments are then made on a numberof subsequentrelated contributions. The paperthen continueswith a discussionof the problemof comparingMLE's with other estimators.Special mention is then made of Rao's measureof "second-orderefficiency" since this does provide a of discriminating between the various best asymptotically Normal (BAN) estimators of which maximum likelihood estimators form a subclass. Subsequently there is a survey of work in the multiparameter field which begins with a brief account of the classical Cramer-Rao theory. Finally there is a section on the theoretical difficulties which arise by reason of the existence of inconsistent maximum likelihood estimates and of estimators which are super efficient. Particular reference is made to Rao's (1962) paper in which these anomalies were considered and at least partially resolved by means of alternative criteria of consistency and efficiency which nevertheless are rooted in Fisher's original ideas.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 2 330 1. Introduction The aim of this paperand its sequelis to give an accountof the theoreticalaspect of Maximum LikelihoodEstimation. To this end we shall considera numberof contributionsin detail in which the main themes are consistencyand efficiency,both with regardto single and multi- parametersituations. The scope of thesepapers will also includea discussionof varioustheoretical difficulties with specialreference to examplesof inconsistencyof maximumlikelihood estimators and super- efficiencyof other estimators.The relatedquestion of to what extent MaximumLikelihood Estimationhas an advantageover othermethods of estimationwill also be considered. An extensivebibliography of nearly400 titles dividedinto four main groupsis given at the end of the firstpaper, in which it is hoped that all, or at least almost all, contributionswhich are in some way relatedto this subjectare included.In the text we referto papersmainly in the first two groups.

2. HistoricalNote In two renownedpapers R. A. Fisher(1921, 1925) introduced into statisticaltheory the concepts of consistency,efficiency and sufficiency,and he also advocatedthe use of MaximumLikelihood as an estimationprocedure. His 1921 definitionof consistency,although importantfrom the theoreticalstandpoint, cannoteasily be appliedto actualsituations, and he gave a furtherdefinition in his 1925paper. "A statisticis said to be a consistentestimate of any parameterif when calculatedfrom an indefinitelylarge sample it tends to be accuratelyequal to that ." This is the definitionof consistencywhich is now in generaluse, and mathematicallyit may be statedthus. The estimatorTn of a paremeter0, basedon a sampleof n is consistentfor 0 if for an arbitrary e>0, P[I T,-0 I>e]-+0 as n-+oo. We shallrefer to Fisher'searlier definition in Section9, however,where we considerexamples of inconsistencyof MLE's. With regardto efficiency,he says: "The criterionof efficiencyis satisfiedby those which when derivedfrom large samplestend to a normaldistribution with the least possible standarddeviation". Later,1925, he says: "We shall provethat when an efficientstatistic exists, it may be found by the methodof MaximumLikelihood Estimation". Evidently,therefore, he regardedefficiency as an essentiallyasymptotic property. Nowadays we would say that an estimatoris efficient,whatever the samplesize, if its varianceattains the Cramer-Raolower bound, and Fisher'sproperty would be describedas asymptoticefficiency. His definitionof sufficiencyis as follows:"A statisticsatisfies the criterionof sufficiencywhen no other sample provides any additionalinformation to the value of the parameterto be estimated". It is well known, of course, that there is a close relation between sufficient statistics and Maximum Likelihood Estimation. Further, he defines the likelihood as follows: "The likelihood that any parameter (or set of ) should have any assigned value (or set of values) is proportional to the probability that if this were so, the totality of observations should be that observed". The method of Maximum Likelihood as expounded by Fisher consists of regarding the likeli- hood as a function of the unknown parameter and obtaining the value of this parameter which makes the likelihood greatest. Such a value is then said to be the Maximum Likelihood Esti- mate of the parameter.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 3 331 A formaldefinition of the methodwould now thereforebe as follows. Let X be a randomvariable with probabilitydensity function f = f (x; 0), wherethe form of is knownbut the which be a is unknown. are the f parameter0, may vector, Supposex1, ..., x. realizedvalues of X in a sampleof n independentobservations. Define the LikelihoodFunction (LF) by L = f (xl ; 0)... f (x,; 0). Then if 0 is such that sup L = L (0) 0 e where 0 is the parameterspace, i.e. the set of possiblevalues of 0, then we say that 0 is the MLE1of 0. If thereis no uniqueMLE then 0 will be understoodto be any value of 0 at which L attainsits supremum. The questionas to whetherFisher may be rightlyregarded as the originatorof the methodof MLE has arisenfrom time to time, so that althoughmost statisticianswould now attributethe methodto him theredoes not appearto havebeen absoluteagreement on this matter. Thereis, however,a full discussionof the historicalaspect by C. R. Rao (1962),where he surveysthe work of F. Y. Edgeworth(1908a and b), and also of Gauss, Laplaceand Pearson in so far as they are relevantto this question. Accordingto Rao, all arguments,prior to those of Fisher,which appearedto be of a ML type, werein fact dependenton the methodof inverseprobability, i.e. the unknownparameter is estimatedby maximizingthe "a posterior"probability derived from Bayeslaw. Theycould not thereforebe regardedas MLE as such, for as Rao remarks,"If 0 is a maximumlikelihood estimateof 0, then0 (0) is a maximumlikelihood estimate of any one-to-onefunction of 0 (0), while such a propertyis not true of estimatesobtained by the inverseprobability argument". After furtherdiscussion he concludesby saying:"We, thereforedo not have any literature supportingprior claims to the methodof maximumlikelihood estimation as a principlecapable of wide application,and justifyingits use on reasonablecriteria (such as sufficiencyin a sense widerthan that used by Edgeworthand consistency)and not on inverseprobability argument, beforethe fundamentalcontributions by Fisherin 1922and 1925". This agreeswith the view put forwardby Le Cam (1953),who also gives an outlineaccount of developmentsof MLE since 1925,particularly with regardto the variousattempts that were made to prove rigorouslythat MLE's are both consistentand efficient.Fisher himself, of course,provided a proof of efficiency,but there was no explicitstatement of the restrictions on the type of estimationsituation to which his conclusionswould apply. He also gave no separateproof of consistency,although this could be regardedas implied by his proof at efficiency. Subsequentlythese matters were consideredby a number of writers. Hotelling (1931) attemptedto prove the consistencyand asymptoticnormality of MLE's and his work was extendedby J. L. Doob (1934). The question of efficiencywas consideredseparately by D. Dugu6(1936-1937), in a seriesof notes, but accordingto Le Cam all these papersare in some degreeerroneous, even though they contain a numberof importantideas. He does say, however, that a paper by S. S. Wilks (1938), does contain "the essential elements of a proof" that within a certain class of estimates MLE's are efficient. A major advance was made by H. Cramer (1946), who provided a proof of the consistency and asymptotic normality of MLE's under certain regularity conditions. The consistency aspect of his proof is not, however, complete, for he only established that some root of the likelihood equation, and not necessarily the absolute maximum of the , L,

1 In the sequel we shall now use the abbreviationML for Maximum Likelihood and MLE for Maximum LikelihoodEstimate, or Estimation,etc., in accordancewith the context.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 4 332 is consistent.This resultwas improvedby Hurzabazaar(1948), but it was not until A. Wald's (1949)paper appeared on the scene that a rigorousproof of the consistencyof the MLE of a parameterwithin a wide class of estimationsituations existed. By this time therecould, of course,be no questionof MLE'sbeing generallyconsistent. In 1948,J. Neymanand E. L. Scott showedthat in the generalcase, wherethe numberof para- metersincreased with the numberof observations,the MLE'swould be inconsistent. The concept of efficiencysuffered a similarfate, for in 1951 J. L. Hodges constructedan estimator,not the MLE, whichwas asymptoticallynormal and yet for some valuesof the un- knownparameter had asymptoticallya variancethat was definitelyless than that of the MLE. Thus efficiency,in Fisher'ssense, could not be regardedas a propertythat would apply to MLE's generally,but only to a limitedclass of estimators. By the end of the 1950s,examples of consistencyand "superefficiency", many of themhighly ingenious,were fairly numerous, and it mightnow be supposedthat MLE cannotbe accorded any specialimportance relative to other methodsof estimation. This, however,is an oversimplification,for many of these examplesare of a verytheoretical type and do not have much practicalvalue. In any case, a reformulationof the conceptsof consistencyand efficiencyin operationalrather than mathematicalterms, by Rao (1961), showedthat MLE'scould still be regardedas havingoptimal asymptotic properties within a wide class of estimators,and to this extent, at least, Fisher'soriginal assertions have been realized.

3. Some FundamentalProperties In this section we summarizea numberof well-knownresults of fundamentalimportance. We have alreadygiven a formal definitionof MLE, and we see that it is very generaland impliesneither the differentiabilityof L nor even the uniquenessof 0. In the singleparameter case, where 0 is some subsetof the set of realnumbers, L is a function of a singlevariable 0. Clearly,if 0 exists and is not a terminalvalue and - existsfor all 0 e 0, a0 then 0 will be a root of-OL= 0. a0 In manysimple situations, L is unimodaland 0 is the uniqueroot of = 0, whosesolution a0 is usuallystraightforward. However, I = log L is in generala simplerfunction to work with, and as the maximaof I andL occurat the same valuesof 0, it is usualto considerthe equation

= 0 00 whichis referredto as the LikelihoodEquation (LE). This is perhapsunfortunate since in some instancesno root of the LE is the MLE. As an illustration,consider the estimationof 0 in the type III distribution

f(x, ) =- x '-le-ex (x > 0, 0> 0) r (L) where A is a known positive constant. For a sample of n independent observations x1, x2, ..., x,, it is easily shown that

l=nilog0-nlogF(A)+(l-1) 1 logx,-0 E x, i=1 i= 1 which is a continuous differentiable function of 0 for all 0 > 0.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 5 333

- Clearly1 has an absolutemaximum at 0 = which is therefore0, since

@1 - - xi n-n)n i= I 00 _ 0 a21 is zero at this value of 0, and nowhereelse, and <0 for all 0>0. Thus 0 is uniquein this case, and thereis no problemof the LE havingmultiple roots. It is also not difficultto establishthat, in this case, 0 is not unbiased.There are in fact many instancesof biasedMLE's, but this is not, of course,a seriousobjection to themethod, provided, as is usual, the tends to zero as n tends to infinity. This is in fact the case here, for it is not difficultto show that 0 E() - = = 0 nA-1 (n) . It is in any case, often possible to correctfor bias in small samples.' Difficultieswill, however,very often arise when L has more than one root, or is not dif- ferentiableeverywhere, or when 0 is a terminalvalue. In the multiparametercase the situation may becomeextremely involved and it may not be possibleto obtain the absolutemaximum of L merelyby settingthe partialderivation of L to zero. We mightfor exampleget a "saddle" point by this method2so thatthe solutionof the LE wouldnot in fact be the MLEin accordance with the definition. Even if this proceduredoes, in principle,give us 0, we may still be confrontedwith serious computationaldifficulties. The LE's can be extremelycomplicated, and a closed-formsolution will be often out of the question.Some form of iterativeprocedure will then be necessary,and the computertime aspectmay then be a limitingfactor. We mightthen have to abandonMLE altogetherand employsome otherestimation procedure such as the methodof momentswhich even if less efficientwould lead to more straightforwardcalculations. However,these problems clearly form a separatestudy in themselves,and as suchcome out- side the scope of this paper,which is primarilyconcerned with the propertiesof 0 as an esti- mator,rather than with the questionof how it may be calculated. We pass on thereforeto consideragain consistency, efficiency and sufficiency.The existence of a sufficientstatistic, has, in particular,a close bearingon the MLE, for in this case L can be factoredas (in the usual notation) L (x 0) = g(tl 0) h (x) (3.1) where t is sufficient for 0, and x = (x1, x2, ..., x,). Thus any solution of the LE is also a solution of

0)=0 S•(t which impliesthat 0 will be a functionof t. Thus, if a sufficientstatistic exists, then 8 will be a function of it.

1 The problem of determiningthe magnitudeup to various orders of - of the bias in certainMLE's has been n considered in a number of papers, e.g. Haldane (1953), Haldane and Smith (1956), Bowman and Shenton (1965, 1968, 1969) and Shenton and Bowman (1963, 1967). There is also a recent paper by Box (1971) which deals with the problemfrom a more generalstandpoint. 2 See, for example,Solari (1968). Furthercomments about this type of difficultywill be made in section 9.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 6 334 BL @2L It is also knownthat underthe regularityconditions that and exist for all 0 and all x, a0 e02 that a necessaryand sufficientcondition for thereto exist an estimatort of 0, whose attainsthe MinimumVariance Bound (MVB) of Cramer-Rao, 1 = Vo, say E[k_]21 is that - can be writtenin the form a0 - =A ()(t - 0) (3.2) 00 whereA (0) is a functionof 0 only. Further,if this conditionis satisfiedthen no functionof 0, apartfrom 0 itself, can have an estimatorwith this property. If the above identity(3.2) holds, then clearly the LE has the unique solution 0 = t, and since at this valueof 0, is negative,we have that 1, and henceL, has an absolutemaximum 802 at 0 = t, which is therefore0. We thus have the importantresult that in a regularestimation situation,if thereis an MVB estimator,it can be obtainedby the ML method. A considerationof (3.1) and (3.2) will show furtherthat in a regularsituation the MVB cannot be attainedunless a sufficientstatistic exists. The converseof this is not, however, necessarilytrue as is illustratedby our exampleearlier in this section. Here is indeedsufficient for 0, but not of the form A - 0). In fact, the efficiency 0 0/is00 (0)(0 of , i.e. Vo can be shownto be whichis less thanunity for any givenn. (See Cram6r V (0) onA (1946), Chapter33.) On the otherhand, as this ratio tends to one as n tends to infinity,then we can say that 0 is asymptoticallyefficient. To show also that it is efficientin Fisher'ssense we would have to establishthat, asymptotically,0 has a normaldistribution, which is in fact the case. In a limitednumber of cases wherethe MVB is not attainedthe MLE can be used in con- junctionwith the Rao-Blackwellprocess to obtain the MVUE. Thus, for example,in the case of the binomialdistribution with parametersn and 0, if r r is the numberof "successes"realised then 0 = -. As this estimatoris unbiasedand sufficient n for 0, and the distributionis complete,i.e. E0[f (r)] = 0 for all 0 impliesf (r) - 0, we have immediatelythat any functionof 0 whichis unbiasedfor the functionof 0 whichit estimatesis the MVUE.2 Thus as E Frr (n- r)1 r (n- r) Thus as E (nr) = nO(1-0) then n- is the MVUE of nO(1-0) which is the variance.

Similarly, for example, (r-1) is the MVUE of 02. n (n - ) [1+ b' ()]2 db (O) 1 SinceJ is = - = andb = E 0. biased,Vo (0)where b'(0) d (0) ()- E-a21l 1 a02 2 Note that as = is an MVB estimator,then in view of our earlierremarks no other functionof 6 can be n MVB for the function of 0 it estimates.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 7 335 It does not necessarilyfollow, however,that we must alwaysfirst obtain the MLE in order to begin the Rao-Blackwellprocess. All that is requiredinitially is a sufficientstatistic and some unbiasedestimator which may be obtainedin any way we like. Consistency,on the other hand, is of its very nature an asymptoticproperty, and so the distinctionbetween small and largesamples does not arise.It is a propertywhich holds under fairly wide conditions.Certainly in cases such as we have consideredin this section, where Cramer's(1946) conditions hold and the LE has a uniquesolution, we know at once that 0 is consistentfor 0. We shall considerthis propertyin detail in the next section,but beforepassing on, we con- clude with a briefdiscussion of the propertyof invariancewhich MLE's possess. It can be shown that if 0 (0) is a 1- 1 functionof 0, i.e. both 0 (0) and 0-1 (0) are single- valued,then the MLEof 4 (0) is 4 (0). This resultcan lead to considerablesimplification of any necessarycalculations when, if for any reason,we need not only the estimateof a parameter but also an estimateof some functionof it. So long as we are dealingwith MLE's,therefore, no reparametrizationwill be necessary. Thus in the case of the normaldensity

- exp (X- P) v27w 2v with It known but v, the variance,unknown, it is easily shown that 1 2 i= 1 and we know thereforethat the MLE of a, the standarddeviation, is

I= 1

4. Consistency We now give an outlineof Cramer's(1946) proof that some root of the LE whichis consistent for 0 exists underthe following regularityconditions. These are stated in full since they are now standardin statisticaltheory.

logf logf logf 2 C.1. For almost all x, a , a2 and a8 exist for every 0 e 9. 80 802 003 8 -af a2f logf C.2. For all 0 e 0,

and F2 being integrableover (- oo, oo) and [ H (x)f(x, 0) dx

0 fdx is finite and positive.

1 In order to preservethe 1-1 aspect, we regardv and a as related by a = v* ratherthan by a2 = v. 2 fis writtenfor f(x; 0).

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 8 336 By Taylor'stheorem we can thereforewrite, 8 log log (08logf) +(0 - 00_)(02 f) +A (0 - 00)2 H (x) 002 00 )000'?oo \ wherefj = f (xi, 0), I I1<1,00 is the true value of the parameterand the partial derivatives with respectto 0 are evaluatedat 0 = 00. Summingover i we obtain for the LE 1 81 0 Bo +B1 (0-00)+JAB2 (0-00)2 (4.1) =- n 80 where , 1 " n o n 8log0f, ni =B1 00 =n = 81 0 ,j and ni n i= 1 From the conditions(1), (2) and (3) it follows that

E = 0 and E logf lgf)80 0 0=o - is a strictlynegative constant, -k2 say. Thus by Khintchine'stheorem' Bo and B1 converge in probabilityto zero and -k2 respectively,while B2 convergesin probabilityto E [H (x)] which is positive and less than a fixedquantity independent of 0, by condition2. Thereforeif we put 0 = 00+ 6 so that 1 al = Bo-B16+jAB2 62 n 80

then it follows that for all small6, and - k26have the samesign in probability sufficiently n 80 as n tends to infinity.Thus the probabilitythat there is a root of the LE in the interval (00 -6, 00 + 6) tendsto one as n tendsto infinity.Since 6 maybe arbitrarilysmall it followsthat a consistentroot of the LE exists. V. S. Hurzabazaar(1948) arguesalong similarlines, and from Cramer'sconditions proves that if 0 is a consistentsolution of the LE then

lim P (82L <0=1

which impliesthat the probabilitythat L has a local maximum at 0 = 0 approachesunity as n approachesinfinity. A straightforwardapplication of Rolle's theoremthen shows that there cannot be more than one consistent solution. Thus Cramer'sand Hurzabazaar'sresults together prove that under certain regularity conditions the LE has a unique consistent solution, and that at this value of 0, the LF has a local maximum with probability one. L. Weiss (1963, 1966) develops these ideas further. In the first of these two papers he proves Cramer's result on consistency, but under far less restrictive conditions. The LF is not even assumed continuous.

1 The conditionsof Khintchine'stheorem, which is a formof the law of largenumbers, require identically distributedvariables with finite expectations.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 9 337 This result is improvedfurther in the second paper, where it is shown that there exists a to suchthat the that the LF has a relativemaximum sequence{6.}, converging zero, probability in the open interval(00 -6,, 00 + 6.) tends to one as n tends to infinity. This resulthas an interestingapplication from the computationalpoint of view.Suppose that

f+(x, 0) = (-oo

whereZ, is the medianof the observedvalues xx, x2, ..., x,, then from Weiss'result it can be shownthat 0, is asymptoticallyidentical to 0. Since,by definition,there is an explicitform for 0nwhich can be easilycomputed for a given sample,it is thereforea relativelystraightforward matterto obtain, for example,an "asymptoticconfidence interval" for 00. The resultsof Cramerand Hurzabazaartogether do not imply the consistencyof 0 in the generalcase. It might,however, be inferredthat this would follow if we restrict0 to be a suf- ficientlysmall neighbourhood of 00. That this is not the case is shownby the followingcounter-example due to Kraftand Le Cam (1956).Here Cramer'sconditions are satisfiedand 0 is identifiable.The LE has roots and with probabilitytending to one, as n tendsto infinity,the MLE existsand is a uniqueroot of the LE. Nevertheless,the MLE is inconsistentwhile at the same time consistentestimates do exist. Suppose = U Ak, whereAk is the open interval(2k, 2k+ 1) and that is some defined ? k=l {k}) order of the rationalsin (0, 1). Let p (0) = ok if 0 e Akand let (X,, Y,, Zj) be multinomially with = = = distributed probabilitiesp, p (0) cos2 2nrO,P2 p (0) sin22n0r and p3 1-p (0). For n independentobservations we have = = n1 I logp, logp, +n2 logp2 +n3 log p +f(n, n2,n, n) where

n1i= Z n2= C Y, n3= C i=1 X,, i=1 i=1 Z,, andf (n1, n2, n3) is a functionof ni, n2, n3 independentof 0. a The solutionsof the LE can be shown to be of the form 0 = Im + wherea is the acute 27n angletan-1 In2,and m is any integer.Since, however, I is maximizedby takingp = n/n V/nl then only one of these solutionsfor 0 can be 0. Also we must have lim P (0) - - 1, so is [1-p = thatif consistentitmusteventually remainin somefixed interval Ak so that = 1 -a,. The probabilityof this occurring,however, n_ tends to zero as n tends to infinity.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 10 338 We pass on thereforeto considerWald's proof of consistencywhich is more satisfactory, for it does establishthe consistencyof 0 and not merelysome root of the LE. His approachis quite different,for his regularityconditions do not includeany differentiabilityassumptions. We give only a brief outlineof the proof which is set in a multiparametercontext. If the randomvariable u = logf (x, 0)- log f (x, 00)is considered,then it can be establishedby a generalizedform of the theoremof means that when 0 # 80 E0o [log f (x,, 0)] < E0o[log f (x,, 0)]. Summingover i and lettingn tend to infinity,it followsfrom the stronglaw of largenumbers that lim P [l (x, 0)<1 (x, 0)] = 1 1 n -+ 00 On the other hand, by definition, S(_x, I) l((x, 0o) so that we must have lim P (0 = = 1. n- 00oo 0o) Thereis an interestingcomment by Wolfowitz(1949) which follows immediately afterwards. He makesthe point that by appealingto the stronglaw of largenumbers Wald actually proves strongconvergence which is more than consistency.He shows that in fact it is only necessary to appealto the weaklaw of largenumbers to establishthe consistencyof 0. This, accordingto Wolfowitz,extends the classof dependentrandom variables to whichthe resultcan be applied. Kraft (1955) also establishesa theoremgiving conditionsunder which 0 will be consistent, stronglyconsistent and uniformlyconsistent. He observesthat in fact strongconvergence is of morethan theoreticalimportance, as Wald(1951) has shownthat thereis a class of asymptotic minimaxestimates whose construction depends on the existenceof a strongconsistent estimate. Kraft'sfinal theoremproves strongconsistency for the case wherethe observationsare inde- pendent and identicallydistributed, though he himself remarksthat this can be deduced from a trivialgeneralization of Wald's(1949) results. We concludeby remarkingthat sincethere are cases of inconsistentMLE's, then some condi- tions must be imposedin orderto prove consistency.Conditions such as Wald'sform a suffi- cient set. A more difficultproblem would be to establisha set of necessaryand sufficient conditions,as for exampleKolmogorov has done for a form of the law of largenumbers. This problemappears not to havebeen considered by anyone.

5. Efficiency We have alreadygiven a definitionof efficiencyin section3, whichis now a standardmeasure of estimatorperformance. However, it is to some extentarbitrary, as will be explainedsubse- quently,and we shall thereforebegin by consideringthe possibilityof more generalcriteria. The "natural"definition of optimalityis the following.The estimatorT of 0 is optimumif P [I T-0 I P [Iy-O I

P[I N Ij=P [ Tn-g(0)I eI] (B) where T, is an estimator, based on n independent observations, of g (8) 2 a real-valued function of an unknown parameter 0, and N is the standard normal variate.

1 1(C, 0) = log f(xI, 0) f(x2, 0) ... f (xn, 0), and similarly1 (S, 00). 2 g (0) could, of course, be simply 0.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 11 339 The motivationfor definition(B) arisesin the firstplace from a considerationof the approxi- mation

- P N I -> P [IT g()j ei s 10] (5.1) wherea2 is the asymptoticvariance of Tn. Such an approximationis unsatisfactory,however, since the errorinvolved is unknown.Thus an mightdiffer considerably from - which is a true measureof the performanceof Tn.It would in any case be permissibleto use (5.1) only within the class of CAN estimators. Definition(B), however,is quite generaland clearly leads to a criterionof optimalityin agreementwith (A), for if Unis any other estimatorof g (0) then P P [IT.-g he80] [I U,-g (0)1 El810 if (0)l_ _- if and only g, (Tn, E, 0) > g, (U,, E, 0). In fact, withoutrestriction to the class of CAN estimatorsBahadur establishes the following results. If Tnis consistentfor g (0) then

lim lim { , 0)} [g' (0)2 (5.2) e-+On-.oo (Tn, I (0) whereI (0) is the classical"information contained in x". This resultis clearlyanalogous to Fisher'sresult

lim (0)] (5.3) (nag,) I (0) n--oo for a consistentestimator. The appearanceof I (0) in Bahadur'sresult again bearswitness to its generalimportance as a statisticalmeasure. He also provesthat undercertain regularity conditions which are a combinationof Wald's (1949)conditions, Cramer's (1946) conditions, plus certainother assumptions, that if Unis the MLE of g (0) basedon a sampleof n independentobservations, then

im li{n ,,(U, , 0)} [g' (0)] (5.4) I e-,o n-oo (0) Clearly(5.2) and (5.4) implythat MLE'sare asymptoticallyefficient, i.e. "best",in the senseof definition(A) or (B). This is a very satisfactoryresult, although limited to a relativelysmall class of estimators, and it might now be hoped that it would be possible to go on and using the parameterz, establishthe asymptoticefficiency of any estimatorof whatevertype. Thus if Tnis asymptotically efficient,the asymptoticefficiency of Tn',some other estimator,would be measuredby an expressionsuch as P > mlim [ Tn-g (o)le e-on._ P[I Tn-g (0)>e] Unfortunately the analysis necessary to follow up such an approach appears to be of extreme difficulty. Another contribution along similar lines is provided by Wolfowitz (1965), who is strongly critical of Fisher's conclusions about the efficiency of MLE's, partly because it does not take non-normal estimators into account, and partly because super efficient estimators do exist. Wolfowitz's argument requires heavy regularity conditions onf (x, 8) which nevertheless he claims are operationally sensible, but no restrictions on a general estimator sequence {Tn), except that the distribution of In - 8) should approach some limiting distribution L (x, 8) z (Tn

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 12 340 uniformlyin both x and 0. Thereis thereforeno requirement,as in the case of Fisher'sargu- ment, that T, shouldbe asymptoticallynormal. He thenproves that if 00 is a "smooth"point for L, i.e. any point of 0 exceptpossibly for an at most denumerableset of turningpoints, then for any positiveb and c, lim P [-c+w (80)

Xn = E X and Sn,= _n)2 i=l n i= 1 (X are independentlydistributed, and that the distributionof Sndoes not dependon P. Let a, be such that 1 P(Sn>an) = - n and defineHn by H, = 0 if Sn : an H, = 1 if Sn,>an. = Further,let t, (1-Hn) and t' = X[,,l where, as usual, [x] denotesthe largest not x. ThenXn+nHn, = and integer exceeding as, /n (tn- u) In(Xn -)+ InHn (n-Xn) 1 P (H, = 0)= 1- - n tends to 1 as n tends to infinity,then it follows that n (t,- u)converges in distributionto a 1 standardnormal variate. Thus the asymptotic variance of is -, whereasthe asymptotic variance tn n of t, is clearly 1/In.

1 [1(0), u (0)] is referredto as the medianinterval. If thereis a uniquemedian then 1(0) = u (0) = w (0) = W(0).

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 13 341 On the other hand as and are then for all n > p + XY Hn independent, e P P [I = 1)= [ tn-p I >e Xn-l>esIy]+P(H. +o since ]=P(Hn=o)

= P [n- P 1>Ip] o0 . Also P [I t1_1> Ii (] o )

Judgedtherefore by criterion(A), t' is better,one might say infinitelybetter, than t,, but on the basis of the usual definitionof is betterthan efficiencytn t'n. Nevertheless,this type of situation is probably very exceptional,and indeed the CAN estimatorsform a large class. We can thereforecontinue our discussionof efficiencyin the usual sense withoutundue restriction. We have alreadyremarked that Fishernever gave whatwe wouldnow describeas regularity conditionsin the proof of the asymptoticefficiency of MLE's.The firstrigorous proof is due, in fact, to Cramer(1946). This formspart of the theoremto whichreference has alreadybeen made in connectionwith consistency.Under the conditionsstated there,it is provedthat the solution of the LE which is consistentis also asymptoticallyefficient. The whole theoremcan thusbe summarizedby sayingthat thereis a root of the LE, 0, suchthat the variateIn (0- 00) in distributionto thatof a converges N (, I () variate. Cramer'sproof of efficiencyfollows on immediatelyfrom the argumentfor consistency.In the notationof section4 it is shownthat if 0 is the solutionof the LE now knownto exist then from the LE we obtain, 1 a logJf

kJn (0 - O0) k= ni= O 1 -B, 2B2 k2 2k2 (0_00 Since 0 has now been provedconsistent for 00, then it follows that the denominatorof the right-handside of the above convergesin probabilityto 1. Further,( logf) for each i, = \ 2D /00 o is a variate with zero and variancek2, so that a straightforwardapplication of the Lindeberg-Ivy theoremshows that n(a log f) i=1 DO = 0 0o is asymptoticallyN (0, k2n), so that the numeratorabove is asymptoticallyN (0, 1). Thus N or N = In (0- 00) is asymptotically (0, 0, )where I (0) E logf) =

We remark in passing that the quantity I forms part of the lower bound 1 for the (0o) nI (0) asymptotic variance of any estimator of 0, when 00 is the true value of 8. Fisher himself showed in a non-rigorous way that 0 attains this bound, and may therefore be said to be efficient. He therefore described such an estimator as one which utilizes all the available information in the sample, in the sense that no other based on the same sample could in probability give a more precise estimate of 0o. For this reason I (0o) is often described as the information con-

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 14 342 tained in a single observation,or simply as "Fisher'sinformation". Similarly, the quantity E-(a2 whichcan be shownto be equalto nI (80), is the informationavailable in a 002 )0 = sampleE of n independentOo observations. Returningnow to Cramer'sproof of efficiencywe remarkthat his conditions,as statedat the beginningof section4, are sufficientrather than necessary,as is clearlyshown by a considera- tion of the densityf (x, 0) = le-x-0 I. Here 0 is the medianwhich asymptotically is normally a aa2 distributedand is efficient.Yet - log f(x, 0) is discontinuousand log f(x, 0) is zeroalmost everywhere,hence contradictingcondition 1. This exampleis due to H. E. Daniels (1961),who provesasymptotic efficiency under more conditionswhich do not involve 82 general 00220 logf(x, 0). In the case of finite sampleswe have alreadyseen that if the MVB is attainedthen efficient estimatorscan be obtainedby the methodof ML. Whathappens in a regularestimation problem when the MVBis not attained? The following elementaryexample is enough to show that the MLE does not necessarilyhave the least possiblevariance, in this case. thenfor a of n observations Letf (x, 0) = 02xe-Ox, (0> 0), sample independent x,, x2, ..., Xn it is easily shown that =2n - n

whereasan MVUE is 2n-1 n xi so that var () (2n- 121<1. var (0) 2n Neverthelessthis is clearlya regularestimation situation, since 2n D21 -2n Dl - - xi and - DO.- 0020 02 both exist for 0>0 and all x. Another possibilitymight be to considerthe variancebounds derivedby Bhattacharyya (1946-7-8),1 since they are a generalizationof the Cramer-RaoMVB. A possible result for the MLE would be that in a regularestimation situation the MLE attains the kth order B- bound,if this is the minimumvariance which is attained.Fend (1959),however, has shownthat this is not necessarilytrue. This is done by considering = f (x, ) 0-Ce-x?-c (c>0, 0>0) where c is supposed known and a single observation is taken by which to estimate 8. It is shown 1 xk that if 1 is an integer, k, then the kth B-bound attained is attained by the estimator - which is unbiased for 0. The MLE, on the other hand, is xk, which clearly therefore does not attain this bound for k> 1.

1 A full account of these bounds is also given, for example, in The AdvancedTheory of Statistics, Vol. 2, chapter 17, by Kendall and Stuart.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 15 343 Theredoes not, in fact, appearto be any particularbound such as the MVB of Cramer-Rao which the MLE must attain in such cases. In the non-regularcase the MVBmay not evenbe definedso that we cannottalk of efficiency in this sense.Neither is it possible,as an alternative,to defineefficiency relative to the MVUE, for Basu (1952)has shownthat such estimatorsdo not exist in a largenumber of situations. Some progresstowards defining non-trivial variance bounds for certainwell-defined classes of non-regularsituations has, however,been made by, for example,Chapman and Robins (1951) and Fraserand Guttman(1952). There is also a set of threepapers by Blischkeet al. (1965, 1968aand b) on the estimationof the parametersof the Pearsontype III distribution

f(x) =-F (a) (Xa e , x> (o>O )

= 0 otherwise, which is a good exampleof a non-regularsituation. They also derivea new bound which is appliedto the Pearsontype I and IV distributions, and to the Weibulldistribution. Thereis, however,certainly nothing approaching a unifiedtheory of non-regularestimation, and thereare no simplemeasures such as the MVB to providea yardstickof efficiency.Thus no generalconclusions about the MLE, or indeedabout any estimator,can be madefor such situations,and most non-regularestimation problems must thereforebe consideredad hoc. Neither is it necessarilytrue that MLE's have asymptoticallythe least variancefor large samplesin non-regularsituations, as the followingexample due to Basu (1952) shows. The estimatorT = of 0, whereg and I are the greatestand least valuesof a sampleof 2 5 +, n independentobservations drawn from a populationwith a rectangulardistribution on [0, 20], 1 1 has an asymptoticvariance of The MLE, on the otherhand, has a varianceof so that n2. 1,2 v (T) =0-8. v(<)

Note on Arrangementof Bibliography Those papers which are in the first group are of importancein estimation theory generally,and for various reasons come within the scope of this paper. Some of them contain results about MLE in particular.The papers in the second group are again of a theoreticaltype but are mainly or even exclusivelyconcerned with MLE. In the third group we have papers containing particularapplications of MLE. In the final group are those which considersimultaneously various estimators, among them MLE's,with regardto some particularproblem of estimation.They thereforeoften provide useful comparisonsof MLE's with other estimates. Papers marked with a single asterisk have some numericalcontent, and those with a double asterisk are almost wholly numericalor are concernedwith the computationalaspect of derivingMLE's.

Note. An additional forty entries were added at proof stage and are printed at the end of this Bibliography. (Editor.)

Bibliography GroupI Aitken, A. C. and Silverstone,H. (1942).On the estimationof statisticalparameters. Proc. Royal Soc. Edinburgh, A, 61, 186. Barankin, E. W. and Gurland, J. (1951). On asymptoticallynormal and efficient estimates. Univ. California Pub. in Statist. 1, 89. Bartlett, M. S. (1936). The informationavailable in small samples.Proc. CambridgePhil. Soc. 32, 560-566.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 16 344

Basu, D. (1952). An example of non-existenceof a minimumvariance estimator. Sankhyd,12, 43. Basu, D. (1954). Choosing between two simple hypotheses and the criterion of consistency. From a Ph.D. thesis submittedto Calcutta University. Basu, D. (1956).The concept of asymptoticefficiency. Sankhyd, 17, 193. Basu, D. (1964). Recovery of ancillaryinformation. Sankhyd, A, 26, 3-16. Bhattacharyya,A. (1946-7-8). On some analogues of the amount of information and their use in statistical estimation.Sankhyd, 8, 1, 201, 315. Blischke, W. R., Mundle, P. B. and Truelove, A. J. (1969). On non-regularestimation I. Variancebounds for estimationof location parameters.J. Amer. Statist. Assoc. 64, 1056-1072. Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations(with discussion).J. R. Statist. Soc., B, 26, 211-252. Chapman, D. G. and Robbins, H. E. (1951). Minimum varianceestimation without regularityassumptions. Ann. Math. Statist. 22, 581-586. Edgeworth,F. Y. (1908a). On the probableerrors of frequencyconstants. J. R. Statist. Soc. LXXI, 381. Edgeworth,F. Y. (1908b).On the probableerrors of frequencyconstants. J. R. Statist. Soc. LXXI, 499. Fisher, R. A. (1922). On the mathematicalfoundation of theoreticalstatistics. Philosophical Transaction of the Royal Society, A, 222, 308-358. Fisher, R. A. (1925). Theory of StatisticalEstimation. Proc. CambridgePhil. Soc. 22, 700-725. Fisher, R. A. (1934). Two new propertiesof mathematicallikelihood. Proc. Royal Soc., A, 144, 285. Ford, A. V. (1959). On the attainment of C-R and Bhattacharyyabounds for the variance of an estimate. Ann. Math. Statist. 30, 381-388. Fraser, D. A. S. and Guttman, I. (1952). BhattacharyyaBounds without regularityassumptions. Ann. Math. Statist. 23, 629-632. Fraser, D. A. S. (1964). On local inferenceand information.J. R. Statist. Soc., B, 26, 253-260. Hodges, J. L.(Jr.) and Lehmann, E. L. (1950). Some applicationsof the Cramer-Raoinequality. Proc. 2nd BerkeleySymp. on Math. Statist. Prob. 13-22. Kalbfleisch, J. D. and Sprott, D. A. (1969). Application of likelihood and fiducial probability to finite populations.New Developmentsin SurveySampling, 358-389. Kallianpur,G. and Rao, C. R. (1955). On Fisher'slower bound to asymptoticvariance of a constant estimate. Sankhyd,15, 331-342. Koopman, B. 0. (1936). On distributionsadmitting sufficient statistics. Trans.Ann. Math. Soc. 39, 399-409. Kraft, C. (1955). Some conditions for consistency and uniform consistency of statistical procedures. Univ. CaliforniaPub. in Statist. 2, 125. Kullback, S. and Liebler,R. A. (1951). On informationand sufficiency.Ann. Math. Statist. 22, 79. Le Cam, L. (1953). On the asymptotictheory of estimationand testing hypotheses.Proc. 3rd BerkeleySymp. on Math. Statist. Prob. 1, 129-156. Lehmann, E. and Scheff6, H. (1950). Completeness,similar regions and unbiased estimation. Sankhyd, 10, 305. Neyman, J. (1949). Contributionsto the theory of X2 test. Proc. 1st Berkeley Symp. on Math. Statist. Prob. 239-273. Pearson,K. (1896). Mathematicalcontributions to the theory of evolution IV. Regression,Heredity, Panmixia. Phil. Trans.Royal Soc. London,Ser., A, 187, 253-318. Pearson,K. and Filon, L. N. G. (1898). Mathematicalcontributions to the theory of evolution. On the probable errors of constants and on the influence of random selection on variation and correlation. Phil. Trans.Royal Soc. London,191, 229-311. Pfanzagl,J. (1969). On the measurabilityand consistencyof minimumcontrast estimates. Metrika, 14, 249-272. Pitman,E. J. C. (1938). The estimationlocation and scale parametersof a continuouspopulation of any given form. Biometrika,30, 391. Rao, C. R. (1945). Information and accuracy attainable in estimation of statistical parameters.Bull. Cal. Math. Soc. 37, 81-91. Rao, C. R. (1949). Sufficientstatistics and minimumvariance estimates. Proc. CambridgePhil. Soc. 45, 213-218. Rao, C. R. (1952a). Some theoremson minimumvariance estimations. Sankhyd, 12, 27-42. Rao, C. R. (1952b).Minimum variance estimations in distributionsadmitting ancillary statistics. Sankhyd, 12, 53-56. Rao, C. R. (1952c). On statisticswith uniformlyminimum variance. Science and Culture,17, 483-484. Rao, C. R. (1955). Theory of the method of estimation by minimumchi-square. Bull. Inter. Statist. Inst. 35, 25-32. Rao, C. R. (1961b).A study of large sample test criteriathrough propertiesof efficientestimates. Sankhyd, 23, 25-40. Rao, C. R. and Poti, S. J. (1946).On locally most powerfultests when alternativesare one-sided.Sankhya, 7, 439. Roy, Jogabratha,Mitra and Sujitkumar(1957). Unbiased minimumvariance estimation in a class of discrete distributions.Sankhya, 18, 371-378. Smith, J. H. (1947). Estimationof linear functions of cell proportions.Ann. Math. Statist. 18, 231-254. Wald, A. (1941). Asymptoticallymost powerful tests of statisticalhypotheses. Ann. Math. Statist. 12, 1, 1-19. Wald, A. (1943). Test of statisticalhypotheses concerning several parameters when the numberof observations is large. Trans.Amer. Math. Soc. 54, 426-482.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 17 345

Wald, A. (1948). Estimation of a parameterwhen the number of unknown parametersincreases indefinitely with the numberof observations.Ann. Math. Statist. 19, 220-227. Wald, A. (1951). Asymptotic minimax solutions of sequentialpoint estimation problems.Proc. 2nd Berkeley Symp. on Math. Statist. Prob. 1-11. Walker,A. M. (1963). A note on asymptoticefficiency of an asymptoticallynormal estimatorsequence. J. R. Statist. Soc., B, 25, 195-200.

GroupII Bahadur,R. R. (1958). Examplesof inconsistencyof maximumlikelihood estimates.Sankhyd, 20, 207-210. Bahadur,R. R. (1960). On the asymptoticefficiency of tests and estimates.Sankhyd, 22, 229. Barnard, G. A. (1967). The use of the likelihood function in statistical practice.Proc. 5th Berkeley Symp. Math. Statist. Prob. 27-40. Barnard,G. A., Jenkins,G. M. and Winsten, C. B. (1962). Likelihood inferenceand .J. R. Statist. Soc., A, 125, 321-372. Bar-Shalom(1971). Asymptotic properties of maximumlikelihood estimates. J. R. Statist. Soc., B, Vol. 33, No. 1. Barton, D. E. (1968). The solution of stochastic integralrelations for stronglyconsistent estimatorsat an un- known distributionfunction from a sample subject to variablecensoring and truncation. Trab.Estadist. 19, III, 51-73. Basu, D. (1955). An inconsistencyof the method of maximumlikelihood. Ann. Math. Statist. 26, 144. Berkson, J. (1955). Estimationby and by maximumlikelihood. Proc. 3rd BerkeleySymp. Math. Statist. Prob. 1. Birnbaum,A. (1961). GeneralisedM.L. methodswith exactjustifications on two levels. Bull. Inter. Statist. Inst. 38, IV, 457-462. Bradley,R. A. and Grant,J. J. (1962).The asymptoticproperties of ML estimatorswhen samplingfor associated populations.Biometrika, 49, 205-214. Brillinger,D. R. (1962). A note on the rate of convergenceof a mean. Biometrika,49, 574-576. Chanda, K. C. (1954). A note on the consistencyand maxima of the roots of likelihood equations.Biometrika, 41, 56. Daniels, H. E. (1961). The asymptotic efficiency of the maximum likelihood estimator. Proc. 4th Berkeley Symp. Math. Statist. Prob. 1, 151-163. Dharmadhikari,S. W. (1967). A note on a theorem of H. Cramer.Skand. Aktuarietidskr.50, 153-154. Doob, J. L. (1934). Probabilityand statistics. Trans.Amer. Math. Soc. 36, 766-775. Doob, J. L. (1936). Statisticalestimation. Trans.Amer. Math. Soc. 39, 410. Doss, S. A. D. C. (1962). A note on consistency and asymptotic efficiencyof maximum likelihood estimates in multi-parametricproblems. Bull. Cal. Statist. Ass. 11, 85. Doss, S. A. D. C. (1963). On consistencyand asymptoticefficiency of maximumlikelihood estimates.J. Indian Soc. Agric. Statist. 15, 232-241. Dugu6, D. (1936a). Sur le maximum de precision des estimation gaussiennesa la limite. ComptesRendus, Paris, 202, 193-196. Dugu6, D. (1936b)Sur le maximumde pr6cisiondes limitesd'estimations. Comptes Rendus, Paris, 202, 452-454. Dugu6, D. (1937). Applicationdes propri6t6sde la limite au sens de calcul des probabilit6sa l'6tudede diverses questionsd'estimation. J. Ecole Polytechnique,3, 305-374. Dutta, M. (1966). On maximum(information-theoretic) entropy estimation. Sankhyd,A, 28, 319-328. Fraser,D. A. S. (1963). On sufficiencyand the exponentialfamily. J. R. Statist. Soc., B, 25, 115-123. Fraser,D. A. S. (1964). Local conditionalsufficiency. J. R. Statist. Soc., B, 26, 52-62. Geary, R. C. (1942). The estimationof many parameters.J. R. Statist. Soc. 105, 213-17. Godambe, V. P. (1960). An optimum propertyof regularmaximum likelihood estimation.Ann. Math. Statist. 31, 1208-1212. Gurland,J. (1954). On regularityconditions for M.L.E.'s Skand. Aktuarietidskrift,37, 71-76. Hanson, M. A. (1965). Irregularityconstrained maximum likelihood estimation. Ann. Inst. Statist. Math., Tokyo, 17, 311-321. Hartley,H. 0. and Rao, J. N. K. (1968). A new estimationtheory for sample surveys.Biometrika, 55, 547-557. Hotelling, M. (1930). The consistencyand ultimatedistribution of optimumstatistics. Trans. Amer. Math. Soc. 32, 847. Huber, P. J. (1967). The behaviourof maximumlikelihood estimatesunder non-standardconditions. Proc. 5th BerkeleySymp. Math. Statist. Prob. 1, 221-233. Huzurbazar,V. S. (1948). The likelihood equation, consistency and the maxima of the likelihood function. Annalsof Eugenics,14, 185-200. Huzurbazar,V. S. (1949). On a propertyof distributionsadmitting sufficient statistics. Biometrika, 36, 71. Kale, B. K. (1963). Some remarks on a method of maximum likelihood estimation proposed by Richards. J. R. Satist. Soc., B, 25, 209-212. Kale, B. K. (1966). Approximationsto the maximumlikelihood estimatorusing groupeddata. Biometrika,53, 282-285. Kallianpur,G. (1963). Von Mises functionalsand maximumlikelihood estimation.Sankhyad, A, 25, 148-149. Kaufman, S. (1966). Asymptotic efficiencyof the maximum likelihood estimator. Ann. Inst. Statist. Math., Tokyo,18, 155-178.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 18 346

Kiefer,J. and Wolfowitz,J. (1956).Consistency of the maximumlikelihood estimator in the presenceof infinitely many incidentalparameters. Ann. Math. Statist. 27, 887-906. Konijn, H. S. (1963). Note on the non-existenceof a maximumlikelihood estimate.Aust. J. Statist. 5, 143-146. Kraft, C. and Le Cam, L. (1956). A remark on the roots of the maximum likelihood equation. Ann. Math. Statist. 27, 1174. Kriz, T. A. and Talacko, J. V. (1968).Equivalence of the M.L.E. to a minimumentropy estimator. Trb. Estadist. 19, I/lI, 55-65. Kulldorf, G. (1957). On the conditions for consistency and asymptotic efficiency of maximum likelihood estimates.Skand. Aktuarietidskrift, 40, 129. Le Cam, L. (1970). On the assumptionsused to prove asymptoticnormality of the M.L.E.'s.Ann. Math. Statist. 41 (3), 802-828. Linnik,Yu V. and Mitroafanova,N. M. (1965).Some asymptotic expansions for the distributionof the maximum likelihood estimate. Sankhyd,A, 27, 73-82. Neyman, J. and Scott, E. L. (1948). Consistent estimates based on partially consistent observations.Econo- metrica,16, 1. Pakshirajan,R. P. (1963).On the solution of the maximumlikelihood equation. J. IndianStatist. Ass. 1, 196-201. Raja Rao, B. (1960). A formula for the curvature of the likelihood surface of a sample drawn from a distributionadmitting sufficient statistics. Biometrika, 47, 203-207. Rao, C. R. (1947). Minimum variance and the estimation of several parameters.Proc. CambridgePhil. Soc. 43, 280-283. Rao, C. R. (1957). Maximumlikelihood estimationfor the multinomialdistribution. Sankhyd, 18, 139-148. Rao, C. R. (1958). Maximumlikelihood estimationfor the multinomial distributionwith an infinite number of cells. Sankhyd,20, 211-218. Rao, C. R. (1961a). Asymptotic efficiencyand limiting information.Proc. 4th BerkeleySymp. Math. Statist. Prob. 531-545. Rao, C. R. (1962a).Apparent anomalies and irregularitiesin maximumlikelihood estimation(with discussion). Sankhyd,A, 24, 73-101. Rao, C. R. (1962b).Efficient estimates and optimum inferenceprocedures in large samples.J. R. Statist. Soc., B, 24, 46-72. Rao, C. R. (1963). Criteriaof estimationin large samples. Sankhyd,25, 189-206. *Richards,F. S. G. (1961). A method of M.L.E. J. R. Statist. Soc., B, 23, 469-475. Roussas, G. G. (1965). Extension to Markov processes of a result by A. Wald about the consistency of the maximumlikelihood estimate.Zeit. Wahrscheinlichkeitsth.4, 69-73. Silvey,S. D. (1961).A note on M.L. in the case of dependentrandom variables. J. R. Statist. Soc., B, 23, 444-452. Sprott, D. A. and Kalbfleisch,J. D. (1965). Use of the likelihood function in inference.Psychological Bulletin, 64, 15-22. *Sprott,D. A. and Kalbfleisch,J. D. (1969). Examplesof likelihoodsand comparisonswith point estimatesand large sample approximations.J. Amer. Statist. Ass. 64, 468-484. Wald, A. (1949). Note on the consistencyof maximumlikelihood estimate.Ann. Math. Statist. 20, 595-601. Wedgeman,E. J. (1970). M.L.E. of a unimodal density function. Ann. Math. Statist. 41, (2), 457-471. Weiss, L. (1963). The relativemaxima of the likelihood function. Skand. Aktuartidskr.46, 162-166. Weiss, L. (1966). The relativemaxima of the likelihood function II. Skand. Aktuartidskr.49, 119-121. Weiss, L. (1968). Approximatingmaximum likelihood estimatorsbased on bounded random variables.Naval Res. Logist. Quart.15, 168-177. Weiss, L. and Wolfowitz, J. (1967). Estimationof a density function at a point. Zeit. Wahrscheinlichkeitsth.7, 327-335. Welch, B. L. (1965). On comparisonsbetween confidencepoint proceduresin the case of a single parameter. J. R. Statist. Soc., B, 27, 1-8. Wilks, S. S. (1938). Shortestaverage confidence intervals from large samples. Ann. Math. Statist. 9, 166-175. Wolfowitz, J. (1949). On Wald's proof of the consistency of the maximum likelihood function. Ann. Math. Statist. 20, 601-602. Wolfowitz, J. (1953). The method of maximum likelihood and the Wald theory of decision functions. Proc. Royal Dutch Acad. Sciences,Series A, 56, 114-119. Wolfowitz,J. (1965). Asymptoticefficiency of the maximumlikelihood estimator. Teor. VesoyatnostPrimen. 10, 267-281. Zehna, P. W. (1966). Invarianceof maximumlikelihood estimates.Ann. Math. Statist. 37, 744. GroupIII Alma, D. (1965). An efficiencycomparison between two differentestimators for the fractionp of a normal distributionfalling below a fixed limit (in Dutch). Statist. Neerlandica,19, 81-91. Anderson, T. W. (1957). Maximum likelihood estimates for a multivariatenormal distribution when some observationsare missing.J. Amer. Statist. Ass. 52, 200-203. Anscombe, F. J. (1964). Normal likelihood functions. Ann. Inst. Statist. Math. 26, 1-19. Arato, M. (1962). Some remarksto the notion of I-.Magy. Tud.Akad. III Oszt. Kozl. 12, 325-327. *Arato, M. (1968). Unbiased estimates of the complex stationary Gauss-Markovprocess. Approximatedis- tributionfunctions. Studia Sci. Math. Hung. 3, 153-158.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 19 347

**Barnett,V. D. (1966). Evaluationof the maximumlikelihood estimator where the likelihood equation has multipleroots. Biometrika,53, 151-166. Barnett, V. D. (1967). A note on linear structuralrelationships when both residual are known. Biometrika,54, 670-672. Batholomew,D. J. (1957). A problem in life testing. J. Amer. Statist. Ass. 52, 350-357. Berger, Agnes and Gold, Ruth Z. (1967). On estimating recessive frequencies from truncated samples. Biometrics,23, 356-360. *Bhattacharya,N. (1959). An extension to Hold's table for the one-sided censored . Sankhyd,21, 377-380. Blischke, W. R., Glinski, A. M., Johns, M. V., Mundle, P. B. and Truelove, A. J. (1965). On non-regular estimation,minimum variance bounds and the Pearson Type III distribution.Arl. Tech. Rep. Aerospace Res. Lab. Wright-PattersonA.F. Base No. 65-177. *Blischke,W. R., Mundle, P. B., Johns, M. V. and Truelove,A. J. (1968a).Further results on estimation of the parametersof the Pearson Type III and Weibull distributionsin the non-regularcase. Arl. Tech. Rep. Aerospace Res. Lab. Wright-PattersonA.F. Base. *Bowman,K. O. and Shenton,L. R. (1970).Properties of M.L.E. for the parameterof the log seriesdistribution. RandomCount in Scientific Work,1, 127-150. Brillinger,D. R. (1964).The asymptoticbehaviour of Tukey'sgeneral method of setting approximateconfidence limits ("The Jacknife")when applied to M.L.E. Rev. Int. Statist. Inst. 32, 202-206. Brunk, H. D. (1955). Maximumlikelihood estimates of monotone parameters.Ann. Math. Statist. 26, 607-616. Brunk, H. D., Franck, W. E., Hanson, D. L. and Hogg, R. V. (1966). Maximumlikelihood estimationof the distributionof two stochasticallyordered random variables.J. Amer. Statist. Ass. 61, 1067-1080. Brunk, H. D., Ewiing, G. M. and Reid, W. T. (1954). The minimum of a certain definite integral suggested by the M.L.E. Bull. Amer.Math. Soc. 60, 535. Cannings,C. J. and Smith,C. A. B. (1966).A note on the estimationof gene frequenciesby maximumlikelihood in systems similarto Rhesus. Ann. Hum. Genet.29, 277-279. Chan, L. K. (1967). Remarkon the linearisedM.L.E. Ann. Math. Statist. 38, 1876-1881. Chapman,D. G. and Douglas, G. (1956). Estimatingthe parametersof a truncatedgamma distribution.Ann. Math. Statist. 27, 498-506. Chernoff,H. L. and Lehmann,E. L. (1954). The use of the maximumlikelihood in the X2 test for goodness of fit. Ann. Math. Statist. 25, 579-586. *Choi, S. C. and Wette, R. (1969). M.L.E. of the parametersof the gamma distribution and their bias. Technometrics,11, 683-690. Clutton-brock,M. (1965). Using the observationto estimate the prior distribution.J. R. Statist. Soc., B, 27, 17-27. Cohen, A. C. (1950a).Estimating the mean and varianceof normalpopulations from singlytruncated and doubly truncatedsamples. Ann. Math. Statist. 21, 557-569. Cohen, A. C. (1950b).Estimating parameters of PearsonType III populationsfrom truncatedsamples. J. Amer. Statist. Ass. 45, 411-423. Cohen, A. C. (1959). Simplifiedestimators for the normal distributionswhen samples are singly censored or truncated.Technometric, 1, 217-237. Cohen, A. C. (1960a). Estimatingthe parameterin a conditionalPoisson distribution.Biometrics, 16, 203-211. Cohen, A. C. (1960b). Estimatingthe parametersof a modified Poisson distribution.J. Amer. Statist. Assn. 55, 139-143. Cohen, A. C. (1960c). Estimationin the truncatedPoisson distributionwhen zeros and some ones are missing. J. Amer. Statist. Ass. 55, 342-348. Cohen, A. C. (1960d). Simplifiedestimators for the normal distributionwhen samples are singly censored or truncated.Bull. Int. Statist. Inst. 37, III, 251-269. Cohen, A. C. (1960e). Estimation in the Poisson distributionwhen sample values of C+ 1 are sometimes erroneouslyreported as C. Ann. Inst. Statist. Math. Tokyo, 11, 189-193. *Cohen, A. C. (1965). M.L.E. in the Weibull distributionbased on complete and on censored samples. Tech- nometrics,7, 579-588. Curnow,R. N. (1961).The estimationof repeatabilityand heritabilityfrom recordssubject to culling.Biometrics, 17, 553-566. *Das Gupta, P. (1964). On the estimation of the total number of events and of the probabilitiesof deleting an event from informationsupplied to severalagencies. Bull. CalcuttaStatist. Ass. 13, 89-100. **David, F. N. and Johnson, N. L. (1952). The truncatedPoisson. Biometrics,8, 175-185. *Davidson, R. R. and Bradley, R. A. (1969). Multivariatepaired comparisons:the extension of a univariate model and associatedestimation and test procedures.Biometrika, 46, 81-95. *Day, N. E. (1969). Estimatingthe componentsof a mixtureof normal populations.Biometrika, 56, 463-474. *Day, N. E. and Kerridge,D. F. (1967). A generalmaximum likelihood discriminant.Biometrics, 23, 313-325. *Dorfman, D. D. and Alf, E. Jr. (1968). Maximum likelihood estimation of parametersof . Psychometrika,33, 117-124. Dorogovtzev,A. I. (1959). Confidenceintervals in estimationof parameters.Dopov. Acad. Nank Ukrain.S.S.R. 355-358. Dubey, S. D. (1963). On some statisticalinferences for Weibull laws. J. Amer. Statist. Ass. 58, 549.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 20 348

Dubey, S. D. (1967). Monte Carlo study of the and maximumlikelihood estimatorsof Weibull para- meters. Trab.Estadist. 18, II/III, 131-141. Edgett, G. (1956). Multiple regressionwith missing observationsamong the independentvariables. J. Amer. Statist. Ass. 51, 122-31. Feigl, Polly and Zelen, M. (1965).Estimation of exponentialsurvival with concomitantinformation. Biometrics,21, 826-838. Ferguson, T. (1958). A method of generatingBAN estimates with application to the estimation of bacterial densities.Ann. Math. Statist. 29, 1046-1062. **Fields,R. I., Kramer,C. Y. and Clunies-Ross,C. W. (1962).Joint estimationof the parametersof two normal populations.J. Amer. Statist. Ass. 57, 446-454. **Finney,D. J. (1949). The truncatedbinomial distribution.Ann. Eugen. 14, 319-328. **Finney, D. J. and Varley, G. C. (1955). An example of the truncatedPoisson distribution.Biometrics, 11, 387-394. Fisher, F. M. (1967). Approximatespecification and the choice of a k-class estimator.J. Amer. Statist. Ass. 62, 1265-1276. **Fisk,P. R. (1961).Estimation of location and scale parametersin a truncatedgrouped sech square distribution. J. Amer. Statist. Ass. 56, 692-702. Frome, E. L. and Beauchamp, J. J. (1968). Maximum likelihood estimation of survival curve parameters. Biometrics,24, 595-605. *Gajjar,A. V. and Khatri, C. G. (1969). Progressivelycensored samples from log-normal and logistic dis- tributions.Technometrics, 11, 793-803. Ghosh, J. K. and Rajinder Singh (1966). Unbiased estimationof location and scale parameters.Ann. Math. Statist. 37, 1671-1675. *Glaser, M. (1967). Exponentialsurvival with .J. Amer. Statist. Ass. 62, 561-68. *Gnadadesikan,R., Pinkham, R. S. and Hughes, Laura P. (1967). M.L.E. of the parametersof the beta dis- tributionfrom smallestorder statistics. Technometrics,9, 607-620. *Greenwood,J. and Durand,D. (1960).Aids for fittingthe gammadistribution by M.L. Technometrics,2, 55-65. Hald, A. (1949). M.L.E. of the parametersof a normal distributionwhich is truncated at a known point. Skand. Aktuariedskrift.32, 119-134. **Haldane,J. B. S. (1941). The fitting of the Binomial distribution.Ann. Eng. 11, 179-181. Halperin, M. (1952). M.L.E. in truncationsamples. Ann. Math. Statist. 23, 226-238. Hannan, J. (1960). Consistency of M.L.E. of discrete distributions.Contributions to Prob. and Statist., pp. 249-257. StanfordU.P. Hannan, J. F. and Tate, R. F. (1965). Estimationof the parametersof a multivariatenormal distributionwhen one variableis dichotomised.Biometrika, 52, 664-668. *Harter,H. L. (1966). Asymptoticvariances and covariancesof maximumlikelihood estimators,from censored samples, of the parametersof a four parametergeneralised gamma population. Arl. Tech.Rep. Aerospace Res. Lab. Wright-PattersonA.F. Base, No. 66-0158, iv+ 147 pages. *Harter, H. L. (1967). Maximum likelihood estimation of the parametersof a four parametergeneralised gamma populationfrom complete and censoredexamples. Technometrics, 9, 159-165. *Harter,H. L. and Moore, A. H. (1965). Maximum likelihood estimation of the parametersof gamma and Weibull populationsfrom complete and censoredsamples. Technometrics,7, 639-643. Harter,H. L. and Moore, A. H. (1966a).Iterative maximum likelihood estimationof the parametersof normal populationsfrom singly and doubly censoredsamples. Biometrika,53, 205-213. *Harter,H. L. and Moore, A. H. (1966b). Local maximum likelihood estimation of the parametersof three parameterlognormal populations from complete and censoredsamples. J. Amer.Statist. Ass. 61, 842-851. Harter,H. L. and Moore,A. H. (1967a).Asymptotic variances and of maximumlikelihood estimators, fromcensored samples, of the parametersof Weibulland gamma populations. Ann. Math. Statist. 38, 557-571. Harter, H. L. and Moore, A. H. (1967b). A note on estimation from a type I extreme-valuedistribution. Technometrics,9, 325-331. Harter, H. L. and Moore, A. H. (1967c). Maximum likelihood estimation, from censored samples, of the parametersof a logistic distribution.J. Amer. Statist. Ass. 62, 675-684. *Harter,H. L. and Moore, A. H. (1968a). M.L.E. from doubly censoredsamples, of the parametersof the first asymptoticdistribution of extremevalues. J. Amer. Statist. Ass. 63, 889-901. *Harter, H. L. and Moore, A. H. (1968b). Conditional M.L.E. from singly censored samples, of the scale parametersof type II extreme-valuedistributions. Technometrics, 10, 349-359. Hartley, H. 0. and Rao, J. W. K. (1967). M.L.E. for the mixed model. Biometrika,54, 93-108. Hasselblad, V. (1969). Estimation of finite mixtures of distributionfrom the . J. Amer. Statist. Ass. 64, 1459-1471. Hildreth, C. (1969). Asymptotic distributionof M.L.E.'s in a with autoregressiondisturbances. Ann. Math. Statist. 40 (2), 583-594. Hill, B. M. (1963).The threeparameter lognormal distribution and Bayesiananalysis of a point sourceepidemic. J. Amer. Statist. Ass. 58, 72-84. *Hocking, R. R. and Smith, W. B. (1968). Estimation of parametersin the multivariatenormal distribution with missing observations.J. Amer. Statist. Ass. 63, 159-173.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 21 349

*Holgate, P. (1964). Estimationfor the bivariatePoisson distribution.Biometrika, 51, 241-245. *Irwin, J. 0. (1959). On the estimation of the mean of a Poisson distributionfrom a sample with the zero class missing. Biometrics,15, 324-326. *Isida, M. and Tagami,S. (1959).The bias and precisionin the M.L.E. of the parametersof normalpopulations from singly truncatedsamples. Rep. Statist. Appl. Res. (JUSE), 6, 105-110. Iyer, P. V. K. and Sing, N. (1962). Estimation of parametersfrom generalisedcensored normal samples. J. IndianSoc. Agric. Statist. 14, 165-176. *Jacquez,J. A., Mather, F. J. and Crawford,C. R. (1968). Linear regressionwith non-constant, unknown errorvariances, sampling with least squares,weighted least squaresand maximumlikelihood estimators.Biometrics, 24, 607-626. Jogdeo, K. (1967). Monotone convergenceof binomialprobabilities with an applicationto maximumlikelihood estimation.Ann. Math. Statist. 38, 1583-1586. Jolly, G. M. (1963). Estimates of population parametersfrom multiple recapturedata with both death and dilution - deterministicmodel. Biometrika,50, 113-128. Kale, B. K. (1963). M.L.E. for a truncatedexponential family. J. IndianStatist. Ass. 1, 86-90. Katti, S. K. and Gurland,J. (1962). Efficiencyof certain methods of estimationfor the negative binomial and the Neyman type A distribution.Biometrika, 49, 215-226. Khan, R. A. (1969). M.L. estimationin sequentialexperiments. Sankhyd, A, 31, 49-56. **Khatri, C. G. (1962a). A simplifiedmethod of fitting the doubly and singly truncated negative binomial distribution.J. UniversityBaroda, 11, 35-38. **Khatri, C. G. (1962b). A method of estimating approximatelythe parametersof a certain class of dis- tributions.Ann. Inst. Statist. Math., Tokyo, 14, 57-62. *Khatri,C. G. (1963). Joint estimationof the parametersof multivariatenormal populations.J. IndianStatist. Ass. 1, 125-133. Khatri,C. G. and Jaiswall,M. C. (1963). Estimationof parametersof a truncatedbivariate normal distribution. J. Amer. Statist. Ass. 58, 519-526. Klotz, J. and Putter, J. (1969). M.L.E. of multivariatecovariance components for the balanced one-way lay-out. Ann. Math. Statist. 40 (3), 1100-1105. Kristof, W. (1963). The statisticaltheory of stepped-upreliability coefficients when a test has been divided into severalequivalent parts. Psychometrika,28, 221-238. Kulldorf, G. (1959). A problem of M.L.E. from a grouped sample. Metrika, 2, 94-99. Lee, T. C., Judge, G. G. and Zellner, A. (1968). M.L. and Bayesian estimation of transition probabilities. J. Amer. Statist. Ass. 63, 1162-1179. Lejeune,J. (1958). On an "a priori" solution of Haldane's "a posteriori"method. Biometrics,14, 513-520. Lellough,J. and Wambersie,A. (1966). Maximumlikelihood estimation of survivalcurves of irradiatedmicro- organisms(in French).Biometrics, 22, 673-683. Li, C. C. and Mantel,N. (1968). A simple method of estimatingthe segregationratio undercomplete ascertain- ment. Amer.J. Hum. Genet.20, 61-81. Lloyd, D. E. (1959). Note on a problemof estimation.Biometrika, 46, 231-235. Lord, F. M. (1955a). Estimationof parametersfrom incompletedata. J. Amer. Statist. Ass. 50, 870-876. Lord, F. M. (1955b).Equating test scores a M.L. solution. Psychometrika,20, 193-200. Malik, H. J. (1968). Estimationof the parametersof the power function population. Metron, 27, 196-203. Mantel, N. (1967). Adaption of Karber's method for estimating the exponential parameter from quantal , and its relationshipto birth, death and branchingprocesses. Biometrics, 23, 739-746. Maritz, J. S. (1966). Smooth empiricalBayes estimationfor one-parameterdiscrete distributions. Biometrika, 53, 417-429. Marshall, A. W. and Proschan, F. (1965). Maximum likelihood estimation for distributionswith monotone . Ann. Math. Statist. 36, 69-77. Matthai, A. (1951). Estimate of parametersfrom incomplete data with applications to design of sample surveys.Sankhyd, 11, 145-152. Mead, R. (1967). A mathematicalmodel for the estimationof interplantcompetition. Biometrics, 23, 189-206. *Mendenhall,W. and Hader, R. J. (1958). Estimationof parametersof mixed exponentiallydistributed failure- time distributionfrom censoredlife test data. Biometrika,45, 504-520. Mitchell, Ann F. S. (1968). Exponentialregression with correlatedobservations. Biometrika, 55, 149-162. Moore, A. H. and Harter,H. L. (1969). Conditional M.L.E. from singly censoredsamples, of the shape para- meters of Pareto and limited distributions.IEEE Trans.Rel. R-18, 76-78. Morgan, R. W. (1965). The estimationof parametersfrom the spread of a disease by consideringhouseholds of two. Biometrika,52, 271-274. Mukherji,V. (1967). A note on maximumlikelihood - a generalisation.Sankhyd, A, 29, 105-106. *Mustafi,C. K. (1963). Estimationof parametersof the extremevalue distributionwith limitedtype of primary probabilitydistribution. Bull. CalcuttaStatist. Ass. 12, 47-54. Nelson, A. C. Jr., Williams,J. S. and Fletcher,N. T. (1963). Estimationof the probabilityof defectionfailure from destructivetests. Technometrics,5, 459-468. Ohlsen, Sally (1964). On estimatingedpidemic parameters from household data. Biometrika,51, 511-512. Pal, L. and Kiddi Eva. (1965). On some statisticalproblems connected with the measuringof the total cross section of thermalneutrons. Pub. Res. Inst. Physics HungarianAcad. Sci. 11, 3-19.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 22 350

*Parker,R. A. (1963). On the estimationof populationsize, mortalityand recruitment.Biometrics, 19, 318-328. *Patil, G. P. (1962). M.L.E. for generalisedpower series distributionand its applicationto truncatedbinomial distribution.Biometrika, 49, 227-238. Paulik, G. J. (1963). Estimatesof mortalityrates from tag recoveries.Biometrics, 19, 28-57. *Pielou, E. C. (1963). The distributionof diseased trees with respect to healthy ones in a patchily infected forest. Biometrics,19, 450-459. Pike, M. C. (1966). A suggested method of analysis of a certain class of experimentsin carcinogenesis.Bio- metrics,22, 142-161. PrakasaRao, B. L. S. (1968).Estimation of the location of the cusp of a continuousdensity. Ann. Math. Statist. 39, 76-87. PrakasaRao, B. L. S. (1969). Estimationof unimodal density. Sankhyd,A, 31, 23-36. Press, S. J. (1968). Estimatingfrom mis-classifieddata. J. Amer. Statist. Ass. 63, 123-133. Prince, B. M. and Tate, R. F. (1966). Accuracyof M.L.E.'s of correlationfor a biserialmodel. Psychometrika, 31, 85-92. Quandt, R. E. (1966). Old and new methods of estimation and the Pareto distribution.Metrika, 10, 55-82. Rao, C. R. (1948). Large sample tests of statisticalhypotheses concerning several parameters with applications to problemsof estimation.Proc. CambridgePhil. Soc. 44, 50-57. Remage, R. Jr. and Thompson,W. A. Jr. (1966). Maximumlikelihood paired comparison . Biometrika, 53, 143-149. Riffenburgh,R. H. (1966). On growth parameterestimation for early life stages. Biometrics,22, 162-178. *Rutemiller,H. C. (1967). Estimationof the probabilityof zero failuresin m binomial trials. J. Amer. Statist. Ass. 62, 272-277. Samuel, Ester (1968). Sequential maximum likelihood estimation of the size of a population. Ann. Math. Statist. 39, 1057-1068. Samuel, Ester (1969). Comparisonfor sequentialrules for estimation of the size of a population. Biometrics, 25, 517-527. *Saw, J. G. (1961).The bias of the M.L.E.'sof the location and scale parametersgiven a type II censoredsample. Biometrika,48, 448-451. *Saw, J. G. (1962). Efficientmoment estimators when the variables are dependent with special referenceto type II censored normal data. Biometrika,49, 155-161. Seber, G. A. F. and Le Cren, E. D. (1967). Estimatingpopulation parametersfrom catches large relative to the population.J. AnimalEcol. 36, 631-643. Shah, S. M. and Jaiswall,M. C. (1966). Estimationof parametersof doubly truncatednormal distribution from first four sample moments. Ann. Inst. Statist. Math., Tokyo, 18, 107-111. Shenton, L. R. (1958). Moment estimatorsand M.L. Biometrika,45, 411-420. Shenton, L. R. (1963). A note on bounds for the asymptotic sampling variance of the maximum likelihood estimatorof a parameterin the negativebinomial distribution. Ann. Inst. Statist. Math. Tokyo,15, 145-151. Shenton, L. R. and Wallington,P. A. (1962). The bias of momentestimators with an applicationto the negative binomial distribution.Biometrika, 49, 193-204. Sheps, M. C. (1966). Characteristicsof a ratio used to estimate failure rates: occurrenceper person year of exposure.Biometrics, 22, 310-321. Sibuya, M. (1963). Randomisedunbiased estimation of restrictedparameters. Ann. Inst. Statist. Math., Tokyo, 15, 61-66. Singh, S. N. (1963). A note on the inflated Poisson distribution.J. IndianStatist. Ass. 1, 140-144. Skibinsky,M. and Cote, L. (1963). On the inadmissabilityof some standardestimates in the presenceof prior information.Ann. Math. Statist. 34, 539-548. *Smith, B. W. (1966). Parameterestimation in the presenceof measurementnoise. Int. J. Control,3, 297-312. Smith, W. E. (1966). An "a posteriori"probability method for solving an over-determinedsystem of equations. Technometrics,8, 675-686. Sprott, D. A. (1965a). Statisticalestimation - some approachesand controversies.Statist. Hefte. 6, 97-111. Sprott, D. A. (1965b). A class of contagious distributions;and maximum likelihood estimation. Sankyd, A, 27, 369-382. Sverdrup,E. (1965). Estimatesand test proceduresin connection with stochasticmodels for deaths, recoveries and transfersbetween different states of health. Skand.Aktuarietidskr. 48, 184-211. *Swamy,P. S. (1960). Estimatingthe mean varianceof a normal distributionfrom singly and doubly truncated samples of grouped observations.Bull. CalcuttaStatist. Ass. 9, 145-156. *Swamy,P. S. and Doss, S. A. D. (1961). On a problemof Bartholomewin life testing.Sankhyd, A, 23, 225-230. *Swan,A. V. (1969).Computing M.L.E.'s for parametersof the normaldistribution from groupedand censored data. Appl. Statist. 18, 65-69. *Tallis, G. M. and Light, R. (1968). The use of fractionalmoments for estimatingthe parametersof a mixed exponentialdistribution. Technometrics, 10, 161-175. *Tallis, G. M. and Young, S. S. Y. (1962). M.L.E. of parametersof the normal, log normal, truncatednormal and bivariatenormal distributionfrom . Aust. J. Statist. 4, 49-54. Teicher, H. (1961). Maximumlikelihood characterisationof distributions.Ann. Math. Statist. 32, 1214-1222. Thomason,R. L. and Kapadia,C. H. (1968). On estimatingthe parameterof a truncatedgeometric distribution. Ann. Inst. Statist. Math., Tokyo, 20, 519-523.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 23 351

*Thoni, H. (1969). A table for estimating the mean of a lognormal distribution.J. Amer. Statist. Ass. 64, 632-636. Tiku, M. L. (1967a). Estimatingthe mean and standarddeviation from a censorednormal sample.Biometrika, 54, 155-165. Tiku, M. L. (1967b). A note on estimatingthe location and scale parametersof the exponential distribution from a censoredsample. Aust. J. Statist. 9, 49-54. *Tiku, M. L. (1968). Estimatingthe mean and standarddeviation from progressivelycensored normal samples. J. IndianSoc. Agric. Statist. 20, 1, 20-25. Trawinski,Irene M. and Bargmann,R. E. (1964). Maximumlikelihood estimationwith incompletemultivariate data. Ann. Math. Statist. 35, 647-657. Villegas, C. (1961). Maximumlikelihood estimationof a linear functional relationship.Ann. Math. Statist. 32, 1048-1062. *Villegas,C. (1963). On the L.S.E. of a linearrelation. Publ. Inst. Mat. Estadist.Fac. Ingen.Agric., Montivideo, 3, 189-204. Wald, A. (1948).Asymptotic properties of the M.L.E. of an unknownparameter of a discretestochastic process. Ann. Math. Statist. 19, 40-46. Watson, G. S. (1964). A note on maximumlikelihood. Sankhyd,A, 26, 303-304. Weibull, C. (1963a). Maximum likelihood estimationfrom truncated,censored and grouped samples. Skand. Aktuartiedskr.46, 70-77. **Wilk, M. B., Gnanadesikan,R. and Hyett, Marilyn J. (1962). Estimation of parametersof the gamma distributionusing order statistics.Biometrika, 49, 525-545. Wilk, M. B., Gnanadesikan,R. and Hyett, Marilyn J. (1963). Separate M.L.E. of scale or shape parameters of the gamma distributionusing order statistics.Biometrika, 50, 217-221. Wilk, M. B., Gnanadesikan,R. and Laugh, Elizabeth (1966). estimation from the order statisticsof unequalgamma components.Ann. Math. Statist. 37, 152-176. Wilks, S. S. (1932). Moments and distributionsof estimating of population parametersfrom fragmentary samples. Ann. Math. Statist. 3, 163-195. Wyshak, G. and White, C. (1968). Estimationof parametersof a truncatedPoissonian binomial. Biometrics, 24, 377-388. Zacks, S. (1966). Sequentialestimation of the mean of a log-normaldistribution having a prescribedproportional closeness. Ann. Math. Statist. 37, 1688-1696. Zacks, S. and Even, M. (1966a).The efficienciesin small samplesof the maximumlikelihood and best unbiased estimationof reliabilityfunction. J. Amer. Statist. Ass. 61, 1033. Zacks, S. and Even, M. (1966b).Minimum variance unbiased and M.L.E.'s of reliabilityfunctions for systems in series and in parallel.J. Amer. Statist. Ass. 61, 1052. Zippin, C. and Armitage,P. (1966). Use of concomitant variablesand incompletesurvival information in the estimationof an exponentialsurvival parameter. Biometrics, 22, 665-672.

GroupIV *Amemiya, T. and Fuller, W. A. (1967). A comparativestudy of alternativeestimators in a distributedlog model. Econometrica,35, 509-529. *Arnold, B. C. (1968). Parameterestimation for a multivariateexponential distribution. J. Amer. Statist. Ass. 63, 848-852. *Bain, L. J. and Antle, C. E. (1967). Estimation of parametersin the Weibull distribution.Technometrics, 9, 621-627. Berkson,J. (1955). M.L. and minimumx2 estimatesof the logistic function.J. Amer.Statist. Ass. 50, 130-162. Berkson, J. (1960). Problemsrecently discussedregarding estimating the logistic curve. Bull. Int. Statist. Inst. 37, III, 207-211. Berkson, J. and Elveback,Lillian (1960). Computingexponential risks with particularreference to the study of smoking and lung cancer.J. Amer. Statist. Ass. 55, 415-428. Birnbaum,A. (1961). A unifiedtheory of estimation.Ann. Math. Statist. 32, 112-135. Birnbaum,A. (1962). Intrinsicconfidence methods. Bull. Int. Statist. Inst. 39, II, 376-383. *Blischke, W. R. (1964). Estimating the parametersof mixtures of binomial distributions.J. Amer. Statist. Ass. 59, 510-528. *Blischke,W. R., Mundle, P. B. and Johns, P. B. (1968b). Further results on estimation of the parameters of the Pearson type III and Weibull distributionsin the non-regularcase. Arl. Tech.Rep. Aerospace Res. Lab. Wright-PattersonA.F. Base, No. 68-0207. *Blumenthal,S. and Cohen, A. (1968). Estimationof the larger of two normal means. J. Amer. Statist. Ass. 63, 861-876. Chakraborty,P. N. (1963).On a method of estimatingbirth and deathrates from severalagencies. Bull. Calcutta Statist. Ass. 12, 106-112. *Cohen, A. (1967). Estimationin mixturesof two normal distributions.Technometrics, 9, 15-28. *Cornell,R. G. and Speckman,J. A. (1967).Estimation for a simpleexponential model. Biometrics,23, 717-737. *Cragg, J. G. (1967). On the relative small-sample properties of several structural-equationestimations. Econometrica,35, 89-110.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 24 352

Doss, S. A. D. (1962a). On the efficiencyof BAN estimatesof the parametersof normalpopulations based on singly censoredsamples. Biometrika, 49, 570-573. Doss, S. A. D. (1962b). On uniquenessand maxima of the roots of likelihood equations under truncatedand censored samplingfrom normal populations.Sankhyd, A, 24, 355-362. Dubey, S. D. (1965a). Asymptotic propertiesof several estimators of Weibull parameters.Technometrics, 7, 423-434. Dubey, S. D. (1965b). M.L. and Bayes solutions for the true score. J. IndianStatist. Ass. 3, 65-81. Dubey, S. D. (1966). Comparativeperformance of several estimatorsof Weibull parameters.Ann. Tech. Conf. Trans.,Amer. Soc. Qual. Control,20, 723-735. Epstein, B. (1961). Estimatesof bounded relativeerror for the mean life of an exponentialdistribution. Tech- nometrics,1, 107-109. Fraser, D. A. S. (1964). On local unbiasedestimation. J. R. Statist. Soc., B, 26, 46-51. Ghirtis, G. C. (1967). Some problems of statisticalinference relating to the double-gammadistribution. Trab. Estadist.18, II/III, 67-87. Glasser, G. J. (1962). Minimum variance unbiased estimators for Poisson probabilities. Technometrics,4, 409-418. *Graybill, F. A. and Cornell, T. L. (1964). Sample size requiredto estimate the parameterin the uniform density within d units of the true value. J. Amer. Statist. Ass. 59, 550-556. Gumbell, E. J. (1965). A quick estimationof the parametersin Frechetsdistributions. Rev. Int. Statist. Inst. 33, 349-363. Haldane, J. B. S. (1951). A class of efficientestimates of a parameter.Bull. Int. Statist. Inst. 33, 231. *Haldane, J. B. S. and Smith, C. A. B. (1947). A new estimate of the linkage between the genes for colour- blindnessand haemophiliain man. Ann. of Eugenics,14, 10-31. Halperin, M. (1961). Almost linearly-optimumcombination of unbiased estimates.J. Amer. Statist. Ass. 56, 36-43. Harter, H. L. (1968). The use of order statisticsin estimation. Oper.Res. 16, 783-798. Huber, P. J. (1964). Robust estimation of a .Ann. Math. Statist. 35, 73-101. Huber, P. J. (1968). Robust estimation. Math. CentreTracts. Selected StatisticalPapers, 27, 3-25. Jaiswal, M. C. and Khatri, C. G. (1967). Estimation of parametersfor the selected samples from bivariate normal populations.Metron, 26, 326-333. Jorgenson,D. W. (1961). Multiple regressionanalysis of a Poisson process.J. Amer.Statist. Ass. 56, 235-245. Katti, S. K. (1962). Use of some "a priori" knowledge in the estimation of means from double samples. Biometrics,18, 139-147. Katti, S. K. and Gurland, J. (1962). Some methods of estimation for the Poisson binomial distribution. Biometrics,18, 42-51. Kharshikar,A. V. (1968). On informationfrom folded distributions.J. IndianSoc. Agric. Statist. 20, I, 49-54. Lindley, D. V. and El-Sayyad, G. M. (1968). The Bayesian estimation of a linear functional relationship. J. R. Statist. Soc., B, 30, 190-202. Miller, R. G. (1960). Early failures in life testing. J. Amer. Statist. Ass. 55, 491-502. *Molenaar, W. (1965). Survey of estimation methods for a mixture of two normal distributions.Statist. Neerlandica,19, 249-263. Parzen,E. (1962). On estimationof a probabilitydensity function and . Ann.Math. Statist. 33, 1065-1076. Posner, E. C., Rodemich, E. R., Ashlock, J. C. and Lurie, Sandra(1969). Application of an estimatorof high efficiencyin bivariateextreme-value theory. J. Amer.Statist. Ass. 64, 1403-1414. Reiersol, 0. (1961). Linear and non-linearmultiple comparisons in logit analysis. Biometrika,48, 359-365. Roberts, H. V. (1967). Informationstopping rules and inferenceabout population size. J. Amer. Statist. Ass. 62, 763-775. Rothenberg,T. J., Fisher, F. M. and Tilanus, C. B. (1964). A note on estimationfrom a Cauchy sample. J. Amer. Statist. Ass. 59, 460-463. Saw, J. G. (1961). Estimationof the normal populationparameters given a type I censoredsample. Biometrika, 48, 367-377. Silverstone,H. (1957). Estimatingthe logistic curve. J. Amer. Statist. Ass. 52, 567. Smith, J. H. (1947). Estimationof linearfunctions of cell proportions.Ann. Math. Statist. 18, 231-254. Snoeck, N. (1961). The construction of asymptoticallyexhaustive estimators. C. R. Acad. Sci. Paris, 253, 777-779. *Soong, T. T. (1969). An extension of the moment method in statisticalestimation. Siam. J. Appl. Math. 17, 560-568. Stormer,H. (1961). On a test for the parametersof a life distribution.Metrika, 4, 63-77. Theil, H. and Schweitzer,A. (1961). The best quadraticestimator of the residualvariance in regressionanalysis. Statist. Neerlandica,15, 19-23. *Tiku, M. L. (1968a). Estimatingthe parametersof log-normaldistributions from censored samples.J. Amer. Statist. Ass. 63, 134-140. Tiku, M. L. (1968b). Estimating the parametersof normal and logistic distributionsfrom censored samples. Aust. J. Statist. 10, 64-74. Varde, S. D. (1969). Life testing and reliabilityestimation for the two parameterexponential distribution. J. Amer. Statist. Ass. 64, 621-631.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 25 353

Books Cramer,H. (1946). MathematicalMethods of Statistics. PrincetonUniv. Press. Feller, W. (1966). An Introductionto ProbabilityTheory and its Applications,Volume 2. New York, Wiley. Kendall, M. G. and Stuart,A. (1961). The AdvancedTheory of Statistics, Volume 2. London, Griffin.

Additions to Bibliography GroupII Anderson, E. B. (1970). Asymptotic propertiesof conditional maximum likelihood estimators.J. R. Statist. Soc., B., 32, 283-301. Barton, D. E. (1956). A class of distributionsfor which the maximumlikelihood estimatoris unbiasedand of maximumvariance for all sample sizes. Biometrika,43, 200-202. Bodin, N. A. (1968). On the theory of grouped samples. Dokl. Akad. Nank. SSSR, 178, 17-20; Soviet Math. Dokl. 9, 10-13. Box, M. J. (1971). Bias in Nonlinear Estimation.J. R. Statist. Soc., B, 33, 171-201. Herman,R. J. and Patel, R. K. N. (1971). Maximumlikelihood estimation for multi-riskmodel. Technometrics, 13, 385-396. Hocking, R. R. and Oxspring, H. H. (1971). Maximum likelihood estimation with incomplete multinomial data. J. Amer. Statist. Ass. 66, 65-70. Kalbfleisch, J. D. and Sprott, D. A. (1970). Application of likelihood methods to models involving large numbersof parameters.J. R. Statist. Soc., B, 32, 175-208. Kale, B. K. (1970). Inadmissibilityof the maximumlikelihood estimatorin the presenceof prior information. Canad.Math. Bull. 13, 391-393.

GroupmI Antle, C., Klimko, L. and Harkness,W. (1970). Confidenceintervals for the parametersof the logistic distribu- tion. Biometrika,57, 397-402. Armitage, P. (1970). The combinationof assay results. Biometrika,57, 665-666. **Bairagi,R. (1969). Estimationof the covariancebetween two sets of values of a variablegiven at two periods. Bull. Inst. Statist. Res. Tv 3, 42-47. Blight, B. J. N. (1970). Estimationfrom a censoredsample for the exponentialfamily. Biometrika,57, 389-395. Bowman, K. and Shenton, L. R. (1965). AEC Research and Development Report K1633, Union Carbide Corporation. Bowman, K. and Shenton, L. R. (1968). Propertiesof estimatesfor the gamma distribution.Statist. Abstr. 10, 401. Bowman, K. and Shenton,L. R. (1969).Remarks on maximumlikelihood estimates for the gammadistribution. ICQC 69 - Tokyo, TS 10-05, 519-522. Bowman, K. and Shenton, L. R. (1970). Small sample propertiesof estimatorsfor the gamma distribution. Union CarbideCorporation, Oakridge, Tennessee, Report No. CTC-28. Easterling,R. G. and Prairie,R. R. (1971). Combiningcomponent and system information.Technometrics, 13, 271-280. Fryer, J. G. and Holt, D. (1970). On the robustness of the standard estimates of the exponential mean to contamination.Biometrika, 57, 641-648. Garg, M. L., Lao, B. R. and Redmond, Carol K. (1970). Maximum likelihood estimation of the parameters of the Gompertzsurvival function. Appl. Statist. 19, 152-159. Haldane, J. B. S. (1953). The estimationof two parametersfrom a sample. Sankhya,12, 313-320. Haldane, J. B. S. and Smith, S. M. (1956). The sampling distributionof a maximum likelihood estimate. Biometrika,43, 96-103. Hartley, H. 0. and Rao, J. H. K. (1969). A new estimation theory for sample surveys,II. New Developments in SurveySampling, 147-169. Heiny, R. L. and Siddiqui, M. M. (1970). Estimation of the parametersof a normal distributionwhen the mean is restrictedto an interval.Aust. J. Statist. 12, 112-117. Lambert,J. A. (1970). Estimationof parametersin the four-parameterlognormal distribution. Aust. J. Statist. 12, 33-43. Malik, M. J. (1970). Estimationof the parametersof the Pareto distribution.Metrika, 15, 126-132. Meyer, P. L. (1967). The maximumlikelihood estimate of the non-centralityparameters of a non-centralx2 variate.J. Amer. Statist. Ass. 62, 1258-1264. *Oberhofer,W. (1970). Maximumlikelihood estimationof the parametersfor the multisectormodels. Operat. Res. Verfahren,8, 211-219. Seber,G. A. F. and Whale,J. F. (1970).The removalmethod for two and threesamples. Biometrics, 26, 393-400. Shenton, L. R. and Bowman,K. (1963). Higher momentsof a maximumlikelihood estimate. J. R. Statist. Soc., B, 25, 305-317.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 26 354

Shenton, L. R. and Bowman, K. (1967). Remarks on large sample estimatorsfor some discrete distributions. Technometrics,9, 587-598. Shenton, L. R. and Bowman, K. (1969). Maximumlikelihood estimator moments for the two parametergamma distribution.Sankhyd, B, 31, 379-396. Ta-Chung, Lu and Breen, W. J. (1969). The of the limited information estimator and the identificationtest. Econometrica,37, 222-227.

GroupIV Blischke, W. R., Brady, F. J. and Mundle, P. B. (1970). Furtherresults on estimationof the parametersof the Pearson type III distributionin the regularand non-regularcases. ARL Tech.Rep. AerospaceRes. Labs., Wright-PattersonAir Force Base, No. 70-0017. *Griffin, B. S. and Krutchkoff, R. G. (1971). Optimal linear estimates: an empirical Bayses version with applicationto the binomial distribution.Biometrika, 58, 195-201. Hamdan, M. A. (1971). Estimationof class boundariesin fitting a normal distributionto a qualitative multi- nominal distribution.Biometrics, 27, 457-459. John, S. (1970). On identifyingthe population of origin of each observationin a mixtureof observationsfrom two gamma populations. Technometrics,12, 565-568. *Mann, Nancy R. (1970). Estimation of location and scale parametersunder various models of and truncation.ARL Tech.Rep. AerospaceRes. Labs. Wright-PattersonAir Force Base No. 70-0026. Oliver, F. R. (1970). Some asymptotic propertiesof Colquhoun'sestimates of a rectangularhyperbola. Appl. Statist. 19, 269-273. *Rowe, I. M. (1970). A bootstrap method for the statistical estimation of model parameters.Int. J. Control, 12, 721-738. Rustagi, J. S. and Laitnen, R. (1970). Moment estimation in a Markov-dependentfiring distribution.Operat. Res. 18, 918-923.

Book Silvey, S. D. (1970). .London, Penguin Books, 1970, especiallychapter 2.

R6sume Pr6sent6sous forme d'exposition, ce m6moire (en deux parties) traite du d6veloppementde la th6orie du Maximumde Vraisemblance(m.l.), depuis son introductiondans les exposes de Fisher (1922 et 1925),jusqu'd l'6poque actuelle, oh des recherchesoriginales se poursuiventtoujours. Apres un court pr6ambule,I'auteur donne un aperguhistorique dans lequel il d6montrenotamment que c'est Ajuste titre que l'on consid"reFisher comme l'originateurde cette m6thoded'estimation, puisque toutes les m6thodesant6rieures, bien que similaires en apparence,d6pendaient en effet de la m6thodede la probabilit6inverse. Le m6moiredonne ensuite, en r6sum6,quelques d6finitionset r6sultatsfondamentaux, ayant leurs origines dans les expos6s de Fisher lui-meme et dans les contributionsapport6es plus tard par MM. Crameret Rao. La consistanceet l'efficience,par rapportau m.l. sont alors consid6r6esdans le d6tail.Le th6oremeimportant de Cramer (1946) est 6galementmentionn6, mais avec la remarqueque, en ce qui concerne la consistance, la preuve de Wald (1949) est d'une applicationplus g6n6rale. Viennentensuite quelques observations sur des expos6ssubs6quents qui s'y rattachent.Le m6moirese poursuit par la discussion du probleme de la comparaison des m.l. avec d'autres estimateurs.L'auteur attache une importancetoute sp6cialeau concept de Rao - "l'efficiencede second ordre"- 6tant donn6 que celle-ci fournit en r6alit6le moyen de diff6rencierles diversestimateurs BAN ("best asymptoticnormal"), dont les m.l. forment une sous-classe. L'auteur continue en r6sumantle travail fait dans le domaine des multi-parametres,oih il commencepar un bref r6sum6de la th6orieCramer-Rao bien connue. Pour finir, il y a une section traitantdes difficult6sth6oriques que suscite l'existence de m.l. inconsistants et d'estimateurssuper-efficients. L'auteur fait une allusion particuli"reA l'expos6 de Rao (1962), dans lequel ces anomaliesont 6t6 consid6r6eset r6solues, au moins partiellement,par l'utilisation d'autres criteres de consistance et d'efficience,lesquels criteres ont n6anmoinscomme source les id6es originalesde Fisher. Dans la bibliographie,A la fin du m6moire,se trouvent les titres d'environquatre cents expos6s. Quelques d6finitionsimportantes sont donn6ess6par6ment en annexe.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 27

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 28 Int. Stat. Rev., Vol. 41, No. 1, 1973, pp. 39-58/Longman GroupLtd/Printed in Great Britain

A Survey of Maximum LikelihoodEstimation Part 2

R. H. Norden Universityof Bath, England

Contents Introduction page39 6. Consistencyand efficiency continued 39 7. Comparisonof MLE'swith other estimators 43 8. Multiparametersituations 44 9. Theoreticaldifficulties 47 10. Summary 55 11. Appendix 55

Introduction In Part 2 of this survey the discussion is mainly confined to the problem of comparing maximumlikelihood estimation with other estimationprocedures and the relatedproblems that arise by reason of the existenceof inconsistentmaximum likelihood estimates, and of other estimatorswhich are superefficient,with regardto which special referenceis made to the work of C. R. Rao (1961aand 1962a). Initially, however, there is a furtherdiscussion of Cramer'sconditions as they relate to consistencyand efficiency,and thereis also a sectionon the generalmultiparameter situation. All referencesare containedwithin the bibliographyat the end of Part 1 (Int. Statist. Rev., Vol. 40, No. 3, 1972,pp. 329-354),but a numberof importantdefinitions are given separately in the Appendixat the end of this paper.

6. Consistencyand Efficiency continued A very thoroughanalysis of Cramer'sconditions has been carriedout by G. Kulldorf(1957), who gives them in the followingequivalent way:

C.2. a logf ,exists for every 0e 0 and for almost all x. ao3

ad log f C.3. exists for every 8 E ? and for almost all x.

C.4. = for every E •oo aj dx 0 0 ?.

1 fis an abbreviationfor f(x, 0).

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 29 40

C.5. {o {dx = 0 for every0 e ).

S02 10ogflog C.6. -oo< J ffdx < 0 for every 0 e 0.

C.7. Thereexists a functionH (x) such that for all

cE, logf (x) and H dx < oo. 0

He then shows that conditions 1 to 4, 6 and 7 are sufficientfor provingconsistency in the loose sense, which is Cramer'sresult, and that conditions1 to 7 implyasymptotic efficiency in the strictsense. He also gives an exampleof an MLEwhich is consistentin the loose sense,is asymptotically efficientin the strictsense, and yet, Cramer'sconditions are not satisfied.It is

f(x, 0) = 1e-X2/20 12ne for which " 1 ni=l nf i=1 but a3 logf 1 3x2 - -o as 8-+0, 003 03 04 a 0F+ and is not boundedtherefore in the open interval0<0< oo. Thus condition7 is not satisfied. A new condition, C.8, is then introduced.There exists a functiong (0) which is positive and twice differentiablefor every0 e 0, and a functionH (x) such that for every0 e 0

a2 a f {g (0) logf

8 C.9. logf is a continuous function of 8 for almost all x. a0

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 30 41 C.10. Thereexists a positivefunction h (0) such that 00 1 a(alogf)'2f(x, 02)dx<0 0 01-02 -oo( =0O= for everypair 01 and 82 belongingto 0 and satisfyingthe inequality 0<1 01 -02 I

f (x, 0) = 1 e-2/20 12ne0 for which conditions9 and 10 are clearly satisfied.Also condition 11 is satisfiedby taking log g (0) = O2since then aL0) g 0)a logf g( _ PreviouslyGurland (1954) had attemptedto establish similar results, but accordingto Kulldorfthere are a numberof errorswhich invalidatethe proofs. Huber (1967) has also establishedconsistency and asymptotic normality of the MLE under fairly weak conditions.He does not assumethat the true distributionbelongs to the parametricfamily which definesthe MLE, and the regularityconditions themselves do not involve the second and higherderivatives off. This is of importancein relationto questions of robustnesswhere it is necessaryto establishasymptotic normality under such conditions. A further question of importanceis to what extent does any dependencebetween the observationsaffect the asymptoticproperties of 0. This problemhas been studiedby Silvey (1961), by Hartleyand Rao (1967), for a particularcase, and very recentlyby Bar-Shalom (1971). Silvey'sresults which are closely relatedto Martingaletheory are not expressedas explicit conditions on the distributionof the observations,and thereforeit is difficultto see how they can be applieddirectly to actual situations. More useful results, on the other hand, have been obtained by Bar-Shalom (1971), since he does derive a set of relatively straightforward conditions which imply weak consistency and asymptotic normality. In the first place this requires a reformulation of Cramer's conditions to take dependence into account. Thus the observations x1, x2, ..., x, are assumed to have a known joint probability density function p (x1, x2, ..., x, I 0) with respect to a a-finite product measure u"(n = 1, 2, 3, ...). 8o, the unknown true value of 0, is estimated by 0 = 0, (x") obtained by maximizing H i= 1 Pm L,=p(x"lO)= R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 31 42 whereO, (x") = O,(x1, 0) 0) and L)= x2, ..., xn),p (x" =p(x, ..., x pi =p(xi Ixi'-, 0) Li- 1 i.e. p, is the conditionalprobability density function of xj, given x1, x2, ..., xi- and 0. It is then supposed that the following regularityconditions are satisfiedfor all i. They are quoted in full. o C.1. (s = 1, 2, 3) exist for all i and all 0 e 0. dOs C.2. E? log pJ = 0.

C.3. Ii (0o) = E log P ) = pL log p(xi 0)#Y (dxi)o =go C1<00oo.

C.4. E log Pi I ). 102 -l=0 80 =

C.5. There exists a function Hi (x1) such that 3log 0, there exists a finitepositive M such that P [Hi> M] < e whereM is independentof 0 and i. a C.6. lim E log Pk log = 0 Ik-j 1-4 '00 00o pj = or

C.6'. E log Pk log pj =0, for all j k whichis a strongerversion of C.6, and 00 00 1= 00 2 C.7. var log pk] = A

lim co log P a2 log p = 0. Ik-j -+ao d=0210o= The form of the argumentthen follows that of Cramer,except that whenwe get to equation (4.1) it is not possible to apply Khintchine'sform of the law of large numberssince the observationsare not independent. Insteada form of the weaklaw of largenumbers for dependentrandom variables is applied. This states that if for all i the randomvariables {ui} are such that

IE (u) < oo, var (u1) oo and lim -1 E (ui)< oo n-, I i=l 1 and also -A< lim n-•o 1- i= 1 cov(uy,u)= then - lim u- - E (ui) =0 n-oo i=1 n i-= in probability.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 32 43 Clearlythe final conditionsof this theoremmotivates the "asymptoticuncorrelatedness" conditionsC.6 and C.7. Bar-Shalom'sthen provesthat underthe regularityconditions 1 to 7, lim O,(x") = 00 in probability n- oo and if also C.6' is satisfied,then

E = (On- 00)2 Ii (0o) for sufficientlylarge n, i.e. On is asymptoticallyefficient.

7. Comparisonof MLE's withother Estimators It is well known that J. Neyman (1949), in developingthe theory of minimumchi-square estimators,formulated the class of RegularBest AsymptoticallyNormal estimates,RBAN, or simplyBAN estimates.This idea was furtherdeveloped by Barankinand Gurland(1951), who deriveda generalmethod for constructingsuch estimators. Although Neyman definedBAN estimatesin terms of the multinomialdistribution, they now have a completelygeneral definition which is that 0,, an estimatefrom a sample of n observations,is BAN if EnO is asymptoticallydistributed as N (nnO, I )where I(0) is Fisher'sinformation for a single observation. He also obtaineda set of necessaryand sufficientconditions for an estimatorto be BAN and showedthat minimumchi-square (MCS) as well as ML estimatorsbelong to this class. The same is also true of modifiedMCS and other estimators,so that MLE's are certainly not alone in havingoptimal asymptotic properties. Thus in actual and hence finite samples,asymptotic efficiency as we have understoodthat term so far does not enable us to distinguishbetween these estimators.Some finite sample studies have, of course, been carriedout e.g. Berkson(1955a and b) and Silverstone(1957) for particularcases, but we do not have and indeedit does not appearpossible to have any clear-cutconclusions so long as we keep to minimumvariance as a yardstickof efficiency. Nevertheless,a reformulationof efficiencyby Rao (1961a)has providedus with a statistical measurewhich does distinguishbetween the variousBAN estimators,even when the sample itself is not very large. His schemebegins with the idea that if a statisticT is efficient,in the old sense, then I,, the informationprovided by T per observation,in the Fisheriansense, tends to I the informationfor a single observation.This will be the case if

limP - - (T.n-0) >c =0 (A) n- aoo o -n for any e>0. Taking(A) as a definitionof efficiencydoes not lead us immediatelyto any new conclusions, for since it is equivalentto the old definitionwe find as previouslythat ML, MCS and other estimators are all equally efficient. However, it can be extended to formulate a criterion of second order efficiency defined in the following way Let E2 be the variance of

- - na - (T1 - ) - ln (T - 00 nfl 0)2 minimized with respect to 2. Then T1 is said to have second-order efficiency if no other estimator has a smaller value of E2.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 33 44 This approachis in agreementwith Rao's earliercontention (1960) that the efficiencyof 1 an estimatorT, shouldbe judged by the extentto whichao can be approximatedby a function of Tn. In this (1961) paper, Rao then goes on to consider various estimatorsbased on a randomsample from a multinomialdistribution 1 where the cell probabilitiesare functions of a single parameter.It is shown that, in this case, E2 for the MLE is lower than that for the MCS, the minimumdiscrepancy (Haldane, 1951) the minimumHellinger distance, and the minimumKullback-Liebler separator estimators. Thus within this class, at least, the MLE has second orderefficiency. We can also approachthe problemof comparingMLE with other estimationprocedures, by investigatingwhether it can be regardedas a specialcase of some more generalprocess. In this directionwe have, for example,a paper by Wolfowitz(1953) who shows in a non- rigorousway that the theoryof decisionfunctions can explainwhy the MLE is asymptotically efficient. The paper by Berkson(1955), to which we have just referred,is also of interesthere. He defines an extendedclass of Least Squares Estimatesto which belong, in particular,the MLE, minimumX2, minimum reduced x2, minimumtransform x2 and two other estimators describedby Smith (1947) as "ideal"least squaresestimates. All these estimatesare efficient since the extendedclass is itself BAN. Such approaches,however, demand essentially the acceptanceof a particularloss function and are to that extentarbitrary. Different loss functionscan give differentresults so that it is possibleto have two estimatorsJ, and 02 of 0, and a functionf(0), such that, although E [(J,- 8)] > E [(02_)2 yet E [{f(61)-f(O))2] " approach.Thus, for example,Silverstone (1957) points out that for the binomialdistribution with probability0, when n = 3 the estimatorT - of 0 has a smallermean squareerror when ? 0 $, than the estimator T' = yet clearly T' is to be preferred. ? < r,n It seems doubtful, therefore,whether can claim any greatergenerality than the method of MLE, as an estimationprocedure. The only way open to us, therefore in orderto assess the ML processgenerally is by directcomparison of results,and in order to achieve this it is necessaryto search continuouslyfor measures,such as efficiencyand second-orderefficiency which do indeeddiscriminate between the variousestimators.

8. MultiparameterSituations We have alreadyseen that consistencyfor the multiparametercase has been establishedby Wald. It is, however,possible to generalizethe work of Cramer(1946) and Hurzabazaar(1948), and the first step in this directionappears to have been taken by Chanda(1954). He modifies the conditions of Cramer in an obvious way, so that condition I for example, now becomes that for 0 = (O, ..., ,Ok) 8 logf a2logf a logf 80,. , 8O•Oo, 8 OrO,8O86 (r, s, t= 1, ..., k) lo ,3 og 1 Rao's papers (1957, 1958) have establisheda complete theory of MLE for the multinomial distribution. 2 This would be the case, for example if 0 is distributedas a normal variate with a mean 00 and a unit and = = variance, 00, 02 vI + 0 with f(0) = 02. 01 R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 34 45 all exist for all 0 e O and almost all x, and similarlyfor the other conditions.The proof that there is a unique consistentroot of the LE, and that with probabilitytending to one the 021 matrix is negative definite is analogous to the single parametersituation. This a0,ao0 result, however,suffers from the defect of provingconsistency at a local maximumand not necessarilyat an absolutemaximum, as in the case of Wald'sproof. When, however,we have a set of k jointly sufficientstatistics, ..., tk for k parameters then as in the case the MLE's will betl, functions of them. It was 01, ..., Ok, single parameter establishedby Hurzabazaar(1949) that in this case the LE's have a uniquesolution at which the LF has a maximum. The multiparameterform of the Cramer-RaoMVB (Rao, 1945)in which Fisher'sinforma- tion is replacedby the "information"matrix -E a logf enablesus, in a regularestimation problem, to fix bounds for the varianceof the estimatorsof severalunknown parameters, whatever estimation process is used. Thus if we have k estimators t1, ..., tk for k parameters v 01, ..., Ok,then defining Vrr,,= (t,) we have simultaneously that nV,, > I" where n II 1= 1 -I (r, s = 1, ..., k) In• and 1I,, 11is the informationmatrix, I say, definedabove. Sinceit can be shownthat I" > I, 1 then we have an improvedresult relative to the normal Cramer-RaoMVB for the singleparameter case. Thus, for example,in the estimationof r1 and 72 in the trinomialdistribution nt f (x, 0) = 1t• 2 (1- ,-n2)x3 x1x!x2 ! wherex, + x2 + x3 = n, it is easily shown that

72 2 A• 1, n n and that

I 1 7r2 1-[- 1-71-72 72__1 ( - --_71 - 72)-_ and that - i-1= , (1-t) 7172 L-nrn2 n2(1 -n2)]

Since it can be shown independently that V(ri,) = aT(1-1) and that V = ( -n2) (2) 2 then it follows that in this case the MLE's of and do attain the lower bound. n1 n2 [See Silvey (1970), chapter 2.] We note also that if r1 only is unknown, then the MVB for is n' (1-n - which is ?1 1n2) n (1 - 2) less than - it (1-T1). D

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 35 46 However, as in the single parametercase, there are many instanceswhere these bounds are not attainedin finite samplesand it is again to large samplesthat we must look for the optimalproperties of MLE to become more generallyapparent. An importantpaper in this respectis again by Rao (1947)who provesthe following. If L is the dispersion matrix of 01, ..., Ok, the MLE's of 0, ...--, k, and K that of any other set of estimatorsthen undercertain conditions: (i) lim nK- lim nL = - L0, say, n-+oo n-+oo Ko is positive semi-definite (ii) L-'1 = I, the information matrix. Actually part (i) of the theoremfollows from an earlierresult of Wald (1943), who by means of general limit theorems showed that the joint distribution of 01, ..., Ok tends to that of the multivariate normal with mean ..., Ok)and dispersion matrix D-1 = (01, n I-1 An immediateconsequence of Rao's result is that det K det L, which is Geary's(1942) result that MLE'shave the least "generalized"variance. A second_ corollaryof this theorem is that K•oLo showingthat each MLE has least possiblevariance in large samples. Thereare also two importantpapers by Doss (1962, 1963),in the firstof whichhe establishes the propertiesof consistency,uniqueness and asymptoticnormality, and joint asymptotic efficiencyof the MLE for the multiparametercase. The conditions there are weaker than those given by Chandain the case of consistency,and Doss gives as an exampleof the case of the normaldistribution with mean a and varianceP which satisfy his conditionsbut not those of Chanda,yet the MLE's of a and fl do have all the propertiesunder consideration. In the second paper a differentset of conditionsis given to establishthe same results.The two sets of conditionstherefore widen the class of situationsin whichMLE's have the optimal asymptoticproperties. One might say that Doss investigatedthe conditions of Chanda in the same way as Kulldorf examined Cramer'sconditions. All this points again to the fact that there are a large numberof possiblesets of sufficientconditions and that, as we have alreadyremarked, a muchmore difficult problem would be to establisha set of necessaryand sufficientconditions underwhich desirable asymptotic properties would hold. A differentapproach to the questionof efficiencysimilar to Wolfowitz's(1965) investigations in the singleparameter case is providedby Kaufman(1966). Wolfowitz's results, as we have already seen in section 5, imply that subject to certain conditions which do not demand normality,that among all estimatorsof 00, 0 is such that ,,n (0- 00) is most concentrated about its . Kaufman'smain theorem is a symmetrizedmulti-parameter version of this result. He proves,in fact, that if it is assumedthat 0 is a closed subsetof k-dimensionalspace Rk, and {f0; 0 e 0} a family of probabilitydensities over a a-finitemeasure space (X, Y, pq),be such that a MLE sequence{O} exists, that n (08,-0) is asymptoticallyjoint normalN [0, 1 ()-'] uniformlyin Rk X C, where 1(0) is the informationmatrix and that {0(, is asymptotically sufficient for 00, the interior of 0, in the sense defined by Le Cam (1956), then if {T,} is any estimator sequence, which need not be asymptotically normal, such that the distribution of n+ (T- 8) converges uniformly in Rk x C, and S is a convex symmetric set in Rk, we have e e lim P [n* (T,- 8) S] lim P [n* (0n - 0) S] for all e 0. " = " He then gives a counter-example to show that this result is not in general true for the unsymmetrized multidimensional form.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 36 47 In conclusion,we comment briefly on a paper by Richards(1961), who devises a com- putationalmethod which can greatly reduce the calculationsnecessary to derive a set of MLE's. It could also be of great value in regressionproblems. From k unknown parameters 01, ..., Ok choose r, 01, ..., 08, in some convenient way, and obtain conditional MLE's of 0,+ 1, ..., Ok given 01, ..., 0r. Denoting these conditional esti- mates by Or+ , ...k we have, by implication, that there exists functional relationships = k-r. 0r+j -j (01, ..., 0r) j = 1, 2, ..., (8.1) I Now substitute O,+j for O,,+jin I (x 0, ..., Ok)to obtain 1' (x 1 0, ..., 8,), say, and hence new modifiedLE's,

S1' 0 i =1,2,...,r. (8.2) aOi If 01, ..., 8, are the solutions of (8.2) then the MLE's are given by

There are some theoreticalgaps in this argument,but these have been filled in by Kale (1963), who establishesthat the method is essentiallyvalid providedcorrect use is made of the implicitfunction theorem.

9. Theoretical Difficulties We have already seen that MLE as a procedure appears to be deficient in the following ways: (a) MLE's are not always consistent. (b) There are other estimators which have a smaller asymptotic variance, at least for some values of the unknown parameter, and are therefore, in this sense, more efficient than the MLE. (c) Nothing very much can be said about the MLE small sample performance. With regard to (c), it can be said that as there is no complete theory of estimation for small samples, no simple conclusions capable of a wide application about MLE's, or any other estimator,can be made. It cannot thereforebe advancedas a reason for preferring another estimation procedure. Certainly Fisher's claims for MLE's cannot be criticized in this way, as he clearly regarded the optimal properties of MLE as essentially asymptotic ones. The rest of this discussion will therefore be confined to (a) and (b), and we begin with the well-known example of inconsistency of Neyman and Scott (1948), who also at the same time introduced the important concepts of structural and incidental parameters. Suppose that we have observations xij (i = 1, 2, ..., s; j = 1, ..., n) from s populations which are characterized by the following probability density functions,

| = exp Pi (xij i, ) -?x . The parameter a is common to each population so that if s is indefinitely large then a appears in the probability law of an infinite number of observations, and therefore, in accord- ance with the definition given, is a structural parameter. On the other hand, each pi relates only to a finite number of observations, however large s may be, and is therefore an incidental parameter.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 37 48 If these parametersare estimatedby the methodof ML then it is easilyfound that

1n Pi = = 5e, say, -1nj=1 xij and

ns i=l=1

2 Since E (x i for each i, it follows that is the 1 )2]=(n-)1), J=1(n-i) a2 2 of s quantitieseach of which has an expectedvalue . Thus A convergesin prob- n n- abilityas s tends to infinityto 22,so that it is not consistentfor a2. n It is, of course,possible to correctfor this by taking n_ 2 as our estimatorof U2, but this n-1 is irrelevantsince this new estimatorwould not then be the MLE. We note, too, that when n = 2, which is the form in which this exampleis most frequentlyquoted, 62 convergesin probabilityto ja2, so that thereis a substantialerror. Kiefer and Wolfowitz (1956), on the other hand, have shown that when the p, are in- dependentlybut identicallydistributed random variables then the MLE a2 is stronglycon- sistent.Evidently therefore, this situationis more regularthan the one in whichthe incidental parametersare unknownconstants. In general,however, the problemof eliminatingincidental or nuisanceparameters so as to be ableto makestatements of uncertaintyabout the parametersof interestis now recognized in statisticsas a major one. A "likelihood"as distinct from a Bayesianapproach to this problem is providedin a very recent paper by Kalbfleischand Sprott (1970). This is an importantas well as a highly controversialpaper. It is here that integratedlikelihoods, marginallikelihoods, maximum relative likelihoods, conditional likelihoods and second-order likelihoodsare introduced.Definitions of these termsare given in the Appendix. An exampleof a differenttype is due to Basu (1952).Let X = 0 or 1 be a randomvariable whose distributiondepends on a single unknownparameter 0, 0 < 0 < 1, as follows P (X = 1) = 0 when 0 eA = 1- when 0 e B whereA is the set of rationalsand B any countableset of irrationalsin [0, 1]. If in a sample of n independentobservations we obtain r then = r which convergesin probability l's, 0 n to 0 or 1-0 accordingas 0 e A or 0 e B, and it thereforecannot be consistentfor 0. Anothermore complicatedexample due to Kraft and Le Cam (1956)was discussedearlier in section 4. Two further examples of inconsistencyof a differenttype are provided by Bahadur (1958). Here we are concerned with the estimation of the population distribution function. Bahadur's viewpoint is that where "the (ML) estimated distribution converges to the actual distribution in the strong sense that, with probability one, the estimated probability of any event in the sample space of a single observation tends, uniformly in events, to the correct probability" it is unreasonable to speak of the MLE as being inconsistent. He claims, therefore, that the importance of his examples lies in the fact that unlike earlier examples of inconsistency, the estimated distribution function does not converge to the true one. Actually, in the first of his examples, the MLE does not, in any way, indicate the true form

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 38 49 of the distributionfunction. In the second which we now summarizethe resultsobtained by the ML processare misleading. (i) The randomvariable X is a non-negativeinteger j = 0, 1, 2, .... (ii) The set of possible distributionsof X is a countableset {P1, P2, ...} in which lim Pk exists = Poo say. koo (iii) For all sets of sample values (X1, X2, ..., Xj} the MLE exists. (iv) Nevertheless,no matter what the actual distributionof X, any sequence of MLE distributions to Thus if hn = hn ..., is the least index h such that is converges Po. (X1, X.) Ph a MLE based on the firstn observationsthen [ lim h, = o] = 1 Pk") n - oo for eachk = 1, 2, ... whenP1I") denotes the probabilitymeasure on the spaceof infinite sequencesX1, X2, ..., whereX is distributedas Pk* At firstsight, therefore, it wouldappear that we are confrontedwith serioustheoretical difficultieswith regard to MLE.The situationis, however,not as bad as it seems,for con- sistencyas a propertyis artificial,and indeedit is possiblefor thereto be situationswhere it is quitevalueless. For example, if pu,is themean of n independentobservations from a normal populationwith unknown mean y andknown variance a2, theni = jp,and is consistentfor p. Now is for n andis to for n> Then is alsoa con- define,Tn arbitrary ~ 106 equal n, 106. T, sistentestimator of y butin practicalterms it is clearlyuseless. Of greaterintrinsic value in estimationtheory is Fisher'sfirst (1922) definition of con- sistency,now referredto as Fisherconsistency (FC). It is definedin the followingway. "A statistic satisfiesthe criterionof consistencyif when it is calculatedfrom the whole population it is equal to the requiredparameter" (page 309). Later in the paper he says "That when applied to the whole population the derived statistic should be equal to the parameter". It is also apparentthat these definitionsand the examplesgiven in the paperimply that he was consideringthe statisticas a functionof the observeddistribution function. We therefore have the followingformal definition of FC due to Rao (1962). A statisticis saidto be FC for a parameter0 if (1) it is an explicitfunction of the sampledistribution function Sn (or the observedpro- portions[pl, ..., Pk]in the caseof the multinomial),and (2) thevalue of thefunction reduces to 0 identicallywhen Sn is replacedby thetrue distribu- tion function F (0) (or true proportions [xr (0), ..., k (0)] in the case of the multi- nomial). This definitionplaces restrictions on the functionsof observationswhich may be considered, whereasthe definitionof consistencyused heretofore,probability consistency (PC), does not. In the case wherethe statisticis a continuousfunctional (see Appendix)of the distribution functionthen PC.FC, and no doubt Fisherhimself had in mind this kind of situationwhen insteadhe put forwardPC as a definitionof consistency. With regardto the generalproblem of the consistencyof MLE's Rao (1962) then makes the following observations by which the importance of FC, in particular, will readily become apparent. In the first place it can easily be shown that the MLE is always FC. In addition, however, we must also require that when S, is close to F (00) then F (0) should be close to F This is always the case for the multinomial distribution (Rao, 1957). It is also the case(0o). for the infinite multinomial distribution provided Zn log n, is convergent, where ni are the cell probabilities (Kiefer and Wolfowitz, 1956; Rao, 1958), and under slightly more restrictive conditions, for any continuous distribution (Kraft, 1955). The example due to Bahadur, which we have just considered, is of a very special type.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 39 50 Thus it can be said that the MLE of the distributionfunction is consistentunder very non- restrictiveconditions, and from this will follow the correspondingconsistency of the MLE of a parameterprovided we also have the continuitycondition F (0)-+F (0) implies 0-+0. This seems a reasonablerequirement, and in fact, it is not satisfiedin the examplesof Basu (1955)and Kraft and Le Cam (1956). Rao then shows how it is possible to accommodatesome, at least, of these anomolous P cases, in the followingway. Considerfirst the likelihoodratio (T, 01) whereP (T, 0,) is the P (T,02) probabilitydensity function of T when 0 = 0i (i = 1, 2), so that the more T discriminates between the two possible values 01 and 02 of 0, the more this ratio will differ from unity. As a measureof over all and hence over all values of 1 discrimination, samples, possible T, the quantity JT 2)= Elog + Elog (T02) (o81, o 02 P(T, 02)J LP(T, 80,) is put forward. Obviouslythe maximumvalue of this quantityis obtainedby replacingT by the sample itself, when we would obtain 1) P 2) 02)= n E logP + E log (X P 02 Js(01, S1 (X'(X, 802) P (X, 01) sincethe observationsare independent.A measureof the effectivenessof T as a discriminator, therefore,would be the ratio Jr (0 , 02) which would be unity if T is a sufficientstatistic. J, (01, 02) Now Basu (1954) has shown that as n tends to infinityJ, (01, 02)-+00, so that we have perfectdiscrimination between 01 and 02, as we shouldexpect. We mustalso demandtherefore that for T similarly,Jr (01, 02)-+oo as n-+*oo,in whichcase P (T, 01) and P (T, 02) would be said to be orthogonalin the limit. This will be the case if T convergesin probabilityto a 1- 1 functionof 0, which is preciselyFisher's criterion of consistency. The problemof inconsistentMLE's can thereforebe partlyresolved by the followingwider definitionof consistency.A statisticT will be said to be consistentin the wide senseif the two distributionsof T correspondingto the two possible values of 0 are orthogonal. Such a definitiondoes not necessarilydemand the convergenceof T to a unique limit and in fact Basu'sestimate is consistentin this sense. So also is the exampledue to Neyman and Scott. Bahadur'sexample, however, is not even consistentin the wide sense though it would be reasonableto regardthis as very exceptional,and only very weak conditionsare necessary for its exclusion. Examplesof apparentinconsistency arise in some cases of estimationof parametersof Functionaland Structuralrelationships. This is an extensive subjectand referencewill be made to only two simpleexamples. In the case of the functionalrelationship Y= +flX when c andfl are unknownconstants, it is supposedthat we observex = X+ u and y = Y+ v where u~ N (0, tr2) and v~ N (0, of2). If we have n unreplicated observations then " 2t, i= 1 2as,a= 1

1 This originatesfrom the work of Kullbackand Liebler(1951) in connectionwith informationtheory.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 40 51

From the LE's it follows that A2 = '6,2 which is an inconsistencyin any sense, since no correspondingrelationship between the true values of or, at and P necessarilyexists. This anomaly,which was first discoveredby Lindley1 (1947) has, however,been resolved by Solari (1968) who has shown that the solutions of the likelihoodequations for the three structuralparameters fl, ar, at and the n incidentalparameters Xi do not give rise to even a local maximumof L, but in fact to a saddle point. Thus they are not MLE's, and so no inconsistencyexists here. Solarigoes on, however,to show furtherthat the function = in __ g 2 log ,f + 2 (Xi- Xi)2 + 2 i= (Yi- f3Xi)2 nt- It, i=1 which is such that L =(2n)-"e-"g/2 so that L is a decreasingfunction of g, has an essentialsingularity at any points of the para- meterspace for which a,= 0 and C (Xi-Xi)2 = 0, i=1 and/or .V=0 and C (Yi-_Xi)2 = 0. Thus, in the neighbourhoodof such points, g can take any value, so that the minimum of g obtainedby first minimizingwith respectto one parameterand then with respect to another,and so on, would dependon the orderof the parameterschosen. Thus unless prior informationwas availableto indicatea particularorder no uniqueminimum of g and hence no MLE could exist. Apparentlytherefore we have an estimationproblem-there are in fact severalproblems of this type-in which the method of MLE fails to give any resultsat all. One might ask, how- ever, whetherin the problemjust describedit is logical to admit points for which a ,< e or a, r2>0 and U2 > U2>O wherea2 and a2 are small but fixed.2 The functiong would then be boundedbelow and continuous(and differentiable)through- out the entire parameterspace and would thereforehave a unique minimumin 0, so that in fact the MLE would exist, even though we would know as a result of Solari'sanalysis, that it could not be obtainedfrom the likelihoodequation. The importantresults of Kiefer and Wolfowitz(1956) to which referencewas made earlier in this section, would lead us to expect a more straightforwardsituation in the structural relationshipcase, since here the incidentalparameters are themselvesrandom variables. We have here the model Y= a+flX x = X+u y = Y+v (u and v independent)

1 Lindleyand El-Sayyad(1967) have howeverprovided a Bayesiananalysis of the problemof estimatingthe parametersof the functionalrelationship considered here. 2 This is merelyanother way of saying that both variablesare subjectto error,which belief is implicit in the formulationof the model itself.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 41 52 where = (u) = E(v) = 0, V (u) o , V (v)= a2, E(x) = E (X)= Px, say, and hence E (y) = E (Y) = +flpx. Also, if V (X) = o2, then S(x) = +a and V(y)= fl2a+ao. If it is furtherassumed that the distributionsinvolved are normal, then the joint distribution of x andy is bivariatenormal and is thereforedefined by parametersp,, u2, a2, a2 and p, say. In the case of unreplicatedobservations, therefore, it is impossibleto estimatethe six para- metersin the originalmodel. This is essentiallya problemof identifiability1 and in this case it can only be resolvedby our makingcertain assumptions about the errorvariances. Once these assumptionsare made, however,the MLE gives perfectlysensible results. Whenwe turnto the problemof super-efficiencyit is naturalto beginwith the now classical exampleof J. L. Hodges, since previousto its appearanceit had alwaysbeen thoughtthat in large samplesno estimatorcould have a smallervariance than that of the MLE.2 If p, is the meanof n independentobservations from a normalpopulation with an unknown mean p and a unit variance,then it is well known that pu,is the MLE of p and that it is 1 normallydistributed about p with a normalvariance -, for all n. Define T, by n T.= p, if /pt, n- = if I < cp, p. n-+. Clearlythe asymptotic distribution of T is N , wheny andN ) wheny = 0. 1) :0 p, Thus by takinga<1 we have whenp = 0 an asymptoticallynormal estimator with a variance less than that of the MLE. This examplewas quoted and then generalizedby Le Cam (1953). He showedthat it was possibleto constructan estimatorwhich would be super-efficientat a countableset of values of the unknownparameter 0. Further,it was shown that the set of all values of 0 at which an estimator could be super-efficienthas Lebesgue measure zero. Evidently, therefore, Hodges' exampleis not as drasticas at first sight might be supposed,though as it standsit does refuteFisher's assertion about the asymptoticvariance of MLE'sin general. One possible remedyis to impose regularityconditions so as to exclude such estimators. This has been done, for example,by A. M. Walker(1963). His conditionsare as follows: 1. 0 is an open intervalof the real live R for each 0 e 0.

2. 8f(x, 0) exists for almost all x. 80 (p) 3. f 0. (i l= fdy =

4. J (90)= E nogf(x, 0 )2= f{ -a dp is finite and positive.

5. E (T) which exists for each n where is the E f(x,-i dp(")8) p,") product measure x x ... x in the Euclidean pu # # space R". 1 Identifiabilityof the structural parametersis one of the conditions of Kiefer and Wolfowitz's (1956) results. 2 Actually this view persistedfor a long time afterwards.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 42 53 Thus{f(x, 0); 0 e 0} is said to be a regularfamily of probabilitydensities if conditions1 to 4 hold, and (T,) is a regularestimator sequence if condition5 holds. His firsttheorem is then as follows.

Theorem 1. Let {f(x, 0); 0 e O} be a regular family of probability densities with respect to a-finite measure p, and X ") = (X1, X2, ..., X,) be, for each n, a random sample of size n from a population with probability density f (x, 0). Suppose also that Tn(Xn")) is a regular estimator sequence with asymptotic variance n-'v (0). Then if (a)for each 0

EoI n n(T- 8)2+a 2+a,n (0)< o0 for each n, and the sequence {A2+6, , (0)} is boundedfor some positive 6, and b' = E - as (b) (0) (Tn 8)-+0 n-+,oo then v (0) {(J(O)}- or eT (0) l 8EO where = (0) J oT (0) {vT (0)} is the asymptotic efficiency of T. His secondtheorem is merelya multiparametergeneralization of this. It is claimedthat the conditionson T, are not undulyrestrictive and would not, in practice, excludeany estimatorthat a statisticianwould be likely to want to use. In fact, the estimators of Hodges and Le Cam are excludedby condition(b) of Theorem1. Alternativelywe can reconsiderother ways of definingefficiency. We have already seen that attemptsto achieve this in terms of "concentration"are mathematicallycomplicated, and no straightforwarddevelopment of this idea appears to be possible. Thus following Fisher efficiencyhas been definedheretofore essentially in terms of ratio of variances.We have remarkedtoo, that since Fisher himself regardeda statisticas efficientwhen it utilizes all the availableinformation, and that this can be said to be the case when the asymptotic 1 variancehas the least possiblevalue then the quantity1(0) has come to be definedas nI (0) the amount of informationper observationin the sample. Fisher'sproof that does nI (0) provide a lower bound for the asymptoticvariance of any estimatordoes not, however, contain the necessaryregularity conditions to exclude estimatorssuch as those of Hodges and Le Cam. Nevertheless,the appearanceof 1(0) in many distinct statisticalsituations shows it to be of fundamentalimportance. We have alreadyseen that it formspart of Bahadur'sinequalities in connectionwith the "effectivestandard deviation". It is also implicitin the discrimination measures J, (01, 02) and Jr (01, 02), which were described earlier, for it is not difficult to show that n Ji (fA, oI+0ON,) 2i (n1(n% and

2 when Ir (8,) is the corresponding information per observation derived from the statistic T. T is not fully efficient, therefore, if 1(0)>I, (8). In the multiparameter case where 1(0) and I, (0) are matrices it is well known that I (0)-I T (0) is positive semi-definite, Rao (1945).

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 43 54 Efficiencyhere thereforecould be measuredby a scalarfunction of 1(0)-I, (0) having the value zero when all lower boundswere simultaneouslyattained. Thus efficiencyin the sense of perfectdiscrimination, or orthogonalityas definedearlier, is equivalentto efficiencyin the informationsense, and hence to efficiencyin the minimum variancesense. As a basis for all possible definitionsof efficiency,therefore, is the criterionthat T is efficient if I(O)-I (0)-+0 as n-+ co. This approach is capable of development to first- and second-orderefficiency discussed earlier by which it is possible to discriminatebetween the various BAN estimators.It also leads to the followingequivalent formulation proposed by Rao (1962). "A statisticis said to be efficientif its asymptoticcorrelation with the derivativeof the log-likelihood(1) is unity. The efficiencyof any statisticmay be measuredby p2, when p is the asymptotic correlationwith Z, (= n-*l')", and there is an analogous definition for multiparameterestimation. The motivationfor these definitionslies, again, in the searchfor effectivediscrimination. A paper by Rao and Poti (1946) establishesthat a consequenceof the Neyman-Pearson lemma is that a test which best discriminatessmall variations30 from the true value of 0 for any given n, is of the form rejectif and only if Z~, A ( a constant).The definitionjust given then follows naturallyfrom the fact that a statisticwhich has an asymptoticcorrelation with Z, of 1 will be as good as Z, as a discriminator. We now pass on to considersuper-efficient estimators in the light of this new definition. In the firstplace it shouldbe noted that such statisticsare not FC since theycannot be explicit functions of the sample distribution.In fact, Kallianpurand Rao (1955) have shown that is a lower bound for the asymptoticvariance of any estimatorwhich is FC. This again nI (0) illustratesthe importance,from the theoreticalstandpoint, of taking FC ratherthan PC as the criterionof consistency. They also show that in large samples,under suitableregularity conditions, any estimator whose variancedoes attain the lower bound also has an asymptoticcorrelation of nI (0) unity with Z,, thus reconcilingFisher's and Rao's definitionof efficiency. When FC and other regularityconditions do not hold, then the estimatorsof Hodges and Le Cam are not excluded,but on the basis of the new definitionof efficiencythey are at best as good as, but neverbetter than, the MLE. They are only as effectivefor hypothesis testing and ,etc., when their asymptoticcorrelation with the MLE is unity. Whenit is less than unity they are definitelyworse than the MLE. We can evenhave an estimatorof the Hodges-LeCam type whichis super-efficientfor some values of 0, but is sub-efficientin the new sense, i.e. -1 n+-.

Thus T is asymptotically distributed as N ) when = 0, and as N (j, 1) when u # 0. (p, l Since, therefore, a can be arbitrarily small, then T is super-efficient. It would be wrong,

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 44 55 however,to infer that T should be used as a means of testing the hypothesisp = 0. For if T is used, then our test statisticis essentiallyx, and we have a less powerfultest than if we had used T. Returningnow to Basu's exampleas summarizedat the end of section 5, we see at once that if minimumvariance is our criterionthen we must rejectthe MLE in favourof the super- efficient estimator T = 2g + . A more "operational" approach to this problem, however, 5 shows that the choice of estimatoris not so straightforwardas might at first sight appear, since for small departuresfrom the true value of 0 the statistic T does not providesuch a powerful test of a hypothesis 0 = as does T', say, = 9, the MLE. 0o 2

10. Summary It now remainsto summarizethe foregoingdiscussion. We have seen that there are a large numberof papersconcerned with establishingconsistency and/or efficiencyunder conditions which are unrestrictiveas possible. The aim is thereforeto obtain the widest possible class of estimationsituations in which MLE's exist and have the optimalproperties which Fisher claimedfor them. On the other hand, the difficultiesthat arise by reason of the existence of inconsistent MLE'sand super-efficientestimators can be partlyresolved not so much by imposingfurther conditions as by reformulatingefficiency and consistencyin operationalas distinct from purelymathematical terms. Finally,it is clearthat MLE'sdo have certainadvantages over othermethods of estimation which may be summarizedas follows: 1. MLE's are functions of such sufficientstatistics as exist, and hence of the minimal sufficientstatistic. This importantproperty does not appertainto other estimators. 2. Although there are other BAN estimators, yet only MLE's possess second order efficiency. To these could be addedthat MLE has greatintuitive appeal and that it can be appliedto a wide class of situations.

Appendix (A) The followingdefinitions are due to Kalbfleischand Sprott(1970).

1. IntegratedLikelihoods and Integrated Relative Likelihoods If x1, x2, ..., x,, are the realizedvalues of a sampleof n from a populationwhose probability density function is f(x; 0, b), then the joint distributionof the observations,assuming independence, is

f(x, x,..., x; , )= f(x; , ). i= 1 If, further, we are interested in making inferences about 0, and b is merely a nuisance para- meter with a known prior distribution g (b), then we integrate out q to obtain

x,, ..., x.) = I (0; Jf(xl, ..., x.; ) g ( )d 01

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 45 56 where( is the entire q-space.The quantity1(0; x1, ..., x,) is thus known as the integrated likelihoodfunction of 0 and the standardizedquantity I (0; X1, ..., x.) sup I (0; x-, ..., xn) Oe 0 the integratedrelative likelihood of 0.

2. MaximumRelative Likelihood Let $ (0) be the MLE of 0, for a given 0, based on a sample x1, ..., xn. Substituting this for 0 in the joint densityof the observationsand standardizingappropriately we obtain RM(0) = f[x, ..., X; O,$ (0)] sup f[x1, ..., Xn; 0, / (0)] Oe O a quantitywhich is calledthe maximumrelative likelihood.

3. Marginal Likelihoods The probabilitythat the point (xl, ..., x.) in the n-dimensionalsample space falls in the elemental cuboid whose "edges" are of lengths dxl, ..., dx, is f(xl, ..., x,; 0, b)dx1 ... dx,. Supposethat the followingconditions are satisfiedin relationto this quantity. (i) There is a non-singular transformation taking x1, ..., x,, into yl, ..., yn-,, a1, ..., ar such that a factorization f (x, ..., x.; 8, 0) dx1, ..., dx. = x . {g (a1, ..., a,; 0) dal, ..., da,} {h (y,, Yn-r_,;O, Ia, ..., a,) dy, ..., dYnr_,} whereg and h are also densityfunctions, exists. (ii) If 4 is unknown,then h contains"no availableinformation" about 0. Then the marginallikelihood function of 0 is some functionof 0 proportionalto M (0) = g (a, ..., a,;0) da, ..., da, and the correspondingstandardized function M MR (0) = (0) sup M (0) 0e the marginalrelative likelihood function of 0.

4. Conditional Likelihoods These are defined in a way which is wholly analogous to the above. It is supposed that the following two conditions hold. (i) That there is a factorization in the form dx f (x1, ..., 0, ) = (x1, ..., 8 i T1, ..., Tk)dx, ..., xn; 9 x,; , dTk x h (T, ..., Tk; O, b)dTl, ..., dT,

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 46 57 where the statisticsTi which by implicationare jointly sufficientfor b, may or may not be functionsof 0. (ii) The factor h (T,, ..., Tk; O, b) dT,, ..., dTk contains no "available information" about 0 if nothingis known about0. The "conditionallikelihood" of 0 is then definedas a functionof 0 proportionalto

dxl, 0.0, dxn g .., Tk) c (0) = (xi, x,; 0 d ..., T1, dT,,,...... , , dx dT, which when standardizedas CR(0) = c (0) sup C (0) 0 is calledthe "conditionalrelative likelihood function" of 0.

5. Second-orderLikelihoods This conceptis introducedby consideringthe situationwhere we have pairs of observations i = such that and are distributed as N and N (x,, y,), 1, ..., n, xi y, independently (?i, a) (U,, Aa) respectively,and a linear functionalrelationship Ui = 08,+6 exists. Thus a is a structural parameterand the ?i are incidentalparameters. The parametersof interest here are therefore0 and 6, since they define the functional relationship.Now if 0 is known then it is shown that the marginalrelative likelihood of 6, assuming l also known, is MR, (6 0) = (1+ u2)-- (- 1) where

22 _2_ n (05- y +6)2 1 {0 (x,- 5)-(y,-•)}2" Since, however,MR, (6 80)is independentof 2, then A has been eliminatedwithout being estimated.' Thus althoughA is assumedknown in the derivationof the result above, it is now apparentthat if l is unknownwe can by means of MR, (6 80)make inferencesabout 6 for a given 0. A logicaldeduction from this is that we can thereforemake inferences about an unknown0 for a given 6. This is apparentby observingthat if, for example,we know 6 = 0, then a low value of MR, (0 10) arisingfrom a value of 0 undertest would suggestthat such a value is implausible,and similarlya largevalue of MR, (0 10) would suggestit is a likely value. Viewedin this way, MR, (0 10) is essentiallya likelihoodfunction, and since it bears the same relation to the likelihood as the likelihood does to the probabilitydistribution it is describedas a second orderlikelihood.

(B) The following"basic notions" are summarizedin a paperby Kallianpurand Rao (1955). (1) Any transformationf(x) which has for its domain a subset D of a normed, linear space E and the set of real numbers for its is called a (real-valued) functional. e (2) A functionalf is continuous at xo D if limf (x) = f (Xo) as IIx-Xo II-+0. (3) A functional f having E for its domain is said to be additive if for any two elements xl and x2 f (xt + x2) = f(x1) +f (x2). 1 It is well known that if A is unknownthen not all the parametersare identifiable.This means that a direct ML method cannot be applied. We have alreadyalluded to this point in section 9.

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 47 58 (4) An additive,continuous functional f (definedover E) is calledlinear. (5) Letf be any functionalwith domainD g E. Thenf is said to possess a Frech&tdifferential (F-differential) at xo, if for h e E so that xo + h e D thereexists a functionalL (xo, h) linearin h such that + L lim [f(xo h) -f(xo)- (xo, h)] 0. h = II I o-+0 11h l

R.H. Norden, 1972 A Survey of Maximum Likelihood Estimation, parts 1-2 Combined 48