<<

arXiv:0804.2996v1 [stat.ME] 18 Apr 2008 iinbooitToa er ulyadteso- the and Dar- Huxley the Henry included Thomas where group biologist them The winian take it. conversation have once would the dinner chance let for and meet month to was a the The plan for the Club. symbol and mathematical X unknown, the the as taken called was they name what formed lectuals h pcSoyo aiu Likelihood Maximum of Stigler M. Story Stephen Epic The 07 o.2,N.4 598–620 4, DOI: No. 22, Vol. 2007, Science Statistical 03,UAe-mail: Illinois USA Chicago, 60637, Chicago, of University of , Department Professor, Distinguished

tpe .Silri h retDWt Burton DeWitt Ernest the is Stigler M. Stephen c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint nte16sasalgopo on nls intel- English young of group small a 1860s the In nttt fMteaia Statistics Mathematical of Institute 10.1214/07-STS249 .INTRODUCTION 1. nttt fMteaia Statistics Mathematical of Institute , cec,sprffiiny itr fstatistics. of suffici history likelihood, superefficiency, maximum ficiency, Wald, Abraham Hotelling, Harold phrases: and words is Key work drawn. Fisher’s hi lessons to Euler’s and some reaction , from and The of derived functions. analysis was homogeneous the functions for on estimating work di via his ineq approach proofs from derived three the be of his to derivation of Fisher’s esti basis particular, likelihood mathematical In maximum dissert the of Cam’s and efficiency Le presented, and w Lucien consistency revie ways characterization of the 1930 is in time unpublished for history Fisher’s always the process that to not the Fisher In article, topic, before this the well Gaus In explored back Friedrich today. who Carl sanction those and would of Laplace B some Daniel Simon only Lagrange, “simple Pierre Louis this Euler, Joseph shows likelihood Leonard simple. topic mediocre but the f even anything of in really or history argue mathematical to likelihood, the rise minimum yet would of terms. Who those method idea: in hav a unassailable would described they even been believe simple, had to a hard method is their it if w but surprised of gather, choice and their hunt describe to to how likelihood” use have maximum not of may “method gatherers and hunters early prehistoric: be Abstract. 07 o.2,N.4 598–620 4, No. 22, Vol. 2007, [email protected] 2007 , tasprca ee,teie fmxmmlklho must likelihood maximum of idea the level, superficial a At . .A ihr alPasn ez Neyman, Jerzy Pearson, Karl Fisher, A. R. This . in 1 Abatflter,kle yansy gylittle ugly nasty, a by (Galton, killed fact” theory, replied, Huxley beautiful was. “A it Huxley what then. asked before Spencer it impossible, insisted. about spoken was never had it he for the a declared know wrote Spencer “I once Spencer catastrophe.” promptly, I but answered Herbert it, Huxley conversation think tragedy.” little dur- the would it, in “You reported said, pause Galton recorded As a ac- he memoirs. ing separate own and were, his heard in who he it men but three from dinner, counts the re- not was at was Galton it present Francis that occasions. present several on ex- those peated one struck included so evening that that change and Athenaeum London, the at in dinner Club for met they 1870 eve- about One ning Spencer. Herbert philosopher-scientist cial ual prpit o n eln ftehistory the of telling one for appropriate gularly ulysdsrpino cetfi rgd ssin- is tragedy scientific a of description Huxley’s 1908 ae258). page , aiyi seen is uality fconditions of h words the d e from wed reviewed, eeand here Relation ae is mates tseems It ernoulli, ny ef- ency, scussed. da is idea” vrof avor And ? been e later s ation. are s e 2 S. M. STIGLER

Joe Hodges’s Nasty, Ugly Little Fact (1951) that history, with a sketch of the conceptual prob- 1 lems of the early years and then a closer look at the T = X¯ if X¯ n n | n|≥ n1/4 bold claims of the 1920s and 1930s, and at the early 1 arguments, some unpublished, that were devised to = αX¯n if X¯n < . | | n1/4 support them. Then √n(Tn θ) is asymptotically N(0, 1) if θ = 0, − 2 6 and asymptotically N(0, α ) if θ = 0. 2. THE EARLY HISTORY OF 2 Tn is then “super-efficient” for θ =0 if α < 1. MAXIMUM LIKELIHOOD

Fig. 1. The example of a superefficient estimate due to By the mid-1700s it seems to have become a com- Joseph L. Hodges, Jr. The example was presented in lectures monplace among natural philosophers that problems in 1951, but was first published in Le Cam (1953). Here X¯n is of observational error were susceptible to mathe- the of a random sample of size n from a N(θ, 1) matical description. There was essential agreement population, with n Var(X¯n) = 1 all n, all θ (Bahadur, 1983; van der Vaart, 1997). upon some elements of that description: errors, for want of a better assumption, were supposed equally able to be positive and negative, and large errors of Maximum Likelihood. The theory of maximum were expected to be less frequently encountered than likelihood is very beautiful indeed: a conceptually small. Indeed, it was generally accepted that their simple approach to an amazingly broad collection distribution followed a smooth symmet- of problems. This theory provides a simple recipe ric curve. Even the goal of the observer was agreed that purports to lead to the optimum solution for upon: while the words employed varied, the observer all parametric problems and beyond, and not only sought the most probable position for the object of promises an optimum estimate, but also a simple observation, be it a star declination or a geodetic lo- all-purpose assessment of its accuracy. And all this cation. But in the few serious attempts to treat this comes with no need for the specification of a pri- problem, the details varied in important ways. It ori , and no complicated derivation of was to prove quite difficult to arrive at a precise for- distributions. Furthermore, it is capable of being mulation that incorporated these elements, covered automated in modern computers and extended to useful applications, and also permitted analysis. any number of . But as in Huxley’s quip There were early intelligent comments related to about Spencer’s unpublished tragedy, some would this problem already in the 1750s by Thomases Simp- have it that this theory has been “killed by a nasty, son and Bayes and by Johann Heinrich Lambert ugly little fact,” most famously by Joseph Hodges’s in 1760, but the first serious assault related to our elegant simple example in 1951, pointing to the ex- topic was by Joseph Louis Lagrange in 1769 (Stigler, istence of “superefficient” estimates (estimates with 1986, Chapter 2; 1999, Chapter 16; Sheynin, 1971; smaller asymptotic than the maximum like- Hald, 1998, 2007). Lagrange postulated that obser- lihood estimate). See Figure 1. And then, just as vations varied about the desired mean according to with fatally wounded slaves in the Roman Colos- a multinomial distribution, and in an analytical tour seum, or fatally wounded bulls in a Spanish bull- de force he showed that the of a set ring, the theory was killed yet again, several times of observations was largest if the relative frequen- over by others, by ingenious examples of inconsis- cies of the different possible values were used as the tent maximum likelihood estimates. values of the probabilities. In modern terminology, The full story of maximum likelihood is more com- he found that the maximum likelihood estimates of plicated and less tragic than this simple account the multinomial probabilities are the sample relative would have it. The history of maximum likelihood frequencies. He concluded that the most probable is more in the spirit of a Homeric epic, with long for the desired mean was then the mean value periods of peace punctuated by some small attacks found from these probabilities, which is the arith- building to major battles; a mixture of triumph and metic mean of the observations. It was only then, tragedy, all of this dominated by a few characters and contrary to modern practice, that Lagrange in- of heroic stature if not heroic temperament. For all troduced the hypothesis that the multinomial prob- its turbulent past, maximum likelihood has survived abilities followed a symmetric curve, and so he was numerous assaults and remains a beautiful, if in- left with only the problem of finding the probabil- creasingly complicated theory. I propose to review ity distribution of the when the THE EPIC STORY OF MAXIMUM LIKELIHOOD 3 error probabilities follow a curve. This he solved for the nineteenth century. By the end of that century several examples by introducing and using “Laplace this was sometimes known as the Gaussian method, Transforms.” By introducing restrictions in the form and the approach became the staple of many text- of the curve only after deriving the estimates of books, often without the explicit invocation of a uni- probabilities, Lagrange’s analysis had the curious form prior that Gauss had seen as needed to justify consequence of always arriving at method of mo- the procedure. ment estimates, even though starting with maxi- mum likelihood! (Lagrange, 1776; Stigler, 1999, Chap- 3. AND L. N. G. FILON ter 14; Hald, 1998, page 48.) At about the same time, Daniel Bernoulli consid- Over the 19th century, the theory of estimation ered the problem in two successively very different generally remained around the level Laplace and ways. First, in 1769 he tried using the hypothesized Gauss left it, albeit with frequent retreats to lower curve as a weight , in order to weight, then levels. With regard to maximum likelihood, the most iteratively reweight and average the observations. important event after Gauss’s publication of 1809 This was very much like some modern robust M- occurred only on the eve of a new century, with a estimates. Second, in 1778 (possibly after he had long memoir by Karl Pearson and Louis Napoleon seen a 1774 memoir of Laplace’s with a Bayesian George Filon, published in the Transactions of the analytical formulation), Bernoulli changed his view Royal Society of London in 1898 (Pearson and Filon, dramatically and used the same curve as a density 1898). The memoir has a place in history, more for for single observations. He multiplied these densities what in the end it seemed to suggest, rather than for together, and he sought as the true value for the what it accomplished. The two authors considered a observed quantity, that value that made the prod- very general setting for the estimation problem—a uct a maximum (Bernoulli, 1769, 1778; Stigler, 1999, set of multivariate observations with a distribution Chapter 14; Laplace, 1774). depending upon a potentially large array of con- These and the other attempts of that time were stants to be determined. They did not refer to the primarily theoretical explorations, and did not at- constants as , but it would be hard for a tract many practical applications or further devel- modern reader to view them in any other light, even opment. And while they all used phrases that could though a close reading of the memoir shows that it easily be translated into modern English as “Max- lacked the parametric view Fisher was to introduce imum Likelihood,” and in some cases even be de- more than 20 years later (Stigler, 2007). fended as maximum likelihood, in no case was there The main result of Pearson and Filon (expressed a reasoned defense for them or their performance. in modern terminology) came from taking a like- The most that was to be found was the superficial lihood ratio (a ratio of the invocation that the value derived was “most prob- of the observed and the frequency distribu- able” because it made the only probability in sight (the probability of the observed data) as large as tion evaluated for the same data, but with the con- possible. stants slightly perturbed), expanding its The philosophically most cogent of these early in a multivariate Taylor’s expansion, then approxi- treatments was that of Gauss, in his first publica- mating the coefficients by their expected values and tion on in 1809 (Gauss, 1809). Gauss, claiming that the resulting expression gave the fre- like Daniel Bernoulli in 1778, adopted Laplace’s an- quency distribution of the errors made in estimating alytical formulation, but unlike Bernoulli, Gauss ex- the constants. They erred in taking the limit of the plicitly invoked Laplace’s Bayesian perspective us- coefficients, in effect using a procedure that did not ing a uniform prior distribution for the unknowns. at all depend upon the method of estimation used Where Laplace had then sought (and found) the and would at most be valid for maximum likelihood posterior (which minimized the posterior ex- estimates, a fact they failed to recognize. Their last pected error), Gauss chose the posterior . In step employed an implicit Bayesian step in the man- accord with modern maximum likelihood with nor- ner of Gauss. When cubic and higher order terms mally distributed errors, this led Gauss to the method were neglected, their formula would give a multivari- of least squares. The simplicity and tractability of ate normal posterior distribution (extending results the analysis made this approach very popular over of Laplace a century earlier), although Pearson and 4 S. M. STIGLER

Filon cautioned against doing this with skewed fre- was remiss on the same count. He called particular quency distributions. A modern reader would recog- attention to a perceptive footnote in Smith’s paper nize their resulting distribution as the normal distri- that argued the case against the Gaussian method: bution sometimes used to approximate the distribu- the probability being maximized was not a proba- tion of maximum likelihood estimates, but Pearson bility but rather a probability density, an infinitesi- and Filon made no such restriction in the choice of mal probability, and of what force was such meager estimate and applied it heedlessly to all manner of evidence in defense of a choice? At least the mini- estimates, particularly to method of moments esti- mum chi-square method optimized with respect to mates. an actual metric. Two more years passed, and in The result may in hindsight be seen to be a mess, 1918 Fisher discovered sufficiency in the context of not even applying to the examples presented, and estimating the normal (Fisher, the approach was soon to be abandoned by Pear- 1920); he recalled Pearson’s challenge to produce a son himself. But it led to some correct results for rationale for the method, and he was off to the races, quickly setting to work on the monumental paper on the bivariate normal correlation coefficient, and it the theory of statistics that he read to the Royal So- was bold and surely highly suggestive to a reader ciety in November 1921 and published in 1922. like , to whom I now turn. I have re- cently published a detailed study (Stigler, 2005) of 5. FISHER’S FIRST PROOF how Fisher was led to write his 1922 watershed work on “The Mathematical Foundations of Theoretical By my reconstruction, Fisher’s discovery of suffi- Statistics,” so I will only briefly review the main ciency was quickly followed by the development of a points leading to that memoir. short argument that he gave in that great 1922 pa- per; indeed it was the first mathematical argument in the paper. The essence of the argument in mod- 4. R. A. FISHER ern notation is the following. Suppose you have two At Cambridge Fisher had studied the theory of er- candidates as estimates for a θ, denoted rors and even published in 1912 a short piece com- by S and T . Suppose that T is a sufficient mending the virtues of the Gaussian approach to for θ. Since generally both S and T are approxi- estimation, particularly of the standard deviation mately normal with large samples, let us (anticipat- of a normally distributed sample. He had been so ing a species of argument Wald was to develop rig- taken by the invariance of the estimates so derived, orously in 1943) follow Fisher in considering that S how (for example) the estimate of the square of a and T actually have a bivariate , frequency constant was the square of the estimate both with expectation = θ, and with standard devi- of the constant, that he termed the criterion “abso- ations σS and σT and correlation ρ. Then the stan- lute” (Fisher, 1912). But his approach at that time dard facts of the bivariate normal distribution tell us that E(S T = t)= θ + ρ(σ /σ )(t θ). Since T is was superficial in most respects, tacitly endorsing | S T − the na¨ıve Bayesian approach Gauss had used, with- sufficient, this cannot depend upon θ, which is only possible if ρ(σ /σ )=1, or if σ = ρσ σ . Thus out noticing the lurking inconsistency in even the S T T S S T cannot have a larger mean squared error≤ than any example he considered, in that the estimate of the other such estimate S, and so must be optimum ac- squared standard deviation based upon the distri- cording to a clear metric criterion, expected squared bution of the data, namely 1 (x x¯)2, did not n i error! In one stroke Fisher had (if one accepts the agree with that found applying the− same principle 1 2 P substitution of exact for approximate normality) the to distribution of n (xi x¯) alone. simple and powerful result: Four years later, Fisher− sent to Pearson for possi- P ble publication a short, equally superficial critique Sufficiency implies optimality, at least when of a article by Kirstine Smith advocat- combined with consistency and asymptotic ing the minimum chi-square approach to estimation normality. (Smith, 1916). Pearson’s thoughtful rejection letter The question was, how general is this result? Nei- to Fisher focused on the lack of a clear and convinc- ther Fisher nor much of posterity thought of con- ing rationale for the method of choosing constants to sistency and asymptotic normality as major restric- maximize the frequency function, and Pearson even tions. After all, who would use an inconsistent esti- stated that he now thought the Pearson–Filon paper mate, and while there are noted exceptions, is not THE EPIC STORY OF MAXIMUM LIKELIHOOD 5 asymptotic normality the general rule? Indeed, Fisher and this co-operation is not infrequently clearly knew the result was stronger than this, that called forth by the very imperfections of a sufficient estimate captured all the information in writers on Applied ” (Fisher, the data in even stronger senses; the argument was 1922, page 323). only to present the claim in terms of a specific cri- The 1922 paper did present several related argu- terion, minimum . But what about ments in addition to the Waldian one I reported sufficiency? above. It stated less boldly a converse of the state- At this point Fisher appears to have made an in- ment in the 1921 abstract that, “it appears that teresting and highly productive mistake. He quickly explored a number of other parametric examples and any statistic which fulfils the condition of sufficiency came to the conclusion that maximizing the likeli- must be a solution obtained by the method of the op- hood always led to an estimate that was a function timum [e.g. maximum likelihood]” (page 331). But of a sufficient statistic! When he read the paper to Fisher did not now claim that a sufficient statis- the Royal Society in November 1921, his abstract, as tic need always exist. Instead Fisher gave an im- printed in Nature (November 24, 1921) emphatically proved non-Bayesian version of the Pearson–Filon stated, “Statistics obtained by the method of maxi- argument for asymptotic normality, expanding the mum likelihood are always sufficient statistics.” And about the true value and point- from this it would follow, with the minor quibble ing out how and why the argument requires max- that perhaps consistency and asymptotic normality imum likelihood estimates (and that it would not may be needed, that maximum likelihood estimates apply to estimates), and how it could be are always optimum. A truly beautiful theory was used to assess the accuracy of maximum likelihood born, after over a century and a half in gestation. estimates (pages 328–329). And there, in a long foot- Even as the paper was being readied for press, note, he called Karl Pearson to task for not earlier doubts occurred to the one person best equipped calling attention himself to the error in the 1898 to understand the theory, Fisher himself. The bold paper. Fisher noted that in 1903 Pearson had pub- claim of the abstract does not appear in the pub- lished correct standard errors for moment estimates, lished version; neither does its denial. He expressed even while citing the 1898 paper without noting that himself in this way: the standard errors given in 1898 for several exam- ples were wrong. In the 1922 paper Fisher also point- “For the solution of problems of estima- edly included a section illustrating the use of maxi- tion we require a method which for each mum likelihood for Pearson’s Type-III distributions particular problem will lead us automati- (gamma distributions), contrasting his results with cally to the statistic by which the criterion the erroneous ones Pearson and Filon had given in of sufficiency is satisfied. Such a method 1898 for the same family. is, I believe, provided by the Method of Maximum Likelihood, although I am not 6. THREE YEARS LATER satisfied as to the mathematical rigour of any proof which I can put forward to that By 1925 Fisher’s earlier optimism had faded some- effect. Readers of the ensuing pages are what, and he prepared a revised version of his the- invited to form their own opinion as to ory for presentation to the Cambridge Philosophical the possibility of the method of maximum Society. At some point in the interim he had recog- likelihood leading in any case to an insuf- nized that sufficient statistics of the same ficient statistic. For my own part I should as the parameter did not always exist. What led to gladly have withheld publication until a this realization? Fisher did not say, although in a rigourously complete proof could be for- 1935 discussion he wrote, “I ought to mention that mulated; but the number and variety of the theorem that if a sufficient statistic exists, then new results which the method discloses it is given by the method of maximum likelihood was press for publication, and at the same time proved in my paper of [1922].... It was this that led I am not insensible of the advantage which me to attach especial importance to this method. I accrues to Applied Mathematics from the did not at that time, however, appreciate the cases co-operation of the Pure Mathematician, in which there is no sufficient statistic, or realize that 6 S. M. STIGLER other properties of the likelihood function, in addi- agricultural field trials. Fisher’s own 1925 presenta- tion to the position of its maximum, could supply tion of the argument is fairly opaque and does not what was lacking” (Fisher, 1935, page 82). I spec- explain clearly its underlying logic; in 1935 he gave ulate that he learned this in considering a problem an improved presentation that helps some (Fisher, where no sufficient statistic exists, namely the prob- 1935, pages 42–44). The mathematical details of the lem that figured prominently in the 1925 paper, the proof have been clearly re-presented by Hinkley (1980) estimation of a for a Cauchy dis- at some length. I will be content to offer only a tribution. In any event, in that 1925 paper Fisher sketch emphasizing the essence of the argument, what did not dwell on this discovery of insufficiency; quite I believe to be the logical development Fisher had in the contrary. The possibility that sufficient statistics mind. It will help the historical discussion to divide need not exist was only casually noted as a fact 14 his 1925 argument into two parts, just as Fisher did pages into the paper, and a reader of both the 1922 in the 1935 version. and 1925 papers might not even notice the subtle Let f(x; θ) be the density of a single observa- shift in emphasis that had taken place. tion, and let φ be the likelihood function for a sam- Where in 1922 Fisher started with consistency and ple of n independent observations, so that log φ = 1 ∂φ ∂ sufficiency, in 1925 he began with efficiency. Writing Σ log f. Following Fisher, let X = φ ∂θ = ∂θ log φ— of consistent and asymptotically normal estimates, what we now sometimes refer to as the func- he stated, “The criterion of efficiency requires that tion. Fisher was only concerned here with situations the fixed value to which the variance of a statistic where the maximum likelihood estimate could be (of the class of which we are speaking) multiplied by found from solving the equation X =0 for θ. The n, tends, shall be as small as possible. An efficient first part of the argument was really more of a re- statistic is one for which this criterion is satisfied” statement of what he had shown in 1922: from ex- (page 703). With this in mind, his main claim now panding the score function in a , he had was (page 707), “We shall see that the method of that the score function was approximately a linear maximum likelihood will always provide a statistic function of the maximum likelihood estimate; as he which, if normally distributed in large samples with put it, X = nA(θ θˆ) “if θ θˆ is a small quan- − − − − variance falling off inversely to the sample number, tity of order n 1/2,” where his nA denoted what − will be an efficient statistic.” we now call the in a sample, Thus in 1925 the theory said that if there is an I(θ). Since under fairly general regularity conditions 1 ∂φ ∂φ ∂ ∂ efficient statistic, then the maximum likelihood es- E(X)= φ ∂θ φ = ∂θ = ∂θ φ = ∂θ 1 = 0, we also timate is efficient. When a sufficient and consistent have Var(RX)= I(θR). As FisherR noted, I(θ) may be estimate exists, it will also be maximum likelihood, found from any of the alternative expressions but that is not necessary for efficiency. He granted 2 2 that more than one efficient estimate could exist, ∂ log φ ∂ log φ I(θ)= E 2 = E but he repeated a proof he had already given in 1924 −  ∂θ   ∂θ  (Fisher, 1924a) that any two efficient estimates are ∂2 log f ∂ log f 2 = nE = nE . correlated with correlation that approaches 1.0 as n − ∂θ2 ∂θ increases.     Fisher did not discuss conditions under which the linear approximation would prove adequate; he was 7. THE 1925 “ANOVA” PROOF content to exploit it as a simple route to the asymp- What did Fisher offer by way of proof of this totic distribution of the maximum likelihood esti- new efficiency-based formulation? His 1922 treat- mate, namely N(θ, 1/I(θ)). Thus far he had not ment had leaned crucially on sufficiency, but that gone beyond the 1922 argument. was no longer generally available. In its place he The part of the argument that was novel in 1925, depended upon a new and limited but mathemati- the “ANOVA proof,” then went as follows: Let T cally rather clever proof that I will call the “analy- be any estimate of θ, assumed to be consistent and sis of variance proof.” The proof was clearly based asymptotically normal N(θ, V ). In the proof Fisher upon a probabilistic version of the analysis of vari- used this as the exact distribution of T , and further ance breakdown of a sum of squares that Fisher was treated V as not depending upon θ, as would ap- developing separately at about the same time for proximately be the case for “reasonable” estimates THE EPIC STORY OF MAXIMUM LIKELIHOOD 7

T in what we now call “regular” parametric prob- achievable value. He gave particular attention to lems. Fisher considered the score function X as a multinomial problems and focused on a study of function of the sample and looked at its variation the information loss when no sufficient estimate ex- over different samples in two ways. The first was isted, and the loss in information in using an esti- to consider the total variation of X over all sam- mate that was efficient but not maximum likelihood ples, namely its variance Var(X)= I(θ). And for (e.g., a minimum chi-square estimate). He found the the second, he evaluated Var(X T ), the conditional latter difference tended to a finite limit, a measure | variation in X given the value of T for the sam- of what C. R. Rao (1961, 1962) was later to term ple (i.e., the variance of X among all samples that “second-order efficiency.” give the same value for T ). From this he computed By 1935 Fisher evidently had come to see the first E[Var(X T )], which he found equal to Var(X) part of the argument—the part establishing that the | − 1/V . Since Var(X)= E[Var(X T )] + Var[E(X T )] maximum likelihood estimate actually achieved the | | (this is the ANOVA-like breakdown I refer to), this lower bound 1/I(θ)—as unsatisfactory, and he of- would give Var[E(X T )] = 1/V . But Var(X T ) 0 fered in its place a different argument to show the | | ≥ always, which implies that necessarily E[Var(X T )] bound was achieved. That argument (Fisher, 1935, 0, and so Var(X) 1/V 0. This gave 1 I(θ|), or≥ − ≥ V ≤ pages 45–46) was derived from what I will call his V 1 for any such T , with equality for efficient third proof; I shall comment on it later in that con- ≥ I(θ) estimates—what we now refer to as the information nection. inequality. Thus if the maximum likelihood estimate Fisher’s 1925 work was conceptually deep and has indeed has asymptotic variance 1/I(θ), he had es- been the subject of much fruitful modern discussion, tablished efficiency. particularly by Efron (1975, 1978, 1982, 1998), Efron The logic of the proof—and the likely route that and Hinkley (1978) and Hinkley (1980). led Fisher to it—seems clear. If there were a suf- ficient statistic S, then the factorization theorem 8. AFTER 1925: CORRESPONDENCE (which Fisher had recognized in 1922, at least in WITH HOTELLING part) would give φ = C h(S; θ), where the propor- tionality factor C may depend· upon the sample but Fisher’s beautiful theory had become more com- not on θ. By sufficiency, X would then depend upon plicated but was still quite attractive. The proofs the sample only through S, and so Var(X S)=0 for Fisher offered in 1925 were not such as would sat- all values of S, and consequently E[Var(|X S)] = 0 isfy the Pure Mathematician he had referred to in also. Also, if S is sufficient, the maximum| likeli- 1922, nor would they withstand the challenges that hood estimate (found through solving X =0 for θ) would come a quarter century later. Were they all is a function of S. The failure of T to capture all that he could offer? To answer this, it would help us of the information in the sample is then reflected to listen in on a dialogue between Fisher and a non- through the variation in the values of X given T , hostile, highly intelligent party. Many in the audi- namely through Var(X T ) and thus E[Var(X T )]. ence in England who were interested in this question This latter quantity plays| the role of a residual| sum had axes to wield, and Fisher’s transparent digs at of squares and measures the loss of efficiency of Karl Pearson, even though they came in the form of T over S (or at least over what would have been legitimately pointing out major errors in Pearson’s achievable had there been a sufficient statistic). previous work, just set those axes a-grinding. But What is more, this interpretation gave Fisher a there was one reader who approached Fisher’s level target to pursue in trying to measure the amount as a mathematician and was so distant both geo- of lost information, or even to determine how one graphically (he was in ) and scientifically might recover it, just as in an (he was working on crop estimating at that time) one can advance the analysis by introducing fac- that he was able to engage in just such a dialogue. tors that lead to a decrease in the residual sum of I refer to . squares. In the remainder of the 1925 paper Fisher Hotelling received his Ph.D. from Princeton Uni- pursued just such courses. He introduced both the versity in 1924, for a dissertation in point set topol- term and the concept of an , in ogy. In that same year he joined the Food Research effect as a covariate designed to reduce the resid- Institute at , where he worked ual sum of squares toward its theoretical minimum on agricultural problems. Soon after, he discovered 8 S. M. STIGLER

Fisher through Fisher’s 1925 book, Statistical Meth- Fisher’s draft table of contents is given as Ap- ods for Research Workers. Hotelling reviewed that pendix 1 below. That work was never to be com- book for JASA; in fact he reviewed each of the first pleted. There was no apparent split between the two, seven editions and the first three of these were volun- but as the project went on, Fisher’s increasing focus teered reviews, not requested by the Editor on genetics as his 1930 book The Genetical Theory (Hotelling, 1951). He started up a correspondence of Natural Selection went through the press, and with Fisher, and tried unsuccessfully to get Fisher Hotelling’s move in 1931 to the Department of Eco- to visit Stanford in 1928 and 1929 (Stigler, 1999a). nomics at , were likely causes After several friendly exchanges of letters, on Octo- for the drop in . By February 1930 Fisher ber 15, 1928, Fisher (who had had several requests was writing, “It is a grind getting anything seri- from others for detailed mathematical proofs) wrote, ous done in the way of a text book; I hope you asking Hotelling, “Now I want your considered opin- will stick to yours, though; as well as developing ion as to the of collecting such scraps of the- the purely mathematical developments.” Nonethe- ory as are needed to prove just what is wanted for less, Hotelling spent nearly six months at Rotham- my practical methods.” Hotelling replied Decem- sted over the last half of 1929 and saw quite a bit ber 8, strongly encouraging such a work as valu- of Fisher over that time. Hotelling returned to the able for mathematics generally, and stated that “a in late December in time to submit a knowledge of the grounds for belief in a theory helps paper to the American Mathematical Society (AMS) to dispel the absurd notions which tend to clus- and present it at their meeting in Des Moines, De- ter even about sound doctrines.” Fisher’s Christmas cember 31. That paper was entitled, “The consis- Eve 1928 reply proposed that they collaborate: tency and ultimate distribution of optimum statis- 24 Dec ’28 tics”; that is, on the consistency and asymptotic nor- Dear Prof. Hotelling mality of maximum likelihood estimates. It was pub- Your letter has arrived on Christmas Eve, lished in the October 1930 issue of the Transactions and has given me plenty to think about for the of the AMS. holidays. You will not expect too much of my It is a reasonable guess that the approach taken answer, as you see that I am writing first and in the paper reflected Fisher’s views to some degree, thinking afterwards; but I can see already that coming directly after the long visit with Fisher, al- I have a great deal to thank you for. though Fisher apparently played no direct role in the After a few hours consideration I believe my writing. At any rate, when Fisher wrote to Hotelling right course is to send you a draft contents, to on the 7th of January in 1930 to thank him for a be pulled to pieces or recast as much as you copy, Fisher’s only complaint was that the definition like, and to say I will do my best to fill the bill of “consistency” Hotelling gave was slightly different if you will be joint author and be responsible from Fisher’s. Fisher wrote, for the pure mathematics. If you consent to this and to taking the first decision, like an editor, “It is worth noting to avoid future con- as to inclusion or exclusion, on the clear under- fusion that you are using consistency in standing that either of us may throw it up as a somewhat different sense from mine. To soon as we think it is not worth while, I will me a statistic is inconsistent if it tends start sending stuff in. It will be mostly new as to the wrong limit as the sample is in- many of the proofs can be done much better creased indefinitely. I do not think I have than in my old publications. ever attempted to apply the distinction of Have you all my old stuff? I believe you have, consistency or inconsistency to statistics but if not I will try to find anything still lack- which tend to no limit, whereas you call ing. them all inconsistent. Thus I should not It seems a monstrous lot of work, but I will call the mean of a sample from not grumble if I need not think too much about 1 dx arrangement. π 1 + (x m)2 Yours sincerely − R. A. Fisher an inconsistent statistic, though you would. [Hotelling Papers Box 3] Congratulations on a very fine paper.” THE EPIC STORY OF MAXIMUM LIKELIHOOD 9

Hotelling’s paper is little referred to today, which vector of relative frequencies of counts x = (x1,..., seems a shame. It is beautifully written, as was most xm) taking values in the m-dimensional simplex m of Hotelling’s work, and among other things he ex- t=1 xt = 1, xt 0 all t. Let the probabilities of plained Fisher’s own work on this topic more clearly the cells f(p) =≥ (f (p),...,f (p)) depend upon a P 1 m than Fisher ever did. He reviewed Fisher’s proof parameter p; this describes a curve in the simplex of asymptotic normality (the one based upon the as p varies. Let p = p0 denote the true value of the Pearson–Filon approach), and he gently noted that parameter, letp ˆ be the maximum likelihood esti- “it is not clear what conditions, particularly of con- mate of p, and let f(p0) and f(ˆp) be the points tinuity, are necessary in order that the proofs which on the curve corresponding to these two values. In have been given shall be valid.” To repair this omis- his 1930 paper, Hotelling stated further, “The like- sion Hotelling offered two explicit proofs for the case lihood L is constant over a system of approximately of one continuous variable, stating overconfidently spherical hypersurfaces about [x]. The point [f(ˆp)] that “the extensions to any number of variables are is the point of the curve which lies on the smallest perfectly obvious; and the corresponding theorems of the approximate meeting the curve, and for discrete variables follow immediately....” The is therefore approximately the nearest point on the problem is, as Hotelling’s clear exposition makes ap- curve to [x]” (Hotelling, 1930). parent to a modern reader, the proof does not work. Here then is how Hotelling raised his question in He simplified the problem by transforming the pa- correspondence, in the context of what must have rameter to a finite (if necessary) by been a shared frame of discourse they had adopted an arc tangent transformation, and discretized the at Rothamsted. observed variable by grouping in a finite number of Dear Dr. Fisher: small intervals, and did not realize that the two com- Thank you very much for your recent letter, bined do not ensure the uniformity he would need to with graph and data. achieve the desired result for other than discrete dis- I have been examining various problems in tributions with bounded parameter sets. The error Maximum Likelihood of late; I wonder if you evidently came to Hotelling’s attention by 5 Decem- can enlighten me as to the conditions under ber 1931, when he circulated a list of 37 “Outstand- which your proof holds good regarding the min- ing Problems in the Theory of Statistics.” Problem imum variance of statistics obtained by this #16 on the list was, “Prove the validity of the dou- method, or rather, as to the exact meaning of ble limiting process used in the proof of (Hotelling, the theorem. One of several questions is whether 1930), for as general a situation as possible.” the variance of a statistic or its mean square de- viation from the true value should be used as 9. THE GEOMETRIC SHADOW OF A NASTY a measure of accuracy. LITTLE FACT Denoting byp ˆ the optimum estimate of a pa- To this point there had been not even a hint of rameter p, whose true value is p0, can it be the future appearance of any nasty, ugly little fact said that the variance ofp ˆ, assumingp ˆ nor- that might sully the beautiful theory. But then, on mally distributed, is less than that of any other November 15, 1930, Hotelling wrote to Fisher with function of the same observations? Obviously some pointed questions. The letter reflected a geo- not without further qualification, since a func- metric view of the inference problem that Hotelling tion of the observations can be defined having seems to have found in Fisher’s work by 1926 and de- an arbitrarily small variance. We must there- veloped further after their conversations at Rotham- fore restrict the comparison to a special class sted. Hotelling gave one statement of the view in of functions suitable for estimating p, but the his 1930 paper (which must have been drafted at definition of this class must not involve p0. How Rothamsted), and he restated it in his November let- should the class be defined? As the class of con- ter in different but equivalent notation. The essence sistent statistics? If so, the following difficulty is captured by Figure 2, drawn to display what must be faced. Hotelling conveyed in words and symbols. Consider a distribution of frequency among a Hotelling considered a parameterized multinomial finite number m of classes, involving a param- problem with m cells, where the observations are a eter p. In a sample of n, let xt be the number 10 S. M. STIGLER

Fig. 2. A reconstruction of Hotelling’s geometric view of the multinomial estimation problem, circa Fall 1929. Here x represents a multinomial observed relative frequency vector in the simplex, and the curve f(p) the potential values of the multinomial probability vector; the true value of the parameter p (p0) is shown, as is the MLE and a contour of the likelihood surface.

[Hotelling evidently relative frequency] of p0 might indicate a greater average variance falling in the tth class. Let ft(p) be the proba- than that of the optimum statistic. But such bility of an individual falling into this class. If an averaging would seem to be of a piece with we take x1,...,xm as coordinates in m-space, “Bayes’ Theorem,” in supposing equal a priori the equations probabilities. xt = ft(p) (t = 1,...,m) Hotelling went on to state that even in the par- represent a curve with p as parameter. The ticular case of symmetric beta densities, maximum points corresponding to samples will form a likelihood failed to be optimum, but his derivation “globular cluster” (as you so well put it in 1915)1 there was marred by a simple error in differentia- about that point on the curve for which p = tion. Before he received Fisher’s reply of the 28th of p0. The method of maximum likelihood cor- November, 1930, Hotelling wrote again, on Decem- responds approximately, for large samples, to ber 12, correcting his own error with regard to the taking forp ˆ the parameter of the point of the beta estimation problem and enlarging on his other curve nearest to that representing the sample; comment, to the point of rather clearly speculating i.e., to projecting orthogonally. Now consider on the possibility of superefficient estimates. some other method of projecting sample points upon the curve; for example an orthogonal pro- The general question of the exact cir- jection followed by an alternate stretching and cumstances in which optimum statistics contracting along the curve. Then if p0 hap- have minimum variance. . . is extremely in- pens to give a point in one of the regions of teresting. That the property is not per- condensation [i.e. high density], this method of fectly general seems clear from a consid- estimation will, for sufficiently large samples, eration of some of the distributions hav- yield a statistic with smaller variance than that ing discontinuities; and also from the fact by the method of maximum likelihood. To be that, if the true value were known, a sys- sure, its variance will be larger if the true value tem of estimation could be devised which p0 lies in a region of rarefaction [i.e. low den- would give it with arbitrarily small vari- sity], and averaging for different possible values ance; and such a system of estimation might happen to be adopted even if the true 1Here Hotelling evidently refers to Fisher’s use in Fisher value were unknown. (1924, at page 101) of the evocative astronomical term “glob- I have two students working on the op- ular cluster” to describe a point cloud. Fisher (1924) used the timum estimates of m for the above curve term in summarizing the multiple dimensional space approach he had taken in Fisher (1915). Fisher did not use the term in and for the Type III case you treated. Fisher (1915), although it would have been appropriate there Failing to get anything of consequence for also. small samples by purely mathematical meth- THE EPIC STORY OF MAXIMUM LIKELIHOOD 11

ods, they will probably soon resort to ex- general variances and for the periment.2 multinomial, and is done more prettily by Cordially yours, replacing the multinomial by a multiple Harold Hotelling Poisson,3 but the argument is probably clearer as it stands. Hotelling’s letters posed a challenging question in This is a very short note; the meat of a direct but nonconfrontational way. Clearly, Hotelling my letter is in the enclosures. . . . . said, some more constraints on the class of estimates Yours sincerely, would be needed; the geometric view they had ev- R. A. Fisher idently shared at Rothamsted suggested that con- [Hotelling Papers Box 45] sistency alone was not enough. There is no obvi- ous guarantee that the curve f(p) and the contours Fisher’s Enclosure A presented a sketch of a new, of the likelihood are such that improvement over third proof of the efficiency of maximum likelihood, maximum likelihood is not possible. What would one that took a different point of attack. The argu- be needed to prevent this, or at least to convince a ment, given in toto as Appendix 2 below, was el- reader such as Hotelling that the worry was ground- egant, geometric, and I believe also correct, or at less? Hotelling’s hypothetical improvements were cer- least completable, under the tacit regularity condi- tainly vague. A modern reader might be tempted tions implied by the analysis. The geometric stance to see them as foreshadowing Hodges’s estimate or he took was that which he and Hotelling would have even shrinkage via Stein estimation, but even though discussed at Rothamsted, restricted to the case of they fall short of that, they presented a clear chal- a parametric multinomial family of distributions. lenge to Fisher. Without any discussion, but in evident reply to Hotelling’s request for further restrictions on the class of estimates allowed in order to rule out nasty 10. FISHER’S REPLY: A THIRD PROOF OF little facts of the sort Hotelling had hinted at, Fisher THE OF MAXIMUM did introduce a new restriction on the class of esti- LIKELIHOOD mates T for which the result was claimed. As Fisher By 1930 Fisher was no stranger to challenges by put it, the estimates under consideration were now skeptical readers. His general reaction to one from a assumed to be homogeneous functions T = φ(x1,..., friendly source was to state clearly what he was pre- xs) of degree zero, where (x1,...,xs) is the vector pared to say, while avoiding speaking directly to the of counts. This, and the tacitly assumed smooth dif- point raised. Without addressing the criticism, much ferentiability, gave him access to a number of simple less admitting its validity, he would move directly to relationships leading to a conclusion he summarized a new and improved position, often not giving any as follows: indication that it represented the strongest state- “The criterion of consistency thus fixes ment that could be made and perhaps even hinting the value of T at all points on the expec- otherwise or at least allowing the reader to speculate tation line, while the criterion of efficiency so. Such was the case here. in conjunction with it fixes the direction in Fisher’s reply to the first of Hotelling’s letters was which the equistatistical surface cuts that brief, but it included one enclosure (A) that outlined line. All statistics which are both consis- a new proof that illuminated Fisher’s views, as well tent and efficient thus have surfaces which as a second short note (B) correcting Hotelling’s er- touch on that line. The surface for Maxi- ror in differentiating the beta density. mum Likelihood has the plane surface of 28 November 1930 this type.” Dear Hotelling, A homogeneous function φ of degree h is one where I enclose two notes A and B on the φ(cx, cy, . . .)= chφ(x,y,...), and in the applied math- points you raise. The first brings in the ematics of Fisher’s day their principal advantage was

2From comments elsewhere in the correspondence it is clear 3An analytical trick he had introduced in Fisher (1922a), that by “” Hotelling means simulation with dice or see also the revised footnote in the reprinting of this paper in cards. Fisher (1974). 12 S. M. STIGLER that if differentiable, they satisfied Euler’s Relation the differential of parameter spaces, with xφ + yφ + zφ + = hφ(x,y,z,...), where φ de- what is sometimes called the Jeffreys information x y z · · · x notes the partial of φ with respect to x metric, after Jeffreys (1946) (Kass, 1989; Kass and (see, e.g., Courant, 1936, Vol. 2, pages 108–109). In Vos, 1997). The paper, entitled “Spaces of statisti- Fisher’s case, the homogeneous functions φ of de- cal parameters,” included Type-III or gamma den- gree zero would be functions of the sample relative sities as one example, must also have been writ- frequencies only, and would not otherwise depend ten by Hotelling at Rothamsted. On 27 December upon the sample size n. This might be considered a 1929, a summary of the paper was read by Oystein strong restriction upon the class of estimates (Fisher Ore in Hotelling’s absence to the Annual Meeting of did not comment upon this), but with the assumed the American Mathematical Society in Bethlehem, differentiability and Euler’s Relation with h =0, and Pennsylvania. Only an abstract was ever published, the exactly known covariances for the multinomial, but the summary Ore read survives (part of a thick Fisher had an easy expression for the asymptotic folder of other, later notes by Hotelling), and it is variance for all estimates T in this class. He did printed here, together with the abstract (Hotelling, not require recourse to the regularity assumptions 1930a), as Appendix 3. implicit in the substituting normal distributions for approximately normal distributions, or in assuming 11. THE SITUATION TO 1950 the variance of T was approximately constant, as he had in his previous proof. It was then an easy In all, Fisher gave three proofs of the optimality of step to use standard Lagrangian methods to mini- maximum likelihood. The first, in 1922, was based mize this asymptotic expression for the variance for upon the erroneous belief that maximum likelihood consistent estimates within this class and show the estimates were always sufficient statistics, and it de- resulting equations were those that also determined pended upon treating approximately normally dis- the maximum likelihood estimate. tributed random variables as if they were in fact Fisher published this third proof later only in a normally distributed. The second proof, in 1925, was disguised form, namely where he assumed that the what I called the ANOVA proof. It too required the estimate T was to be found from an estimating equa- same implicit appeal to regularity by using normal- tion restricted to be a linear function of the rela- ity in place of approximate normality, as well as tive frequencies; that is, without stating where he assuming that the asymptotic variances of the es- had begun, he jumped directly to Euler’s Relation. timates were approximately constant, and that the In that guise, and without the geometric setting likelihood was sufficiently regular to permit the eval- and intuition, it appeared in his 1935 paper (Fisher, uation and manipulation of various integrals. The 1935, pages 45–46) where it served to provide an im- third, in 1930 in correspondence (and later in print proved version of the demonstration that the max- in 1935 in a version restated in terms of estimat- imum likelihood estimate achieves the information lower bound. It also appeared in Fisher (1938, pages ing functions that lost the geometric origin), placed 30–32), a 45-page tract he put together for a visit to more severe restrictions upon the distributions (as- at Mahalonobis’s invitation in January 1938. sumed multinomial) and estimates (smoothly dif- That tract was mostly cobbled together from Fisher’s ferentiable functions of the relative frequencies only, papers, and it summarized his view to that time. not varying with sample size), but it yielded a more And in his 1956 book, he gave (apparently only as an satisfactory proof. Even if not all details were filled illustration) another, simplified version, restricted to in, that task was fairly easy for the limited setting estimates T that were themselves linear in the rela- considered. Indeed, the third proof was immune to tive frequencies (Fisher, 1956, pages 145–148). the ugly little facts that Hotelling hinted at in 1930 Hotelling’s role in this was that of an important and Hodges produced explicitly in 1951, but at a catalyst. He helped lead Fisher to reconsider the cost in generality. Still, multinomial distributions problem and provided a remarkably acute audience, are as general as one could hope for in the discrete but he himself did not contribute further to the the- case, and the intuition developed from the geometric ory of maximum likelihood. Hotelling did write one setting of that third proof provided at least superfi- other related paper during his nearly six months cial promise that the result held much more gener- at Rothamsted in 1929. It was an investigation of ally, for continuous parametric families. THE EPIC STORY OF MAXIMUM LIKELIHOOD 13

Over the next few years, several mathematicians to his death and some others in his camp consid- recognized the unsatisfactory extent of rigorous sup- ered Fisher’s maximum likelihood simply the Gaus- port given for such a broad theory and tried their sian method, warmed over and served again with- hands at filling in the gaps Fisher had knowingly out overt reference to any Bayesian underpinnings. leapt over as well as some he had not even recog- That can be attributed to a lack of understanding of nized. The major early efforts were by Joseph Doob what Fisher was accomplishing, a phenomenon that (1934, 1936) and (1943, 1949) in the afflicted even such first-class older as United States, Daniel Dugu´e(1937) in France, and G. . Yule’s otherwise excellent 1911 text- then Harald Cram´er (1946, 1946a) writing during book was frequently revised but never made more the war in isolation in Sweden (having read Fisher, than the most superficial reference to Fisher (other Doob, Dugu´e, but apparently not Wald). Both Doob than to his correction to Pearson on degrees of free- and Wald had strong connections with Hotelling; dom), even to the 10th edition of 1936 (Yule, 1936). both pursued their studies of this topic on Carnegie Another, more recent claimant’s name was added Fellowships working with Hotelling at Columbia, to Gauss’s in 1935 when Arthur Bowley, in moving Doob in 1934–1935, and Wald in 1938–1939. Doob a vote of thanks for Fisher (1935), called attention left for the University of Illinois in 1935, but Wald to work by Edgeworth in 1908–1909 that bore at stayed on at Columbia, replacing Hotelling in 1939– least superficial similarity to some of Fisher’s work, 1940 while Hotelling was on leave, and again per- namely the information inequality of the second of manently when Hotelling moved to North Carolina the three proofs. in 1946. Bowley clearly had only a dim understanding of Of these writers, Doob and Dugu´efell into new this work of Fisher’s, and his remarks were mild difficulties (as Hotelling had in his 1930 paper); Doob compared to those 15 years later by . was gently corrected by Wald, and Dugu´e’s slip was Neyman’s difficulties with Fisher began in 1934 and apparently first noticed a decade later, in the mid- involved both scientific and personal issues, in what 1940s by Edith Mourier, who brought it to Dar- would become a long-running feud. Generally the mois’s attention. The Wald and Cram´er treatments dispute simmered at a low level: Fisher would, af- were the most satisfactory; both raised the level of ter the initial split, mostly ignore Neyman except rigor to new heights, although both suffered from for occasional barbs (usually veiled, without men- the complexity of the conditions assumed and the tioning Neyman by name), and Neyman would gen- limitations imposed. Wald was already publishing erally downplay the importance and originality of on the theory of estimation by 1939, and his 1943 Fisher’s work, rising on occasion for a more detailed proof of the asymptotic sufficiency of the maximum published blast (Zabell, 1992; Kruskal, 1980). likelihood estimates can be seen as a form of the In 1937 Neyman had been content to attribute the completion of Fisher’s 1922 proof. Cram´er also was simple idea of maximum likelihood to Karl Pearson, firmly based on Fisher; indeed his development fol- citing Pearson’s derivation of the product moment lowed the structure of Fisher’s work closely, but with estimate of the normal correlation coefficient as the rigorous demonstrations and explicit statements of “most probable” value, using the Gaussian method conditions. Much of what Cram´er presented might Pearson later abandoned (Neyman, 1937, page 345; be seen as a realization of the book Fisher and 1938, pages 132, 136; Pearson, 1896, pages 262–265). Hotelling might have written, albeit without the ge- But in 1951 Neyman’s focus on Fisher reached a ometry. peak, and he latched on to the claim of priority for While this reaction to Fisher’s theory (namely Edgeworth and deployed it as a rhetorical weapon that it was not true, or at least not proven as stated) in the feud. In a review of the collection of papers progressed in some quarters, another appeared, (Fisher, 1950), Neyman resurrected Bowley’s discov- namely claims that the theory was not new. In this ery, accusing Fisher of “an unjustified claim of pri- respect the reactions were like those in the seven- ority” with respect to “the so-called property of effi- teenth century to William Harvey’s 1628 demon- ciency of the maximum likelihood estimates” (Ney- stration of the circulation of blood, where denials man, 1951). What is more, Neyman wrote, “Actu- of the truth of the claimed phenomenon coexisted ally, the proofs of the efficiency of maximum likeli- with priority claims on behalf of Hippocrates, circa hood estimates offered both by Edgeworth and by 400 bc (Stigler, 1999, pages 207ff). Karl Pearson Fisher are inaccurate, and the assertion, taken at 14 S. M. STIGLER its full generality, is false.” This comes close to be- reader with Fisher’s work in hand and either exten- ing an accusation of a false claim of priority for a sive experience with Edgeworth or a strong histor- false discovery of an untrue fact, which would be ical or personal motive. Bowley had studied Edge- a rare triple-negative in the history of intellectual worth’s work and mode of expression thoroughly in property disputes. Savage (1976) wrote of Fisher preparing an extended commemorative summary in with this review in mind, “nor did he always emerge 1928. Neyman had both historical and personal mo- as the undisputed champion in bad manners.” On tives, as well as Bowley’s 1935 prompt. Even to- the other side, in 1938 Fisher had reviewed Ney- day anyone who tries to learn what Edgeworth ac- man’s influential Lectures and Conferences on Math- complished from Bowley’s 1928 summary (Bowley, ematical Statistics (1938), a book which had only a 1928, pages 26–28) would emerge completely at sea, few grudging references to Fisher. Fisher’s review no matter how long the text is puzzled over. This consisted of only two sentences, the first innocu- is not to deny that when one has dug through the ous and the second, “There is not enough origi- thicket of the 1908–1909 original, there is a limited nal material to justify publication as a book, and result and a hint of understanding that went be- too much that is really trivial” (Fisher, 1938–1939). yond the limited result. Edgeworth was a statisti- In June 1951, Neyman also wrote to the editor of cal scientist with an uncommonly subtle and deep the Journal of the American Statistical Association, mind (Stigler, 1986, Chapter 9; 1999, Chapter 5), W. Allen Wallis, unsuccessfully requesting that the and his work here is further evidence of that. But, Journal reprint Edgeworth’s 1908–1909 papers (let- for all that, the work stands as an independent par- ter in Neyman papers, Bancroft Library). tial anticipation—a hint, not an instance, of what But what of the basic question, did Edgeworth was to come. precede Fisher and did he in any way influence him Edgeworth died in 1926 without ever commenting if he did? My own view, which is in general accord on Fisher, and Fisher, as was his wont, dug in his with the conclusions Jimmie Savage (1976, pages heels and refused to seriously engage the issue in 447–448) and particularly John Pratt (1976) came print. His frankest private statements were in two to from a detailed study of both Edgeworth and letters. The first was a 12 February 1940 letter to Fisher, is that while there was indeed merit to Edge- Maurice Fr´echet, where he described Edgeworth’s worth’s work on this, there was no merit to the 1951 statement as confusingly linked to inverse proba- accusation of “an unjustified claim of priority.” In bility, even though the mathematics could be dis- the course of a long, obscure and rambling series sociated from that approach. In that letter Fisher of papers emphasizing the use of inverse probabil- summed up his view in these words: “The confu- ity in estimation, Edgeworth did include a treat- sion of associating this method with Bayes’ theorem ment of what he called “the direct method free from seems to have been originally due to Gauss, who the speculative character which attaches to inverse certainly recognized its merits as a method of esti- probability.” He made what can in retrospect be mation, though I do not know whether he proved best interpreted as a statement that maximizing the anything definite about it. I do not know of any likelihood within a very restricted class of estimates explicit statement of the properties, consistency, ef- (basically M-estimates for location parameters) gives ficiency and sufficiency, which may characterize es- the estimate with smallest standard deviation. The timates prior to my 1922 paper” (Bennett, 1990, proof he offered (suggested by Professor A. E. H. page 125). The second letter, dated 2 July 1951, was Love, an expert on the calculus of variations) was to a Californian, Horace Gray, who had spent time based explicitly upon Schwarz’s inequality, and bore with Fisher in London 1935–1936, and he had writ- no resemblance to any Fisher gave. ten to Fisher to call attention to Neyman’s review. There is no indication that this work of Edge- Fisher replied, worth’s ever had any influence upon Fisher or any “Neyman is, judging from my own experi- other worker on this topic. And the obscurity of ence, a malicious mischief-maker. . . . Edge- the prose—uncommonly dense, even by Edgewor- worth’s paper of 1908 has, of course, been thian standards—is such that it is hard to believe long familiar to me, and to other English the result would have been recognized there by any statisticians. No one could now read it contemporary reader other than Edgeworth him- without realizing that the author was pro- self. Even at a later time, its recognition required a foundly confused. I should say, for my own THE EPIC STORY OF MAXIMUM LIKELIHOOD 15

part, that he certainly had an inkling of page 313). A scrupulous recent translation of Thiele what I later demonstrated. The view that, from the Danish (Lauritzen, 2002) with accompa- in any proper sense, he anticipated me is nying commentary allows a better assessment of his made difficult by a number of verifiable excellent work, which, however, did not include con- facts” (Bennett, 1990, pages 138–139). tributions to maximum likelihood estimation. The facts Fisher listed were that (i) Edgeworth 12. DOUBTS ABOUT MAXIMUM based his investigation on , (ii) LIKELIHOOD he limited attention to location parameters, and (iii) the formula they shared in common, for the variance The possibility that maximum likelihood estimates of efficient estimates, had been drawn from Pearson could actually perform badly, or that they might and Filon with no notice given to the major errors be dramatically improved upon by another method, in that work. Fisher noted that since by 1903 Shep- seems to have not been raised prior to Hotelling’s pard’s works had shown that moment estimates had probing letters to Fisher of November 15 and De- variances different from those given by Pearson and cember 12, 1930. Kirstine Smith and Karl Pearson Filon, this to Fisher raised the questions: “Had Pear- had questioned the relative merits of the “Gaus- son and Filon’s variances any validity at all? Does sian method” versus minimum chi-square in 1916, any class of estimate actually have these variances? but any difference there was minor; both were later If so, how can such an estimate be obtained in gen- seen to be asymptotically efficient estimates. For eral? But Edgeworth would have been far ahead of the most part, the early reservations about Fisher’s his time had he asked them.” Fisher would grant maximum likelihood centered on questions of pri- Edgeworth “an inkling,” but no more. Some might ority (was he preceded? was anything really new in see more in Edgeworth than Fisher did, but they do the method?) and issues of practical usefulness (were so from a different historical perspective. I believe the calculations too hard relative to the method of Fisher owed no intellectual debt to Edgeworth on moments?). As maximum likelihood became more this issue, and it was his own loss. Had he taken the widely adopted in the 1930s, the increased atten- time and trouble to learn from Edgeworth’s insight, tion to proofs of its effectiveness (could a rigorous he might have gone even further. Savage (1976) prof- general demonstration be devised?) led inevitably to fered as explanations for this neglect, that Fisher ini- questions of when it might break down. The earliest tially thought Edgeworth’s premises ridiculous, and explicit example is perhaps due to Abraham Wald, later “because it is hard to seek diligently after the in correspondence with Jerzy Neyman in 1938. unwelcome.” Wald immigrated to the United States from Vi- Neyman’s was not the only review to raise pri- enna in Spring 1938, when shortly after Hitler’s an- ority issues about Fisher’s work. In a tendentious nexation of Austria he accepted an offer to join the 1930 review of the 3rd edition of Fisher’s Statisti- Cowles Commission for Research in Economics, then cal Methods for Research Workers, Charles Grove located in Colorado Springs. He remained with Cowles seemed to claim that all in Fisher was to be found through the summer before joining Harold Hotelling earlier in Scandinavian work by Thiele, Gram or at Columbia University in Fall 1938. On Septem- Charlier. Grove (1930) did not focus on maximum ber 20, 1938, a week before he left for Columbia, likelihood, which he evidently thought was unsup- Wald wrote to Neyman sending a promised manuscript ported, but put forth instead the claim that Thiele on the Markov inequality, but also describing a dif- had in 1889 anticipated Fisher on small sample infer- ferent problem he had encountered. The problem de- ence and particularly on estimating with scribed was a slight generalization of one he would k-statistics, and Gram had done so on the use of or- treat in Wald (1940), namely estimating a straight thogonal polynomials in regression. Fisher replied in line when n points on the line are observed but both the same publication, and more colorfully in a pri- coordinates are subject to independent errors. How- vate letter to Grove’s colleague Arne Fisher (a Dane ever, Wald’s letter to Neyman contained a statement who seems to have been the instigator of Grove’s re- that he omitted from the 1940 article: “I have shown view). Fisher stated that Thiele “had no more glim- that the method of maximum likelihood leads to mer than [Karl] Pearson of some of the ideas we false estimations of the parameters.... (i.e., leads to now use” (Grove, 1930; Fisher, 1931; Bennett, 1990, statistics of which the stochastic limits are unequal 16 S. M. STIGLER to the values of the respective parameters to be es- Littauer and Mode, 1952). Neither Neyman’s nor timated). Hence the maximum likelihood method Hodges’s talks were ever published; Le Cam’s was cannot be applied” (Neyman Papers, Box 14, Folder developed into his Ph.D. dissertation and published 28). Wald stated that he had solved this general es- in 1953. That publication (Le Cam, 1953) included timation problem for the case of independent nor- Hodges’s example (credited to Hodges), and Le Cam mally distributed errors with possibly unequal vari- proved among other things that while superefficiency ances.4 was clearly possible, the set of parameter points Neyman replied on September 23 that he was quite where it could be achieved had Lebesgue measure interested in the new problem, “the more so as it zero. is rather close to what I am trying to do myself.” In the decade that followed, a number of other Ten years later Neyman and Elizabeth Scott pub- examples were discovered or devised. Of these, the lished, with a general citation to Wald (1940), a sim- least contrived was the problem of estimation for the plified version of Wald’s example as one of several five-parameter mixture of two normal distributions, with increasing numbers of parameters where maxi- where the likelihood function explodes to infinity mum likelihood estimates are inconsistent. That ver- when either mean parameter equals any observation. sion, in which the straight line is y = x and the This and several other examples, including an im- two coordinates’ error variances are equal, has come portant one by Bahadur, are reviewed in Le Cam to be known as the Neyman–Scott example. It is (1990) and Cox (2006, Chapter 7). Le Cam specu- usually expressed as follows: Xij are independent lates that the normal mixture example (known in 2 N(µj,σ ), for i = 1, 2, and j = 1,...,n, in which case the folklore of the 1950s but apparently not pub- 2 the maximum likelihood estimate of σ consistently lished then) was due to Jack Kiefer and Jacob Wol- estimates half the correct value (Neyman and Scott, fowitz; Cox (2006, pages 134–135) considers it to 1948). some extent pathological. In June of 1951, just as Jerzy Neyman’s review These early examples created a flurry of excite- of Fisher’s Collected Papers appeared, the Berke- ment but are for the most part not seen today as ley Statistical Laboratory convened for the summer debilitating to the theory. Hodges’s example made under Neyman’s general direction. One of three re- a substantial impact when it first became known, search groups took as its charge “a complex of ques- but it has, ever since Le Cam’s dissertation, come to tions arising from considerations of superefficiency be seen as an ingenious but minor technical achieve- and identifiability.” The group concentrating on this ment. Hodges (see Figure 1 above) showed you could topic was comprised of Joseph L. Hodges, Jr., Lu- improve locally on maximum likelihood, basically by cien Le Cam and Agnes Berger. It was presumably shrinking the estimate toward zero, and as such it shortly before that time that Hodges, then an As- might also be viewed as an early hint of the 1955 sistant Professor at Berkeley, constructed his exam- shrinkage estimates of Charles Stein that in multi- ple; in any event the study was soon sufficiently ad- parameter problems can improve uniformly on max- vanced that a session on the topic “Efficiency and imum likelihood. But Hodges’s example itself was superefficiency of estimates” was arranged by Ney- for finite samples inferior to maximum likelihood man to be held on Saturday, December 29, 1951, at for parameter values not near zero, and it was not the Boston meeting of the Institute of Mathematical long seen as a serious threat. The Wald–Neyman– Statistics. Four talks were presented in that session, Scott example was of more practical import, and by Jerzy Neyman (“On the problem of asymptotic still serves as a warning of what might occur in efficiency of estimates”), Joe Hodges (“Local super- efficiency”), Lucien Le Cam (“On sets of parameter modern highly parameterized problems, where the points where it is possible to achieve superefficiency information in the data may be spread too thinly to of estimates”) and Joseph Berkson (“Relative preci- achieve asymptotic consistency. The normal mixture sion of least squares and maximum likelihood esti- example remains certainly of at least computational mates of regression coefficients”) (Biometrics, 1951; importance, as showing how in complex settings it may be necessary to seek local maxima or to con- strain the parameter space. Fisher never commented 4If the observed points’ means are modeled as a random sample, the parameters do not grow in number with the sam- on any of these examples. ple size and their maximum likelihood estimates are consistent There continued over this period to be a num- under mild conditions; see Kiefer and Wolfowitz (1956). ber of attempts to complete the theory, to give a THE EPIC STORY OF MAXIMUM LIKELIHOOD 17 rigorous description of conditions that approached lapses in rigor in attempts by Doob and by Dugu´eto necessary and sufficient, conditions describing situ- themselves correct some of Fisher’s oversights. But I ations in which maximum likelihood would not mis- do not mean at all to suggest these pioneers had feet lead. As work on the topic became more refined and of clay. To the contrary. Without Lagrange’s error more correct, the intrinsic difficulties of the topic he might not have found the Laplace transform at also became more apparent. The lists of conditions that early date. Without Pearson and Filon, Fisher needed to prove optimality by Wald and Cram´er might not have started down the road he did. With- were already unwieldy and the basic logic of the out Fisher’s 1921 mistaken jump to a conclusion, solutions retreated from sight; indeed one problem he might not have rushed to complete his theory, was that achieving rigor sometimes led to the ex- which even flawed and incomplete, was instrumen- clusion of basic examples, such as the estimation of tal in launching twentieth century theoretical statis- the normal standard deviation, as in Wald (1943). tics. Great explorations in uncharted territory seem The consequences can still be seen today, in the best to require great boldness, and even mischance can textbook treatments, such as those by Bickel and lead to major advance. Doksum (2001) and by van der Vaart (1998), where the elegance of the exposition comes from strate- 14. CONCLUSION gically restricting the of the coverage. Ba- Despite all these difficulties, maximum likelihood hadur (1964) gave a succinct and elegant theorem remains one of the most used and useful techniques that builds upon work of Le Cam, but only treated of modern statistics. How can that be, in the face of a one-dimensional parameter and was restricted to the nasty little facts uncovered by the 1950s? For one estimates that are asymptotically normal with vari- thing, there is solid mathematical in a wide ances that are continuous in the parameter. class of problems. Fisher’s proofs can all be defended as correct, at least if one accepts as given the regu- 13. OF ERRORS IN THEORY larity conditions and assumptions that were clearly At many junctures in this story we have encoun- implicit, including the limitation in 1922 to sufficient tered what might be judged theoretical errors com- estimates, and in 1925 to score functions linearly mitted by the workers involved. Perhaps Lagrange, approximable by maximum likelihood estimates. Of by ignoring the curve his probabilities followed un- course that defense flirts with tautology: any state- til the final stage, could be judged in error; it cer- ment is true if all the conditions required for its tainly left him with method of moment estimates truth are assumed; even the Pearson–Filon deriva- that would be thought woefully inefficient by the tion of the “probable errors of frequency constants” Fisher generation. Perhaps Gauss’s use of a uni- might be so defended. But there is a big difference form prior, which rendered his solution susceptible between the two cases. Fisher’s implicit assumptions to change by nonlinear transformations of the pa- are in part fairly clear (smooth differentiability, con- rameters, would be considered an error. Certainly sistent estimates, e.g.) and were clearly evident to Pearson and Filon erred in their promiscuous use Fisher himself; if he made a false application of the of a na¨ıve passage to a limit in ways where it gave theory, it is not known to me. On the other hand, wrong answers (Stigler, 2007). And certainly Fisher’s with Pearson–Filon the case was different, as the 1921 assumption that sufficient statistics always ex- inappropriate applications in the same paper make ist was an error, and Hotelling’s 1930 proof of the clear. Nonetheless, the intense mathematical inves- consistency and asymptotic normality of the maxi- tigations after 1938 and particularly in the 1950s mum likelihood estimate cannot be counted correct revealed potential problems Fisher had not consid- for the generality claimed. ered, with increasing numbers of parameters, un- There are other errors I have not discussed. When bounded likelihood functions and the possibility of Lambert (1760) in a sketchy presentation gave only local improvement over maximum likelihood. Fisher one example, he got what was arguably the wrong was surely aware of some of these problems, at least answer there. Lambert’s only specific result was for when they were published, if not before. The first n = 2, claiming the sample mean in that case always of them he might have countered by noting that gave the most probable result, a claim that would in such situations the amount of information in the fail for the Cauchy density. See Stigler (1999, Chap- data (the measurement of which was one of his pi- ter 16). And there were later smaller and subtler oneering advances) was being spread over a space 18 S. M. STIGLER of dimension increasing in proportion with the sam- APPENDIX 1: FISHER’S DECEMBER 1928 ple size, and so of course problems with consistency DRAFT TABLE OF CONTENTS [HOTELLING could be expected. But he did not. The third pos- PAPERS, BOX 3] sibility, local improvement, had been brought early I. Distributions to his attention by Hotelling, but here too Fisher Varieties and variables remained silent, as he did in the face of other exam- Types of distribution ples as well. An explanation for this silence might, ironically, have been given by Fisher himself in a 14 (a) Discontinuous, step like integrals January 1933 letter to , commiserating (b) Continuous, differentiable integrals with him on the difficulties he faced with his father, (c) General type, integral not differentiable Karl: “Many original men are for that reason unre- but frequency not confined to zero measure ceptive, and this is a fault which age does nothing Specification by moments to cure” (Fisher papers). Characteristic function, eitxf(x) dx or Personalities played a role in this development. eitx dF (x) R Fisher’s hostilities with Neyman surely increased his Its logarithm, cumulative property R stubborn resistance to public discussion of areas where Cumulative moment functions or seminvariants questions remained, and they surely contributed to [sic] the zeal with which Neyman pursued the discovery Illustrative cases, uniqueness of normal distri- and public discussion of such problems. The latter bution, multinomial and multiple Poisson might be viewed as a benefit of the feud: when peace II. Distributions derived from normal 2 n 2 reigned in the early 1930s and the only attention χ distribution is that of S1 (xp) when xp is dis- to the problem was by Fisher, Hotelling, and those tributed with unit variance about zero n Hotelling inspired to work on this (Doob, Wald) or a Transformation of ξq = p=1 cpqxp, n 2 n noncombatant (Dugu´e), the problems in the proofs p=1 cpq = 1, p=1 cpqcpq′ = 0 P and the limitations of the theory were not on pub- Application of χ2 to frequencies P P nx¯ lic view. Indeed, there has been no published crit- Distribution of t = χ2 [sic]; application to re- 2 icism that clearly identified the source of errors in 1 n2χ1 gression coefficients; of z = 2 log n χ2 . the proofs of Hotelling, Doob or Dugu´eeven to the 1 2 present day; either the early works were ignored, III. Distribution of correlation coefficient, partial cor- were merely cited, or referred to with a polite al- relation, multiple correlation. Hyperspace treat- ment lusion such as to the proofs being “not rigorous” IV. Moment estimates of seminvariants (e.g., Doob, 1934; Le Cam, 1953). The reader got Simple and multiple distribution of such esti- no sense of where and how real problems with the mates paper in Lond. Math. Soc. They will theory might arise. Hostility bred uncivil discourse; not publish{ for a year if then it also led to principled focus. Combinatorial method } Yet despite these problems, time and again maxi- V. Theory of estimation mum likelihood has proved useful even in situations (Much as already done but more about Suffi- where no general theorem could be found to de- cient Statistics) fend its use. Perhaps as Fisher’s powerful geomet- Method of maximum likelihood ric intuition may have foreseen, the scope of useful Bayes’ theorem. Inverse probability and likeli- application of maximum likelihood exceeds that of hood. Illustrate by inefficiency of moments with any reasonably achievable proof, even though this Pearsonian curves. comes at the potential cost of inadvertently blunder- VI. Experimental design (not so agricultural as in ing into a region of inapplicability. We now under- Statistical Methods for Research Workers), more stand the limitations of maximum likelihood better use of amount of information than Fisher did, but far from well enough to guar- VII. Statistical mechanics; argument put in a clear antee safety in its application in complex situations light without taking x! as a where it is most needed. Maximum likelihood re- when x is small! NOT YET DONE! mains a truly beautiful theory, even though tragedy Analogous biological problems. Rather worth may lurk around a corner. doing though. THE EPIC STORY OF MAXIMUM LIKELIHOOD 19

Fowler has now done a great deal, but still the line. All statistics which are both consistent and ef- method of steepest descent seems very indirect, ficient thus have surfaces which touch on that line. and obviously this limits statistical argument. The surface for Maximum Likelihood has the plane surface of this type. APPENDIX 2: FISHER’S ENCLOSURE A FROM THE NOV. 28, 1930 LETTER TO APPENDIX 3: HOTELLING ON HOTELLING. THE LETTERS WERE TYPED PARAMETER SPACES BUT THE FORMULAS WERE WRITTEN IN Hotelling briefly attended the American Math- BY HAND, AND THE APPARENT ematical Society’s Annual Meeting in Bethlehem, TYPOGRAPHICAL ERROR IN THE FOURTH ∂θ Pennsylvania, December 26–29, 1929, but he left be- FORMULA FROM THE BOTTOM ( ∂X1 fore this paper was scheduled to be read on Decem- FOR ∂φ ) IS AS WRITTEN IN THE ORIGINAL ∂X1 ber 27. The paper was read in his absence by Profes- [HOTELLING PAPERS, BOX 45] sor Oystein Ore of Yale University, and Ore subse- Expectation line x = f(θ) quently returned the manuscript to Hotelling; only the abstract was ever published (Hotelling, 1930a). Equistatistical surface (or region) T = φ(x1,...,xs) Meanwhile, Hotelling traveled on to the AMS Reg- ∂φ x =0 if φ is homogeneous of zero degree. ular Meeting December 30–31 in Des Moines, Iowa, ∂x where on December 31 he read his paper “The con- X For consistency θ = φ(f1,...,fs) sistency and ultimate distribution of optimum statis- For large samples, provided there is no bias of or- tics” (Hotelling, 1930). The summary that follows der as high as n−1/2, is the entire manuscript as read by Ore, from the 2 Hotelling Papers at Columbia University (Box 44). ∂φ V (T ) = Mean δx ∂x ! SPACES OF STATISTICAL PARAMETERS X b ∂φ 2 ff ′ ∂φ ∂φ By Harold Hotelling, Stanford University. = f 1 − x ∂x − n ∂x ∂x′ [Abstract] X    XX For a space of n dimensions representing the pa- for multinomial, where differentials refer to the ex- rameters p ,...,p of a frequency distribution, a sta- pectation point. 1 n tistically significant metric is defined by means of Differentiating the condition of consistency, dθ = ∂φ ∂f ∂φ ∂f the variances and co-variances of efficient estimates ( ∂x ∂θ ) dθ, or ∂x ∂θ = 1 ∂φ of these parameters. Such a space, for the ordinary PAny values ∂x Pare admissible subject to this condi- types of distributions, is always curved. For the two tion for consistency, we may therefore minimize the parameters of the normal law the manifold may be expression for the variance subject to this condition represented in part as a surface of revolution of neg- and obtain equations of the form ative , with a sharp circular edge. On this ∂θ f1 ∂φ ∂f1 surface variation of the dispersion is represented by f1(θ) f = λ ∂x1 − n ∂x ∂θ moving along a generator. For a Pearson Type III X curve [i.e. gamma distributions] of any given shape Now if φ is homogeneous in x of zero degree, the same surface occurs. For the unrestricted Type f ∂φ = 0, hence, for all classes ∂x III curve there are three parameters; their space is P ∂φ λ ∂f investigated. Certain metrical properties which hold = or ∂x f ∂θ in general for spaces of statistical parameters are given. ∂φ 1 ∂f 1 ∂f 2 = ∂x f ∂θ f ∂θ SUMMARY OF “SPACES OF . X   The criterion of consistency thus fixes the value of T STATISTICAL PARAMETERS” at all points on the expectation line, while the crite- A “population” is specified by a function rion of efficiency in conjunction with it fixes the di- rection in which the equistatistical surface cuts that f(x,p1,...,pk) 20 S. M. STIGLER such that f dx is the probability of an observation for example the mean and standard deviation of a falling in the range dx. In we have normal error curve. Our k-space is in such cases a given observations x1,...,xN and wish to estimate surface of constant negative curvature. Represent- the values of the parameters p1,...,pk. There is an ing the normal curve by means of a pseudosphere, infinity of possible methods of making these esti- variation of the standard deviation is represented by mates; but one possessing certain peculiarly valu- motion along a generator, variations of the mean by able properties is that of maximum likelihood. The rotation about the axis. A greater variance means likelihood is defined as closer propinquity to the axis. N Since a geodesic on a pseudosphere between two f(xi,p1,...,pk). points on the same meridian comes closer to the axis iY=1 than the meridian, we have an interesting biological Denote its logarithm by L. Letp ˆ1,..., pˆk be the val- conclusion. If we have two related species having ues maximizing L. They have been called optimum about the same variance but a difference in means, statistics, or optimum estimates of the parameters, the most likely common ancestors had a greater vari- by R. A. Fisher. The errors of estimatep ˆα pα de- ance than either existing species. − rived from samples of N have a distribution which For a Pearson Type III curve the measures of po- for large values of N approaches the normal form sition and scale vary, not along geodesic but along − 1 T loxodromes. Ke 2 dpˆ1,...,dpˆk, Spaces of statistical parameters lend themselves where to the treatment of a wide range of problems in which discrepancies between hypothesis and obser- T = gαβ(ˆpα pα)(ˆpβ pβ). − − vation which involve two or more observations are XX Here gαβ is the mathematical expectation of to be tested. Thus if the hypothesis to be tested is ∂2L that a species, in which the frequency distribution of , some dimension has the normal form, has arisen by ∂pα ∂pβ a succession of small mutations from another, and and is a covariant tensor of second order under trans- ′ if we consider the difference of the variances along formations pα = φα(p1,...,pk)—though of course the with that of the means, we are led to apply the dis- second derivative is not itself a tensor. tribution of χ2 for n =2, just as in judging marks- This tensor property suggests that manship we may combine vertical with horizontal α β gαβ dp dp deviations of a shot from the center of the target. But the fact that the surface, on which the mean and be taken as distance element in a space of coordi- variance are coordinates, is a pseudosphere instead nates p1,...,pk. Indeed a considerable amount of of a plane, shows that a correction must be applied differential geometry carries over immediately to give to the probability of a greater deviation as calcu- novel statistical conclusions. It should be said at 2 once that these spaces are not flat, but are curved lated from χ . Indeed, the area or circumference of in a manner depending on the initial population dis- a geodesic circle is greater than for one of the same tributions. radius on a plane. The excess of area measures the Problems of “random migration” by short leaps correction which must be applied to obtain the true in the k-space occur in various biological problems, probability of a greater discrepancy. when evolution is supposed to take place by small If about a point on the pseudosphere represent- mutations. Such problems occur also in experimen- ing any population we describe a geodesic circle, the tal work, as in the dilution method of counting soil points on the circumference represent statistics, such bacteria developed by Cutter at Rothamsted. These as mean and variance, which might with equal likeli- problems, for short steps, are equivalent to prob- hood have been obtained in a sample from this pop- lems regarding heat conduction and geodesics in the ulation. And inversely, if corresponding to a given curved space. sample, we fix upon a point as center of a geodesic If we are considering an initial distribution curve circle, the points on the circumference represent pop- of any fixed shape, we have two parameters to es- ulations which, on the evidence of this sample, are timate, giving the location and scale of the curve, all equally likely. THE EPIC STORY OF MAXIMUM LIKELIHOOD 21

ACKNOWLEDGMENTS I am grateful to Henry Bennett for permission to Aldrich, J. (1997). R. A. Fisher and the making of maximum consult and quote from the Fisher Papers in Ade- likelihood 1912–1922. Statist. Sci. 12 162–176. MR1617519 laide, to Michael Ryan for permission to quote from Arrow, K. J. and Lehmann, E. L. (2005). Harold Hotelling the Harold Hotelling Papers, Rare Book and Manus- 1895–1973. Biographical Memoirs of the National Academy of Sciences 87 3–15. cript Library, Columbia University, and to Susan Bahadur, R. R. (1964). On Fisher’s bound for asymptotic Snyder for permission to quote from the Jerzy Ney- variances. Ann. Math. Statist. 35 1545–1552. MR0166867 man Papers (Call Number BANC MSS 84/30C, Box Bahadur, R. R. (1983). Hodges superefficiency. In Encyclo- 14, Folder 28), Bancroft Library, University of Cal- pedia of Statistical Sciences (S. Kotz and N. L. Johnson, ifornia Berkeley. For comments during the course eds.) 3 645–646. of this investigation, I thank Peter Bickel, Larry Bennett, J. H., ed. (1990). and Anal- Brown, Bernard Bru, , Persi Diaconis, ysis: Selected Correspondence of R. A. Fisher. Clarendon Anthony Edwards, Brad Efron, Tim Gregoire, Marc Press, Oxford. MR1076366 Bernoulli, D. (1769). Dijudicatio maxime probabilis Hallin, Lucien Le Cam, Erich Lehmann, Peter Mc- plurium observationum discrepantium atque verisimillima Cullagh, Edith Mourier, . The paper inductio inde formanda. Manuscript; Bernoulli MSS f.299– is based upon material presented as the Lucien Le 305, University of Basel. English translation in Stigler Cam Memorial Lecture at the IMS Annual Meeting (1997). in Rio de Janeiro, August 2, 2006. Bernoulli, D. (1778). Dijudicatio maxime probabilis plurium observationum discrepantium atque verisimillima REFERENCES inductio inde formanda. Acta Academiae Scientiarum Im- perialis Petropolitanae for 1777, pars prior 3–23. Reprinted Norden (1972–1973) surveys the literature to 1972, in Bernoulli (1982). English translation in Kendall (1961) with an extensive bibliography. Hald (1998) gives a 3–13, reprinted 1970 in Pearson, Egon S. and Kendall, M. detailed report in modern notation of Fisher’s pub- G. (eds.), Studies in the and Probabil- lished work in statistical inference, as well as work ity, pp. 157–167. Charles Griffin, London. related to maximum likelihood by Pearson and Filon Bernoulli, D. (1982). Die Werke von Daniel Bernoulli. and by Edgeworth. Aldrich (1997) and Edwards Band 2. Analysis. Wahrscheinlichkeitsrechnung. Birkh¨auser, Basel. MR0685593 (1997a) discuss Fisher’s earliest work on maximum Bickel, P. J. and Doksum, K. (2001). Mathematical Statis- likelihood; I also describe this work from a differ- tics. Basic Ideas and Selected Topics, 2nd ed. 1. Prentice ent perspective in Stigler (2005), and other aspects Hall, Upper Saddle River, NJ. MR0443141 of Fisher’s work in Stigler (1973, 2001, 2007). Ed- Biometrics (1951). News and Notes. Biometrics 7 449–450. wards (1974), Kendall (1961) and Hald (1998, 2007) Bowley, A. L. (1928). F. Y. Edgeworth’s Contributions to are among those who describe the predecessors of . Royal Statistical Society, London. maximum likelihood; see Stigler (1999, Chapter 16), (Reprinted 1972 by Augustus M. Kelley, Clifton, NJ.) Box, J. F. (1978). R. A. Fisher. The Life of a Scientist. for more detail on this and further references. Savage , New York. MR0500579 (1976), Pratt (1976) and Hald (1998, 2007) give ac- Courant, R. (1936). Differential and Integral Calculus. counts of the relationship of Fisher’s work to Edge- Nordeman, New York. worth’s. For the best appreciation of the role of ge- Cox, D. R. (2006). Principles of Statistical Inference. Cam- ometry in Fisher’s work on estimation see Efron’s bridge Univ. Press. MR2278763 Wald Lecture (1982), and for an elegant and insight- Cramer,´ H. (1946). Mathematical Methods of Statistics. ful modern development of Fisher’s geometric ap- Princeton Univ. Press. MR0016588 Cramer,´ H. proach with historical references, see Kass and Vos (1946a). A contribution to the theory of statisti- cal estimation. Skand. Aktuarietidskr. 29 85–94. Reprinted (1997). On Hotelling’s life and work see Arrow and in H. Cram´er, Collected Works 2 948–957. Springer, Berlin Lehmann (2005), Darnell (1988), Hotelling (1990) (1994). MR0017505 and Smith (1978); on Fisher’s life see Box (1978); Darnell, A. C. (1988). Harold Hotelling 1895–1973. Statist. on Pearson’s life see Porter (2004). The Hotelling– Sci. 3 57–62. MR0959716 Fisher correspondence is housed in the Hotelling Doob, J. L. (1934). Probability and statistics. Trans. Amer. Collection at Columbia University (Rare Book and Math. Soc. 36 759–775. MR1501765 Manuscript Library) and the Fisher Papers at the Doob, J. L. (1936). Statistical estimation. Trans. Amer. Math. Soc. 39 410–421. MR1501855 University of Adelaide; each has a more complete Dugue,´ D. (1937). Application des propri´et´es de la limite au set of the received letters than the sent letters. Ney- sens du calcul des probabilit´es a l’´etude de diverse questions man’s papers are at the Bancroft Library of the Uni- d’estimation. J. l’Ecole´ Polytechnique 3e s´erie (n. 4) 305– versity of California Berkeley. 373. 22 S. M. STIGLER

Edwards, A. W. F. (1974). The history of likelihood. Inter- Fisher, R. A. (1938–1939). Review of “Lectures and Confer- nat. Statist. Rev. 42 9–15. MR0353514 ences on Mathematical Statistics” by J. Neyman. Science Edwards, A. W. F. (1997). Three early papers on efficient Progress 33 577. parametric estimation. Statist. Sci. 12 35–47. MR1466429 Fisher, R. A. (1950). Contributions to Mathematical Statis- Edwards, A. W. F. (1997a). What did Fisher mean by “in- tics. Wiley, New York. verse probability” in 1912–1922? Statist. Sci. 12 177–184. Fisher, R. A. (1956). Statistical Methods and Scientific In- MR1617520 ference. Oliver and Boyd, Edinburgh. Efron, B. (1975). Defining the curvature of a statistical prob- Fisher, R. A. (1974). The Collected Papers of R. A. Fisher lem (with applications to second order efficiency). Ann. U. of Adelaide Press. MR0505093 Statist. 3 1189–1242. MR0428531 Galton, F. (1908). Memories of my Life. Methuen, London. Efron, B. (1978). The geometry of exponential families. Ann. Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. Statist. 6 362–376. MR0471152 Perthes et Besser, Hamburg. Translated, 1857, as Theory Efron, B. (1982). Maximum likelihood and of Motion of the Heavenly Bodies Moving about the Sun in (The 1981 Wald Memorial Lectures). Ann. Statist. 10 340– Conic Sections, trans. C. H. Davis. Little, Brown; Boston. 356. MR0653516 Reprinted, 1963, Dover, New York. Efron, B. (1998). R. A. Fisher in the 21st century (with Grove, C. C. (1930). Review of “Statistical Methods for discussion). Statist. Sci. 13 95–122. MR1647499 Research Workers.” Amer. Math. Monthly 37 547–550. Efron, B. and Hinkley, D. V. (1978). Assessing the accu- MR1522136 racy of the maximum likelihood : Observed ver- Hald, A. (1998). A History of Mathematical Statistics from sus expected Fisher information. Biometrika 65 457–482. 1750 to 1930. Wiley, New York. MR1619032 MR0521817 Hald, A. (2007). A History of Parametric Statistical Infer- Fienberg, S. E. and Hinkley, D. V. eds. (1980). R. A. ence from Bernoulli to Fisher, 1713 to 1935. Springer, New Fisher: An Appreciation. Springer, New York. MR0578886 York. MR2284212 Fisher, R. A. (1912). On an absolute criterion for fitting Hinkley, D. V. (1980). Theory of statistical estimation: The frequency curves. Messenger of Mathematics 41 155–160; 1925 paper. Pp. 85–94 in Fienberg and Hinkley (1980). reprinted as Paper 1 in Fisher (1974); reprinted in Edwards Hotelling, H. (1930). The consistency and ultimate distri- (1997). bution of optimum statistics. Trans. Amer. Math. Soc. 32 Fisher, R. A. (1915). Frequency distribution of the values of 847–859. MR1501565 the correlation coefficient in samples from an indefinitely Hotelling, H. (1930a). Spaces of statistical parameters (Ab- large population. Biometrika 10 507–521; reprinted as Pa- stract). Bull. Amer. Math. Soc. 36 191. per 4 in Fisher (1974). Hotelling, H. (1951). The impact of R. A. Fisher on statis- Fisher, R. A. (1920). A mathematical examination of the tics. J. Amer. Statist. Assoc. 46 35–46. methods of determining the accuracy of an observation by Hotelling, H. (1990). The Collected Economic Articles of the mean error, and by the mean square error. Mon. No- Harold Hotelling. Springer, New York. MR1030045 tices Roy. Astron. Soc. 80 758–770; reprinted as Paper 12 Jeffreys, H. (1946). An invariant form for the prior proba- in Fisher (1974). bility in estimation problems. Proc. Roy. Soc. London Ser. Fisher, R. A. (1922). On the mathematical foundations of A 186 453–461. MR0017504 theoretical statistics. Philos. Trans. Roy. Soc. London Ser. Kass, R. E. (1989). The geometry of asymptotic inference. A 222 309–368; reprinted as Paper 18 in Fisher (1974). Statist. Sci. 4 188–219. MR1015274 Fisher, R. A. (1922a). On the interpretation of χ2 from con- Kass, R. E. and Vos, P. W. (1997). Geometrical Foun- tingency tables, and the calculation of P. J. Roy. Statist. dations of Asymptotic Inference. Wiley, New York. Soc. 85 87–94; reprinted as Paper 19 in Fisher (1974). MR1461540 Fisher, R. A. (1924). The Influence of Rainfall on the Yield Kendall, M. G. (1961). Daniel Bernoulli on maximum like- of Wheat at Rothamsted. Philos. Trans. Roy. Soc. London lihood. Biometrika 48 1–18. Reprinted in 1970 in Pearson, Ser. B 213 89–142; reprinted as Paper 37 in Fisher (1974). Egon S. and Kendall, M. G. (eds.), Studies in the History Fisher, R. A. (1924a). Conditions under which χ2 measures of Statistics and Probability. Charles Griffin, London, pages the discrepancy between observation and hypothesis. J. 155–172. Roy. Statist. Soc. 87 442–450; reprinted as Paper 34 in Kiefer, J. and Wolfowitz, J. (1956). Consistency of Fisher (1974). the maximum likelihood estimator in the presence of in- Fisher, R. A. (1925). Theory of statistical estimation. Proc. finitely many parameters. Ann. Math. Statist. 27 887–906. Cambridge Philos. Soc. 22 700–725; reprinted as Paper 42 MR0086464 in Fisher (1974). Kruskal, W. H. (1980). The significance of Fisher: A review Fisher, R. A. (1931). Letter to the Editor. Amer. Math. of “R. A. Fisher: The Life of a Scientist” by Joan Fisher Monthly 38 335–338. MR1522291 Box. J. Amer. Statist. Assoc. 75 1019–1030. Fisher, R. A. (1935). The logic of inductive inference. J. Roy. Lagrange, J.-L. (1776). M´emoire sur l’utilit´ede la m´ethode Statist. Soc. 98 39–54; reprinted as Paper 124 in Fisher de prendre le milieu entre les r´esultats de plusieurs ob- (1974). servations; dans lequel on examine les avantages de cette Fisher, R. A. (1938). Statistical Theory of Estimation. Univ. m´ethode par le calcul d es probabilit´es, & ou l’on resoud Calcutta. differens probl`emes relatifs `acette mati`ere. Miscellanea THE EPIC STORY OF MAXIMUM LIKELIHOOD 23

Taurinensia 5 167–232. Reprinted in Lagrange (1868) 2 Probab. 1 531–546. Univ. California Press, Berkeley. 173–236. MR0133192 Lagrange, J.-L. (1868). Oeuvres de Lagrange, 2. Gauthier- Rao, C. R. (1962). Efficient estimates and optimum infer- Villars, Paris. ence procedures in large samples, with discussion. J. Roy. Lambert, J. H. (1760). Photometria, sive de Mensura et Statist. Soc. Ser. B 24 46–72. MR0293766 Gradibus Luminis, Colorum et Umbrae. Detleffsen, Augs- Savage, L. J. (1976). On rereading R. A. Fisher. Ann. Statist. burg. (French translation 1997, L’Harmattan, Paris; En- 4 441–500. MR0403889 glish translation 2001, by David L. DiLaura, for The Illu- Sheynin, O. B. (1971). J. H. Lambert’s work on proba- minating Engineering Society of North America). bility. Archive for History of Exact Sciences 7 244–256. Laplace, P. S. (1774). M´emoire sur la probabilit´e des MR1554145 causes par les ´ev`enemens. M´emoires de math´ematique et Smith, K. (1916). On the ‘best’ values of the constants in de physique, present´es `al’Acad´emie Royale des Sciences, frequency distributions. Biometrika 11 262–276. par divers savans, & lˆudans ses assembl´ees 6 621–656. Smith, W. L. (1978). Harold Hotelling 1985–1973. Ann. Translated in Stigler (1986a). Statist. 6 1173–1183. MR0523758 Lauritzen, S. L. (2002). Thiele: Pioneer in Statistics. Ox- Stigler, S. M. (1973). Laplace, Fisher, and the discov- ford Univ. Press. MR2055773 ery of the concept of sufficiency. Biometrika 60 439–445. Le Cam, L. (1953). On some asymptotic properties of maxi- Reprinted in 1977 in Kendall, Maurice G. and Robin L. mum likelihood estimates and relates Bayes estimates. Uni- versity of California Publications in Statistics 1 277–330. Plackett, eds., Studies in the History of Statistics and Prob- MR0054913 ability, Vol. 2. Griffin, London, pp. 271–277. MR0326872 Le Cam, L. (1990). Maximum likelihood: An introduction. Stigler, S. M. (1986). The History of Statistics: The Mea- Internat. Statist. Rev. 58 153–171 [Previously issued in surement of Uncertainty Before 1900. Harvard Univ. Press, 1979 by the Statistics Branch of the Department of Math- Cambridge, MA. MR0852410 ematics, University of Maryland, as Lecture Notes No. 18]. Stigler, S. M. (1986a). Laplace’s 1774 memoir on inverse Littauer, S. B. and Mode, E. B. (1952). Report of the probability. Statist. Sci. 1 359–378. MR0858515 Boston Meeting of the Institute. Ann. Math. Statist. 23 Stigler, S. M. (1997). Daniel Bernoulli, Leonhard Euler, and 155–159. Maximum Likelihood. In Festschrift for Lucien LeCam (D. Neyman, J. (1937). Outline of a theory of statistical estima- Pollard, E. Torgersen and G. Yang, eds.) 345–367. Springer, tion based upon the classical theory of probability. Phil. New York. Extensively revised and reprinted as Chapter 16 Trans. Royal Soc. London Ser. A 236 333–380. of Stigler (1999). MR1462957 Neyman, J. (1938). Lectures and Conferences on Mathemat- Stigler, S. M. (1999). Statistics on the Table. Harvard Univ. ical Statistics (edited by W. Edwards Deming). The Grad- Press, Cambridge, MA. MR1712969 uate School of the USDA, Washington DC. Stigler, S. M. (1999a). The Foundations of Statistics at Neyman, J. and Scott, E. L. (1948). Consistent estimates Stanford. Amer. Statist. 53 263–266. MR1711551 based on partially consistent observations. Stigler, S. M. (2001). Ancillary history. In State of the Art 16 1–32. MR0025113 in Probability and Statistics (C. M. de Gunst, C. A. J. Neyman, J. (1951). Review of R. A. Fisher “Contributions to Klaassen and A. W. van der Vaart, eds.). IMS Lecture Mathematical Statistics.” The Scientific Monthly 72 406– Notes Monogr. Ser. 36 555–567. IMS, Beachwood, OH. 408. MR1836581 Norden, R. H. (1972–1973). A survey of maximum likelihood Stigler, S. M. (2005). Fisher in 1921. Statist. Sci. 20 32–49. estimation. Internat. Statist. Rev. 40 329–354, 41 39–58. MR2182986 Pearson, K. (1896). Mathematical contributions to the the- Stigler, S. M. (2007). Karl Pearson’s theoretical errors and ory of evolution, III: regression, heredity and panmixia. the advances they inspired. To appear. Philos. Trans. Roy. Soc. London Ser. A 187 253–318. van der Vaart, A. W. (1997). Superefficiency. In Festschrift Reprinted in Karl Pearson’s Early Statistical Papers, Cam- for Lucien Le Cam (D. Pollard, E. Torgersen and G. L. bridge: Cambridge University Press, 1956, pp. 113–178. Yang, eds.) 397–410. Springer, New York. MR1462961 Pearson, K. and Filon, L. N. G. (1898). Mathematical con- van der Vaart, A. W. (1998). Asymptotic Statistics. Cam- tributions to the theory of evolution IV. On the probable errors of frequency constants and on the influence of ran- bridge Univ. Press. MR1652247 dom selection on variation and correlation. Philos. Trans. Wald, A. (1940). The fitting of straight lines if both vari- Roy. Soc. London Ser. A 191 229–311. Reprinted in Karl ables are subject to error. Ann. Math. Statist. 11 284– Pearson’s Early Statistical Papers, Cambridge: Cambridge 300. [A summary of the main results of this article, as University Press, 1956, pp. 179–261. presented in a talk July 6, 1939, was published pp. 25– Porter, T. M. (2004). Karl Pearson: The Scientific Life in 28 in Report of the Fifth Annual Research Conference on a Statistical Age. Princeton Univ. Press. MR2054951 Economics and Statistics Held at Colorado Springs July 3 Pratt, J. W. (1976). F. Y. Edgeworth and R. A. Fisher to 28, 1939, Cowles Commission, University of Chicago, on the efficiency of maximum likelihood estimation. Ann. 1939.] MR0002739 Statist. 4 501–514. MR0415867 Wald, A. (1943). Tests of statistical hypotheses concern- Rao, C. R. (1961). Asymptotic efficiency and limiting in- ing several parameters when the number of observations formation. Proc. Fourth Berkeley Symp. Math. Statist. is large. Trans. Amer. Math. Soc. 54 426–482. MR0012401 24 S. M. STIGLER

Wald, A. (1949). Note on the consistency of the maxi- 1937 by M. G. Kendall were not greatly changed in em- mum likelihood estimate. Ann. Math. Statist. 20 595–601. phasis.] MR0032169 Zabell, S. L. (1992). R. A. Fisher and the fiducial argument. Yule, G. U. (1936). An Introduction to the Theory of Statis- Statist. Sci. 7 369–387. Reprinted in 2005 in S. L. Zabell, tics, 10th ed. Charles Griffin, London. [This was the last Symmetry and its Discontents: Essays on the History of edition revised by Yule himself; subsequent revisions from Inductive Philosophy. Cambridge Univ. Press. MR1181418