THE EPIC STORY of MAXIMUM LIKELIHOOD 3 Error Probabilities Follow a Curve

Statistical Science 2007, Vol. 22, No. 4, 598–620 DOI: 10.1214/07-STS249 c Institute of Mathematical Statistics, 2007 The Epic Story of Maximum Likelihood Stephen M. Stigler Abstract. At a superficial level, the idea of maximum likelihood must be prehistoric: early hunters and gatherers may not have used the words “method of maximum likelihood” to describe their choice of where and how to hunt and gather, but it is hard to believe they would have been surprised if their method had been described in those terms. It seems a simple, even unassailable idea: Who would rise to argue in favor of a method of minimum likelihood, or even mediocre likelihood? And yet the mathematical history of the topic shows this “simple idea” is really anything but simple. Joseph Louis Lagrange, Daniel Bernoulli, Leonard Euler, Pierre Simon Laplace and Carl Friedrich Gauss are only some of those who explored the topic, not always in ways we would sanction today. In this article, that history is reviewed from back well before Fisher to the time of Lucien Le Cam’s dissertation. In the process Fisher’s unpublished 1930 characterization of conditions for the consistency and efficiency of maximum likelihood estimates is presented, and the mathematical basis of his three proofs discussed. In particular, Fisher’s derivation of the information inequality is seen to be derived from his work on the analysis of variance, and his later approach via estimating functions was derived from Euler’s Relation for homogeneous functions. The reaction to Fisher’s work is reviewed, and some lessons drawn. Key words and phrases: R. A. Fisher, Karl Pearson, Jerzy Neyman, Harold Hotelling, Abraham Wald, maximum likelihood, sufficiency, efficiency, superefficiency, history of statistics. 1. INTRODUCTION cial philosopher-scientist Herbert Spencer. One evening about 1870 they met for dinner at the Athenaeum arXiv:0804.2996v1 [stat.ME] 18 Apr 2008 In the 1860s a small group of young English intel- Club in London, and that evening included one ex- lectuals formed what they called the X Club. The name was taken as the mathematical symbol for the change that so struck those present that it was re- unknown, and the plan was to meet for dinner once peated on several occasions. Francis Galton was not a month and let the conversation take them where present at the dinner, but he heard separate ac- chance would have it. The group included the Dar- counts from three men who were, and he recorded winian biologist Thomas Henry Huxley and the so- it in his own memoirs. As Galton reported it, dur- ing a pause in the conversation Herbert Spencer Stephen M. Stigler is the Ernest DeWitt Burton said, “You would little think it, but I once wrote a Distinguished Service Professor, Department of tragedy.” Huxley answered promptly, “I know the Statistics, University of Chicago, Chicago, Illinois catastrophe.” Spencer declared it was impossible, 60637, USA e-mail: [email protected]. for he had never spoken about it before then. Huxley This is an electronic reprint of the original article insisted. Spencer asked what it was. Huxley replied, published by the Institute of Mathematical Statistics in “A beautiful theory, killed by a nasty, ugly little Statistical Science, 2007, Vol. 22, No. 4, 598–620. This fact” (Galton, 1908, page 258). reprint differs from the original in pagination and Huxley’s description of a scientific tragedy is sin- typographic detail. gularly appropriate for one telling of the history 1 2 S. M. STIGLER Joe Hodges’s Nasty, Ugly Little Fact (1951) that history, with a sketch of the conceptual prob- 1 lems of the early years and then a closer look at the T = X¯ if X¯ n n | n|≥ n1/4 bold claims of the 1920s and 1930s, and at the early 1 arguments, some unpublished, that were devised to = αX¯n if X¯n < . | | n1/4 support them. Then √n(Tn θ) is asymptotically N(0, 1) if θ = 0, − 2 6 and asymptotically N(0, α ) if θ = 0. 2. THE EARLY HISTORY OF 2 Tn is then “super-efficient” for θ =0 if α < 1. MAXIMUM LIKELIHOOD Fig. 1. The example of a superefficient estimate due to By the mid-1700s it seems to have become a com- Joseph L. Hodges, Jr. The example was presented in lectures monplace among natural philosophers that problems in 1951, but was first published in Le Cam (1953). Here X¯n is of observational error were susceptible to mathe- the sample mean of a random sample of size n from a N(θ, 1) matical description. There was essential agreement population, with n Var(X¯n) = 1 all n, all θ (Bahadur, 1983; van der Vaart, 1997). upon some elements of that description: errors, for want of a better assumption, were supposed equally able to be positive and negative, and large errors of Maximum Likelihood. The theory of maximum were expected to be less frequently encountered than likelihood is very beautiful indeed: a conceptually small. Indeed, it was generally accepted that their simple approach to an amazingly broad collection frequency distribution followed a smooth symmet- of problems. This theory provides a simple recipe ric curve. Even the goal of the observer was agreed that purports to lead to the optimum solution for upon: while the words employed varied, the observer all parametric problems and beyond, and not only sought the most probable position for the object of promises an optimum estimate, but also a simple observation, be it a star declination or a geodetic lo- all-purpose assessment of its accuracy. And all this cation. But in the few serious attempts to treat this comes with no need for the specification of a pri- problem, the details varied in important ways. It ori probabilities, and no complicated derivation of was to prove quite difficult to arrive at a precise for- distributions. Furthermore, it is capable of being mulation that incorporated these elements, covered automated in modern computers and extended to useful applications, and also permitted analysis. any number of dimensions. But as in Huxley’s quip There were early intelligent comments related to about Spencer’s unpublished tragedy, some would this problem already in the 1750s by Thomases Simp- have it that this theory has been “killed by a nasty, son and Bayes and by Johann Heinrich Lambert ugly little fact,” most famously by Joseph Hodges’s in 1760, but the first serious assault related to our elegant simple example in 1951, pointing to the ex- topic was by Joseph Louis Lagrange in 1769 (Stigler, istence of “superefficient” estimates (estimates with 1986, Chapter 2; 1999, Chapter 16; Sheynin, 1971; smaller asymptotic variances than the maximum like- Hald, 1998, 2007). Lagrange postulated that obser- lihood estimate). See Figure 1. And then, just as vations varied about the desired mean according to with fatally wounded slaves in the Roman Colos- a multinomial distribution, and in an analytical tour seum, or fatally wounded bulls in a Spanish bull- de force he showed that the probability of a set ring, the theory was killed yet again, several times of observations was largest if the relative frequen- over by others, by ingenious examples of inconsis- cies of the different possible values were used as the tent maximum likelihood estimates. values of the probabilities. In modern terminology, The full story of maximum likelihood is more com- he found that the maximum likelihood estimates of plicated and less tragic than this simple account the multinomial probabilities are the sample relative would have it. The history of maximum likelihood frequencies. He concluded that the most probable is more in the spirit of a Homeric epic, with long value for the desired mean was then the mean value periods of peace punctuated by some small attacks found from these probabilities, which is the arith- building to major battles; a mixture of triumph and metic mean of the observations. It was only then, tragedy, all of this dominated by a few characters and contrary to modern practice, that Lagrange in- of heroic stature if not heroic temperament. For all troduced the hypothesis that the multinomial prob- its turbulent past, maximum likelihood has survived abilities followed a symmetric curve, and so he was numerous assaults and remains a beautiful, if in- left with only the problem of finding the probabil- creasingly complicated theory. I propose to review ity distribution of the arithmetic mean when the THE EPIC STORY OF MAXIMUM LIKELIHOOD 3 error probabilities follow a curve. This he solved for the nineteenth century. By the end of that century several examples by introducing and using “Laplace this was sometimes known as the Gaussian method, Transforms.” By introducing restrictions in the form and the approach became the staple of many text- of the curve only after deriving the estimates of books, often without the explicit invocation of a uni- probabilities, Lagrange’s analysis had the curious form prior that Gauss had seen as needed to justify consequence of always arriving at method of mo- the procedure. ment estimates, even though starting with maximum likelihood! (Lagrange, 1776; Stigler, 1999, Chap- 3. KARL PEARSON AND L. N. G. FILON ter 14; Hald, 1998, page 48.) At about the same time, Daniel Bernoulli consid- Over the 19th century, the theory of estimation ered the problem in two successively very different generally remained around the level Laplace and ways. First, in 1769 he tried using the hypothesized Gauss left it, albeit with frequent retreats to lower curve as a weight function, in order to weight, then levels. With regard to maximum likelihood, the most iteratively reweight and average the observations.

THE EPIC STORY of MAXIMUM LIKELIHOOD 3 Error Probabilities Follow a Curve

April 2, 2015 Probabilistic Voting in Models of Electoral Competition By

Harold Hotelling 1895–1973

Memorial Resolution George Bernard Dantzig (1914–2005)

The Gompertz Distribution and Maximum Likelihood Estimation of Its Parameters - a Revision

Stable Components in the Parameter Plane of Transcendental Functions of Finite Type

The Likelihood Principle

UC Riverside UC Riverside Previously Published Works

Statistical Estimation in Multivariate Normal Distribution

Backpropagation TA: Yi Wen

Decomposing the Parameter Space of Biological Networks Via a Numerical Discriminant Approach

A Sufficiency Paradox: an Insufficient Statistic Preserving the Fisher

A Family of Skew-Normal Distributions for Modeling Proportions and Rates with Zeros/Ones Excess