<<

Statistical Science 1997, Vol. 12, No. 3, 162᎐176 R. A. Fisher and the Making of Maximum Likelihood 1912 – 1922 John Aldrich

Abstract. In 1922 R. A. Fisher introduced the method of maximum likelihood. He first presented the numerical procedure in 1912. This paper considers Fisher’s changing justifications for the method, the concepts he developed around it Žincluding likelihood, sufficiency, effi- ciency and information. and the approaches he discarded Žincluding inverse .. Key words and phrases: Fisher, Pearson, Student, Bayes’s postulate, , maximum likelihood, sufficiency, , infor- mation.

INTRODUCTION Fisher did not finish with maximum likelihood in 1922 Žmore was to come in 1925 and 1934., but the The making of maximum likelihood was one of method of Cramer and Wald, of countless applica- the most important developments in 20th century ´ tions and textbook expositions, had arrived. The . It was the work of one man but it was no 1921 version also has a continuing place in Fisher’s simple process: between 1912 and 1922 R. A. Fisher writings and elsewhereᎏas likelihood analysis. produced three justificationsᎏand three namesᎏ This paper follows the development of the ideas for a procedure that existed in two forms. Pratt w of Fisher alone but the syllabus extends beyond his Ž1976. shows that Edgeworth Ž1908. anticipated a writings. His problems came out of the work of Karl good part of the 1922 version, but nobody noticed Pearson and ‘‘Student’’ ŽW. S. Gosset.. Their writ- until a decade or so after Fisher had redone it. x ings belong in the syllabus. E. S. Pearson put them The ‘‘absolute criterion’’ of 1912 was derived from w there and wrote very well about them; see, e.g., the ‘‘principle of inverse probability.’’ The ‘‘opti- E. S. Pearson Ž1968, 1990.. Box, Edwards, MacKen- mum’’ of 1921 was associated with the notion of zie and Zabell have also discussed them. So does ‘‘likelihood’’ as a quantity for appraising hypotheti- x the teaching to which Fisher was cal quantities on the basis of given . The ‘‘maxi- exposed. This material presents a complicated and mum likelihood’’ method of 1922 gave estimates confusing scene. Pearson promoted at least four satisfying the criteria of ‘‘sufficiency’’ and ‘‘effi- estimation methods, and Gosset found ciency.’’ There were two forms for sometimes Fisher distributions in the cause of Bayesian analysis while based the likelihood on the distribution of the en- the least squares texts reproduced disconnected and tire sample, sometimes on the distribution of a mutilated bits of Gauss. The classics discussed by particular . Stigler Ž1986. are not part of the syllabus, for Fisher The making of maximum likelihood was the mak- did not know them. ing of the world of ideas and nomenclature includ- ing ‘‘parameter,’’ ‘‘statistic,’’ ‘‘likelihood,’’ ‘‘suf- 1. THE ABSOLUTE CRITERION ficiency,’’ ‘‘consistency,’’ ‘‘efficiency,’’ ‘‘information’’ ᎏeven ‘‘estimation.’’ It was the unmaking of the Fisher published ‘‘An absolute criterion for fitting method’s inverse probability connection and Žfor frequency curves’’ ŽFisher, 1912. as a third-year Fisher. the unmaking of inverse probability as well undergraduate. wFor biographical information on as of the related ‘‘Bayes postulate.’’ Fisher see Box Ž1978..x It is a very implicit piece of writing and, to make any of it explicit, we have to read outside the paper and guess. We start with the usual questions: where is the author coming from; John Aldrich is a member of the Department of to whom is the paper addressed; what does it say? , University of Southampton, Southamp- The author comes out of the theory of errors, a ton, SO17 1BJ, United Kingdom (e-mail: jca1@ speciality of astronomers and surveyors. Fisher soton.ac.uk). mentions Chauvenet’s Spherical Astronomy and

162 FISHER AND MAXIMUM LIKELIHOOD 163

T. L. Bennett’s tract on the theory of errors. The Fisher ŽFisher, 1912, p. 156. objects to least squares only other reference is to his tutor, ‘‘to whose criti- that ‘‘an arbitrariness arises in the scaling of the cism and encouragement the present form of this abscissa line.’’ The problem is to choose a function note is due.’’ F. J. M. Stratton, an astronomer, f, from a family indexed by ␪, to approximate a lectured on the theory of errors. He wrote on agri- curve y. The method of least squares is to choose cultural statistics and was in touch with Gosset. the value of ␪ to minimize wSee E. S. Pearson Ž1990, p. 47. for information on 2 Stratton.x Brunt’s textbook ŽBrunt, 1917., based on H Ž yŽ x. y f Ž x : ␪ .. dx. Stratton’s lectures, shows what Fisher had to work with. wSee Edwards Ž1974, 1994.; Edwards brought Now suppose the abscissa line is rescaled with this book into the Fisher Syllabus.x It lacks theoret- Ž . ical finesse, and important constructions are muti- x s x z . lated or left out. Least squares is justified following Then, as Fisher observes, the value of ␪ that mini- Gauss’s Theoria Motus ŽGauss, 1809., but omitting mizes the prior! By 1930 Fisher Ž1930, p. 531. knew the 2 original and by 1934 he knew ŽFisher, 1934a p. 616. H Ž yŽ xŽ z.. y f Ž xŽ z. : ␪ .. dz the ‘‘Gauss᎐Markov’’ justification of 1821; in 1912᎐1922 he seems to have known only bowdler- is not the same as that which minimizes the origi- ized Gauss. ’s knowledge was no bet- nal objective. His reasoning seems to be that in ter wsee, e.g., Pearson Ž1920. and Section 17 belowx. rescaling account needs to be taken of the change of However, Edgeworth Ž1908. knew more. variable. Noninvariance is important in this paper Fisher’s title refers to ‘‘frequency curves’’ but he and in Fisher 1912᎐1922 generally but it was not begins by criticizing least squares and the method important in the statistical literature he knew. of moments as general methods of curve-fitting. This argument applies to approximating one There are no references but the text reads like a curve by another. There was a further objection critique of Pearson’s ‘‘On the systematic fitting of when fitting a curve to data: ‘‘if a finite number of curves to observations and ’’ ŽPear- observations are grouped about a series of ordi- son, 1902. which treats these methods. wBrunt refers nates, there is an additional arbitrariness in choos- to this work. Edwards Ž1994, Appendix 2. repro- ing the position of the ordinates and the distances duces Fisher’s notes on it. These were written much between them in the case of grouping observations.’’ later and do not shed any light on my conjecture These objections do not apply to the method of that the piece was the target of Fisher’s criticism.x moments but there is the question of which equa- Fisher already knew of Pearson and his biometric tions to use: a choice is made ‘‘without theoretical program. In his 1911 essay ‘‘Mendelism and biome- justification.’’ try’’ ŽBennett, 1983, p. 56. he described Pearson as Dismissing these methods, Fisher asserts ‘‘we the one ‘‘on whose mathematical work the whole may solve the real problem directly.’’ His ŽFisher, science of has been based.’’ 1912, p. 156. complete argument is as follows. Let Pearson thought that the method of moments p s f ␦ x was more easily applied than least squares; Fisher ŽFisher, 1912, pp. 155᎐156. concurred. wFisher’s be the chance of an observation falling within the notes, in Edwards Ž1994, Appendix 2., are an inter- ␦ x. Then PЈ with esting commentary on these claims and Fisher’s PЈ s Ł p s Ł f ␦ x earlier acceptance of them.x Pearson Ž1902, p. 267. judged that it gave ‘‘sensibly as good results’’ as ‘‘is proportional to the chance of a given set of least squares though admittedly the definition of observations occurring.’’ As the factors ␦ x are inde- best fit is ‘‘more or less arbitrary’’. He made a pendent of the theoretical curve, ‘‘the probability of virtue out of the agreement between least squares any particular set of ␪ ’s is proportional to P, where and the method of moments where both could be log P s Ý log f. used. Fisher ŽFisher, 1912, 1983, p. 155. appealed to another standard: The most probable set of values for the ␪ ’s will make P a maximum.’’ This mutual approximation wof resultsx, though This final sentence stating the absolute criterion convenient in practice . . . is harmful from a appears at the head of the derivation of least theoretical standpoint as tending to obscure squares in Chauvenet Ž1891, p. 481., Bennett Ž1908, the practical discrepancies, and the theoretical p. 15. and Brunt Ž1917, p. 77.: for example, Chau- indefiniteness which actually exist. venet writes ‘‘The most probable system of values 164 J. ALDRICH of the unknown quantities . . . will be that which than Fisher’s Ž1922, p. 326. own explanation: makes the probability P a maximum.’’ Fisher’s cri- terion is their method applied to all estimation I must indeed plead guilty in my original problems. statement of the Method of Maximum Likeli- Fisher works through an example. It is a simple hood wi.e., in the 1912 paperx to having based version of the standard least squares problem based my argument upon the principle of inverse on a random sample of size n from the normal error probability . . . curve He ŽFisher 1922, p. 326. explains the principle as follows: ‘‘if the same observed result A might be the h 2 2 f s expŽyh Ž x y m. . consequence of one or other of two hypothetical '␲ conditions X and Y, it is assumed that the proba- bilities of X and Y are in the same ratio as the w h corresponds to 1r␴'2 in modern notationx. Dif- of A occurring on the two assump- ferentiating log P, Fisher obtains the most proba- tions, X is true, Y is true.’’ In the 1912 argument ble values Žgiven above. the principle appears as: ‘‘the proba- bility of any particular set of ␪ ’s is proportional n 2 to P.’’ m s x and h s , 2Ý v 2 The 1922 account is plausible ŽFisher’s retrospec- tive statements are not always so. because it fits the record. In 1912 the proposition just quoted was where v is Ž x x.. The notation does not distin- y a step in the derivation of the absolute criterion. guish true from most probable values. Fisher did not signal that a significant principle is Fisher was following the textbooks when he max- involved and name it, but its role in the argument Ž . imized P with respect to m, but they used n y 1 is clear. In 1917 he said that the absolute criterion in the estimate of dispersion. wGauss never gave a is based on the principle of inverse probability. In Bayesian analysis of and dispersion together. 1921 he denied angrily that he had assumed a In 1809 he treated the mean Žregression coeffi- uniform prior in his correlation work, and in 1922 cients. and in 1816, dispersion.x Fisher dissects two he treated the principle separately from the postu- arguments for Žn y 1.. The more significant dissec- late of a uniform prior. These episodes are exam- tion Žof Bennett’s argument. is treated in the next ined in detail below. section. Chauvenet’s Ž1891, p. 493. argument is a If we accept this account, then we have to recog- mangled version of Gauss’s Ž1823. treatment of nize that Fisher’s beliefs and usage were idiosyn- unbiased estimation of ␴ 2. Although the argu- craticᎏand a source of confusion. Contemporaries ments are radically different, Brunt Ž1917, pp. using ‘‘inverse probability’’ wsuch as Edgeworth 32᎐34. distinguishes the Gauss᎐Chauvenet argu- Ž1908. or Keynes Ž1911. used it in the context of ment from Bennett’s as being from a ‘‘purely alge- Bayes’s theorem. wIn his sketch of the of braic standpoint.’’ inverse probability Edwards Ž1994. gives a formula- The paper’s message to estimators seems to be: tion from De Morgan in 1838 that is the same as adopt the one valid method used by astronomers; Fisher’s.x Fisher’s criterion was widely used for shun the others as well as both the methods used estimating regression coefficients but I cannot find by Pearson. anyone who based it upon his principle. Sophisti- cated authors like Keynes Ž1911, p. 324. follow early Gauss and use Bayes’s theorem with a uni- 2. JUSTIFYING THE CRITERION form prior. Conniffe Ž1992, p. 485. suggests Fisher was following Keynes but, like the textbooks, Fisher How did Fisher justify the absolute criterion? does not mention . Chauvenet, Two views are current: the argument is a rough Bennett and Brunt go directly from maximizing the draft of the likelihood position clearly articulated in probability of observations to maximizing the prob- 1921 wsee E. S. Pearson Ž1968, p. 412., Edwards ability of parameter values. For Keynes such pro- Ž1974, p. 14. or Box Ž1978, p. 79.x; the argument ceedings assume a uniform prior; for Fisher they involves maximizing a posterior density obtained assume the ‘‘principle of inverse probability.’’ from a uniform prior wsee MacKenzie Ž1981, p. 244., In the 1912 paper the phrase ‘‘inverse probabil- Geisser Ž1980, p. 62. or Conniffe Ž1992, p. 485.x. ity’’ appears twice: ‘‘the inverse probability system’’ As Edwards Ž1994. indicates, there are difficul- meaning P as a function of the parameters and in a ties with both accounts. I find them less plausible second context we consider below. In 1922 Fisher FISHER AND MAXIMUM LIKELIHOOD 165

Ž1922a, p. 326. insisted, ‘‘I emphasised the fact that much more damage. The most basic use for the such inverse probabilities were relative only.’’ ‘‘inverse probability system’’ as probability density Fisher Ž1912, p. 160. did indeed emphasize that function was for finding probable errors of parame- ‘‘fact’’ which proved a fixed point around which ters Žsee Sections 10 and 17.. Chauvenet Ž1891, justifications would come and go: p. 492. obtains the probable error of m in this way. We have now obtained an absolute criterion for When he ŽChauvenet, 1891, p. 500. finds the proba- finding the relative probabilities of different ble error of x, it has the same numerical value. sets of values for the elements of a probability Chauvenet treats them as the same thing. system of known form. . . . P is a relative proba- Many of the ideas and attitudes of the 1912 bility only, suitable to compare point with paper were to persist in FIsher’s work: attention to point, but incapable of being interpreted as a theoretical indefiniteness associated with scaling , or giving any estimate both observations and parameters; insistence on a of absolute probability. single principle of estimation free from indefinite- ness; belief that such a principle was available and The principle of inverse probability involves ra- had found restricted use in the theory of errors; tios and generates ‘‘relative probabilities.’’ Fisher identification of a link between this principle and gave an explicit argument against integrating P to inverse probability; recognition that inverse proba- obtain an ‘‘expression for the probability that the bility was subject to an important limitation. Some true values of the elements should lie within any of these ideas were old and some of the new were given range.’’ If we could, we would obtain incom- not substantiated. However, if not contributions to patible probability distributions depending on the knowledge, they were contributions to Fisher’s in- parametrization chosen: ‘‘the probability that the tellectual make-up. The ideas varied in durability. true values lie within a region must be the same Around 1920 Fisher realized his principle of inverse whether it is expressed in terms of ␪ or ␾.’’ While probability was unsatisfactory: it had the same ‘‘the relative values of P would be unchanged by weakness as the postulate of a uniform prior. When such a transformation,’’ the probability that the ‘‘Bayes’s postulate’’ fell, the inverse principle fell true values will lie in a particular region will only with it. be unchanged if the Jacobian is unity, ‘‘a condition that is manifestly not satisfied by the general transformation.’’ His objection to least squares in 3. THE SECOND CRITERION curve fitting was based on the same piece of calcu- Fisher published only one other application of the lus. Presumably Fisher’s is an ‘‘absolute’’ criterion ‘‘absolute criterion’’ before 1922ᎏto the correlation because it does not depend upon the parametriza- coefficient. However, this was not an application of tion chosen or the scaling of the variables. the criterion expounded in Section 1: the formula Fisher used this insight to object to Bennett’s 2 Pearson gave in 1896 was that. It was an applica- argument for using Žn y 1. when estimating h . tion of another criterion, hinted at in the 1912 Bennett’s procedure was to integrate m out of P, paper.

n 2 2 Fisher closes his account of estimating h by stat- P dm s h exp yh Ž x y x. H ž Ý / ing: ‘‘we should expect the equation Ѩ brѨ h s 0 to give the most probable value of h.’’ Here b is the 2 2 и H expwyh nŽ m y x. x dm , not of the sample but of the ž / statistic ␮2. Fisher did not know b in 1912. As he 2 n 2 y1 Ž . A h exp yh Ý Ž x y x. h showed later Fisher, 1923 , the density of the ob- ž / servations is proportional to the product of the n 1 2 2 2 2 s h y exp yh Ý Ž x y x. densities of ␮ Žs s . and x, and maximize the resulting marginal density with s ny2 ns2 respect to h to obtain exp y ␴ ny1 2␴ 2 n 1 n 1 ž ž // 2 y y h s 2 s 2 . 2ÝŽ x y x. 2Ýv 'n n 2 и exp y 2 Ž x y m. Fisher Ž1912, p. 159. objects, ‘‘the integration with ž ␴ ž 2␴ // respect to m is illegitimate and has no definite meaning with respect to inverse probability.’’ Žomitting constants.. Now the ‘‘most probable’’ value Bennett is a well-chosen target Žfor a paper on of h Žs 1r␴'2 . found by maximizing the first fac- . but Fisher’s argument could do tor, the density of s2, with respect to h is not the 166 J. ALDRICH same as that found by maximizing the whole ex- son: ‘‘A neat, but as far as I could understand it, pression, the sample density, with respect to h quite unpractical and unserviceable way of looking and m. at things. ŽI understood it when I read it but it’s Fisher did not pursue the point in the paper but gone out of my heat . . . .’’ Pearson wrote some time he did in a letter to Gosset. Gosset reports how later, ‘‘I did not think it of any importance at the ‘‘with two foolscap pages covered with time & had some communication with wFisherx on of the deepest dye . . . wFisherx proved, by using n the subject.’’ dimensions’’ that the divisor should be n y 1 after Why did the paper fail to register with Gosset or all. wGosset’s letter is reprinted in E. S. Pearson impress Pearson? The method was the oldest in the Ž1968, p. 446..x Fisher Žsee Fisher, 1915. obtained book, illustrated by the easiest of cases; Pearson 2 the density for s ‘‘using n dimensions.’’ Maximiz- had even used it in his first work on correlation. ing with respect to ␴ gives the estimate with divi- Fisher paid no attention to the practicality of the sor n y 1. Fisher never published this argument method, the matter that most worried Pearson. The but he did ŽFisher, 1922b, p. 599. for the analogous arguments against the method of moments, least regression case. Gosset did the same maximization, squares and treating P as a probability distribu- in Bayesian assuming all values of ␴ are tion may be telling but so what? Pearson was inter- equally likely Žsee Section 5.. In his ŽStudent, 1908b. ested in the first two methods but placed them in work on correlation the likelihood component was the realm of the practical. based on the distribution of the correlation coeffi- Fisher’s next project was the correlation coeffi- cient not on that of the sample. However, Fisher cient. Correlation was central to the study of hered- does not seem to have known Gosset’s work ŽStu- ity and so central to Fisher’s interests. There was dent, 1908a, or 1908b. at this time. also an outstanding problem: despite the efforts of Fisher never discussed the relationship between Student Ž1908b. and Soper Ž1913. the exact distri- the original and the second criterion, why he pre- bution of the correlation coefficient was still open. ferred the latter in 1912᎐1921 and changed his Pearson produced some large sample theory for mind in 1922 He ŽFisher, 1922a, p 313. recognizes . . the correlation coefficient Žsee Section 17.. Gosset that there has been ‘‘some confusion’’ but does not worked with small samples. In his formulation clear it up beyond stating that the relevance of ŽStudent, 1908b. a prior for the correla- other parameters is relevant. He never even formu- tion is combined with the distribution of the sample lates the second criterion. It seems to be: Ži. use the correlation. He conjectured the form of this distri- original criterion to find the relevant statistic; Žii. bution for the case of zero correlation. When Fisher use the distribution of that statistic to find the best Ž1915. found the form for the nonnull case, Gosset version of the statistic. For Savage Ž1976, p. 455, fn told him ‘‘there still remains the determination of 20. to use the modified criterion is to ‘‘cheat a little the probability curve giving the probability of the bit.’’ Such cheating makes honesty seem easy for real value Žfor infinite population. when a sample x first Fisher had to get the distribution of ␮2 and of has given r’’ Ži.e., the posterior distribution.. Re- the correlation. He did. Perhaps this is why. w printed in E. S. Pearson Ž1990, p. 25.. Pearson and Welch Ž1958. discuss Student’s Bayesianism.x 4. THE CORRELATION COEFFICIENT In a way parallel to Fisher’s second criterion, the biometricians reacted to the Between 1912 and 1916 Fisher used ideas from 2 his paper in estimating the correlation coefficient of r Žand s . by reconsidering whether these statis- and criticizing minimum ␹ 2. Both applications led tics were reasonable estimate. Soper Ž1913, p. 91. to disagreement with Pearson. The more conse- remark on the implications of the ‘‘markedly skew quential dispute concerned the basis to Fisher’s character’’ of the distribution of r: calculation of the ‘‘most probable value’’ of the cor- relation coefficient. The ensuing precipitation of the value of r found from a single sample will prior probabilities made Fisher differentiate his most probably be neither the true r of the position from the Bayesian one. material nor the mean value of r as deduced Fisher sent copies of his paper to Pearson and from a large number of samples of the same Gosset. The only letters to survive are the recipi- size, but the modal value of r in the given ents’s comments to each other, giving their overall frequency distribution of r for samples of this impressions. wGosset’s letter is reprinted with com- size. ments in E. S. Pearson Ž1968, p. 446. and an extract from Pearson’s is reproduced in E. S. Pearson Ž1990, In ‘‘Appendix I to Papers by ‘Student’ and R. A. pp. 47᎐48..x Gosset described the criterion to Pear- Fisher’’ Pearson applies this reasoning to the stan- FISHER AND MAXIMUM LIKELIHOOD 167 dard deviation: ˜s, the modal value of s, is given by 5. MINIMUM CHI-SQUARED VERSUS THE GAUSSIAN METHOD ␴'Ž n y 2. ˜s s . There are further signs of immobility in Fisher’s n thinking on fundamentals in an abortive publica- Pearson Ž1915, p. 528. argues ‘‘If we take the most tion criticizing Smith Ž1916.. She proposed mini- probable value, ˜s, as that which has most likely mum ␹ 2 as a criterion of ‘‘bestness’’ for the values been observed, then the result w sx should be divided of constants in frequency distributions. The meth- by w'Ž n y 2. rnx to obtain the most reasonable ods she reviewed included the method of moments. value for ␴ .’’ So he advocates the use of the divisor The method of moments estimates of the normal Žn y 2.. Against this Gosset proposed the maxi- mean and are the sample mean and vari- mum posterior estimate based on a uniform prior; ance. She comments ŽSmith, 1916, p. 262. that ‘‘if this uses the divisor Žn y 1.. we deal with individual observations then the We now turn to Fisher’s paper, not to the distri- method of moments gives, with a somewhat arbi- bution theory but to the implications Fisher saw for trary definition of what is to be a maximum, the estimation. In the following passage ŽFisher, 1915, ‘best’ values for ␴ and x wthe population meanx.’’ p. 520. he criticized Soper’s reasoning; r is the She criticized ŽSmith, 1916, p. 263n. the ‘‘Gaussian’’ sample correlation, r the mean of the sampling choice of maximand, distribution of r, ␳ the population correlation and 2 wPxrobability the frequency of recur- t s rr 'Ž1 y r . : rence in a repeated series of trials and this The fact that the mean value r of the observed probability is in the wGaussianx case supposed correlation is numerically less than ␳ might indefinitely small. have been interpreted as meaning that given a The ‘‘Gaussian’’ maximand was textbook Žpriorless. single observed value r, the true value of the Gauss. In correspondence Gauss actually gave correlation coefficient from which the sample is Smith’s objection to maximizing a quantity that drawn is likely to be greater than r. This was still infinitely small as a reason for changing reasoning is altogether fallacious. The mean r from maximizing to minimiz- is not an intrinsic feature of the frequency ing sampling Žsee Plackett, 1972, p. 247.. distribution. It depends upon the choice of the Fisher sent Pearson a draft of a note on Smith’s particular variable r in terms of which the paper. He did not respond to her criticism of the frequency distribution is represented. When we Gaussian method but he reacted to her use of ‘‘arbi- use t the situation is reversed. trary’’: This criticism recalls the noninvariance objection to There is nothing at all ‘‘arbitrary’’ in the use of least squares. The reasoning is ‘‘fallacious’’ because the method of moments for the normal curve; it depends on the scaling of the variable not be- as I have shown elsewhere it flows directly cause there is no sense in making probability state- from the absolute criterion ŽÝ log f a maxi- ments about ␳. mum. derived from the Principle of Inverse Fisher goes on to use the ‘‘absolute criterion,’’ Probability. There is on the other hand, some- which is ‘‘independent of scaling,’’ to obtain the thing exceedingly arbitrary in a criterion which ‘‘relation between an observed correlation of a sam- depends entirely upon the manner in which the ple and the most probable value of the correlation data happen to be grouped. of the whole population.’’ The most probable value, ␳ˆ, is obtained by the modified criterion, that is, by wFisher’s draft note and covering letter are in E. S. maximizing the sampling distribution of r. This is Pearson Ž1968, pp. 454᎐455..x Fisher’s objection to not the maximum likelihood estimator; r is that. grouping follows his 1912 objection to least squares. Fisher Ž1915, p. 521. gives the approximate rela- Pearson agreed with Smith, ‘‘I frankly confess I tionship approved the Gaussian method in 1897 . . . I think it logically at fault now.’’ He invited Fisher to write a r ␳ˆŽ1 Ž1 r 2 . 2n.. s q y r defence of the method. When Fisher submitted a He writes the formula not with ␳ˆ but with ␳, a further piece on Smith’s work, Pearson rejected it. symbol he uses both for the parameter and its most wPearson’s letters are reprinted in E. S. Pearson probable value. He concludes that ‘‘It is now appar- Ž1968, pp. 455᎐456..x For Pearson’s earlier ap- ent that the most likely value w ␳ˆx of the correlation proval, see Section 17 below. will in general be less than that observed w rx.’’ Fisher’s work 1912᎐1916 contains applications of ‘‘Most likely’’ has arrived. It means ‘‘most probable.’’ his ideas on the absolute criterion, on invariance to 168 J. ALDRICH rescaling of the data and the arbitrariness of group- preaching: ing but no new basic ideas. ‘‘Appendix II to the Statistical workers cannot be too often re- Papers of ‘Student’ and R. A. Fisher’’ by Pearson minded that there is no validity in a mathe- and his team roused Fisher from his dogmatic matical theory pure and simple. Bayes’ theo- slumber. rem must be based on experience . . . The cooperative study made Fisher angry A letter 6. THE PRECIPITATION OF . from L. Darwin to Fisher ŽBennett, 1983, p. 73. on a PRIOR PROBABILITIES draft of Fisher Ž1921. shows how much: ‘‘you speak The Cooperative Study of Soper, Young, Cave, of someone’s interpretation of your remarks being Lee and Pearson ŽSoper et al., 1917. has a section ‘so erroneous etc., etc.’ . . . Again, you imply that on the ‘‘ ‘most likely’ value of the correlation in the your opponents have criticized you without reading sampled population.’’ The framework is Gosset’s. your paper. . . . ’’ Fisher’s intellectual response was Writing ␾Ž ␳. d␳ for ‘‘the law of distribution of ␳’s,’’ to state that he and his method had been misrepre- that is, the prior distribution of the population sented and to redraft his method. correlation, they state ŽSoper et al., 1917, p. 352. ‘‘we ought to make’’ the product of ␾Ž ␳. and the 7. THE ABSOLUTE CRITERION REVISITED density of r, the sample correlation, ‘‘a maximum Fisher distinguished his method from that of with ␳.’’ Bayes by articulating the notion of ‘‘likelihood.’’ They state ŽSoper et al., 1917, p. 353. ‘‘Fisher’s This was not merely a search for the mot juste as equation . . . is deduced on the assumption that ␾Ž ␳. Edwards Ž1974, p. 14. implies. Genuine conceptual is constant’’ᎏwith good reason! Fisher told Pear- development was involved, though at first Fisher son that the absolute criterion was derived from the did not admit this. It spreads over two papers: the principle of inverse probability. It was reasonable combative 1921 paper was written by a very in- to presume that prior probabilities are not men- jured party; the 1922 paper is almost sereneᎏit tioned because they are taken to be equal. Fisher’s even admits that the author makes mistakes Žcf. work fitted into Gosset’s program: having derived Section 2.. the distribution of r he was using a uniform prior The 1921 paper picks up two issues from Fisher’s to make probability statements about ␳. Gosset earlier correlation work. We will not be concerned was proposing the same for ␴ . The cooperators with the Žextensive. material on distribution the- ŽSoper et al., 1917, p. 353n. expected clusterings of ory. The material on estimation bristles with criti- values and rejected the uniform prior for ␴ . The cisms: the cooperators are criticized for ascribing to views that Pearson attacked were Gosset’s. It was Fisher a position he had not taken; that position is reasonable to suppose that Fisher held them criticized; finally the position Fisher had taken is tooᎏthough it turned out that he did not. criticized! Pearson Ž1892, p. 175. argued that the ‘‘equal Fisher Ž1921, p. 16. objected to the cooperators’ distribution of ignorance’’ could have a justification portrayal of his correlation work as based on a in ‘‘experience’’ if there is prior experience to the uniform prior: effect that all values of the parameter have been a reader, who did not refer to my paper might met with equal frequency. ‘‘Bayes’s theorem’’ in his . . . paper ‘‘On the influence of past experience on fu- imagine that I had used Boole’s ironical phrase ‘‘equal distribution of ignorance,’’ and that I ture expectation’’ Ž1907. gives the probability den- had appeared to ‘‘Bayes’ Theorem’’. I must sity of the chance of an event happening given p therefore state that I did neither. occurrences in n Bernoulli trials on the assumption of the equal distribution of ignorance. ‘‘Bayes’s the- Against the possible charge that he had implicitly orem’’ in the cooperative study is similarly based on assumed a uniform prior for the correlation, Fisher a uniform prior for the parameter Žthe correlation.. begins by pointing out that he would have obtained The cooperators ŽSoper et al., 1917, p. 358. work the same value of the ‘‘optimum,’’ his new name for through an example of data on 25 parent᎐child the most probable or likely value, if he had used a pairs. As to ‘‘equal distribution of ignorance’’ they transformation Ž z. of the correlation Žr.: ‘‘Does the argue that ‘‘there is no such experience in this value of the optimum therefore depend upon equal case.’’ So they judged that ŽSoper et al., 1917, numbers of parental correlations having occurred p. 354. the fuller solution of Fisher’s equation ‘‘ap- in equal intervals dz? If so, it should be noted this pears to have academic rather than practical value.’’ is inconsistent with an equal distribution in the The cooperators close their account of Bayes by scale of r.’’ FISHER AND MAXIMUM LIKELIHOOD 169

Fisher gives a somewhat edited account of his leads to inconsistent posterior distributions. The 1912 and 1915 papers. He recalls his use of ‘‘most point was not new. For instance, Edgeworth Ž1908, likely value’’ but not of ‘‘most probable value.’’ He p. 392. made the point in the context of different mentions ŽFisher, 1921, p. 4. the determination of parametrizations of dispersion for the normal dis- ‘‘inverse probabilities’’ as ‘‘resting’’ on Bayes’s work tribution: a uniform prior for h is inconsistent with but does not mention his own use of the notion. The a uniform prior for c Žs 1rh.. However, Edgeworth account is also somewhat enhanced: the simple did not think the objection fatal. factsᎏthat there was no mention of prior probabili- In 1921 Fisher quietly uncoupled the absolute ties in 1912 and that the ‘‘criterion’’ delivers the criterion from the method of inverse probability. In ‘‘optimum’’ᎏbecome ŽFisher, 1921, p. 16.: 1922 he confronted the method and argued ŽFisher, 1922a, p 326. that it did not give a unique solution As a matter of fact, as I pointed out in . either: ‘‘irrelevant’’ distinctions between ‘‘condi- 1912 . . . the optimum is obtained by a criterion tions’’ alter relative probabilities Although he be- which is absolutely independent of any as- . gins by distinguishing the principle of inverse prob- sumption regarding the a priori probability of ability from the assumption of a uniform prior, he any particular value. It is therefore the correct ends by concluding that inverse probability value to use when we wish for the best value ‘‘amounts to assuming that it was known that for the given data, unbiassed by any a priori . . . our universe had been selected at random from an presuppositions. infinite population in which X was true in one half, These were additions: he had not pointed out the and y true in one half.’’ This sounds very like an criterion was ‘‘independent of any assumption. . . . ’’ admission that the absolute criterion is based on Bayes’s postulate. From ‘‘Inverse probability’’ 8. THE REJECTION OF BAYES ŽFisher, 1930. onwards Fisher treated the principle AND INVERSE PROBABILITY of inverse probability and Bayes’s postulate as one. Fisher appended a ‘‘Note on the confusion be- 9. THE MAKING OF LIKELIHOOD tween Bayes’ Rule and my method of the evaluation of the optimum’’ to his 1921 paper. He recalled Fisher Ž1921, p. 4. tacitly criticizes himself when Bayes: ŽFisher, 1921, p. 24.: he states ‘‘two radically distinct concepts have been Bayes Ž1763. attempted to find, by observing a confused under the name of ‘probability’ and only sample, the actual probability that the popula- by sharply distinguishing between these can we state accurately what information a sample does tion value w ␳x lay in any given range. . . . Such a problem is indeterminate without knowing give us respecting the population from which it was the statistical mechanism under which differ- drawn.’’ The concepts are probability and likelihood Ž . ent values of ␳ come into existence; it cannot Fisher, 1921, p. 25 : be solved from the data supplied by a sample, We may discuss the probability of occurrence of or any number of samples, of the population. quantities which can be observed . . . in relation In 1922 he wrote ŽFisher, 1922a, p. 326.: ‘‘probabil- to any hypotheses which may be suggested to ity is a ratio of frequencies, and about the frequen- explain these observations. We can know noth- cies of such wparameterx values we can know noth- ing of the probability of hypotheses . . . wWex ing whatever.’’ may ascertain the likelihood of hypotheses . . . In 1922 Fisher used the binomial distribution to by calculation from observations: . . . to speak of illustrate maximum likelihood and to contrast it the likehood . . . of an observable quantity has with Bayes’s method. Against Bayes’s ‘‘postulate,’’ no meaning. that is, that it is reasonable to assume that the a He develops the point with reference to correla- priori distribution of the parameter p is uniform, tion ŽFisher, 1921, p. 24.: he argued ŽFisher, 1922a, pp. 324᎐325. that ‘‘apart from evolving a vitally important piece of knowl- What we can find from a sample is the likeli- edge, that of the exact form of the distribution of p, hood of any particular value of ␳, if we define out of an assumption of complete ignorance, it is the likelihood as a quantity proportional to the not even a unique solution,’’ This nonuniqueness probability that, from a population having the charge was the final form of the 1912 noninvari- particular value of ␳, a sample having the ance argument. He shows that a uniform prior for p observed value of r, should be obtained. and a uniform prior for ␪, defined as Fisher still worked with the distribution of r rather sin ␪ s 2 p y 1, than with the distribution of the entire set of obser- 170 J. ALDRICH vations. In 1922 that changed ŽFisher, 1922a, measure by which he claimed rival hypotheses could p. 310.: be weighed.’’ The measure came in with a bang but the volume was not sustained In 1922 the main The likelihood that any parameter Žor set of . case for maximum likelihood was based on its sam- parameters. should have any assigned value pling properties and on its association with infor- Žor set of values. is proportional to the proba- mation and sufficiency bility that if this were so, the totality of . observations should be that observed. 10. GAUSS AND THE SAMPLING Fisher Ž1921, p. 24; 1922a, p. 327. redrafted what PROPERTIES OF ESTIMATORS he had written in 1912 about inverse probability, We have seen nothing to suggest that an estima- distinguishing between the mathematical opera- tor’s sampling properties could matter for Fisher. tions that can be performed on probability densities However, while he was developing likelihood he and likelihoods: likelihood is not a ‘‘differential ele- was working on what seems to have been an unre- ment,’’ it cannot be integrated. Fisher had not been lated project, comparing the sampling properties of concerned with nonoptimal values of the likelihood estimators. In 1919᎐1920 he compared different since 1912. In 1921 he reemphasized the impor- estimators of the and the corre- tance of the whole function. He points out ŽFisher, lation coefficient. He did not say that the estimator 1921, p. 24. that ‘‘no transformation can alter the given by the absolute criterion was best and that value of the optimum, or in any way affect the was that! Absolute criterion and optimum method likelihood of any suggested value of ␳.’’ In the note do not figure here, only size of probable error. on Bayes he describes how probable errors encode Again Gauss Žor textbook Gauss. is a presence. likelihood information: Earlier developments can be counted remote conse- ‘‘Probable errors’’ attached to hypothetical quences of his 1809 justification of least squares; quantities should not be interpreted as giving those we about to treat can be linked to his 1816 any information as to the probability that the paper on estimating the parameter h Žsee Sec- quantity lies within any particular limits. When tion 1. of the . Gauss gave a the sampling curves are normal and equivari- Bayesian analysis using a uniform prior for h Žm is ant the ‘‘quartiles’’ obtained by adding and assumed known. and obtained a normal approxi- subtracting the probable error, express in real- mation to the posterior. He also compared the ity the limits within which the likelihood wrela- large-sample properties of estimators of h based on tive to the maximumx exceeds 0.796542 . . . different powers of the errors and showed that the value 2 gives the estimator with least dispersion. This view of ‘‘reality’’ leads directly to the ex- He distinguished the two investigations but they position of likelihood as a ‘‘form of quantitative are linked: the probable error of the posterior dis- inference’’ in Statistical Methods and Scientific In- tribution of h is equal to the probable error of the ference Ž1971, p. 75. or to works like Edwards’s sampling distribution of the squared deviation esti- Likelihood. mator of h. The 1921 paper was not consistent likelihood Merriman’s Ž1884, pp. 206᎐208. textbook repro- analysis. In Examples II ŽFisher, 1921, p. 11. and duces the Bayesian argument Žwithout the prior VIII Žp. 23. probable errors are used to calculate and the sampling theory results, linking them with prob-values, tail areas, not val- the phrase ‘‘a similar investigation.’’ He reports ues. In Example II he confuses the two, concluding Gauss’s finding that with absolute values 114 ob- from prob-values of 0.142 for one and servations ‘‘give the same uncertainty . . . as 100 ob- 0.00014 for another that the latter is roughly 1,000 servations’’ with squared errors. Merriman’s book is times ‘‘less likely’’ than the former. significant as it seems to have been Pearson’s refer- Fisher Ž1922, p. 327fn. wrote ‘‘in an important ence for least squares. Pearson Ž1894, p. 21. uses class of cases the likelihood may be held to measure w data from Merriman, and Yule, Pearson’s assistant, the degree of our rational belief in a conclusion.’’ He used it as his least squares reference in his ŽYule wrote ŽFisher 1925b, pp. 10᎐11. of likelihood as 1897, p. 818.. In Section 17 we consider Pearson’s ‘‘the mathematical quantity which appears to be x use of Merriman᎐Gauss and its connection with appropriate for measuring our order of preference Fisher’s 1922 analysis. among different possible .’’ Despite such declarations there was no development of likelihood 11. FISHER ON COMPARING ESTIMATES methods. Edwards Ž1972, p. 3. writes of Fisher’s use of likelihood, ‘‘From 1921 until his death, Fisher first compares sampling properties of esti- Fisher . . . quietly and persistently espoused a . . . mators in his 1919 critique of Thorndike’s work on FISHER AND MAXIMUM LIKELIHOOD 171 twins. Thorndike used a measure of resemblance, tion to be obtained from ␴1 is included in that 2 2 2 xyrŽ x q y ., where x and y are intraclass bi- supplied by ␴2 .’’ He went further and argued that variate normal. Fisher obtained its distribution and ␴1 could be replaced by any other statistic: ‘‘The mentioned ŽFisher, 1919, p. 493. that data on re- whole of the information respecting ␴ , which a semblances might be used to estimate the popula- sample provides is summed up in the value of ␴2 .’’ tion correlation but added at once that ‘‘The value Fisher does not state that the ‘‘qualitative dis- of the correlation that best fits the data may be tinction’’ underlies the superiority of ␴2 . Perhaps it found with less probable error, from the was obvious. He finishes ŽFisher, 1920, pp. 769᎐770. product᎐ correlations.’’ This remark is not by investigating the curve related to ␴1, ‘‘in the developed, nor its basis made clear, but in 1920 he same way as the normal curve is related’’ to ␴2 . His devoted a paper to comparing estimates of the stan- treatment is, he admits, not very satisfactory. How- dard deviation of the normal distribution. ever, there is an implicit generalization: the suffi- The 1920 paper was a reply to Eddington’s Ž1914, cient statistic provides an ‘‘ideal measure’’ of the p. 147. claim that, ‘‘contrary to the advice of most parameter. text-books’’ it is better to use ␴1, based on absolute values of residuals, than ␴ , based on squares of 2 13. MAXIMUM LIKELIHOOD IN 1922 residuals. As distribution theory, Fisher’s response was a continuation of his work on s2. Correcting The name ‘‘maximum likelihood’’ finally appears Eddington took much less than Fisher gave. Find- in the ‘‘Mathematical foundations of theoretical ing the large-sample probable errors of the two statistics’’ ŽFisher, 1922a, p. 323.. The method is estimators was dispatched in the first third of the the link between two separate investigations. The paper and demonstrating that 2 is the best choice first relates maximum likelihood to sufficiency and for the power of the residuals took another page efficiency. The second uses the large-sample stan- or so. dard error of maximum likelihood to compare error The rest of the paper is concerned with the joint curves and assess the method of moments. distribution of the two estimators, ␴1 and ␴2 . The new theoretical scheme is this: maximum Fisher Ž1920, p. 763. writes that ‘‘Full knowledge of likelihood produces sufficient estimates; sufficient the effects of using one rather than another of two estimates are efficient, besides being important in derivates can only be obtained from the frequency their own right. The theory is thin and underdevel- surface of pairs of values of the two derivates.’’ It is oped. The only presented is the not clear whether this principle guides the investi- standard deviation from 1920, though Fisher gation or is a conclusion from it. Fisher’s route to Ž1922a, p. 357. alludes to the case of the Poisson the marginal density of ␴1 is through the joint parameter and gives a demonstration in Fisher, density of ␴1 and ␴2 . This accident of technique Thornton and MacKenzie Ž1922, p. 334.. may have led him to the principle. In other ways the paper is a grand finale. Besides Fisher Ž1920, p. 758. wrote ‘‘The case is of inter- inverse probability and Bayes Žsee above. the est in itself, and it is illuminating in all similar method of moments, minimum ␹ 2 and the effects cases, where the same quantity may be ascertained of grouping are all reexamined. Before 1922 Fisher by more than one statistical formula.’’ This sug- worked on only three estimation problems: the gests a program of comparing different estimators mean and variance of the univariate normal and on a case-by-case basis. The absolute criterion has the correlation coefficient of the bivariate normal. dropped out and so have the universal pretensions The new paper has a miniature treatise on the associated with it. In 1922 all are back, thanks to Pearson curves and the method of moments. The an idea in the 1920 paper. efficiency of the method of moments is assessed by comparing the large-sample variance of such esti- mators with that of maximum likelihood. Suffi- 12. SUFFICIENCY IN 1920 ciency plays no role here, nor is the computability The important new idea is sufficiency, though it of the superior estimator considered. was only named in 1922. Fisher Ž1920, p. 768. ends by discussing a ‘‘qualitative distinction, which re- 14. EFFICIENCY AND SUFFICIENCY veals the unique character of ␴2 .’’ This property had not been identified in earlier investigations of Fisher Ž1922a, p. 316. presents three ‘‘Criteria of the merits of ␴2 versus ␴1. wStigler Ž1973. dis- Estimation’’ two of them linked to maximum likeli- cusses related work of Laplace.x ‘‘For a given value hood. He links the third, ‘‘consistency,’’ to the of ␴2 , the distribution of ␴1 is independent of ␴ .’’ method of moments, taking it for granted that max- Fisher added the gloss: ‘‘The whole of the informa- imum likelihood satisfies it. 172 J. ALDRICH The ‘‘Criterion of Efficiency’’ refers to large- idea, if not the phrase, was already commonplace in sample behavior: ‘‘when the distribution of the 1922. Galton Ž1889, p. 125. imagined an ‘‘exceed- statistics tend to normality, that statistic is to be ingly large Fraternity, far more numerous than is chosen which has the least probable error.’’ This physiologically possible’’ from which random sam- criterion underlies Eddington’s comparison and the ples are taken, and Fisher Ž1918, p. 400n. describes 1920 paper. The efficiency of an estimator is given an ‘‘infinite fraternity, . . . all the sons which a pair by the square of the ratio of its probable error to of parents might conceivably have produced.’’ ‘‘In- that of the most efficient statistic. definitely large population’’ appears in the title of The ‘‘Criterion of Sufficiency’’ is that ‘‘the statis- Fisher Ž1915.. In his correlation paper Student tic chosen should summarize the whole of the rele- Ž1908b, p. 35. wrote that starting from the actual vant information supplied by the sample.’’ The sample of 21 years ‘‘one can image the population ‘‘mathematical interpretation’’ of sufficiency is the indefinitely increased and the 21 years to be a conditional distribution notion from the 1920 paper. sample from this.’’ The example is the same. The parameters associated with the law of distri- ‘‘Efficiency’’ is efficiency in summarizing informa- bution of this population are ‘‘sufficient to describe tion so, in the large-sample case, Fisher Ž1922a, it exhaustively in respect of all qualities under p. 317. identifies sufficiency and full efficiency. Let discussion.’’ Fisher continues ŽFisher, 1922a, p. 311.:

␪ be the parameter to be estimated and let ␪1 be Any information given by the sample, which is the sufficient statistic and ␪ 2 , any other statistic. In accordance with the scope of the criterion of of use in estimating the values of these param- eters is relevant information . . . . It is the efficiency ␪1 and ␪ 2 are taken to be normal in large samples, indeed jointly normal with common mean object of the statistical processes . . . to isolate the whole of the relevant information con- ␪. Using the condition that the distribution of ␪ 2 tained in the data. given ␪1 does not involve ␪, Fisher obtains the relation The only ‘‘statistical process’’ discussed in 1922

r␴2 s ␴1 , paper is ‘‘estimation.’’ Isolating relevant informa- tion is the business of sufficient estimates. Efron where r is the correlation between ␪1 and ␪ 2 and Ž1982. suggests that Fisher was not primarily in- ␴1 and ␴2 the large-sample standard errors of ␪1 terested in estimation but in ‘‘summarization.’’ In and ␪ 2 , respectively. This shows that ␴1 is neces- 1922 Fisher could see no need to choose between sarily less than ␴2 and that the efficiency of ␪ 2 is them. measured by r 2. Although this analysis covers the 1920 efficiency comparison, the technique is 16. SUFFICIENCY AND different. MAXIMUM LIKELIHOOD The normal standard deviation seemed to exem- 15. SUFFICIENCY AND INFORMATION plify a connection beyond efficiency and sufficiency, The move from efficiency to sufficiency was not namely, between sufficiency and maximum likeli- simply from ‘‘large’’ to ‘‘all finite’’ samples. The hood. Fisher Ž1922a, p. 323. conjectured that maxi- objective seems to have changed: efficiency relates mum likelihood ‘‘will lead us automatically’’ to a to estimation accuracy, sufficiency to ‘‘information.’’ sufficient statistic. He had much more confidence in ‘‘Information’’ has arrived. The 1920 notion of this proposition than in his argument for it. He information was a comparative one: information in invited ‘‘pure mathematicians’’ to improve upon his one statistic is ‘‘included’’ in that supplied by an- proof. The proposition is a cornerstone of the paper other. There is no explanation of what information but of course it is not true, as Geisser Ž1980, p. 64. is. Perhaps a statistic contains relevant information notes in his brief commentary. if its distribution depends on the parameter of in- Fisher’s Ž1922a, pp. 330᎐331. argument is ob- terest. In 1922 ŽFisher, 1922a, p. 311. information scure in structure and in detail; it is not even clear is written into the ’s task. whether he is showing that ‘‘optimality’’ implies The statistician constructs a hypothetical infinite sufficiency or vice versa. He considers the joint population, of which the actual data are regarded density fŽ␪, ␪ˆ, ␪1. of the maximum likelihood esti- as constituting a random sample. The phrase ‘‘hy- mator ␪ˆ and any other statistic ␪1 and argues that pothetical infinite population’’ was new in 1922. the equation Fisher later told Jeffreys ŽBennett, 1990, pp. Ѩ 172᎐173. that it may have been used by Venn, ˆ log f Ž␪ , ␪ , ␪1 . s 0 ‘‘who . . . was wrestling with the same idea.’’ The Ѩ␪ FISHER AND MAXIMUM LIKELIHOOD 173 is satisfied irrespective of ␪1 by the value ␪ s ␪ˆ. He square gives the theoretically best results.’’ For then shows correctly that this equation would be Merriman the error of mean square was distin- satisfied if the maximum likelihood estimator were guished because it was given by the Gaussian sufficient. Fisher continued to work on the maxi- method and because it had smaller probable error mum likelihood᎐sufficiency connection, producing than other estimators. Pearson does not mention the factorization theorem in 1934. However, we the second point here or in his lectures ŽPearson, now turn to maximum likelihood’s other role in the 1894᎐1896, vol. 1, 90᎐92; vol. 3, 15᎐22.. 1922 paper: its use as an efficiency yardstick for Pearson and Filon Ž1898. floated off the method other estimators, particularly for the method of of calculating probable errors from that of finding moments. best values and applied the technique to ‘‘nonbest’’ estimates, in particular to the method of moments. Welch Ž1958, p. 780., MacKenzie Ž1981, pp. 17. PROBABLE ERRORS OF 241᎐243. and Stigler Ž1986, pp. 342᎐345. describe OPTIMUM STATISTICS the method as implicitly Bayesian. I would put Fisher seemed to consider his work on the large- things differently. The method had Bayesian ori- sample distribution of maximum likelihood as re- gins but there was no longer a Bayesian argument fining existing work. He noted ŽFisher, 1922a, to be made explicit. Person had taken a method and p. 329fn. how a ‘‘similar method’’ for calculating changed it in a way that contradicted the basis of large-sample standard errors had been developed the method. by Pearson and Filon Ž1898.. In 1940 he saw their The Taylor expansion is now around the true work differently: ‘‘If I remember right, they arrive value and the discrepancy is referred to as an at the formula by a very obscure argument involv- ‘‘error,’’ a usage that signals a concern with the ing inverse probability’’ ŽBennett, 1990, p. 125.. probable error as a sampling distribution property. There is no evidence that Fisher began by develop- The new multiparameter treatment leads to a cor- ing their argument but it is not implausible that he rection of the probable error formula for the corre- did. In any case a comparison is instructive. Fisher’s lation, but the more fundamental change is not treatment is unambiguously frequentist while their registered. I will simplify their argument and con- method developed out of work that was as unam- sider a scalar variable x and parameter ␪. Let P0 biguously Bayesian. be the density of the observations x1, . . . , xn, and The Pearson᎐Filon treatment can be seen evolv- let P⌬ be the density corresponding to the parame- ing in Pearson’s lectures of 1894᎐1896. Pearson ter value ␪ q ⌬␪. Consider P0rP⌬ and its logarithm Ž1894᎐1896, vol. 1, p. 90. begins by treating the P standard deviation along the lines of Merriman Žsee 0 log s Ý Žlog f Ž xi ; ␪ . y log f Ž xi ; ␪ q ⌬␪ .., Section 10 above.. His first new case is the correla- ž P⌬ / tion coefficient; he published this in 1896 Although . expanding using Taylor’s theorem he later referred to his use of the ‘‘Gaussian method’’ Žsee Section 5 above., Pearson’s way of finding the P d 0 Ž . ‘‘best’’ value and its probable error was Gauss with- log s ⌬␪ Ý log f xi ; ␪ P⌬ d␪ out a prior. ž / 2 Pearson Ž1896, pp. 264᎐266. considers a bivariate 1 d Ž .2 Ž . 2 2 q ⌬␪ Ý 2 log f xi ; ␪ q иии . normal distribution with parameters, ␴1 , ␴2 and 2 d␪ r; Gauss᎐Merriman had one parameter. Pearson ‘‘Replacing sums by integrals’’ they obtain expresses the joint density of the observations in terms of r, not by integrating out the nuisance 1 2 logŽ P0rP⌬ . s A ⌬␪ q 2 BŽ⌬␪ . q иии , parameters or maximizing them out, but by treat- 2 2 2 where A and B are defined by ing ␴1 and ␴2 as identical to Ž1rn.Ý x and 2 Ž1rn.⌺ y , respectively. He states ŽPearson, 1896, p. d 265. that the ‘‘best’’ value of r is found by choosing A s log f Ž xi ; ␪ . f Ž xi ; ␪ . dx, H d␪ the value for which ‘‘the observed result is the most probable.’’ A quadratic Taylor approximation to the d 2 B s log f Ž xi ; ␪ . f Ž xi ; ␪ . dx. log of the function around its maximum value Žgiven H d␪ 2 by the product᎐moment formula. yields a normal As A 0, they write density. The probable errors are obtained from this. s The value of r chosen appears to be ‘‘best’’ simply 1 2 P⌬ s P0exp y 2 BŽ⌬␪ . because of the way it is obtained. Pearson mentions ž / the analogy of estimates of ␴ : ‘‘the error of mean и expŽcubic and higher-order terms.. 174 J. ALDRICH

They ŽPearson and Filon, 1898, pp. 233᎐234. inter- Fisher Ž1922a, p. 310. points to the ‘‘anomalous pret P⌬ as the density of ⌬␪ an ‘‘error,’’ that is, ⌬␪ state of statistical science.’’ There has been an ‘‘im- is the difference between the estimate and the true mense amount of fruitful labour’’ on applications, value of the frequency constant, and conclude that including such topics as Pearson curves, but the ⌬␪ is approximately normally distributed with zero basic principles remain in ‘‘a state of obscurity.’’ 1 2 mean and standard deviation By r . The state of affairs is due to the ‘‘survival to the Fisher’s 1922 procedure resembles Gauss᎐Merri- present day of the fundamental paradox of inverse man᎐Pearson but the objective is not to approxi- probability which like an impenetrable jungle ar- mate the posterior distribution but a frequency rests progress towards precision of scientific con- distribution as in Pearson and Filon. Fisher Ž1922a, cepts’’ ŽFisher, 1922a, p. 311.. . p. 328 expands the log likelihood around the value Fisher Ž1922a, p. 326. refers to the ‘‘baseless of a statistic ␪ : 1 character of the assumptions’’ made under the names inverse probability and Bayes’s theorem and log ␾ s C q Ž␪ y ␪1 . Ý a the ‘‘decisive criticism to which they have been 1 2 q 2 Ž␪ y ␪1 . Ý b q иии . exposed.’’ The criticsᎏBoole, Venn and Chrystalᎏ came to appear regularly in Fisher’s historical ac- For ‘‘optimum’’ statistics, that is, when ␪1 is maxi- counts. Yet when he detailed their criticismsᎏ mum likelihood, Ý a s 0. For sufficiently large in 1956 ŽFisher, 1971, p. 31.ᎏhe reflected, samples ⌺ b differs from nb by a quantity of order ‘‘wChrystal’sx case as well as Venn’s illustrates the ' ␴ Eliminating all terms which converge to zero n b. truth that the best causes tend to attract to their as n increases gives support the worst arguments.’’ Zabell Ž1989. was unimpressed when he examined their work. Fisher b 2 log ␾ s C q n Ž␪ y ␪1 . . may have drawn comfort from the critics’s exis- 2 tence; it is hard to believe he was influenced by So, them. A major factor in the ‘‘survival’’ was a ‘‘purely b 2 verbal confusion’’ ŽFisher, 1922a, p. 311.: the use of ␾ A exp n Ž␪ y ␪1 . . ž 2 / the same name for true value and estimate. So Fisher introduced the terms ‘‘parameter’’ and ‘‘sta- Fisher argues that the density of ␪ is proportional 1 tistic ’’ He wrote later ŽBennett, 1990, p 81. ‘‘I was to this function and that the large-sample standard . . y1r2 quite deliberate in choosing unlike words for these error of ␪1 is given by Žynb. . In 1922 Fisher made no fundamental objection to ideas which it was important to distinguish as Pearson and Filon’s development. He criticized the clearly as possible.’’ The outraged and confused application of their formulae to methods other than commonsense of the older generation was voiced by maximum likelihood. The novel application in Pear- Pearson Ž1936, p. 49n.: he objected ‘‘very strongly’’ son and Filon was the construction of probable to the ‘‘use of the word ‘statistic’ for a statistical error formulae for the method of moments applied parameter’’ and asked ‘‘are we also to introduce the to the Pearson curves. The large part of the 1922 words, a mathematic, a physic, . . . for parameters paper on the Žin.efficiency of this method can be . . . of other branches of science?’’ taken as an ironical commentary on their results. The formulae had already been quietly withdrawn ACKNOWLEDGMENTS by Pearson; Fisher was reinstating them as appro- priate to the efficient method of maximum likeli- I am grateful to librarians at University College hood. and Adelaide for their help. My thanks to Janne Rayner, Denis Conniffe, Raymond O’Brien 18. GOODBYE TO ALL THAT and Grant Hillier for discussion. On submitting an Fisher’s 1922 view of the place of maximum like- earlier draft of this paper I learned that A. W. F. lihood in statistics did not last. In particular, he Edwards already had a paper reexamining Fisher’s discovered that there may be no sufficient statistic use of ‘‘inverse probability’’; a revision of his 1994 for maximum likelihood to find. However, he did preprint in printed here. However, though we use not revise his refutation of Bayes’s postulate and the same ‘‘data,’’ our interpretations are somewhat inverse probability nor his general diagnosis of what different. I am grateful to Anthony Edwards for had been wrong with statistics. We close with an helpful discussions as well as to the Editor and an examination of that diagnosis. Associate Editor for their suggestions. FISHER AND MAXIMUM LIKELIHOOD 175

REFERENCES errors of observation. Proc. Cambridge Philos. Soc. 21 655᎐658. wCP30 in Bennett Ž1971., vol. 1.x BENNETT, J. H., ed. Ž1971.. Collected Papers of R. A. Fisher 1–5. FISHER, R. A. Ž1925a.. Theory of statistical estimation. Proc. Adelaide Univ. Press. Cambridge Philos. Soc. 22 700᎐725. wCP42 in Bennett Ž1971., BENNETT, J. H., ed. Ž1983.. Natural Selection, Heredity, and vol. 2.x Eugenics. Oxford Univ. Press. FISHER, R. A. Ž1925b.. Statistical Methods for Research Workers. BENNETT, J. H., ed. Ž1990.. and Analysis: Oliver & Boyd, Edinburgh. Selected Correspondence of R. A. Fisher. Oxford Univ. Press. FISHER, R. A. Ž1930.. Inverse probability. Proc. Cambridge Phi- BENNETT, T. L. Ž1908.. Errors of observation. Technical Lecture los. Soc. 26 528᎐535. wCP84 in Bennett Ž1971., vol. 2.x 4, Cairo, Ministry of Finance, Survey Department Egypt. FISHER, R. A. Ž1934a.. Two new properties of mathematical BOX, J. F. Ž1978.. R. A. Fisher: The Life of a Scientist. Wiley, likelihood. Proc. Roy. Soc. Ser. A 144 285᎐307. wCP108 in New York. Bennett Ž1971., vol. 3.x BRUNT, D. Ž1917.. The Combination of Observations. Cambridge FISHER, R. A. Ž1934b.. Discussion on ‘‘On the two different Univ. Press. aspects of the representative method,’’ by J. Neyman. J. CHAUVENET, W. Ž1891.. A Manual of Spherical and Practical Roy. Statist. Soc. 97 614᎐619. Astronomy 2, 5th ed. wReprinted Ž1960. by Dover, New FISHER, R. A. Ž1971.. Statistical Methods and Scientific Infer- York.x ence, 3rd ed. Hafner, New York. ŽFirst edition published in CONNIFFE, D. Ž1992.. Keynes on probability and statistical infer- 1956.. ence and the links to Fisher. Cambridge Journal of Eco- FISHER, R. A., THORNTON, H. G. and MACKENZIE, W. A. Ž1922.. nomics 16 475᎐489. The accuracy of the plating method of estimating the den- EDDINGTON, A. S. Ž1914.. Stellar Movements and the Structure of sity of bacterial populations. Annals of Applied 9 the Universe. Macmillan, New York. 325᎐359. wCP22 in Bennett Ž1971., vol. 1.x EDGEWORTH, F. Y. Ž1908.. On the probable errors of frequency- GALTON, F. Ž1889.. Natural Inheritance. Macmillan, London. constants. J. Roy. Statist. Soc. 71 381᎐397, 499᎐512, GAUSS, C. F. Ž1809.. Theoria Motus Corporum Coelestium. wEx- 651᎐678; 72 81᎐90. tract in Trotter Ž1957., pp. 127᎐147.x EDWARDS, A. W. F. Ž1972.. Likelihood. Cambridge Univ. Press. GAUSS, C. F. Ž1816.. Bestimmung der Genauigkeit der Beobach- Ž . EDWARDS, A. W. F. Ž1974.. The history of likelihood. Internat. tungen. wSee Trotter 1957 , pp. 109᎐117.x Statist. Rev. 42 9᎐15. GAUSS, C. F. Ž1821.. Theoria combinationis observationum er- roribus minimum obnoxiae, first part See Trotter Ž1957., EDWARDS, A. W. F. Ž1994.. What did Fisher mean by ‘inverse . w probability’ in 1912᎐22? In Three contributions to the his- pp. 1᎐49.x EISSER Ž . tory of statistics, Preprint 3, Inst. , G , S. 1980 . Basic theory of the 1922 mathematical statis- Ž Univ. Copenhagen. tics paper. In R. A. Fisher: An Appreciation S. E. Fienberg and D. V. Hinkley, eds.. 59᎐66. Springer, New York. EFRON, B. Ž1982.. Maximum likelihood and . Ann. KEYNES, J. M. Ž1911.. The principal averages and the laws of Statist. 10 340᎐356. error which lead to them. J. Roy. Statist. Soc. 74 322᎐331. FIENBERG, S. E. and HINKLEY, D. V., eds. Ž1980.. R. A. Fisher: MACKENZIE, D. A. Ž1981.. Statistics in Britain 1865᎐1930. Edin- An Appreciation. Springer, New York. burgh Univ. Press. FISHER, R. A. Ž1911.. Mendelism and biometry. Unpublished MERRIMAN, M. Ž1884.. A Textbook on the Method of Least Squares. manuscript. wReproduced in Bennett Ž1983., pp. 51᎐58.x Wiley, New York. ŽEighth edition, 1911.. FISHER, R. A. Ž1912.. On an absolute criterion for fitting fre- NEYMAN, J. Ž1934.. On the two different aspects of the represen- quency curves. Messenger of Mathematics 41 155᎐160. CP1 w tative method Žwith discussion.. J. Roy. Statist. Soc. 97 in Bennet Ž1971., vol. 1. x 587᎐651. FISHER, R. A. Ž1915.. Frequency distribution of the values of the PEARSON, E. S. Ž1968.. Some early correspondence between W. S. correlation coefficient in samples from an indefinitely large Gosset, R. A. Fisher and Karl Pearson, with notes and population. 10 507᎐521. CP4 in Bennett Ž1971., w comments. Biometrika 55 445᎐457. vol. 1. x PEARSON, E. S. Ž1990.. ‘‘Student,’’ A Statistical Biography of FISHER, R. A. Ž1918.. The correlation between relatives on the William Sealy Gosset Žedited and augmented by R. L. supposition of Mendelian inheritance . Transactions of the Plackett with the assistance of G. A. Barnard.. Oxford Univ. 399᎐433 CP9 in Bennett Royal Society of Edinburgh 52 . w Press. Ž1971., vol. 1. x PEARSON, K. Ž1892.. The Grammar of Science. White, London. FISHER, R. A. Ž1919.. The genesis of twins 489᎐499 . Genetics 4 . PEARSON, K. Ž1894.. Contribution to the mathematical theory of wCP11 in Bennett Ž1971., vol. 1.x . Philos. Trans. Roy. Soc. London Ser. A 185 FISHER, R. A. Ž1920.. A mathematical examination of the meth- 71᎐110. ods of determining the accuracy of an observation by the PEARSON, K. Ž1894᎐1896.. Lectures on the Theory of Statistics mean error, and by the mean square error. Monthly Notices 1᎐5. ŽFive volumes of notes taken by Yule. Item 84r2 in of the Royal Astronomical Society 80 758᎐770. wCP12 in Pearson collection, University College, London.. Bennett Ž1971., vol. 1.x PEARSON, K. Ž1896.. Mathematical contributions to the theory of FISHER, R. A. Ž1921.. On the ‘‘probable error’’ of a coefficient of evolution, III. Regression, heredity and panmixia. Philos. correlation deduced from a small sample. Metron 1 3᎐32. Trans. Roy. Soc. London Ser. A 187 253᎐318. wCP14 in Bennett Ž1971., vol. 1.x PEARSON, K. Ž1902.. On the systematic fitting of curves to obser- FISHER, R. A. Ž1922a.. On the mathematical foundations of theo- vations and measurements, parts I and II. Biometrika 1 retical statistics. Philos. Trans. Roy. Soc. London Ser. A 265᎐303; 2 1᎐23. 222 309᎐368. wCP18 in Bennett Ž1971., vol. 1.x PEARSON, K. Ž1907.. On the influence of past experience on FISHER, R. A. Ž1922b.. The of regression formulae, future expectation. Philosophical Magazine 13 365᎐378. and the distribution of regression coefficients. J. Roy. PEARSON, K. Ž1915.. On the distribution of the standard devia- Statist. Soc. 85 597᎐612. wCP20 in Bennett Ž1971., vol. 1.x tions of small samples: appendix I to papers by ‘‘Student’’ FISHER, R. A. Ž1923.. Note on Dr. Burnside’s recent paper on and R. A. Fisher. Biometrika 10 522᎐529. 176 J. ALDRICH

PEARSON, K. Ž1920.. Notes on the history of correlation. SOPER, H. E., YOUNG, A. W., CAVE, B. M., LEE, A. and PEARSON, Biometrika 13 25᎐45. K. Ž1917.. On the distribution of the correlation coefficient PEARSON, K. Ž1936.. Method of moments and method of maxi- in small samples. Appendix II to the papers of ‘‘Student’’ mum likelihood. Biometrika 28 34᎐59. and R. A. Fisher. A cooperative study. Biometrika 11 PEARSON, K. and FILON, L. N. G. Ž1898.. Mathematical contribu- 328᎐413. tions to the theory of evolution IV. On the probable errors of STIGLER, S. M. Ž1973.. Laplace, Fisher and the discovery of the frequency constants and on the influence of random selec- concept of sufficiency. Biometrika 60 439᎐445. tion on variation and correlation. Philos. Trans. Roy. Soc. STIGLER, S. M. Ž1986.. A . Belnap, Cam- Ser. A 191 229᎐311. bridge. PLACKETT, R. L. Ž1972.. The discovery of the method of least STUDENT Ž1908a.. The probable error of a mean. Biometrika 6 squares. Biometrika 67 239᎐251. 1᎐25. PLACKETT, R. L. Ž1989.. Discussion of ‘‘R. A. Fisher on the history STUDENT Ž1908b.. Probable error of a correlation coefficient. of inverse probability’’ by S. Zabell. Statist. Sci. 4 256᎐258. Biometrika 6 302᎐310. PRATT, J. W. Ž1976.. F. Y. Edgeworth and R. A. Fisher on the TROTTER, H. F. Ž1957.. Gauss’s work Ž1803᎐1826. on the theory efficiency of maximum likelihood estimation. Ann. Statist. 4 of least squares. Technical Report 5, Statistical Techniques 501᎐514. Research Group, Princeton Univ. SAVAGE, L. J. Ž1976.. On rereading R. A. Fisher Žwith discussion.. WELCH, B. L. Ž1958.. ‘‘Student’’ and small sample theory. J. Ann. Statist. 4 441᎐500. Amer. Statist. Assoc. 53 777᎐788. SMITH, K. Ž1916.. On the ‘‘Best’’ values of the constants in YULE, G. U. Ž1897.. On the theory of correlation. J. Roy. Statist. frequency distributions. Biometrika 11 262᎐276. Soc. 60 812᎐854. SOPER, H. E. Ž1913.. On the probable error of the correlation ZABELL, S. Ž1989.. R. A. Fisher on the history of inverse probabil- coeffcient to a second approximation. Biometrika 9 91᎐115. ity Žwith discussion.. Statist. Sci. 4 247᎐263.