<<

arXiv:0808.4032v1 [stat.ME] 29 Aug 2008 ok htbo it 4 ok n ssilincom- still is and works 648 (Morant, hardbound lists a plete fills book works that sheer Pearson’s in book; of Pearson list of cen- short A two falls volume. than death, more his published after turies being still collected are whose works Euler, the no Even Leonhard rival. had prolific serious “he immensely no quanti- said, has of Pearson been domain science the has tative in it but whom thought,” unpublished of humanists, prolific one most as the such his- match in cannot scholars He tory. energetic intellectually and ductive fhswrsisl eyfrfo opee(tomits (it to complete contributions from his far very works—itself his of the Stigler M. Inspired and Stephen They Errors Advances Theoretical Pearson’s Karl 08 o.2,N.2 261–271 2, DOI: No. 23, Vol. 2008, Science Statistical [email protected] vne hcg,Ilni 03,UAe-mail: USA 60637, Illinois University Chicago, 5734 Avenue, Chicago, of of Department University the , in Professor Service Distinguished 07 tasmoimclbaigte10hanniversary birth. 150th Pearson’s the Karl celebrating of symposium March a in Society at Statistical 2007, Royal the at presented

tpe .Silri h retDWt Burton DeWitt Ernest the is Stigler M. Stephen c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint alPasnsrl ak mn h oepro- more the among ranks surely Pearson Karl nttt fMteaia Statistics Mathematical of Institute 10.1214/08-STS256 .INTRODUCTION 1. 1939 nttt fMteaia Statistics Mathematical of Institute , ere ffedm aaercifrne itr fstati of history inference, parametric freedom, of statist none degrees the phrases: that upon and errors words impression lasting Key major and Pearson’s positive of a two left Pa have to Fisher. drawn thro A. is Ronald partly attention upon statisticians, t influence theoretical upon acknowledged impact and adequately profound applied contrib a both technical have to His of today program. continue publishing and laboratories vast initially of a establishment of day his tiation his teaching, in research his statistical research, of organization and content Abstract. 08 o.2,N.2 261–271 2, No. 23, Vol. 2008, hsppri ae pnatalk a upon based is paper This . .M w oeaecollection moderate own My ). 2008 , alPasnpae neomu oei eemnn the determining in role enormous an played Pearson Karl —cuis5fe of feet 5 )—occupies alPasn .A ihr h-qaetest, Chi-square Fisher, A. R. Pearson, Karl This . in 1 qaets,dtn epcieyfo 86ad1900 and 1896 from (Pearson, respectively dating test, Pearson’s Chi- square the are and these Coefficient as Correlation of famous enlightening Product most luminously the Today be well. could they minous; hspbil dniyn oeflie htmade that idea powerful first a correlation,” the identifying “spurious 1897 publicly of thus in phenomenon was the He name name. to commonly his less with discoveries, associated other mathe- made of for He practice series decades. statistical a influenced that of tables calculation matical and the calculation, machine supervised of he use the in Pear- pioneered thought. son statistical of research school two a and publications, laboratories, serial additional several oc eidtefudn of founding the behind force commit- a attend (Stouffer, or meeting” telephone tee a answer never Pear- understand, I the much. not but would so found Americans compute he “You and replied, how much son so him in write asked Pearson to once visiting time 1930s American accu- early An greater the check. to a work as the racy, redo anal- would statistical Pearson some do ysis, for would co-worker calculations new constructed laborious a casually the or student not a were when his works: And space. shelf ora nmteaia ttsis eas estab- also (the He journal statistics. another lished mathematical important first in the into made journal and years 36 for edited ero’ copihet eentmrl volu- merely not were accomplishments Pearson’s 1896 , 1900a stics. n i ini- his and , g hi in- their ugh hog his through , clworld. ical , toshad utions 1958 1900b ework he rticular naso of Annals theless ). Biometrika .H a driving a was He ). hc he which , and ) 2 S. M. STIGLER him and countless descendents more aware of the and reassess them. I intend to argue that these er- pitfalls expected in any serious statistical investi- rors should count among the more influential of his gation of society (Pearson, 1897). And in a series of works, and that they helped pave the way for the investigations of craniometry he introduced the idea creation of modern . of landmarks to the statistical study of shapes. Pearson was at one time well known for the Pear- 2. PEARSON’S FIRST MAJOR ERROR son Family of Frequency Curves. That family is sel- Louis Napoleon George Filon was born in France, dom referred to today, but there is a small (re- but his family moved to England when he was three ally a striking discovery) he found in its early de- years old (Jeffrey, 1938). He first encountered Karl velopment that I would call attention to. When we Pearson as a student at University College London. think of the normal approximation to the binomial, After receiving a B.A. in 1896, Filon served as Pear- we usually think in terms of large samples. Pear- son’s Demonstrator until 1898, and together they son discovered that there is a sense in which the wrote a monumental memoir on the “probable er- two distributions agree exactly for even the smallest rors of frequency constants,” a paper read to the number of trials. It is well known that the normal Royal Society in 1897 and published in their Trans- density is characterized by the differential equation actions in 1898. In 1912 Filon succeeded Pearson d f ′(x) (x − µ) as Goldsmid Professor of Applied Mathematics and log(f(x)) = = − . dx f(x) σ2 Mechanics. At Pearson’s retirement banquet in 1934, Filon (who was by that time Vice-Chancellor of the Pearson discovered that p(k), the func- ) explained the genesis of this, tion for the symmetric binomial distribution (n in- their only work together in statistics. dependent trials, p = 0.5 each trial), satisfies the analogous difference equation exactly: “K. P. lectured to us on the Mathemat- ical Theory of Statistics, and on one oc- p(k + 1) − p(k) (k + 1/2) − n/2 = − casion wrote down a certain integral as (p(k +1)+ p(k))/2 (n + 1) · 1/2 · 1/2 zero, which it should have been on ac- or cepted principle. Unfortunately I have al- rate of change p(k) to p(k + 1) ways been one of those wrong-headed per- average of p(k) and p(k + 1) sons who refuse to accept the statements of Professors, unless I can justify them for midpoint of (k, k + 1) − µn myself. After much labour, I actually ar- = − 2 σn+1 rived at the value of the integral directly— and it was nothing like zero. I took this for all n, k. The appearance of n + 1 instead of n in result to K. P., and then, if I may say the denominator might be considered a minor fudge, so, the fun began. The battle lasted, I but the equation still demonstrates a really funda- think, about a week, but in the end I suc- mental agreement in the shapes of the two distribu- ceeded in convincing Professor Pearson. It tions that does not rely upon asymptotics (Pearson, was typical of K. P. that, the moment he 1895, page 356). was really convinced, he saw the full con- All of these Pearsonian achievements are indeed sequences of the result, proceeded at once substantial, and constitute ample reason to cele- to build up a new theory (which involved brate him 150 years after his birth. But if these are scrapping some previously published re- all we saluted, I would hold that Pearson is being un- sults) and generously associated me with derappreciated. To properly gauge his impact upon himself in the resulting paper” (Filon, 1934). modern statistics, we must take a look at parts of two works of his that are typically not held in high The term “probable error” was introduced early regard. Indeed, they are usually mentioned in deri- in the 19th century to mean what we would now call sion, as exhibiting two major errors that show Pear- the median error of an estimate. Thus it is a value son’s limitations and highlight the great gulf that which, when divided by 0.6745, gives the standard lay between Pearson and the Fisherian era that was deviation for an unbiased estimate with an approx- to follow. I wish to return now to these two works imately . My guess is that the KP’S ERRORS 3 lecture that Filon referred to involved the formula The “position” of the surface would be given in for the probable error of Pearson’s product-moment terms of the means h1, h2,...,hm of the x’s, which estimate of the correlation coefficient for bivariate are implicit in this notation as they are stated to normal distributions. Pearson had given this incor- give the origin of the surface, and so there are m + p rectly in 1896, and one of the signal achievements frequency constants to be determined from a set of of the Pearson–Filon paper was to correct that er- n measurements, each m-dimensional. ror (Stigler, 1986, page 343). But the 1898 paper To determine the “probable errors” of the fre- did much more: it purported to give the approxi- quency constants, Pearson and Filon looked at the mate distributions for the probable errors of all the ratio formed by dividing the product of n such sur- estimated frequency constants, indeed their entire faces (for the n vector measurements) into what a joint distribution, for virtually any statistical prob- similar product would be if the frequency constants lem. The theory presented was relatively short; most had been different values: of the paper was taken up with a large number of P∆ applications. = Πf(x1 +∆h1,x2 +∆h2,...,xm +∆hm; P0 Unfortunately, quite a number of these applica- tions proved to be in error. There are some indica- c1 +∆c1, c2 +∆c2,...,cp +∆cp) tions Pearson may have realized this by 1903, but if /Πf(x1,x2,...,xm; c1, c2,...,cp). he did sense trouble with the paper, he did not call it to public attention. In 1922 repaired The logarithm of this ratio is then the difference of Pearson’s omission when he noted in particular that two sums. After being expanded “by Taylor’s theo- outside of the case of the normal distribution, nearly rem,” this yields a series with typical terms, writing all of the applications in Pearson–Filon were erro- S for summation, they found to be neous. This included many method-of-moments es- log(P∆/P0) timates (the gold-standard method for the Pear- 2 sonian school). A significant Pearson achievement d 1 2 d =∆hrS (log f)+ (∆hr) S 2 (log f) came to be labeled an error, one eventually overcome dxr 2 dxr by a Fisher success, and in consequence, the 1898 d2 paper has suffered a low reputation. But despite +∆hr∆hr′ S (log f) dx dx ′ these problems, the paper had, arguably, a signif- r r 2 icant and largely underappreciated positive impact d 1 2 d +∆csS (log f)+ (∆cs) S 2 (log f) upon statistics. dcs 2 dcs At first glance the Pearson–Filon argument may d2 appear strikingly modern, apparently expanding a +∆cs∆cs′ S (log f) dc dc ′ log-likelihood ratio to derive an asymptotic approx- s s imately multivariate normal distribution for the er- d2 +∆hr∆csS (log f)+ · · · rors of estimation. The authors considered a multi- dhr dcs variate set of m-dimensional measurements x1,x2,x3, + cubic terms in ∆h and ∆c +&c, ...,xm of “a complex of organs,” and they stated the “frequency surface” should be given by “where f stands for f(x1,x2,x3,...,xm; c1, c2, c3, ...,cp).” Pearson and Filon then “replace sums by inte- “z = f(x ,x ,x ,...x ; c , c , c ,...c ), 1 2 3 m 1 2 3 p grals,” for example, by replacing the second sum- 2 mation above, namely S d (log f), by −B = · · · where c1, c2, c3,...,cp, are p frequency con- dxr2 r stants, which define the form as distin- d2 log f RRR f 2 dx1 dx2 · · · dxm. With their notation the fre- guished from the position of the frequency dxr quency surface encompassed a volume = n (i.e., it surface, and which will be functions of stan- was not a relative frequency surface), so if the f dard deviations, moments, , co- were taken as a density or relative frequency surface efficients of correlation, &c., &c., of indi- 2 this would be tantamount to replacing 1 S d (log f) vidual organs, and of pairs of organs in n dxr2 2 by its expectation E[ d (log f)], which would equal the complex.” dxr2 4 S. M. STIGLER

−Br. The integrals were then evaluated, and higher- would focus on a specification in terms of param- order terms discarded, to get eters, not the estimates, with one value considered 1 2 the true or population value, and the focus would P = P exp t. − {B (∆h ) − 2C ′ ∆h ∆h ′ ∆ 0 2 r r rr r r be the deviations of the estimates from that value. 2 (1) − 2Grs∆hr∆cs + Es(∆cs) The idea of this type of parametric modeling was, however, only to be introduced in 1922 by Fisher − 2F ′ ∆c ∆c ′ +&c. · · ·} ss s s (Stigler, 2005), and Pearson’s “frequency constants” where Br, etc. are integrals given in terms of deriva- were not parameters, even if they were sometimes tives of log f. We are then told: employed in an equivalent fashion. This difference was, as we shall see, highly consequential; it was the “This represents the probability of the ob- source of the principal difficulties in the argument. served unit, i.e. the individuals (x1, x2, Because Pearson and Filon took the estimates as x3,...xm, for all sets), occurring, on the a starting point, the Taylor expansion they gave was assumption that the errors ∆h1, ∆h2,..., about the estimated values. The expansion itself is ∆hm, ∆c1,..., ∆cp, have been made in the determination of the frequency con- fine, but when they came to substituting integrals stants. In other words, we have here the for sums, they inadvertently encountered a problem. frequency distribution for errors in the val- Consider the first sum that involves a general fre- d ues of the frequency constants.” quency constant c, namely S dcs (log f). (The earlier terms involve the h’s but have been written in terms Several of their steps, such as the cavalier sub- of derivatives with respect to the x’s; the issue is stitution of integrals for sums or the discarding of more clearly addressed with the c’s.) If we took f as remainder terms, may seem insufficiently defended, a density, it might not be unreasonable to replace but the general drift is so similar to what we tend this sum (divided by n) by its limit in probabil- to see today that it would be easy for an uncriti- d ity, the expectation E[ dcs (log f)]. But under what cal reader to accept it, believing that it is probably distribution should the expectation be computed? essentially accurate, and that with some effort and With Fisher it would be computed under the distri- additional regularity conditions all should be well. bution with the true values of the parameters. But After all, such a reader might say, the replacement Pearson lacked that notion; for him there was no can be justified under reasonable regularity condi- “true value,” only a summary estimate in terms of tions, and even the last step would be sanctioned observed values. by a loose inverse probability argument such as was Someone—perhaps Fourier—has been quoted as common at that time. That reader would be wrong. saying that “Mathematics has no symbols for con- 3. THE SOURCE OF THE ERROR fused ideas.” Anyone seeking a counterexample to this need look no further than Pearson–Filon. With The key to understanding what went wrong with their symbol f = f(x1,x2,x3,...,xm; c1, c2, c3,...,cp) the Pearson–Filon argument is at the very begin- submerging the role of estimated values and elevat- ning, as a closer reading shows. Modern readers have ing them in the process to surrogates for true val- understandably tended to take the frequency surface ues, the argument goes astray. All expectations are z = f(x1,x2,x3,...,xm; c1, c2, c3,...,cp) as a para- computed as if the estimated values were true val- metric model. But in fact, in their notation, z is ues, and the result is a distribution for errors that the fitted surface in terms of estimated h’s and c’s. does not in any way depend upon the method used Pearson and Filon explained this implicitly in the to estimate. Pearson and Filon replaced S d (log f) first paragraph on page 231, when they write exclu- dcs by the integral D = · · · f d log f dx dx · · · dx , sively in terms of the group of n individuals and the s dcs 1 2 m which reduces identicallyRRR to zero (if the two f’s “means” hi (referring to the arithmetic means) as determining the origin of the surface, and explicitly are identical) under fairly general regularity condi- in the short fourth paragraph, where they refer to tions. But the same would not be generally true for d log f(x|θˆ) ˆ the h +∆h’s and c +∆c’s of the numerator as being · · · f(x|θ) dcs dx1 dx2 · · · dxm, writing θ for considered “instead of the observed values.” This (RRRh1, h2, h3,...,hm; c1, c2, c3,...,cp), and θ for the po- runs counter to modern statistical practice, which tential true value θˆ is intended to estimate. KP’S ERRORS 5

d The summation S dcs (log f) is identically zero if “It would appear that shortly before 1898 maximum likelihood estimates are used, but it re- the process which leads to the correct value, mained for Fisher to notice that this term is not in of the probable errors of optimum statis- general negligible; in fact, it will contribute asymp- tics, was hit upon and found to agree with totically to the variance term if the estimate θˆ is in- the probable errors of statistics found by efficient, as would be the case for many of Pearson’s the method of moments for normal curves moment-based estimates. Pearson and Filon wrote and surfaces; without further enquiry it that the expansion represented the distribution “on would appear to have been assumed that this process was valid in all cases, its di- the assumption that the errors ∆h1, ∆h2,..., ∆hm, rectness and simplicity being peculiarly ∆c1,..., ∆cp have been made in the determination of the frequency constants.” But that is not what attractive. The mistake was at the time, they had done. With their notation of a simple f perhaps, a natural one; but that it should without arguments, and their fixation on the esti- have been discovered and corrected with- mated values, they were led to mathematical mis- out revealing the inefficiency of the method of moments is a very remarkable circum- adventure. stance” (Fisher, 1922a, page 329). If the substitution of limiting integrals for sums had been valid, the likelihood ratio P∆/P0 would It is worth pointing out that the size of the correc- then have been the ratio of the probability densities tion Fisher noted was needed was not small. Fisher of the sample with estimated h’s and c’s (the denom- gave several examples of nonnormal members of Pear- inator) to that for a hypothetical set of alternative son’s own family of curves where the lower bound of values (the h +∆h’s and c +∆c’s, the numerator). the efficiency of the moment-based estimates was In modern terminology, Pearson’s final expression zero. Since Fisher measured efficiency as a ratio of (1) was claimed to be an approximation for P∆, the variances, this meant that the correction needed for (conditional) density of the sample x given that the Pearson’s 1898 expressions for “probable errors” could estimated values are in error by ∆h or ∆c, and their be enormous—in fact arbitrarily large. The 1898 ex- final claim (“In other words, we have here the fre- pressions were not larger than the actual probable quency distribution for errors in the values of the errors, but there was little else that could be said. frequency constants”) was an assertion that formula There was no finite limit to the amount they under- (1) also gives the (conditional) density of the errors estimated the actual probable errors. ∆h and ∆c given the data x. This last statement The major error in the paper was due (as Fisher was not explained, but later in 1916 correspondence noted) to a conceptual confusion, a taking of the es- timated frequency constants in part of the analysis with Fisher (quoted in Stigler, 2005) Pearson de- in the place of the actual frequency constants. Pear- scribed it as the use of inverse probability, making son had run aground after encountering a need for a it a naive Bayesian approach with a uniform prior, clear notion of a set of values for his frequency con- such as was practiced routinely over the 19th cen- stants; he did not have a framework to encompass tury and was sometimes referred to as the Gaussian both estimates and targets of estimation. To some method. degree then, the Pearson–Filon error can be seen to Pearson’s contemporaries did not raise questions be due to the lack of the notion of parametric fam- about the memoir. When Edgeworth discussed and ilies. Pearson and Filon used notation in this mem- extended it in 1908, he gave no indication he saw oir suggestive of parametric families, but the lack of anything amiss (Edgeworth, 1908). Only in 1922 did conceptual clarity led to a confused and ultimately Ronald Fisher criticize the approach of the paper in erroneous analysis. Pearson thought of “frequency a lengthy footnote (Fisher, 1922a, page 329), writ- constants” as quantities such as moments, derivable ing that “It is unfortunate that in this memoir no from arbitrary density curves with the same mean- sufficient distinction is drawn between the popula- ing in all cases and with the sample moments as tion and the sample ....” Fisher went on to say clearly leading to the best estimates. that the results implicitly assume the estimates ac- Fisher’s comments were apt; perhaps even too gen- tually maximized the likelihood function, whereas erous, although it is doubtful Pearson would have they were applied in many cases where this was not agreed with such an assessment. That the Pearson– the case. He wrote, Filon procedure can be shown to work for efficient 6 S. M. STIGLER estimates is a species of mathematical accident, al- the asymptotic variance of maximum likelihood esti- beit one that may have helped to deceive Pearson mates (Fisher, 1922a, pages 328–329), involving the and probably produced overconfidence when it gave expansion of the density of a sample, reflects that. the results he knew should hold for the normal dis- However, Fisher used the expansion in a different tribution case. The fact that the identities they way, and operated under different assumptions. He claimed in general would work for many efficient es- began by assuming that the estimate tended to nor- timates is mathematics that would have been foreign mality with large samples, and under that restric- to Pearson and Filon, and unachievable without the tion and the assumption that the estimates maxi- full notion of parametric families. mized likelihood, Fisher used the expansion to show From 1903 on, Pearson subtly distanced himself how the asymptotic variance could be found from from the paper without ever calling attention to the the second derivative of the log density. Pearson and errors, but he never repudiated it. In 1899 William Filon had sketched a solution to a problem that was F. Sheppard published a long study of “normal cor- not the one they had embarked upon. But it was relation” (Sheppard, 1899). Sheppard appears to have Fisher who recognized, with the conceptual appa- not seen the Pearson–Filon paper (at any rate he ratus of parametric families, that this sketch could lead to the solution of his own problem. Pearson the did not cite it), and a part of what he presented in- pioneer had laid a path that was insufficiently well- cluded probable errors for the frequency constants in lit for his own travel, but it provided a brightly lit the normal case, derived by methods different from highway for Fisher. Pearson–Filon. The methods he used were quite straightforward—writing estimates as linear func- 4. PEARSON’S SECOND MAJOR ERROR tions of frequency counts (using a Taylor expansion if necessary), and then finding moments from the Pearson introduced the Chi-square test in 1900, variances and covariances of the counts in ways that and it has been widely celebrated as a great achieve- remain standard today. ment in statistical methodology. In 1984 the editors The directness of Sheppard’s methods must have of a popular science magazine selected it as one of appealed to Pearson. In a sequence of articles, all twenty discoveries made during the twentieth cen- with the same title “On the probable errors of fre- tury that have changed our lives (Hacking, 1984). quency constants” (Pearson, 1903, 1913, 1920), he Yet for all this celebration, virtually no historical presented what he called “simple proofs of the main mention of the paper is made by statisticians with- propositions,” all the while with the Pearson–Filon out adding damning words to the effect that Pearson erred in claiming, as we would now put it, that no paper receding into the background. In 1903 he gave correction in degrees of freedom need be made when only a general reference to the 1898 treatment (as parameters are estimated under the null hypothe- well as to Sheppard); in 1913 he only referred to the sis. Worse for Pearson’s reputation, such accounts formulae in 1898 for the case of the normal corre- further note that the error stood uncorrected until lation coefficient (where they were correct); in 1920 it was sensed in 1915 by Greenwood and Yule and he did not cite the 1898 work at all. In the 1903 definitively corrected in 1922 and 1924 by Ronald paper he included formulae based upon Sheppard’s Fisher, thus seemingly turning Pearson’s landmark approach that were capable of being worked out for publication into Fisher’s triumph over ignorance. getting probable errors for estimates in five types Pearson has had some defenders in this matter; of curves within the Pearson family, but only for some have even suggested that Pearson was right methods of moments estimates. all along. For example, Karl’s son Egon and George The 1898 paper had a considerable impact upon Barnard have separately advanced tentative (and I statistical practice in making the use of probable er- think half-hearted) statistical cases that might be rors available for the entire span of the new method- made for proceeding as Pearson did (Pearson, 1938, ology including moment estimates. It could even page 30; Barnard, 1992). But a cold, clear-eyed look be argued that the wrong, generally overoptimistic at the original 1900 paper shows that such excuses probable errors were better than none at all. And cannot be reconciled with Pearson’s text. He did again, the paper had a significant impact upon Fisher. make an error, and a big, consequential one too. While preparing his 1922 memoir, Fisher clearly had The crucial passage from Pearson’s 1900 article Pearson and Filon before him, and his discussion of is on pages 165–166. Pearson considered a test of KP’S ERRORS 7

fit based upon a total of N frequency counts from (m′ − m )2 µ m′2 − m2 = s − s a sample independently distributed among n + 1 ms ms ms groups or categories, with µ 2 m′2 µ 3 m′2 m = theoretical frequency [i.e., the expected + − + · · ·     frequency for the group in question], ms ms ms ms m = theoretical frequency deduced from data for (m′ − m )2 µ m′2 − m2 µ 2 m′2 s = s − s + , the sample [i.e., expected frequency using ms ms ms ms  ms the dropping terms of higher order than (µ/m )2. Sum data to find the “best” value for the group], s both sides over the n + 1 groups and this is the ex- m′ = observed frequency [for the group], ′ pression Pearson arrives at: and with the total count N = m = ms = m . Pearson recognized that the estimatedP P theoreticalP µ m′2 − m2 χ2 = χ2 − s frequency m would typically differ from the theo- s s Xms ms  retical frequency m, and he denoted that difference (2) µ 2 m′2 by µ; that is, µ = m − ms. His analysis gave partic- + , and hence,    ular attention to the relative error, namely µ/ms, X ms ms “which,” he told us, “will, as a rule, be small.” ′2 2 2 2 µ m − ms The gist of Pearson’s argument was to show that χ − χs = −  ms ms  the Chi-square statistic based upon the theoretical (3) X ′ 2 2 2 (m −m) µ m′2 frequencies, χ = m , is close to the Chi- + . square statistic basedP upon the estimated theoret- X ms  ms  ′ 2 2 (m −ms) 2 ical frequencies, χs = ms ; so close, in fact, The term −ms in the numerator of the first term that the discrepancy couldP for all practical purposes on the right-hand side of (3) is superfluous when 1 be ignored. summed over groups since µ = m − ms = 0, (m′−m)2 Pearson had evidently expanded h(m)= m = but it plays a role in Pearson’sP argument,P P which is (m′−ms−µ)2 no doubt why he left it in. in a Taylor series about ms, discarded ms+µ For future reference, I note that exactly the same the terms of higher order than (µ/m )2, and then s result can be arrived at more simply by noting summed the results over the n + 1 groups. Proceed- ing in this way, he would have found m′2 − 2mm′ + m2 χ2 = ′2 2 X m  ′ (m − m ) h (m)= − , ′2 ′2 2 m m m = − 2N + N = − N, 2m′2 X m  X m  h′′(m)= , m3 2 m′2 and similarly χs = { ms }− N; then ′2 ′′′ 6m P ′2 ′2 h (m)= − 4 ,.... 2 2 m m m χ − χs = −  m   ms  And so, since µ = m − m , X X s 1 1 = m′2 − . h(m)= h(m )+ µh′(m ) s s X m ms  µ2 µ3 If we then expand m−1 as a function of m about m + h′′(m )+ h′′′(m )+ · · · s 2 s 6 s (again neglecting third- or higher order terms), we get 1 Pearson again employed S for , and his argument is µ m′2 µ 2 m′2 made harder than necessary to understand by two clear ty- 2 2 P (4) χ − χs = − + . pographical errors. The typographical errors are an evident Xms ms  Xms  ms  missing left parenthesis in the numerator of the second term This agrees exactly with Pearson’s expansion (3) on his first line of equations on page 165, and a missing ms in 2 the denominator of the second term of the second line of equa- when the superfluous term “−ms” is dropped, as tions [it reappeared, correctly, when this term was repeated would have to be the case since the function being 2 2 two lines later; that equation is our equation (3) below]. expanded (χ − χs) is the same in both cases. 8 S. M. STIGLER

It is not hard to show reasonably generally under false assumptions) off to one side. Although Pear- the hypothesis of fit that the terms dropped, even son’s argument on point (i) can be questioned, his when summed, are indeed with high probability neg- conclusion is correct. As Fisher would observe later, −1/2 ligible when N is large [OP (N )]. In order to see the first term is in fact zero (or nearly so) if the esti- where Pearson was led astray, we must then look to mated m’s are chosen well (minimum Chi-square or the paragraph following his equations. Pearson’s ar- maximum likelihood) due to the (near) orthogonal- ′ gument proceeded as follows: He recognized that the ity of m − ms and m − ms in those cases (much like difference (3) between these two Chi-squares should that of X¯ and Xi − X¯ for normal distributions). be positive: the deviation of the observed counts In any event, it is part (ii) of his argument that from the theoretical counts should be greater than is crucial, and that argument fails, and fails dra- the same deviation if the theoretical counts are ad- matically. The second term on the right-hand side justed to fit the observed. He wished to argue that of (3) should not be expected to be small under the difference (3) was not large. His argument was either null or alternative hypothesis. At this dis- in two parts: (i) the first term on the right-hand tance in time it may seem surprising that Pearson side of (3) should be expected to be either negative did not realize this. Already in 1938 his son Egon (thus canceling out part of the second term) or at registered this surprise in a biographical memoir of least very small; (ii) the second term was nonnega- his father (Pearson, 1938, page 30), when he noted tive of course, but it would be expected to be small that for any multinomial distribution, if Chi-square in any case, because it involved for each summand is computed with no parameter restrictions (so each the square of the relative error µ/ms, which Pearson theoretical value is estimated by the correspond- ′ had stated (page 164) “will, as a rule, be small,” and ing observed count and ms = m ), then the fit with much smaller still when squared. He gave no citation the estimated values is perfect. We would thus have 2 2 2 for this “rule,” but two years earlier he had explic- χ − χs = χ − 0, while the right-hand side of (3) itly cited Gauss, Laplace and Poisson, among oth- gives ers, as sanctioning the dropping of terms involving µ m′2 − m2 µ 2 m′2 the squares of errors thought to be small (Pearson − s + and Filon, 1898, page 246). Presumably in stating Xms ms  Xms  ms  this he assumed good estimates and ample data. He (m − m′)2 = −0+ . granted that in some cases where the fit was bad the ′ X m  deviations would be quite large, but then both Chi- In this extreme case the second term is asymptoti- squares would be large and the discrepancy between cally equivalent to the original Chi-square itself un- them unimportant. der the null hypothesis, and so it is certainly not There are two points to make about Pearson’s ar- negligible. The test of fit is not interesting here (we gument. The first is that his analysis (i) of the first would say the degrees of freedom is zero), but it term may seem dubious to modern eyes, but it is shows starkly the devastating effect estimated pa- not the source of the error. He noted that the first rameters can have upon the statistic, even when (as term will be positive only if the two terms multi- in Egon’s example) the relative error itself (µ/m ) plied (µ = m − m and m′2 − m2) are negatively s s s would be small [O (N −1/2)]. Why, Egon seemed to correlated2; that is, if there was a tendency for the P ask, would Karl have not seen this? Egon offered his m’s to be ordered m′ >m >m or m′ ms >m or m

Udny Yule initially accepted the 1900 rule, for ex- he would give 8 rather than 4 degrees of freedom. In ample using 8 rather than 4 degrees of freedom for a these and other examples the effect upon inferences Chi-square test of a 3 × 3 contingency table in Yule could be devastating. (1906, page 349). Only in 1915, after years of ex- Not only were the errors Pearson made not eas- perience, did Greenwood and Yule (1915) bring the ily discovered; even after they were pointed out in puzzle to wider notice, and even then neither they 1922 they were not widely understood. In 1924, a nor anyone else had a clear view of the source of the Handbook of Mathematical Statistics was published, problem. And so it stood until Fisher. prepared under the auspices of the U.S. National Even with Fisher’s work before us, we must mar- Research Council (Rietz, 1924). The Editor-in-Chief vel at how far Pearson had gone. He had lacked only was H. L. Rietz, and major contributions were made one ingredient—parametric families—but what he by Harvard University Professor E. V. Huntington had managed to do was to identify the issue and and University of Michigan Professor H. C. Carver present it in such a clear way that when Fisher com- (later the founding editor of the Annals of Mathe- bined Pearson’s 1900 development with the decep- matical Statistics). Carver cited the Pearson–Filon tively simple idea of parametric families, the solu- paper without any indication he saw the problem tion must have sprung to mind nearly immediately. with it (page 95). Fisher’s (1922b) first correction It took Fisher’s genius to answer the question, but to the degrees of freedom for contingency tables was he would scarcely have been in a position to do so briefly cited without comment by Rietz, but Ri- without the path-breaking formulation of Pearson etz (with evident approval) also gave in more detail the pioneer. Pearson’s argument that no correction for estimat- It is not anachronistic to see Pearson as erring in ing expected values was needed (pages 80–81). Else- 1900. Even without the notion of parametric fami- where in the volume, Huntington wrote warmly of lies he could have seen a discrepancy without seeing the method of moments, and nowhere was Fisher’s magnum opus of 1922 referred to. Even in Eng- a resolution, just as Greenwood and Yule did, when land understanding was slow. By 1938 they found the Chi-square test for 2 × 2 tables gave had conceded the degrees of freedom issue, but he results inconsistent with a comparison of the two seemed to have not accepted Fisher’s point about columns as binomial counts. Pearson erred, but the Pearson and Filon (Pearson, 1938, pages 28–29). error led to Fisher’s discovery of degrees of freedom. Both Pearson and Fisher were giants in our his- Pearson had not only solved the great problem of tory; despite their lack of mutual appreciation we testing multinomial goodness of fit against all alter- cannot imagine modern statistics without both. Pear- natives, he had also isolated and formulated another son’s errors were substantial and not to be glossed great problem in terms that two decades later per- over, but they should not obscure the even greater mitted another genius, armed with his own major achievements they accompanied. Pearson had a gi- discovery, an easy solution. ant ambition and the energy to realize it. He sought to create a whole new statistical system, and for 6. CONCLUSION a time succeeded. He did not have a mathemati- The errors Pearson made did not go undetected cal mind equal to Fisher’s, and he became mired in because they were small; to the contrary, they were and never escaped from an incompletely developed large and of potentially large practical consequence. conceptual apparatus that was not equal to the full For example, in Pearson and Filon’s own numerical task at hand. But he took statistics to a higher level example for a Type III or Gamma density (Pearson nonetheless. If Pearson could never come to admit and Filon, 1898, pages 279–280), the probable er- some failures, it was surely due to a stubbornness ror given for the shape parameter p is only about that even he recognized in himself. In the Preface a fifth of what it should have been (Fisher, 1922a, for the Second Edition of The Grammar of Science page 336). If the curves being fit by the method of (1900b), Pearson wrote, moments had been closer to the normal shape, the “If I have not paid greater attention to errors would have been smaller, but if not, there my numerous critics, it is not that I have was no finite bound on how far off they could be. failed to study them; it is simply that I For Chi-squares for 2 × 2 tables Pearson would give have remained—obstinately it may be— 3 rather than 1 degree of freedom; for 3 × 3 tables convinced that the views expressed are, KP’S ERRORS 11

relatively to our present state of knowl- Edwards, A. W. F. (1994). R. A. Fisher on Karl Pear- edge, substantially correct” (Pearson, 1900b, son. Notes and Records Roy. Soc. London 48 97–106. page ix). MR1272638 Eisenhart, C. (1974). . In Dictionary of Scien- So it was with his statistical work as well. tific Biography 10 447–473. Charles Scribner’s Sons, New Pearson’s impact upon Fisher may in the end stand York. Fienberg, S. E. (1980). Fisher’s contributions to the analysis as one of his greater achievements. Pearson had no of categorical data. In R. A. Fisher: An Appreciation. (S. student more diligent than Fisher, despite their dif- E. Fienberg and D. V. Hinkley, eds.) 75–84. Lecture Notes ferences. When in 1945 Fisher wrote an ill-fated bi- in Statist. 1. Springer, New York. ographical account of Pearson for the Dictionary of Filon, L. N. G. (1934). Remarks in Speeches Delivered at a National Biography (rejected by the Dictionary and Dinner in University College, London, in Honour of Pro- fessor Karl Pearson, 23 April 1934. The University Press, not published until by A. W. F. Edwards, in 1994), Cambridge. he wrote to the editor that he had made a “lifelong Fisher, R. A. (1922a). On the mathematical foundations of study of Pearson’s writings.” Fisher further stated, theoretical statistics. Philos. Trans. Roy. Soc. London (A) “I have during the last 35 years at various times had 222 309–368; reprinted as Paper 18 in Fisher (1974). 2 occasion to look at probably all of [Pearson’s fun- Fisher, R. A. (1922b). On the interpretation of χ from con- tingency tables, and the calculation of P. J. Roy. Statist. damental statistical memoirs] and at the immense Soc. 85 87–94; reprinted as Paper 19 in Fisher (1974). output which was published in Biometrika.” It was Fisher, R. A. (1924). Conditions under which χ2 measures from reading Pearson’s work and Pearson’s journal the discrepancy between observation and hypothesis. J. that Fisher’s interest in statistics developed in the Roy. Statist. Soc. 87 442–450; reprinted as Paper 34 in way it did, and in the case of the two examples dis- Fisher (1974). Fisher, R. A. (1974). The Collected Papers of R. A. Fisher cussed here, the effect of the Pearsonian blueprint (J. H. Bennett, ed.). Univ. Adelaide. could scarcely be more evident. Fisher saw Pearson Greenwood, M. and Yule, G. U. (1915). The statistics of clearly, warts and all, and while he did not acknowl- anti-typhoid and anti-cholera inoculations, and the inter- edge the extent of his debt to Pearson, its extent is pretation of such statistics in general. In Proc. Roy. Soc. clear to other, less involved readers. As in Newton’s Medicine, Section of and State Medicine 8 Statistical Papers of George famous statement, Fisher stood on the shoulders of 113-190; reprinted (1971) in (A. Stuart and M. G. Kendall, eds.) 171–248 a giant (Merton, 1965). Griffin, London. Porter’s recent biography (2004) is illuminating Hacking, I. (1984). Trial by number. Science 84 5 69–70. on Pearson’s pre-statistical life. Eisenhart (1974) re- Hald, A. (1998). A History of Mathematical Statistics From mains the most complete discussion of K. P.’s sta- 1750 to 1930. Wiley, New York. MR1619032 Jeffrey, G. B. J. Lon- tistical work. For other discussion relating to this (1938). Louis Napoleon George Filon. don Math. Soc. 13 310–318. early work see Aldrich (1997), Hald (1998), Mag- Magnello, M. E. (1996). Karl Pearson’s Gresham lectures: nello (1996, 1998). On Chi-square see in particular W. F. R. Weldon, speciation and the origins of Pearsonian Fienberg (1980), Hacking (1984), Plackett (1983) statistics. British J. 29 43–63. and Stigler (1999, Chapter 19). For other aspects Magnello, M. E. (1998). Karl Pearson’s mathematization of of the Pearson–Fisher relationship see Stigler (2005, inheritance: From ancestral heredity to Mendelian genetics (1895–1909). Ann. Sci. 55 35–94. MR1605849 2007a, 2007b). Pearson himself returned to that topic Merton, R. K. (1965). On the Shoulders of Giants: A Shan- of Chi-square frequently, including Pearson (1915, dean Postscript. The Free Press, New York. 1922, 1923, 1932), most of these under the instiga- Morant, G. M. (1939). A Bibliography of the Statistical and tion of Fisher. Other Writings of Karl Pearson (Compiled with the Assis- tance of B. L. Welch). Cambridge Univ. Press. Pearson, E. S. (1938). Karl Pearson: An Appreciation of REFERENCES Some Aspects of His Life and Work. Cambridge Univ. Press. Aldrich, J. (1997). R. A. Fisher and the making of maximum Pearson, K. (1895). Mathematical contributions to the the- likelihood 1912–1922. Statist. Sci. 12 162–176. MR1617519 ory of , II: Skew variation in homogeneous ma- Barnard, G. A. (1992). Introduction to Pearson (1900). In terial. Philos. Trans. Roy. Soc. London (A) 186 343–414. Breakthroughs in Statistics II (S. Kotz and N. L. Johnson, Reprinted in Karl Pearson’s Early Statistical Papers (1956) eds.) 1–10. Springer, New York. 41–112 Cambridge Univ. Press. Edgeworth, F. Y. (1908). On the probable errors of Pearson, K. (1896). Mathematical contributions to the frequency-constants (contd.). J. Roy. Statist. Soc. 71 499– theory of evolution, III: Regression, heredity and pan- 512. mixia. Philos. Trans. Roy. Soc. London (A) 187 253–318. 12 S. M. STIGLER

Reprinted in Karl Pearson’s Early Statistical Papers (1956) bridge Univ. Press. Abstract in Proc. Roy. Soc. London 113–178 Cambridge Univ. Press. (1897) 62 173–176. Pearson, K. (1897). Mathematical contributions to the the- Pearson, K. (1932). Experimental discussion of the (χ2, P ) ory of evolution. On a form of spurious correlation which test for goodness of fit. Biometrika 24 351–381. may arise when indices are used in the measurement of Plackett, R. L. (1983). Karl Pearson and the chi-squared organs. Proc. Roy. Soc. London 60 489–498. test. Internat. Statist. Rev. 51 59–72. MR0703306 Pearson, K. (1900a). On the criterion that a given system of Porter, T. M. (2004). Karl Pearson: The Scientific Life in deviations from the probable in the case of a correlated sys- a Statistical Age. Princeton Univ. Press. MR2054951 tem of variables is such that it can be reasonably supposed Rietz, H. L., ed. (1924). Handbook of Mathematical Statis- to have arisen from random sampling. Philos. Magazine, tics. Houghton Mifflin, Boston. 5th Series 50 157–175. Reprinted in Karl Pearson’s Early Sheppard, W. F. (1899). On the application of the theory Statistical Papers (1948) 339–357 Cambridge Univ. Press. of errors to cases of normal distribution and normal corre- Pearson, K. (1900b). The Grammar of Science, 2nd ed. lation. Philos. Trans. Roy. Soc. London (A) 192 101–167, [First was 1892]. Adam and Charles Black, London. 531. Pearson, K. (1903, 1913, 1920). On the probable errors of Stigler, S. M. (1986). The : The Mea- frequency constants. Biometrika 2 273–281, 9 1–10, 13 113– surement of Uncertainty Before 1900. Harvard Univ. Press, 132. Cambridge, MA. MR0852410 Pearson, K. (1915). On a brief proof of the fundamental Stigler, S. M. (1999). Statistics on the Table. Harvard Univ. formula for testing the goodness of fit of frequency distri- Press, Cambridge, MA. MR1712969 butions, and on the probable error of “P.” Philosophical Stigler, S. M. (2005). Fisher in 1921. Statist. Sci. 20 32–49. Magazine, 6th Series 31 369–378. MR2182986 Pearson, K. (1922). On the χ2 test of goodness of fit. Stigler, S. M. (2007a). The pedigree of the International Biometrika 14 186–191. Biometrics Society. Biometrics 63 317–321. MR2370789 Pearson, K. (1923). Further note on the χ2 test of goodness Stigler, S. M. (2007b). The epic story of maximum likeli- of fit. Biometrika 14 418. hood. Statist. Sci. 22 598–620. Pearson, K. and Filon, L. N. G. (1898). Mathematical con- Stouffer, S. A. (1958). Karl Pearson—An appreciation on tributions to the theory of evolution IV. On the probable the 100th anniversary of his birth. J. Amer. Statist. Assoc. errors of frequency constants and on the influence of ran- 58 23–27. MR0109122 dom selection on variation and correlation. Philos. Trans. Yule, G. U. (1906). The influence of bias and of personal Roy. Soc. London (A) 191 229–311. Reprinted in Karl equation in statistics of ill-defined qualities. J. Anthropo- Pearson’s Early Statistical Papers (1956) 179–261 Cam- logical Institute of Great Britain and Ireland 36 325–381.