100 Years of Statistical Research at Rothamsetd. Gavin
Total Page:16
File Type:pdf, Size:1020Kb
Linstat 2018 From Fisher to Big Data: 100 years of Statistical Research at Rothamsetd. Gavin Ross, Retired Visiting Worker, Rothamsted Research, Harpenden AL5 2JQ, England. Rothamsted Research, formerly Rothamsted Experimental Station, was founded in 1843 by John. B. Lawes, owner of an estate in Harpenden, 40 km north of London. Lawes was interested in the application of chemistry to improvements in farming, and patented the manufacture of phosphates for use as a chemical fertiliser. To demonstrate the advantages of chemical fertilisers he laid out several experimental fields on his estate where fertilisers could be compared with traditional manures, and the sale of phosphates became a very profitable business, which he decided to invest in scientific research. The trials were repeated every year on the same plots, and the yields and weather data were carefully recorded. Over time more scientists were appointed with expertise in botany, soils, and plant pests and diseases. The institute was greatly expanded in the 20th century with new laboratories and departments. R.A. Fisher appointed By 1919 it was decided that the accumulated data could yield important conclusions, and a statistician would be needed to perform the analysis. Ronald A Fisher, who had already established a reputation for his work on genetics and evolution, and the theoretical derivation of Student's t distribution and the sample correlation coefficient, was recommended. Fisher accepted on condition that he was provided with a modern calculator, and Student (W.S. Gosset) recommended the Millionaire, a large, heavy machine that could perform multiplication in one operation, although division was more complicated, so that Fisher preferred to multiply by reciprocals of integers to save time. His daughter, Joan Fisher Box, in her biography, R. A. Fisher, The Life of a Scientist, describes the machine as 'sounding like a threshing machine'. The previous computing resources consisted of a 19th century spiral slide rule, fine for multiplication but useless for sums of squares. Fisher's first task was to analyse the continuous wheat experiment, Broadbalk, with weather data starting in 1838 so that there were 70 years of yields and weather data for each plot. To eliminate long term trends and to summarise annual rainfall patterns he adapted the methods of G.H. Hardy to provide orthogonal polynomials, and by performing multiple regression of the adjusted yields on the orthogonal polynomial coefficients of rainfall for each year he was able to produce graphs of the effect of rainfall on the yields of each plot. In doing so he devised the modern technique of multiple regression, with an analysis of variance to determine how many polynomial terms to include. He had already introduced the term 'variance' for the squared standard deviation, noting that variance could be partitioned into additive components, and he derived the distribution of the ratio of two independent chi-squared variables, first in its logarithmic form, Z, whereas later George Snedecor proposed that the ratio itself be tabulated and called the F-distribution, after Fisher. Fisher also had to advise scientists about the design of annual or laboratory experiments, and argued that for a proper statistical testing of the significance of treatment effects it would be necessary to introduce random allocation of treatments to plots, and to replicate treatments over several plots. To adjust for variations in soil fertility or other external constraints, various arrangements of equal numbers of treatments in blocks, or in more complicated designs such as Latin Squares were devised. Where there were several factors of treatments such as different mineral fertilisers, he advocated balanced sets of combinations, so that the results could be presented in terms of main effects and interactions. The variety of sources of data from different departments showed him that reliance on least squares and the Normal distribution was not always appropriate, and he introduced the concept of Likelihood, and the method of Maximum Likelihood, combined with normalising transformations of variates from binomial, Poisson or long-tailed positive distributions. He disagreed with the prevailing orthodoxy of Karl Pearson that large samples were needed to determine the distribution of any set of observations, and proposed instead that the job of the statistician was to extract and summarise whatever information there was in a particular data set. His review of the fundamental principles of statistical inference was a dramatic announcement of the new approach to statistics. This was followed by the first edition of 'Statistical Methods for Research Workers' in 1925, setting out his ideas on experimental design, likelihood, efficiency and sufficiency, and hypothesis testing. Further work on experimental design led to the publication in 1935 of 'The Design of Experiments'. Meanwhile he continued his work on evolutionary genetics, achieving the synthesis of Darwinism and Mendelian inheritance in his 1930 publication, 'The Genetical Theory of Natural Selection'. Among other advances he showed how to use maximum likelihood to estimate the number of organisms in a dilution series experiment, to extract all the information in assays such as pesticide trials to determine the LD50 and slope of a pesticide effect, and to compare control and treatment groups. To analyse differences between species of plants he devised Discriminant Analysis. When shown data from catches of Lepidoptera he derived the Log Series Distribution and the concept of Species Diversity, widely adopted by ecologists. Fisher was often involved in controversial arguments with other statisticians, such as Pearson, Jerzy Neyman, Gosset ('Student'), and the Royal Statistical Society, and when his papers were rejected by Pearson for Biometrika he became editor of Annals of Eugenics and published there instead. His main failure was to provide a convincing argument for 'fiducial probability', the supposed distribution of parameters given the data, in contrast to the easier concept of a distribution of sample data statistics given the parameter values. He opposed the Bayesian argument as inappropriate in laboratory and field conditions. Fisher left Rothamsted in 1933 to succeed Karl Pearson as Professor of Statistics at the Galton Laboratory, University College London, but continued to live in Harpenden and collaborate with Rothamsted scientists. In 1943 he went to Cambridge, and on retirement he was invited to Adelaide by E.A. Cornish (of the Cornish-Fisher expansion), where he died in 1963. Several of Fisher's assistants and students made names for themselves and developed and promoted his ideas in other fields. L.H.C. Tippett was an early student, taking his ideas into industry, and Harold Hotelling spent a year at Rothamsted. J.O. Irwin and J. Wishart were appointed in 1927, and Frank Yates in 1931. George Snedecor enthusiastically promoted his ideas in the USA, and he continued to correspond with Gosset over many years. Frank Yates, Head of Statistics 1933-68 When Fisher moved to University College in 1933 the Statistics Department was relatively small, and Frank Yates was appointed as head of the department. Yates had previous experience as a surveyor in West Africa, with expertise in least squares methods, and an early interest in a wide range of mathematical ideas. He introduced Balanced Incomplete Blocks and Split Plots, and developed the ideas of confounding of higher order interactions to increase error degrees of freedom. For the analysis of 2**n factorial treatments he devised the plus and minus method for main effects and interactions. With Fisher he classified the 6x6 Latin Squares to ensure an equal probability of selecting a particular square. In 1938 they published the first edition of Statistical Tables, still widely used today. By 1939 Yates was actively promoting sample survey methods, and in World War II he was recruited by the Royal Air Force to provide statistical advice, particularly in the interpretation of bombing raids and their accuracy, based on aerial photography. He also worked with Ernest Jones on the interpretation of V1 attacks on London in 1944, showing that they were randomly distributed and not targeted. Meanwhile at Rothamsted he organised surveys to help increase the home supply of food resources. The Survey of Fertiliser Practice was started in 1942, involving a two-stage sampling scheme of sample counties and sample farms within counties. After the war he wrote the classic work 'Sampling Methods for Censuses and Surveys'. In 1946 he joined the UN World Fertility Survey, collaborating with Mahalanobis in India. The analysis of surveys involved the formation of multiway tables, and the use of iterative methods known as 'fitting constants' to adjust for unbalanced counts in each cell to provide least squares estimates of marginal effects, and for proportional data by a maximum likelihood procedure. The departmental staff remained small, but several well-known statisticians began their careers under Yates. William G Cochran joined in 1934 until leaving for the USA in 1939, where he collaborated with Gertrude Cox on the classic work on Experimental Designs, and with George Snedecor on Sampling Methods. David Finney began his career at Rothamsted in 1939 and studied bioassay methods, and conducted field surveys, leading to his later works on Probit Analysis and on Agricultural Statistics. He also developed Fractional Replication designs. Oscar Kempthorne