<<

Linstat 2018

From Fisher to Big Data: 100 years of Statistical Research at Rothamsetd.

Gavin Ross, Retired Visiting Worker, , Harpenden AL5 2JQ, .

Rothamsted Research, formerly Rothamsted Experimental Station, was founded in 1843 by John. B. Lawes, owner of an estate in Harpenden, 40 km north of London. Lawes was interested in the application of chemistry to improvements in farming, and patented the manufacture of phosphates for use as a chemical fertiliser. To demonstrate the advantages of chemical fertilisers he laid out several experimental fields on his estate where fertilisers could be compared with traditional manures, and the sale of phosphates became a very profitable business, which he decided to invest in scientific research. The trials were repeated every year on the same plots, and the yields and weather data were carefully recorded. Over time more scientists were appointed with expertise in botany, soils, and plant pests and diseases. The institute was greatly expanded in the 20th century with new laboratories and departments.

R.A. Fisher appointed

By 1919 it was decided that the accumulated data could yield important conclusions, and a would be needed to perform the analysis. Ronald A Fisher, who had already established a reputation for his work on genetics and evolution, and the theoretical derivation of Student's t distribution and the sample correlation coefficient, was recommended. Fisher accepted on condition that he was provided with a modern calculator, and Student (W.S. Gosset) recommended the Millionaire, a large, heavy machine that could perform multiplication in one operation, although division was more complicated, so that Fisher preferred to multiply by reciprocals of integers to save time. His daughter, Joan Fisher Box, in her biography, . A. Fisher, The Life of a Scientist, describes the machine as 'sounding like a threshing machine'. The previous computing resources consisted of a 19th century spiral slide rule, fine for multiplication but useless for sums of squares.

Fisher's first task was to analyse the continuous wheat , Broadbalk, with weather data starting in 1838 so that there were 70 years of yields and weather data for each plot. To eliminate long term trends and to summarise annual rainfall patterns he adapted the methods of G.H. Hardy to provide orthogonal polynomials, and by performing multiple regression of the adjusted yields on the orthogonal polynomial coefficients of rainfall for each year he was able to produce graphs of the effect of rainfall on the yields of each plot. In doing so he devised the modern technique of multiple regression, with an to determine how many polynomial terms to include. He had already introduced the term 'variance' for the squared standard deviation, noting that variance could be partitioned into additive components, and he derived the distribution of the ratio of two independent chi-squared variables, first in its logarithmic form, Z, whereas later George Snedecor proposed that the ratio itself be tabulated and called the F-distribution, after Fisher.

Fisher also had to advise scientists about the design of annual or laboratory , and argued that for a proper statistical testing of the significance of treatment effects it would be necessary to introduce random allocation of treatments to plots, and to replicate treatments over several plots. To adjust for variations in soil fertility or other external constraints, various arrangements of equal numbers of treatments in blocks, or in more complicated designs such as Latin Squares were devised. Where there were several factors of treatments such as different mineral fertilisers, he advocated balanced sets of combinations, so that the results could be presented in terms of main effects and interactions.

The variety of sources of data from different departments showed him that reliance on least squares and the Normal distribution was not always appropriate, and he introduced the concept of Likelihood, and the method of Maximum Likelihood, combined with normalising transformations of variates from binomial, Poisson or long-tailed positive distributions. He disagreed with the prevailing orthodoxy of Karl Pearson that large samples were needed to determine the distribution of any set of observations, and proposed instead that the job of the statistician was to extract and summarise whatever information there was in a particular data set. His review of the fundamental principles of was a dramatic announcement of the new approach to . This was followed by the first edition of 'Statistical Methods for Research Workers' in 1925, setting out his ideas on experimental design, likelihood, efficiency and sufficiency, and hypothesis testing. Further work on experimental design led to the publication in 1935 of 'The '.

Meanwhile he continued his work on evolutionary genetics, achieving the synthesis of Darwinism and Mendelian inheritance in his 1930 publication, 'The Genetical Theory of Natural Selection'. Among other advances he showed how to use maximum likelihood to estimate the number of organisms in a dilution series experiment, to extract all the information in assays such as pesticide trials to determine the LD50 and slope of a pesticide effect, and to compare control and treatment groups. To analyse differences between species of plants he devised Discriminant Analysis. When shown data from catches of Lepidoptera he derived the Log Series Distribution and the concept of Species Diversity, widely adopted by ecologists.

Fisher was often involved in controversial arguments with other , such as Pearson, , Gosset ('Student'), and the Royal Statistical Society, and when his papers were rejected by Pearson for Biometrika he became editor of Annals of Eugenics and published there instead. His main failure was to provide a convincing argument for 'fiducial probability', the supposed distribution of parameters given the data, in contrast to the easier concept of a distribution of sample data statistics given the parameter values. He opposed the Bayesian argument as inappropriate in laboratory and field conditions.

Fisher left Rothamsted in 1933 to succeed Karl Pearson as Professor of Statistics at the Galton Laboratory, University College London, but continued to live in Harpenden and collaborate with Rothamsted scientists. In 1943 he went to Cambridge, and on retirement he was invited to Adelaide by E.A. Cornish (of the Cornish-Fisher expansion), where he died in 1963.

Several of Fisher's assistants and students made names for themselves and developed and promoted his ideas in other fields. L.H.C. Tippett was an early student, taking his ideas into industry, and Harold Hotelling spent a year at Rothamsted. J.O. Irwin and J. Wishart were appointed in 1927, and in 1931. George Snedecor enthusiastically promoted his ideas in the USA, and he continued to correspond with Gosset over many years.

Frank Yates, Head of Statistics 1933-68

When Fisher moved to University College in 1933 the Statistics Department was relatively small, and Frank Yates was appointed as head of the department. Yates had previous experience as a surveyor in West Africa, with expertise in least squares methods, and an early interest in a wide range of mathematical ideas. He introduced Balanced Incomplete Blocks and Split Plots, and developed the ideas of confounding of higher order interactions to increase error degrees of freedom. For the analysis of 2**n factorial treatments he devised the plus and minus method for main effects and interactions. With Fisher he classified the 6x6 Latin Squares to ensure an equal probability of selecting a particular square. In 1938 they published the first edition of Statistical Tables, still widely used today.

By 1939 Yates was actively promoting sample survey methods, and in World War II he was recruited by the Royal Air Force to provide statistical advice, particularly in the interpretation of bombing raids and their accuracy, based on aerial photography. He also worked with Ernest Jones on the interpretation of V1 attacks on London in 1944, showing that they were randomly distributed and not targeted. Meanwhile at Rothamsted he organised surveys to help increase the home supply of food resources. The Survey of Fertiliser Practice was started in 1942, involving a two-stage scheme of sample counties and sample farms within counties. After the war he wrote the classic work 'Sampling Methods for Censuses and Surveys'. In 1946 he joined the UN World Fertility Survey, collaborating with Mahalanobis in India. The analysis of surveys involved the formation of multiway tables, and the use of iterative methods known as 'fitting constants' to adjust for unbalanced counts in each cell to provide least squares estimates of marginal effects, and for proportional data by a maximum likelihood procedure.

The departmental staff remained small, but several well-known statisticians began their careers under Yates. William G Cochran joined in 1934 until leaving for the USA in 1939, where he collaborated with Gertrude Cox on the classic work on Experimental Designs, and with George Snedecor on Sampling Methods. David Finney began his career at Rothamsted in 1939 and studied bioassay methods, and conducted field surveys, leading to his later works on Probit Analysis and on Agricultural Statistics. He also developed Fractional Replication designs. Oscar Kempthorne was appointed in 1941 and developed experimental design theory before leaving for the USA in 1946. Maurice Quenouille was appointed in 1944 before moving to Southampton University and also wrote a textbook on experimental design. was there from 1945-48 before moving to a university post.

After the war Yates was able to expand the department, establishing a Statistical Research Service for UK agricultural institutes and ministry farms, and for colonial and overseas experiments in developing countries. For surveys he acquired punch card equipment, but by 1953 the volume of demand was such that he saw the need to acquire an electronic computer. One early prototype machine was lying idle after its early development, and Rothamsted was able to lease, rather than purchase, the Elliott 401 computer. Using thermionic valves, this computer occupied a large room and generated much heat. While extremely slow by modern standards and with a comparatively small store capacity, it marked a new beginning in statistical analysis and enabled much more work to be done. Everything had to be learnt from scratch, sharing ideas with other users of prototype machines to provide computer routines for standard functions, sorting lists, organising multiway tables, linear algebra and solution of equations, and generating pseudo-random numbers for simulation studies. As the only computer of its kind, it could not be used elsewhere or its programs shared. Instead Rothamsted was allowed to accept problems from a wide variety of institutions anxious to use its software, with no payment expected from publicly funded organisations. There was little scope for text messages apart from copied headings, and assistants had to annotate the outputs by hand, calculating significance tests from tables. The input devices were paper tape or punched cards, and output at first was only to a modified typewriter, though later to paper tape fed into a teleprinter.

The main programmers apart from Yates himself were Michael Healy, working on multiple regression and multivariate analysis methods, John Gower working on experimental designs and multiway tables, and Howard Simpson on survey analysis. Mike Westmacott, better known as a member of the 1953 expedition to Mount Everest, programmed techniques for estimating missing values in experimental designs. John Gower was approached to write the first software for Cluster Analysis, using the whole machine for quite small jobs. Gavin Ross developed early optimisation algorithms to fit nonlinear models such as exponential and logistic curves, probit lines and frequency distributions. Yates then began to design a general program for experimental designs which allowed designs such as balanced incomplete blocks and lattice squares to be analysed without writing separate single programs. The Elliott 401 was finally closed down in 1965 and taken by the Science Museum in London, where it was later worked on by the Computer Conservation Society, concerned with the preservation of early machines.

The next computer, a Ferranti Orion, was a great advance, with more storage and a wide range of functions, allowing larger programs combining many analyses previously requiring separate programs. Programs were still written in machine code, or in an autocode, but programs could only be run on machines of the same series. The age of widely exchangeable software came with the next series when programs were written in Fortran. Programmers from other institutes came to Rothamsted to test and run their programs, and these included and Roger Mead from the National Vegetable Research Station, who developed the Simplex algorithm for function optimisation.

During this period there were many short term visitors and students from various parts of the world, including Tadeusz Calinski from Poznan, who came in 1964, and again in 1969, developing methods for analysing groups of experiments.

John Nelder, head of department from 1968-1984

Yates retired in 1968, but remained at Rothamsted to continue his research. The department was split with the computing staff forming a separate Computer Department. John Nelder had plans for a single software package for statistical analysis, combining his ideas on data structures with a general algorithm for the analysis of variance of balanced experimental designs developed by Graham Wilkinson, who joined the department from Adelaide. The original package known as had a fairly limited range of facilities, but it was not difficult to incorporate regression analysis and various facilities for data transformation. The original idea to extend the program to incorporate survey analysis, cluster analysis and nonlinear models proved more difficult, and separate program suites were created for these applications.

Before the next computer, the ICL 470, was installed which would allow Fortran programs to be run, the new programs had to be tested elsewhere, with a courier taking cases of punched cards to London each day. But by 1974 the basic Genstat suite was available, and at the same time survey analysis was developed by Howard Simpson as RGSP (Rothamsted General Survey Program), and Gavin Ross developed the Maximum Likelihood Program (MLP) and the Classification Program (CLASP). Gower promoted the publication of numerical algorithms in Applied Statistics.

Robert Wedderburn was appointed in 1969, and with Nelder developed the analysis of Generalised Linear Models, extending the use of the logit transformation with binomial data to the square root transformation with Poisson data, and the the log transformation with Gamma data. The GLIM program package was promoted by the Royal Statistical Society, and Rothamsted packages were later distributed by the Numerical Algorithms Group in Oxford. Wedderburn tragically died young, but Nelder continued to develop GLM theory along with Peter McCullagh.

Contributions to and practice during this period included Gower's Principal Co- ordinate Analysis, providing graphical displays of data units, later developed as Biplots, displaying both units and variables in a single diagram. Ross developed the theory of Stable Parameters using transformations to aid fitting and interpreting nonlinear models. Roger Payne developed methods for providing keys to identify individuals given multivariate data, and produced a package, GENKEY, for this purpose. Wojtek Krzanowski made contributions to multivariate theory, and promoted the Multivariate Study Group of the Royal Statistical Society. Donald Preece and Rosemary Bailey made contributions to design theory and principles of orthogonality. Mike Kenward worked on longitudinal analysis. Janet Riley headed a section giving advice to developing countries, and published papers on intercropping and aquaculture. Rob Kempton developed ideas on species diversity, and later became head of the Scottish agricultural statistical service in Edinburgh.

Later Years Nelder retired in 1984 and was succeeded by John Gower up to 1990. During this period the pattern of statistical advice and computing was changing, with the widespread use of PCs and networks instead of relying on the mainframe computers, and many scientists were able to use programs from other sources. Other institutes now had their own computers and relied less on Rothamsted. Vic Barnett became head of department from 1991 to 1995, but Rothamsted was no longer able to maintain such a large independent statistics department, and the department's function became more of a service to scientists, along with growing interest in modelling. Barnett promoted international conferences on Statistics in the Environment.

Barnett was succeeded by Robin Thompson, who maintained his interest in animal statistics, and the use of REML models, with two or more sources of error, and spatial statistics to analyse crop variation in two dimensions. In 2002 the programming section looking after GENSTAT moved to a separate office, marketing programs as a company known as VSN. Meanwhile data collection was changing with increased use of automatic devices such as drones, flying over crop fields to make many more observations than before. Rothamsted also became more involved in genetic mapping of DNA sequences, requiring different analytical techniques.

In 2018 Andrew Mead heads the Statistics section, but to some extent the problems arising in laboratory and field experiments are not very different from those posed to Fisher in 1919. The challenge of providing appropriate design and analysis of scientific data will continue for many years to come.

Retrospect Rothamsted statisticians have always been part of a wider community, collaborating with others in many countries, and playing their part in statistical societies, journals and conferences. They have travelled extensively, and welcomed international visitors from all quarters. They have been open to ideas from elsewhere, and shared their experiences with all. The early importance of an institution with many different departments, and a pioneering computing role allowing access to data from many different disciplines and sources contributed to its achievements. The support staff should not be forgotten, with teams of mainly women on desk calculators and data preparation.

Selected Book References

Fisher-Box, Joan (1978). R.A. Fisher, The Life of a Scientist. Wiley Fisher, R.A. (1925) and 11 further editions. Statistical Methods for Research Workers. Oliver and Boyd. Fisher, R.A. (1930) The Genetical Theory of Natural Selection. Oliver and Boyd Fisher, R.A. (1935) and 5 further editions. The Design of Experiments. Oliver and Boyd. Fisher, R.A. and Yates, F. (1938) and 6 further editions. Statistical Tables for Biological, Agricultural and Medical Research. Longman. Yates, F. (1937). The Design and Analysis of Factorial Experiments. CAB Press Yates. F. (1949) and 4 further editions. Sampling Methods for Censuses and Surveys. Griffin Finney, D.J. (1947) and further editions. Probit Analysis. Cambridge University Press McCullagh, P and Nelder, J.A. (1983). Generalised Linear Models. Chapman and Hall Gower, J.C. and Hand, D.J/ (1996). Biplots Chapman and Hall Ross, G.J.S. (1990) Nonlinear Estimation. Springer