National Institute for Applied Research Australia

University of Wollongong, Australia

Working Paper

15-18

Mixed Models for Analysts

Brian Cullis, Alison Smith, Ari Verbyla, Robin Thompson, and Sue Welham

Copyright © 2021 by the National Institute for Applied Statistics Research Australia, UOW. Work in progress, no part of this paper may be reproduced without permission from the Institute.

National Institute for Applied Statistics Research Australia, University of Wollongong, Wollongong NSW 2522, Australia Phone +61 2 4221 5435, Fax +61 2 4221 4998. Email: [email protected] Mixed Models for Data Analysts - DRAFT 2018

Brian Cullis, Alison Smith National Institute for Applied Statistics Research Australia University of Wollongong

Ari Verbyla Robin Thompson, Sue Welham CSIRO

Contents

1 Introduction 1 1.1 Applications of linear mixed models 2 1.2 Model definition 2 1.3 Setting the scene 5

2 An overview of linear models 9 2.1 Example: Glasshouse experiment on the growth of diseased and healthy plants 9 2.2 Model construction and assumptions 10 2.3 Estimation in the linear model 10 2.4 Non-full rank Design 14 2.5 A glimpse at REML 14 2.6 Factors and model specification 16 2.7 Tests of hypotheses 18 2.8 Analysis of plant growth example 22 2.9 Summary 23

3 Analysis of designed experiments 25 3.1 One-way classification 25 3.2 Randomised complete blocks 33 3.3 Split plot design 40 3.4 Balanced incomplete blocks 47 3.5 In search of efficient estimation for variance components 52 3.6 Summary 53

4 The Linear 55 4.1 The Model 55 4.2 Variance structures for the errors: R-structures 56 4.3 Variance structures for the random effects: G-structures 57 4.4 Separability 57 4.5 Variance models 60 4.6 Identifiability of variance models 61 4.7 Combining variance models 62 4.8 Summary 63

5 Estimation 65

iii iv CONTENTS 5.1 Estimation of fixed effects and variance parameters 65 5.2 Estimation of variance parameters 65 5.3 Estimation of fixed and random effects 71 5.4 An approach to prediction in linear mixed models - REML construct 80 5.5 Summary 81

6 Inference 83 6.1 General hypothesis tests for variance models 83 6.2 Hypothesis testing in mixed models: fixed and random effects 86 6.3 Summary 91

7 Prediction from linear mixed models 93 7.1 Introduction 93 7.2 The Prediction Model 98 7.3 Computing Strategy 100 7.4 An example of the prediction model 102 7.5 Prediction in models not of full rank 103 7.6 Issues of averaging 108 7.7 Prediction of new observations 109

8 From ANOVA to variance components 113 8.1 Introduction 113 8.2 Navel Orange Trial 113 8.3 Sensory experiment on frozen peas 118

9 Mixed models for Geostatistics 125 9.1 Introduction 125 9.2 Motivating Examples 126 9.3 Geostatistical mixed model 128 9.4 Covariance Models for Gaussian random fields 130 9.5 Prediction 137 9.6 Estimation 139 9.7 Model building and diagnostics 140 9.8 Analysis of examples 142 9.9 Simulation Study 146

10 Population and Quantitative Genetics 159 10.1 Introduction 159 10.2 Mendel’s Laws 159 10.3 Population genetics 162 10.4 Quantitative genetics 164 10.5 Theory 164 10.6 Discussion 176 CONTENTS v 11 Mixed models for plant breeding 177 11.1 Introduction 177 11.2 Spatial analysis of field trials 182 11.3 Analysis of multi-phase trials for quality traits 196

12 The analysis of quantitative trait loci 209 12.1 Introduction 209 12.2 Example 209 12.3 Overview of Molecular Genetics 210 12.4 Reproduction 214 12.5 Genetic information 215 12.6 Linkage analysis 219 12.7 QTL analysis 225 12.8 Interval mapping: The Regression Approach 232 12.9 Whole genome interval mapping 235 12.10Conclusions 244

13 Mixed models for penalized models 249 13.1 Introduction 249 13.2 Hard-edge constraints 251 13.3 Soft-edge constraints 255 13.4 Penalized Regression splines 257 13.5 P-splines 258 13.6 Smoothing splines 262 13.7 L-splines 267 13.8 Variance modelling 273 13.9 Analysis of high-resolution mixograph data 275 13.10Analysis of another example: still to come 275 13.11LASSO 275 13.12Discussion 275

Bibliography 277

A Iterative Schemes 285 A.1 Introduction 285 A.2 Gradient methods: Average Information algorithm 286 A.3 EM Algorithm 292 A.4 PX-EM - an improved EM algorithm 296 A.5 Computational Implementation 300 A.6 Summary 305

CHAPTER 1

Introduction

The linear model is a basic tool in statistical modelling with widespread use and application in data analysis and applied statistics. The expected value of the response variable is given by a of explanatory vari- ables, often termed the linear predictor (McCullagh and Nelder, 1994). The stochastic nature of the response is modelled using a single random compo- nent, elements of which are assumed to be independent with constant variance. The linear mixed model is a natural extension of the linear model. In the simplest case the linear mixed model is a linear model which has been extended to allow for a correlated error term or additional random components. The wide range of variance and correlation models, make the linear mixed model a very flexible tool for data analysts. The linear mixed model provides the basis for the analysis of many data-sets commonly arising in the agricultural, biological, medical and environmental sciences, as well as other areas. This introductory chapter provides an overview of the book through ex- amples. Ths flavour of the linear mixed model and the diversity of possible applications are presented. The examples also allow the development of mod- els that contain both variates and factors, and indeed to define these two types of variable. The symbolic respresentation of linear models is introduced as it will be used throughout the book. The philosophical issues that arise in the analysis of data are also raised. The authors have a background in statistics arising from experiments. Thus the models proposed in this book begin with the design or sampling structure of the study. These models are randomization based. Hence it is important to understand the origins of mixed models, which belong in the . There are often further considerations in model building that suggest plaus- able variance models that are outside a randomization based approach. The diversity of applications and the complexity and flexibility of variance models results in the subjective choice of a variance model and hence analysis for any particular data-set. A consequence of the increased flexibility and com- plexity of variance modelling is the danger of fitting inappropriate variance models to small or highly unbalanced data-sets. The approach taken in this book is to highlight the difference between analyses that use a variance mod- elling approach and a design-based approach. The former relies heavily on the appropriateness of the fitted variance structure for the validity of inferences concerning both fixed and random effects. The importance of data-driven di-

1 2 INTRODUCTION agnostics is stressed throughout the book although there remain unresolved issues in this area.

1.1 Applications of linear mixed models

Typical applications covered in this book include the analysis of balanced and unbalanced designed experiments, the analysis of balanced and unbal- anced longitudinal data, repeated measures analysis, the analysis of regular or irregular spatial data, the combined analysis of several experiments (ie meta-analysis) and the analysis of both univariate and multivariate animal and plant genetics data. This chapter will be finished last!!!

1.2 Model definition

1.2.1 Defining the linear model

A linear model relates the expected value of a response variable (outcome) to a linear combination of a set of explanatory variables, termed the linear predictor. The stochastic properties of the response are determined by the addition of a single random component to the linear predictor. Explanatory variables are assumed to be measured without error, and may be design-based (values planned as part of the experiment) or observational. For example, we may wish to examine the effect of imposed dietary regimes (explanatory variable) on the live-weight of pigs (response variable). Let yi represent the response measured at the end of the experiment for the ith individual or experimental unit (in the example the pig), i = 1, . . . , n. These are combined to form the data vector yn×1. The linear model for y is then written as

y = Xτ + e (1.2.1) where τ p×1 is a vector of (fixed) effects corresponding to the explanatory variables (in this example diet effects) and Xn×p is the associated design matrix (chosen to be of full rank, see chapter 2). The columns of X may be dummy variables, that is columns of zeros and ones, which assign categorical variables or factors to units, or columns of continuous variables or covariates. These aspects will be covered in more detail in chapter 2. The residuals or errors, ei, are assumed to be identically and independently distributed with zero mean, common variance and follow a Normal distribution. This is written as

2 e ∼ N(0, σ In) The parameters to be estimated are σ2 and τ . Generally we wish to give mea- sures of uncertainty or and sometimes conduct tests of hypotheses on τ . MODEL DEFINITION 3 1.2.2 Defining the linear mixed model The linear mixed model also relates the expected value of a response vari- able (outcome) to the linear predictor. In this case, however, the stochastic properties of the response are determined by one or several random effects which are added to the linear predictor. For example, groups of pigs may have been housed together in pens and the live-weight of the pigs may therefore be affected by both the dietary treatment and the pen to which the pigs were assigned. We may regard the dietary effects as fixed effects and the pen effects as random effects. In doing this we assume that in the absence of dietary ef- fects there is a (positive) covariance between the live-weight of pigs that are in the same pen. Another important extension of the linear model is to allow the elements of the residual vector e or more generally, any random effects to be correlated. For example, in the analysis of longitudinal data, repeated measurements on the same individual may be correlated. This feature may be accounted for by relaxing the assumption of independence between the residuals within each in- dividual, although residuals on different individuals may remain independent. We need a flexible framework to define the types of correlation models we may wish to fit. Further, the variance may change as the time progresses. This is often the case in longitudinal data-sets involving the measurement of weight (Kenward, 1987). Hence we may also need to relax the assumption of constant variance, choosing to either model the variance as a function of time, or using another parametric model which adequately reflects the variance properties of the data. If yn×1 denotes the vector of observations, the linear mixed model can be written as y = Xτ + Zu + e (1.2.2) where τ p×1 is the vector of fixed effects, Xn×p is the design matrix (of full rank) that associates observations with the appropriate combination of fixed effects, ub×1 is the vector of random effects, Zn×b is the design matrix that associates observations with the appropriate combination of random effects, and e is the vector of residual errors. It is assumed  u   0   G(γ) 0  ∼ N , σ2 (1.2.3) e 0 H 0 R(φ) where the matrices G and R are functions of parameters γ and φ, respectively. The parameter σ2 is a variance parameter which we will refer to as the overall H scale parameter.

1.2.3 Factors and covariates Covariates and factors are used to define explanatory variables. A covariate is defined to be an explanatory variable which may take any value (within a given valid range). For example, in a study of the growth of weaner pigs, the initial live-weight of each pig may have been measured with the view of 4 INTRODUCTION adjusting for its effect. The initial weight, denoted by initwt is a covariate in the linear mixed model. When a covariate is included then, in general a single effect, associated with the assumed of the response on the covariate, is included in either τ (for a fixed covariate) or u (for a random covariate). A factor is defined as an explanatory variable which takes one of a set of discrete values (or categories). The set of discrete values are called levels. For example, in the study of the growth of pigs, an individual pig may be described by a factor denoting its sex. This factor, denoted by sex, can take one of two values, either male or female. Factors may be purely qualitative, ordinal (eg, with levels such as low, medium and high), or relate to some underlying quantitative scale. In the preceding example the factor sex is a qualitative factor, since its levels can not be ordered in any special way. However to assess the effect of phosphorus (P) fertilizer on the yield of wheat, an agronomist may conduct an experiment with several treatments which are rates of fertilizer, 0, 5, 10, 20 and 40 kg/ha P, say. Here the treatment, P, is a factor with 5 levels. It is a quantitative factor and it is usually of interest to model the effect of P on yield using some functional form. In general, when a factor is included as a term in the linear mixed model then the result is to include in τ or u an effect for each level of the factor. For quantitative factors there are often examples (see chapter ??) when it may be included more than once in the linear model, in differing roles, say as a covariate and as a factor. As a covariate we are fitting the linear regression of the response on the levels of the factor, while in the second case a effect is included for each level of the factor. This latter setting may be useful to examine the goodness of fit of the assumed linear regression of the response on the levels of the factor. Terms in the linear mixed model may relate directly to the factors and co- variates in the set of explanatory variables or may be formed as a combination of factors and/or covariates. For example the notion of interaction usually in- volves the examination of the effect of one factor or covariate in the presence of another factor or covariate. The form for the model terms as combinations of factors and covariates can be conveniently and succintly written down using a syntax originally developed by Wilkinson and Rogers (1973) which is now briefly described. This syntax is widely adopted in many statistical packages such as GENSTAT 5 (Payne et al., 2001) and S-PLUS(Mathsoft 2000).

1.2.4 Symbolic representation of model formulae

We describe the terms of the linear model using the syntax of Wilkinson and Rogers (1973). Factors and covariates may be present in model terms as either main effects or as a component of an interaction. In the example on pigs, if the male and female pigs were randomly assigned to one of three diets, we may wish to examine both the main effects of sex and diet and allow for a possible interaction. We denote the interaction between factors sex and diet using the dot operator, ie as sex.diet. In this example, we wish to fit both the SETTING THE SCENE 5 main effects of diet and sex as well as the interaction. Thus, the factors are said to be crossed and this is succintly written down by the single term sex*diet, where the ∗ operator is the crossing operator. This single term expands to sex ∗ diet = sex + diet + sex.diet where sex.diet denotes the 6 interaction effects of sex and diet. Alternatively factors may be nested. For example, in testing new wheat varieties in South Australia the wheat growing area of the state has been divided into 6 regions and within each region comparative yield trials are grown at several locations. These locations are said to be nested within regions, ie each location occurs in only one region. Locations may be coded from one to the number of locations within each region rather than by their unique name, to reflect the aim to fit a nested model, in which the main effect of location is not explicitly fitted. This nested form implies we fit the main effect of region and the interaction of region and location. A succinct representation of these two terms results by use of the / operator which represents the nesting operation. For example, region/location = region + region.location Of course, if the location factor was recoded using the individual location names then the main effect of the recoded factor is equivalent to the interaction with the nesting factor. More complex operators and model term functions will be introduced as necessary. The terms of the linear mixed model are classified according to whether the effects associated with the term are either fixed or random. Hence we can represent a linear mixed model by a fixed and random model formulae. In the above example of pig weight, if we denote the response by y and in addition, pigs were housed in groups or pens within an animal house then a convenient representation of the linear mixed model for these data is given by y ∼ sex ∗ diet + pen + units where pen is a factor denoting the allocation of pigs to pens. The reserved word units represents the residual term, which could be regarded as a random factor with one level for each of the experimental units. As discussed in the preface terms which involve random effects are bolded in the symbolic model formula.

1.3 Setting the scene As we have seen the effects (and therefore the term associated with these effects) in a linear mixed model are classified as either fixed or random. A random term models the (co)variation in the data in the simplest case by the introduction of a component of variance. A fixed term contributes to the expectation of the response. It is difficult to construct a set of rules which determine the classification of effects (and hence terms) as fixed or random. Traditionally it has been suggested that if the “levels of the factor come from 6 INTRODUCTION a probability distribution” then the factor may be considered as random (see Searle et al., 1992, chap. 1). However a term may also be taken as random even though the levels of the factor which relate to the term (or a component of the term) cannot be assumed to have arisen from a probability distribution. Our approach is pragmatic and we allow the classification to be less rigid. There are general rules which can assist with the decision but the final choice must depend on the aim of the analysis and the role of the term in the linear mixed model. In the following we present a range of examples which are discussed in the book and describe how fixed and random effects may arise in the context of the analysis of such examples. Components of variance: The aim of the analysis of these data is typically concerned with determining and quantifying the major sources of variation. The data is often collected as part of a designed or observational study with many factors classifying the units and the analysis is to determine the magnitude of variation associated with each factor or combination of factors. Examples of these types of factors include trees within an orchard, students within a school, doctors within a hospital, cows within a herd within a farm, days within a sampling period, batches within a process and so on. The levels of the factor can usually be thought of as having arisen from a probability distribution and as the aim is to examine (co)variation associated with the factor (and terms derived from it) it is natural to classify the factor (and terms derived from it) as random. The analysis of an example of this type of data is described in chapters 2 and 3. Designed experiments: The analysis of data arising from designed experiments is one of the earli- est examples of the application of linear mixed models. Traditionally, factors are generally defined as either blocking or treatment factors (see for example, GENSTAT 5 (Payne et al., 2001)). Blocking factors are concerned with the de- scription and definition of the stratification of the experimental units. In this sense their role in the linear mixed model is to describe (co)variation in the data and their levels can be assumed to have arisen from a probability distri- bution. Hence blocking factors are usually classified as random. We note that in the analysis of orthogonal designs, such as randomised complete blocks, es- timation of treatment effects is identical regardless of whether blocking factors are classified as fixed or random. Treatment factors are generally classified as fixed. Quantitative treatments: In chapter ?? we present the analysis of experiments with quantitative treat- ments in which the aim is usually to describe or quantify the (continuous) re- lationship between the response and the treatment levels. There is often more than one observation for each value (ie level) of the factor. Traditionally, the analysis proceeds by modelling the relationship using low order polynomials or non-linear models. More recently semi-parametric methods such as smoothing splines have been suggested and used (Green and Silverman, 1994a; Verbyla SETTING THE SCENE 7 et al., 1999a). Verbyla et al. (1999a) decompose the factor into a linear com- ponent and “smooth” non-linear deviations, where the linear component is fitted as a fixed effect and the smooth component is fitted as a random term. When there are replicate covariate values, the opportunity exists to partition the variation not explained by the modelled response into lack of fit and pure error components. It is often convenient to model lack of fit by inclusion of the associated treatment factor as a random term in the linear mixed model. Repeated measures: These data arise in many applications where multiple measurements are taken on experimental units. The aim of the analysis of repeated measures from designed experiments may be to examine the overall effects of treatments and how these effects vary with time. In this context subjects or experimen- tal units will usually be measured at common time points. The classification of treatment and blocking factors will usually still be fixed and random re- spectively. The main effects of time and interaction with treatments will usu- ally be fixed terms, interactions with blocking factors will usually be random terms and correlated error terms may also be added to the model (which are of course random). The aim of the analysis of repeated measures from ob- servational studies may be to describe or quantify the response profiles for experimental units, groups of units or factors arising in the study. Low or- der polynomials with coefficients assumed to be fixed effects have often been suggested for describing the response profiles (Diggle et al., 1994). Random co- efficient regression has also been widely advocated and used for the analysis of such data (Laird and Ware, 1982). Most recently smoothing splines and other semi-parametric modelling approaches have also been suggested (Brumback and Rice, 1998a; Verbyla et al., 1999a). Terms modelling treatment response profiles are usually fitted as fixed (except for spline components). Terms used to model (co)variation due to structure between experimental units (for exam- ple blocking) or variation of individuals about treatment profiles are usually fitted as random. These random terms are often fitted to quantify the popu- lation variability but also model the covariance between the set of repeated measurements within each individual. The analysis of these data is described in chapters ?? and ??. Multivariate analysis: Multivariate linear mixed models are used where several measurements (traits) have been made on a set of experimental units, and a linear mixed model is required for each trait, taking account of the correlation between the traits (within each unit). In this case, the overall constant term which is usually fitted in a univariate linear mixed model is replaced by a factor which fits a constant term for each trait. Similarly the overall residual variance is replaced by a residual variance for each trait, with correlation between traits. Complex variance models are fitted to all random terms in the linear mixed model to account for the correlation between traits. Spatial analysis: The analysis of spatial data usually involves modelling the (co)variation 8 INTRODUCTION of the data. Experimental units are measured at a set of points which can be described by a coordinate system in either 1 or 2 (and occasionally 3) dimensions. The data may be regularly or irregularly spaced data. As an example of regularly spaced data, the analysis of field experiments has received much attention since the seminal paper of Wilkinson et al. (1983). Many covariance models have been proposed for the errors of field trials including time-series models (Cullis and Gleeson, 1991) as well as those originating in geostatistics (Cressie, 1991). The analysis of these types of data is considered in chapter ??. Analysis of a series of experiments: This is increasing in popularity as a method of summarising and integrat- ing the findings of studies or experiments with a common set (or subset) of treatments or aims. These occur in many application areas especially medi- cal (see eg, Yusuf, 1985) and agricultural (Smith and Cullis, 2001). As well as including factors which are measured within studies, factors may also be defined at the study level, such as geographic location, date of study, type or size of clinic. The aim of the analysis is usually to determine the treatment factors affecting the response, and the influence of study-level factors, both as main effects and how these may interact with treatment factors. Terms associ- ated with study-level factors may be classified as fixed or random, depending on the context. Terms associated with treatment factors are usually classified as fixed, but there are examples where these are classified as random. These issues are discussed more fully in ??. The above examples illustrate that often, consideration of whether the levels of a factor arise from a probability distribution is not sufficient to determine the classification of the factor as fixed or random. A term may also be classified as random rather than fixed to: 1. quantify (co)variation between different factor levels via a variance model 2. reflect the structure of the data 3. achieve efficient selection, based on prediction of future performance 4. allow inference for a broader set of conditions Conversely, there are examples where the term may be classified as fixed even though the levels could have arisen from a probability distribution. We note that if a term in the linear mixed model is classified as random then any other terms which share a common set of factors is also classified as random. For example, if variety and site.variety are terms in the linear mixed and variety is classified as random, then site.variety must be classified as random. CHAPTER 2

An overview of linear models

In this chapter we review some basic ideas and results for the linear model. For a more thorough treatment of linear models the reader is referred to Searle (1971), for example. The ideas will be covered in the context of a small example.

2.1 Example: Glasshouse experiment on the growth of diseased and healthy plants

The data for this example is presented in the GENSTAT 5 manual (Payne (1993)). An experiment was designed to examine the difference between the growth of diseased (MAV) and healthy (HC) plants. The heights of plants were measured at 1, 3, 5, 7 and 10 weeks after treatment. There were seven plants per treatment and the plants were arranged in a completely randomised design. For the purpose of illustration, we consider the height of each plant 10 weeks after treatment as our response variable. The data are given in table 2.1.

Table 2.1 Height (cm) of plants 10 weeks after treatment Plant Number Treatment within treatment HC MAV 1 57.0 55.0 2 123.5 67.6 3 66.0 61.5 4 130.0 58.0 5 114.0 104.0 6 107.5 62.0 7 110.5 75.9

We are interested in estimating the effect of the treatments on plant height and to examine if there is a significant difference in mean plant height for the two treatments.

9 10 AN OVERVIEW OF LINEAR MODELS 2.2 Model construction and assumptions To allow inferences to be conducted, we propose the

yij = τi + eij (2.2.1) where yij is the height for the jth plant (j = 1,..., 7) for the ith treatment (i = 1, 2), τi is the mean effect for the ith treatment and the eij are random “errors” that reflect plant to plant variability that is not related to the treat- 2 ment. We assume the random errors are such that eij ∼ N(0, σ ) and that they are independent for all i and j. Let n = 14 denote the total number of observations. In symbolic form (2.2.1) can be written as y ∼ treatment + units where the variable treatment is a factor taking two levels, with value i for observation yij. If i = 1 the treatment is HC and while if i = 2 the treatment is MAV. The term units is a factor that has levels 1 to n and represents the random errors. Let     y11 1 0  y12   1 0       .   . .   .   . .        τ1 y =  y17  , X =  1 0  , τ = (2.2.2)     τ2  y21   0 1       .   . .   .   . .  y27 0 1 and e be the vector of eij in the same order as y. The matrix X defines the treatment term, and has two columns (corresponding to the two levels of the factor), each row having a zero and a one, with the one indicating which treatment level is appropriate for each experimental unit (plant). Using the above vectors and matrices, the model (2.2.1) can be written succinctly as y = Xτ + e (2.2.3) and this is the vector-matrix form of the linear model. Note that the assump- 2 tions regarding eij imply that e ∼ N(0, σ In) so that the distribution of y is given by 2 y ∼ N(Xτ , σ In) (2.2.4) The unknown parameters, τ and σ2 must be estimated from the data.

2.3 Estimation in the linear model We consider the linear model in a general setting and return to the example throughout this chapter. We now denote the individual observations by yi, i = 1, 2, . . . , n, and let ESTIMATION IN THE LINEAR MODEL 11 0 0 xi (where denotes the transpose) be the ith row of X. For example, the 0 first row of X given in (2.2.2) is x1 = [1 0]. Then from (2.2.4), the individual observations yi are statistically independent and have distribution 0 2 yi ∼ N(xiτ , σ ) (2.3.5) Equation (2.3.5) is a convenient form for the estimation of the unknown pa- rameters τ and σ2. We use a likelihood based approach for estimation of the parameters. The for independent observations is defined as n 2 Y 2 L(τ , σ ; y) = f(yi; τ , σ ) i=1 2 where f(yi; τ , σ ) is the probability density function for yi. In our case yi follows a normal distribution specified by (2.3.5). Hence n   2 Y 1 1 0 2 L(τ , σ ; y) = √ exp − (yi − x τ ) 2 2σ2 i i=1 2πσ and the log-likelihood function, denoted by `, is given by n n n 1 X ` = `(τ , σ2; y) = − log(2π) − log(σ2) − (y − x0 τ )2 (2.3.6) 2 2 2σ2 i i i=1 Standard maximum likelihood estimation consists of differentiating the log- likelihood with respect to τ and σ2, to form the vector of derivatives, which is called the score vector. The estimates are found by equating the score vector to zero. For the linear model, it is possible to solve the resulting equations directly, but in general an iterative procedure is required. Here the derivatives of ` with respect to τ and σ2 are n ∂` 1 X = x (y − x0 τ ) (2.3.7) ∂τ σ2 i i i i=1 and n ∂` n 1 X = − + (y − x0 τ )2 (2.3.8) ∂σ2 2σ2 2σ4 i i i=1 respectively. Noting n n X 0 0 X 0 xixi = X X, xiyi = X y i=1 i=1 the score vector is given by  1 0 0  2 σ2 (X y − X Xτ ) U = U(τ , σ ) = n 1 0 (2.3.9) − 2σ2 + 2σ4 (y − Xτ ) (y − Xτ ) Equating the score vector to zero, we see that X0Xτˆ = X0y (2.3.10) 12 AN OVERVIEW OF LINEAR MODELS The equations in (2.3.10) are called the normal equations. If X is of full column rank, that is the columns are linearly independent, X0X is non-singular and τˆ = (X0X)−1X0y (2.3.11) is the maximum likelihood estimate of τ (also the least squares estimate). The case when X is not of full column rank is discussed below. Under model (2.3.5), τˆ ∼ N(τ , σ2(X0X)−1) (2.3.12) and τˆ is an unbiased estimator of τ . By the Gauss-Markov Theorem, linear functions a0τˆ are also the minimum variance unbiased estimators of a0τ . The maximum likelihood estimate of σ2 is n 1 X σˆ2 = (y − x0 τˆ)2 n i i i=1 1 = (y − Xτˆ)0(y − Xτˆ) n 1 = (y0y − τˆ0X0y (2.3.13) n 1 = R (2.3.14) n where R is the residual sum of squares. This estimate is known to be biased 2 n−p 2 (E σˆ = n σ ), because it does not take account of the p degrees of freedom used in the estimation of τ . We return to this shortly. Note also that (2.3.13) shows that the sum of squares due to the linear model (2.2.3) is given by SSQ(β) = τˆ0X0y = τˆ0X0Xτˆ = (Xτˆ)0(Xτˆ) (2.3.15) The second derivatives of the log-likelihood and their expected values are important for estimation and also for inference. The negative of the matrix of second derivatives is called the observed information matrix, while the expected value of this matrix is called the expected or matrix. The observed information matrix is  1 0 1 0 0  2 σ2 X X σ4 (X y − X Xτ ) J = J(τ , σ ) = 1 0 0 0 n 1 0 σ4 (y X − τ X X) − 2σ4 + σ6 y − Xτ ) (y − Xτ ) while the expected information matrix is equal to  1 0  2 σ2 X X 0 I = I(τ , σ ) = n (2.3.16) 0 2σ4 The fitted values for each observation are given by 0 −1 0 Xτˆ = X(X X) X y = P X y ESTIMATION IN THE LINEAR MODEL 13 n×n where P X is called the for X. It is also called the “hat matrix”. The properties of P X are simple but very important, namely 0 2 P X = P X , P X = P X , P X X = X which imply that P X is an orthogonal projection matrix onto the plane or space defined by the columns of X. P X is a real , and hence we can diagonalize it (see appendix ??) so that 0 P X = KΛK where Λ is a whose elements are the eigenvalues of P X , and K 0 is the matrix of orthonormal eigenvectors (K K = In). It is easy to show that P X has p unit eigenvalues and n − p zero eigenvalues. Thus we can partition Λ into two diagonal matrices Λ1 = Ip, Λ2 = 0n−p (a square matrix of zeros of 0 size n−p), and K into two orthogonal components K1 and K2 (K1K2 = 0), where the columns of K1 are those eigenvectors corresponding to the unit eigenvalues, and the columns of K2 are those eigenvectors corresponding to the zero eigenvalues. Thus we have    0  Λ1 0 K1 P X = [K1 K2] 0 0 Λ2 K2 0 = K1K1 (2.3.17) 0 As K1K1 = Ip, 0 −1 0 0 −1 0 0 P X = X(X X) X = K1(K1K1) K1 = K1K1 are equivalent representations of the orthogonal projection onto the space defined by the columns of X. The estimated e or residual vector is given by e˜ = y − Xτˆ

= y − P X y

= (In − P X )y and (In − P X ) is also an orthogonal projection matrix. This projection is orthogonal to the plane or space defined X. The eigenvectors of this matrix are identical to those of P X , that is K, and the diagonal matrix of eigenvalues of In − P X is equal to In − Λ. Thus we have an equivalent result to (2.3.17), namely 0 In − P X = K2K2 (2.3.18) 0 Notice also that as (In − P X )X = 0, K2X = 0. The two representations (2.3.17) and (2.3.18) provide the basis of transfor- mations discussed below and in Chapter 3. Lastly, notice that the notation e˜ suggests that this estimate is of a differ- ent type than the estimate τˆ. In fact, we have estimated a random vector, something that is called prediction rather than estimation; e˜ is the best linear unbiased predictor (BLUP) of e. This will be discussed later. 14 AN OVERVIEW OF LINEAR MODELS 2.4 Non-full rank Design Matrix We briefly discuss the case where X is not of full column rank and hence when X0X is singular. In this case the solution of the normal equations (2.3.10) is not unique. This case arises naturally in chapter 3. If X is singular one solution is to find a full rank version. This involves finding matrices X∗ and A such that X = X∗A and X∗ is of full column rank. This is the standard method used for over-specified models and is discussed in section 2.6. We consider an alternative approach. Write the normal equations as Cτˆ = b (2.4.19) so that C = X0X and b = X0y. Then a (non-unique) solution is given by τˆ = C−b (2.4.20) where C− is a generalized inverse of C, that is a matrix satisfying CC−C = C (2.4.21) To show (2.4.20), note that premultiplying (2.4.19) by CC− we find that CC−b = b and on substituting (2.4.20) into (2.4.19) we see that indeed the former is a solution of the latter. Now, in terms of the original matrices, τˆ = (X0X)−X0y (2.4.22) A particularly nice generalised inverse is the so-called Moore-Penrose gen- eralised inverse. This is written as C+ and in addition to satifying (2.4.21), also satifies C+CC+ = C+ In this case var (τˆ) = σ2(X0X)+ and 0 + 0 P X = X(X X) X are results that are similar to the full-rank case. 0 −1 0 Lastly, note that in the full rank case, the projection matrix P X = X(X X) X , is a Moore-Penrose inverse of itself. This can be useful in the analysis of vari- ance for multi-stratum experiments, discussed in chapter 3.

2.5 A glimpse at REML It was indicated above that the maximum likelihood estimate,σ ˆ2 given by (2.3.14) did not allow for estimation of τ . We can obtain an estimate that allows for the estimation of τ as follows. A GLIMPSE AT REML 15 0 If K is the matrix of eigenvectors of P X , we transform y to K y (a non- 0 singular and hence one-to-one transformation). Noting that K2X = 0, the distribution of the transformed data is    0   0    y1 K1y K1Xτ 2 Ip 0 = 0 ∼ N , σ (2.5.23) y2 K2y 0 0 In−p

The two components of the transformation y1 and y2 are independent and thus (2.5.23) provides two linear models. The first linear model has a design 0 matrix K1X that is square and non-singular and hence transforms τ to a ∗ new parameter vector τ say, of the same length. The vector y1 provides the only information on τ ∗ under the transformation and hence is the basis of estimation of τ . The parameter vector matches the data vector y1 in length, 2 so that once τ has been estimated, y1 cannot be used for estimation of σ as there is no further information available. In fact using (2.3.11) with the design 0 matrix X replaced by K1X, the independence of y1 and y2 implies that we can estimate τ by 0 0 −1 0 0 τˆ = (X K1K1X) X K1K1y 0 −1 0 = (X P X X) X P X y = (X0X)−1X0y which actually reproduces (2.3.11), as in hindsight it should. The second linear model has a known (zero) mean and depends only on σ2. 2 Because y1 cannot be used to estimate σ , we use the marginal distribution 2 of y2 for estimation of σ . The log-likelihood of y2, is given by n − p n − p 1 `(σ2 y ) = − log(2π) − log(σ2) − y0 y 2 2 2 2σ2 2 2 and hence the estimate of σ2 based on this marginal likelihood is given by (having differentiated the marginal log-likelihood by σ2 and equated to zero) 1 s2 = y0 y n − p 2 2 1 = y0K K0 y n − p 2 2 1 = y0(I − P )y n − p n X 1 = R (2.5.24) n − p rather than the form in (2.3.14). The estimator in (2.5.24) is unbiased. This is the standard estimate corrected for estimation of τ using a simple degrees of freedom adjustment. The above approach for estimating σ2 based on the marginal distribution of y2, is an example of residual maximum likelihood (REML) estimation dis- cussed originally by Patterson and Thompson (1971). The basic idea is to partition the likelihood into components, which in general may not be inde- 16 AN OVERVIEW OF LINEAR MODELS pendent, with one component (here y1) for estimation of τ and the other component (y2) whose distribution does not depend on τ for estimation of σ2 . A more complex decomposition is presented in chapter 4 and a thorough derivation of REML is given in chapter 5.

2.6 Factors and model specification

All linear models can initially be expressed in terms of means. For example in (2.2.1) we have specified the expected value for the ith treatment by the mean τi. In this case, the design matrix X might be called the replication matrix, as it simply indicates which mean value an experimental unit takes on. In many situations the mean value depends on several factors, and the model is reparameterised to reflect the factors of interest. For example, comparative inferences concerned the levels of a factor may of prime interest (as is the case in our example). This subsequently leads to modelling the mean in terms of the factors, and in particular to introducing a transformation from the mean to a new set of parameters. In the linear models situation the transformation is linear so that the replication matrix is then multiplied by the matrix of this linear transformation to provide a new design matrix. In our simple example, the mean may be reparameterised as

∗ τi = µ + τi (2.6.25)

∗ The τi in (2.6.25) represent the deviations from µ. This specification contains redundancies or intrinsic aliasing of effects, because we have 3 parameters µ, ∗ ∗ τ1 , τ2 to model the means for the 2 treatments, namely τ1 and τ2. To overcome this redundancy a constraint must be applied. This over-parameterisation oc- curs whenever factors are present in the model (as fixed effects), so by conven- tion constraints are applied to each factor term. These constraints are then applied to any term in the model which has this factor as an elemental com- ponent (for example in an interaction). The type of constraint that is applied affects the interpretation of the term µ. Corner-point constraint: This is the standard constraint for linear models in GENSTAT 5. We set ∗ τ1 = 0 and in terms of the parameterisation in (2.2.1) we have

∗ τ1 = µ + τ1 = µ ∗ τ2 = µ + τ2

∗ Thus µ is the mean for treatment 1, and τ2 is the deviation of the mean of treatment 2 from the mean of treatment 1. This is easily generalised to ∗ more than 2 treatments, in which case the τi , i = 2, . . . , p are deviations of treatment i from treatment 1. In addition, the set of effects in interaction terms have all effects corresponding to the first level of all component factors set to zero. FACTORS AND MODEL SPECIFICATION 17 In terms of the vector of means, we have       τ1 1 0 µ = ∗ τ2 1 1 τ2 so that if T is the matrix of the transformation and τ ∗ is the vector of µ and ∗ τ2 we have τ = T τ ∗ (2.6.26) and the linear model becomes y = XT τ ∗ + e = X∗τ ∗ + e which is a linear model with a new design matrix X∗ and non-redundant parameter vector τ ∗. Zero-sum constraint: P2 ∗ We set i=1 τi = 0. This ensures (for our simple case) that the intercept, µ, is the overall mean. This constraint also has been widely used in statistical packages to fit the linear model. The possible advantage of this constraint is that the intercept represents the overall mean. In the case of balanced data, the estimate of the intercept is the mean of all the data. However, this is not the case for unbalanced data. ∗ ∗ For this constraint, note τ1 = −τ2 , and the matrix T of (2.6.26) is  1 −1  (2.6.27) 1 1 It is also useful for the development in Chapter 3 to provide an alternative approach given by Nelder (1965b). Thus we write

τi =τ ¯. + (τi − τ¯.) 1 0 whereτ ¯. = 2 12τ is the mean of the τi. Notice that the zero-sum constraint is automatically incorporated for the terms (τi − τ¯.). In vector form,

τ = 12τ¯. + (τ − 12τ¯.) 1 1 = 1 10 τ + (I − 1 10 )τ 2 2 2 2 2 2 2 0 −1 0 0 −1 0 = 12(1212) 12τ + (I2 − 12(1212) 12)τ

= T 1τ + T 2τ (2.6.28)

Notice that T 1 and T 2 are orthogonal projection matrices, T 1 onto the vector 0 12 and T 2 onto the vector orthogonal to 12, namely [−1 1] , and that these two vectors make up the matrix T in (2.6.27). In addition T 1 + T 2 = I2 so that the decomposition of τ is into complete components. Thus  −1  1  −1  T τ ≡ 1 µ T τ ≡ (τ − τ ) = τ ∗ 1 2 1 1 2 1 2 1 2 This type of decomposition into orthogonal parts occurs not only in the mean 18 AN OVERVIEW OF LINEAR MODELS structure but also in the covariance structures to be discussed in Chapter 3, and it is the relationship between these mean and covariance structures that is important in analysis of variance. Conventions in this book: In this book we assume that the design matrix X has full column rank. This often necessitates imposition of constraints on the vector τ of fixed ef- fects. The method of estimation of the model and subsequent prediction of effects of interest is not affected by the type of constraint used, but of course the interpretation of the parameters or effects will depend on the constraints applied.

2.7 Tests of hypotheses

Before leaving the introduction to linear models we consider tests of hypothe- ses. There are basically three approaches (usually based on large sample the- ory) to deriving tests. The three methods are the likelihood ratio test, the Wald test and the score test (?). 0 We consider the hypothesis H0 : L τ = l for given L, p × r, and l, r × 1. In the example, the hypothesis of interest is simply the equality of treatment means, ie τ1 = τ2, or more appropriately for our general formulation, that 0 τ1 − τ2 = 0, so that L = [1 − 1] and l = 0. Suppose M is an p × (p − r) matrix such that the matrix T = [ML] is non-singular. We can choose M such that L0M = 0. The linear model (2.2.3) can be written as

y = X(T 0)−1T 0τ + e = X∗τ ∗ + e (2.7.29)

Now

 0   ∗  ∗ M τ τ 0 τ = 0 = ∗ L τ τ 1 ∗ and our null hypothesis becomes H0 : τ 1 = l. This demonstrates that if we partition τ into two components τ 0 and τ 1, without loss of generality we can consider the test of H0 : τ 1 = l (and hence avoid messy notation). We therefore consider tests of this hypothesis. Note firstly that under the partition of τ , X can be partitioned to yield the linear model

y = X0τ 0 + X1τ 1 + e (2.7.30) The estimation of the the partitioned τ can be carried out using partitioned matrices and the normal equations (2.3.10). The solutions can be written in a number of ways, but the useful form for the derivation of the likelihood ratio TESTS OF HYPOTHESES 19 test is

0 −1 0 τˆ0 = (X0X0) X0(y − X1τˆ1) (2.7.31)  0 −1 0 τˆ1 = X1(In − P X0 )X1 X1(In − P X0 )y 1 1 σˆ2 = (y − X τˆ )0(I − P )(y − X τˆ ) = R n 1 1 n X0 1 1 n

P X0 is the orthogonal projection matrix onto the column space of X0.

Testing H0 : τ 1 = l can be carried out by considering the sequence of models

y = X0τ 0 + X1l + e

y = X0τ 0 + X1τ 1 + e

where X0 is n × (p − r) and the column space of X contains the column space of X0 (Searle, 1971). We assume that both X and X0 are full rank. Under H0, the linear model becomes

2 y ∼ N(X0τ 0 + X1l, σ0In) (2.7.32) and the maximum likelihood estimates are

0 0 −1 0 τˆ0 = (X0X0) X0(y − X1l) (2.7.33) 1 σˆ2 = (y − X0τˆ − X l)0(y − X τˆ0 − X l) 0 n 0 0 1 0 0 1 1 = (y − X l)0(I − P )(y − X l) (2.7.34) n 1 n X0 1 1 = R n 0 where R0 is the residual sum of squares under H0. The generalized likelihood ratio test is a standard approach for tests of hypotheses for models fitted using likelihood methods (?). For our hypothesis the generalized likelihood ratio statistic is defined as

L(τˆ0, σˆ2; y) Λ = 0 0 L(τˆ, σˆ2; y)

which is the maximized likelihood under H0 divided by the maximized like- lihood for the full model. The statistic under the normality assumptions of 20 AN OVERVIEW OF LINEAR MODELS section 2.2 is given by σˆ2 −n/2 Λ = 0 σˆ2 R −n/2 = 0 R R + (R − R)−n/2 = 0 R  R − R−n/2 = 1 + 0 R  r −n/2 = 1 + F n − p where (R − R)/r F = 0 R/(n − p) is the standard F-test for the above null hypothesis. It is easy to show (τˆ − l)0X0 (I − P )X (τˆ − l) F = 1 1 n X0 1 1 r s2 This form will arise below. Several things need to be noted here. These are 2 • the asymptotic distribution of −2 log(Λ) is χ (r) under H0; 2 • using a likelihood ratio test we reject H0 if −2 log(Λ) > χ1−α(r) • the likelihood ratio statistic Λ is a monotone decreasing function of F ;

• F ∼ F (r, n − p) under H0 where the numerator and denominator d.f. are presented within the brackets;

• using an F-test we reject H0 if F > F1−α(r, n − p). This shows how the standard F -test relates to the likelihood ratio test for fixed effects under full maximum likelihood. The F-test is preferred in this context, as if the normality assumption holds, the test statistic has an exact distribution, whereas the null distribution for the likelihood ratio test is asymptotic, that is based on a large sample approximation. Likelihood ratio methods play an important role in mixed models and are discussed in detail in chapter 6. Before we turn to the remaining methods note that an analysis of variance table is implicit in the above formulation. In fact table 2.2 provides the logical decomposition. The Wald statistic ?) is based on the distribution of the estimator of τ , namely (2.3.12). As we are interested in a test involving τ 1, we require the distribution of τˆ1, namely 2 0 −1 τˆ1 ∼ N(τ 1, σ (X1(In − P X0 )X1) ) TESTS OF HYPOTHESES 21

Table 2.2 ANOVA Decomposition of sums of squares Term d.f. S.S. M.S. F-test

Difference r R0 − R (R0 − R)/r F Full model n-p RR/(n − p)

Null Hypothesis n-(p-r) R0

Under H0, τ 1 = l, and the Wald statistic is given by (τˆ − l)0X0 (I − P )X (τˆ − l) nr W = 1 1 n X0 1 1 = F σˆ2 n − p which is a simple monotone function of the F -statistic. Thus the Wald test is equivalent to the likelihood ratio test in this case. The score test is given by 0 −1 S = U 0I0 U 0 where U 0 and I0 are the score and expected information matrices derived 0 2 under the full model but evaluated under H0. Thus at τˆ0 andσ ˆ0,   0   1 0 0 τˆ0) σˆ2 (X y − X X U 0 =  0 l  (2.7.35) 0 while the expected information matrix is given by (2.3.16) with σ2 evaluated 2 atσ ˆ0. The score statistic is then   0 0   0  1 0 0 τˆ0  2 0 −1 1 0 0 τˆ0 S = 2 X y − X X σˆ0(X X) 2 X y − X X σˆ0 l σˆ0 l    0     0  1 τˆ0 τˆ0 0 τˆ0 τˆ0 = 2 − X X − σˆ0 τˆ1 l τˆ1 l Using (2.7.31) and (2.7.33) we have

   0   0 −1 0  τˆ0 τˆ0 −(X0X0) X0X1 − = (τˆ1 − l) τˆ1 l Ir and replacing X0X by its partitioned form  0 0  X0X0 X0X1 0 0 X1X0 X1X1 the score statistic can be shown to be 0 0 (τˆ1 − l) X1(In − P X0 )X1(τˆ1 − l) rF S = 2 = n σˆ0 n − p + rF which is a monotone increasing function of the F -statistic. Thus for the linear 22 AN OVERVIEW OF LINEAR MODELS model and under H0, the score statistic is equivalent to the F -statistic. For linear models, all three methods leads to the standard F-test for inference concerning the vector of fixed effects, τ . For completeness we present the F -statistic for the original test of H0 : Lτ = l. The statistic is given by −1 (L0τˆ − l)0 L0(X0X)−1L (L0τˆ − l) F = r s2 and under H0 F ∼ F (r, n − p).

2.8 Analysis of plant growth example For this example the two models (in symbolic form) to be fitted in the example are y ∼ mu + units y ∼ treatment + units We note that the latter model could also be written as y ∼ mu + treatment + units where constraints are now necessary. Thus, in this example excluding the treatment term from the linear model allows us to test the hypothesis that the treatment mean effects are equal. To complete this chapter we present the analysis of variance table for this plant growth data-set. The results of the analysis are summarized in table 2.3. The total (mean corrected) sum of squares has been partitioned into a sum of squares due to treatments (R0 − R) and a within treatment sum of squares (R). In matrix terms these are given by

0 R = y (In − P X )y 0 R0 = y (In − P X0 )y Table 2.3 contains an entry for the between treatments and within treatments sources of variation as well as the total variation (about the mean), together with the decomposition in terms of R and R0, with a column for the degrees of freedom (d.f.), the sum of squares (S.S), the mean squares (M.S. = S.S./d.f.) and the F-test (the ratio of the M.S for between treatments to the M.S. for within treatments). That is, (R − R)/r F = 0 R/(n − p) where n = 14, p = 2 and r = 1. The F-test is significant (p < .05), hence we reject the hypothesis that the treatment means are equal. The least squares estimates of the fixed effects and their standard errors are presented in table 2.4. These are defined using the corner-point constraints. SUMMARY 23

Table 2.3 Decomposition of sums of squares for plant growth data Term d.f. S.S. M.S. F-test

Between treatments (R0 − R) 1 3600.0 3600.0 6.640 Within treatments (R) 12 6506.1 542.2

Total (R0) 13 10106.1

Table 2.4 Summary of fixed effects for plant growth example Effect Estimate S.E.

∗ τ1 0.00 ∗ τ2 -32.07 12.45 µ 101.21 8.80

2.9 Summary In this chapter, results for the linear model have been presented to • provide the vector-matrix setting for chapters to follow, • revise basic results on estimation in the linear model, • introduce Residual Maximum Likelihood (REML) for estimation of vari- ance parameters, • derive the likelihood ratio test, the Wald test and the score test and show their equivalence to the standard F-test in the analysis of variance.

CHAPTER 3

Analysis of designed experiments

We begin the study of mixed models by considering the analysis of designed experiments. Designed experiments are highly structured, and this structure can be utilised to provide an approach to analysis and to set the foundations for more complex developments. The fundamental principles underpinning analysis of variance, namely pro- jections, strata and orthogonal block structures are developed in this chapter, in the manner of Nelder (1965a,b). There are three components in the de- velopment, namely the covariance structures generated by orthogonal block structures, the decomposition of treatment effects into components of inter- est (and this is usually the aim of the experiment), and lastly the interplay between the block and treatment structures. For the most part, the models to be discussed in this chapter have simple random effects structure and the treatment effects appear in only one part of the analysis of variance table. Estimation of variances allied to these random effects is very simple and involves equating expected (residual) mean squares to their observed values in an analysis of variance table. These are Residual Maximum Likelihood (REML) estimates in these simple cases. When the same treatment effects appear in several independent parts of the analysis of variance, an example is incomplete block designs, the corre- spondence between analysis of variance and REML estimates of variances is broken. Efficient estimation of treatment effects and variance components in such situations is via REML and this extends to the analysis of unbalanced data to be discussed later in the book. The use of analysis of variance tables is to be encouraged in unbalanced situations in order to define appropriate models. Thus the specialized nature of this chapter is a springboard to more complex situations.

3.1 One-way classification 3.1.1 Motivation The data to be considered in this section comes from a larger study conducted by Dr J Panozzo, in which the aim is to assess the malting quality of a number of barley varieties. An important trait in determining the malt quality of barley is the diastatic power (DP). Samples of barley grain are put through a malting process using a micro-malter and DP is measured on these samples. The micro-malter holds 80 cannisters in a 16 × 5 array. Often there are more

25 26 ANALYSIS OF DESIGNED EXPERIMENTS than 80 samples to be malted, so that sequential runs of the micro-malter must be undertaken. In the study 10 sequential malt runs were required. In order to assess variation between malt runs, 4 out of 80 cannisters in each run were randomly assigned to a control barley. Each of these control cannisters was filled with a subsample from a uniform batch of barley grain (from a single variety). The DP data for these control samples in each malt run is presented in table 3.1.

Table 3.1 Diastatic power (DP), for control samples in 10 malt runs Malt Run Diastatic Power 1 10.0 9.9 10.1 10.6 2 9.1 10.3 10.0 9.0 3 11.5 11.3 11.6 11.3 4 10.0 9.6 10.6 10.8 5 10.0 9.2 10.6 9.2 6 10.0 10.9 10.9 10.1 7 9.1 9.1 9.3 9.0 8 9.0 8.3 9.9 10.0 9 10.3 9.0 9.0 9.7 10 9.1 9.1 8.9 9.0

10

9

8

7

6

Malt Run 5

4

3

2

1

9 10 11

DP

Figure 3.1 Malt run data: dotplot of the control samples from each malt run ONE-WAY CLASSIFICATION 27 Our aim is to quantify the variation between and within malt runs using the control sample data, and in particular to see if the between malt run variation is “large”. We begin with a preliminary look at the data using the dotplot (S-PLUS, Insightful Corp., 2000) given in Figure 3.1. The dotplot suggests that there is variation between malt runs, but in addition that within malt run variation may also be large. A simple statistical model which allows for both between and within malt run variation is yij = µ + ui + eij (3.1.1) where yij is the observed DP, µ is the mean DP across all malt runs, ui 2 represents the ith malt run effect, i = 1, 2,..., 10, and eij ∼ N(0, σ ), j = 1, 2,..., 4. This has the same form as (2.2.1). However, there is a major differ- ence in the aim of this analysis, namely a measure of variation across maltruns is required.Rather than assume the malt run effects are fixed, and following the principles discussed in Chapter 1, it is appropriate to assume the effect is 2 random. Thus we assume ui ∼ N(0, σu), i = 1,..., 10. In addition we assume the malt run effects (ui) and the residual errors (eij) are statistically inde- pendent. The parameters to be estimated are now µ (the mean DP over all 2 2 malt runs), σu and σ . In contrast to (2.2.1), the malt run effects are random variables and therefore do not have fixed values which can be estimated. We return to this issue in chapter 5. If y is the vector of observations (ordered as samples within malt run), n = 40 is the sample size, b = 10 is the number of malt runs, and r = 4 is the number of replicate controls per malt run, (3.1.1) can be written as

y = 1nµ + Zu + e (3.1.2) n×q where 1n is a vector of n ones, Z = Ib ⊗ 1r is a design matrix, u ∼ 2 2 N(0, σuIb) and e ∼ N(0, σ In); ⊗ is the kronecker product operator (??). Note that the dimension of the vector of ones and the is given as a subscript. This is not to be confused with the convention of presenting the dimensionality of a general matrix or vector as a superscript. In the notation of chapter 2, we can also write this model symbolically as y ∼ mu + maltrun + units where maltrun is a factor with 10 levels, and, following on from the argument above, it is a random factor and hence is presented in bold (by the convention established in chapter 1). Under these assumptions the marginal distribution of y is 2 T 2 y ∼ N(1nµ, σuZZ + σ In) (3.1.3) 2 2 The aim is to estimate σu and σ in order to gauge the relative size of be- 2 tween and within malt run variation. As σu is a variance, we constrain it to be positive. In some applications discussed in this book, this non-negativity constraint can be relaxed. 28 ANALYSIS OF DESIGNED EXPERIMENTS The model given by (3.1.2) specifies a simple linear mixed model. It has a random component Zu in addition to the residual error random effect. The marginal distribution of y is given by (??) and has a structured which we will denote by

2 T 2 V = σuZZ + σ In

3.1.2 Projections, strata and analysis of variance

The one-way classification is an example of a designed experiment which pos- sesses strata of variation. The concept of strata has been primarily considered, for example by Nelder (1965a,b), for the analysis of designed experiments with orthogonal block structure. We shall present the formal definition of strata later in this section, but for the moment we introduce the concepts by consid- ering the linear mixed model (and associated distributional assumptions for the random components) for the one-way classification as given by (3.1.2) and (??. If u was a fixed effect, that is a vector of parameters, we could proceed as in chapter 2, and partition malt run effects using the approach leading to (2.6.28). Thus malt-run effects can be decomposed into a complete orthogonal set, defined by (the notation is changed from chapter 2 because of the change in the status of the factor from fixed to random)

T −1 T P 1 = Z(Z Z) Z (3.1.4) T −1 T P 2 = In − Z(Z Z) Z (3.1.5) and the matrices P 1 and P 2 are orthogonal projections onto the plane defined by Z and the plane orthogonal to Z respectively; these projections correspond to the between groups and the within groups effects respectively. Thus for i = 1, 2, j 6= i,

T 2 P i = P i, P i = P i, P iP j = 0, P 1 + P 2 = In

The model in which ui is random moves the design matrix Z to the variance structure. While group effects no longer appear explicitly, they are implied by the form of V . The decomposition using (3.1.4) and (3.1.5) can still be applied, but the impact is to separate the data into components, each of which can be modelled using a simple linear model, that is a model with a single random term with constant variance. For the one-way classification this T can be achieved as follows. Firstly note that Z Z = rIb. Then

2 T −1 T 2 V = rσuZ(Z Z) Z + σ In 2 2 = rσuP Z + σ In 2 2 2 = (σ + rσu)P Z + σ (In − P Z )

= ξ1P 1 + ξ2P 2 (3.1.6) ONE-WAY CLASSIFICATION 29 Note that

2 2 ξ1 = rσu + σ , (3.1.7) 2 ξ2 = σ (3.1.8) Thus the variance matrix contains the decomposition into between and within group components, together with two weights or variances ξ1 and ξ2 which are functions of the original variances. Since each P i is a projection matrix we can proceed as in chapter 2 and write T P i = KiKi where Ki are matrices of size n × b and n × (n − b) for i = 1, 2 respectively. T T In addition, K1 K2 = 0 and Ki Ki = Iνi , where νi is the rank of Ki. Since Ki is full column rank, the rank of Ki equals the number of columns in Ki, that is b and n − b for i = 1, 2 respectively. 2 2 For the estimation of (µ, ξ1, ξ2) ie, (µ, σu, σ ) we partition the data into two parts that reflect the between and within malt run components. To do so T consider the transformation of the data vector y to K y where K = [K1 K2] is a non-singular matrix.

Since P 21n = 0, we have E (y2) = 0n−b. Further, for i = 1, 2 T var (yi) = Ki (ξ1P 1 + ξ2P 2)Ki T = ξiKi P iKi

= ξiIνi and T cov (y1, y2) = K1 (ξ1P 1 + ξ2P 2)K2 = 0 Thus,  y   KT 1 µ   ξ I 0  KT y = 1 ∼ N 1 n , 1 b (3.1.9) y2 0 0 ξ2In−b and we have two independent components, one depending on µ and ξ1 and the other depending only on ξ2. These two independent distributions define the two strata in our experiment, the between groups stratum specified by y1 and the within groups stratum specified by y2. The variance parameters ξi, i = 1, 2 are known as stratum variances. In essence we have reduced the linear mixed model to two independent linear models, namely T y1 ∼ N(K1 1nµ, ξ1Ib)

y2 ∼ N(0, ξ2In−b) Allied to this partition into strata, is the decomposition of the total sum of squares as a sum of the stratum sums of squares. This implies no loss of 30 ANALYSIS OF DESIGNED EXPERIMENTS information. To see this consider 2 2 X T X T T yi yi = y KiKi y i=1 i=1 2 T X = y ( P i)y i=1 = yT y as required.

It is clear that the overall mean can only be estimated from y1 (ie in the be- tween groups stratum). The least squares (and maximum likelihood) estimate using (2.3.11) is given by

−1  T T  T T µˆ = 1n K1K1 1n 1n K1K1 y 1T y = n =y ¯ n The residual sum of squares for the between groups stratum is therefore given by

T T T T −1 T T T T −1 T y1 y1 − y1 K1 1n(1n P 11n) 1n K1y1 = y P 1y − y 1n(1n 1n) 1n y T T = y P 1y − y P 0y where P 0 is the projection matrix for the overall mean. An analysis of variance (ANOVA) table, as given in table 3.2 can be constructed based on the two strata, and the decomposition of the sums of squares in the between groups stratum into the sum of squares due to the overall mean and a residual. There is no decomposition of the within group stratum because there are no fixed effects in the linear model for y2. This is a simple example of a multi- stratum experiment, without treatment factors. The ANOVA estimates of ξ1 and ξ2 can be obtained by equating the residual mean squares in table 3.2 to their expected values. Hence, ˆ T ξ1 = y (P 1 − P 0)y/(b − 1) ˆ T ξ2 = y P 2y/(n − b) (3.1.10)

2 2 andσ ˆ andσ ˆu can be found using equations (3.1.7) and (3.1.8), namely 2 ˆ σˆ = ξ2 2 ˆ ˆ σˆu = (ξ1 − ξ2)/r The ANOVA table for the malt run data is presented in table 3.3. The es- ˆ ˆ timates of ξ1 annd ξ2 are ξ1 = 0.261 and ξ2 = 2.145. Using the expressions above or (3.1.7) and (3.1.8), the ANOVA estimates of the variance components 2 2 are thenσ ˆu = 0.4708 andσ ˆ = 0.2614 respectively. We see that the estimated between malt run variance is approximately 1.8 times the estimated resid- ONE-WAY CLASSIFICATION 31

Table 3.2 Analysis of variance for one-way classification Strata/Decomposition d.f S.S. Expectation of MS

T Between groups b y P 1y - T 2 Mean 1 y P 0y nµ + ξ1 T Residual b − 1 y (P 1 − P 0)y ξ1 T Within groups n − b y P 2y - T Residual n − b y P 2y ξ2

ual variance, indicating the need for careful design protocols to account for between malt run variation.

Table 3.3 Analysis of variance for the malt run data Strata/Decomposition d.f S.S. M.S. Between groups 10 Mean 1 3888.784 3888.784 Residual 9 19.296 2.145 Within groups 30 Residual 30 7.840 0.261

We now present a formal definition of strata. Definition 3.1 A stratum is a maximal set of linear orthonormal functions of y which are independent and have equal variances. When orthogonal strata exist, the total variation in the data can be parti- tioned accordingly and represented in an analysis of variance table as pre- sented above. We note that the strata presented in the above differs from the strata derived by a randomisation argument (as described in detail by Nelder (1965a) for example). The randomisation analysis for the one-way classifica- tion has three strata, the mean stratum, the between groups stratum and the within groups stratum. While our approach yields two strata, the between groups and within groups strata, we shall see that the mean stratum arises naturally with crossed classified random effects.

3.1.3 Another glimpse at REML In the previous section the estimation of the variance parameters, namely the variance components (via the stratum variances), was achieved by equating stratum residual mean squares to their expectations. These estimates are the so-called ANOVA estimates of variance components (Searle et al., 1992). 32 ANALYSIS OF DESIGNED EXPERIMENTS We saw in chapter 2 that a residual likelihood could be found by eliminating the mean effects from the linear model. In the multi-stratum case, we have several linear models and to develop an appropriate residual likelihood we need to eliminate mean effects for each stratum.

In the one-way classification, these mean free components are y2 and that part of y1 that has zero expectation. As y1 follows a linear model, the argu- T ments of section 2.5 apply. In particular, if X = K1 1n, we can define matrices T T T K11 and K12 of full column rank such that K12 X = K12 K1 1n = 0b−1. ∗ Let K1 = [K11 K12]. T T T Now K1K12K12 K1 is an orthogonal projection matrix. In fact K12K12 T projects orthogonally to X = K1 1n, and so equals (after some algebra)

T T K12K12 = Ib − K1 P 0K1

T −1 T where P 0 = 1n(1n 1n) 1n . Thus

T T T T T K1K12K12 K1 = K1K1 − K1K1 P 0K1K1

= P 1 − P 1P 0P 1

= P 1 − P 0 (3.1.11)

This projection is merely the component of the between groups space or- thogonal to the vector of ones (which specifies the unconditional mean of the ∗T T linear mixed model). Thus the transformation of y to K1 K1 y results in the complete decomposition

   ∗T T    y11 K2 K1 1n ξ1 0 0  y12  ∼ N  0  ,  0 ξ1Ib−1 0  y2 0 0 0 ξ2In−b

It follows that the log likelihood free of mean effects is based on the distribution T T T of (y12, y2 ) and is given by

`R = `(ξ1; y12) + `(ξ2; y2)  T T  1 y12y12 y2 y2 = − (b − 1) log ξ1 + + (n − b) log ξ2 + (3.1.12) 2 ξ1 ξ2

Maximisation of (3.1.12) with respect to ξ1 and ξ2 leads to the unbiased ANOVA estimates as before. The log-likelihood in (3.1.12) is the so-called (log) Residual likelihood, since it is the log-likelihood of a maximal set of contrasts which have zero expecta- tion, that is error or residual contrasts (Patterson and Thompson, 1971). If we define

∗ ∗ T T s1 = y1y1 = y (P 1 − P 0)y T T s2 = y2 y2 = y P 2y RANDOMISED COMPLETE BLOCKS 33 (3.1.12) can be written as (ignoring constants)

`s = `(ξ1; s1) + `(ξ2; s2)   1 s1 s2 = − (b − 1) log ξ1 + + (n − b) log ξ2 + (3.1.13) 2 ξ1 ξ2

Differentiation of (3.1.13) with respect to ξ1 and ξ2 gives

∂`s b − 1 s1 −2 = − 2 ∂ξ1 ξ1 ξ1 ∂`s n − b s2 −2 = − 2 (3.1.14) ∂ξ2 ξ2 ξ2 and equating (3.1.14) to zero again gives the ANOVA estimates given in (3.1.10). The sums of squares s1 and s2 are independent and distributed as (scaled) chi-squared variates with degrees of freedom equal to the stratum residual degrees of freedom, namely b − 1 and n − b respectively. Thus “residuals” at various levels have been used to construct a likelihood free of fixed effects and hence the name residual likelihood. The residual likelihood has therefore been constructed using the complete sufficient statistics for ξ1 and ξ2, namely s1 and s2.

3.2 Randomised complete blocks The data for this example has been kindly provided by Dr Maria Durban and involved a field experiment testing the yield performance of 272 barley varieties. The trial was laid out in 16 rows, each row consisting of 34 beds. There were two complete blocks. Block one occupied rows 1 to 8, block two rows 9 to 16. Thus within each block there are 272 plots to which the varieties are allocated at random. The trial layout in terms of the subdivision of the field trial into blocks is presented in table 3.4. A simple model for the yield observed on block i = 1, 2 plot j = 1,..., 272, is given by

yij = τs(ij) + ui + eij (3.2.15) where s(ij) represents the random assignment of a variety to plot j in block i, τk (k = 1,..., 272) is the mean effect for the kth variety and ui is the effect for block i. Note that we have not included an overall mean or intercept term in (3.2.15) . This avoids the unnecessary complication of placing constraints on the set of 272 variety effects, which for the present represent the variety mean levels. Here we assume the variety effects are fixed. Since the role of the blocks is to model the (co)variation in the data we classify block effects as 2 2 random. In this sense we are partitioning the total variance into σu and σ , 2 2 where σu is the variance of the block effects and σ is the residual variance. We assume further that the ui and the eij are statistically independent. In general if there are t fixed (treatment) means and b blocks with a total 34 ANALYSIS OF DESIGNED EXPERIMENTS

Table 3.4 Layout of the blocks in the field for the variety trial example Row Bed 1 2 ... 7 8 9 10 ... 15 16 1 1 1 ... 1 1 2 2 ... 2 2 2 1 1 ... 1 1 2 2 ... 2 2 ...... 33 1 1 ... 1 1 2 2 ... 2 2 34 1 1 ... 1 1 2 2 ... 2 2

of n = bt observations then we can write (3.2.15) in matrix form as y = Xτ + Zu + e (3.2.16) where Xn×t is the design matrix which assigns the treatment means to plots. The matrix Zn×b is the design matrix for block effects. The data are assumed to be ordered by plot number within blocks and so it follows that Z = Ib ⊗1t. Hence 2 T 2 V = var (y) = σuZZ + σ In The model is written symbolically as y ∼ variety + block + units where variety and block are factors with 272 and 2 levels respectively.

3.2.1 Orthogonal projections and strata The strata for this design are identical to those for the one-way classifica- tion with blocks being equivalent to groups. The two strata are therefore the between blocks stratum and the within blocks stratum. T As before we transform the data vector y to K y where K = [K1 K2], and these matrices arise from P 1 and P 2. We have    T    T y1 K1 Xτ ξ1Ib 0 K y = ∼ N T , (3.2.17) y2 K2 Xτ 0 ξ2Ib(t−1) where ξi, i = 1, 2 are the stratum variances and are given by 2 2 2 ξ1 = tσu + σ , ξ2 = σ

3.2.2 Estimation of treatment effects and analysis of variance The variety effects are of interest in the example and this introduces the first complication as far as estimation is concerned. The model (3.2.17) suggests that variety effects may be present in both strata, or in general in several strata. The problem here, and in fact in general for more complicated designs, RANDOMISED COMPLETE BLOCKS 35 is to obtain efficient estimates of both τ and ξi. Nelder (1965b) considers this problem and the following is largely (though not exactly) based on his development. We begin by defining some matrices that will appear in the developments of this chapter. The matrices An and Bn are defined by

T −1 T An = 1n(1n 1n) 1n T −1 T Bn = In − 1n(1n 1n) 1n

= In − An

Both An and Bn are orthogonal projection matrices, they are orthogonal to each other and the size (which will vary depending on the application) is given by the subscript (here n). An replaces each element of a vector by the mean of that vector, while Bn replaces each element by the deviation of each element from the mean. Equation (3.2.17) shows that the treatment effects may appear in both strata and hence in both linear models defined in that equation. We can therefore estimate part or all of τ in each stratum. Consider estimation of τ in the ith stratum. If τˆ[i] denotes the estimate of τ using only the ith stratum, using the normal equations (2.3.10) of chapter 2, we have T T X P iXτˆ[i] = X P iy (3.2.18) T The matrix X P iX is the information matrix for the fixed effects in stratum i. It may not be of full rank so that the solution is not unique, and hence obtaining a specific solution to (3.2.18) depends on finding an appropriate generalised inverse. We consider this below but firstly we examine the form of the information matrices. As Z = Ib ⊗ 1t, we have

P 1 = Ib ⊗ At, P 2 = Ib ⊗ Bt

An important property of the orthogonal projection matrices P i is that they are invariant to row and column permutations within blocks. This is because permuting within blocks of unit values does not change P i. This means that the rows of X can be reordered in manipulations to follow. Thus we take a nice form for X, namely X = 1b ⊗ It. Using properties of kronecker products it is easy to show

T T X P 1X = bAt = bT 1, X P 2X = bBt = bT 2 (3.2.19) Note also that

T T T T X P 1 = 1b ⊗ At, X P 2 = 1b ⊗ Bt (3.2.20) In stratum 1, we therefore have the normal equations

T bAtτˆ[1] = (1b ⊗ At)y 36 ANALYSIS OF DESIGNED EXPERIMENTS or 1 A τˆ = (1T ⊗ A )y (3.2.21) t [1] b b t Now

Atτ =τ ¯·1t so that the left hand side of (3.2.21) shows that in stratum 1 we can only estimate the overall mean of the treatment effects. The right hand side of (3.2.21) confirms this as it equals

y¯··1t In stratum 2, T bBtτˆ[2] = (1b ⊗ Bt)y and we can use properties of kronecker products to reduce the equations to 1 B τˆ = (1T ⊗ B )y t [2] b b t 1 = B ( 1T ⊗ I ) t b b t = Bty¯t (3.2.22) where y¯t is the vector of treatment means calculated across the blocks.

Btτ = τ − τ¯·1t so that the left hand side of (3.2.22) shows that in stratum 2 we can only esti- mate the deviations of treatment effects from the overall mean of the treatment effects. The right hand side of (3.2.22) equals

y¯t − y¯··1t deviations of treatment sample means about the overall sample mean. Before we turn to important aspects of these results, note that we can solve the two sets of normal equations for stratum 1 and 2 by using generalized inverses. Both At and Bt are generalized inverses of themselves, and hence using (2.4.22 we find 1 τˆ = A (1T ⊗ A )y [1] t b b t 1 = (1T ⊗ A )y b b t =y ¯··1t

τˆ[2] = BtBty

= Bty

= y¯t − y¯··1t which confirms the statements made regarding the effects that can be esti- mated from each stratum. Now in chapter 2, we saw that a decomposition of the mean in a single RANDOMISED COMPLETE BLOCKS 37 factor design was given by (2.6.28). The matrices in that equation are exactly of the form given in (3.2.20). If we let

T 1 = At, T 2 = Bt T then it easy to see that T 1 + T 2 = It, T 1 T 2 = 0 and that T 1 and T 2 are orthogonal projection matrices. Thus the treatment effects have an or- thonormal decomposition in a similar manner to the variance matrix which was determined by the block structure. Based on (3.2.20), the estimates can be written as 1 T τˆ = XT P y (3.2.23) i [i] b i In this form the left-hand side represents the effects being estimated in the ith stratum, while the right hand side is a sum involving the data and a divisor b which is called the effective replication of the effect in stratum i. This example is a special case of an important concept for the estimation of fixed effects within strata, that is of generally balanced designs (Nelder, 1965b). A design is Generally balanced if the information matrix for the fixed effects in stratum i can be written as l T X X P iX = λijT j (3.2.24) j=1 This form differs to that of Nelder (1965b) but it can be shown the two definitions are equivalent. When this condition holds, there is no need to find an inverse for the left hand side of (3.2.18). If the λik corresponding to a T k is zero then it follows that there is no information in stratum i on the fixed effects T kτ .

Table 3.5 Effective replication for the randomised complete blocks example

Stratum T 1τ (mean) T 2τ (treatment)

Between Blocks λ11 = b λ12 = 0 Within Blocks λ21 = 0 λ22 = b

For the generally balanced design consider estimation of T kτ where k = 1, . . . , l in stratum i, which is only possible if λik 6= 0. Pre-multiplying (3.2.18) by T k gives l X T T k( λijT j)τˆ[i] = T kX P iy j=1 T ⇒ λikT kτˆ[i] = T kX P iy

1 T ⇒ T kτˆ[i] = T kX P iy (3.2.25) λik 38 ANALYSIS OF DESIGNED EXPERIMENTS At this point it is worth considering the form of (3.2.25) in more detail. Beginning from standard maximum likelihood estimation of the vector of fixed effects τ , it follows that for generally balanced designs there is a very simple form for the maximum likelihood estimate of T kτ in stratum i. This estimate T (T kτˆ[i]) is a simple function of P iy. Pre-multiplication by X forms totals for each treatment, T k takes deviations and λik is a scaling factor. The scalar λik is known as the effective replication of T kτ in stratum i. We have seen that the RCB design is an example of this type. Equation (3.2.24) holds, see (3.2.20), with the effective replication for each treatment term given for each stratum in table 3.5. The sum of squares due to the fixed effects T kτ in stratum i is then given by, see (2.3.15) and the derivation leading to that form, T T T T T (T kτˆ[i]) X KiKi y = τˆ[i]T kX P iy T = λik(T kτˆ[i]) (T kτˆ[i]) (3.2.26) using (3.2.25) and the idempotency of T k. This has degrees of freedom equal to the rank of T k. As T k is an orthogonal projection, the rank equals the trace of the matrix. If each set of fixed effects T kτ , k = 1, . . . , l can only be estimated in one stratum then the design is said to be orthogonal. This occurs if there is only one non zero λik for each k. The randomised complete block design is an orthogonal design as can be seen from Table 3.5. The mean is estimated in the between blocks stratum and the treatment effects (deviations from the mean) are estimated in the within blocks stratum. The analysis of variance table for the RCB can be constructed by subdi- viding the total sum of squares for each stratum into a sum of squares due to the fixed effects estimated in that stratum and a residual sum of squares. The ANOVA estimates of the stratum variances are the stratum residual mean squares. The full analysis of variance decomposition for a randomised com- plete block design is given in table 3.6. Note that using (3.2.26) the sum of squares due to the mean is given by T sm = λ11(T 1τˆ[1]) (T 1τˆ[1])

1 T T = y P 1XT 1X P 1y λ11 T = y P 0y The residual sums of squares for the between blocks stratum is obtained by T difference as s1 = y P 1y − sm. Similarly the sums of squares due to the treatment effects is given by T st = λ22(T 2τˆ[2]) (T 2τˆ[2])

1 T T = y P 2XT 2X P 2y λ22 and the residual sums of squares for the within blocks stratum is obtained as RANDOMISED COMPLETE BLOCKS 39 T s2 = y P 2y−st. Also note that the expectations of the mean squares is given in an abbreviated form, where, for example, the non-centrality parameters (see appendix ??) for each are written as µ(·) to represent which treatment effects are involved in the non-centrality parameter. The ANOVA table for the variety trial data is presented in table 3.7. Note firstly that the treatment effects are significantly different from zero, indicat- ing varietal differences exist. Secondly, the ANOVA estimates of the stratum ˆ ˆ variances are ξ1 = 2.324 and ξ2 = 0.1380, from which the ANOVA estimates of 2 2 the variance components areσ ˆ = 0.1380 andσ ˆu = 0.00803. Thus the within blocks variance is considerably larger that the between block variance (which is based on only 1 degree of freedom).

Table 3.6 Analysis of variance for an RCB design Expectation Strata/Decomposition d.f. S.S. of M.S.

T Between Blocks b y P 1y - Mean 1 sm ξ1 + µ(T 1τ [1]) Residual b − 1 s1 ξ1 T Within Blocks b(t − 1) y P 2y - Treatment t − 1 st ξ2 + µ(T 2τ [2]) Residual (b − 1)(t − 1) s2 ξ2

Table 3.7 Analysis of variance for the variety trial data Strata/Decomposition d.f. S.S. M.S. F-test Between Blocks 2 16544.90 16544.90 Mean 1 16542.58 16542.58 Residual 1 2.324 2.324 Within Blocks 542 118.01 Treatment 271 80.613 0.297 2.156 Residual 271 37.400 0.138

3.2.3 Another look at REML ANOVA estimates of stratum variances and hence variance components are obtained by equating stratum residual mean squares to their expectations. In the spirit of the approach used in section 3.1, we consider the log likeli- hood of the between blocks and within blocks residual sums of squares. These 40 ANALYSIS OF DESIGNED EXPERIMENTS sums of squares are distributed as (scaled) chi-squared variates with degrees of freedom equal to the stratum residual degrees of freedom given in table 3.6. The log-likelihood is, ignoring constants

`s = `(ξ1; s1) + `(ξ2; s2)   1 s1 s2 = − (b − 1) log ξ1 + + (b − 1)(t − 1) log ξ2 + (3.2.27) 2 ξ1 ξ2

Differentiation of (3.2.27) with respect to ξ1 and ξ2 leads to the ANOVA estimates. It can be shown that the log likelihood in (3.2.27) is equivalent, as far as estimation is concerned, to the log likelihood of that part of y1 and y2 which ∗ ∗T ∗ ∗T has zero expectation. That is if we let y1 = K1 y and y2 = K2 y where the ∗n×(b−1) ∗n×(b−1)(t−1) matrices K1 and K2 are full rank matrices, such that

∗ ∗T K1K1 = P 1 − P 0 ∗ ∗T 1 T K2K2 = P 2 − P 2XT 2X P 2 λ22 ∗T ∗ K1 K2 = 0

∗ ∗ ∗ Thus if K = [K1 K2] then,

 ∗      ∗T y1 0 ξ1Ib−1 0 K y = ∗ ∼ N , y2 0 0 ξ2I(b−1)(t−1)

3.3 Split plot design

The last example we present on orthogonal designs, with orthogonal block and treatment structure is a split plot example. The data is again kindly provided by Dr Maria Durban and involved an experiment designed to investigate the effect on yield of controlling the fungus powdery mildew in barley. Seventy varieties of barley were grown with and without fungicide application. The field layout consisted of four blocks (labelled I, II, III, IV) with two whole- plots per block each split into 70 sub-plots. The two fungicide treatments were randomly allocated to the two whole-plots per block while the 70 varieties were randomly assigned to the 70 sub-plots. The trial was laid out in 56 beds by 10 rows. Each block consisted of 14 beds by 10 rows with block 1 occupying beds 1 to 14, block 2 beds 15 to 28 and so on. Each whole-plot within each block comprised 7 beds by 10 rows. A sub-plot consisted of a single row. The layout of the trial indicating the allocation of fungicide treatments to whole-plots and the arrangement of blocks is presented in table 3.9. SPLIT PLOT DESIGN 41

Table 3.8 Indicative trial layout for the split plot design Beds Block Whole-Plot Fungicide 1,..., 7 I 1 - 8,..., 14 I 2 + 15,..., 21 II 1 - 22,..., 28 II 2 + 29,..., 35 III 1 - 36,..., 42 III 2 + 43,..., 49 IV 1 + 50,..., 56 IV 2 -

The statistical model for the yield observed on sub-plot k = 1,..., 70, whole- plot j = 1, 2 and block i = 1,..., 4 is

yijk = τs(ijk) + bi + wij + eijk (3.3.28) where s(ijk) represents the randomisation of treatments (fungicides and va- rieties) to experimental units and τl (l = 1,..., 140) is the mean effect for treatment l. As in the randomised block example, we use τ to represent the 140 treatment mean effects, rather than partitioning at this stage into the main effects of fungicide and variety and their interaction. The standard anal- ysis assumes the terms bi for blocks, wij for whole-plots within blocks, and eijk for sub-plots are all normally distributed, mutually independent (within 2 2 2 and between) sets of random effects, with variances σb , σw and σ respec- tively. We assume further that the bi, wij and eijk are pairwise statistically independent. In general we assume there are t treatment effects, b blocks, w whole-plots in each block and s sub-plots in each whole-plot, with n = bws total observations. Then we can write (3.3.28) in matrix form as

y = Xτ + Z1u1 + Z2u2 + e = Xτ + Zu + e where Xn×t is a design matrix which assigns the factorial combinations of the treatments to experimental units. We assume there are w treatments applied to the whole-plots and s treatments applied to the sub-plots so that t = ws. The whole-plot treatment factor will be denoted in the following by Wtreat and the sub-plot treatment factor by Streat. For notational convenience the treatment factor names are sometimes abbreviated to W for Wtreat and S for Streat. In the fungicide by variety example w = 2 and s = 70. The matrix n×b n×bw Z1 is the design matrix for block effects and the matrix Z2 is the design b×1 bw×1 matrix for effects of whole-plots within blocks. The vectors u1 and u2 represent the effects for blocks and whole-plots within blocks respectively. T  T T  Finally we define Z = [Z1 Z2] and u = u1 u2 . 42 ANALYSIS OF DESIGNED EXPERIMENTS This is a mixed model with three random components. The marginal dis- tribution of y is

2 T 2 T 2 y ∼ N(Xτ , σb Z1Z1 + σwZ2Z2 + σ In) (3.3.29) The mixed model can also be written as y ∼ Wtreat ∗ Streat + block/wplot + units where block and wplot are factors with b and w levels respectively, and wplot labels the whole-plots within blocks. Note that the fixed effects formulation in ( 3.3.29) reflects the decomposition of the treatment effects into the main effects of Wtreat and Streat and their interaction to be considered in sec- tion 3.3.2.

3.3.1 Orthogonal projections and strata Using a similar approach to section 3.2.1 the variance matrix can be expressed in terms of projections involving orthogonal components. Thus if

2 T 2 T 2 var (y) = V = σb Z1Z1 + σwZ2Z2 + σ In then V can be written as 3 X V = ξiP i i=1 where 2 2 2 2 2 2 ξ1 = wsσb + sσw + σ , ξ2 = sσw + σ , ξ3 = σ

Simple expressions for the P i can be obtained by noting the form of the random effects design matrices ie, Z1 and Z2. That is, the data are assumed ordered sub-plots within whole-plots within blocks so that

Z1 = Ib ⊗ 1w ⊗ 1s

Z2 = Ib ⊗ Iw ⊗ 1s Hence

T −1 T P 1 = Z1(Z1 Z1) Z1

= Ib ⊗ Aw ⊗ As T −1 T T −1 T P 2 = Z2(Z2 Z2) Z2 − Z1(Z1 Z1) Z1

= Ib ⊗ Bw ⊗ As T −1 T P 3 = In − Z2(Z2 Z2) Z2

= Ib ⊗ Iw ⊗ Bs We also define

T −1 T P 0 = 1n(1n 1n) 1n

= Ab ⊗ Aw ⊗ As SPLIT PLOT DESIGN 43

Recalling the properties of Am and Bm it follows that the P i are orthogonal projection matrices summing to the identity matrix. The ranks of P 1,..., P 3 are b, b(w − 1), and bw(s − 1) since rank (Bm) = m − 1, rank (Am) = 1 and rank (A ⊗ B) = rank (A) rank (B) (see section ??). Thus we have three strata of variation and using a similar approach to section 3.2.1 we can transform to three independent linear models, each with homogeneous variation. Formally, we transform the data vector y to KT y T where K = [K1 K2 K3] and KiKi = P i, i = 1, 2, 3. Then   y1 T K y =  y2  y3  T    K1 Xτ ξ1Ib 0 0 T ∼ N  K2 Xτ  ,  0 ξ2Ib(w−1) 0  T K3 Xτ 0 0 ξ3Ibw(s−1) These strata correspond to blocks, whole-plots within blocks and sub-plots within whole-plots. For notational convenience these strata will be referred to in the following by the labels blocks, blocks.wplots and blocks.wplots.splots.

3.3.2 Estimation of treatment effects and analysis of variance

The model for the fixed effects, τ can be written in a form similar to that used for the fixed effects in section 3.2.2. That is,

l X τ = T jτ (3.3.30) j=1 where l = 4 for the fungicide by variety trial example. The individual terms partition the treatment effects into the overall mean, the main effect of factor Wtreat, the main effect of factor Streat and the Wtreat.Streat interaction. The treatment projection matrices are given by

T 1 = Aw ⊗ As

T 2 = Bw ⊗ As

T 3 = Aw ⊗ Bs

T 4 = Bw ⊗ Bs (3.3.31)

The T j are a set of orthogonal projection matrices summing to the identity. The split plot design is a generally balanced design. This follows from the definition given in section 3.2.2. That is,

l T X TX P iXT = λijT j j=1 44 ANALYSIS OF DESIGNED EXPERIMENTS for i = 1, 2, 3. In fact reordering X as done in section 3.2.2 it can be shown that T X P 1X = bAw ⊗ As = bT 1 T X P 2X = bBw ⊗ As = bT 2 T X P 3X = bIw ⊗ Bs = bT 3 + bT 4

The effective replication (λij) is given in table 3.9. This implies that the overall mean is estimated in the blocks stratum, the main effects of Wtreat are estimated in the blocks.wplots stratum and the main effects of Streat and the interaction effects of Wtreat and Streat are estimated in the blocks.wplots.splots stratum. The design is therefore orthogonal since each set of treatment effects in (3.3.30) is estimated in one stratum only.

Table 3.9 Effective replication for the split plot example

Stratum T 1τ (mean) T 2τ (W) T 3τ (S) T 4τ (W.S)

Blocks λ11 = b λ12 = 0 λ13 = 0 λ14 = 0 Blocks.wplots λ21 = 0 λ22 = b λ23 = 0 λ24 = 0 Blocks.wplots.splots λ31 = 0 λ32 = 0 λ33 = b λ34 = b

The analysis of variance can now be constructed. The total sum of squares in each stratum can be subdivided into treatment sum(s) of squares and a residual sum of squares. Table 3.10 presents the full analysis of variance table. The sums of squares due to the mean, main effects of Wtreat and Streat and their interaction are given by

1 T T sm = y P 1XT 1X P 1y λ11 T = y P 0y 1 T T sw = y P 2XT 2X P 2y λ22 1 T T ss = y P 3XT 3X P 3y λ33 1 T T sws = y P 3XT 4X P 3y λ34 The residual sum of squares for each stratum is obtained by difference, for T example in the blocks.wplots.splots stratum s3 = y P 3y − ss − sws. The stratum variances ξ1, ξ2, ξ3 are estimated by the residual mean squares in each stratum. Table 3.11 presents the analysis of variance table for the fungicide by variety example. It contains the F-tests for the fixed effects. Since the fungicide effects are estimated in the blocks.wplots stratum then the appropriate mean square for testing fungicide effects is the residual in this stratum (see table 3.10). Similarly since the variety and fungicide by variety SPLIT PLOT DESIGN 45 effects are estimated in the blocks.wplots.splots stratum, then the residual mean square in this stratum is the appropriate error for testing these effects (see table 3.10). There is a very large effect of fungicide treatment, however there is no evidence for an interaction. We revisit this data-set in chapter ??.

Table 3.10 Analysis of variance for a split plot design

Expectation Strata/Decomposition d.f. S.S. of M.S.

T Blocks b y P 1y - Mean 1 sm ξ1 + µ(T 1τ [1]) Residual b − 1 s1 ξ1 T Blocks.wplots b(w − 1) y P 2y - Wtreat w − 1 sw ξ2 + µ(T 2τ [2]) Residual (b − 1)(w − 1) s2 ξ2 T Blocks.wplots.splots bw(s − 1) y P 3y - Streat s − 1 ss ξ3 + µ(T 3τ [3]) W.S (w − 1)(s − 1) sws ξ3 + µ(T 4τ [3]) Residual w(b − 1)(s − 1) s3 ξ3

Table 3.11 Analysis of variance for the fungicide by variety example Strata/Decomposition d.f. S.S. M.S. F-test Blocks 4 15389.808 Mean 1 15374.580 15374.58 Residual 3 15.228 5.076 Blocks.wplots 4 45.149 Fungicide 1 42.019 42.019 40.271 Residual 3 3.130 1.043 Blocks.wplots.splots 552 77.107 Variety 69 39.284 0.569 7.201 F.V 69 5.090 0.074 0.933 Residual 414 32.733 0.079

3.3.3 REML for a split plot design Estimation of the stratum variances (thence variance components) have been based on the stratum residual mean squares. As for the randomised complete 46 ANALYSIS OF DESIGNED EXPERIMENTS block design in section 3.2.3 we could also consider the joint likelihood of the residual sums of squares of each stratum. Since the stratum residual sums of squares are independently distributed as (scaled) chi-squared variates it can be shown that the log-likelihood of s1, s2 and s3 is, ignoring constants

`s = `(ξ1; s1) + `(ξ2; s2) + `(ξ3; s3)  1 s1 s2 = − (b − 1) log ξ1 + + (b − 1)(w − 1) log ξ2 + 2 ξ1 ξ2  s3 + (b − 1)w(s − 1) log ξ3 + (3.3.32) ξ3 It is interesting to compare (3.3.32) to the log-likelihood of the data y, which after replacing the fixed effects by the generalised least squares estimates is given by 1 − log |V | + (y − Xτˆ)T V −1(y − Xτˆ) (3.3.33) 2 First consider the log determinant in (3.3.33) |V | = |KT ||K||V | = |KT VK| 3 Y νi = ξi i=1 T since K VK = diag (ξiIνi ) where νi = rank (Ki). Thus 3 X log |V | = νi log ξi i=1 P3 Recall that V = i=1 ξiP i so that 3 −1 X −1 V = ξi P i i=1 then the quadratic form in the log-likelihood is given by 3 T −1 X T T −1 T (y − Xτˆ) V (y − Xτˆ) = (yi − Ki Xτˆ) V (yi − Ki Xτˆ) i=1 s s s = 1 + 2 + 3 ξ1 ξ2 ξ3 Thus (3.3.33) is given by   1 s1 s2 s3 − b log ξ1 + + b(w − 1) log ξ2 + + bw(s − 1) log ξ3 + (3.3.34) 2 ξ1 ξ2 ξ3

The coefficient of log ξi in (3.3.34) is the stratum total degrees of freedom, while in (3.3.32) it is the stratum residual degrees of freedom ie the total degrees of freedom minus the number of treatment effects estimated in that stratum. The other terms, which are dependent on the data, are the same. BALANCED INCOMPLETE BLOCKS 47

Differentiation of (3.3.32) with respect to ξ1, ξ2 and ξ3 leads to the ANOVA estimates. This is not the case for (3.3.34). Hence it seems natural as far as likelihood estimation of the stratum variances is concerned to use (3.3.32) as it takes account of the degrees of freedom used in the estimation of treat- ment effects, and the consequent (residual) maximum likelihood estimates of stratum variances will be unbiased and equal to the ANOVA estimates. Similarly it can be shown (see chapter 5 for details) that the log likelihood in (3.3.32) is equivalent, as far as estimation is concerned, to the log likelihood of that part of y1, y2 and y3 which have zero expectation. This is again the Residual log likelihood as defined by Patterson and Thompson (1971).

3.4 Balanced incomplete blocks Before leaving the analysis of designed experiments to consider more general mixed models, it is useful to consider the analysis of a balanced incomplete block design. This is an example of a design with an orthogonal block struc- ture with treatments estimated in more than one stratum. The data we will use to illustrate the analysis is taken from a long term experiment conducted at the Horticultural Research Station, Dareton, NSW. The data is kindly pro- vided by Dr. A. Grieve (State Forests, NSW) and Ms L. McFadyen (NSW Agriculture). The experiment involved examining the effects of irrigation fre- quency and volume on the growth and yield of sultana grapes. A total of 9 treatments was used, namely, the factorial combinations of three irrigation amounts (low, medium and high) by three irrigation frequencies (based on soil moisture deficit levels). In the following we ignore the factorial structure as this unnecessarily complicates our development of the incomplete block anal- ysis. Furthermore, the scientists were primarily interested in the combined effects of amount and frequency. The experiment design consisted of two re- peats of a balanced incomplete block design, resulting in 8 replicates. Within each replicate there were 3 incomplete blocks with 3 plots in each block. The actual field layout is presented in table 3.12. It comprises a rectangular array of 9 rows (indexed by blocks and plots within blocks) by 8 columns (squares and replicates). The statistical model for the yield of grapes in replicate i = 1,..., 8, block j = 1, 2, 3 within replicate i and plot k = 1, 2, 3 within replicate i and block j is yijk = τs(ijk) + ri + bij + eijk (3.4.35) where s(ijk) represents the randomisation of treatment combinations to ex- perimental units and τl, l = 1,..., 9 is the mean for treatment l. The standard analysis assumes the terms ri for replicates, bij for blocks and eijk for plots are normally distributed, mutually independent sets of random effects with 2 2 2 variances σr , σb and σ respectively. In general we assume there are t treatments, r replicates, b (incomplete) blocks within replicates and p plots within blocks with n = rbp total obser- vations and t = bp. The grape example has n = 72, t = 9, r = 8, b = 3 and 48 ANALYSIS OF DESIGNED EXPERIMENTS

Table 3.12 Field layout and treatment randomisation of irrigation management trial

Replicate Block Plot 1 2 3 4 5 6 7 8 1 1 2 7 1 5 1 3 2 2 1 2 3 3 4 1 3 5 7 5 1 3 1 5 7 9 2 7 6 8 2 1 5 2 3 7 7 8 4 1 2 2 6 4 6 6 9 1 3 4 2 3 4 9 9 2 8 6 8 7 3 1 8 6 2 3 4 4 9 3 3 2 9 8 5 8 6 9 5 6 3 3 7 1 8 4 5 2 1 9

p = 3. Then we can write (3.4.35) in matrix form as

y = Xτ + Z1u1 + Z2u2 + e = Xτ + Zu + e where Xn×t is a design matrix which assigns the treatments to experimental n×r n×rb units. The matrix Z1 is the design matrix for replicate effects and Z2 is the design matrix for the effects of blocks within replicates. The vectors u1 and u2 represent the replicate and block within replicate effects respectively. T  T T  Finally we define the matrix Z = [Z1 Z2] and u = u1 u2 . The marginal distribution of y is 2 T 2 T 2 y ∼ N(Xτ , σr Z1Z1 + σb Z2Z2 + σ In) (3.4.36) The mixed model can also be written as y ∼ treatment + rep + rep.block where treatment, rep and block are factors with t, r, b levels respectively.

3.4.1 Orthogonal projections and strata Using a similar approach to section 3.2.1 the variance matrix can be expressed in terms of projections involving orthogonal components. Thus if 2 T 2 T 2 var (y) = V = σr Z1Z1 + σb Z2Z2 + σ In then V can be written as 3 X V = ξiP i i=1 where 2 2 2 2 2 2 ξ1 = bpσr + pσb + σ , ξ2 = pσb + σ , ξ3 = σ BALANCED INCOMPLETE BLOCKS 49

Also, as before simple expressions for the P i can be obtained by noting the form of the random effects design matrices ie, Z1 and Z2. Assuming the data is ordered plots within blocks within replicates then

Z1 = Ir ⊗ 1b ⊗ 1p

Z2 = Ir ⊗ Ib ⊗ 1p Hence

T −1 T P 1 = Z1(Z1 Z1) Z1

= Ir ⊗ Ab ⊗ Ap T −1 T T −1 T P 2 = Z2(Z2 Z2) Z2 − Z1(Z1 Z1) Z1

= Ir ⊗ Bb ⊗ Ap T −1 T P 3 = In − Z2(Z2 Z2) Z2

= Ir ⊗ Ib ⊗ Bp and we also define

T −1 T P 0 = 1n(1n 1n) 1n

= Ar ⊗ Ab ⊗ Ap

The P i are orthogonal projection matrices summing to the identity matrix. The ranks of P 1,..., P 3 are r, r(b − 1) and rb(p − 1). There are three strata and these are the same, in terms of variance structure as the split plot example. Hence we transform to three independent linear models, each with homogeneous variation. The details will be omitted. The strata will be labelled as: rep, rep.block and rep.block.plot.

3.4.2 Estimation of treatment effects

The model for the fixed effects, τ can be written as

l X τ = T jτ (3.4.37) j=1 where l = 2 for the grape example and the two terms represent the over- all mean and the deviation of treatment effects from the overall mean. The treatment projection matrices are given by

T 1 = At and T 2 = Bt (3.4.38) It can be shown that the design is generally balanced, ie,

l T X TX P iXT = λijT j j=1 50 ANALYSIS OF DESIGNED EXPERIMENTS for i = 1, 2, 3. Using the properties of balanced incomplete blocks designs it can be shown that T X P 1X = rT 1 T X P 2X = r(1 − E)T 2 T X P 3X = rET 2 where E is the efficiency factor of the design (see John and Williams, 1998) which for a BIB is given by E = {t(p − 1)}/{p(t − 1)}. Thus (3.2.24) holds with effective replication given in table 3.13.

Table 3.13 Effective replication for the BIB example

Stratum T 1τ (mean) T 2τ (treatment)

Blocks λ11 = r λ12 = 0 Blocks.wplots λ21 = 0 λ22 = r(1 − E) Blocks.wplots.splots λ31 = 0 λ32 = rE

The effective replication of T 2τ (ie the treatment effects) in the two strata where there is information are given by λ22 = r(1 − E) and λ32 = rE. In the grape example the effective replication of T 2τ is 2 and 6 since E = 0.75. Hence if we consider estimation of T 2τ in the rep.block stratum then it follows that T T T 2(TX P 2XT )T τˆ[2] = T 2TX P 2y T ⇒ λ22T 2T τˆ[2] = T 2TX P 2y

1 T ⇒ T 2τˆ[2] = T 2X P 2y (3.4.39) λ22 It also follows that  1 T var T 2τˆ[2] = 2 T 2X P 2var (y) P 2XT 2 λ22 1 T = 2 T 2X P 2VP 2XT 2 λ22 ξ2 T = 2 T 2X P 2XT 2 λ22 ξ2 = T 2 λ22

Similarly estimation of T 2τ in the rep.block.plot stratum gives

1 T T 2τˆ[3] = T 2X P 3y λ32 with variance  ξ3 var T 2τˆ[3] = T 2 λ32 BALANCED INCOMPLETE BLOCKS 51

To obtain an efficient estimate of T 2τ we therefore need to combine these estimates, weighting by the inverse of the variance λi2/ξi, i = 2, 3. The explicit form for the combined estimate of T 2τ can be derived using this approach. Alternatively, if we consider generalised least squares estimation of T τ then it follows that TXT V −1XT τˆ = TXT V −1y 3 3 T X −1 T X −1 ⇒ TX ( ξi P i)XT τˆ = TX ( ξi P i)y i=1 i=1

λ11 λ22 λ32 T −1 −1 −1 ⇒ T ( T 1 + T 2 + T 2)T τˆ = TX (ξ1 P 1 + ξ2 P 2 + ξ3 P 3)y ξ1 ξ2 ξ3 T T T since X P 1X = λ11T 1, X P 2X = λ22T 2 and X P 3X = λ32T 2. Pre- multiplying by T 2 gives

λ22 λ32 T −1 −1 ( + )T 2τˆ = T 2X (ξ2 P 2 + ξ3 P 3)y ξ2 ξ3 Hence

1 T −1 −1 T 2τˆ = T 2X (ξ2 P 2 + ξ3 P 3)y λ22 + λ32 ξ2 ξ3   1 λ22 λ32 = λ λ T 2τˆ[2] + T 2τˆ[3] (3.4.40) 22 + 32 ξ2 ξ3 ξ2 ξ3 and 1 var (T 2τˆ) = T 2 λ22 + λ32 ξ2 ξ3 This estimate combines information across the two strata in which the treat- ment effects T 2τ are estimated. The weights in (3.4.40) depend on the stratum variances which are unknown. Yates (1940) gives an analysis of variance de- composition for this design, which is presented in Table 3.14. The sums of squares due to the mean and the sums of squares due to treatments in the rep.block and rep.block.plot strata are given by T sm = y P 0y 1 T T st2 = y P 2XT 2X P 2y λ22 1 T T st3 = y P 3XT 2X P 3y λ32 The residual sum of squares for each stratum is obtained by difference, for T example in the rep.block.plot stratum s3 = y P 3y−st3 . Yates (1940) suggests estimating ξ2 using the residual mean square for the rep.block stratum and ξ3 using the residual mean square for the rep.block.plot stratum. These estimates are then used to form the combined estimate of T 2τ . Although intuitively sensible there are difficulties with this estimation ap- proach. This suggests that the estimates of the stratum variances may not 52 ANALYSIS OF DESIGNED EXPERIMENTS

Table 3.14 Analysis of variance for the BIB trial Expectation Strata/Decomposition d.f. S.S. of M.S.

T Rep r y P 1y Mean 1 sm ξ1 + µ(T 1τ [1]) Residual r − 1 s1 ξ1 T Rep.block r(b − 1) y P 2y

Treatment t − 1 st2 ξ2 + µ(T 2τ [2]) Residual r(b − 1) − t + 1 s2 ξ2 T Rep.block.plot rb(p − 1) y P 3y

Treatment t − 1 st3 ξ3 + µ(T 2τ [3]) Residual rb(p − 1) − t + 1 s3 ξ3

be efficient. Nelder (1968) considers this issue and presented a fully efficient approach which is achieved by iterating the successive estimation of the treat- ment effects and stratum variances involved in the strata where treatment effects are estimated. REML estimation produces the fully efficient estimates of the stratum variances. The combined estimates of the treatment effects are produced as a by-product of the algorithm as Empirical Generalised Least Squares (EGLS) estimates (see chapter 5).

3.5 In search of efficient estimation for variance components

The problem of obtaining efficient estimates of variance components in more general designs or settings is now examined. Of the examples presented, those with orthogonal treatment structures present no difficulties for simultaneous estimation of variance components and fixed effects. When there is information on treatment effects in more than one stratum, then it is not as straight for- ward to obtain efficient estimates of variance components, although in the case of generally balanced designs Nelder (1968) presents an iterative approach. In more complex examples, say in animal breeding, where blocks corre- spond to groups of related animals it often happens that blocks will not be of equal size and the variance structure implied by the decomposition into an orthogonal block structure is not appropriate. For example, consider a one- way classification with b groups and ri observations per group (i = 1, . . . , b). The decomposition into orthogonal block structure generates

T −1 T T −1 T V = (ξ0 − ξ1)1n(1n 1n) 1n + (ξ1 − ξ2)Z(Z Z) Z + ξ2In This implies that the covariance between observations in the same block is T inversely proportional to the block size (since Z Z = diag (ri)). It is more SUMMARY 53 usual to assume that the covariance between observations in the same block are all the same irrespective of block size. For unbalanced or complex mixed models there is in general no decomposi- tion into orthogonal strata and hence it is not immediately obvious how to set up sums of squares of residuals for the more general mixed model. Patterson and Thompson (1971) suggest maximising the likelihood of error contrasts, ie contrasts with zero expectation with non-zero variance to estimate vari- ance components. We have already indicated in the previous examples for the one-way, randomised complete block and split plot designs how the so-called Residual Maximum Likelihood (REML) estimates correspond to the ANOVA estimates. In chapter 5 we will present a full account of REML estimation for a wide class of mixed models, which include the models in this chapter as special cases.

3.6 Summary In this chapter, results for designed experiments have been presented to • demonstrate the use of orthogonal projections to create strata in balanced designs • derive the analysis of variance table thence present an approach to obtain ANOVA estimates of variance components • show the equivalence of the ANOVA estimates and “so-called” REML esti- mates of variance components in designs with orthogonal block and treat- ment structures • indicate how efficient estimation of variance components (and treatment effects) can be achieved for generally balanced designs • suggest that the ANOVA and iterated ANOVA approaches for estimation of variance components cannot be readily extended to more complex data-sets where the design or study is non-orthogonal or unbalanced

CHAPTER 4

The Linear Mixed Model

To this point we have considered special cases of the linear mixed model. These special cases reduce to ordinary linear models, via the transformation to strata. The transformation to strata is only available in situations where we have orthogonal block and treatment structures. In particular, this approach fails for unbalanced data and more general covariance structures. In this chapter the general formulation of the linear mixed model is pre- sented, and we thereby extend the class of models discussed in previous chap- ters to those allowing completely general covariance structures for both the random effects and residual random errors.

4.1 The Model

If yn×1 denotes the vector of observations, the linear mixed model can be written as y = Xτ + Zu + e (4.1.1) where τ p×1 is the vector of fixed effects, Xn×p is the design matrix (param- eterised to be of full rank) that associates observations with the appropriate combination of fixed effects, ub×1 is the vector of random effects, Zn×b is the design matrix which associates observations with the appropriate combination of random effects, and en×1 is the vector of residual errors. The model (4.1.1) is called a linear mixed model or linear mixed-effects model. It is assumed  u   0   G 0  ∼ N , σ2 (4.1.2) e 0 H 0 R

The parameter σ2 is a scale parameter that plays an important role to be H discussed below. The variance models given by the matrices G and R are called G-structures and R-structures respectively. Under the assumptions, we have

2 y | u ∼ N(Xτ + Zu, σH R) (4.1.3) 2 u ∼ N(0, σH G) (4.1.4) so that  2 0 y ∼ N Xτ , σH (R + ZGZ ) (4.1.5)

55 56 THE LINEAR MIXED MODEL We write 2 0 V = σH (R + ZGZ ) 2 = σH H (4.1.6) so that y ∼ N(Xτ , V ) (4.1.7) 2 Equation (4.1.6) explains the notation σH ; this parameter multiplies the ma- trix H. Typically G and R are functions of parameters that need to be estimated; these parameters were variances or variance ratios in previous chapters. A general and consistent notation to be used throughout the book is G = G(γ) (4.1.8) and R = σ2Σ (4.1.9) Σ = Σ(φ) (4.1.10) The vectors γ and φ are parameter vectors associated with the random effects (u) and residuals (e) respectively. Their precise meaning is discussed below.

4.2 Variance structures for the errors: R-structures In most cases the vector of residuals represents the errors from a single ex- periment or a single set of data. In chapter 3, R was a scaled identity matrix, that is R = σ2I, so that the errors were assumed independent and identically distributed. In some situations, for example, in the analysis of multi-clinic trials, the analysis of animal breeding data across populations (Foulley and Quass, 1995) or the analysis of multi-environment variety trials (Smith et al., 2001a), the vector e will be a series of vectors indexed by a factor or factors. The sub- vectors relate to sections of the data and in the examples above may be a clinic, a population or a trial. 0 0 0 0 Thus in general we write e = [e1, e2,..., es] so that ej represents the vector of errors of the jth section of the data. The variance matrix for each section may differ, but we assume that the errors from different sections are independent (if they are not we can coalesce the dependent components into a single component and hence maintain the independence structure) . In matrix terms this gives   R1 0 ... 0 0  0 R2 ... 0 0    s  ......  R = ⊕j=1Rj =  . . . . .     0 0 ... Rs−1 0  0 0 ... 0 Rs VARIANCE STRUCTURES FOR THE RANDOM EFFECTS: G-STRUCTURES 57 where ⊕ is the direct sum operator. An example of such a structure is presented by Cullis et al. (1998) in the context of spatial analysis of multi-environment trials. In this case the jth section has variance matrix given by

Rj = Rj(φj) 2 = σj Σj(ρj) + ψjInj Each section represents a trial. The variance parameters allow for a different 2 variance for each trial (σj ) and hence heterogeneity, a different correlation structure for each trial (through Σj and ρj) and a different measurement error term (ψj).

4.3 Variance structures for the random effects: G-structures The b × 1 vector of random effects is often composed of q sub-vectors u = 0 0 0 0 [u1 u2 ... uq] where the sub-vectors ui are of length bi. These sub-vectors are assumed independent normally distributed with variance matrices σ2 G . H i Thus as for R we have   G1 0 ... 0 0  0 G2 ... 0 0    q  ......  G = ⊕i=1Gi =  . . . . .     0 0 ... Gq−1 0  0 0 ... 0 Gq

There is a corresponding partition in Z, namely Z = [Z1 Z2 ... Zq].

4.4 Separability Complex variance structures arise in many applications. These include the analysis of longitudinal data, multivariate analysis and spatial analysis. In some cases, component matrices, Rj, or R itself if there is only one section, or Gj or G if there is only one G-structure, are related to the underlying structure in the data. To illustrate this we begin with an example of balanced multivariate data. Suppose we measure p traits or variables on each of nb units (nb > p). To put this in context, suppose the units represent n animals in each of b cattle breeds. If yj, j = 1, . . . , p represents the data vector of the jth trait or variable, the model considered here is given by

yj = Dτ j + Buj + ej (4.4.11) n×p p×1 where D is the fixed effects design matrix, τ j is the vector of fixed effects n×b for the jth trait, B is the random effects design matrix, uj is the vector of random breed effects for the jth trait and ej is the vector of residuals. The design matrices are the same for each trait. In this setting, the components of each ej are assumed independent. 58 THE LINEAR MIXED MODEL

0 We consider the ith animal, i = 1, 2, . . . , nb. If y(i) denotes the row vector of observations on the p traits for this animal, we can write the model

0 0 0 y(i) = di[τ 1 τ 2 ... τ p] + bi[u1 u2 ... up] + [ei1 ei2 . . . eip] 0 0 0 = diT + biU + e(i) (4.4.12)

0 0 where di and bi are the ith rows of the matrices D and B respectively, and T = [τ 1, τ 2,..., τ p] and B = [u1, u2,..., up] are matrices of the fixed and random effects. As (4.4.12) contains observations on the same unit, we assume the random error vector e(i) has components that are correlated (the traits) with possibly heterogeneous variances. Thus

e(i) ∼ N(0, Σp) where Σp is the covariance matrix of the p traits. Similarly, if u(i) is the vector of random effects for the p traits for the ith breed, we assume

u(i) ∼ N(0, Gp) p×p p×p Both Σp and Gp are symmetric, positive definite matrices, each with p(p + 1)/2 unique parameters. A matrix model for the complete data set which combines (4.4.11) and (4.4.12) is then given by Y = DT + BU + E (4.4.13) n×p   n×p where Y = y1, y2,..., yp and E = [e1, e2,..., ep]. If we define y = vec (Y ) , τ = vec (T ) , u = vec (U) and e = vec (E) where vec (·) forms a vector by stacking the columns of the matrix argument (see section ??), (4.4.13) can be written equivalently as y = Xτ + Zu + e where X = Ip ⊗ D, Z = Ip ⊗ B. Under the assumptions given above, the variance structures for u and e are therefore given by

var (u) = G = Gp ⊗ In

var (e) = R = Σp ⊗ In cov (u, e) = 0 and where the parameters σ2 and σ2 of the general formulation (4.1.1) are H both set equal to one. The variance models for u and e are called separable, because they can be represented by the kronecker product of two matrices. These separable structures arise quite naturally in this example and in essence correspond to underlying factors in the data structure. For the random effects the two factors are the trait and the breed variables, while for the random errors the two factors are trait and the observational unit, the animal within each breed, or simply SEPARABILITY 59 units. This type of separable decomposition arises in other applications and in more complex situations. The concept of separability was introduced by Martin (1979) in the context of lattice processes. Martin (1979) showed that the correlation matrix of a linear-by-linear process observed on a r × c rectangular lattice can be written as the kronecker product of two correlation matrices which relate to the rows and columns of the lattice. To conclude this example, we present the symbolic representation of the model and the structures in that representation. If the only fixed effect for each trait is an intercept or overall mean (that is D = 1n, then the symbolic model formula is given by

y ∼ trait + trait.breed + trait.units where trait is a factor with p levels which indexes the trait and breed is a factor with b levels which codes the breed for each animal. The residual term is constructed as the interaction between the trait factor and units. The sym- bolic representation of the R-structure is given by US(trait) x ID(units), where the model acronym US refers to an unstructured variance matrix (or fully parameterized variance model of p(p + 1)/2 parameters) relating to trait and ID refers to an identity variance model relating to units. This notation will be extended and widely used throughout the book. Similarly, the G-structure is given by US(trait) x ID(breed) Separability is a very useful assumption regarding the form of the variance matrices R and G (or sub-matrices R and G ). Formally, if var (e) = σ2 R, j i H then the matrix R (and the error process) is said to be separable with two components if

R = R1 ⊗ R2 (4.4.14)

ri×ri where Ri is proportional to the variance matrix for the ith factor defining the data structure. The same definition applies to G-structures and the defini- tion extends in an obvious way to more than two components. The assumption of separability greatly reduces the computational load. Of particular use in fitting the linear mixed model are the following results (see section ??),

−1 −1 −1 R = R1 ⊗ R2 r2 r1 |R| = |R1| |R2| and the eigenvalues of R are the r1r2 products of the r1 eigenvalues of R1 with the r2 eigenvalues of R2. Separability allows a flexible framework for modelling variance structures in the linear mixed model. Many other examples will be considered in this book where the usual assumptions concerning the stochastic properties of the random effects in the linear mixed model lead naturally to a separable variance matrix. 60 THE LINEAR MIXED MODEL 4.5 Variance models There are three types of variance model that are possible and are used for R and G-structures in this book, namely, correlation models, homogeneous variance models and heterogeneous variance models. A complete list of the variance models used in this book is presented in appendix ??. This appendix also contains a reference to the first use of each variance model in the book.

4.5.1 Correlation models: In correlation models all diagonal elements are identically equal to 1. If Cn×n = {cij}, i, j = 1, . . . , n, denotes the correlation matrix for a particular correla- tion model, then ( cii = 1, ∀i C = {cij} : cij = cji |cij| < 1, i 6= j. The simplest correlation model is the identity model for which the off-diagonal elements are identically equal to zero, that is, cij = 0 i 6= j. Correlation models include those arising in time-series analysis, geostatistics and spatial statistics or more general correlation models such as banded or the completely general correlation model with p(p − 1)/2 parameters.

4.5.2 Homogeneous variance models In homogeneous variance models the diagonal elements all have the same 2 n×n positive value, σ say. If V = {vij}, i, j = 1, . . . , n is an homogeneous variance matrix, then ( 2 vii = σ , ∀i V = {vij} : vij = vji, i 6= j. Note that if V is the homogeneous variance model matrix corresponding to the correlation model matrix C then V = σ2C and has just one more parameter than C. For example, the homogeneous vari- ance model corresponding to the identity correlation structure is the simple variance components model, which specifies the simplest variance model for 2 which vii = σ , ∀i, with off diagonal elements equal to zero. In most software, this is the default variance model for terms classified as random in the linear mixed model.

4.5.3 Heterogeneous variance models: The third variance model is the heterogeneous variance model for which the n×n diagonal elements are positive but differ. If V = {vij}, i, j = 1, . . . , n, is IDENTIFIABILITY OF VARIANCE MODELS 61 a heterogeneous variance matrix, then

( 2 vii = σi , i = 1, . . . , n V = {vhij} : vij = vji, i 6= j. If V is the heterogeneous variance model matrix corresponding to the corre- lation model matrix C, then V = DCD

n×n where D = diag (σi) . This model has an additional n parameters com- pared to the base correlation model. For example, the heterogeneous variance model corresponding to the identity correlation model is the model which 2 specifies the diagonal variance model for which vii = σi ∀i, with zero off diagonal elements. Examples include the diagonal variance model, factor analytic, reduced rank, ante-dependence and the most general is the unstructured with p(p+1)/2 parameters.

4.6 Identifiability of variance models Because of the generality we have attempted to maintain in constructing the variance models for the random effects in the linear mixed model, it is al- most inevitable that even the most experienced user may encounter problems of identifiability of variance models. The cause of non-identifiability can be hard to diagnose. In principle, the causes may be akin to ensuring the fixed effects model is not over-parameterised, in that variance models may not be identifiable as they are over-parameterised. This is called intrinsic aliasing in the fixed effects model. On the other hand, there may be insufficient data to estimate the parameters of the chosen variance model. This is called extrinsic aliasing in the fixed effects model. There are some general principles which can be useful in avoiding over-parameterisation of variance models and in the following we present some of these by way of example.

4.6.1 Variance components or variance ratios

In this section, we assume Σ = In. This occurs in many applications, and some examples were considered in chapter 3. scaled identity. The variance structure is therefore given by σ2 H = σ2 (σ2I + ZGZ0) (4.6.15) H H n This variance model is over-parameterised because the residual variance σ2 cannot be estimated separately from σ2 . There are several ways of overcoming H this. If σ2 is set to one, then the variance matrix for y is then H 2 0 σ In + ZGZ 62 THE LINEAR MIXED MODEL A consequence of this parameterization, is that in this model G must now be a variance matrix. For example, for the one-way classification and the RCB 2 design G = σuIb. For the split plot design G is the direct sum of 2 sub- matrices one for blocks and one for whole-plots. That is,

 2  2 σb Ib 0 G = ⊕i=1Gi = 2 0 σwIbw Setting σ2 = 1 implies that the variance parameters are variance components. H If we set σ2 = 1, then σ2 is an overall scale parameter and is equal to H the residual variance. As a consequenceof this parameterization, the matrix G cannot be a variance matrix. Again, for the one-way classification and the RCB design G = γ I where γ = σ2/σ2 . Similarly for the split plot design u b u u H   2 γbIb 0 G = ⊕i=1Gi = 0 γwIbw where γ = σ2/σ2 and γ = σ2 /σ2 . Thus the parameters in G are variance b b H w w H ratios under this parameterization. Lastly it is clear that σ2 and σ2 cannot both be set to one, as the residual H variance is therefore set to one. To summarize, just as for fixed effects, the parameterization chosen has implications on identifiability and on interpretation of the parameters.

4.6.2 Non identity R-structure When Σ = Σ(φ) then the scale parameters σ2 and σ2 can either be separately H set to one or jointly set to one. As in the previous section both cannot be estimated in the same model. If σ2 = 1 and σ2 6= 1 then var (e) = σ2Σ and thus Σ must be a scaled H variance matrix or a correlation matrix. For example, if a component matrix of Σ is a diagonal matrix of variances then one of its elements must be fixed to ensure identifiability. On the other hand G must be a variance matrix, since var (u) = G. If σ2 6= 1 and σ2 = 1 then var (e) = σ2 Σ and var (u) = σ2 G. Thus both H H H Σ and G must be scaled variance matrices or correlation matrices. Lastly if σ2 = 1 and σ2 = 1 then var (e) = Σ and var (u) = G. Thus H both Σ and G must be variance matrices and their parameter vectors must include at least one scale parameter. The most common applications for this case are in the analysis of multivariate data or repeated measures analysis with heterogeneous variances.

4.7 Combining variance models When either R or G is formed from the kronecker product of several sub- matrices some general rules must be obeyed to avoid over-parameterisation. SUMMARY 63 In the following we consider models with two components for G and R and use Ci and V i, i = 1, 2 to denote arbitrary correlation and variance matrices.

1. If Σ = C1 ⊗C2 then Σ is a valid correlation model and so a scale parameter (either σ2 or σ2 ) must be included in the variance model model. If G = H C1 ⊗C2 then a scale parameter should be included in the parameter vector γ regardless of the status of σ2 and σ2 . H 2. If Σ = C1 ⊗ V 2 or Σ = V 1 ⊗ C2 then Σ is a variance matrix in which case neither σ2 nor σ2 can be estimated. If G = C ⊗ V or G = V ⊗ C H 1 2 1 2 then G is a variance matrix. This usually coincides with σ2 = 1. H 3. If Σ or G = V 1 ⊗ V 2 = V then Σ or G is an over-parameterised variance matrix, which would necessitate fixing one of the variance parameters in V 1 or V 2.

4.8 Summary In this chapter we have introduced the general form of the linear mixed model and described the range of models which are either used in this book or avail- able in the software packages which we use to undertake the analyses. The main concepts that have been introduced in the context of the linear mixed model we consider are • R and G-structures, their general definition and structure • the assumption of separability for the variance models used in the R and G-structures • the combination of variance models both within and between R and G- structures and how the form of these models (ie as variance or correlation models) relate to the presence of the overall scaling parameter (σ2 ) H As a useful summary of the issues discussed in sections 4.6 and 4.7 we present a summary (see table 4.1) of the possible variance models for y which can be obtained by altering the type of scale parameters in the variance model. As we have seen the two scale parameters namely σ2 and σ2 can be either H fixed (usually to one) or free, which means it must be estimated from the data. Generally, the overall scale parameter σ2 controls whether we estimate H variance components or variance component ratios. There maybe some compu- tational savings in the iteration process using this parameterisation, as given estimates of γ and φ the score equation for σ2 has an algebraic solution (see H chapter ??). In some cases, however it does not make sense to include on overall scale parameter. We have discussed some examples where these types of variance models may be used. 64 THE LINEAR MIXED MODEL

Table 4.1 Summary of the variance models

Σ = I σ2 σ2 Description H Fixed Fixed Not admissable, since there is no residual variance Fixed Free var (e) = σ2I; γ are variance components Free Fixed var (e) = σ2 I; γ are variance component ratios H Free Free σ2 and σ2 are not identifiable H Σ = Σ(φ) σ2 σ2 Description H Fixed Fixed var (e) = Σ so φ is a vector of components, eg parameters for unstructured, antedependence; γ are variance compo- nents Fixed Free var (e) = σ2Σ and so φ are ratios relative to σ2; γ are variance components Free Fixed var (e) = σ2 Σ so φ are ratios relative to σ2 ; γ are variance H H component ratios Free Free σ2 and σ2 are (generally) not identifiable H CHAPTER 5

Estimation

The linear mixed model is composed of fixed effects τ , random effects u and 0 variance parameters σ2 , σ2, and κ = γ0, σ2, φ0 . Likelihood methods are H used for estimation of the fixed effects and variance parameters. The prediction of the random effects is sometimes of interest, and this can be considered to be a post-estimation process, ina manner to be discussed below. The principles involved in estimation and prediction do not necessarily provide an efficient algorithm to achive those aims. Thus this chapter begins with the principles of estimation and prediction and then presents an efficient algorithm. Residual maximum likelihood is presented in this chapter as a formalmethod of estimation of variance parameters. In the process, estimation of fixed effects is also achieved. This leaves the problem of prediction which can be approached in a number of ways. The computational strategy for efficient estimation is also presented in this chapter. The approach has important consequences for prediction using the- linear mixed model, and these extensions will be discussed in chapter 7.

5.1 Estimation of fixed effects and variance parameters Cannot remember what I was going to do here!

5.2 Estimation of variance parameters As we saw in chapter 3 when the variance parameters of the linear mixed model are simple components, that is, Gi = γiIqi ∀i and the block structure is orthogonal, the treatment structure is orthogonal and the variance matrix 2 for the errors is σ In, then we can estimate the stratum variances (and hence variance components) by equating residual mean squares from an ANOVA table with their expectations. This method is attributed to R.A. Fisher (Searle et al., 1992) and has been widely used for many years. There are many applications, however, where the data are unbalanced and or we wish to model the (co)variation in the data by using more complex variance structures. Many authors have suggested using extensions to the ANOVA methods described in chapter 3 for variance component estimation in unbalanced data. Searle et al. (1992) give an exhaustive account of three such approaches, which were originally proposed by Henderson (1953). The three methods of estimation have become known as Henderson’s methods I, II and III. Method I uses quadratic forms which are analogous to the sums

65 66 ESTIMATION of squares of generally balanced designs with orthogonal treatment structure; Method II is an adaptation of method I which takes account of fixed effects in the model; Method III uses sums of squares from fitting the full mixed model (and sub-models thereof) as though all terms were fixed effects. These techniques have been become superseded by Maximum Likelihood (ML) or more recently Residual Maximum Likelihood (REML). There are several reasons for this. Firstly, the original methods were proposed before the advent of high speed computers, and it was therefore important to have an approach which was not computationally intensive. This is no longer an issue with the proliferation of both efficient mixed models software and with high capacity computing power available to most researchers. The other attraction of REML (and ML) is that it provides the framework for variance parameter estimation in a much wider class of variance models than simple variance components. The paper by Patterson and Thompson (1971) is the original reference for REML. REML takes into account the degrees of freedom associated with the estimation of fixed effects so that REML estimates of variance parameters are less biased than ML estimates. As we have indicated REML estimates coincide with ANOVA based estimates for orthogonal block and treatment structures. We will only consider REML estimation in this book. Other texts in this area such as Searle et al. (1992); Verbeke and Molenberghs (2000) (2000) cover ML estimation.

5.2.1 The residual log-likelihood function Recall that if yn×1 denotes the vector of observations, the linear mixed model can be written as y = Xτ + Zu + e (5.2.1) where τ p×1 is the vector of fixed effects, Xn×p is the design matrix (param- eterised to be of full rank) that associates observations with the appropriate combination of fixed effects, ub×1 is the vector of random effects, Zn×b is the design matrix that associates observations with the appropriate combination of random effects, and en×1 is the vector of residual errors. We assume  u   0   G 0  ∼ N , σ2 (5.2.2) e 0 H 0 R where G = G(γ) and R = σ2Σ, Σ = Σ(φ). The vectors γ and φ are vectors of variance parameters associated with the random effects and residuals re- spectively. The distribution of the data is thus Gaussian with mean Xτ and 2 0 variance matrix V = σH H where H = R + ZGZ . Result 5.1 The residual log-likelihood for the model in (5.2.1) is given by ` = `(σ2 , κ; y ) R H 2 = − 1 (n − p) log σ2 + log |H| + log |X0H−1X| 2 H + y0P y/σ2 (5.2.3) H ESTIMATION OF VARIANCE PARAMETERS 67 0 n×(n−p) where y2 = L2y, L2 is a matrix with full column rank chosen such that 0 −1 −1 0 −1 −1 0 −1 L2X = 0 and P = H − H X X H X X H .

Proof: Verbyla (1990) presented an illuminating derivation of the Patterson and Thompson (1971) residual likelihood. He partitions the full likelihood for the mixed model in (5.2.1) into two independent parts: one relates to the treatment (fixed effect) contrasts Xτ (there are p such effects) and the other to the residual contrasts Zu + e, that is, contrasts whose expectation is zero (there are n − p independent error contrasts). Maximization of the former provides estimates of the fixed effects whereas maximization of the residual likelihood provides estimates of the variance parameters and the random effects. n×p Verbyla (1990) considers a non-singular matrix L = [L1 L2] where L1 n×(n−p) 0 0 and L2 are matrices chosen to satisfy L1X = Ip and L2X = 0. The 0 0 0 0 distribution of the transformed data L y = [y1 y2] , say, is given by  y   τ   L0 HL L0 HL  1 ∼ N , σ2 1 1 1 2 (5.2.4) H 0 0 y2 0 L2HL1 L2HL2

The likelihood of L0y can be expressed as the product of the conditional likelihood of y1 given y2 and the marginal likelihood of y2. From (5.2.4) the marginal distribution of y2 is y ∼ N 0, σ2 L0 HL  2 H 2 2 and using result ?? the conditional distribution of y1 given y2 is normal, with mean

0 0 −1 E(y1|y2) = τ + L1HL2 L2HL2 y2 and variance

h −1 i var (y |y ) = σ2 L0 HL − L0 HL L0 HL  L0 HL 1 2 H 1 1 1 2 2 2 2 1

0 Using result ?? and the fact that L1X = Ip this can be written as

 −1 y |y ∼ N τ + y∗, σ2 X0H−1X 1 2 2 H

∗ 0 0 −1 where y2 = L1HL2 L2HL2 y2. The associated log-likelihood functions (excluding constant terms) are given by

` = `(σ2 , κ; y ) R H 2 n −1 o = − 1 (n − p) log σ2 + log |L0 HL | + y 0 L0 HL  y /σ2 2 H 2 2 2 2 2 2 H = − 1 (n − p) log σ2 + log |L0 HL | 2 H 2 2 −1 o + y0L L0 HL  L0 y/σ2 (5.2.5) 2 2 2 2 H 68 ESTIMATION and

` = `(τ , σ2 , κ; y |y ) 1 H 1 2 n −1 = − 1 p log σ2 + log | X0H−1X | 2 H + (y − τ − y ∗)0 X0H−1X (y − τ − y ∗) /σ2 (5.2.6) 1 2 1 2 H

Clearly the likelihood of y2 contains no information on τ so that τ must be estimated from the conditional distribution of y1 given y2. From (5.2.6) and the derivative results in (??) the MLE of τ is obtained as the solution to

∂`1 0 −1  0 0 −1  2 = −2X H X y − τ − L HL2 L HL2 y /σ = 0 ∂τ 1 1 2 2 H This gives

0 0 −1 τˆ = y1 − L1HL2 L2HL2 y2 0  0 −1 0  = L1 I − HL2 L2HL2 L2 y

0  0 −1 0  −1 = L1 H − HL2 L2HL2 L2H H y −1 = X0H−1X X0H−1y

0 using Result (??) and the fact that L1X = Ip. The likelihood of y given y is a function of both τ , σ2 and κ, but since 1 2 H τ and y1 are both vectors of length p then once τ has been estimated there is no information left to estimate σ2 and κ. The variance parameters σ2 and H H 0 00 κ = γ , φ are therefore estimated using the marginal likelihood of y2, that is, the residual likelihood. Since `(τ , σ2 , κ; L0y) = `(σ2 , κ; y ) + `(τ , σ2 , κ; y |y ) then the determi- H H 2 H 1 2 nants can be similarly partitioned:

0 0 0 −1 −1 log |L HL| = log |L2HL2| + log | X H X | 0 0 0 −1 ⇒ log |L2HL2| = log |L L| + log |H| + log |X H X| Now |L0L| does not involve σ2 and κ and so the log-likelihood in (5.2.5) can H be written (ignoring constants) as

` = − 1 (n − p) log σ2 + log |H| + log |X0H−1X| R 2 H −1 o + y0L L0 HL  L0 y/σ2 2 2 2 2 H

0 −1 0 Using Result (??), L2 L2HL2 L2 = P so the residual log-likelihood can be written as

` = − 1 (n − p) log σ2 + log |H| + log |X0H−1X| + y0P y/σ2 (5.2.7) R 2 H H as required. 2 ESTIMATION OF VARIANCE PARAMETERS 69 5.2.2 REML score equations 0 The REML estimates of σ2 and κ = γ0, σ2, φ0 are obtained by solving the H system of equations (known as score equations):

2 ∂`R UR(σ ) = = 0 H ∂σ2 H ∂`R UR(κi) = = 0 ∂κi for i = 1, . . . , nk where nk is the number of variance parameters in κ. Result 5.2 The score for σ2 is given by H U (σ2 ) = − 1 (n − p)/σ2 − y0P y/σ4 R H 2 H H Hence it follows that the REML estimate of σ2 , given κ is H σˆ2 = y0P y/(n − p) H

Result 5.3 The score for κi is given by n   o U (κ ) = − 1 tr P H˙ − y0P H˙ P y/σ2 (5.2.8) R i 2 i i H where H˙ i = ∂H/∂κi Proof: First consider the derivative of the log determinants in (5.2.3). Using the derivative results in section ?? ∂ log |H| ∂ log |X0H−1X| + ∂κi ∂κi  0 −1   −1  0 −1 −1 ∂X H X = tr H H˙ i + tr X H X ∂κi  −1   0 −1 −1 0 −1 −1  = tr H H˙ i − tr X H X X H H˙ iH X

 −1 −1 0 −1 −1 0 −1  = tr H H˙ i − H X X H X X H H˙ i   = tr P H˙ i (5.2.9) Now consider the derivative of the sum of squares in (5.2.3): ∂y0P y ∂P = y0 y ∂κi ∂κi 0 = −y P H˙ iP y (5.2.10) using result ??. Combining (5.2.9) and (5.2.10) gives the result as required. 2

5.2.3 REML likelihood for the split plot trial In this section we derive the REML log-likelihood for the split plot example presented in section 3.3. Since the development for the split plot example was 70 ESTIMATION in terms of variance components, not ratios, we simply set σ2 = 1. Thus H the variance parameter vector γ is a vector of variance components. In the following for consistency with the notation used in section 3.3 we denote these 2 2 by γ1 = σb and γ2 = σw. Recall that since the design has an orthogonal block structure then 3 X H = ξiP i i=1 where 2 2 2 2 2 2 ξ1 = wsσb + sσw + σ , ξ2 = sσw + σ , ξ3 = σ and

P 1 = Ib ⊗ Aw ⊗ As

P 2 = Ib ⊗ Bw ⊗ As

P 3 = Ib ⊗ Iw ⊗ Bs and further we define P 0 = Ab ⊗ Aw ⊗ As The REML log-likelihood is 1  0 −1 0 `R = − 2 log |H| + log |X H X| + y P y Now 3 X log |H| = νi log ξi i=1

= b log ξ1 + b(w − 1) log ξ2 + bw(s − 1) log ξ3 since νi is the rank of P i. Also using the property of general balance it can be shown that 0 −1 X H X = bT 1/ξ1 + bT 2/ξ2 + bT 3/ξ3 + bT 4/ξ3 (5.2.11) and so ignoring a constant 0 −1 log |X H X| = − log ξ1 − (w − 1) log ξ2 − (s − 1) log ξ3 − (s − 1)(w − 1) log ξ3 Obtaining y0P y is a little tedious, however after some algebra it can be shown that 0 0 −1 −1 y P y = y ξ1 Bb ⊗ Aw ⊗ As + ξ2 (Ib ⊗ Bw ⊗ As − Ab ⊗ Bw ⊗ As)+ −1  ξ3 (Ib ⊗ Iw ⊗ Bs − Ab ⊗ Aw ⊗ Bs − Ab ⊗ Bw ⊗ Bs) y = s1/ξ1 + s2/ξ2 + s3/ξ3 since, for example, the sum of squares due to the fungicide treatment is given 0 by λ22(T 2τˆ[2]) (T 2τˆ[2]) which equals 0 y (Ab ⊗ Bw ⊗ As)y Gathering terms we get the REML log-likelihood 1 `R = − 2 {(b − 1) log ξ1 + (b − 1)(w − 1) log ξ2 + (b − 1)w(s − 1) log ξ3+ s1/ξ1 + s2/ξ2 + s3/ξ3} ESTIMATION OF FIXED AND RANDOM EFFECTS 71 This is the same as the likelihood given in 3.3.32. Differentiation with respect to the stratum variances ξi, i = 1, 2, 3 leads to the usual ANOVA estimates as required.

5.3 Estimation of fixed and random effects 5.3.1 A little on prediction Suppose we have a random variable T we wish to predict using data y. For example T might be u from the linear mixed model (4.1.1). The best predictor of T , denoted by T˜ = T˜(y), is the estimator minimizing the prediction mean square error (MSE)   min E {T − T˜(y)}2 T˜ The MSE can be expressed in the following manner using conditional expec- tations (see section ??)      E (T − T˜(y))2 = E E (T − T˜(y))2|y   = E E T 2|y − 2E (T |y) T˜(y) + T˜(y)2   = E var (T |y) + (E (T |y))2 − 2E (T |y) T˜(y) + T˜(y)2   = E (var (T |y)) + E (T˜(y) − E(T |y))2 ≥ var (T |y) Thus the MSE is minimized if and only if T˜(y) = E (T |y) and the best pre- dictor of T is the conditional expectation T˜(y) = E (T |y) (5.3.12) This is a very important result and, in conjunction with the normality as- sumption, allows the specification of predictors and their distribution in a relatively simple manner.

5.3.2 Prediction in linear mixed models In all the linear mixed models presented in chapter 3, except the malt run ex- ample, the emphasis of the analysis was the efficient estimation of fixed treat- ment effects. The presence of the random effects in the mixed model resulted from the stratification of the experimental units and subsequent restricted randomisation of units to treatments. In the malt run example, the random effects in the linear model resulted in a natural decomposition of the total variation in diastatic power into variation between malt runs and variation within malt runs. At no stage have we been concerned with “estimating” the random effects. We use the term “estimate” in a loose sense, in that we cannot estimate a random variable in the same way we estimate a fixed parameter. 72 ESTIMATION To illustrate these concepts, we again consider the one-way random effects model but now in a context where the estimation of the random effects is of interest. The model is given by

yij = µ + ui + eij or y = 1nµ + Zu + e (5.3.13) The model (5.3.13) is applicable in animal breeding data, where, for example, yij may be the milk yield of the jth dairy cow who is the daughter of the ith bull. The bull effects ui are regarded as random effects. For simplicity we assume there are r dairy cows for each of b bulls and we have a total of n = rb 2 2 records. As before, we assume that ui ∼ N(0, σu) and eij ∼ N(0, σ ), and in addition that ui and eij are statistically independent for all i and j. Further for consistency with the development given earlier we set σ2 = 1, although in H general we would prefer to use R = I and σ2 = σ2. n H In this example the focus is not only to estimate the variance components, but more importantly perhaps, it is of interest to predict the ui, as this mea- sures the performance, or more correctly, the genetic merit of the ith bull in terms of its (inherited) milk production. This may then be used to se- lect for mating. However as we noted, the ui is a random effect, and so the best, in terms of minimum mean squared error of prediction, we can do is to consider its expected value given the data we have. That is, we seek to estimate the conditional mean E (ui|y). To do this we use the the results of the section 5.3.1. Firstly we find the joint distribution of y and u. We present two alternate derivations. The first uses the orthogonal block structure and is useful to link with the approach used in chapter 3. The second approach is useful as it presents results in a mixed model setting rather than the ANOVA setting. The joint distribution of y and u is

     2  y 1nµ H σuZ ∼ N , 2 0 2 (5.3.14) u 0 σuZ σuIb

n×b where var (y) = H and recall that Z = Ib ⊗1r. Also we recall from (3.1.6) that

H = ξ1P 1 + ξ2P 2 2 2 2 where ξ1 = rσu + σ , ξ2 = σ and

P 1 = Ib ⊗ Ar P 2 = Ib ⊗ Br and hence

−1 −1 −1 H = ξ1 P 1 + ξ2 P 2 0 −1 −1 0 Z H = ξ1 Z Alternatively we can derive this using results on the inverse of H. Thus if we 2 2 let γu = σu/σ then 2 −1 0 −1 −1 0 σ H = In − Z(Z Z + γu Ib) Z ESTIMATION OF FIXED AND RANDOM EFFECTS 73 and hence 2 0 −1 0 0 0 −1 −1 0 σ Z H = Z − Z Z(Z Z + γu Ib) Z 0 −1 0  0 −1 −1 0 = (Z Z + γu Ib) − Z Z (Z Z + γu Ib) Z −1 −1 −1 0 = γu [(r + γu )Ib] Z −1 γu 0 = −1 Z r + γu 2 0 = σ Z /ξ1 Hence using result ?? we see that u˜ = E (u|y) 2 0 −1 = 0 + σuZ H (y − 1nµ)/ξ1 2 σu 0 = 2 2 Z (y − 1nµ) rσu + σ Note however that this result depends on the unknown parameter µ. It is intuitively obvious that we replace µ byy ¯, however it is not sufficient to condition on y. We need to condition on the component of y that is free of µ. This is the part of the data used for the REML estimation of the variance components, that is y2 and details are given in this chapter (see result 5.9). For the moment we shall replace µ byy ¯, giving σ2 ˜ u 0 u = 2 2 Z (y − 1ny¯) rσu + σ rσ2 u ¯ = 2 2 (y − 1by¯) rσu + σ b×1 0 where y¯ = (¯y1,..., y¯b) is the vector of bull means. This predictor of u˜ is in fact the Best Linear Unbiased Predictor (BLUP) of u. BLUPs are predictors of the realised values of the random variables (effects) u which are • linear functions of the data, • unbiased in the sense that the expected value of the estimate is equal to the expected value of the quantity being estimated, • best in the sense that they have minimum mean squared error within the class of linear unbiased estimators and • predictors to distinguish them from estimators of fixed effects. In this example, it is clear that the BLUP is a shrinkage estimate: compared to the fixed effect estimatey ¯i − y¯ the BLUP is shrunk towards the (prior) 2 2 mean of zero. Also as σu becomes large relative to σ , the BLUP of ui tends 2 2 to the fixed effect solution while for small σu relative to σ the BLUP ui tends towards zero, the assumed initial mean. Thus the BLUP represents a weighted mean of the fixed effect estimate and the prior mean of the ui. Note also that the BLUPs in this simple case sum to zero. This is essentially because the unit vector defining X can be found by summing the columns of 74 ESTIMATION the Z matrix. This linear dependence of the matrices translates to constraints on the BLUPs. This constraint occurs whenever the column space of X is contained in the column space of Z. The constraint is more complex with correlated random effects. We shall return to this later. We now present a general result concerning prediction in the general linear mixed model (4.1.1).

0 0 Result 5.4 Consider the prediction of the linear combination c1τ + c2u of fixed and random effects where cp×1 and cq×1 respectively. If σ2 and H are 1 2 H known, implying σ2 and κ are known, the predictor which has the minimum H mean square error (MSE) among the class of linear unbiased predictors is 0 0 given by c1τˆ + c2u˜ where −1 τˆ = X0H−1X X0H−1y (5.3.15) u˜ = GZ0P y (5.3.16) Proof: 0 0 0 Let a y be an unbiased linear predictor of c1τ + c2u so that 0 0 0 E(a y) = E (c1τ + c2u) 0 0 ⇒ a Xτ = c1τ 0 ⇒ X a = c1 (5.3.17) The MSE is given by

 0 0 0 2 MSE = E (a y − c1τ − c2u) 0 0 2 0 2 0 0 0 = E (a y − c2u) + E (c1τ ) − 2E (a y − c2u)E(c1τ ) 0 0 0 0 2 0 2 = var (a y − c2u) + {E(a y − c2u)} + E (c1τ ) 0 0 0 − 2E (a y − c2u)E(c1τ ) 0 0 0 2 0 2 = var (a y − c2u) + 2 (c1τ ) − 2 (c1τ ) using (5.3.17) 0 0 = var (a y − c2u) 0 0 0 0 = var (a y) + var (c2u) − a cov (y, u) c2 − c2cov (u, y) a = σ2 (a0Ha + c0 Gc − a0ZGc − c0 GZ0a) H 2 2 2 2 = σ2 (a0Ha + c0 Gc − 2a0ZGc ) H 2 2 2 To minimise this with respect to a, subject to (5.3.17) Lagrange multipliers are used. The function to be minimised is M = MSE+2σ2 λ0 c − X0a where H 1 λ(p×1) is a vector of multipliers. Using results on the derivative of matrices (see section ??)

∂M 2 = 2σ (Ha − ZGc2 − Xλ) ∂a H ∂M 2 0  = 2σ c1 − X a ∂λ H ESTIMATION OF FIXED AND RANDOM EFFECTS 75 Equating the derivatives to zero gives: −1 a = H (ZGc2 + Xλ) (5.3.18) 0 c1 = X a Then (5.3.17) and (5.3.18) give

0 −1 −1 0 −1  λ = X H X c1 − X H ZGc2 Substituting this expression in (5.3.18) gives

−1 −1 0 −1 −1 a = H ZGc2 + H X X H X c1 −1 0 −1 −1 0 −1 − H X X H X X H ZGc2 −1 0 −1 −1 = P ZGc2 + H X X H X c1 0 0 0 −1 −1 0 −1 0 0 ⇒ a y = c1 X H X X H y + c2GZ P y 0 0 = c1τˆ + c2u˜ as required. 2 As a consequence the minimum variance estimate of τ is obtained by setting 0 c2 to zero and taking the sequence of vectors c1 = (1, 0 ... 0, 0) ... c1 = (0, 0 ... 0, 1)0 . This leads to τˆ as the best linear unbiased estimator (BLUE) of τ . It is also the generalised least squares (GLS) estimate of τ . A similar process leads to u˜ as the best linear unbiased predictor (BLUP) of u. This derivation of BLUP does not explicitly provide the link with the es- timation of the variance parameters, through REML. This link is provided by considering the original derivation given by Henderson (1950). Henderson (1950) described the BLUP estimates as being “joint maximum likelihood es- timates”. Later Henderson (1973) retracted this statement and suggested that this terminology should not be used, as the function being maximised is not a likelihood. Robinson (1991) provides an excellent account of this and many other aspects concerning the prediction of random effects.

5.3.3 Mixed Model Equations Following Henderson’s (1950) suggestion we will derive the estimates τˆ and u˜ in (5.3.15) and (5.3.16) by maximising a function derived from the joint distribution of y and u. The latter is given by:       y Xτ 2 HZG ∼ N , σ 0 (5.3.19) u 0 H GZ G The log-density function for (y, u) can be written as log f (y | u ; τ , σ2 , σ2, φ) + log f (u ; σ2 , γ) (5.3.20) Y H U H This is the log-joint distribution of (y, u). It is not a log-likelihood as u is not observed. From (5.3.19) the marginal distribution of u is u ∼ N 0, σ2 G H 76 ESTIMATION and the conditional distribution of y given u is y|u ∼ N Xτ + Zu, σ2 R H The function in (5.3.20), ignoring the constant term, is then given by − 1 n log σ2 + log |R| + (y − Xτ − Zu)0 R−1 (y − Xτ − Zu) /σ2 2 H H − 1 b log σ2 + log |G| + u0G−1u/σ2 2 H H which equals − 1 (n + b) log σ2 + log |R| + log |G| + (y − Xτ )0 R−1 (y − Xτ ) /σ2 2 H H − 1 u0 ZR−1Z0 + G−1 u − 2 (y − Xτ )0 R−1Zu/σ2 2 H The vectors of fixed and random effects (τ and u) can be “estimated” by maximising this function. Differentiation with respect to τ and u and setting to zero leads to X0R−1 (y − Xτˆ) − X0R−1Zu˜ = 0 Z0R−1 (y − Xτˆ) − Z0R−1Z + G−1 u˜ = 0 This system of equations is known as the mixed model equations (MME), as proposed by Henderson (1950, 1973). They can be written in matrix-vector notation by:  X0R−1XX0R−1Z   τˆ   X0R−1y  = (5.3.21) Z0R−1X (Z0R−1Z + G−1) u˜ Z0R−1y A more abbreviated representation of the MMEs which is used in this book is, Cβ˜ = W 0R−1y (5.3.22) where W = [XZ] , β0 = [τ 0 u0] and C = W 0R−1W + G∗  0 0  G∗ = 0 G−1 In the following we show that the solutions to the MMEs are in fact equiv- alent to the estimates τˆ and u˜ in (5.3.15) and (5.3.16). In chapter ?? we illustrate how this result allows for a unified computing strategy for both variance parameter estimation and the estimation and prediction of fixed and random effects, centred on the mixed model equations. Result 5.5 The estimates τˆ and u˜ in (5.3.15) and (5.3.16) can be obtained as solutions to the mixed model equations. Proof: Write the coefficient matrix in (5.3.21) as  C C  C = XX XZ (5.3.23) CZX CZZ ESTIMATION OF FIXED AND RANDOM EFFECTS 77 where the partitioning is conformal with the fixed and random effects design matrices X and Z. Thus MMEs are given by

CXX τˆ + CXZ u˜ = cXy

CZX τˆ + CZZ u˜ = cZy 0 −1 where for convenience of notation we write cXy = X R y and cZy = Z0R−1y. These equations are solved by first substituting the second set of equations (for u) into the first set (for τ ) in a process known as absorption (also see chapter ??). The second set gives:

−1 −1 u˜ = CZZ cZy − CZZ CZX τˆ (5.3.24) Substituting into the first set gives: −1  −1 CXX − CXZ CZZ CZX τˆ = cXy − CXZ CZZ cZy (5.3.25) Note that −1 CXX − CXZ CZZ CZX −1 = X0R−1X − X0R−1Z Z0R−1Z + G−1 Z0R−1X  −1  = X0 R−1 − R−1Z Z0R−1Z + G−1 Z0R−1 X = X0H−1X using the matrix identity in result ??. Thus using (5.3.25) −1  −1  τˆ = X0H−1X X0R−1y − X0R−1Z Z0R−1Z + G−1 Z0R−1y

−1  −1  = X0H−1X X0 R−1 − R−1Z Z0R−1Z + G−1 Z0R−1 y

−1 = X0H−1X X0H−1y using the matrix identity in result ??. The solution for u˜ is then obtained by back-substitution, that is, substitut- ing the above expression for τˆ into (5.3.24). This gives:

−1  0 −1 −1 0 −1  u˜ = CZZ cZy − CZX X H X X H y −1  −1  = Z0R−1Z + G−1 Z0R−1y − Z0R−1X X0H−1X X0H−1y

−1  −1  = Z0R−1Z + G−1 Z0R−1 I − X X0H−1X X0H−1 y

 −1  = GZ0H−1 I − X X0H−1X X0H−1 y using result ?? = GZ0P y ≡ GZ0H−1 (y − Xτˆ) Thus the solutions to the MMEs are the GLS estimate τˆ and the BLUP u˜. 2 78 ESTIMATION Having shown the equivalence between the two sets of estimates, we present two more results which prove to be useful. These results obtain an expression for the inverse of the coefficient matrix C and present a convenient alternative expression for P in terms of C−1, R−1 and W . Result 5.6 " 0 −1 −1 0 −1 −1 0 −1 # −1 X H X − X H X X H ZG C = −1 −GZ0H−1X X0H−1X G − GZ0PZG Proof: Let  CXX CXZ  C−1 = CZX CZZ  T −TS0  = −1 0 (5.3.26) −STCZZ + STS −1 −1 −1 where T = CXX − CXZ CZZ CZX and S = CZZ CZX . Thus

h −1 i−1 T = X0R−1X − X0R−1Z Z0R−1Z + G−1 Z0R−1X

h  −1  i−1 = X0 R−1 − R−1Z Z0R−1Z + G−1 Z0R−1 X

h −1 i−1 = X0 R + ZGZ0 X using result ??

−1 = X0H−1X Also −1 S = Z0R−1Z + G−1 Z0R−1X −1 = GZ0 R + ZGZ0 X = GZ0H−1X using result ??. Thus −1 CXZ = − X0H−1X X0H−1ZG and −1 CZZ = Z0R−1Z + G−1 −1 + GZ0H−1X X0H−1X X0H−1ZG (5.3.27) −1 = Z0R−1Z + G−1 + GZ0H−1ZG − GZ0PZG −1 = Z0R−1Z + G−1 −1 + Z0R−1Z + G−1 Z0R−1ZG − GZ0PZG −1 = Z0R−1Z + G−1 I + Z0R−1ZG − GZ0PZG −1 = Z0R−1Z + G−1 G−1 + Z0R−1Z G − GZ0PZG = G − GZ0PZG (5.3.28) ESTIMATION OF FIXED AND RANDOM EFFECTS 79 Gathering terms gives the result

" 0 −1 −1 0 −1 −1 0 −1 # −1 X H X − X H X X H ZG C = −1 −GZ0H−1X X0H−1X G − GZ0PZG giving the result. 2

−1 Result 5.7 The matrix P = H−1 − H−1X X0H−1X X0H−1 can also be written as R−1 − R−1WC−1W 0R−1.

Proof: First note that −1 H−1ZG = R + ZGZ0 ZG = R−1ZK

−1 where K = Z0R−1Z + G−1 . Then

−1 CXZ = − X0H−1X X0R−1ZK and in (5.3.27)

−1 CZZ = K + KZ0R−1X X0H−1X X0R−1ZK so that

R−1 − R−1WC−1W 0R−1 −1 = R−1 − R−1X X0H−1X X0R−1 −1 + R−1X X0H−1X X0R−1ZKZ0R−1 −1 + R−1ZKZ0R−1X X0H−1X X0R−1 − R−1ZKZ0R−1 −1 − R−1ZKZ0R−1X X0H−1X X0R−1ZKZ0R−1 = R−1 − R−1ZKZ0R−1 −1 − R−1 − R−1ZKZ0R−1 X X0H−1X X0R−1 −1 + R−1 − R−1ZKZ0R−1 X X0H−1X X0R−1ZKZ0R−1 −1 = H−1 − H−1X X0H−1X X0H−1 = P as required. 2 Finally we present a convenient expression for the vector of estimated resid- uals e˜ Result 5.8 BLUPs of the residuals are given by e˜ = RP y. Proof: 80 ESTIMATION

e˜ = y − W β˜ = y − WC−1W 0R−1y = R R−1 − R−1WC−1W 0R−1 y = RP y using (5.3.22) and result 5.7. A comparison with (5.3.16) shows that this has the same form as the BLUP of the random effects. 2 The estimates of the fixed and random effects involve the variance matrices G and R. These matrices are assumed to be functions of the variance parame- ters γ, σ2 and φ. which are usually unknown and must be estimated from the data. In the examples considered in chapter 3, except the balanced incomplete block design, estimates of fixed effects were available within strata and hence did not involve estimates of the variance parameters, though the estimated variances of the fixed effects depend on the variance parameters. However the BLUP of u in the one-way classification (and more generally), requires 2 2 estimates of the variance components σu and σ or the variance component ratio γ = σ2/σ2 . For the remainder of the book we will replace the unknown u u H values of the variance parameters by their REML estimates in calculating the solutions to the mixed model equations. These are then termed empirical generalised least squares estimates, EGLS of the fixed effects and empirical best linear unbiased predictors of random effects. This can affect inference, particularly in small samples and we return to this issue in chapter 6.

5.4 An approach to prediction in linear mixed models - REML construct

We conclude this chapter with some important results on prediction in linear mixed models. We present an approach to prediction in linear mixed models based on the notion of the REML construct. This is used extensively in the following chapters as a device to form general predictions (see chapter 7) and for the derivation of the REML-EM algorithm (see chapter A).

5.4.1 REML construct

 0 0 0 0 We consider the joint distribution of (L2y) , u , e . This is given by

     0 0 0  y2 0 L2HL2 L2ZGL2R u ∼ N 0 , σ2 GZ0L G 0     H  2  e 0 RL2 0 R SUMMARY 81

Thus using (5.2.2) we can derive the conditional distribution of u given y2. Given σ2 and κ we have, H 0 0 −1 0 E(u|y2) = 0 + GZ L2(L2HL2) L2y = GZ0P y using result ?? = u˜ Furthermore, var (u|y ) = σ2 G − GZ0L (L0 HL )−1L0 ZG 2 H 2 2 2 2 = σ2 G − GZ0PZG H = σ2 CZZ H This result resolves the issue that we alluded to in section 5.3.2. Conditioning on y2 rather than y produces the usual BLUP for u, with the estimate of τ automatically incorporated. Furthermore, the conditional variance is in fact the prediction error variance. That is,

var (u|y2) = var (u˜ − u) (5.4.29) This is a very important result which we summarise in the following. Result 5.9   u |y ∼ N u˜, σ2 CZZ 2 H where CZZ is the portion of C−1 relating to u.

Similarly, we can derive the conditional distribution of e given y2. Given σ2 and κ we have, H 0 −1 0 E(e|y2) = 0 + RL2(L2HL2) L2y = RP y = e˜ Furthermore, var (e|y ) = σ2 R − RL (L0 HL )−1L0 R 2 H 2 2 2 2 = σ2 (R − RP R) H = σ2 WC−1W 0 H Again this shows that the conditional mean is the BLUP of e and the con- ditional variance is the prediction error variance var (e˜ − e), which we sum- marise in the following result. Result 5.10 e |y ∼ N e˜, σ2 WC−1W 0 2 H

5.5 Summary In this chapter we have defined the Residual likelihood and illustrated how Residual Maximum Likelihood estimation for variance parameters in linear 82 ESTIMATION mixed models may proceed. As well we have considered the problem of pre- diction, both in a general sense, but more extensively as applied to linear mixed models. The basic theory required for the implementation of the com- puting strategy and inference have also been set in place. In summary we have • defined and derived the Residual Likelihood • derived the estimating equations (ie the REML score equations) for the variance parameters • introduced the general concept of prediction for random (and fixed) effects • derived formulae for the Best Linear Unbiased Predictor and Generalised Least Squares (GLS) Estimates in the linear mixed model • introduced the mixed model equations and showed their solution is the Best Linear Unbiased Predictors and GLS estimates • introduced the concept of the REML Construct CHAPTER 6

Inference

In chapter 2 we considered likelihood ratio tests for the fixed effects in the linear model and showed the connection between the likelihood ratio test and the F-test for the hypothesis that τ = 0. Hypothesis testing in linear mixed models is, in general, more difficult. Tests for fixed effects cannot be readily constructed from likelihood ratio tests. This is because, as the design matrix X changes so does the construct y2 used to from the REML likelihood. Welham and Thompson (1997) consider this problem and developed adjusted likelihood ratio tests. Our approach is to base inference on Wald tests. These tests, although valid only asymptotically, are simple to compute and have been readily implemented in our software. For balanced designs with orthogonal block structure there is a direct equivalence between the Wald test and the F- test from the ANOVA table. In general, however, it is not possible to obtain such equivalence. Kenward and Roger (1997) have considered inference for fixed effects in linear mixed models in small samples. Their approach adjusts the denominator degrees of freedom to get approximately valid t-tests. This adjustment gives the exact test for orthogonal designs. In this chapter we consider the general problem of inference for variance parameters and fixed and random effects. Tests of hypotheses for variance parameters, using REML likelihood ratio tests are presented in section 6.1. We present a general result for the distribution of β˜, which is then used to derive tests for fixed effects in section 6.2. We conclude the chapter by considering inference for random or a mixture of fixed and random effects in sections 6.2.4 and 6.2.5.

6.1 General hypothesis tests for variance models To develop tests for general (nested) variance models we use REML likelihood ratio tests. For ease of notation we refer to the REsidual Maximum Likelihood Ratio Test by the acronym, REMLRT. We summarise the approach in the following result.

Result 6.1 For a comparison of (nested) models M0 and M1 with the same fixed model, where M1 contains an extra k variance parameters the REMLRT statistic is given by

D = −2 (`R0 − `R1) where `Ri is the residual log-likelihood for model i. The statistic D is asymptotically distributed as a chi-squared variable with k

83 84 INFERENCE degrees of freedom. The exception is when the test involves a null hypothesis with the parameter on the boundary of the parameter space. We illustrate this by way of example in section 6.1.2. Again we stress that REMLRT cannot be used to compare models which differ in their fixed effects. This is clear from the derivation of the residual likelihood as the marginal likelihood of error contrasts. A change in the fixed part of a model will therefore change the basis of the residual likelihood.

6.1.1 Information criteria for non-nested models

The REMLRT can only be used to compare “nested” models. In these cases we reject model M0 in favour of model M1 if D exceeds a pre-specficied critical value, cα, which depends on the size of the test and the difference in degrees of freedom (ie the number of additional variance parameters) between the two models. The REMLRT cannot be used to compare non-nested models. In these situ- tions the Akaike Information Criteria AIC (Akaike, 1973) has been proposed as a model selection criteria. For this setting this is given by

AIC = −2`R + 2k (6.1.1) where `R is the value of the maximised REML log-liklihood and k is the num- ber of variance parameters being estimated. That is the AIC “penalises” the maximised residual log-likelihood for the use of too many variance parame- ters. Various other criteria have been suggested to improve the performance of the AIC criteria in different settings. Some examples of these include the corrected Akaike information criteria AICC (Hurvich and Tsai, 1989) and the Bayesian information criteria BIC (Schwarz, 1978).

6.1.2 Simple variance components tests: Malt run data-set

To illustrate the above methodology and demonstrate the relationship with the exact F -test we consider the malt run example presented in section 3.1. In this example we wish to test that the variance component associated with the random effects is zero. 2 Thus suppose we wish to test H0 : σu = 0. The simplest way to derive the test is to use the representation given by (3.1.9). The explicit use of the strata means that the likelihood ratio statistic is easily derived and has a simple form that can be manipulated. Thus recall (3.1.9),

   0    y0 L01nµ ξ0 0 0 0 L y =  y1  ∼ N  0  ,  0 ξ1Ib−1 0  y2 0 0 0 ξ2In−b

Recalling that the REML likelihood is the joint likelihood of y1 and y2, then GENERAL HYPOTHESIS TESTS FOR VARIANCE MODELS 85 2 it follows that if σu is not constrained, the maximized residual likelihood is 1 1 1 e−(n−1)/2 (2π)(n−1)/2 ˆ(b−1)/2 ˆ(n−b)/2 ξ1 ξ2 2 while under H0, y ∼ N(1nµ, ξ2In) since ξ2 = σ . Using standard results, the maximized residual likelihood is 1 1 e−(n−1)/2 (2π)(n−1)/2 ˆ(n−1)/2 ξ20 ˆ where ξ20 denotes the estimate of ξ0 under H0. Noting that (b − 1)ξˆ + (n − b)ξˆ ξˆ = 1 2 20 (b − 1) + (n − b) the residual likelihood ratio statistic is (after some algebra) F (b−1)/2 Λ = (n − 1)(n−1)/2 (6.1.2) {(b − 1)F + (n − b)}(n−1)/2 where ξˆ F = 1 ˆ ξ2 is the ratio of the between malt run mean square and the within malt run mean square, which is an F -statistic on b − 1 and n − b degrees of freedom under H0. The likelihood ratio statistic is not a monotone function of F . In fact it has a maximum at F = 1 and declines to zero as F → 0 and F → ∞. Thus if we use the usual rule to reject H0 when Λ < cα, we reject when F is too large or too small. Up to this point we have not specified the alternative hypothesis. If H : 2 σu ≥ 0 is our alternative, we would use the F -statistic but would only reject for large values of F . The likelihood ratio test is implicitly two-sided, and must be adjusted when the test involves an H0 with the parameter on the boundary of the parameter space. The asymptotic adjustment is to take half the P-value. In fact, theoretically it can be shown that −2 log Λ has a mixture distribu- tion, where the mixing probabilities are 0.5 and the two distributions are a chi-square on 0 degrees of freedom (spike at 0) and a chi-square on 1 degree of freedom (Stram and Lee, 1994). For the malt run experiment, the estimates of the stratum variance compo- ˆ ˆ nents were ξ1 = 2.145 and ξ2 = 0.261. The F-statistic is ξˆ F = 1 = 8.214 ˆ ξ2 −6 2 with a P-value of 4.87 × 10 . We therefore strongly reject H0 that σu = 0. 86 INFERENCE The asymptotic test can be carried out by fitting two models given y ∼ mu+maltrun Model 1 y ∼ mu Model 0 and examining the change in residual log-likelihood. Note that Model 0 does not contain the maltrun factor, hence we are implicitly fitting the model with 2 σu = 0. We calculate −2 log Λ = −2(`R0 − `R1) = 19.25, where `Ri is the REML log-likelihood for model i = 0, 1. The P-value for this test is 5.74E × 10−6. The REML likelihood ratio test is an asymptotic test and so the P- values will not generally agree exactly for this test and the F-test. Substituting F = 8.214 into Λ (6.1.2), shows the numerical equivalence of the statistics for this example.

6.2 Hypothesis testing in mixed models: fixed and random effects

In this section we consider inference in a mixed model framework for both fixed and random effects. We begin with a general result concerning the distribution of the BLUEs and BLUPs of the fixed and random effects and then consider, separately, testing fixed effects and illustrate the application of these ideas for the split plot example presented in chapter 3.

˜ −1 Result 6.2 β − β ∼ N(0p+b, θC ) where C is the coefficient matrix of the mixed model equations (5.3.21).

Proof: First recall the following basic results from chapter 5: β˜ = C−1W 0R−1y C = W 0R−1W + G∗ G∗ = SG−1S0  0  S = Ib Thus    τˆ − τ  E β˜ − β = = 0 u˜ − u p+b since E (τˆ) = τ and E (u˜) = E (u) = 0. Furthermore   n      o var β˜ − β = σ2 var β˜ + var (β) − cov β˜, β − cov β, β˜ H = σ2 var C−1W 0R−1y + var (β) − H cov C−1W 0R−1y, β − cov β, C−1W 0R−1y = σ2 C−1W 0R−1(ZGZ0 + R)R−1WC−1 + SGS0− H  0   C−1W 0R−1Z [0 G] − Z0R−1WC−1 G HYPOTHESIS TESTING IN MIXED MODELS: FIXED AND RANDOM EFFECTS 87 Since Z = WS, then   var β˜ − β = σ2 C−1(W 0R−1W SGS0W 0R−1W + W 0R−1W )C−1 H +SGS0 − C−1W 0R−1W SGS0 − SGS0W 0R−1WC−1 = σ2 C−1W 0R−1WC−1+ H C−1(W 0R−1W − C)SGS0(W 0R−1W − C)C−1 = σ2 C−1W 0R−1WC−1 + C−1G∗SGS0G∗C−1 H = σ2 C−1W 0R−1WC−1 + C−1SG−1S0SGS0SG−1S0C−1 H = σ2 C−1W 0R−1WC−1 + C−1G∗C−1 H = σ2 C−1(W 0R−1W + G∗)C−1 H = σ2 C−1 H as required. 2

6.2.1 Hypotheses tests for fixed effects Analysis of variance and both use F -tests for inference con- cerning fixed effects. Inference for fixed effects in linear mixed models intro- duces some difficulties. In general the methods used in chapter 3 to construct F -tests for fixed effects from the analysis of variance cannot be used for the range of applications illustrated in this book and that arise in practice. One approach would be to use likelihood ratio methods. Unfortunately, if the fixed effects part of the model is changed, the residual likelihood changes. Thus the maximized residual likelihood under the null and alternative hypotheses are not comparable. Alternative methods are required. As noted by Kenward and Roger (1997) conventionally, estimates of precision and inference for fixed effects are based on their asymptotic distribution through the Wald Statistic. In the following we present this statistic in our setting. To test the hypothesis H0 : Lτ = l for given L, r × p, and l, r × 1, we define the (empirical) Wald test statistic by −1 W = (Lτˆ − l)0{L(X0Hˆ X)−1L0}−1(Lτˆ − l)/σˆ2 (6.2.3) H where the “hat” signifies that the parameters σ2 and κ have been replaced by H their REML estimates. This statistic arises from a natural generalisation of the result given in appendix ??, section ?? on the distribution of quadratic forms. However the empirical Wald statistic in 6.2.3 is asymptotically distributed as a chi-square random variable on r degrees of freedom. This is a result of the replacement of the unknown variance parameters by their REML estimates. The applicability of this asymptotic result is known to be inadequate for some small sample problems. For example, simulations have shown that the empirical Wald statistic tends to be anti-conservative for small samples, ie the test indicates that an effect may be important more often than expected under the null hypothesis of no effect (Kenward and Roger, 1997). Welham 88 INFERENCE and Thompson (1997) have shown that this depends on the treatment degrees of freedom and residual degrees of freedom in the stratum of interest. However, the degree of anti-conservatism is dependent on the problem. In the analysis of field experiments Lill et al. (1988) showed that the small sample behaviour of the empirical Wald test was adequate. Kenward and Roger (1997) have introduced a scaled Wald statistic, to- gether with an F approximation to its sampling distribution. Although their simulation studies indicate that these adjustments lead to a test statistic with improved small sample properties their adjustments are computationally in- tensive. This is a particularly difficult problem if there are a large number of variance parameters or the number of fixed effects under consideration is large. As a result, the Kenward-Roger adjustments have only recently been implemented into the software we use. Computation of the adjusted variance matrix of the empirical best linear unbiased estimate of the vector of fixed effects and the approximate F test of the hypothesis hypothesis H0 : Lτ = l is presented in the following section. The reader is referred to ? for a more complete description. It is convenient to compute the empirical Wald statistics for fixed model terms as they are added into the linear mixed model. This implies that each term is assessed eliminating the terms that appear before it, but ignoring any that appear after it. If the fixed model terms are non-orthogonal (ie −1 X0H−1X is not block diagonal), then the empirical Wald statistics may change depending on the order of fitting the terms. We adopt a strategy which respects marginality and is consistent with the approach used in examining the significance of terms in regression analysis. The approach is best illustrated by way of a simple example. Consider the following linear model y ∼ A + A.B + C where A,B and C are all included in the model as fixed effects. The following sequence of models would be (implicitly) fitted, with the model terms added to the model from left to right in order to correctly assess the signicance of the relevant term y ∼ A + A.B + C to assess C y ∼ C + A + A.B to assess A y ∼ A + C + A.B to assess A.B Alternatively, we may calculate the F statistics by dropping terms, but re- specting marginality.

6.2.2 Computing Kenward Adjustments 6.2.3 Testing fixed effects in the split plot example In this section we consider hypothesis tests for fixed effects in the split plot example presented in chapter 3. This is a special linear mixed model which has HYPOTHESIS TESTING IN MIXED MODELS: FIXED AND RANDOM EFFECTS 89 orthogonal block structure and is an orthogonal design. In this case the tests of hypotheses concerning fixed effects can be performed using the F -tests in the ANOVA table. The ANOVA table is reproduced below in table 6.1. The statistics in the final column are F -statistics for the tests that the fungicide, variety and fungicide by variety effects are zero, respectively.

Table 6.1 Analysis of variance for the fungicide by variety example Strata/Decomposition d.f. S.S. M.S. F-test Blocks 4 15389.808 Mean 1 15374.580 15374.58 Residual 3 15.228 5.076 Blocks.wplots 4 45.149 Fungicide 1 42.019 42.019 40.271 Residual 3 3.130 1.043 Blocks.wplots.splots 552 77.107 Variety 69 39.284 0.569 7.201 F.V 69 5.090 0.074 0.933 Residual 414 32.733 0.079

The analysis of variance table has been formed by partitioning the total sum of squares in each stratum into a sum of squares due to the fixed effects estimated in that stratum and an error sum of squares. For this example we have that 4 X τ = T τ = T jτ j=1 where for suitably chosen T j (see 3.3.31), the four terms in the summation represent the overall mean, the main effects of fungicide, the main effects of variety and the interaction effects of fungicide by variety. Since the design is orthogonal then 4 0 X TX P iXT = λijT j j=1 for i = 1, 2, 3. The sets T j, j = 1, 2, 3, 4 and P i, i = 1, 2, 3 are sets of mutually orthogonal matrices summing to identity matrices of orders 140 and 560 re- spectively. Each treatment effect T kτ can only be estimated in one stratum. The sum of squares due to T kτ in stratum i is given by 0 λik(T kτˆ[i]) (T kτˆ[i]) with df equal to the rank of T k, where

1 0 T kτˆ[i] = T kTX P iy λik 90 INFERENCE The sum of squares due to the fungicide treatment effects is therefore 0 λ22(T 2τˆ[2]) (T 2τˆ[2]) with 1 df. The F -test for the hypothesis H0 : T 2τ = 0 is given by 0 λ22(T 2τˆ ) (T 2τˆ ) F = [2] [2] ˆ ξ2

This has an F -distribution with 1 and 3 degrees of freedom under H0. The Wald statistic to test the hypothesis H0 : T 2τ = 0, can be calculated by noting that 0 −1 −1 var (T 2τˆ) = T 2(X H X) T 2 −1 = λ22 ξ2T 2 Thus −1 0 W = λ22ξ2 τˆ2T 2τˆ2 −1 0 = λ22ξ2 (T 2τˆ[2]) (T 2τˆ[2])

To compute the empirical Wald statistic we replace ξ2 by its REML estimate and hence W is then equivalent to the F -test in the ANOVA table. The equivalence in this case is because the numerator df of the F -test is one. In general W = rF where r is the rank of the matrix L presented in section 6.2.1. That is in this case, we have the same test statistic compared to different reference distributions: the F -test is exact, the chi-squared is often a poor approximation.

6.2.4 Inference for random effects It is sometimes of interest to conduct inferences concerning random effects. In the analysis of crop variety evaluation data probability statements con- cerning the ranking or superiority of newer varieties under study compared to commercial or standard varieties are often useful summaries of the analysis (Besag and Higdon, 1999; Smith and Cullis, 2001). The natural approach for random effects is to consider inference from the conditional distribution of the random effects given the “data”. We have already seen that the mean of this conditional distribution is the BLUP of u which is an analogous to the posterior distribution of u given the data using a Bayesian approach. Thus using (5.2.2) we can derive the conditional distribution of u given y2 and κ   u |y ∼ N u˜, σ2 CZZ 2 H where CZZ = G−GZ0PZG is the portion of C−1 relating to u. Conditioning on y2 produces the usual BLUP for u, with the estimate of τ automatically incorporated. The conditional variance is in fact the prediction error variance, that is

var (u|y2) = var (u − u˜) SUMMARY 91 This also avoids the somewhat artificial assumption of the vector of fixed effects τ having a prior distribution with infinite variance, as employed in the Bayesian derivation of REML. In practice, we replace κ by its REML estimate κˆ. Probability statements will therefore generally be anti-conservative, since we have ignored the uncertainty in the estimation of σ2 and κ. H Since u is a random variable tests involving equality do not make sense however we can make sensible probabilty statements concerning u. For exam- ple, in the variety trial example, we may wish to determine the probability that, given the data, variety i is superior to variety j. That is

P (ui > uj |y2 ) = P (ui − uj > 0 |y2 )   ui − u˜i − (uj − u˜j) u˜j − u˜i = P  q > q |y2  σ2 s0CZZ s σ2 s0CZZ s H H   u˜j − u˜i = 1 − Φ q  σ2 s0CZZ s H say, where s is a b×1 vector of zeroes, except for si = 1 and sj = −1. Similarly we can construct (approximate) prediction intervals for u. Further examples will be presented in the analysis of examples in later chapters.

6.2.5 Inference on a combination of fixed and random effects In a similar manner to the approach used in the previous section we can 0 0 construct probability statements concerning s1τ + s2u from the result (s0 τˆ + s0 u˜) − (s0 τ + s0 u) ∼ N(0, σ2 s0C−1s) 1 2 1 2 H 0 0 0 where s = [s1, s2] . An example of this is in the analysis of longitudinal data using smoothing splines (see chapter ??).

6.3 Summary In this chapter we have presented results concerning inference for variance parameters and fixed and random effects in linear mixed models and discussed the approaches we implement for such inferences in this book. For example, we • use REMLLRT for inferences for variance parameters, with adjustments when the test involves an H0 with the parameter on the boundary of the parameter space • use information criteria for non-nested variance models • have shown that likelihood ratio methods cannot be used for inference concerning fixed effects in a REML context • use Wald tests for inference concerning fixed effects, but exercise care with interpretation in small samples 92 INFERENCE • can make probability statements for random and a mixture of fixed and random effects CHAPTER 7

Prediction from linear mixed models

The problem of general prediction from linear mixed models has been recently considered by ?Gilmour et al. (2003). Their work extended prediction for linear models considered by Lane and Nelder (1982); Lane (1998). In this chapter we present a review of prediction in linear mixed models, considering the issues of prediction as well as the computational implementation of the algorithm. This chapter is based on the papers by ?Gilmour et al. (2003). We have attempted to present their algorithm in more detail to clarify the concepts and difficulties in forming predictions from the linear mixed model.

7.1 Introduction It is often desirable to construct predicted values from the effects fitted in the linear mixed model (4.1.1). Such predictions may be summaries in the form of treatment means in the analysis of designed experiments or fitted values from a multiple regression. As we shall see in part II of the book, we often also require summaries of quite complex analyses. Lane and Nelder (1982) described an approach for forming predictions in general(ised) linear models. Briefly, their approach involves choosing a com- bination of variables to be predicted, forming the fitted values for these com- binations with all combinations of other variables in the model and taking marginal means across the variables not relevant to the current prediction. Their approach has been implemented in GENSTAT 5. It has been widely tested and is suitable for most applications. Some of the computational lim- itations with the calculation of the standard errors of predicted values have been recently removed (Lane, 1998). This algorithm however is not generally suitable for use in linear mixed models. In simple balanced applications of the linear mixed model (see chapter 3, for example), predictions are formed by consideration of the relevant tables of treatment means. This may not be appropriate for forming predictions in the analysis of unbalanced data-sets. Furthermore, depending on the role of the random effects in the model a decision must be made concerning the inclusion of such effects in the linear combination of effects to be used in forming pre- dictions. For correlated random effects, information on effects present in the data may be used to predict effects not present in the data set, with prediction standard errors allowing for the extra uncertainty associated with the effect not being observed. The application of this principle to the residual error gives the kriging predictions used in geostatistics. We will first discuss the principles

93 94 PREDICTION FROM LINEAR MIXED MODELS of prediction in linear mixed models and present two motivating examples to illustrate some of the main issues.

7.1.1 Principles of prediction In the simplest case a prediction is the sum of the best linear unbiased predic- tor (BLUP) of random effects with the best linear unbiased estimate (BLUE) of fixed effects for a particular combination of explanatory variables, either averaged over or ignoring any other explanatory variables in the model. This gives a prediction for different explanatory variable combinations estimated from the current experiment. Additionally, we may also wish to consider pre- diction of future observations allowing for the variation due to unknown ran- dom effects. The description of prediction above implies a partition of the explanatory variables into three sets: those for which predicted values are re- quired, called the classify set; those which are to be averaged over, called the averaging set; and those to be ignored. The terms of the linear mixed model are constructed from these sets of explanatory variables. As we have already seen the terms in the linear mixed model may refer to a single categorical variable (factor), a single continuous variable (covariate) or an interaction of two or more variables (see chapter 1). However an additional complication in linear mixed models is that terms can be classified as either fixed or random. We will first consider the role of fixed and random single factor terms in the model separately with respect to prediction. Fixed terms have an associated set of effects (parameters) to be estimated. Random terms have the additional constraint that the associated set of effects are normally distributed with zero mean and a variance matrix that is a function of un- known parameters. We have considered examples of linear mixed models in the analysis of designed experiments in which the random terms may represent error terms due to randomisation or other structure in the data (chapter 3). Alternately they may be used to account for (co)variance in the data, for example, in the analysis of repeated measures data. Random factor terms may contribute to predictions in several ways. They may be evaluated at a given value(s) specified by the user, they may be aver- aged over, or unlike fixed effects they may be omitted from the linear combi- nation of effects used to form the prediction. Averaging over the set of random effects gives a prediction relevant for the set of random effects in the data-set. We use the term “conditional” predictive margin to describe this type of pre- diction. Omitting a (random) term from the prediction, implicitly produces a prediction with the random effects set to zero, which is usually the assumed prior mean of the random effects. We use the term “marginal” to describe this type of prediction. For fixed factors, there is no natural interpretation for a prediction derived by omitting a fixed term from the fitted values. Predictions must therefore consider all fixed terms of the linear mixed model in some way, for example by either averaging over all the levels in the factors which are not involved INTRODUCTION 95 in the prediction set of factors, or levels for each of these factors must be specified by the user. For covariate terms (fixed or random), the associated effect is the estimate of the regression coefficient of the response on the covariate. Regardless of whether the term involving the covariate is classifed as fixed or random, the term should be evaluated at a given value of the covariate, or averaged over several given values. Note that omitting a covariate from the predictive model is equivalent to predicting at a zero covariate value, which may not be appro- priate. Interaction terms follow the behaviour of the explanatory variables from which they are composed. Interactions constructed from factors generate an effect for each combination of the factor levels and hence are considered in the same way as terms which involve a single factor. Interactions between covariates or between covariates and factors are treated in the same way as covariates. If the term involves an interaction between a factor and a covariate and is classified as random then a decision regarding omission of the effects must be made.

7.1.2 Split Plot Design

The first example of prediction in a linear mixed model we consider is the split plot example considered in section 3.3. The experiment was designed to inves- tigate the effect on yield of controlling the fungus powdery mildew in barley. Seventy varieties of barley were grown with and without fungicide applica- tion. The layout of the trial indicating the allocation of fungicide treatments to main plots and the arrangement of blocks was presented in table 3.9. The model we considered for the analysis of these data is written symboli- cally by

yield ∼ mu + fungicide + variety + fungicide.variety+block + block.wplot

In this case the block and block.wplot terms are regarded as error terms which have been used in the estimation of treatment effects, but are not otherwise relevant to prediction of treatment effects. This model may be written in the usual form as

yijk = µ + bi + fr(ij) + wij + vs(ijk) + fvrs + eijk where yijk is the yield on block i = 1,..., 4, whole-plot j = 1, 2, sub-plot k = 1,..., 4; µ is the overall constant, bi is the effect of block i; fr(ij) is the effect of fungicide r(ij) where r(ij) represents the randomisation of fungicides to whole-plots; wij is the effect of whole-plot j in block i; vs(ijk) is the effect of variety level s where s(ijk) represents the randomisation of varieties to sub-plots; fvrs is the interaction of fungicide level r with variety level s and eijk is the residual error for block i, whole-plot j, sub-plot k. We assume 0 0 0 b = (b1, b2 . . . b4) , w = (w1,1 w1,2, . . . , w4,2) and e = (e1,1,1, e1,1,2 . . . e4,2,70) 96 PREDICTION FROM LINEAR MIXED MODELS with      2  b 0 σb I4 2  w  ∼ N  0  ,  0 σwI8  2 e 0 0 0 σ I560 A prediction of fungicide mean l marginal to block and whole-plot would then be 70 ˆ 1 X   µˆ + fl + vˆj + fvc (7.1.1) 70 lj j=1 This is different to the “conditional” prediction which includes the random effects (ie using all blocks and whole-plots):

4 4 2 70 1 X ˜ ˆ 1 X X 1 X   µˆ + bi + fl + w˜ij + vˆj + fvc (7.1.2) 4 8 70 lj i=1 i=1 j=1 j=1 The consequence and interpretation of the difference between these two pre- dictions is examined in more detail when we revisit the analysis of these data in chapter ??.

7.1.3 An unbalanced variety trial - SA wheat The second example is taken from Smith and Cullis (2001) and involves the evaluation of elite breeding lines and commercial varieties of wheat by the South Australia crop variety evaluation programme. The major objective of crop variety evaluation programmes is to provide reliable information about the performance of potential new varieties relative to existing ‘standard’ com- mercial varieties. This dataset comprises 6028 mean yields from the separate analysis of 174 trials conducted from 1992 to 1998. Each trial is analysed sepa- rately and the mean yield for each variety is saved with a measure of precision for subsequent across trial analysis (see Smith and Cullis (2001), 2001 for fur- ther details). Each year trials are conducted within each of six regions (denoted by LEP - Lower Eyre Peninsula, MM - Murray Mallee, MN - Mid North, SE - South East, UEP - Upper Eyre Peninsula and Y - Yorke Peninsula). Table 7.1 presents the numbers of trials conducted in each region between 1992 to 1998. Trial locations are largely invariant between years, mainly for convenience. Elite lines are usually tested for two to three years before a decision is made regarding release or rejection. Commercial varieties remain in the programme for a longer period. Table 7.2 presents the concurrence matrix of varieties. This illustrates the degree of imbalance in the data-set. The combined analysis is based on a linear mixed model which accounts for all sources of variation. Factors are set up to indicate the year, region, location and variety for each mean yield. Locations are nested within regions, but this is ignored when setting up the factors. The dataset can be regarded as a variety × environment array, where environments are the year × region × location combinations. Variety effects and interaction terms involving variety are regarded as random effects, and a diagonal vector of relative weights is INTRODUCTION 97

Table 7.1 Numbers of trials conducted in each region for the SA wheat dataset Year Region 1992 1993 1994 1995 1996 1997 1998 LEP 3 3 3 3 3 3 3 MM 6 5 6 6 6 6 6 MN 4 4 4 4 4 3 4 SE 1 0 1 1 1 2 3 UEP 8 7 8 8 8 8 8 Y 3 3 3 3 3 3 3 used with a fixed residual variance to take account of plot error from individual trials (see Smith et al., 2001 for a complete discussion). The model can be written as yield ∼ mu + year + region + region.loc + year.region + year.region.loc variety + variety.year + variety.region + variety.region.loc + variety.year.region + variety.year.region.loc In this analysis, variety × environment effects are partitioned and fitted as random terms to investigate patterns of variation between varieties as dis- cussed by Patterson et al. (1977). This contrasts with the split-plot design in section 7.1.2, where random terms were used solely to define error strata.

Table 7.2 SA wheat data: numbers of varieties common across years (diagonal en- tries are numbers for individual years) Year 1992 1993 1994 1995 1996 1997 1998 1992 32 1993 21 31 1994 18 20 37 1995 13 15 23 36 1996 11 11 19 24 38 1997 11 12 14 15 21 34 1998 10 10 12 13 15 23 36

We are interested in the prediction of variety mean yields both at a state level and at a regional level. There are several issues which require consider- ation. At a regional level, traditional arguments lead to the classification of locations within regions and years as random. Additionally we wish to predict for a broader set over years and locations for each variety, ie. to predict variety performance at an ’average’ environment within each region. At a state level, the six regions cannot be regarded as a random sample, so we would require predictions specific to the regions in these data or averaged over those regions 98 PREDICTION FROM LINEAR MIXED MODELS in these data. Hence the variety predictions are conditional predictions with respect to regions and marginal predictions with respect to years and locations within regions. That is we omit all random terms except variety and variety.region terms from the predictions. For predictions of variety regional yields, the four-way variety × year × region × location table of predictions is formed from the relevant terms in the model, then averages are taken across years and locations within each region. For prediction of state yields for each variety, the same procedure is followed, with averages additionally being taken across regions. However, in both cases, there are new issues to consider in forming the multi-way table of predictions and in taking marginal means. Locations are nested within regions, so that it is natural to take means only across the lo- cations present in the data for each region. This implies that a 0/1 weighting scheme be applied to region × location combinations for averaging across lo- cations, with value 1 when a location is present in a region, zero otherwise. This nesting also generates linear dependencies (aliasing) between the region and location factors, so that not all parameters are estimable, and individual parameter estimates may change according to the order of fitting. The issue of aliasing was discussed briefly by Lane (1998). It is important to establish when aliasing will affect the calculated predictions. Another relevant aspect of the design is that although varieties will not all be tested at each year × region × location combination, it is desirable to calculate all variety predic- tions calculated across a common set of environments. This again requires weighting based on data presence, and predictions may again be affected by parameter aliasing. Where aliasing is present, predicted values may not be invariant to the parameterisation used. We describe a method for detecting invariant predictions in section 7.5, suggested by Gilmour et al. (2003). This example shows again that unlike in prediction for linear models there must be user control over model terms to be used in fitted values for predic- tion. There must also be flexibility in the averaging process which recognises aliasing and nesting, and allows for different weighting schemes over factors, or combinations of factors.

7.2 The Prediction Model 7.2.1 Steps in the prediction process In this section we present the conceptual steps involved in the prediction process. The four main steps are 1. Choosing the explanatory variable(s) and their respective values for which predictive margins are required; the variables involved will be referred to as the classify set. 2. Determine which variables should be averaged over to form predictions. The values to be averaged over must also be defined for each variable; the variables involved will be referred to as the averaging set. The combination THE PREDICTION MODEL 99 of the classify set with these averaging variables defines a multiway hyper- table. 3. Determine which terms from the linear mixed model are to be used in forming the linear combination of effects to form the predictions for each cell in the multiway hyper-table. 4. Choose the weighting for forming predictive margins over each dimension (or combination of dimensions) of the hyper-table. Note that after steps 1 and 2, there may be some explanatory variables in the model that do not classify the hyper-table. These variables must be either evaluated at a given value within each cell of the hyper-table, or only occur in terms that are ignored when forming the fitted values. It was explained in section 7.1.1 that fixed terms could not sensibly be ignored in forming predictions, so that factors can only be excluded from the hyper-table when they appear in random terms only. Whether terms including these factors should be used when forming predictions depends on the application and purpose of prediction.

7.2.2 Prediction process Prediction involves forming a linear function of β˜. If we denote the vector of predictive margins of interest by π˜, then π˜ = Dβ˜ (7.2.3) say, for some matrix Dd×t, where t = p + b is the number of effects in β. It follows that (see result 6.2)  τˆ − τ   0   D(β˜ − β) = D ∼ N , σ2 DC−1D0 (7.2.4) u˜ − u 0 H Consideration of the values required for forming confidence intervals make it   clear that it is the prediction error variance, ie. var β˜ − β , rather than the   variance of the estimator, var β˜ , that is usually of interest. The sizes of D and C are often prohibitively large, so that to evaluate π˜ and the prediction error variance of π˜ requires special consideration. From a point of view of understanding the processes involved, however, it is instructive to decompose D into its component matrices, where each component matrix relates to a step in the prediction process described in the previous section. We can write D by D = AW M MS (7.2.5) where • Sr×t is a binary matrix which selects the elements of β which are used to form the predictions for each cell of the hyper-table - this relates to step 3. Note that r ≤ t, the number of effects used in forming the fitted values, is in general much less than t. 100 PREDICTION FROM LINEAR MIXED MODELS • M c×r is a “design” matrix which forms (a portion of) the multiway hyper- table for the specified combinations of the classify set plus the averaging set - this relates to step 2. Note that c is the number of values in the hyper- table, (usually) equal to the product of the number of combinations in the classify set with the number of combinations in the averaging set. c×c • W M is a diagonal matrix of weights - this relates to step 4. d×c • A is a matrix which when combined with W M averages the multiway table to produce the predictive margins - this relates to steps 1 and 4. Note that d is the number of predicted values, equal to the number of combinations of factor and covariate values in the classify set.

The matrices A and W M may be combined, however it is helpful to keep them separate to reflect the type of averaging of the multiway hyper-table. This will be particularly important for problems in which aliasing has oc- curred. Lane (1998) discusses this issue and indicates that aliasing may occur either as a result of linear dependencies in the explanatory variables or because of non-representation of some combinations of factor levels in the data-set. The latter may occur by chance or through the intrinsic structure of the data, as in the second example, where locations are nested within regions. Care must be taken in this case to ensure sensible averaging occurs. This will be discussed in detail in section 7.5. At this level, the major difference between the algorithm described here and the algorithm proposed by Lane and Nelder (1982) is the presence of the matrix S in D.

7.3 Computing Strategy The implementation of the Lane and Nelder (1982) algorithm in GENSTAT 5 has been limited by the the size of the model and or dataset for which pre- dictions and associated standard errors can be easily obtained. The following is a brief description of the strategy which has been implemented in ASReml, samm and the most recent release of the REML directive in GENSTAT 5. It minimises the computational load by use of methods and judicious formation of D and the matrix of prediction error variances.

7.3.1 Initialisation of the component matrices For full flexibility, the user must be able to specify factor/covariate combi- nations for which predictions are required (the classify set), combinations of factors/covariates to be averaged over (the averaging set), methods of averag- ing for each factor (or combination of factors) in the averaging set and terms to be used when forming predictions. However, given the basic information, sensible default values can be determined to minimise user input. For exam- ple, in the terms which define the full multiway hyper-table, sensible default values would be the mean value for covariates, all levels for factors and knot points for spline terms (see chapter ??). Note that where a single variable COMPUTING STRATEGY 101 defines several terms (eg. lin and spl, or lin and quad) care must be taken to maintain the link with the underlying variable.

7.3.2 Forming D The algorithm uses the following method to compute the matrix D efficiently from the information obtained in the initialisation process described in the previous section. This specification influences the size and composition of the component matrices of D. These matrices are not formed individually as it is more efficient to generate D directly. Recall that each row of D relates to a unique combination of the levels of the factors and values of the covariates in the classify set. These rows are successively formed using a modified version of the subroutine which generates the design matrix for the linear mixed model, W . Each row of the prediction design matrix generates one predicted value. Columns corresponding to the predicted combination will be set to the appropriate value (1 for a factor level, specified value for a covariate). Columns corresponding to averaging factors will contain weights dependent on the averaging process (although a slightly different procedure is used for weights depending on data presence, see section 7.5). Columns corresponding to model terms ignored in the prediction process will be set to zero. The matrix D is stored in a linklist sparse form, and a check is made for aliasing caused by the absence of data for an effect involved in D (see section 7.5).

7.3.3 Calculation of predictions and prediction error variances The major computational challenge in the implementation of the prediction algorithm is the calculation of the prediction error variance matrix. The fol- lowing describes the approach used in ASReml as outlined in Gilmour et al. (2003). It computes π˜ and the scaled prediction error variance matrix of π˜. The approach extends the mixed model equations, which can be manipulated during the final iteration of the AI algorithm. That is, let Qe be the matrix of augmented mixed model equations, given by,  y0R−1y 0 y0R−1W  Qe =  0 0 D  W 0R−1y D0 C Absorption of C gives  y0P y −π˜ 0  Qa = e −π˜ −DC−1D0 The absorption using the reordering of the mixed model equations (briefly de- scribed in chapter A) is designed to retain a high degree of sparsity (Gilmour et al., 1995). It is advantageous to have control over the formation of the ele- ments of DC−1D0. For example, where standard errors of differences (SEDs) 102 PREDICTION FROM LINEAR MIXED MODELS are required, the full matrix must be calculated. However, calculation of the average SED can be performed during absorption without storing the full ma- trix, and if only standard errors (SEs) of predictions are required then only the diagonal of the matrix must be retained.

7.4 An example of the prediction model

In this section we present a simple example of the prediction model, taking particular notice to describe the form of the matrices involved in forming the predictions. This will assist with the understanding of prediction in the more complex examples which are presented later in this book. Consider the following (trivial) meta-analysis in which three treatments are tested in two experiments. Experiment 1 had three replicates for each treatment, experiment 2 had two replicates per treatment. The symbolic rep- resentation of the model is given by y ∼ mu + expt ∗ trt where expt and trt are factors with 2 and 3 levels respectively. The linear (mixed) model for the vector of data y15×1 is given by y = Xτ + e where X = [115 Xe Xt Xet] is the design matrix which is partitioned con- formably with the terms in the linear model and represent the intercept, the main effect of experiment, the main effects of treatment and the interaction effects of experiment and treatment. The corner point parameterisation has been applied to ensure the design matrix X is full column rank. Case 1: Suppose we are interested in predicting the treatment means averaged over both experiments. The predictive margins are therefore formed by, using the approach outlined in section 7.2. ˜ π˜ = Dβ, D = AW M MS where the matrices on the right hand side are given by

0 1 S = I6, A = 12 ⊗ I3, W M = 2 I2 ⊗ I3 and  1 0 0 0 0 0   1 0 1 0 0 0     1 0 0 1 0 0  M =    1 1 0 0 0 0     1 1 1 0 1 0  1 1 0 1 0 1

= [16|B2 ⊗ 13|12 ⊗ B3|B2 ⊗ B3] PREDICTION IN MODELS NOT OF FULL RANK 103 n×(n−1) where the matrix Bn is given by  0  B = In−1 Case 2: Suppose now we are interested in deriving the predictive margins for each treatment by weighting the means for each experiment by the number of replicates for each treatment in each experiment. The matrices A, M and S are as above and the matrix W is given by  3  5 I3 0 W M = 2 0 5 I3 Case 3: Suppose now that each of the experiments were randomised complete block designs, in which case a possible model for the data would be y ∼ mu + expt ∗ trt+expt.block where the factor block is a factor with 3 levels. There are missing combinations in the expt.block term which does not cause problems in the analysis, as these effects are classified as random effects, and hence no singularity will occur. As for case 1, we wish to form predictive margins for the treatment factor. We require predicitive margins which are marginal to the term expt.block, and hence we exclude this term in forming the linear combination of effects in β  6×6 by setting S = I6 0 .

7.5 Prediction in models not of full rank There are often situations in which the fixed effects design matrix, X, is not of full column rank. These can be classified according to the cause of aliasing as follows: 1. linear dependencies between explanatory variables due to over parameter- isation of factor terms 2. no data present for some factor combinations, so that the corresponding effects cannot be estimated 3. linear dependencies due to other, usually unexpected, structure in the data Absorption to produce predictions does not need to be performed until param- eter estimation is complete. The first type of aliasing can be detected when setting up the design matrix for parameter estimation, and the second type should be detected from the data, but if missed can also be detected during absorption of the mixed model equations. The third case cannot be detected until absorption of the extended matrix Qe during the predict step of the al- gorithm. In any case, a strategy is required to ensure predictions are estimable in the sense defined by Searle (1971, pgs. 160,180). In the following we focus on predictions involving the vector of fixed effects (τ ) only as estimability is not an issue for random effects. We recall the equation for fixed effects, after 104 PREDICTION FROM LINEAR MIXED MODELS absorption of the equations associated with the random effects in (5.3.21) are given by X0H−1Xτˆ = X0H−1y (7.5.6) If X is not full rank, then there is no unique solution to (7.5.6). To obtain a solution, say τˆ0, we compute 0 −1 − 0 −1 τˆ0 = (X H X) X H y 0 −1 − 0 −1 for some generalised inverse (X H X) of X H X. We note that τˆ0 is not an unbiased estimator of τ , since, in general 0 −1 − 0 −1 E(τˆ0) = (X H X) X H Xτ 6= τ Since X0H−1X is symmetric then there exists an orthogonal L, such that   0 0 −1 A11 A12 L X H XL = 0 A12 A22 0 −1 where A22 is a square matrix of full rank, equal to the rank of X H X. Further, we define ∗ ∗ ∗ X = [X1 X2] = XL  ∗  ∗ τ 1 0 τ = ∗ = L τ τ 2 and note X∗τ ∗ = Xτ Hence a convenient choice for (X0H−1X)− is given by   0 0 0 L −1 L (7.5.7) 0 A22 giving  0  τˆ0 = L ∗ τˆ2 where ∗ −1 ∗0 −1 τˆ2 = A22 X2 H y The package ASReml uses this approach when it encounters aliasing. Any effects which are aliased effects are flagged and reordered to the top of the ∗ mixed model equations. Thus the estimate of the fixed effects τ is the τˆ2 above.

7.5.1 Estimability of predictions We first consider the case of the estimability of functions of fixed effects, as this corresponds to the case considered by Searle (1971) if

E(Dτ τˆ0) = Dτ τ (7.5.8) PREDICTION IN MODELS NOT OF FULL RANK 105

Note that estimability in this context implies that the value of Dτ τˆ0 is invari- ant to the parameterisation (ie. the generalised inverse of X0H−1X) chosen. Thus,   0 0 0 0 −1 E(Dτ τˆ0) = Dτ L −1 L X H Xτ 0 A22   0 0 0 0 −1 0 = Dτ L −1 L X H XLL τ 0 A22     ∗ ∗ 0 0 A11 A12 ∗ = [Dτ1 Dτ2] −1 0 τ 0 A22 A12 A22   ∗ ∗ 0 0 ∗ = [Dτ1 Dτ2] −1 0 τ A22 A12 I  ∗   ∗ −1 0 ∗  τ 1 = Dτ2A22 A12 Dτ2 ∗ (7.5.9) τ 2

For Dτ τˆ0 to be estimable (7.5.9) must equal ∗ ∗ ∗ ∗ ∗ ∗ Dτ τ = Dτ τ = Dτ1τ 1 + Dτ2τ 2 and so ∗ ∗ −1 0 Dτ1 − Dτ2A22 A12 = 0 (7.5.10) The other case to consider is of estimability of a linear function of random effects Duu˜. It can be shown that the E (Duu˜) is zero, taking expectation with respect to the distribution of u. This is because the subset of equation in the mixed model equations corresponding to u˜ are full rank and Xτˆ is estimable Searle (1971). If Duu˜ and Dτ τˆ are estimable then it follows that the linear function Duu˜ + Dτ τˆ is also estimable.

7.5.2 Computing strategy for determining estimability of predictions The above criteria for detecting estimability of the predictions (7.5.10) can be incorporated into the computing algorithm described in chapter A and as part of the prediction process devised within section 7.3. Consider the augmented mixed model matrix, after reordering and absorption of random effects, which is given by  0 −1 0 −1 ∗ 0 −1 ∗  y H y 0 y H X1 y H X2 ∗ ∗  0 0 Dτ1 Dτ2  Qe1 =  ∗0 −1 ∗0   X1 H y Dτ1 A11 A12  ∗0 −1 ∗0 0 X2 H y Dτ2 A12 A22 ∗ Absorption of the last row pertaining to the τ 2 leaves the symmetric matrix  y0P y  a ∗ ∗ ∗ −1 ∗0 Qe1 =  −Dτ2τˆ2 −Dτ2A22 Dτ2  ∗0 −1 ∗ ∗0 −1 ∗0 −1 0 X1 H y − A12τˆ2 Dτ1 − A12A22 Dτ2 A11 − A12A22 A12 106 PREDICTION FROM LINEAR MIXED MODELS ∗0 ∗0 0 Since the reordering of the vector τ into the partition (τ 1 τ 2 ) has been es- tablished and implemented during the first iteration, then the criteria for de- termining estimability (invariance to parameterisation) can be assessed during the same absorption process that determines the vector of predictions and the matrix of prediction variances, ie. estimable predictions are characterised by 0 a columns corresponding to D in Qe1 taking value zero during the absorption process.

7.5.3 An example of prediction in models not of full rank To motivate the next section and to clarify and perhaps reinforce some of the technical issues presented in section 7.5 we will consider a portion of the data-set described in section 7.1.3. We consider data taken from three years (1996-1998) at four locations (BOO, CUM, GER and COO) for a sample of 6 varieties (FRAME, HALBERD, MACHETE, SPEAR, TRIDENT and KRICHAUFF). Table 7.3 presents the number of varieties for each year by location combination for these data. The variety TRIDENT was not sown in 1998 and the location COO was not used in 1996 and 1997.

Table 7.3 Numbers of varieties for each year by location for the reduced SA wheat data-set Location Year BOO CUM GER COO 1986 6 0 6 6 1987 6 0 6 6 1998 5 5 5 5

For the purposes of illustration we consider fitting the simple linear model to these data given by y ∼ mu + variety + year ∗ loc where the terms in the model formula have the obvious interpretation. This can be written conveniently in vector matrix notation by y = Xτ + e (7.5.11) where y56×1 is the vector of yields, X56×17 is the (full rank, removing type 1 aliased effects) design matrix for the vector of fixed effects given by 0 h 5×10 2×10 3×10 6×10i τ = τµ, τ v , τ y , τ l , τ yl The sub-vectors of τ represent the intercept, variety, year, location and main effects and the year by location interaction effects respectively. The linear dependencies in the model have been removed (ie type 1 aliasing) by setting τv;1 = τy;1 = τl;1 = 0, τyl;1k = 0, k = 1,..., 4 and τyl;j1 = 0, j = 1, 2, 3. PREDICTION IN MODELS NOT OF FULL RANK 107 However due to “missingness” in the data (ie type 2 aliasing) there are 2 an additional two linear dependencies in X. We assume e ∼ N(0, σ I56). Table 7.4 presents the least square estimates of the full set of effects, including type 1 and 2 aliased effects.

Table 7.4 Best linear unbiased estimates of the fixed effects in the analysis of the reduced SA wheat data-set Term Effect Estimate Alias Type mu 0.930 variety FRAME 0.000 1 variety HALBERD -0.217 variety MACHETE -0.208 variety SPEAR -0.181 variety TRIDENT 0.056 variety KRICHAUFF 0.137 year 96 0.000 1 year 97 0.396 year 98 0.393 loc BOO 0.000 1 loc CUM 2.763 loc GER 2.754 loc COO -0.462 year.loc 96.BOO 0.000 1 year.loc 96.CUM 0.000 1 year.loc 96.GER 0.000 1 year.loc 96.COO 0.000 1 year.loc 97.BOO 0.000 1 year.loc 97.CUM -0.727 year.loc 97.GER -2.409 year.loc 97.COO 0.000 2 year.loc 98.BOO 0.000 1 year.loc 98.CUM -0.604 year.loc 98.GER -1.413 year.loc 98.COO 0.000 2

Consider forming variety predictive margins. Since all the terms in the lin- ear model are classified as fixed then the hyper-table is the three way table, classified by variety, year and location. To simplify the remainder we assume that the columns of D relate to the vector τˆ0, the vector of fixed effects which includes type 2 aliased effects, in the order assumed in (7.5.11). The matrix M 72×17 is therefore given by

M = [172 | B6 ⊗ 13 ⊗ 14 | 16 ⊗ B3 ⊗ 14 | 16 ⊗ 14 ⊗ B4 | 16 ⊗ B3 ⊗ B4] 108 PREDICTION FROM LINEAR MIXED MODELS

The matrix S = I17. The hyper-table must now be “averaged” in some way to produce the variety predictive margins. Simple averaging by setting 6×72 0 1 A = [I6 ⊗ 112] and W M = 12 I72 produces a predictive margin for variety i of 3 4 3 4 1 X 1 X 1 X X τˆ +τ ˆ + τˆ + τˆ + τˆ µ v;i 3 y;j 4 l;k 12 yl;jk j=1 k=1 j=1 k=1 This predictive margin is not estimable. We can avoid this problem by choosing to average over the year by location table cells which are present in the data. That is we choose the ith diagonal block of W M to be given by 1 diag (1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1) 10 The predictive margin for variety i is then given by 1 1 1 X τˆ +ˆτ + (3ˆτ +3ˆτ +4ˆτ )+ (3ˆτ +ˆτ +3ˆτ +3ˆτ )+ τˆ µ v;i 10 y;1 y;2 y;3 10 l;1 l;2 l;3 l;4 10 yl;jk j,k∈ /S where S = {(1, 2), (2, 2)}. This predictive margin is estimable since it involves a linear combination of fitted values and this can be checked algebraically using the condition (7.5.10).

7.6 Issues of averaging We have seen in the previous section there is a need for different averaging schemes when forming the marginal table of predictions from the multiway hyper-table. Often it is sufficient and sensible to average over all the of the hyper table, not in the classify set, using equal weights. In other cases, specific user-supplied weights will be required for some factors, with equal weighting over others. In either case the weight matrix is defined by multiplying together the fixed weights associated with each margin being averaged. The additional case, which was illustrated in the previous section, which deserves attention because it requires a slightly modified algorithm, is the case of averaging only over factor combinations that are present in the data. In this case, the prediction matrix D can be written as

D = ADW MDAF W MF MS where the averaging step is split into averaging over factors with weighting fixed (without reference to the data, subscript F ) and factors with weighting determined by data presence, denoted by subscript D. The first step of av- eraging over factors with fixed weights can be done to reduce the size of the problem, before checking data presence on the reduced hyper-table. A general algorithm for prediction should allow specification of the type of weighting (equal, population, data present, user-supplied) on each of the averaging factors, or on any combination of the averaging factors. PREDICTION OF NEW OBSERVATIONS 109 7.7 Prediction of new observations In some circumstances, it is desirable to predict new observations. This may include the predicted mean for a new experiment or prediction at new points within the data set. An important application is in the analysis of spatial data, in which kriging (Cressie, 1991) has become a popular approach for np×1 prediction. Examples of this are presented in chapter ??. If yp is the vector of (unobserved) data we wish to predict then a model for yp is

yp = Xpτ + Zp1u + Zp2up + ep (7.7.12) where the subscript p denotes design matrices associated with the predicted values. The vectors up and ep denote random effects not present in the ob- served data set, but drawn from the same population as u and e. We assume 0 0 0 joint normality for (ua, ea) with  u   G 0  var a = σ2 a ea H 0 Ra

 0 0 0  0 0 0 where ua = u up , ea = e ep ,   GGop Ga = Gpo Gpp and   RRop Ra = Rpo Rpp The matrices σ2 G and σ2 R are the variance matrices of the observed random H H effects. Note that these matrices do not have the suffixes, however inverse matrices associated with them do, for example, Goo and Roo respectively. This is maintain consistency with previous notation. We do not include residual errors from the current experiment or a general design matrix for the new residual errors ep here, but the extension is straightforward. A convenient device to obtain the best predictor of yp, denoted by y˜p is now presented. It has been used extensively for the estimation of missing data (Verbyla and Cullis, 1992). Their approach consists of forming the augmented  0 data vector yna×1 where n = n + n and y = y0, 00 . That is we place a a p a np zeros in the position where the data is not observed and set up a pseudo linear mixed model for ya which is an extension of (4.1.1) and is given by

ya = F ψ + W aβa + ea (7.7.13) where ψnp×1 is the vector of effects for prediction with design matrix  0  F na×np = Inp

0 0 0 and the vector βa = (τ , ua) is the vector of fixed and random effects with 110 PREDICTION FROM LINEAR MIXED MODELS design matrix given by  XZ 0  W a = Xp Zp1 Zp2 This pseudo model is merely a computational device. The variance param- eters (σ2 , κ) and the vectors τ and u have already been estimated (or pre- H dicted) using the observed data. The inclusion of the vector ψ as a fixed effect at first appears to be inconsistent with the aim of prediction of yp, however by inclusion of ψ as a fixed effect ensures that if (7.7.13) was fitted to the aug- mented data vector then the resulting estimates of the variance parameters and fixed effects and prediction of random effects will be identical to those obtained by only fitting the usual linear mixed model to the observed data. For a complete proof the reader is referred to Verbyla and Cullis (1992). In the following we present a portion of their proof which examines the prediction of yp through the set of extended mixed model equations for (7.7.13). Hence we consider the extended mixed model equations

 0 −1  yaRa ya 0 −1 0 −1 ∗  W aRa ya W aRa W a + Ga  (7.7.14) 0 −1 0 −1 0 −1 F Ra ya F Ra W a F Ra F where   ∗ 0 Ga = −1 0 Ga Since

0 −1 po F Ra ya = R y 0 −1 pp F Ra F = R 0 −1 po pp F Ra W a = R W + R W p then (7.7.14) can be written as

 0 −1  yaRa ya 0 −1 0 −1 ∗  W aRa ya W aRa W a + Ga  po po pp pp R y R W + R W p R Absorbing the last row ie the set of equations for gives the reduced set of extended mixed model equations

 0  yaP F ya 0 0 ∗ (7.7.15) W aP F ya W aP F W a + Ga where

−1 −1 0 −1 −1 0 −1 P F = Ra − Ra F (F Ra F ) F Ra  R−1 0  = 0 0 PREDICTION OF NEW OBSERVATIONS 111 Hence (7.7.15) become  y0R−1y   X0R−1y X0R−1X    (7.7.16)  Z0R−1y Z0R−1XZ0R−1Z + Goo  0 0 Gpo Gpp where  Goo  G−1 = a Gpo Gpp The final row gives the solution pp −1 po −1 u˜p = −(G ) G u˜ = GpoG u˜

Absorbing the last section corresponding to up leaves  y0R−1y  W 0R−1y W 0R−1W + G∗ ie the original mixed model equations. So, the augmented data approach gives the required estimates of the fixed and random effects as required. Estimation of the variance parameters using this model can also be shown to be equivalent to estimation based solely on the observed data (Verbyla and Cullis, 1992). To obtain ψˆ we use backsubstitution in (7.7.14), giving ˆ pp −1 po pp −1 po pp ˜ ψ = (R ) R y − (R ) (R W + R W p)β (7.7.17) pp −1 po ˜ = (R ) R e˜ − W pβ  pp −1 po  = Xpτˆ + (Zp1 + Zp2)u˜ + (R ) R e˜

= −y˜p

7.7.1 Computation for new observations In the following development it is simpler to incorporate any new random effects into the usual linear mixed model. Hence we can write the predicted observations as −1 y˜p = Xpτˆ + Zpu˜ + RpoR e˜ where Zp = [Zp1 Zp2], up now represents the full set of observed and pre- dicted random effects, and W p = [Xp Zp]. In the previous section we have made use of the following result (see sec- tion ??)

 −1  oo op  RRop R R = po pp Rpo Rpp R R  −1 −1 pp −1 −1 pp  R + R RopR RpoR −R RopR = pp −1 pp −R RpoR R

pp −1 −1 where R = (Rpp − RpoR Rop) . Thus we recall (7.7.17) that the pre- 112 PREDICTION FROM LINEAR MIXED MODELS dicted observations can be written as pp −1 po  ˜ pp −1 po y˜p = W p + (R ) R W β − (R ) R y

The prediction error variance of y˜p is   ˜ −1  var y˜p − yp = var W p(β − β) + RpoR e − ep  −1 ˜ −1  = var (W p − RpoR W )(β − β) + RpoR e − ep = σ2 (W − R R−1W )C−1(W − R R−1W )0 + (Rpp)−1 H p po p po 0 = σ2 W + (Rpp)−1RpoW  C−1 W + (Rpp)−1RpoW  + H p p (Rpp)−1 ˜ −1  since cov β − β, RpoR e − ep = 0. ˜ We see that y˜p is a function of β and the observed data y, and in the spirit of section 7.3, we can extend the mixed model equations and calculate predictions and prediction error variances through an absorption step. The extended mixed model equations now become  y0R−1y  pp−1 po pp−1  R R y −R  W 0R−1y D0 C

pp −1 po where D = W p + (R ) R W . Absorption of C gives  y0P y  −y˜ −var y˜ − y  /σ2 p p p H Additionally we can check for prediction invariance (see section 7.5) during the absorption process. CHAPTER 8

From ANOVA to variance components

8.1 Introduction In this chapter we consider the analysis of data arising from designed exper- iments, observational studies or surveys for which the variance matrix of the data is, in terms of variance component ratios, given by q X var (y) = σ2 H = σ2 ( γ Z Z0 + I) H H i i i i=1 or in terms of variance components (setting σ2 = 1) by H q X 0 2 var (y) = H = γiZiZi + σ I i=1 2 2 This variance matrix is termed ‘linear’ in the parameters since ∂ H/∂γi = 0. It arises in many applications: the most notable is in the analysis of designed experiments. In this setting the variance structure of the data can be derived from a randomisation argument (Nelder, 1965a, see). In contrast, most analy- ses of longitudinal and spatial data using linear mixed models are model-based and as such do not have this robustness to misspecification of the variance model. In observational studies the linear mixed model is usually constructed from consideration of likely sources of variation in the data.

8.2 Navel Orange Trial This example considers an experiment conducted in the Sunraysia district of Australia during the 1999-2000 orange season. The aim of the experiment was to examine the effect of branch size (strong or weak), inflorescence type (A - leafy with a single fruit, B - leafy with more than one fruit and C - leafless terminal fruit) and the possible interaction of these factors with tree aspect (facing north or facing south). Figure 8.1 presents a sketch of these inflores- cence types. The experiment was conducted in a commercial orchard not far from the NSW Agriculture’s Horticultural Research Station at Dareton, NSW. The data is kindly provided by Dr. T. Khurshid. Five mature navel orange trees were randomly selected from the orchard. On each side of the tree (ie aspect) a number of branches were tagged and classified as strong or weak according to their average diameter. The researcher then randomly selected 5 oranges for each tree, aspect,

113 114 FROM ANOVA TO VARIANCE COMPONENTS

<−−−− Strip 1 −−−−>

<−−−− Block 1 −−−−>

<−−−− Strip 2 −−−−>

<−−−− Column 1 −−−−> <−−−− Column 2 −−−−> <−−−− Column 3 −−−−> <−−−− Column 4 −−−−> <−−−− Column 5 −−−−> <−−−− Column 6 −−−−> <−−−− Column 7 −−−−> <−−−− Column 8 −−−−> <−−−− Column 9 −−−−> <−−−− Strip <−−−− Column 1 −−−−> 1 −−−−><−−−− Column 2 −−−−> <−−−− Column 3 −−−−> <−−−− Column 4 −−−−> <−−−− Column 5 −−−−> <−−−− Column 6 −−−−> <−−−− Column 7 −−−−> <−−−− Column 8 −−−−> <−−−− Column 9 −−−−> <−−−− Column 10 −−−−> <−−−− Column 11 −−−−> <−−−− Column 12 −−−−> <−−−− Column 13 −−−−> <−−−− Column 14 −−−−> <−−−− Column 15 −−−−> <−−−− Column 10 −−−−> <−−−− Column 11 −−−−> <−−−− Column 12 −−−−> <−−−− Column 13 −−−−> <−−−− Column 14 −−−−> <−−−− Column 15 −−−−>

<−−−− Block 2 −−−−> <−−−− Rep 1 −−−−> <−−−− Rep 2 −−−−> <−−−− Strip 2 −−−−>

Figure 8.1 Schematic Orange tree structure - nb wrong figure! branch strength class and inflorescence type. Thus a total of 300 oranges were tagged and identified. During this process no particular care was taken to ensure that the sampling respected the branch structure of the tree. We shall return to this issue later. Fruit diameter was recorded on several occa- sions throughout the growing season. The data we consider is the penultimate measurement before fruit maturity. The observational unit is an orange. The factors in the experiment are tree, aspect (north or south), strength (strong or weak), type (A, B, C) and fruit. The fixed factors are aspect, strength and type; tree and fruit are considered random. The linear mixed model could be derived using both our knowledge of the sampling protocol and the aims of the experiment. Following the tier methodology we have Tier Description Factors 1 Unrandomised fruit factors tree, aspect, strength, type, fruit 2 Randomised Treatment factors aspect, strength, type The corresponding structure formulae are Tier Structure formula 1 tree/aspect/strength/type/fruit 2 aspect*strength*type One important issue that has been thus far overlooked, however, is the issue of tree structure. Inflorescence types grow from branches. Each branch within a tree plays a vital role in transport of nutrients and also has a fixed position within the tree with respect to orientation and height above ground. Fur- NAVEL ORANGE TRIAL 115 thermore the classification of branches to strength class was based on branch diameter. There was a reasonable amount of variation in branch diameter within each strength class. It is therefore quite likely that fruit growth could be affected by the particular branch on which it grew. Individual branches were subsequently identified within each tree, aspect and strength class us- ing the diameter records, which according to the researcher were unique to each branch (within these factors). Table 8.1 presents the actual number of branches that were chosen out of a possible 15 for each tree, aspect and strength combination. A total of 35 branches had two fruit sampled. The structure for- mula for tier 1 was modified to include the term tree.aspect.strength.branch where branch is a factor with 15 levels.

Table 8.1 Number of branches sampled North South Tree Strong Weak Strong Weak 1 13 14 14 14 2 14 12 13 14 3 14 12 14 13 4 14 12 13 11 5 14 13 15 12

Ignoring the complication of branch and missing values (16 in all) table 8.2 presents the decomposition of the sources of variation using the rules de- scribed in section ??. For interest and comparison with Kenward and Roger’s approach for determining denominator d.f. for F-statistics we have included an additional column denoting the d.f. for each term. The linear mixed model can now be constructed by collating the terms in the last column in table 8.2 and adding tree.aspect.strength.branch, and is y ∼ mu + aspect ∗ strength ∗ type + tree/aspect/strength/type + tree.aspect.strength.branch (8.2.1) Table 8.3 presents a summary of the REML estimates of the variance com- ponents for the random terms for the model determined by (8.2.1) and two additional models, viz dropping tree.aspect.strength.branch (model 1) and dropping aspect.strength.type from the full model (model 2). The results are very interesting. The largest (excluding units) source of variation is associated with tree.aspect.strength.branch, ie the actual branch. A formal test of the hypothesis that the variance component is zero would not be rejected. We prefer to retain the term, given its biological relevance. The lack of information in this data-set is the most probable cause for the degree of uncertainty in its estimate. Table 8.4 presents a summary of the tests of the fixed effects for these data. The F -statistics have been computed using the approach of Kenward 116 FROM ANOVA TO VARIANCE COMPONENTS

Table 8.2 Skeletal ANOVA for the navel orange trial Term in linear Source/Decomposition d.f. mixed model Tree1 5 Mean2 1 mu Residual2 4 tree Tree.Aspect1 5 Aspect2 1 aspect Residual2 4 tree.aspect Tree.Aspect.Strength1 10 Strength2 1 strength Aspect.Strength2 1 aspect.strength Residual2 8 tree.aspect.strength Tree.Aspect.Strength.Type1 40 Type2 2 type Aspect.Type2 2 aspect.type Strength.Type2 2 strength.type Aspect.Strength.Type2 2 aspect.strength.type Residual2 32 tree.aspect.strength.type Tree.Aspect.Strength.Type.Fruit1 240 units

Table 8.3 REML estimates of variance components for three models fitted to the navel orange trial

Term No. of effects Full Model Model 1 Model 2 tree 5 0 0 0 tree.aspect 10 2.471 2.491 2.476 tree.aspect.strength 20 1.590 1.529 1.580 tree.aspect.strength.type 60 0 0.179 0 tree.aspect.strength.branch 265 2.619 - 3.273 units 284 18.467 20.936 17.763 REML log-likelihood -578.73 -578.94 - NAVEL ORANGE TRIAL 117 and Roger (1997) and described in section ??. The approximate denominator d.f. for each F -statistic is also presented as well as a probability based on an F distribution with the appropriate numerator and denominator d.f. as presented in the columns labelled ν1 and ν2. The three-way interaction is not significant and therefore it has been dropped in subsequent analyses. There is some evidence of a strength.type interaction and possibly a aspect.type interaction. It is very interesting to note the effect of using the small sample adjustment of Kenward and Roger (1997). The most important difference is the use of the more appropriate denominator d.f. for each term. The inflation of the Wald statistics was modest (the largest percent reduction of 4%). The probability for the strength.type interaction using the Wald statistic and its asymptotic reference distribution was 0.037 (cf 0.056).

Table 8.4 Testing of fixed effects for the navel orange trial

Term ν1 ν2 F -statistic P value aspect 1 4.5 1.85 .238 strength 1 7.9 43.31 < .001 type 2 41.0 8.35 .006 aspect.strength 1 8.0 0.06 .813 aspect.type 2 30.6 2.52 .097 strength.type 2 30.6 3.18 .056 aspect.strength.type 2 30.1 0.49 .619

The values for the two-way interactions were based on model 2, whereas the values for the main effects were calculated by fitting a model which in- cluded only main effects for aspect, strength and type but fixing the variance parameters at the REML estimates from model 1. This respects marginal- ity and design structure, more reflecting the ANOVA decomposition and our uncertainty of inference for some of the two-way interactions. To summarise the results for the researcher we present predictions for the strength by type table. These are formed by using the rules outlined in chapter 7. Briefly the classify set is {aspect, strength, type}. Marginal pre- dictions are formed for all random terms. That is the linear combination of effects are from the unsaturated three factor model aspect*strength*type - as- pect.strength.type. The two-way margins for strength by type are obtained by averaging over the two levels of aspect. Table 8.5 presents the predictions and the associated matrix of standard errors of differences (SEDs) based on the adjusted variance matrix. 118 FROM ANOVA TO VARIANCE COMPONENTS

Table 8.5 Predictions and SEDs for the navel orange trial Strength Type Prediction SEDs Strong A 69.7 Strong B 73.2 0.96 Strong C 73.6 0.96 0.97 Weak A 66.1 1.10 1.11 1.11 Weak B 67.8 1.12 1.12 1.12 0.95 Weak C 66.7 1.11 1.12 1.12 0.95 0.96

8.3 Sensory experiment on frozen peas

A sensory experiment was conducted in order to evaluate 16 frozen pea prod- ucts. These were assessed 3 times by each of 12 assessors for a range of at- tributes (for details see Brockhoff, 2002). We consider here the analysis of the attribute “pea taste”. Sensory experiments are multi-phase experiments (?). The analysis of more complex and completely replicated multiphase experi- ments will be considered in chapter ??. In terms of phase I of this experiment, namely the selection of samples to be evaluated, we have limited information on the design. Each sample represents a different treatment. The treatments were originally chosen as a factorial combination of 4 sizes by 2 colours by 2 sucrose levels. However an error was made resulting in no sample for the combination size 1, colour 1, sucrose 2 and two samples for the combination size 1, colour 2, sucrose 1 (see table 8.6, ordered on the factorial structure of the products). A preliminary analysis of the factorial structure revealed a significant 3-way interaction. On the basis of this, and the fact that the treatments were confounded with the variety of pea (see table 8.6) we ignore the treatment structure when conducting the analysis. A key issue is that there appears to be no replication of the treatments in phase I. Thus we will not be able to make broad inferences about product dif- ferences but must restrict attention to the product samples used in the study. In terms of phase II, namely the evaluation of the products by the assessors, a cross-over design was used. Products were evaluated over 12 days, with 4 products evaluated each day. Days 1 to 4 comprised a complete evaluation of all 16 products (ie. a complete “repeat”). Similarly days 5 to 8 and 9 to 12 comprised complete repeats. Products were assigned to days using an incom- plete block design. Within each day each of 12 assessors judged all 4 products. The order of presentation within a day was varied between assessors in an at- tempt to balance for carry-over effects. There was a large number of missing values in these data. Although the design suggests that there should be 3 data values for each product by assessor combination, there are many combinations with fewer values (table 8.7). Assessor 7 was poorly represented, with no data SENSORY EXPERIMENT ON FROZEN PEAS 119

Table 8.6 Description of frozen pea products product size colour sucrose variety 811 1 1 1 3 231 1 2 1 2 569 1 2 1 6 315 1 2 2 2 720 2 1 1 8 494 2 1 2 4 625 2 2 1 6 930 2 2 2 2 936 3 1 1 9 981 3 1 2 5 535 3 2 1 6 540 3 2 2 7 701 4 1 1 8 237 4 1 2 7 132 4 2 1 4 863 4 2 2 1

for 4 of the products and only a single value for the remaining products. For this reason the data for this assessor was omitted from the analysis. Although our analysis will only focus on phase II of the trial we find that, given the complexity of the experiment design, it is helpful to use the tier methodology again to develop an analysis which reflects the randomisation process and accounts for the sources of variation. The factors we define are assessor, repeat, day, order and product with 11, 3, 4, 4 and 16 levels respectively. We also define another factor called previous which has 17 levels and represents the product which was assessed in the previous session. The extra level is created for those products assessed in session 1 of each repeat and day. The observational unit is the session for each assessor. This is the repeat, day, order combination for each assessor. The fixed factors are product and previous. All other factors are taken as random. There are two tiers described by Tier Description Factors 1 Unrandomised sessions assessor, repeat, day, order 2 Randomised Treatment factors product, previous In tier 1 assessor is crossed with repeat,day and order whilst the latter three factors are nested and so the structure formulae are Tier Structure formula 1 assessor*(repeat/day/order) 2 product + previous + product.assessor The interaction of product and assessor has been included in the structure 120 FROM ANOVA TO VARIANCE COMPONENTS

Table 8.7 Number of non-missing data for each Assessor and frozen Pea Product Assessor Product 1 2 3 4 5 6 8 9 10 11 12 811 1 1 3 2 2 2 3 3 2 2 3 231 1 1 3 3 2 2 3 3 2 2 2 569 3 3 3 3 2 2 3 2 2 2 2 315 1 1 3 2 2 2 3 2 2 2 3 720 3 3 3 3 1 2 3 1 2 2 3 494 3 3 3 3 2 2 3 1 2 2 3 625 2 2 3 3 2 2 3 2 2 2 3 930 2 2 3 3 1 2 3 2 2 2 3 936 2 2 3 3 2 2 3 3 2 2 2 981 3 3 3 3 3 2 3 3 2 2 2 535 2 2 3 3 3 2 3 3 2 2 2 540 3 3 3 3 2 2 3 3 2 2 2 701 2 2 3 2 3 2 3 2 2 2 2 237 3 3 3 3 3 2 3 2 2 2 1 132 3 3 3 3 3 2 3 2 2 2 1 863 2 2 3 2 3 2 3 2 2 2 2

formula for tier 2 and since it is an interaction between a fixed factor and a random factor the term is considered random. The order of presentation of products is also known to be important in sensory evaluation trials and so previous is included to adjust for residual or carry-over effects. Each term in the structure formula for tier 1 must be checked to ensure that each has been respected in the randomisation process. Thus

Term Status Reason assessor in products assigned to all assessors repeat in is a complete block repeat.day in forms IB design for products repeat.day.order in order of presentation considered assessor.repeat in complete block for each assessor assessor.repeat.day in IB design repeated for each assessor assessor.repeat.day.order in the units

Although the data are highly unbalanced with so many missing values and data available from only 11 of the original 12 assessors it is still instructive to consider the decomposition of the sources of variation as presented in a skeletal ANOVA table (table 8.8). Note where a term has been bracketed this indicates that there is information in that strata for that term. The linear mixed model can now be constructed by collating the terms in SENSORY EXPERIMENT ON FROZEN PEAS 121

Table 8.8 Skeletal ANOVA for the frozen pea experiment Term in linear Source/Decomposition mixed model Assessor1 Mean2 mu Residual2 assessor Repeat1 repeat Repeat.Day1 [Product]2 product [Previous]2 previous [Product.Assessor]2 product.assessor Residual2 repeat.day Repeat.Day.Order1 [Product]2 product [Previous]2 previous [Product.Assessor]2 product.assessor Residual2 repeat.day.order Assessor.Repeat1 assessor.repeat Assessor.Repeat.Day1 [Product]2 product [Previous]2 previous [Product.Assessor]2 product.assessor Residual2 assessor.repeat.day Assessor.Repeat.Day.Order1 Product2 product Previous2 previous Product.Assessor2 product.assessor Residual2 assessor.repeat.day.order

the last column in table 8.8 and is given by y ∼ mu + product + previous + assessor*(repeat/day/order) + product.assessor (8.3.2) Figure 8.2 presents a QQ plot of the residuals from fitting (??). The Gaus- sian assumption for the errors seems reasonable. Table 8.9 presents the REML estimates for the variance parameters. The interaction between assessors and 1 2 1 2 products is significant ( −2 ∗ REMLLRT = 4.24 ∼ 2 χ0 + 2 χ1; p < .05). Table 8.10 presents Wald tests for the fixed effects. Both terms are highly significant. In this example, although the adjusted statistics differ numerically 122 FROM ANOVA TO VARIANCE COMPONENTS

o

o oo o oo oo o ooo oo oo oooo oooooo oooo oo ooo ooo ooooo oo oooo oo oo oooo ooo oooo oo oooo oooooo oooo ooo oooo oooooo oooo ooo Residuals Pea Taste ooo ooo ooo ooo ooo ooo oooo ooo oooo ooo oooo ooooo ooo oooo

o oooo o

-10ooo -5 0 5 10 o o

-3 -2 -1 0 1 2 3

Quantiles of Standard Normal

Figure 8.2 QQ plot for the residuals for the frozen pea experiment

Table 8.9 REML estimates of variance components for the frozen pea experiment Term No. of effects Component assessor 11 0.717 repeat 3 0.00406 repeat.day 12 0.0191 repeat.day.order 48 0 assessor.repeat 33 0.0563 assessor.repeat.day 132 0.242 product.assessor 176 0.3910 units 384 2.039

from the unadjusted or large sample test statistics the conclusions remain the same. The predicted means for the products are presented in table 8.11. The values presented are marginal predictions with respect to all random terms, including assessors and are formed by simple averaging of the two-way hyper table classified by product and previous. The SEDs are based on the adjusted variance matrix for τˆ. Product 863 is the superior product for pea taste but it is difficult to determine the effects of the size, colour, sucrose and variety from these data. SENSORY EXPERIMENT ON FROZEN PEAS 123

Table 8.10 Testing of fixed effects for the frozen pea experiment

Term ν1 ν2 F -statistic P value Wald F P value product 15 86 11.54 < .001 12.49 < .001 previous 16 139 2.01 .017 2.24 .003

Table 8.11 Predicted means for the frozen pea experiment Product size colour sucrose variety Predicted Pea Taste 811 1 1 1 3 8.12 231 1 2 1 2 7.55 569 1 2 1 6 8.70 315 1 2 2 2 7.81 720 2 1 1 8 7.70 494 2 1 2 4 8.24 625 2 2 1 6 8.70 930 2 2 2 2 7.25 936 3 1 1 9 5.34 981 3 1 2 5 7.38 535 3 2 1 6 6.28 540 3 2 2 7 8.37 701 4 1 1 8 4.70 237 4 1 2 7 5.98 132 4 2 1 4 6.65 863 4 2 2 1 9.11 Mean SED 0.531 Min SED 0.505 Max SED 0.555

CHAPTER 9

Mixed models for Geostatistics

9.1 Introduction

Geostatistics is an area of statistics which has evolved as a branch of spatial statistics concerned primarily with the prediction of a spatially dependent quantity based on observations at a set of (pre-specified) locations. The devel- opment of geostatistics traces back to the 1960’s but much earlier references to the existence of spatial variation exist. One of the earlier records appears in the paper by Mercer and Hall (1911) who examined the variation in the yield of crops in small plots at Rothamsted Experiment Station, UK. Their pa- per introduced some of the fundamental concepts such as spatial dependence, correlation range and “nugget” effect. Not long after Fisher arrived at Rothamsted he also became aware of the existence of spatial variation in the field. His primary objective, however, was the efficient estimation of treatment effects in designed experiments. In an attempt to neutralise spatial dependence in this setting he developed the principles of randomisation and blocking which became the building blocks for the design and analysis of comparative experiments. Valid inferences regarding treatment effects could be achieved through randomisation based analysis and little consideration or attention was paid to modelling spatial dependence in this setting until the papers by Bartlett (1978) and Wilkinson et al. (1983). In the meantime, geostatistics had developed from the initial work of Georges Matheron and colleagues at Fontainebleau, France. The original motivation of this work was concerned with ore estimation. These ideas were developed independently of other work in spatial statistics, in particular the work of Matern, whose doctoral thesis published in 1960 is still widely cited, Whit- tle and Bartlett, to name a few. The division between geostatistics and main stream spatial statistics is illustrated by the basic geostatistical tool known as kriging. It is now well known that kriging (after Krige (1951)) is equiv- alent to minimum mean square error prediction under a (Gaussian) linear mixed model. This connection has been made on various occasions since Rip- ley (1981) and was a primary motivation for the excellent theoretical treatment of the subject in Stein (1999). Recently Diggle et al. (1998) coined the phrase model-based geostatistics to refer to an approach to geostatistical problems based on the application of formal statistical methods based on an assumed stochastic model. We adopt this approach in the following but narrow our focus to the Gaussian setting. We will consider the analysis of designed ex- periments using geostatistical or spatial statistics in chapter 11.

125 126 MIXED MODELS FOR GEOSTATISTICS 9.2 Motivating Examples

9.2.1 Cashmore Field

This example arose from a comprehensive study by Lark et al. (1998) who were interested in determining the amount of spatial variability of crop yield within a field which was associated with differences between soil map units. These map units had been previously determined according to a system of simple units corresponding to soil series as defined by the Soil Survey of England and Wales. The study was conducted over three successive seasons on a 6ha field known as Cashmore field which is a field at Silsoe Research Institute, Bedfordshire, UK. Winter barley was grown in each year, with accompanying soil surveys taken at strategic times during the growing season. The data we consider is the gravimetric water content of the topsoil (0-200mm) taken in March 1995. These data have been analysed more recently by Lark et al. (2005) and were collected on a 50m square grid supplemented with additional sample sites to give a total of 100 observations. At the time of sampling the soil was deemed to be at field capacity. The sampling locations are presented in figure 9.1. The aim of the study is to produce a map of the field in terms of soil moisture which could then be used to produce predictions of soil moisture for individual map units within the field (see figure 2 of Lark et al. (1998) and Clayton and Hollis (1984) for further details).

6.0

5.5

5.0 y (m) /100

4.5

5.0 5.5 6.0 6.5 7.0 7.5 x (m) /100

Figure 9.1 Sampling locations for Cashmore water survey MOTIVATING EXAMPLES 127 9.2.2 Electromagnetic salinity Rapid and cost-effective measurement of soil salinity via apparent electrical conductivity (ECa) of soil profiles is becoming an important management tool for determining the suitability of soils for growing rice in parts of New South Wales, Australia. The current protocol involves measurements of ECa from a ground-based electromagnetic induction instrument, EM31, which is linked to a differential global positioning system and towed behind a four-wheel mo- torbike, providing a large number of geographically-referenced observations. From one rice field, 2000 observations were gathered in a serpentine fashion throughout an irregularly-shaped field, as displayed in Figure 9.2. There were 1995 distinct locations, with five locations having two observations. The aim was to produce a fine-scale map to determine where ECa is at least 150 mS/m (milli-Siemens per metre), fulfilling one requirement for suitability for growing rice (Beecher et al., 2002). Northing −100 0 100 200 300 400

400 500 600 700 800 900 1000

Easting

Figure 9.2 Sampling locations for EM survey

9.2.3 Fine scale soil pH data Our third and final example arose as part of a larger study conducted at the Wagga Wagga Agricultural Institute by Dr. Mark Conyers of the New South Wales Department of Primary Industries. Soil pH was determined for five replicates of 100 1cm3 cubes in a 10 cm × 10 cm square at the soil surface in a plot that had been cropped and grazed for several years as part of a long term 128 MIXED MODELS FOR GEOSTATISTICS experiment to examine a range of cropping rotations (Heenan et al., 1994). The plot had been in a wheat-clover rotation since 1979 and was in a clover phase grazed by sheep in 1991 and sampled in 1992 prior to cultivation for sowing wheat. An appropriate sampling location was chosen for each of 5 grids of 10cm by 10cm to be co-located so that the soil surface was judged to have less than 0.3cm deviation from horizontal. The orientation of each block was not recorded. A trench was excavated around the grid and slices of soil were taken in layers through the face of the trench. The soil layers were dissected so that the whole of each 1cm3 cube was taken from the slice. This resulted in a total of 600 1cm3 cubes of soil taken from each of 5 10 × 10 × 6cm3 blocks of soil. Soil pH, NO3 and NH4 were measured on each cube. To simplify the analysis we will consider the analysis of the pH data for the 0-10cm depth only. The aim of the study was to examine and identify the extent of micro-scale spatial variation of pH in a typical field in south western NSW. Previous studies have shown that up to 50% of the total variance occurs within a square metre (Beckett and Webster, 1971) and if progress is to be made in understanding soil acidity at the process level (Conyers et al., 1995) and in the management of variable soil acidity at the field scale (Van Vuuren et al., 2000) it is crucial, that the scale at which biologically meaningful variation in pH occurs, be determined.

9.3 Geostatistical mixed model Our development of the model follows the aims of analysis of the data-sets. Geostatistics is usually concerned with the problem of producing a map or interpolation of a quantity of interest over a particular area (for simplicity we assume in R2). We assume that we have observed data at a set of n locations (of which b, possibly less than n, are distinct), with the ith observation yi taken at the location identified by a vector si, i = 1 . . . n. The observed data may relate to a single point or a subset of points say Ri within a bounded 2 subset of R . Each Ri is assumed to be a disjoint mutually exclusive subset. For example, in the soil pH example the observed data relates to a 1cm3 cube of soil while for the EM salinity data each observation is an integration over a pre-determined (by the scope of the EM31 sensor) range, both horizontally and vertically, of soil. A model for yi is

yi = f(si) + ei (9.3.1) where f(si) is some function of the spatial location si and the ei are mutually independent N(0, σ2) random variables. If s represents the set of b distinct T observed locations, and f(s) = (f(s1), . . . , f(sn)) , then we assume

f(s) = Xsτ s + Zsus(s) (9.3.2) n×p where Xs is a matrix of polynomials in s, often of degree 1 and associ- GEOSTATISTICAL MIXED MODEL 129 p×1 ated vector τ s is a vector of polynomial regression coefficients. This term is included in the model to account for so-called trend or non-stationary be- n×b haviour. The matrix Zs is an indicator matrix for random effects at distinct locations, accommodating duplicated locations (typically Zs = In if all the locations are distinct, otherwise b < n). Finally us(s) is a realisation of a T stationary Gaussian process distributed independently of e = (e1, . . . , en) , 2 with zero mean and variance matrix γsσ Gs, so that γs is the ratio of the variance of the spatially-correlated process to the so-called nugget variance 2 (ie, σ ). The elements of Gs are given by ρ(si −sj, φ), ρ(·) being a correlation function with a parameter vector φ and dependent on the spatial separation vector hij = si − sj. The matrix Gs is assumed positive definite. Subscripts s indicate elements specifically related to spatial effects. Combining equations (9.3.1) and (9.3.2) we have, in matrix notation,

y(s) = Xsτ s + Zsus(s) + e (9.3.3) and

2 T y(s) ∼ N(Xsτ s, σ (γsZsGsZs + I)) (9.3.4) which is a mixed model with spatially correlated random effects and iden- tically and independently distributed residual errors. In geostatistical terms, the identity matrix in (9.3.4) models a nugget effect. We shall return to this later (see section 9.4.10). Alternatively, provided the n locations are distinct (b = n), the roles of e and u can be switched, formulating the model with spatially correlated residual 2 errors es(s) ∼ N(0, σs Rs), with the nugget effect modelled as an independent 2 random n × 1 effect u ∼ N(0, σs γIn), with design matrix Z = In,

y(s) = Xsτ s + u + es (9.3.5) and

2 y(s) ∼ N(Xsτ s, σs (γIn + Rs)). (9.3.6) 2 2 Here σs = σ γs, and γ = 1/γs is the ratio of nugget variance to variance of the spatially-correlated process; the nugget effect can be excluded by setting γ = 0, equivalent to dropping u from (9.3.5). Using this form, the values of the spatial process f(s) in (9.3.2) are now Xsτ s + es, and e in (9.3.3) is replaced by u. In this model the matrix Rs is actually the same as the matrix Gs in the first form (9.3.3) and (9.3.4), but we use the R notation for consistency with the notation of previous chapters. The first form (9.3.3) and (9.3.4) is necessary when there are locations with multiple observations, and the second form (9.3.5) and (9.3.6) is required if it desired to fit a model with no nugget effect. We frequently interchange between these two forms in the analysis of the examples and it is important to understand the duality and equivalence of them in certain cases. It is clear that a difficulty would arise if we wished to fit a model without a nugget effect for data with multiple observations. 130 MIXED MODELS FOR GEOSTATISTICS 9.4 Covariance Models for Gaussian random fields The mathematical description of the dependence between observations at dif- ferent locations is central to geostatistics and hence the key element of our geostatistical model is the specification of the covariance model for us(s). We have already used terms such as stationarity which may be unfamiliar to some. In this section we will briefly review some aspects of the theory of random fields, common nomenclature and results which are necessary for understand- ing the development of covariance models for us(s) and for the analysis of the examples described in section 9.2. A more thorough account can be found in Stein (1999).

9.4.1 Preliminaries In the following for ease of notation we shall drop the subscript and denote the random field Us(s) by U(s), which is Gaussian if the joint distribution of U(s1),...,U(sn) is multivariate normal for any n set of locations. We denote the covariance function by ψ(·, ·) and this function must satisfy n X cicjψ(si, sj) ≥ 0 i,j=1 2 for all n < ∞, s1,... sn ∈ R and all real c1, . . . , cn. This requirement results from n ! n X X var ciU(si) = cicjψ(si, sj) i=1 i,j=1 Furthermore if we denote m(s) as the mean of U(s) then it follows that the joint distribution of U(s1),...,U(sn) is multivariate normal with mean T (m(s1), . . . , m(sn)) and covariance matrix Ψ with ijth element given by ψ(si, sj).

9.4.2 Stationarity Since observations are made from only a single realisation of a random field we cannot make much progress without further assumptions. Stationarity is an important simplifying assumption. There are several forms of stationarity. A random field is said to be weakly (or second order) stationary if • its mean is independent of its location, ie m(s) = µ say, for all s ∈ R2 • the covariance function between any pair of locations s and t is a function only of the the spatial separation vector h = s − t. That is ψ(s, t) = ψ(h)

Using this assumption it is equivalent to consider the correlation function of 2 U denoted by ρ(·) and ψ(h) = σs ρ(h) noting that ρ(0) = 1. Strong stationarity occurs when the complete distribution of any pair of COVARIANCE MODELS FOR GAUSSIAN RANDOM FIELDS 131 observations, including all of the moments of the distribution, depends only on the spatial separation vector. Lastly, a weaker form of strong stationarity is intrinsic stationarity for which U(s) − U(t) have zero mean and variance depending only on the spatial separation vector h. This concept leads to the class of intrinsic random functions (IRFs) introduced by Matheron (1973) and discussed by Cressie (1993) and Zimmerman (1989). These are the spatial data analogy to integrated time series but we will not deal with these in any detail here.

9.4.3 Isotropy Another important property of a random field is isotropy. A weakly station- ary random field (in more than one dimension) is said to be isotropic if the dependence between any pair of observations depends only on the Euclidean distance between them. Otherwise it is said to be anisotropic. In terms of the correlation function, then ρ(s, t) = ρ(h) = ρ(d) where d = ||h||.

9.4.4 Mean square continuity and differentiability A descriptive property of a spatial surface is its smoothness. The concept of smoothness is widely used in semi-parametric regression (Green and Sil- verman, 1994a; Verbyla et al., 1999a) and we can similarly mathematically define smoothness of a random field using the ideas of mean square continuity and differentiability (Bartlett, 1966). The random field U(s) is mean square continuous at s if   E {U(s + ) − U(s)}2 → 0 as  → 0

Similarly U(s) is mean square differentiable if there exists a random field U 0(s) such that for all s ! U(s + ) − U(s) 2 E − U 0(s) → 0 as  → 0 

Higher order mean square derivatives are defined in a similar manner. The mean square differentiability of the random field U(s) is directly linked to the differentiability of its covariance function at 0 (Stein, 1999). Result 9.1 For a weakly stationary random field, U(s), then • U(s) is mean square continuous at s if and only if ψ() is continuous at the origin • U(s) is m-times mean square differentiable if and only if ψ(2m)(0) exists and is finite and if so the covariance function of U (m) is (−1)mψ(2m). 132 MIXED MODELS FOR GEOSTATISTICS 9.4.5 The variogram

Traditional geostatistics relies heavily on the variogram. The variogram exists and is linked to the covariance function for weakly stationary random fields, but it also exists for other random fields such as an IRF-0 (intrinsic random function of order 0). The variogram of a random field U(s) is defined to be 1 2 the function V (s, t) = 2 var (U(s) − U(t)) for any s, t ∈ R . If the random field is weakly stationary then this reduces to 1 V (s, t) = V (h) = var (U(s) − U(t)) 2 1   = E {U(s) − U(t)}2 2 2 = σs (1 − ρ(h)) If the random field is also isotropic then the variogram is a function only of d the Euclidean distance. Technically 2V (h) is called the variogram and V (h) the semi-variogram, but more often the term variogram has been used for V (·) and we will use this convention.

9.4.6 Geometric Anisotropy

When considering observations taken at spatial locations in two (or higher) dimensional space, we may wish to retain the assumption of weak stationarity but avoid the assumption of isotropy. This amounts to relaxing the assump- tion that the covariance function ψ(·) is a function only of the Euclidean distance. In R2 isotropic correlation (or covariance) has circular contours of constant correlation with respect to the elements of the spatial separation T vector h = (h1, h2) . In some cases it may be preferable to align the com- ponents of h with the coordinates axes of the spatial locations hence making the correlation model dependent in them. One such application is in the area of field trials, where axial dependence of the correlation function may be a sensible model given the likely imposition of cultural and agronomic prac- tices on the underyling random field. More often, however, it may be more natural to allow the preferred direction with respect to the random field to correspond to a rotation of the coordinate axes. The most common form of anisotropic behaviour which achieves this is termed geometric anisotropy. Ge- ometric anisotropy in two dimensions can be specified via a transformation of h which depends on an anisotropy angle α and an anisotropy ratio δ. Higher dimensional geometric anisotropy requires more parameters to be completely general. The correlation function of an isotropic random field is a function only of the Euclidean distance d. To convert this correlation function ρ(·) to geometric anisotropy we apply a rotation of the original coordinates through α radians then stretch (or shrink) the resulting axes relative to each other. In matrix COVARIANCE MODELS FOR GAUSSIAN RANDOM FIELDS 133 notation we have "√ # h00 δ 0  cos α sin α h  h00 = 1 = 1 = Sh0 = ST h h00 0 √1 − sin α cos α h 2 δ 2 where h0 = T h say. The geometric anisotropic correlation function is then a function of the Euclidean distance based on h00, that is d2 = h00T h00 = hT T T S2T h We note that there is non-uniqueness in this metric d(·), since inverting δ and π adding 2 to α gives the same distance. This non-uniqueness can be removed π by constraining 0 ≤ α < 2 and δ > 0, or by constraining 0 ≤ α < π and either 0 < δ ≤ 1 or δ ≥ 1. Isotropy corresponds to δ = 1, and then the rotation angle α is irrelevant: correlation contours are circles, compared with ellipses in general. Figure 9.3 presents several forms of anisotropic correlation functions.

Isotropic − (1,0) Anisotropic − (2,0) −60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60

−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60 Anisotropic − (2,.4) Anisotropic − (2,−.4) −60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60

−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60

Figure 9.3 Examples of geometric anisotropy

9.4.7 Minkowski Metric

The anisotropic correlation function described in section 9.4.6 can be further generalised (for random fields in more than one dimension) by replacing the usual Euclidean metric by the so-called Minkowski metric. The Minkowski 134 MIXED MODELS FOR GEOSTATISTICS metric applied to the transformed coordinates is

 1 1/λ d(h; δ, α, λ) = δ|h0 |λ + |h0 |λ 1 δ 2 recalling  0      0 h1 cos α sin α h1 h = 0 = = T h h2 − sin α cos α h2 and with λ usually taken to be a positive integer. When λ = 2 this metric is the Euclidean metric and when λ = 1 it corresponds to the city-block metric used in the analysis of field trials (Cullis and Gleeson, 1991). Following Haskard et al. (2005) we can then embed this generalised metric into the correlation function ρ(·) giving ρ(h) = ρ(d(h; δ, α, λ)).

9.4.8 Separability Correlation functions for a random field on R2 can be taken to be the product of one dimensional correlation functions. Specifically for correlation functions ρ1(·) and ρ2(·) then ρ(h) = ρ1(h1)ρ2(h2). Such random fields are known as separable random fields and are used widely in the analysis of field trials (Martin, 1979; Cullis and Gleeson, 1991; Gilmour et al., 1997). These offer substantial computational savings but may not be acceptable for other spatial data. For routine geostatistical applications Stein (1999) suggests that they would be ”untenable for most physical processes”. This is primarily due to their dependence on the choice of axes. He demonstrates that this can lead to very unsatisfactory behaviour from a prediction viewpoint.

9.4.9 Parametric correlation models Some of the parametric correlation models that have been suggested and used in geostatistics: exponential: ρ(d) = exp(−d/φ) gaussian: ρ(d) = exp(−(d/φ)2) spherical: ( 1 − 3 (d/φ) + 1 (d/φ)3 if 0 ≤ d < φ, ρ(d) = 2 2 0 if d ≥ φ circular: ( q 2 cos−1(d/φ) − d 1 − ( d )2 if 0 ≤ d < φ, ρ(d) = π φ φ 0 if d ≥ φ COVARIANCE MODELS FOR GAUSSIAN RANDOM FIELDS 135 powered exponential: ρ(d) = exp(−(d/φ)k) where k is restricted to 0 < k ≤ 2 to ensure a valid correlation function. Set- ting k = 1 or k = 2 gives the exponential or gaussian correlation functions respectively. Whittle’s elementary correlation: d d ρ(d) = K ( ) φ 1 φ

where K1(·) is the modified Bessel function of order 1 of the third kind (Abramowitz and Stegun, 1965). bounded linear: ( 1 − d if 0 ≤ d < φ, ρ(d) = φ 0 if d ≥ φ Following the recommendations of Stein (1999) we base our correlation model on the Mat´ernfamily of correlation functions. The isotropic Mat´erncorrelation function is given by ν −1  d   d  ρ (d; φ, ν) = 2ν−1Γ(ν) K , (9.4.7) M φ ν φ where φ > 0 is a range parameter, ν > 0 is a smoothness parameter, Γ(·) is the gamma function, and Kν (.) is the modified Bessel function of the third kind of order ν (Abramowitz and Stegun, 1965, S9.6). For a given ν, the range parameter φ affects the rate of decay of ρ(·) with increasing d. The parameter ν > 0 controls the analytic smoothness of the underlying process us, the process being dνe − 1 times mean-square differentiable, where dνe is the smallest integer greater than or equal to ν (Stein, 1999, p. 31). Larger ν correspond to smoother processes. 1 When ν = m+ 2 with m a non-negative integer ρM (·) is then the product of 1 exp(−d/φ) and a polynomial of degree m in d. Thus if ν = 2 then we get the 1 the exponential correlation function, ρM (d; φ, 2 ) = exp(−d/φ), while ν = 1 yields Whittle’s elementary correlation function, ρM (d; φ, 1) = (d/φ)K1(d/φ) (Webster and Oliver, 2001, p. 119). When ν = 1.5 then

ρM (d; φ, 1.5) = exp(−d/φ)(1 + d/φ) which is the correlation function of a random field which is continuous and once differentiable. This has been used recently by Kammann and Wand (2003). As ν → ∞ then ρM (·) tends to the gaussian correlation function. Thus the Mat´erncorrelation function offers flexibility and parsimony and includes many other correlation functions as special cases. There are also links with generalised covariance functions of IRFs (Stein, 1999, p. 177). Figure 9.4 presents some examples of the Mat´ern correlation function for specific choices of ν and φ. 136 MIXED MODELS FOR GEOSTATISTICS

ν = 1.5 φ = 0.15

correlation correlation 2.5 φ = 0.2 ν =

φ = 0.1 ν = 1.5

φ = 0.05 ν = 0.5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

distance distance

Figure 9.4 Examples of the Mat´erncorrelation function

9.4.10 Extended geometric anisotropy within the Mat´ernclass

We will use the correlation model suggested by Haskard (2005) which is the Mat´ernfamily of correlation functions, incorporating geometric anisotropy and a choice of distance metrics. This is given by

ρ(h; φ) = ρM (d(h; δ, α, λ); φ, ν)

T where h = (h1, h2) is the spatial separation vector, (δ, α, λ) govern the choice of metric and geometric anisotropy and (φ, ν) are the parameters of the Mat´ern correlation function. The metric parameter λ is usually set to either 1 or 2, primarily based on context, but mindful of the criticisms aimed at the use of separable correlation models for spatial data. Geometric anisotropy is dis- cussed in most geostatistical books (Webster and Oliver, 2001; Diggle et al., 2003)) but rarely are the anisotropy angle or ratio estimated from the data. Similarly the smoothness parameter ν is often set a-priori (Kammann and Wand, 2003; Diggle et al., 2003) however Stein (1999) and Haskard (2005) demonstrate that ν can be reliably estimated even for modest sized data-sets, subject to caveats regarding the sampling design. These issues will be investi- gated in the analysis of the three motivating examples. Haskard et al. (2005) present a more thorough investigation of the properties of REML estimates of the parameters in the extended correlation model described above. PREDICTION 137 9.4.11 Measurement errors - the nugget effect The geostatistical mixed model (9.3.3) has three components, a deterministic component (Xτ ) and two stochastic components (us(s) and e). The variance parameter σ2 as mentioned earlier is termed the nugget variance in geostatis- tical literature, owing its rather colourful name to the origins of geostatistics in the mining industry. The mutual independence of the ei is reasonable if they represent errors arising from the measurement process. This implies that duplicated observations at the same sampling location would be different (and statistically independent). The nugget variance could then be estimated from these duplicated observations. In practice, however, it is less common to have duplicated observations, but further empirical evidence has suggested that even the closest observations in space differ by more than the technical error (Laslett et al., 1987). The term nugget therefore reflects the fact that we are essentially modelling micro-scale spatial variation and measurement errors by including e in the model. We can see this by supposing that the true model for the data is

y(s) = Xsτ s + Zsus(s) + Zsvs(s) + e (9.4.8) where the additional term vs(·) is a stationary Gaussian random field which 2 is independent of us(·) with variance γvσ and correlation function ρv(·), such that for d > d0 say, then ρv(d) = 0, where d0 corresponds to the minimum distance of the sampling design. It is therefore clear that model 9.4.8 is indis- tinguishable from 0 y(s) = Xsτ s + Zsus(s) + e 0 where the ei are mutually independent Gaussian random variables with mean 2 0 and variance σ (1 + γv). This wrong model would give incorrect predictions whenever the predictand is close to a sampling location. Since measurement errors are mostly unavoidable it is generally sound practice, though not nec- essary, to include a nugget effect and perhaps be less concerned whether the discontinuity in the spatial correlation of the data at the origin is due to mea- surement errors alone. It can be shown (Stein, 1999) that for any σ2 > 0 and a finite set of locations that the presumed mean square error of a prediction from a model ignoring the nugget effect will be too small, but pragmatically near optimal predictions and accurate estimates of mean square error may still be achievable for sufficiently small measurement errors.

9.5 Prediction The aims of the three examples emphasise that in many applications of the geostatistical mixed linear model the main objective is to form predictions. In the Cashmore and rice salinity examples fine scale maps of soil water and EC are required. To achieve this objective we therefore wish to predict f(·) at a new set of locations. To keep the notation simple we consider prediction at a 2 single new location denoted by s0 ∈ R . The form of f(s0) varies depending 138 MIXED MODELS FOR GEOSTATISTICS which form of the geostatistical linear mixed model we fit. In the following we refer to predictions based on (9.3.3) and (9.3.5) as G-level predictions and R-level predictions respectively. Hence

( T x0 τ s + us(s0) G-level f(s0) = T (9.5.9) x0 τ s + es(s0) R-level where x0 is the vector of of fixed effects for s0. Using the results of chapter 3 it is relatively simple to show that the best T linear unbiased prediction of f(s0) for known γs (γ for R-level) and θ = (φ, ν, δ, α)T is given by

( T T −1 ˜ x0 τˆs + gs0Gs u˜s(s) G-level f(s0) = T T −1 (9.5.10) x0 τˆs + rs0Rs e˜s(s) R-level where gs0 = cor (us(s), us(s0)) (G-level) and rs0 = cor (es(s), es(s0)) (R- level) respectively and τˆ, u˜s(s) and e˜s(s) are solutions to the mixed model equations for either G-level or R-level forms. These are given by  T T     T  Xs Xs Xs Zs τˆs Xs y(s) T T −1 −1 = T Zs Xs Zs Zs + γs Gs u˜s(s) Zs y(s) ˜ T Cgβg = W g y(s) say for G-level and  T −1 T −1     T −1  Xs Rs Xs Xs Rs τˆs Xs Rs y(s) −1 −1 −1 = −1 Rs Xs Rs + γ I u˜ Rs y(s) ˜ T −1 Crβr = W r Rs y(s) ˜ and e˜s(s) = ys(s) − W rβr for the R-level form. In this form it is also conve- nient to consider a more compact form for e˜s(s). If s(s) = u + es(s) then it can be shown that

−1 e˜s(s) = Rs(γI + Rs) (ys(s) − Xsτˆs) −1 = RsHs ˜s(s) say. The BLUP of f(s0) in (9.5.10) is the universal kriging estimate after Krige (1951). The additional utility of the above result is that we can also use standard results to compute the mean squared error (MSEP) or the prediction error variance (pev) of the BLUP. That it, for a G-level prediction we have ˜ T ˜ T −1 f(s0) − f(s0) = w0g(βg − βg) + gs0Gs us(s) − us(s0)

T  T T −1 where w0g = x0 gs0Gs . Thus using the result that  ˜   −1  βg − βg 2 Cg 0 var T −1 = σ T −1 gs0Gs us(s) − us(s0) 0 γs(1 − gs0Gs gs0) ESTIMATION 139 Hence

 ˜  2 T −1 T −1 var f(s0) − f(s0) = σ (w0gCg w0g + γs(1 − gs0Gs gs0))

For an R-level prediction then

˜ T T −1 f(s0) − f(s0) = x0r(τˆs − τ s) + rs0Hs s(s) − es(s0)

T  T T −1  where x0r = x0 − rs0Hs X . Thus using the result that

   T −1 −1  τˆs − τ s 2 (Xs Hs Xs) 0 var T −1 = σs T −1 rs0Hs s(s) − es(s0) 0 1 − rs0Hs rs0 then

 ˜  2 T T −1 −1 T −1 var f(s0) − f(s0) = σs (x0r(Xs Hs Xs) x0r + (1 − rs0Hs rs0))

Gilmour et al. (2004) present a computationally efficient algorithm for com- puting both G and R-level predictions and associated MSEPs.

9.6 Estimation

The application of (9.5.10) requires knowledge of the variance parameters 2 ie σ , γs and θ. As in previous chapters we replace the unknown variance parameters by their REML estimates. The BLUPs are then E-BLUPS and their optimality is no longer guaranteed, while the formulae for the prediction error variances apply only asymptotically. We examine this issue in section 9.9. Haskard (2005) presents details of the REML estimation of the anisotropic Mat´ernclass variance parameters using the AI algorithm. This requires the derivative of the variance matrix Gs with respect to θi. These are all straight- forward to compute except for ν and so to ensure stability Haskard (2005) suggests that it is best to use numerical methods to compute the derivative for this parameter. In contrast to many geostatistical texts, we share the views of Stein (1999, p 222) and regard REML estimation of variance parameters as the only rea- sonable approach. Estimation based on the sample semi-variogram (defined in section 9.7) cannot be recommended in general. One potential obstacle how- ever, to more widespread adoption of likelihood based estimation is the burden in computing the likelihood, score and AI matrix for large irregularly spaced data-sets. Unlike spatial analysis of field trials which exploits the assumption of separability to overcome the computational burdens, we are not in a posi- tion to use these assumptions in general for irregularly spaced data. Moreover −1 it is the lack of sparcity in Gs which is the major source of computational load. Stein et al. (2004) describe an approach for efficiently obtaining REML estimates in large spatial data-sets which may resolve these problems. 140 MIXED MODELS FOR GEOSTATISTICS 9.7 Model building and diagnostics

The other major advantage of the geostatistical linear mixed model (over classical geostatistics) is that the relative fit of different variance models can be assessed within the framework of REML likelihood ratio tests. As in chapter ?? the REML likelihood ratio for testing hypothesis H0 nested within hypothesis H1, where H1 contains an additional k parameters. is given by

D = −2 {`R(ρˆ0; y) − `R(ρˆ1; y)} where ρˆ0 and ρˆ1 are the REML estimates of ρ, the vector of all variance parameters, for the models under H0 and H1 respectively. The statistic D is asymptotically distributed as a chi-square variable with k degrees of freedom. There are several exceptions which are relevant for the models we may consider. For example, the distribution theory for D is complicated when a parameter is on the boundary of the parameter space under H0 (Stram and Lee, 1994). A test of the need for the inclusion of a nugget effect would require such attention. A test for (geometric) anisotropy is another example which deserves closer attention. This issue has been considered in some detail by Haskard (2005). At first glance, the usual χ2 test may not seem to apply, as isotropy occurs when δ = 1, with anisotropy otherwise. This suggests a test of H0 : δ = 1 against H1 : δ 6= 1. Although the boundary issue mentioned above does not π apply, because α can be constrained to 0 ≤ α < 2 and then δ = 1 is not on the boundary, the test appears non-regular because the angle α is required under H1 but not under H0, and therefore cannot be estimated under H0. Haskard (2005) demonstrates that this issue can be overcome by using the alternative invertible reparameterisation

 1 ξ = δ − sin (2α) 1 δ  1 ξ = δ − cos (2α), 2 δ as isotropy corresponds exactly to H0 :(ξ1, ξ2) = (0, 0) and anisotropy to 2 H1 :(ξ1, ξ2) 6= (0, 0), leading to a χ2 distribution as the reference distribution for D under H0. For comparison of non-nested models we use an extended version of the AIC or BIC criteria using the REML log-likelihood. The formal model selection process is aided by use of various graphical tools based on either the original data or more often BLUPs or residuals from intermediate and final models. Included in these are graphical displays of the sample omni-directional or directional semi-variogram which are defined below. MODEL BUILDING AND DIAGNOSTICS 141 9.7.1 Sample semi-variograms The sample omni-directional semi-variogram is based on the empirical omni- T directional semi-variogram of the BLUP of a random field u˜s = (u˜s(s1),..., u˜s(sn)) which is given by the set of points (dij, v˜ij): j < i where dij = ||si − sj|| and 1 2 v˜ij = 2 (˜us(si) − u˜s(sj)) . In a regular spatial sampling design, say an r ×c array with equal (assumed to be r1 and r2) spacings, then the separation vectors and the set of pairwise distances take a smaller number of unique values. This suggests that it may be sensible to average thev ˜ij for each of these distinct values of dij. For irregular spatial sampling designs there may be no replication of the unique values of the dij but the standard approach is to ’bin’ the semi-variances according to the following principle. The sample omni-directional semi-variogram is the set of points dk, v¯k where the dk, k = 1, . . . , q are pre-specified distances and 1 X v¯k = v¯ij nk dij ∈Sk where Sk is the set of points for which d is closer to dk than any other dk0 and nk is the number of elements in Sk. To examine anisotropy we need to consider graphical displays of the semi- variance in terms of the elements (or functions) of the spatial separation vec- tor h. The empirical semi-variogram cloud is defined to be the set of triples (hij1, hij2, v˜ij. This is then graphically represented by two forms of ’binning’ based either on the cartesian coordinates of (h1, h2) or the polar coordinates p 2 2 −1 (d, t) where d = h1 + h2 and t = tan (h2/h1). The former display based on cartesian co-ordinates has been widely used for the analysis of field trials (see chapter 11). The latter graphical display, based on polar co-ordinates is usually more suitable for other applications of the geostatistical linear mixed model. The cartesian sample semi-variogram cloud is defined as the set of points (hk1, hl2, v¯kl), k = 1, . . . , q1 and l = 1, . . . , q2 and 1 X v¯kl = v˜ij nkl hij1∈Sk1 hij2∈Sl2 for pre-specified set of lags hk1 and hl2, Sk1 is the set of values of h1 closer to hk1 than any other hk01, Sl2 is the set of values of h2 closer to hl2 than any other hl02 and nkl is the number of points such that hij1 ∈ Sk1 and hij2 ∈ Sl2. The directional empirical semi-variogram is the set of points (dij, tij, v˜ij): j < i and then the directional sample semi-variogram is the set of points (dk, tl, v¯kl) and 1 X v¯kl = v˜ij nkl dij ∈Sk tij ∈Tl for pre-specified set of distances and angles dk and tl, Sk is the set of values 142 MIXED MODELS FOR GEOSTATISTICS of d closer to dk than any other dk0 , Tl is the set of values of t closer to tl than any other tl0 and nkl is the number of points such that dij ∈ Sk and tij ∈ Tl. Figure 9.5 shows a binning region for 450 with d = 1.5. y −2 −1 0 1 2

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

x

Figure 9.5 Typical binning region for a directional sample semi-variogram

9.8 Analysis of examples 9.8.1 Cashmore Field Figure 9.6 presents a plot of the soil water content against the x and y co- ordinates of the data which are aligned with the natural topography of the field and also run from west to east and south to north. There are obvious increasing trends in soil water content from north to south and east to west. Lark et al. (1998) pointed out that a previous soil survey indicated that the north of the field overlies the Lower Greensand (Cretaceous sands). The Gault Clay is downfaulted against the Lower Greensand with the boundary running across the sampled area from west north-west to east south-east in the south- ern part of the field. These solid formations are overlaid by loose material. In the north-eastern part of the field the soil is dominated by coarse alluvium, with heavy textured soils in the north-west. The south-west of the field has a sandy loam while in the south-east the soil is an Evesham series formed in swelling clay to the surface and loamy textured alluvium. This means that the driest soils are in the north-east of the sampled area and the wettest in the south-west. ANALYSIS OF EXAMPLES 143 Water Content % Water Content % 20 25 30 35 20 25 30 35

4.5 5.0 5.5 6.0 5.0 5.5 6.0 6.5 7.0 7.5

y (m)/100 x (m)/100

Figure 9.6 Scatter plots of soil water content against sampling coordinates

Figure 9.7 shows the sample omni and directional semi-variograms for the residuals from zero, first and second degree polynomial trend models. There is no evidence of “flatenning” of the semi-variogram for the raw data, which strongly supports the need for inclusion of terms to account for the spatial trend linked to the changing soil type across the field. The degree of anisotropy is also linked to the changing soil types across the field. Inclusion of first degree polynomials substantially improved the situation, however, inclusion of second order terms in both x and y appears preferable. The sample semi-variogram for the residuals for this model is more reasonable, though there is a suggestion of anisotropy remaining which will be examined in more detail in the following. To avoid numerical difficulties we have rescaled the x and y co-ordinates in the fitting of the correlation models by dividing both by 100. For Mat´ern models this affects the estimate of φ only. Table 9.1 presents a summary of the sequence of correlation models with a second degree polynomial trend. We begin by fitting a series of models with fixing ν = 0.5, 1, 1.5, 2. This helps to improve convergence by choosing sensible starting values and avoids most numerical problems with scale for the x and y values. The results for ν = 0.5 and ν = 1 are presented (M1 and M2) for comparison. This process suggests that a reasonable starting value for ν would be 0.5. Fitting M3 resulted in REML estimates ofν ˆ = .259, φˆ = .242 (rescaled value of 24.2). Before includ- ing a nugget effect we examined the assumption of isotropy (M4). The REML estimates of the anisotropy ratio and angle for M4 were 3.93 and .693 respec- tively, though there was only a modest increase in the REML log-likelihood 144 MIXED MODELS FOR GEOSTATISTICS

NULL LIN 0 10 20 30 40 50 0 2 4 6 8 10

0 50 100 150 0 50 100 150 QUAD Semi−variance

Angle (degrees) 0 45 90 135 omni 0 1 2 3 4 5

0 50 100 150 Distance

Figure 9.7 Sample directional semi-variograms for residuals from three trend models for the Cashmore data

2 which was insufficient to reject the hypothesis of isotropy (χ2 = 1.80, p > .05). It was not possible to achieve sensible convergence for either M3 or M4 with a nugget effect. This is a typical problem that we have experienced in mod- erate sample sizes, with less than optimal sampling designs. Our approach in these situations is to fix the value of ν at a value which is consistent with our data-set. This almost always overcomes the problem and is a reasonable and practical way forward without a more exhaustive data-set. Haskard (2005) ex- amines the convergence and properties of REML estimation within the class of correlation models we use. Her results suggest that REML estimates achieve low bias and good convergence properties for moderate (ie n ≥ 100) sample sizes, but with good sampling designs. Model M5 fixes ν = 0.5 and includes a nugget effect for an isotropic and anisotropic correlation model. M5 is the preferred model though anisotropy is still a lingering issue. The iso-correlation contours for the anisotropy ratio and angle from M6 against north-south and east-west displacements is presented in figure 9.8. The contours are more or less in agreement with the pattern in the directional sample semi-variograms with the stronger dependence running north-west to south-east. If we assume the bulk of the large scale spatial trend has been accounted for by the second degree polynomial, then this anisotropy may be due to small scale spatial variation in soil water content arising from either natural or non-natural sources. The other important feature of the field is the topography. There is a gradual slope (downwards) from north to ANALYSIS OF EXAMPLES 145

Anisotropic − (4.04, 40deg) −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

Figure 9.8 Correlation iso-contours of M6 for the Cashmore example south (see figure 1 of Lark et al. (1998)) which necessitates all cultural oper- ations be conducted in an east-west direction. The gridded sampling scheme, is in hindsight perhaps not optimal in differentiating between an isotropic and anisotropic correlation model for these data. To examine this anisotropy empirically we fitted the anisotropic city block correlation model (Ani-CB). This model (M7) has a REML log-likelihood of -66.1 which is quite an im- provement over M5. This lends empirical support to the hypothesis that the small scale spatial variation may be partially a result of cultural and agro- nomic practices which have been used in this field for many years. There is really insufficient data and relevant background information to judge which is the superior model. The issue of model mis-specification is examined later in section 9.9. Figure 9.9 present contour plots based on E-BLUPs computed on a 34*23 grid ( 8m × 8m) generated by the intersection of the unique x and y sampling locations for models M5 (Iso-Euc) and M7 (Ani-CB). This shows the smooth trend from the driest soils in the north-east corner to the wettest in the south- west. The contours from the ani-cb correlation model are far more “edgy”. The 146 MIXED MODELS FOR GEOSTATISTICS

Table 9.1 Summary of sequence of models fitted to the Cashmore data: GA represents geometric anisotropy, bolded terms are fixed in the model

ˆ ˆ 2 2 Model λ νˆ φ δ αˆ σ σs `R num. par. M1 2 0.5 .096 1 0 0 1.77 -71.1 2 M2 2 1.0 .047 1 0 0 1.70 -72.2 2 M3 2 .259 .243 1 0 0 1.91 -70.5 3 M4 2 .338 .205 3.93 .693 0 1.88 -69.6 5 M5 2 0.5 .187 1 0 .473 1.43 -70.4 3 M6 2 0.5 .187 3.92 .694 .308 1.53 -69.4 5 M7 1 0.5 - - - .825 5.18 -66.1 4

MSEPs are smallest near the sampling locations, indicating the influence of the stationary component of the model. Scatter diagrams of the E-BLUPs and MSEPs for each correlation model are presented in figure 9.10 to illustrate the degree of disagreement in more detail. The E-BLUPs are in reasonable agreement away from the edges of the field, where there is good coverage in the sampling design (ie 450 < y < 600, 500 < x < 750). Some points are highlighted by their x coordinate (/100) to support this comment. There is much more discrepancy between the MSEPs of the E- BLUPs for the two models. The MSEPs for the ani-cb model are always less than the MSEPs for the iso-euc model. We return to this issue in section 9.9 but conclude by remarking that even with the most rigorous analysis we are often limited in model selection by either insufficient data, shortcomings of design or lack of contextual information.

9.9 Simulation Study In this section we describe a limited simulation study to examine the following issues • For a moderate sample size and reasonable sampling design do model based MSEPs provide reliable estimates of the precision of the E-BLUPs • What is the effect of model mis-specification on MSEPs To our knowledge there has been little work done on this issue. Stein (1999, ch. 6) provides limited evidence that model based MSEPs are reasonable if the true model is fitted. If a wrong correlation model is fitted, say a gaussian vs a true Mat´ernmodel with ν = 1.5 then the MSEPs of the E-BLUPs from this model are seriously biassed. His study was for a sample size of 23 and for a one-dimensional process. We will extend these results to two dimensions and a sample size of 100. Our study is based on the sampling scheme of Haskard (2005) which is depicted in figure 9.11. A total of 20 points were removed from a 10 × 10 grid SIMULATION STUDY 147

Water Content: Ani−CB MSEP: Ani−CB 4.5 5.0 5.5 6.0 4.5 5.0 5.5 6.0

5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5

Water Content: Iso−Euc MSEP: Iso−Euc 4.5 5.0 5.5 6.0 4.5 5.0 5.5 6.0

5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5

Figure 9.9 Contour plots of E-BLUPS and MSEPs for Ani-CB and Iso-Euc corre- lation models for the Cashmore example

and replaced by three clusters of locations either in a horizontal, vertical or diagonal pattern. Data was generated from two correlation models: either an anisotropic city-block with φT = (.8,.3)T or an isotropic Mat´erncorrelation with ν = 0.5 and φ = 2. These models represent extreme forms of isotropic vs anisotropic correlation and are of relevance to the analysis of the Cashmore example. A nugget effect was included with γ = .111 so that the total variance was 1. For each simulated data-set we fitted the Ani-CB and the Iso-Euc models. Predictions were made at three locations shown in figure 9.11. Location “a” is a central location with row, column and diagonal neighbours but without close support, “b” is on the edge of the sampling design without close support and “c” is a central location with close support. A total of 800 simulations were done for each correlation model. Table 9.2 presents a summary of the results. Model based MSEPs and em- pirical MSEPs are presented for each true model and show excellent agreement for both correlation models. This is very encouraging as the use of so-called 148 MIXED MODELS FOR GEOSTATISTICS

Predictions MSEPs 4.2 4.2

4.2 4.2 4.2

4.5 4.2

6

4.2 5.9

Iso−Euc 4.3 Iso−Euc 20 25 30 35 0.5 1.0 1.5 2.0 2.5 3.0 3.5

20 25 30 35 40 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Ani−CB Ani−CB

Figure 9.10 Scatter plots of E-BLUPs and MSEPs for the Ani-CB and Iso-euc cor- relation models for the Cashmore example

“plug-in” estimates have been widely critised within the recent Bayesian liter- ature on the analysis of spatial data as these would underestimate the precision of the E-BLUPs. When an incorrect model is fitted then the results are less encouraging. If an Iso-Euc correlation model is fitted to data generated from an Ani-CB correla- tion model then both the model based and empirical MSEPs are significantly increased. When an Ani-CB correlation model is fitted to data generated from an Iso-Euc correlation model then the empirical MSEPs are only slightly in- flated while the model based MSEPs are perhaps slightly conservative. The tentative conclusions based on this limited study are • model based MSEPs provide reasonable estimates of precision of E-BLUPs when the correct correlation model is fitted • if anisotropy is suspected it is far preferable to perhaps “overfit” and pay a small price for bias when using model based MSEPs

9.9.1 Electromagnetic salinity Scatter plots (figure 9.12) of ECa against northing (labelled the y direction in the following) and easting (labelled the x direction in the following) suggest possible non-stationarity which may be accounted for by inclusion of a second degree polynomial in northing and easting. Figure 9.13 presents empirical omni and directional semi-variograms in four directions for the raw data and SIMULATION STUDY 149

c y

a

2 4 6 8 10 b

2 4 6 8 10

x

Figure 9.11 Sampling scheme for the simulation study for residuals after an (OLS) fit of a quadratic surface in northing and easting, which provide added support of the need to allow for a global trend. However there remains evidence of anisotropy which we investigate more formally in the following. For these data we focus on anisotropic correlation models within the Mat´ern class and use the model formulation (9.3.3) to accommodate the five dupli- cated locations. This models the spatially dependent random field as a G- structure. We feel that there are insufficient duplicated observations to obtain a reliable external estimate of measurement error and the duplication process was not an intentional part of the sampling design. Table 9.3 presents a summary of the models fitted to these data. As in the Cashmore example we begin with models which fix ν to obtain sensible starting values. These (M1,M2 and M3) model fits showed that a reasonable starting value for ν is 1.0. The fit of the exponential model, with a nugget effect, resulted in instability in the convergence sequence, with φ and σ2 becoming unacceptably large. Geometric anisotropy is supported for these data by a comparison of M4 with M5, giving a value of D = 383.12, which is highly 2 significant (p  0.001) when compared with a χ2 distribution. Figure 9.14 is a contour map of E-BLUPs from M5 for the full rice bay. This was produced from the evaluation of the E-BLUPS of f(·) at each of the unique (1996) sampling locations. Overlaid on this map are the estimated ro- tated axes. Correlation remains highest (the range parameter is largest) along the nearer-to-vertical axis, and drops most quickly along the more horizon- 150 MIXED MODELS FOR GEOSTATISTICS

Table 9.2 MSEPs for the simulation study, numbers in brackets are percentages of the empirical MSEPs for the true model

(a) True model Ani-CB Ani-CB Fitted Iso-Euc fitted Location Model Based Empirical Model Based Empirical a .214 (101) .212 (100) .479 (226) .429 (202) b .327 (88) .371 (100) .527 (142) .616 (166) c .078 (99) .079 (100) .122 ( 154) .130 (164) (b) True model Iso-Euc Iso-Euc Fitted Ani-CB fitted Location Model Based Empirical Model Based Empirical a .365 (106) .343 (100) .294 (86) .373 (109) b .410 (101) .406 (100) .353 (87) .441 (109) c .093 ( 90) .103 (100) .113 (110) .106 (103)

Table 9.3 Summary of sequence of models fitted to the EM data: bolded terms are fixed in the model

ˆ ˆ 2 2 Modelν ˆ φ δ αˆ σˆ σˆs `R num. par. M1 0.5 - 1 0 - - -5104.9 3 M2 1.0 22.4 1 0 5.87 447 -5043.4 3 M3 1.5 12.7 1 0 8.23 370 -5046.4 3 M4 1.12 18.7 1 0 6.51 415 -5042.7 4 M5 0.770 40.7 2.63 −.410 6.55 611 -4851.1 6

tal axis (-23.5◦ to the horizontal). The anisotropy ratio and anisotropy angle produce ellipsoid iso-correlation contours which are aligned with the direction of the slope and water flow in the rice field. This is shown more clearly in figure 9.15 which presents the axes of the ellipsoid iso-correlation contours. The anisotropy ratio of 2.63 results in more rapid decay in the parallel with the slope and water flow. Figure 9.16 is a plot of the E-BLUPs from M4 and M5 showing that even with such a large data-set, a common trend model and dense sampling loca- tions there is significant variation in predicted values compared to their model based MSEPs. SIMULATION STUDY 151 ECa ECa 80 100 120 140 160 180 200 80 100 120 140 160 180 200

0 100 300 500 0 100 200 300 400 500

Relative Northing (y) Relative Easting (x)

Figure 9.12 Scatter plots of ECa against sampling coordinates

9.9.2 Fine-scale soil pH data

Empirical directional semi-variograms are presented in figure 9.17 for the raw fine-scale soil pH data. There is strong evidence of non-stationarity for most blocks, particularly 2,4 and 5. These also suggest anisotropy for all but block 4. Empirical directional semi-variograms of the ordinary least squares residuals after fitting a linear global trend surface separately to each block, are dis- played in figure 9.18. The non-stationarity appears to have been substantially accounted for but anisotropy may be still present. The other notable point is that the range of the random fields appear comparable, but the variances quite different between blocks. We will consider these issues in the formal modelling to follow. In the modelling we used the formulation (9.3.5) to ensure numerical sta- bility if the nugget variance component is zero. In fact, it is most likely that the nugget variance will be very small, since pH was determined for complete adjoining 1cm3 cubes, effecting a complete survey of the five small blocks. The blocks were all sampled from the same field plot, though not on the same date. Hence the sequence of models we fitted examined which of the spatial covariance parameters could be (reasonably) assumed equal. As these models included geometric anisotropy it was not sensible to fix the anisotropy angle to be the same as the individual orientation of the five blocks was not recorded. The same trend model with a linear effect for x and y for each block 152 MIXED MODELS FOR GEOSTATISTICS

NULL QUAD

Angle (degrees) 0 45 90 135 omni semi−variance 0 500 10000 100 1500 200 300 400 500 600

0 100 200 300 0 100 200 300 distance

Figure 9.13 Sample variograms of the residuals from a quadratic trend model for the EM data was maintained throughout. A common nugget variance was also included in each model. Table 9.4 presents the sequence of models fitted to these data. Two models failed to satisfactorily converge (M1 and M5). Commencing from M2 (equal 2 σs ) we examine equality of each remaining variance parameters in turn, ac- 2 cepting equality of δ, χ4 = 3.64 choosing M8 as the basis of the next round of 2 testing. From this model we accept equality only of ν, χ4 = 2.52 choosing M14 as the preferred model. No further reduction from this model can be achieved. Based on the AIC criteria M14 is the preferred model, with equal δ, ν and σ2. 2 REML estimates of the common variance parameters from M14 wereσ ˆs = .180, σˆ2 = .00137, νˆ = 1.078, δˆ = 1.36 and REML estimates of the anisotropy angles and range for each block were αˆ T = (1.57,.854, 1.40, 1.28, 2.13) and T φˆ = (.982, 1.04, 1.98, 1.47, 3.71) respectively. 2 Using an extension of the χ2 test of isotropy to the situation with five sepa- 2 rate anisotropy angles yields an asymptotic χ6 likelihood-ratio test statistic of 18.46, p = 0.005. As for both of the previous examples there is ample evidence of anisotropy, even at a micro-scale level. Most geostatistical modelling would ignore it, using more commonly available models, such as a spherical or ex- ponential model. The consequences in fitting an incorrect model have already been shown in terms of MSEPs, though our simulation study is very limited and more work in this area is needed. The REML estimate of ν implies that the spatial random field is just differentiable. SIMULATION STUDY 153 Relative Northing 0 100 200 300 400 500 600

0 100 200 300 400 500

Relative Easting

Figure 9.14 Contour plots of E-BLUPs from M5 for the EM data

The final model is interesting and biologically explainable. The variation of the range parameter between blocks (from 0.98cm to 3.71cm) is most likely a result of variation in organic matter at a micro-scale level in the top 1cm of soil. 154 MIXED MODELS FOR GEOSTATISTICS

Water In Relative Northing 0 100 200 300 400 500 600

0 100 200 300 400 500 600

Relative Easting

Figure 9.15 Direction of rotated axes for the EM data St−d difference (Iso−An) −1.5 −1.0 −0.5 0.0 0.5 1.0

80 100 120 140 160 180 200

Anisotropic EM38v (ν = 0.77)

Figure 9.16 Scatter plot of E-BLUPs from M4 and M5 SIMULATION STUDY 155

1 2 3 0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6 0.00 0.05 0.10 0.15 0.20 0.25

0 2 4 6 0 2 4 6 0 2 4 6 4 5 semi−variance

Angle (degrees) 0 45 90 135 omni 0.00 0.05 0.10 0.15 0.20 0.00 0.02 0.04 0.06 0.08

0 2 4 6 0 2 4 6 distance

Figure 9.17 Sample variograms for each block of the raw data for the pH data 156 MIXED MODELS FOR GEOSTATISTICS

1 2 3 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10

0 2 4 6 0 2 4 6 0 2 4 6 4 5 semi−variance

Angle (degrees) 0 45 90 135 omni 0.00 0.05 0.10 0.15 0.00 0.02 0.04 0.06

0 2 4 6 0 2 4 6 distance

Figure 9.18 Sample variograms for ecah block of the residuals from a linear trend model for the pH data SIMULATION STUDY 157

Table 9.4 Summary of sequence of models fitted to the pH data, 6= implies parameters not constrained to be equal between blocks, = implies parameters constrained to be equal between blocks.

2 Model δ ν φ σ `R num. par. AIC M1 6= 6= 6= 6= NA 26 NA M2 6= 6= 6= = 512.55 22 18.9 M3 6= 6= = 6= 510.55 22 22.9 M4 6= = 6= 6= 510.96 22 22.1 M5 = 6= 6= 6= NA 22 NA M6 6= 6= = = 506.19 18 23.6 M7 6= = 6= = 509.92 18 16.2 M8 = 6= 6= = 510.73 18 14.5 M9 6= = = 6= 507.24 18 21.5 M10 = 6= = 6= 509.16 18 17.7 M11 = = 6= 6= 509.67 18 16.6 M12 6= = = = 465.72 14 96.6 M13 = 6= = = 504.81 14 18.4 M14 = = 6= = 508.47 14 11.1 M15 = = = 6= 506.10 14 15.8 M16 = = = = 450.62 10 118.8

CHAPTER 10

Population and Quantitative Genetics

10.1 Introduction The developments in quantitative genetics and in the molecular genetics in- volved in QTL analysis is based on certain assumptions. These assumptions arise from classical genetics and date back to the 19th century.

10.2 Mendel’s Laws Gregor Mendel entered the Augustinian monastery in Brno in 1843 to prepare for the priesthood. The monastery was a place of learning, which encouraged the study of the natural sciences alongside religion. Mendel’s interests lay in the field of hybridization, or the interbreeding of purebreeding strains with one another. He chose plants as his experimental tool and chose to work with a single species, the common garden pea. Mendel worked with simply inherited traits, but his deductions provide the basic foundation on which knowledge of the inheritance of complex traits has been built.

Figure 10.1 Gregor Mendel

159 160 POPULATION AND QUANTITATIVE GENETICS 10.2.1 Law of segregation The two members of a gene pair segregate (separate) from each other into the gametes, so that one-half of the gametes carry one member of the pair and the other half of the gametes carry the other member of the gene pair.

10.2.2 Law of independent assortment During gamete formation the segregation of one gene pair is independent of other gene pairs. Mendel had no knowledge of the physical nature of genetic material when, in 1865, he promulgated his laws which were based on years of data collection and analysis. Mendel was fortunate that all of the 7 traits he examined were on separate chromosomes. Genes that are on the same chromosome are said to be linked and do not follow Mendel’s second law.

Figure 10.2 Mendel’s law of segregation: colour

10.2.3 Linkage At the turn of the 20th century, 3 geneticists working in different parts of the world, and on different organism re-discovered Mendel’s laws. Mendel’s factors were now called genes, and some scientists had noted that chromosomes seemed to be involved in reproduction, and might be related to the biological explanation of Mendel’s law. In 1902, William Sutton correctly hypothesized that Mendel’s genes were located on chromosomes. Sutton did his research at Columbia University, which was fast becoming a hot bed for the blossoming field of genetics. Scientists at Columbia also noticed that some genes in fruit MENDEL’S LAWS 161 flies were statistically linked to each other in a way that seemed to contradict Mendel’s laws. Thomas Hunt Morgan’s laboratory at Columbia University in New York City focused on the genetics of drosophila melanogaster, or the common fruit fly, and used this simple organism to tease apart the mechanisms of heredity. Hunt’s ”fly room ” combined classical genetics with microscopy to prove that chromosomes have a definite function in heredity, establish mutation theory, and outline the fundamental mechanisms of heredity. They created important applications for crossing-over and genetic mapping. Hunt and his associates helped initiate cross-disciplinary science, or the use of what had been learned in other biological disciplines in order to explain common over-arching themes. In many respects, flies are similar to mice, as well as humans. In the same way, some aspects of genetics are related to physiology, chemistry, even physics.

Figure 10.3 Thomas Hunt Morgan

Morgan was a developmental biologist who came to Drosophila while study- ing Mendelian heredity patterns in rodents. Flies were fast-breeding and re- silient, and seemed the perfect organism for observing the general patterns of heredity. Soon into his experiments, Morgan came across a male fly with white eyes - a random mutation from the normal red color. This chance discovery inspired a series of breeding experiments with red-eyed females. Some white- eyed flies were produced, and this was expected, in keeping with Mendelian inheritance ratios. However, all the white-eyed flies were male - without ex- ceptions. This gave rise to one of Morgan’s seminal insights - the discovery of sex-linked characteristics. It gradually dawned on Morgan that some traits had a greater chance of 162 POPULATION AND QUANTITATIVE GENETICS being inherited together. He reasoned that there must have been some physical reason for this, and he realized that the traits had an actual location on the chromosome, and that their positions relative to each other dictated how likely they were to be inherited together. Eventually, these discoveries would lead to Morgan to realize that specific traits were generated from specific genes, each of which had a location along a specific chromosome. The genetic distance measure is named after Morgan in recognition of his major contribution.

10.3 Population genetics Population genetics is concerned with heritable changes in populations of or- ganisms to the underlying individual processes of inheritance and develop- ment. An extension of Mendelian Genetics, it deals deals with frequencies of genes and genotypes in populations but does not assign a genotypic value to each genotype. Thus for a single locus, two alleles might occur, B and b say, and in a diploid, we have possible genotypes BB, Bb abd bb. Population ge- netics is concerned with the frequencies of alleles and hence of genotypes in subsequent generations.

10.3.1 Hardy-Weinberg Equilibrium Suppose we have a population in which the frequencies of alleles B and b at a particular locus are pB and pb respectively. Under random mating, the genotypic frequencies are therefore 2 2 pBB = pB pBb = 2pBpb pbb = pb and these frequencies are constant from generation to generation. This is Hardy-Weinberg equilibrium.

10.3.2 Assumptions The assumptions underlying Hardy-Weinberg equilibrium are 1. sexual reproduction 2. non-overlapping generations 3. random mating 4. no natural selection 5. no migration 6. no mutation 7. large (infinite) population size If there are systematic forces that impact on a population then it is possi- ble to predict the changes in gene frequency in amount and direction. Thus if selection occurs within a population, or migration is possible between pop- ulations or mutation rates of loci are known, the changes to Hardy-Weinberg POPULATION GENETICS 163 equilibrium can be modelled. There are dispersive forces for which it may be possible to predict the amount but not the direction of changes, so-called genetic (or random) drift.

10.3.3 Types of mating Under random mating, the genetic character being studied has no influence on the choice of mate. With assortative mating, phenotypically alike (or pheno- typically unalike) individuals mate preferentially. Clearly the mating system impacts on the frequencies of alleles and genotypes.

10.3.4 Selection Selection of progeny to progress to the next generation and hence to reproduce will also impact on allele frequency. Selection advantage of an allele will result in departure from Hardy-Weinberg equilibrium.

10.3.5 Mutation Mutations result in a change from one allele to another through a sudden her- itable change in genetic material. This may result in a new allele, or change one allele into an existing allele at the locus. Mutation is the ultimate source of new genetic material. Many mutations are lethal and hence non-recurring. When a mutation is not lethal, the frequency of the allele will increase with the mutation rate. Typically, mutation rates are of the order 10−4 to 10−8 mu- tations per generation. However, the influence of mutation felt over a number of generations. There is often an equilibrium between mutation in two directions, with reverse mutation about 0.1 as frequent. While mutations are the source of variation, the process of mutation does not itself drive genetic change (evolu- tion) because the rate of change in gene frequency from the mutation process is very low due to the low rate of spontaneous mutation.

10.3.6 Genetic drift Changes in gene grequency can occur simply by chance, particularly in small populations. This is influenced by the number of parents in the populations. Genetic drift was thought to be a major component of genetic evolution.

10.3.7 Inbreeding If mating between relatives occurs more commonly than would occur by pure chance, then the population is inbred. The result is to increase the number of homozygous genes in the population such that completely inbred lines are homozygous at all loci. This clearly alters the gene frequencies. 164 POPULATION AND QUANTITATIVE GENETICS 10.4 Quantitative genetics Quantitative genetics is concerned with the inheritance of quantitative traits. Both genetic and enviromental components are considered, with the genetic effects arising from possibly a large number of genes.

10.4.1 Model The basic quantitative genetics model is assumed here, and hence we assume that trait observations to come from the model specification

yij = µ + gi + ij for genotypes i = 1, 2, . . . , ng, and replicates j = 1, 2, . . . , ri. In accord with quantitative genetics the mean of yij is µ. This places constraints on the realised values of gi, as does the pedigree structure which contains these geno- types (as well as other genotypes).

10.4.2 Effects

The genotypic effects gi will be composed of additive effects (in the case of a diploid) of the two alleles that make up the genotype, and their “interaction” or dominance deviation. These components will be relevant for each gene. In addition, the interaction between genes may be important and hence epistatic effects may be postulated. The development presented below involves accounting for additive and dom- inance effects in a pedigree and developing mean, variance and covariance terms that incorporate relationships between genotypes. The epistatic component will be taken as a random effect, and the devel- opment will not inculde interactions between additive and dominance com- ponents between genes; these will all be encapsulated by the simple epistatic “residual” term.

10.4.3 Assumptions The assumptions underlying the development include the assumption of Mendelian sampling and Hardy-Weinberg equilibrium. However, inbreeding is allowed for in the derivations.

10.5 Theory 10.5.1 Identity coefficients To understand the variation for genetic lines and relationships between lines, it is necessary to consider identity modes and coefficients between genetic lines. It is usual to consider a diploid locus indexed by r (r = 1, 2,...,L) with mr alleles Ar1,...,Armr that have an additive or main effect ars, s = THEORY 165

1, 2, . . . , mr on the performance of a line with allele Ars, while the interaction or dominance effect will be denoted by drs1s2 . In the population these alleles have (relative) frequency (and hence probability of being selected in a random Pmr mating population) of prs, s=1 prs = 1. Thus for locus r, we have details as given by Table 10.1. The extension to polyploidy introduces further complexity, in that the num- ber of identity modes increases; for diploid k-ploidy (k = 2, 3,...,), there are 22k −1 identity modes. If the parental type is not distinguished, this reduces to 3{(k − 1)(k − 2) + 1} for k = 2, 3,...,. Furthermore, higher order interactions or dominance effects need to be considered. Technically, this is not a problem, but the notation becomes quite complex.

Table 10.1 Probability distribution for locus r

Allele Ar1 Ar2 ...Armr

Effect ar1 ar2 . . . armr

Probability pr1 pr2 . . . prmr 1

The Identity Modes between two lines i and j for a single diploid locus is given in Figure 10.4. The lines or edges of the graphs represent identity by descent, so that the allele joined by an edge come from the same ancestor. The non-distinguishable groups (9 for the diploid case) are also presented in Figure 10.4 by boxing the specific modes. We label the modes I1,...,I9, with probabilities $1, . . . , $9 respectively. These identity modes are presented in Table 10.2 for two individuals i and j. The column i edge has a value 1 if alleles are IBD, and zero if the alleles are not IBD. The column labelled ij edge, means an edge from an allele of i to one of j. Any non-zero value in this column indicate that there exist alleles that are IBD.

Table 10.2 Identity modes using graph theoretic structure

Mode i edge j edge ij edge Total edges Probability i IBD j IBD ij IBD

I1 1 1 4 6 $1 yes yes yes I2 1 0 2 3 $2 yes no yes I3 0 1 2 3 $3 no yes yes I4 0 0 2 2 $4 no no yes I5 0 0 1 1 $5 no no yes I6 1 1 0 2 $6 yes yes no I7 1 0 0 1 $7 yes no no I8 0 1 0 1 $8 no yes no I9 0 0 0 0 $9 no no no 166 POPULATION AND QUANTITATIVE GENETICS

Figure 10.4 15 IBD states with 9 identity modes used alleles i

alleles j

I1 I2 I3

I5

I4

I I I I6 7 8 9

The basic quantitative genetics model is assumed here, but additional fixed and random effects can be added. These additional terms are generally re- quired in analysis of data, but our interest is in the genetic component of the model and hence without loss of generality we assume that trait observations to come from the model specification

yik = µ + gi + ik for lines i = 1, 2, . . . , ng, and replicates k = 1, 2, . . . , ri. In accord with quan- titative genetics the mean of yij is µ. This places constraints on the realized values of gi, as does the pedigree structure which contains these lines (as well as other lines). The reproductive process is viewed as sampling from the allele pool of each of the L loci. The sampling process depends on the IBD status at each locus and also on the population relative frequency or probability of selecting an allele; that is the prs for allele s of locus r. Because we consider diploids for simplicity, two independent samples are chosen with replacement (although sampling will clearly depend on the parental alleles available). Let (Sr1,Sr2) represent the bivariate sampling random vari- able for locus r, and suppose p(ArsArt) = Pr(Sr1 = Ars ∩ Sr2 = Art). Then

9 X p(ArsArt) = Pr(Sr1 = Ars ∩ Sr2 = Art|Iv) Pr(Iv) v=1 9 X = Pr(Sr1 = Ars ∩ Sr2 = Art|Iv)$rv (10.5.1) v=1 THEORY 167 where the identity mode probabilities will in general depend on the locus r. Once (Sr1,Sr2), r = 1, 2,...,L are observed, the realization of gi, namely gir, is obtained. If gir represents the genotype at locus r, the realized expression for line i across all loci is the sum of the effects of the two alleles present and a dominance component. Thus

L L X T X gi(s, t) = gir = 1Lgi = (arsr + artr + drsr tr + ersr tr ) r=1 r=1 where gi is the vector of gir. Notice that the form of the realized value is similar to a standard factorial model. The terms arsr and artr are the additive effects due to the two alleles present at the rth locus. The term drsr tr is the dominance effect or interaction between the two alleles present at the rth locus. The final term ersr tr represents the non-additive or dominance effects; these terms are specified as independent random effects, normally distributed 2 with mean zero and variance σIr. The constraints required for identifiability can be expressed in terms of ar and Dr, the vector of additive effects and the matrix of dominance effects respectively for locus r. Thus the weighted zero sum constraints that are used in quantitative genetics are used and are

T ar pr = 0, Drpr = 0. (10.5.2) In the absence of IBD information, the mean, variance and covariance of the genotypic effects can be calculated easily. Thus

L mr mr L X X X X T T  E(gi) = (ars + art + drst) prsprt = 2ar pr + pr Drpr = 0 r=1 s=1 t=1 r=1 because of the constraints. Similarly, the variance is given by

L L X (2)T X T (2) 2 2 var (gi) = 2 ar pr + (pr Dr pr) = σa + σd (10.5.3) r=1 r=1

(2) (2) where ar is the vector of squared values of ar, and Dr denote the matrix whose elements are the squares of Dr. The same notation will be used for other vectors of squares. A similar calculation shows that the covariance between two lines i and j is zero. 2 2 The variance components σa and σd are the additive and dominance vari- ances in the simple random mating situation. The components in (10.5.3), will however, appear in the more general case of a pedigree with possible inbreeding. The mean, variance and covariance for lines in a pedigree are developed in the following sub-sections. Importantly, we begin with basic definitions for the coefficient of parentage or kinship, and of inbreeding. 168 POPULATION AND QUANTITATIVE GENETICS 10.5.2 Coefficient of kinship

A major difference in the development presented here is the manner in which relationships between two individuals i and j are presented. We assume that a pair of individuals has overall mean identity probabilities as presented in Table 10.1, and that the probabilities at locus r, designated by vectors $r are realizations of the overall $. Thus we have (using standard multinomial results) that

T E($r) = $, var ($r) = diag ($) − $$

Linear functions of $r therefore obey

T  T T  T T 2 E c $r = c $, var c $r = c diag ($) c − (c $) (10.5.4)

The result (10.5.4) will be used in the derivations below. A similar result applies for the covariance, namely

T T  T T T cov cr $r, cs $s = cr diag ($) cs − (cr $)(cs $) (10.5.5)

Let Sir be an allele sampled at random from individual i at locus r, with a similar definition for Sjr. The coefficient of kinship is defined as

frij = Pr(Sir ≡ Sjr) where ≡ means the alleles are identical by descent (IBD). Then from Fig- ure 10.4 and Table 10.1

9 X frij = Pr(IBD|Iv)$rv v=1 1 1 = $ + ($ + $ + $ ) + $ r1 2 r2 r3 r4 4 r5

At this point it is appropriate to define the inbreeding coefficient Fir for individual i at a specific locus r. We denote the parents of i by u and v. If Aru and Arv are the alleles of i received from u and v respectively, the inbreeding coefficient is defined as

Fir = Pr(Aru ≡ Arv)

= $r1 + $r2 + $r6 + $r7

= fruv (10.5.6) which is the relationship between the parents at locus r. The coefficient of kinship for individual i involves sampling from the alleles of i. Thus for IBD, we either sample both alleles and the alleles are IBD, or we have sampled the same allele in which case they are also IBD. Thus we THEORY 169 have two components each with probability 0.5, and using (10.5.6) we have 1 1 f = ($ + $ + $ + $ ) + 1 rii 2 r1 r2 r6 r7 2 1 = (1 + f ) 2 ruv 1 = (1 + F ) 2 ir

Notice that as E ($r) = $, using (10.5.4)

E(Fir) = Fi, E(frij) = fij

10.5.3 Mean Genetic effect under IBD

We consider E (gi) under inbreeding and Mendelian sampling. The approach involves conditioning on $r, r = 1, 2,...,L; the complete set of these vectors P is denoted by Π and is a r mr × 1 vector. In the three derivations to follow, the expression L X T gi = gir = 1Lgi (10.5.7) r=1 is used. The expressions for terms in calculating the expected value conditional on identity modes are given in Table 10.3. In all the derivations that follow, the common thread is the partition of possible events into IBD (and hence inbred) and not IBD. Under IBD at a locus, the alleles must be the same. This implies identity modes I1,I2,I6,I7 for which line i has alleles that are IBD. The probability of these modes is Fir as given by (10.5.6), the inbreeding coefficient at locus r. On the other hand, all pairs of alleles (including the same allele) can occur when the locus is not IBD (two equal alleles are then identical by state or IBS but are not IBD); the probability of non-IBD modes is 1 − Fir. In vector-matrix notation of the possible outcomes for gir are 2a + d , and a ⊗ 1T + 1 ⊗ aT + D r rh r mr mr r r where the first term is written conveniently as a vector and the second as a matrix. The h in drh and elsewhere denotes homozygosity. Thus conditionally E(g |Π) = F (2a + d )T p + (1 − F )pT (a ⊗ 1T + 1 ⊗ aT + D )p ir ir r rh r ir r r mr mr r r r T = Firdrhpr = Fir∆rh (10.5.8) Using a standard result on conditional expectations, we therefore have

E(gir) = Fi∆rh

If ∆h is the vector of ∆rh E(gi) = Fi∆h Hence T E(gi) = Fi1L∆h = Fi∆h (10.5.9) 170 POPULATION AND QUANTITATIVE GENETICS

Table 10.3 Calculation of the mean of gir Identity Identity Component Mode Probability i IBD of gir probability

I1 $1r yes 2ars + drss prs I2 $2r yes 2ars + drss prs I3 $3r no ars + art + drst prsprt I4 $4r no ars + art + drst prsprt I5 $5r no ars + art + drst prsprt I6 $6r yes 2ars + drss prs I7 $7r yes 2ars + drss prs I8 $8r no ars + art + drst prsprt I9 $9r no ars + art + drst prsprt

10.5.4 Genetic variance To find the genetic variance, we proceed in a similar manner as for the expec- tation. We can write 2 T T gi = 1Lgigi 1L 2  and hence we need to evaluate E gir|Π and E (girgis|Π). 2  To evaluate E gir|Π using vector-matrix notation, we consider the element by element products

(2) (2) (2ar + drh).(2ar + drh) = 4ar + drh + 4ar.drh and (a ⊗ 1T + 1 ⊗ aT + D ).(a ⊗ 1T + 1 ⊗ aT + D ) r mr mr r r r mr mr r r which produces eight terms. Then in a similar manner to the mean,

2  (2) (2) T E gir|Π = Fir(4ar + drh + 4ar.drh) pr + (1 − F )pT (a ⊗ 1T + 1 ⊗ aT + D ).(a ⊗ 1T + 1 ⊗ aT + D )p ir r r mr mr r r r mr mr r r r T (2)T (2) T = 4Firar pr + Firdrh pr + 4Fir(ar.drh) pr (2)T T (2) + 2(1 − Fir)ar pr + (1 − Fir)pr Dr pr T (2)T T (2) (2) T = 2(1 + Fir)ar pr + (1 − Fir)pr Dr pr + Firdrh pr + 4Fir(ar.d(10.5.10)rh) pr Noting that alleles at different loci are assumed different,

E(girgis|Π) = E (gir|Π)E(gis|Π) = Fir∆rhFis∆sh (10.5.11)

Taking expectations with respect to $r, Fir is replaced by Fi in (10.5.10) while in (10.5.11) we need to find E (FirFis). The calculations provide the rth T  diagonal term and the (r, s) term in Gi = E gigi respectively. Thus using THEORY 171 the definitions explicit in (10.5.3) and defining

L L T 2 X (2) 2 X T σdh = drh pr − ∆h, σadh = 2 (ar.drh) pr r=1 r=1 which is the variance of the dominance effects among homozygotes and the interaction between additive and homozygotic dominance effects, we find

2 2 2 2 2 X E gi = (1+Fi)σa+(1−Fi)σd+Fiσdh+Fi∆h+2Fiσadh+ E(FirFis) ∆rh∆sh r6=s (10.5.12) Combining equations (10.5.9) and (10.5.12) we find

2 2 2 var (gi) = (1 + Fi)σa +(1 − Fi)σd + Fiσdh + 2Fiσadh 2 X +Fi(1 − Fi)∆h + E(FirFis) ∆rh∆(10.5.13)sh r6=s The last term provides for possible linkage between loci, through the expected value E (FirFis). If loci are linked, for example on the same chromosome, their relationship will depend on the recombination frequency between them. If θrs is the recombination frequency between loci r and s, the correlation between the loci is given by (1 − 2θrs) and

2 2 2 E(FirFis) = (1 − 2θrs)(Fi − Fi ) + Fi = (1 − 2θrs)Fi + 2θrsFi using (10.5.5). Loci on different chromosomes are not linked and θrs = 0.5, so that

2 E(FirFis) = Fi In this case, the last term can be written as

2 2 (2) Fi (∆h − ∆h )

10.5.5 Genetic covariance

We turn to E (gigj). Again,

T T gigj = 1Lgigj 1L and hence we need to consider expectations of terms gisgjs and gisgjt. For the first term, components for the calculation of the the conditional mean are indicated in Table 10.4. It is much simpler in this case to simply multiply the terms and then sum over the subscripts. There are 7 distinct such sums, multiplied in turn by $1r, $2r +$3r, $4r, $5r, $6r, $7r +$8r, and $9r respectively. Many of the terms are zero due to the constraints (10.5.2). After taking expectation with respect 172 POPULATION AND QUANTITATIVE GENETICS

Table 10.4 Calculation of the mean of girgjr

Mode Probability i IBD j IBD ij IBD Component of girgjr probability 2 I1 $1r yes yes yes (2ars + drss) prs I2 $2r yes no yes (2ars + drss)(ars + art + drst) prsprt I3 $3r no yes yes (ars + art + drst)(2ars + drss) prsprt 2 I4 $4r no no yes (ars + art + drst) prsprt I5 $5r no no yes (ars + art + drst)(ars + aru + drsu) prsprtpru I6 $6r yes yes no (2ars + drss)(2art + drtt) prsprt I7 $7r yes no no (2ars + drss)(art + aru + drtu) prsprtpru I8 $8r no yes no (ars + art + drst)(2aru + druu) prsprtpru I9 $9r no no no (ars + art + drst)(aru + arv + druv) prsprtpruprv

to the $r, we find

 1 1  T E(g g ) = 4 $ + ($ + $ + $ ) + $ a(2) p + $ pT D p ir jr 1 2 2 3 4 4 5 r r 4 r r r  1  T +4 $ + ($ + $ ) (a .dh )T p + $ dh(2) p + $ ∆2 1 4 2 3 r r r 1 r r 6 rh

The expectation of girgjs is simply ($1 + $6)∆rh∆sh. Most terms in Ta- ble 10.4 have a zero probability for different loci on different individuals. 2 As for E gi we can form the matrix of expectations, Gij and E (gigj) = T 1LGij1L. This becomes  1 1   1  E(g g ) = 2 $ + ($ + $ + $ ) + $ σ2 + $ σ2 + 2$ + ($ + $ ) σ i j 1 2 2 3 4 4 5 a 4 d 1 2 2 3 adh L L X (2)T X 2 X +$1 dhr pr + $6 ∆rh + ($1 + $6) ∆rh∆sh r=1 r=1 r6=s

The covariance between gi and gj is, using (10.5.9),  1 1   1  cov (g , g ) = 2 $ + ($ + $ + $ ) + $ σ2 + $ σ2 + 2$ + ($ + $ ) σ i j 1 2 2 3 4 4 5 a 4 d 1 2 2 3 adh 2 2 +$1σdh + ($1 + $6 − FiFj)∆h (10.5.14)

It remains to express the $ in terms of Fi, Fj and kinship coefficients. The values are given in Table10.5. Note that individuals i and j have parents u and v, and w and x respectively and define

fuv·wx = (fuwfvx + fuxfvw) THEORY 173

Table 10.5 Identity modes probabilities: individuals i and j with parents u and v, and w and x

Mode $p

I1 FiFjfij I2 Fi(1 − Fj)2fij I3 (1 − Fi)Fj2fij I4 (1 − Fi)(1 − Fj)fuv·wx I5 (1 − Fi)(1 − Fj)(4fij − 2fuv·wx) I6 FiFj(1 − fij) I7 Fi(1 − Fj)(1 − 2fij) I8 (1 − Fi)Fj(1 − 2fij) I9 (1 − Fi)(1 − Fj)(1 − 4fij + fuv·wx)

Substituting these results in (10.5.14), we find 2 2 2 cov (gi, gj) = 2fijσa +(1−Fi)(1−Fj)fuv·wxσd +(Fi +Fj)fijσadh +FiFjfijσdh (10.5.15)

10.5.6 Full variance-covariance matrix Combining (10.5.13) and (10.5.15) and writing g as the vector of individual genetic effects, we find that

2 2 2 2 2 (2) var (g) = σaA + σdD + σadhC + σdhH + ∆hDI + (∆h − ∆h )E (10.5.16) where DI = diag (Fi)(I −diag (Fi)) and assuming all loci are unlinked. Equa- tion (10.5.16) is of the same form as ?, although explicit expressions for all terms were not given. The elements of A are Aij = 2fij, while the terms in all other matrices canbe written as functions of Aij. The diagonal elements of D are

Dii = 1 − Fi = 2 − Aii and the off-diagonal elements are 1 D = (1 − F )(1 − F )f = (1 − F )(1 − F )(A A + A A ) ij i j uv·wx 4 i j uw vx ux vw where the parents of i are (u, v) and of j are (w, x). The diagonal elements of C are Cii = 2Fi = 2(Aii − 1) while for i 6= j, 1 C = (F + F )f = (A + A − 2)A ij i j ij 2 ii jj ij 174 POPULATION AND QUANTITATIVE GENETICS Similarly, the elements of H are

Hii = Fi = Aii − 1 while for i 6= j, 1 H = F F f = (A − 1)(A − 1)A ij i j ij 2 ii jj ij The diagonal matrix E has elements 2 2 Eii = Fi = (Aii − 1)

Lastly, the term involving the diagonal matrix DI = diag (Fi)(I − diag (Fi)) represents inbreeding depression.

10.5.7 Computation of the relationship matrices The major problem from a practical point of view is both the number of terms in (10.5.16) and the evaluation of the inverses of the matrices involved. ? show in a simulation study that the leading two terms provide an accurate approximation to the full matrix under certain circumstances. Thus 2 2 var (g) ≈ σaA + σdD which is the variance matrix for the sum of two random effects, one for additive and one for dominance effects. This is the approach used in animal and some plant breeding situations where mixed models and pedigrees are standard. Thus if we include an independent residual component, the model is g = a + d + i (10.5.17) (see ?). The efficient calculation of the numerator relationship matrix A has been the subject of considerable research. ? provided a recursive method based on triples, where each progeny is listed together with their parents, and in which parents precede their progeny. The algorithm focused on obtaining the inverse of A because this matrix is required for the mixed model equations (see ?). ? provide the extension for inbred lines including doubled haploid and recombinant inbred lines in the general pedigree situation. The second problem has been the dominance relationship matrix D. This is a large matrix and methods to simplify the calculations, in particular in regard to the inverse have been examined. The elements of D are functions of the elements of A, and it is difficult to see a simple way in which A−1 can be used to determine D. ? present an approach to find D that uses the family structure of the popu- lation. As dominance relationships are determined by the parents of individu- als, the full dominance relationship matrix can be partitioned into a between family dominance matrix and the within family dominance matrix. Note that the dominance component of the covariance between lines arises from the identity mode I4. I4 is also a component of the variance of a line, THEORY 175 and hence the term corresponding to this identity state provides the between family dominance variance. All lines within a family share the same variance and the same covariance with lines of other families. The variance for a line i in a family is 1 D = (1 − F )2(A A + A2 ) bi 4 i uu vv uv and the covariance between i and a line j in another family is 1 D = (1 − F )(1 − F )(A A + A A ) bij 4 i j uw vx ux vw The within family variance is the residual variance and is a scaled diagonal matrix with elements 1 D = (1 − F ) − (1 − F )2(A A + A2 ) wi i 4 i uu vv uv the same for element for all individuals in a family. Thus we can write

d = Zbdb + dw where db are the between family dominance effects with variance matrix Db that has diagonal elements Dbi and off-diagonal elements Dbij. The design ma- trix Zb provides the assignment of individuals to families. The term dw is the vector of within family dominance effects with diagonal variance matrix with family blocks DwiIqi , where qi is the size of the family. This between/within family formulation has the potential to reduce the computational burden in many situations. Note that T D = var (d) = ZbDbZb + Dw so that −1 −1 −1 T −1 −1 −1 T −1 D = Dw − Dw Zb(Zb Dw Zb + Db ) Zb Dw so that the inverse can be found using smaller and simpler matrices, but the −1 Db is constructed from A and it is not clear how to use A . The two matrices C and H are functions of A. However like D, their nonlinear dependence seems to preclude a simple way to find their inverses, which are required for estimation and prediction. The matrix in the inbreeding depression term and E are diagonal matrices and present little difficulty. A full model for g that provides the structure given in (10.5.16) is

g = a + Zbdb + dw + dh + adh + idh + de + i (10.5.18) where 2 2 a ∼ N(0, σaA) i ∼ N(0, σi I) 2 2 db ∼ N(0, σdDb) dw ∼ N(0, σdDw) 2 dh ∼ N(0, σdhH) adh ∼ N(0, σadhC) 2 id ∼ N(0, σidDI ) de ∼ N(0, σeE) 2 (2) Note that σadh and σe = (∆h −∆h ) need not be positive and in estimation these parameters should be unconstrained. 176 POPULATION AND QUANTITATIVE GENETICS All the variance-covariance matrices in this model are a scalar variance- covariance parameter multiplied by a known (that is, able to be calculated) symmetric matrix. Thus the required inverses can be calculated before esti- mation and only once for each problem. With the power of modern computers this is only likely to be an issue for large pedigrees.

10.6 Discussion The aim of this paper has been to present a simplified and modern derivation of the variance matrix of related individuals. The results obtained are explicit and are important for use in the analysis of plant breeding trials (?). The use of family structure simplifies the computation of the dominance effects (?). In fact it allows the partition of effects that is useful from a practical point of view. See ? for the practical application of this idea. The implementation involves use of A in the direct calculation of the other variance-covariance matrices. The approach provides explicit results for all components of the full-variance covariance matrix and differs from ? who focus on the decomposition of the dominance varianc-covariance matrix. They also differ in that they recover the within family dominance components by back-substitution. Importantly, in applications, determining the various components of vari- ance/covariance are of interest, and the additive, dominance and inbreeding depression effects for each line can be obtained by best linear unbiased predic- tion (?); this may not be relevant for some of the other terms. The formulation presented here has been implemented in ASReml (?). CHAPTER 11

Mixed models for plant breeding

11.1 Introduction

The aim of plant breeding and crop evaluation programs is to produce new varieties that are superior in terms of a range of key traits. These traits include grain yield, quality traits associated with product end-use (for example milling yield of wheat and malting quality of barley) and resistance to disease. The process of breeding new varieties is lengthy and typically involves sequential stages of testing. In the initial stages a large number (possibly greater than 1000) of new breeding lines are grown in a small number (possibly only one) of designed field experiments. Note that we reserve the term ‘variety’ for a commercial variety and use the terms ‘line’ or ‘genotype’ to collectively represent any of the material (either new test or existing commercial material) grown in a breeding trial. The ‘best’ test lines (in terms of a range of traits) are selected to progress to the next stage of testing. This process continues with subsequent stages involving progressively fewer test lines grown in more field experiments. Ultimately only a few (typically less than 10) elite test lines are grown together with existing standard varieties in trials that span a wide range of geographic locations. The best test lines may then be released as varieties for commercial use.

11.1.1 Obtaining trait data: single and multi-phase experiments

In this chapter we consider selection for quantitative traits since this can be achieved using a linear mixed model approach for data analysis. These traits include grain yield and many of the traits associated with grain and end-product quality. Resistance to disease is often measured using qualitative scoring systems. Such data are not amenable to a linear mixed model anal- ysis so is not considered here. All data under consideration are derived from designed field experiments. Some traits, such as grain yield, are measured di- rectly from the field in what can be termed single phase experiments. Most quality traits, however, involve an additional measurement phase (or phases) in which grain from field plots undergoes some process in a laboratory. For example, wheat grain is milled in order to measure the amount of flour pro- duced (so-called milling yield) and barley grain is malted in order to take a range of measurements that reflect suitability for beer production. Such traits are therefore derived from multi-phase experiments. In this chapter we con-

177 178 MIXED MODELS FOR PLANT BREEDING sider the analysis of data from both single and multi-phase plant breeding experiments. The design of field experiments and analysis of resultant data has a long history. In terms of plant breeding trials, designs may vary depending on the stage of testing. In early stages replication of test lines may not be feasible either due to insufficient seed for some lines and/or restrictions on the size of the trial. The most widely adopted design for this situation is a grid-plot design in which a systematic grid of a standard variety (or several standard varieties) is interposed among plots of unreplicated test lines. More recently Cullis et al. (2005) proposed a superior alternative, known as p–rep designs, in which grid plots are replaced by replicate plots of test lines. Thus a percentage, p, of test lines is replicated and the total trial size is unaffected. In later stages of testing fully replicated trials are common practice. Both the fully and partially replicated designs usually employ some form of blocking and possibly neighbour balance. The analysis of data from field trials may be either randomisation or model based. In the latter the focus is on the need to control spatial variation. As implied by the terminology, this variation is linked to the spatial locations of plots in the field and may be due, for example to fluctuations in soil fertility. Such models are now widely used for plant breeding trials and can lead to substantial gains in terms of the accuracy and efficiency of estimated geno- type contrasts compared with the randomisation approach, particularly when block sizes are large. The major criticism of model based approaches is that estimates of treatment effects and their standard errors rely solely on the cho- sen model whereas the randomisation based analysis is validated by recourse to randomisation theory. In our experience with conducting the annual anal- yses of breeding trials from most Australian public breeding and evaluation programs the gains of a spatial approach outweigh this potential disadvantage. We safeguard against this to some extent by using an approach that merges the randomisation and spatial approaches. We use the randomisation based model as the base-line then build on this to model remaining spatial variation. For the latter we use the approach of Gilmour et al. (1997). Thus we do not regard a spatial model as a replacement for the randomisation based model but rather as an enhancement to better accommodate field trend. Details of our approach are given in Section 11.2. In contrast to data obtained directly from the field, the design and analysis of quality trait data has received scant attention in the literature. In terms of design common practice involves the use of a single field replicate and no randomisation or replication in the laboratory (although a laboratory control sample is often processed at regular intervals). Smith et al. (2001b) promote the need to employ proper experimental techniques in the laboratory phase. Thus they propose that individual field replicate samples should be processed in the laboratory and that the resultant samples should be replicated and ran- domised in the laboratory process. Until recently there has been little guidance on how this should be achieved. Wood et al. (1988) give some information for INTRODUCTION 179 multi-phase experiments in general but quality trait data provide some spe- cific challenges that cannot be handled using their approach. The key issue is that the cost of obtaining quality trait data is very high. This means that the number of samples tested must be kept to a minimum and certainly it is not feasible to process all field replicates of all test lines then replicate all the resultant samples in the laboratory. Smith et al. (2005b) provide a viable solution that builds on the p–rep concept of Cullis et al. (2005). In short, repli- cate plots of a percentage, p, of lines are processed in the laboratory making a total of np plots, say. Then a percentage, q, of these np plots is replicated in the laboratory (that is, the grain from each of these plots is split into 2 or more samples to be processed separately). The concept can be extended to accommodate more than one phase in the laboratory. In terms of ran- domisation of samples to ‘positions’ in the laboratory process there are many unresolved issues. For example, should we use field layout information when allocating samples in the laboratory? If so, how should this be done? Design issues, particularly for the partially replicated designs of Smith et al. (2005b) are very complex and are the subject of current research. At present we base our designs on the recommendation in Wood et al. (1988), that is, that the design in the laboratory phase should be efficient for genotype comparisons. Thus we ignore other field information and allocate lines to positions in the laboratory, using blocking and neighbour balance as required. In terms of the analysis of quality trait data Smith et al. (2001b) and Cullis et al. (2003) give mixed model approaches. The key elements are the partitioning of error variation into all potential sources, that is, associated with each phase of the experiment and the modelling of trend as appropriate. Modelling of quality trait data can be challenging, particularly if there are several laboratory phases. However, as with the spatial analysis of field trials, the gains compared with a randomisation based analysis can be substantial. Our approach to analysis is discussed in detail in Section 11.3.

11.1.2 Selecting the best test lines

The mixed model approaches we propose for the analysis of plant breeding data (whether they be derived from single or multi-phase experiments) all have the same aim, namely to provide information on which to base selection decisions. This is a crucial point as it determines the status of genotype effects in the models, that is, whether they should be taken as fixed or random. Since the aim is selection then we require the rankings of the estimated test line effects to be as close as possible to the true effects. In more exact terms we require a set of estimates of test line effects that best predicts the true effects. By definition this means that genotype effects should be regarded as random and consequently predicted using best linear unbiased prediction. This may be contrasted with the aim of determining the difference between specific lines (for example if a seed company wishes to know the difference between their potential new variety and other commercial varieties) in which case the use of 180 MIXED MODELS FOR PLANT BREEDING best linear unbiased prediction is inappropriate since the BLUP of a specific difference is biased. Thus in that case genotype effects should be regarded as fixed. We believe, however, that the aim of the analysis of most plant breeding and crop improvement data (irrespective of the stage of testing) is selection so we use the assumption of random genotype effects. The reader is referred to Smith et al. (2005a) for a discussion of this issue and for a listing of relevant references. Selection may be conducted separately for individual traits or for a range of traits simultaneously using a selection index approach (see Falconer and Mackay, 1996, for example). We consider the former and with this in mind propose the following general form for the mixed model for a single trait (derived either from a single or multi-phase experiment):

y = Xτ + Zgug + Zouo + e (11.1.1) where y is the n × 1 data vector, τ is the p × 1 vector of fixed effects with associated n × p design matrix X (assumed to have full column rank), ug is the g×1 vector of random genotype effects with associated n×g design matrix Zg, uo is the b×1 vector of other (non-genetic) random effects with associated n × b design matrix Zo and e is the vector of residual effects. Without loss of 0 0 0 generality we write ug = (us, ut) , where us and ut are vectors of standard and test genotype effects respectively (with length s and t such that g = s+t). In the simplest case the fixed effects in (11.1.1) comprise a single effect, namely an overall mean, but may include effects for covariates to model trend associated with either the field or laboratory phase/s. The vector uo comprises block effects associated with the experimental design and effects to model vari- ation. In the case of multi-phase experiments it also includes residual effects for phases other than the final phase. Thus in a two-phase experiment the vector e corresponds to the laboratory residuals and the field plot residuals are represented by a sub-vector in uo. In a single phase experiment the field plot residuals are represented by the vector e. Full details of model terms are given in Sections 11.2 and 11.3. 0 0 0 0 We assume that the joint distribution of (ug, uo, e ) is Gaussian with zero mean and variance matrix   Gg(γg) 0 0 V = σ2 0 G (γ ) 0 H  o o  0 0 R(φ) where γg, γo and φ are vectors of unknown variance parameters. The matrix Gg is often a scaled identity matrix, that is, Gg = γgIg. The associated variance component, σ2 = σ2 γ , is often termed the genetic variance. Another g H g possibility is Gg = γgA where A is a known relationship matrix. At present pedigrees are not generally used in routine analyses of early generation trials so will not be considered here. Forms for Go and R will be discussed in Sections 11.2 and 11.3. We use the mixed model of (11.1.1) in order to obtain predictions of the INTRODUCTION 181 random genotype effects. As discussed previously, by definition the BLUP of ug provides the best prediction of the vector of true genetic effects so is used as the basis for selection. In practice, the prediction of random effects involves the use of estimated (rather than known) variance parameters so that Empirical BLUP (or E-BLUP) is used. Thus we would order the E-BLUPs of test line effects and select the top m% where m is some number relevant to the breeding program and stage of testing. It is usually of interest to assess the success of selection and/or the reliability of data used for selection. Two key summary statistics used in the quantitative genetics literature are the ‘response to selection’ (or expected genetic gain, EGG) and the ‘heritability’ (see Falconer and Mackay, 1996, for example). In the following we show how these quantities can be calculated from the linear mixed model. Since selection is based on the elements of u˜g corresponding to the test line effects then EGG resulting from selection is dependent on the associated distribution. It can be shown that u˜ ∼ N (0, σ2 (γ I − Ctt)) (11.1.2) t H g t where Ctt is the (scaled) prediction error variance matrix for the test genotype −1 effects and is given by the partition of C corresponding to ut. Thus, we could calculate EGG as the mean of the top m% of values from the distribution in (11.1.2). An algebraic solution to this problem is difficult so a numerical procedure may be preferred. Thus, given values of σ2 , γ and Ctt, we would H g simulate values of u˜t from the distribution in (11.1.2) then select the top m% of lines and calculate EGG for that simulation as the corresponding mean. The final EGG is taken to be the mean across a (large) number of simulations. An alternative and simpler method of calculation of EGG is obtained by approximating the prediction error variance matrix in (11.1.2) with a scaled identity. A good approximation can be obtained using the concept of ‘effective error variance’ (see Cochran and Cox, 1957, for example). Thus we approx- tt imate C by AttIt/2 where Att is the (scaled) average pairwise prediction error variance of test genotype effects, that is, 2  1  A = tr Ctt − 10 Ctt1 (11.1.3) tt t − 1 t t t Then we have u˜ ≈ N (0, σ2 (γ − A /2)I ) (11.1.4) t H g tt t q ⇒ u˜ / σ2 (γ − A /2) ≈ N (0, I ) t H g tt t so can calculate EGG as q EGG = i σ2 (γ − A /2) H g tt

= iσghg (11.1.5) where i is the ‘selection intensity’ corresponding to m (that is, the mean of the 182 MIXED MODELS FOR PLANT BREEDING top m% of order statistics from a standard normal distribution of size t) and hg is the square root of a generalised measure of mean line heritability with 2 hg = 1 − Att/(2γg). Note that the equation for EGG in (11.1.5) is analogous to the standard quantitative genetics formula (see Falconer and Mackay, 1996, for example). The difference is that we propose the use of the generalised mea- sure of heritability rather than the standard measure that is calculated as the ratio of genetic variance to total (genetic plus error) variance. In the simplest case of balanced data and a model with no fixed effects other than an overall mean and no random effects other than those associated with genotype and residual error (both with simple scaled identity variance matrices) the gener- alised and standard heritability measures are identical. In all other cases they will differ and importantly, the standard heritability measure will not relate to response to selection. Note that we have computed measures of heritability and EGG in terms of the total genetic variance. Often these measures are given in relation to the so-called additive genetic variance which is the por- tion of genetic variance associated with the resemblance between relatives. In order to obtain heritability and EGG in terms of additive genetic variance we would need to include a known relationship matrix in the model, that is, use 2 a variance matrix for the genotype effects of the form var (ug) = σg A.

11.2 Spatial analysis of field trials The literature on methods for the analysis of field trials (which includes our specific setting of variety trials) is quite diverse but the methods can be broadly classified as either randomisation or model based. In the former the model for residual effects is determined purely from the experimental design whereas in the latter it is either assumed or selected with the objective of pro- viding a good fit to the data. In the following sections we discuss both types of analysis. Finally we describe our approach to analysis which combines both randomisation and modelling aspects and is based on the methods of Cullis and Gleeson (1991) and Gilmour et al. (1997). We illustrate the approach us- ing grain yield data from an Australian early generation wheat variety trial.

11.2.1 Randomisation based analysis A randomisation based analysis of a variety trial may be conducted using the model in (11.1.1) with sub-vectors of uo corresponding to terms in the block structure of the experiment (see Nelder, 1965a, for a complete account). For example, if the experiment was designed as a randomised complete block (RCB) experiment with nr replicates (complete blocks) then uo would have length nr (comprising an effect for each replicate) and the effects would be assumed independent with constant variance σ2 γ , say. Thus G = γ I . H r o r nr The vector of residuals would then comprise independent effects with constant variance σ2 , thence R = I . In an incomplete block (IB) design with n H n r replicates and nb (incomplete) blocks per replicate there would be two sub- SPATIAL ANALYSIS OF FIELD TRIALS 183 vectors in uo, the first corresponding to the replicate effects and the second to block within replicate effects. Independence is assumed both within and between these sub-vectors and the effects have associated variance components of σ2 γ and σ2 γ for replicates and blocks within replicates, respectively. Thus H r H b

Go = diag (γrInr , γbInb ). As in the RCB design we would have R = In. Note that Nelder (1954) discusses the need to allow variance components associated with blocking factors to be negative in order for the mixed model to provide a proper surrogate for the randomisation analysis.

11.2.2 Model based analysis

Model based approaches for the analysis of variety trials aim to account for the effect of spatial heterogeneity on the prediction of genotype contrasts. Typically the heterogeneity reflects the fact that, in the absence of design effects, data from plots that are close together (that is, neighbouring plots) are more similar (positively correlated) than those that are further apart. Numerous authors have proposed analytical methods to remove the effects of such trend from the estimation of genotype contrasts. The earliest method was that of Papadakis (1937) in which neighbouring plot yields were used as covariates in the analysis. Interest in the area was re-ignited by Wilkinson et al. (1983) who suggested that spatial field trend could be expressed as the sum of two components, namely a smooth trend and an independent error term. They removed the assumed trend by (second) differencing the data. The method of differencing adjacent plot yields has been used by other authors as a means of modelling trend (Green et al., 1985; Besag and Kempton, 1986, for example). Gleeson and Cullis (1987), Martin (1990) and Cullis and Gleeson (1991) pro- posed approaches that model trend using time series models (with differencing still having a role as a means to achieve stationarity). A key aspect of Mar- tin (1990) and Cullis and Gleeson (1991) is the use of separable correlation models to accommodate trend in two dimensions (field rows and columns). Zimmerman and Harville (1991) also proposed a direct modelling approach using geostatistical models. They view spatial variation as comprising two sources, namely large-scale variation that is modelled through the mean, and small-scale variation that is modelled through the covariance structure. Gilmour et al. (1997) extended the approach of Cullis and Gleeson (1991) by partitioning spatial variation into two types of smooth trend (local and global) and extraneous variation. Local trend reflects, for example, small scale soil depth and fertility fluctuations. Global trend reflects non-stationary trend across the field. Extraneous variation is often linked to trial management, in particular, procedures that are aligned with the field rows and columns (for example, the sowing and harvesting of plots). Certain procedures may result in row and column effects (systematic and/or random). In the Gilmour et al. (1997) approach global trend and extraneous variation are accommodated in the model by including appropriate fixed and/or random effects. Local sta- 184 MIXED MODELS FOR PLANT BREEDING tionary trend is accommodated using a correlation structure for the residuals. Thus there are similarities with the Zimmerman and Harville (1991) approach. Most of the current spatial approaches for the analysis of field trials are of the form advocated by Zimmerman and Harville (1991) and Gilmour et al. (1997), that is, they involve a direct modelling of local spatial trend using a covariance model. In the following we present a general framework for such models, then give specific details for our approach to analysis which is based on Gilmour et al. (1997).

11.2.3 Covariance models for local spatial trend We assume a field trial that has a two-dimensional layout indexed by field rows (1 . . . r) and columns (1 . . . c) so that the total number of observations is n = rc. We assume the data to be ordered as rows within columns. Extensions to non-rectangular or non-contiguous layouts are straight-forward. Consistent with the notation of Chapter 9 we let the vector si = (sir, sic) denote the th spatial location of the i plot in the field, where sir and sic are the row and column co-ordinates respectively. In terms of the mixed model in (11.1.1) we assume the residuals to be spatially correlated, that is, e = e(s) where e is a realisation of a stationary Gaussian process with zero mean and variance matrix σ2 R. The elements of R are given by ρ(s − s , φ), ρ(·) being a cor- H i j relation function with a parameter vector φ and dependent on the spatial separation vector hij = (hijr, hijc) = si − sj. We note that each field plot is in itself a two-dimensional region and the data for an individual plot consists of the grain yield harvested from the entire region (or most of that region with the exclusion of small edge areas). In terms of the mixed model in (11.1.1) we assume that the data for the ith plot is concentrated at the centroid of the plot. The correlation process e(s) is second-order stationary so that the depen- dence between any plots i and j depends only on the distance between them (see Chapter 9). In contrast to geostatistical applications we further assume that the two-dimensional process is separable so that the correlation function is given by the product of the correlation functions for rows and columns. The separability assumption is computationally convenient and appears to be reasonable for the two-dimensional spatial trend process associated with field trials (see Martin, 1990; Cullis and Gleeson, 1991, for example). More work is needed in this area, however. For the separable model we have

ρ(hij, φ) = ρr(hijr, φr)ρc(hijc, φc) where ρr and ρc are the correlation functions for rows and columns respec- tively. Correspondingly, the variance matrix for e can be written as var (e) = σ2 (R (φ ) ⊗ R (φ )) H c c r r where Rr and Rc are the r × r and c × c correlation matrices for the row and column dimensions respectively. SPATIAL ANALYSIS OF FIELD TRIALS 185

As discussed in Chapter 9 many forms for ρj(·), j = r, c are possible. The use of statistical criteria for choosing between covariance functions for a particular data set can be a difficult task (see Zimmerman and Harville, 1991, for exam- ple). Also, we question the merit of searching for the ‘best’ covariance model. After all, as Besag and Kempton (1986) pointed out, any of these models “... can only provide a rather crude representation of the underlying fertility pattern and is not intended as a proper biological model.” Also, Zimmerman and Harville (1991) found that “... among isotropic covariance functions that are continuous, nonnegative, and monotone decreasing, the estimates of treat- ment contrasts are relatively insensitive to the choice of covariance function ...”. We concur with these points of view so recommend use of a plausible spa- tial model that has broad application. We choose a separable autoregressive process of order 1 (here-after denoted AR1×AR1) as originally proposed by Cullis and Gleeson (1991) and used by Gilmour et al. (1997). The AR1×AR1 model is a special case of a separable exponential model. The correlation function for the latter (see Chapter 9) in the context of our field trial is given by

ρ(hij, φ) = exp(−|hijr|/φr) exp(−|hijc|/φc) (11.2.6) In field experiments, plots are often of equal size and are laid out in a con- tiguous array, so that the distance between plots can be measured simply in ∗ terms of row and column numbers. Let hijr be the difference in row num- ∗ bers between plots i and j, so that hijrhas possible values 0, 1,... (r − 1). ∗ Define hijc similarly. If mr and mc are the actual distances (in metres, say) between the centroids of plots in row and column directions, respectively, then ∗ hijr = mrhijr so that the function in equation (11.2.6) can be written as

∗ ∗ |hijr | |hijc| ρ(hij, φ) = αr αc (11.2.7) where, for example αr = exp(−mr/φr). The parameters αr and αc are by def- inition, positive. If this restriction is lifted, equation (11.2.7) is the correlation 0 function for an AR1×AR1 process. The parameters φ = (αr, αc) are known as the autoregressive correlation coefficients.

11.2.4 Proposed analysis The first step in our approach to analysis is to fit the model in equation (11.1.1) with an AR1×AR1 process for the residuals together with all random effects as necessary to reflect the randomisation employed in the experimental design. These randomisation based effects are maintained in the model throughout the modelling process, irrespective of their level of significance, in order to preserve the covariance structure associated with the experimental design. Note that if the number of rows or columns in the trial is small (less than 4, say) then we may choose to assume independence for this dimension rather than fit an AR1 process. Such models are denoted ID×AR1 and AR1×ID (for independence in the column and row dimensions respectively). This base-line model is then 186 MIXED MODELS FOR PLANT BREEDING examined for adequacy. Here we focus on issues associated with the assumed spatial model, namely the existence of non-stationary global trend, extraneous variation and measurement error. We also consider outlier detection. Issues concerning non-normality of the data and dependence of variance on the mean are not considered here. We note that these problems do not usually occur with respect to the most commonly analysed trait from variety trials, namely grain yield.

Outlier detection Outlier detection in linear mixed models is a difficult problem. Haslett (1999) and Haslett and Hayes (1998) consider outlier detection in correlated data and suggest using diagnostics based on so-called conditional residuals. In a similar approach, Gogel (1997) extended the alternative outlier model of Thompson (1985) for linear mixed models. Neither of these approaches has been imple- mented in statistical software so routine use is difficult. We adopt a less for- mal approach centred on graphical displays. A key graphic, here-after called a ‘residual plot’, is of estimated residuals against row (column) number for each column (row). This enables identification of unusual data points relative to their spatial neighbours. We have also found the QQ-plot (Wilk and Gnanade- skin, 1968) to be an informative diagnostic for this purpose. Erroneous data tends to be well away from the ends of the ‘extrapolated’ straight line. The outlier detection process is iterative and consultative. At each stage, possible outliers should be identified then advice sought as to their likely cause and an appropriate remedy.

Global trend and extraneous variation The existence of non-stationary global trend and extraneous variation may be evident from examination of the residual plots. For example, non-stationary global trend in the row direction will be reflected in the residual plot as smooth trend (linear or non-linear) over row number for each column. Extraneous variation may be more difficult to detect in residual plots. We therefore use an additional diagnostic tool, namely the three-dimensional graph of the sample variogram of the estimated residuals. Note that we do not usually adjust this variogram for the bias induced by using estimates of the residuals (see Chapter 9) since we propose to use it purely as an informal diagnostic tool. The bias does not have an adverse effect on the visual interpretation of the sample variogram. As a reference point we consider the nature of the theoretical variogram for the assumed covariance model, that is, the separable AR1×AR1 model. This is given by |h∗ | |h∗ | V (h) = σ2 (1 − α ijr α ijc ) H r c This increases monotonically in both the row and column directions as the separation between plots increases (and thus correlation decreases). It reaches a plateau that is given by the variance σ2 . The greater are the autoregres- H SPATIAL ANALYSIS OF FIELD TRIALS 187 sive correlation coefficients, the slower is the rise to the plateau. Figure 11.1 shows a three-dimensional graph of the theoretical variogram for a separable AR1×AR1 model with α = 0.8, α = 0.2 and σ2 = 1. r c H

0.8

0.6

0.4

0.2

0.0

25

20

15 7 6 row 10 5 displacement 4 5 3 2 1 0 0 column displacement

Figure 11.1 Three-dimensional plot of theoretical variogram for an AR1×AR1 pro- 2 cess with αr = 0.8, αc = 0.2 and σH = 1.

The existence of non-stationary global trend is reflected in a sample vari- ogram that fails to reach a plateau in the row (column) direction. Historically, non-stationary trend of this type was corrected by differencing the data (see Gleeson and Cullis, 1987, for example) but this complicates the analysis un- necessarily. We adopt the approach of Gilmour et al. (1997), namely we fit polynomial functions or smoothing splines (see Chapter ??) to the row (col- umn) coordinates of the plots. Thus non-stationary global trend is explicitly modelled. In terms of the model in (11.1.1), polynomial regression coefficients are included as effects in τ . Extraneous variation may be revealed in the sample variogram in a number of ways. For example, a sample variogram with a sawtooth appearance indi- cates the presence of cyclic row (column) effects. Since this is a systematic effect, it can be accommodated in the model by fitting a fixed factor with 188 MIXED MODELS FOR PLANT BREEDING number of levels corresponding to the length of the cycle. There may also be non-systematic variation associated with rows and columns. In general we choose to accommodate this by fitting random row (column) effects in the model. The sample variogram can be used to diagnose the existence of such variation. If there are row effects, then the sample variogram ordinates will be lower at zero row displacement compared with other row displacements, and similarly for column effects. In terms of the model in (11.1.1), random row (column) effects are included as sub-vectors in uo and have scaled identity variance matrices.

Measurement error

A measurement error term or so-called nugget effect in the context of spatial models constitutes lack of fit about the smooth spatial trend. The inclusion of such an effect has been proposed by several authors including Wilkinson et al. (1983) in their ‘smooth trend plus independent error’ model for field experiments. A nugget effect may be included in the model of equation (11.1.1) as in equation (9.3.5), that is, by including an n × 1 sub-vector uη in uo with a design matrix of I and variance matrix σ2 γ I , say. The need for n H η n measurement error may be revealed in the sample variogram. To see this we first consider the theoretical variogram for an AR1×AR1 process together with a nugget effect, that is, the variogram for e + uη. This is given by

( |h∗ | |h∗ | σ2 (γ + 1 − α ijr α ijc ) h∗ 6= 0 H η r c ij V (h) = ∗ 0 hij = 0

Measurement error introduces a jump discontinuity in the variogram at zero separation. The need for measurement error in the model may be diagnosed using the sample variogram although this is often difficult. It may be helpful to consider the two ‘faces’ of the variogram, that is, the slices corresponding to zero displacement in the column/row direction and superimpose the cor- |h∗ | responding fitted values (for example,σ ˆ2 (1 − αˆ ijr ) for the AR1 process in H r the row direction). If it is unreasonable to constrain the intercept to be zero (which is implicit in the omission of measurement error) then the fitted vari- ogram will rise too sharply to a plateau (reflecting the fact that there will have been a downward bias in the estimation of the autoregressive correlations). Although the inclusion of measurement error may be desirable it is not always computationally possible. Experience has shown that there are esti- mation problems when the autoregressive correlations are small (less than about 0.3) or the experiment consists of only a few rows and columns. Zim- merman and Harville (1991) also note the additional computational burden with the inclusion of a nugget effect. SPATIAL ANALYSIS OF FIELD TRIALS 189

Table 11.1 Spatial example: estimated variance parameters for all models fitted. Model Model term Parameter M0 M1 M2 M3

2 genotype σg 0.015 0.030 0.029 0.015 2 block σb 0.087 0.046 0.051 0.031 2 column σc 0.013 0.009 2 nugget ση 0.031 error σ2 0.147 0.139 0.124 0.115 H αc 0.223 0.243 0.405 αr 0.783 0.749 0.870 residual log-likelihood 504.80 866.88 870.41 891.13

11.2.5 Example: early generation wheat variety trial

We consider an early generation wheat variety trial grown as part of the New South Wales Department of Primary Industries (NSWDPI) wheat breeding program based at the Wagga Wagga Agricultural Institute (data kindly sup- plied by Dr. Peter Martin). This trial consisted of a total of 1005 lines (1001 test lines and 4 standard varieties). A p–rep design was used, with 189 of the test lines replicated and the remaining 812 planted in single plots (so p=18.8%). In addition there were multiple plots of the standard varieties (14 plots each for three of the standards them and 16 plots for the fourth). Thus there was a total of 1248 plots that were arranged in the field in a 104 row by 12 column array. The plot dimensions were 6m by 1m so that the full trial occupied 72m by 104m. The replicated lines were randomised so that a single replicate appeared in the block comprising rows 1-52 and the other replicate in rows 53-104. Subject to this constraint, the design then involved optimisation of the design criteria, namely Att, in terms of a model with random row and column effects (γ = 0.1 in each case) and an AR1×AR1 spatial model with αr = 0.6 and αc = 0.4. The design was generated using the DigGer program (Coombes, 2002). The reader is referred to Cullis et al. (2005) for details of the design search algorithm. The data collected for each plot comprised grain yield (t/ha). The mean yield for the trial was 1.72 t/ha. We commence with the fitting of the mixed model with terms reflecting the randomisation employed in the design and an AR1×AR1 process for the residuals. In terms of the former the only requirement for this experiment is to include a two level block factor corresponding to the resolvable blocks (so that block 1 = rows 1 to 52 and block 2 = rows 53 to 104). The block effects are included as a sub-vector of uo with associated variance component σ2 = σ2 γ , say. The remaining design strategies were model based, that is, a b H b covariance model was assumed and an optimum design sought for that model. 190 MIXED MODELS FOR PLANT BREEDING

0 20 40 60 80 100 0 20 40 60 80 100 column:9 column:10 column:11 column:12 1.5 1.0 0.5 0.0 −0.5 −1.0 column:5 column:6 column:7 column:8 1.5 1.0 0.5 0.0

residual −0.5 −1.0 column:1 column:2 column:3 column:4 1.5 1.0 0.5 0.0 −0.5 −1.0 0 20 40 60 80 100 0 20 40 60 80 100 row

Figure 11.2 Spatial example: plot of estimated residuals from model 1 against row number for each column. SPATIAL ANALYSIS OF FIELD TRIALS 191 The estimated variance parameters from this initial model are given in Table 11.1 (as model M1). In order to show the significance of the spatial correla- tions we also fitted a model with random block effects but independent errors (model M0 in Table 11.1). The REML-LR test statistic for the comparison of models 0 and 1 was 724.16. This can be compared with a critical value of 2 χ0.95(2) = 5.99. Thus the correlation model is very significant. The plot of estimated residuals from model M1 against row number for each column is given in Figure 11.2. This graph reveals that there are no extreme outliers but there is evidence of column effects with the residuals for columns 10 and 4 being relatively low whilst those for columns 11 and 8 are relatively high. The existence of column effects is also clearly seen in the three-dimensional graph of the sample variogram (Figure 11.3 (a)), with variogram ordinates at zero column displacement being lower than at other displacements. Note that for clarity of visual presentation we have restricted the graph of the variogram to row displacements of 30 or less. Thus we add random column effects to the 2 model as a sub-vector in uo with associated variance component σc . The re- sultant variance parameter estimates are given in Table 11.1 (model M2). The variance component for column effects is very significant with a REML-LR statistic of 7.06 (p=0.004 using Stram and Lee, 1994, adjustment). The vari- ogram from this model (Figure 11.3 (b)) now has a form that is very similar to the theoretical variogram for an AR1×AR1 process. Finally we add a measurement error term to the model as a sub-vector in 2 uo with variance component ση (see model M3 in Table 11.1 for parameter estimates). The associated variance component is very significant (REML-LR statistic of 41.44, p¡.001). The need for the nugget effect can also be seen by considering the graph of the sample variogram corresponding to zero column displacement both before and after inclusion of measurement error. These are shown, together with the associated fitted variograms, in Figure 11.4 (a) and (b). The agreement between sample and fitted variograms appears to be superior for the model including measurement error. The addition of a nugget effect to the model has had a large impact on the results. In particular, there has been a substantial reduction in the genetic variance (see Table 11.1). The implication is that, without the nugget ef- fect, error variation was incorrectly being assigned to genetic variation. There is some biological justification for inclusion of the nugget effect. The wheat breeder commented that immediately prior to sowing this trial heavy rain re- sulted in pools of water across the trial. Later in the season when the trial was subjected to drought conditions, plant growth was less affected where the pools had lain. Variation potentially attributable to this soil moisture effect was very small scale and patchy, varying both at a within and between plot level. From a statistical perspective we are led to consider the reliability of esti- mation of measurement error in the context of p–rep designs, an issue that has not been examined elsewhere. In order to assess this we conducted a sim- ulation study based on the factorial combination of two data models (with 192 MIXED MODELS FOR PLANT BREEDING

(a) (b)

0.15 0.15

0.10 0.10

0.05 0.05

0.00 0.00 30 30 25 25 20 20 15 10 15 10 8 8 row 10 6 row 10 6 5 4 5 4 2 2 0 0 column 0 0 column

Figure 11.3 Spatial example: three-dimensional plot of sample variogram for (a) model M1 and (b) model M2 (that is, after addition of random column effects). SPATIAL ANALYSIS OF FIELD TRIALS 193

(a) (b) variogram ordinate variogram ordinate 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0 5 10 15 20 25 30 0 5 10 15 20 25 30

row displacement row displacement

Figure 11.4 Spatial example: plot of sample and theoretical variogram at zero column displacement for (a) model 2 and (b) model 3 (that is, after addition of nugget effect). 194 MIXED MODELS FOR PLANT BREEDING

Table 11.2 Estimation of nugget effect for p–rep design. Empirical mean squared errors of prediction (×100) for two data models and two analysis models. Results are means of 400 simulations. Data generation model Analysis model no nugget nugget no nugget 1.628 1.573 nugget 1.632 1.163

and without nugget effect) by two analysis models (with and without nugget effect). The models and variance parameters for data generation were models M2 (no nugget model) and M3 (nugget model) from Table 11.1. The data were generated according to the design used in the example. Each generated data set was analysed using two models, namely corresponding to models M2 and M3. For each analysis, predictions of genotype effects (˜ugi , i = 1 ... 1005) were calculated together with the empirical mean squared error of prediction P1005 2 (MSEP) ( i=1 (˜ugi − ugi ) /1005). The results are presented in Table 11.2. When data were generated without a nugget effect there was no penalty in fitting the nugget effect, with the empirical MSEP being very similar. We note that for these data the estimate of the nugget variance was on the boundary (that is, estimated as zero) for 203 out of the 400 simulations. Thus there was no evidence of ‘phantom’ nugget variance being estimated. When data were generated with a nugget effect, however, there was a substantial loss in efficiency by not fitting the effect (the ratio of MSEP being 1.35). In terms of the estimation of variance components there was little bias in the estimates when obtained using a model for analysis that matched the data generation model. This simulation study showed that measurement error can be reliably es- timated for a p–rep design and that we can therefore accept model M3 for the example as being the best model of those fitted. It is also important to note that the asymptotic prediction error variances for the fitted models were very similar to the empirical values. The asymptotic MSEP for the genetic effects for a model can be calculated based on the approximation used in Sec- tion 11.1.2. Thus we calculate MSEP as σ2 A /2 where σ2 A is the average H gg H gg pairwise prediction error variance of all genotype effects (rather than just the test genotype effects). The asymptotic MSEP (×100) for model ‘2’ was 1.614 which compares favorably with the empirical value of 1.628 in Table 11.2 and the asymptotic MSEP (×100) for model ‘3’ was 1.139 which is very similar to the empirical value of 1.163. The agreement between the asymptotic and empirical MSEP is re-assuring as it provides confidence in inference based on asymptotic results and suggests that recourse to monte-carlo methods may be unnecessary. A histogram of the test line BLUPs is shown in Figure 11.5. In terms of SPATIAL ANALYSIS OF FIELD TRIALS 195

cut−off for top 20% Density

EGG=0.083

0 2 4x 6

−0.2 −0.1 0.00.05 0.1

Test line BLUPs

Figure 11.5 Spatial example: plot of histogram and approximate density function for test line BLUPs. Cut-off for selection of top 20% of lines is indicated. 196 MIXED MODELS FOR PLANT BREEDING selection we consider the top 20% of test line BLUPs (the cut-off point for which is marked as the vertical line in Figure 11.5). The associated EGG (calculated using equation (11.1.5)) is 0.083 t/ha. This value is quite low, as 2 is the estimate of line mean heritability with hg = 0.23.

11.3 Analysis of multi-phase trials for quality traits Many quality traits are obtained from multi-phase experiments in which lines are grown in the field then the resultant grain samples processed in the labo- ratory. In terms of varietal selection, quality traits are as important as grain yield. However, in contrast to grain yield data the literature on methods of design and analysis for quality trait data from multi-phase trials is scarce. In a general setting, key references for multi-phase experiments are by McIntyre (1955); Wood et al. (1988); Brien (1983). We propose an approach that is analogous to our approach for field trials in the sense that we combine ran- domisation and model based techniques. The approach is based on that of Smith et al. (2001b) and Cullis et al. (2003). The motivation for our approach has been in terms of the analysis of two key quality traits, namely the milling yield of wheat and malting quality of barley. It is therefore useful to provide some background information concerning the laboratory processes involved in obtaining these data. The milling of grain samples from a field trial is conducted as a sequential process almost always requires more than a single day of processing. Each grain sample is milled, that is, ground until all bran has been removed and the remaining endosperm has been reduced to flour. The resultant flour sample is weighed to provide either a milling yield (in absolute units) or a trait known as flour extraction (weight of flour expressed as a percentage of the weight of grain). The measurement of barley malting quality is a complex process. Grain samples from field plots are initially malted (controlled germination and kil- ning process) in a machine known as a micro-malter. Samples are arranged in the micro-malter in a two-dimensional (row by column) array. Depending on the capacity of the machine, the malting of a field trial may require sev- eral ‘runs’ of the micro-malter. After completion of the malting phase samples are further processed in order to obtain the traits of interest. There is rarely any re-randomisation or additional replication at this stage so that, for the purposes of statistical analysis, we can regard this as a two-phase experiment (with a field and malting phase). The acquisition of data for both these traits involves two-phase experiments. There are natural potential sources of variation (between days for milling and between runs for malting) and sources of correlation (temporal correlation within days for milling and spatial correlation within the micro-malter for malting). Our approach to analysis aims to accommodate all such sources of variation. As in section 11.2 we begin with a discussion of randomisation and model ANALYSIS OF MULTI-PHASE TRIALS FOR QUALITY TRAITS 197 based analyses then describe our method in detail. We illustrate the approach using milling yield data from an Australian early generation wheat variety trial.

11.3.1 Randomisation based analysis The randomisation based analysis requires determination of all sources of variation associated with the experimental design. In single phase experiments this is usually straightforward but it may be more difficult in the context of multi-phase experiments. Brien (1983) provides some helpful guidelines with his concept of ‘tiers’. The application of this technique for quality trait data is well explained in Cullis et al. (2003) so we do not pursue this further here. The basic principle, however, is to include terms in the model that capture the randomisation processes used in each phase of the experiment. In a two-phase quality experiment, for example, there are two randomisations, namely the randomisation of genotypes to field plots then the randomisation of field plots to ‘positions’ in the laboratory process. Thus the effects for blocking factors associated with each of these randomisations must be included in the analysis. The most crucial feature of the analysis of multi-phase data is the need to include an ‘error’ term for each phase of the experiment. In terms of the model in equation (11.1.1) error effects for all phases other than the final phase and effects for blocking factors for all phases are included as sub-vectors in uo. It is instructive to re-write the model in (11.1.1) and explicitly include these error effects rather than leaving them embedded within uo. For simplicity we restrict attention to two-phase experiments but the extension to more phases is straight-forward. Thus we write the model for a two-phase quality trait experiment as

y = Xτ + Zgug + Zpup + Zouo + e (11.3.8) where the vectors y, τ , ug and e are as defined for (11.1.1) and up is the np ×1 vector of random (residual) field plot effects (where np is the number of plots tested) with associated n×np design matrix Zp. The vector uo now represents any remaining non-genetic effects, that is, other than the plot effects. Before we discuss model formulae in detail it is instructive to consider a simple balanced example that can be analysed using ANOVA.

Simple hypothetical two-phase experiment for milling yield We consider a hypothetical milling yield trial. We assume that r field replicates of g genotypes are grown in a field trial that is designed as an RCB. Grain samples from each of the rg field plots are split into d smaller samples to be used as replicates in the laboratory process. Thus there is a total of n = rgd samples to be milled. We assume here that rg samples can be processed each day so that the full trial requires d days. Field plots are randomised to times in the milling process using an RCB design with days as blocks. A single sample from each of the rg field plots is processed each day and the plots are allocated 198 MIXED MODELS FOR PLANT BREEDING

Table 11.3 ANOVA table for hypothetical milling yield example. Strata/Decomposition df E(mean square) Model term mean 1 mrep d − 1 rgσ2 + σ2 mrep 1 H mrep.order d(rg − 1) frep r − 1 dgσ2 + dσ2 + σ2 frep 2 3 H frep.plot r(g − 1) genotype g − 1 genotype residual (r − 1)(g − 1) dσ2 + σ2 frep.plot 3 H residual (d − 1)(rg − 1) σ2 units H total rgd − 1 completely at random within a day. The data measured for each sample is the flour extraction. In order to develop the analysis within an ANOVA framework we follow the usual practice of assuming that block effects are random and treatment effects are fixed. Thus we assume here that genotype effects are fixed. Accordingly and based on the randomisation processes described above the symbolic model formula for the data can be written as y ∼ genotype + mrep + frep + frep.plot + mrep.order (11.3.9) where genotype is a factor with g levels, mrep is a factor (for replicates in the milling process) with d levels, frep is a factor (for field replicates) with r levels, plot is a factor (for plots within field replicates) with g levels and order is a factor (indexing the order of processing of samples within days) with rg levels. Thus the final term in (11.3.9) is the residual term that is also represented generically as units. We denote the variance components for 2 the random effects by σi , i = 1 ... 3 for mrep, frep and frep.plot respectively. In terms of the algebraic form for the mixed model in (11.3.8) the effects associated with frep.plot correspond to up and the effects for lrep and frep are included in uo. We can apply the techniques of Nelder (1965a) and Nelder (1965b) and as described in Chapter 1 to these data in order to derive the ANOVA ta- ble. Key things to note are the existence of five strata (corresponding to the mean, laboratory replicates, field replicates, plots within field replicates and samples within laboratory replicates). Also it can be shown that the design is orthogonal, that is, the genotype effects are estimated in a single stratum, and this stratum corresponds to plots within field replicates. Thus the skeletal ANOVA table is as given in Table 11.3. The model terms from (11.3.9) that correspond to the sources of variation in the ANOVA table are given as the final column. Following the approach in Chapter 1 it can also be shown that the variance matrix of the estimated (fixed) genotype means (as deviations from the overall ANALYSIS OF MULTI-PHASE TRIALS FOR QUALITY TRAITS 199 mean) is given by dσ2 + σ2 3 H (I − A ) rd g g and thence the average pairwise variance is 2(dσ2 + σ2 )/rd. Thus it is clear 3 H that inclusion of the ‘error’ term for field plots, namely frep.plot, is crucial otherwise the variance matrix (and thence inference) for estimated genotype 2 effects will be incorrect (unless σ3 = 0). When we change from fixed to random genotype effects the orthogonality properties of the design remain, with all of the information on genotype effects being contained in the plots within field replicates stratum. Thus, analogous to the fixed effects setting, a field plot error term should be included in the model otherwise prediction error variances will be incorrect. More importantly, however, is that the genotype predictions (BLUPs) themselves will be affected because the associated shrinkage will be incorrect. This example is atypical of trials for measuring quality trait data in that we rarely have balanced data and orthogonal designs. The most common scenario is quite complex, with only a subset of genotypes from the field trial being quality tested and partial rather than complete replication in terms of both the field and laboratory. We consider an example of this type in Section 11.3.4. Thus in general we cannot use ANOVA but must conduct a mixed model analysis. Information on genotype effects will be contained in more than one stratum, but the majority is still likely to be associated with the field plot stratum. In terms of the mixed model we must therefore ensure that random plot effects (that is, up) are included.

11.3.2 Model based analysis As with the analysis of field trials, we can model trend in multi-phase trials in order to improve the accuracy and efficiency of genotype contrasts. In the case of field trials we model spatial trend in the errors. In multi-phase quality trials the potential exists to model trend (spatial or temporal) associated with the errors for any of the phases. The type of trend modelling depends on the trait and/or measurement process. Since our experience has largely been in terms of the analysis of milling yield in wheat and malting quality in barley we discuss modelling in the context of these data but the concepts generalise to other traits.

Covariance models for local trend As previously discussed the milling of multiple samples from a field trial in- volves a sequential process that usually requires more than a single day. There is potential for temporal correlation linked to the order of processing samples within a day. If we assume a rectangular trial layout of d days and s samples per day, making a total of n = ds samples, and order the data sequentially within days, then the temporal correlation leads to a variance matrix for the 200 MIXED MODELS FOR PLANT BREEDING residuals of the form

var (e) = σ2 (I ⊗ R (φ )) H d o o where Ro is the s × s correlation matrix for sample order within days. As in the spatial modelling of field trials, a range of covariance models is possible. We have found that an autoregressive process of order 1 provides a plausible model for Rs. We therefore denote the full correlation model for e by ID×AR1. In terms of the measurement of malting quality, we recall that grain samples are tested in a micro-malter machine. Due to the arrangement of samples in the micro-malter there is potential for spatial variation. We let rm and rc denote the numbers of rows and columns in the micro-malter. For simplicity we initially assume that all samples can be processed in a single ‘run’ of the micro-malter so that n = rmrc. If the data are ordered as micro-malter rows within columns then we have

var (e) = σ2 R (φ ) ⊗ R (φ ) H mc mc mr mr where Rmr and Rmc are the correlation matrices for the micro-malter row and column dimensions. Once again we have found the AR1×AR1 model to be reasonable. If the data comprises several runs of the micro-malter then we assume independence of the errors between runs and often constrain the autocorrelation parameters to be the same for all runs. In addition to modelling at the (laboratory) residual level there is potential for modelling at the field plot residual level. We proceed here as for the spatial analysis of a field trial. Thus we may consider an AR1×AR1 process for up. Of course it is important to recognise that the modelling of correlation for a random term in the mixed model will only be possible if that term accounts for a substantial amount of the total variation in the data.

11.3.3 Proposed analysis

The first step in our approach to analysis is to fit the model in equation (11.3.8) with all random effects as necessary to reflect the randomisation employed in the experimental design. Initially we assume independence for all error ef- fects, that is, for up and e. As in our spatial approach the randomisation based effects are maintained in the model throughout the modelling process, irrespective of their level of significance, in order to preserve the experimen- tal strata. We examine the estimated residuals from the randomisation based model in order to detect outliers, global trend and extraneous variation. We must consider these issues in relation to both phases. We then consider cor- relation models for e (the laboratory residuals) and check the adequacy using previously described diagnostic techniques. Finally we investigate correlation models for up (the field plot residuals), provided that the associated source of variation accounts for a substantial proportion of the total variation in the data. ANALYSIS OF MULTI-PHASE TRIALS FOR QUALITY TRAITS 201 11.3.4 Example: early generation wheat milling yield trial

We consider the milling of genotypes grown in the field trial described in Section 11.2.5. Recall that the field trial comprised 1005 lines (1001 test lines and 4 standard varieties). The wheat breeder wished to mill only a subset of the test lines (so some lines had already been discarded on the basis of other traits, including grain yield). A total of 416 lines was milled, comprising all standard varieties and 412 test lines. The milling trial was designed as a so- called p/q–rep design (Smith et al., 2005b) in which field replicates of 52 of the test lines were milled (so that p = 12.6%) and single plots of the remaining 360 test lines. Additionally 9 plots of each of three of the standard varieties and 2 plots of the other standard were milled, making a total of np = 493 plots. Note that this represents only 40% of the original field trial. The design was generated using the DigGer program (Coombes, 2002). In the laboratory 67 of the plots (that is, q = 13.6%) were replicated making a total of n = 560 samples. These were milled as 28 samples per day so that the full trial required 20 days. The plots replicated in the laboratory were ran- domised so that a single replicate appeared in the block comprising days 1-10 and the other replicate in days 11-20. The randomisation was restricted with respect to an additional blocking factor, namely days within replicates. Sub- ject to these constraints, the design then involved optimisation with respect to a model with an ID×AR1 process for the residuals (with an autocorrelation for sample order of αo = 0.5). The data collected for each sample comprised the flour extraction (weight of flour expressed as a percentage of weight of grain sample). The mean flour extraction for the trial was 63.5% (but see below). A complication with these data is that the genotypes fall into two popu- lations depending on the presence of genes that control the hardness of the genotype. The standard practice within the NSW wheat breeding program is to divide the lines into two groups, namely ‘hard’ and ‘soft’ genotypes. In terms of flour extraction the two groups usually differ with respect to their means (with soft lines yielding less) and their genetic and residual variances (with larger variances for the soft group in both cases). The two types of genotypes have different end uses. For simplicity we focus here on the hard genotypes that are amenable to bread making. These usually constitute the majority of lines (in this example, 311, that is, 75% of the test lines being milled are hard and all standard varieties are hard). The mean flour extraction for the hard lines in this trial was 64.8% (and for the soft lines it was 58.7%). In order to restrict attention to the hard lines and avoid the need to fit a complex model with heterogeneous variances for hard and soft genotypes we include a fixed effect for each sample corresponding to a soft genotype. In this way all sources of variation in the data relate to hard genotypes alone. Note that we could have achieved the same result by either dropping observations from the data-set or by changing the real data for the soft genotypes to missing values. Both of these approaches have disadvantages compared with the approach we 202 MIXED MODELS FOR PLANT BREEDING adopted. If the former approach is used we can no longer use a separable vari- ance structure for the residuals so lose the associated computational benefits. With the latter we must alter the data themselves which may be undesirable. Our approach allows us to deal with the problem via model specification and we avoid any loss of efficiency induced by fitting a large number of fixed effects by utilising the sparse matrix methods of the software, namely ASReml (?) and samm (?). We can write the model formula for the randomisation based model in symbolic notation as fe ∼ sunits + genotype + mblock + mblock.daywb + fblock + plot + units where genotype is a factor with 416 levels, mblock is a factor (corresponding to resolvable blocks in the laboratory) with 2 levels, daywb is a factor with 10 levels (corresponding to days within resolvable blocks), fblock is a factor (corresponding to resolvable blocks in the field) with 2 levels and plot is a factor (indexing field plots) with 493 levels. The factor sunits is included to ‘eliminate’ the data for the soft genotypes. It has 122 levels where the first 121 index the samples corresponding to soft genotypes and the last level is assigned to all samples corresponding to hard genotypes. We prefer to work with a different (but equivalent) specification of the model, namely fe ∼ sunits+genotype+mblock+day+fblock+column.row+units (11.3.10) where day is a factor with 20 levels (indexing all days) so that mblock.daywb ≡ day and column and row are factors indexing field columns (12 levels) and rows (104 levels) so that plot ≡ column.row. This last term represents the residual field plot error. We let the variance components for the non-genetic random effects in (11.3.10) 2 2 2 2 be denoted by σmb, σd, σfb and σp for milling blocks, days, field blocks and plots, respectively. Estimates of these components are given in Table 11.4 as model M1. The genetic variance for this trait is large. In terms of non-genetic variation, the majority (76%) stems from the laboratory. The largest non- genetic component is associated with milling days (within blocks). The graph of the BLUPs of day effects (Figure 11.6) reveals that the last 4 days of the trial gave rise to very low flour extractions. This is consistent with information from the laboratory technician who commented that the last 4 days of milling occurred after a wet weekend, causing the sample moistures to rise so that extractions for all samples on these days may well be lower than expected. The plot of estimated residuals against milling order for each day (Figure 11.7) reveals the existence of several potential outliers. However, the labora- tory technician did not find these to be erroneous so no changes were made to the data. An obvious feature of Figure 11.7 is the consistent (declining) linear trend in flour extraction over the course of each day. Such a phenomenon has been reported elsewhere for the type of mill used here (see Smith et al., 2001b) and is partly due to a warming of the mill over the day. There are controls in place that attempt to maintain mill temperature but the effect observed here ANALYSIS OF MULTI-PHASE TRIALS FOR QUALITY TRAITS 203

Table 11.4 Multi-phase example: estimated variance parameters and key fixed effects for all models fitted. Model Model term Parameter M1 M2 M3

2 genotype σg 3.886 4.037 4.163 2 mblock σmb 0.853 0.795 0.834 2 day σd 1.555 1.715 1.477 2 fblock σfb 0.087 0.106 0.140 2 column.row σp 0.904 0.582 0.520 error σ2 0.815 0.562 0.765 H αo 0.803 lin(order) τo -0.075 -0.082 residual log-likelihood -535.69 -526.35

still remains. We therefore added a term to the model for the linear regres- sion of flour extraction on order within each day. Thus the model included a fixed effect, τo, say, corresponding to the slope of this regression. The estimate of the slope is given in Table 11.4 (for model M2). The regression was very significant (p < .001). Remaining (stationary) trend over order within days was modelled using an ID×AR1 correlation model for the residuals. This was very significant (REML-LR test statistic of 18.69, p < .001). The autocorrela- tion parameter αo was estimated as 0.802 (see model M3 in Table 11.4). We added a nugget effect to the model but the estimate of the associated variance component was on the boundary (that is, zero). Having investigated the laboratory phase we now consider the field phase. First we construct a residual plot for field errors. Thus we take BLUPs of the plot effects from model M3 and graph them against row number for each column (see Figure 11.8). There are, of course, numerous gaps in this graph due to the fact that only a subset of the full field trial was milled. This can make outlier and trend detection difficult. For these data there is little suggestion of extraneous variation. We fitted an AR1×AR1 process for the column.row term but this resulted in only a small increase (0.33 units) in the residual log-likelihood, despite reasonably large values for the autocorrelations (αr = 0.40 and αc = 0.50). Thus we would choose model M3 as the best of those fitted. A histogram of the test line BLUPs for hard genotypes is shown in Figure 11.9. This reveals the existence of two very poor genotypes. Superimposed on the histogram is a plot of the density for the approximate distribution for u˜t as given in (11.1.4). In terms of selection we consider the top 10% of test line BLUPs (the cut-off point for which is marked as the vertical line in Figure 204 MIXED MODELS FOR PLANT BREEDING BLUP of day effect −2 −1 0 1

5 10 15 20

day

Figure 11.6 Multi-phase example: plot of BLUPs of day effects from model M1.

11.9). The associated EGG (calculated using equation (11.1.5)) is 3.25%. This 2 value is quite high, as is the estimate of mean line heritability with hg = 0.83. The high heritability and EGG are largely due to the high genetic variance for flour extraction. ANALYSIS OF MULTI-PHASE TRIALS FOR QUALITY TRAITS 205

0 5 10 15 20 25 0 5 10 15 20 25 day:16 day:17 day:18 day:19 day:20

1 0 −1 day:11 day:12 day:13 day:14 day:15

1 0 −1 day:6 day:7 day:8 day:9 day:10

residual 1 0 −1 day:1 day:2 day:3 day:4 day:5

1 0 −1

0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 order

Figure 11.7 Multi-phase example: plot of estimated residuals (for hard genotypes only) from model M1 against milling order for each day. 206 MIXED MODELS FOR PLANT BREEDING

0 20 40 60 80 100 0 20 40 60 80 100 column:9 column:10 column:11 column:12 1.0 0.5 0.0 −0.5 −1.0 column:5 column:6 column:7 column:8 1.0 0.5 0.0 −0.5

field residual −1.0 column:1 column:2 column:3 column:4 1.0 0.5 0.0 −0.5 −1.0

0 20 40 60 80 100 0 20 40 60 80 100 column

Figure 11.8 Multi-phase example: plot of estimated field plot residuals (for hard geno- types only) from model M3 against field row for each column. ANALYSIS OF MULTI-PHASE TRIALS FOR QUALITY TRAITS 207

cut−off for top 10% Density

EGG=3.25

x 0.00 0.05 0.10 0.15 0.20 0.25

−6 −4 −2 0 22.39 4 6

Test line BLUPs

Figure 11.9 Multi-phase example: plot of histogram and approximate density function for hard test line BLUPs. Cut-off for selection of top 10% of lines is indicated.

CHAPTER 12

The analysis of quantitative trait loci

12.1 Introduction The sequencing of the human genome http://www.ornl.gov/sci/techresources/Human Genome/home.shtml is but one part of the of the explosion in genetic research for humans, animals and plants. Data is being generated at an ever increasing rate as technology is developed at perhaps an even faster rate. The data generated aims to sup- port the discovery of genes, proteins and metabolites, their function, and the relationship between them in all living things. This chapter touches on one part of this grand design, the search for quantitative trait loci (QTL). Quantitative traits are those thraits that vary continuously. For example, the yield of wheat in a field plot will take on a non-negative value that will depend on the underlying genetics of the variety grown, the environment in which the wheat is grown, and the interaction between the genetics and the environment. These aspects have been discussed in a broad manner in chap- ter 11 where variety effects represent the genetic component. The determination of genomic regions or genes that influence the expression of a quantitative trait is important in the plant and animal breeding as well as the human context. For plants, this can lead to rapid improvements in agronomic and quality traits, in disease resistance, and tolerance to both biotic and abiotic stresses. In the livestock industry, this can lead to improvements in quality traits, such as marbling in beef, and wool characteristics in sheep.

12.2 Example To motivate our development we use data from two field experiments con- ducted in 1999 and 2000 which involved 175 from a total of 180 doubled haploid lines from the Sunco × Tasman mapping population. This population was developed as part of the Grains Research and Development Corporation National Wheat Molecular Marker Program in Australia. Marker assisted se- lection for quality characteristics is difficult as these traits are often influenced by many QTLs whose affects can be environmentally controlled. Additionally, ? and ? have shown that quality traits are often subject to large amounts of non-genetic sources of variation, hindering both genetic progress using tradi- tional breeding approaches and efficient and accurate identification of QTLs. The quality trait we use for illustration here is flour yield, which is one of the most commercially important quality traits in wheat breeding in Australia. ?

209 210 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI have recently analyzed these data with an improved linkage map and using the methods of ?. Both field trials were designed as randomized complete block designs with two replicates of each DH line and additional plots of parental genotypes and commercial varieties. Each trial was laid out in the field as a rectangular array of 38 rows and 12 columns. Grain samples from most of the field plots were then milled using a Buhler mill. For the 1999 field trial, none of the field plots were replicated in the milling process but an additional 47 so-called milling control samples were included at regular intervals during milling of the field samples. The field plots were randomly assigned to mill days and mill order within mill days. The laboratory measurement phase took a total of 38 mill days with 11 samples milled per day. In 2000, 23% of the field samples were replicated in the milling process. Thus a total of 456 samples were milled over 38 mill days with 12 samples per mill day. Field plot samples were randomly assigned to mill days and mill order within mill days.

12.3 Overview of Molecular Genetics Living things are composed of cells. Prokaryotes are organisms that do not have a cell nucleus such as bacteria. Eukaryotes have complex cells and an- imals, plants, fungi and protista fit into this class. The cell is central to the so-called central dogma of biology. A fundamental part of the nucleus of each cell is Deoxyribonucleic acid (DNA). DNA is often referred to as the molecule of heredity as it contains the genetic instructions specifying the biological development of all cellular forms of life (http://en.wikipedia.org/wiki/). DNA is a polymer, a very long molecule consisting of structural units that are connected by covalent chemical bonds. It includes repetition of many identical, similar, or comple- mentary molecular subunits that are called monomers. Monomers link during a chemical reaction called polymerization. A simple representation of DNA is presented in Figure 12.1. The double helix nature of the polymer consists of a sugar phosphate backbones tgether with nitrogenous bases consisting of two base pairs which are connected with hydrogen bonds. The base pairs are the purines, Adenine (A) and Guanine (G) and the pyrimidines, Cytosine (C) and Uracil (U). A and T pair as do G and C. The chemical structure of DNA is presented in Figures 12.2 and 12.3. The two strands are complementary in that the pentose molecules, Ribose or De- oxiribose, and the 50 and 30 hydrogen bonds as given in Figure 12.4 indicate the direction the DNA is read. The nucleus actually contains Chromosome, a threadlike “package” of genes and other DNA; see Figure 12.5. These genes are the beginning of the com- plex biochemical processes that include maintenance and development of the living thing and reproduction. Different organisms have different numbers of chromosomes and hence a different genome. Humans beings have 23 pairs of OVERVIEW OF MOLECULAR GENETICS 211

Figure 12.1 Double helix structure of DNA

Figure 12.2 Chemical structure of DNA 212 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI

Figure 12.3 Nucleotide structure within DNA and Purines/Pyrimidines

Figure 12.4 Pentose within DNA with hydrogen bond labels

chromosomes (humans are therefore diploid), hence have 46 chromosomes, of which 44 are autosomes and 2 are the sex chromosomes. The bovine genome consists of 30 pairs chromosomes, 29 pairs of autosomes and 1 pair of sex chromosomes. Barley is also a diploid and has 7 pairs of chromosomes. Wheat however is a hexaploid, with three genomes of 7 chromosome pairs. Despite this complexity, the wheat genome is often treated as though it has 21 chromo- some pairs. Polyploidy is common in horticultural plants such as strawberries. Sugarcane is an autoploid with varying genome size and represents one of the most complex plants. The central dogma, as represented in Figure 12.6, has formed the basis of our understanding of biological processes for some time. The cell produces the appropriate protein for the specific process required in several steps. The steps consist of transcription, post-transcription, translation and post-translation. The result is an active protein that could for example trigger a response in a plant in drought situations. It is the gene, and hence the portion of DNA that regulates the protein synthesis, that we seek to discover and to understand. OVERVIEW OF MOLECULAR GENETICS 213

Figure 12.5 Chromosomes in the nucleus of the cell

Figure 12.6 Central dogma: protein synthesis 214 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI

Figure 12.7 Recombination in Meiosis

12.4 Reproduction The reproduction of the species presents the opportunity for changes in the genetic makeup through the creation of a new individual. In animal and plant breeding this is the opportunity to seek improvements in important econmoic attributes of animals and plants. In experimental populations the generation of new offspring provide the mechanism for determining genomic regions that impact on these attributes. Thus the genetics of reproduction is crucial.

12.4.1 Meiosis There are two types of cell division: mitosis and meiosis. Mitosis occurs during normal biological growth and differentiation, and results in two identical cells produced from a single cell. Meiosis is the specialized division of cells that produces four daughter cells in diploid organisms, with each cell having half the number of chromosomes of the progenitor cell. These are so-called haploid gametes. Meiosis includes one round of chromosome duplication and two rounds of cell division. Meiosis begins with cells containing two sets of chromosomes (in diploids). The DNA duplicates so that each chromosome has identical DNA duplex strands, called sister chromatids. The pair of chromosomes are called ho- mologs. The homologs form four duplex strands of DNA with non-sister chro- matids pairing. Two rounds of cell division, the first to diploid cells and the second to haploid cells, results in four gametes for each chromosome in the cell. Figure 12.7 presents the process for a single chromosome. GENETIC INFORMATION 215 The diploid nature of the organism is restored during fertilization when a male and female gamete fuse together to form the zygote which can become an embryo.

12.4.2 Recombination

A crucial part of meiosis is chiasmata, the cross shaped structure between non- sister chromatids. The breakage of non-sister chromatids and the reunion to the other chromatid is called crossing-over and recombination; see Figure 12.7. This means that the next generation may provide a non-parental haploid when fused in fertilization, and hence possible diversity in genetic composition. Recombination is fundamental in linkage and QTL analysis, and indeed these methods rely on recombination events.

12.5 Genetic information

We have seen that the pedigree structure can provide genetic information on lines. The relationship matrices provides average genetic connections between progeny. However, to understand how and what genes influence the phenotype of the progeny, genetic information within progeny is required. Combining both pedigree and within progeny information is also desirable. The ideal genetic information would be the complete DNA sequence for each progeny. However, this is impossible because the genomes are huge and complex, and sequencing one progeny is an exhaustive task. The solution is to “summarize” information on the genome in some way.

12.5.1 Molecular markers

Molecular markers represent genetic differences between individual organisms or species. The marker needs to exhibit variation or be polymorphic. Monomor- phic markers are those markers that do not exhibit variation across progeny. These markers are usually non-informative, but in structured populations they may be uniformative in one population, but informative in others, and hence useful. Molecular markers are not genes in general, but they flag diversity at specific locations on the genome called loci. Markers close to genes are sometimes called gene ‘tags’. There are various types of markers. Morphological (from the Greek for mor- phe meaning form) markers have been used by plant breeders for a long time. Essentially these are phenotypes observed after crossing of lines. Table 12.1 gives an example of morphological markers for wheat. Biochemical markers such as isozymes, enzymes that have the different amino acid sequences but catalyze the same chemical reaction, have been used for low level genetic information. However, it was the advent of molec- 216 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI

Table 12.1 Morphological markers for wheat

ular markers that are DNA based that has allowed significant progress to be made in providing more detailed genetic information on individuals. Molecular markers are generally abundant, and arise from different classes of DNA mutations, namely substitution mutations (point mutations), rearrange- ments (insertions or deletions), and errors in replication of tandemly repeated DNA. Broadly, there are three types, hybridization based, Polymerase chain reaction (PCR) based and DNA sequence based. Hybridization is the process of combining complementary, single-stranded nucleic acids into a single molecule. Two perfectly complementary strands will bind to each other, whereas a single inconsistency between the two strands will prevent them from binding. This is the basis for determining if an organism has a specific sequence, and hence provides a means for obtaining potential markers. Polymerase Chain Reaction (PCR) is sometimes called “molecular photo- copying”, because it is a method to amplify or produce DNA. The origin of this technique is a hot spring in Yellowstone National Park where a bac- terium, Thermus aquaticus highlighted how an organism copies its DNA in the cell cycle. The two DNA strands are denatured or separated, and an enzyme called DNA polymerase copies the strands using each strand as a template. However, the enzyme cannot copy a chain without a short sequence of nu- cleotides to “prime” the process. Another enzyme, primase, makes the first few nucleotides: this stretch of DNA is called a primer. Subsequently the poly- merase takes over and finshes the job. PCR does this in a test tube using Taq polymerase, isolated in Thermus aquaticus. Restriction enzymes provide a way to isolate sections of DNA. These en- zymes cut double-stranded DNA with two incisions, one through each of the phosphate backbones of the double helix without damaging the bases. The chemical bonds that the enzymes cleave can be reformed by other enzymes known as ligases. Ligation or splicing together, allows reformation provided their ends are complementary. These enzymes were discovered in E. coli strains and EcoR1 is a particular example; its recognition sequence is 5’-GAATTC-3’ Gel-electrophoresis is a technique used to produce and score markers. The GENETIC INFORMATION 217 method involves the movement of an electrically charged substance under the influence of an electric field. Molecules are separated according to size and electrical charge by applying an electric current to DNA, The current forces them through a gel, for example Agarose (made from seaweed) or polyacry- lamide, and the gel can be made to allow separation of specific sized and shaped molecules. There are many types of molecular markers and these reflect the develop- ments over a number of years. Restriction Fragment Length Polymorphism (RFLP) involves DNA extracted from individuals being digested by a restric- tion enzyme (2-10 kilobases, kb) followed by PCR, gel-electrophoresis, South- ern blotting and hybridization of fragments to a locus-specific radiolabelled DNA probe. Visualization of the polymorphism is by radiography. RFLPs are abundant and randomly distributed throughout the genome and are repro- ducible. The bands observed are interpreted in terms of loci and codominant alleles. However, generation of RFLPs is laborious and technically demanding because the approach is not amenable to automation. Amplified Fragment Length Polymorphism (AFLP) also involves DNA di- gested by a restriction enzyme (80-500 kb) followed by ligation of oligonu- cleotide (20 bp) adapters to the fragments. Selective PCR follows with sep- aration of fragments by gel-electrophoresis. Again AFLPs are abundant and randomly distributed throughout the genome, are are reproducible. Many in- formative bands can be obtained per reaction, and the approach is amenable to automation. The alleles found are dominant; if a band is present the allele is present but absence does not specify what allele is in its place. Another negative is that purified, high molecular weight DNA is required. Microsatellites or Simple Sequence Repeats (SSR) are tandem repeated sequences, for example GCGCGCGC = (GC)4. Restriction enzymes used together with PCR and primers obtained from genomic sequence libraries (databases). Separation of fragments is by gel-electrophoresis. SSRs are abun- dant and randomly distributed throughout the genome, highly polymorphic and reproducible. The bands: can be interpreted in terms of loci and codomi- nant alleles. Multiplexing (running many SSRs in a single gel) is possible and the approach is amenable to automation. Only low quantities of DNA are required. Primer development may be costly. Diversity Array Technology (DArT) is a high-throughput genotyping tech- nology based on microarray platform. The process involves 20-120 slides, with 2,000–12,000 clones per slide. Each slide hybridized with a target (represen- tation) prepared from a line and two images are obtained, a reference and the target. Clustering of each clone across multiple slides is used to deter- mine polymorphism. This is a high-throughput technique with clone libraries developed for spotting the array. The markers obtained are dominant. The final marker discussed is the Single Nucleotide Polymorphism (SNP). A SNP is a DNA sequence variation, occurring when a single nucleotide, adenine (A), thymine (T), cytosine (C) or guanine (G), in the genome is altered. For example, a SNP might change the nucleotide sequence AAGCCTA to 218 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI

Figure 12.8 Population structures for field crops

AAGCTTA. SNPs make up 90% of all human genetic variations and occur every 100 to 300 bases. SNP-chips available with arrays of sizes 10K, 100K, 500K (and soon 1 million) for humans. This is a high-throughput approach, the alleles are bi-allelic, but the question of analysis becomes critical.

12.5.2 Checking markers: Segregation analysis Each marker will have distinct alleles if it is polymorphic. The segregation ratios (that is the relative proportion of each allele or odds ratio) of the pos- sible alleles will depend on the type of population. For field crops, population types are given in Figure 12.8. Doubled hapoid lines are formed by doubling a haploid and hence are homozygous at all loci. Recombinant inbred lines (RIL) are largely homozygous after sufficient generations. For various population types, the odds or segregation ratios at a locus that has two possible alleles A1 and A2 are given in Table 12.2.

Table 12.2 Segregation ratios for population types Population type Codominant markers Dominant markers

F2 1:2:1 (A1A1, A1A2, A2A2) 3:1 (A1−, A2A2) Backcross 1:1 (Aa, aa) 1:1 (Aa, aa) RIL or DH 1:1 (A1A1, A2A2) 1:1 (A1A1, A2A2)

A simple chi-squared test can be used to assess if there is segregation dis- tortion, a departure from the expected segregation ratios for each marker. If a marker departs from the expected ratio significantly, a decision will have to be made on whether or not to use the marker in linkage analysis. As an example of segregation analysis consider two markers in a doubled haploid population presented in Table 12.3. This example will be used again in the discussion of linkage and hence the full two-way table is given rather than just the margins for each marker. The likelihood ratio chi-squared statistics are 16.46 and 1.00 for markers 1 and 2 respectively. These should be compared to a chi-square value of 3.841 LINKAGE ANALYSIS 219

Table 12.3 Segregation analysis for two markers in a doubled haploid population Marker 2

1 B1 B2 Total

A1 35 35 70 A2 10 20 30 Total 45 55 100

on 1 degree of freedom. Marker 1 shows significant segregation distortion, whereas marker 2 does not.

12.5.3 Checking individuals Segregation analysis focuses on the markers. Checks on the pattern of markers for individuals is also advisable. For example, individuals that carry unusually few alleles from either parent may be in error, and individuals that are very similar (perhaps identical) or distinct from the rest should also be exmined for potential errors.

12.6 Linkage analysis Linkage analysis is a term given to examining the connection between markers. Are two markers close to one another or are the independent in their scores? For a particular crop, the chromosomal structure is known. Where do the markers fit into this structure? Can we group the markers in such a way to reflect the chromosomal structure and order them within a chromsome? If we are able to answer these questions, we can constuct a representation of the genome using molecular markers. Two markers are linked if they are on the same chromosome. So our aim is to assemble the markers into linkage groups that correspond to chromosomes. We begin by examining some simple cases.

12.6.1 Linkage between two markers The basis for linkage analysis is crossing-over or recombination. Two markers are “close” to one another if few recombinations have occured between the two markers. Notice that this measure of ‘distance’ is in terms of recombinaion events and may not reflect the true physical distance which may be measured in terms of base-pairs. We begin with a discussion of recombination. Suppose two loci (or markers) are located at the end-points of an interval [a, b]; this inteval represents the separation (in some sense as we have not yet defined distance) of the two 220 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI markers. Let N be the number of cross-overs or chiasmata on that interval, and assume chiasmata occur randomly on the interval. If θn is the probability that a gamete is a recombinant given there are n chiasmata on the interval, θ0 = 0. Furthermore as chiasmata occur randomly, a difference equation for θn in terms of θn−1 is given by 1 1 θ = θ + (1 − θ ) n 2 n−1 2 n−1 1 = 2 leading to this obvious result. This is a conditional probability and the un- conditional probability of a recombinant is given by θ = Pr(recombinant) ∞ X = Pr(recombinant|N = n) Pr(N = n) n=0 ∞ X = θn Pr(N = n) n=0 ∞ X 1 = Pr(N = n) 2 n=1 1 = Pr(N > 0) 2 1 = (1 − Pr(N = 0)) 2 which is known as Mather’s formula. This shows that 0 ≤ θ ≤ 0.5. Genetic map distance is then defined as 1 d = E(N) 2 with the unit of distance being the Morgan or centi-Morgan (cM), in honour of Thomas Hunt Morgan; 1cM is 1 recombination event in 100 events. If the number of chiasmata follows a Poisson distribution, N ∼ P o(λ), 1 θ = (1 − e−λ) 2 Now λ = E (N) = 2d so that 1 θ = (1 − e−2d) 2 If invert this relationship, this results in Haldane’s mapping function 1 d = − log(1 − 2θ) 2 Notice that if θ = 0, d = 0 and as θ → 0.5, d → ∞. Thus small values of LINKAGE ANALYSIS 221 θ suggest that markers are close in terms of recombination events and that θ = 0.5 corresponds to two markers that are unlinked. There are other mapping functions, for example Kosambi’s mapping func- tion. ? provide a review of mapping functions and their links to renewal pro- cesses. Linkage is equivalent to the statistical association or dependence. The two- way table of probabilities of the possible marker pairs, A1B1, A1B2, A2B1, and A2B2 is given in Table 12.4. Zero or an even number of cross-overs have occured if the genotype is A1B1 or A2B2. If an odd number of cross-overs have occured, we will observe either A1B2 or A2B1 and a recombination has occured. Thus θ = pA1B2 + pA2B1

Table 12.4 Two-way table of probabilities for markers 1 and 2 Marker 2 B1 B2 Total

1 A1 pA1B1 pA1B2 pA1

A2 pA2B1 pA2B2 pA2

Total pB1 pB2 1

We would like to provide a measure of association between two variables that are categorical using the marker data across individuals. In statistical terms the first thing to do is to form a contingency table. Thus if we have alleles A and a for marker 1 and B and b for marker 2 in a doubled haploid population, we form the table of counts as given for example by Table 12.3. A test of independence of the two categorical variables, that is the two markers, is a standard statistical procedure. The null hypothesis of indepen- dence is H0 : pAiBj = p(Ai)p(Bj). The likelihood ratio test of independence for Table 12.3 is given by −2log(Λ) = 2(35 log(35/31.5)+35 log(35/38.5)+10 log(10/13.5)+20 log(20/16.5) = 2.40 which should be compared to a chi-square on 1 degrees of freedom (3.841 at the 5% level). Thus we retain independence and the two markers not significantly linked. We can estimate θ under the independence assumption. We have

θ = p(A1B2) + p(A2B1)

= p(A1)p(B2) + p(A2)p(B1) (12.6.1)

The inuitive estimate of p(A1) is 0.7, p(A2) is 0.3 and so on. This leads to the estimate θˆ = 0.52 which is outside the range of permissable values. What has gone wrong? For double haploid lines, we expect marker alleles p(A1) = p(A2) = 0.5 with 222 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI the same result for the other marker. In this case θ = 0.5 using (12.6.1) which is what we would expect if the markers are unlinked. The likelihood ratio test for independence under these restrictions is −2log(Λ) = 2(35 log(35/25)+35 log(35/25)+10 log(10/25)+20 log(20/25) = 19.86 which looks large and suggests we reject independence and hence the two markers are linked. Note however that we are now testing for both equal segregation rates for each marker and for independence which gives 3 degrees of freedom (5% value of 7.815). In fact this final statistic is the sum of the two values for testing equal segregation and the test for independence, that is 19.86 = 16.46 + 1.00 + 2.40 Segregation distortion is therefore the culprit in terms of clouding the issue. The intuitive estimate of the recombination fraction is in fact 35 + 10 θ = = 0.45 100 and we can test H0 : θ = 0.5 using a simpler table.

Table 12.5 Recobination in a doubled haploid population Type Count Probability Recombinant r = 45 θ Non-recombinant n − r = 55 1 − θ Total n = 100 1

There are two possible outcomes and hence estimation and a test based on the binomial distribution is appropriate. The likelhood is (omitting constants) L(θ) = θr(1 − θ)n−r Taking logarithms, we have log{L(θ)} = r log(θ) + (n − r) log(1 − θ) . We choose the estimate of θ which makes this a maximum. If we graph this function for 0 ≤ θ ≤ 0.5 we find a single maximum at r/n. Alternatively, differentiating and equating to zero we find r n − r − = 0 θ 1 − θ and solving we have r θˆ = n This reduces to 45/100 = 0.45 for our data and hence is the same as our intuitive estimate. LINKAGE ANALYSIS 223

The likelihood ratio statistic for H0 is 0.5r0.5n−r Λ = 0.45r0.55n−r for r = 45 and n = 100. Hence −2log(Λ) = 1.00 and we retain unlinked. The traditional statistic used in this area is not the likelihood ratio statistic but a related statistic called the LOD (log-odds) score; the logarithm is taken as base 10. In fact, in testing H0 : θ = 0.5,

LOD = 45 log10(0.45) + 55 log10(0.55) − 45 log10(0.5) − 55 log10(0.5) = 0.22 and the traditional cutoff for significance is a LOD = 3 (p-value=0.0002). This very large value reflects the multiple testing in a full assembling of a linkage map and is hence very stringent.

12.6.2 Linkage between three markers Having three markers introduces the problem of order. Thus consider the data presented in Table 12.6. Assuming the markers are linked, what is the most likely order, 123 or 231 or 312? Notice that 321, 132 and 213 are equivalent to the first three.

Table 12.6 Three markers Marker 3 Marker 1 Marker 2 C c Total A B 47 16 63 A b 2 5 7 a B 7 1 8 a b 12 60 72 Total 68 82 150

An obvious approach to determine the best order is to enumerate the like- lihood for the three possible orders. This will not be an option when a large number of markers must be ordered, but the principles are best covered in this simple case. The expressions in Table 12.7 involve only pairwise recombination fractions and we have seen how to estimate these. The pairwise estimates are θ12 = 0.1, θ13 = 0.27 and θ23 = 0.21. The corresponding distances are (using Haldane’s mapping function) of d12 = 11, d13 = 39 and d23 = 27 cM. This suggests 123 is most likely order. Notice that the pairwise estimates of recombination frequencies “do not add up” and nor do the distances. 224 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI

Table 12.7 Three markers: order 123 Genotype Probability Frequency

1 ABC 2 (1 − θ12)(1 − θ23) 47 1 ABc 2 (1 − θ12)θ23 16 1 AbC 2 θ12θ23) 2 1 Abc 2 θ12(1 − θ23) 5 1 aBC 2 θ12(1 − θ23) 7 1 aBc 2 θ12θ23 1 1 abC 2 (1 − θ12)θ23 12 1 abc 2 (1 − θ12)(1 − θ23) 60

The likelihoods for each order are given in Table 12.8, and suggest that the order 123 is the best. Notice that double recombinants, in this case AbC and aBc, are then the least frequently occuring genotypes.

Table 12.8 Log-likelihoods for the three possible orders Order log-likelihood 123 -125.19 231 -163.41 312 -135.75

The pairwise recombination frequencies do not add up because double cross- overs between marker 1 and marker 3 are ignored in estimating θ13. In fact θ13 can be calculated by

θ13 = θ12(1 − θ23) + (1 − θ12)θ23

= θ12 + θ23 − 2θ12θ23 which is called Trow’s formula (?). There is an alternative form given by

(1 − 2θ13) = (1 − 2θ12)(1 − 2θ23) which makes clear why Haldane’s mapping function leads to an additive dis- tance measure. There is a theory that the presence of one chiasmata reduces the probability of another chiasmata in the near vicinity. This is termed interference. Then

θ13 = θ12(1 − kθ23) + (1 − kθ12)θ23 where k < 1, and Trow’s formula becomes

θ13 = θ12 + θ23 − 2kθ12θ23 (12.6.2) QTL ANALYSIS 225 The alternative form becomes

(1 − 2θ13) = (1 − 2kθ12)(1 − 2kθ23) Using data to estimate k for each order for three markers results in maximized log-likelihoods that are all equal. The underlying biological mechanism for interence is not understood. It is thought that recombinations tend to occur in hot spots, and some regions experience very few if any recombinations.

12.6.3 Linkage or Genetic map construction Ordering three markers illustrates the problem but in reality things are much more complex. Firstly, we may have several hundred markers. These markers need to be organised into linkage groups, preferably with the number of such groups corresponding to the known chromosome number of the organism. Having formed linkage groups, the order of the markers and their separation in terms of recombination frequency and hence distance must be determined. Grouping markers into linkage groups involves deciding on the level of link- age required, and in one sense is a clustering problem. To order markers in a linkage group has been recognised as a Travelling salesman problem. This is a linear programming problem but with a large number of possible orders. For large numbers of markers the number of orders is huge. In fact, for m markers it is m!/2. There are many algorithms avial- ble including seriation, simulated annealing, branch and bound, the so-called genetic algoritthm, and the Tabu search. Methods based on likelihood and the bootstrap have also been proposed. Whatever method is used, there is no guarantee of an optimal solution. Software is available for linkage map construction. There are a large number of programs; see http://linkage.rockefeller.edu/soft/list.html. The quality varies considerably. An example of a genetic linkage map is given in Figure 12.9, while an example of a physical map is given in Figure 12.10. Physical maps require Requires sequence information and provide detail on how much DNA separates two or more genes (measured in basepairs). A physical map can be used with genetic linkage maps to anchor that linkage map. Physical maps

12.7 QTL analysis A major aim of molecula genetics is to find genes that control important traits or characteristics of plants, animals and human beings. For example, yield of wheat is a very important economic trait. What is the genetic basis for high yield? Marbling in beef cattle is very important for the Japanese market. What is the genetic basis of marbling? Heart disease or diabetes may have a genetic component. Can we determine which genes make people susceptable to such diseases? 226 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI

Figure 12.9 Example of a linkage maps: Wheat

Figure 12.10 Example of a physical map: Barley QTL ANALYSIS 227

Figure 12.11 Markers linked to a gene, and a perfect marker Linked marker Gene Perfect marker Linked marker

Linkage map

QTL mapping inolves finding the location and size of QTL effects for a trait or traits. It is a first step towards possible determination of a gene that may influence expression of the trait. Figure 12.11 provides a graphical represen- tation. QTL detection can lead to marker assisted selection of lines in plant breed- ing programmes. Sound QTL information can lead to better agricultural out- comes and products. Farmers benefit, with improvements in the food chain, health and well-being anf finally export income. The focus in this chapter is on whole genome interval mapping in the re- gression setting. Other approaches are presented as background to this more comprehensive approach. There is no attempt to compare methods in the chapter; the whole genome method has been compared with best available methods in ?, and has been shown to be a powerful approach for QTL anal- ysis.

12.7.1 The Data There are three sources of data in QTL analysis. The first source is data is available on material of interest that we call the genotype; for example varieties of wheat or barley, which consists of measured or scored traits of interest together with the experimental or study design and any management variables that might impact on the traits of interest. This phenotypic data is represented by (y, X, Z, Zg) where y is the n × 1 vector of trait values on the n observational units and X and Z are n × t and n × b matrices of fixed effect and random effect variates and factors; each row specifies the corresponding 228 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI values for the ith observation. The matrix Zg is n × l relates the observed data to each of the l genotypes and is a binary matrix. Let M g be the l × m matrix of m marker scores on the l genotypes. This is the second source of data and provides genetic information on each of the lines. Interval mapping methods rely on the availability of a linkage map for the population, that is, an ordered set of markers arranged in the chromosomal structure for the organism (for example wheat or barley), together with es- timated recombination fractions between adjacent markers. Let c denote the number of chromosomes and mk denote the number of markers on chromosome g k; the number of markers generally varies across chromosomes. Let M k denote the matrix of scored markers on chromosome k. Then the matrix of markers g g g g scores can be ordered as M = [M 1 M 2 ... M c ] and m = m1 + ··· + mc. g g For genotype i, mk;ij and mk;i,j+1 will denote the jth and j + 1th marker scores for a pair of adjacent markers on chromosome k; this is the jth interval on chromosome k. The vector of these scores across all lines will be denoted g g g by mk;j and mk;j+1 respectively, and are columns j and j + 1 of M k. The third source of data is the linkage map, and consists of a list of the ordered markers together with the recombination fractions between neigh- bouring or flanking markers. Let θk;j,j+1 denote the recombination fraction between markers j and j + 1 (the jth interval) on chromosome k. The genetic distance for interval j, dk;j,j+1, will be based on Haldane’s distance, 1 d = − log(1 − 2θ ) (12.7.3) k;j,j+1 2 k;j,j+1 although other distance measures could be used. The calculations used in interval mapping assume that recombination events occur at random along the genome, and this corresponds to this distance measure. Missing trait data is usually estimated in the analysis, together with the effects of interest. Missing marker scores are handled as in ?.

12.7.2 Statistical model

The general statistical model that forms the basis of analyses is (see ? for details)

y = Xτ + Zgg + Zu +  (12.7.4) where τ is a t × 1 vector of fixed effects, u is a b × 1 vector of random effects assumed N(0, σ2G(γ)), and  is the residual vector, assumed N(0, σ2R(φ)). The latter two effect vectors are assumed mutually independent. The fixed, random and residual terms reflect the design and conduct of the trial, and as such provide the underlying structure for non-genetic variation through the associated structures, namely τ and the parameters of the covariance matrices γ and φ. QTL ANALYSIS 229 12.7.3 Genetic model

The total genetic effect for genotype i = 1, 2, . . . , l will be denoted by gi and the vector of these effects by g. This vector of genotypic effects is of prime interest and is decomposed as follows. If we have a single QTL,

gi = qia + pi (12.7.5) where a represents the size of the QTL (on the scale of the trait), qi is un- known, but is either −1 or 1 for doubled haploid lines depending on the parental allele at the QTL (?), and pi is the residual or polygenic effect, as- 2 sumed to be distributed N(0, σ γg). There are many unknown components in (12.7.5). Firstly we do not know the location of the QTL, nor do we know the size of the effect (or difference between the allelic expressions), and lastly, we do not know the appropriate QTL allele for each genotype. The linkage map and marker scores provide information to estimate aspects of these three components.

12.7.4 Single marker methods Consider a single marker, M. Is this marker, and hence the different alleles, as- sociated with differences in trait expression? Figure 12.12 provides the genetic situation, with the QTL, Q, and the marker assumed to have recombination frequency θ between them. θ MQ

Figure 12.12 Single marker and a QTL

A Punnett square can be formed as presented in Table 12.9. The entries are the probability that the QTL score is −1 or 1 given the marker score and the recombination frequency θ.

Table 12.9 Punnett square for a single marker and QTL Marker QTL

Marker xi −1 1 Total 1 1 −1 2 (1 − θ) 2 θ 0.5 1 1 1 2 θ 2 (1 − θ) 0.5 230 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI

Conditional probabilities like Pr(qi = −1|xi = −1) are found by dividing the values in a row by the row total; this means eliminating the 0.5.

12.7.5 single marker: Likelihood approach Now gi = qia + pi 2 and as pi ∼ N(0, σg ) we have 2 gi|qi ∼ N(qia, σg )

We now condition on xi and “eliminate” qi. Thus for example, if f(·) denotes a probability density function, f(gi|xi = −1) = f(gi|xi = −1, qi = −1) Pr(qi = −1|xi = −1) + f(gi|xi = −1, qi = 1) Pr(qi = 1|xi = −1)

= (1 − θ)f(gi|qi = −1) + θf(gi|qi = 1) which is a mixture of two normal distributions, one with mean −a (qi = −1) and one with mean a ((qi = 1). Interestingly,

f(gi|xi = 1) = (1 − θ)f(gi|qi = −1) + θf(gi|qi = 1) so that the mixture distribution given xi is

f(gi|xi) = (1 − θ)f(gi|qi = −1) + θf(gi|qi = 1) This result is central to likelihood based methods of estimation as introduced 2 by ?. We need to estimate θ, a and σg . In simple situations this mixture distribution translates to a simple mixture for the trait observations. However, with field and laboratory data, the mixture is but part of what can be a complex model. This will involve additional unknown parameters that need to be estimated. Every marker can be examined for association with the trait. This will pro- vide putative QTL size, location as given by the marker and an assessment of the strength can be obtained using a LOD score. No association implies θ = 0.5 or a = 0; in fact this is a non-standard test because the null hy- pothesis ‘loses’ one paremeter (see ?). Permutation tests have been used to assess significance but these are time-consuming and perhaps not practical for complex situations. The likelihood ratio/LOD score approach provides a ranking.

12.7.6 Single marker: Regression approach

The regression approach (?, ?) for a single marker involves replacing qi by its expected value given the marker. Thus

E(qi|xi = −1) = (−1)(1 − θ) + (1)θ = −(1 − 2θ) and similarly E(qi|xi = 1) = (1 − 2θ) QTL ANALYSIS 231 so that

E(qi|xi) = (1 − 2θ)xi Our genetic model becomes

gi = (1 − 2θ)axi + pi

= βxi + pi which is a regression of gi on xi. Thus we can estimate β using regression methods or in more complex situations using mixed model methods. Note however, that unlike the likelhood approach, we can only estimate β = (1 − 2θ)a and hence we cannot separate the location (θ) from the size (a) of the QTL. Thus if a marker is associated with the trait, it may be because a small QTL is close by or a large QTL is possibly far away. The regression approach can be used in the same way as the likelihood approach by considering each marker in turn.

12.7.7 Single marker: Shortcomings The obvious shortcomings of the single marker methods are determining the location of a QTL (despite being able to estimate θ using likelihood methods), the piecemeal nature of the analysis (estimation at each marker), the multiple testing problem (hence LOD scores of 3 or more), and the impact of other QTL on the assessment of a particular QTL.

12.7.8 Interval mapping The obvious extension of single marker methods is to use intervals as defined by the linkage map. Consider then Figure 12.13, where we have two mark- ers defining an interval, the left flanking marker ML and the right flanking marker MR with a putative QTL Q lying in the interval. The recombination frequencies are θLR between the left and right flanking markers, and θLQ and θQR between the left marker and QTL, and the QTL and the right marker respectively. The Punnet square for this situation is given in Table 12.10, and again conitional probabilities are found by dividing the values in rows by their row total.

θLQQ θ QR

MML θLR R

Figure 12.13 Interval mapping: two markers, QTL and recombination frequencies 232 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI

Table 12.10 Punnett square for two markers defining an interval and a QTL in the interval

qi xLi xRi −1 1 Total 1 1 1 −1 −1 4 (1 − θLQ)(1 − θQR) 4 (1 − θLQ)(1 − θQR) 4 (1 − θLR) 1 1 1 −1 1 4 (1 − θLQ)θQR 4 θLQ(1 − θQR) 4 θLR 1 1 1 1 −1 4 θLQ(1 − θQR) 4 (1 − θLQ)θQR 4 θLR 1 1 1 1 1 4 θLQθQR 4 (1 − θLQ)(1 − θQR) 4 (1 − θLR)

12.7.9 Interval mapping: Likelihood We now condition on the left and right flanking markers. Thus, using the same approach as for a single marker, we find

(1 − θLQ)(1 − θQR) θLQθQR f(gi|xLi = −1, xRi = −1) = f(gi|qi = −1)+ f(gi|qi = 1) (1 − θLR) (1 − θLR) with similar results for the other three combinations of values for the left and right markers. Thus again we have a mixture of normal distributions. Typically, the genome is scanned at regular intervals and the mixing prob- abilities calculated and combined to allow the size to be estimated at each location. This eliminates the need to use mixture methods as normal the- ory results then hold. Again, complex designs/processes make this approach time-consuming and involve restimation of non-genetic parameters.

12.8 Interval mapping: The Regression Approach The regression approach for QTL analysis (?, ?) is used for the remainder of this chapter; general notation is introduced to allow the whole genome approach to be developed. A fundamental reason for using regression methods is the ability to easily include additional sources of variation in the model, both fixed and random effects, and hence the models discussed are linear mixed models. Much of the remaining material is presented by ?. The approach presented below builds on the method of ?, and extended by ? and ?, and is based on a single environment trial for doubled haploid or recombinant inbred populations in field crops. However, the ideas and methods are generally applicable for other population structures. The regression method for interval mapping follows the single marker ap- proach and involves replacing qi by its expected value given the flanking mark- ers that define the interval being examined. Consider chromosome k. Let θk;j denote the recombination fraction between marker j and the putative QTL in the jth interval on chromosome k, and ∗ θk;j denote the recombination fraction between the putative QTL in the jth ∗ interval and marker j + 1. Thus 0 ≤ θk;j, θk;j ≤ θk;j,j+1 and a form of Trow’s INTERVAL MAPPING: THE REGRESSION APPROACH 233 ∗ formula (?), (1−2θk;j)(1−2θk;j) = 1−2θk;j,j+1 connects these recombination rates; this notation extends that of (12.6.2). ? show that for functions

(1 − θk;j,j+1 − θk;j)(θk;j,j+1 − θk;j)(1 − 2θk;j,j+1) λk;jj = λk;jj(θk;j; θk;j,j+1) = θk;j,j+1(1 − θk;j,j+1)

θk;j(1 − θk;j)(1 − 2θk;j,j+1) λk;j+1,j = λk;j+1,j(θk;j; θk;j,j+1) = θk;j,j+1(1 − θk;j,j+1)(1 − 2θk;j) the conditional expectation of the QTL genotype given two flanking markers is g g g g E(qi|mk;ij, mk;i,j+1, θk;j,j+1) = mk;ijλk;jj + mk;i,j+1λk;j+1,j (12.8.6)

Note that 0 ≤ λk;jj ≤ 1 and 0 ≤ λk;j+1,j ≤ 1. At this point, a change in notation for the size of the QTL effect is ap- propriate. Interval analysis implicitly assumes there may be a QTL in every interval. Thus we will let ak;j denote the size of a putative QTL in the jth interval on chromosome k and replace a by ak;j. Applying (12.8.6) in (12.7.5), we have in vector form g g g = (mk;jλk;jj + mk;j+1λk;j+1,j)ak;j + p g g = mk;jαk;jj + mk;j+1αk;j+1,j + p (12.8.7) where αk;jj = λk;jjak;j and αk;j+1,j = λk;j+1,jak;j. In (12.8.7), the subscripts k; j and k; j + 1 have been used to indicate the interval being examined and hence there are only two regression parameters in each fit of flanking markers across the genome. The full model for analysis of interval j on chromosome k is g y = Xτ + ZgM k;jαk;j + Zgp + Zu +  (12.8.8)

g g g T where M k;j = [mk;j mk;j+1] and αk;j = [αk;jj αk;j+1,j] . This model is fitted for each k and appropriate j, so that τ is re-estimated, 2 as are parameters associated with p, u and , namely σ , γg, γ and φ.

Constraints ? show that

2 (αk;jj + (1 − 2θk;j,j+1)αk;j+1,j)(αk;j+1,j + (1 − 2θk;j,j+1)αk;jj) ak;j = 1 − 2θk;j,j+1

This equation indicates that αk;jj and αk;j+1,j must be of the same sign. Substituting for αk;jj and αk;j+1,j, we find

(λk;jj + (1 − 2θk;j,j+1)λk;j+1,j)(λk;j+1,j + (1 − 2θk;j,j+1)λk;jj) = 1 − 2θk;j,j+1 (12.8.9) In fact this is Trow’s formula in terms of λk;jj and λk;j+1,j rather than θk;j ∗ and θk;j. Equation (12.8.9) is important because it provides a simple quadratic 234 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI constraint on λk;jj and λk;j+1,j for each j. This can be written as 2 2 λk;jj + 2φk;j,j+1λk;jjλk;j+1,j + λk;j+1,j = 1 (12.8.10) where 1  1  φk;j,j+1 = 1 − 2θk;j,j+1 + ≥ 1 (12.8.11) 2 1 − 2θk;j,j+1

The simplicity of (12.8.10) means that we can express λk;j+1,j as

q 2 2 λk;j+1,j = −φk;j,j+1λk;jj + 1 + (φk;j,j+1 − 1)λk;jj (12.8.12)

12.8.1 Interval mapping: Shortcomings Like single marker methods, scanning the genome is a liecemeal approach, with multiple testing issues, issues on selection of QTL in the presence of other QTL, and re-estimation of non-genetic parameters. A single ghost QTL can occur if two QTL occur in close proximity in coupling.

12.8.2 Composite Interval Mapping To overcome the impact of other QTL on detection of specific QTL, the in- troduction of markers in the genetic model was proposed; see ?, ? and ?. The approach is called Composite Interval Mapping (CIM). The genetic model becomes q X gi = qia + cijβj + pi j=1 where cij are the marker scores on genotype i for marker j. Single marker or interval mapping methods can now be applied to the leading term. The introduction of the additional markers is an attempt to allow for back- ground genetic effects. While this is a good idea, the choice of the number of markers, q nor their location is obvious. However, there is a clear impact on the significance and clarity of location using CIM.

12.8.3 Other approaches The literature on QTL analysis is enormous. Bayesian methods have been pro- posed with Markov Chain Monte Carlo (MCMC) methods. ? review various procedures and suggest that QTL determination is a model selection problem. They conduct a simulation study to examine the properties of various methods and recommend an approach based on MCMC and a Bayesian Information Criterion (BIC). There have been suggestions of simultaneous use of the full linkage map in marker assisted selection by ? and ? uses all markers in a mixed model to attempt to locate QTL. More recently, an approach using markers and the WHOLE GENOME INTERVAL MAPPING 235 LASSO (Tibshirani (1996)) has been proposed by ? in an unpublished PhD thesis. These approaches use all markers as random effects.

12.8.4 Summary

Genome scans provide a piecemeal approach to analysis and have generated CIM and MQM to try to incorporate genetic variation in the model. Using all markers has been proposed by other researchers; potentially this approach accommodates background genetic variation in a natural way. However, QTLs are not at markers in general, and the fixed effect status of QTLs appears reasonable. Thus we examine a method based on interval mapping the uses the whole genome in a working model and ultimately provides QTLs as fixed effects.

12.9 Whole genome interval mapping

A key step forward in reformulating QTL analysis is to view the process from a selection point of view. Each interval may contain a QTL, and our aim is to assemble the evidence for each interval, rank this evidence and ultimately se- lect the intervals on the basis of the strength of the evidence. The conventional approach to do this in QTL analysis is via fixed effects, a test of significance on the size ak;j in interval j of chromosome k, and a LOD score. As in conventional interval mapping we assume every interval may contain a QTL. Unlike interval mapping however, we include all intervals in a single analysis, and assume a working model of a simple random effect for the size of QTLs across the genome.

Working statistical model

The genetic model we use is

c m −1 X Xk gi = qk;ijak;j + pi (12.9.13) k=1 j=1 where qk;ij is the indicator of parental type (or allele number) of a putative QTL in the jth interval on chromosome k and ak;j is the size of a putative 2 QTL in that interval. As a working model we assume ak;j ∼ N(0, σ γa). Thus we assume a QTL may exist in every interval and that the presence of a significant QTL variance γa suggests at least one QTL may be present. We do not believe the working model reflects reality as far as QTL effects is concerned. It is a vehicle that will allow the detection of putative QTLs that behaves like a penalty as in ridge regression (?). We return to detection in section 12.9 As for interval mapping we replace qk;ij by its expected value given markers 236 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI j and j + 1 on chromosome k. Thus using (12.8.7) we have

c mk−1 X X g g g = (mk;jλk;jj + mk;j+1λk;j+1,j)ak;j + p (12.9.14) k=1 j=1 This can be written as g = M gΛa + p (12.9.15) where M g is the l×m matrix of marker scores for each line and all m markers, a is the vector of sizes of effects, and Λ is a block diagonal matrix of size m × (m − c), with kth block   λk;11 0 0 ... 0  λk;21 λk;22 0 ... 0     0 λk;32 λk;33 ... 0   . . .  Λk =  ......  (12.9.16)  . . . . .   .   ..   0 0 0 λk;mk−1,mk−1 

0 0 0 . . . λk;mk,mk−1 so that the blocks correspond to chromosomes. Given Λ, under (12.9.15) g follows a normal distribution with mean vector 0 and variance matrix given by

2  g T gT  var (g|Λ) = σ γaM ΛΛ M + γgIl (12.9.17) Thus the variance matrix is of factor analytic form. This type of structure arises elsewhere (?). The assumption of random QTL sizes moves the λk;jj to the variance matrix. There are m − c + 1 distinct parameters in this variance matrix, namely the m − c λk;jj (the λk;j,j+1 are functions of λk;jj) and γg. The full model for the trait is based on (12.7.4) and (12.9.15). The distribution of y given Λ is y ∼ N(Xτ , σ2H) (12.9.18) g where if M = ZgM is the full matrix of marker scores for the data set, T T T T H = R + γaMΛΛ M + γgZgZg + ZGZ If a QTL is at a marker, the matrix Λ will contain two identical columns with a single unit entry. Thus the model will be singular in terms of the sizes of the effects in the two intervals. They in fact coincide, and obviously both cannot be estimated.

Estimation The model presented above for QTL analysis is a mixed model and models to follow also fall into this class. Thus residual maximum likelihood or REML (?) and best linear unbiased prediction or BLUP (?) are appropriate methods for analysis (see ?). WHOLE GENOME INTERVAL MAPPING 237 The main advantage of the model presented is that all markers are used simultaneously in one analysis. Thus the benefits of composite interval map- ping are built into the approach, and polygenic and non-genetic effects are estimated allowing for all possible QTL. There are m−c unknown parameters in Λ because of the constraints on the λk;j+1,j as given by (12.8.12). Calculations under REML require derivatives of H with respect to λk;jj. The derivative of λk;j+1,j is found using (12.8.10), and equals ∂λ λ + φ λ k;j+1,j = − k;jj k;j,j+1 k;j+1,j ∂λk;jj 1 + φk;j,j+1λk;jj For large m relative to the sample size n, problems may arise in estimation of λk;jj and hence θk;j. Even if estimation is possible, the estimates may be very poor as information on the precise location of a QTL is likely to be small. An approach is presented in the next section that overcomes this difficulty.

Reducing the parameterization The approach presented here avoids the estimation of the specific location of each QTL and focuses on the broad location as defined by each interval. It also overcomes the potentially confusing situation of a QTL being located at a marker. To eliminate the parameters λk;jj, we assign a prior distribution regarding the location of the QTL and integrate (or average out) over each interval. Without prior information, a QTL in any interval can occur at any location within that interval. We therefore assume that the distance from the left hand marker to a putative QTL, dk;j say, is uniformly distributed. Notice that it must be distance and not the recombination fraction that takes on the uniform distribution, because distance is additive whereas recombination fractions are not. Again using Haldane’s distance measure, the distance to the QTL is 1 d = − log(1 − 2θ ) k;j 2 k;j and it is assumed dk;j ∼ U[0, dk;j,j+1], where U denotes the uniform distri- bution defined on the range specified. With this specification, the dk;j or θk;j need to be integrated out to form a marginal distribution. Unfortunately, this is analytically intractable. In the manner of the regression approach to interval mapping, we replace Λ in (12.9.15) by its expected value. Let ΛE = E (Λ). The non-zero elements of ΛE are

θk;j,j+1 E(λk;jj) = E (λk;j+1,j) = (12.9.19) 2dk;j,j+1(1 − θk;j,j+1) Note that this implies that we have a regression on

θk;j,j+1 1  g g  mk;j + mk;j+1 dk;j,j+1(1 − θk;j,j+1) 2 for each interval. Thus the average of the markers scores for the markers 238 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI defining the interval are scaled according to their separation in terms of re- combination fraction and genetic distance. Our genetic model is based on these moment calculations and is given by g g = M ΛEa + p (12.9.20) g 2 Under this model, if M E = ZgM ΛE, (12.9.18) has variance matrix σ HE where T T T HE = R + γaM EM E + γgZgZg + ZGZ (12.9.21) We call the columns of M E derived interval markers, and the model lead- ing to the variance structure (12.9.21) the interval marker random regression model (IMRRM). Notice that under the IMRRM the sizes are correlated across the chromo- some via the random regression, something to be expected because of the linkage between the markers. Thus the model provides a natural covariance structure to capture that linkage.

Detection of QTLs: An outlier model All intervals on the linkage map can be classified into two groups. The first group consists of the intervals not containing a QTL and is large in number. The size of QTL effect for these intervals will be small, because these intervals do not contain a QTL. The second group is small in number and consists of the intervals that con- tain a QTL. The size of QTL effect for these intervals will reflect the presence of a QTL. Thus the size of QTL effects represent outliers in comparison to the majority of intervals as given by the first group. An approach to detect outliers may therefore be used to select intervals for putative QTLs. ? present outlier detection in linear mixed models based on the alternative outlier model (?, ?), following from the unpublished PhD thesis of B. Gogel (?). The alternative outlier model assumes inflated variance for those com- ponents which are outliers. However, to avoid the need to refit models, or to use approximations as in ?, a score statistic is proposed for QTL detection. The use of a score statistic means that the outlier model does not have to be fitted. The score statistic is evaluated under the null hypothesis, that is that there is no QTL in an interval. The selection procedure presented below is not unique. However, as is shown in the simulation study, the approach performs well in terms of Type I error rate, detection of QTLs and has a small false discovery rate. The first step involves fitting models both with and without random regres- sion effects for the size of QTL, and testing the significance of the random regression term. If significant, the process continues. If not significant the process is terminated. Once the significance of the interval marker random regression is estab- lished, there is an outlier detection phase. One approach would be to find the interval that appears as the largest outlier. An alternative approach used here is to carry out a nested process. This nested process performs well. WHOLE GENOME INTERVAL MAPPING 239 The first part of the nested outlier determination is to evaluate a score based statistic to determine the most likely chromosome for a QTL. The rationale behind this first step reflects the fact that not only will the specific interval that contains the QTL be inflated, so will surrounding intervals. Thus the overall impact of each chromosome establishes the strength of evidence for at least one outlier on a chromosome, and the chromosome with the largest score statistic is chosen as the most likely location of a QTL. The score based statistic is then used to choose the most likely interval on the selected chromosome as the putative QTL. The selected interval is then transferred to the fixed effects part of the model. This process is repeated until the change in the residual log-likelihood for a fit with random QTL effects compared to a fit without random QTL effects is not significant. The score statistic is based on an alternative outlier model (AOM). In the first instance this model is developed at the chromosome level. Subsequently the statistic is specialized to a single interval For chromosome k the AOM is ko a = a + Ekδk (12.9.22) T where Ek = [0 Imk−1 0] and δk is the vector of random effects. We assume 2 that δk ∼ N(0, σ γa,kImk−1). This models modifies the size of the QTL effects for intervals j = 1, 2, . . . , mk by inflating the variance on that chromosome and hence allowing larger predicted random size effects. If this variance inflation is significant, this therefore suggests that at least one QTL may be present on that chromosome. The outlier score statistic that is developed below is motivated by a test of hypothesis of H0 : γa,k = 0, against the one-sided alternative H1 : γa,k > 0. The full mixed model under (12.9.22) is given by

y = Xτ + M Ea + M EEkδk + Zgp + Zu +  (12.9.23) In order to develop a statistic that can be used to locate QTLs, we consider the score for γa,k evaluated at the null hypothesis, which is our whole genome −1 −1 T −1 −1 T −1 interval fit. If P = H − H X(X H X) X H , the score for γa,k under (12.9.23) when H0 is true is given by 1    1  U (0) = − tr PM E ET M T − yT PM E ET M T P y k 2 E k k E σ2 E k k E (12.9.24) If a˜ is the BLUP for the sizes a under (12.9.18) we define the vector T T w˜k = Ek M EP y 1 T = Ek a˜ γa a˜ = k (12.9.25) γa If T T Ck,k = Ek M EPM EEk (12.9.26) 240 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI the score for γa,k at the null hypothesis is given by ! 1 w˜ T w˜ U (0) = k k − tr (C ) k 2 σ2 k,k tr (C ) = k,k (t2 − 1) (12.9.27) 2 k where T 2 w˜ k w˜ k tk = 2 σ tr (Ck,k) T  2 Note that E w˜ k w˜ k = σ tr (Ck,k). It can be shown that

T 1 1 ME ME M EPM E = Im−c − 2 C γa γa where CME ME is the component of C−1 relating to a (and hence is the prediction error variance matrix of a), and where C is the coefficient matrix of the mixed model equations (see for example ?). Thus

1 1 ME ME Ck,k = Imk−1 − 2 Ck,k γa γa

ME ME ME ME where Ck,k is the kth diagonal block of C , and is the prediction error variance matrix for a˜k. Thus the prediction error variances (the diagonal 2 elements only) of the QTL sizes are required to calculate the statistic tk. To determine which chromosome is likely to contain a QTL, the chromosome 2 with the largest tk is selected. If there is no QTL on the chromosome, the score statistic has mean zero and a “large” deviation of the observed score from zero 2 indicates a QTL may be present. Thus large tk suggest a QTL is present. If we replace Ek in the derivation of the score by a single column ek;j that selects interval j on chromosome k (the selected chromosome), the score for each interval on chromosome k can be determined. In fact for a single interval j on chromosome k, the (12.9.27) can be written as c U (0) = k;jj t2 − 1 (12.9.28) k;j 2 k;j where 2 2 w˜k;j tk;j = 2 σ ck;jj 2 and the tk;j reflect the importance of an interval with respect to a putative 2 QTL. If tk;j is large this suggests a QTL may be present in the interval and 2 hence the interval with largest tkj is chosen as the likely position for the QTL. The selected QTL interval is now moved to the fixed effects and the process repeated until the random effects QTL component is not significant. When the selection process concludes, all putative QTLs will appear as fixed effects. WHOLE GENOME INTERVAL MAPPING 241 Thus if S putative QTLs are selected, the final model is

S X y = Xτ + mE,sas + Zgp + Zu +  (12.9.29) s=1 g where mE,s = ZgM sλE,s is the appropriate vector for the sth putative QTL, g and, M s is a n × 2 matrix of the marker scores defining the interval and λE,s is the appropriate column of ΛE.

Implementation The approach presented above has been implemeted by Simon Diffey of the New South Wales Department of Primary Industries. The implementation is in R (?) and is built on the library qtl (?) in R. The model fitting involves using the samm library (?) in R. The implementation was used in the simulation study and the analysis of the Sunco × Tasman flour yield trials presented below.

12.9.1 Analysis of Sunco-Tasman Flour yield experiment Our analysis commences with determining the appropriate non-genetic model for the flour yield data. ? present an approach to the analysis of quality trait data from so-called multi-phase experiments. Flour yield is measured in a two-phase (field and milling phases) experiment. Their modelling includes both blocking factors to respect the randomization processes in the design phases as well as accounting for other sources of non-genetic variation and spatial and/or temporal correlation from the field and laboratory processes. They demonstrate that such modelling significantly increases the response to selection in traditional breeding analysis and suggest that this modelling should also enhance accurate detection of QTLs for quality traits. To simplify our approach, we exclude the QTL terms from the initial modelling, but then use the established phenotypic model as the baseline model in the outlier QTL detection methodology. Tables 12.11 and 12.12 presents a summary of the models fitted to the 1999 and 2000 data respectively. The initial model (M0) for 1999 can be written symbolically as y ∼ 1 + gfac + genotype + frep + column.row + day.order where 1 represents a constant (or overall mean term), gfac is a fixed factor with 10 levels, 1 to 9 for parental genotypes and commercial varieties and 10 for DH lines (thereby providing a baseline mean for the DH lines), genotype is a random factor with 184 levels (175 DH lines plus the 9 parental and commercial lines) that represents the polygenic effects, frep is a random factor indexing field replicates, column and row are factors indexing field columns and rows respectively and day and order are factors indexing mill days and mill order within days. Terms like column.row are interactions. 242 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI

Table 12.11 Estimation of variance parameters and Z-statistics for key fixed effects (for rows labelled fixed in the Parameter column) for models fitted for the 1999 Sunco × Tasman flour yield data. Terms involving lin(order) are regression terms on the order of milling.

Model Term Parameter M0 M1 M2 M3 M4 M5 M6 M6+QTL

2 genotype σ γg 2.448 2.564 2.491 2.491 2.561 2.617 2.737 0.817 2 frep σr 0.297 0.169 0.159 0.070 0.084 0.057 0.120 0.122 2 column.row σp 0.172 0.158 0.108 0.317 0.292 0.383 0.345 0.360 ρc 0.693 0.785 ρr 0.849 0.884 2 day σmd 0.205 0.215 0.314 0.259 0.000 0.240 0.236 2 day.lin(order) σm.o 0.433 0.381 0.406 0.343 ρm.o -0.162 -0.155 -0.064 0.047 units σ2 1.297 0.704 0.705 0.345 0.413 0.806 0.370 0.397 ρo 0.318 0.765 0.347 0.306 lin(order) fixed -3.886 -2.462 -2.589 -3.006 -2.266 -2.236 Residual log-likelihood -402.2 -350.9 -347.5 -338.8 -337.8 -340.1 -319.9 -267.5

Note that Terms in the models that are listed in Table 12.11 and 12.12 may have several associated parameters. Thus column.row represents variation 2 at the plot level in the field and has a variance σp associated with it, and correlations ρc and ρr in the row and column directions to allow for spatial variation. The terms day and day.lin(order) are random intercept and slopes to allow for mill day effects and the linear effects within a mill day. These are random effects and are correlated and hence the correlation ρm.o is included. Correlation also exists at the residual level, labelled units in Tables 12.11 and 12.12, between samples milled on the same day (ρo). Detailed discussion of these terms is presented in ?. For the data from the 1999 trial, several outliers were identified and re- moved, with a substantial reduction in σ2 (compare model M1 with model M0). Terms and covariance parameters were added/tested using the approach outlined in ? leading to the sequence M0 to M6. The final model, M6, included terms for spatial correlation, temporal correlation within mill days, an overall regression on mill order and a random regression for mill days on mill order. A similar approach was used for 2000 data, where the final model included temporal correlation within mill days, mill days and regressions on mill order and field row. The improvement in the fit of models for both 1999 and 2000 trials can be seen in Table 12.11 and 12.12 by examining the increase in resid- ual log-likelihood, and the reduction in residual variance σ2 for the models listed. WHOLE GENOME INTERVAL MAPPING 243

Table 12.12 Estimation of variance parameters and Z-statistics for key fixed effects (for rows labelled fixed in the Parameter column) for models fitted for the 2000 Sunco × Tasman flour yield data. Terms involving lin(order) are regression terms on the order of milling.

Model Term Parameter M0 M1 M2 M3 M4 M4+QTL

2 genotype σ γg 2.086 1.997 1.996 1.971 1.923 0.292 2 frep σr 0.107 0.078 0.079 0.077 0.077 0.076 2 column.row σp 0.359 0.365 0.399 0.434 0.406 0.413 2 day σmd 0.586 0.575 0.486 0.482 0.499 units σ2 0.830 0.238 0.200 0.264 0.271 0.267 ρo 0.693 0.711 0.718 lin(order) fixed -3.791 -3.242 -3.231 -3.435 lin(row) fixed -3.823 -4.234 Residual log-likelihood -405.1 -322.9 -320.8 -311.5 -309.6 -222.3

These final models were then used as the baseline models for the QTL analysis using the outlier method. The final analysis that includes all selected QTL, is presented as model M6+QTL in Table 12.11 and model M4+QTL in Table 12.12. The estimated polygenic variance for these models is clearly greatly reduced when compared to the baseline model (M6 or M4), indicating the impact of the detected QTL on the genetic component of flour yield. Tables 12.13 and 12.15 present a summary of the QTLs identified for 1999 and 2000 respectively using a 5% threshold for the likelihood ratio test. Us- ing the full non-genetic modelling approach together with the outlier method resulted in 15 and 12 QTLs being identified for 1999 and 2000 respectively. Of these a total of 9 could be realistically regarded as the same QTLs. The Z statistic relects the importance of the selected QTL (it is the estimate divided by its standard error). The impact in the QTL analysis of incorporating the variation due to the environment, and also due to the experimental design and conduct of the ex- periment was also examined. The outlier QTL procedure was repeated using the simplistic model of genotype plus error, where these two terms are in- dependently normally distributed with associated variance parameters. Thus the field variation and laboratory variation are condensed to a single variance parameter. The QTL found under this simplistic model are presented in Ta- bles 12.14 and 12.16. There were only 6 and 7 QTLs identified at each of the two sites with 5 in common. These results demonstrate the potential benefits of using the most efficient non-genetic models for these types of data. 244 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI

Table 12.13 QTL summary for 1999 Sunco × Tasman data for M6 with QTLs and QTL analysis using the outlier method. Ci Ij gives the chromosome and the interval on the chromosome of the QTL, while Dist (L, R) is the distance in cM along the chromosome of the left and right flanking markers for each QTL. The Z-statistic is the estimate of the size of QTL divided by the standard error of the estimate with a LOD score.

M6 QTL Ci Ij Markers Dist L, R (cM) Z statistic LOD 1 3D I2 (gwm71a, P32.M48.280) 0.0, 2.1 -2.54 1.40 2 4B I12 (wmc47, gwm6) 35.5, 41.4 2.62 1.49 3 5D I8 (gwm292, cfd19b) 117.9, 132.1 -2.86 1.78 4 4A I14 (wmc313, P33.M76.2) 154.7, 157.6 -2.46 1.31 5 4D I4 (almt1, wmc48b) 18.9, 30.0 2.53 1.39 6 2B I10 (ksuD22, wmc149) 61.8, 95.9 2.44 1.29 7 1B I4 (ksuD14, Glu.B3) 10.7, 11.0 2.08 0.94 8 7D I4 (wmc94, gwm121) 94.0, 98.2 2.10 0.96 9 6D I6 (sun5b, P39.M50.149) 50.0, 51.2 -2.68 1.56 10 1D I5 (abc156, cfd19a) 36.6, 68.9 -3.09 2.07 11 6B I4 (cdo1380, P39.M49.142) 5.1, 7.2 -4.06 3.58 12 3D I13 (gwm3, psr931) 161.5, 175.3 3.54 2.72 13 5A I13 (wg232c, PAACTelo2) 90.6, 95.1 -3.71 2.99 14 1B I16 (ksuI27a, P36.M37.2) 247.3, 256.7 -3.53 2.71 15 2B I5 (gwm515a, wmc474USQ) 51.2, 54.8 4.75 4.90

The results using the outlier method can also be compared with the results of ?. These authors only identified 3 and 4 QTLs for 1999 and 2000 respec- tively, with 3 QTL in common. Importantly, the outlier method identified the same QTLs that were identified by ?, but in addition detected further QTL, most of which were common across the two years. This is a remarkable result, given that ? used efficient non-genetic models in their analyses, but used the method due to ? for QTL identification.

12.10 Conclusions The outlier approach to QTL detection presented in this chapter provides a mechanism to incorporate all intervals on a linkage map into a single model. The model is based on an extension of the simple interval mapping approach of ? in the regression setting. A major difference is the notion of a working model, that is developed only for detection. This working model does not represent the underlying genetics. However, at the end of what is a multi-stage selection process, the result is a simple genetic model. A further difference with CONCLUSIONS 245

Table 12.14 QTL summary for 1999 Sunco × Tasman data for M6 with QTLs and QTL analysis using the outlier method but with no modelling of non-genetic effects. The QTL number matches the labelled QTLs in Table 12.13.

M6 QTL Ci Ij Markers Dist L, R (cM) Z statistic LOD 11 6B I9 (gwm626, barc24) 12.1, 21.6 -2.62 1.49 8 7D I4 (wmc94, gwm121) 94.0, 98.2 1.96 0.83 10 1D I5 (abc156, cfd19a) 36.6, 68.9 -2.63 1.50 13 5A I13 (wg232c, PAACTelo2) 90.6, 95.1 -3.59 2.80 14 1B I14 (gwm11, gwm140) 90.9, 239.5 -3.09 2.07 15 2B I5 (gwm515a, wmc474USQ) 51.2, 54.8 7.18 11.19

Table 12.15 QTL summary for 2000 Sunco × Tasman data for M4 with QTLs and QTL analysis using the outlier method. Ci Ij gives the chromosome and the interval on the chromosome of the QTL, while Dist (L, R) is the distance in cM along the chromosome of the left and right flanking markers for each QTL. The Z-statistic is the estimate of the size of QTL divided by the standard error of the estimate with a LOD score.

M4 QTL Ci Ij Markers Dist L, R (cM) Z statistic LOD 1 2D I22 (gwm349, gwm301) 162.5, 178.3 2.18 1.03 2 2A I8 (wmc198, wmc170) 29.7, 40.9 -2.71 1.59 3 3D I6 (TeloPAGG2, TeloPAGG1) 55.0, 61.2 -2.43 1.28 4 4A I3 (germin, cdo795) 10.3, 11.4 2.76 1.65 5 1B I2 (gwm550, P36.M67.1) 0.0, 9.0 2.96 1.90 6 5A I14 (PAACTelo2, P46.M37.4) 95.1, 102.1 -4.58 4.55 7 6B I6 (cdo507, barc354) 8.9, 9.4 -6.47 9.09 8 1B I14 (gwm11, gwm140) 90.9, 239.5 -5.53 6.64 9 4B I2 (barc193, csME1) 0.0, 12.0 -5.40 6.33 10 4D I2 (Rht2.mut, csME2) 0.0, 1.8 4.84 5.09 11 7D I3 (gwm437, wmc94) 86.6, 94.0 4.26 3.94 12 2B I6 (wmc474USQ, wmc35a) 54.8, 59.6 10.74 25.05 246 THE ANALYSIS OF QUANTITATIVE TRAIT LOCI

Table 12.16 QTL summary for 2000 Sunco × Tasman data for M4 with QTLs and QTL analysis using the outlier method, but without modelling non-genetic variation. The QTL number matches the labelled QTLs in Table 12.15.

M4 QTL Ci Ij Markers Dist L, R (cM) Z statistic LOD 8 1B I14 (gwm11, gwm140) 90.9, 239.5 -3.01 1.97 6 5A I14 (PAACTelo2, P46.M37.4) 95.1, 102.1 -4.16 3.76 10 4D I2 (Rht2.mut, csME2) 0.0, 1.8 3.56 2.75 7 6B I6 (cdo507, barc354) 8.9, 9.4 -5.33 6.17 9 4B I2 (barc193, csME1) 0.0, 12.0 -4.40 4.20 12 2B I6 (wmc474USQ, wmc35a) 54.8, 59.6 8.01 13.93 11 7D I4 (wmc94, gwm121) 94.0, 98.2 4.41 4.22

interval mapping and other approaches is that the exact location of a QTL in an interval is not estimated. The effectiveness of the random effects formulation for selecting putative QTLs reflects the stability offered by this model, and thus while QTL sizes are shrunk towards zero, these sizes together with the outlier approach are able to isolate genuinely important effects very effectively. Rather than requiring a LOD score, a sequential test procedure based on the residual likelihood ratio statistic allows a genome-wide assessment of significance. The simulation studies presented show that while a theoretical justification for a critical value for a specified type I error rate is not available, a simple approach of using a 5% test at each stage leads to Type I error rates that are very close to the nominal 5%. In addition, the simulations show that the outlier approach is able to detect more genuine QTL, with only increases in the rate of false positives. The regression approach for QTL analysis also allows non-genetic effects to be included in the analysis in a routine fashion. Thus the outlier method for QTL analysis is available in complex experimental situations. The Sunco × Tasman doubled haploid wheat population data from trials in 1999 and 2000 provide an example of the ability of the method to incorporate important sources of non-genetic variation. Flour yield is a complex trait and the outlier approach enables the determination of many putative QTLs. The number both for individual trials and those in common across trials far surpasses the results from previous QTL analysis. The impact of non-genetic variation on the detection of QTL can be sub- stantial. The analysis of flour yield for the Sunco × Tasman population high- lighted the improvement possible in incorporating the multi-phase nature of the data generation process. More genuine QTLs and fewer extraneous QTLs CONCLUSIONS 247 should be the result. Thus the manner in which phenotypic data or trait data are generated must be understood and incorporated in the analysis. The approach presented is currently being extended to more complex situa- tions. A multi-environment approach that allows QTL × environment interac- tions to be examined, multi-traits QTL analysis and the impact of treatments on the expression of QTLs are some of the new developments in progress.

CHAPTER 13

Mixed models for penalized models

13.1 Introduction

We have seen that mixed models are very flexible and provide a mechanism for modelling in many situations. There are other approaches that introduce structure in a statistical model and one such approach is based on the idea of a penalty function. In this chapter we consider regression models for data with a quantitative explanatory variable, and in particular examine some methods for providing a smooth representation of the regression in a non-parametric fashion. There is a link between some of these models and mixed models and this is developed in the chapter. This link allows penalized regression models to be fitted using standard mixed model software, a powerful tool. The idea of penalizing an objective function is an old one. Whittaker (1923) considered using a discrete third difference penalty for equally spaced data to smooth in a non-parametric manner. Since this first paper the literature on such methods has grown to be very large. Discrete penalties have been dis- cussed by a number of authors, notably Eilers and Marx (1996) who introduce P-splines. Penalized regression splines are discussed in detail by Rupert et al. (2003). The literature on spline smoothing is enormous, with much work by Grace Wahba. Books by Wahba (1990), Green and Silverman (1994b) and on functional data analysis by Ramsey and Silverman (1997) provide details and extensions. We will touch on some of the ideas and methods in this chapter. The extension to additive models (Hastie and Tibshirani (1990)), smoothing spline ANOVA (Gu (2002), and the related mixed model approaches in this area (Verbyla et al. (1999b), Brumback and Rice (1998b), Zhang et al. (1998), Wang (1998), Wand (2003)) provide additional flexibility to model in the presence of treatment and additional structure. To illustrate the ideas and methods we consider some data from a study on wheat flour quality. We thank Rudi Appels and Chris Rath for providing the data. A mixograph is a machine that mixes flour and water to produce dough. The mixograph used to generate the data considered here has a series of fixed pins and two rotating heads with two pins each. The mathematical properties of the mixing trajectories are well understood. The aim of this study was the identification of structure in the mixing process and ultimately to provide a mechanism for comparing different flours in terms of their mixing process. The power required to drive the machine at a constant number of revolutions per minute is measured at regular intervals throughout the mixing process (at intervals of 0.0025 seconds). A portion of the data obtained in a study on a

249 250 MIXED MODELS FOR PENALIZED MODELS particular wheat variety (and with which bakers quality flour was used in the mixograph) is presented in Figure 13.1. The portion of data is from 0.2 to 0.4 seconds into the mixed process (the full data set consists of approximately 240,000 observations). The power required to drive the mixograph shows a general increase over time in Figure 13.1. There is additional structure in the data as can be seen by careful examination. The data generated by the mixing process could be taken to be correlated. There is a certain duality between trend and correlation and the models con- sidered in this chapter specifically assume that trend is important and that correlation is present because of trend. It may be the case that additional cor- relation exists at the residual or error level. This is examined as we proceed with various analysis methods.

Figure 13.1 Mixograph data: bakers wheat dough development. The Force required to drive the mixograph at a constant revolutions per minute and time intervals of 0.0025 seconds.

2515

2510

2505 Force

2500

2495

0.0 0.05 0.10 0.15 0.20

time (secs)

The times used in this chapter are translated to (0, 0.2) to simplify some of the mathematical and statistical detail. Our aim is to detect patterns in the data in an exploratory analysis which allows the process to be quantified. In this chapter, we begin by considering several methods of smoothing data which has a single quantitative explanatory variable; in the example this is time in seconds. The underlying approach to smoothing involves the use of a penalty function. We motivate the use of penalties by firstly considering a simple so-called hard-edged constraint in the context of a linear model. Relaxing the hard-edged constraint to something softer is then presented. This naturally gives rise to a quadratic penalty function that forms the basis of most of the methods considered in this chapter. Thus we proceed to penalized regression splines, P-splines, natural polynomial smoothing splines and then HARD-EDGE CONSTRAINTS 251 L-splines. The mixograph data is used as each method is presented. Although this data is not ideal for some of these methods, the data set does highlight that we must be careful not to use methods blindly. An important aspect of this chapter is the connection between methods using penalties and mixed models. This conceptual link allows estimation to be conducted within the residual likelihood (REML) and best linear unbiased prediction (BLUP) paradigm. This is a very powerful vehicle but there are arguments regarding the selection of the underlying smoothing parameter. Typically cross-validation is used to select this parameter but the methods presented here use REML. Furthermore, there is debate on whether inferential aspects of mixed models can be used formally for penalized models. While these issues are important, the ease with which mixed model technology can be used to fit complex models involving penalties make it a very attractive approach. It is this approach that we pursue in this chapter.

13.2 Hard-edge constraints

A very simple (and possibly unrealistic) model for the data of Figure 13.1 is a . To set the notation for this chapter, let y denote the n×1 vector of responses, taken at times ti such that a ≤ t1 < t2 < ··· < tn ≤ b for some a and b. Let the function g(ti) be the mean response at time ti. If g is the vector of g(ti), we assume that y = g + e (13.2.1) If the function g(·) is linear in t, we can write g = Xτ (13.2.2) where X is an n × 2 matrix and τ has two parameters, the intercept and the slope. Because the observations are taken over time, they constitute repeated mea- surements and they are likely to be correlated. Thus for some covariance ma- trix σ2R depending on a scale parameter σ2 and a parameter vector φ, we assume e ∼ N(0, σ2R(φ)). To motivate the approach using penalties, we begin by considering the sim- ple regression problem in a non-conventional way; see Ramsey and Silverman (1997). Rather than specify the linear regression using (13.2.2), we specify the model using constraints. Thus if g(·) is linear, there exists a matrix ∆2 such T that ∆2 g = 0. If the ti are equally spaced, the matrix ∆2 can be chosen to T take second order differences. Under the linear model (13.2.2), ∆2 X = 0. Under the constraints, using a Lagarangian is appropriate and we maximise the log-likelihood (for given R), namely 1 1 n o P LL(g, λ) = − log det σ2R− (y − g)T R−1(y − g) + 2λT (∆T g − 0) 2 2σ2 2 (13.2.3) 252 MIXED MODELS FOR PENALIZED MODELS Differentiating with repsect to g and λ we have ∂P LL 1 = − R−1(y − g) − ∆ λ ∂g σ2 2 ∂P LL 1 = − ∆T g ∂λ σ2 2 and hence obtain the adjusted normal equations,  −1     −1  R ∆2 gˆ R y T ˆ = (13.2.4) ∆2 0 λ 0 To solve (13.2.4) note that

 −1 −1  T −1 T T −1 R ∆2 R − R∆2(∆2 R∆2) ∆2 RR∆2(∆2 R∆2) T = T −1 T T −1 ∆2 0 (∆2 R∆2) ∆2 R −(∆2 R∆2) so that    T −1 T  gˆ (I − R∆2(∆2 R∆2) ∆2 )y ˆ = T −1 T λ (∆2 R∆2) ∆2 y Now, using the result T −1 T T −1 −1 T R − R∆2(∆2 R∆2) ∆2 R = X(X R X) X (13.2.5) we see that gˆ = X(XT R−1X)−1XT R−1y = Xτˆ so that our estimate of g is the fitted straight line at the design points. Thus under the straight line model E(gˆ) = Xτ = g so that the estimator is unbiased, while var (gˆ) = σ2(XT R−1X)−1 The constraint imposed forces the model to be linear in t. This is a global constraint and is called “hard” by Ramsey and Silverman (1997). We now consider a way to soften the constraint.

13.2.1 Mixograph data The analysis of the mixograph data is conducted assuming independence and constant variance for the residual or error components as in the previous section. Figure 13.2 presents the data with the straight line fit. Using the methods discussed in the chapter on Geostatistcs, a variogram of residuals is presented in Figure 13.3. This variogram shows that there is a periodic nature to the residuals. This is not surprising given the underlying mixing process. As mixing proceeds, dough is formed and the mixograph goes through a process of breaking down the dough and mixing the subsequent reduced mixture. For the time period considered here, the process induces an approximately periodic pattern. HARD-EDGE CONSTRAINTS 253

Figure 13.2 Straight line fit assuming independence and constant variance of residual errors

2515

2510

2505 Force

2500

2495

0.20 0.25 0.30 0.35 0.40

time (secs)

Figure 13.3 Sample variogram for the residuals of the linear fit for the mixograph data.

1.0

0.8

0.6 Residual 0.4

0.2

0.0

0 20 40 60 80

units 254 MIXED MODELS FOR PENALIZED MODELS We can highlight the cyclic or periodic nature of the data firstly by providing a plot with points joined by lines, as in Figure 13.4. There is clearly a high frequency periodic trend. Secondly we can examine the residuals from the linear fit as a line plot as in Figure 13.5. This further reinforces that a high frequency periodic trend exists, buit is also suggests that a lower frequency periodic effect may also be present.

Figure 13.4 Line plot of mixograph data

2515

2510

2505 Force

2500

2495

0.0 0.05 0.10 0.15 0.20

time (secs)

Figure 13.5 Line plot of residuals from the linear fit for the mixograph

4

2

0

Adjusted Force: residuals after linear fit -2

0.20 0.25 0.30 0.35 0.40

time (secs) SOFT-EDGE CONSTRAINTS 255 To examine the cyclic patterns, a periodogram (Diggle (1990)) of the resid- uals from the linear fit is presented in Figure 13.6 and highlights two strong Fourier frequencies, namely at ω = 0.125 and ω = 0.2. AS mentioned above, these cyclic patterns are also evident in Figure 13.5. Note that if a fixed effects model with periodic terms was contemplated we would have the model

∗ ∗ ∗ ∗ g(t) = β0+β1t+β1s sin 2πω1t +β1c cos 2πω1t +β2s sin 2πω2t +β2c cos 2πω2t (13.2.6) ∗ where t = t/0.0025, and ω1 and ω2 correspond to 0.125 and 0.2 respectively in Figure 13.6.

Figure 13.6 Sample periodogram of the residuals from the linear fit to the mixograph data

40

30

20 Scaled periodogram

10

0

0.0 0.1 0.2 0.3 0.4 0.5

Frequency

The methods of this chapter need to model both of these cyclic or periodic components. Notice that the periodic components do not appear to be entirely deterministic and the methods of this chapter aim to allow for departures from strict Fourier terms.

13.3 Soft-edge constraints

Suppose we would like our fitted model to be close to a straight line, if appro- priate, rather than exactly a straight line. By close we mean in terms of some T T distance measure. Thus let d = ∆2 g and consider the squared distance d d. Suppose we wish that this distance measure is small and that the residual sum of squares is also small. Then we might consider choosing g to minimize 256 MIXED MODELS FOR PENALIZED MODELS the penalised log-likelihood 1 1 n o P LL(g, λ) = − log det σ2R − (y − g)T R−1(y − g) + λdT d 2 2σ2 (13.3.7) Notice that the Lagrangian λ is now a scalar and that λ controls the interplay between the residual sum of squares and the distance measure. If λ is large, we weight heavily in favour of a straight line whereas if λ is small, fidelity of the fitted model to the data is weighted more heavily. The penalty can be written as T T T T J(g) = d d = g ∆2∆2 g = g Kg which is a quadratic form in g. This form of the penalty leads to very different estimation and properties than the hard constraint case. Notice that the n×n matrix K is symmetric, but is of rank n − 2. This reflects the fact that we are penalising departures from the rank 2 linear representation. Differentiating P LL we find ∂P LL 1 = − R−1(y − g) − λKg ∂g σ2 and equating to zero we see that g˜ = (R−1 + λK)−1R−1y Now, E(g˜) = (R−1 + λK)−1R−1g and var (g˜) = σ2(R−1 + λK)−1R−1(R−1 + λK)−1 Thus g˜ is a biased estimator of g; it is in fact a shrinkage estimator. The bias can be reduced by decreasing λ. A consequence, however, is an increase in the variance of the estimator. If precision is important, λ should be increased in which case the bias increases. Thus the choice of λ governs the trade-off between bias and precision. Note that as λ → ∞, we have g˜ → X(XT H−1X)−1XT H−1y so that in the limit we are fitting a straight line. Furthermore as λ → 0, T −1 −1 T −1 T −1 −1 T g˜ → X(X H X) X H y + Z2(Z2 R Z2) Z2 P y = y using (13.2.5). Thus we interpolate the data. The form of K allows the estimate to be written in a different and infor- mative manner. Firstly note the matrix identity (A + BCD)−1 = A−1 − A−1B(DA−1B + C−1)−1DA−1 Using this identity twice and the identity (13.2.5) we can show (see Verbyla et al. (1999b), p. 298) −1 T g˜ = Xτˆ + λ Z2Z2 P y (13.3.8) PENALIZED REGRESSION SPLINES 257 where τˆ = (XT H−1X)−1XT H−1y T −1 Z2 = ∆2(∆2 ∆2) −1 T H = R + λ Z2Z2 P = H−1 − H−1X(XT H−1X)−1XT H−1 The form in (13.3.8) suggests the solution could be obtained via a mixed model. If g = Xτ + Z2u2 (13.3.9) with 2 u2 ∼ N(0, σuIn−2) 2 2 and λ = σ /σu, then estimation using REML and prediction using BLUP will yield (13.3.8). Thus the model with a penalty can be fitted using a mixed model, with a suitable form for Z2 and a random effect u2.

13.3.1 Mixograph data The penalized model was fitted to the mixograph data usaing REML. The variance component for the second difference random effects converged to zero. Thus the smoothing parameter goes to infinity and hence the fit is a straight line as in the previous subsection. Thus this simple penalty is unable to provide estimation of the two periodic components presnt in the data.

13.4 Penalized Regression splines In literature on splines, basis functions play a prominent part. We begin with a simple approach that is advocated by Rupert et al. (2003) and that introduces the truncated power basis.

13.4.1 Truncated Power Function Basis Penalized regression splines are built on a base polynomial, together with a power function basis that facilitates smoothing. There are several aspects that need to be selected in fitting penalized regression splines. The first of these is the number of knots and their values. Thus, suppose there are r knots at locations xi defined by a < x1 < ··· < xr < b. A penalized spline also involves a polynomial of degree k and so-called truncated power functions. The truncated power functions introduce smooth departures from the polynomial in a very simple fashion. Rupert et al. (2003) provide a simple justification for the power functions by building on the broken stick model which can be represented using a truncated power model of degree 1. The function g(·) is represented by k r X j X k g(t) = τT jt + uT i(t − xi)+ j=0 i=1 258 MIXED MODELS FOR PENALIZED MODELS where the truncated function is defined as (a)+ = a for a ≥ 0 and 0 otherwise. These truncated functions arise for other spline models as we shall see below. In matrix terms, the model at the ti is given by

g = XT τ T + ZT uT (13.4.10) The terms in (13.4.10) are X, an n × (k + 1) matrix whose (i, j)th element is j ti , τ T , a vector whose jth element is τT j, ZT , an n × r matrix with (i, j)th k element (ti − xj)+ and uT is the vector of uti. The penalised log-likelihood to be maximized is 1 P LL = − log det σ2R T 2 1 − (y − X τ − Z u )T R−1(y − X τ − Z u )(13.4.11) 2σ2 T T T T T T T T T + λT uT uT This penalized log-likelihood incorporates a simple ad-hoc penalty which is designed to shrink the estimator and hence provide a scatterplot smoother. 2 The link with mixed models is immediate if we assume uT ∼ N(0, σT Ir), 2 2 with λT = σ /σT . The truncated power function basis can be numerically unstable for com- putations. Alternative basis functions are often used in their place. In the following section we turn to P-splines which are built on the B-spline basis which is related to the truncated power function basis. B-splines are compu- tationally more stable and are easy to generate.

13.4.2 Mixograph data There are a number of “parameters” that need to be set to fit penalized regression splines. These are firstly the number and location of the knots and the degree of the polynomial and hence the tuncated power function. For the mixograph data, knots were equally spaced and were 11, 21 and 41 in number. The polynomial degree was taken as 1, 2 and 3 in combination with the various number of knots. In all cases, the variance component for the power function penalty converged to zero. Thus this method failed to detect the periodicities 2 in the data under the automatic estimation of the variance component σT using REML. A straight line fit results.

13.5 P-splines Eilers and Marx (1996) introduced P-splines. P-splines are smoothers that are based on the B-spline basis functions and a difference penalty. B-splines are constructed from piecewise polynomials that are joined at chosen knot points. As for the penalized regression splines we suppose we have r distinct knot points x. We begin with a brief review of B-splines and then present details of estimation based on the penalized log-likelihood. A basic reference for B-splines is de Boor (1978). P-SPLINES 259 13.5.1 B-splines Because B-splines are based on polynomials, any basis or set of B-splines depends on the degree of the polynomial selected; as for penalized regression splines we denote the degree of the underlying polynomial (piecewise in this case) by k. With r knots there is a requirement to select an additional k knots below a and another k knots above b. The placement of these knots is arbitrary. B-splines can be generated from the truncated power function basis. In fact B-splines of order k can be constructed from kth divided differences of the truncated power function basis of order k on the same knots. B-splines can also be generated using simple recurrence relations. Thus if Bi(t; k, x) is the ith B-spline, i = −(k −1),..., 0, 1, . . . , r−1, we have r+k −1 B-splines in the full set. Bi(t; k, x) is the value of a polynomial of degree k at the point t, for the ith B-spline. In fact  1 x ≤ t < x B (t; 1; x) = i i+1 i 0 otherwise and

t − xi xi+j+1 − t Bi(t; j + 1, x) = Bi(t; j, x) + Bi+1(t; j, x) j > 0 xi+1 − xi xi+j+1 − xi+1 de Boor (1978) provides a comprehensive account of properties of B-splines.

13.5.2 Penalized B-splines: P-splines The P-splines approach involves using B-splines to represent the underlying function. Thus the vector of function values at the design points t1, t2, . . . , tn is g = Ba where the vector a contains unknown coefficients. The penalized log-likelihood that forms the basis of the P-spline approach is 1 1 n o P LL = − log σ2R− (y − Ba)T (y − Ba) + λ aT ∆ ∆T a (13.5.12) 2 2σ2 b d d where ∆d is a differencing matrix of order d and hence is of size r + k − 1 × r + k − 1 − d. Eilers (1999) shows that a mixed model representation is possible for P- splines. The projection matrix defined by ∆d is decomposed as T −1 T T Ir+k−1 − ∆d(∆d ∆d) ∆d = LL T where L is an r + k − 1 × d matrix of full column rank. Defining τ b = L a, and T −1 T Z = B∆d(∆d ∆d) , ub = ∆d a we see that T −1 a = Lτ b + ∆d(∆d ∆d) ub 260 MIXED MODELS FOR PENALIZED MODELS and a mixed model is given by

y = BLτ b + Zub + e (13.5.13) As Eilers (1999) notes, BL is a transformation of a polynomial of degree d−1 on the vector t of design points and this polynomial is represented exactly by 2 / 2 the B-splines basis. Further we take ub ∼ N(0, σb Ir+k−1−d), and λb = σ σb . The penalized log-likelihood then is of the form required to obtain the mixed model equations and again REML and BLUP can be used to estimate the P-spline.

13.5.3 Mixograph data P-splines allow the flexibility of choosing the degree of the underlying piece- wise polynomials, the number of knots and their location. As with penalized regression we considered 11, 21 and 41 equally spaced knots and d = 0, 1, 2. For d = 0 there are no fixed effects in the mixed model (although for fitting we include the constant). The case d = 0 performed best in following the trend, see Figure 13.7 where the fitted function is based on 41 knot points, but is not smooth as it is piecewise linear. Second order differencing was too severe, recovering only the linear trend. However, first order differencing recovers the linear plus the broad cyclic trend but very little of the high frequency signal; see Figure 13.8.

Figure 13.7 P-spline with 41 equally spaces knots and zero degree differencing in the penalty

2515

2510

2505 Force

2500

2495

0.0 0.05 0.10 0.15 0.20

time (secs)

The plots of the sample variogram and residuals for the d = 0 case are pre- sented in Figures 13.9 and 13.10 respectively. Much of the cyclic nature of the data has been accommodated but there are still cyclic patterns evident. The P-SPLINES 261

Figure 13.8 P-spline with 21 equally spaces knots and first degree differencing in the penalty

2515

2510

2505 Force

2500

2495

0.0 0.05 0.10 0.15 0.20

time (secs)

d = 1 fit exhibits the high frequency cyclic pattern in the sample variogram and the residual plot.

Figure 13.9 Sample variogram for residuals from the d = 0, 41 knot, P-spline fit

1.0

0.8

0.6 Residual 0.4

0.2

0.0

0 20 40 60 80

units 262 MIXED MODELS FOR PENALIZED MODELS

Figure 13.10 Plot of residuals after the d = 0, 41 knot, P-spline fit

2

1

0 resid(fmp3.0)

-1

-2

0.0 0.05 0.10 0.15 0.20

time

13.6 Smoothing splines 13.6.1 Background: Green’s functions and reproducing kernels We turn to a special form for the penalty function. Thus, suppose that our observation times or design values satisfy a < t1 < t2 < ··· < tn < b. For ease of presentation we will take a = 0 and b = 1, something that can be achieved by a simple transformation of the design variable. The unknown function g(·) is chosen to maximize the penalized log-likelihood,  Z 1  1 2 1 T −1 (m) 2 P LL = − log(det σ R) − 2 (y − g) R (y − g) + λs {g (t)} dt 2 2σ 0 (13.6.14) where g(m)(t) denotes the mth derivative of g(t). We shall also use the notation Dmg = Dmg(t) to denote mth derivatives in certain developments. Implicit in the penalized log-likelihood is the fact that 1. g(j)(t) is absolutely continuous for j = 0, 1, . . . , m − 1, where g(0)(t) = g(t). 2. g(m)(t) is square integrable so that Z 1 {g(m)(t)}2dt < ∞ 0 The solution found by maximizing the penalized log-likelihood is a polynomial smoothing spline. This means it is a piecewise polynomial of degree 2m − 1, which satisfies the continuity property of condition 1. The mth derivative is discontinuous. If m = 2 we obtain the familiar cubic smoothing spline; see Green and Silverman (1994b). The parameter λs governs the amount of smoothing. The heuristic rationale SMOOTHING SPLINES 263 behind the penalty is as follows. The polynomial of degree m − 1 is considered the smoothest curve and the mth derivative g(m)(t) = 0 in this case. Thus we are penalizing curves departing from this (m − 1)th degree polynomial and the size of the penalty determines the deviation allowed. To see how the solution can be obtained, we use the two properties or conditions, to expand g(t) in a Taylor series. We use a Taylor Series expansion about t = 0 with an integral form of the remainder, namely

m−1 m−1 X tj Z 1 (t − u) g(t) = g(j)(0) + + g(m)(u)du (13.6.15) j! (m − 1)! j=1 0 Notice that we have the truncated power function and the mth derivative appearing in the remainder term. The remainder is actually a special integral that arises in the solution of the differential equation Dmg(t) = f(t) The solution of this differential equation under certain boundary conditions is Z 1 g(t) = Gm(t, u)f(u)du 0 Z 1 (m) = Gm(t, u)g (t)du 0 where Gm(t, u) is Green’s function. This is of the same form as the remainder in our Taylor series expansion (13.6.15), with Green’s function being (t − u)m−1 G (t, u) = + m (m − 1)! Different Green’s functions arise for different differential equations as we shall see below. In general, determination of the solution of such differential equa- tions requires finding Gm(t, u). Let H2 be the set of functions satisfying conditions 1 and 2 above, but in addition satisfying g(j)(0) = 0. This set contains all functions that are linear combinations of functions satisfying these requirements, is closed under limiting operations, and has an associated inner product. In simple Euclidean geometry, the inner product plays a role in specifying the angle between vectors and also in a natural way to the length of vectors. Thus if x and y are two vectors, we define the inner product as (x, y) = xT y and a measure of distance called the norm as √ kxk = p(x, x) = xT x Note that the angle θ between x and y can be found using (x, y) cos θ = kxkkyk 264 MIXED MODELS FOR PENALIZED MODELS thereby showing that the inner product provides information on angles be- tween vectors. Our set H2 contains functions and hence we need to provide an inner prod- uct for functions. Thus for H2 we define the inner product Z 1 (f, g) = f (m)(u) g(m)(u)du 0 so that the (squared) norm is Z 1 n o2 kgk2 = g(m)(u) du 0 which is our penalty function. This is a very logical selection as we are intend- ing to penalize functions distant from a polynomial of degree m − 1. We use these results in what follows. Green’s functions play a vital role in the maximization of (13.6.14). Define Z 1 k2(s, t) = Gm(s, u) Gm(t, u)du (13.6.16) 0

Note that k2(s, t) = k2(t, s). If we consider t to be fixed and write k2,t(v) = k2(v, t), then it can be shown that m D k2,t(v) = G(t, v) Notice that with this result Z 1 (k2,s, k2,t) = Gm(s, u) Gm(t, u)du = k2(s, t) 0 and hence the name reproducing kernel that is given for k2. Note also that Z 1 (m) (k2,t, g) = Gm(t, u) g (u)du = g(t) 0 so that k2,t is called the representer of evaluation of g at t. The set H2 with the function k2 forms a reproducing kernel Hilbert space. j In the same manner, the polynomials φj(t) = t /j!, j = 0, 1, . . . , m − 1 also form a reproducing kernel Hilbert space, which we call H1. The inner product is defined by m−1 X i i (φj, φk) = D φj(0)D φk(0) i=1 so that the squared norm is m−1 2 X  i 2 kφjk = D φj(0) i=1

Note that with these definitions, φ0, φ1, . . . , φm−1 are orthonormal. That is they have unit length and they are perpendicular. This follows because  1 j = k (Djφ )(0) = k 0 j 6= k SMOOTHING SPLINES 265 In this setting the reproducing kernel is

m−1 X k1(s, t) = φj(s)φk(t) j=1 and it satisfies the reproducing nature and representer of evaluation conditions required.

13.6.2 Solution What is the consequence of these results? Firstly, all functions g satisfying conditions 1 and 2 belong to a reproducing kernel Hilbert space H which is such that H = H1 ⊕ H2 so that we can form g as the sum of two functions, one from H1 and one from H2. In the same manner, the reproducing kernel is the sum of k1 and k2, so that k(s, t) = k1(s, t) + k2(s, t) The most important consequence of these results however, is the fact that we can write m−1 n X X g(t) = τs,jφj(t) + dik2(ti, t) (13.6.17) j=0 i=1

Note that if k2,ti (t) = k2(ti, t) and k2(t) is the vector of k2,ti (t), we can write (13.6.17) as T T g(t) = φ(t) τ s + k2(t) d Now it follows that (m) m m T g (t) = D g(t) = D k2 (t)d and hence that Z 1 2 Z 1 2 n (m) o n m T o g (u) du = D k2 (u)d du 0 0 Z 1 T m m T = d D k2(u)D k2 (u)d du 0 "Z 1 # T m m = d D k2,ti (u)D k2,tj (u)du d ) ij Z 1  T = d Gm(ti, u) Gm(tj, u)du d 0 ij T = d [k2(ti, tj)]ij d T = d K2d

Thus the penalty can be written as a quadratic form with the matrix K2 266 MIXED MODELS FOR PENALIZED MODELS evaluated at the design points t1, t2, . . . , tn. Thus, at the design points, the penalized log-likelihood is given by 1 P LL = − log σ2R 2 1 − (y − Xτ − K d)T R−1(y − Xτ − K d)(13.6.18) 2σ2 s 2 s 2 T o + λsd K2d

This suggests that a mixed model formulation is again possible. Note that the solution based on (13.6.18) may not always be numerically very stable and so often we seek to transform to a better system (this is not always possible and problems arise because K2 can be ill-conditioned). There are implicit constraints on the above system. We impose the con- T straints X d = 0. Then there exists an n × (n − m) matrix Q2 of full column −1 T rank such that d = Q2δ. Then if Z2 = K2Q2 and Gs = Q2 K2Q2, we have 1 P LL = − log σ2R 2 1 − (y − Xτ − Z δ)T R−1(y − Xτ − Z δ)(13.6.19) 2σ2 s 2 s 2 T −1 o +λsδ Gs δ which looks like the objective function that gives rise to mixed model equa- tions. Thus we can fit such models using mixed model techniques. Here δ ∼ 2 2 2 N(0, σs Gs) and λs = σ /σs .

13.6.3 Cubic smoothing spline The special case m = 2 which corresponds to the cubic smoothing spline has received much attention. The elements of the cubic smoothing spline are as follows. Let hj = tj+1 − tj, j = 1, 2, . . . , n − 1. Define ∆ and Gs to be n × (n − 2) and (n − 2) × (n − 2) banded matrices respectively, where the only non-zero elements are (for i = 1, 2, . . . , n − 2) 1  1 1  1 ∆ii = , ∆i+1,i = − + , ∆i+2,i = (13.6.20) hi hi hi+1 hi+1

h h + h G = G = i+1 ,G = i i+1 (13.6.21) s;i,i+1 s;i+1,i 6 s;ii 3 T −1 Then Gs appears in (13.6.19) and Z2 = ∆(∆ ∆) . In addition it can be shown that δ is a n − 2 vector of second derivatives at the internal knots. A further transformation reduces the estimation to that of a variance com- −1/2 1/2 2 ponent analysis. If us = G δ and Z = Z2G , then us ∼ N(0, σs In−2) and the penalized log-likelihood can be modifed accordingly. L-SPLINES 267 13.6.4 Mixograph Data

The cubic smoothing spline was fitted to the mixograph data using a mixed model approach and hence used REML and BLUP. The fit is presented in Figure 13.11 and we see that their is a mix of the low and high frequency cycle (together with the linear trend). However, this figure and the residual and sample variogram plots in Figures 13.12 and 13.13 highlight that the high frequency cycle is still present in the data. Thus the smoothing spline has picked up broad trends but is unable to smooth for high frequency periodic effects. Note that using cross validation for the smoothing parameter does not change this result.

Figure 13.11 Smoothing spline fit to the mixograph data

2515

2510

2505 Force

2500

2495

0.0 0.05 0.10 0.15 0.20

time (secs)

13.7 L-splines

13.7.1 Differential Operator

The previous approaches to analysing the mixograph data have esssentially ignored the cyclic structure evident in the data. Given there are three compo- nents to the trend it is perhaps not surprising that in the main the methods have failed. We now turn to an extension of the smoothing spline approach which targets more complex functions than polynomials in the penalty. We replace the differential operator Dm by a more general operator L where

m−1 m L = w0(t) + w1(t)D + . . . wm−1(t)D + D (13.7.22) for some functions wj(t), j = 0, 1, . . . , m−1. We also consider the more general 268 MIXED MODELS FOR PENALIZED MODELS

Figure 13.12 Residuals after the smoothing spline fit to the mixograph data

3

2

1

0 resid(fmss)

-1

-2

-3

0.0 0.05 0.10 0.15 0.20

time

Figure 13.13 Sample variogram of residuals from the smoothing spline fit to the mixograph data

1.0

0.8

0.6 Residual 0.4

0.2

0.0

0 20 40 60 80

units L-SPLINES 269 penalty Z 1 {Lg(t)}2 dt 0 How are we going to use this L? We wish to penalize curves that obey Lg(t) = 0. Thus given a function g, we wish to determine L such that Lg(t) = 0. The functions g that are chosen provide a curve that is sensi- ble in the application but that may not be strictly appropriate. For example, the patterns observed in Figures 13.4, 13.5 and 13.6 suggest that a suitable g to consider is function

g(t) = β0 + β1t + β1s sin ω1t + β1c cos ω1t + β2s sin ω2t + β2c cos ω2t (13.7.23)

The six basis functions that make up g are (1, t, sin ω1t, cos ω1t, sin ω2t, cos ω2t). The operator L should annihilate each of these six “basis” functions. Determi- nation of the L that annihilates these functions involves finding the functions wj(t), j = 0, 1, . . . , m − 1. We will illustrate the general theory by considering the underlying basis set for (13.7.23). Thus define the following vectors     w0(t) 1  w1(t)   t       w2(t)   sin ω1t  w = w(t) =   , u = u(t) =    w3(t)   cos ω1t       w4(t)   sin ω2t  w5(t) cosω2t We require the w that ensures Lu = 0. Thus applying L to u we have

m−1 X j m w0(t)u + wj(t)D u = −D u j=1 In matrix terms this becomes

u Du ...Dm−1u w = −Dmu or W w = −Dmu (13.7.24) The matrix W is called the and it plays an important role here and below. Note that if W is non-singular for all t, we can find w by

w = −W −1Dmu

The derivatives of our u, namely Dju, are simple to find. However the algebra required to solve (13.7.24) is best conducted using a symbolic mathematical package. The calculations reported here were carried out using the freeware symbolic package Maxima (http://maxima.sourceforge.net/index.shtml). 270 MIXED MODELS FOR PENALIZED MODELS Using this manipulator we find  0   0     ω2ω2  w = w(t) =  1 2   0   2 2   ω1 + ω2  0 and hence 2 2 2 2 2 4 6 L = ω1ω2D + (ω1 + ω2)D + D It is easily checked that this operator does indeed satisfy Lg(t) = 0 for func- tions based on the 6 basis functions.

13.7.2 Green’s function As for the smoothing spline case the solution to the differential equation Lg(t) = f(t) is given by Z 1 g(t) = G(t, u)f(u)du 0 Z 1 = G(t, u)Lg(u)du 0 where G(t, u) is Green’s function for the solution. It is necessary to derive G(t, u) for any particular L. If v1(t), v2(t), . . . vm(t) are the elements of the last row of the inverse of the Wronskian matrix, W −1, and v(t) is the vector of these elements, Green’s function can be determined by  u(t)T v(s) s ≤ t G(t, s) = 0 otherwise For our model given by (13.7.23) we find   T s 1 cos ω1s sin ω1s cos ω2s sin ω2s v(s) = − 2 2 2 2 − 3 2 5 3 2 5 − 2 3 5 − 2 3 5 ω1ω2 ω1ω2 ω1ω2 − ω1 ω1ω2 − ω1 ω1ω2 − ω2 ω1ω2 − ω2 and hence

( t−s sin ω1(t−s) sin ω2(t−s) ω2ω2 − ω3ω2−ω5 − ω3ω3−ω5 s ≤ t G(t, s) = 1 2 1 2 1 1 2 2 0 otherwise

13.7.3 Reproducing kernel As for the smoothing spline, Z 1 k2(s, t) = G(s, u)G(t, u)du 0 L-SPLINES 271 Symbolic manipulations yield the reproducing kernel for s ≤ t,

s2(3t − s) k2(s, t) = 4 4 ω1ω2 (ω1s cos ω1t + sin ω1(t − s) − sin ω1t) − 7 2 2 2 ω1ω2(ω1 − ω2) (ω2s cos ω2t + sin ω2(t − s) − sin ω2t) + 2 7 2 2 ω1ω2(ω1 − ω2) ω1(t − s) − ω1t cos ω1s + sin ω1s + 7 2 2 2 ω1ω2(ω1 − ω2) ω2(t − s) − ω2t cos ω2s + sin ω2s + 2 7 2 2 (13.7.25) ω1ω2(ω1 − ω2) (2ω1s cos ω1(t − s) + sin ω1(t − s) − sin ω1(t + s)) + 7 2 2 2 4ω1(ω1 − ω2) (2ω2s cos ω2(t − s) + sin ω2(t − s) − sin ω2(t + s)) + 7 2 2 2 4ω2(ω1 − ω2) {(ω2 − ω1) sin(ω1s + ω2t) + (ω1 + ω2) sin(ω1s − ω2t) + 2ω1 sin ω2(t − s)} − 3 3 2 2 3 2ω1ω2(ω1 − ω2) {(ω1 − ω2) sin(ω2s + ω1t) + (ω1 + ω2) sin(ω2s − ω1t) + 2ω2 sin ω1(t − s)} + 3 3 2 2 3 2ω1ω2(ω1 − ω2)

For s > t we note that k2(s, t) = k2(t, s). The functional form for g(t) given by (13.6.17) is again appropriate with φj(t) replaced by uj(t). The mixed model that arises is given by

y = Xτ + Zu + e

1/2 −1 T where X has (i, j)th element uj(ti), Z = K2Q2G , G = Q2 K2Q2 and 2 u ∼ N(0, σLI). The penalized log-likehood is then given by 1 1 P LL = − log σ2R− (y − Xτ − Zu)T R−1(y − Xτ − Zu) + λ uT u 2 2σ2 L

2 2 and λL = σ /σL.

13.7.4 Mixograph Data

For the mixograph data the mixed model was fitted using REML. The esti- mated L-spline is presented in Figre 13.14. The fit appears to be very good, and the sample variogram of the residuals from the fit is presented in Fig- ure 13.15. The cyclic pattern is not as evident and the variation has largely been accounted for by the L-spline. 272 MIXED MODELS FOR PENALIZED MODELS

Figure 13.14 L-spline fit for the mixograph data

2515

2510

2505 Force

2500

2495

0.0 0.05 0.10 0.15 0.20

time (secs)

Figure 13.15 Sample variogram for the residuals from the L-spline fit for the mixo- graph data

1.0

0.8

0.6 Residual 0.4

0.2

0.0

0 20 40 60 80

units VARIANCE MODELLING 273 13.8 Variance modelling We have seen that mixed models can be used to fit a wide range of mod- els under a penalized likelihood scheme. In each case, a variance structure was generated by the penalty, and this variance structure was fitted using REML. The fitted curves are BLUPs. We might argue that penalties are a way to achieve a satisfactory underlying variance model, in fact that penalties are a surrogate for variance modelling. This may be a contraversial view on modelling using penalties, but an alternative to that approach is the use of a variance structure. Given the clear linear plus cyclic patterns in the data, as exemplified by the L-spline approach, it would seem sensible to include the six terms in the basis as fixed effects in the model and then investigate possible variance models. There are various possible models that could be investigated. Some obvious models are the autogressive models, moving average models and possibly the most attractive, models using the Matern class of correlations. Nugget effects could not be fitted in any variance model. The autoregressive model of order 1 (AR(1)), moving average model of order 1 (MA(1)) and the Matern correlation model with ν = 0.5, 1.0, 1.5, 2.0, 2.2 were all fitted. The Matern model with ν = 2.2 achieved the largest REML likelihood. Table 13.1 has REML log-likelihoods for various models.

Table 13.1 REML log-likelihoods for various models fitted to the mixograph data L-spline -45.49 AR(1) -37.47 MA(1) -31.02 Matern ν = 0.5 -37.47 Matern ν = 1.0 -33.31 Matern ν = 1.5 -31.08 Matern ν = 2.0 -30.01 Matern ν = 2.2 -29.76

The parameter ν governs the level of differentiability of the underlying Matern process. Thus it appears we have a function that is at least twice differentiable. The log-likelihoods indicate that the Matern correlation struc- ture with ν approximately 2 fits the dat best. Figure 13.16 shows that the fit is very close to that of the L-spline, but in likelihood terms it is much better. Thus we have an appropriate variance structure. Interstingly the MA(1) model is also a good fit, while the AR(1) (equivalently the Matern with ν = 0.5) is not as good. The sample variogram for the best Matern fit is given in Figure 13.17. What is evident is the cyclic pattern through the variogram. However, the residuals used in constructing the variogram are correlated and hence we expect to see some pattern that reflects that correlation. 274 MIXED MODELS FOR PENALIZED MODELS

Figure 13.16 Linear plus double periodic plus Matern fit to the mixograph data

2515

2510

2505 Force

2500

2495

0.0 0.05 0.10 0.15 0.20

time (secs)

Figure 13.17 Sample variogram for residuals from Matern fit to the mixograph data

1.0

0.8

0.6 variogram 0.4

0.2

0.0

0.0 0.05 0.10 0.15 0.20

distance ANALYSIS OF HIGH-RESOLUTION MIXOGRAPH DATA 275 Thus for this dataset, the use of a formal variance matrix rather than a penalty appears to provide the best fit. There is an implicit assumption that the REML log-likelihood is an appropriate measure of goodness of fit. This might be challenged for penalized likelihood methods, for we are using stan- dard mixed model inferential tools. This is a difficult and important topic that remains to be answered.

13.9 Analysis of high-resolution mixograph data

13.10 Analysis of another example: still to come

13.11 LASSO

The common component throughout the development of penalties as linear models has been a quadratic penalty, be it via an ad-hoc or more formal approach. If the penalty was changed to the absolute value, such as used in the Least Absolute Shrinkage and Selection Operator of Tibshirani (1996), the nice linear mixed model properties are no longer available. Approximations are required to move towards a mixed model and estimation involves nonlinear programming (Osborne et al. (2000)). Thus the mechanics of linear mixed models are useful but not universal.

13.12 Discussion

All of this chapter has focussed on estimation. The inferential use of mixed models for estimation based on penalties is very contentious. The fact that estimators (or predictors) are like best linear unbiased predictors does not mean the generating model is a mixed model. Indeed it is a mathematical niceity. However, it is a very useful vehicle for fitting the models. Nothing has been said about the use of REML to estimate the smoothing parameter rather than some form of cross-validation. Various studies have shown that REML is a reasonable approach; see Wahba (1985), Rupert et al. (2003). There are studies that show that some of the inferential properties may hold. For example, testing linearity in a smoothing spline context using a residual likelihood ratio test was shown in a small simultion study by Verbyla et al. (1997) to closely follow the proposed asymptotic distribution under a mixed model setting. More recently Claeskens (2004) has studied this problem in more detail for penalized regression splines. Prediction or pointwise confidence intervals based on the Bayesian (Wahba (1983)) or mixed model (Verbyla et al. (1999b)) have been proposed. Simula- tion studies suggest these perform as expected. Given the mixed model involves variance structures, the appropriateness of inferences using mixed models derived from penalized modelling will ulti- mately depend on the appropriateness of the variance structure generated by the penalty. If this is a good representation of such structure in the data, we 276 MIXED MODELS FOR PENALIZED MODELS surmise that inferences will be valid. If this is not the case, inferences will be problematic. Bibliography

Abramowitz, M. and Stegun, I. A., editors (1965). Handbook of Mathematical Functions. Dover Publications, New York. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov and F. Csaki, editors, Proceedings 2nd International Symposium on Information Science, Budapest, pages 267–281. Akademai Kiado. Bartlett, M. S. (1966). Stochastic Processes (second edition). Cambridge University Press, Cambridge, revised edition. Bartlett, M. S. (1978). Nearest neighbour models in the analysis of field experiments. Journal of the Royal Statistical Society, Series B, 40, 147– 158. Beckett, P. H. T. and Webster, R. (1971). Soils variability: A review. Soils and Fertilisers, 34, 1–15. Beecher, H. G., Hume, I. H., and Dunn, B. W. (2002). Improved method for assessing rice soil suitability to restrict recharge. Australian Journal or Experimental Agriculture, 42(3), 297–307. Besag, J. and Kempton, R. A. (1986). Statistical analysis of field experiments using neighbouring plots. Biometrics, 42, 231–251. Besag, J. E. and Higdon, D. (1999). Analysis of field experiments using a Bayesian approach (with discussion). Journal of the Royal Statistical Soci- ety, Series B, xx, xxx–xxx. Brien, C. J. (1983). Analysis of variance tables based on experimental struc- ture. Biometrics, 39, 53–59. Brumback, B. A. and Rice, J. A. (1998a). Smoothing spline models for the analysis of nested and crossed samples of curves (with discussion). Journal of the American Statistical Association, 93, 961–994. Brumback, B. A. and Rice, J. A. (1998b). Smoothing spline models for the analysis of nested and crossed samples of curves (with discussion). Journal of the American Statistical Association, 93, 961–994. Claeskens, G. (2004). Restricted likelihood ratio lack-of-fit tests using mixed spline models. Journal of the Royal Statistical Society Series B, 66, 909– 926. Clayton, B. and Hollis, J. M. (1984). Criteria for differentiating soil series: Soil

277 278 BIBLIOGRAPHY survey of England and Wales Technical Monograph 17. Technical report, Harpenden Soil Survey of England and Wales. Cochran, W. G. and Cox, G. M. (1957). Experimental Designs. Wiley Press, New York, 2nd edition. Conyers, M. K., Uren, N. C., and Helyar, K. R. (1995). Causses of changes in ph in acidic mineral soils. Soil Biology and Biochemistry, 27, 1383–1392. Coombes, N. (2002). The Reactive Tabu Search for efficient correlated experi- mental designs. Ph.D. thesis, Liverpool John Moores University, Liverpool, U.K. Cressie, N. A. C. (1991). Statistics for spatial data. John Wiley and Sons. Cressie, N. A. C. (1993). Statistics for spatial data, revised edition. John Wiley and Sons, revised edition. Cullis, B., Smith, A., Panozzo, J., and Lim, P. (2003). Barley malting quality: Are we selecting the best? Australian Journal of Agricultural Research, 54, 1261–1275. Cullis, B., Smith, A., and Coombes, N. (2005). On the design of early gener- ation variety trials with correlated data. Cullis, B. R. and Gleeson, A. C. (1991). Spatial analysis of field experiments - an extension to two dimensions. Biometrics, 47, 1449–1460. Cullis, B. R., Gogel, B. J., Verbyla, A. P., and Thompson, R. (1998). Spatial analysis of multi-environment early generation trials. Biometrics, 54, 1–18. de Boor, C. (1978). A practical guide to splines, volume 27 of Applied Math- ematical Sciences. Springer-Verlag: New York. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. Diggle, P. J. (1990). Time Series: A Biostatistical Introduction. Clarendon Press: Oxford. Diggle, P. J., Liang, K.-Y., and Zeger, S. L. (1994). Analysis of longitudinal data. Oxford Clarendon Press. Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998). Model-based geostatistics (with discussion). Applied Statistics, 47(3), 299–350. Diggle, P. J., Ribeiro, P. J. J., and Christensen, O. F. (2003). An introduction to model-based geostatistics. In J. Moller, editor, Spatial Statistics and Computational Methods, pages 43–86. Springer-Verlag. Eilers, P. H. C. (1999). Contribution to the discussion of Verbyla et al. (1999). Applied Statistics, 48, 307–308. Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with b-splines and penalties. Statistical Science, 11, 89–121. Falconer, D. S. and Mackay, T. (1996). Introduction to Quantitative Genetics. Longman Scientific and Technical, 4th edition. BIBLIOGRAPHY 279 Foulley, J.-L. and Quass, R. L. (1995). Hterogeneous variances in Gaussian linear mixed models. Genetic Selection and Evolution, 27, 211–228. Foulley, J.-L. and van Dyk, D. A. (2000). The PX-EM algorithm for fast stable fitting of Henderson’s mixed model. Genetic Selection and Evolution, 32, 143–163. Foulley, J.-L., Jaffrezic, F., and Robert-Granie, C. (2000). EM-REML esti- mation of covariance parameters in Gaussian mixed models for longitudinal data analysis. Genetic Selection and Evolution, 32, 129–141. Gilmour, A. R., Thompson, R., and Cullis, B. R. (1995). AI, an efficient algorithm for REML estimation in linear mixed models. Biometrics, 51, 1440–1450. Gilmour, A. R., Cullis, B. R., and Verbyla, A. P. (1997). Accounting for nat- ural and extraneous variation in the analysis of field experiments. Journal of Agricultural, Biological and Environmental Statistics, 2, 269–293. Gilmour, A. R., Cullis, B. R., Welham, S. J., Gogel, B. J., and Thompson, R. (2003). ASREML, reference manual. Technical report, VSN International. Gilmour, A. R., Cullis, B. R., Welham, S. J., Gogel, B. J., and Thompson, R. (2004). An efficient computing strategy for prediction in mixed linear models. Computational Statistics and Data Analysis, 44, 571–586. Gleeson, A. C. and Cullis, B. R. (1987). Residual maximum likelihood (REML) estimation of a neighbour model for field experiments. Biometrics, 43, 277–288. Gogel, B. J. (1997). Spatial analysis of multi-environment variety trials. Ph.D. thesis, Department of Statistics, University of Adelaide, South Australia. Green, P. J. and Silverman, B. W. (1994a). Nonparametric regression and generalized linear models. Chapman and Hall. Green, P. J. and Silverman, B. W. (1994b). Nonparametric Regression and Generalized Linear Models. London: Chapman and Hall. Green, P. J., Jennison, C., and Seheult, A. H. (1985). Analysis of field experi- ments by least squares smoothing. Journal of the Royal Statistical Society, Series B, 47, 299–315. Gu, C. (2002). Smoothing spline ANOVA models. Springer. Haskard, K. A. (2005). Anisotropic Mat´erncorrelation and other issues in model-based geostatistics. Ph.D. thesis, BiometricsSA, University of Ade- laide. Haskard, K. A., Cullis, B. R., and Verbyla, A. P. (2005). Anisotropic mat´ern correlation and spatial prediction using reml. Journal of Agricultural and Biological Sciences, unknown, ***–***. Haslett, J. (1999). A simple derivation of deletion diagnostic results for the with corelated errors. Journal of the Royal Statistical Society, Series B, 61, 603–609. 280 BIBLIOGRAPHY Haslett, J. and Hayes, K. (1998). Residuals for the linear model with general covariance structure. Journal of the Royal Statistical Society, Series B, 60, 201–215. Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Lon- don: Chapman and Hall. Heenan, D. P., Taylor, A. C., Cullis, B. R., and Lill, W. J. (1994). Long term effects of rotation, tillage and stubble management on wheat production in southern NSW. Australian Journal of Agricultural Research, 45, 93–117. Henderson, C. R. (1950). Estimation of genetic parameters (abstract). Annals of Mathematical Statistics, 21, 309–310. Henderson, C. R. (1973). Sire evaluation and genetic trends. In Proceedings of the animal breeding and genetics symposium in honour of Dr. Jay L. Lush, Champaigne, pages 10–41. Hurvich, C. M. and Tsai, C. L. (1989). Regression and time series model selection in small samples. Biometrika, 76, 271–293. Kammann, E. E. and Wand, M. P. (2003). Geoadditive models. Applied Statistics, 52(1), 1–18. Kenward, M. G. (1987). A method for comparing profiles of repeated mea- surements. Applied Statistics, 36, 296–308. Kenward, M. G. and Roger, J. H. (1997). The precision of fixed effects esti- mates from restricted maximum likelihood. Biometrics, 53, 983–997. Krige, D. G. (1951). A statistical approach to some basic mine valuation problems on the witwatersrand. Journal of the Chemical, Metallurgical and Mining Society of South Africa, 52, 119–139. Laird, N. and Ware, J. (1982). Random effects models for longitudinal data. Biometrics, 38, 963–975. Lane, P. W. (1998). Predicting from unbalanced linear or generalised linear models. In COMPSTAT98 Proceedings in Computational Statistics. Heidel- burg: Physica-Verlag. Lane, P. W. and Nelder, J. A. (1982). and standardis- ation as instances of prediction. Biometrics, 82, 613–621. Lark, R., Cullis, B. R., and Welham, S. (2005). On spatial prediction of soil properites in the presence of a spatial trend:- the empirical best linear unbaised predictor (e-blup) with reml. European Journal of Soil Science, unknown, ***–***. Lark, R. M., Catt, J. A., and Stafford, J. V. (1998). Towards the explanation of within-field variability of yield of winter barley: soils series differences. Journal of Agricultural Science, Cambridge, 131, 409–416. Laslett, G. M., McBratney, A. B., Pahl, P., and Hutchinson, M. F. (1987). Comparison of several spatial prediction methods for soil ph. Journal of Soil Science, 38, 325–341. BIBLIOGRAPHY 281 Lill, W. J., Cullis, B. R., and Gleeson, A. C. (1988). Safe, a computer program for the spatial analysis of field experiments. In Proc. National Mathematical Science Congress. Liu, C. and Rubin, D. B. (1994). The ECME algorithm: a simple extension of EM and ECM with fast monontone convergence. Biometrika, 81, 633–648. Liu, C., Rubin, D. B., and Wu, Y. N. (1998). Parameter expansion to accel- erate EM: The PX-EM algorithm. Biometrika, 85, 755–770. Martin, R. J. (1979). A subclass of lattice processes applied to a problem in planar sampling. Biometrika, 66, 209–217. Martin, R. J. (1990). The use of time-series models and methods in the analysis of agricultural field trials. Communications in Statistics, 19, 55–81. Matheron, G. (1973). The intrinsic random functions and their applications. Journal of Applied Probability, 5, 439–468. McCullagh, P. and Nelder, J. A. (1994). Generalized Linear Models. Chapman and Hall, London, 2 edition. McIntyre, G. A. (1955). Design and analysis of two-phase experiments. Bio- metrics, 11, 324–334. McLachlan, G. J. and Krishnan, T. (1997). The EM algorithm and extensions. John Wiley and Sons, New York. Meng, X. L. and van Dyk, D. A. (1997). The EM algorithm: an old song sung to a fast new tune (with discussion). Journal of the Royal Statistical Society, Series B, 59, 511–567. Meng, X. L. and van Dyk, D. A. (1998). Fast EM-type implementations for mixed effects models. Journal of the Royal Statistical Society, Series B, 60, 559–578. Mercer, W. B. and Hall, A. D. (1911). The experimental error of field trials. Journal of Agricultural Science, Cambridge, 4, 107–132. Nelder, J. A. (1954). The interpretation of negative components of variance. Biometrika, 41, 544–548. Nelder, J. A. (1965a). The analysis of randomized experiments with orthog- onal block structure. I. block structure and null analysis of vaiance. Proc. Roy. Soc. A, 283, 147–162. Nelder, J. A. (1965b). The analysis of randomized experiments with orthog- onal block structure. II. treatment structure and the general analysis of vaiance. Proc. Roy. Soc. A, 283, 163–178. Nelder, J. A. (1968). The combination of information in generally balanced designs. J. Roy. Statist. Soc. B, 30, 303–311. Osborne, M. R., Presnell, B., and Turlach, B. A. (2000). On the LASSO and its dual. Journal of Computational and Graphical Statistics, 9, 319–337. Papadakis, J. S. (1937). Methode statistique pour des experiences sur champ. Bulletin scientifique, Institut d’Amelioration des Plantes a Thessaloniki (Grece). 282 BIBLIOGRAPHY Patterson, H. D. and Thompson, R. (1971). Recovery of interblock information when block sizes are unequal. Biometrika, 31, 545–554. Patterson, H. D., Silvey, V., Talbot, M., and Weatherup, S. T. C. (1977). Variability of yields of cereal varieties in U.K. trials. Journal of Agricultural Science, Cambridge, 89, 238–245. Payne, R. W. e. (1993). Genstat 5 Release 3 Reference Manual. Clarendon Press, Oxford. Pinheiro, J. and Bates, D. M. (2000). Fitting linear and non-linear mixed effects models in SPLUS. Springer Verlag, New York. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1991). Numerical recipes in Fortran 77, the art of scientific computing. Cambridge University Press, London. Ramsey, J. O. and Silverman, B. W. (1997). Functional data analysis. Springer: New York. Ripley, B. D. (1981). Spatial Statistics. John Wiley, New York. Robinson, G. K. (1991). That BLUP is a good thing: The estimation of random effects. Statistical Science, 6, 15–51. Rupert, D., Wnd, M. P., and Carroll, R. J. (2003). Semiparametric regression. Cambridge University Press. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Searle, S. R. (1971). Linear Models. John Wiley and sons, inc. Searle, S. R., Casella, G., and McCulloch, C. E. (1992). Variance Components. J. W. Wiley, New York. Smith, A., Cullis, B. R., and Thompson, R. (2001a). Analyzing variety by environment data using multiplicative mixed models and adjustments for spatial field trend. Biometrics, 57, 1138–1147. Smith, A., Cullis, B., Appels, R., Campbell, A., Cornish, G., Martin, D., and Allen, H. (2001b). The statistical analysis of quality traits in plant improvement programs with application to the mapping of milling yield in wheat. Australian Journal of Agricultural Research, 52, 1207–1219. Smith, A., Cullis, B., and Thompson, R. (2005a). The analysis of crop cul- tivar breeding and evaluation trials: an overview of current mixed model approaches. Journal of Agricultral Science, Cambridge. Smith, A., Cullis, B., and Lim, P. (2005b). On the design of multi-phase experiments for quality trait data. Smith, A. B. and Cullis, B. R. (2001). The analysis of crop variety evaluation data in Australia. Australian and New Zealand Journal of Statistics, 43, 129–145. Stein, M. L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer-Verlag, New York. BIBLIOGRAPHY 283 Stein, M. L., Chi, Z., and Welty, L. J. (2004). Approximating likelihoods for large spatial datasts. Journal of the Royal Statistical Society, Series B, 66, 275–296. Stram, D. O. and Lee, J. W. (1994). Variance components testing in the longitudinal mixed effects setting. Biometrics, 50, 1171–1177. Thompson, R. (1985). A note on restricted maximum likelihood estimation with an alternative outlier model. Journal of the Royal Statistical Society Series B, 47, 53–55. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society Series B, 58, 267–288. Van Vuuren, J. A. J., Barnard, R. O., and Claassens, A. S. (2000). Soil sampling under fixed cultivation practices. Communications in Soils and Plant Analysis, 31, 2055–2066. Verbeke, G. and Molenberghs, G. (2000). Linear Mixed Models for Longitu- dinal Data. Springer-Verlag, New York. Verbyla, A. P. (1990). A conditional derivation of residual maximum likeli- hood. Australian Journal of Statistics, 32, 227–230. Verbyla, A. P. and Cullis, B. R. (1992). The analysis of multistratum and spatially correlated repeated measures data. Biometrics, 48, 1015–1032. Verbyla, A. P., Cullis, B. R., Kenward, M. G., and Welham, S. J. (1997). The analysis of designed experiments and longitudinal data using smoothing splines. Research Report 97/4, Department of Statistics, The University of Adelaide. Verbyla, A. P., Cullis, B. R., Kenward, M. G., and Welham, S. J. (1999a). The analysis of designed experiments and longitudinal data by using smoothing splines (with discussion). Applied Statistics, 48, 269–311. Verbyla, A. P., Cullis, B. R., Kenward, M. G., and Welham, S. J. (1999b). The analysis of designed experiments and longitudinal data by using smoothing splines (with discussion). Applied Statistics, 48, 269–311. Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validated smoothing spline. Journal of the Royal Statistical Society Series B, 45, 133–150. Wahba, G. (1985). A comparison of GCV and GML for choosing the smooth- ing parameter in the generalized spline smoothing problem. Annals of Statistics, 4, 1378–1402. Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM. Wand, M. P. (2003). Smoothing and mixed models. Computational Statistics, 18, 223–249. Wang, Y. (1998). Mixed effects smoothing spline analysis of variance. Journal of the Royal Statistical Society Series B, 60, 159–174. Webster, R. and Oliver, M. A. (2001). Geostatistics for Environmental Sci- entists. John Wiley and Sons, Chichester. 284 BIBLIOGRAPHY Welham, S. J. and Thompson, R. (1997). Adjusted likelihood ratio tests for fixed effects. Journal of the Royal Statistical Society, Series B, 59, 701–714. Whittaker, E. T. (1923). On a new method of graduation. Proceedings of the Edinburgh Mathematical Society, 41, 63–75. Wilk, M. and Gnanadeskin, R. (1968). Probability plotting method for the analysis of data. Biometrika, 55, 1–17. Wilkinson, G. N. and Rogers, C. E. (1973). Symbolic description of factorial models for analysis of variance. Applied Statistics, 22, 392–399. Wilkinson, G. N., Eckert, S. R., Hancock, T. W., and Mayo, O. (1983). Nearest neighbour (NN) analysis of field experiments (with discussion). Journal of the Royal Statistical Society, Series B, 45, 151–211. Wood, J. T., Williams, E. R., and Speed, T. P. (1988). Non-orthogonal block structure in two-phase designs. Australian Journal of Statistics, 30A, 225– 237. Yates, F. (1940). The recovery of inter-block information in balanced incom- plete block designs. Annuals of Eugenics, 10, 317–325. Zhang, D., Lin, X., Raz, J., and Sowers, M. (1998). Semiparametric stochastic mixed models for longitudinal data. Journal of the American Statistical Association, 93, 710–719. Zimmerman, D. L. (1989). Computationally exploitable structure of covari- ance matrices and generalised covariance matrices in spatial models. Journal of Statistics and Computer Simulation, 32, 1–15. Zimmerman, D. L. and Harville, D. A. (1991). A random field approach to the analysis of field plot experiments. Biometrics, 47, 223–239. APPENDIX A

Iterative Schemes

A.1 Introduction

REML estimation for linear mixed models generally involves the use of an iterative scheme. Patterson and Thompson (1971) proposed using the Fisher Scoring (FS) algorithm, which requires computation of the expected informa- tion matrix. The terms in this matrix involve traces of the product of large matrices and can be prohibitively expensive to compute for large data-sets or complex variance models. Gilmour et al. (1995) presented a variation of the FS algorithm in which the expected information matrix is replaced by an approximate average of the observed and expected information matrices. The elements of the so-called average information (AI) matrix are proportional to the residual sums of squares of working variates, one for each variance param- eter in the model. Gilmour et al. (1995) demonstrated that the AI algorithm is competitive in terms of rate of convergence compared to the FS algorithm. ? use the AI algorithm as the basis for the iterative scheme in ASReml. ASReml also uses sparse matrix methods which allow for the fitting of quite complex models to very large data-sets and provides the basis for the REML algorithms in samm and GENSTAT 5. These are the software packages that we have used to conduct the analyses in this book. The AI algorithm is a derivative based scheme and therefore its conver- gence sequence is not guaranteed to be monotone in terms of the REML log-likelihood. Problems with non-monotonic convergence may be experienced when fitting more complex R-structures or G-structures especially for small data-sets where there is less information on variance parameters. Some of these models include the unstructured, ante-dependence and factor analytic (FA) models. Stein (1999) demonstrated that convergence may be problematic when fitting higher order (ie k > 1) FA models. The problem is exacerbated by the difficulty of obtaining reliable starting values for variance parameters when analysing highly unbalanced data-sets or when fitting complex variance models. Another problem with the AI algorithm (and in general for all derivative algorithms) is the difficulty of ensuring the variance parameters remain within the parameter space during iterations, and becomes more difficult for complex variance models. The EM (Expectation-Maximisation) algorithm (Dempster et al., 1977) has been widely used for REML estimation of variance parameters in mixed mod- els. Meng and van Dyk (1997) argue that this popularity stems from its com-

285 286 ITERATIVE SCHEMES putational simplicity, numerical stability and reliable convergence. The EM algorithm convergence sequence is monotonic. Furthermore, the variance pa- rameters for each iteration remain within the parameter space. The major disadvantage of the EM algorithm is its slow rate of convergence. Gilmour et al. (1995) showed that for several different variance models the FS and AI algorithms converge in fewer iterations than the EM algorithm. They also showed that the computing time required for each iteration of the AI algo- rithm was similar to the computing time required for each iteration of the EM algorithm. Their study considered the basic EM algorithm but there have been several recent enhancements to the EM algorithm aimed at improving its rate of convergence. Meng and van Dyk (1998) used a Cholesky decomposi- tion for the variance parameters in random coefficient regression models which improved the rate of convergence of the basic EM algorithm. This principle was further extended by Liu et al. (1998) with the introduction of a parame- ter expanded or PX-EM algorithm in which a rescaling is embedded into the iterative scheme. Foulley and van Dyk (2000) reviewed the EM and PX-EM algorithms and compared the convergence rate of the EM, EMCE (Liu and Rubin, 1994) and PX-EM algorithms for the analysis of three data-sets. Their results indicated that the PX-EM algorithm generally coverged faster than the EM and EMCE algorithms, though convergence still required a large number of iterations (60-100). The data-sets used by Foulley and van Dyk (2000) were very small, and their study did not include a direct comparison with derivative based schemes. Stein (1999) recently introduced the concept of a composite EM-AI algo- rithm for the fitting of FA models. She illustrated that the convergence be- haviour of the AI algorithm could be significantly improved by commencing with a few iterations of the EM algorithm, then switching to the AI algorithm once the iterative values of the variance parameters were closer to the REML estimates. Similarly Pinheiro and Bates (2000) suggest use of a composite al- gorithm. They use 20 EM iterates before switching over to Newton Raphson (NR) iterates. This algorithm is used for their S-PLUS mixed models software package LME. In this chapter we will present the derivation of the AI algorithm as well as the EM and PX-EM algorithms for REML estimation in mixed models. The REML-EM algorithm is derived using the concept of the REML construct (see section 5.4.1).

A.2 Gradient methods: Average Information algorithm

In general the score equations (5.2.8) will require a numerical solution. An algebraic solution is available for the overall scale parameter σ2 . Given the H mth iterate for κ denoted by κ(m), then

σ2(m) = y0P (m)y/(n − p) (1.2.1) H GRADIENT METHODS: AVERAGE INFORMATION ALGORITHM 287 where P (m) is P with κ replaced by κ(m) and recall that P is given by −1 P = H−1 − H−1X X0H−1X X0H−1 = R−1 − R−1WC−1W 0R−1 Gradient methods involve the linearisation of the score equations using the first term in a Taylor’s series expansion. In the following we denote ψ0 = σ2 , κ0. Then expanding the score equations about the value ψ(m) yields: H

(m) ∂UR(ψ)  (m) UR(ψ) = UR(ψ ) + 0 ψ − ψ (1.2.2) ∂ψ ψ=ψ(m) Solving the score equations, that is, equating the right hand side to zero leads to  −1 (m+1) (m) ∂UR(ψ) (m) ψ = ψ − 0 UR(ψ ) ∂ψ ψ=ψ(m) −1 (m) h (m)i (m) = ψ + Io UR(ψ ) (1.2.3)

(m) (m) where Io is the observed information matrix for ψ evaluated at ψ . In the iterative scheme ψ(m) is the value from iteration m and the updated value ψ(m+1) is given by the left hand side of (1.2.3). This is known as the Newton- Raphson algorithm. Closely related to this is the Fisher scoring algorithm in which the expected information is used instead of observed information. We note that since an algebraic form for σ2(m) exists, then we only need an H update for κ. This is given by

(m+1) (m) h (m)κκi (m) κ = κ + Io UR(κ ) (1.2.4) h i−1 where I (m)κκ is the portion of I (m) relating to κ. If σ2 is fixed (usually o o H to one) then the updating formula for κ is given by −1 (m+1) (m) h (m)i (m) κ = κ + Io(κ, κ) UR(κ ) (1.2.5)

(m) (m) where Io(κ, κ) is the portion of Io relating to κ. In the following three results we present the expected, observed and average information matrices for ψ and show how the elements of the average infor- mation matrix may be computed using the concept of the “working variate”. Result A.1 The elements of the observed information matrix are given by I (σ2 , σ2 ) = −(n − p)/(2σ4 ) + y0P y/σ6 o H H H H I (σ2 , κ ) = y0P H˙ P y/(2σ4 ) o H i i H     I (κ , κ ) = 1 tr P H¨ − 1 tr P H˙ P H˙ + y0P H˙ P H˙ P y/σ2 − o i j 2 ij 2 i j i j H y0P H¨ P y/(2σ2 ) ij H 2 where H¨ ij = ∂ H/∂κi∂κj. 288 ITERATIVE SCHEMES Proof: Firstly, 2 ∂U (σ2 ) 2 2 ∂ `R R H Io(σ , σ ) = − = − H H ∂σ4 ∂σ2 H H = 1 ∂ (n − p)/σ2 − y0P y/σ4  /∂σ2 2 H H H = −(n − p)/(2σ4 ) + y0P y/σ6 (1.2.6) H H and 2 ∂U (σ2 ) 2 ∂ `R R H Io(σ , κi) = − = − H ∂σ2 ∂κ ∂κ H i i = 1 ∂ (n − p)/σ2 − yP y/σ4  /∂κ 2 H H i = y0P H˙ P y/(2σ4 ) using appendix ?? (1.2.7) i H Lastly 2 ∂ `R ∂UR(κi) Io(κi, κj) = − = − ∂κi∂κj ∂κj     = 1 ∂tr P H˙ /∂κ − ∂(y0P H˙ P y/σ2 )/∂κ (1.2.8) 2 i j i H j In (1.2.8) first consider   ! ∂tr P H˙ i ∂P ∂H˙ i = tr H˙ i + P ∂κj ∂κj ∂κj ! ∂H˙ i = tr −P H˙ jP H˙ i + P using appendix ?? ∂κj   = tr −P H˙ jP H˙ i + P H¨ ij (1.2.9) Then consider 0 ˙ ˙ ! ∂y P HiP y 0 ∂P ∂Hi ∂P = y H˙ iP + P P + P H˙ i y ∂κj ∂κj ∂κj ∂κj

0   = y P H¨ ijP − 2P H˙ iP H˙ jP y (1.2.10)

Substituting (1.2.9) and (1.2.10) into (1.2.8) gives the result as required. 2 Result A.2 The elements of the expected information matrix are given by I (σ2 , σ2 ) = 1 (n − p)/σ4 e H H 2 H   I (σ2 , κ ) = 1 tr P H˙ /σ2 e H e 2 i H   1 ˙ ˙ Ie(κi, κj) = 2 tr P HiP Hj Proof: GRADIENT METHODS: AVERAGE INFORMATION ALGORITHM 289 Using Result A.1 and the expected value of quadratic forms (see in section ??),

 2  2 2 ∂ `R Ie(σ , σ ) = E − H H ∂σ2 H = −(n − p)/(2σ4 ) + tr (PH) /σ4 + (Xτ )0P Xτ /σ6 H H H = 1 (n − p)/σ4 2 H

Also, using result A.1 we have

 2  2 ∂ `R Ie(σ , κi) = E − H ∂σ2 ∂κ H i   = tr P H˙ /(2σ2 ) + (Xτ )0P Xτ /σ4 i H H   = 1 tr P H˙ /σ2 2 i H

Lastly

 2  ∂ `R Ie(κi, κj) = E − ∂κi∂κj     1 ¨ 1 ˙ ˙ = 2 tr P Hij − 2 tr P HiP Hj +   tr HP H˙ P H˙ P + (Xτ )0 P H˙ P H˙ P Xτ /σ2 − i j i j H   1 tr HP H¨ P − (Xτ )0 P H¨ P H˙ P Xτ /(2σ2 ) 2 ij ij j H       1 ¨ 1 ˙ ˙ ˙ ˙ = 2 tr P Hij − 2 tr P HiP Hj + tr P HiP Hj −   1 ¨ 2 tr P Hij   1 ˙ ˙ = 2 tr P HiP Hj as required. 2 In many applications considered in this book and in other applications ei- ther the number of observations (n), the dimension of the coefficient matrix of the mixed model equations (C) or the number of variance parameters can be large so that the calculation of the trace terms in the expected (and observed) information matrices can be very computer intensive. To overcome this prob- lem, Gilmour et al. (1995) suggest the use of a matrix with elements given by simplified averages of the terms for observed and expected information matrices. 290 ITERATIVE SCHEMES Result A.3 The elements of the average information matrix are given by:

I (σ2 , σ2 ) = y0P y/(2σ6 ) A H H H 0 = 1 HP y/σ2  (P /σ2 ) HP y/σ2  (1.2.11) 2 H H H I (σ2 , κ ) = y0P H˙ P y/(2σ4 ) A H i i H 0 h i = 1 HP y/σ2  (P /σ2 ) H˙ P y (1.2.12) 2 H H i h i0 h i I (κ , κ ) = 1 H˙ P y (P /σ2 ) H˙ P y (1.2.13) A i j 2 i H j Proof: Taking averages for the three sets of different terms we have

I (σ2 , σ2 ) = 1 −(n − p)/(2σ4 ) + y0P y/σ6 + (n − p)/(2σ4 ) A H H 2 H H H = y0P y/(2σ6 ) H

    I (σ2 , κ ) = 1 y0P H˙ P y/(2σ4 ) + tr P H˙ /(2σ2 ) A H i 2 i H i H ' y0P H˙ P y/(2σ4 ) i H   approximating the trace term σ2 tr P H˙ to y0P H˙ P y. H i i Lastly      I (κ , κ ) = 1 1 tr P H¨ − 1 tr P H˙ P H˙ + y0P H˙ P H˙ P y/σ2 − A i j 2 2 ij 2 i j i j H   y0P H¨ P y/(2σ2 ) + 1 tr P H˙ P H˙ ij H 2 i j     = 1 1 tr P H¨ − y0P H¨ P y/(2σ2 ) + y0P H˙ P H˙ P y/σ2 2 2 ij ij H i j H ' y0P H˙ P H˙ P y/(2σ2 ) i j H   approximating y0P H¨ P y by its expectation σ2 tr P H¨ , as required 2. ij H ij Note that IA(κi, κj) is an exact average for simple variance component models in which the variance structure is linear in the parameters γi, that is, P 0 in which H = γiZiZi since in this case H¨ ij = 0. The matrix IA is a scaled residual sums of squares matrix and is given by I = 1 Q0(P /σ2 )Q A 2 H where the columns of Q are the “so-called” working variables, where the work- ing variables for σ2 and κ , i = 1, . . . , n , where n is the length of κ are given H i k k by

2 q 2 = HP y/σ σ H H ˙ qi = HiP y GRADIENT METHODS: AVERAGE INFORMATION ALGORITHM 291 Since PHP = P , then the working variable for σ2 can be written as H

2 q 2 = y/σ σ H H

The previous form of the working variable is consistent with a general defi- nition of the working variables for any variance parameter, and in which σ2 H is included in the calculation of P . An outline of the computing strategy which has been implemented in ASReml samm and GENSTAT 5 is given in section A.5.

A.2.1 Constraints on variance parameters

There may be a need to impose the linear constraints T 0κ = s where T is a nk ×nc matrix, s is an nc ×1 vector and nc is the number of constraints. Note n ×1 nt×1 that this is equivalent to transforming κ k to α (where nt = nk − nc) according to

κ = Mα + Es (1.2.14) where M and E are nk × nt and nk × nc matrices respectively such that 0 0 T M = 0 and T E = Inc . The REML estimate of α can be obtained in terms of the score and information for the original parameters κ. First consider the score equations for α:

nk ∂`R X ∂`R ∂κj U (α ) = = R i ∂α ∂κ ∂α i j=1 j i

From (1.2.14)

n n Xt Xc κj = mjiαi + ejlsl i=1 l=1

th th where mji is the (j, i) element of M and ejl is the (j, l) element of E. Thus ∂κj/∂αi = mji so that

n Xk UR(αi) = UR(κj) mji j=1 0 ⇒ UR(α) = M UR(κ) 292 ITERATIVE SCHEMES Elements of the observed information matrix for α are given by 2 ∂ `R ∂UR(αi) Io(αi, αi0 ) = − = − ∂αi∂αi0 ∂αi0 nk X ∂UR(αi) ∂κk = − ∂κk ∂αi0 k=1   nk nk X ∂ X ∂`R ∂κj ∂κk = −   ∂κk ∂κj ∂αi ∂αi0 k=1 j=1

nk nk 2 X X ∂ `R ∂κj ∂κk = − ∂κkκj ∂αi ∂αi0 k=1 j=1 n n Xk Xk = Io(κk, κj) mjimki0 k=1 j=1 0 ⇒ Io(α) = M Io(κ)M 0  The expected information matrix is then given by Ie(α) = E M Io(κ)M = 0 0 M Ie(κ)M and the average information matrix by IA(α) = M IA(κ)M. Finally, updates for estimation of α can be obtained as −1 (m) h (m) i (m) α = α + IA(α ) UR(α ) −1 (m) h 0 (m) i 0 (m) = α + M IA(κ )M M UR(κ )

A.3 EM Algorithm The Expectation-Maximisation (EM) algorithm is a widely applicable algo- rithm for estimation in many applications. Special forms of the algorithm have been proposed for some time, though Dempster et al. (1977) present a uni- fying and formal account of the algorithm. McLachlan and Krishnan (1997) present a thorough account of the EM algorithm and its extensions. The EM algorithm is particularly well suited to estimation in linear mixed models. The random component u is not observed but if it were then ML (or REML) estimation would be greatly simplified. The EM algorithm has been used for estimation in many applications of the linear mixed model, for example Laird and Ware (1982) suggest its use for the analysis of longitudinal data using random coefficient regression. In this section we present the EM algorithm for estimation in the linear mixed model. There are limitations, however, for models that are not linear in the parameters and these will not be considered. Foulley et al. (2000) consider extending the basic EM algorithm for estimation of φ in the analysis of longitudinal data but this will not be considered here. Iterations of the REML-EM algorithm require the conditional distributions summarised in results 5.9 and 5.10, in which we replace the vector of variance EM ALGORITHM 293 parameters by the current iterates σ2(m) and κ(m). We write this by H   u y , ψ = ψ(m) ∼ N u˜(m), σ2(m)CZZ(m) (1.3.15) 2 H Similarly we have   e y , ψ = ψ(m) ∼ N e˜(m), σ2(m)WC−1(m)W 0 (1.3.16) 2 H

A.3.1 REML-EM algorithm In the following we develop the REML-EM algorithm for obtaining the REML estimates of the variance parameters σ2 and κ. We note, however, that for H κ = κ(m), direct maximisation of the residual likelihood with respect to σ2 H leads to σ2(m+1) = y0P (m)y/(n − p) H This can be achieved without recourse to the “missing data”, u. Each iteration of the EM algorithm consists of two steps, the expectation (E) step and the maximisation (M) step. For mixed models the algorithm splits into two parts: one for γ and one for (σ2, φ0). The E-step involves evaluating the conditional expectation of the joint-likelihood of the so-called complete data, (u, y) given the observed part of the data which relates to the estimation of the variance parameters, ie y2. This conditional expectation is evaluated at the current iterates for σ2 and κ, σ2(m) and κ(m). The M-step involves maximisation of H H this quantity with respect to κ. The joint likelihood (though we recall that it is not a likelihood) of (u, y), is, ignoring constants and evaluated at σ2 = σ2(m), H H ` (κ) = log f (y | u ; τ , σ2 = σ2(m), κ) + log f (u ; σ2 = σ2(m), κ) c Y H H U H H = − 1 (n + b) log σ2(m) + log |R| + e0R−1e/σ2(m) + log |G|+ 2 H H o u0G−1u/σ2(m) (1.3.17) H

The expected value of this joint likelihood conditional on y2 and evaluated at 0 the current iterate ψ(m) = σ2(m), κ(m)0 is therefore given by H

(m)  (m) `ce(κ) = E `c(κ) | y2, ψ = ψ (m) 2 (m) = `ce(γ) + `ce(σ , φ) say where   ` (γ)(m) = − 1 E b log σ2(m) + log |G| + u0G−1u/σ2(m) | y , ψ = ψ(m) ce 2 H H 2   ` (φ)(m) = − 1 E n log σ2(m) + log |R| + e0R−1e/σ2(m) | y , ψ = ψ(m) ce 2 H H 2 Using the results (1.3.15) and in section ?? then     E u0G−1u | y , ψ = ψ(m) = σ2(m)tr G−1CZZ(m) + u˜(m)0G−1u˜(m) 2 H 294 ITERATIVE SCHEMES and so n   ` (γ)(m) = − 1 b log σ2(m) + log |G| + tr G−1CZZ(m) + ce 2 H o u˜(m)0G−1u˜(m)/σ2(m) (1.3.18) H Similarly using the result in (1.3.16) then     E e0R−1e | y , ψ = ψ(m) = σ2(m)tr R−1WC−1(m)W 0 + e˜(m)0R−1e˜(m) 2 H and so n   ` (φ)(m) = − 1 n log σ2(m) + log |R| + tr R−1WC−1(m)W 0 + ce 2 H o e˜(m)0R−1e˜(m)/σ2(m) (1.3.19) H This is the E-step. The M-step involves maximisation of (1.3.18) and (1.3.19) with respect to 2 γ, σ and φ respectively. Thus differentiating (1.3.18) with respect to γi gives (m) n     ∂`ce(γ) 1 −1 ˙ −1 ˙ −1 ZZ(m) = − 2 tr G Gi − tr G GiG C − ∂γi o u˜(m)0G−1G˙ G−1u˜(m)/σ2(m) (1.3.20) i H where G˙ i = ∂G/∂γi. Similarly for φi 2 (m) n     ∂`ce(σ , φ) 1 ˙ −1 −1 ˙ −1 −1(m) 0 2 = − 2 tr ΣiΣ − tr Σ ΣiΣ WC W /σ − ∂φi o e˜(m)0Σ−1Σ˙ Σ−1e˜(m)/(σ2σ2(m)) (1.3.21) i H and for σ2 ∂` (σ2, φ)(m) n   ce = − 1 n/σ2 − tr Σ−1WC−1(m)W 0 /σ4− ∂σ2 2 i e˜(m)0Σ−1e˜(m)/(σ4σ2(m)) (1.3.22) H where R˙ i = ∂R/∂ψi The specific form for the updating formulae depend on the variance models which have been specified for G and R. Simple updating formulae can be de- rived for variance component models and also for unstructured models. As an illustration, we now present updating formulae for the one-way classification, which we can then be easily extended to models with several terms whose G-structure is a scaled identity, such as those used in chapter 3.

A.3.2 REML-EM for the one-way classification For illustrative purposes we consider implementation of the REML-EM algo- rithm for the one-way classification, in which we parameterise the variance EM ALGORITHM 295 model in terms of variance components by setting σ2 = 1. We present up- H dating formula for the variance component ratio at the end of this section. Recall, for the one-way classification we assume that

2 var (e) = R = σ In and 2 var (u) = G = σuIb 2 ˙ so that γ = σu and φ is the null vector. Hence Gi = Ib then (1.3.20) becomes ∂` (σ2)(m) n   o ce u 1 2 ZZ(m) 4 ˜(m)0 ˜(m) 4 2 = − 2 b/σu − tr C /σu − u u /σu ∂σu Equating this to zero gives the (m + 1)th iterate as

2(m+1) n  ZZ(m) (m)0 (m)o σu = tr C + u˜ u˜ /b

Similarly from (1.3.22) ∂` (σ2)(m) n   o ce = − 1 n/σ2 − tr WC−1(m)W 0 /σ4 − e˜(m)0e˜(m)/σ4 ∂σ2 2 Equating this to zero gives the (m + 1)th iterate as n   o σ2(m+1) = tr WC−1(m)W 0 + e˜(m)0e˜(m) /n (1.3.23)

To simplify (1.3.23) we recall C = W 0R−1W + G∗ hence for the one-way classification C(m) = W 0W /σ2(m) + G∗(m) where   ∗ 0 0 G = 2 0 Ib/σu Thus     tr WC−1(m)W 0 = tr W 0WC−1(m)   = σ2(m)tr (C(m) − G∗(m))C−1(m)

2(m) n  ZZ(m) 2(m)o = σ (b + 1) − tr C /σu (1.3.24)

Substituting (1.3.24) into (1.3.23) gives the (m + 1)th update for σ2 as

2(m+1) n 2(m) 2(m)  ZZ(m) 2(m) (m)0 (m)o σ = (b + 1)σ − σ tr C /σu + e˜ e˜ /n (1.3.25) If σ2 6= 1 which implies that σ2 = 1, then we only require an updating H formula for the variance component ratio, γ = σ2/σ2 . This is given by u u H n   o γ(m+1) = tr CZZ(m) + u˜(m)0u˜(m)/σ2(m) /b (1.3.26) u H 296 ITERATIVE SCHEMES A.3.3 REML-EM for variance components models In this section we present the updating formulae for variance components models. Recall the general form q G = ⊕i=1Gi for which there is a corresponding partition in Z namely, Z = [Z1 Z2 ... Zq]. The coefficient matrix of the mixed model equations (and its inverse) are par- titioned conformably with the partitioning of G. Finally for variance com- ponents models, such as the variance model for the split plot example and balanced incomplete block design each Gi is a scaled identity matrix. Thus, for the split plot example,  2  2 σb I4 0 G = ⊕i=1Gi = 2 0 σwI8 2 The updating formula for σi , i = b, w is n   o 2(m+1) ZiZi(m) (m)0 (m) σi = tr C + u˜i u˜i /bi

A.4 PX-EM - an improved EM algorithm Although the EM algorithm has been widely used for the estimation of vari- ance parameters in the linear mixed model it can be slow to converge. This is particularly a problem when the estimates of the variance parameters are on or near the boundary of the parameter space (Laird and Ware, 1982). Fur- thermore Foulley and van Dyk (2000) suggest that biometricians working in animal breeding have been among the largest users of the EM algorithm, but note that the EM algorithm can be very slow to converge in these applications due to the relative magnitude of some of the variance components. To improve the rate of convergence of the EM algorithm, Liu et al. (1998) introduced the parameter expanded EM or PX-EM algorithm. In the case of linear mixed models, the algorithm involves the re-scaling of the random effects for simple variance components models or a rotation of the random effects for unstruc- tured G-structures. In this section we briefly review the PX-EM algorithm and illustrate its application in a simple example. For a more thorough review the reader is referred to Foulley and van Dyk (2000); Liu et al. (1998).

A.4.1 Definition of PX-EM The PX-EM algorithm assumes that the parameter vector κ can be expanded to a larger set of parameters Γ0 = κ∗0 λ0 where λ is a “working” parameter. The expanded parameterisation must satisfy the following two conditions 1. it can be reduced to the original parameterisation, κ, maintaining the same data model via a many-to-one reduction, κ = F (Γ)

2. when λ is set to its “null” value, say λ0 then this induces the same complete data model as with κ = κ∗ PX-EM - AN IMPROVED EM ALGORITHM 297 Once we have set up the expanded parameter set, then the PX-EM algorithm proceeds in a simliar fashion to the EM algorithm, in which there is an E- step and an M-step. The PX-E step, computes the conditional expectation of the joint density of the complete data given y2, the so-called observed 0 (m)  ∗0(m) 0 0  data, with Γ set to κ , λ = λ0 . The PX-M step then maximises this conditional expectation with respect to the expanded parameters and κ(m) is updated via the reduction κ(m+1) = F (Γ(m+1)). In the next section we will illustrate the PX-EM algorithm for implementation in linear mixed models with simple variance components structures. For ease of presentation we will consider only one variance component. Iterations of the REML-PX-EM algorithm require an additional result on conditional distributions similar to those summarised in results 5.9 and 5.10, in which we replace the vector of variance parameters by the current iterates σ2(m) and κ(m). We summarise this in the result below. H

∗ Result A.4 The joint distribution of u and y = y − Xτ given y2 and evaluated at the current iterates σ2(m) and κ(m) is Gaussian with H " #  u   u˜(m) E y , ψ = ψ(m) = y∗ 2 yˆ∗(m)  u    CZZ(m) −CZX(m)X0  var y , ψ = ψ(m) = σ2(m) y∗ 2 H −XCXZ(m) XCXX(m)X0 where yˆ∗ = y − Xτˆ.

Proof: 0 0 ∗0 0 We consider the joint distribution of [y2, u , y ] which is Gaussian mean 0 and variance matrix given by  0 0 0  L2HL2 L2ZGL2H σ2 GZ0L G GZ0 H  2  HL2 ZGH Hence given ψ  u    GZ0  E | y = 0 + L (L0 HL )−1L0 y y∗ 2 H 2 2 2 2  GZ0P y  = HP y  u˜  = yˆ∗ Hence " #  u   u˜(m) E y , ψ = ψ(m) = y∗ 2 yˆ∗(m) 298 ITERATIVE SCHEMES Similarly given ψ  u    G GZ0  var | y = σ2 − y∗ 2 H ZGH  0  2 GZ 0 −1 0   σ L2(L HL2) L ZGH H H 2 2  G − GZ0P ZG GZ0 − GZ0PH  = σ2 H ZG − HPZGH − HPH Now using result 5.6 we have G − GZ0PZG = CZZ −1 H − HPH = H(H−1 − H−1X X0H−1X X0H−1) −1 = X X0H−1X X0 = XCXX X0 and −1 GZ0 − GZ0PH = GZ0H−1X X0H−1X X0 = −CZX X0 Proving the result.

A.4.2 REML-PX-EM for variance components models In this section we illustrate the implementation of the REML-PX-EM algo- rithm for a simple variance components model, with one random factor. Recall (4.1.1) which can be rewritten as y = Xτ + λZf + e (1.4.27) where u = λf and var (u) = σ2 γ I , var (f) = σ2 dI , thus γ = λ2d. H u b H b u Also var (e) = σ2 I and σ2 is set to one. The reduced parameter vector is H n 0 ∗ κ = γu, while the expanded variance parameter model is Γ = [κ λ] where κ∗ = d. The role of the extra parameter λ, is simply to re-scale the random effects. Note that the null value of λ = 1 results in the same variance model parameterisation as the reduced variance parameter model. The joint likelihood of the complete data (f 0, y0)0 is, evaluated at σ2 = σ2(m) H H and ignoring constants, ` (Γ; f, y) = − 1 (n + b) log σ2(m) + e0e/σ2(m) + b log d + f 0f/(σ2(m)d) c 2 H H H where e = y −Xτ −λZf. The expected complete joint likelihood is therefore given by

(m)  (m)∗0  `ce(Γ) = E `c | y2, Γ = (κ , λ = 1) (m) (m) = `ce(d) + `ce(λ) PX-EM - AN IMPROVED EM ALGORITHM 299 say where ` (d)(m) = − 1 E b log d + f 0f/(σ2(m)d) | y , Γ = (κ(m)∗0, λ = 1) ce 2 H 2 ` (λ)(m) = − 1 E n log σ2(m) + e0e/σ2(m) | y , Γ = (κ(m)∗0, λ = 1) ce 2 H H 2 It follows that since     E f 0f | y , Γ = (κ(m)∗0, λ = 1) = u˜(m)0u˜(m) + σ2(m)tr CZZ(m) 2 H then the (m + 1)th iterate for d is given by n  o d(m+1) = u˜(m)0u˜(m)/σ2(m) + tr CZZ(m) /b H

This is identical to the updating formula for γu given in 1.3.26. Next we turn to obtaining an updating formula for λ which is obtained (m) by maximizing `ce(λ) . Rather than compute the expected value of this and then differentiate it is more convenient to exchange the expectation and differentiation operators. Hence we get ∂ ∂ e0e = (y − Xτ − λZf)0(y − Xτ − λZf) ∂λ ∂λ = −2f 0Z0(y − Xτ − λZf) Hence

∂ (m)  0 0 (m)∗0  2(m) `ce(λ) = E f Z (y − Xτ − λZf) | y , Γ = (κ , λ = 1) /σ ∂λ 2 H and setting this to zero gives

(m+1)  0 0 (m)∗0  λ = E f Z (y − Xτ ) | y2, Γ = (κ , λ = 1) /

 0 0 (m)∗0  E f Z Zf | y2, Γ = (κ , λ = 1)

Using result A.4 it follows that

 0 0 (m)∗0  (m)0 0 ∗(m) E f Z (y − Xτ ) | y2, Γ = (κ , λ = 1) = u˜ Z yˆ −   σ2(m)tr Z0XCXZ(m) H

 0 0 (m)∗0  (m)0 0 (m) E f Z Zf | y2, Γ = (κ , λ = 1) = u˜ Z Zu˜ +   σ2(m)tr Z0ZCZZ(m) H Finally we reduce these parameters to the parameter of interest, viz

(m+1) (m+1)2 (m+1) γu = λ d The above equations can be simply modified for variance components by set- ting σ2 = 1 and including σ2. H 300 ITERATIVE SCHEMES A.5 Computational Implementation In the final section of this chapter we present details of a computing algorithm for REML estimation of variance parameters in the linear mixed model. As a byproduct of this algorithm we obtain the BLUEs and BLUPS of fixed and random effects respectively as solutions to the mixed model equations. The key reference for this chapter is Gilmour et al. (1995). They describe the algorithm in detail. We give a brief sketch of the computational implementation in this section. The algorithm forms the basis of the REML estimation procedure in GENSTAT 5, ASReml and samm.

A.5.1 Basic Results The following results provide computationally convenient forms for the resid- ual log-likelihood and REML score equations. Result A.5 The residual log-likelihood in (5.2.7) can also be written as ` = − 1 (n − p) log σ2 + log |G| + log |R| + log |C| + y0P y/σ2 R 2 H H Proof: Use the identity that −1 XX −1 |C| = |CZZ ||CXX − CXZ CZZ CZX | = |CZZ ||(C ) | so that |C| = |ZR−1Z0 + G−1||X0H−1X| −1 −1 0 0 −1 = |G ||In + R ZGZ ||X H X| using result ?? = |G−1||R−1||R + ZGZ0||X0H−1X| = |G−1||R−1||H||X0H−1X| Use of the above result completes the proof as required. 2 0 Recall that the vector of variance parameters κ = γ0, σ2, φ0 . The vector γ is the vector of so-called G-structure parameters, while σ2 and the vector φ are the R-structure parameters. For example, in the analysis of the split plot 2 2 2 design presented in chapter 3, γ comprised σb and σw. The parameter σ is the residual variance, the vector φ is null and σ2 is set to one. This is not the H usual parameterisation as σ2 is usually set to one , σ2 is estimated and hence H γ comprise variance component ratios, that is, γ = σ2/σ2 i = b, w i i H The following two results expand the general formulation for the REML score equation for a generic variance parameter κi, given in (5.2.8).

Result A.6 The REML score for a variance parameter φi associated with the errors e is given by n     1 −1 ˙ −1 0 −1 ˙ −1 2 UR(φi) = − 2 tr Σ Σi − tr C W Σ ΣiΣ W /σ − o e˜0Σ−1Σ˙ Σ−1e˜/(σ2σ2 ) i H COMPUTATIONAL IMPLEMENTATION 301 where Σ˙ i = ∂Σ/∂φi

Proof: 2 Note that ∂H/∂φi = σ Σ˙ i for a variance parameter φi. Then using results on differentiation in section ??, n   o U (φ ) = − 1 tr P H˙ − y0P H˙ P y/σ2 R i 2 i i H n   1 −1 −1 −1 0 −1 ˙ = − 2 tr R − R WC W R Ri − o y0P RR−1R˙ R−1RP y/σ2 i H n   1 −1 ˙ = − 2 tr R Ri −   o tr C−1W 0R−1R˙ R−1W − e˜0R−1R˙ R−1e˜/σ2 i i H n   1 −1 ˙ = − 2 tr Σ Σi −   o tr C−1W 0Σ−1Σ˙ Σ−1W /σ2 − e˜0Σ−1Σ˙ Σ−1e˜/(σ2σ2 ) i i H as required. 2

Result A.7 The REML score for a variance parameter σ2 associated with the errors e is given by

U (σ2) = − 1 n/σ2 − tr C−1W 0Σ−1W  /σ4 − e˜0Σ−1e˜/(σ2σ2 ) R 2 H Proof: The result follows from result A.6 by noting that ∂H/∂σ2 = Σ

Result A.8 The REML score for a variance parameter γij associated with th the i random factor ui is given by n     1 −1 ˙ −1 ˙ −1 ZiZi UR(γij) = − 2 tr Gi Gij − tr Gi GijGi C − o u˜0 G −1G˙ G −1u˜ /σ2 i i ij i i H

ZiZi where G˙ ij = ∂Gi/∂γij and C is the bi × bi partition of the inverse of C which corresponds to ui.

Proof: 0 Note that H˙ ij = ZiG˙ ijZi for a variance parameter γij. Define the matrix (p+b)×bi Pq Si , where b = i=1 bi, so that WSi = Zi. Thus Si contains zeros everywhere except for an identity matrix of order bi in the partition corre- ∗ 0 −1 sponding to Zi in W . Also define G = C − W R W . Then the trace term 302 ITERATIVE SCHEMES in (5.2.8) is

   0 −1 −1 −1 0 −1  tr P H˙ ij = tr G˙ ijZi R − R WC W R Zi

 0 0 −1 −1 0 −1  = tr G˙ ijSi W R In − WC W R WSi

 0 0 −1 −1 0 −1   = tr G˙ ijSi W R WC C − W R W Si

 0 ∗ −1 ∗  = tr G˙ ijSi (C − G ) C G Si

 0 ∗   0 ∗ −1 ∗  = tr G˙ ijSi G Si − tr G˙ ijSi G C G Si     ˙ −1 −1 ˙ −1 ZiZi = tr GijGi − tr Gi GijGi C and the score for γij is n   o U (γ ) = − 1 tr P H˙ − y0P H˙ P y/σ2 R ij 2 ij ij H n     1 ˙ −1 −1 ˙ −1 ZiZi = − 2 tr GijGi − tr Gi GijGi C − o y0PZ G G−1G˙ G−1G Z 0P y/σ2 i i i ij i i i H n     1 ˙ −1 −1 ˙ −1 ZiZi = − 2 tr GijGi − tr Gi GijGi C − o u˜0 G −1G˙ G −1u˜ /σ2 using (5.3.16) i i ij i i H as required. 2 Note that in the case of a standard variance component, that is, with σ2 G = σ2 γ I this reduces to H i H i bi n   o U (γ ) = − 1 b /γ − tr CZiZi /γ2 − u˜0 u˜ /(σ2 γ2) R i 2 i i i i i H i

A.5.2 Absorption of Mixed Model Equations

It was shown in section 5.3.3 that estimates of the fixed and random effects could be obtained via absorption of the MME. This process is fundamental to the computing strategy as it provides additional quantities needed for the calculation of the log-likelihood, the score equations and the variance matrix of the prediction errors of fixed and random effects. The following is intended to provide a sketch of the approach. More details of Gaussian elimination with back-substitution can be found for example in Press et al. (1991)

Result A.9 Absorption of the MME provides τˆ and u˜, C−1, log |C| and y0P y.

Proof: COMPUTATIONAL IMPLEMENTATION 303 First the coefficient matrix of the MME is augmented to form  0 0  cyy cXy cZy M =  cXy CXX CXZ  (1.5.28) cZy CZX CZZ 0 −1 where cyy = y R y. Thus M is the coefficient matrix of the MME aug- mented by the right hand side of the equations and a quadratic form involv- ing the data vector y. Then Gaussian elimination with diagonal pivoting is performed, that is, sequential absorption of the rows and columns of M into the first element cyy. For notational simplicity we present this in terms of the full CZZ matrix (corresponding to the entire set of random effects), ie the first pivot is the matrix CZZ . The actual implementation in ASReml involves absorption one equation at a time with the MMEs reordered to maximise sparsity. Furthermore, whilst performing all of the following operations such as absorption and back-substitution, computing time is kept to a minimum by avoiding arithmetic operations involving zero elements. Step 1: First partition M as   M 11 M 12 M = 0 M 12 M 22 where  0   0  cyy cXy cZy M 11 = , M 12 = and M 22 = CZZ . cXy CXX CXZ

CZZ (= M 22) is absorbed by constructing ∗ −1 0 M = M 11 − M 12M 22 M 12  0 −1 0 0 −1  cyy − cZyCZZ cZy cXy − cZyCZZ CZX = −1 −1 cXy − CXZ CZZ cZy CXX − CXZ CZZ CZX  ∗ ∗  m11 m12 = ∗ 0 ∗ (1.5.29) m12 M 22  y0H−1y y0H−1X  = X0H−1y X0H−1X Step 2: ∗ Perform another absorption with M 22 as the pivot: ∗∗ ∗ ∗ ∗ −1 ∗ 0 M = m11 − m12 (M 22) m12 = y0P y (1.5.30) Step 3: We now obtain estimates of fixed and random effects as solution to (1.5.29) defines the single set of equations for τ , viz, ∗ ∗ 0 M 22τ = m12 ∗ −1 ∗ 0 ⇒ τˆ = (M 22) m12 304 ITERATIVE SCHEMES Then via backsubstitution we obtain both u˜ and e˜ which are given by −1 u˜ = CZZ (cZy − CZX τˆ) e˜ = y − Xτˆ − Zu˜ The elements of C−1 are obtained during this process by noting that for example, ∗ −1 −1 −1 XX (M 22) = CXX − CXZ CZZ CZX = C The log determinant of C is also calculated using the identity that |C| = −1 |CZZ ||CXX − CXZ CZZ CZX | so that ∗ log |C| = log |CZZ | + log |M 22| That is, the sum of the log determinants of the pivots in the absorption pro- cess. Similarly, when equations are absorbed one at a time the log determinant is the sum of the (non-zero) pivots. Result A.10 The average information matrix can be obtained by absorption of the matrix M but with the data vector y replaced by the matrix Q = h i 2 q 2 , q ,..., q of working variates for the variance parameters σ and κ σ 1 nk H H where 2 q 2 = y/σ σ H H ˙ qi = HiP y Proof: From (1.2.13) the average information matrix can be written as I = 1 Q0(P /σ2 )Q A 2 H Note that this is of the same form as y0P y which was obtained from the final absorption step on the matrix M in (1.5.28). Thus Q0PQ can be similarly obtained via absorption of the matrix  0 0  CQQ CXQ CZQ M =  CXQ CXX CXZ  CZQ CZX CZZ 0 −1 0 −1 0 −1 where CQQ = Q R Q, CXQ = X R Q and CZQ = Z R Q. 2 ˙ −1 Result A.11 The working variate for φi is given by qi = RiR e˜, for 2 2 ˙ −1 σ , qσ2 = e˜/σ and for γij, qij = ZiGijGi u˜i. Proof: 2 1. φi: note that H˙ i = R˙ i = σ Σ˙ i so that ˙ qi = RiP y −1 = R˙ iR RP y −1 = R˙ iR e˜ −1 = Σ˙ iΣ e˜ SUMMARY 305 2 2. σ : result follows from H˙ i = Σ ˙ ˙ 0 3. γij: Note that Hij = ZiGijZi so that ˙ 0 qij = ZiGijZiP y ˙ −1 0 = ZiGijGi GiZiP y ˙ −1 = ZiGijGi u˜i as required. 2

A.6 Summary In this chapter we have presented a review of the most commonly used al- gorithms for solving the REML score equations. Of these we presented a de- tailed derivation of the derivative based methods such as Newton Raphson and Fisher Scoring and described how the more computer efficient Average Information algorithm can be derived and implemented. We also presented a brief review of the Expectation-Maximisation (EM) algorithm for REML estimation in linear mixed models and illustrated its application in simple example. The parameter expanded EM algorithm was also described briefly and its implementation illustrated using a simple linear mixed model.