National Institute for Applied Statistics Research Australia
University of Wollongong, Australia
Working Paper
15-18
Mixed Models for Data Analysts
Brian Cullis, Alison Smith, Ari Verbyla, Robin Thompson, and Sue Welham
Copyright © 2021 by the National Institute for Applied Statistics Research Australia, UOW. Work in progress, no part of this paper may be reproduced without permission from the Institute.
National Institute for Applied Statistics Research Australia, University of Wollongong, Wollongong NSW 2522, Australia Phone +61 2 4221 5435, Fax +61 2 4221 4998. Email: [email protected] Mixed Models for Data Analysts - DRAFT 2018
Brian Cullis, Alison Smith National Institute for Applied Statistics Research Australia University of Wollongong
Ari Verbyla Robin Thompson, Sue Welham CSIRO
Contents
1 Introduction 1 1.1 Applications of linear mixed models 2 1.2 Model definition 2 1.3 Setting the scene 5
2 An overview of linear models 9 2.1 Example: Glasshouse experiment on the growth of diseased and healthy plants 9 2.2 Model construction and assumptions 10 2.3 Estimation in the linear model 10 2.4 Non-full rank Design Matrix 14 2.5 A glimpse at REML 14 2.6 Factors and model specification 16 2.7 Tests of hypotheses 18 2.8 Analysis of plant growth example 22 2.9 Summary 23
3 Analysis of designed experiments 25 3.1 One-way classification 25 3.2 Randomised complete blocks 33 3.3 Split plot design 40 3.4 Balanced incomplete blocks 47 3.5 In search of efficient estimation for variance components 52 3.6 Summary 53
4 The Linear Mixed Model 55 4.1 The Model 55 4.2 Variance structures for the errors: R-structures 56 4.3 Variance structures for the random effects: G-structures 57 4.4 Separability 57 4.5 Variance models 60 4.6 Identifiability of variance models 61 4.7 Combining variance models 62 4.8 Summary 63
5 Estimation 65
iii iv CONTENTS 5.1 Estimation of fixed effects and variance parameters 65 5.2 Estimation of variance parameters 65 5.3 Estimation of fixed and random effects 71 5.4 An approach to prediction in linear mixed models - REML construct 80 5.5 Summary 81
6 Inference 83 6.1 General hypothesis tests for variance models 83 6.2 Hypothesis testing in mixed models: fixed and random effects 86 6.3 Summary 91
7 Prediction from linear mixed models 93 7.1 Introduction 93 7.2 The Prediction Model 98 7.3 Computing Strategy 100 7.4 An example of the prediction model 102 7.5 Prediction in models not of full rank 103 7.6 Issues of averaging 108 7.7 Prediction of new observations 109
8 From ANOVA to variance components 113 8.1 Introduction 113 8.2 Navel Orange Trial 113 8.3 Sensory experiment on frozen peas 118
9 Mixed models for Geostatistics 125 9.1 Introduction 125 9.2 Motivating Examples 126 9.3 Geostatistical mixed model 128 9.4 Covariance Models for Gaussian random fields 130 9.5 Prediction 137 9.6 Estimation 139 9.7 Model building and diagnostics 140 9.8 Analysis of examples 142 9.9 Simulation Study 146
10 Population and Quantitative Genetics 159 10.1 Introduction 159 10.2 Mendel’s Laws 159 10.3 Population genetics 162 10.4 Quantitative genetics 164 10.5 Theory 164 10.6 Discussion 176 CONTENTS v 11 Mixed models for plant breeding 177 11.1 Introduction 177 11.2 Spatial analysis of field trials 182 11.3 Analysis of multi-phase trials for quality traits 196
12 The analysis of quantitative trait loci 209 12.1 Introduction 209 12.2 Example 209 12.3 Overview of Molecular Genetics 210 12.4 Reproduction 214 12.5 Genetic information 215 12.6 Linkage analysis 219 12.7 QTL analysis 225 12.8 Interval mapping: The Regression Approach 232 12.9 Whole genome interval mapping 235 12.10Conclusions 244
13 Mixed models for penalized models 249 13.1 Introduction 249 13.2 Hard-edge constraints 251 13.3 Soft-edge constraints 255 13.4 Penalized Regression splines 257 13.5 P-splines 258 13.6 Smoothing splines 262 13.7 L-splines 267 13.8 Variance modelling 273 13.9 Analysis of high-resolution mixograph data 275 13.10Analysis of another example: still to come 275 13.11LASSO 275 13.12Discussion 275
Bibliography 277
A Iterative Schemes 285 A.1 Introduction 285 A.2 Gradient methods: Average Information algorithm 286 A.3 EM Algorithm 292 A.4 PX-EM - an improved EM algorithm 296 A.5 Computational Implementation 300 A.6 Summary 305
CHAPTER 1
Introduction
The linear model is a basic tool in statistical modelling with widespread use and application in data analysis and applied statistics. The expected value of the response variable is given by a linear combination of explanatory vari- ables, often termed the linear predictor (McCullagh and Nelder, 1994). The stochastic nature of the response is modelled using a single random compo- nent, elements of which are assumed to be independent with constant variance. The linear mixed model is a natural extension of the linear model. In the simplest case the linear mixed model is a linear model which has been extended to allow for a correlated error term or additional random components. The wide range of variance and correlation models, make the linear mixed model a very flexible tool for data analysts. The linear mixed model provides the basis for the analysis of many data-sets commonly arising in the agricultural, biological, medical and environmental sciences, as well as other areas. This introductory chapter provides an overview of the book through ex- amples. Ths flavour of the linear mixed model and the diversity of possible applications are presented. The examples also allow the development of mod- els that contain both variates and factors, and indeed to define these two types of variable. The symbolic respresentation of linear models is introduced as it will be used throughout the book. The philosophical issues that arise in the analysis of data are also raised. The authors have a background in statistics arising from experiments. Thus the models proposed in this book begin with the design or sampling structure of the study. These models are randomization based. Hence it is important to understand the origins of mixed models, which belong in the analysis of variance. There are often further considerations in model building that suggest plaus- able variance models that are outside a randomization based approach. The diversity of applications and the complexity and flexibility of variance models results in the subjective choice of a variance model and hence analysis for any particular data-set. A consequence of the increased flexibility and com- plexity of variance modelling is the danger of fitting inappropriate variance models to small or highly unbalanced data-sets. The approach taken in this book is to highlight the difference between analyses that use a variance mod- elling approach and a design-based approach. The former relies heavily on the appropriateness of the fitted variance structure for the validity of inferences concerning both fixed and random effects. The importance of data-driven di-
1 2 INTRODUCTION agnostics is stressed throughout the book although there remain unresolved issues in this area.
1.1 Applications of linear mixed models
Typical applications covered in this book include the analysis of balanced and unbalanced designed experiments, the analysis of balanced and unbal- anced longitudinal data, repeated measures analysis, the analysis of regular or irregular spatial data, the combined analysis of several experiments (ie meta-analysis) and the analysis of both univariate and multivariate animal and plant genetics data. This chapter will be finished last!!!
1.2 Model definition
1.2.1 Defining the linear model
A linear model relates the expected value of a response variable (outcome) to a linear combination of a set of explanatory variables, termed the linear predictor. The stochastic properties of the response are determined by the addition of a single random component to the linear predictor. Explanatory variables are assumed to be measured without error, and may be design-based (values planned as part of the experiment) or observational. For example, we may wish to examine the effect of imposed dietary regimes (explanatory variable) on the live-weight of pigs (response variable). Let yi represent the response measured at the end of the experiment for the ith individual or experimental unit (in the example the pig), i = 1, . . . , n. These are combined to form the data vector yn×1. The linear model for y is then written as
y = Xτ + e (1.2.1) where τ p×1 is a vector of (fixed) effects corresponding to the explanatory variables (in this example diet effects) and Xn×p is the associated design matrix (chosen to be of full rank, see chapter 2). The columns of X may be dummy variables, that is columns of zeros and ones, which assign categorical variables or factors to units, or columns of continuous variables or covariates. These aspects will be covered in more detail in chapter 2. The residuals or errors, ei, are assumed to be identically and independently distributed with zero mean, common variance and follow a Normal distribution. This is written as
2 e ∼ N(0, σ In) The parameters to be estimated are σ2 and τ . Generally we wish to give mea- sures of uncertainty or precision and sometimes conduct tests of hypotheses on τ . MODEL DEFINITION 3 1.2.2 Defining the linear mixed model The linear mixed model also relates the expected value of a response vari- able (outcome) to the linear predictor. In this case, however, the stochastic properties of the response are determined by one or several random effects which are added to the linear predictor. For example, groups of pigs may have been housed together in pens and the live-weight of the pigs may therefore be affected by both the dietary treatment and the pen to which the pigs were assigned. We may regard the dietary effects as fixed effects and the pen effects as random effects. In doing this we assume that in the absence of dietary ef- fects there is a (positive) covariance between the live-weight of pigs that are in the same pen. Another important extension of the linear model is to allow the elements of the residual vector e or more generally, any random effects to be correlated. For example, in the analysis of longitudinal data, repeated measurements on the same individual may be correlated. This feature may be accounted for by relaxing the assumption of independence between the residuals within each in- dividual, although residuals on different individuals may remain independent. We need a flexible framework to define the types of correlation models we may wish to fit. Further, the variance may change as the time progresses. This is often the case in longitudinal data-sets involving the measurement of weight (Kenward, 1987). Hence we may also need to relax the assumption of constant variance, choosing to either model the variance as a function of time, or using another parametric model which adequately reflects the variance properties of the data. If yn×1 denotes the vector of observations, the linear mixed model can be written as y = Xτ + Zu + e (1.2.2) where τ p×1 is the vector of fixed effects, Xn×p is the design matrix (of full rank) that associates observations with the appropriate combination of fixed effects, ub×1 is the vector of random effects, Zn×b is the design matrix that associates observations with the appropriate combination of random effects, and e is the vector of residual errors. It is assumed u 0 G(γ) 0 ∼ N , σ2 (1.2.3) e 0 H 0 R(φ) where the matrices G and R are functions of parameters γ and φ, respectively. The parameter σ2 is a variance parameter which we will refer to as the overall H scale parameter.
1.2.3 Factors and covariates Covariates and factors are used to define explanatory variables. A covariate is defined to be an explanatory variable which may take any value (within a given valid range). For example, in a study of the growth of weaner pigs, the initial live-weight of each pig may have been measured with the view of 4 INTRODUCTION adjusting for its effect. The initial weight, denoted by initwt is a covariate in the linear mixed model. When a covariate is included then, in general a single effect, associated with the assumed linear regression of the response on the covariate, is included in either τ (for a fixed covariate) or u (for a random covariate). A factor is defined as an explanatory variable which takes one of a set of discrete values (or categories). The set of discrete values are called levels. For example, in the study of the growth of pigs, an individual pig may be described by a factor denoting its sex. This factor, denoted by sex, can take one of two values, either male or female. Factors may be purely qualitative, ordinal (eg, with levels such as low, medium and high), or relate to some underlying quantitative scale. In the preceding example the factor sex is a qualitative factor, since its levels can not be ordered in any special way. However to assess the effect of phosphorus (P) fertilizer on the yield of wheat, an agronomist may conduct an experiment with several treatments which are rates of fertilizer, 0, 5, 10, 20 and 40 kg/ha P, say. Here the treatment, P, is a factor with 5 levels. It is a quantitative factor and it is usually of interest to model the effect of P on yield using some functional form. In general, when a factor is included as a term in the linear mixed model then the result is to include in τ or u an effect for each level of the factor. For quantitative factors there are often examples (see chapter ??) when it may be included more than once in the linear model, in differing roles, say as a covariate and as a factor. As a covariate we are fitting the linear regression of the response on the levels of the factor, while in the second case a effect is included for each level of the factor. This latter setting may be useful to examine the goodness of fit of the assumed linear regression of the response on the levels of the factor. Terms in the linear mixed model may relate directly to the factors and co- variates in the set of explanatory variables or may be formed as a combination of factors and/or covariates. For example the notion of interaction usually in- volves the examination of the effect of one factor or covariate in the presence of another factor or covariate. The form for the model terms as combinations of factors and covariates can be conveniently and succintly written down using a syntax originally developed by Wilkinson and Rogers (1973) which is now briefly described. This syntax is widely adopted in many statistical packages such as GENSTAT 5 (Payne et al., 2001) and S-PLUS(Mathsoft 2000).
1.2.4 Symbolic representation of model formulae
We describe the terms of the linear model using the syntax of Wilkinson and Rogers (1973). Factors and covariates may be present in model terms as either main effects or as a component of an interaction. In the example on pigs, if the male and female pigs were randomly assigned to one of three diets, we may wish to examine both the main effects of sex and diet and allow for a possible interaction. We denote the interaction between factors sex and diet using the dot operator, ie as sex.diet. In this example, we wish to fit both the SETTING THE SCENE 5 main effects of diet and sex as well as the interaction. Thus, the factors are said to be crossed and this is succintly written down by the single term sex*diet, where the ∗ operator is the crossing operator. This single term expands to sex ∗ diet = sex + diet + sex.diet where sex.diet denotes the 6 interaction effects of sex and diet. Alternatively factors may be nested. For example, in testing new wheat varieties in South Australia the wheat growing area of the state has been divided into 6 regions and within each region comparative yield trials are grown at several locations. These locations are said to be nested within regions, ie each location occurs in only one region. Locations may be coded from one to the number of locations within each region rather than by their unique name, to reflect the aim to fit a nested model, in which the main effect of location is not explicitly fitted. This nested form implies we fit the main effect of region and the interaction of region and location. A succinct representation of these two terms results by use of the / operator which represents the nesting operation. For example, region/location = region + region.location Of course, if the location factor was recoded using the individual location names then the main effect of the recoded factor is equivalent to the interaction with the nesting factor. More complex operators and model term functions will be introduced as necessary. The terms of the linear mixed model are classified according to whether the effects associated with the term are either fixed or random. Hence we can represent a linear mixed model by a fixed and random model formulae. In the above example of pig weight, if we denote the response by y and in addition, pigs were housed in groups or pens within an animal house then a convenient representation of the linear mixed model for these data is given by y ∼ sex ∗ diet + pen + units where pen is a factor denoting the allocation of pigs to pens. The reserved word units represents the residual term, which could be regarded as a random factor with one level for each of the experimental units. As discussed in the preface terms which involve random effects are bolded in the symbolic model formula.
1.3 Setting the scene As we have seen the effects (and therefore the term associated with these effects) in a linear mixed model are classified as either fixed or random. A random term models the (co)variation in the data in the simplest case by the introduction of a component of variance. A fixed term contributes to the expectation of the response. It is difficult to construct a set of rules which determine the classification of effects (and hence terms) as fixed or random. Traditionally it has been suggested that if the “levels of the factor come from 6 INTRODUCTION a probability distribution” then the factor may be considered as random (see Searle et al., 1992, chap. 1). However a term may also be taken as random even though the levels of the factor which relate to the term (or a component of the term) cannot be assumed to have arisen from a probability distribution. Our approach is pragmatic and we allow the classification to be less rigid. There are general rules which can assist with the decision but the final choice must depend on the aim of the analysis and the role of the term in the linear mixed model. In the following we present a range of examples which are discussed in the book and describe how fixed and random effects may arise in the context of the analysis of such examples. Components of variance: The aim of the analysis of these data is typically concerned with determining and quantifying the major sources of variation. The data is often collected as part of a designed or observational study with many factors classifying the units and the analysis is to determine the magnitude of variation associated with each factor or combination of factors. Examples of these types of factors include trees within an orchard, students within a school, doctors within a hospital, cows within a herd within a farm, days within a sampling period, batches within a process and so on. The levels of the factor can usually be thought of as having arisen from a probability distribution and as the aim is to examine (co)variation associated with the factor (and terms derived from it) it is natural to classify the factor (and terms derived from it) as random. The analysis of an example of this type of data is described in chapters 2 and 3. Designed experiments: The analysis of data arising from designed experiments is one of the earli- est examples of the application of linear mixed models. Traditionally, factors are generally defined as either blocking or treatment factors (see for example, GENSTAT 5 (Payne et al., 2001)). Blocking factors are concerned with the de- scription and definition of the stratification of the experimental units. In this sense their role in the linear mixed model is to describe (co)variation in the data and their levels can be assumed to have arisen from a probability distri- bution. Hence blocking factors are usually classified as random. We note that in the analysis of orthogonal designs, such as randomised complete blocks, es- timation of treatment effects is identical regardless of whether blocking factors are classified as fixed or random. Treatment factors are generally classified as fixed. Quantitative treatments: In chapter ?? we present the analysis of experiments with quantitative treat- ments in which the aim is usually to describe or quantify the (continuous) re- lationship between the response and the treatment levels. There is often more than one observation for each value (ie level) of the factor. Traditionally, the analysis proceeds by modelling the relationship using low order polynomials or non-linear models. More recently semi-parametric methods such as smoothing splines have been suggested and used (Green and Silverman, 1994a; Verbyla SETTING THE SCENE 7 et al., 1999a). Verbyla et al. (1999a) decompose the factor into a linear com- ponent and “smooth” non-linear deviations, where the linear component is fitted as a fixed effect and the smooth component is fitted as a random term. When there are replicate covariate values, the opportunity exists to partition the variation not explained by the modelled response into lack of fit and pure error components. It is often convenient to model lack of fit by inclusion of the associated treatment factor as a random term in the linear mixed model. Repeated measures: These data arise in many applications where multiple measurements are taken on experimental units. The aim of the analysis of repeated measures from designed experiments may be to examine the overall effects of treatments and how these effects vary with time. In this context subjects or experimen- tal units will usually be measured at common time points. The classification of treatment and blocking factors will usually still be fixed and random re- spectively. The main effects of time and interaction with treatments will usu- ally be fixed terms, interactions with blocking factors will usually be random terms and correlated error terms may also be added to the model (which are of course random). The aim of the analysis of repeated measures from ob- servational studies may be to describe or quantify the response profiles for experimental units, groups of units or factors arising in the study. Low or- der polynomials with coefficients assumed to be fixed effects have often been suggested for describing the response profiles (Diggle et al., 1994). Random co- efficient regression has also been widely advocated and used for the analysis of such data (Laird and Ware, 1982). Most recently smoothing splines and other semi-parametric modelling approaches have also been suggested (Brumback and Rice, 1998a; Verbyla et al., 1999a). Terms modelling treatment response profiles are usually fitted as fixed (except for spline components). Terms used to model (co)variation due to structure between experimental units (for exam- ple blocking) or variation of individuals about treatment profiles are usually fitted as random. These random terms are often fitted to quantify the popu- lation variability but also model the covariance between the set of repeated measurements within each individual. The analysis of these data is described in chapters ?? and ??. Multivariate analysis: Multivariate linear mixed models are used where several measurements (traits) have been made on a set of experimental units, and a linear mixed model is required for each trait, taking account of the correlation between the traits (within each unit). In this case, the overall constant term which is usually fitted in a univariate linear mixed model is replaced by a factor which fits a constant term for each trait. Similarly the overall residual variance is replaced by a residual variance for each trait, with correlation between traits. Complex variance models are fitted to all random terms in the linear mixed model to account for the correlation between traits. Spatial analysis: The analysis of spatial data usually involves modelling the (co)variation 8 INTRODUCTION of the data. Experimental units are measured at a set of points which can be described by a coordinate system in either 1 or 2 (and occasionally 3) dimensions. The data may be regularly or irregularly spaced data. As an example of regularly spaced data, the analysis of field experiments has received much attention since the seminal paper of Wilkinson et al. (1983). Many covariance models have been proposed for the errors of field trials including time-series models (Cullis and Gleeson, 1991) as well as those originating in geostatistics (Cressie, 1991). The analysis of these types of data is considered in chapter ??. Analysis of a series of experiments: This is increasing in popularity as a method of summarising and integrat- ing the findings of studies or experiments with a common set (or subset) of treatments or aims. These occur in many application areas especially medi- cal (see eg, Yusuf, 1985) and agricultural (Smith and Cullis, 2001). As well as including factors which are measured within studies, factors may also be defined at the study level, such as geographic location, date of study, type or size of clinic. The aim of the analysis is usually to determine the treatment factors affecting the response, and the influence of study-level factors, both as main effects and how these may interact with treatment factors. Terms associ- ated with study-level factors may be classified as fixed or random, depending on the context. Terms associated with treatment factors are usually classified as fixed, but there are examples where these are classified as random. These issues are discussed more fully in ??. The above examples illustrate that often, consideration of whether the levels of a factor arise from a probability distribution is not sufficient to determine the classification of the factor as fixed or random. A term may also be classified as random rather than fixed to: 1. quantify (co)variation between different factor levels via a variance model 2. reflect the structure of the data 3. achieve efficient selection, based on prediction of future performance 4. allow inference for a broader set of conditions Conversely, there are examples where the term may be classified as fixed even though the levels could have arisen from a probability distribution. We note that if a term in the linear mixed model is classified as random then any other terms which share a common set of factors is also classified as random. For example, if variety and site.variety are terms in the linear mixed and variety is classified as random, then site.variety must be classified as random. CHAPTER 2
An overview of linear models
In this chapter we review some basic ideas and results for the linear model. For a more thorough treatment of linear models the reader is referred to Searle (1971), for example. The ideas will be covered in the context of a small example.
2.1 Example: Glasshouse experiment on the growth of diseased and healthy plants
The data for this example is presented in the GENSTAT 5 manual (Payne (1993)). An experiment was designed to examine the difference between the growth of diseased (MAV) and healthy (HC) plants. The heights of plants were measured at 1, 3, 5, 7 and 10 weeks after treatment. There were seven plants per treatment and the plants were arranged in a completely randomised design. For the purpose of illustration, we consider the height of each plant 10 weeks after treatment as our response variable. The data are given in table 2.1.
Table 2.1 Height (cm) of plants 10 weeks after treatment Plant Number Treatment within treatment HC MAV 1 57.0 55.0 2 123.5 67.6 3 66.0 61.5 4 130.0 58.0 5 114.0 104.0 6 107.5 62.0 7 110.5 75.9
We are interested in estimating the effect of the treatments on plant height and to examine if there is a significant difference in mean plant height for the two treatments.
9 10 AN OVERVIEW OF LINEAR MODELS 2.2 Model construction and assumptions To allow inferences to be conducted, we propose the statistical model
yij = τi + eij (2.2.1) where yij is the height for the jth plant (j = 1,..., 7) for the ith treatment (i = 1, 2), τi is the mean effect for the ith treatment and the eij are random “errors” that reflect plant to plant variability that is not related to the treat- 2 ment. We assume the random errors are such that eij ∼ N(0, σ ) and that they are independent for all i and j. Let n = 14 denote the total number of observations. In symbolic form (2.2.1) can be written as y ∼ treatment + units where the variable treatment is a factor taking two levels, with value i for observation yij. If i = 1 the treatment is HC and while if i = 2 the treatment is MAV. The term units is a factor that has levels 1 to n and represents the random errors. Let y11 1 0 y12 1 0 . . . . . . τ1 y = y17 , X = 1 0 , τ = (2.2.2) τ2 y21 0 1 . . . . . . y27 0 1 and e be the vector of eij in the same order as y. The matrix X defines the treatment term, and has two columns (corresponding to the two levels of the factor), each row having a zero and a one, with the one indicating which treatment level is appropriate for each experimental unit (plant). Using the above vectors and matrices, the model (2.2.1) can be written succinctly as y = Xτ + e (2.2.3) and this is the vector-matrix form of the linear model. Note that the assump- 2 tions regarding eij imply that e ∼ N(0, σ In) so that the distribution of y is given by 2 y ∼ N(Xτ , σ In) (2.2.4) The unknown parameters, τ and σ2 must be estimated from the data.
2.3 Estimation in the linear model We consider the linear model in a general setting and return to the example throughout this chapter. We now denote the individual observations by yi, i = 1, 2, . . . , n, and let ESTIMATION IN THE LINEAR MODEL 11 0 0 xi (where denotes the transpose) be the ith row of X. For example, the 0 first row of X given in (2.2.2) is x1 = [1 0]. Then from (2.2.4), the individual observations yi are statistically independent and have distribution 0 2 yi ∼ N(xiτ , σ ) (2.3.5) Equation (2.3.5) is a convenient form for the estimation of the unknown pa- rameters τ and σ2. We use a likelihood based approach for estimation of the parameters. The likelihood function for independent observations is defined as n 2 Y 2 L(τ , σ ; y) = f(yi; τ , σ ) i=1 2 where f(yi; τ , σ ) is the probability density function for yi. In our case yi follows a normal distribution specified by (2.3.5). Hence n 2 Y 1 1 0 2 L(τ , σ ; y) = √ exp − (yi − x τ ) 2 2σ2 i i=1 2πσ and the log-likelihood function, denoted by `, is given by n n n 1 X ` = `(τ , σ2; y) = − log(2π) − log(σ2) − (y − x0 τ )2 (2.3.6) 2 2 2σ2 i i i=1 Standard maximum likelihood estimation consists of differentiating the log- likelihood with respect to τ and σ2, to form the vector of derivatives, which is called the score vector. The estimates are found by equating the score vector to zero. For the linear model, it is possible to solve the resulting equations directly, but in general an iterative procedure is required. Here the derivatives of ` with respect to τ and σ2 are n ∂` 1 X = x (y − x0 τ ) (2.3.7) ∂τ σ2 i i i i=1 and n ∂` n 1 X = − + (y − x0 τ )2 (2.3.8) ∂σ2 2σ2 2σ4 i i i=1 respectively. Noting n n X 0 0 X 0 xixi = X X, xiyi = X y i=1 i=1 the score vector is given by 1 0 0 2 σ2 (X y − X Xτ ) U = U(τ , σ ) = n 1 0 (2.3.9) − 2σ2 + 2σ4 (y − Xτ ) (y − Xτ ) Equating the score vector to zero, we see that X0Xτˆ = X0y (2.3.10) 12 AN OVERVIEW OF LINEAR MODELS The equations in (2.3.10) are called the normal equations. If X is of full column rank, that is the columns are linearly independent, X0X is non-singular and τˆ = (X0X)−1X0y (2.3.11) is the maximum likelihood estimate of τ (also the least squares estimate). The case when X is not of full column rank is discussed below. Under model (2.3.5), τˆ ∼ N(τ , σ2(X0X)−1) (2.3.12) and τˆ is an unbiased estimator of τ . By the Gauss-Markov Theorem, linear functions a0τˆ are also the minimum variance unbiased estimators of a0τ . The maximum likelihood estimate of σ2 is n 1 X σˆ2 = (y − x0 τˆ)2 n i i i=1 1 = (y − Xτˆ)0(y − Xτˆ) n 1 = (y0y − τˆ0X0y (2.3.13) n 1 = R (2.3.14) n where R is the residual sum of squares. This estimate is known to be biased 2 n−p 2 (E σˆ = n σ ), because it does not take account of the p degrees of freedom used in the estimation of τ . We return to this shortly. Note also that (2.3.13) shows that the sum of squares due to the linear model (2.2.3) is given by SSQ(β) = τˆ0X0y = τˆ0X0Xτˆ = (Xτˆ)0(Xτˆ) (2.3.15) The second derivatives of the log-likelihood and their expected values are important for estimation and also for inference. The negative of the matrix of second derivatives is called the observed information matrix, while the expected value of this matrix is called the expected or Fisher information matrix. The observed information matrix is 1 0 1 0 0 2 σ2 X X σ4 (X y − X Xτ ) J = J(τ , σ ) = 1 0 0 0 n 1 0 σ4 (y X − τ X X) − 2σ4 + σ6 y − Xτ ) (y − Xτ ) while the expected information matrix is equal to 1 0 2 σ2 X X 0 I = I(τ , σ ) = n (2.3.16) 0 2σ4 The fitted values for each observation are given by 0 −1 0 Xτˆ = X(X X) X y = P X y ESTIMATION IN THE LINEAR MODEL 13 n×n where P X is called the projection matrix for X. It is also called the “hat matrix”. The properties of P X are simple but very important, namely 0 2 P X = P X , P X = P X , P X X = X which imply that P X is an orthogonal projection matrix onto the plane or space defined by the columns of X. P X is a real symmetric matrix, and hence we can diagonalize it (see appendix ??) so that 0 P X = KΛK where Λ is a diagonal matrix whose elements are the eigenvalues of P X , and K 0 is the matrix of orthonormal eigenvectors (K K = In). It is easy to show that P X has p unit eigenvalues and n − p zero eigenvalues. Thus we can partition Λ into two diagonal matrices Λ1 = Ip, Λ2 = 0n−p (a square matrix of zeros of 0 size n−p), and K into two orthogonal components K1 and K2 (K1K2 = 0), where the columns of K1 are those eigenvectors corresponding to the unit eigenvalues, and the columns of K2 are those eigenvectors corresponding to the zero eigenvalues. Thus we have 0 Λ1 0 K1 P X = [K1 K2] 0 0 Λ2 K2 0 = K1K1 (2.3.17) 0 As K1K1 = Ip, 0 −1 0 0 −1 0 0 P X = X(X X) X = K1(K1K1) K1 = K1K1 are equivalent representations of the orthogonal projection onto the space defined by the columns of X. The estimated e or residual vector is given by e˜ = y − Xτˆ
= y − P X y
= (In − P X )y and (In − P X ) is also an orthogonal projection matrix. This projection is orthogonal to the plane or space defined X. The eigenvectors of this matrix are identical to those of P X , that is K, and the diagonal matrix of eigenvalues of In − P X is equal to In − Λ. Thus we have an equivalent result to (2.3.17), namely 0 In − P X = K2K2 (2.3.18) 0 Notice also that as (In − P X )X = 0, K2X = 0. The two representations (2.3.17) and (2.3.18) provide the basis of transfor- mations discussed below and in Chapter 3. Lastly, notice that the notation e˜ suggests that this estimate is of a differ- ent type than the estimate τˆ. In fact, we have estimated a random vector, something that is called prediction rather than estimation; e˜ is the best linear unbiased predictor (BLUP) of e. This will be discussed later. 14 AN OVERVIEW OF LINEAR MODELS 2.4 Non-full rank Design Matrix We briefly discuss the case where X is not of full column rank and hence when X0X is singular. In this case the solution of the normal equations (2.3.10) is not unique. This case arises naturally in chapter 3. If X is singular one solution is to find a full rank version. This involves finding matrices X∗ and A such that X = X∗A and X∗ is of full column rank. This is the standard method used for over-specified models and is discussed in section 2.6. We consider an alternative approach. Write the normal equations as Cτˆ = b (2.4.19) so that C = X0X and b = X0y. Then a (non-unique) solution is given by τˆ = C−b (2.4.20) where C− is a generalized inverse of C, that is a matrix satisfying CC−C = C (2.4.21) To show (2.4.20), note that premultiplying (2.4.19) by CC− we find that CC−b = b and on substituting (2.4.20) into (2.4.19) we see that indeed the former is a solution of the latter. Now, in terms of the original matrices, τˆ = (X0X)−X0y (2.4.22) A particularly nice generalised inverse is the so-called Moore-Penrose gen- eralised inverse. This is written as C+ and in addition to satifying (2.4.21), also satifies C+CC+ = C+ In this case var (τˆ) = σ2(X0X)+ and 0 + 0 P X = X(X X) X are results that are similar to the full-rank case. 0 −1 0 Lastly, note that in the full rank case, the projection matrix P X = X(X X) X , is a Moore-Penrose inverse of itself. This can be useful in the analysis of vari- ance for multi-stratum experiments, discussed in chapter 3.
2.5 A glimpse at REML It was indicated above that the maximum likelihood estimate,σ ˆ2 given by (2.3.14) did not allow for estimation of τ . We can obtain an estimate that allows for the estimation of τ as follows. A GLIMPSE AT REML 15 0 If K is the matrix of eigenvectors of P X , we transform y to K y (a non- 0 singular and hence one-to-one transformation). Noting that K2X = 0, the distribution of the transformed data is 0 0 y1 K1y K1Xτ 2 Ip 0 = 0 ∼ N , σ (2.5.23) y2 K2y 0 0 In−p
The two components of the transformation y1 and y2 are independent and thus (2.5.23) provides two linear models. The first linear model has a design 0 matrix K1X that is square and non-singular and hence transforms τ to a ∗ new parameter vector τ say, of the same length. The vector y1 provides the only information on τ ∗ under the transformation and hence is the basis of estimation of τ . The parameter vector matches the data vector y1 in length, 2 so that once τ has been estimated, y1 cannot be used for estimation of σ as there is no further information available. In fact using (2.3.11) with the design 0 matrix X replaced by K1X, the independence of y1 and y2 implies that we can estimate τ by 0 0 −1 0 0 τˆ = (X K1K1X) X K1K1y 0 −1 0 = (X P X X) X P X y = (X0X)−1X0y which actually reproduces (2.3.11), as in hindsight it should. The second linear model has a known (zero) mean and depends only on σ2. 2 Because y1 cannot be used to estimate σ , we use the marginal distribution 2 of y2 for estimation of σ . The log-likelihood of y2, is given by n − p n − p 1 `(σ2 y ) = − log(2π) − log(σ2) − y0 y 2 2 2 2σ2 2 2 and hence the estimate of σ2 based on this marginal likelihood is given by (having differentiated the marginal log-likelihood by σ2 and equated to zero) 1 s2 = y0 y n − p 2 2 1 = y0K K0 y n − p 2 2 1 = y0(I − P )y n − p n X 1 = R (2.5.24) n − p rather than the form in (2.3.14). The estimator in (2.5.24) is unbiased. This is the standard estimate corrected for estimation of τ using a simple degrees of freedom adjustment. The above approach for estimating σ2 based on the marginal distribution of y2, is an example of residual maximum likelihood (REML) estimation dis- cussed originally by Patterson and Thompson (1971). The basic idea is to partition the likelihood into components, which in general may not be inde- 16 AN OVERVIEW OF LINEAR MODELS pendent, with one component (here y1) for estimation of τ and the other component (y2) whose distribution does not depend on τ for estimation of σ2 . A more complex decomposition is presented in chapter 4 and a thorough derivation of REML is given in chapter 5.
2.6 Factors and model specification
All linear models can initially be expressed in terms of means. For example in (2.2.1) we have specified the expected value for the ith treatment by the mean τi. In this case, the design matrix X might be called the replication matrix, as it simply indicates which mean value an experimental unit takes on. In many situations the mean value depends on several factors, and the model is reparameterised to reflect the factors of interest. For example, comparative inferences concerned the levels of a factor may of prime interest (as is the case in our example). This subsequently leads to modelling the mean in terms of the factors, and in particular to introducing a transformation from the mean to a new set of parameters. In the linear models situation the transformation is linear so that the replication matrix is then multiplied by the matrix of this linear transformation to provide a new design matrix. In our simple example, the mean may be reparameterised as
∗ τi = µ + τi (2.6.25)
∗ The τi in (2.6.25) represent the deviations from µ. This specification contains redundancies or intrinsic aliasing of effects, because we have 3 parameters µ, ∗ ∗ τ1 , τ2 to model the means for the 2 treatments, namely τ1 and τ2. To overcome this redundancy a constraint must be applied. This over-parameterisation oc- curs whenever factors are present in the model (as fixed effects), so by conven- tion constraints are applied to each factor term. These constraints are then applied to any term in the model which has this factor as an elemental com- ponent (for example in an interaction). The type of constraint that is applied affects the interpretation of the term µ. Corner-point constraint: This is the standard constraint for linear models in GENSTAT 5. We set ∗ τ1 = 0 and in terms of the parameterisation in (2.2.1) we have
∗ τ1 = µ + τ1 = µ ∗ τ2 = µ + τ2
∗ Thus µ is the mean for treatment 1, and τ2 is the deviation of the mean of treatment 2 from the mean of treatment 1. This is easily generalised to ∗ more than 2 treatments, in which case the τi , i = 2, . . . , p are deviations of treatment i from treatment 1. In addition, the set of effects in interaction terms have all effects corresponding to the first level of all component factors set to zero. FACTORS AND MODEL SPECIFICATION 17 In terms of the vector of means, we have τ1 1 0 µ = ∗ τ2 1 1 τ2 so that if T is the matrix of the transformation and τ ∗ is the vector of µ and ∗ τ2 we have τ = T τ ∗ (2.6.26) and the linear model becomes y = XT τ ∗ + e = X∗τ ∗ + e which is a linear model with a new design matrix X∗ and non-redundant parameter vector τ ∗. Zero-sum constraint: P2 ∗ We set i=1 τi = 0. This ensures (for our simple case) that the intercept, µ, is the overall mean. This constraint also has been widely used in statistical packages to fit the linear model. The possible advantage of this constraint is that the intercept represents the overall mean. In the case of balanced data, the estimate of the intercept is the mean of all the data. However, this is not the case for unbalanced data. ∗ ∗ For this constraint, note τ1 = −τ2 , and the matrix T of (2.6.26) is 1 −1 (2.6.27) 1 1 It is also useful for the development in Chapter 3 to provide an alternative approach given by Nelder (1965b). Thus we write
τi =τ ¯. + (τi − τ¯.) 1 0 whereτ ¯. = 2 12τ is the mean of the τi. Notice that the zero-sum constraint is automatically incorporated for the terms (τi − τ¯.). In vector form,
τ = 12τ¯. + (τ − 12τ¯.) 1 1 = 1 10 τ + (I − 1 10 )τ 2 2 2 2 2 2 2 0 −1 0 0 −1 0 = 12(1212) 12τ + (I2 − 12(1212) 12)τ
= T 1τ + T 2τ (2.6.28)
Notice that T 1 and T 2 are orthogonal projection matrices, T 1 onto the vector 0 12 and T 2 onto the vector orthogonal to 12, namely [−1 1] , and that these two vectors make up the matrix T in (2.6.27). In addition T 1 + T 2 = I2 so that the decomposition of τ is into complete components. Thus −1 1 −1 T τ ≡ 1 µ T τ ≡ (τ − τ ) = τ ∗ 1 2 1 1 2 1 2 1 2 This type of decomposition into orthogonal parts occurs not only in the mean 18 AN OVERVIEW OF LINEAR MODELS structure but also in the covariance structures to be discussed in Chapter 3, and it is the relationship between these mean and covariance structures that is important in analysis of variance. Conventions in this book: In this book we assume that the design matrix X has full column rank. This often necessitates imposition of constraints on the vector τ of fixed ef- fects. The method of estimation of the model and subsequent prediction of effects of interest is not affected by the type of constraint used, but of course the interpretation of the parameters or effects will depend on the constraints applied.
2.7 Tests of hypotheses
Before leaving the introduction to linear models we consider tests of hypothe- ses. There are basically three approaches (usually based on large sample the- ory) to deriving tests. The three methods are the likelihood ratio test, the Wald test and the score test (?). 0 We consider the hypothesis H0 : L τ = l for given L, p × r, and l, r × 1. In the example, the hypothesis of interest is simply the equality of treatment means, ie τ1 = τ2, or more appropriately for our general formulation, that 0 τ1 − τ2 = 0, so that L = [1 − 1] and l = 0. Suppose M is an p × (p − r) matrix such that the matrix T = [ML] is non-singular. We can choose M such that L0M = 0. The linear model (2.2.3) can be written as
y = X(T 0)−1T 0τ + e = X∗τ ∗ + e (2.7.29)
Now
0 ∗ ∗ M τ τ 0 τ = 0 = ∗ L τ τ 1 ∗ and our null hypothesis becomes H0 : τ 1 = l. This demonstrates that if we partition τ into two components τ 0 and τ 1, without loss of generality we can consider the test of H0 : τ 1 = l (and hence avoid messy notation). We therefore consider tests of this hypothesis. Note firstly that under the partition of τ , X can be partitioned to yield the linear model
y = X0τ 0 + X1τ 1 + e (2.7.30) The estimation of the the partitioned τ can be carried out using partitioned matrices and the normal equations (2.3.10). The solutions can be written in a number of ways, but the useful form for the derivation of the likelihood ratio TESTS OF HYPOTHESES 19 test is
0 −1 0 τˆ0 = (X0X0) X0(y − X1τˆ1) (2.7.31) 0 −1 0 τˆ1 = X1(In − P X0 )X1 X1(In − P X0 )y 1 1 σˆ2 = (y − X τˆ )0(I − P )(y − X τˆ ) = R n 1 1 n X0 1 1 n
P X0 is the orthogonal projection matrix onto the column space of X0.
Testing H0 : τ 1 = l can be carried out by considering the sequence of models
y = X0τ 0 + X1l + e
y = X0τ 0 + X1τ 1 + e
where X0 is n × (p − r) and the column space of X contains the column space of X0 (Searle, 1971). We assume that both X and X0 are full rank. Under H0, the linear model becomes
2 y ∼ N(X0τ 0 + X1l, σ0In) (2.7.32) and the maximum likelihood estimates are
0 0 −1 0 τˆ0 = (X0X0) X0(y − X1l) (2.7.33) 1 σˆ2 = (y − X0τˆ − X l)0(y − X τˆ0 − X l) 0 n 0 0 1 0 0 1 1 = (y − X l)0(I − P )(y − X l) (2.7.34) n 1 n X0 1 1 = R n 0 where R0 is the residual sum of squares under H0. The generalized likelihood ratio test is a standard approach for tests of hypotheses for models fitted using likelihood methods (?). For our hypothesis the generalized likelihood ratio statistic is defined as
L(τˆ0, σˆ2; y) Λ = 0 0 L(τˆ, σˆ2; y)
which is the maximized likelihood under H0 divided by the maximized like- lihood for the full model. The statistic under the normality assumptions of 20 AN OVERVIEW OF LINEAR MODELS section 2.2 is given by σˆ2 −n/2 Λ = 0 σˆ2 R −n/2 = 0 R R + (R − R)−n/2 = 0 R R − R−n/2 = 1 + 0 R r −n/2 = 1 + F n − p where (R − R)/r F = 0 R/(n − p) is the standard F-test for the above null hypothesis. It is easy to show (τˆ − l)0X0 (I − P )X (τˆ − l) F = 1 1 n X0 1 1 r s2 This form will arise below. Several things need to be noted here. These are 2 • the asymptotic distribution of −2 log(Λ) is χ (r) under H0; 2 • using a likelihood ratio test we reject H0 if −2 log(Λ) > χ1−α(r) • the likelihood ratio statistic Λ is a monotone decreasing function of F ;
• F ∼ F (r, n − p) under H0 where the numerator and denominator d.f. are presented within the brackets;
• using an F-test we reject H0 if F > F1−α(r, n − p). This shows how the standard F -test relates to the likelihood ratio test for fixed effects under full maximum likelihood. The F-test is preferred in this context, as if the normality assumption holds, the test statistic has an exact distribution, whereas the null distribution for the likelihood ratio test is asymptotic, that is based on a large sample approximation. Likelihood ratio methods play an important role in mixed models and are discussed in detail in chapter 6. Before we turn to the remaining methods note that an analysis of variance table is implicit in the above formulation. In fact table 2.2 provides the logical decomposition. The Wald statistic ?) is based on the distribution of the estimator of τ , namely (2.3.12). As we are interested in a test involving τ 1, we require the distribution of τˆ1, namely 2 0 −1 τˆ1 ∼ N(τ 1, σ (X1(In − P X0 )X1) ) TESTS OF HYPOTHESES 21
Table 2.2 ANOVA Decomposition of sums of squares Term d.f. S.S. M.S. F-test
Difference r R0 − R (R0 − R)/r F Full model n-p RR/(n − p)
Null Hypothesis n-(p-r) R0
Under H0, τ 1 = l, and the Wald statistic is given by (τˆ − l)0X0 (I − P )X (τˆ − l) nr W = 1 1 n X0 1 1 = F σˆ2 n − p which is a simple monotone function of the F -statistic. Thus the Wald test is equivalent to the likelihood ratio test in this case. The score test is given by 0 −1 S = U 0I0 U 0 where U 0 and I0 are the score and expected information matrices derived 0 2 under the full model but evaluated under H0. Thus at τˆ0 andσ ˆ0, 0 1 0 0 τˆ0) σˆ2 (X y − X X U 0 = 0 l (2.7.35) 0 while the expected information matrix is given by (2.3.16) with σ2 evaluated 2 atσ ˆ0. The score statistic is then 0 0 0 1 0 0 τˆ0 2 0 −1 1 0 0 τˆ0 S = 2 X y − X X σˆ0(X X) 2 X y − X X σˆ0 l σˆ0 l 0 0 1 τˆ0 τˆ0 0 τˆ0 τˆ0 = 2 − X X − σˆ0 τˆ1 l τˆ1 l Using (2.7.31) and (2.7.33) we have
0 0 −1 0 τˆ0 τˆ0 −(X0X0) X0X1 − = (τˆ1 − l) τˆ1 l Ir and replacing X0X by its partitioned form 0 0 X0X0 X0X1 0 0 X1X0 X1X1 the score statistic can be shown to be 0 0 (τˆ1 − l) X1(In − P X0 )X1(τˆ1 − l) rF S = 2 = n σˆ0 n − p + rF which is a monotone increasing function of the F -statistic. Thus for the linear 22 AN OVERVIEW OF LINEAR MODELS model and under H0, the score statistic is equivalent to the F -statistic. For linear models, all three methods leads to the standard F-test for inference concerning the vector of fixed effects, τ . For completeness we present the F -statistic for the original test of H0 : Lτ = l. The statistic is given by −1 (L0τˆ − l)0 L0(X0X)−1L (L0τˆ − l) F = r s2 and under H0 F ∼ F (r, n − p).
2.8 Analysis of plant growth example For this example the two models (in symbolic form) to be fitted in the example are y ∼ mu + units y ∼ treatment + units We note that the latter model could also be written as y ∼ mu + treatment + units where constraints are now necessary. Thus, in this example excluding the treatment term from the linear model allows us to test the hypothesis that the treatment mean effects are equal. To complete this chapter we present the analysis of variance table for this plant growth data-set. The results of the analysis are summarized in table 2.3. The total (mean corrected) sum of squares has been partitioned into a sum of squares due to treatments (R0 − R) and a within treatment sum of squares (R). In matrix terms these are given by
0 R = y (In − P X )y 0 R0 = y (In − P X0 )y Table 2.3 contains an entry for the between treatments and within treatments sources of variation as well as the total variation (about the mean), together with the decomposition in terms of R and R0, with a column for the degrees of freedom (d.f.), the sum of squares (S.S), the mean squares (M.S. = S.S./d.f.) and the F-test (the ratio of the M.S for between treatments to the M.S. for within treatments). That is, (R − R)/r F = 0 R/(n − p) where n = 14, p = 2 and r = 1. The F-test is significant (p < .05), hence we reject the hypothesis that the treatment means are equal. The least squares estimates of the fixed effects and their standard errors are presented in table 2.4. These are defined using the corner-point constraints. SUMMARY 23
Table 2.3 Decomposition of sums of squares for plant growth data Term d.f. S.S. M.S. F-test
Between treatments (R0 − R) 1 3600.0 3600.0 6.640 Within treatments (R) 12 6506.1 542.2
Total (R0) 13 10106.1
Table 2.4 Summary of fixed effects for plant growth example Effect Estimate S.E.
∗ τ1 0.00 ∗ τ2 -32.07 12.45 µ 101.21 8.80
2.9 Summary In this chapter, results for the linear model have been presented to • provide the vector-matrix setting for chapters to follow, • revise basic results on estimation in the linear model, • introduce Residual Maximum Likelihood (REML) for estimation of vari- ance parameters, • derive the likelihood ratio test, the Wald test and the score test and show their equivalence to the standard F-test in the analysis of variance.
CHAPTER 3
Analysis of designed experiments
We begin the study of mixed models by considering the analysis of designed experiments. Designed experiments are highly structured, and this structure can be utilised to provide an approach to analysis and to set the foundations for more complex developments. The fundamental principles underpinning analysis of variance, namely pro- jections, strata and orthogonal block structures are developed in this chapter, in the manner of Nelder (1965a,b). There are three components in the de- velopment, namely the covariance structures generated by orthogonal block structures, the decomposition of treatment effects into components of inter- est (and this is usually the aim of the experiment), and lastly the interplay between the block and treatment structures. For the most part, the models to be discussed in this chapter have simple random effects structure and the treatment effects appear in only one part of the analysis of variance table. Estimation of variances allied to these random effects is very simple and involves equating expected (residual) mean squares to their observed values in an analysis of variance table. These are Residual Maximum Likelihood (REML) estimates in these simple cases. When the same treatment effects appear in several independent parts of the analysis of variance, an example is incomplete block designs, the corre- spondence between analysis of variance and REML estimates of variances is broken. Efficient estimation of treatment effects and variance components in such situations is via REML and this extends to the analysis of unbalanced data to be discussed later in the book. The use of analysis of variance tables is to be encouraged in unbalanced situations in order to define appropriate models. Thus the specialized nature of this chapter is a springboard to more complex situations.
3.1 One-way classification 3.1.1 Motivation The data to be considered in this section comes from a larger study conducted by Dr J Panozzo, in which the aim is to assess the malting quality of a number of barley varieties. An important trait in determining the malt quality of barley is the diastatic power (DP). Samples of barley grain are put through a malting process using a micro-malter and DP is measured on these samples. The micro-malter holds 80 cannisters in a 16 × 5 array. Often there are more
25 26 ANALYSIS OF DESIGNED EXPERIMENTS than 80 samples to be malted, so that sequential runs of the micro-malter must be undertaken. In the study 10 sequential malt runs were required. In order to assess variation between malt runs, 4 out of 80 cannisters in each run were randomly assigned to a control barley. Each of these control cannisters was filled with a subsample from a uniform batch of barley grain (from a single variety). The DP data for these control samples in each malt run is presented in table 3.1.
Table 3.1 Diastatic power (DP), for control samples in 10 malt runs Malt Run Diastatic Power 1 10.0 9.9 10.1 10.6 2 9.1 10.3 10.0 9.0 3 11.5 11.3 11.6 11.3 4 10.0 9.6 10.6 10.8 5 10.0 9.2 10.6 9.2 6 10.0 10.9 10.9 10.1 7 9.1 9.1 9.3 9.0 8 9.0 8.3 9.9 10.0 9 10.3 9.0 9.0 9.7 10 9.1 9.1 8.9 9.0
10
9
8
7
6
Malt Run 5
4
3
2
1
9 10 11
DP
Figure 3.1 Malt run data: dotplot of the control samples from each malt run ONE-WAY CLASSIFICATION 27 Our aim is to quantify the variation between and within malt runs using the control sample data, and in particular to see if the between malt run variation is “large”. We begin with a preliminary look at the data using the dotplot (S-PLUS, Insightful Corp., 2000) given in Figure 3.1. The dotplot suggests that there is variation between malt runs, but in addition that within malt run variation may also be large. A simple statistical model which allows for both between and within malt run variation is yij = µ + ui + eij (3.1.1) where yij is the observed DP, µ is the mean DP across all malt runs, ui 2 represents the ith malt run effect, i = 1, 2,..., 10, and eij ∼ N(0, σ ), j = 1, 2,..., 4. This has the same form as (2.2.1). However, there is a major differ- ence in the aim of this analysis, namely a measure of variation across maltruns is required.Rather than assume the malt run effects are fixed, and following the principles discussed in Chapter 1, it is appropriate to assume the effect is 2 random. Thus we assume ui ∼ N(0, σu), i = 1,..., 10. In addition we assume the malt run effects (ui) and the residual errors (eij) are statistically inde- pendent. The parameters to be estimated are now µ (the mean DP over all 2 2 malt runs), σu and σ . In contrast to (2.2.1), the malt run effects are random variables and therefore do not have fixed values which can be estimated. We return to this issue in chapter 5. If y is the vector of observations (ordered as samples within malt run), n = 40 is the sample size, b = 10 is the number of malt runs, and r = 4 is the number of replicate controls per malt run, (3.1.1) can be written as
y = 1nµ + Zu + e (3.1.2) n×q where 1n is a vector of n ones, Z = Ib ⊗ 1r is a design matrix, u ∼ 2 2 N(0, σuIb) and e ∼ N(0, σ In); ⊗ is the kronecker product operator (??). Note that the dimension of the vector of ones and the identity matrix is given as a subscript. This is not to be confused with the convention of presenting the dimensionality of a general matrix or vector as a superscript. In the notation of chapter 2, we can also write this model symbolically as y ∼ mu + maltrun + units where maltrun is a factor with 10 levels, and, following on from the argument above, it is a random factor and hence is presented in bold (by the convention established in chapter 1). Under these assumptions the marginal distribution of y is 2 T 2 y ∼ N(1nµ, σuZZ + σ In) (3.1.3) 2 2 The aim is to estimate σu and σ in order to gauge the relative size of be- 2 tween and within malt run variation. As σu is a variance, we constrain it to be positive. In some applications discussed in this book, this non-negativity constraint can be relaxed. 28 ANALYSIS OF DESIGNED EXPERIMENTS The model given by (3.1.2) specifies a simple linear mixed model. It has a random component Zu in addition to the residual error random effect. The marginal distribution of y is given by (??) and has a structured covariance matrix which we will denote by
2 T 2 V = σuZZ + σ In
3.1.2 Projections, strata and analysis of variance
The one-way classification is an example of a designed experiment which pos- sesses strata of variation. The concept of strata has been primarily considered, for example by Nelder (1965a,b), for the analysis of designed experiments with orthogonal block structure. We shall present the formal definition of strata later in this section, but for the moment we introduce the concepts by consid- ering the linear mixed model (and associated distributional assumptions for the random components) for the one-way classification as given by (3.1.2) and (??. If u was a fixed effect, that is a vector of parameters, we could proceed as in chapter 2, and partition malt run effects using the approach leading to (2.6.28). Thus malt-run effects can be decomposed into a complete orthogonal set, defined by (the notation is changed from chapter 2 because of the change in the status of the factor from fixed to random)
T −1 T P 1 = Z(Z Z) Z (3.1.4) T −1 T P 2 = In − Z(Z Z) Z (3.1.5) and the matrices P 1 and P 2 are orthogonal projections onto the plane defined by Z and the plane orthogonal to Z respectively; these projections correspond to the between groups and the within groups effects respectively. Thus for i = 1, 2, j 6= i,
T 2 P i = P i, P i = P i, P iP j = 0, P 1 + P 2 = In
The model in which ui is random moves the design matrix Z to the variance structure. While group effects no longer appear explicitly, they are implied by the form of V . The decomposition using (3.1.4) and (3.1.5) can still be applied, but the impact is to separate the data into components, each of which can be modelled using a simple linear model, that is a model with a single random term with constant variance. For the one-way classification this T can be achieved as follows. Firstly note that Z Z = rIb. Then
2 T −1 T 2 V = rσuZ(Z Z) Z + σ In 2 2 = rσuP Z + σ In 2 2 2 = (σ + rσu)P Z + σ (In − P Z )
= ξ1P 1 + ξ2P 2 (3.1.6) ONE-WAY CLASSIFICATION 29 Note that
2 2 ξ1 = rσu + σ , (3.1.7) 2 ξ2 = σ (3.1.8) Thus the variance matrix contains the decomposition into between and within group components, together with two weights or variances ξ1 and ξ2 which are functions of the original variances. Since each P i is a projection matrix we can proceed as in chapter 2 and write T P i = KiKi where Ki are matrices of size n × b and n × (n − b) for i = 1, 2 respectively. T T In addition, K1 K2 = 0 and Ki Ki = Iνi , where νi is the rank of Ki. Since Ki is full column rank, the rank of Ki equals the number of columns in Ki, that is b and n − b for i = 1, 2 respectively. 2 2 For the estimation of (µ, ξ1, ξ2) ie, (µ, σu, σ ) we partition the data into two parts that reflect the between and within malt run components. To do so T consider the transformation of the data vector y to K y where K = [K1 K2] is a non-singular matrix.
Since P 21n = 0, we have E (y2) = 0n−b. Further, for i = 1, 2 T var (yi) = Ki (ξ1P 1 + ξ2P 2)Ki T = ξiKi P iKi
= ξiIνi and T cov (y1, y2) = K1 (ξ1P 1 + ξ2P 2)K2 = 0 Thus, y KT 1 µ ξ I 0 KT y = 1 ∼ N 1 n , 1 b (3.1.9) y2 0 0 ξ2In−b and we have two independent components, one depending on µ and ξ1 and the other depending only on ξ2. These two independent distributions define the two strata in our experiment, the between groups stratum specified by y1 and the within groups stratum specified by y2. The variance parameters ξi, i = 1, 2 are known as stratum variances. In essence we have reduced the linear mixed model to two independent linear models, namely T y1 ∼ N(K1 1nµ, ξ1Ib)
y2 ∼ N(0, ξ2In−b) Allied to this partition into strata, is the decomposition of the total sum of squares as a sum of the stratum sums of squares. This implies no loss of 30 ANALYSIS OF DESIGNED EXPERIMENTS information. To see this consider 2 2 X T X T T yi yi = y KiKi y i=1 i=1 2 T X = y ( P i)y i=1 = yT y as required.
It is clear that the overall mean can only be estimated from y1 (ie in the be- tween groups stratum). The least squares (and maximum likelihood) estimate using (2.3.11) is given by
−1 T T T T µˆ = 1n K1K1 1n 1n K1K1 y 1T y = n =y ¯ n The residual sum of squares for the between groups stratum is therefore given by
T T T T −1 T T T T −1 T y1 y1 − y1 K1 1n(1n P 11n) 1n K1y1 = y P 1y − y 1n(1n 1n) 1n y T T = y P 1y − y P 0y where P 0 is the projection matrix for the overall mean. An analysis of variance (ANOVA) table, as given in table 3.2 can be constructed based on the two strata, and the decomposition of the sums of squares in the between groups stratum into the sum of squares due to the overall mean and a residual. There is no decomposition of the within group stratum because there are no fixed effects in the linear model for y2. This is a simple example of a multi- stratum experiment, without treatment factors. The ANOVA estimates of ξ1 and ξ2 can be obtained by equating the residual mean squares in table 3.2 to their expected values. Hence, ˆ T ξ1 = y (P 1 − P 0)y/(b − 1) ˆ T ξ2 = y P 2y/(n − b) (3.1.10)
2 2 andσ ˆ andσ ˆu can be found using equations (3.1.7) and (3.1.8), namely 2 ˆ σˆ = ξ2 2 ˆ ˆ σˆu = (ξ1 − ξ2)/r The ANOVA table for the malt run data is presented in table 3.3. The es- ˆ ˆ timates of ξ1 annd ξ2 are ξ1 = 0.261 and ξ2 = 2.145. Using the expressions above or (3.1.7) and (3.1.8), the ANOVA estimates of the variance components 2 2 are thenσ ˆu = 0.4708 andσ ˆ = 0.2614 respectively. We see that the estimated between malt run variance is approximately 1.8 times the estimated resid- ONE-WAY CLASSIFICATION 31
Table 3.2 Analysis of variance for one-way classification Strata/Decomposition d.f S.S. Expectation of MS
T Between groups b y P 1y - T 2 Mean 1 y P 0y nµ + ξ1 T Residual b − 1 y (P 1 − P 0)y ξ1 T Within groups n − b y P 2y - T Residual n − b y P 2y ξ2
ual variance, indicating the need for careful design protocols to account for between malt run variation.
Table 3.3 Analysis of variance for the malt run data Strata/Decomposition d.f S.S. M.S. Between groups 10 Mean 1 3888.784 3888.784 Residual 9 19.296 2.145 Within groups 30 Residual 30 7.840 0.261
We now present a formal definition of strata. Definition 3.1 A stratum is a maximal set of linear orthonormal functions of y which are independent and have equal variances. When orthogonal strata exist, the total variation in the data can be parti- tioned accordingly and represented in an analysis of variance table as pre- sented above. We note that the strata presented in the above differs from the strata derived by a randomisation argument (as described in detail by Nelder (1965a) for example). The randomisation analysis for the one-way classifica- tion has three strata, the mean stratum, the between groups stratum and the within groups stratum. While our approach yields two strata, the between groups and within groups strata, we shall see that the mean stratum arises naturally with crossed classified random effects.
3.1.3 Another glimpse at REML In the previous section the estimation of the variance parameters, namely the variance components (via the stratum variances), was achieved by equating stratum residual mean squares to their expectations. These estimates are the so-called ANOVA estimates of variance components (Searle et al., 1992). 32 ANALYSIS OF DESIGNED EXPERIMENTS We saw in chapter 2 that a residual likelihood could be found by eliminating the mean effects from the linear model. In the multi-stratum case, we have several linear models and to develop an appropriate residual likelihood we need to eliminate mean effects for each stratum.
In the one-way classification, these mean free components are y2 and that part of y1 that has zero expectation. As y1 follows a linear model, the argu- T ments of section 2.5 apply. In particular, if X = K1 1n, we can define matrices T T T K11 and K12 of full column rank such that K12 X = K12 K1 1n = 0b−1. ∗ Let K1 = [K11 K12]. T T T Now K1K12K12 K1 is an orthogonal projection matrix. In fact K12K12 T projects orthogonally to X = K1 1n, and so equals (after some algebra)
T T K12K12 = Ib − K1 P 0K1
T −1 T where P 0 = 1n(1n 1n) 1n . Thus
T T T T T K1K12K12 K1 = K1K1 − K1K1 P 0K1K1
= P 1 − P 1P 0P 1
= P 1 − P 0 (3.1.11)
This projection is merely the component of the between groups space or- thogonal to the vector of ones (which specifies the unconditional mean of the ∗T T linear mixed model). Thus the transformation of y to K1 K1 y results in the complete decomposition
∗T T y11 K2 K1 1n ξ1 0 0 y12 ∼ N 0 , 0 ξ1Ib−1 0 y2 0 0 0 ξ2In−b
It follows that the log likelihood free of mean effects is based on the distribution T T T of (y12, y2 ) and is given by
`R = `(ξ1; y12) + `(ξ2; y2) T T 1 y12y12 y2 y2 = − (b − 1) log ξ1 + + (n − b) log ξ2 + (3.1.12) 2 ξ1 ξ2
Maximisation of (3.1.12) with respect to ξ1 and ξ2 leads to the unbiased ANOVA estimates as before. The log-likelihood in (3.1.12) is the so-called (log) Residual likelihood, since it is the log-likelihood of a maximal set of contrasts which have zero expecta- tion, that is error or residual contrasts (Patterson and Thompson, 1971). If we define
∗ ∗ T T s1 = y1y1 = y (P 1 − P 0)y T T s2 = y2 y2 = y P 2y RANDOMISED COMPLETE BLOCKS 33 (3.1.12) can be written as (ignoring constants)
`s = `(ξ1; s1) + `(ξ2; s2) 1 s1 s2 = − (b − 1) log ξ1 + + (n − b) log ξ2 + (3.1.13) 2 ξ1 ξ2
Differentiation of (3.1.13) with respect to ξ1 and ξ2 gives
∂`s b − 1 s1 −2 = − 2 ∂ξ1 ξ1 ξ1 ∂`s n − b s2 −2 = − 2 (3.1.14) ∂ξ2 ξ2 ξ2 and equating (3.1.14) to zero again gives the ANOVA estimates given in (3.1.10). The sums of squares s1 and s2 are independent and distributed as (scaled) chi-squared variates with degrees of freedom equal to the stratum residual degrees of freedom, namely b − 1 and n − b respectively. Thus “residuals” at various levels have been used to construct a likelihood free of fixed effects and hence the name residual likelihood. The residual likelihood has therefore been constructed using the complete sufficient statistics for ξ1 and ξ2, namely s1 and s2.
3.2 Randomised complete blocks The data for this example has been kindly provided by Dr Maria Durban and involved a field experiment testing the yield performance of 272 barley varieties. The trial was laid out in 16 rows, each row consisting of 34 beds. There were two complete blocks. Block one occupied rows 1 to 8, block two rows 9 to 16. Thus within each block there are 272 plots to which the varieties are allocated at random. The trial layout in terms of the subdivision of the field trial into blocks is presented in table 3.4. A simple model for the yield observed on block i = 1, 2 plot j = 1,..., 272, is given by
yij = τs(ij) + ui + eij (3.2.15) where s(ij) represents the random assignment of a variety to plot j in block i, τk (k = 1,..., 272) is the mean effect for the kth variety and ui is the effect for block i. Note that we have not included an overall mean or intercept term in (3.2.15) . This avoids the unnecessary complication of placing constraints on the set of 272 variety effects, which for the present represent the variety mean levels. Here we assume the variety effects are fixed. Since the role of the blocks is to model the (co)variation in the data we classify block effects as 2 2 random. In this sense we are partitioning the total variance into σu and σ , 2 2 where σu is the variance of the block effects and σ is the residual variance. We assume further that the ui and the eij are statistically independent. In general if there are t fixed (treatment) means and b blocks with a total 34 ANALYSIS OF DESIGNED EXPERIMENTS
Table 3.4 Layout of the blocks in the field for the variety trial example Row Bed 1 2 ... 7 8 9 10 ... 15 16 1 1 1 ... 1 1 2 2 ... 2 2 2 1 1 ... 1 1 2 2 ... 2 2 ...... 33 1 1 ... 1 1 2 2 ... 2 2 34 1 1 ... 1 1 2 2 ... 2 2
of n = bt observations then we can write (3.2.15) in matrix form as y = Xτ + Zu + e (3.2.16) where Xn×t is the design matrix which assigns the treatment means to plots. The matrix Zn×b is the design matrix for block effects. The data are assumed to be ordered by plot number within blocks and so it follows that Z = Ib ⊗1t. Hence 2 T 2 V = var (y) = σuZZ + σ In The model is written symbolically as y ∼ variety + block + units where variety and block are factors with 272 and 2 levels respectively.
3.2.1 Orthogonal projections and strata The strata for this design are identical to those for the one-way classifica- tion with blocks being equivalent to groups. The two strata are therefore the between blocks stratum and the within blocks stratum. T As before we transform the data vector y to K y where K = [K1 K2], and these matrices arise from P 1 and P 2. We have T T y1 K1 Xτ ξ1Ib 0 K y = ∼ N T , (3.2.17) y2 K2 Xτ 0 ξ2Ib(t−1) where ξi, i = 1, 2 are the stratum variances and are given by 2 2 2 ξ1 = tσu + σ , ξ2 = σ
3.2.2 Estimation of treatment effects and analysis of variance The variety effects are of interest in the example and this introduces the first complication as far as estimation is concerned. The model (3.2.17) suggests that variety effects may be present in both strata, or in general in several strata. The problem here, and in fact in general for more complicated designs, RANDOMISED COMPLETE BLOCKS 35 is to obtain efficient estimates of both τ and ξi. Nelder (1965b) considers this problem and the following is largely (though not exactly) based on his development. We begin by defining some matrices that will appear in the developments of this chapter. The matrices An and Bn are defined by
T −1 T An = 1n(1n 1n) 1n T −1 T Bn = In − 1n(1n 1n) 1n
= In − An
Both An and Bn are orthogonal projection matrices, they are orthogonal to each other and the size (which will vary depending on the application) is given by the subscript (here n). An replaces each element of a vector by the mean of that vector, while Bn replaces each element by the deviation of each element from the mean. Equation (3.2.17) shows that the treatment effects may appear in both strata and hence in both linear models defined in that equation. We can therefore estimate part or all of τ in each stratum. Consider estimation of τ in the ith stratum. If τˆ[i] denotes the estimate of τ using only the ith stratum, using the normal equations (2.3.10) of chapter 2, we have T T X P iXτˆ[i] = X P iy (3.2.18) T The matrix X P iX is the information matrix for the fixed effects in stratum i. It may not be of full rank so that the solution is not unique, and hence obtaining a specific solution to (3.2.18) depends on finding an appropriate generalised inverse. We consider this below but firstly we examine the form of the information matrices. As Z = Ib ⊗ 1t, we have
P 1 = Ib ⊗ At, P 2 = Ib ⊗ Bt
An important property of the orthogonal projection matrices P i is that they are invariant to row and column permutations within blocks. This is because permuting within blocks of unit values does not change P i. This means that the rows of X can be reordered in manipulations to follow. Thus we take a nice form for X, namely X = 1b ⊗ It. Using properties of kronecker products it is easy to show
T T X P 1X = bAt = bT 1, X P 2X = bBt = bT 2 (3.2.19) Note also that
T T T T X P 1 = 1b ⊗ At, X P 2 = 1b ⊗ Bt (3.2.20) In stratum 1, we therefore have the normal equations
T bAtτˆ[1] = (1b ⊗ At)y 36 ANALYSIS OF DESIGNED EXPERIMENTS or 1 A τˆ = (1T ⊗ A )y (3.2.21) t [1] b b t Now
Atτ =τ ¯·1t so that the left hand side of (3.2.21) shows that in stratum 1 we can only estimate the overall mean of the treatment effects. The right hand side of (3.2.21) confirms this as it equals
y¯··1t In stratum 2, T bBtτˆ[2] = (1b ⊗ Bt)y and we can use properties of kronecker products to reduce the equations to 1 B τˆ = (1T ⊗ B )y t [2] b b t 1 = B ( 1T ⊗ I ) t b b t = Bty¯t (3.2.22) where y¯t is the vector of treatment means calculated across the blocks.
Btτ = τ − τ¯·1t so that the left hand side of (3.2.22) shows that in stratum 2 we can only esti- mate the deviations of treatment effects from the overall mean of the treatment effects. The right hand side of (3.2.22) equals
y¯t − y¯··1t deviations of treatment sample means about the overall sample mean. Before we turn to important aspects of these results, note that we can solve the two sets of normal equations for stratum 1 and 2 by using generalized inverses. Both At and Bt are generalized inverses of themselves, and hence using (2.4.22 we find 1 τˆ = A (1T ⊗ A )y [1] t b b t 1 = (1T ⊗ A )y b b t =y ¯··1t
τˆ[2] = BtBty
= Bty
= y¯t − y¯··1t which confirms the statements made regarding the effects that can be esti- mated from each stratum. Now in chapter 2, we saw that a decomposition of the mean in a single RANDOMISED COMPLETE BLOCKS 37 factor design was given by (2.6.28). The matrices in that equation are exactly of the form given in (3.2.20). If we let
T 1 = At, T 2 = Bt T then it easy to see that T 1 + T 2 = It, T 1 T 2 = 0 and that T 1 and T 2 are orthogonal projection matrices. Thus the treatment effects have an or- thonormal decomposition in a similar manner to the variance matrix which was determined by the block structure. Based on (3.2.20), the estimates can be written as 1 T τˆ = XT P y (3.2.23) i [i] b i In this form the left-hand side represents the effects being estimated in the ith stratum, while the right hand side is a sum involving the data and a divisor b which is called the effective replication of the effect in stratum i. This example is a special case of an important concept for the estimation of fixed effects within strata, that is of generally balanced designs (Nelder, 1965b). A design is Generally balanced if the information matrix for the fixed effects in stratum i can be written as l T X X P iX = λijT j (3.2.24) j=1 This form differs to that of Nelder (1965b) but it can be shown the two definitions are equivalent. When this condition holds, there is no need to find an inverse for the left hand side of (3.2.18). If the λik corresponding to a T k is zero then it follows that there is no information in stratum i on the fixed effects T kτ .
Table 3.5 Effective replication for the randomised complete blocks example
Stratum T 1τ (mean) T 2τ (treatment)
Between Blocks λ11 = b λ12 = 0 Within Blocks λ21 = 0 λ22 = b
For the generally balanced design consider estimation of T kτ where k = 1, . . . , l in stratum i, which is only possible if λik 6= 0. Pre-multiplying (3.2.18) by T k gives l X T T k( λijT j)τˆ[i] = T kX P iy j=1 T ⇒ λikT kτˆ[i] = T kX P iy
1 T ⇒ T kτˆ[i] = T kX P iy (3.2.25) λik 38 ANALYSIS OF DESIGNED EXPERIMENTS At this point it is worth considering the form of (3.2.25) in more detail. Beginning from standard maximum likelihood estimation of the vector of fixed effects τ , it follows that for generally balanced designs there is a very simple form for the maximum likelihood estimate of T kτ in stratum i. This estimate T (T kτˆ[i]) is a simple function of P iy. Pre-multiplication by X forms totals for each treatment, T k takes deviations and λik is a scaling factor. The scalar λik is known as the effective replication of T kτ in stratum i. We have seen that the RCB design is an example of this type. Equation (3.2.24) holds, see (3.2.20), with the effective replication for each treatment term given for each stratum in table 3.5. The sum of squares due to the fixed effects T kτ in stratum i is then given by, see (2.3.15) and the derivation leading to that form, T T T T T (T kτˆ[i]) X KiKi y = τˆ[i]T kX P iy T = λik(T kτˆ[i]) (T kτˆ[i]) (3.2.26) using (3.2.25) and the idempotency of T k. This has degrees of freedom equal to the rank of T k. As T k is an orthogonal projection, the rank equals the trace of the matrix. If each set of fixed effects T kτ , k = 1, . . . , l can only be estimated in one stratum then the design is said to be orthogonal. This occurs if there is only one non zero λik for each k. The randomised complete block design is an orthogonal design as can be seen from Table 3.5. The mean is estimated in the between blocks stratum and the treatment effects (deviations from the mean) are estimated in the within blocks stratum. The analysis of variance table for the RCB can be constructed by subdi- viding the total sum of squares for each stratum into a sum of squares due to the fixed effects estimated in that stratum and a residual sum of squares. The ANOVA estimates of the stratum variances are the stratum residual mean squares. The full analysis of variance decomposition for a randomised com- plete block design is given in table 3.6. Note that using (3.2.26) the sum of squares due to the mean is given by T sm = λ11(T 1τˆ[1]) (T 1τˆ[1])
1 T T = y P 1XT 1X P 1y λ11 T = y P 0y The residual sums of squares for the between blocks stratum is obtained by T difference as s1 = y P 1y − sm. Similarly the sums of squares due to the treatment effects is given by T st = λ22(T 2τˆ[2]) (T 2τˆ[2])
1 T T = y P 2XT 2X P 2y λ22 and the residual sums of squares for the within blocks stratum is obtained as RANDOMISED COMPLETE BLOCKS 39 T s2 = y P 2y−st. Also note that the expectations of the mean squares is given in an abbreviated form, where, for example, the non-centrality parameters (see appendix ??) for each are written as µ(·) to represent which treatment effects are involved in the non-centrality parameter. The ANOVA table for the variety trial data is presented in table 3.7. Note firstly that the treatment effects are significantly different from zero, indicat- ing varietal differences exist. Secondly, the ANOVA estimates of the stratum ˆ ˆ variances are ξ1 = 2.324 and ξ2 = 0.1380, from which the ANOVA estimates of 2 2 the variance components areσ ˆ = 0.1380 andσ ˆu = 0.00803. Thus the within blocks variance is considerably larger that the between block variance (which is based on only 1 degree of freedom).
Table 3.6 Analysis of variance for an RCB design Expectation Strata/Decomposition d.f. S.S. of M.S.
T Between Blocks b y P 1y - Mean 1 sm ξ1 + µ(T 1τ [1]) Residual b − 1 s1 ξ1 T Within Blocks b(t − 1) y P 2y - Treatment t − 1 st ξ2 + µ(T 2τ [2]) Residual (b − 1)(t − 1) s2 ξ2
Table 3.7 Analysis of variance for the variety trial data Strata/Decomposition d.f. S.S. M.S. F-test Between Blocks 2 16544.90 16544.90 Mean 1 16542.58 16542.58 Residual 1 2.324 2.324 Within Blocks 542 118.01 Treatment 271 80.613 0.297 2.156 Residual 271 37.400 0.138
3.2.3 Another look at REML ANOVA estimates of stratum variances and hence variance components are obtained by equating stratum residual mean squares to their expectations. In the spirit of the approach used in section 3.1, we consider the log likeli- hood of the between blocks and within blocks residual sums of squares. These 40 ANALYSIS OF DESIGNED EXPERIMENTS sums of squares are distributed as (scaled) chi-squared variates with degrees of freedom equal to the stratum residual degrees of freedom given in table 3.6. The log-likelihood is, ignoring constants
`s = `(ξ1; s1) + `(ξ2; s2) 1 s1 s2 = − (b − 1) log ξ1 + + (b − 1)(t − 1) log ξ2 + (3.2.27) 2 ξ1 ξ2
Differentiation of (3.2.27) with respect to ξ1 and ξ2 leads to the ANOVA estimates. It can be shown that the log likelihood in (3.2.27) is equivalent, as far as estimation is concerned, to the log likelihood of that part of y1 and y2 which ∗ ∗T ∗ ∗T has zero expectation. That is if we let y1 = K1 y and y2 = K2 y where the ∗n×(b−1) ∗n×(b−1)(t−1) matrices K1 and K2 are full rank matrices, such that
∗ ∗T K1K1 = P 1 − P 0 ∗ ∗T 1 T K2K2 = P 2 − P 2XT 2X P 2 λ22 ∗T ∗ K1 K2 = 0
∗ ∗ ∗ Thus if K = [K1 K2] then,
∗ ∗T y1 0 ξ1Ib−1 0 K y = ∗ ∼ N , y2 0 0 ξ2I(b−1)(t−1)
3.3 Split plot design
The last example we present on orthogonal designs, with orthogonal block and treatment structure is a split plot example. The data is again kindly provided by Dr Maria Durban and involved an experiment designed to investigate the effect on yield of controlling the fungus powdery mildew in barley. Seventy varieties of barley were grown with and without fungicide application. The field layout consisted of four blocks (labelled I, II, III, IV) with two whole- plots per block each split into 70 sub-plots. The two fungicide treatments were randomly allocated to the two whole-plots per block while the 70 varieties were randomly assigned to the 70 sub-plots. The trial was laid out in 56 beds by 10 rows. Each block consisted of 14 beds by 10 rows with block 1 occupying beds 1 to 14, block 2 beds 15 to 28 and so on. Each whole-plot within each block comprised 7 beds by 10 rows. A sub-plot consisted of a single row. The layout of the trial indicating the allocation of fungicide treatments to whole-plots and the arrangement of blocks is presented in table 3.9. SPLIT PLOT DESIGN 41
Table 3.8 Indicative trial layout for the split plot design Beds Block Whole-Plot Fungicide 1,..., 7 I 1 - 8,..., 14 I 2 + 15,..., 21 II 1 - 22,..., 28 II 2 + 29,..., 35 III 1 - 36,..., 42 III 2 + 43,..., 49 IV 1 + 50,..., 56 IV 2 -
The statistical model for the yield observed on sub-plot k = 1,..., 70, whole- plot j = 1, 2 and block i = 1,..., 4 is
yijk = τs(ijk) + bi + wij + eijk (3.3.28) where s(ijk) represents the randomisation of treatments (fungicides and va- rieties) to experimental units and τl (l = 1,..., 140) is the mean effect for treatment l. As in the randomised block example, we use τ to represent the 140 treatment mean effects, rather than partitioning at this stage into the main effects of fungicide and variety and their interaction. The standard anal- ysis assumes the terms bi for blocks, wij for whole-plots within blocks, and eijk for sub-plots are all normally distributed, mutually independent (within 2 2 2 and between) sets of random effects, with variances σb , σw and σ respec- tively. We assume further that the bi, wij and eijk are pairwise statistically independent. In general we assume there are t treatment effects, b blocks, w whole-plots in each block and s sub-plots in each whole-plot, with n = bws total observations. Then we can write (3.3.28) in matrix form as
y = Xτ + Z1u1 + Z2u2 + e = Xτ + Zu + e where Xn×t is a design matrix which assigns the factorial combinations of the treatments to experimental units. We assume there are w treatments applied to the whole-plots and s treatments applied to the sub-plots so that t = ws. The whole-plot treatment factor will be denoted in the following by Wtreat and the sub-plot treatment factor by Streat. For notational convenience the treatment factor names are sometimes abbreviated to W for Wtreat and S for Streat. In the fungicide by variety example w = 2 and s = 70. The matrix n×b n×bw Z1 is the design matrix for block effects and the matrix Z2 is the design b×1 bw×1 matrix for effects of whole-plots within blocks. The vectors u1 and u2 represent the effects for blocks and whole-plots within blocks respectively. T T T Finally we define Z = [Z1 Z2] and u = u1 u2 . 42 ANALYSIS OF DESIGNED EXPERIMENTS This is a mixed model with three random components. The marginal dis- tribution of y is
2 T 2 T 2 y ∼ N(Xτ , σb Z1Z1 + σwZ2Z2 + σ In) (3.3.29) The mixed model can also be written as y ∼ Wtreat ∗ Streat + block/wplot + units where block and wplot are factors with b and w levels respectively, and wplot labels the whole-plots within blocks. Note that the fixed effects formulation in ( 3.3.29) reflects the decomposition of the treatment effects into the main effects of Wtreat and Streat and their interaction to be considered in sec- tion 3.3.2.
3.3.1 Orthogonal projections and strata Using a similar approach to section 3.2.1 the variance matrix can be expressed in terms of projections involving orthogonal components. Thus if
2 T 2 T 2 var (y) = V = σb Z1Z1 + σwZ2Z2 + σ In then V can be written as 3 X V = ξiP i i=1 where 2 2 2 2 2 2 ξ1 = wsσb + sσw + σ , ξ2 = sσw + σ , ξ3 = σ
Simple expressions for the P i can be obtained by noting the form of the random effects design matrices ie, Z1 and Z2. That is, the data are assumed ordered sub-plots within whole-plots within blocks so that
Z1 = Ib ⊗ 1w ⊗ 1s
Z2 = Ib ⊗ Iw ⊗ 1s Hence
T −1 T P 1 = Z1(Z1 Z1) Z1
= Ib ⊗ Aw ⊗ As T −1 T T −1 T P 2 = Z2(Z2 Z2) Z2 − Z1(Z1 Z1) Z1
= Ib ⊗ Bw ⊗ As T −1 T P 3 = In − Z2(Z2 Z2) Z2
= Ib ⊗ Iw ⊗ Bs We also define
T −1 T P 0 = 1n(1n 1n) 1n
= Ab ⊗ Aw ⊗ As SPLIT PLOT DESIGN 43
Recalling the properties of Am and Bm it follows that the P i are orthogonal projection matrices summing to the identity matrix. The ranks of P 1,..., P 3 are b, b(w − 1), and bw(s − 1) since rank (Bm) = m − 1, rank (Am) = 1 and rank (A ⊗ B) = rank (A) rank (B) (see section ??). Thus we have three strata of variation and using a similar approach to section 3.2.1 we can transform to three independent linear models, each with homogeneous variation. Formally, we transform the data vector y to KT y T where K = [K1 K2 K3] and KiKi = P i, i = 1, 2, 3. Then y1 T K y = y2 y3 T K1 Xτ ξ1Ib 0 0 T ∼ N K2 Xτ , 0 ξ2Ib(w−1) 0 T K3 Xτ 0 0 ξ3Ibw(s−1) These strata correspond to blocks, whole-plots within blocks and sub-plots within whole-plots. For notational convenience these strata will be referred to in the following by the labels blocks, blocks.wplots and blocks.wplots.splots.
3.3.2 Estimation of treatment effects and analysis of variance
The model for the fixed effects, τ can be written in a form similar to that used for the fixed effects in section 3.2.2. That is,
l X τ = T jτ (3.3.30) j=1 where l = 4 for the fungicide by variety trial example. The individual terms partition the treatment effects into the overall mean, the main effect of factor Wtreat, the main effect of factor Streat and the Wtreat.Streat interaction. The treatment projection matrices are given by
T 1 = Aw ⊗ As
T 2 = Bw ⊗ As
T 3 = Aw ⊗ Bs
T 4 = Bw ⊗ Bs (3.3.31)
The T j are a set of orthogonal projection matrices summing to the identity. The split plot design is a generally balanced design. This follows from the definition given in section 3.2.2. That is,
l T X TX P iXT = λijT j j=1 44 ANALYSIS OF DESIGNED EXPERIMENTS for i = 1, 2, 3. In fact reordering X as done in section 3.2.2 it can be shown that T X P 1X = bAw ⊗ As = bT 1 T X P 2X = bBw ⊗ As = bT 2 T X P 3X = bIw ⊗ Bs = bT 3 + bT 4
The effective replication (λij) is given in table 3.9. This implies that the overall mean is estimated in the blocks stratum, the main effects of Wtreat are estimated in the blocks.wplots stratum and the main effects of Streat and the interaction effects of Wtreat and Streat are estimated in the blocks.wplots.splots stratum. The design is therefore orthogonal since each set of treatment effects in (3.3.30) is estimated in one stratum only.
Table 3.9 Effective replication for the split plot example
Stratum T 1τ (mean) T 2τ (W) T 3τ (S) T 4τ (W.S)
Blocks λ11 = b λ12 = 0 λ13 = 0 λ14 = 0 Blocks.wplots λ21 = 0 λ22 = b λ23 = 0 λ24 = 0 Blocks.wplots.splots λ31 = 0 λ32 = 0 λ33 = b λ34 = b
The analysis of variance can now be constructed. The total sum of squares in each stratum can be subdivided into treatment sum(s) of squares and a residual sum of squares. Table 3.10 presents the full analysis of variance table. The sums of squares due to the mean, main effects of Wtreat and Streat and their interaction are given by
1 T T sm = y P 1XT 1X P 1y λ11 T = y P 0y 1 T T sw = y P 2XT 2X P 2y λ22 1 T T ss = y P 3XT 3X P 3y λ33 1 T T sws = y P 3XT 4X P 3y λ34 The residual sum of squares for each stratum is obtained by difference, for T example in the blocks.wplots.splots stratum s3 = y P 3y − ss − sws. The stratum variances ξ1, ξ2, ξ3 are estimated by the residual mean squares in each stratum. Table 3.11 presents the analysis of variance table for the fungicide by variety example. It contains the F-tests for the fixed effects. Since the fungicide effects are estimated in the blocks.wplots stratum then the appropriate mean square for testing fungicide effects is the residual in this stratum (see table 3.10). Similarly since the variety and fungicide by variety SPLIT PLOT DESIGN 45 effects are estimated in the blocks.wplots.splots stratum, then the residual mean square in this stratum is the appropriate error for testing these effects (see table 3.10). There is a very large effect of fungicide treatment, however there is no evidence for an interaction. We revisit this data-set in chapter ??.
Table 3.10 Analysis of variance for a split plot design
Expectation Strata/Decomposition d.f. S.S. of M.S.
T Blocks b y P 1y - Mean 1 sm ξ1 + µ(T 1τ [1]) Residual b − 1 s1 ξ1 T Blocks.wplots b(w − 1) y P 2y - Wtreat w − 1 sw ξ2 + µ(T 2τ [2]) Residual (b − 1)(w − 1) s2 ξ2 T Blocks.wplots.splots bw(s − 1) y P 3y - Streat s − 1 ss ξ3 + µ(T 3τ [3]) W.S (w − 1)(s − 1) sws ξ3 + µ(T 4τ [3]) Residual w(b − 1)(s − 1) s3 ξ3
Table 3.11 Analysis of variance for the fungicide by variety example Strata/Decomposition d.f. S.S. M.S. F-test Blocks 4 15389.808 Mean 1 15374.580 15374.58 Residual 3 15.228 5.076 Blocks.wplots 4 45.149 Fungicide 1 42.019 42.019 40.271 Residual 3 3.130 1.043 Blocks.wplots.splots 552 77.107 Variety 69 39.284 0.569 7.201 F.V 69 5.090 0.074 0.933 Residual 414 32.733 0.079
3.3.3 REML for a split plot design Estimation of the stratum variances (thence variance components) have been based on the stratum residual mean squares. As for the randomised complete 46 ANALYSIS OF DESIGNED EXPERIMENTS block design in section 3.2.3 we could also consider the joint likelihood of the residual sums of squares of each stratum. Since the stratum residual sums of squares are independently distributed as (scaled) chi-squared variates it can be shown that the log-likelihood of s1, s2 and s3 is, ignoring constants
`s = `(ξ1; s1) + `(ξ2; s2) + `(ξ3; s3) 1 s1 s2 = − (b − 1) log ξ1 + + (b − 1)(w − 1) log ξ2 + 2 ξ1 ξ2 s3 + (b − 1)w(s − 1) log ξ3 + (3.3.32) ξ3 It is interesting to compare (3.3.32) to the log-likelihood of the data y, which after replacing the fixed effects by the generalised least squares estimates is given by 1 − log |V | + (y − Xτˆ)T V −1(y − Xτˆ) (3.3.33) 2 First consider the log determinant in (3.3.33) |V | = |KT ||K||V | = |KT VK| 3 Y νi = ξi i=1 T since K VK = diag (ξiIνi ) where νi = rank (Ki). Thus 3 X log |V | = νi log ξi i=1 P3 Recall that V = i=1 ξiP i so that 3 −1 X −1 V = ξi P i i=1 then the quadratic form in the log-likelihood is given by 3 T −1 X T T −1 T (y − Xτˆ) V (y − Xτˆ) = (yi − Ki Xτˆ) V (yi − Ki Xτˆ) i=1 s s s = 1 + 2 + 3 ξ1 ξ2 ξ3 Thus (3.3.33) is given by 1 s1 s2 s3 − b log ξ1 + + b(w − 1) log ξ2 + + bw(s − 1) log ξ3 + (3.3.34) 2 ξ1 ξ2 ξ3
The coefficient of log ξi in (3.3.34) is the stratum total degrees of freedom, while in (3.3.32) it is the stratum residual degrees of freedom ie the total degrees of freedom minus the number of treatment effects estimated in that stratum. The other terms, which are dependent on the data, are the same. BALANCED INCOMPLETE BLOCKS 47
Differentiation of (3.3.32) with respect to ξ1, ξ2 and ξ3 leads to the ANOVA estimates. This is not the case for (3.3.34). Hence it seems natural as far as likelihood estimation of the stratum variances is concerned to use (3.3.32) as it takes account of the degrees of freedom used in the estimation of treat- ment effects, and the consequent (residual) maximum likelihood estimates of stratum variances will be unbiased and equal to the ANOVA estimates. Similarly it can be shown (see chapter 5 for details) that the log likelihood in (3.3.32) is equivalent, as far as estimation is concerned, to the log likelihood of that part of y1, y2 and y3 which have zero expectation. This is again the Residual log likelihood as defined by Patterson and Thompson (1971).
3.4 Balanced incomplete blocks Before leaving the analysis of designed experiments to consider more general mixed models, it is useful to consider the analysis of a balanced incomplete block design. This is an example of a design with an orthogonal block struc- ture with treatments estimated in more than one stratum. The data we will use to illustrate the analysis is taken from a long term experiment conducted at the Horticultural Research Station, Dareton, NSW. The data is kindly pro- vided by Dr. A. Grieve (State Forests, NSW) and Ms L. McFadyen (NSW Agriculture). The experiment involved examining the effects of irrigation fre- quency and volume on the growth and yield of sultana grapes. A total of 9 treatments was used, namely, the factorial combinations of three irrigation amounts (low, medium and high) by three irrigation frequencies (based on soil moisture deficit levels). In the following we ignore the factorial structure as this unnecessarily complicates our development of the incomplete block anal- ysis. Furthermore, the scientists were primarily interested in the combined effects of amount and frequency. The experiment design consisted of two re- peats of a balanced incomplete block design, resulting in 8 replicates. Within each replicate there were 3 incomplete blocks with 3 plots in each block. The actual field layout is presented in table 3.12. It comprises a rectangular array of 9 rows (indexed by blocks and plots within blocks) by 8 columns (squares and replicates). The statistical model for the yield of grapes in replicate i = 1,..., 8, block j = 1, 2, 3 within replicate i and plot k = 1, 2, 3 within replicate i and block j is yijk = τs(ijk) + ri + bij + eijk (3.4.35) where s(ijk) represents the randomisation of treatment combinations to ex- perimental units and τl, l = 1,..., 9 is the mean for treatment l. The standard analysis assumes the terms ri for replicates, bij for blocks and eijk for plots are normally distributed, mutually independent sets of random effects with 2 2 2 variances σr , σb and σ respectively. In general we assume there are t treatments, r replicates, b (incomplete) blocks within replicates and p plots within blocks with n = rbp total obser- vations and t = bp. The grape example has n = 72, t = 9, r = 8, b = 3 and 48 ANALYSIS OF DESIGNED EXPERIMENTS
Table 3.12 Field layout and treatment randomisation of irrigation management trial
Replicate Block Plot 1 2 3 4 5 6 7 8 1 1 2 7 1 5 1 3 2 2 1 2 3 3 4 1 3 5 7 5 1 3 1 5 7 9 2 7 6 8 2 1 5 2 3 7 7 8 4 1 2 2 6 4 6 6 9 1 3 4 2 3 4 9 9 2 8 6 8 7 3 1 8 6 2 3 4 4 9 3 3 2 9 8 5 8 6 9 5 6 3 3 7 1 8 4 5 2 1 9
p = 3. Then we can write (3.4.35) in matrix form as
y = Xτ + Z1u1 + Z2u2 + e = Xτ + Zu + e where Xn×t is a design matrix which assigns the treatments to experimental n×r n×rb units. The matrix Z1 is the design matrix for replicate effects and Z2 is the design matrix for the effects of blocks within replicates. The vectors u1 and u2 represent the replicate and block within replicate effects respectively. T T T Finally we define the matrix Z = [Z1 Z2] and u = u1 u2 . The marginal distribution of y is 2 T 2 T 2 y ∼ N(Xτ , σr Z1Z1 + σb Z2Z2 + σ In) (3.4.36) The mixed model can also be written as y ∼ treatment + rep + rep.block where treatment, rep and block are factors with t, r, b levels respectively.
3.4.1 Orthogonal projections and strata Using a similar approach to section 3.2.1 the variance matrix can be expressed in terms of projections involving orthogonal components. Thus if 2 T 2 T 2 var (y) = V = σr Z1Z1 + σb Z2Z2 + σ In then V can be written as 3 X V = ξiP i i=1 where 2 2 2 2 2 2 ξ1 = bpσr + pσb + σ , ξ2 = pσb + σ , ξ3 = σ BALANCED INCOMPLETE BLOCKS 49
Also, as before simple expressions for the P i can be obtained by noting the form of the random effects design matrices ie, Z1 and Z2. Assuming the data is ordered plots within blocks within replicates then
Z1 = Ir ⊗ 1b ⊗ 1p
Z2 = Ir ⊗ Ib ⊗ 1p Hence
T −1 T P 1 = Z1(Z1 Z1) Z1
= Ir ⊗ Ab ⊗ Ap T −1 T T −1 T P 2 = Z2(Z2 Z2) Z2 − Z1(Z1 Z1) Z1
= Ir ⊗ Bb ⊗ Ap T −1 T P 3 = In − Z2(Z2 Z2) Z2
= Ir ⊗ Ib ⊗ Bp and we also define
T −1 T P 0 = 1n(1n 1n) 1n
= Ar ⊗ Ab ⊗ Ap
The P i are orthogonal projection matrices summing to the identity matrix. The ranks of P 1,..., P 3 are r, r(b − 1) and rb(p − 1). There are three strata and these are the same, in terms of variance structure as the split plot example. Hence we transform to three independent linear models, each with homogeneous variation. The details will be omitted. The strata will be labelled as: rep, rep.block and rep.block.plot.
3.4.2 Estimation of treatment effects
The model for the fixed effects, τ can be written as
l X τ = T jτ (3.4.37) j=1 where l = 2 for the grape example and the two terms represent the over- all mean and the deviation of treatment effects from the overall mean. The treatment projection matrices are given by
T 1 = At and T 2 = Bt (3.4.38) It can be shown that the design is generally balanced, ie,
l T X TX P iXT = λijT j j=1 50 ANALYSIS OF DESIGNED EXPERIMENTS for i = 1, 2, 3. Using the properties of balanced incomplete blocks designs it can be shown that T X P 1X = rT 1 T X P 2X = r(1 − E)T 2 T X P 3X = rET 2 where E is the efficiency factor of the design (see John and Williams, 1998) which for a BIB is given by E = {t(p − 1)}/{p(t − 1)}. Thus (3.2.24) holds with effective replication given in table 3.13.
Table 3.13 Effective replication for the BIB example
Stratum T 1τ (mean) T 2τ (treatment)
Blocks λ11 = r λ12 = 0 Blocks.wplots λ21 = 0 λ22 = r(1 − E) Blocks.wplots.splots λ31 = 0 λ32 = rE
The effective replication of T 2τ (ie the treatment effects) in the two strata where there is information are given by λ22 = r(1 − E) and λ32 = rE. In the grape example the effective replication of T 2τ is 2 and 6 since E = 0.75. Hence if we consider estimation of T 2τ in the rep.block stratum then it follows that T T T 2(TX P 2XT )T τˆ[2] = T 2TX P 2y T ⇒ λ22T 2T τˆ[2] = T 2TX P 2y
1 T ⇒ T 2τˆ[2] = T 2X P 2y (3.4.39) λ22 It also follows that 1 T var T 2τˆ[2] = 2 T 2X P 2var (y) P 2XT 2 λ22 1 T = 2 T 2X P 2VP 2XT 2 λ22 ξ2 T = 2 T 2X P 2XT 2 λ22 ξ2 = T 2 λ22
Similarly estimation of T 2τ in the rep.block.plot stratum gives
1 T T 2τˆ[3] = T 2X P 3y λ32 with variance ξ3 var T 2τˆ[3] = T 2 λ32 BALANCED INCOMPLETE BLOCKS 51
To obtain an efficient estimate of T 2τ we therefore need to combine these estimates, weighting by the inverse of the variance λi2/ξi, i = 2, 3. The explicit form for the combined estimate of T 2τ can be derived using this approach. Alternatively, if we consider generalised least squares estimation of T τ then it follows that TXT V −1XT τˆ = TXT V −1y 3 3 T X −1 T X −1 ⇒ TX ( ξi P i)XT τˆ = TX ( ξi P i)y i=1 i=1
λ11 λ22 λ32 T −1 −1 −1 ⇒ T ( T 1 + T 2 + T 2)T τˆ = TX (ξ1 P 1 + ξ2 P 2 + ξ3 P 3)y ξ1 ξ2 ξ3 T T T since X P 1X = λ11T 1, X P 2X = λ22T 2 and X P 3X = λ32T 2. Pre- multiplying by T 2 gives
λ22 λ32 T −1 −1 ( + )T 2τˆ = T 2X (ξ2 P 2 + ξ3 P 3)y ξ2 ξ3 Hence
1 T −1 −1 T 2τˆ = T 2X (ξ2 P 2 + ξ3 P 3)y λ22 + λ32 ξ2 ξ3 1 λ22 λ32 = λ λ T 2τˆ[2] + T 2τˆ[3] (3.4.40) 22 + 32 ξ2 ξ3 ξ2 ξ3 and 1 var (T 2τˆ) = T 2 λ22 + λ32 ξ2 ξ3 This estimate combines information across the two strata in which the treat- ment effects T 2τ are estimated. The weights in (3.4.40) depend on the stratum variances which are unknown. Yates (1940) gives an analysis of variance de- composition for this design, which is presented in Table 3.14. The sums of squares due to the mean and the sums of squares due to treatments in the rep.block and rep.block.plot strata are given by T sm = y P 0y 1 T T st2 = y P 2XT 2X P 2y λ22 1 T T st3 = y P 3XT 2X P 3y λ32 The residual sum of squares for each stratum is obtained by difference, for T example in the rep.block.plot stratum s3 = y P 3y−st3 . Yates (1940) suggests estimating ξ2 using the residual mean square for the rep.block stratum and ξ3 using the residual mean square for the rep.block.plot stratum. These estimates are then used to form the combined estimate of T 2τ . Although intuitively sensible there are difficulties with this estimation ap- proach. This suggests that the estimates of the stratum variances may not 52 ANALYSIS OF DESIGNED EXPERIMENTS
Table 3.14 Analysis of variance for the BIB trial Expectation Strata/Decomposition d.f. S.S. of M.S.
T Rep r y P 1y Mean 1 sm ξ1 + µ(T 1τ [1]) Residual r − 1 s1 ξ1 T Rep.block r(b − 1) y P 2y
Treatment t − 1 st2 ξ2 + µ(T 2τ [2]) Residual r(b − 1) − t + 1 s2 ξ2 T Rep.block.plot rb(p − 1) y P 3y
Treatment t − 1 st3 ξ3 + µ(T 2τ [3]) Residual rb(p − 1) − t + 1 s3 ξ3
be efficient. Nelder (1968) considers this issue and presented a fully efficient approach which is achieved by iterating the successive estimation of the treat- ment effects and stratum variances involved in the strata where treatment effects are estimated. REML estimation produces the fully efficient estimates of the stratum variances. The combined estimates of the treatment effects are produced as a by-product of the algorithm as Empirical Generalised Least Squares (EGLS) estimates (see chapter 5).
3.5 In search of efficient estimation for variance components
The problem of obtaining efficient estimates of variance components in more general designs or settings is now examined. Of the examples presented, those with orthogonal treatment structures present no difficulties for simultaneous estimation of variance components and fixed effects. When there is information on treatment effects in more than one stratum, then it is not as straight for- ward to obtain efficient estimates of variance components, although in the case of generally balanced designs Nelder (1968) presents an iterative approach. In more complex examples, say in animal breeding, where blocks corre- spond to groups of related animals it often happens that blocks will not be of equal size and the variance structure implied by the decomposition into an orthogonal block structure is not appropriate. For example, consider a one- way classification with b groups and ri observations per group (i = 1, . . . , b). The decomposition into orthogonal block structure generates
T −1 T T −1 T V = (ξ0 − ξ1)1n(1n 1n) 1n + (ξ1 − ξ2)Z(Z Z) Z + ξ2In This implies that the covariance between observations in the same block is T inversely proportional to the block size (since Z Z = diag (ri)). It is more SUMMARY 53 usual to assume that the covariance between observations in the same block are all the same irrespective of block size. For unbalanced or complex mixed models there is in general no decomposi- tion into orthogonal strata and hence it is not immediately obvious how to set up sums of squares of residuals for the more general mixed model. Patterson and Thompson (1971) suggest maximising the likelihood of error contrasts, ie contrasts with zero expectation with non-zero variance to estimate vari- ance components. We have already indicated in the previous examples for the one-way, randomised complete block and split plot designs how the so-called Residual Maximum Likelihood (REML) estimates correspond to the ANOVA estimates. In chapter 5 we will present a full account of REML estimation for a wide class of mixed models, which include the models in this chapter as special cases.
3.6 Summary In this chapter, results for designed experiments have been presented to • demonstrate the use of orthogonal projections to create strata in balanced designs • derive the analysis of variance table thence present an approach to obtain ANOVA estimates of variance components • show the equivalence of the ANOVA estimates and “so-called” REML esti- mates of variance components in designs with orthogonal block and treat- ment structures • indicate how efficient estimation of variance components (and treatment effects) can be achieved for generally balanced designs • suggest that the ANOVA and iterated ANOVA approaches for estimation of variance components cannot be readily extended to more complex data-sets where the design or study is non-orthogonal or unbalanced
CHAPTER 4
The Linear Mixed Model
To this point we have considered special cases of the linear mixed model. These special cases reduce to ordinary linear models, via the transformation to strata. The transformation to strata is only available in situations where we have orthogonal block and treatment structures. In particular, this approach fails for unbalanced data and more general covariance structures. In this chapter the general formulation of the linear mixed model is pre- sented, and we thereby extend the class of models discussed in previous chap- ters to those allowing completely general covariance structures for both the random effects and residual random errors.
4.1 The Model
If yn×1 denotes the vector of observations, the linear mixed model can be written as y = Xτ + Zu + e (4.1.1) where τ p×1 is the vector of fixed effects, Xn×p is the design matrix (param- eterised to be of full rank) that associates observations with the appropriate combination of fixed effects, ub×1 is the vector of random effects, Zn×b is the design matrix which associates observations with the appropriate combination of random effects, and en×1 is the vector of residual errors. The model (4.1.1) is called a linear mixed model or linear mixed-effects model. It is assumed u 0 G 0 ∼ N , σ2 (4.1.2) e 0 H 0 R
The parameter σ2 is a scale parameter that plays an important role to be H discussed below. The variance models given by the matrices G and R are called G-structures and R-structures respectively. Under the assumptions, we have
2 y | u ∼ N(Xτ + Zu, σH R) (4.1.3) 2 u ∼ N(0, σH G) (4.1.4) so that 2 0 y ∼ N Xτ , σH (R + ZGZ ) (4.1.5)
55 56 THE LINEAR MIXED MODEL We write 2 0 V = σH (R + ZGZ ) 2 = σH H (4.1.6) so that y ∼ N(Xτ , V ) (4.1.7) 2 Equation (4.1.6) explains the notation σH ; this parameter multiplies the ma- trix H. Typically G and R are functions of parameters that need to be estimated; these parameters were variances or variance ratios in previous chapters. A general and consistent notation to be used throughout the book is G = G(γ) (4.1.8) and R = σ2Σ (4.1.9) Σ = Σ(φ) (4.1.10) The vectors γ and φ are parameter vectors associated with the random effects (u) and residuals (e) respectively. Their precise meaning is discussed below.
4.2 Variance structures for the errors: R-structures In most cases the vector of residuals represents the errors from a single ex- periment or a single set of data. In chapter 3, R was a scaled identity matrix, that is R = σ2I, so that the errors were assumed independent and identically distributed. In some situations, for example, in the analysis of multi-clinic trials, the analysis of animal breeding data across populations (Foulley and Quass, 1995) or the analysis of multi-environment variety trials (Smith et al., 2001a), the vector e will be a series of vectors indexed by a factor or factors. The sub- vectors relate to sections of the data and in the examples above may be a clinic, a population or a trial. 0 0 0 0 Thus in general we write e = [e1, e2,..., es] so that ej represents the vector of errors of the jth section of the data. The variance matrix for each section may differ, but we assume that the errors from different sections are independent (if they are not we can coalesce the dependent components into a single component and hence maintain the independence structure) . In matrix terms this gives R1 0 ... 0 0 0 R2 ... 0 0 s ...... R = ⊕j=1Rj = . . . . . 0 0 ... Rs−1 0 0 0 ... 0 Rs VARIANCE STRUCTURES FOR THE RANDOM EFFECTS: G-STRUCTURES 57 where ⊕ is the direct sum operator. An example of such a structure is presented by Cullis et al. (1998) in the context of spatial analysis of multi-environment trials. In this case the jth section has variance matrix given by
Rj = Rj(φj) 2 = σj Σj(ρj) + ψjInj Each section represents a trial. The variance parameters allow for a different 2 variance for each trial (σj ) and hence heterogeneity, a different correlation structure for each trial (through Σj and ρj) and a different measurement error term (ψj).
4.3 Variance structures for the random effects: G-structures The b × 1 vector of random effects is often composed of q sub-vectors u = 0 0 0 0 [u1 u2 ... uq] where the sub-vectors ui are of length bi. These sub-vectors are assumed independent normally distributed with variance matrices σ2 G . H i Thus as for R we have G1 0 ... 0 0 0 G2 ... 0 0 q ...... G = ⊕i=1Gi = . . . . . 0 0 ... Gq−1 0 0 0 ... 0 Gq
There is a corresponding partition in Z, namely Z = [Z1 Z2 ... Zq].
4.4 Separability Complex variance structures arise in many applications. These include the analysis of longitudinal data, multivariate analysis and spatial analysis. In some cases, component matrices, Rj, or R itself if there is only one section, or Gj or G if there is only one G-structure, are related to the underlying structure in the data. To illustrate this we begin with an example of balanced multivariate data. Suppose we measure p traits or variables on each of nb units (nb > p). To put this in context, suppose the units represent n animals in each of b cattle breeds. If yj, j = 1, . . . , p represents the data vector of the jth trait or variable, the model considered here is given by
yj = Dτ j + Buj + ej (4.4.11) n×p p×1 where D is the fixed effects design matrix, τ j is the vector of fixed effects n×b for the jth trait, B is the random effects design matrix, uj is the vector of random breed effects for the jth trait and ej is the vector of residuals. The design matrices are the same for each trait. In this setting, the components of each ej are assumed independent. 58 THE LINEAR MIXED MODEL
0 We consider the ith animal, i = 1, 2, . . . , nb. If y(i) denotes the row vector of observations on the p traits for this animal, we can write the model
0 0 0 y(i) = di[τ 1 τ 2 ... τ p] + bi[u1 u2 ... up] + [ei1 ei2 . . . eip] 0 0 0 = diT + biU + e(i) (4.4.12)
0 0 where di and bi are the ith rows of the matrices D and B respectively, and T = [τ 1, τ 2,..., τ p] and B = [u1, u2,..., up] are matrices of the fixed and random effects. As (4.4.12) contains observations on the same unit, we assume the random error vector e(i) has components that are correlated (the traits) with possibly heterogeneous variances. Thus
e(i) ∼ N(0, Σp) where Σp is the covariance matrix of the p traits. Similarly, if u(i) is the vector of random effects for the p traits for the ith breed, we assume
u(i) ∼ N(0, Gp) p×p p×p Both Σp and Gp are symmetric, positive definite matrices, each with p(p + 1)/2 unique parameters. A matrix model for the complete data set which combines (4.4.11) and (4.4.12) is then given by Y = DT + BU + E (4.4.13) n×p n×p where Y = y1, y2,..., yp and E = [e1, e2,..., ep]. If we define y = vec (Y ) , τ = vec (T ) , u = vec (U) and e = vec (E) where vec (·) forms a vector by stacking the columns of the matrix argument (see section ??), (4.4.13) can be written equivalently as y = Xτ + Zu + e where X = Ip ⊗ D, Z = Ip ⊗ B. Under the assumptions given above, the variance structures for u and e are therefore given by
var (u) = G = Gp ⊗ In
var (e) = R = Σp ⊗ In cov (u, e) = 0 and where the parameters σ2 and σ2 of the general formulation (4.1.1) are H both set equal to one. The variance models for u and e are called separable, because they can be represented by the kronecker product of two matrices. These separable structures arise quite naturally in this example and in essence correspond to underlying factors in the data structure. For the random effects the two factors are the trait and the breed variables, while for the random errors the two factors are trait and the observational unit, the animal within each breed, or simply SEPARABILITY 59 units. This type of separable decomposition arises in other applications and in more complex situations. The concept of separability was introduced by Martin (1979) in the context of lattice processes. Martin (1979) showed that the correlation matrix of a linear-by-linear process observed on a r × c rectangular lattice can be written as the kronecker product of two correlation matrices which relate to the rows and columns of the lattice. To conclude this example, we present the symbolic representation of the model and the structures in that representation. If the only fixed effect for each trait is an intercept or overall mean (that is D = 1n, then the symbolic model formula is given by
y ∼ trait + trait.breed + trait.units where trait is a factor with p levels which indexes the trait and breed is a factor with b levels which codes the breed for each animal. The residual term is constructed as the interaction between the trait factor and units. The sym- bolic representation of the R-structure is given by US(trait) x ID(units), where the model acronym US refers to an unstructured variance matrix (or fully parameterized variance model of p(p + 1)/2 parameters) relating to trait and ID refers to an identity variance model relating to units. This notation will be extended and widely used throughout the book. Similarly, the G-structure is given by US(trait) x ID(breed) Separability is a very useful assumption regarding the form of the variance matrices R and G (or sub-matrices R and G ). Formally, if var (e) = σ2 R, j i H then the matrix R (and the error process) is said to be separable with two components if
R = R1 ⊗ R2 (4.4.14)
ri×ri where Ri is proportional to the variance matrix for the ith factor defining the data structure. The same definition applies to G-structures and the defini- tion extends in an obvious way to more than two components. The assumption of separability greatly reduces the computational load. Of particular use in fitting the linear mixed model are the following results (see section ??),
−1 −1 −1 R = R1 ⊗ R2 r2 r1 |R| = |R1| |R2| and the eigenvalues of R are the r1r2 products of the r1 eigenvalues of R1 with the r2 eigenvalues of R2. Separability allows a flexible framework for modelling variance structures in the linear mixed model. Many other examples will be considered in this book where the usual assumptions concerning the stochastic properties of the random effects in the linear mixed model lead naturally to a separable variance matrix. 60 THE LINEAR MIXED MODEL 4.5 Variance models There are three types of variance model that are possible and are used for R and G-structures in this book, namely, correlation models, homogeneous variance models and heterogeneous variance models. A complete list of the variance models used in this book is presented in appendix ??. This appendix also contains a reference to the first use of each variance model in the book.
4.5.1 Correlation models: In correlation models all diagonal elements are identically equal to 1. If Cn×n = {cij}, i, j = 1, . . . , n, denotes the correlation matrix for a particular correla- tion model, then ( cii = 1, ∀i C = {cij} : cij = cji |cij| < 1, i 6= j. The simplest correlation model is the identity model for which the off-diagonal elements are identically equal to zero, that is, cij = 0 i 6= j. Correlation models include those arising in time-series analysis, geostatistics and spatial statistics or more general correlation models such as banded or the completely general correlation model with p(p − 1)/2 parameters.
4.5.2 Homogeneous variance models In homogeneous variance models the diagonal elements all have the same 2 n×n positive value, σ say. If V = {vij}, i, j = 1, . . . , n is an homogeneous variance matrix, then ( 2 vii = σ , ∀i V = {vij} : vij = vji, i 6= j. Note that if V is the homogeneous variance model matrix corresponding to the correlation model matrix C then V = σ2C and has just one more parameter than C. For example, the homogeneous vari- ance model corresponding to the identity correlation structure is the simple variance components model, which specifies the simplest variance model for 2 which vii = σ , ∀i, with off diagonal elements equal to zero. In most software, this is the default variance model for terms classified as random in the linear mixed model.
4.5.3 Heterogeneous variance models: The third variance model is the heterogeneous variance model for which the n×n diagonal elements are positive but differ. If V = {vij}, i, j = 1, . . . , n, is IDENTIFIABILITY OF VARIANCE MODELS 61 a heterogeneous variance matrix, then
( 2 vii = σi , i = 1, . . . , n V = {vhij} : vij = vji, i 6= j. If V is the heterogeneous variance model matrix corresponding to the corre- lation model matrix C, then V = DCD
n×n where D = diag (σi) . This model has an additional n parameters com- pared to the base correlation model. For example, the heterogeneous variance model corresponding to the identity correlation model is the model which 2 specifies the diagonal variance model for which vii = σi ∀i, with zero off diagonal elements. Examples include the diagonal variance model, factor analytic, reduced rank, ante-dependence and the most general is the unstructured with p(p+1)/2 parameters.
4.6 Identifiability of variance models Because of the generality we have attempted to maintain in constructing the variance models for the random effects in the linear mixed model, it is al- most inevitable that even the most experienced user may encounter problems of identifiability of variance models. The cause of non-identifiability can be hard to diagnose. In principle, the causes may be akin to ensuring the fixed effects model is not over-parameterised, in that variance models may not be identifiable as they are over-parameterised. This is called intrinsic aliasing in the fixed effects model. On the other hand, there may be insufficient data to estimate the parameters of the chosen variance model. This is called extrinsic aliasing in the fixed effects model. There are some general principles which can be useful in avoiding over-parameterisation of variance models and in the following we present some of these by way of example.
4.6.1 Variance components or variance ratios
In this section, we assume Σ = In. This occurs in many applications, and some examples were considered in chapter 3. scaled identity. The variance structure is therefore given by σ2 H = σ2 (σ2I + ZGZ0) (4.6.15) H H n This variance model is over-parameterised because the residual variance σ2 cannot be estimated separately from σ2 . There are several ways of overcoming H this. If σ2 is set to one, then the variance matrix for y is then H 2 0 σ In + ZGZ 62 THE LINEAR MIXED MODEL A consequence of this parameterization, is that in this model G must now be a variance matrix. For example, for the one-way classification and the RCB 2 design G = σuIb. For the split plot design G is the direct sum of 2 sub- matrices one for blocks and one for whole-plots. That is,
2 2 σb Ib 0 G = ⊕i=1Gi = 2 0 σwIbw Setting σ2 = 1 implies that the variance parameters are variance components. H If we set σ2 = 1, then σ2 is an overall scale parameter and is equal to H the residual variance. As a consequenceof this parameterization, the matrix G cannot be a variance matrix. Again, for the one-way classification and the RCB design G = γ I where γ = σ2/σ2 . Similarly for the split plot design u b u u H 2 γbIb 0 G = ⊕i=1Gi = 0 γwIbw where γ = σ2/σ2 and γ = σ2 /σ2 . Thus the parameters in G are variance b b H w w H ratios under this parameterization. Lastly it is clear that σ2 and σ2 cannot both be set to one, as the residual H variance is therefore set to one. To summarize, just as for fixed effects, the parameterization chosen has implications on identifiability and on interpretation of the parameters.
4.6.2 Non identity R-structure When Σ = Σ(φ) then the scale parameters σ2 and σ2 can either be separately H set to one or jointly set to one. As in the previous section both cannot be estimated in the same model. If σ2 = 1 and σ2 6= 1 then var (e) = σ2Σ and thus Σ must be a scaled H variance matrix or a correlation matrix. For example, if a component matrix of Σ is a diagonal matrix of variances then one of its elements must be fixed to ensure identifiability. On the other hand G must be a variance matrix, since var (u) = G. If σ2 6= 1 and σ2 = 1 then var (e) = σ2 Σ and var (u) = σ2 G. Thus both H H H Σ and G must be scaled variance matrices or correlation matrices. Lastly if σ2 = 1 and σ2 = 1 then var (e) = Σ and var (u) = G. Thus H both Σ and G must be variance matrices and their parameter vectors must include at least one scale parameter. The most common applications for this case are in the analysis of multivariate data or repeated measures analysis with heterogeneous variances.
4.7 Combining variance models When either R or G is formed from the kronecker product of several sub- matrices some general rules must be obeyed to avoid over-parameterisation. SUMMARY 63 In the following we consider models with two components for G and R and use Ci and V i, i = 1, 2 to denote arbitrary correlation and variance matrices.
1. If Σ = C1 ⊗C2 then Σ is a valid correlation model and so a scale parameter (either σ2 or σ2 ) must be included in the variance model model. If G = H C1 ⊗C2 then a scale parameter should be included in the parameter vector γ regardless of the status of σ2 and σ2 . H 2. If Σ = C1 ⊗ V 2 or Σ = V 1 ⊗ C2 then Σ is a variance matrix in which case neither σ2 nor σ2 can be estimated. If G = C ⊗ V or G = V ⊗ C H 1 2 1 2 then G is a variance matrix. This usually coincides with σ2 = 1. H 3. If Σ or G = V 1 ⊗ V 2 = V then Σ or G is an over-parameterised variance matrix, which would necessitate fixing one of the variance parameters in V 1 or V 2.
4.8 Summary In this chapter we have introduced the general form of the linear mixed model and described the range of models which are either used in this book or avail- able in the software packages which we use to undertake the analyses. The main concepts that have been introduced in the context of the linear mixed model we consider are • R and G-structures, their general definition and structure • the assumption of separability for the variance models used in the R and G-structures • the combination of variance models both within and between R and G- structures and how the form of these models (ie as variance or correlation models) relate to the presence of the overall scaling parameter (σ2 ) H As a useful summary of the issues discussed in sections 4.6 and 4.7 we present a summary (see table 4.1) of the possible variance models for y which can be obtained by altering the type of scale parameters in the variance model. As we have seen the two scale parameters namely σ2 and σ2 can be either H fixed (usually to one) or free, which means it must be estimated from the data. Generally, the overall scale parameter σ2 controls whether we estimate H variance components or variance component ratios. There maybe some compu- tational savings in the iteration process using this parameterisation, as given estimates of γ and φ the score equation for σ2 has an algebraic solution (see H chapter ??). In some cases, however it does not make sense to include on overall scale parameter. We have discussed some examples where these types of variance models may be used. 64 THE LINEAR MIXED MODEL
Table 4.1 Summary of the variance models
Σ = I σ2 σ2 Description H Fixed Fixed Not admissable, since there is no residual variance Fixed Free var (e) = σ2I; γ are variance components Free Fixed var (e) = σ2 I; γ are variance component ratios H Free Free σ2 and σ2 are not identifiable H Σ = Σ(φ) σ2 σ2 Description H Fixed Fixed var (e) = Σ so φ is a vector of components, eg parameters for unstructured, antedependence; γ are variance compo- nents Fixed Free var (e) = σ2Σ and so φ are ratios relative to σ2; γ are variance components Free Fixed var (e) = σ2 Σ so φ are ratios relative to σ2 ; γ are variance H H component ratios Free Free σ2 and σ2 are (generally) not identifiable H CHAPTER 5
Estimation
The linear mixed model is composed of fixed effects τ , random effects u and 0 variance parameters σ2 , σ2, and κ = γ0, σ2, φ0 . Likelihood methods are H used for estimation of the fixed effects and variance parameters. The prediction of the random effects is sometimes of interest, and this can be considered to be a post-estimation process, ina manner to be discussed below. The principles involved in estimation and prediction do not necessarily provide an efficient algorithm to achive those aims. Thus this chapter begins with the principles of estimation and prediction and then presents an efficient algorithm. Residual maximum likelihood is presented in this chapter as a formalmethod of estimation of variance parameters. In the process, estimation of fixed effects is also achieved. This leaves the problem of prediction which can be approached in a number of ways. The computational strategy for efficient estimation is also presented in this chapter. The approach has important consequences for prediction using the- linear mixed model, and these extensions will be discussed in chapter 7.
5.1 Estimation of fixed effects and variance parameters Cannot remember what I was going to do here!
5.2 Estimation of variance parameters As we saw in chapter 3 when the variance parameters of the linear mixed model are simple components, that is, Gi = γiIqi ∀i and the block structure is orthogonal, the treatment structure is orthogonal and the variance matrix 2 for the errors is σ In, then we can estimate the stratum variances (and hence variance components) by equating residual mean squares from an ANOVA table with their expectations. This method is attributed to R.A. Fisher (Searle et al., 1992) and has been widely used for many years. There are many applications, however, where the data are unbalanced and or we wish to model the (co)variation in the data by using more complex variance structures. Many authors have suggested using extensions to the ANOVA methods described in chapter 3 for variance component estimation in unbalanced data. Searle et al. (1992) give an exhaustive account of three such approaches, which were originally proposed by Henderson (1953). The three methods of estimation have become known as Henderson’s methods I, II and III. Method I uses quadratic forms which are analogous to the sums
65 66 ESTIMATION of squares of generally balanced designs with orthogonal treatment structure; Method II is an adaptation of method I which takes account of fixed effects in the model; Method III uses sums of squares from fitting the full mixed model (and sub-models thereof) as though all terms were fixed effects. These techniques have been become superseded by Maximum Likelihood (ML) or more recently Residual Maximum Likelihood (REML). There are several reasons for this. Firstly, the original methods were proposed before the advent of high speed computers, and it was therefore important to have an approach which was not computationally intensive. This is no longer an issue with the proliferation of both efficient mixed models software and with high capacity computing power available to most researchers. The other attraction of REML (and ML) is that it provides the framework for variance parameter estimation in a much wider class of variance models than simple variance components. The paper by Patterson and Thompson (1971) is the original reference for REML. REML takes into account the degrees of freedom associated with the estimation of fixed effects so that REML estimates of variance parameters are less biased than ML estimates. As we have indicated REML estimates coincide with ANOVA based estimates for orthogonal block and treatment structures. We will only consider REML estimation in this book. Other texts in this area such as Searle et al. (1992); Verbeke and Molenberghs (2000) (2000) cover ML estimation.
5.2.1 The residual log-likelihood function Recall that if yn×1 denotes the vector of observations, the linear mixed model can be written as y = Xτ + Zu + e (5.2.1) where τ p×1 is the vector of fixed effects, Xn×p is the design matrix (param- eterised to be of full rank) that associates observations with the appropriate combination of fixed effects, ub×1 is the vector of random effects, Zn×b is the design matrix that associates observations with the appropriate combination of random effects, and en×1 is the vector of residual errors. We assume u 0 G 0 ∼ N , σ2 (5.2.2) e 0 H 0 R where G = G(γ) and R = σ2Σ, Σ = Σ(φ). The vectors γ and φ are vectors of variance parameters associated with the random effects and residuals re- spectively. The distribution of the data is thus Gaussian with mean Xτ and 2 0 variance matrix V = σH H where H = R + ZGZ . Result 5.1 The residual log-likelihood for the model in (5.2.1) is given by ` = `(σ2 , κ; y ) R H 2 = − 1 (n − p) log σ2 + log |H| + log |X0H−1X| 2 H + y0P y/σ2 (5.2.3) H ESTIMATION OF VARIANCE PARAMETERS 67 0 n×(n−p) where y2 = L2y, L2 is a matrix with full column rank chosen such that 0 −1 −1 0 −1 −1 0 −1 L2X = 0 and P = H − H X X H X X H .
Proof: Verbyla (1990) presented an illuminating derivation of the Patterson and Thompson (1971) residual likelihood. He partitions the full likelihood for the mixed model in (5.2.1) into two independent parts: one relates to the treatment (fixed effect) contrasts Xτ (there are p such effects) and the other to the residual contrasts Zu + e, that is, contrasts whose expectation is zero (there are n − p independent error contrasts). Maximization of the former provides estimates of the fixed effects whereas maximization of the residual likelihood provides estimates of the variance parameters and the random effects. n×p Verbyla (1990) considers a non-singular matrix L = [L1 L2] where L1 n×(n−p) 0 0 and L2 are matrices chosen to satisfy L1X = Ip and L2X = 0. The 0 0 0 0 distribution of the transformed data L y = [y1 y2] , say, is given by y τ L0 HL L0 HL 1 ∼ N , σ2 1 1 1 2 (5.2.4) H 0 0 y2 0 L2HL1 L2HL2
The likelihood of L0y can be expressed as the product of the conditional likelihood of y1 given y2 and the marginal likelihood of y2. From (5.2.4) the marginal distribution of y2 is y ∼ N 0, σ2 L0 HL 2 H 2 2 and using result ?? the conditional distribution of y1 given y2 is normal, with mean
0 0 −1 E(y1|y2) = τ + L1HL2 L2HL2 y2 and variance
h −1 i var (y |y ) = σ2 L0 HL − L0 HL L0 HL L0 HL 1 2 H 1 1 1 2 2 2 2 1
0 Using result ?? and the fact that L1X = Ip this can be written as
−1 y |y ∼ N τ + y∗, σ2 X0H−1X 1 2 2 H
∗ 0 0 −1 where y2 = L1HL2 L2HL2 y2. The associated log-likelihood functions (excluding constant terms) are given by
` = `(σ2 , κ; y ) R H 2 n −1 o = − 1 (n − p) log σ2 + log |L0 HL | + y 0 L0 HL y /σ2 2 H 2 2 2 2 2 2 H = − 1 (n − p) log σ2 + log |L0 HL | 2 H 2 2 −1 o + y0L L0 HL L0 y/σ2 (5.2.5) 2 2 2 2 H 68 ESTIMATION and
` = `(τ , σ2 , κ; y |y ) 1 H 1 2 n −1 = − 1 p log σ2 + log | X0H−1X | 2 H + (y − τ − y ∗)0 X0H−1X (y − τ − y ∗) /σ2 (5.2.6) 1 2 1 2 H
Clearly the likelihood of y2 contains no information on τ so that τ must be estimated from the conditional distribution of y1 given y2. From (5.2.6) and the derivative results in (??) the MLE of τ is obtained as the solution to
∂`1 0 −1 0 0 −1 2 = −2X H X y − τ − L HL2 L HL2 y /σ = 0 ∂τ 1 1 2 2 H This gives
0 0 −1 τˆ = y1 − L1HL2 L2HL2 y2 0 0 −1 0 = L1 I − HL2 L2HL2 L2 y
0 0 −1 0 −1 = L1 H − HL2 L2HL2 L2H H y −1 = X0H−1X X0H−1y
0 using Result (??) and the fact that L1X = Ip. The likelihood of y given y is a function of both τ , σ2 and κ, but since 1 2 H τ and y1 are both vectors of length p then once τ has been estimated there is no information left to estimate σ2 and κ. The variance parameters σ2 and H H 0 00 κ = γ , φ are therefore estimated using the marginal likelihood of y2, that is, the residual likelihood. Since `(τ , σ2 , κ; L0y) = `(σ2 , κ; y ) + `(τ , σ2 , κ; y |y ) then the determi- H H 2 H 1 2 nants can be similarly partitioned:
0 0 0 −1 −1 log |L HL| = log |L2HL2| + log | X H X | 0 0 0 −1 ⇒ log |L2HL2| = log |L L| + log |H| + log |X H X| Now |L0L| does not involve σ2 and κ and so the log-likelihood in (5.2.5) can H be written (ignoring constants) as
` = − 1 (n − p) log σ2 + log |H| + log |X0H−1X| R 2 H −1 o + y0L L0 HL L0 y/σ2 2 2 2 2 H