Mixed Models for Data Analysts

National Institute for Applied Statistics Research Australia University of Wollongong, Australia Working Paper 15-18 Mixed Models for Data Analysts Brian Cullis, Alison Smith, Ari Verbyla, Robin Thompson, and Sue Welham Copyright © 2021 by the National Institute for Applied Statistics Research Australia, UOW. Work in progress, no part of this paper may be reproduced without permission from the Institute. National Institute for Applied Statistics Research Australia, University of Wollongong, Wollongong NSW 2522, Australia Phone +61 2 4221 5435, Fax +61 2 4221 4998. Email: [email protected] Mixed Models for Data Analysts - DRAFT 2018 Brian Cullis, Alison Smith National Institute for Applied Statistics Research Australia University of Wollongong Ari Verbyla Robin Thompson, Sue Welham CSIRO Contents 1 Introduction 1 1.1 Applications of linear mixed models 2 1.2 Model definition 2 1.3 Setting the scene 5 2 An overview of linear models 9 2.1 Example: Glasshouse experiment on the growth of diseased and healthy plants 9 2.2 Model construction and assumptions 10 2.3 Estimation in the linear model 10 2.4 Non-full rank Design Matrix 14 2.5 A glimpse at REML 14 2.6 Factors and model specification 16 2.7 Tests of hypotheses 18 2.8 Analysis of plant growth example 22 2.9 Summary 23 3 Analysis of designed experiments 25 3.1 One-way classification 25 3.2 Randomised complete blocks 33 3.3 Split plot design 40 3.4 Balanced incomplete blocks 47 3.5 In search of efficient estimation for variance components 52 3.6 Summary 53 4 The Linear Mixed Model 55 4.1 The Model 55 4.2 Variance structures for the errors: R-structures 56 4.3 Variance structures for the random effects: G-structures 57 4.4 Separability 57 4.5 Variance models 60 4.6 Identifiability of variance models 61 4.7 Combining variance models 62 4.8 Summary 63 5 Estimation 65 iii iv CONTENTS 5.1 Estimation of fixed effects and variance parameters 65 5.2 Estimation of variance parameters 65 5.3 Estimation of fixed and random effects 71 5.4 An approach to prediction in linear mixed models - REML construct 80 5.5 Summary 81 6 Inference 83 6.1 General hypothesis tests for variance models 83 6.2 Hypothesis testing in mixed models: fixed and random effects 86 6.3 Summary 91 7 Prediction from linear mixed models 93 7.1 Introduction 93 7.2 The Prediction Model 98 7.3 Computing Strategy 100 7.4 An example of the prediction model 102 7.5 Prediction in models not of full rank 103 7.6 Issues of averaging 108 7.7 Prediction of new observations 109 8 From ANOVA to variance components 113 8.1 Introduction 113 8.2 Navel Orange Trial 113 8.3 Sensory experiment on frozen peas 118 9 Mixed models for Geostatistics 125 9.1 Introduction 125 9.2 Motivating Examples 126 9.3 Geostatistical mixed model 128 9.4 Covariance Models for Gaussian random fields 130 9.5 Prediction 137 9.6 Estimation 139 9.7 Model building and diagnostics 140 9.8 Analysis of examples 142 9.9 Simulation Study 146 10 Population and Quantitative Genetics 159 10.1 Introduction 159 10.2 Mendel's Laws 159 10.3 Population genetics 162 10.4 Quantitative genetics 164 10.5 Theory 164 10.6 Discussion 176 CONTENTS v 11 Mixed models for plant breeding 177 11.1 Introduction 177 11.2 Spatial analysis of field trials 182 11.3 Analysis of multi-phase trials for quality traits 196 12 The analysis of quantitative trait loci 209 12.1 Introduction 209 12.2 Example 209 12.3 Overview of Molecular Genetics 210 12.4 Reproduction 214 12.5 Genetic information 215 12.6 Linkage analysis 219 12.7 QTL analysis 225 12.8 Interval mapping: The Regression Approach 232 12.9 Whole genome interval mapping 235 12.10Conclusions 244 13 Mixed models for penalized models 249 13.1 Introduction 249 13.2 Hard-edge constraints 251 13.3 Soft-edge constraints 255 13.4 Penalized Regression splines 257 13.5 P-splines 258 13.6 Smoothing splines 262 13.7 L-splines 267 13.8 Variance modelling 273 13.9 Analysis of high-resolution mixograph data 275 13.10Analysis of another example: still to come 275 13.11LASSO 275 13.12Discussion 275 Bibliography 277 A Iterative Schemes 285 A.1 Introduction 285 A.2 Gradient methods: Average Information algorithm 286 A.3 EM Algorithm 292 A.4 PX-EM - an improved EM algorithm 296 A.5 Computational Implementation 300 A.6 Summary 305 CHAPTER 1 Introduction The linear model is a basic tool in statistical modelling with widespread use and application in data analysis and applied statistics. The expected value of the response variable is given by a linear combination of explanatory variables, often termed the linear predictor (McCullagh and Nelder, 1994). The stochastic nature of the response is modelled using a single random component, elements of which are assumed to be independent with constant variance. The linear mixed model is a natural extension of the linear model. In the simplest case the linear mixed model is a linear model which has been extended to allow for a correlated error term or additional random components. The wide range of variance and correlation models, make the linear mixed model a very flexible tool for data analysts. The linear mixed model provides the basis for the analysis of many data-sets commonly arising in the agricultural, biological, medical and environmental sciences, as well as other areas. This introductory chapter provides an overview of the book through examples. Ths flavour of the linear mixed model and the diversity of possible applications are presented. The examples also allow the development of models that contain both variates and factors, and indeed to define these two types of variable. The symbolic respresentation of linear models is introduced as it will be used throughout the book. The philosophical issues that arise in the analysis of data are also raised. The authors have a background in statistics arising from experiments. Thus the models proposed in this book begin with the design or sampling structure of the study. These models are randomization based. Hence it is important to understand the origins of mixed models, which belong in the analysis of variance. There are often further considerations in model building that suggest plaus- able variance models that are outside a randomization based approach. The diversity of applications and the complexity and flexibility of variance models results in the subjective choice of a variance model and hence analysis for any particular data-set. A consequence of the increased flexibility and complexity of variance modelling is the danger of fitting inappropriate variance models to small or highly unbalanced data-sets. The approach taken in this book is to highlight the difference between analyses that use a variance modelling approach and a design-based approach. The former relies heavily on the appropriateness of the fitted variance structure for the validity of inferences concerning both fixed and random effects. The importance of data-driven di- 1 2 INTRODUCTION agnostics is stressed throughout the book although there remain unresolved issues in this area. 1.1 Applications of linear mixed models Typical applications covered in this book include the analysis of balanced and unbalanced designed experiments, the analysis of balanced and unbalanced longitudinal data, repeated measures analysis, the analysis of regular or irregular spatial data, the combined analysis of several experiments (ie meta-analysis) and the analysis of both univariate and multivariate animal and plant genetics data. This chapter will be finished last!!! 1.2 Model definition 1.2.1 Defining the linear model A linear model relates the expected value of a response variable (outcome) to a linear combination of a set of explanatory variables, termed the linear predictor. The stochastic properties of the response are determined by the addition of a single random component to the linear predictor. Explanatory variables are assumed to be measured without error, and may be design-based (values planned as part of the experiment) or observational. For example, we may wish to examine the effect of imposed dietary regimes (explanatory variable) on the live-weight of pigs (response variable). Let yi represent the response measured at the end of the experiment for the ith individual or experimental unit (in the example the pig), i = 1; : : : ; n. These are combined to form the data vector yn×1. The linear model for y is then written as y = Xτ + e (1.2.1) where τ p×1 is a vector of (fixed) effects corresponding to the explanatory variables (in this example diet effects) and Xn×p is the associated design matrix (chosen to be of full rank, see chapter 2). The columns of X may be dummy variables, that is columns of zeros and ones, which assign categorical variables or factors to units, or columns of continuous variables or covariates. These aspects will be covered in more detail in chapter 2. The residuals or errors, ei, are assumed to be identically and independently distributed with zero mean, common variance and follow a Normal distribution. This is written as 2 e ∼ N(0; σ In) The parameters to be estimated are σ2 and τ . Generally we wish to give measures of uncertainty or precision and sometimes conduct tests of hypotheses on τ . MODEL DEFINITION 3 1.2.2 Defining the linear mixed model The linear mixed model also relates the expected value of a response variable (outcome) to the linear predictor.

Load more