Quick viewing(Text Mode)

Applied Linear and Nonlinear Mixed Models* Introduction Mixed-Effect

Applied Linear and Nonlinear Mixed Models* Introduction Mixed-Effect

Short Course — Applied Linear and Nonlinear Mixed Models*

Introduction

Mixed-effect models (or simply, “mixed models”) are like classical (“fixed- effects”) statistical models, except that some of the parameters describ- ing group effects or covariate effects are replaced by random variables, or random-effects.

• Thus, the model has both parameters, also known as “fixed effects”, and random effects. Thus the model has mixed effects.

• Random effects can be thought of as random versions of parame- ters. So, in some sense, a mixed model has both fixed and random parameters.

– This can be a useful way to think about it, but it’s really not quite right and can lead to confusion.

– The word “parameter” fixed, unknown constant, so it is really something of an oxymoron to say “random parameter.”

– As we’ll see, the distinction between a parameter and a random effect goes well beyond vocabulary.

* Temple-Inland Forest Products, Inc., Jan. 17–18, 2005

1 Random effects arise when the observations being analyzed are heteroge- neous, and can be thought of as belonging to several groups or clusters.

• This happens when there is one observation per experimental unit (tree, patient, plot, animal) and the experimental units occur or are measured in different locations, at different time points, from different sires or genetic strains, etc.

• Also often occurs when repeated measurements of each experi- mental unit are taken.

– E.g., several observations are taken through time of the height of 100 trees. The repeated height measurements are grouped or clustered by tree.

The use of random effects in linear models leads to linear mixed models (LMMs).

• LMMs are not new. Some examples from this class are among the simplest, most familiar and are very old.

• However, until recently, software and statistical methods for infer- ence were not well-developed enough to handle the general case.

– Thus, only recently has the full flexibility and power of this class of models been realized.

2 Some Simple LMMs:

The one-way random effects model — Railway Rails:

(See Pinheiro and Bates, §1.1) The displayed below are from an ex- periment conducted to measure longitudinal (lengthwise) stress in railway rails. Six rails were chosen at random and tested three times each by mea- suring the time it took for a certain type of ultrasonic wave to travel the length of the rail.

4

3

6 Rail

1

5

2

40 60 80 100 Zero-force travel time (nanoseconds)

Clearly, these data are grouped, or clustered, by rail. This clustering has two closely related implications:

1. (within-cluster correlation) we should expect that observations from the same rail will be more similar to one another than observations from different rails; and

2. (between cluster heterogeneity) we should expect that the re- sponse will vary from rail to rail in addition to varying from one measurement to the next.

• These ideas are really flip-sides of the same coin.

3 Although it is fairly obvious that clustering by rail must be incorporated in the modeling of these data somehow, we first consider a naive approach.

The primary interest here is in measuring the mean travel time. Therefore, we might naively consider the model

yij = µ + eij,i=1,...,6,j =1,...,3, th th where yij is the travel time for the j trial on the i rail, and we assume iid 2 ε11,...,ε63 ∼ N(0,σ ).

iid 2 • Here, the notation ∼ N(0,σ ) means, “are independent, identically distributed random variables each with a normal distribution with mean 0 and (constant) σ2.”

• In addition, µ is the mean travel time which we wish to estimate. Its maximum likelihood (ML)/ordinary least-squares (OLS) estimate is the grand sample mean of all observations in the data set:y ¯·· =66.5.

• Themeansquareerror(MSE)iss2 =23.6452, which estimates the error variance σ2.

However, an examination of the residuals form this model plotted sepa- rately by rail reveals the inadequacy of the model:

Boxplots of Raw Residuals by Rail, Simple Mean Model Residuals for simple mean model -40 -20 0 20

251634

Rail No.

4 Clearly, the mean response is changing from rail to rail. Therefore, we consider a one-way ANOVA model:

yij = µ + αi + eij. (∗)

Here, µ is a grand mean across the rails included in the , and th αi is an effect up or down from the grand mean specific to the i rail.

Alternatively, we could define µi = µ + αi as the mean response for the ith rail and reparameterize this model as

yij = µi + eij.

The OLS estimates of the parameters of this model areµ ˆi =¯yi·,of 2 2 (ˆµ1,...,µˆ6)=(54.00, 31.67, 84.67, 96.00, 50.00, 82.67) and s =4.02 .The residual plot looks much better:

Boxplots of Raw Residuals by Rail, One-way Residuals for one-way fixed effects model -6 -4 -2 0 2 4 6

251634

Rail No.

5 However, there are still drawbacks to this one-way fixed effects model:

• It only models the specific sample of rails used in the experiment, while the main interest is in the population of rails from which these rails were drawn.

• It does not produce an estimate of the rail-to-rail variability in travel time, which is a quantity of significant interest in the study.

• The number of parameters increases linearly with the number of rails used in the experiment.

These deficiencies are overcome by the one-way random effects model.

To motivate this model, consider again the one-way fixed effects model. Model (*) can be written as

yij = µ +(µi − µ)+eij where, under the usual constraint i αi =0,(µi − µ)=αi has mean 0 when averaged over the groups (rails).

The one-way random effects model, replaces the fixed parameter (µi − µ) th with a random effect bi, a specific to the i rail, which 2 isassumedtohavemean0andanunknownvarianceσb . This yields the model yij = µ + bi + eij, (∗∗) where b1,...,b6 are independent random variables, each with mean 0 and 2 variance σb .Often,thebi’s are assumed normal, and they are usually assumed independent of the eij’s. Thus we have

iid 2 iid 2 b1,...,ba ∼ N(0,σb ), independent of e11 ...,ean ∼ N(0,σ ), where a is the number of rails, n the number of observations on the ith rail.

6 • Note that now the interpretation of µ changes from the mean over the 6 rails included in the experiment (fixed effects model) to the mean over the population of all rails from which the six rails were sampled.

• In addition, we don’t estimate µi the mean response for rail i,which is not of interest. Instead we estimate the population mean µ and 2 the variance from rail to rail in the population, σb .

–Thatis,ourscope of inference has changed from the six rails included in the study to the population of rails from which those six rails were drawn.

In addition:

2 • we can estimate rail to rail variability σb ;and

• the number of parameters no longer increases with the number of rails tested in the experiment.

– The parameters in the fixed-effect model were the grand mean 2 µ, the rail-specific effects α1,...,αa, and the error variance σ .

– In the random effects model, the only parameters are µ, σ2 2 and σb .

2 σb quantifies heterogeneity from rail-to-rail, which is one consequence of having observations that are grouped or clustered by rail, but what about within-rail correlation?

7 Unlike a purely fixed-effect model, the one-way random effects model does not assume that all of the responses are independent. Instead, it implies that observations that share the same random effect are correlated.

th • E.g., for two observations from the i rail, yi1 and yi3,say,themodel implies yi1 = µ + bi + ei1 and yi3 = µ + bi + ei3

That is, yi1 and yi3 share the random effect bi, and are therefore correlated.

Why?

Because one can easily show that

2 2 var(yij)=σb + σ 2 cov(yij,yij )=σb ,j= j 2 σb  ≡  corr(yij,yij )=ρ 2 2 ,j= j , and σb + σ cov(yij ,yij )=0,i= i .

That is, if we stack up all of the observations from the ith rail (the obser- T vations that share the random effect bi)asyi =(yi1,...,yin) ,then ⎛ ⎞ 1 ρ ··· ρ ⎜ ρ 1 ··· ρ ⎟ 2 2 ⎜ ⎟ var(yi)=(σb + σ ) ⎝ . . . . ⎠ (†) . . .. . ρρ··· 1 and groups of observations from different rails (those that do not share random effects) are independent.

8 • The variance- structure given by (†) has a a special name: compound symmetry. This means that

– observations from the same rail all have constant variance equal 2 2 to σ + σb ,and

– all pairs of observations from the same rail have constant cor- relation equal to 2 σb ρ = 2 2 σ + σb

– ρ, the correlation between any two observations from the same rail, is called the intraclass correlation coefficient.

• In addition, because the total variance of any observation is var(yij)= 2 2 2 2 σb + σ , the sum of two terms, σb and σ are called variance com- ponents.

9 • Both fixed-effects and random-effects versions of the one-way model are fit to these data in intro..

• For (*), the fixed-effect version of the one-way model, we obtain µˆ =66.5 with a of 0.948.

• For (**), the random-effect version of the one-way model, we obtain µˆ =66.5 with a standard error of 10.17.

– Standard error is larger in random-effects model, because this model has a larger scope of inference.

– That is, the two models are estimating different µ’s: the fixed effect model is estimating the grand mean for the six rails in the study; the fixed effect model is estimating the grand mean for all possible rails.

– It makes sense that we would be much less certain of (i.e., there would be more error in) our estimate of the latter quantity especially if there is a lot of rail-to-rail variability.

• The usual method of /anova/REML estimates of the vari- 2 2 2 2 ance components in model (**) areσ ˆ =24.8 andσ ˆb =4.02 , so here there is much more between-rail variability than within-rail variability.

10 The randomized complete block model — Stool Example:

In the last example, the data were grouped by rail and we were interested in only one treatment (there was only one experimental condition under which the travel time along the rail was measured).

Often, several treatments are of interest and the data are grouped. In a randomized complete block design (RCBD), each of a treatments are observed in each of n blocks.

As an example, consider the data displayed below. These data come from an experiment to compare the ergonomics of four different stool designs. n = 9 subjects were asked to sit in each of a = 4 stools. The response measured was the amount of effort required to stand up.

T1 T2 T3 T4

2

1

7

3

6 Subject 9

4

5

8

8101214 Effort required to arise (Borg scale)

11 Here, subjects form the blocks and we have a complete set of treatments observed in each block (each subject tests each stool). Thus we have a RCBD.

th th Let yij be the response for the j stool type tested by the i subject.

The classical fixed effects model for the RCBD assumes

yij = µ + αj + βi + eij, i =1,...,n,j =1,...,a, = µj + βi + eij,

iid 2 where e11,...,ena ∼ N(0,σ ).

th • Here, µj is the mean response for the j stool type, which can be broken apart into a grand mean µ and a stool type effect αj . βi is a fixed subject effect.

• Again, the scope of inference for this model is the set of 9 subjects used in this experiment.

• If we wish to generalize to the population from which the 9 subjects were drawn, we would consider the subject effects to be random.

12 The RCBD model with random block effects is

yij = µj + bi + eij, where

iid 2 iid 2 b1,...,bn ∼ N(0,σb ) independent of e11,...,ena ∼ N(0,σ ).

• Since µj ’s are fixed, and bi’s are random, this is a mixed model.

The variance-covariance structure here is quite similar to that in the one- way random effects model.

Again, the model implies that any two observations that share a random effect (i.e., any two observations from the same block) are correlated.

In fact, the same compound symmetry structure holds. In particular, if T th yi =(yi1,...,yia) is the vector of observations from the i block, then as in the last example, ⎛ ⎞ 1 ρ ··· ρ ⎜ ρ 1 ··· ρ ⎟ 2 2 ⎜ ⎟ var(yi)=(σb + σ ) ⎝ . . . . ⎠ (†) . . .. . ρρ··· 1

• All pairs of observations from the same block have correlation ρ = 2 σb σ2+σ2 ; b • all pairs of observations from different blocks are independent; and

2 2 • all observations have variance σ +σb (two components: within-block and between block ).

• The RCBD model treating block effects is fit to these data in intro.R. First blocks are treated as fixed, then random.

13 It is often stated that whether block effects are assumed random or fixed does not affect the analysis of the RCBD.

This is not completely true.

It is true that whether or not blocks are treated as random does not affect the ANOVA F test for treatments. Either way we test for equal treatment MSTrt means with the test F = MSE However, there are important differences in the analysis of the two designs. These differences affect inferences on treatment means.

For instance, the variance of a treatment mean is  σ2 n for fixed block effects, var(¯y·j )= 2 2 σb +σ n for random block effects.

2 2 Substituting the usual method of moment/anova estimators for σ and σb leads to a standard error of ⎧ ⎨ MSBlocks+(s−1)MSE ns for random block effects s.e.(¯y·j )= var(¯ˆ y·j )= ⎩ MSE n for fixed block effects.

• Again, the standard error of a treatent mean is larger in the random effects model, because the scope of inference is broader.

– For these data, s.e.(ˆµj )=.367 in the fixed block effects model, and s.e.(ˆµj )=.576 in the random block effects model.

• For these data, the estimated between- and within-subject variance 2 2 2 2 components areσ ˆb =1.33 and σ =1.10 .

– This means that the estimated correlation between any pair of observations on the same subject is 2 2 σˆb 1.33 ρˆ = 2 2 = 2 2 =0.59 σˆ +ˆσb 1.10 +1.33

14 A Split-plot model — Grass Example:

A split-plot experimental design is one in which two sizes of experimental unit are used. The larger experimental unit, known as the whole plot,is randomized to some experimental design (a RCBD, say).

The whole plot is then subdivided into smaller units, known as split plots, which are assigned to a second experimental design within each whole plot.

Example: A study of the effects of three bacterial inoculation treatments and two cultivar types on grass yield was conducted as follows.

• Four fields, or blocks, were divided in half, and the two cultivars (A1 and A2) were assigned at random to be grown in the two halves of each field.

• Then each half-field (the whole plot) was divided into three sub- units or split-plots, and the three inoculation treatments (B1, B2, and B3) were randomly assigned to the three split-plots in each whole plot.

The resulting design and data are as follows: Block 1 Block 2 Block 3 Block 4 A1 A2 A2 A1 A2 A1 A1 A2

B2 B3 B3 B1 B2 B1 B1 B3 29.7 34.4 36.4 28.9 29.1 28.6 26.7 30.7

B1 B1 B2 B2 B1 B3 B3 B2 27.4 29.4 32.4 28.7 27.2 32.9 31.8 28.6

B3 B2 B1 B3 B3 B2 B2 B1 34.5 32.5 28.7 33.4 32.6 29.7 28.9 26.8

• Here it was easier to randomize the planting of the two cultivars to a few large units (the whole plots) then to many small units (the split plots).

– Convenience is the motivation for this design.

15 • Here the 8 columns within the four rectangles are the whole plots and cultivar is the whole plot factor.

• The 24 smaller squares within the columns are the split plots and inoculation type is the split plot factor.

The Data: i =1,...,a (levels of W.P. factor) yijk, j =1,...,n (blocks) k =1,...,b (levels of S.P. factor) th th • That is, yijk is the response for the i cultivar in the j block treated with the kth inoculation type.

Model: yijk = µ + αi + τj + bij + βk +(αβ)ik + eijk, where

th αi = effect of i cultivar th τj = effect of j block (treated here as fixed) th βk = effect of k inoculation treatment (αβ)ik = between cultivars and inoculations

In addition,

iid 2 iid 2 bij’s ∼ N(0,σb ) independent of eijk’s ∼ N(0,σ )

• bij sometimes describes as “whole plot error terms”. In a sense that is what a random effect is, an additional error term in the model.

16 bij ’s are random effects for each whole plot (one for each half-field). They account for:

2 • heterogeneity from one whole plot to the next (quantified byσ ˆb );

• correlation among the three split-plots within a given whole plot.

Again, the variance-covariance structure in this model is compound sym- metric. Model implies ⎛ ⎞ 1 ρ ··· ρ ⎜ ρ 1 ··· ρ ⎟ 2 2 ⎜ ⎟ var(yij)=(σb + σ ) ⎝ . . . . ⎠ . . .. . ρρ··· 1

T th where yij =(yij1,yij2,...,yijb) (vector of all observations on i, j whole plot).

This means:

• all pairs of observations from the same whole-plot have correlation 2 σb ρ = σ2+σ2 ; b • all pairs of observations from different whole-plots are independent; and

2 2 • all observations have variance σ + σb (two components: within- whole-plot and between-whole-plot variances).

• The split plot model is fit to these data with the lme() function in S-PLUS/R in intro.R.

• Note that here we’ve treated blocks as fixed. Later, we’ll return to this example and model block effects as random.

17 The Experimental Unit, Pseudoreplication, D.F., & Balance

• The split-plot design involves two different experimental units: the whole plot and the split plot.

• Whole plots are randomly assigned to the whole plot factor.

– E.g., half-fields were randomized to the two cultivars.

• There are many fewer whole plot experimental units than there are observations (which equals the number of split plots in the experi- ment).

– Only 8 half-fields, but 24 observations in the grass experiment.

• With respect to cultivar, then, 8 experimental units are randomized. So degrees of freedom for testing cultivar are based on a sample size of 8.

– At the whole plot level, we have a RCBD design with two treatments (cultivars), four blocks, so error d.f. for testing cultivars is the error d.f. in a RCBD of this size. Namely: (2-1)(4-1)=3.

• With respect to cultivar, the measurements on the three split plots in each whole plot are pseudoreplicates (or subsamples).

– That is, they are not independently randomized to cultivars and thus proved no additional d.f. (information) regarding cultivar effects.

18 In some sense, modeling whole plots with random effects:

— identifies the appropriate error term for the whole plot factor;

— identifies the appropriate d.f. (amount of relevant information in the data/design) for testing the whole plot factor (cultivar); and

— identifies which units are true experimental units and which are pseu- doreplicates with respect to each experimental factor.

• If a purely fixed-effect model is used in the split-plot design* then the usual MSE and DFE based upon the eijk error term will lead to incorrect inferences on the whole plot factor.

– See model grass.lm1 in intro.R.

• Correct inferences on whole plot factors can sometimes be obtained from a fixed-effects analysis, but

– you have to really know what you’re doing, especially in com- plex situations like split-split-plot models, etc; and

– the design has to be balanced (rare!).

• So, the use of random effects is also motivated by

– use of multiple sizes of experimental units with distinct ran- domizations;

– presence of pseudo-; and

– imbalance.

• Mixed effects models handles these complications much more “au- tomatically” than fixed-effects models and, consequently, avoid in- correct inferences to which fixed-effects models are prone in these situations.

* this would be done by modeling variability among whole plots with a fixed cult*block interaction effect

19 A More Complex Example — PMRC Site Preparation Study:

• Study of various site preparation and intensive management regimes on the growth of slash pine.

• Involved 191 0.2 ha plots nested within 16 sites in lower coastal plain of GA and FL.

• Data consist of repeated plot-level measurements of hd=dominant height (m), ba=basal area (m2/ha), tph (100’s of trees per ha), de- rived volume (total volume outside bark in m3/ha), and other vari- ables at ages 2, 5, 8, 11, 14, 17, and 20 years.

• At each site, plots were randomized to eleven treatments consist- ing of a subset of the 25 = 32 combinations of five two-level (ab- sent/present) treatment factors:

A = Chop, site prep. w/ a single pass of a rolling drum chopper; B = Fert, fertilizer following the first, 12th and 17th growing seasons; C = Burn, broadcast burn of site prior to planting; D = Bed, a double pass bedding of the site; E = Herb, veg. control with chemical herbicide.

20 Here is a plot of the data. Each panel represents a site, and each panel contains tph over time profiles for each plot on that site.

Separate profiles for each plot, graphed separately by site 1 3 5 7 9 11 2 4 6 8 10 13

5 101520 5101520 5101520 5101520

14 8 1 7 12 16 13 19

15

10

5

10 4 9 5 6 11 20 15

15

Trees/Hectare (100s of trees) 10

5

5 101520 5 101520 5 101520 5 101520

Age (yrs)

(Each panel is a site)

• These data are grouped, or clustered, by site.

– We would expect heterogeneity from site to site. – We would expect correlation among plots within the same site.

• Data are also grouped by plot, since we have repeated measures throughtimeoneachplot.

– Again, we would expect plots to be heterogeneous. – Expect stronger correlation among observations from the same plot than observations from different plots.

21 • In addition, we’d like to make inferences about the population of plantation sites for which these sites are representative, not just these sites alone.

• Also would like to be able to generalize to the population from which these plots are drawn.

• Hence, it makes sense to model sites with random site effects, plots with random plot effects.

– Plots are nested within sites. This would be an example of a multilevel mixed model.

• In addition, plots are randomized to treatments, then repeated mea- sures through time are taken on each plot.

– With respect to treatments, plots are the experimental unit, but measurement unit occurs at a finer scale: times within plots.

• These time-specific measurements are a bit like measurements on split plots.

– However, in a split-plot example, observations from the same whole plot are correlated due to shared characteristics of that whole plot. These are captured by whole plot random effects.

– In a repeated measures context, observations through time from the same unit are correlated due to shared characteristics of that unit and are subject to serial correlation (observations taken close together in time more similar than observations taken far apart in time).

– Thus, in a repeated measures context, we may want random effects and serial correlation built into our model.

• We’ll soon see how multilevel random effects, serial correlation, and other features can be handled in the general form of the LMM.

22 Fixed vs. random effects: The effects in the model account for variability in the response across levels of treatment and design factors. The decision as to whether fixed effects or random effects should be used depends upon what the appropriate scope of generalization is.

• If it is appropriate to think of the levels of a factor as randomly drawn from, or otherwise representative of, a population to which we’d like to generalize, then random effects are suitable.

– Design or grouping factors are usually more appropriately mod- eled with random effects.

– E.g., blocks (sections of land) in an agricultural experiment, days when an experiment is conducted over several days, lab technician when measurements are taken by several techni- cians, subjects in a , locations or sites along a river when we desire to generalize to the entire river.

• If, however, the specific levels of the factor are of interest in and of themselves then fixed effects are more appropriate.

– Treatment factors are usually more appropriately modeled with fixed effects.

– E.g., In to compare drugs, amounts of fertilizer, hybrids of corn, teaching techniques, and measurement devices, these factors are most appropriately modeled with fixed effects.

• A good litmus test for whether the level of some factor should be treated as fixed is to ask whether it would be of broad interest to report a mean for that level. For example, if I’m conducting an experiment in which each of four different classes of third grade stu- dents are taught with each of three methods of instruction (e.g., in a crossover design) then it will be of broad interest to report the mean response (level of learning, say) for a particular method of instruction, but not for a particular classroom of third grades.

– Here, fixed effects are appropriate for instruction method, ran- dom effects for class.

23 Preliminaries/Background

• In order to really understand the LMM, we need to study it in its vector/matrix form. So, we need to discuss/review random vectors and the multivariate normal distribution.

• Also need to review the classical linear model (CLM) before gener- alizing to the LMM.

• Estimation in the CLM based on , but in the LMM, maximum likelihood (ML) estimation is used. Therefore, need to cover/review the basic ideas of ML estimation.

Random Vectors:

Random Vector: A vector whose elements are random variables. E.g., ⎛ ⎞ y1 ⎜ ⎟ ⎜ y2 ⎟ y = ⎝ . ⎠ , . yn

where y1,y2,...,yn are each random variables.

• Random vectors we will be concerned with:

– A vector containing the response variable measured on n units T in the sample: y =(y1,...,yn) . T – A vector of error terms in a model for y: e =(e1,...,en) . T – A vector of random effects: b =(b1,b2,...,bq) .

Expected Value: The expected value (population mean) of a random vector is the vector of expected values, often denoted µ.Foryn×1, ⎛ ⎞ ⎛ ⎞ E(y1) µ1 ⎜ ⎟ ⎜ ⎟ ⎜ E(y2) ⎟ ⎜ µ2 ⎟ E(y)=⎝ . ⎠ ≡ ⎝ . ⎠ = µ. . . E(yn) µn

24 (Population) Variance-: For a random vector T T yn×1 =(y1,y2,...,yn) with mean µ =(µ1,µ2,...,µn) ,thema- trix ⎛ ⎞ var(y1)cov(y1,y2) ··· cov(y1,yn) ⎜ cov(y2,y1)var(y2) ··· cov(y2,yn) ⎟ T ⎜ ⎟ E[(y−µ)(y−µ) ]=⎝ . . . . ⎠ . . .. . cov(yn,y1)cov(yn,y2) ··· var(yn) ⎛ ⎞ σ11 σ12 ··· σ1n ⎜ ··· ⎟ ⎜ σ21 σ22 σ2n ⎟ ≡ ⎝ . . . . ⎠ . . .. . σn1 σn2 ··· σnn

is called the variance-covariance matrix of y and is denoted var(y).

(Population) Correlation Matrix: For a random variable yn×1,the population correlation matrix is the matrix of correlations among the elements of n: ⎛ ⎞ 1 corr(y1,y2) ··· corr(y1,yn) ⎜ ··· ⎟ ⎜ corr(y2,y1)1 corr(y2,yn) ⎟ corr(y)=⎝ . . . . ⎠ ...... corr(yn,y1) corr(yn,y2) ··· 1

• Recall: for random variables yi and yj,

cov(yi,yj) corr(yi,yj)= var(yi)var(yj )

measures the amount of linear association between yi and yj.

• Correlation matrices are symmetric.

25 Properties of expected value, variance:

Let x, y be random vectors of the same dimension, and let C and c be a matrix and vector, respectively, of constants. Then

1. E(y + c)=E(y)+c.

2. E(x + y)=E(x)+E(y).

3. E(Cy)=CE(y).

4. var(y + c)=var(y).

5. var(y + x)=var(y)+var(x)+ cov(y, x)+cov( x, y) . =0 if x, y independent

6. var(Cy)=Cvar(y)CT .

26 Multivariate normal distribution:

• The multivariate normal distribution is to a random vector as the univariate (usual) normal distribution is to a random variable.

– It is the version of the normal distribution appropriate to the joint distribution of several random variables (collected and stacked as a vector) rather than a single random variable.

• Recall that we write y ∼ N(µ, σ2) to signify that the univariate r.v. y has the normal distribution with mean µ and variance σ2.

–Meansthaty has probability density function (p.d.f.) − 2 √ 1 −(y µ) fY (y)= exp 2 2πσ2 2σ

– Meaning: for two values y1

• We write y ∼ Nn(µ, Σ) to denote that y follows the n−dimensional multivariate normal distribution with mean µ and variance-covariance matrix Σ. y1 • E.g., for a bivariate random vector y = ∼ N2(µ, Σ), the p.d.f. y2 of y maps out a bell over the (y1,y2) plane centered at µ with spread described by Σ.

• Recall for y ∼ N(µ, σ2) the p.d.f. of y is 1 1 (y − µ)2 f(y)= exp − , (2πσ2)1/2 2 σ2

• In the multivariate case, for y ∼ Nn(µ, Σ), the p.d.f. of y is 1 1 f(y)= exp − (y − µ)T Σ−1(y − µ) . (2π)n/2|Σ|1/2 2

–Here|Σ| denotes the determinant of the var-cov matrix Σ.

27 Review of Classical (Fixed-Effects) Linear Model

Assume we observe a sample of independent pairs, (y1, x1),..., T (yn, xn)whereyi is a response variable and xi =(xi1,...,xip) is a p × 1 vector of explanatory variables.

The classical linear model can be written yi = β1xi1 + ···+ βpxip + ei,i=1,...,n, T = xi β + εi,

iid 2 where e1,...,en ∼ N(0,σ ).

Equivalently, we can stack these n equations and write the model as fol- lows: ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ y1 x11 x12 ··· x1p β1 e1 ⎜ . ⎟ ⎜ . . . . ⎟ ⎜ . ⎟ ⎝ . ⎠ ⎝ . ⎠ = ⎝ . . .. . ⎠ ⎝ . ⎠ + . yn xn1 xn2 ··· xnp βp en or y = Xβ + e

• Our assumptions on e1,...,en can be equivalently restated as 2 e ∼ Nn(0,σ In).

2 • Since y = Xβ +e and e ∼ Nn(0,σ In), it follows that y is m’variate normal too: 2 y ∼ Nn(Xβ,σ In).

• The var-cov matrix for y is ⎛ ⎞ σ2 0 ··· 0 ⎜ 2 ··· ⎟ 2 ⎜ 0 σ 0 ⎟ σ In = ⎝ . . . . ⎠ . . .. . 00··· σ2

2 •⇒yi’s are uncorrelated and have constant variance σ .

• Therefore, in the CLM y is assumed to have multivariate normal joint p.d.f.

28 Estimation of β and σ2:

Maximum likelihood estimation:

In general, the is just the probability density func- tion, but thought of as a function of the parameters rather than of the data.

• Interpretation: likelihood function quantifies how likely the data are for a given value of the parameters.

• The idea behind maximum likelihood estimation is to find the values of β and σ2 under which the data are most likely.

– That is, we find the β and σ2 that maximize the likelihood function, or equivalently, the loglikelihood function, for the value of y actually observed.

– These values are the maximum likelihood estimates (MLEs) of the parameters.

For the CLM, the loglikelihood is

β 2 −n −n 2 − 1 − β T − β ( ,σ ; y)= log(2π) log(σ ) 2 (y X ) (y X ) . 2 2 2σ a constant kernel of 

29 Notice that maximizing (β,σ2; y) with respect to β is equivalent to max- imizing the third term:

1 − (y − Xβ)T (y − Xβ), 2σ2 which is equivalent to minimizing n T T 2 (y − Xβ) (y − Xβ)= (yi − xi β) (Least-Squares Criterion). (∗) i=1

• (y −Xβ)T (y −Xβ) is the squared distance between y and its mean, Xβ.

– Parameter estimate βˆ minimizes this distance.

–Thatis,βˆ gives the estimated mean Xβˆ that is closest to y.

• So, the estimators of β given by ML and (ordinary) least squares (OLS) coincide.

–Forβ in the CLM: ML = OLS and, if X is of full rank (model is not overparameterized) then:

βˆ =(XT X)−1XT y

.

30 Estimation of σ2:

• Setting the partial derivative of  with respect to σ2 to 0 and solving leads to the MLE of σ2: 2 1 T 1 T 2 1 σˆML = (y − Xβˆ) (y − Xβˆ)= (yi − xi βˆ) = SSE n n i n

• Problem: This estimator is biased for σ2.

• This bias can be easily fixed, which leads to the generally preferred estimator:

2 1 T 1 1 σˆ = (y − Xβˆ) (y − Xβˆ)= SSE = SSE = MSE n − p n − p dfE

• Note that the MLE of σ2 is biased, and this is due to using the wrong valueforthedfE (the divisor for SSE ).

–dfE = n − p is the information in the data left for estimating 2 σ after having estimated β1,...,βp.

2 – Becauseσ ˆML uses n rather than n − p, it is often said that the MLE of σ2 fails to account for d.f. used (or lost) in estimating β.

2 • MSE, the preferred estimator of σ , is an example of what is known as a restricted ML (REML) estimator.

– As we’ll see, REML is the preferred method of estimating vari- ance components in LMMs. This method simply generalizes 2 2 usingσ ˆ = MSE rather thanσ ˆML in the CLM.

31 Example – Volume of Cherry Trees:

For 31 black cherry trees the following measurements were obtained:

V = Volume of usable wood (cubic feet) H = Height of tree (feet) D = Diameter at breast height (inches)

Goal: Predict usable wood volume from diameter and height.

• See S-PLUS script, backgrnd.R.

• Here, we first consider a simple multiple regression model cherry.lm1, for these data:

Vi = β0 + β1Hi + β2Di + ei,i=1,...,31

• Initial plots of V against both explanatory variables, D and H,look linear, so this model may be reasonable.

• cherry.lm1 gives a high R2 of .941 and most residual plots look pretty good. However, plot of residuals vs. diameter looks “U”-shaped, so we consider some other models for these data.

32 Inference in the CLM:

Under the basic assumptions of the CLM (independence, , normality), βˆ, the ML/OLS estimator of β, has distribution

βˆ ∼ N(β,σ2(XT X)−1)

That is,

• βˆ is unbiased for β; ˆ 2 T −1 • β has var-cov matrix σ (X X) T −1 • βˆj has standard error s.e.(βˆj )= MSE[(X X) ]jj; • βˆ is normally distributed.

• Also can be shown that βˆ is optimal estimator (BLUE, UMVUE).

These properties lead to a number of normal-theory methods of inference:

1. t tests, confidence intervals for an individual regression coefficient βj based on ˆ − ˆ βj β ∼ − t(n p) s.e.(βˆj ) the t distribution with n − p d.f.

– 100(1 − α)% CI for βj given by βˆj ± t1−α/2(n − p)s.e.(βˆj ).

– For an α-level test of H0 : βj = β0 versus H1 : βj = β0 we use the rule: reject H0 if

|βˆj − β0| >t1−α/2(n − p) s.e.(βˆj )

–TestsofH0 : βj =0foreachβj given by summary() function in S-PLUS/R.

33 2. More generally, inference on linear combinations of the βj’s of the form cT β (e.g., contrasts) based on the t distribution: T βˆ − T βˆ c c ∼ − T T −1 t(n p) MSEc (X X) c

– E.g., 100(1 − α)% C.I. for the expected response at a given value of the vector of explanatory variables xo is given by

T T T −1 x0 βˆ ± t1−α/2(n − p) MSE x0 (X X) x0.

– A 100(1 − α)% for the response on a new subject with vector of explanatory variables xo is given by

T T T −1 x0 βˆ ± t1−α/2(n − p) MSE(1 + x0 (X X) x0).

– Confidence intervals for fitted and predicted values given by the predict() function in S-PLUS/R.

3. Inference on the entire vector β is based on the fact that

(βˆ − β)T (XT X)(βˆ − β) ∼ F (p, n − p) pMSE the F distribution with p and n − p d.f.

– E.g., we can test any hypothesis of the form H0 : Aβ = c where A is a k×p matrix of constants (e.g., contrast coefficients) with an F test. The appropriate test has rejection rule: reject if

(Aβˆ − c)T {A(XT X)−1A}−1(Aβˆ − c) F = >F1−α(k, n − p). kMSE

4. The fit of nested models can be compared via an F test comparing their MSE’s.

– Accomplished with the anova() function in S-PLUS/R.

34 Clustered Data:

Clustered data are data that are collected on subjects/animals/trees/units which are heterogenous, falling into natural groupings, or clusters, based upon characteristics of the units themselves or the experimental design, but not on the basis of treatments or interventions.

• The most common example of clustered data are repeated mea- sures data.

• By repeated measures, people typically mean data consisting of mul- tiple measurements of essentially the same variable on a given subject or unit of observation.

– Repeated measurements are typically taken through time, but can be at different spatial locations, or can arise from multiple measuring devices, obervers, etc.

– When repeated measures are taken through time, the terms longitudinal data, and panel data, are roughly synony- mous.

• We’ll use the more generic term clustered data to refer to any of these situations.

– Clustered data also include data from split-plot designs, crossover designs, hierarchical , and designs with pseudorepli- cation/subsampling.

35 Advantages of longitudinal/clustered data:

• Allow study of individual patterns of change — i.e., growth.

• Economize on experimental units.

• Heterogeneous experimental units are often better representative of the population to which we’d like to generalize.

• Each subject/unit can “serve as his or her own control”.

– E.g., in a split-plot experimentorcrossoverdesign,compar- isons between treatments can be done within the same subject. – In a longitudinal study comparisons of time effects (growth) can be made within a subject rather than between subjects.

– Between unit heterogeneity can be eliminated when assess- ing treatment or time effects. Leads to more power/efficiency (think paired t-test versus two-sample t-test).

Disadvantages:

• Correlation, multiple sources of heterogeneity in the data.

– Makes statistical methods harder to understand, implement.

– LMMs flexible enough to deal with these features.

• Imbalance, incompleteness in data more common.

– This can be hard for some statistical methods, especially if are not missing at random.

– LMMs handle unbalanced data relatively easily, well.

36 Linear Mixed Models (LMMs)

• We will present the LMM for clustered data. It can be presented and used in a somewhat more general context, but most applicationsare to clustered data and this is a simpler case to discuss/understand.

Examples revisited:

Example 1, One-way random effects model — Rails

• Recall that we had three observations on each of 6 rails.

Model: yij = µ + bi + eij,i=1,...,6,j =1,...,3, where

th th yij =responsefromj measurement on i rail µ = grand mean response across population of all rails th bi = random effect for the i rail eij = error term

• Data are clustered by rail.

Model for all data from the ith rail can be written in vector/matrix form: ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ yi1 1 1 ei1 ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ yi2 = 1 µ + 1 bi + ei2 yi3 1 1 ei3

or yi = Xiβ + Zibi + ei

37 Example 2, RCBD model — Stools

• Recall that we had n = 9 subjects, each of whom tested all a =4 stool designs under study.

Model: yij = µj + bi + eij,i=1,...,n,j =1,...,a, where

th th yij =responsefromj stool tested by i subj. µj = mean response for stool type j across population of all subjects th bi = random effect for the i subject eij = error term

• Data are clustered by subject.

Model for all data from the ith subject can be written in vector/matrix form: ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ yi1 1000 µ1 ei1 1 ⎜ yi2 ⎟ ⎜ 0100⎟ ⎜ µ2 ⎟ ⎝ ⎠ ⎜ ei2 ⎟ ⎝ ⎠ = ⎝ ⎠ ⎝ ⎠ + 1 bi + ⎝ ⎠ yi3 0010 µ3 ei3 1 yi4 0001 µ4 ei4 or yi = Xiβ + Zibi + ei

38 Example 3, Split-plot model — Grass

• Recall that we had 8 whole plots (half-fields) randomized to a RCND, and then split into 8 split-plots, which were randomized to 3 different inoculation types.

Model: yijk = µ + αi + βk +(αβ)ik + τj + bij + eijk, th where yijk is response from the split-plot assigned to the k inoculation type within the (i, j)th whole plot (which is assigned to the ith cultivar in jth block).

In addition, µ =grandmean th αi = i cultivar effect (fixed) th βk = k inoculation type effect (fixed) (αβ)ik =cultivar×inoculation interaction effect (fixed) th τj = j block effect (treated as fixed, but could be random) th bij = effect for the (i, j) whole plot (random) eijk = error term (random)

• Data are clustered by whole plot.

Model for all data from the (i, j)th whole plot can be written in vec- tor/matrix form: ⎛ ⎞ µ ⎜ ⎟ ⎜ αi ⎟ ⎜ ⎟ ⎛ ⎞ ⎛ ⎞ ⎜ β1 ⎟ ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ yij1 111001001 ⎜ β2 ⎟ 1 eij1 ⎝ ⎠ ⎝ ⎠ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ yij2 = 110100101 ⎜ β3 ⎟ + 1 bi + eij2 ⎜ ⎟ yij3 110010011 ⎜ (αβ)i1 ⎟ 1 eij3 ⎜ ⎟ ⎝ (αβ)i2 ⎠ (αβ)i3 τj or yij = Xijβ + Zijbij + eij

39 The Linear Mixed Model for Clustered Data:

• Notice that all 3 of the previous examples have the same form.

• They are all examples of LMMs with a single (univariate) random effect: a random cluster-specific intercept.

T Suppose we have data on n clusters, where yi =(yi1,...,yiti ) are the ti observations available on the ith cluster, i =1,...,n.

Then the LMM with random cluster-specific intercept is given (in general) by yi = Xiβ + Zibi + ei,i=1,...,n, where Xi is a ti × p design matrices for the fixed effects β,andZi is a ti × 1 vector of ones. ei is a vector of error terms.

If you’re not comfortable with the vector/matrix representation, another way to write it is

yij = β1x1ij + β2x2ij + ···+ βpxpij + zijbi +eij fixed part random part where zij =1.

Assumptions:

— cluster effects: bi’s are independent, normal with variance (variance 2 component) σb .

— error terms: eij’s are independent, normal with variance (variance component) σ2.

• We will relax both the assumption of independence, and con- stant variance (homoscedasticity) later

— bi’s and eij’s assumed independent of each other.

40 • Often, it makes sense to have more than one random effect in the model. To motivate this, let’s consider another example.

Example — Microbril Angle in Loblolly Pine

• Whole-disk cross-sectional microfibril angle was measured at 1.4, 4.6, 7.6, 10.7 and 13.7 meters up the stem of 59 trees, sampled from four physiographic regions.

– Regions (no. trees) were Atlantic Coastal Plain (24), Piedmont (17), Gulf Coastal Plain (9), and Hilly Coastal Plain (9).

A plot of the data:

10 20 30 40 10 20 30 40 Atlantic Gulf Hilly Piedmont

35

30

25 sectional microbril angle (deg)

− 20

disk cross 15 − Whole

10 20 30 40 10 20 30 40 Height on stem (m)

41 • Here we have 4 or 5 repeated measures on each tree.

– Repeated measures not through time, but through space, up the stem of the tree.

• Any reasonable model would account for

– heterogeneity between individual trees;

– correlation among observations on the same tree; and

– dependence of MFA on height at which it is measured.

• From the plots it is clear that MFA decreases with height.

– For simplicity, suppose it decreases linearly with height (it doesn’t, but let’s keep things easy).

th th th Let yijk be the MFA on the j tree in the i region, measured at the k height.

Then a reasonable model might be

yijk = µi + βheight +bij + eijk ijk fixed part where th µi = mean response for i region β = slope for linear effect of height on MFA th bij = random effect for (i, j) tree eijk = error term for height-specific measurements

• Fixed part of model says that MFA decreases linearly in height, with an intercept that depends on region.

– I.e., mean MFA is different from one region to next.

• Random effects (the bij ’s) say that the intercept varies from tree to tree within region.

42 Rather than just random tree-specific intercepts, suppose we believe that the slope (linear effect of height on MFA) also varies from subject to sub- ject.

• This leads to a random intercept and slope model:

yijk =(µi +b1ij) +(β +b2ij) heightijk + eijk intercept slope

= µi + βheightijk + b1ij + b2ijheightijk + eijk

• Now there are two random effects b1ij and b2ij, or bivariate random b1ij effects: bij = . b2ij

– No reason to expect that an individual tree’s effect on the intercept would be independent of that same tree’s effect on the slope.

– So, we would assume b1ij and b2ij are correlated (probably negatively).

Model can be written as ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ yij1 1heightij1 1heightij1 eij1 ⎜ ⎟ ⎜ ⎟ µi ⎜ ⎟ b1ij ⎜ ⎟ ⎝ . ⎠ = ⎝ . ⎠ + ⎝ . ⎠ + ⎝ . ⎠ . . β . b2ij . yij5 1heightij5 1heightij5 eij5 or yij = Xijβ + Zijbij + eij

43 So, the LMM in general may have > 1 random effect, which leads us to the general form of the model:

yi = Xiβ + Zibi + ei,i=1,...,n, where Xi = for fixed effects β = p × 1 vector of fixed effects (parameters) Zi = design matrix for random effects bi = q × 1 vector of random effects ei = vector of error terms

If you’re not comfortable with the vector/matrix representation, another way to write it is

yij = β1x1ij + β2x2ij + ···+ βpxpij + z1ijb1i + ···+ zqijbqi +eij. fixed part random part

Assumptions:

— cluster effects: bi’s are normal and independent from cluster to clus- ter.

— We allow b1i,...,bqi (random effects from same cluster - e.g., random intercept and slope) to be correlated, with var-cov D:

iid bi’s ∼ Nq(0, D)

— error terms: eij’s are independent, normal with variance (variance 2 iid∼ 2 component) σ .Thatis,ei’s Nti (0,σ I).

• We will relax both the assumption of independence, and con- stant variance (homoscedasticity) later

— bi’s and eij’s assumed independent of each other.

44 Example — Microbril Angle in Loblolly Pine (Continued)

Recall the original random-intercept (only) model:

yijk = µi + βheightijk + bij + eijk.

• This model is fit with the lme() function in LMM.R:

> mfa.lme1 <- lme(mfa ~ regname + diskht -1 , data=mfa, random= ~1|tree) > summary(mfa.lme1) Linear mixed-effects model fit by REML Data: mfa AIC BIC logLik 1501.148 1526.311 -743.5738 Random effects: Formula: ~1 | tree (Intercept) Residual StdDev: 1.795762 3.347371 Fixed effects: mfa ~ regname + diskht - 1 Value Std.Error DF t-value p-value regnameAtlantic 20.267993 0.5996115 55 33.80187 0 regnameGulf 18.266745 0.8800006 55 20.75765 0 regnameHilly 18.425222 0.8719278 55 21.13159 0 regnamePiedmont 20.948914 0.6741677 55 31.07374 0 diskht -0.116966 0.0147071 215 -7.95304 0 Correlation: rgnmAt rgnmGl rgnmHl rgnmPd regnameGulf 0.192 regnameHilly 0.219 0.117 regnamePiedmont 0.323 0.172 0.196 diskht -0.601 -0.320 -0.365 -0.537 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -1.7505407 -0.6968633 -0.1144518 0.5237712 3.9879323 Number of Observations: 274 Number of Groups: 59

45 The random intercept and random slope model was

yijk = µi + βheightijk + b1ij + b2ijheightijk + eijk.

• This model can be fit with lme() too, but an easy way to refit a model with a slight change is via update():

> mfa.lme2 <- update(mfa.lme1, random= ~diskht|tree) > summary(mfa.lme2) Linear mixed-effects model fit by REML Data: mfa AIC BIC logLik 1504.586 1536.938 -743.293 Random effects: Formula: ~diskht | tree Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) 2.21069507 (Intr) diskht 0.02712458 -0.678 Residual 3.31192368 Fixed effects: mfa ~ regname + diskht - 1 Value Std.Error DF t-value p-value regnameAtlantic 20.237540 0.6221012 55 32.53095 0 regnameGulf 18.312019 0.9048791 55 20.23698 0 regnameHilly 18.449352 0.8925792 55 20.66971 0 regnamePiedmont 20.950184 0.6944940 55 30.16611 0 diskht -0.116822 0.0149822 215 -7.79740 0 Correlation: rgnmAt rgnmGl rgnmHl rgnmPd regnameGulf 0.215 regnameHilly 0.248 0.132 regnamePiedmont 0.364 0.193 0.223 diskht -0.636 -0.338 -0.390 -0.573 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -1.7613139 -0.7114237 -0.1082732 0.5346069 3.7535692 Number of Observations: 274 Number of Groups: 59

46 Questions:

• The models were fit by REML. What does that mean? • Which model is better? • How do know if the model assumptions are met (diagnostics)? • How do we predict MFA at a given height for a given tree? for the population of all trees from a given region?

Estimation and Inference in the LMM:

Estimation:

• In the classical linear model, the usual method of estimation is ordi- nary least squares.

• However, we saw that if we assume normal errors, then OLS gives the same estimates of β as maximum likelihood (ML) estimation.

In the LMM, there are fixed effects β, but also parameters related the 2 distribution of the random effects (e.g., variance components such as σb ) as well as parameters related to the error terms (e.g., the error variance σ2).

• Least-squares doesn’t provide a framework for estimation and infer- ence for all of these parameters, so ML and related likelihood-based methods (i.e., restricted maximum likelihood, or REML) are gener- ally preferred.

47 ML: recall that ML proceeds by finding the parameters that maximizes the loglikelihood, or joint p.d.f. of the data.

• Finds the parameter values under which the observed data are most likely.

• Since the LMM assumes that the errors are normal, the random effects are normal, and the response y is linearly related to the errors and random effects via y = Xβ + Zb + e, its not hard to show that the LMM implies that the response vector y is normal too.

– That is, its easy to show that the observations from different clusters are independent, with

T 2 yi ∼ N(Xiβ, Vi)whereVi = ZiDZi + σ I

•⇒the joint p.d.f. of the data is multivariate normal •⇒the loglikelihood is the log of a m’variate normal p.d.f.

– This loglikelihood is easy to write down, but requires iterative algorithm to maximize.

• Implemented optionally in lme() with method=“ML” option.

48 REML: Recall from the classical linear model that the MLE of σ2 was biased.

• Did not adjust for d.f. lost in estimating β (fixed effects).

• Instead we used MSE as preferred estimator of σ2.

REML was developed as a general likelihood-based methodology that would be applicable to all LMMs, but which would

• take account of d.f. lost in estimation of β to produce less biased esti- mates of variance-covariance parameters (e.g., variance components) than ML;

• generalize the old, well-known, unbiased estimators in those simple cases of the LMM where such estimators are known;

– e.g., REML yields MSE as its estimator of σ2 in the CLM.

• REML is based upon maximizing the restricted loglikelihood

– Can be thought of as that portion of the loglikelihood that doesn’t depend on β.

• Like ML estimation, requires iterative algorithm to produce esti- mates.

• REML is the default estimation method for the lme() function and PROC MIXED in SAS.

• It’s generally regarded as the preferred method of estimation for LMMs.

– However, some aspects of are easier with ML, so sometimes competing models are fit and compared with ML, and then “best” model refit with REML at the end.

49 Inference on Fixed Effects:

• Remember, the framework for estimation and inference in the LMM is ML or REML, not least-squares as in the CLM.

The standard methods of inference in a likelihood-based framework are Wald tests,andlikelihood ratio tests (LRTs).

• LRTs and Wald tests are based upon asymptotic theory.Thatis, they provide methods that hold exactly when the sample size goes to infinty and only approximately for finite sample sizes.

• LRTs are useful for comparing nested models.

– Shouldn’t be used for comparing random-effect structures/variance- covariance structures.

– Shouldn’t be used with REML, only ML.

• Wald tests are useful for testing linear hypotheses (e.g., contrasts) on fixed effects.

– Wald tests yield approximate z and chi-square tests.

– These tests can be improved as t and F tests to produce better inferences in small-samples.

50 Wald Tests:

It can be shown that the approximate (i.e., large sample) distribution of the (restricted) ML estimator βˆ in the LMM is ⎛ ⎞ ⎜ ⎟ ⎜ −1⎟ ⎜ n ⎟ ˆ ⎜ T −1 ⎟ β ∼ N ⎜β, Xi Vi Xi ⎟ , (♣) ⎜ ⎟ ⎝ i=1 ⎠ ˆ =var(β) T 2 where Vi = ZiDZi + σ I.

• In practice var(βˆ) is estimated by plugging in final (restricted) ML estimates obtained from fitting the model.

th • Standard errors of βˆj,thej component of βˆ are obtained as the square root of the jth diagonal element ofvar( ˆ βˆ).

The distributional result (♣) leads to the general on β.

In particular, we reject H0 : Aβ = c at level α,whereA is k × p reject H0 if T −1 T −1 2 (Aβˆ − c) {A[ˆvar(βˆ)] A } (Aβˆ − c) >χ1−α(k) 2 th where χ1−α(k) is the upper α critical value of a chi-square distribution on k df.

• As a special case, an approximate z test of H0 : βˆj =0versus H0 : βˆj = 0 rejects H0 if

βˆj >z1−α/2 s.e.(βˆj)

where z1−α/2 is the (1 − α/2) quantile of a standard normal distri- bution.

• In addition, an approximate 100(1 − α)% CI for βj is given by

βˆj ± z1−α/2s.e.(βˆj).

51 These Wald tests can be improved in small samples by using the t and F distributions in place of the z and χ2.

• An approximate F test of H0 : Aβ = c is based on the test statistic −1 βˆ − T { βˆ T }−1 βˆ − (A c) A var(ˆ ) A (A c) . ∼ F (k, ν). k

• In addition, H0 : βˆj = 0 can be tested via the test statistic

βˆj . ∼ t(ν). s.e.(βˆj )

What is the appropriate choice for the denominator d.f. ν in these tests?

• This is a question which is difficult to answer, in general.

• Pinheiro and Bates’ lme() function uses the “containment” method.

– This method produces the right answers in simple cases such as a split-plot model where those answers are known.

– This approach can give non-optimal answers in non-standard examples of the LMM, but it tends to work pretty well over all.

– Same method (essentially) is implemented in SAS’s PROC MIXED with the ddfm=contain option (which is the default).

– However, this is one place where PROC MIXED is superior to lme() because other, better approaches are implemented. In particular, the “Kenward-Roger” (ddfm=kr) method works well much more generally than the containment method.

• Approximate t and F tests are implemented in lme() in the sum- mary() function and in the anova() function when the Terms or L options are specified.

52 Example — Microbril Angle in Loblolly Pine (Continued)

• The random-intercept model is fit as mfa.lme1 in LMM.R:

yijk = µi + βheightijk + bij + eijk.

• The hypothesis of equal means/intercepts is

H0 : µ1 = µ2 = µ3 = µ4

which can be written as

H0 : Aβ = 0

where ⎛ ⎞ ⎛ ⎞ µ1 ⎛ ⎞ − ⎜ ⎟ − 1 10 00⎜ µ2 ⎟ µ1 µ2 β ⎝ − ⎠ ⎜ ⎟ ⎝ − ⎠ A = 01 100⎝ µ3 ⎠ = µ2 µ3 00 1−10 µ4 µ3 − µ4 β

• Can be tested with the anova function:

> A <- matrix( c(1,-1,0,0,0, 0,1,-1,0,0, 0,0,1,-1,0),nrow=3,byrow=T) >A [,1] [,2] [,3] [,4] [,5] [1,] 1 -1 0 0 0 [2,] 0 1 -1 0 0 [3,] 0 0 1 -1 0 > fixef(mfa.lme1) regnameAtlantic regnameGulf regnameHilly regnamePiedmont diskht 20.2679929 18.2667446 18.4252221 20.9489136 -0.1169664 > anova(mfa.lme1,L=A,type="marginal") # use marginal, or "Type 3" SSs F-test for linear combination(s) regnameAtlantic regnameGulf regnameHilly regnamePiedmont 11-100 2 01-10 3001-1 numDF denDF F-value p-value 1 3 55 3.679848 0.0173

53 • Alternatively, the same test can be obtained by generating the anova table for the equivalent model

yijk = µ + αi + βheightijk + bij + eijk

which can be done as follows:

> mfa.lme1a <- lme(mfa ~ regname + diskht, data=mfa, random= ~1|tree) > anova( mfa.lme1a, type="marginal") numDF denDF F-value p-value (Intercept) 1 214 1142.5667 <.0001 regname 3 55 3.6798 0.0173 diskht 1 214 63.2508 <.0001

• t-based confidence intervals for the parameters can be obtained with the intervals() function:

> intervals(mfa.lme1) Approximate 95% confidence intervals Fixed effects: lower est. upper regnameAtlantic 19.0663446 20.2679929 21.46964124 regnameGulf 16.5031840 18.2667446 20.03030520 regnameHilly 16.6778398 18.4252221 20.17260436 regnamePiedmont 19.5978513 20.9489136 22.29997592 diskht -0.1459550 -0.1169664 -0.08797776

54 LRTs:

In fairly broad generality, for models with a parametric likelihood func- tion L(γ) depending on a parameter γ, nested models can be tested by examining the ratio L(γˆ) λ ≡ L(γˆ0) where

γˆ0 = MLE under H0 (under null, or partial model) γˆ = MLE under HA (under alternative, or full model)

• Logic: If the observed data are much less likely under the simple than under the complex model, then λ will be large, and we should choose the complex one.

• If the models explain the data equally well, then λ ≈ 1 and we prefer H0,thesimplermodel.

• We reject H0 for large values of λ, or equivalently, for large values of log(λ).

Asymptotic version of the test: Reject partial model in favor of full model if 2 2{log L(γˆ) − log L(γˆ0)} >χ1−α(ν) where

ν = (# parameters estimated under full model) − (# parameters estimated under partial model) = number of restrictions imposed by H0

55 • LRTS generalize the F test for nested models in the CLM.

• LRTs implemented in lme software via the anova() function.

• Maximized loglikelihood for any fitted model via the logLik() func- tion.

• Important: LRTs should never be performed for two models with different fixed effect specifications when using REML, only with ML!

Example — Microbril Angle in Loblolly Pine (Continued)

• Suppose we believe that MFA changes in a quadratic way with height. Then we might consider the model

2 yijk = µi + β1heightijk + β2heightijk + bij + eijk.

• If we fit this model and the linear in height model with ML, then a LRT can be done to test the quadratic effect in height.

56 > mfa.lme3.ML <- lme(mfa ~ regname + diskht + I(diskht^2) -1 , + data=mfa, random= ~1|tree,method="ML") #full model > mfa.lme1.ML <- update(mfa.lme3.ML, fixed= ~ regname + diskht -1 )#partial model > > anova(mfa.lme3.ML,mfa.lme1.ML) # do LRT of diskht^2 Model df AIC BIC logLik Test L.Ratio p-value mfa.lme3.ML 1 8 1322.110 1351.015 -653.0552 mfa.lme1.ML 2 7 1498.380 1523.672 -742.1899 1 vs 2 178.2694 <.0001 > 2*(logLik(mfa.lme3.ML)[1]-logLik(mfa.lme1.ML)[1]) [1] 178.2694 > summary(mfa.lme3.ML)$tTable #gives Wald-based t-test on diskht^2, #alternative to LRT Value Std.Error DF t-value p-value regnameAtlantic 25.29239285 0.6096669291 55 41.48559 3.437954e-43 regnameGulf 23.70407853 0.8796914640 55 26.94590 2.269282e-33 regnameHilly 23.61188490 0.8702167111 55 27.13334 1.591075e-33 regnamePiedmont 25.94446296 0.6800771466 55 38.14929 2.969921e-41 diskht -0.74916125 0.0395495622 214 -18.94234 1.212177e-47 I(diskht^2) 0.01308780 0.0007938345 214 16.48682 5.789352e-40 > anova(mfa.lme3.ML,Terms=3) #Wald F test, just square of previous F-test for: I(diskht^2) numDF denDF F-value p-value 1 1 214 271.8151 <.0001

• Wald F test obtained above from the anova() function using the Terms option.

• Using either LRT or Wald F test, its clear that the quadratic term in height is necessary.

Which test do I use?

Recommendation: use Wald-based F testsforinferenceonfixedeffects.

57 Inference on Random Effects, Var-Cov Structure:

Inference on the variance-covariance structure (e.g., variance components, serial correlation parameters, parameters) is compli- cated by a number of theoretical and technical difficulties.

2 • For example, we may want to test H0 : σb = 0. I.e, that the variance component associated with some random effect is zero.

– Under H0 the corresponding random effect is zero.

2 – Complications caused by σb being constrained to be ≥ 0and 2 fact that null places σb at the boundary of its set of possible values.

• In principle LRTs still apply, but reference distribution is not neces- sarily χ2, and is difficult to determine.

• Wald tests apply as well, but can perform extremely poorly unless sample size is very large.

Instead, a simpler but somewhat less formal approach to inference on the var-cov structure is via model selection criteria.

• The two most common model selection criteria are AIC,andBIC.

– Both based on the maximized value of the loglikelihood or re- stricted loglikelihood.

– Idea is to choose model under which data are most likely, but each criterion imposes a penalty for model complexity (lack of parsimony).

– Penalties differ for AIC, BIC.

– Hard to say which is better; BIC tends to lead to simpler mod- els, AIC slightly more commonly used.

• Use: choose one or the other criterion, then choose model which minimizes that criterion.

58 Example — Microbril Angle in Loblolly Pine (Continued)

• We’ve now concluded that the following quadratic (in height) model for MFA is better than the linear one. Model:

2 yijk = µi + β1heightijk + β2heightijk + bij + eijk (mfa.lme3)

• We now consider whether the linear and quadratic effects of height vary from subject to subject. That is, we consider models

2 yijk = µi + β1heightijk + β2heightijk 2 + b1ij + b2ij heightijk + eijk (mfa.lme4) 2 yijk = µi + β1heightijk + β2heightijk 2 3 + b1ij + b2ij heightijk + b3ijheightijk + eijk (mfa.lme5)

• These models differ in their random effects structure.

• Can’t test model mfa.lme4 vs. mfa.lme3 or mfa.lme5 vs. mfa.lme3 with a LRT or Wald test.

– Instead use AIC or BIC to choose best model. Smallest AIC, say, wins:

> mfa.lme3 <- update(mfa.lme3.ML, method="REML" ) > mfa.lme4 <- update(mfa.lme3, random=~diskht|tree ) > mfa.lme5 <- update(mfa.lme3, random=~diskht+I(diskht^2)|tree ) > anova(mfa.lme3,mfa.lme4,mfa.lme5) Model df AIC BIC logLik Test L.Ratio p-value mfa.lme3 1 8 1338.214 1366.942 -661.1069 mfa.lme4 2 10 1329.896 1365.805 -654.9477 1 vs 2 12.31831 0.0021 mfa.lme5 3 13 1318.841 1365.524 -646.4205 2 vs 3 17.05449 0.0007

• Conclusion, according to both AIC and BIC, model mfa.lme5 is best model considered so far.

• LRTs and the p-values given here should not be used in this context.

59 Prediction of Random Effects:

To keep things relatively simple, let’s return to the random intercept (only) version of the quadratic-in-height model for MFA:

2 yijk = µi + β1heightijk + β2heightijk + bij + eijk (mfa.lme3)

• Mean response in the model is the fixed part:

2 E(yijk)=µi + β1heightijk + β2heightijk (†)

• Describes the average behavior over population of all trees from which the trees in the study were drawn.

–(†) is estimated by plugging in the parameter estimates.

• Model also “localizes” to the individual tree level. The i, jth tree behaves somewhat differently than average. That tree’s mean MFA described by

2 µi + β1heightijk + β2heightijk + bij (‡)

–Sincebij is random, the quantity above is random.

– Makes sense, because it describes a particular tree’s responses, where that tree is random (randomly drawn from a popula- tion).

–(‡) is predicted by plugging in parameter estimates and pre- dictions of bij’s.

Terminology:

—Weestimate fixed, unknown constants (parameters) or quantities depending only on parameters like (†).

—Wepredict unknown random variables (random effects), or quantities involving random effects like (‡).

60 The preferred method of predicting the random effects is to use estimated best linear unbiased predictors (BLUPs).

These predictions of the (unobserved) random effects are based upon

— the response vector y (observed)

— the distribution of the random effects according to the model (e.g., iid 2 ∼ N(0,σb )) (estimated from fitted model).

— the joint distribution of y and the random effects (estimated from fitted model).

• Estimated BLUPs of the random effects given by the ranef() function in S-PLUS/R.

• E.g., for model mfa.lme3, these predictions given by

> ranef(mfa.lme3) (Intercept) 1 -1.643437203 2 -0.481270493 51 -0.908065397 52 0.900232600 53 -0.965476440 54 0.880095610 55 -0.090215557 56 -0.009334064 57 0.541355263 58 -0.145192338 59 -0.203399677

61 • Estimates of (†) can be obtained from the predict() function by spec- ifying level=0 (population level).

– Can think of this as plugging in 0 for random effects in fitted model equation.

– Appropriate for 1) estimating the mean for population of all trees; or for 2) predicting the response for a tree whose random effect is unknown (a new tree).

• Predictions of (‡) can be obtained from the predict() function by specifying level=1 (first level or tree level, in this case).

• A nice way to plot the data, the population-level estimates and the tree-level predictions is via augPred().

– Using this function for the Hilly region only (trees 51–19) pro- duces the following plot:

Data from Hilly Region only fixed tree 10 20 30 40 57 58 59 22

20

18

16

14

12 54 55 56 22

20

18

16 sectional microbril angle (deg)

− 14

12 51 52 53 22 disk cross − 20

18 Whole 16

14

12

10 20 30 40 10 20 30 40 Height on stem (m)

62 Model Diagnostics:

As in classical linear models, residual plots are the work-horse of model diagnostics.

In a LMM, however, we need to think a bit more carefully about what we mean by “residuals”.

• Residuals can be based on the difference between a response and its estimated mean: yij − µˆij estimated mean

– These are level 0 (population-level) residuals

• Alternatively, residuals can be based on the difference between a response from the ith cluster and its cluster-specific predicted vale:

− ˆ yij µˆij+ bi predicted value

– These are level 1 (cluster-level) residuals

• Typically, we are interested in the cluster level residual.

– Residuals are extracted via the residuals() (or resid() for short) function. The option type controls the residual type (raw, Pearson, etc.).

– Fitted values are extracted with the fitted() function.

63 • The general form of the command for residual plots is

plot(fitted.model, y ~ x, options)

• In LMM.R, we produce various standard residual plots.

• The resulting conclusion is that the model is misspecified. It appears that

– the shape of the MFA vs height function depends on region.

– model is often underpredicting mfa at breast height.

• After considering various alternatives, a much improved model is one where we treat diskht as a factor, and use a standard two-way layout model with random tree effects:

yijk = µ + αi + βk +(αβ)ik + bij + eijk (mfa.lme8)

where µ =grandmean αi = region effects βk = disk height effects (αβ)ik =diskheight× region interactions bij = random tree effect eijk = within-tree error term

64 Extensions of LMMs

Accommodating Heteroscedasticity (Non-constant Variance):

In the LMM as we’ve presented it so far, the error terms are assumed to have constant variance.

• I.e., if we write the model as

yi = Xiβ + Zibi + ei

then we assume ⎛ ⎞ σ2 0 ··· 0 ⎜ 2 ··· ⎟ 2 ⎜ 0 σ 0 ⎟ var(ei)=σ I = ⎝ . . . . ⎠ . . .. . 00··· σ2

– Constant variance among errors (also implies response has con- stant variance).

Such an assumption is often unrealistic and can be violated both empiri- cally and theoretically.

• Common example is variance which increases as with the magnitude of the response.

– E.g., heights of tall trees more variable than those of short.

• Another example is where the error variance differs across groups.

– Intensively managed trees less variable than natural stands.

• Another possibility is that variability depends upon a covariate.

– E.g., variability in tree heights decreases with increasing site quality (site index, or soil quality measure)??

65 Such non-constant variance can be accommodated by modeling the vari- ance as a function of covariates, factors (groups), and/or the mean response in addition to one or more unknown parameters.

• In particular, the lme software allows the error variance to be of the form 2 2 var(eij)=σ g (vi, δ) where

vi = a vector of one or more variance covariates, δ = a vector of unknown variance parameters to be estimated, g2(·)=aknownvariance function.

• We allow vi, the variance covariates to include µi =E(yi), the mean response.

– This requires a more complex fitting algorithm and a bit dif- ferent theory than the standard ML, REML teory that applies when the variance doesn’t depend on the mean.

– However, from the user-perspective, variance depending on the mean causes no complication and is extremely useful.

66 Variance Functions Available in the lme/nlme Software:

• Variance functions in the nlme software are described in §5.2.1 in Pinheiro and Bates (2000). Here, we give only brief descriptions.

2 1. varFixed. The varFixed is g (vi)=vi.Thatis,

2 var(ei)=σ vi.

– says that error variance is proportional to the value of a co- variate.

– This is the traditional form.

2. varIdent. This variance specification corresponds to different vari- ances at each level of some stratification (grouping) variable s.

3. varPower. This generalizes the varFixed function so that the er- ror variance can be a to-be-estimated power of the magnitude of a variance covariate:

2 2δ 2 2δ var(ei)=σ |vi| so that g (vi,δ)=|vi| .

δ The power is taken to be 2δ rather than δ so that s.d.(ei)=σ|vi| .

A very useful specification is to take the variance covariate to be the mean response. That is,

2δ var(ei)=|µi|

4. varComb. Finally, the varComb class allows the other variance classes to be combined so that the variance function of the model is a product of two or more component variance functions.

67 Example — Microbril Angle in Loblolly Pine (Continued)

We return to the two-way layout model with random tree effects:

yijk = µ + αi + βk +(αβ)ik + bij + eijk (mfa.lme8)

• The residuals vs. fitteds plots shows increasing variance as the fitted values increase.

• Appears that variance increases with mean response, not systemati- cally with a covariate such as height.

• Can allow variance to be proportional to power of mean as follows: > mfa.lme9 <- update(mfa.lme8, weights=varPower(form= ~ fitted(.))) > anova(mfa.lme9,mfa.lme8) Model df AIC BIC logLik Test L.Ratio p-value mfa.lme9 1 23 1205.206 1286.564 -579.6028 mfa.lme8 2 22 1261.743 1339.564 -608.8712 1 vs 2 58.53691 <.0001 > summary(mfa.lme9) Linear mixed-effects model fit by REML Data: mfa AIC BIC logLik 1205.206 1286.564 -579.6028 Random effects: Formula: ~1 | tree (Intercept) Residual StdDev: 1.987213 0.00575837 Variance function: Structure: Power of variance covariate Formula: ~fitted(.) Parameter estimates: power 2.018668

• Fitted variance model says:

δˆ 2.018668 s.d.(ei)=ˆσ|µi| =0.00575837|µi|

68 • According to AIC, BIC heteroscedastic model mfa.lme9 a big im- provement over mfa.lme8.

• Residual plots look much better too.

Accommodating Serial Correlation:

In the LMM as we’ve presented it so far, the error terms are assumed to be uncorrelated.

• Only correlation among the responses due to shared random effects.

– Accounts for shared characteristics.

– Ignores serial correlation.

• Serial correlation () is temporally or spatially related.

– Typically, observations close together in time/space more sim- ilar than observations far apart.

• We can accommodate serial correlation by relaxing our indepedent errors assumption.

• Instead, we model the correlation among error terms as function of time lag, spatial distance and unknown parameters. lme/nlme software allows correlation model of form

corr(eij ,eik)=h{d(pij ,pik), ρ} where ρ = a vector of correlation parameters, h(·)=aknowncorrelation function, pij,pik = position variables corresponding to observations yij,yik, d(·, ·)=isaknowndistance function.

• The correlation function h(·) is assumed continuous in ρ, returning values in [−1, +1]. In addition, h(0, ρ) = 1, so that observations that are 0 distance apart (identical observations) are perfectly correlated.

69 Correlation Structures Available in the lme/nlme Software:

• Correlation structures in the nlme software are described in §5.3 in Pinheiro and Bates (2000).

• There are also several spatial correlation structures.

• Here, we give brief descriptions of the serial and general correlation structures.

Serial Correlation Structures:

1. corAR1. Autoregressive of order 1.

• Appropriate for observations taken at evenly spaced time points.

T • E.g., for ei =(ei1,ei2,...,eit) , taken at times 1, 2,...,t,models says |j−k| corr(eij ,eik)=ρ

– E.g., for t = 5 model implies ⎛ ⎞ 1 ρρ2 ρ3 ρ4 ⎜ 2 3 ⎟ ⎜ 1 ρρ ρ ⎟ ⎜ 2 ⎟ corr(ei)=⎝ 1 ρρ⎠ 1 ρ 1

2. corCAR1. This correlation structure is a continuous-time version of an AR(1) correlation structure. The specification is the same as in corAR1, but now the covariate indexing time can take any non- negative non-repeated value and we restrict φ ≥ 0.

3. corARMA. This correlation structure corresponds to an ARMA(p, q) model. AR(p)andMA(q) models can be specified with this function, but keep in mind that the corAR1 specification is more efficient than specifying corARMA with p =1andq =0.

70 General Correlation Structures:

1. corCompSymm. Compound symmetry. In this structure, 1ifj = k;and corr(eij ,eik)= ρ if j = k.

• Same correlation structure as implied by a random cluster-specific intercept with independent errors (e.g., split-plot model).

2. corSymm. Specifies a completely general correlation structure with a separate parameter for every non-redundant correlation.

T • E.g., for ei =(ei1,ei2,ei3,ei4,ei5) ⎛ ⎞ 1 ρ1 ρ2 ρ3 ρ4 ⎜ ⎟ ⎜ 1 ρ5 ρ6 ρ7 ⎟ ⎜ ⎟ corr(ei)=⎝ 1 ρ8 ρ9 ⎠ 1 ρ10 1

71 Q: How do we choose a correlation structure?

• Hard question.

• In a single dimension (e.g., time, position on the bole), AR(1) models are often sufficient.

If we are willing to consider other ARMA models, two tools that are useful in selecting the right ARMA model are the sample autocorrelation function (ACF) and the sample partial autocorrelation function (PACF).

• ACF produces in lme/nlme software with the ACF() function. PACF harder to obtain.

• AR(p) models have PACFs that are non-zero for lags ≤ p and 0 for lags >p. Therefore, we can look at the magnitude of the sample PACF to try to identify the order of an AR process that will fit the data. The number of “significant” partial is a good guess at the order of an appropriate AR process.

• MA(q) models have ACFs that are nonzero for lags ≤ q and 0 for lags >q. Again, we can look at the sample ACF to choose q.

• Simpler approach: trial and error until ACF looks good.

72 Example — Microbril Angle in Loblolly Pine (Continued)

• In the MFA example, following code produces the ACF plot below:

> plot(ACF(mfa.lme9,resType="n", form=~1|tree),alpha=.05)

0.8

0.6

0.4

Autocorrelation 0.2

0.0

−0.2

01234 Lag

• Plot doesn’t look great, not terrible either.

• Can try continuous AR(1) correlation structure as follows

> mfa.lme10 <- update(mfa.lme9, corr=corCAR1(form= ~diskht|tree)) > anova(mfa.lme10,mfa.lme9) Model df AIC BIC logLik Test L.Ratio p-value mfa.lme10 1 24 1206.418 1291.314 -579.2088 mfa.lme9 2 23 1205.206 1286.564 -579.6028 1 vs 2 0.7879748 0.3747

• Doesn’t help. No strong evidence of serial correlation here and it probably can be safely ignored.

73 Multilevel Models

Sometimes data are clustered at two or more nested levels. E.g.,

• Educational data: students’ test scores are clustered by class within school, school within school district.

• Multisite with repeated measures: observations are clus- tered by patient within clinics, and by clinics within the overall study.

• Forestry data: repeated measures on trees clustered by tree within plot, plot within stand, etc.

Example — Microbril Angle in Loblolly Pine (Continued)

When introducing these data, I lied and simplified the description of the sampling design.

• In reality, several stands were sampled within each physiographic region, and then three trees per stand were sampled.

– Data clustered by tree within stand, and by stands within re- gion.

– May want to account for stand-level heterogeneity, tree-level heterogeneity separately, with nested random effects.

74 th th th Now let yijk be the response at  height, for the k tree, in the j stand, in the ith region.

Multilevel extension of our two-way anova with random tree-specific inter- cepts:

yijk = µ + αi + β +(αβ)i + bij + bijk + eijk (mfa2.lme2)

• This model fit as mfa2.lme2 in extend.R. But first must construct a two-level groupedData object mfa2:

mfa2 <- groupedData(mfa~diskht|stand/tree,data=mfa, labels=list(x="Height on stem", y="Whole-disk cross-sectional microbril angle"), units=list(x="(m)",y="(deg)"), order.groups=F)

75 • Now refit model mfa.lme9 to new data set (call it mfa2.lme1), and then update by adding a stand-level random intercept:

> mfa2.lme2 <- update(mfa2.lme1,random=list(stand= ~1,tree= ~1)) > anova(mfa2.lme1,mfa2.lme2) Model df AIC BIC logLik Test L.Ratio p-value mfa2.lme1 1 23 1205.206 1286.564 -579.6028 mfa2.lme2 2 24 1198.543 1283.439 -575.2714 1 vs 2 8.662672 0.0032 > summary(mfa2.lme2) Linear mixed-effects model fit by REML Data: mfa2 AIC BIC logLik 1198.543 1283.439 -575.2714 Random effects: Formula: ~1 | stand (Intercept) StdDev: 1.586422 Formula: ~1 | tree %in% stand (Intercept) Residual StdDev: 1.401302 0.007036924 Variance function: Structure: Power of variance covariate Formula: ~fitted(.) Parameter estimates: power 1.948697

• According to AIC, BIC, two-level model fits better.

• Estimated variance component from stand to stand is 1.5862.

• Estimated variance component between trees within stands is 1.4012.

76 Nonlinear Mixed Effects Models

A Motivating Example — Circumference of Orange Trees

The data in the table below are the circumferences of five orange trees over time.

Tree No. Time(days)12345 118 30 33 30 32 30 484 58 69 51 62 49 664 87 111 75 112 81 1004 115 156 108 167 125 1231 120 172 115 179 142 1372 142 203 139 209 174 1582 145 203 140 214 177

A plot of the data, with observations from the same tree connected, ap- pears below.

Orange Tree Data w/ NLS Fit

Tree 1 Tree 2 Tree 3 Tree 4 Tree 5 Trunk circumference (mm) Observed Growth Curve Fitted Growth Curve 0 50 100 150 200

500 1000 1500

Time since December 31, 1968 (days)

77 Also displayed in this plot is the fitted curve from a logistic function fit with NLS.

th That is, if we let yij = circumference of the i tree at age tij, i =1,...,5, j =1,...,7, then the fitted model is

θ1 yij = + eij (m1Oran.gnls) 1+exp[−(tij − θ2)/θ3]

iid 2 where {eij} ∼ N(0,σ ).

• Clearly, model (m1Oran.gnls) is inadequate.

Fitted curve goes through the center of the combined data from all trees, but growth curves of individual trees are poorly estimated.

• Because the growth curves of the different trees spread out as the trees get older, this misspecification will show up as a cone-shaped residuals vs. fitteds plot suggesting heteroskedasticity.

• Not the problem: it is only (or at least mainly) between-tree vari- ability that is increasing over time. Within-tree error variance looks to be homoskedastic.

78 Another problem with m1Oran.gnls: treats observations as independent. Two obvious potential sources of correlation in these data:

1. Clustering. The data are grouped, or clustered, by tree. Observa- tions from same tree should share characteristics of that tree which make them similar, or correlated.

– Minimized by very homogeneous groups.

2. Serial Dependence. Observations close together in time will tend to be correlated more highly than observations far apart.

– Often reduced by long lags between measurements, and/or ho- mogeneous environmental conditions through time.

The first of these sources almost certainly affects the orange tree data and the second may as well.

To deal with these problems, we could fit different parameters to each tree, or perhaps it would be sufficient to just fit different assymptotes, or different rate parameters to each tree.

E.g., the model with 5 separate asymptote parameters, one for each tree, is: θ1i yij = + eij (m2Oran.gnls) 1+exp[−(tij − θ2)/θ3] where we assume corr(ei)=C(ρ), where C is an assumed form for the within-group correlation matrix, de- pending on an unknown parameter ρ.

• In m2Oran.gnls, C(ρ)=I, but we could fit AR(1) or other corr structure.

79 While this approach is clearly an improvement over (m1Oran.gnls), it has some disadvantages:

A. # of parameters grows with sample size. In (m2Oran.gnls) we’ve in- troduced a distinct fixed asymptote parameter for each tree. There- fore, if we had measured 500 trees, our model would have 502 regres- sion parameters.

Having the number of parameters increase with the sample size in- troduces a number of problems:

• Theoretical: in ML and LS estimation, asymptotic arguments establishing consistency, optimality break down.

• Computational: Difficult to optimize a criterion of estimation with respect to many parameters.

• Interpretation: We have 500 separate asymptotes and no single parameter describing the average limit of growth. Do we really care what the limit of growth was for tree #391?

• Conceptual: θ1i is the asympote parameter for tree i.That is, its the fixed theoretical population constant for the limit of growth for tree i. But what’s the population? and why is the asymptote of tree i a fixed constant? Wasn’t tree i randomly selected from a population of trees? If so, the asymptote of this randomly drawn tree should be regarded as a random variable, not a parameter.

• Scope of inference: Results apply to sample at hand, not to population from which the trees were drawn.

80 B. Correlation structure. The correlation structure in model (m2Oran.gnls) accounts for within–tree correlation by modelling source 2 (serial cor- relation) but not source 1 (grouping correlation). It is often difficult and unnecessary to model both sources, but for short , modelling 2 is often harder than modelling 1.

That is, it is often not easy to fit an ARMA model to the within- group observations through time. This can be so because of:

• Short series.

• Non-stationary series.

• Unbalanced/missing data and/or irregular or continuous time indexing.

• Having many fixed, cluster-specific parameters to better fit the data from each cluster (tree, plot, etc.) in nonlinear growth curve models, is (essentially) the approach taken in the self-referencing functions popular in forest biometrics.

– I’m not a fan.

81 An alternative: A nonlinear mixed-effects model (NLMM) for the orange tree data.

Again, our fixed effects nonlinear model (m2Oran.gnls) with 5 separate tree-specific asymptotes is

θ1i yij = + eij (m2Oran.gnls) 1+exp[−(tij − θ2)/θ3]

¯ Using an ANOVA-type parameterization for θ1i we can write θ1i = θ1 + τi 5 ¯ where i=1 τi =0.Hereθ1 is the average or typical θ1-value (asymptote) th and τi is the i tree effect.

Under this parameterization, model (m2Oran.gnls) becomes

5 θ¯1 + τi yij = − − + eij τi =0. 1+exp[ (tij θ2)/θ3] i=1

In the ordinary model, the θ’s and the τ’s are all considered to be fixed unknown parameters, a.k.a. fixed effects.

In the NLMM, we consider the τi’s to be random variables, or random th effects. τi is the deviation from θ¯1 of the asymptote of the i tree; it is considered to be random because the tree itself is a randomly selected representative element of the population to which we want to generalize.

82 Changing symbols from τi to bi, the model becomes

iid 2 θ1 + bi b1,...,b5 ∼ N(0,σb ) yij = + eij, (†) 1+exp[−(tij − θ2)/θ3] iid 2 {eij} ∼ N(0,σ )

Here we’ve also dropped the bar from θ¯1.

th • Now the asymptote for the i tree is θ1 + bi, a random variable because bi is a random variable. The asymptote for the typical tree is θ1 (when bi =0).

• If we write θ1i ≡ θ1 + bi, then we have that the 5 asymptotes are iid 2 randomly distributed around θ1: θ11,...,θ15 ∼ N(θ1,σb ).

Fitting Model (†):

The fact that the random effects {bi} enter into the NLMM (†)nonlin- early complicates the methodology and theory of NLMMs substantially compared to ordinary NLMs and LMMs.

• To focus on the motivation, interpretation, and basic ideas of NLMMs we temporary skip this material and just assume that the nlme() function in S-PLUS can fit an NLMM with a “good” method.

• See the R script NLMM.R, where we analyze these data with NLMMs.

• In this script, we first fit models (m1Oran.gnls) and (m2Oran.gnls). We then fit the NLMM (†) as m1Oran.nlme using the nlme() func- tion.

83 • Notice that the NLME (m1Oran.nlme) has estimated regression pa- rameter θˆ similar to the estimated regression parameter in the fixed- effects model that fit a mean curve to all the data ignoring tree effects (m1Oran.gnls):

> m1Oran.nlme <- nlme(circumference ~ SSlogis(age,Asym,th2,th3), data=Orange, + fixed= Asym+th2+th3~1,random= Asym~1,start=coef(m1Oran.gnls)) > > fixef(m1Oran.nlme) #compare fixed effect estimates Asym th2 th3 191.0499 722.5590 344.1681 > coef(m1Oran.gnls) Asym th2 th3 192.6876 728.7564 353.5337 > summary(m1Oran.nlme) Nonlinear mixed-effects model fit by maximum likelihood Model: circumference ~ SSlogis(age, Asym, th2, th3) Data: Orange AIC BIC logLik 273.1691 280.9459 -131.5846 Random effects: Formula: Asym ~ 1 | Tree Asym Residual StdDev: 31.48255 7.846255

• Variability in the asymptotes from tree to tree is captured through 2 bi, which is assumed normal, mean 0, with estimated varianceσ ˆb = (31.48)2. The error variance is estimated to beσ ˆ2 =(7.85)2.

84 > AIC(m1Oran.nlme,m1Oran.gnls,m2Oran.gnls) df AIC m1Oran.nlme 5 273.1691 m1Oran.gnls 4 324.7974 m2Oran.gnls 8 254.1040

• The NLMM m1Oran.nlme has AIC=273.2, BIC=280.9 for 5 esti- 2 2 mated parameters: θ1,θ2,θ3,σb ,σ . This compares with AIC=324.8, BIC=331.0 for the 4-parameter model (m1Oran.gnls) and AIC=254.1, BIC=266.5 for the 8-parameter model (m1Oran.gnls).

• So, the addition of random effects in the asymptote of (m1Oran.gnls) only costs us 1 df and results in a vast improvement in fit.

• Fit is even better when fitting separate asymptotes to each tree (m2Oran.gnls), but that shouldn’t be surprising.

– In (m1Oran.nlme) we save on df in comparison to (m2Oran.gnls) by making a parametric assumption on the distribution of the random effects: that they’re normal with only an unknown variance to be estimated.

– In contrast, model (m2Oran.gnls) doesn’t make any assump- tion about the tree-to-tree variability in asymptotes, it sepa- rately estimates each asymptote.

However:

– Problems with many, cluster-specific parameters cited above; and – Advantage would go away if we had more trees in the data set. Then the penalty for lack of parsimony would increase, and model m2Oran.gnls would have higher AIC, BIC than m1Oran.nlme.

85 • Of course the residuals of model (m1Oran.gnls) looked terrible be- cause the individual trees were poorly fit by the average curve. The residuals of models (m2Oran.gnls) and (m1Oran.nlme) look about equally good.

Our fitted model for tree i at time tij is

θˆ1 + ˆbi yˆij = 1+exp[−(tij − θˆ2)/θˆ3]

• The ˆbi’s aren’t estimated parameters of the model. They’re predicted quantities based on the fitted model, the data, and the assumption iid 2 that b1,...,b5 ∼ N(0,σb ).

The ˆbi’s are as follows (note these aren’t sorted by tree #):

> ranef(m1Oran.nlme) Asym 3 -37.000247 1 -29.403585 5 -5.179485 2 31.565006 4 40.018311

Therefore, the predicted circumference of tree 1, say, at time tij is given by 191.0 − 29.4 yˆij = 1+exp[−(tij − 722.6)/344.2]

86 • Plugging the ˆbi’s into the fitted model equation yields the pink curves (tree-level), and plugging bi =0inyieldsthebluecurves (population-level) in the following plot obtained from the augPred() function:

fixed Tree 500 1000 1500 500 1000 1500 3 1 5 2 4

200

150

100 Trunk circumference (mm)

50

500 1000 1500 500 1000 1500 500 1000 1500 Time since December 31, 1968 (days)

87 • Finally, we examine the ACFs for models (m2Oran.gnls), and (m1Oran.nlme).

ACF, Model m2Oran.gnls

0.5

0.0 Autocorrelation

−0.5

0123456 Lag

ACF, Model m1Oran.nlme

0.5

0.0 Autocorrelation

−0.5

0123456 Lag

• The ACFs of these models (**) and (†) are similar — the two models “account for” the residual correlation structure similarly and ade- quately here.

88 The NLME Model Formulation

• We consider a single level of clustering, before tackling the multilevel case.

Formulation for Single Level Data:

th th Let yij denote the j observation (e.g., through time) on the i cluster (e.g, subject, tree, plot) where we have n clusters, and ti observations in th the i cluster. Let wij be a vector of covariates corresponding to response yij.

The general form of the NLMM for this situation is

i =1,...,n yij = f(θij, wij )+eij, , j =1,...,ti iid (∗) b1,...,bn ∼ N(0, D) where θij = Xijβ + Zijbi, iid 2 {eij} ∼ N(0,σ ) where

β =ap × 1 vector of fixed effects, bi =aq × 1 vector of cluster-specific random effects with var-cov matrix D Xij = a model/design matrix for β Zij = a model/design matrix for bi

• Note that we assume homoscedastic, uncorrelated errors for now, but can be relaxed as in the LMM.

89 Model (*) can be equivalently expressed in matrix form as

yi = fi(θi, wi)+ei, (∗∗) θi = Xiβ + Zibi, for i =1,...,n,where ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ yi1 θi1 ei1 f(θi1, wi1) ⎜ ⎟ ⎜ ⎟ . θ ⎝ . ⎠ ε ⎝ . ⎠ θ . yi = ⎝ . ⎠ , i = . , i = . , fi( i, wi)=⎝ . ⎠ , θ yit iti eiti f(θit , wit ) i ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ i i wi1 Xi1 Zi1 ⎝ . ⎠ ⎝ . ⎠ ⎝ . ⎠ wi = . , Xi = . , Zi = . .

witi Xiti Ziti

We assume

iid∼ { } iid∼ 2 b1,...,bn Nq(0, D), ei Nti (0,σ I) and the random effects {bi} are independent of the errors {ei}.

Example — Orange Tree Data

To illustrate the model formulation, we write model (†)=(m1Oran.nlme) that we used for these data in the form (**). Model (†) can be written as

θ1ij yij = + eij, 1+exp[−(tij − θ2ij)/θ3ij] ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ θ1ij 100 β1 1 β1 + b1i ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ where θ2ij = 010 β2 + 0 (b1i ) = β2 , θ3ij 001 β3 0 β3 bi X Z θij ij β ij where bi = bi is a scalar, so q =1and

iid∼ 2 { } iid∼ 2 b1,...,bn N(0,σb ), eij N(0,σ ). D

90 A Multilevel Example — TPH in the PMRC Site Preparation Study:

• Study of various site preparation and intensive management regimes on the growth of slash pine.

• Involved 191 0.2 ha plots nested within 16 sites in lower coastal plain of GA and FL.

• Data consist of tph (100’s of trees per ha), site index, soil type, and treatment variables (herb, fert, chop, etc.) at ages 2, 5, 8, 11, 14, 17, and 20 years.

A plot of the data:

Separate profiles for each plot, graphed separately by site 1 3 5 7 9 11 2 4 6 8 10 13

5 101520 5101520 5101520 5101520

14 8 1 7 12 16 13 19

15

10

5

10 4 9 5 6 11 20 15

15

Trees/Hectare (100s of trees) 10

5

5 101520 5 101520 5 101520 5 101520

Age (yrs)

(Each panel is a site)

91 • Because the profiles of tph over time appear to be sigmoidal (‘S’- shaped) for at least some plots, we consider the four-parameter lo- gistic model for these data.

• The function SSfpl() in the lme/nlme software implements this func- tion in the following form:

θ2ij − θ1ij yijk = θ1ij + + eijk ♥ 1+exp{(θ3ij − Ageijk)/θ4ij }

where

θ1ijk = left (upper) asymptote (mixed effect) θ2ijk = right (lower) asymptote (mixed effect) θ3ijk = age at inflection point (mixed effect) θ4ijk = scale param determining how quickly response reaches lower asym. (mixed) th th th yijk =tphatk age, for j plot in i site eijk = error term

92 The θijk’s are mixed effects. Each one can be modeled in terms of ex- planatory variables, parameters, and random plot and random site effects.

• E.g., we might believe that the lower asymptote θ2ijk might depend on whether or not the site was fertilized.

–Inthatcasewemightset

θ2ijk = β20 + β21Fertij (fixed effects only)

• However, we might also believe that there is variability from plot to plot and variability from site to site in the lower asymptote.

– In that case, we would consider modeling the lower asymptote as

θ2ijk = β20 + β21Fertij + b2i + b2ij (mixed effects)

where the b2i’s are site-specific random effects and b2ij’s are plot-specific random effects.

• The other θijk’s (the basic “parameters” of the nonlinear model) can each be modeled this way as well.

93 • E.g., suppose that:

– the upper asymptote depends upon whether the site was fer- tilized, and varies across sites and plots.;

– the lower asymptote depends upon whether the site was bedded and whether it was burned, and varies across sites;

– the inflection point depends upon site index;

– and the is constant.

• Then a NLMM to describe such a situation would be

θ2ij − θ1ij yijk = θ1ij + + eijk 1+exp{(θ3ij − Ageijk)/θ4ij}

=(β10 + β11Fertij + b1ij + b1i) (β10 + β11Bedij + β12Burnij + b2i) − (β10 + β11Fertij + b1ij + b1i) + + eijk 1+exp{((β30 + β31SIij) − Ageijk)/β40}

Note that multivariate random effects at a given level occur more naturally and often in NLMMs than LMMs.

• E.g., here we have random site effects on both the upper and lower asymptotes. That is the site effects are bivariate: b1i bi = b2i

• Its natural to expect that if a site has a unique effect on the lower asymptote and a unique effect on the upper asymptote, then those effects are probably related. Therefore, we typically assume multi- variate random effects are correlated. E.g., iid σ11 σ12 {bi} ∼ N(0, D)whereD = (∗) σ12 σ22

94 • There are several functions built into the lme/nlme software that can specific various var-cov structures on multivariate random effects in the model.

– pdDiag() specifies uncorrelated (independent) random effects.

– pdSymm() specifies general covariance among random effects (as above in (*)).

– pdBlocked() specifies block-diagonal covariance among random effects.

Example — TPH (continued):

In NLMM.R we consider model (♥) for the tph data. This involves

— building mixed models for the “basic” parameters of the logistic func- tion,

— modeling the variance-covariance structures for the random effects, and

— modeling the error variance-covariance structure to allow heteroscedas- ticity and serial correlation, if necessary.

• First, we create a two-level groupedData object siteprep.tph to con- tain the data. The data are clustered by site (level 1) and plot within site (level 2).

– Level 0 refers to the population level (corresponding to random effects equal to their mean, 0.

95 • Next, we try to fit the basic model with θ’s not depending on co- variates and no serial correlation or heteroscedasticity in the error terms. We do assume random plot and site effects in θ1,θ2,andθ3:

m1tph.fpl <- nlme(tph100~SSfpl(age,th1,th2,th3,th4),data=siteprep.tph, fixed=list(th1~1,th2~1,th3~1,th4~1),random=list(site=pdDiag(th1+th2+th3~1), plot=pdDiag(th1+th2+th3~1)),start=c(10,8,10,1))

• The model here is

θ2ij − θ1ij yijk = θ1ij + + eijk 1+exp{(θ3ij − Ageijk)/θ4ij}

=(β10 + b1ij + b1i) (β10 + b2i + b2ij) − (β10 + b1ij + b1i) + + eijk 1+exp{((β30 + b3ij + b3i) − Ageijk)/β40}

– The pdDiag() specification in the random option requests that the random site effects in the model are independent and the random plot effects are identical as well. I.e., for site effects: ⎛ ⎞ ⎛ ⎞ (1) b1i σ11 00 ⎝ ⎠ ⎜ (1) ⎟ var(bi)=var b2i = ⎝ 0 σ22 0 ⎠ (1) b3i 00σ33

and for plot effects ⎛ ⎞ ⎛ ⎞ (2) b1ij σ11 00 ⎝ ⎠ ⎜ (2) ⎟ var(bij )=var b2ij = ⎝ 0 σ22 0 ⎠ (2) b3ij 00σ33

– Assuming independent random effects that operate on the same level of clustering (e.g., site) is usually not realistic. However, it may be necessary, at least when fitting the model intially, to obtain convergence.

96 • Here is a summary of the model fit:

> summary(m1tph.fpl) Nonlinear mixed-effects model fit by maximum likelihood Model: tph100 ~ SSfpl(age, th1, th2, th3, th4) Data: siteprep.tph AIC BIC logLik 2151.744 2207.161 -1064.872 Random effects: Formula: list(th1 ~ 1, th2 ~ 1, th3 ~ 1) Level: site Structure: Diagonal th1 th2 th3 StdDev: 0.855265 1.86076 2.941078 Formula: list(th1 ~ 1, th2 ~ 1, th3 ~ 1) Level: plot %in% site Structure: Diagonal th1 th2 th3 Residual StdDev: 1.424189 1.262397 1.378790 0.3068705 Fixed effects: list(th1 ~ 1, th2 ~ 1, th3 ~ 1, th4 ~ 1) Value Std.Error DF t-value p-value th1 11.969625 0.2395291 945 49.97149 0 th2 10.646728 0.4759046 945 22.37156 0 th3 12.675188 0.7959850 945 15.92390 0 th4 2.020528 0.1151651 945 17.54462 0 Number of Observations: 1139 Number of Groups: site plot %in% site 16 191

• Variance components from all site and plot effects appear to be large, so we don’t consider dropping any of these effects (yet).

97 • A plot of the observed and fitted values (obtained with augPred()) indicates how well the model fits, and reveals its nested structure:

fixed site plot 5101520 5101520 4/5 4/6 4/7 4/8 15 10 5 4/1 4/2 4/3 4/4 15 10 5 1/9 1/10 1/11 1/13 15 10 5 1/5 1/6 1/7 1/8 15 10 Trees/Hectare (100s of trees) 5 1/1 1/2 1/3 1/4 15 10 5

5101520 5101520 Age (yrs)

• The data from only a subset of the data are plotted here.

• At this stage we might consider relaxing the assumption of uncorre- lated (independent) random site and plot effects. However, the data and model are big enough here that this causes lots of convergence problems.

98 • Instead, we consider whether the asymptotes and other parameters depend on covariates (e.g., treatment indicators, site index, soil vari- ables, etc.).

– These are all plot-level measurements.

• One useful way to determine which parameters may depend upon which covariates, is to graph the predicted plot-specific random ef- fects for each parameter against potential covariates.

– It’s as though we’re building a separate little linear model for each mixed-effect parameter (each θ) and the predicited plot- level random effects are the residuals from those models.

• The most obvious covariate on which one of the parameters depends is initial trees per hectare (itph100, measured at age 2). Clearly, we’d expect the left (upper) asymptote to be highly dependent on this variable.

– This is apparent from the plot of the θ1ijk random effects vs. itph100.

• Therefore, we add itph100 to the model, yielding m2tph.fpl, which fits much better than m1tph.fpl.

• Continuing in this manner we build up the mixed effects until arriv- ing at the specifications (m5tph.fpl):

θ1ijk = β10 + β11itphij + (fixed trtmnt effects) + b1ij + b1i

θ2ijk = β20 + (fixed trtmnt effects) + b2ij + b2i θ3ijk = β30 + (fixed trtmnt effects) + b3ij + b3i θ3ijk = β40

99 No additional fixed effects appear necessary at this point from the plots of random effects and residuals vs. explanatory variables, so we now consider the assumptions on the error var-cov structure:

• A plot of the residuals vs. fitted reveals no obvious pattern of non- constant variance.

• However, a plot of the residuals versus site index (si) shows a slight increasing variance pattern.

• Therefore, in model m6tph.fpl, we add heteroskedasticity of the form

2 2δ var(eijk)=σ si , whereσ ˆ =0.0041, δˆ =1.49

> m6tph.fpl <- update(m5tph.fpl, weights=varPower(form=~si), start=fixef(m5tph.fpl)) > summary(m6tph.fpl) Random effects: Formula: list(th1 ~ 1, th2 ~ 1, th3 ~ 1) Level: site Structure: Diagonal th1.(Intercept) th2.(Intercept) th3.(Intercept) StdDev: 0.1559281 1.849531 2.303681 Formula: list(th1 ~ 1, th2 ~ 1, th3 ~ 1) Level: plot %in% site Structure: Diagonal th1.(Intercept) th2.(Intercept) th3.(Intercept) Residual StdDev: 0.5110908 1.203069 1.403265 0.004075045 Variance function: Structure: Power of variance covariate Formula: ~si Parameter estimates: power 1.494952 • This improves the model (decreases AIC) substantially.

100 Now consider serial correlation:

• The ACF plot of m6tph.fpl shows clear evidence of negative auto- correlation among the residuals.

– To deal with this, we consider adding an AR(1) autocorrelation structure to the model (done in two steps, models m7tph.fpl, and m8tph.fpl).

– However, this does not improve the model according to AIC, BIC and has almost no impact on the ACF plot.

– Whenever modeling the autocorrelation has virtually no effect on the ACF plot, we should be suspiscious that the problem is not autocorrelation per se, but misspecification of the mean, which manifests as autocorrelated residuals.

– To investigate this, a plot of the residuals by age can help:

5

0 Standardized residuals

−5

5101520 Age (yrs)

101 • The plot reveals a slight mean misspecification in the model which results in a wavy shape to the residuals over time. This leads to the large negative autocorrelation at lag 1 found in the ACF plot.

– To remove this autocorrelation, it would probably be necessary to consider another nonlinear form for tph over time which decreases more rapidly from the upper asymptote. For now, however, we satisfy ourselve with the logistic model and live with the apparent autocorrelation.

• Finally, I fit a number of other models in an attempt to simplify the random effect structures (not all shown).

– In particular, in model m9tph.fpl we find that dropping the plot-level random effect in θ2ijk decreases the AIC (improves the fit).

– Therefore, the “final” model here is m9tph.fpl although, un- doubtedly, we could do some further tweaking.

102