SB1a Applied Statistics Lectures 5-6

Dr Geoff Nicholls

Week 3 MT15 Blocks, Treatments and Designs

“Treatment factors are those for which we wish to determine if there is an effect. Blocking fac- tors are those for which we believe there is an effect. We wish to prevent a presumed block- ing effect from interfering with our measure- ment of the treatment effect”, Heiberger and Holland ‘Statistical Analysis and Data Display’, Springer (2004).

Example: Piglet diet data from D. Lunn (2007)

Diet Litter A BC I 89 68 62 II 78 59 61 III 114 85 83 IV 79 61 82

Effect due to diet masked if effect due to litter not accounted for. Gain

60 70 80 90 100 110 C B A Diet Litter IV Litter III Litter II Litter I Diet Litter A BC I 89 68 62 II 78 59 61 III 114 85 83 IV 79 61 82

Diet Litter A B C I E(Y )= α α + τ2 α + τ3 II α + γ2 α + γ2 + τ2 α + γ2 + τ3 III α + γ3 α + γ3 + τ2 α + γ3 + τ3 IV α + γ4 α + γ4 + τ2 α + γ4 + τ3

Above b blocks, t treatments 1+b−1+t−1 parameters, n = bt observations.

Number piglets k = 1, 2, ..., 12 Gain yk = k’th piglet yk = α+γ2gk,2+γ3gk,3+γ4gk,4+τ2zk,2+τ3zk,3+ǫk gk,2 = 1/0 piglet k was litter II, etc. zk,2 = 1/0 piglet k had diet B, etc. > pigs Gain Litter Diet 1 89 I A 2 78 II A 3 114 III A 4 79 IV A 5 68 I B 6 59 II B 7 85 III B 8 61 IV B 9 62 I C 10 61 II C 11 83 III C 12 82 IV C > (X<-model.matrix(Gain~Litter+Diet,data=pigs)) (int) II IIIIV B C 1 100000 2 110000 3 101000 4 100100 5 100010 6 110010 7 101010 8 100110 9 100001 10 1 1 0 0 0 1 11 1 0 1 0 0 1 12 1 0 0 1 0 1 > pigs.lm<-lm(Gain~1+Litter+Diet,data=pigs) > summary(pigs.lm) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 86.250 5.763 14.967 5.6e-06 LitterII -7.000 6.654 -1.052 0.33332 LitterIII 21.000 6.654 3.156 0.01967 LitterIV 1.000 6.654 0.150 0.88547 DietB -21.750 5.763 -3.774 0.00924 DietC -18.000 5.763 -3.124 0.02049 ... Residual standard error: 8.15 on 6 dof F-statistic: 7.18 on 5 and 6 DF, p-value: 0.0162

Litter I and Diet A are baseline. Is Diet predictive for Gain?

> anova(pigs.lm) Analysis of Variance Table

Response: Gain Df Sum Sq Mean Sq F value Pr(>F) Litter 3 1304.25 434.75 6.5458 0.02545 Diet 2 1081.50 540.75 8.1418 0.01952 Residuals 6 398.50 66.42

Diet is predictive at 5% significance. No Litter records in data? s2 = RSS/(n − p) so s2 ր, t, F ց and p ց.

Loose significance.

> anova(lm(Gain~Diet,data=pigs)) Analysis of Variance Table

Response: Gain Df Sum Sq Mean Sq F value Pr(>F) Diet 2 1081.50 540.75 2.8582 0.1094 Residuals 9 1702.75 189.19

Diet is no longer explanatory. Experimental design - planning a design

1. what data on each subject? Blocking factors known in advance.

2. how to distribute treatments over subjects? Balanced between blocks, randomised within.

3. how many subjects? Power to detect a treatment effect.

Good (balanced) design: max power given n Block effects dont confound treatment effects. Example: Field trial of fertilizers in a single field

1. Divide field into b square grid (cells are blocks)

2. Divide ith grid cells into ni sub-grid cells (ni subjects)

3. Within each cell assign fertilizers (treat- ments) at random to sub-cell (including the non-treatment). Try to use each treat- ment the same number of times in each block. Design terminology

• One way ANOVA - no blocks

• Two way ANOVA - Blocks recorded but not used to decide assignment of treat- ments to subjects.

• Randomised (complete) block design - treat- ments assigned to subjects randomly within blocks, each treatment once in each block.

• Orthogonal block design - each treatment applied same number of times in each block (Randomised, balanced)

• Incomplete block design - treatment ap- plied unequal number of times between blocks.

• Balanced incomplete block design - treat- ments applied different numbers of times in each block, but block and treatment vari- ables orthogonal. Design Example - Piglets design is complete ⇒ blocks and treatmeants are ’orthogonal’.

> X<-model.matrix(Gain~1+Litter+Diet,data=pigs) > round(solve(t(X)%*%X),2) II III IV B C (Intercept) 0.50 -0.33 -0.33 -0.33 -0.25 -0.25 LitterII -0.33 0.67 0.33 0.33 0 0 LitterIII -0.33 0.33 0.67 0.33 0 0 LitterIV -0.33 0.33 0.33 0.67 0 0 DietB -0.25 0 0 0 0.50 0.25 DietC -0.25 0 0 0 0.25 0.50

Treatment effects are separated from block ef- fects in a complete design. The treatment ef- fects (size, significance) are not changed by having the block effects in the model (they contribute to the estimate of the standard er- ror variance s2). [R - verify that ANOVA is unchanged by order factors added]

> pigs.lm1<-lm(Gain~1+Litter+Diet,data=pigs) > anova(pigs.lm1) > pigs.lm2<-lm(Gain~1+Diet+Litter,data=pigs) > anova(pigs.lm2)

Exercise (EFE): show that a complete design is orthogonal (in the sense above). Model Diagnostics and Goodness of Fit

2 Observation model Y = Xβ+ǫ, ǫ ∼ N(0, Inσ )

Our F and t-tests cant be trusted if this model is wrong.

Problems with the model:

errors are not iid normal – errors ǫ are correlated with x1β1 + ... + xpβp (linear predictor) – errors are correlated with each other

y is not a linear function of x1,...,xp

Problems with the data:

bulk of data fit normal linear model

outliers have ǫ’s with large variance data-entry errors, rare unmodeled events. Model checking

Under NLM residuals e ∼ N(0, σ(I − H)) and e andy ˆ are independent. A qqplot of the residu- als against the quantiles of a standard normal is standard.

A plot of residuals e v. fitted valuesy ˆ allows us to check many model assumptions in a single graph. residuals −50 0 50 100

150 200 250 300

fitted values The example shows simple diagnostics for house price data/final model, merged Terr/SD. Resid- uals show non-constant variance and correla- tion withy ˆ. Ignoring correlation often leads to incorrect (over-significant) p-values.

If the fitted surface does not interpolate the data (eg if the data are curved) then this may show up here. response y (studentised) residuals −1 0 1 2 0 2 4 6 8 10 12 14

0.0 0.5 1.0 1.5 2.0 2.5 3.0 −2 0 2 4 6 8 10

covariate x fitted values y−hat

The e v. yˆ plot is particularly helpful when there are many dimensions of covariates. Data cleaning - classification of data points and identification of outliers

We classify data vectors (yi, xi) according to

Misfit observation has excess variance.

Leverage observation pulls the fitted surface onto itself (thereby hiding misfit).

Influence a combined measure of misfit and leverage.

Outliers are data points which depart from an otherwise well-fitting observation model. In- dividual data vectors of particularly high in- fluence are often classified as outliers and re- moved. Misfit

Residuals e = y −yˆ may have unequal variance.

var(e)= σ2(I − H) 2 diag(H) ≡ (h1,...,hn) so var(ei)= σ (1 − hi)

2 Exercise Show var(ˆyi)= σ hi so 0 ≤ hi ≤ 1.

Standardised residuals r = (r1,...,rn): e r = k , k s 1 − h p k converge to N(0, 1) at large n − p. |rk| > 2 is possible misfit.

Exercise show r independent ofy ˆ (dbn of r unknown as num/den correlated).

Data fools r: s2 = eT e so outlier ր s and ց r. ′ ′ ′ Studentised residuals r = (r1,...,rn): y − x βˆ ′ = k k −k rk x . std.err(yk − kβˆ−k) ˆ T T with β−k = (X−kX−k)X−ky−k.

Claim

′ v n − p − 1 r = r u . k ku 2 tn − p − rk Compute r′ directly from main regression (don’t need n deletions and regressions).

′ Exercise rk ∼ t(n − p − 1) (approx. normal at large n) and r′ andy ˆ independent

′ ′ Plot r againsty ˆ. Misfit is |rk| > 2, and possible outlier. studentised residuals standardised residuals −2 0 2 4 −2 0 2 4

150 200 250 300 150 200 250 300

fitted values fitted values rstudent(pigs.lm) rstandard(pigs.lm) −1 0 1 2 −1 0 1 2 3 4

60 70 80 90 100 60 70 80 90 100

pigs.lm$fitted.values pigs.lm$fitted.values

Top/bot=ohp/pigs, left/right=standard/student. Leverage hi = Hii are the leverage components.

0 ≤ hi ≤ 1 2 var(ei) = σ (1 − hi) and E(ei) = 0 so xiβˆ is pulled onto data point i by large hi.

Claim ¯h = p/n. H is ⊥-projection into col(X). If col(X) has dimension p

Hv = v p linearly independent vectors Hv = 0 n − p linearly independent vectors Eigenvalues by multiplicity: λ1 = λ2 = ...λp = 1 λp+1 = λp+2 = ...λn = 0. λi = p ⇒ hi = p Pi Pi

(or can apply trace(ABC)=trace(CAB) to H).

Flag data i with (for eg) hi > 2p/n as ’high’ leverage. hatvalues(ohp.lm) 0.010 0.015 0.020 0.025 0.030 0.035 0.040

0 50 100 150 200 250 300

Index hatvalues(pigs.lm) 0.496 0.498 0.500 0.502 0.504

2 4 6 8 10 12

Index

Leverages (top) OHP and (bottom) piglet data Influence

x Cook’s distance for ( k,yk) (ˆy − yˆ )T (ˆy − yˆ ) C = −k −k . k ps2 x yˆ−k = Xβˆ−k (includesy ˆ−k,k = kβˆ−k)

Claim 2 rk hk Ck = . p(1 − hk) hk/(1 − hk) is a measure of leverage

2 rk misfit (measured by standardised residual)

|rk| > 2 and hk > 2p/n are checked so 8 C & k n − 2p (arbitary, typical) threshold for high influence.

Identify outliers; remove and repeat analysis. Cooks Distance 0.00 0.02 0.04 0.06 0.08 0.10

0 20 40 60 80 100

month cooks.distance(pigs.lm) 0.0 0.2 0.4 0.6 0.8

2 4 6 8 10 12

Index

Cooks Distances (top) OHP and (bottom) piglet Data Diagnostics, Example data(swiss) dataset n = 47 observations of fertility with 5 poten- tially explanatory variables

Fertility fertilitymeasure Agriculture % of males in agriculture Examination % top grade army exam Education % educated beyond primary Catholic %catholic Infant.Mortality % babies living < 1 year

Which variables explain Fertility?

Transform (0, 100) to R with logit function

x ← log(x/(100 − x)) Robust to x = 0, 100

x ← log((1 + x)/(101 − x)) Increase sensitivity near 0, 100. > data(swiss) > head(swiss) #first few rows Fertility Agric Exam Edu Cath Mortality Courtelary 80.2 17.0 15 12 9.96 22.2 Delemont 83.1 45.1 6 9 84.84 22.2 Franches-Mnt 92.5 39.7 5 5 93.40 20.2 Moutier 85.8 36.5 12 7 33.77 20.3 Neuveville 76.9 43.5 17 15 5.16 20.6 Porrentruy 76.1 35.3 9 7 90.57 26.6 > > sw<-swiss; #map data into R > sw[,-1]<-log((swiss[,-1]+1)/(101-swiss[,-1])) > n<-dim(sw)[1]; p<-dim(sw)[2]

(i) Fit NLM, (ii) look for outliers and remove them (iii) select a model (iv) check again for outliers etc 40 60 80 −4 −2 0 2 −3.0 −1.5 −4 −2 0 −2 0 2 4 −2.0 −1.4

Fertility 40 60 80 40 60 80

Agriculture −4 −2 0 2

Examination −3.0 −1.5

Education −4 −2 0

Catholic −2 0 2 4

Infant.Mortality −2.0 −1.4 −2.0 −1.4 Pairs plot for Swiss fertility data. Percentage data has been mapped monotonically to the interval (−∞, ∞).

> # (i) fit a normal linear model > sw1.lm<-lm(Fertility~Mortality+Exam+Edu+Cath+Agric, + data=sw) > summary(sw1.lm) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 76.7438 11.4611 6.696 4.44e-08 *** Infant.Mortality 23.6269 6.9393 3.405 0.00149 ** Examination -6.2086 3.7104 -1.673 0.10188 Education -6.8316 2.6984 -2.532 0.01528 * Catholic 0.8225 0.6183 1.330 0.19079 Agriculture -1.8702 1.6896 -1.107 0.27478 ... Residual standard error: 8.398 on 41 degrees of freedom F-statistic: 12.16 on 5 and 41 DF, p-value: 2.960e-07 LVll Glan Nvvl Sion Motr Sirr Fr−M LLcl Crtl Sarn Bdry VldR Broy Aigl Vvys NchtMrgs Gryr Grnd Avnc Dlmn Lavx Pyrn Roll AbnnVldT Nyon LChx LsnnOrbe Cssn Vevy MrtgPys’ Mnth CnthHrns StMrYvrd EchlEntrOron Prrn rstudent(sw1.lm) Sample Quantiles Modn RvDr −2 −1 0 1 −2 −1 0 1 RvGc V.DG

−2 −1 0 1 2 50 60 70 80

Theoretical Quantiles fitted.values(sw1.lm) La Vallee V. De Geneve Paysd’enhaut Conthey La Chauxdfnd V. De Geneve Sierre Franches−Mnt Rive Droite Oron Echallens Porrentruy La Vallee Delemont ValdeTravers Veveyse Neuchatel Herens Glane Courtelary Avenches Martigwy Lavaux Broye Payerne Orbe Entremont hatvalues(sw1.lm) St Maurice Rive Gauche Moudon Monthey Aubonne Sarine Aigle Neuveville Sion Sierre Gruyere Le Locle Sion Glane Oron Yverdon Nyone Val de Ruz cooks.distance(sw1.lm) Grandson Boudry Moutier Aigle Vevey Orbe Broye Rolle Moudon Rive Droite Sarine Vevey Nyone Herens Moutier Boudry Conthey Lavaux Morges Porrentruy Echallens Gruyere Le Locle Yverdon Veveyse Monthey Payerne Martigwy Neuveville Aubonne Entremont Rive Gauche Courtelary Delemont Neuchatel Lausanne Cossonay Grandson Avenches St Maurice Val de Ruz 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Franches−Mnt −0.2 0.0 0.2 0.4 0.6 0.8 Paysd’enhaut ValdeTravers La Chauxdfnd 0 10 20 30 40 0 10 20 30 40

Index Index > # (ii) look for outliers (above) remove and refit > i<-cooks.distance(sw1.lm)>(8/(n-2*p)) > swr<-sw[-which(i),] > nr<-dim(swr)[1]; > swr1.lm<-lm(Fertility~Mortality+Exam+Edu+Cath+Agric, + data=swr) > summary(swr1.lm) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 82.5002 12.3752 6.667 6.17e-08 *** Infant.Mortality 26.9630 8.2567 3.266 0.00228 ** Examination -6.7927 3.5219 -1.929 0.06107 . Education -5.9604 2.5509 -2.337 0.02469 * Catholic 1.0270 0.6005 1.710 0.09514. Agriculture -2.8355 1.7661 -1.606 0.11645 ... Residual standard error: 7.866 on 39 degrees of freedom F-statistic: 10.41 on 5 and 39 DF, p-value: 2.139e-06 SirrGlan Nvvl Sion Motr Aigl Bdry LLcl VldR Broy Fr−M Mrgs Crtl Sarn Vvys Gryr NchtLavx Grnd AbnnAvnc Roll Pyrn Dlmn NyonOrbe Cssn VldT Pys’ Mrtg Cnth Lsnn LChx Hrns Mnth Vevy Oron StMrYvrd Echl Entr

rstudent(swr1.lm) Modn Sample Quantiles Prrn −2 −1 0 1 −2 −1 0RvDr 1

RvGc

−2 −1 0 1 2 60 65 70 75 80 85 90

Theoretical Quantiles fitted.values(swr1.lm) La Chauxdfnd Conthey Paysd’enhaut Oron Porrentruy Rive Droite ValdeTravers Sierre Echallens Franches−Mnt Delemont Neuchatel Orbe Veveyse Glane Payerne Herens Rive Gauche Avenches Courtelary Lavaux Broye Cossonay Le Locle Martigwy Lausanne Moudon Entremont St Maurice Aigle Sierre Sarine hatvalues(swr1.lm) Monthey Aubonne Sion Nyone Rolle Neuveville Gruyere Glane Yverdon Morges Sion Rive Droite Vevey Oron Val de Ruz Grandson Porrentruy Aigle Boudry Moutier cooks.distance(swr1.lm) Orbe Broye Rolle Rive Gauche Moudon Vevey Sarine Nyone Moutier Boudry Herens Morges Lavaux Conthey Le Locle Yverdon Gruyere Veveyse Echallens Monthey Payerne Neuveville Martigwy Aubonne Entremont Lausanne Delemont Courtelary Avenches Grandson Neuchatel Cossonay St Maurice Val de Ruz 0.0 0.1 0.2 0.3 0.4 0.5 0.6 −0.2 0.0 0.2 0.4 0.6 0.8 Franches−Mnt La Chauxdfnd ValdeTravers Paysd’enhaut 0 10 20 30 40 0 10 20 30 40

Index Index > round( cor(sw[2:6,2:6]), 2) Agric Exam Edu Cath Mortality Agric 1.00 -0.06 0.58 -0.26 -0.40 Exam -0.06 1.00 0.74 -0.92 -0.10 Edu 0.58 0.74 1.00-0.81 -0.10 Cath -0.26 -0.92 -0.81 1.00 0.45 Mortality -0.40 -0.10 -0.10 0.45 1.00

> swr0.lm<-lm(Fertility~Mortality+Exam,data=swr) > anova(swr0.lm,swr1.lm) Analysis of Variance Table

Model 1: Fertility ~ Mortality + Exam Model 2: Fertility ~ Mortality + Exam + Edu + Cath + Agric Res.Df RSS Df Sum of Sq F Pr(>F) 1 42 2810.68 2 39 2413.14 3 397.54 2.1416 0.1106

No evidence to support the more complex model. We repeated the visual outlier and model checks for the fi- nal model swr0: Fertility ~ Mortality + Exam, yielding fig- ures similar to those shown above for swr1 for the data with outliers removed. These show no evidence for problems with model or data.

> summary(swr0.lm) ... Estimate Std. Error t value Pr(>|t|) (Intercept) 95.133 10.871 8.751 5.15e-11 *** Infant.Mortality 32.982 8.009 4.118 0.000175 *** Examination -11.811 2.100 -5.624 1.38e-06 *** ... F-statistic: 21.1 on 2 and 42 DF, p-value: 4.545e-07

We conclude that fertility is higher where there is higher infant mortality and lower where there is more advanced education and that there is no evidence for an association between fertility and the other measured covariates.