Sb1a Applied Statistics Lectures 5-6

SB1a Applied Statistics Lectures 5-6 Dr Geoff Nicholls Week 3 MT15 Blocks, Treatments and Designs “Treatment factors are those for which we wish to determine if there is an effect. Blocking factors are those for which we believe there is an effect. We wish to prevent a presumed blocking effect from interfering with our measure- ment of the treatment effect”, Heiberger and Holland ‘Statistical Analysis and Data Display’, Springer (2004). Example: Piglet diet data from D. Lunn (2007) Diet Litter A BC I 89 68 62 II 78 59 61 III 114 85 83 IV 79 61 82 Effect due to diet masked if effect due to litter not accounted for. Litter I Litter II Litter III Litter IV Gain 60 70 80 90 100 110 ABC Diet Diet Litter A BC I 89 68 62 II 78 59 61 III 114 85 83 IV 79 61 82 Diet Litter A B C I E(Y )= α α + τ2 α + τ3 II α + γ2 α + γ2 + τ2 α + γ2 + τ3 III α + γ3 α + γ3 + τ2 α + γ3 + τ3 IV α + γ4 α + γ4 + τ2 α + γ4 + τ3 Above b blocks, t treatments 1+b−1+t−1 parameters, n = bt observations. Number piglets k = 1, 2, ..., 12 Gain yk = k’th piglet yk = α+γ2gk,2+γ3gk,3+γ4gk,4+τ2zk,2+τ3zk,3+ǫk gk,2 = 1/0 piglet k was litter II, etc. zk,2 = 1/0 piglet k had diet B, etc. > pigs Gain Litter Diet 1 89 I A 2 78 II A 3 114 III A 4 79 IV A 5 68 I B 6 59 II B 7 85 III B 8 61 IV B 9 62 I C 10 61 II C 11 83 III C 12 82 IV C > (X<-model.matrix(Gain~Litter+Diet,data=pigs)) (int) II IIIIV B C 1 100000 2 110000 3 101000 4 100100 5 100010 6 110010 7 101010 8 100110 9 100001 10 1 1 0 0 0 1 11 1 0 1 0 0 1 12 1 0 0 1 0 1 > pigs.lm<-lm(Gain~1+Litter+Diet,data=pigs) > summary(pigs.lm) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 86.250 5.763 14.967 5.6e-06 LitterII -7.000 6.654 -1.052 0.33332 LitterIII 21.000 6.654 3.156 0.01967 LitterIV 1.000 6.654 0.150 0.88547 DietB -21.750 5.763 -3.774 0.00924 DietC -18.000 5.763 -3.124 0.02049 ... Residual standard error: 8.15 on 6 dof F-statistic: 7.18 on 5 and 6 DF, p-value: 0.0162 Litter I and Diet A are baseline. Is Diet predictive for Gain? > anova(pigs.lm) Analysis of Variance Table Response: Gain Df Sum Sq Mean Sq F value Pr(>F) Litter 3 1304.25 434.75 6.5458 0.02545 Diet 2 1081.50 540.75 8.1418 0.01952 Residuals 6 398.50 66.42 Diet is predictive at 5% significance. No Litter records in data? s2 = RSS/(n − p) so s2 ր, t, F ց and p ց. Loose significance. > anova(lm(Gain~Diet,data=pigs)) Analysis of Variance Table Response: Gain Df Sum Sq Mean Sq F value Pr(>F) Diet 2 1081.50 540.75 2.8582 0.1094 Residuals 9 1702.75 189.19 Diet is no longer explanatory. Experimental design - planning a design 1. what data on each subject? Blocking factors known in advance. 2. how to distribute treatments over subjects? Balanced between blocks, randomised within. 3. how many subjects? Power to detect a treatment effect. Good (balanced) design: max power given n Block effects dont confound treatment effects. Example: Field trial of fertilizers in a single field 1. Divide field into b square grid (cells are blocks) 2. Divide ith grid cells into ni sub-grid cells (ni subjects) 3. Within each cell assign fertilizers (treatments) at random to sub-cell (including the non-treatment). Try to use each treatment the same number of times in each block. Design terminology • One way ANOVA - no blocks • Two way ANOVA - Blocks recorded but not used to decide assignment of treatments to subjects. • Randomised (complete) block design - treatments assigned to subjects randomly within blocks, each treatment once in each block. • Orthogonal block design - each treatment applied same number of times in each block (Randomised, balanced) • Incomplete block design - treatment applied unequal number of times between blocks. • Balanced incomplete block design - treatments applied different numbers of times in each block, but block and treatment vari- ables orthogonal. Design Example - Piglets design is complete ⇒ blocks and treatmeants are ’orthogonal’. > X<-model.matrix(Gain~1+Litter+Diet,data=pigs) > round(solve(t(X)%*%X),2) II III IV B C (Intercept) 0.50 -0.33 -0.33 -0.33 -0.25 -0.25 LitterII -0.33 0.67 0.33 0.33 0 0 LitterIII -0.33 0.33 0.67 0.33 0 0 LitterIV -0.33 0.33 0.33 0.67 0 0 DietB -0.25 0 0 0 0.50 0.25 DietC -0.25 0 0 0 0.25 0.50 Treatment effects are separated from block effects in a complete design. The treatment effects (size, significance) are not changed by having the block effects in the model (they contribute to the estimate of the standard error variance s2). [R - verify that ANOVA is unchanged by order factors added] > pigs.lm1<-lm(Gain~1+Litter+Diet,data=pigs) > anova(pigs.lm1) > pigs.lm2<-lm(Gain~1+Diet+Litter,data=pigs) > anova(pigs.lm2) Exercise (EFE): show that a complete design is orthogonal (in the sense above). Model Diagnostics and Goodness of Fit 2 Observation model Y = Xβ+ǫ, ǫ ∼ N(0, Inσ ) Our F and t-tests cant be trusted if this model is wrong. Problems with the model: errors are not iid normal – errors ǫ are correlated with x1β1 + ... + xpβp (linear predictor) – errors are correlated with each other y is not a linear function of x1,...,xp Problems with the data: bulk of data fit normal linear model outliers have ǫ’s with large variance data-entry errors, rare unmodeled events. Model checking Under NLM residuals e ∼ N(0, σ(I − H)) and e andy ˆ are independent. A qqplot of the residuals against the quantiles of a standard normal is standard. A plot of residuals e v. fitted valuesy ˆ allows us to check many model assumptions in a single graph. residuals −50 0 50 100 150 200 250 300 fitted values The example shows simple diagnostics for house price data/final model, merged Terr/SD. Resid- uals show non-constant variance and correlation withy ˆ. Ignoring correlation often leads to incorrect (over-significant) p-values. If the fitted surface does not interpolate the data (eg if the data are curved) then this may show up here. response y (studentised) residuals −1 0 1 2 0 2 4 6 8 10 12 14 0.0 0.5 1.0 1.5 2.0 2.5 3.0 −2 0 2 4 6 8 10 covariate x fitted values y−hat The e v. yˆ plot is particularly helpful when there are many dimensions of covariates. Data cleaning - classification of data points and identification of outliers We classify data vectors (yi, xi) according to Misfit observation has excess variance. Leverage observation pulls the fitted surface onto itself (thereby hiding misfit). Influence a combined measure of misfit and leverage. Outliers are data points which depart from an otherwise well-fitting observation model. In- dividual data vectors of particularly high influence are often classified as outliers and re- moved. Misfit Residuals e = y −yˆ may have unequal variance. var(e)= σ2(I − H) 2 diag(H) ≡ (h1,...,hn) so var(ei)= σ (1 − hi) 2 Exercise Show var(ˆyi)= σ hi so 0 ≤ hi ≤ 1. Standardised residuals r = (r1,...,rn): e r = k , k s 1 − h p k converge to N(0, 1) at large n − p. |rk| > 2 is possible misfit. Exercise show r independent ofy ˆ (dbn of r unknown as num/den correlated). Data fools r: s2 = eT e so outlier ր s and ց r. ′ ′ ′ Studentised residuals r = (r1,...,rn): y − x βˆ ′ = k k −k rk x . std.err(yk − kβˆ−k) ˆ T T with β−k = (X−kX−k)X−ky−k. Claim ′ v n − p − 1 r = r u . k ku 2 tn − p − rk Compute r′ directly from main regression (don’t need n deletions and regressions). ′ Exercise rk ∼ t(n − p − 1) (approx. normal at large n) and r′ andy ˆ independent ′ ′ Plot r againsty ˆ. Misfit is |rk| > 2, and possible outlier. studentised residuals standardised residuals −2 0 2 4 −2 0 2 4 150 200 250 300 150 200 250 300 fitted values fitted values rstudent(pigs.lm) rstandard(pigs.lm) −1 0 1 2 −1 0 1 2 3 4 60 70 80 90 100 60 70 80 90 100 pigs.lm$fitted.values pigs.lm$fitted.values Top/bot=ohp/pigs, left/right=standard/student. Leverage hi = Hii are the leverage components. 0 ≤ hi ≤ 1 2 var(ei) = σ (1 − hi) and E(ei) = 0 so xiβˆ is pulled onto data point i by large hi. Claim ¯h = p/n. H is ⊥-projection into col(X). If col(X) has dimension p Hv = v p linearly independent vectors Hv = 0 n − p linearly independent vectors Eigenvalues by multiplicity: λ1 = λ2 = ...λp = 1 λp+1 = λp+2 = ...λn = 0.

Load more