Regression Analysis and Analysis of Variance P

REGRESSION ANALYSIS AND ANALYSIS OF VARIANCE P. L. Davies Eindhoven, February 2007 Reading List Daniel, C. (1976) Applications of Statistics to Industrial Experimenta- • tion, Wiley. Tukey, J. W. (1977) Exploratory Data Analysis, Addison-Wesley. • Mosteller, F. and Tukey, J. W. (1977) Data Analysis and Regression, • Wiley. Miller, R. G. (1980) Simultaneous Statistical Inference, Springer. • Huber, P. J. (1981) Robust Statistics, Wiley. • Hoaglin, D. C., Mosteller, F. and Tukey, J. W. (1983) Understanding • Robust and Exploratory Data Analysis, Wiley. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. • (1986) Robust Statistics: The Approach Based on Influence Functions, Wiley. Miller, R. G. (1986) Beyond ANOVA, Basics of Applied Statistics, • Wiley. Rousseeuw, P. J. und Leroy, A. M. (1987) Robust Regression and Out- • lier Detection, Wiley. Hoaglin, D. C., Mosteller, F. and Tukey, J. W. (1991) Fundamentals of • Exploratory Analysis of Variance, Wiley. Davies, P. L. and Gather, U. (2004) Robust Statistics, Chapter III.9 in • Handbook of Computational Statistics: Concepts and Methods, Editors Gentle, J. E., Hardle, W. and Mori, Y., Springer 1 Jureˇcková, J. and Picek, J. (2006) Robust Statistical Methods with R, • Chapman and Hall. Maronna, R. A., Martin, R. D. and Yohai, V. J. (2006) Robust Statis- • tics: Theory and Methods, Wiley. 2 Contents 1 Location and Scale in R 5 1.1 Measuresoflocation ....................... 5 1.1.1 Themean ......................... 5 1.1.2 Themedian ........................ 6 1.1.3 Theshorth ........................ 7 1.1.4 Whattouse? ....................... 8 1.2 Measuresofscaleordispersion. 9 1.2.1 Thestandarddeviation. 9 1.2.2 TheMAD......................... 9 1.2.3 Thelengthoftheshorth . 10 1.2.4 TheinterquartilerangeIQR . 10 1.2.5 Whattouse? ....................... 11 1.3 Boxplots.............................. 11 1.4 Empiricaldistributions . 13 1.4.1 Definition ......................... 13 1.4.2 Empirical distribution function . 14 1.4.3 Transformingdata . 14 1.4.4 Statisticsandfunctionals. 16 1.4.5 Fisherconsistency. 17 1.5 Equivariance considerations . 18 1.5.1 Thegroupofaffinetransformations . 18 1.5.2 Measuresoflocation . 18 1.5.3 Measuresofscale . 19 1.6 Outliersandbreakdownpoints. 19 1.6.1 Outliers .......................... 19 1.6.2 Breakdown points and equivariance . 20 1.6.3 Identifyingoutliers . 24 1.7 M-measuresoflocationandscale . 30 1.7.1 M-measuresoflocation . 30 1.7.2 M-measuresofscatter . 35 1.7.3 Simultaneous M-measures of location and scale . 38 1.8 Analytic properties of M–functionals . 40 1.8.1 Local boundedness and breakdown . 42 1.9 Redescending M–estimators I . 47 1.10 Differentiability and asymptotic normality . .. 51 1.11 RedescendingM–estimatorsII . 62 1.12 Confidenceintervals. 64 3 2 The One-Way Table 71 2.1 The two sample t-test....................... 71 2.2 The k sampleproblem ...................... 73 2.3 Theclassicalmethod . 74 2.4 Multiplecomparisons . 78 2.4.1 Tukey: Honest Significant Difference . 78 2.4.2 BonferroniandHolm . 81 2.4.3 Scheffé’smethod . 83 2.5 Simultaneous confidence intervals . 85 2.6 Borrowingstrength . 88 2.7 Anexample ............................ 91 2.7.1 Tukey: Honest Significant Difference . 91 2.7.2 Bonferroni......................... 92 2.7.3 Holm............................ 93 2.7.4 Scheffé........................... 93 2.7.5 Simultaneous confidence intervals . 94 3 The Two-way table 94 3.1 TheTwo-WayTable ....................... 94 3.1.1 Theproblem........................ 94 3.1.2 Theclassicalmethod . 95 3.1.3 Interactions ........................ 97 4 Linear Regression 106 4.1 Theclassicaltheory. .106 4.1.1 Thestraightline . .106 4.1.2 Thegeneralcase . .108 4.1.3 Thetwo-waytable . .111 4.2 Breakdownandequivariance . 112 4.2.1 Outliersandleast-squares . 112 4.2.2 Equivariance . .113 4.2.3 Breakdown ........................116 4.3 TheLeastMedianofSquares(LMS) . 118 4 1 Location and Scale in R 1.1 Measures of location 1.1.1 The mean We start with measures of location for a data set xn =(x1,...,xn) consisting of n real numbers. By far the most common measure of location is the arithmetic mean 1 n mean(x )= x¯ = x . (1.1) n n n i i=1 X Example 1.1. 27 measurements of the amount of copper (milligrams per litre) in a sample of drinking water. 2.16 2.21 2.15 2.05 2.06 2.04 1.90 2.03 2.06 2.02 2.06 1.92 2.08 2.05 1.88 1.99 2.01 1.86 1.70 1.88 1.99 1.93 2.20 2.02 1.92 2.13 2.13 The mean is given by 1 1 x¯ = (2.16 + . +2.13) = 54.43=2.015926 27 27 27 In R the command is >mean(copper) [1] 2.015926 Example 1.2. Length of a therapy in days of 86 control patients after a suicide attempt. 1 1 1 5 7 8 8 13 14 14 17 18 21 21 22 25 27 27 30 30 31 31 32 34 35 36 37 38 39 39 40 49 49 54 56 56 62 63 65 65 67 75 76 79 82 83 84 84 84 90 91 92 93 93 103 103 111 112 119 122 123 126 129 134 144 147 153 163 167 175 228 231 235 242 256 256 257 311 314 322 369 415 573 609 640 737 The mean is given by 1 1 x¯ = (1 + . + 737) = 10520 = 122.3256. 86 86 86 5 Example 1.3. Measurements taken by Charles Darwin on the differences in growth of cross- and self-fertilized plants. Heights of individual plants (in inches) Pot I Cross 23.5 12.0 21.0 Self 17.4 20.4 20.0 Pot II Cross 22.0 19.2 21.5 Self 20.0 18.4 18.6 Pot III Cross 22.2 20.4 18.3 21.6 23.2 Self 18.6 15.2 16.5 18.0 16.2 Pot IV Cross 21.0 22.1 23.0 12.0 Self 18.0 12.8 15.5 18.0 The differences in heights are 23.5 17.4,..., 12.0 18.9. − − 6.1, 8.4, 1.0, 2.0, 0.8, 2.9, 3.6, 5.2, 1.8, 3.6, 7.0, 3.0, 9.3, 7.5, 6.0 − − The mean of the differences is 2.626667. 1.1.2 The median The second most common measure of location is the median. To define the median we order the observations (x1,...,xn) according to their size x x . x (1.2) (1) ≤ (2) ≤ (n) and define the median to be the central observation of the ordered sample. If the size of the sample n is an odd number n =2m+1 then this is unambiguous as the central observation is x(m+1). If the number of observations is even n = 2m then we take the mean of the two innermost observations of the ordered sample. This gives x if n =2m +1 med(x )= (m+1) (1.3) n (x + x )/2 if n =2m. (m) (m+1) Example 1.4. We take the first 5 measurements of Example 1.1 These are 2.16, 2.21, 2.15, 2.05, 2.06 so the ordered sample is 2.05, 2.06, 2.15, 2.16, 2.21. 6 The number of observation is odd, 5 = n =2m + 1 with m =2. The median is therefore med(2.16, 2.21, 2.15, 2.05, 2.06) = x(m+1) = x(3) =2.15. We take the first 6 measurements of Example 1.1 These are 2.16, 2.21, 2.15, 2.05, 2.06, 2.04 so the ordered sample is 2.04, 2.05, 2.06, 2.15, 2.16, 2.21. The number of observation is odd, 6 = n = 2m with m = 3. The median is therefore med(2.16, 2.21, 2.15, 2.05, 2.06, 2.04) = (x(m) + x(m+1))/2 = (x(3) + x(4))/2=(2.06+2.15)/2 = 4.21/2=2.105. In R the command is >median(copper) [1] 2.03 1.1.3 The shorth A less common measure of location is the so called shorth (shortest half) introduced by Tukey. We require some notation (a) x spoken “floor x” is the largest integer n such that n x. For example⌊ ⌋ 2.001 =2, 5 =5, 2.3 = 3. ≤ ⌊ ⌋ ⌊ ⌋ ⌊− ⌋ − (b) x spoken “ceiling x” is the smallest integer n such that x n. ⌈ ⌉ ≤ (c) For any number x, x + 1 is the smallest integer n which is strictly greater than x. ⌊ ⌋ (d) The R command is floor: > floor(2.39) [1] 2 7 For a sample (x1,...,xn) of size n we consider all intervals I = [a, b] which contain at least (n/2 +1 of the observations, that is at least half the number ⌊ ⌋ of observations. Let I0 denote that interval which has the smallest length amongst all intervals containing at least n/2 + 1 observations. The shorth is defined to be the mean of those observations⌊ ⌋ in the shortest interval. There is a degree of indeterminacy in this definition as there may be more than one such interval. One way of trying to break the indeterminacy is to take that interval with the largest number of observations. However this may not be sufficient as the next example shows. Example 1.5. We consider the data of Example 1.1. The first step is to order the data. 1.70 1.86 1.88 1.88 1.90 1.92 1.92 1.93 1.99 1.99 2.01 2.02 2.02 2.03 2.04 2.05 2.05 2.06 2.06 2.06 2.08 2.13 2.13 2.15 2.16 2.20 2.21 The sample size is n = 27 so n/2 +1= 13.5 +1=13+1=14.

Load more