Anova, Regression, Correlation

1

252regr 2/26/07 (Open this document in 'Outline' view!) Roger Even Bove

G. LINEAR REGRESSION-Curve Fitting 1. Exact vs. Inexact Relations 2. The Ordinary Least Squares Formula ˆ We wish to estimate the coefficients in Y   0  1 X   . Our ‘prediction’ will be Y  b0  b1 X and our error will be e  Y Yˆ so that Y  b0  b1 X  e . (See appendix for derivation)  XY  nXY b1  b0  Y  b1 X  X 2  nX 2

3. Example i Y X XY X 2 Y 2 1 0 0 0 0 0 2 2 1 2 1 4 3 1 2 2 4 1 4 3 1 3 1 9 5 1 0 0 0 1 6 3 3 9 9 9 7 4 4 16 16 16 8 2 2 4 4 4 9 1 2 2 4 1 10 2 1 2 1 4 sum 19 16 40 40 49 First copy n  10,  X  16, Y  19,  XY  40,  X 2  40 and Y 2  49. X 16 Y 19 Then compute means: X     1.60 Y     1.90 . n 10 n 10 2 2 2 Use these to compute ‘Spare Parts’: SS x   X  nX  40 101.60  14.40 2 2 2 SS y  Y  nY  49 101.9  12.90  SST (Total Sum of Squares) S xy   XY  nXY  40 101.61.9  9.60 .

Note that SS x and SS y must be positive, while S xy can be either positive or negative. We can compute the coefficients:

S xy XY  nXY 9.60 b      0.6667 b  Y  b X  1.90  0.6667 1.60  0.8333 1 2 2 0 1     SS x  X  nX 14.40 So our regression equation is Yˆ  0.8333 0.6667X or Y  0.8333  0.6667X  e . 2

4. R 2 , the Coefficient of Determination SST  SSR  SSE , SSR  b1S xy is Regression (Explained) Sum of Squares. 2 SSE  SST  SSR is the Error (Unexplained or Residual) Sum of Squares, and is defined as  Y Yˆ , a formula that should never be used for computation. 2 2 SSR b1S xy S xy 9.6 R 2      .4961 SST SSy SS x SS y 14.4012.90 b Y  b XY  nY 2 2 0  1  An alternate formula, if no spare parts have been computed, is R  Y 2  nY 2  2 2 R  r . The coefficient of determination is the square of the correlation. Note that SS x , b1, and r all have the same sign.

H. LINEAR REGRESSION-Simple Regression 1. Fitting a Line 2. The Gauss Markov Theorem OLS is BLUE ˆ 2 SSE Y Y  3. Standard Errors – The standard error is defined as s 2    . e n  2 n  2 2 2 SSE SST  SSR SS y  b1S xy SS y  b SS x se     . n  2 n  2 n  2 n  2 2 Y  b0 Y  b1 XY Or, if no spare parts are available, s 2     . e n  2 2 SS y 1 R  Note also that if R 2 is available s 2  . e n  2

Using data from G3, and using our spare parts SS x  14.40, SS y  12.90  SST, S xy  9.60 2 2  Y  nY  b1  XY  nXY  SS y  b1S xy 12.90  0.66679.60 s 2       0.8125 e n  2 n  2 8

4. The Variance of b0 and b1 . 1 X 2   1  s 2  s 2  s 2  s 2 b0 e   and b1 e   n SS x   SS x 

I. LINEAR REGRESSION-Confidence Intervals and Tests

1. Confidence Intervals for b1 .

  b  t s df  n  2 n 1 1 2 b1 The interval can be made smaller by increasing either or the amount of variation in x . 3

2. Tests for b1 .

H 0 : 1  10 b1  10 To test use t  . Remember  is most often zero – and if the null hypothesis is  s 10 H1 : 1  10 b1 false in that case we say that 1 is significant.

2 2 S xy 9.6 To continue the example in G3: R 2    .4961 or SS x SS y 14.4012.90

2 2 b0 Y  b1 XY  nY 0.833319 0.666740101.90 R 2      .4961 SS y 12.90 SSR  b S  0.6667 9.60  6.400. 2 1 xy   We have already computed se  0.8125 , which implies that  1  0.8125 s 2  s 2   0.0564 and s  0.0564  0.2374 . b1 e   b1  SS x  14.40

H 0 : 1  0 The significance test is now  df  n  2  10  2  8 . Assume that   .05 , so that for a 2- H1 : 1  0 n2 8 sided test t  t  2.306 and we reject the null hypothesis if t is below –2.306 or above 2.306. Since 2 .025

b1  0 0.6667 t    2.809 is in the rejection region, we say that  is significant. A further test says that s 0.2374 1 b1

1 is not significantly different from 1.

  b  t s  0.6667  2.3060.2374  0.667  0.547 If we want a confidence interval 1 1 2 b1 . Note that this includes 1, but not zero. X 2  nX 2  1  s 2 2 SS x  2 2   e Note that since s   , sb  se  . This indicates that x 1  X 2  nX 2  n 1 s 2 n 1 n 1     x n x s 2 both the a large sample size, , and a large variance of will tend to make b1 smaller and thus decrease the size of a confidence interval for 1 or increase the size (and significance) of the t-ratio. To put it more negatively, small amounts of variation in x or small sample sizes will tend to produce values of b1 that are not significant. The common sense interpretation of this statement is that we need a lot of experience with what happens to y when we vary x to be able to put any confidence in our estimate of the slope of the equation that relates them. 4

3. Confidence Intervals and Tests for b0

H 0 :  0   00 b0   00 We are now testing with t  .  s H1 :  0   00 b0 1 X 2   1 1.602  s 2  s 2   0.8125   0.81250.2778  0.2256 . So s  0.2256  0.4749 . If b0 e     b0 n SS x  10 14.40 

H 0 :  0  0 b0  0 0.8333 we are testing t    1.754 . Since the rejection region is the same as in I2,  s 0.4749 H1 :  0  0 b0 we accept the null hypothesis and say that  0 is not significant. A confidence interval would be

  b  t s  0.8333 2.3060.4749  0.883 1.095 0 0 2 b0 Yˆ  0.8333  0.6667X A common way to summarize our results is, . The equation is written with the 0.4749 0.2374 standard deviations below the equation. For a Minitab printout example of a simple regression problem, see 252regrex1.

4. Prediction and Confidence Intervals for y  2  ˆ 2 2  1 X 0  X   The Confidence Interval is   Y  ts ˆ , where s ˆ  s  and the Prediction Interval Y0 0 Y Y e  n SS   x  2  1 X  X   is Y  Yˆ  ts , where s 2  s 2   0 1 . In these two formulas, for some specific X , 0 0 Y Y e  n SS  0  x  ˆ Y0  b0  b1 X 0 . For example, assume that X 0  5 so that for the results in G3, 2  X  X   2  ˆ 2 2  1  0    1 5 1.6  Y  0.8333  0.66675  4.168 . Then s ˆ  s   0.8125   0.733 and 0 Y e  n SS  10 14.40   x   

sYˆ  0.733  0.856 , so that the confidence interval is   Yˆ  ts  4.168  2.3060.856  4.168 1.974 Y0 0 Yˆ . This represents a confidence interval for the average value that Y will take when X  5 . For the same data  2   2  2 2  1 X 0  X   1 5 1.6 s  s  1  0.8125  1  1.545 and s  1.545  1.243 , so that the Y e  n SS  10 14.40  Y  x    ˆ prediction interval is Y0  Y0  tsY  4.168  2.3061.243  4.168  2.866 . This is a confidence interval for the value that Y will take in a particular instance when X  5 .

Ignore the remainder of this document unless you have had calculus! 5

Appendix to G2– Explanation of OLS Formula Assume that we have three points: X1,Y1 ,X 2 ,Y2  and X 3 ,Y3 . We wish to fit a regression line to these ˆ ˆ 2 points, with the equation Y  b0  b1 X and the characteristic that the sum of squares, SS  Y  Y  is a minimum. If we imagine that there is a 'true' regression line Y   0  1 X   we can consider b0 and b1 to be estimates of  0 and 1 . Let us make the definition e  Y  Yˆ . Note that if we substitute our equation for Yˆ , we find that ˆ e  Y  Y  Y  b0  b1 X , or Y  b0  b1 X  e . This has two consequences: First the sum of squares can be 2 written as ˆ 2 2 ; and second, that if we fit the line so that SS  Y  Y   Y  b0  b1 X   e e  0 or the mean of Y and Yˆ is the same we have Y  a  bX . Now if we subtract the equation for Y

Y  b0  b1 X  e from the equation for Y we find Y  b0  b1 X . Now let us measure X and Y as deviations

Y  Y  b1X  X   e ~ ~ ~ ~ from the mean, replacing X with X  X  X and Y with Y  Y Y . This means that Y  b1 X  e or ~ ~ e  Y  b1 X . If we substitute this expression in our sum of squares, we find that ~ ~ 2 2 . SS  e  Y  b1 X  Now write this expression out in terms of our three points and differentiate it to minimize SS with respect to b1 . To do this, recall that b1 is our unknown and that the X s and Y s are numbers (constants!), ~~ ~~ ~ ~ so that d b XY  XY and d b 2 X 2  2b X 2 . db1 1 db1 1 1 2 ~ ~ 2 ~ ~ 2 ~ ~ 2 ~ ~ 2 SS  e  Y  b1 X   Y1  b1 X1   Y2  b1 X 2   Y3  b1 X 3    . ~ 2 ~ ~ 2 ~ 2 ~ 2 ~ ~ 2 ~ 2 ~ 2 ~ ~ 2 ~ 2  (Y1  2b1 X1Y1  b1 X1 )  (Y2  2b1 X 2Y2  b1 X 2 )  (Y3  2b1 X 3Y3  b1 X 3 )

If we now take a derivative of this expression with respect to b1 and set it equal to zero to find a minimum, we find that: d ~ ~ ~ 2 ~ ~ ~ 2 ~ ~ ~ 2 db SS  0  2X 1Y1  2b1 X 1  0  2X 2Y2  2b1 X 2  0  2X 3Y3  2b1 X 3  1 . ~~ ~ 2 ~~ ~ 2    2XY  2b1 X  2 XY  b1 X  0 ~~ ~ 2 ~~ ~ 2 ~~ ~ 2 ~~ ~ 2 But if  2 XY  b1 X  0 , then XY  b1 X   0 or  XY  b1 X or  XY  b1  X , so ~~ XY  ~ ~ that if we solve for b1 , we find b1  ~ . But if we remember that X  X  X and Y  Y  Y , we can  X 2 X  X Y Y  XY  nXY b   b   . write this as 1 2 or 1  X  X   X 2  nX 2

Of course, we still need b0 , but remember that Y  b0  b1 X , so that b0  Y  b1 X .