Anova, Regression, Correlation
Total Page:16
File Type:pdf, Size:1020Kb
1
252regr 2/26/07 (Open this document in 'Outline' view!) Roger Even Bove
G. LINEAR REGRESSION-Curve Fitting 1. Exact vs. Inexact Relations 2. The Ordinary Least Squares Formula ˆ We wish to estimate the coefficients in Y 0 1 X . Our ‘prediction’ will be Y b0 b1 X and our error will be e Y Yˆ so that Y b0 b1 X e . (See appendix for derivation) XY nXY b1 b0 Y b1 X X 2 nX 2
3. Example i Y X XY X 2 Y 2 1 0 0 0 0 0 2 2 1 2 1 4 3 1 2 2 4 1 4 3 1 3 1 9 5 1 0 0 0 1 6 3 3 9 9 9 7 4 4 16 16 16 8 2 2 4 4 4 9 1 2 2 4 1 10 2 1 2 1 4 sum 19 16 40 40 49 First copy n 10, X 16, Y 19, XY 40, X 2 40 and Y 2 49. X 16 Y 19 Then compute means: X 1.60 Y 1.90 . n 10 n 10 2 2 2 Use these to compute ‘Spare Parts’: SS x X nX 40 101.60 14.40 2 2 2 SS y Y nY 49 101.9 12.90 SST (Total Sum of Squares) S xy XY nXY 40 101.61.9 9.60 .
Note that SS x and SS y must be positive, while S xy can be either positive or negative. We can compute the coefficients:
S xy XY nXY 9.60 b 0.6667 b Y b X 1.90 0.6667 1.60 0.8333 1 2 2 0 1 SS x X nX 14.40 So our regression equation is Yˆ 0.8333 0.6667X or Y 0.8333 0.6667X e . 2
4. R 2 , the Coefficient of Determination SST SSR SSE , SSR b1S xy is Regression (Explained) Sum of Squares. 2 SSE SST SSR is the Error (Unexplained or Residual) Sum of Squares, and is defined as Y Yˆ , a formula that should never be used for computation. 2 2 SSR b1S xy S xy 9.6 R 2 .4961 SST SSy SS x SS y 14.4012.90 b Y b XY nY 2 2 0 1 An alternate formula, if no spare parts have been computed, is R Y 2 nY 2 2 2 R r . The coefficient of determination is the square of the correlation. Note that SS x , b1, and r all have the same sign.
H. LINEAR REGRESSION-Simple Regression 1. Fitting a Line 2. The Gauss Markov Theorem OLS is BLUE ˆ 2 SSE Y Y 3. Standard Errors – The standard error is defined as s 2 . e n 2 n 2 2 2 SSE SST SSR SS y b1S xy SS y b SS x se . n 2 n 2 n 2 n 2 2 Y b0 Y b1 XY Or, if no spare parts are available, s 2 . e n 2 2 SS y 1 R Note also that if R 2 is available s 2 . e n 2
Using data from G3, and using our spare parts SS x 14.40, SS y 12.90 SST, S xy 9.60 2 2 Y nY b1 XY nXY SS y b1S xy 12.90 0.66679.60 s 2 0.8125 e n 2 n 2 8
4. The Variance of b0 and b1 . 1 X 2 1 s 2 s 2 s 2 s 2 b0 e and b1 e n SS x SS x
I. LINEAR REGRESSION-Confidence Intervals and Tests
1. Confidence Intervals for b1 .
b t s df n 2 n 1 1 2 b1 The interval can be made smaller by increasing either or the amount of variation in x . 3
2. Tests for b1 .
H 0 : 1 10 b1 10 To test use t . Remember is most often zero – and if the null hypothesis is s 10 H1 : 1 10 b1 false in that case we say that 1 is significant.
2 2 S xy 9.6 To continue the example in G3: R 2 .4961 or SS x SS y 14.4012.90
2 2 b0 Y b1 XY nY 0.833319 0.666740101.90 R 2 .4961 SS y 12.90 SSR b S 0.6667 9.60 6.400. 2 1 xy We have already computed se 0.8125 , which implies that 1 0.8125 s 2 s 2 0.0564 and s 0.0564 0.2374 . b1 e b1 SS x 14.40
H 0 : 1 0 The significance test is now df n 2 10 2 8 . Assume that .05 , so that for a 2- H1 : 1 0 n2 8 sided test t t 2.306 and we reject the null hypothesis if t is below –2.306 or above 2.306. Since 2 .025
b1 0 0.6667 t 2.809 is in the rejection region, we say that is significant. A further test says that s 0.2374 1 b1
1 is not significantly different from 1.
b t s 0.6667 2.3060.2374 0.667 0.547 If we want a confidence interval 1 1 2 b1 . Note that this includes 1, but not zero. X 2 nX 2 1 s 2 2 SS x 2 2 e Note that since s , sb se . This indicates that x 1 X 2 nX 2 n 1 s 2 n 1 n 1 x n x s 2 both the a large sample size, , and a large variance of will tend to make b1 smaller and thus decrease the size of a confidence interval for 1 or increase the size (and significance) of the t-ratio. To put it more negatively, small amounts of variation in x or small sample sizes will tend to produce values of b1 that are not significant. The common sense interpretation of this statement is that we need a lot of experience with what happens to y when we vary x to be able to put any confidence in our estimate of the slope of the equation that relates them. 4
3. Confidence Intervals and Tests for b0
H 0 : 0 00 b0 00 We are now testing with t . s H1 : 0 00 b0 1 X 2 1 1.602 s 2 s 2 0.8125 0.81250.2778 0.2256 . So s 0.2256 0.4749 . If b0 e b0 n SS x 10 14.40
H 0 : 0 0 b0 0 0.8333 we are testing t 1.754 . Since the rejection region is the same as in I2, s 0.4749 H1 : 0 0 b0 we accept the null hypothesis and say that 0 is not significant. A confidence interval would be
b t s 0.8333 2.3060.4749 0.883 1.095 0 0 2 b0 Yˆ 0.8333 0.6667X A common way to summarize our results is, . The equation is written with the 0.4749 0.2374 standard deviations below the equation. For a Minitab printout example of a simple regression problem, see 252regrex1.
4. Prediction and Confidence Intervals for y 2 ˆ 2 2 1 X 0 X The Confidence Interval is Y ts ˆ , where s ˆ s and the Prediction Interval Y0 0 Y Y e n SS x 2 1 X X is Y Yˆ ts , where s 2 s 2 0 1 . In these two formulas, for some specific X , 0 0 Y Y e n SS 0 x ˆ Y0 b0 b1 X 0 . For example, assume that X 0 5 so that for the results in G3, 2 X X 2 ˆ 2 2 1 0 1 5 1.6 Y 0.8333 0.66675 4.168 . Then s ˆ s 0.8125 0.733 and 0 Y e n SS 10 14.40 x
sYˆ 0.733 0.856 , so that the confidence interval is Yˆ ts 4.168 2.3060.856 4.168 1.974 Y0 0 Yˆ . This represents a confidence interval for the average value that Y will take when X 5 . For the same data 2 2 2 2 1 X 0 X 1 5 1.6 s s 1 0.8125 1 1.545 and s 1.545 1.243 , so that the Y e n SS 10 14.40 Y x ˆ prediction interval is Y0 Y0 tsY 4.168 2.3061.243 4.168 2.866 . This is a confidence interval for the value that Y will take in a particular instance when X 5 .
Ignore the remainder of this document unless you have had calculus! 5
Appendix to G2– Explanation of OLS Formula Assume that we have three points: X1,Y1 ,X 2 ,Y2 and X 3 ,Y3 . We wish to fit a regression line to these ˆ ˆ 2 points, with the equation Y b0 b1 X and the characteristic that the sum of squares, SS Y Y is a minimum. If we imagine that there is a 'true' regression line Y 0 1 X we can consider b0 and b1 to be estimates of 0 and 1 . Let us make the definition e Y Yˆ . Note that if we substitute our equation for Yˆ , we find that ˆ e Y Y Y b0 b1 X , or Y b0 b1 X e . This has two consequences: First the sum of squares can be 2 written as ˆ 2 2 ; and second, that if we fit the line so that SS Y Y Y b0 b1 X e e 0 or the mean of Y and Yˆ is the same we have Y a bX . Now if we subtract the equation for Y
Y b0 b1 X e from the equation for Y we find Y b0 b1 X . Now let us measure X and Y as deviations
Y Y b1X X e ~ ~ ~ ~ from the mean, replacing X with X X X and Y with Y Y Y . This means that Y b1 X e or ~ ~ e Y b1 X . If we substitute this expression in our sum of squares, we find that ~ ~ 2 2 . SS e Y b1 X Now write this expression out in terms of our three points and differentiate it to minimize SS with respect to b1 . To do this, recall that b1 is our unknown and that the X s and Y s are numbers (constants!), ~~ ~~ ~ ~ so that d b XY XY and d b 2 X 2 2b X 2 . db1 1 db1 1 1 2 ~ ~ 2 ~ ~ 2 ~ ~ 2 ~ ~ 2 SS e Y b1 X Y1 b1 X1 Y2 b1 X 2 Y3 b1 X 3 . ~ 2 ~ ~ 2 ~ 2 ~ 2 ~ ~ 2 ~ 2 ~ 2 ~ ~ 2 ~ 2 (Y1 2b1 X1Y1 b1 X1 ) (Y2 2b1 X 2Y2 b1 X 2 ) (Y3 2b1 X 3Y3 b1 X 3 )
If we now take a derivative of this expression with respect to b1 and set it equal to zero to find a minimum, we find that: d ~ ~ ~ 2 ~ ~ ~ 2 ~ ~ ~ 2 db SS 0 2X 1Y1 2b1 X 1 0 2X 2Y2 2b1 X 2 0 2X 3Y3 2b1 X 3 1 . ~~ ~ 2 ~~ ~ 2 2XY 2b1 X 2 XY b1 X 0 ~~ ~ 2 ~~ ~ 2 ~~ ~ 2 ~~ ~ 2 But if 2 XY b1 X 0 , then XY b1 X 0 or XY b1 X or XY b1 X , so ~~ XY ~ ~ that if we solve for b1 , we find b1 ~ . But if we remember that X X X and Y Y Y , we can X 2 X X Y Y XY nXY b b . write this as 1 2 or 1 X X X 2 nX 2
Of course, we still need b0 , but remember that Y b0 b1 X , so that b0 Y b1 X .