<<

Lecture 12 : Test and Confidence Intervals

Fall 2013 Prof. Yao Xie, [email protected] H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

1 Outline

ˆ ˆ • Properties of and as point estimators β1 β0 • Hypothesis test on slope and intercept • Confidence intervals of slope and intercept • Real example: house prices and taxes

2 Regression

• Step 1: graphical display of — scatter : sales vs. advertisement cost ! ! ! ! ! ! ! • calculate correlation 3 • Step 2: find the relationship or association between Sales and Advertisement Cost — Regression

4

Based on the scatter diagram, it is probably reasonable to assume that the of the Y is related to X by the following simple linear regression model:

Response Regressor or Predictor

Yi = β0 + β1X i + εi i =1,2,!,n

2 ε i 0, εi ∼ Ν( σ ) Intercept Slope Random error

where the slope and intercept of the line are called regression coefficients.

• The case of simple linear regression considers a single regressor or predictor x and a dependent or response variable Y.

5 JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 407

JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 407

11-2 SIMPLE LINEAR REGRESSION 407

11-2 SIMPLE LINEAR REGRESSION 407 Simplifying these two equations yields

Simplifying these two equations yields n n ˆ ˆ n␤0 ϩ␤1 a xi ϭ a yi iϭ1 n iϭ1 n ˆ ˆ n n␤0nϩ␤1 a xni ϭ a yi ˆ ˆ 2 iϭ1 iϭ1 ␤ x ϩ␤ x ϭ y x (11-6) 0 a i n 1 a i n a i ni iϭ1 ˆ iϭ1 ˆ iϭ21 ␤ ϩ␤ ϭ 0 a xi 1 a x i a yi xi (11-6) iϭ1 iϭ1 iϭ1 Equations 11-6 are called the normal equations. The solution to the normal ˆ ˆ equations Equationsresults in the11-6 least are squarescalled the estimators least squares ␤0 and normal ␤1. equations. The solution to the normal ˆ ˆ equations results in the least squares estimators ␤0 and ␤1. Least Squares EstimatesLeast SquaresThe least squares estimates of the intercept and slope in the simple linear regression Estimatesmodel areThe least squares estimates of the intercept and slope in the simple linear regression model are ˆ ˆ ␤0 ϭ yˆϪ␤1x ˆ (11-7) ␤0 ϭ y Ϫ␤1x (11-7) n n n n yi xi n a ay x n iϭ1 aiϭ1i a i y x Ϫ iϭ1 iϭ1 a i i Ϫ a yi xi n iϭ1 a aba bnab b ␤ˆ ϭ ˆ iϭ1 (11-8) 1 ␤1 ϭ n 2n 2 (11-8) x n n a i a xi 2 i2ϭ1 iϭ1 Ϫ Ϫ a x i a x i n iϭ1 iϭ1a n ab b where y ϭn 1 n g n y and x ϭ 1n n g n x . where y ϭ 1րn g iϭ1 րyi andiϭ1 x iϭ 1րn g iϭր1 xi. iϭ1 i 1 2 1 2 1 2 1 2 The fittedTheor estimatedfitted or estimated regression regression line is linethereforeis therefore ˆ ˆ ˆ ˆ y ϭ␤ˆ 0 ϩ␤1x (11-9) yˆ ϭ␤0 ϩ␤1x (11-9) Note that each pair of observations satisfies the relationship Note that each pair of observations satisfies the relationship ˆ ˆ yi ϭ␤0 ϩ␤1xi ϩ ei, i ϭ 1, 2, p , n ˆ ˆ yi ϭ␤0 ϩ␤1xi ϩ ei, i ϭ 1, 2, p , n where ei ϭ yi Ϫ yˆi is called the residual. The residual describes the error in the fit of the where ei ϭmodelyi Ϫ toyˆi theis calledith observation the residual. yi. LaterThe in residual this chapter describes we will the use error the residualsin the fit toof provide the about the adequacy of the fitted model. model to the ith observation yi. Later in this chapter we will use the residuals to provide information aboutNotationally, the adequacy it is occasionally of the fitted convenient model. to give special symbols to the numerator and Notationally,denominator it is ofoccasionally Equation 11-8. convenient Given data to (givex1, y1 ),special (x2, y2 ),symbols p , (xn, yton), the let numerator and denominator of EquationRegression 11-8. Given data (x1 ,coefficients y1), (x2, y2), p , (xn, nyn), let2 n n a xi 2 2 n iϭ21 S ϭ x Ϫ x ϭ x Ϫ (11-10) x x a i a i n n iϭ1 n iϭ1 a xai b 2 2 iϭ1 ϭ Ϫ 1 ϭ 2 Ϫ Sx x a xi x a x i (11-10) and iϭ1 iϭ1 a n b 1 2 n n and n n a xi a yi iϭ1 iϭ1 ϭ Ϫ Ϫ ϭ Ϫn n Sx y a yi y xi x a xi yi (11-11) iϭ1 iϭ1 a bna b n n a xi a yi 1 21 2 iϭ1 iϭ1 ϭ Ϫ Ϫ ϭ Ϫ Sx y a yi ˆy xi xˆ a xi yi (11-11) iϭ1 β0 = y − β1x iϭ1 a bna b 1 21 2 ˆ ˆ ˆ Fitted (estimated) S yi = β0 + β1xi regression model ˆ xy β1 = Sxx

Caveat: regression relationship are valid only for values of the regressor variable within the the original data. Be careful with extrapolation. 6 Estimation of • Using the fitted model, we can estimate value of the response variable for given predictor ! ˆ ˆ yˆi = β0 + β1xi !

• Residuals: ri = yi − yˆi 2 • Our model: Yi = β0 + β1Xi + εi, i =1,…,n, Var(εi) = σ • Unbiased estimator (MSE: Mean Square Error)

n ! 2 ∑ i σˆ 2 = MSE = i=1 n − 2

7 Punchline

• the coefficients ! ˆ ˆ β1 and β0 ! and both calculated from data, and they are subject to error. ˆ ˆ • if the true model is , are y = β1x + β0 β1 and β0 point estimators for the true coefficients ! ˆ ˆ • we can talk about the ``accuracy’’ of β1 and β0

8 Assessing linear regression model

• Test hypothesis about true slope and intercept ! β1 = ?, β0 = ? ! • Construct confidence intervals ! ⎡ ˆ ˆ ⎤ ⎡ ˆ ˆ ⎤ β1 ∈ β1 − a,β1 + a β0 ∈ β0 − b,β0 + b with probability 1−α ! ⎣ ⎦ ⎣ ⎦ • Assume the errors are normally distributed

ε ~ Ν 2 i ( 0, σ )

9 JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 414 JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 414

414 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION

414 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION x ϭ 1.0, 1.5, 1.5, 1.5, 2.5, 4.0, 5.0, 5.0, 7.0, 8.0, 8.5, 9.0, 9.5, (a) Write the new regression model. 9.5, 10.0, 12.0, 12.0, 13.0, 13.0, 14.5, 15.5, 15.5, 16.5, (b) What change in gasoline mileage is associated with a 3 x ϭ 1.0, 1.5,17.0, 1.5, 22.5, 1.5, 29.0, 2.5, 31.5 4.0, 5.0, 5.0, 7.0, 8.0, 8.5, 9.0, 9.5, (a) Write1 cm thechange new is regression engine displacement? model. 9.5, 10.0, 12.0, 12.0, 13.0, 13.0, 14.5, 15.5, 15.5, 16.5, (b)11-18. What changeShow that in gasoline in a simple mileage linear is regression associated model with a y ϭ 1.80, 1.85, 1.87, 1.77, 2.02, 2.27, 2.15, 2.26, 2.47, 2.19, 3 17.0,2.26, 22.5, 2.40, 29.0, 2.39, 31.5 2.41, 2.50, 2.32, 2.32, 2.43, 2.47, 2.56, the1 point cm (xchange, y ) lies is exactly engine on displacement? the least squares regression line. 11-19. Consider the simple linear regression model Y ϭ␤ ϩ y ϭ 1.80,2.65, 1.85, 2.47, 1.87, 2.64, 1.77, 2.56, 2.02, 2.70, 2.27, 2.72, 2.15, 2.57 2.26, 2.47, 2.19, 11-18. Show that in a simple linear regression model0 the␤1x pointϩ⑀. ( Supposex, y ) lies that exactly the analyston the least wants squares to use zregressionϭ x Ϫ x as line. 2.26, 2.40, 2.39, 2.41, 2.50, 2.32, 2.32, 2.43, 2.47, 2.56, the regressor variable. (a) FindJWCL232_c11_401-448.qxd the least squares estimates of 1/15/10 the slope and4:53 the PM inter- Page 407 2.65, 2.47, 2.64, 2.56, 2.70, 2.72, 2.57 11-19. Consider the simple linear regression model Y ϭ␤0 ϩ cept in the simple linear regression model. Find an esti- (a) Using the data in Exercise 11-11, construct one scatter 2 ␤1x ϩ⑀. Suppose that the analyst wants to use z ϭ x Ϫ x as mate of ␴ . plot of the (xi, yi ) points and then another of the the regressor variable. (a) Find(b) the Estimate least thesquares mean estimates length of dugongsof the slope at age and 11. the inter- (z ϭ x Ϫ x, y ) points. Use the two plots to intuitively (a) Usingi thei datai in Exercise 11-11, construct one scatter cept(c) Obtainin the simplethe fitted linear values regression yˆ that correspondmodel. Find to aneach esti- ob- explain how the two models, Y ϭ␤ ϩ␤x ϩ⑀ and 2 i 0 1 mate of ␴ . plot of the (xi, yi ) points and then another of the served value yi. Plot yˆi versus yi, and comment on what Y ϭ␤*0 ϩ␤*1z ϩ⑀, are related. (b) Estimatethis plotthe wouldmean lengthlook like of ifdugongs the linear at relationship age 11. between (b)(z Findi ϭ thexi Ϫ leastx, ysquaresi) points. estimates Use the of ␤ two*0 and plots ␤*1 in to the intuitively model (c) Obtainlength the andfitted age values were y perfectlyˆi that correspond deterministic to each (no error). ob- explainY ϭ␤*0 ϩ␤ how* 1z theϩ⑀ two. How models, do11-2 they SIMPLEY ϭ␤ relate LINEAR0 ϩ␤ to 1 the xREGRESSIONϩ⑀ leastand 407 ˆ ˆ servedDoes value this y ploti. Plot indicate yˆi versus that ageyi, andis a commentreasonable onchoice what of Ysquaresϭ␤*0 ϩ␤estimates*1z ϩ⑀ ␤0, andare ␤related.1 ? this plotregressor would variable look like in thisif the model? linear relationshipSimplifying between these two(b)11-20. equations Find theSuppose least yields squares we wish estimates to fit a regression of ␤*0 and model ␤*1 infor the which model length11-16. andConsider age were the perfectly regression deterministic model developed (no error). in Ex- the Ytrueϭ␤ regression*0 ϩ␤*1 zlineϩ⑀ passes. How through do they the point relate (0, 0). to The the ap- least ˆ nˆ n Doesercise this 11-2. plot indicate that age is a reasonable choice of propriatesquares model estimates is ˆY ϭ␤ ␤0ˆx andϩ⑀ ␤.1 Assume ? that we have n pairs n␤0 ϩ␤1 a xi ϭ a yi regressor(a) Suppose variable that temperature in this model? is measured in ЊC rather than ЊF. 11-20.of data (x1Suppose, y1), (x2, wey2), wishp , ( xtoinϭ, 1fityn ).a regressioniϭ1 model for which Write the new regression model. (a) Find the leastn squares estimaten of ␤n. 11-16. Consider the regression model developed in Ex- the true regressionˆ line passesˆ through2 the point (0, 0). The ap- (b) What change in expected pavement deflection is associ- ␤ x ϭ␤ϩ␤ ϩ⑀ x ϭ y x (11-6) ercise 11-2. propriate(b) Fit the model model0 a is Y Yi ϭ␤x1xaϩ⑀to.i Assumethe chloridea thati i concentration- we have n pairs ated with a 1ЊC change in surface temperature? roadway areaiϭ1 data in iϭ Exercise1 i 11-10.ϭ1 Plot the fitted (a) Suppose that temperature is measured in ЊC rather than ЊF. of data (x , y ), (x , y ), , (x , y ). model1 on1 a scatter2 2 diagramp n ofn the data and comment on Write11-17. the newConsider regression the regression model. model developedEquations in Exercise 11-6 are(a) called Find the the least least squares squares normalestimate equations.of ␤. The solution to the normal 11-6. Suppose that engine displacement is measured in cubic the appropriateness of theˆ model.ˆ (b) What change in expected pavement deflectionequations is associ- results in(b) the Fitleast the squares model estimators Y ϭ␤x ϩ⑀ ␤0 andto the ␤1 .chloride concentration- centimeters instead of cubic inches. ated with a 1ЊC change in surface temperature? roadway area data in Exercise 11-10. Plot the fitted 11-17. Consider the regression Leastmodel Squares developed in Exercise model on a scatter diagram of the data and comment on Estimates 11-6. Suppose that engine displacement is measuredThe in cubic least squares estimatesthe appropriatenessof the intercept of the and model. slope in the simple linear regression centimeters instead of cubic inches. model are 11-3 PROPERTIES OF THE LEAST SQUARES ESTIMATORS ˆ ˆ ␤0 ϭ y Ϫ␤1x (11-7) ˆ ˆ The statistical properties of the least squares estimators ␤0 andn ␤1 mayn be easily described. Recall that we have assumed that the error term ⑀ inn the model aY ϭ␤yi 0 aϩ␤xi1x ϩ⑀is a random ϭ ϭ 11-3 PROPERTIESvariable OF THE with LEAST mean zero SQUARES and variance ␴ESTIMATORS2. Since the valuesy x Ϫ of ix 1are fixed,i 1 Y is a random vari- a i i n JWCL232_c11_401-448.qxd␮ ϭ␤ ϩ␤ 1/14/10ˆ ␴8:022 iϭ1 PM Pagea 415ba b␤ˆ ␤ˆ able with mean Y x 0 1x and variance␤ ϭ . Therefore, the values of 0 and 1 depend (11-8) 1 ˆ nˆ 2 Theon statistical the observed properties y’s; thus, of thethe leastleast squares squares estimators estimators of ␤the0 and regression ␤1 may coefficients be easily described.may be 0 n a xi Recallviewed that as we random have assumed variables. that We the will error investigate term ⑀ inthe the bias model and2 variance Y iϭ␤ϭ1 propertiesϩ␤x ϩ⑀ ofis the a randomleast x Ϫ 0 1 ˆ ˆ 2 a i n JWCL232_c11_401-448.qxdvariablesquares with estimators1/14/10 mean zero ␤ 08:02 and and ␤ PMvariance1 . Page ␴ 415. Since the valuesiϭ1 of xa are fixed,b Y is a random vari- Consider first ␤ˆ . Because ␤ˆ is a linear combination2 of the observationsˆ Y , we ˆcan use able with mean ␮Y ϭ␤1 0 ϩ␤1x and1 variance ␴ . Therefore, the values of ␤0 i and ␤1 depend x where y ϭ 1 n g n y and x ϭ 1 n ˆg n x . on propertiesthe observed of expectation y’s; thus, theto show least that squaresր the iϭexpected1 estimatorsi value of ofր the ␤1 regression isiϭ1 i coefficients may be 0 viewed as random variables. We will1 investigate2 the bias1 and2 variance properties11-4 HYPOTHESIS of the least TESTS IN SIMPLE LINEAR REGRESSION 415 ˆ ˆ E ␤ˆ ϭ␤ (11-15) squares estimators ␤0 andThe ␤1 .fittedPropertiesor estimated1 regression of1 Regression line is therefore Estimators Consider first ␤ˆ . Because ␤ˆ is a linear combination of the observations Y , we can use 1 1 1 2 i ␤ˆ For the ␤intercept,ˆ we ˆcan show in a similar manner that propertiesThus, 1of is expectation an unbiased to estimator show thatslopeof the the parameterexpected true slope value 1β. 1yˆ ofϭ␤ ␤01 ϩ␤ isintercept1x parameter β0 (11-9) ˆ 11-4 HYPOTHESIS TESTS IN SIMPLE2 LINEAR REGRESSION 415 Now consider the variance of ␤1 . Since we have assumed that V(⑀i) ϭ␴, it follows that 2 ˆ 2 V(Yi) ϭ␴. BecauseNote ␤ that1 is each a linear pair of combination observationsˆ satisfies of the the observations relationship Yi, the results in 2 1 x E ␤1 ϭ␤1 E ␤ˆ ϭ␤ and(11-15) V ␤ˆ ϭ␴ ϩ (11-17) Section 5-5 can beFor applied the intercept, to show thatwe can show in a similar manner that0 0 0 n S ˆ ˆ xx yi ϭ␤0 ϩ␤1xi ϩ ei, i ϭ 1, 2, p , n ˆ 1 2 1 2 1 2 c d Thus, ␤1 is an unbiased estimator of the true slope2 ␤1. 2 ␴ˆ 2 1 x Now consider thewhere variance ei ϭ yofi Ϫ ␤ˆ yˆi . isSince calledV ␤ˆ weThus, theϭ ˆhave residual. ␤ 0assumed is anThe unbiased thatresidual Vˆ(⑀ estimator )describesϭ␴2 , it the offollows (11-16)theerror intercept inthat the fit ␤of0 . theThe of the random vari- 1 1E ␤0Sϭ␤ˆ 0 andˆ V ␤0 i ϭ␴ n ϩ (11-17) ˆ ˆ V(Y ) ϭ␴2. Becausemodel ␤ˆ is to a the linear ith observation combinationables yi. Laterxx␤ 0 of and in the this ␤ observations1 chapter is not we zero. will Y It, use the can the results be S residualsxx shown in to (see provide Exercise 11-98) that cov(␤0, ␤1 ) ϭ i information1 about the adequacy1 2 2 of the fitted model. i Section 5-5 can be applied to show that Ϫ␴1 x2րSxx. 1 2 c d Notationally, it is occasionally convenient to give2 special symbols to the numerator and Thus, ␤ˆ is an unbiased estimatorThe estimate of the interceptof ␴ could ␤ be. The used covariance in Equations of the 11-16 random and 11-17 vari- to provide estimates of denominator0 of Equationunbiased 11-8. Givenestimator data ( x1, y1), (unbiasedx2, y2), p0, ( xestimatorn, yn), let ˆ ˆ the variance of the slope and the intercept. We call the squareˆ rootsˆ of the resulting variance ables ␤0 and ␤1 is not zero.␴ 2 It can be shown (see Exercise 11-98) that cov(␤0, ␤1 ) ϭ 2 V ␤ˆ estimatorsϭ the estimated standardn 2errors(11-16)of the slope and intercept, respectively. Ϫ␴ xրSxx. 1 S 2 nxx n a xi The estimate of ␴ could be used in2 Equations2 11-16iϭ1 and 11-17 to provide estimates of ϭ Ϫ ϭ Ϫ 1 Sx x2 a xi x a x i (11-10) the variance of the slope and itheϭ1 intercept. Weiϭ1 call thea squaren b roots of the resulting variance estimators theEstimated estimated standard1 errors2 of the slope and intercept, respectively.10 and Standard In simple linear regression the estimated of the slope and the Errors estimated standard errorn of the nintercept are n n a xi a yi Estimated ϭ Ϫ Ϫ ϭ Ϫ iϭ1 iϭ1 Sx y a yi y xi x a xi yi 2 (11-11) 2 ϭ ϭ a ␴ˆ bna b 1 x Standard In simple linear regressioni 1 the estimatedi se1 ␤ˆ standardϭ error and of these slope␤ˆ ϭand ␴ˆ the2 ϩ 1 B 0 B n Errors estimated standard error1 of21 the intercept2 are Sxx Sxx 1 2 1 2 c d 2 respectively,␴ˆ 2 where ␴ˆ is computed from1 Equationx 2 11-13. ˆ ˆ 2 se ␤1 ϭ and se ␤0 ϭ ␴ˆ n ϩ BSxx B Sxx 1 2 1 2 c d respectively, where ␴ˆ 2 is computed from Equation 11-13. The Minitab computer output in Table 11-2 reports the estimated standard errors of the slope and intercept under the column heading “SE coeff.”

The11-4 Minitab HYPOTHESIS computer output TESTSin Table 11-2IN SIMPLE reports the LINEAR estimated REGRESSIONstandard errors of the slope and intercept under the column heading “SE coeff.” An important part of assessing the adequacy of a linear regression model is testing statistical hypotheses about the model parameters and constructing certain confidence intervals. Hypothesis 11-4 HYPOTHESIS TESTS IN SIMPLEtesting LINEAR in simple linearREGRESSION regression is discussed in this section, and Section 11-5 presents methods for constructing confidence intervals. To test hypotheses about the slope and intercept of the re- An important part of assessinggression the model, adequacy we ofmust a linear make regression the additional model assumption is testing statisticalthat the error component in the hypotheses about the modelmodel, parameters ⑀, is normally and constructing distributed. certain Thus, confidence the complete intervals. assumptions Hypothesis are that the errors are normally testing in simple linear regressionand independently is discussed distributed in this section, with mean and Section zero and 11-5 variance presents ␴2 ,methods abbreviated NID(0, ␴2). for constructing confidence intervals. To test hypotheses about the slope and intercept of the re- gression11-4.1 model, Use we of mustt-Tests make the additional assumption that the error component in the model, ⑀, is normally distributed. Thus, the complete assumptions are that the errors are normally and independently distributed with mean zero and variance ␴2, abbreviated NID(0, ␴2). Suppose we wish to test the hypothesis that the slope equals a constant, say, ␤1,0. The appro- priate hypotheses are 11-4.1 Use of t-Tests H0: ␤1 ϭ␤1,0 H : ␤ (␤ (11-18) Suppose we wish to test the hypothesis that the slope equals a constant,1 1 say,1,0 ␤1,0. The appro- priate hypotheses are 2 where we have assumed a two-sided alternative. Since the errors ⑀i are NID(0, ␴ ), it follows H : ␤ ϭ␤ 2 ˆ directly that the observations0 1 1,0 Yi are NID(␤0 ϩ␤1xi, ␴ ). Now ␤1 is a linear combination of

H1: ␤1 (␤1,0 (11-18)

2 where we have assumed a two-sided alternative. Since the errors ⑀i are NID(0, ␴ ), it follows 2 ˆ directly that the observations Yi are NID(␤0 ϩ␤1xi, ␴ ). Now ␤1 is a linear combination of JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 415

11-4 HYPOTHESIS TESTS IN SIMPLE LINEAR REGRESSION 415

For the intercept, we can show in a similar manner that

2 ˆ ˆ 2 1 x E ␤0 ϭ␤0 and V ␤0 ϭ␴ n ϩ (11-17) Sxx 1 2 1 2 c d ˆ Thus, ␤0 is an unbiased estimator of the intercept ␤0. The covariance of the random vari- ˆ ˆStandard errors of coefficients ˆ ˆ ables ␤0 and ␤1 is not zero. It can be shown (see Exercise 11-98) that cov(␤0, ␤1 ) ϭ 2 2 2 Ϫ␴ xրSxx. • We can replace with its estimator … σ σˆ The estimate of ␴2 could be used in Equations 11-16 and 11-17 to provide estimates of ! n the variance of the slope and the intercept. We callr 2the square roots of the resulting variance ∑ i estimators the estimated! standardσ ˆerrors2 = MSEof the= i =slope1 and intercept, respectively. n − 2 ! r y yˆ ˆ ˆ ˆ ! i = i − i yi = β0 + β1xi Estimated ! Standard In simple linear regression the estimated standard error of the slope and the Errors estimated standard• Using results from previous page, estimate the error of the intercept are standard error of coefficients ␴ˆ 2 1 x 2 ˆ ˆ 2 se ␤1 ϭ and se ␤0 ϭ ␴ˆ n ϩ BSxx B Sxx

1 2 1 2 c d 11 respectively, where ␴ˆ 2 is computed from Equation 11-13.

The Minitab computer output in Table 11-2 reports the estimated standard errors of the slope and intercept under the column heading “SE coeff.”

11-4 HYPOTHESIS TESTS IN SIMPLE LINEAR REGRESSION

An important part of assessing the adequacy of a linear regression model is testing statistical hypotheses about the model parameters and constructing certain confidence intervals. Hypothesis testing in simple linear regression is discussed in this section, and Section 11-5 presents methods for constructing confidence intervals. To test hypotheses about the slope and intercept of the re- gression model, we must make the additional assumption that the error component in the model, ⑀, is normally distributed. Thus, the complete assumptions are that the errors are normally and independently distributed with mean zero and variance ␴2, abbreviated NID(0, ␴2).

11-4.1 Use of t-Tests

Suppose we wish to test the hypothesis that the slope equals a constant, say, ␤1,0. The appro- priate hypotheses are

H0: ␤1 ϭ␤1,0

H1: ␤1 (␤1,0 (11-18)

2 where we have assumed a two-sided alternative. Since the errors ⑀i are NID(0, ␴ ), it follows 2 ˆ directly that the observations Yi are NID(␤0 ϩ␤1xi, ␴ ). Now ␤1 is a linear combination of JWCL232_c11_401-448.qxd 1/14/10 8:02JWCL232_c11_401-448.qxd PM Page 415 1/14/10 8:02 PM Page 415

JWCL232_c11_401-448.qxd11-4 HYPOTHESIS 1/14/10 TESTS8:02 PMIN SIMPLE Page LINEAR415 11-4 REGRESSION HYPOTHESIS TESTS415 IN SIMPLE LINEAR REGRESSION 415

JWCL232_c11_401-448.qxd 1/14/10 8:02For thePM intercept, Page 415 we can show in a similarFor manner the intercept, that we can show in a similar manner that

2 2 ˆ ˆ 2 1 x ˆ ˆ 2 1 x E ␤0 ϭ␤0 and V ␤0 ϭ␴ n ϩ E ␤0 ϭ␤0 and V(11-17)␤0 ϭ␴ n ϩ (11-17) Sxx Sxx 1 2 1 2 c 1d 11-42 HYPOTHESIS1 TESTS2 IN SIMPLEc LINEARd REGRESSION 415 ˆ ˆ Thus, ␤0 is an unbiased estimator of theThus, intercept ␤0 is ␤an0. Theunbiased covariance estimator of theof therandom intercept vari- ␤0. The covariance of the random vari- ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ables ␤0 and ␤1 is not11-4 zero. HYPOTHESIS It can be TESTSables For shown the IN␤0 SIMPLEintercept, (see and ␤ Exercise LINEAR1 is we not can REGRESSION 11-98) zero. show It in that can a similar cov( be shown␤415 0manner, ␤1 ) (seeϭ that Exercise 11-98) that cov(␤0, ␤1 ) ϭ 2 2 Ϫ␴ xրSxx. Ϫ␴ xրSxx. 2 The estimate of ␴2 could be used in Equations 11-16 and 11-172 to provide estimates of For the intercept,The estimate we can of show ␴ could in a similarbe used manner in Equations that 11-16 and 11-17 ˆto provide estimatesˆ of 2 1 x the variance of the slope andE ␤ the0 ϭ␤intercept.0 and We Vcall␤0 theϭ␴ square n rootsϩ of the resulting variance(11-17) the variance of the slope and the intercept. We call the square roots of the resulting variance Sxx estimators the estimated standard errors of the slope and intercept, respectively. estimators the estimated standard errors of the slope and intercept,2 1respectively.2 1 2 c d ˆ ˆ 2 1 x E ␤0 ϭ␤0 and V ␤0 ϭ␴ˆ n ϩ (11-17) Thus, ␤0 is an unbiasedSxx estimator of the intercept ␤0. The covariance of the random vari- ˆ ˆ ˆ ˆ 1 2 1ables2 ␤0 andc ␤1 is d not zero. It can be shown (see Exercise 11-98) that cov(␤0, ␤1 ) ϭ Estimated 2 Estimated ˆ Ϫ␴ xրSxx. Thus, ␤0 is an unbiased estimator Standardof the intercept ␤0. The covariance2 of the random vari- Standard ˆ In simpleˆ linear regression the estimatedInThe simple standard estimate linear errorof regression␴ could of the be slopethe used estimatedˆ andinˆ Equations the standard 11-16 error and 11-17 of the to slopeprovideand estimates the of Errorsables ␤0 and ␤1 is not zero. It can Errors be shown (see Exercise 11-98) that cov(␤0, ␤1 ) ϭ 2 estimated standard error of the intercepttheestimated varianceare of standard the slope error and theof the intercept. intercept Weare call the square roots of the resulting variance Ϫ␴ xրSxx. estimators the estimated standard errors of the slope and intercept, respectively. The estimate of ␴2 could be used in Equations 11-16 and 11-17 to provide estimates of 2 2␴ˆ 2 1 x 2 ␴ˆ ˆ 1 x ˆ 2 the variance of the slope andˆ the intercept. We call the squareˆ rootsse2␤ of theϭ resulting varianceand se ␤ ϭ ␴ˆ ϩ se ␤1 ϭ and se ␤0 ϭ ␴ˆ 1n ϩ BS 0 B n S estimators the estimated standardB errorsSxx of the slope and intercept,B respectively.Sxx xx xx 1 2 Estimated 1 2 1 c 2 d 1 2 c d 2 respectively, where ␴ˆ 2 is computed from Equation 11-13. respectively, where ␴ˆ is computedStandard from EquationIn simple 11-13. linear regression the estimated standard error of the slope and the Errors Estimated estimated standard error of the intercept are Standard In simple linear regression the estimated standard error of the slope 2and the 2 Errors ␴ˆ 1 x estimated standard error of the intercept are ␤ˆ ϭ ␤ˆ ϭ ␴ˆ 2 ϩ The Minitab computerse output1 in Table 11-2 and reports se the0 estimated standardn errors of the slope The Minitab computer output in Table 11-2 reports the estimated standardBS errorsxx of the slope B Sxx and intercept under the column heading “SE coeff.” and intercept under the column␴ˆ 2 heading “SE coeff.” 1 1 x 22 1 2 c d ˆ ˆ 2 2 se ␤1 ϭ and se ␤respectively,0 ϭ ␴ˆ nwhereϩ ␴ˆ is computed from Equation 11-13. BSxx B Sxx 1 11-42 HYPOTHESIS 1TESTS2 INc SIMPLEd LINEAR REGRESSION 11-4 HYPOTHESISrespectively, TESTS where IN SIMPLE␴ˆ 2 is computed LINEAR from Equation REGRESSION 11-13. An important part of assessing the adequacy of a linear regression model is testing statistical An important part of assessing the adequacy of a linear regression model is testing statistical hypothesesThe Minitab about computer the model output parameters in Table and 11-2 constructing reports the certain estimated confidence standard intervals. errors Hypothesis of the slope hypotheses about the model parameters and constructing certain confidence intervals. Hypothesis testingand intercept in simple under linear the regression column isheading discussed “SE incoeff.” this section, and Section 11-5 presents methods testing in simple linear regression is discussed in this section, and Section 11-5 presents methods for constructing confidence intervals. To test hypotheses about the slope and intercept of the re- The Minitabfor constructing computer confidence output in intervals.Table 11-2 To reports test hypotheses the estimated about standard the slope errors and intercept of the slope of the re- gression model, we must make the additional assumption that the error component in the and interceptgression undermodel, the we column must headingmake the “ SEadditionalcoeff.” assumption that the error component in the 11-4 HYPOTHESISmodel, TESTS ⑀, is normally IN SIMPLE distributed. LINEAR Thus, the complete REGRESSION assumptions are that the errors are normally model, ⑀, is normally distributed. Thus, the complete assumptions are that the errors are normally and independently distributed with mean zero and variance ␴2, abbreviated NID(0, ␴2). and independently distributed with mean zero and variance ␴2, abbreviated NID(0, ␴2). 11-4 HYPOTHESIS TESTS IN SIMPLE LINEAR REGRESSIONAn important part of assessing the adequacy of a linear regression model is testing statistical 11-4.1 Use of t-Testshypotheses about the modelHypothesis parameters testand constructing in simple certain linear confidence regression intervals. Hypothesis 11-4.1 Use of t-Tests testing in simple linear regression is discussed in this section, and Section 11-5 presents methods An important part of assessing the adequacy of a linear regression model is testing statistical for constructing confidence• we wish to test the hypothesis whether the slope equals intervals. To test hypotheses about the slope and intercept of the re- hypotheses about the model parameters and constructingSuppose certainwe wish confidence to test the intervals. hypothesis Hypothesis that the slope equals a constant, say, ␤1,0. The appro- Suppose we wish to test the hypothesis thatpriategression the hypotheses slope model, equals arewe a constant,musta constant make say, the ␤ 1,0additional. The appro- assumption that the error component in the testingpriate in simple hypotheses linear regressionare is discussed in this section, and Section 11-5 presents methods model, ⑀, is normally distributed.! Thus, the complete assumptions are that the errors are normally for constructing confidence intervals. To test hypotheses about the slope and intercept ofH the0: ␤ re-1 ϭ␤1,0 H and: ␤ independentlyϭ␤ distributed with mean zero and variance ␴2, abbreviated NID(0, ␴2). gression model, we must make the additional0 assumption1 1,0 that the error! component in the H1: ␤1 (␤1,0 (11-18) model, ⑀, is normally distributed. Thus, the completeH1: ␤1 assumptions(␤1,0 are that the errors are normally(11-18) 2 ! 2 and independently distributed11-4.1 with mean Use zeroof t -Testsandwhere variance we have ␴ , assumedabbreviated a two-sided NID(0, ␴ alternative.). Since the errors ⑀ are NID(0, ␴2), it follows 2 i where we have assumed a two-sided alternative. Since the errors• e.g. relate ⑀i are NID(0,ads ␴ to ), itsales follows, we are interested in study 2 ˆ directly that the observations Yi are NID(␤0 ϩ␤1xi, ␴ ). Now ␤1 is a linear combination of Suppose we 2wish to testˆ the hypothesis that the slope equals a constant, say, ␤ . The appro- 11-4.1 Use of t-Tests directly that the observations Yi are NID(␤0 ϩ␤1xi, ␴ ). Now ␤whether or not increase a $ on ads will increase $ in 1 is a linear combination of 1,0 priate hypotheses are sales?

Suppose we wish to test the hypothesis that the slope equals a constant,• sale = ads + constant? say, ␤1,0. The appro-H0: ␤1 ϭ␤1,0 priate hypotheses are H1: ␤1 (␤1,0 (11-18) H : ␤ ϭ␤ 0 1 1,0 2 where we have assumed a two-sided alternative. Since the errors ⑀i are NID(0, ␴ ),12 it follows H1: ␤1 (␤1,0 (11-18) 2 ˆ directly that the observations Yi are NID(␤0 ϩ␤1xi, ␴ ). Now ␤1 is a linear combination of 2 where we have assumed a two-sided alternative. Since the errors ⑀i are NID(0, ␴ ), it follows 2 ˆ directly that the observations Yi are NID(␤0 ϩ␤1xi, ␴ ). Now ␤1 is a linear combination of JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 416

416 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION

ˆ 2 independent normal random variables, and consequently, ␤1 is N(␤1, ␴ ͞Sxx), using the bias and variance properties of the slope discussed in Section 11-3. In addition, n Ϫ 2 ␴ˆ 2ր␴2 has ˆ 2 a chi-square distribution with n Ϫ 2 degrees of freedom, and ␤1 is independent of ␴ˆ . As a result of those properties, the 1 2

ˆ ␤1 Ϫ␤1,0 T ϭ (11-19) 0 2 2␴ˆ րSxx

follows the t distribution with n Ϫ 2 degrees of freedom under H0: ␤1 ϭ␤1,0. We would reject H0: ␤1 ϭ␤1,0 if

t0 Ͼ t␣ր2,nϪ2 (11-20) 0 0 where t0 is computed from Equation 11-19. The denominator of Equation 11-19 is the standard error of the slope, so we could write the test statistic as

ˆ ␤1 Ϫ␤1,0 T0 ϭ ˆ se ␤1 1 2 A similar procedure can be used to test hypotheses about the intercept. To test

H0: ␤0 ϭ␤0,0

H1: ␤0 %␤0,0 (11-21) we would use the statistic

ˆ ˆ ␤0 Ϫ␤0,0 ␤0 Ϫ␤0,0 T ϭ ϭ (11-22) 0 2 ˆ Test Statistic 1 x se ␤0 ␴ˆ 2 ϩ B n S xx 1 2 c d and reject the null hypothesis if the computed value of this test statistic, t0, is such that t0 Ͼ t␣ր2,nϪ2. Note that the denominatorA related of the andtest statistic important in Equation 11-22question is just the …stan- dard error of the intercept. 0 0 A very important special• casewhether or not the slope is of the hypotheses of Equation 11-18 is zero? Significance of regression ! H0: ␤1 ϭ 0 ␤ % ! H1: 1 0 (11-23)

These hypotheses relate to the• if significanceβ1 = 0, that Y does not of regression. Failure to reject H0: ␤1 ϭ 0 is equivalent to concluding that theredepend on X, i.e., is no linear relationship between x and Y. This situation is illustrated in Fig. 11-5. Note that this may imply either that x is of little value in explaining the • Y and X are independent variation in Y and that the best estimator of Y for any x is yˆ ϭ Y [Fig. 11-5(a)] or that the true relationship between x and Y• isIn the advertisement example, not linear [Fig. 11-5(b)]. Alternatively, if H0: ␤1 ϭ 0 is re- jected, this implies that x is of valuedoes in adsexplaining increase the salesvariability? or no in Y (see Fig. 11-6). Rejecting H0: ␤1 ϭ 0 could mean either thateffect? the straight-line model is adequate [Fig. 11-6(a)] or that, although there is a linear effect of x, better results could be obtained with the addition of higher order polynomial terms in x [Fig. 11-6(b)]. 13 JWCL232_c11_401-448.qxd 1/15/10 5:37 PM Page 417

11-4 HYPOTHESIS TESTS IN SIMPLE LINEAR REGRESSION 417

y y

Figure 11-5 The hypothesis H0: ␤1 ϭ 0 x x is not rejected. (a) (b)

• H0 not rejected • H0 rejected EXAMPLE 11-2 Oxygen Purity Tests of Coefficients We will test for significance of regression using the model for Practical Interpretation: Since the reference value of t is the oxygen purity data from Example 11-1. The hypotheses are t0.005,18 ϭ 2.88, the value of the test statistic is very far into the 14 critical region, implying that H0: ␤1 ϭ 0 should be rejected. H0: ␤1 ϭ 0 There is strong evidence to support this claim. The P-value for Ϫ9 H1: ␤1 & 0 this test is P Ӎ 1.23 ϫ 10 . This was obtained manually with a calculator. and we will use ␣ϭ0.01. From Example 11-1 and Table 11-2 Table 11-2 presents the Minitab output for this problem. we have Notice that the t-statistic value for the slope is computed as ˆ 2 11.35 and that the reported P-value is P ϭ 0.000. Minitab also ␤1 ϭ 14.947 n ϭ 20, Sxx ϭ 0.68088, ␴ˆ ϭ 1.18 reports the t-statistic for testing the hypothesis H0: ␤0 ϭ 0. so the t-statistic in Equation 10-20 becomes This statistic is computed from Equation 11-22, with ␤0,0 ϭ 0, as t ϭ 46.62. Clearly, then, the hypothesis that the intercept is ˆ ˆ 0 ␤1 ␤1 14.947 t ϭ ϭ ϭ ϭ 11.35 zero is rejected. 0 2 ˆ 2␴ˆ րSxx se ␤1 21.18ր0.68088 1 2 11-4.2 Approach to Test Significance of Regression

A method called the analysis of variance can be used to test for significance of regression. The procedure partitions the total variability in the response variable into meaningful compo- nents as the basis for the test. The analysis of variance identity is as follows:

Analysis of Variance n n n 2 2 2 Identity a yi Ϫ y ϭ a yˆi Ϫ y ϩ a yi Ϫ yˆi (11-24) iϭ1 iϭ1 iϭ1 1 2 1 2 1 2

y y

Figure 11-6 The hypothesis H0: ␤1 ϭ 0 x x is rejected. (a) (b) JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 416

JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 416

416 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION

ˆ 2 416 CHAPTER 11 SIMPLEindependent LINEAR normal REGRESSION random ANDvariables, CORRELATION and consequently, ␤1 is N(␤1, ␴ ͞Sxx), using the bias and variance properties of the slope discussed in Section 11-3. In addition, n Ϫ 2 ␴ˆ 2ր␴2 has Use t-test for slope ˆ 2 a chi-square distribution with n Ϫ 2 •degreesUnder H of freedom,, test andˆ ␤1 is independent2 of ␴ˆ . As a independent normal random variables, and consequently,0 ␤1 is N(␤1, ␴1 ͞Sxx),2 using the bias result of those properties, the statistic 2 2 Under Hand variance0 properties of the slope discussedstatistic in Section 11-3. In addition, n Ϫ 2 ␴ˆ ր␴ has ˆ 2 a chi-square distribution with n Ϫ !2 degreesˆ of freedom, and ␤1 is independent of ␴ˆ . As a ␤1 Ϫ␤1,0 1 2 Test Statisticresult of those properties, the statistic! T0 ϭ (11-19) 2␴ˆ 2 S ր xx ˆ ,0 ! ␤1 Ϫ␤1,0 Test Statistic follows the t distribution with n Ϫ 2 degreesT ϭ of freedom under H0: ␤1 ϭ␤1,0. We would reject(11-19) 0 2 2 H0: ␤1 ϭ␤1,0 if ~ t distribution with ␴ˆ րSxx n-2 degree of freedom Ͼ follows the t distribution with n Ϫ 2! degreest0 tof␣ր2, freedomnϪ2 under H0: ␤1 ϭ␤1,0. We (11-20)would reject H : ␤ ϭ␤ if • Reject H if 0 1 1,0 0 0 0 where t0 is computed from Equation 11-19.! The denominator of Equation 11-19 is the standard error of the slope, so we could write the testt statisticϾ t as (11-20) βˆ ~ Ν β , σ 2 / S ! 0 ␣ր2,nϪ2 1 ( 1,0 xx ) (two-sided test)ˆ 0 0␤1 Ϫ␤1,0 where t0 is computed from Equation 11-19.T0 ϭ The denominator of Equation 11-19 is the standard ␤ˆ 15 error of the slope, so we could write the test sestatistic1 as 1 2 A similar procedure can be used to test hypothesesˆ about the intercept. To test ␤1 Ϫ␤1,0 T0 ϭ ˆ H0: ␤0 ϭ␤se0,0␤1 H1: ␤0 %␤0,01 2 (11-21) A similar procedure can be used to test hypotheses about the intercept. To test we would use the statistic H0: ␤0 ϭ␤0,0 ˆ ˆ ␤0 Ϫ␤0,0 ␤0 Ϫ␤0,0 T ϭ H : ␤ %␤ϭ (11-22)(11-21) 0 1 02 0,0 ˆ Test Statistic 1 x se ␤0 ␴ˆ 2 ϩ B n S we would use the statistic xx 1 2 c d and reject the null hypothesis if the computedˆ value ofˆ this test statistic, t0, is such that ␤0 Ϫ␤0,0 ␤0 Ϫ␤0,0 t0 Ͼ t␣ր2,nϪ2. Note that the denominatorT ϭ of the test statisticϭ in Equation 11-22 is just the stan-(11-22) 0 2 ˆ Test Statistic dard error of the intercept. 1 x se ␤0 0 0 ␴ˆ 2 ϩ A very important special case ofB the hypothesesn S of Equation 11-18 is xx 1 2 c d H0: ␤1 ϭ 0 and reject the null hypothesis if the computed value of this test statistic, t0, is such that H1: ␤1 % 0 (11-23) t0 Ͼ t␣ր2,nϪ2. Note that the denominator of the test statistic in Equation 11-22 is just the stan- dard error of the intercept. ␤ ϭ 0 These0 hypotheses relate to the significance of regression. Failure to reject H0: 1 0 is equivalentA very toimportant concluding special that there case is of no the linear hypotheses relationship of Equation between x 11-18and Y .is This situation is illustrated in Fig. 11-5. Note that this may imply either that x is of little value in explaining the variation in Y and that the best estimator ofH Y0for: ␤ 1anyϭ x0is yˆ ϭ Y [Fig. 11-5(a)] or that the true

relationship between x and Y is not linear H[Fig.1: ␤ 111-5(% 0b)]. Alternatively, if H0: ␤1 ϭ 0 is re-(11-23) jected, this implies that x is of value in explaining the variability in Y (see Fig. 11-6). Rejecting

TheseH0: ␤ 1 hypothesesϭ 0 could mean relate either to the that significance the straight-line of regression.model is adequateFailure [Fig. to 11-6(a)] reject H or0: that,␤1 ϭ 0 is equivalentalthough thereto concluding is a linear that effect there of xis, betterno linear results relationship could be obtainedbetween withx and the Y. additionThis situation of is illustratedhigher order in Fig.polynomial 11-5. Note terms that in xthis[Fig. may 11-6( implyb)]. either that x is of little value in explaining the variation in Y and that the best estimator of Y for any x is yˆ ϭ Y [Fig. 11-5(a)] or that the true

relationship between x and Y is not linear [Fig. 11-5(b)]. Alternatively, if H0: ␤1 ϭ 0 is re- jected, this implies that x is of value in explaining the variability in Y (see Fig. 11-6). Rejecting

H0: ␤1 ϭ 0 could mean either that the straight-line model is adequate [Fig. 11-6(a)] or that, although there is a linear effect of x, better results could be obtained with the addition of higher order polynomial terms in x [Fig. 11-6(b)]. JWCL232_c11_401-448.qxd 1/15/10 5:37 PM Page 417

JWCL232_c11_401-448.qxd 1/14/10 8:01 PM Page 403

JWCL232_c11_401-448.qxd 1/15/10 5:37 PM Page 417 JWCL232_c11_401-448.qxd 1/15/10 5:37 PM Page 417 JWCL232_c11_401-448.qxd 1/15/10 5:37 PM Page 417 11-1 EMPIRICAL MODELS 403

As an illustration, consider the data in Table 11-1. In this table y is the purity of oxygen produced in a chemical distillation process, and x is the percentage of11-4 hydrocarbons HYPOTHESIS that are TESTS IN SIMPLE LINEAR REGRESSION 417 present in the main condenser of the distillation unit. Figure 11-1 presents a scatter diagram

of the data in Table 11-1. This is just a graph on which each (xi, yi) pair is represented as a point plotted in a two-dimensional coordinate system. This scatter diagram was produced by Minitab, and wey selected an option that shows dot diagrams of the x and y variables alongy the top and right margins of11-4 the graph, HYPOTHESIS respectively, making TESTS it easy toIN see SIMPLE the distributions LINEAR of the REGRESSION 417 individual variables (box plots or histograms11-4 could HYPOTHESISalso be selected). Inspection TESTS of this IN scatter SIMPLE LINEAR REGRESSION 417 diagram indicates that, although no simple curve will pass exactly 11-4through HYPOTHESIS all the points, there TESTS IN SIMPLE LINEAR REGRESSION 417 JWCL232_c11_401-448.qxd 1/15/10 5:37 PMy Page 417is a strong indication that the points liey scattered randomly around a straight line. Therefore, it y is probably reasonable to assume that the mean of the randomy variable Y is related to x by the following straight-liney relationship: y

E Y x ϭ␮Y x ϭ␤0 ϩ␤1x

JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 416 1 0 2 0 where the slope and intercept of the line are called regression coefficients. While the mean of Y is a linear function of x, the actual observed value y does not fall exactly on a straight line. Figure 11-5 TheThe appropriate way to generalize this to a probabilistic is to assume that the expected value of Y is a linear function of x, but that for a fixed value of x the actual value of Y hypothesis H0: ␤is 1determinedϭ 0 by the mean value function (the linear model) plus a randomx error term, say, x is not rejected. (a) (b) Figure 11-5 The Y ϭ␤0 ϩ␤1x ϩ⑀ (11-1) hypothesisFigure H : ␤ 11-5ϭ 0 The JWCL232_c11_401-448.qxd 1/14/10 8:020 1 PM PageFigure 416 11-5 The 11-4 HYPOTHESISx TESTS IN SIMPLE LINEARx REGRESSION 417 (a) (b) is not rejected.hypothesis H0: ␤1 ϭ 0 x hypothesis H0: ␤1 ϭ 0 x Table 11-1 Oxygen and Hydrocarbon Levels(a) x (b) x is not rejected. EXAMPLEis not rejected. 11-2 Oxygen Purity Tests(a) of Coefficients (b) 416 CHAPTER 11 SIMPLE LINEAR REGRESSIONy ANDObservation CORRELATION Hydrocarbon Level Purity y We willNumber test for significancex (%) ofy (%) regression using the model for Practical Interpretation: Since the reference value of t is 1 0.99 90.01 EXAMPLE 11-2 Oxygenthe oxygen Purity purity Tests data of fromCoefficients Example 11-1. The hypotheses are t0.005,18 ϭ 2.88, the value of the test statistic is very far into the We will test for significance of regression2 using1.02 the model89.05 for Practical Interpretation: Since the reference value of t is EXAMPLE 11-2EXAMPLEExample:3Oxygen 11-2 Purity 1.15oxygenOxygen Tests91.43 purityofPurity Coefficientsˆ Tests tests of Coefficientsof2 coefficients critical region, implying that H0: ␤1 ϭ 0 should be rejected. independent normal random variables,4 and consequently,1.29 93.74 ␤ is N(␤1, ␴ ͞Sxx), using the bias the oxygenWe purity will data test fromfor significance Example 11-1. of regression The hypothesesH using0: ␤ 1are theϭ 0model1 t0.005,18 forϭ 2.88, thePractical value of Interpretation:the test statistic Sinceis very the far referenceinto the value of t is We will5 test for significance1.46 of96.73 regression using the model for There2 Practical2 is strong Interpretation: evidence to Since support the referencethis claim. value The ofP-value t is for 416 andCHAPTER variance 11 properties SIMPLE LINEAR of the REGRESSION slope6 discussed AND CORRELATION1.36 in Section94.45␤ & 11-3.critical In addition, region, implying n Ϫ 2that␴ˆ H0␴: ␤1hasϭ 0 should be rejected.Ϫ9 the oxygen puritythe• data oxygenConsider the test from purity Example data 11-1. fromH The1Example: 1hypotheses0 11-1.100 The are hypothesest0.005,18 areϭ 2.88,t thethis րvalueϭ test2.88, of isthe theP test Ӎvalue 1.23statistic ofϫ the 10is test very .statistic This far into was is verythe obtained far into manuallythe H0: ␤1 7ϭ 0 0.87 87.59 Thereˆ is strong evidence to support0.005,182 this claim. The P-value for a chi-square distribution with n Ϫ8 2 degrees1.23 of freedom,91.77 and98 ␤ is independentcritical region, of implying ␴ˆ . As thata H : ␤ ϭ 0 should be rejected. 1 criticalϪ9with region, a calculator. implying0 1 that H0: ␤1 ϭ 0 should be rejected. H1: ␤1! 9& 0H : ␤ ϭ1.550 99.42 this test is P Ӎ 1.23 ϫ 10 . This was obtained manually and we will0 use1 ␣ϭ0.01.H 0From: ␤1 ϭ Example0 96 ˆ11-1 andThere Table2 1is 11-2 strong2 evidence to support this claim. The P-value for result of thoseindependent properties, normal the statistic random10 variables,1.40 and consequently,93.65 ␤1 is N(␤1, ␴ ͞Sxx), usingThere the is Tablestrong bias 11-2evidence presents to support the Minitab this claim. output The Pfor-value this forproblem. ) Ϫ9 we have11 ␤ &1.19 93.54 y 94 with a calculator. Ϫ9 H1: 1 0 H1: ␤1 & 0 this test is P thisӍ 1.23ˆ test2 ϫ2 is 10P Ӎ. 1.23Thisϫ was10 obtained. This was manually obtained manually and weand will variance use ␣ϭ properties0.01. From! 12 ofExample the slope 11-11.15 discussed and Table 92.52 in11-2 Section 11-3. In addition, n Ϫ 2 Notice␴ ր␴ hasthat the t-statistic value for the slope is computed as 92 Table 11-2 presents the Minitab output for this problem. 13 0.98 90.56 Purity ( ˆ with a calculator.with a 2calculator. we havea chi-square distributionˆ with n Ϫ 2 degrees of freedom, and ␤1 2 is independent of11.35 ␴ˆ . As and a that the reported P-value is P ϭ 0.000. Minitab also and we will use ␣ϭ␤1140.01.ϭ 14.947 From Example 1.01n ϭ 20, 11-1 89.54S andxx ϭ Table0.68088,Notice 11-2 that␴ˆ theϭ 1.18t-statistic value for the slope is computed as Figure 11-5 The and !we will use ␣ϭ␤ˆ 0.01.Ϫ␤ From Example90 11-1 and Table 11-2Table 11-2 presentsTable 11-2the Minitab presents output the Minitab for this output problem. for this problem. result of those properties,15 the statistic1.111 1,089.85 1 2reports the t-statistic for testing the hypothesis H0: ␤0 ϭ 0. ˆ we have we have 2 88 11.35 and that the reported P-value is P ϭ 0.000. Minitab also Test Statistic ␤ ϭ␤ 1 ϭ 14.947 n ϭ 20, Sxx16 ϭ 0.68088,T0 ϭ 1.20 ␴ˆ ϭ 1.1890.39 Notice that theNotice t-statistic(11-19) that thevalue t-statistic for the valueslope foris computedthe slope isas computed as hypothesis H0: 1 0 so the! t-statistic in Equation2 x 10-20 becomes This statisticx is computed from Equation 11-22, with ␤0,0 ϭ 0, 17 1.262 93.25 86 reports the t-statistic for testing the hypothesis H0: ␤0 ϭ 0. ␴ˆ Sxx 0.85 0.95 1.052 1.15 1.25 1.35 1.45 1.55 ˆ ␤ˆ18(aϭ) 1.32 ϭ ր 93.41ˆϭ2 ␴ˆ ϭ11.35(b )and that 11.35the reported and that P the-value reported is P ϭ P-value0.000. is Minitab P ϭ 0.000. also Minitab also is not rejected. so the t-statistic␤1 inϭ Equation14.947 10-20n ϭ1 20,becomes14.947 Sxx ϭn0.68088,20, ˆ Sxx␴ ϭ0.68088,1.18This statistic Hydrocarbon is1.18 computed level (x) fromas Equation t0 ϭ 46.62. 11-22, Clearly, with ␤ then,ϭ 0,the hypothesis that the intercept is • Calculate the test 19 ˆ 1.43 ˆ ␤94.981 Ϫ␤1,0 0,0 ␤ ␤ Figure 11-1 reports the t-statisticreports the for t testing-statistic the for hypothesis testing the H hypothesis: ␤ ϭ 0. H0: ␤0 ϭ 0. Test Statistic 20 1 0.95 ϭ1 87.33 14.947Scatterϭ diagram of oxygen purity versus hydrocarbonzero is rejected. 0 0 t ϭ ϭT0 ϭ level fromas Table t0 11-1.46.62.ϭ Clearly,11.35 then, the hypothesis(11-19) that the intercept is so ˆthe t-statisticˆ soin Equationthe0statistic t-statistic 10-20 in Equationbecomes 210-202 becomes This statistic isThis computed statistic from is computed Equation from 11-22, Equation with ␤ 11-22,ϭ 0, with ␤0,0 ϭ 0, ␤1 ␤1 Ϫ 214.947␴ˆ 2 se ␤ˆ ␴ˆ 2րSxx zero is␤ rejected.ϭ␤ 0,0 follows the ttdistributionϭ ϭwith n ϭ 2 degreesրSxx of freedomϭ 11.351 under1.18 րH0.680880: 1 1,0. We wouldϭ reject 0 2 ˆ as t0 ϭ 46.62. asClearly, t0 46.62. then, theClearly, hypothesis then, the that hypothesis the intercept that theis intercept is 2␴ˆ րSxx seˆ ␤1 ! 2ˆ1.18␤ˆր0.68088 ␤ˆ EXAMPLEH0: ␤1 ϭ␤11-21,0 ifOxygen Purity␤1 Tests␤1 of 1Coefficients14.9471 14.947 zero is rejected.zero is rejected. follows ttheϭ t distributionϭt0 ϭwith n ϭϪ 2 degreesϭ 1 of2ϭ freedomϭ 11.35 under Hϭ: ␤11.35ϭ␤ . We would reject 0 2 ! 2ˆ ␴ˆ 2 S se ␤ˆ 21.18 0.68088 0 1 1,0 We will test for significance of regression2␴ˆ 1րSxx2 usingse ␤1 theր modelxx21.18ր for0.680881 րPractical Interpretation: Since the reference value of t is H0: ␤1 ϭ␤1,0 if 11-4.2 Analysis of Variance Approach to Test Significance of Regression Ͼ = ϭ the oxygen purity11-4.2 data from Analysis Example of 11-1. Variance• Threshold The1 hypotheses2 tApproach0 t␣ր 2,1aren Ϫ2to2 Testt0.005,18 Significance2.88, the of valueRegression of the(11-20) test statistic is very far into the critical region, implying that H : ␤ ϭ 0 should be rejected. 11-4.2• Reject H Analysis0 since tof0 A ϾVariance methodt␣ 2,nϪ2 called Approach the analysis to Test of Significancevariance(11-20)160 can1 beof usedRegression to test for significance of regression. 11-4.2H : ␤ ϭ Analysis0 of Variance0 0 Approachր to Test Significance of Regression where t is computed0 from1 EquationA method 11-19. called The the denominatoranalysisThe procedure of Therevariance of Equationis partitions strongcan be evidence 11-19used the totalto is test tothevariability forsupport standard significance thisin the claim. responseof regression. The Pvariable-value intofor meaningful compo- 0 Ϫ9 H1: ␤1 & 0 The procedure partitions0 0 the totalthis variability test is in P theӍ response1.23 ϫ variable10 . This into meaningful was obtained compo- manually error of the slope,where so t0 iswe computed could write from EquationtheA method test statistic11-19. calledA nentsThe method asthe denominator as analysis the called basis of theof for varianceEquation analysis the test. 11-19canof The variance be isanalysis usedthe standard canto testof be variance forused significance to test identity for significance ofis regression. as follows: of regression. error of the slope,nents so we as could the basis write for the the testThe test. statistic procedure Thewith analysisas a partitions calculator. of variance the total identity variabilityis as infollows: the response variable into meaningful compo- and we will use ␣ϭ0.01. From Example 11-1The and procedure Table 11-2 partitions the total variability in the response variable into meaningful compo- nents as ˆthe basisnents for as the the test. basisTable The for analysis11-2the test. presents Theof variance analysis the Minitab identityof variance outputis as identity follows: for thisis asproblem. follows: we have Analysis␤1 ofϪ␤ˆ1,0 Analysis of ␤1 Ϫ␤Notice1,0 that the t-statistic value for the slope is computed as T0 ϭ T ϭ n n n Variancese0 ␤ˆn ˆ n n ˆ Variance 2 1 se ␤11.351 and that the reported 2P-value is P ϭ 0.000.2 Minitab also2 ␤1 ϭ 14.947 n ϭ 20, Sxx ϭ 0.68088, ␴ˆAnalysisIdentityϭ 1.18 of y Ϫ y 2 ϭ yˆ Ϫay 2 ϩyi Ϫ y y ϭϪ yˆa2 yˆi Ϫ y (11-24)ϩ a yi Ϫ yˆi (11-24) IdentityAnalysis of a i a i n a i n i n Variance iϭ1 reportsi ϭ the1 t-statisticiϭ1 i forϭ1 testingiϭ 1 the hypothesisiϭ 1H0: ␤0 ϭ 0. Variance 1 2 1 2 n yn Ϫ y 2 ϭ yˆn Ϫ y 2 ϩ y Ϫ yˆ 2 (11-24) so the tA-statistic similar in procedure EquationA similar can10-20 procedure be becomes used can to betest used hypotheses Identityto test hypotheses about1 Thistheabout2 intercept. statistic the intercept.1 2 isa To computed2 testToi1 test 1from2 2a Equation2 i 1 11-22,2 2a withi 1 ␤i0,0 ϭ2 0, Identity a yi Ϫ y iϭϭ1 a yˆi Ϫ y iϭϩ1 a yi Ϫ yˆi iϭ1 (11-24) asiϭ t1 ϭ 46.62. Clearly,iϭ1 then, the hypothesisiϭ1 that the intercept is ˆ ˆ 0 1 2 1 2 1 2 ␤1 ␤1 14.947 H0: ␤0 ϭ␤zero0,0 1is rejected.2 1 2 1 2 t ϭ ϭ ϭ y Hϭ0: 11.35␤0 ϭ␤y0,0 y y 0 2 ˆ 2␴ˆ րSxx se ␤1 21.18ր0.68088 H1: ␤0 %␤0,0 (11-21) H1: ␤0 %␤y 0,0 y (11-21) y y we 1would2 use the statistic we would use the statistic 11-4.2 Analysis of Variance Approach toˆ Test Significanceˆ of Regression ␤0 Ϫ␤0,0 ␤0 Ϫ␤0,0 T ϭ ϭ (11-22) 0 2 ˆ Test Statistic ␤ˆ Ϫ␤ 1 x␤ˆ Ϫ␤se ␤0 0 0,02 0 0,0 A method Tcalledϭ the analysis␴ˆ nofϩϭ variance can be used to test for(11-22) significance of regression. Figure 11-6 The 0 B 2 Sxx ˆ Test Statistic Figure 11-6 1The x se ␤0 1 2 hypothesisThe H : ␤ procedureϭ 0 partitions2 the total variabilityx in the response variablex into meaningful compo- 0 1 hypothesisFigure 11-6␴ˆ H : The␤ ϩϭ 0 c d x x is rejected. B 0 n(a1) (b) and nents reject theas the null basis hypothesis for the if the test. S computedxx The analysis value of of this(a )variance test statistic, identity t0, is suchis as that follows:(b) Figure 11-6 Theishypothesis rejected. H0: ␤1 ϭ 0 1 2 x x t Ͼ t . Note that the denominator of the test statistic in Equation 11-22 is just the stan- 0 hypothesis␣ր2,nϪ2 H : ␤is ϭrejected.0 c d (a)x (b) x dard error of the0 intercept.1 and reject the nullis rejected. hypothesis if the computed value(a) of this test statistic, t0, is such(b) that 0 0 A very important special case of the hypotheses of Equation 11-18 is t0 ϾAnalysist␣ր2,nϪ2 of. Note that the denominator of the test statistic in Equation 11-22 is just the stan- dard errorVariance of the intercept. n n n H0: ␤1 ϭ2 0 2 2 0 0 A veryIdentity important special case of the hypothesesa yi Ϫ yof Equationϭ a yˆ11-18i Ϫ y is ϩ a yi Ϫ yˆi (11-24) iϭ1 H1: ␤1 % 0 iϭ1 iϭ1 (11-23) 1 2 1 2 1 2 These hypotheses relate to the Hsignificance0: ␤1 ϭ 0 of regression. Failure to reject H0: ␤1 ϭ 0 is equivalent to concluding that thereH :is ␤ no %linear0 relationship between x and Y. This situation(11-23) is illustrated in Fig. 11-5. Note that this1 may1 imply either that x is of little value in explaining the variationy in Y and that the best estimator of Y for anyy x is yˆ ϭ Y [Fig. 11-5(a)] or that the true These hypotheses relate to the significance of regression. Failure to reject H0: ␤1 ϭ 0 is relationship between x and Y is not linear [Fig. 11-5(b)]. Alternatively, if H0: ␤1 ϭ 0 is re- equivalent to concludingjected, this implies that therethat x isis ofno value linear in explaining relationship the variability between inx Yand(see Y Fig.. This 11-6). situation Rejecting is illustrated in Fig.H0: ␤11-5.1 ϭ 0 Note could that mean this either may that imply the straight-line either that modelx is of is little adequate value [Fig. in explaining11-6(a)] or that,the variation in Y althoughand that therethe best is a estimatorlinear effect of of Y xfor, better any resultsx is yˆ couldϭ Y [Fig.be obtained 11-5(a with)] or the that addition the true of higher order polynomial terms in x [Fig. 11-6(b)]. relationship between x and Y is not linear [Fig. 11-5(b)]. Alternatively, if H0: ␤1 ϭ 0 is re- jected, this implies that x is of value in explaining the variability in Y (see Fig. 11-6). Rejecting

H0: ␤1 ϭ 0 could mean either that the straight-line model is adequate [Fig. 11-6(a)] or that, Figure although11-6 The there is a linear effect of x, better results could be obtained with the addition of hypothesishigher H0 :order ␤1 ϭ polynomial0 terms in x [Fig. 11-6(b)]. x x is rejected. (a) (b) JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 416

JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 416 416 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 416 ˆ 2 independent normal random variables, and consequently, ␤1 is N(␤1, ␴ ͞Sxx), using the bias and variance properties of the slope discussed in Section 11-3. In addition, n Ϫ 2 ␴ˆ 2ր␴2 has ˆ 2 a chi-square distribution with n Ϫ 2 degrees of freedom, and ␤1 is independent of ␴ˆ . As a 416 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATIONresult of those properties, the statistic 1 2 416 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION ˆ 2 independent normal random variables, and consequently, ␤1 is N(␤1, ␴ ͞Sxx), using the bias ␤ˆ Ϫ␤ ˆ 2 2 1 1,0 independent normal random variables, and consequently, ␤1 is N(␤ˆ 1, ␴ ͞Sxx), using the bias and variance properties of theTest slope Statistic discussed in Section 11-3. In addition, n Ϫ 2 ␴ ր␴Thas0 ϭ (11-19) and variance properties of the slope discussedˆ in Section 11-3. In addition,2 n Ϫ22 ␴ˆ 22ր␴2 has a chi-square distribution with n Ϫ 2 degrees of freedom, and ␤1 is independent of ␴ˆ . As a ␴ˆ րSxx Ϫ ␤ˆ ␴ˆ 2 result of those properties,a chi-square the statistic distribution with n 2 degrees of freedom, 1and 12 is independent of . As a result of those properties, the statistic 1 2 follows the t distribution with n Ϫ 2 degrees of freedom under H0: ␤1 ϭ␤1,0. We would reject ␤ˆ Ϫ␤ ˆ H10: ␤1 1,0ϭ␤1,0 if ␤1 Ϫ␤1,0 Test Statistic Test Statistic T0 ϭ T ϭ (11-19) (11-19) 2 ˆ 2 0 2 ␴ րSxx 2␴ˆ րSxx t0 Ͼ t␣ր2,nϪ2 (11-20) follows the t distributionfollows with n theϪ t 2distribution degrees of with freedom n Ϫ 2 degreesunder H of0: freedom␤1 ϭ␤ 1,0under. We H would0: ␤1 ϭ␤ reject1,0. We would reject H : ␤ ϭ␤ if H0: ␤1 ϭ␤1,0 if 0 1 1,0 0 0 where t0 is computed from Equation 11-19. The denominator of Equation 11-19 is the standard

error of the slope,t0 Ͼ tso␣ր2, wenϪ2 could write the test statistic(11-20) as t0 Ͼ t␣ր2,nϪ2 (11-20) 0 0 where t is computed from Equation 11-19. The denominator of Equation 11-19 isˆ the standard 0 0 0 ␤1 Ϫ␤1,0 where t0 is computed fromerror Equation of the slope, 11-19. so weThe could denominator write the test of Equationstatistic as 11-19 is the standard T0 ϭ ˆ error of the slope, so we could write the test statistic as se ␤1 ˆ ␤1 Ϫ␤1,0 ˆ T ϭ ␤1 Ϫ␤1,0 0 se ␤ˆ 1 2 T ϭ A similar procedure can1 be used to test hypotheses about the intercept. To test 0 se ␤ˆ Use t-test for1 intercept1 2 A similar procedure can be used to test hypotheses about the intercept. To test 1 2 H0: ␤0 ϭ␤0,0 A similar procedure can be used to test hypotheses about the intercept. To test • Use a similar form of test H : ␤ ϭ␤ 0 0 0,0 H1: ␤0 %␤0,0 (11-21) ! H : ␤ %␤ (11-21) H0: ␤0 ϭ␤0,0 1 0 0,0 ! we would use the statistic we would use the statisticH1: ␤0 %␤0,0 (11-21) ! ˆ ˆ ˆ ˆ we would use the statistic ␤0 Ϫ␤0,0 ␤0 Ϫ␤0,0 ␤0 Ϫ␤0,0 ␤0 Ϫ␤0,0 • Test statistic T ϭ ϭ (11-22) 0 2 Tˆ 0 ϭ ϭ (11-22) Test Statistic 1 x se ␤0 2 ˆ Test Statistic 2 1 x se ␤0 ! ˆ ˆ ␴ˆ ϩ 2 ␤0 Ϫ␤0,0 B␤0 Ϫ␤n 0,0 S ␴ˆ ϩ T ϭ ϭ xx 1 2 B(11-22)n 0 2 ˆ Sxx Test Statistic ! 1 x se ␤c 0 d 1 2 and reject the null␴ˆ 2 hypothesisϩ if the computed value of this test statistic, t , is such that Under H0n, T0 ~ t distribution with n-2 degree of freedom c 0 d ! B Sxx t0 Ͼ t␣ր2,nϪ2. Note thatand the denominator reject the1 of null2 the test hypothesis statistic in Equation if the computed11-22 is just the value stan- of this test statistic, t0, is such that dard error of the intercept.c d • Reject H0 if t0 Ͼ t␣ր2,nϪ2. Note that the denominator of the test statistic in Equation 11-22 is just the stan- and reject the null hypothesis0 0 A very if important the computed special value case of of the this hypotheses test statistic, of Equation t0, is 11-18 such is that t Ͼ t . Note that the denominator of thedard test error statistic of thein Equation intercept. 11-22 is just the stan- 0 ␣ր2,nϪ2 0 0 dard error of the intercept. A very importantH0: ␤1 ϭ 0special case of the hypotheses17 of Equation 11-18 is 0 0 A very important special case of the hypotheses of EquationH1: ␤ 111-18% 0 is (11-23) H0: ␤1 ϭ 0 These hypotheses relateH0: ␤ to1 ϭ the0 significance of regression. Failure to reject H0: ␤1 ϭ 0 is equivalent to concluding that there is no linear relationship between x and YH. This1: ␤ 1situation% 0 is (11-23) H : ␤ % 0 (11-23) illustrated in Fig. 11-5.1 Note1 that this may imply either that x is of little value in explaining the variation in Y and that theThese best estimator hypotheses of Y for relate any x is to yˆ ϭ theY [Fig.significance 11-5(a)] or ofthat regression.the true Failure to reject H0: ␤1 ϭ 0 is These hypotheses relate to the significance of regression. Failure to reject H0: ␤1 ϭ 0 is relationship between x equivalentand Y is not linearto concluding [Fig. 11-5( bthat)]. Alternatively, there is no if linear H0: ␤1 relationshipϭ 0 is re- between x and Y. This situation is equivalent to concludingjected, that thisthere implies is no thatlinear x is relationship of value in explaining between the x and variability Y. This in situation Y (see Fig. is 11-6). Rejecting illustrated in Fig. 11-5. Note that this may implyillustrated either that in xFig.is of 11-5. little valueNote inthat explaining this may the imply either that x is of little value in explaining the H0: ␤1 ϭ 0 could mean either that the straight-line model is adequate [Fig. 11-6(a)] or that, variation in Y and that thealthough best estimator there is a oflinear Yvariationfor effect any xof isin x ,yˆ YbetterϭandY [Fig. resultsthat 11-5( the could besta)] be or estimatorobtained that the withtrue of theY for addition any xofis yˆ ϭ Y [Fig. 11-5(a)] or that the true relationship between x higherand Y orderis not polynomial linear [Fig.relationship terms 11-5( in xb[Fig.)]. Alternatively,between 11-6(b)]. x and if HY0: is␤ 1notϭ 0linear is re- [Fig. 11-5(b)]. Alternatively, if H0: ␤1 ϭ 0 is re- jected, this implies that x is of value in explainingjected, the thisvariability implies in Ythat(see x Fig.is of 11-6). value Rejecting in explaining the variability in Y (see Fig. 11-6). Rejecting H : ␤ ϭ 0 could mean either that the straight-line model is adequate [Fig. 11-6(a)] or that, 0 1 H0: ␤1 ϭ 0 could mean either that the straight-line model is adequate [Fig. 11-6(a)] or that, although there is a linear effect of x, betteralthough results could there be isobtained a linear with effect the additionof x, better of results could be obtained with the addition of higher order polynomial terms in x [Fig. 11-6(higherb)]. order polynomial terms in x [Fig. 11-6(b)]. Class activity

Given the regression line: y = 22.2 + 10.5 x estimated for x = 1,2,3,…,20 ! 1. The estimated slope is: A. β ˆ = 22 . 2 B. β ˆ = 10 . 5 C. biased ! 1 1 ! ! 2. The predicted value for x*=10 is A. y*=22.2 B. y*=127.2 C. y*=32.7 ! ! ! 3. The predicted value for x*=40 is A. y*=442.2 B. y*=127.2 C. Cannot extrapolate

18 Class activity

1. The estimated slope is significantly different from zero when ˆ ˆ ˆ 2 β1 S XX β1 S XX β1 S XX > tα / 2,n−2 < tα / 2,n−2 > F A. σ ˆ B. σ ˆ C. ! σˆ α / 2,n−1,1 2. The estimated intercept is plausibly zero when A. Its contains 0.

ˆ ˆ β0 S XX β0 B. < t α / 2 ,n − 2 C. > tα / 2,n−2 σˆ 2 σˆ 1/ n + x / Sxx

19 JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 416

JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 416

416 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION

416 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION ˆ 2 independent normal random variables, and consequently, ␤1 is N(␤1, ␴ ͞Sxx), using the bias and variance properties of the slope discussed in Section 11-3. In addition, n Ϫ 2 ␴ˆ 2 ␴2 has ˆ 2 ր a chi-squareindependent distribution normal random with n Ϫvariables,2 degrees and of consequently,freedom, and ␤ˆ 1 is independentN(␤1, ␴ ͞Sxx of), using␴ˆ 2 . As the a bias 1 2 2 resultand of variance those properties, properties the of statisticthe slope discussed in Section 11-3. In addition,1 n2Ϫ 2 ␴ˆ ր␴ has ˆ 2 a chi-square distribution with n Ϫ 2 degrees of freedom, and ␤1 is independent of ␴ˆ . As a 1 2 result of those properties, the statistic ˆ ␤1 Ϫ␤1,0 Test Statistic T ϭ (11-19) 0 2 2␴ˆˆ րSxx ␤1 Ϫ␤1,0 Test Statistic T ϭ (11-19) 0 2 2␴ˆ րSxx follows the t distribution with n Ϫ 2 degrees of freedom under H0: ␤1 ϭ␤1,0. We would reject H0: ␤1 ϭ␤1,0 if follows the t distribution with n Ϫ 2 degrees of freedom under H0: ␤1 ϭ␤1,0. We would reject H : ␤ ϭ␤ if 0 1 1,0 t0 Ͼ t␣ր2,nϪ2 (11-20) 0 0 where t0 is computed from Equation 11-19. Thet0 denominatorϾ t␣ր2,nϪ2 of Equation 11-19 is the standard(11-20) JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 416 error of the slope, so we could write the test statistic as 0 0 where t0 is computed from Equation 11-19. The denominator of Equation 11-19 is the standard ˆ error of the slope, so we could write the test␤1 Ϫ␤statistic1,0 as T0 ϭ ˆ Confidence interval se ␤1 ˆ ␤1 Ϫ␤1,0 • we can obtain confidence interval estimates of slope T0 ϭ 1 2 ˆ A similar procedure can be used to test hypotheses seabout␤1 the intercept. To test and intercept 416 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION 1 2 A similar procedure can be used to testH0: hypotheses␤0 ϭ␤0,0 about the intercept. To test • width of confidence interval is a measure of the H1: ␤0 %␤0,0 (11-21) independent normal random variables, and consequently, ␤ˆ is N(␤ , ␴2͞S ), using the bias overall quality of the regression1 1 xx H0: ␤0 ϭ␤0,0 and variance properties of the slope discussedwe would in Section use the statistic 11-3. In addition, n Ϫ 2 ␴ˆ 2true parameterր␴2 has H1: ␤0 %␤0,0 (11-21) a chi-square distribution with n Ϫ 2 degrees of freedom, and ␤ˆ is independent of ␴ˆ 2 . As a 1 ˆ ˆ slope ␤intercept01Ϫ␤0,0 2 ␤0 Ϫ␤0,0 result of those properties, the statistic we would use the statisticT ϭ ϭ (11-22) 0 2 ˆ Test Statistic 1 x se ␤0 2 ␴ˆ nˆ ϩ ˆ ˆ B ␤0 Ϫ␤Sxx0,0 ␤0 Ϫ␤0,0 ␤1 Ϫ␤1,0 T ϭ ϭ 1 2 (11-22) 0 2 ˆ Test Statistic Test StatisticT ϭ c 1 d x(11-19)se ␤0 0 2 and reject2 the null hypothesis if the computed␴ˆ ϩ value of this test statistic, t0, is such that 2␴ˆ րSxx n B Sxx t0 Ͼ t␣ր2,nϪ2. Note that the denominator of the test statistic in Equation1 2 11-22 is just the stan- dard error of the intercept. c d 0 and0 reject the null hypothesis if the computed value of this test statistic, t , is such that follows the t distribution with n Ϫ~ t distribution with n-2 2 degrees ofA veryfreedom important under special H0~ t distribution with n-2 : ␤case1 ϭ␤ of the1,0 .hypotheses We would of reject Equation 11-18 is 0 H : ␤ ϭ␤ if t0 Ͼ t␣ր2,nϪ2. Note that the denominator of the test statistic in Equation 11-22 is just the stan- 0 1 1,0 degree of freedom dard error of the intercept. degree of freedom H0: ␤1 ϭ 0 20 0 0 A very important special case of the hypotheses of Equation 11-18 is H1: ␤1 % 0 (11-23) t0 Ͼ t␣ր2,nϪ2 (11-20) H0: ␤1 ϭ 0 These hypotheses relate to the significance of regression. Failure to reject H0: ␤1 ϭ 0 is 0 equivalent0 to concluding that there is no linearH relationship: ␤ % 0 between x and Y. This situation (11-23)is where t0 is computed from Equation 11-19. The denominator of Equation 11-19 is 1the 1standard error of the slope, so we could write the testillustrated statistic in Fig.as 11-5. Note that this may imply either that x is of little value in explaining the variationThese in hypotheses Y and that relatethe best to estimator the significance of Y for any of regression.x is yˆ ϭ Y [Fig.Failure 11-5( toa)] reject or that H the0: ␤ true1 ϭ 0 is relationshipequivalent between to concluding x and Ythatis notthere linear is no [Fig. linear 11-5( relationshipb)]. Alternatively, between ifx andH : Y␤. Thisϭ 0 situationis re- is ˆ 0 1 jected,illustrated␤1 thisϪ␤ implies1,0 in Fig. that 11-5. x is Noteof value that in this explaining may imply the eithervariability that xinis Y of(see little Fig. value 11-6). in Rejectingexplaining the T ϭ 0 H0:variation ␤1 ϭ ˆ0 couldin Y and mean that either the best that estimator the straight-line of Y for model any x isis adequateyˆ ϭ Y [Fig. [Fig. 11-5( 11-6(a)]a)] or orthat that, the true se ␤1 althoughrelationship there isbetween a linear x effectand Y ofis xnot, better linear results [Fig. could11-5( bbe)]. obtainedAlternatively, with theif H addition0: ␤1 ϭ 0of is re- higherjected, order1 this 2polynomial implies that terms x is ofin valuex [Fig. in 11-6( explainingb)]. the variability in Y (see Fig. 11-6). Rejecting A similar procedure can be used to test hypothesesH0: ␤1 ϭ about0 could the mean intercept. either thatTo testthe straight-line model is adequate [Fig. 11-6(a)] or that, although there is a linear effect of x, better results could be obtained with the addition of higher order polynomial terms in x [Fig. 11-6(b)]. H0: ␤0 ϭ␤0,0

H1: ␤0 %␤0,0 (11-21) we would use the statistic

ˆ ˆ ␤0 Ϫ␤0,0 ␤0 Ϫ␤0,0 T ϭ ϭ (11-22) 0 2 ˆ Test Statistic 1 x se ␤0 ␴ˆ 2 ϩ B n S xx 1 2 c d and reject the null hypothesis if the computed value of this test statistic, t0, is such that t0 Ͼ t␣ր2,nϪ2. Note that the denominator of the test statistic in Equation 11-22 is just the stan- dard error of the intercept. 0 0 A very important special case of the hypotheses of Equation 11-18 is

H0: ␤1 ϭ 0

H1: ␤1 % 0 (11-23)

These hypotheses relate to the significance of regression. Failure to reject H0: ␤1 ϭ 0 is equivalent to concluding that there is no linear relationship between x and Y. This situation is illustrated in Fig. 11-5. Note that this may imply either that x is of little value in explaining the variation in Y and that the best estimator of Y for any x is yˆ ϭ Y [Fig. 11-5(a)] or that the true relationship between x and Y is not linear [Fig. 11-5(b)]. Alternatively, if H0: ␤1 ϭ 0 is re- jected, this implies that x is of value in explaining the variability in Y (see Fig. 11-6). Rejecting

H0: ␤1 ϭ 0 could mean either that the straight-line model is adequate [Fig. 11-6(a)] or that, although there is a linear effect of x, better results could be obtained with the addition of higher order polynomial terms in x [Fig. 11-6(b)]. JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 421 JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 421 JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 421

JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 421

11-5 CONFIDENCE INTERVALS 11-5 CONFIDENCE INTERVALS 421421 11-5 CONFIDENCE INTERVALS 421 11-5 CONFIDENCE INTERVALS 11-5 CONFIDENCE INTERVALS 11-5 CONFIDENCE INTERVALS 421

11-5.111-511-5.1 Confidence CONFIDENCE Confidence Intervals Intervals INTERVALS on the on Slopethe Slope and and Intercept Intercept 11-5 CONFIDENCE INTERVALS 11-5.1 ConfidenceIn additionIn Intervalsaddition to point to pointon estimates the estimates Slope of the of and slopethe Interceptslope and andintercept, intercept, it is it possible is possible to obtainto obtain confidence confidence 11-5.1 Confidence Intervals on the Slope and Intercept intervalinterval estimates estimates of these of these parameters. parameters. The Thewidth width of these of these confidence confidence intervals intervals is a is measure a measure of of In addition to point estimates of the slope and intercept, it is possible to obtain confidence the overallthe overall quality quality of the of regression the regression line. line. If the If errorthe error terms, terms, ⑀i, in⑀ , thein theregression regression model model are are In addition to point estimates of the slope and intercept, it is possible to iobtain confidence normallyintervalnormally and independentlyestimates and independently of these distributed, parameters. distributed, The width of these confidence intervals is a measure of interval estimates of these parameters. The width of these confidence intervals is a measure of the overall quality of the regression line. If the error terms, ⑀i, in the regression model are the overallnormally quality of and the independently regression line. distributed, If the error terms, ⑀i, in the regression1 1 x 2 xmodel2 are ␤ˆ Ϫ␤␤ˆ Ϫ␤2␴ˆ 22S␴ˆ 2 Sand and␤ˆ Ϫ␤␤ˆ Ϫ␤ ␴ˆ 2 ␴ˆ 2ϩ ϩ normally and independently1 distributed,1 1 ր 1 ր ր x x ր x x 0 0 0 րB0 րB n n Sx x Sx x 1 x 2 1 1 ˆ 2 2 2 ˆ 2 1 1 ˆ 2 2 c2 ˆc2 d d ␤1 Ϫ␤1 ր ␴ րSx x and ␤0 Ϫ␤1 0 xր ␴ ϩ are both ˆdistributed2 asˆ 2t random variablesˆ with n Ϫˆ 2 2 degreesB of freedom.n This leads to the are both distributed␤1 Ϫ␤ as1 ր t random␴ րSx x variablesand ␤ with0 Ϫ␤ n0 Ϫր 2␴ degreesϩ of freedom. SThisx x leads to the B n S followingfollowing definition definition of 100(11 of 100(1Ϫ␣2 )%Ϫ␣ confidence)% confidence intervals 1intervals on the on2 x xslopethe slopec and and intercept. intercept.d are both1 distributed2 as t random variables1 with2 n Ϫc 2 degreesd of freedom. This leads to the are both distributed as t random variables with n Ϫ 2 degrees of freedom. This leads to the Confidence followingConfidence definition of 100(1 intervalsϪ␣)% confidence intervals on the slope and intercept. Confidencefollowing definition of 100(1 Ϫ␣)% confidence intervals on the slope and intercept. IntervalsIntervals on on UnderUnder the assumption the assumption that the that observations the observations are normally are normally and andindependently independently distributed, distributed, ParametersConfidenceParameters a 100(1 Ϫ␣)% confidence interval on the slope ␤ in simple linear regression is Confidence a 100(1 Ϫ␣)% confidence interval on the slope ␤1 in simple1 linear regression is Intervals on Intervals on Under the assumptionUnder the assumptionthat the observations that the observations are normally areand normally independently and independently distributed, distributed, Parameters 2 ␴ˆ 2 2 ␴ˆ 2 Parameters a 100(1 Ϫ␣)% confidenceˆ interval␴ˆ on the slopeˆ ␤1 in simple␴ˆ linear regression is a 100(1 Ϫ␣)% confidenceˆ interval␤ Ϫ t on the slope ␤Յ␤1 in simpleˆՅ␤ linearϩ t regression is (11-29) ␤ Ϫ t1 ␣ր2, nϪ2 Յ␤ Յ␤1 ϩ 1t ␣ր2, nϪ2 (11-29) 1 ␣ր2, nϪ2 B 1 1 ␣ր2, nϪ2 B B Sx x B Sx x Sx x Sx x 2 2 2 2 ˆ ␴ˆ ␴ˆ ˆ ␴ˆ ␴ˆ ˆ ␤ Ϫ ˆ Յ␤ Յ␤ ϩ ␤ Ϫ t 1 t␣Յ␤2, nϪ2 Յ␤ ϩ t 1 1 t␣ 2, nϪ2 (11-29) (11-29) Similarly,1 a 100(1␣ր2, nϪϪ␣2 )% confidenceր 1 B 1 interval␣ր2, nϪ on2 the interceptր B␤0 is Similarly, a 100(1 Ϫ␣)%B confidence intervalS on theB intercept ␤ isS Sx x xx Sx x 0 xx 2 1 2 x Similarly, aSimilarly,ˆ 100(1 Ϫ␣ a )%100(1 1confidence2 Ϫ␣x )% intervalconfidence on theinterval intercept on the␤ interceptis ␤ is ˆ 0 0 ˆ ␤0 Ϫ t␣ 2, nϪ22 ␴ ϩ ր ˆ n ␤0 Ϫ t␣ 2, nϪ2 ␴ B ϩ S ր B n x x Sx x 2 2 2 1 x c 1 x d 1 2 x 2 2 2 ˆ ˆ c d ˆ 1 x ˆ ϩ ˆ Յ␤ Յ␤ ϩ t ␴ˆ ϩ (11-30) ␤0 Ϫ t␣ 2, n␤Ϫ2 Ϫ␴t␣ Ϫ ␴ ϩ 0ˆ 0 ␣ 2,nϪ2 2 ր 0 ր2,n 2 ր n B n n Յ␤ Յ␤ ϩ t ␴ˆB ϩ (11-30) BS S 0 0 ␣ 2,nϪ2 Sx x xx x x ր B n Sx x 2 2 c d 1 x c d c d 1 x ˆ 2 2 ˆ ˆ c d Յ␤0 Յ␤0 ϩՅ␤t␣ 2,Յ␤ nϪ2 ϩ␴ t ϩ ␴ˆ ϩ(11-30) (11-30) ր0 B0 ␣nր2, nϪ2 SBx x n S EXAMPLE 11-4 Oxygen Purity Confidence Interval on the Slope x x EXAMPLEWe will 11-4 find a 95%Oxygen confidence Purity interval Confidence on the slope Interval of the onre- the ThisSlope simplifies toc d c d We willgression find a 95% line confidence using the datainterval in Exampleon the slope 11-1. of the Recall re- thatThis simplifies to EXAMPLEgressionEXAMPLEˆ 11-4 line usingOxygen 11-4 the dataPurityOxygen in ExampleConfidence Purity2 11-1. Confidence Interval Recall on that Intervalthe Slope on the Slope 21 ␤1 ϭ 14.947, Sxx ϭ 0.68088, and ␴ˆ ϭ 1.18 (see Table 11-2). 12.181 Յ␤ Յ 17.713 We will␤ˆ findϭ 14.947, a 95% Sconfidenceϭ 0.68088, interval and on␴ˆ 2 theϭ slope1.18 (seeof the Table re- 11-2).This simplifies to 1 1 WeThen, will from findxx Equationa 95% confidence 11-29 we intervalfind on the slope of the re- This simplifies12.181 to Յ␤1 Յ 17.713 gressionThen, linegression from using Equation line the data using 11-29 in the Examplewe datafind in 11-1. Example Recall 11-1. that Recall that ˆ 2 2 2 Practical Interpretation: This CI does not include zero, so ␤1 ϭ 14.947,ˆ Sxx ϭ 0.68088, and ␴ˆ␴ˆ ϭ 1.18 (see2 Table 11-2). ␴ˆ ␤1 ϭ 14.947,ˆ Sxx ϭ2 0.68088, and ␴ˆ ˆ ϭ 1.18 (see2 Table 11-2). 12.181 Յ␤1 Յ 17.713 ␤1 Ϫ t0.025,18ˆ Յ␤1 Յ␤1 ϩ t0.025,18ˆ therePractical is strong Interpretation: evidence12.181 (at This ␣ Յ␤ϭ CI0.05) does1 Յ that 17.713not the include slope zero,is not so zero. Then, from Equationˆ 11-29 we␴ findBS ˆ ␴ BS Then,␤ Ϫ fromt Equation Յ␤11-29xx Յ␤ we findϩ t xx 1 0.025,18 B 1 1 0.025,18 B thereThe is strong CI is evidencereasonably (at narrow ␣ ϭ 0.05) (Ϯ 2.766) that the because slope isthe not error zero. vari- 2 Sxx 2 Sxx Practical Interpretation: This CI does not include zero, so or ␴ˆ 2 ␴ˆ 2 The CIance is isreasonablyPractical fairly small. Interpretation: narrow (Ϯ 2.766) This because CI does the not error include vari- zero, so ˆ ˆˆ ˆ ␤1 Ϫ t0.025,18ˆ Յ␤1 Յ␤␴ 1 ϩ t0.025,18ˆ ␴ there is strong evidence (at ␣ ϭ 0.05) that the slope is not zero. or ␤1BϪSxxt0.025,18 Յ␤1 Յ␤B1 ϩSxxt0.025,18 ance isthere fairly is small.strong evidence (at ␣ ϭ 0.05) that the slope is not zero. BS BS The CI is reasonably narrow (Ϯ 2.766) because the error vari- xx 1.18 xx The CI is reasonably narrow (Ϯ 2.766) because the error vari- 14.947 Ϫ 2.101 Յ␤ Յ 14.947 ance is fairly small. or 1.18A 0.68088 1 or 14.947 Ϫ 2.101 Յ␤ Յ 14.947 ance is fairly small. A 0.68088 1 1.18 1.18 ϩ 14.947 Ϫ 2.101 2.101Յ␤11.18Յ 14.947 A 0.680881.18A 0.68088 14.947ϩ 2.101Ϫ 2.101 Յ␤1 Յ 14.947 A A 0.68088 1.180.68088 ϩ 2.101 A 0.68088 1.18 11-5.2 Confidenceϩ 2.101 Interval on the Mean Response 11-5.2 Confidence IntervalA 0.68088 on the Mean Response 11-5.2 Confidence IntervalA confidenceon the Mean interval Response may be constructed on the mean response at a specified value of x, say, x . This is a confidence interval about E(Y x ) ϭ ␮ and is often called a confidence interval 11-5.2 ConfidenceA confidence0 Interval interval on maythe beMean constructed Response on the͉ 0 meanY response͉x0 at a specified value of x, say, x . This is a confidence interval about E(Y x ) ϭ ␮ and is often called a confidence interval A confidence0 interval may be constructed on the ͉mean0 responseY ͉x0 at a specified value of x, say, x . This is aA confidence confidence interval interval about may E be(Y constructedx ) ϭ ␮ and on theis often mean called response a confidence at a specified interval value of x, say, 0 ͉ 0 Y ͉x0 x . This is a confidence interval about E(Y x ) ϭ ␮ and is often called a confidence interval 0 ͉ 0 Y ͉x0 JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 421

JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 421 JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 421 JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 421 JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 421

11-5 CONFIDENCE INTERVALS 421 11-5 CONFIDENCE INTERVALS 421 11-5 CONFIDENCE INTERVALS 11-5 CONFIDENCE INTERVALS 421 11-5 CONFIDENCE INTERVALS 421 11-5 CONFIDENCE INTERVALS JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 421 11-5 CONFIDENCE INTERVALS 421 11-5.111-5 Confidence CONFIDENCE Intervals INTERVALS on the Slope and Intercept 11-5 CONFIDENCE11-5.1 INTERVALS Confidence Intervals on the Slope and Intercept 11-5 CONFIDENCE INTERVALS 11-5.1 ConfidenceIn addition Intervals to point on estimates the Slope of the and slope Intercept and intercept, it is possible to obtain confidence 11-5.1 Confidence Intervals onIn the addition Slope to andpoint Intercept estimates of the slope and intercept, it is possible to obtain confidence 11-5.1 Confidenceinterval Intervals estimates on the ofSlope these and parameters.interval Intercept estimates The widthof these of parameters. these confidence The width intervals of these confidenceis a measure intervals of is a measure of the overallIn addition quality to ofpoint the estimates regression of theline. slope If the and error intercept, terms, it ⑀isi, possiblein the regression to obtain confidence model are In addition to pointthe estimatesoverall quality of the of slope the regression and intercept, line. Ifit theis possible error terms, to obtain ⑀i, in theconfidence regression model are Innormally additioninterval toand point independentlyestimates estimates of ofthese the distributed, parameters.slope and intercept, The width it is of possible these confidence to obtain confidence intervals is a measure of 11-5 CONFIDENCEinterval INTERVALS estimatesnormally of these421 andparameters. independently The width distributed, of these confidence intervals is a measure of intervalthe estimates overall of quality these parameters. of the regression The width line. of these If the confidence error terms, intervals ⑀i, inis athe measure regression of model are the overall quality of the regression line. If the error terms, ⑀i, in2 the regression model2 are the overallnormally quality and of theindependently regression line. distributed, If the error terms, ⑀i, in the regression1 modelx are 1 x ˆ 2 ␤ˆ Ϫ␤ˆ 2␴ˆ 2 2 ␤ˆ Ϫ␤ ␴ˆ 2 ϩ normally ␤andϪ␤ independently2␴ˆ S distributed, and 1 ␤ 1Ϫ␤ր րSx x␴ ˆ and ϩ 0 0 ր 11-5 CONFIDENCE INTERVALS normally and independently1 distributed,1 ր ր x x 0 0 ր B n S B n S x x 1 x x2 ˆ 2 ˆ 2 1 ˆ2 2 ˆ 2 1 22 c d ␤1 Ϫ␤1 ր ␴ րSx x and ␤0 Ϫ␤1 0 րx ␴ ϩ 1 x 11-5.1 Confidence Intervals on the Slope and Intercept ˆ 1 2areˆ22␤ˆ bothϪ␤ distributed2ˆ␴ˆ 2 as t1 randomˆ 2 ␤ˆvariables2 Ϫ␤B c withn␴ˆ 2n ϪS ϩ2d degrees of freedom. This leads to the ␤1 Ϫ␤1 ր ␴ րS1x x and1 ր ␤0 րϪ␤Sx x 0 րand␴ ϩ0 0 ր x x are both distributed as t random variables with nBϪ 2 degreesn S ofB freedom.n This leads to the following definition of 100(1 Ϫ␣)% confidencex x intervalsSx xon the slope and intercept. following definition of 1100(1 Ϫ␣2 )% confidence intervals1 on2 the slopec andd intercept. In addition to point estimates of the slope and intercept, itare is bothpossible distributed1 to obtain2 as confidence t1random 2variables1 with2 n Ϫc 12 degreesd 2 of freedom.c Thisd leads to the are both distributed as t random variables with n Ϫ 2 degrees of freedom. This leads to the are bothConfidence distributed as t random variables with n Ϫ 2 degrees of freedom. This leads to the interval estimates of these parameters. The width of these confidencefollowing intervalsdefinition is Ϫ␣of a measure100(1 Ϫ␣ of )% confidence intervals on the slope and intercept. Confidence following definitionfollowingIntervals of 100(1 definition on )% of confidence 100(1 Ϫ␣ intervals)% confidence on the slope intervals and intercept. on the slope and intercept. the overall quality of the regression line. If the error terms, ⑀i, in the regression modelUnder are the assumption that the observations are normally and independently distributed, Intervals on Parameters a 100(1 Ϫ␣)% confidence interval on the slope ␤ in simple linear regression is normally and independently distributed,ConfidenceConfidenceUnder the assumption that the observations are normally and independently1 distributed, ParametersIntervalsConfidence on Intervals on Undera 100(1 theUnder assumptionϪ␣ the)% assumption confidence that the observations that interval the observations are on normally the slope are and normally␤ independently1 in simple and independently lineardistributed, regression distributed, is Intervals on Under2 the assumption that the observations are normally␴ˆ 2 and independently distributed,␴ˆ 2 ˆ 2 ParametersParametersˆ 2 1 x ˆ ˆ 2 ˆ a 100(1ˆ Ϫ␣a 100(1)% confidenceϪ␣)% confidence interval on interval the slope on␤ 1the␤in simpleslopeϪ t ␤linear1 in simpleregressionՅ␤ linear isՅ␤ regressionϩ t is (11-29) ␤1 Ϫ␤1 ր ␴ րSx x and ␤0 Ϫ␤Parameters0 ր ␴ ϩ 1 ␣ր2, nϪ2 1 1 ␣ր2, nϪ2 B n a 100(1 Ϫ␣)% confidence interval2 on the slopeB␤S1 in simple2 linear regressionB Sis Sx x ˆ x x ˆ x x ˆ ␴ ˆ ␴ 2 2 2 2 ␤1 Ϫ t␣ 2, nϪ2 Յ␤1 Յ␤1 ϩ t␣ 2, nϪ2 (11-29) 1 2 1 2 c d ր ␴ˆ B ␴ˆ 2 ␴ˆ ր B ␴ˆ 2 ˆ ˆ Sx x ˆ ˆ Sx x Ϫ ␤1 Ϫ t␣ 2, ␤nϪ2 ϪSimilarly,t Յ␤ 1 aՅ␤ 100(11Յ␤ϩ␴ˆϪ␣t␣ 2,Յ␤)% nϪ2 confidenceϩ t interval (11-29) on␴ˆ the intercept(11-29)␤ is are both distributed as t random variables with n 2 degrees of freedom. Thisր leads1 B toˆ␣ րthe2, nϪ2 1ր B1 ˆ ␣ր2, nϪ2 0 ␤Sx xϪ t BS Յ␤ Յ␤Sx x ϩ t B S (11-29) following definition of 100(1 Ϫ␣)% confidence intervals on the slope and intercept. 1 ␣ր2, nϪx2 xB 1 1 ␣ր2, nϪ2x xB Sx x Sx x 2 Similarly, a 100(1 Ϫ␣)% confidence interval1 onx the intercept ␤0 is ˆ 2 Similarly, a 100(1 Ϫ␣)% confidence interval onˆ the intercept ␤0 is Ϫ␣␤ Ϫ t␣ Ϫ ␴ ϩ ␤ Similarly, a 100(1 )%0 confidenceր2,n 2 B intervaln S on the intercept 0 is Confidence Similarly, a 100(1 Ϫ␣)% confidence intervalx x on the intercept ␤0 is 2 2 2 Intervals on 1 x Under the assumption that the observations are normally and independently distributed, c d 1 x 1 2 x 2 ˆ ˆ 2 ˆ 2 ␤ Ϫ t ˆ ␴ˆ ϩ1 x Յ␤ Յ␤ ϩ ␴ˆ ϩ ␤ Ϫ t ␴ ϩ t (11-30) Parameters 0 0 ␣ր2,ˆ ␣nϪր2,2 nϪ2 2 2 0 0 ␣ր2,nϪ2 a 100(1 Ϫ␣)% confidence interval on the slope ␤ in simple␤ linearϪBt regressionBn n␴ˆ is ϩ 1 x B n 1 S S 0 ˆ␣ր2,nϪ2 x x Sx2 x xx B n ␤ Ϫ t ␴ˆ S ϩ 0 ␣ր2,nϪ2 B nx x 2 2 c d Sx x 1 x c d c d ˆ 2 1 x 2 2 2 ˆ 2 c Յ␤ Յ␤d ϩ t␣ ˆ Ϫ ␴ ϩ 1 (11-30)x ␴ˆ ␴ˆ 0 Յ␤0 Յ␤ր2,n ϩ2 t ␴ˆ 2 ϩ 2 (11-30) 0 0 ˆ B ␣ 2, nnϪ2 ˆ ˆ c d ր S ˆ 1 x Յ␤0 Յ␤0 ϩ t␣ 2, nϪ2 xx ␴ n ϩ (11-30) ␤1 Ϫ t␣ 2, nϪ2 Յ␤1 Յ␤1 ϩ t␣ 2, nϪ2 (11-29) ˆ ր B 2 S ր ր EXAMPLE 11-4 Oxygen Purity ConfidenceՅ␤ Յ␤ Intervalϩ tonB the Slopen␴ˆ x xϩ (11-30) BS B S 0 0 ␣ 2, nϪ2 Sx x x x x x c ր d B n Sx x We will find a 95% confidence interval on the slope of the re- Thisc simplifiesc todd gression line using the data in Example 11-1. Recall that c d Ϫ␣ EXAMPLE 11-4 Oxygen Purity Confidence␤ Interval on the Slope Similarly, a 100(1 )% confidenceExample: interval oxygen on the intercept purity␤ˆ ϭ 14.947, tests0 isS ϭ of0.68088, coefficients and ␴ˆ 2 ϭ 1.18 (see Table 11-2). EXAMPLEWe willEXAMPLE find a 11-495% confidence 11-4Oxygen Oxygeninterval Purity1 on Purity the Confidence slope Confidencexx of the re- Interval IntervalThis onsimplifies theon the Slope to Slope 12.181 Յ␤1 Յ 17.713 EXAMPLE 11-4 Then,Oxygen from EquationPurity Confidence 11-29 we find Interval on the Slope 1 Wexgression 2will Wefind line will a using95% find confidencea the 95% data confidence in Exampleinterval interval on 11-1. the on Recall slope the slope thatof(α the =of re-0the.0 5re-) ThisThis simplifies simplifies to to 2 ˆ We will find a 95% confidence2 interval on the slope of the re- This simplifies to ˆ ␤ ϭ 14.947, S ϭ 0.68088, and ␴ˆ ϭ 1.18 (see Table 11-2).2 2 Practical Interpretation: This CI does not include zero, so ␤ Ϫ t ␴ˆ ϩ 1 xx ˆ ˆ 0 ␣ր2,nϪ2 n gressiongression line using line the using data the in data Example in Exampleˆ 11-1. 11-1. Recall␴ Recall that thatˆ 12.181␴ Յ␤1 Յ 17.713 B S gression line using the data␤ Ϫ in Example 11-1.Յ␤ RecallՅ␤ ϩ that ˆ Then,x x fromˆ Equation 11-29 we find 2 ˆ12 t0.025,18 1 1 t0.025,18 there is strong evidence (at ␣ ϭ 0.05) that the slope is not zero. ␤1 ϭ 14.947,␤1 ϭ 14.947,Sˆ xx ϭ 0.68088,Sxx ϭ 0.68088, and ␴ˆ andϭ ␴1.18ϭ (see1.182 Table (seeBS Tablexx 11-2). 11-2). B Sxx Յ␤ Յ ␤ ϭ 14.947, S ϭ 0.68088, and2 ␴ˆ ϭ 1.18 (see Table 11-2). 12.18112.181TheՅ␤ CI1 is1Յ reasonably17.71317.713 narrow (Ϯ 2.766) because the error vari- c d 1 xx 1 x 12.181 Յ␤1 Յ 17.713 Then, fromThen, Equation fromˆ Equation 11-292 we11-29 find we2 find 2 Practical Interpretation: This CI does not include zero, so Յ␤ Յ␤ ϩ ␴tˆ ␴ˆ ϩ ␴ˆ (11-30) ˆ 0 Then,0 from␣ր Equation2, nϪ2 ˆ 11-29 we find ance is fairly small. ␤1 Ϫ t0.025,18 Յ␤1 BՅ␤or1 ϩnt0.025,18 there is strong evidence (at ␣ ϭ 0.05) that the slope is not zero. B Sx xB Sxx 2 ˆ 2 Sxx 2 ˆ 2 Practical Interpretation: This CI does not include zero, so ˆ ␴ˆ ␴ 2 ˆ ␴ˆ ␴ The 2CI is reasonablyPractical narrowPractical Interpretation: (Ϯ 2.766) Interpretation: because This CIthe This doeserror CI notvari- does include not include zero, zero,so so ˆ ␤ Ϫ t Յ␤␴ˆ Յ␤ ϩ t ␴ˆ 1 ˆ0.025,18 c 1 1 d ˆ0.025,18 1.18 there is strong evidence (at ␣ ϭ 0.05) that the slope is not zero. ␤ Ϫ t Յ␤B Յ␤ ϩ t B ␣ ϭ or 1 0.025,18 ␤1 Ϫ t0.025,18Sxx1 1 Յ␤14.9470.025,181 Յ␤Ϫ1 ϩ2.101t0.025,18 Sxx ance isՅ␤ fairlythere Յsmall. is14.947 strongthere isevidence strong evidence (at (at0.05) ␣ ϭ that0.05) the that slope the isslope not iszero. not zero. BSxx BS B Sxx A 0.68088BS 1The CI is reasonably narrow (Ϯ 2.766) because the error vari- xx xx The CI isThe reasonably CI is reasonably narrow narrow(Ϯ 2.766) (Ϯ 2.766)because because the error the errorvari- vari- EXAMPLE 11-4 Oxygen Purity Confidence Interval on the Slope ance is fairly small. or 1.18 1.18 ance is fairlyance small.is fairly small. We will find a 95% confidence interval on the slope of theor re- This14.947 orsimplifiesϪ 2.101 to Յ␤1 Յ 14.947ϩ 2.101 A 0.68088 A 0.68088 gression line using the data in Example 11-1. Recall that 1.18 Ϫ 1.18 1.18Յ␤ Յ ␤ˆ ϭ 14.947, S ϭ 0.68088, and ␴ˆ 2 ϭ 1.18 (see Table 11-2). 14.947 2.1011.18 1 14.947 1 xx ϩ 2.10114.947 Ϫ A2.1010.68088 Յ␤ Յ 14.947 14.947 Ϫ 2.101A 12.181 Յ␤Յ␤A1 Յ117.713Յ 14.9471 Then, from Equation 11-29 we find A0.680880.68088 0.68088 11-5.21.18 Confidence Interval on the Mean Response The confidence interval does not include 0, so enough ϩ 1.18 2 2 Practical Interpretation:2.1011.18 This CI does not include zero, so ␴ˆ ␴ˆ ϩ A2.1010.68088 ˆ ˆ evidence saying there is enough correlation between X and Y. ϩ 2.101 A 0.68088 ␤1 Ϫ t0.025,18 Յ␤1 Յ␤1 ϩ t0.025,18 11-5.2there Confidence is strong evidenceA Interval0.68088 (at ␣ ϭon0.05) the that Mean the slope ResponseA is confidence not zero. interval may be constructed on the mean response at a specified value of x, say, BS BS 22 xx xx x . This is a confidence interval about E(Y x ) ϭ ␮ and is often called a confidence interval The CI is reasonably narrow (Ϯ 2.766) because the 0error vari- ͉ 0 Y ͉x0 or 11-5.2ance 11-5.2is fairly Confidence small.A Confidence confidence Interval interval Interval mayon thebe on constructed Meanthe Mean Response on the Response mean response at a specified value of x, say, x . This is a confidence interval about E(Y x ) ϭ ␮ and is often called a confidence interval 11-5.2 Confidence0 Interval on the Mean Response͉ 0 Y ͉x0 1.18 14.947 Ϫ 2.101 Յ␤ Յ 14.947 A confidence interval may be constructed on the mean response at a specified value of x, say, A 0.68088 1 A confidence interval may be constructed on the mean response at a specified value of x, say, A confidencex0. This is interval a confidence may beinterval constructed about E (onY ͉ thex0) ϭmean ␮Y x responseand is often at acalled specified a confidence value of interval x, say, x . This is a confidence interval about E(Y ͉x ) ϭ͉ 0 ␮ and is often called a confidence interval 1.18 0 0 Y ͉x0 ϩ 2.101 x . This is a confidence interval about E(Y ͉x ) ϭ ␮ and is often called a confidence interval A 0.68088 0 0 Y ͉x0

11-5.2 Confidence Interval on the Mean Response

A confidence interval may be constructed on the mean response at a specified value of x, say, x . This is a confidence interval about E(Y x ) ϭ ␮ and is often called a confidence interval 0 ͉ 0 Y ͉x0 JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 411

11-2 SIMPLE LINEAR REGRESSION 411

11-4. An article in Technometrics by S. C. Narula and J. F. (a) Assuming that a simple linear regression model is appro- Wellington [“, Linear Regression, and a Minimum priate, fit the regression model relating steam usage (y) to Sum of Relative Errors” (Vol. 19, 1977)] presents data on the the temperature (x). What is the estimate of ␴2? selling price and annual taxes for 24 houses. The data are Graph the regression line. Example:shown in the following house table. selling price and(b) What annual is the estimate taxesof expected steam usage when the average temperature is 55ЊF? (c) What change in mean steam usage is expected when the Taxes Taxes monthly average temperature changes by 1ЊF? Sale (Local, School), Sale (Local, School), (d) Suppose the monthly average temperature is 47ЊF. C a l c u l a t e Price/1000 County)/1000 Price/1000 County)/1000 the fitted value of y and the corresponding residual. 25.9 4.9176 30.0 5.0500 11-6. The following table presents the highway gasoline 29.5 5.0208 36.9 8.2464 mileage performance and engine displacement for Daimler- 27.9 4.5429 41.9 6.6969 Chrysler vehicles for model year 2005 (source: U.S. Environ- mental Protection Agency). 25.9 4.5573 40.5 7.7841 (a) Fit a simple linear model relating highway miles per gal- 29.9 5.0597 43.9 9.0384 lon ( y) to engine displacement (x) in cubic inches using 29.9 3.8910 37.5 5.9894 least squares. 30.9 5.8980 37.9 7.5422 (b) Find an estimate of the mean highway gasoline mileage 28.9 5.6039 44.5 8.7951 performance for a car with 150 cubic inches engine 35.9 5.8282 37.9 6.0831 displacement. (c) Obtain the fitted value of y and the corresponding residual 31.5 5.3003 38.9 8.3607 for a car, the Neon, with an engine displacement of 122 31.0 6.2712 36.9 8.1400 cubic inches. 30.9 5.9592 45.8 9.1416

(a) Assuming that a simple linear regression model is Engine appropriate, obtain the least squares fit relating selling Displacement MPG Independent variable X: SalePrice 3 price to taxes paid. What is the estimate of ␴2? Carline (in )(highway) Dependent variable Y(b) Find the mean selling price given: Taxes that the taxes paid are 300C/SRT-8 215 30.8 23 x ϭ 7.50. CARAVAN 2WD 201 32.5 (c) Calculate the fitted value of y corresponding to x ϭ CROSSFIRE ROADSTER 196 35.4 5.8980. Find the corresponding residual. DAKOTA PICKUP 2WD 226 28.1 (d) Calculate the fitted yˆi for each value of xi used to fit the model. Then construct a graph of yˆi versus the correspon- DAKOTA PICKUP 4WD 226 24.4 ding observed value yi and comment on what this plot DURANGO 2WD 348 24.1 would look like if the relationship between y and x was a GRAND CHEROKEE 2WD 226 28.5 deterministic (no random error) straight line. Does the GRAND CHEROKEE 4WD 348 24.2 plot actually obtained indicate that taxes paid is an effective regressor variable in predicting selling price? LIBERTY/CHEROKEE 2WD 148 32.8 11-5. The number of pounds of steam used per month by a LIBERTY/CHEROKEE 4WD 226 28 chemical plant is thought to be related to the average ambient NEON/SRT-4/SX 2.0 122 41.3 temperature (inЊ F) for that month. The past year’s usage and PACIFICA 2WD 215 30.0 temperature are shown in the following table: PACIFICA AWD 215 28.2 PT CRUISER 148 34.1 Month Temp. Usage/1000 Month Temp. Usage/1000 RAM 1500 PICKUP 2WD 500 18.7 Jan. 21 185.79 July 68 621.55 RAM 1500 PICKUP 4WD 348 20.3 Feb. 24 214.47 Aug. 74 675.06 SEBRING 4-DR 165 35.1 Mar. 32 288.03 Sept. 62 562.03 STRATUS 4-DR 148 37.9 Apr. 47 424.84 Oct. 50 452.93 TOWN & COUNTRY 2WD 148 33.8 May 50 454.58 Nov. 41 369.95 VIPER CONVERTIBLE 500 25.9 June 59 539.03 Dec. 30 273.98 WRANGLER/TJ 4WD 148 26.4 • qualitative analysis

Calculate correlation

= 0.8760

24 JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 407 JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 407

11-2 SIMPLE11-2 SIMPLE LINEAR LINEAR REGRESSION REGRESSION 407407

SimplifyingSimplifying these two these equations two equations yields yields

n n n n ˆ ˆ ˆ nˆ␤ ϩ␤ x ϭ y n␤0 ϩ␤10 a x1i ϭa ai yia i iϭ1 iϭ1 iϭ1 iϭ1 n n n n ˆ n ˆ 2n ␤ ϩ␤ ϭ ˆ 0 aˆ xi 1 2 a x i a yi xi (11-6) ␤ x ϩ␤ x ϭ y x (11-6) 0 a i iϭ1 1 a i iϭ1 a i iϭi1 iϭ1 iϭ1 iϭ1 Equations 11-6 are called the least squares normal equations. The solution to the normal Equationsequations 11-6 are results called in the the leastleast squaressquares estimators normal ␤ˆequations. and ␤ˆ . The solution to the normal ˆ ˆ0 1 equations results in the least squares estimators ␤0 and ␤1. Least Squares Least Squares Estimates The least squares estimates of the intercept and slope in the simple linear regression Estimates The least modelsquares are estimates of the intercept and slope in the simple linear regression model are ˆ ˆ ␤0 ϭ y Ϫ␤1x (11-7) ˆ ˆ ␤0 ϭ y Ϫ␤1x n n (11-7) y x n n a ni a i Ϫ iϭ1 iϭ1 a yi xi n a yia abnxai b ˆ iϭ1 iϭ1 iϭ1 ␤1yϭx Ϫ n 2 (11-8) a i i n iϭ1 a ba x b ␤ˆ ϭ n a i (11-8) 1 2n i2ϭ1 Ϫ a x i n n iϭ1 a xai b 2 iϭ1 x Ϫ n a i a n nb where y ϭ 1րn g iϭ1 yi andiϭ1 x ϭ 1րn g iϭ1 xi. n1 2 1 n 2 where y ϭ 1րn g iϭ1 yi and x ϭ 1րn g iϭ1 xi. The fitted or estimated regression line is therefore 1 2 1 2 ˆ ˆ ˆ The fitted or estimated regression line isy thereforeϭ␤0 ϩ␤1x (11-9) JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 408 Note that each pair of observations satisfiesˆ theˆ relationship yˆ ϭ␤0 ϩ␤1x (11-9) JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 408 y ϭ␤ˆ ϩ␤ˆ x ϩ e , i ϭ 1, 2, , n Note that each pair of observations isatisfies0 the1 irelationshipi p where ei ϭ yi Ϫ yˆi is called the residual. The residual describes the error in the fit of the y ϭ␤ˆ ϩ␤ˆ x ϩ e , i ϭ 1, 2, , n model to the ith observationi 0 yi1. Lateri i in this chapter wep will use the residuals to provide information about the adequacy of the fitted model. where ei ϭ yiNotationally,Ϫ yˆi is called it isthe occasionally residual. convenientThe residual to give describes special thesymbols error to in the the numerator fit of the and model to denominator the ith observation of Independent variable Y: SalePriceEquation y . 11-8. Later Given in this data chapter (x1, y1 ), we (x2 , will y2), use, (x then, y n residuals), let to provide CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATIONi p 408 information about the adequacy of the fitted model. Dependent variable X: Taxes n 2 Notationally, it is occasionally convenient to give special symbols to the numerator and n n a xi denominator of Equation 11-8. Given data (x1, y1), 2(x2, y2), p2 , (xn,iϭ y1n), let CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION S ϭ x Ϫ x ϭ x Ϫ (11-10) 408 n 24 x x a i a yi = 6.4049 EXAMPLE 11-1 Oxygen Purity = xiϭ1= 34.6125 iϭ1 a n b We will fit a simple linear regression model to the oxygen Therefore, the least1 squares2 estimatesn 2 of the slope and inter- and n n a xi purity data in Table 11-1. The following quantities may be cept are 2 2 iϭ1 S ϭ x Ϫ x ϭ 82x9 .0Ϫ462 (11-10) EXAMPLE 11-1 Oxygen Purity x x a i =a i n n n computed: iϭ1 iϭ1 a b n n a xi a yi We will fit a simple linear regression model to the oxygen Therefore, the least squares estimatesS of the ϭslope ϭand inter- S ϭ y1 Ϫ y x2 Ϫ xx yϭ= 110.177449x1y.3Ϫ612i 1 i 1 (11-11) and x y a i ␤ˆ ϭi ϭa i i ϭ n14.94748 purity data in Table 11-1. The following20 quantities20 may be cept are iϭ1 1 iϭ1 a b a b Sx x 0.68088 1 21 2 n n computed: n ϭ 20 a xi ϭ 23.92 a yi ϭ 1,843.21 S xi yi iϭ1 iϭ1 n x y 110.1774491.361n 2 a a and␤ˆ ϭ ϭ= = ϭ0.214.9474830iϭ81 iϭ1 S ϭ 1 y Ϫ y x Ϫ x ϭ x y Ϫ (11-11) 20 20 xy a i i 829.04a62 i i a bna b x ϭ 1.1960 y ϭ 92.1605 iϭ1 Sx x 0.68088iϭ1 n ϭ 20 a xi ϭ 23.92 a yi ϭ 1,843.21 ˆ 1 21 ˆ 2 iϭ1 20 iϭ1 20 ␤0 ϭ y Ϫ␤1x ϭ= 692.1605.4049 − 0Ϫ.23014.947488 × 34.6121.1965 = −1.ϭ58374.283317 2 2 and a yi ϭ 170,044.5321 a xi ϭ 29.2892 x ϭ 1.1960iϭ1 y ϭ 92.1605 iϭ1 1 2 25 The fitted simple linear regression model (with the coefficients 20 ˆ ˆ 20 20 ␤0 ϭ y Ϫ␤1x ϭ 92.1605 Ϫ 14.94748 1.196 ϭ 74.28331 2 2 ϭ reported to three decimal places) is y ϭ 170,044.5321 a xxi yi ϭ 29.28922,214.6566 a i ϭa i iϭ1 i iϭ11 1 2 The fitted simple linear regression model (with the coefficients 20 20 2 yˆ ϭ 74.283 ϩ 14.947 x x y ϭ 2,214.6566x reported to three decimal places) is a i i 20 a i 23.92 2 iϭ1 2 iϭ1 ϭ Ϫ ϭ Ϫ Sx x a x i 29.2892 This model is plotted in Fig. 11-4, along with the sample data. iϭ1 a 20 b 20 20 2 1 2 Practicalyˆ ϭ 74.283 Interpretation:ϩ 14.947 Using x the regression model, we ϭ 0.68088x would predict oxygen purity of yˆ ϭ 89.23% when the 20 a i 23.92 2 2 iϭ1 hydrocarbon level is x ϭ 1.00%. The purity 89.23% may be S ϭ x Ϫ ϭ 29.2892 Ϫ This model is plotted in Fig. 11-4, along with the sample data. xx aandi a 20 b 20 iϭ1 1 2 Practicalinterpreted Interpretation: as an estimateUsing the of regressionthe true population model, wemean purity ϭ 0.68088 20 20 would predictwhen oxygen x ϭ 1.00%, purity or of as anyˆ ϭ estimate89.23% of when a new the observation 20 a xi a yi when x = 1.00%. These estimates are, of course, subject to ϭ ϭ hydrocarbon level is x ϭ 1.00%. The purity 89.23% may be and S ϭ x y Ϫ i 1 i 1 error; that is, it is unlikely that a future observation on purity x y a i i interpreted as an estimate of the true population mean purity iϭ1 a b20a b would be exactly 89.23% when the hydrocarbon level is 20 20 when x ϭ 1.00%, or as an estimate of a new observation 23.92 1,843.21 1.00%. In subsequent sections we will see how to use confi- x y when x = 1.00%. These estimates are, of course, subject to 20 ϭa2,214.6566i a Ϫi ϭ 10.17744 dence intervals and prediction intervals to describe the error ϭ Ϫ iϭ1 iϭ1 20 error; that is, it is unlikely that a future observation on purity Sx y a xiyi 1 21 2 in estimation from a regression model. iϭ1 a b20a b would be exactly 89.23% when the hydrocarbon level is 23.92 1,843.21 Computer software programs1.00%. In are subsequent widely used sections in regression we will see modeling. how to use These confi- programs ϭ 2,214.6566 Ϫ ϭ 10.17744 dence intervals and prediction intervals to describe the error 20 typically carry more decimal places in the calculations. Table 11-2 shows a portion of the 1 21 2 in estimation from a regressionˆ model. ˆ output from Minitab for this problem. The estimates ␤0 and ␤1 are highlighted. In subse- quent sections we will provide explanations for the information provided in this computer Computeroutput. software programs are widely used in regression modeling. These programs typically carry more decimal places in the calculations. Table 11-2 shows a portion of the ˆ ˆ output from Minitab102 for this problem. The estimates ␤0 and ␤1 are highlighted. In subse- quent sections we will provide explanations for the information provided in this computer output. 99

102 (%) y 96

99

93 Oxygen purity (%) y 96 Figure 11-4 Scatter 90 plot of oxygen purity y versus93

Oxygen purity 87 hydrocarbon level x 0.87 1.07 1.27 1.47 1.67 and regression model Hydrocarbon level (%) Figure 11-4 Scatteryˆ ϭ 74.283 ϩ9014.947x. x plot of oxygen purity y versus 87 hydrocarbon level x 0.87 1.07 1.27 1.47 1.67 and regression model Hydrocarbon level (%) yˆ ϭ 74.283 ϩ 14.947x. x Fitted simple linear regression model yˆ = −1.5837 + 0.2308x ! ! ! ! ! ! ! !

n ! 2 ri 2 ∑ • residuals: σˆ = MSE = i=1 = 0.6088 n − 2

26 • standard error of regression coefficients

0.6088 = = 0.0271 829.0462

⎡ 1 34.61252 ⎤ = 0.6088 ⎢ + ⎥ = 0.9514 ⎣24 829.0462 ⎦

27 JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 419

11-4 HYPOTHESIS TESTS IN SIMPLE LINEAR REGRESSION 419

Note that the analysis of variance procedure for testing for significance of regression is equivalent to the t-test in Section 11-4.1. That is, either procedure will lead to the same conclusions.

This is easy to demonstrate by starting with the t-test statistic in Equation 11-19 with ␤1,0 ϭ 0, say ˆ ␤1 T0 ϭ (11-27) 2 ˆ 2 ␴ րSx x 2 Squaring both sides of Equation 11-27 and using the fact that ␴ˆ ϭ MSE results in ␤ˆ 2S ␤ˆ S MS 2 1 x x 1 xy R T 0 ϭ ϭ ϭ (11-28) MSE MSE MSE 2 Note that T 0 in Equation 11-28 is identical to F0 in Equation 11-26. It is true, in general, that the square of a t random variable with v degrees of freedom is an F random variable, with one and v degrees of freedom in the numerator and denominator, respectively. Thus, the test using

T0 is equivalent to the test based on F0. Note, however, that the t-test is somewhat more flexible in that it would allow testing against a one-sided , while the F-test is restricted to a two-sided alternative.

EXERCISES FOR SECTION 11-4

11-21. Consider the computer output below. (a) Fill in the missing information. You may use bounds for the P-values. The regression equation is (b) Can you conclude that the model defines a useful linear ϭ ϩ Y 12.9 2.34 x relationship? Predictor Coef SE Coef T P (c) What is your estimate of ␴2? Constant 12.857 1.032 ? ? 11-23. Consider the data from Exercise 11-1 on x ϭ X 2.3445 0.1150 ? ? JWCL232_c11_401-448.qxdcompressive strength 1/15/10 and y5:37ϭ intrinsic PM Page permeability 417 of concrete. S ϭ 1.48111 RϪSq ϭ 98.1% RϪSq(adj) ϭ 97.9% (a) Test for significance of regression using ␣ϭ0.05. Find Analysis of Variance the P-value for this test. Can you conclude that the model specifies a useful linear relationship between these two Source DF SS MS F P variables? Regression 1 912.43 912.43 ? ? (b) Estimate ␴2 and the of ␤ˆ . Residual Error 8 17.55 ? 1 (c) What is the standard error of the intercept in11-4 this HYPOTHESIS model? TESTS IN SIMPLE LINEAR REGRESSION 417 Total 9929.98 11-24. Consider the data from Exercise 11-2 on x ϭ road- y y (a) Fill in the missing information. You may use bounds for way surface temperature and y ϭ pavement deflection. the P-values. (a) Test for significance of regression using ␣ϭ0.05. Find (b) Can you conclude that the model defines a useful linear the P-value for this test. What conclusions can you draw? JWCL232_c11_401-448.qxdrelationship? 1/14/10 8:02 PM Page 416 (b) Estimate the standard errors of the slope and intercept. (c) What is your estimate of ␴2? 11-25. Consider the National Football League data in 11-22. Consider the computer output below. ExerciseFigure 11-5 11-3.The (a) Test for significance of regression using ␣ϭ0.01 . Find The regression equation is hypothesis H0: ␤1 ϭ 0 x x (a) (b) Y = 26.8 ϩ 1.48 x is thenot rejected.P-value for this test. What conclusions can you draw? (b) Estimate the standard errors of the slope and intercept. Predictor Coef SE Coef T P 416 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION(c) Test H0: ␤1 ϭ 10 versus H1: ␤1 ( 10 with ␣ϭ0.01 . Constant 26.753 2.373 ? ? EXAMPLE 11-2 Oxygen Purity Tests of Coefficients WeWould will test you for significanceagree with of theregression statement using thethat model this foris a test ofPractical Interpretation: Since the reference value of t is X 1.4756 0.1063 ? ? thethe oxygen hypothesis purity data thatfrom Example a one-yard 11-1. The increase hypotheses in theare averaget ϭ 2.88, the value of the test statistic is very far into the independent normal random variables, and consequently, ␤ˆ is N(␤ , ␴2͞S ), using the bias 0.005,18 yards per attempt1 results1 in a meanxx increase of 10 ratingcritical region, implying that H0: ␤1 ϭ 0 should be rejected. S ϭ 2.70040 RϪSq ϭ 93.7% R-Sq (adj) ϭ 93.2% 2 2 and variance properties of the slope discussed in Section 11-3. HIn0: ␤addition,1 ϭ 0 n Ϫ 2 ␴ˆ ր␴ has There is strong evidence to support this claim. The P-value for points? Ϫ9 Analysis of Variance Hˆ1: ␤1 & 0 2 this test is P Ӎ 1.23 ϫ 10 . This was obtained manually a chi-square distribution with n Ϫ 2 degrees of freedom, and ␤1 is independent of ␴ˆ . As a •11-26.test Consider the data from Exercise1 11-42 on y ϭ withsales a calculator. Source DFresult of those SS properties, MS the F statistic P and we will use ␣ϭ0.01. From Example 11-1 and Table 11-2 price! and x ϭ taxes paid. Table 11-2 presents the Minitab output for this problem. Regression 1 ? ? ? ? we have Notice that the t-statistic value for the slope is computed as (a) Test H0: ␤1 ϭ 0 using the t-test; use ␣ϭα = 00.05..05 Residual Error ? 94.8 7.3 ! ˆ ˆ 2 11.35 and that the reported P-value is P ϭ 0.000. Minitab also ␤␤11Ϫ␤ϭ 14.9471,0 n ϭ 20, Sxx ϭ 0.68088, ␴ˆ ϭ 1.18 (b) Test H0: ␤1 ϭ 0 using the analysis of variance with ␣ϭreports0.05. the t-statistic for testing the hypothesis H : ␤ ϭ 0. Total Test Statistic15 1500.0 T0 ϭ (11-19) 0 0 • calculate test soDiscuss the t-statistic2 the inrelationship Equation 10-20 of thisbecomes test to the test from part (a). ␤ ϭ 2␴ˆ րSxx This statistic is computed from Equation 11-22, with 0,0 0, as t ϭ 46.62. Clearly, then, the hypothesis that the intercept is ! ˆ ˆ 0 ␤1 ␤1 0.2314.94708 t ϭ ϭ ϭ ϭ 11.35 zero is rejected. ! 0 2 ˆ = = 8.5166 follows the t distribution with n Ϫ 2 degrees of freedom2␴ˆ րS xxunderse H␤10: ␤120ϭ␤.01.18271ր1,00.68088. We would reject H : ␤ ϭ␤ if ! 0 1 1,0 1 2 • threshold 11-4.2 Analysis of Variance Approach to Test Significance of Regression t 3.119 t0! Ͼ t␣ր2,nϪ2 = 0.0025,22 = (11-20) • value of test statistic is greater than threshold A method called the analysis of variance can be used to test for significance of regression. 0 0 The procedure partitions the total variability in the response variable into meaningful compo- where t0 is computed from Equation 11-19. The denominator of Equation 11-19 is the standard • —> reject H0 error of the slope, so we could write the test statistic as nents as the basis for the test. The analysis of variance identity is as follows: 28 ˆ ␤1 Ϫ␤Analysis1,0 of T ϭ Variance n n n 0 ˆ 2 2 2 se ␤1 Identity a yi Ϫ y ϭ a yˆi Ϫ y ϩ a yi Ϫ yˆi (11-24) iϭ1 iϭ1 iϭ1 1 2 1 2 1 2 1 2 A similar procedure can be used to test hypotheses about the intercept. To test

y y H0: ␤0 ϭ␤0,0

H1: ␤0 %␤0,0 (11-21) we would use the statistic

ˆ ˆ ␤0 Ϫ␤Figure0,0 11-6 ␤0 Ϫ␤0,0 T ϭ ϭ The (11-22) 0 hypothesis2 H : ␤ ϭ 0ˆ x x Test Statistic 1 x 0 1se ␤0 ␴ˆ 2 isϩ rejected. (a) (b) B n S xx 1 2 c d and reject the null hypothesis if the computed value of this test statistic, t0, is such that t0 Ͼ t␣ր2,nϪ2. Note that the denominator of the test statistic in Equation 11-22 is just the stan- dard error of the intercept. 0 0 A very important special case of the hypotheses of Equation 11-18 is

H0: ␤1 ϭ 0

H1: ␤1 % 0 (11-23)

These hypotheses relate to the significance of regression. Failure to reject H0: ␤1 ϭ 0 is equivalent to concluding that there is no linear relationship between x and Y. This situation is illustrated in Fig. 11-5. Note that this may imply either that x is of little value in explaining the variation in Y and that the best estimator of Y for any x is yˆ ϭ Y [Fig. 11-5(a)] or that the true relationship between x and Y is not linear [Fig. 11-5(b)]. Alternatively, if H0: ␤1 ϭ 0 is re- jected, this implies that x is of value in explaining the variability in Y (see Fig. 11-6). Rejecting

H0: ␤1 ϭ 0 could mean either that the straight-line model is adequate [Fig. 11-6(a)] or that, although there is a linear effect of x, better results could be obtained with the addition of higher order polynomial terms in x [Fig. 11-6(b)]. JWCL232_c11_401-448.qxd 1/15/10 4:53 PM Page 421

JWCL232_c11_401-448.qxd 1/14/10 8:02 PM Page 416

11-5 CONFIDENCE INTERVALS 421

11-5 CONFIDENCE INTERVALS

11-5.1 Confidence Intervals on the Slope and Intercept

416 CHAPTER 11 SIMPLE LINEAR REGRESSIONIn addition AND CORRELATION to point estimates of the slope and intercept, it is possible to obtain confidence interval estimates of these parameters. The width of these confidence intervals is a measure of the overall quality of the regression line. If the error terms, ⑀i, in the regression model are normally and independently distributed, ˆ 2 independent normal random variables, and consequently, ␤1 is N(␤1, ␴ ͞Sxx), using the bias 2 2 and variance properties of the slope discussed in Section 11-3. In addition,1 nxϪ2 2 ␴ˆ ր␴ has ˆ 2 ˆ 2 ˆ ˆ 2 ␤1 Ϫ␤1 ր ␴ րSx x and ␤0 Ϫ␤0 ր ␴ ϩ ˆ B n S 2 a chi-square distribution with n Ϫ 2 degrees of freedom, and ␤1 is independentx x of ␴ˆ . As a 1 2 1 2 c 1 d 2 result of those properties,are the both statistic distributed as t random variables with n Ϫ 2 degrees of freedom. This leads to the following definition of 100(1 Ϫ␣)% confidence intervals on the slope and intercept. ˆ ␤1 Ϫ␤1,0 Test Statistic Confidence T ϭ (11-19) Intervals on Under• theconstruct confidence interval for slope parameter assumption0 that2 the2 observations are normally and independently distributed, Parameters ␴ˆ րSxx a 100(1 Ϫ␣)% confidence interval on the slope ␤1 in simple linear regression is

2 2 ˆ ␴ˆ ˆ ␴ˆ follows the t distribution with n Ϫ 2 degrees␤ Ϫ oft freedom underՅ␤ Յ␤ H0:ϩ ␤t1 ϭ␤ 1,0. We would(11-29) reject 1 ␣ր2, nϪ2 B 1 1 ␣ր2, nϪ2 B Sx x Sx x H0: ␤1 ϭ␤1,0 if

Similarly, a 100(1 Ϫ␣)% confidence interval on the intercept ␤0 is t Ͼ t = t = 3.119 (11-20) 0 1 x␣2ր2,nϪ2 0.0025,22 ˆ 2 ␤ Ϫ t ␴ˆ ϩ 0 ␣ր2, nϪ2 B n Sx x 0.230080 − 3.119 × 0.0271 ≤ β ≤ 0.2308 + 3.119 × 02.0271 c d 1 1 x where t0 is computed from Equation 11-19. The denominatorˆ of Equation2 11-19 is the standard Յ␤ Յ␤ ϩ t ␴ˆ ϩ (11-30) 0 0 ␣ր2, nϪ2 B n S error of the slope, so we could write the test statistic as x x c d

0.14631 ≤ β1 ≤ 0.3153 EXAMPLE 11-4 Oxygen Purity Confidence Intervalˆ on the Slope ␤1 Ϫ␤1,0 We will find a 95% confidence interval on the slope of the re- This simplifies to T0 ϭ ˆ gression line using the data in Example 11-1. Recall se that␤1 ˆ 2 ␤1 ϭ 14.947, Sxx ϭ 0.68088, and ␴ˆ ϭ 1.18 (see Table 11-2). 12.181 Յ␤1 Յ 17.713 29 Then, from Equation 11-29 we find 1 2 A similar procedure can ˆbe2 used to test hypothesesˆ 2 aboutPractical the intercept. Interpretation: To This test CI does not include zero, so ˆ ␴ ˆ ␴ ␤1 Ϫ t0.025,18 Յ␤1 Յ␤1 ϩ t0.025,18 there is strong evidence (at ␣ ϭ 0.05) that the slope is not zero. BS BS xx xx The CI is reasonably narrow (Ϯ 2.766) because the error vari- or H0: ␤0 ϭ␤0,0ance is fairly small. 1.18 H : ␤ %␤ (11-21) 14.947 Ϫ 2.101 Յ␤ Յ 14.9471 0 0,0 A 0.68088 1 1.18 ϩ 2.101 we would use the statisticA 0.68088

ˆ ˆ 11-5.2 Confidence Interval on the␤0 Ϫ␤Mean0,0 Response␤0 Ϫ␤0,0 T ϭ ϭ (11-22) 0 2 ˆ Test Statistic 1 x se ␤0 A confidence interval␴ˆ 2 mayϩ be constructed on the mean response at a specified value of x, say, x . This is a confidencen interval about E(Y ͉x ) ϭ ␮ and is often called a confidence interval 0 B S 0 Y ͉x0 xx 1 2 c d and reject the null hypothesis if the computed value of this test statistic, t0, is such that t0 Ͼ t␣ր2,nϪ2. Note that the denominator of the test statistic in Equation 11-22 is just the stan- dard error of the intercept. 0 0 A very important special case of the hypotheses of Equation 11-18 is

H0: ␤1 ϭ 0

H1: ␤1 % 0 (11-23)

These hypotheses relate to the significance of regression. Failure to reject H0: ␤1 ϭ 0 is equivalent to concluding that there is no linear relationship between x and Y. This situation is illustrated in Fig. 11-5. Note that this may imply either that x is of little value in explaining the variation in Y and that the best estimator of Y for any x is yˆ ϭ Y [Fig. 11-5(a)] or that the true relationship between x and Y is not linear [Fig. 11-5(b)]. Alternatively, if H0: ␤1 ϭ 0 is re- jected, this implies that x is of value in explaining the variability in Y (see Fig. 11-6). Rejecting

H0: ␤1 ϭ 0 could mean either that the straight-line model is adequate [Fig. 11-6(a)] or that, although there is a linear effect of x, better results could be obtained with the addition of higher order polynomial terms in x [Fig. 11-6(b)].