The Simple Linear Regression Model
Total Page:16
File Type:pdf, Size:1020Kb
The Simple Linear Regression Model Suppose we have a data set consisting of n bivariate observations {(x1, y1),..., (xn, yn)}. Response variable y and predictor variable x satisfy the simple linear model if they obey the model yi = β0 + β1xi + ǫi, i = 1,...,n, (1) where the intercept and slope coefficients β0 and β1 are unknown constants and the random errors {ǫi} satisfy the following conditions: 1. The errors ǫ1, . , ǫn all have mean 0, i.e., µǫi = 0 for all i. 2 2 2 2. The errors ǫ1, . , ǫn all have the same variance σ , i.e., σǫi = σ for all i. 3. The errors ǫ1, . , ǫn are independent random variables. 4. The errors ǫ1, . , ǫn are normally distributed. Note that the author provides these assumptions on page 564 BUT ORDERS THEM DIFFERENTLY. Fitting the Simple Linear Model: Estimating β0 and β1 Suppose we believe our data obey the simple linear model. The next step is to fit the model by estimating the unknown intercept and slope coefficients β0 and β1. There are various ways of estimating these from the data but we will use the Least Squares Criterion invented by Gauss. The least squares estimates of β0 and β1, which we will denote by βˆ0 and βˆ1 respectively, are the values of β0 and β1 which minimize the sum of errors squared S(β0, β1): n 2 S(β0, β1) = X ei i=1 n 2 = X[yi − yˆi] i=1 n 2 = X[yi − (β0 + β1xi)] i=1 where the ith modeling error ei is simply the difference between the ith value of the response variable yi and the fitted/predicted valuey ˆi. Note: the modeling errors {ei} are called residuals. Since the sum of errors squared S(β0, β1) is a function of two variables, we use the Calc III approach and set the partial derivatives of S equal to zero ∂S(β0, β1) = 0 ∂β0 ∂S(β0, β1) = 0. ∂β1 1 This yields a system of two linear equations in β0 and β1. Solving this system yields formulas 7.14 and 7.15 on page 530 for computing βˆ0 and βˆ1. Just as we don’t compute square roots using Newton’s method, we won’t compute βˆ0 and βˆ1 using these formulas. We will use Minitab and save our mental energy for more important things such as assessing the adequacy of our model. Exercise 1A. Fitting and using the regression model i. Open the data set mpgweight.MTW at the bottom of the course website. ii. Fit the model citympgi = β0 + β1(wti)+ ǫi by 1. Accessing the regression menu via Stat -> Regression -> Regression 2. Selecting citympg for Response: and wt for Predictors:, and 3. Clicking the Storage option then checking Residuals under the Diagnostic Measures column. iii. What are the least squares estimates βˆ0 and βˆ1? iv. Compute the predicted value of citympg (ˆy1) for the first observation for which wt is 2705. v. Compute the residual for the first observation, e1 = y1 − yˆ1 = y1 − (βˆ0 − βˆ1wt1), for which wt1 = 2705 and compare it with Minitab’s value in column RES1. Model Assessment/Residual Analysis The actual values of the random errors {ǫi} are unknown but, re-arranging equation 1, we have ǫi = yi − (β0 + β1xi) Since the ith residual is defined to be the difference between yi andy ˆi, we have ei = yi − yˆi (2) = yi − (βˆ0 + βˆ1xi) (3) ≈ yi − (β0 + β1xi) (4) ≈ ǫi (5) 2 In other words, since the least squares estimates βˆ0 and βˆ1 estimate the coefficients β0 and β1, respectively, equations 2 through 5 show that the ith residual, ei, is an estimate of the ith random error, ǫi. Thus, if the random errors {ǫi} satisfy properties 1-4 above, then the residuals {ei} will satisfy them approximately. Using this fact, we check the regression assumptions by examining the behavior of the residuals as follows: i. Construct a scatterplot of the residuals vs. the predictor variable x and check that 1. The residuals exhibit no systematic trends about zero and 2. The residuals exhibit no systematic trends in variability If the residuals satisfy the first requirement, then our model fits the data well. If either of these two requirements is violated, we need to improve the model. ii. If we know the order in which the data was collected, then we can sort the residuals so they are in order and then check for lack of independence by creating a lagplot. In many cases we won’t know the order of the data but, based on the nature of the data and/or how it was acquired, we can reasonably assume independence. iii. If the residuals satisfy i and ii above, then we can check the normality assumption by creating a normality plot of the residuals. If the resulting points don’t deviate systematically from a straight line we conclude the normality assumption is met. Exercise 1B. Assessing the fit of the model i. Plot the residuals (RES1) versus wt. Do you see any violations of the assumption that the residuals have mean 0 and constant variance, i.e., do the residuals exhibit any systematic trends away from zero or with respect to variability? ii. Since we have a random sample of cars, the independence assumption should be met. iii. Check the normality assumption. What do you conclude? 2. Fitting and assessing the least squares line, take 2 i. Using the mpgweight.MTW data set, create a column containing the inverse of weight using Minitab’s Calc -> Calculator menu and name it invwt. ii. Fit the model citympgi = β0 + β1(invwti)+ ǫi following the procedure in 1A, part ii above. iii. Repeat the steps in 1B to assess the fit of this model. Note that Minitab puts the residuals in a column named RES2 since RES1 is in use. iv. Based on the residuals, which of the two models fits the data better? 3.