Simple Linear Regression
Total Page:16
File Type:pdf, Size:1020Kb
Lectures 9: Simple Linear Regression Lectures 9: Simple Linear Regression Junshu Bao University of Pittsburgh 1 / 32 Lectures 9: Simple Linear Regression Table of contents Introduction Data Exploration Simple Linear Regression Utility Test Assess Model Adequacy Inference about Linear Regression Model Diagnostics Models without an Intercept 2 / 32 Lectures 9: Simple Linear Regression Introduction Simple Linear Regression The simple linear regression model is yi = β0 + β1xi + i where th I yi is the i observation of the response variable y th I xi is the i observation of the explanatory variable x 2 I i is an error term and ∼ N(0; σ ) I β0 is the intercept I β1 is the slope of linear relationship between y and x The simple" here means the model has only a single explanatory variable. 3 / 32 Lectures 9: Simple Linear Regression Introduction Least-Squares Method Choose β0 and β1 that minimize the sum of the squares of the vertical deviations. n X 2 Q(β0; β1) = [yi − (β0 + β1xi)] : i=1 Taking partial derivatives of Q(β0; β1) and solve a system of equations yields the least squares estimators: β^0 = y − β^1x Pn ^ i=1(xi − x)(yi − y) Sxy β1 = Pn 2 = : i=1(xi − x) Sxx 4 / 32 Lectures 9: Simple Linear Regression Introduction Predicted Values and Residuals I Predicted (estimated) values of y from a model y^i = β^0 + β^1xi I Residuals ei = yi − y^i I Play an important role in diagnosing a tted model. I Sometimes standardized before use (dierent options from dierent statisticians) 5 / 32 Lectures 9: Simple Linear Regression Introduction Alcohol and Death Rate A study in Osborne (1979): I Independent variable: Average alcohol consumption in liters per person per year I Dependent variable: Death rate per 100,000 people from cirrhosis or alcoholism I Data on 15 countries (No. of observations) I Question: Can we predict the cirrhosis death rate from alcohol consumption? 6 / 32 Lectures 9: Simple Linear Regression Introduction Analysis Plan I Obtain basic descriptive statistics (mean and std) I Scatterplot to visualize relationship I Run simple linear regression I Diagnostic checks I Interpret model 7 / 32 Lectures 9: Simple Linear Regression Introduction Data Structure Country Alcohol Consumption Cirrhosis and Alcoholism (l/Person/Year) (Death Rate/100,000) France 24.7 46.1 Italy 15.2 23.6 West Germany 12.3 23.7 Australia 10.9 7.0 Belgium 10.8 12.3 USA 9.9 14.2 Canada 8.3 7.4 . Israel 3.1 5.4 8 / 32 Lectures 9: Simple Linear Regression Introduction Reading Data data drinking; input country $ 1-12 alcohol cirrhosis; cards; France 24.7 46.1 Italy 15.2 23.6 W.Germany 12.3 23.7 ... ... Israel 3.1 5.4 ; run; I Some of the country names are longer than the default of eight for character variables, so column input is used to read them in. I The values of the two numeric variables can then be read in with list input. 9 / 32 Lectures 9: Simple Linear Regression Data Exploration Descriptive Statistics, Correlation and Scatterplot proc means data = drinking; var alcohol cirrhosis; run; proc sgplot data=drinking; scatter x=alcohol y=cirrhosis / datalabel=country; run; proc corr data=drinking; run; I There is a very strong, positive, linear relationship between death rate and a country's average alcohol consumption. I Notice that France is an outlier in both x and y directions. We will address this problem later. I The correlation between the two variables is 0.9388 with p-value less than 0.0001. I The datalabel option species a variable whose values are to be used as labels on the plot. 10 / 32 Lectures 9: Simple Linear Regression Simple Linear Regression Fit Linear Regression Model The linear regression model can be tted using the following SAS code. proc reg data=drinking; model cirrhosis=alcohol; run; The tted linear regression equation is Death rate = −6:00 + 1:98 × Average Alcohol Consumption 11 / 32 Lectures 9: Simple Linear Regression Utility Test Partition of Sum of Squares and ANOVA Table The analysis of variance identity of a linear regression model is: n n n X 2 X 2 X 2 (yi − y¯) = (^yi − y¯) + (yi − y^i) i=1 i=1 i=1 | {z } | {z } | {z } SST SSR SSE The corresponding ANOVA table is: Source DF SS MS F Value Pr > F Model 1 SSR MSR =SSR/1 F0=MSR/MSE p-value Error n-2 SSE MSE = SSE/(n-2) Total n-1 SST where SS is sum of squares, MS is mean squares and p-value = P (F1;n−2 > F0) 12 / 32 Lectures 9: Simple Linear Regression Utility Test Utility Test For a simple linear regression (SLR) model, if the slope (β1) is zero there is no linear relationship between x and y (the model is useless). A utility test of SLR is as follows: 1. H0 : β1 = 0 versus Ha : β1 6= 0 2. Test statistic: F0 = MSR=MSE 3. p-value = P (F1;n−2 > F0) 4. Decision and Conclusion The test results of the alcohol consumption example: I Test statistic: F0 = 96:61 I p-value = P (F1;15−2 > 96:61) < 0:0001 I Reject the null. 13 / 32 Lectures 9: Simple Linear Regression Assess Model Adequacy Coecient of Determination I The analysis of variance identity n n n X 2 X 2 X 2 (yi − y¯) = (^yi − y¯) + (yi − y^i) i=1 i=1 i=1 | {z } | {z } | {z } SST SSR SSE I Coecient of Determination, denoted by R2: SSR SSE R2 = = 1 − SST SST I R2 is often used to judge the adequacy of a regression model. I We often refer loosely to R2 as the amount of variability in the data explained or accounted for by the regression model. I 0 ≤ R2 ≤ 1. It could be shown that R2 = r2, the square of the sample correlation coecient. 2 In the alcohol consumption example, R = 0:8814. 14 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression Estimation of Variances I The variance σ2 is estimated as s2 given by Pn (y − y^ )2 SSE s2 = i=1 i i = ≡ MSE n − 2 n − 2 ^ I The estimated variance of β1 is 2 ^ s Var (β1) = Pn 2 i=1(xi − x¯) I The estimated variance of a predicted value y∗ at a given value of x, say x∗, is ∗ 2 ∗ 2 1 (x − x¯) Var (y ) = s 1 + + Pn 2 n i=1(xi − x¯) 15 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression Estimation of Variances (example) I The estimator of variance, s2, is MSE: p s2 = 17:39076 =) s = 17:39076 = 4:17022 In the SAS output, s is called Root MSE. ^ I The standard error of β1 is given in the table of parameter estimates: se(β^1) = 0:20123 16 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression Inference about the Slope (β1) Under our model assumptions, β^1 − β1 T = ∼ tn−2 se(β^1) where ^ p 2 Pn 2 p 2 . se(β1) = s = i=1(xi − x¯) ≡ s =Sxx I Suppose we want to test β1 equals to a certain value, β1;0 H0 : β1 = β1;0 versus Ha : β1 6= (< or >)β1;0 I The test statistic under the null is β^1 − β1;0 t0 = p 2 s =Sxx I Note that if the alternative is the not equal case and β1;0 = 0 then it is the same as the utility test. 17 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression Hypothesis Test about β1 In the standard SAS output, the result of the test that is equivalent to the utility test is automatically included. For example, in the alcohol example, I Hypotheses: H0 : β1 = 0 versus Ha : β1 6= 0 I Test statistic: β^1 1:97792 t0 = = = 9:83 p 2 s =Sxx 0:20123 I p-value < 0:0001. I Reject the null. With these standard output, it is straightforward to test other cases. For example, if the alternative is right-sided, Ha : β1 > 0, then the p-value is P (Tn−2 > t0), which is technically a half of the p-value of the two-sided test. 18 / 32 Lectures 9: Simple Linear Regression Model Diagnostics Model Diagnostics Check model assumptions: I Constant variance I Normality of error terms The residuals ei = yi − y^i play an essential role in diagnosing a tted model. I Diagnosis plots: I Residuals versus tted values I Residuals versus explanatory variables I Normal probability plot of the residuals I Residuals and Studentized residuals I Cook's distance I Leverage 19 / 32 Lectures 9: Simple Linear Regression Model Diagnostics Residuals and Studentized Residual Residuals are used to identify outlying or extreme Y observations. I Residuals ei = yi − y^i I Studentized residuals ei ei ri = = p s(ei) s 1 − hii th where hii is the i element in the diagonal of the hat matrix. While the residuals ei will have substantially dierent sampling variations if their standard deviations dier markedly, the studentized residuals ri have constant variance. 20 / 32 Lectures 9: Simple Linear Regression Model Diagnostics Deleted Residuals and Studentized Deleted Residual If yi is far outlying, the regression line may be inuenced to come close to yi, yielding a tted value y^i near yi, so that the residual ei will be small and will not disclose that yi is outlying. A renement to make residuals more eective for detecting y outliers is to measure the ith residual when the tted regression is based on all the cases except the ith one.