Lectures 9: Simple Linear Regression
Lectures 9: Simple Linear Regression
Junshu Bao
University of Pittsburgh
1 / 32 Lectures 9: Simple Linear Regression
Table of contents
Introduction
Data Exploration
Simple Linear Regression
Utility Test
Assess Model Adequacy
Inference about Linear Regression
Model Diagnostics
Models without an Intercept
2 / 32 Lectures 9: Simple Linear Regression Introduction
Simple Linear Regression The simple linear regression model is
yi = β0 + β1xi + i where th I yi is the i observation of the response variable y th I xi is the i observation of the explanatory variable x 2 I i is an error term and ∼ N(0, σ ) I β0 is the intercept I β1 is the slope of linear relationship between y and x The simple" here means the model has only a single explanatory variable.
3 / 32 Lectures 9: Simple Linear Regression Introduction
Least-Squares Method
Choose β0 and β1 that minimize the sum of the squares of the vertical deviations.
n X 2 Q(β0, β1) = [yi − (β0 + β1xi)] . i=1
Taking partial derivatives of Q(β0, β1) and solve a system of equations yields the least squares estimators:
βˆ0 = y − βˆ1x Pn ˆ i=1(xi − x)(yi − y) Sxy β1 = Pn 2 = . i=1(xi − x) Sxx
4 / 32 Lectures 9: Simple Linear Regression Introduction
Predicted Values and Residuals
I Predicted (estimated) values of y from a model
yˆi = βˆ0 + βˆ1xi
I Residuals ei = yi − yˆi
I Play an important role in diagnosing a tted model. I Sometimes standardized before use (dierent options from dierent statisticians)
5 / 32 Lectures 9: Simple Linear Regression Introduction
Alcohol and Death Rate A study in Osborne (1979): I Independent variable: Average alcohol consumption in liters per person per year I Dependent variable: Death rate per 100,000 people from cirrhosis or alcoholism I Data on 15 countries (No. of observations) I Question: Can we predict the cirrhosis death rate from alcohol consumption?
6 / 32 Lectures 9: Simple Linear Regression Introduction
Analysis Plan
I Obtain basic descriptive statistics (mean and std) I Scatterplot to visualize relationship I Run simple linear regression I Diagnostic checks I Interpret model
7 / 32 Lectures 9: Simple Linear Regression Introduction
Data Structure
Country Alcohol Consumption Cirrhosis and Alcoholism (l/Person/Year) (Death Rate/100,000) France 24.7 46.1 Italy 15.2 23.6 West Germany 12.3 23.7 Australia 10.9 7.0 Belgium 10.8 12.3 USA 9.9 14.2 Canada 8.3 7.4 ...... Israel 3.1 5.4
8 / 32 Lectures 9: Simple Linear Regression Introduction
Reading Data
data drinking; input country $ 1-12 alcohol cirrhosis; cards; France 24.7 46.1 Italy 15.2 23.6 W.Germany 12.3 23.7 ...... Israel 3.1 5.4 ; run;
I Some of the country names are longer than the default of eight for character variables, so column input is used to read them in. I The values of the two numeric variables can then be read in with list input.
9 / 32 Lectures 9: Simple Linear Regression Data Exploration
Descriptive Statistics, Correlation and Scatterplot
proc means data = drinking; var alcohol cirrhosis; run; proc sgplot data=drinking; scatter x=alcohol y=cirrhosis / datalabel=country; run; proc corr data=drinking; run;
I There is a very strong, positive, linear relationship between death rate and a country's average alcohol consumption. I Notice that France is an outlier in both x and y directions. We will address this problem later. I The correlation between the two variables is 0.9388 with p-value less than 0.0001. I The datalabel option species a variable whose values are to be used as labels on the plot. 10 / 32 Lectures 9: Simple Linear Regression Simple Linear Regression
Fit Linear Regression Model The linear regression model can be tted using the following SAS code.
proc reg data=drinking; model cirrhosis=alcohol; run;
The tted linear regression equation is
Death rate = −6.00 + 1.98 × Average Alcohol Consumption
11 / 32 Lectures 9: Simple Linear Regression Utility Test
Partition of Sum of Squares and ANOVA Table The analysis of variance identity of a linear regression model is: n n n X 2 X 2 X 2 (yi − y¯) = (ˆyi − y¯) + (yi − yˆi) i=1 i=1 i=1 | {z } | {z } | {z } SST SSR SSE The corresponding ANOVA table is:
Source DF SS MS F Value Pr > F Model 1 SSR MSR =SSR/1 F0=MSR/MSE p-value Error n-2 SSE MSE = SSE/(n-2) Total n-1 SST where SS is sum of squares, MS is mean squares and
p-value = P (F1,n−2 > F0)
12 / 32 Lectures 9: Simple Linear Regression Utility Test
Utility Test
For a simple linear regression (SLR) model, if the slope (β1) is zero there is no linear relationship between x and y (the model is useless). A utility test of SLR is as follows:
1. H0 : β1 = 0 versus Ha : β1 6= 0
2. Test statistic: F0 = MSR/MSE
3. p-value = P (F1,n−2 > F0) 4. Decision and Conclusion The test results of the alcohol consumption example:
I Test statistic: F0 = 96.61
I p-value = P (F1,15−2 > 96.61) < 0.0001 I Reject the null.
13 / 32 Lectures 9: Simple Linear Regression Assess Model Adequacy
Coecient of Determination
I The analysis of variance identity n n n X 2 X 2 X 2 (yi − y¯) = (ˆyi − y¯) + (yi − yˆi) i=1 i=1 i=1 | {z } | {z } | {z } SST SSR SSE I Coecient of Determination, denoted by R2: SSR SSE R2 = = 1 − SST SST I R2 is often used to judge the adequacy of a regression model. I We often refer loosely to R2 as the amount of variability in the data explained or accounted for by the regression model. I 0 ≤ R2 ≤ 1. It could be shown that R2 = r2, the square of the sample correlation coecient.
2 In the alcohol consumption example, R = 0.8814. 14 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression
Estimation of Variances
I The variance σ2 is estimated as s2 given by Pn (y − yˆ )2 SSE s2 = i=1 i i = ≡ MSE n − 2 n − 2 ˆ I The estimated variance of β1 is 2 ˆ s Var (β1) = Pn 2 i=1(xi − x¯) I The estimated variance of a predicted value y∗ at a given value of x, say x∗, is
∗ 2 ∗ 2 1 (x − x¯) Var (y ) = s 1 + + Pn 2 n i=1(xi − x¯)
15 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression
Estimation of Variances (example)
I The estimator of variance, s2, is MSE: √ s2 = 17.39076 =⇒ s = 17.39076 = 4.17022
In the SAS output, s is called Root MSE. ˆ I The standard error of β1 is given in the table of parameter estimates:
se(βˆ1) = 0.20123
16 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression
Inference about the Slope (β1) Under our model assumptions,
βˆ1 − β1 T = ∼ tn−2 se(βˆ1) where ˆ p 2 Pn 2 p 2 . se(β1) = s / i=1(xi − x¯) ≡ s /Sxx I Suppose we want to test β1 equals to a certain value, β1,0
H0 : β1 = β1,0 versus Ha : β1 6= (< or >)β1,0 I The test statistic under the null is
βˆ1 − β1,0 t0 = p 2 s /Sxx I Note that if the alternative is the not equal case and β1,0 = 0 then it is the same as the utility test. 17 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression
Hypothesis Test about β1 In the standard SAS output, the result of the test that is equivalent to the utility test is automatically included. For example, in the alcohol example,
I Hypotheses: H0 : β1 = 0 versus Ha : β1 6= 0 I Test statistic:
βˆ1 1.97792 t0 = = = 9.83 p 2 s /Sxx 0.20123 I p-value < 0.0001. I Reject the null. With these standard output, it is straightforward to test other
cases. For example, if the alternative is right-sided, Ha : β1 > 0, then the p-value is P (Tn−2 > t0), which is technically a half of the p-value of the two-sided test. 18 / 32 Lectures 9: Simple Linear Regression Model Diagnostics
Model Diagnostics Check model assumptions: I Constant variance I Normality of error terms The residuals ei = yi − yˆi play an essential role in diagnosing a tted model. I Diagnosis plots: I Residuals versus tted values I Residuals versus explanatory variables I Normal probability plot of the residuals I Residuals and Studentized residuals I Cook's distance I Leverage
19 / 32 Lectures 9: Simple Linear Regression Model Diagnostics
Residuals and Studentized Residual
Residuals are used to identify outlying or extreme Y observations. I Residuals ei = yi − yˆi I Studentized residuals
ei ei ri = = √ s(ei) s 1 − hii
th where hii is the i element in the diagonal of the hat matrix.
While the residuals ei will have substantially dierent sampling variations if their standard deviations dier markedly, the
studentized residuals ri have constant variance.
20 / 32 Lectures 9: Simple Linear Regression Model Diagnostics
Deleted Residuals and Studentized Deleted Residual
If yi is far outlying, the regression line may be inuenced to come close to yi, yielding a tted value yˆi near yi, so that the residual ei will be small and will not disclose that yi is outlying. A renement to make residuals more eective for detecting y outliers is to measure the ith residual when the tted regression is based on all the cases except the ith one. I Deleted residuals ei di = yi − yˆi(i) = 1 − hii th yˆi(i) is the tted value of yi with the i case omitted. I Studentized deleted residuals di ei ti = = √ s(di) s(i) 1 − hii th where s(i) is s with the i case omitted. 21 / 32 Lectures 9: Simple Linear Regression Model Diagnostics
Cook's Distance This Cook's D statistic is a measure of the inuence of an observation on the regression line. The statistic is dened as follows: n 1 X 2 D = yˆ − yˆ k (p + 1)s2 i(k) i i=1 th th where yˆi(k) is the tted value of the i observation when the k observation is omitted from the model. p is the number of explanatory variables. th I The values of Dk assess the impact of the k observation on the estimated regression coecients.
I Values of Dk greater than 1 implies undue inuence on the model coecients.
22 / 32 Lectures 9: Simple Linear Regression Model Diagnostics
Leverage
This is a measure of how unusual the x value of a point is, relative to the x observations as a whole. th I The leverage of observation i is hii, the i diagonal element of the hat matrix.
I A leverage value hii is usually considered to be large if it is more than twice as large as the mean leverage value, (p + 1)/n, i.e. 2(p + 1) h > ii n where p is the number of predictors in the regression model. I Another suggested guideline is that I hii > 0.5 =⇒ high leverage I 0.2 ≤ hii ≤ 0.5 =⇒ moderate leverage I hii < 0.2 =⇒ low leverage 23 / 32 Lectures 9: Simple Linear Regression Model Diagnostics
Diagnostic Plots ODS graphics are used to generate diagnostic plots: ods graphics on; proc reg data=drinking; model cirrhosis=alcohol; run; ods graphics off;
I Raw residuals against predicted values I Standardized/Studentized residuals against predicted values I Studentized residuals against leverage I Cook's D against observation No. I Histogram of residuals In both of the two residual plots, there is a large negative residual. In the residual vs. leverage plot, there is a large leverage. The rst observation has a very large Cook's D. 24 / 32 Lectures 9: Simple Linear Regression Model Diagnostics
Identifying Outlying Observations and Inuential Points To identify the observations with large residuals or high leverage or large Cook's distance we need to save these quantities. proc reg data=drinking; model cirrhosis=alcohol; output out = drinkingfit predicted = yhat student = student /*Studentized residuals*/ rstudent = rstudent /*Studentized deleted residuals*/ h = leverage cookd = cooksd; run; proc print data = drinkingfit; where abs(rstudent) > 2 or leverage >.3 or cooksd > 1; run;
I Inuential point: France (Cook's D = 1.54) I X-outlier and inuential point: France (Leverage = 0.64) I Y-outlier: Australia (Studentized deleted residual = -2.55)
25 / 32 Lectures 9: Simple Linear Regression Model Diagnostics
Problem of Outlying Observations We've found that France is an X-outlier and inuential point. Observations that are outliers may distort the value of the an estimated regression coecient. Let us simply drop it and ret the model:
ods graphics on; proc reg data=drinking; model cirrhosis=alcohol; where country ne 'France'; run; ods graphics off;
I The residuals are more normal look. I No inuential points. I Two marginally large residuals
26 / 32 Lectures 9: Simple Linear Regression Models without an Intercept
Models without an Intercept In some applications of simple linear regression, a model without an intercept is required. The model is of the following form:
yi = βxi + i In this case, application of the least squares method gives the following estimator of β: Pn ˆ i=1 xiyi β = Pn 2 i=1 xi
27 / 32 Lectures 9: Simple Linear Regression Models without an Intercept
Example: Estimating the Age of the Universe The dominant motion in the universe is the smooth expansion known as Hubble's Law.
V = H0D where
I V is Recessional Velocity, the observed velocity of the galaxy away from us, usually in km/sec
I H0 is Hubble's constant, in km/sec/Mpc I D is the distance to the galaxy in Mpc Based on this law, we can easily estimate the age of the universe
1 age = H0
28 / 32 Lectures 9: Simple Linear Regression Models without an Intercept
Example: Estimating the Age of the Universe (cont.) Wood (2006) gives the relative velocity and the distances of 24 galaxies, according to measurements made using the Hubble Space Telescope.
Observation Galaxy Velocity (km/sec) Distance (mega-parsec) 1 NGC0300 133 2.00 2 NGC0925 664 9.16 3 NGC1326A 1794 16.14 ...... 24 NGC7331 999 14.72 where mega-parsec is Mpc and 1 Mpc = 3.09 × 1019 km
29 / 32 Lectures 9: Simple Linear Regression Models without an Intercept
Scatterplot After reading in the data le, we can create a scatterplot to visualize the relationship between velocity and distance.
proc sgplot data=universe; scatter y=velocity x=Distance; yaxis label='velocity (kms)'; xaxis label='Distance (mega-parsec)'; run;
I The yaxis and xaxis statements are used to give the units of measurement to the axis labels.
I The diagram shows a clear, strong relationship between velocity and distance.
I The scatterplot also shows that the intercept probably is zero.
30 / 32 Lectures 9: Simple Linear Regression Models without an Intercept
Model Diagnostics There are two large Studentized deleted residuals (absolute value > 2) and two large leverages (> 2/24 ≈ 0.08). We can identify these observations by saving these values. proc reg data=universe; model velocity= distance / noint; output out=regout predicted=pred rstudent=rstudent h=leverage; run; quit;
proc print data=regout; where abs(rstudent)>2; run;
proc print data=regout; where leverage>.08; run; 31 / 32 Lectures 9: Simple Linear Regression Models without an Intercept
Age of the Universe
Now we can use the estimated value of β to nd an approximate value for age of the universe.
1 3.09 × 1019 km 1 year Age = × × βˆ km/sec/Mpc 1 Mpc 3.15 × 107 sec It is approximately 12.8 billion years.
32 / 32