Lectures 9: Simple

Lectures 9:

Junshu Bao

University of Pittsburgh

1 / 32 Lectures 9: Simple Linear Regression

Table of contents

Introduction

Data Exploration

Simple Linear Regression

Utility Test

Assess Model Adequacy

Inference about Linear Regression

Model Diagnostics

Models without an Intercept

2 / 32 Lectures 9: Simple Linear Regression Introduction

Simple Linear Regression The simple linear regression model is

yi = β0 + β1xi + i where th I yi is the i observation of the response variable y th I xi is the i observation of the explanatory variable x 2 I i is an error term and  ∼ N(0, σ ) I β0 is the intercept I β1 is the slope of linear relationship between y and x The simple" here means the model has only a single explanatory variable.

3 / 32 Lectures 9: Simple Linear Regression Introduction

Least-Squares Method

Choose β0 and β1 that minimize the sum of the squares of the vertical deviations.

n X 2 Q(β0, β1) = [yi − (β0 + β1xi)] . i=1

Taking partial derivatives of Q(β0, β1) and solve a system of equations yields the :

βˆ0 = y − βˆ1x Pn ˆ i=1(xi − x)(yi − y) Sxy β1 = Pn 2 = . i=1(xi − x) Sxx

4 / 32 Lectures 9: Simple Linear Regression Introduction

Predicted Values and Residuals

I Predicted (estimated) values of y from a model

yˆi = βˆ0 + βˆ1xi

I Residuals ei = yi − yˆi

I Play an important role in diagnosing a tted model. I Sometimes standardized before use (dierent options from dierent statisticians)

5 / 32 Lectures 9: Simple Linear Regression Introduction

Alcohol and Death Rate A study in Osborne (1979): I Independent variable: Average alcohol consumption in liters per person per year I Dependent variable: Death rate per 100,000 people from cirrhosis or alcoholism I Data on 15 countries (No. of observations) I Question: Can we predict the cirrhosis death rate from alcohol consumption?

6 / 32 Lectures 9: Simple Linear Regression Introduction

Analysis Plan

I Obtain basic descriptive (mean and std) I Scatterplot to visualize relationship I Run simple linear regression I Diagnostic checks I Interpret model

7 / 32 Lectures 9: Simple Linear Regression Introduction

Data Structure

Country Alcohol Consumption Cirrhosis and Alcoholism (l/Person/Year) (Death Rate/100,000) France 24.7 46.1 Italy 15.2 23.6 West Germany 12.3 23.7 Australia 10.9 7.0 Belgium 10.8 12.3 USA 9.9 14.2 Canada 8.3 7.4 ...... Israel 3.1 5.4

8 / 32 Lectures 9: Simple Linear Regression Introduction

Reading Data

data drinking; input country $ 1-12 alcohol cirrhosis; cards; France 24.7 46.1 Italy 15.2 23.6 W.Germany 12.3 23.7 ...... Israel 3.1 5.4 ; run;

I Some of the country names are longer than the default of eight for character variables, so column input is used to read them in. I The values of the two numeric variables can then be read in with list input.

9 / 32 Lectures 9: Simple Linear Regression Data Exploration

Descriptive Statistics, Correlation and Scatterplot

proc means data = drinking; var alcohol cirrhosis; run; proc sgplot data=drinking; scatter x=alcohol y=cirrhosis / datalabel=country; run; proc corr data=drinking; run;

I There is a very strong, positive, linear relationship between death rate and a country's average alcohol consumption. I Notice that France is an in both x and y directions. We will address this problem later. I The correlation between the two variables is 0.9388 with p-value less than 0.0001. I The datalabel option species a variable whose values are to be used as labels on the plot. 10 / 32 Lectures 9: Simple Linear Regression Simple Linear Regression

Fit Linear Regression Model The linear regression model can be tted using the following SAS code.

proc reg data=drinking; model cirrhosis=alcohol; run;

The tted linear regression equation is

Death rate = −6.00 + 1.98 × Average Alcohol Consumption

11 / 32 Lectures 9: Simple Linear Regression Utility Test

Partition of Sum of Squares and ANOVA Table The identity of a linear regression model is: n n n X 2 X 2 X 2 (yi − y¯) = (ˆyi − y¯) + (yi − yˆi) i=1 i=1 i=1 | {z } | {z } | {z } SST SSR SSE The corresponding ANOVA table is:

Source DF SS MS F Value Pr > F Model 1 SSR MSR =SSR/1 F0=MSR/MSE p-value Error n-2 SSE MSE = SSE/(n-2) Total n-1 SST where SS is sum of squares, MS is mean squares and

p-value = P (F1,n−2 > F0)

12 / 32 Lectures 9: Simple Linear Regression Utility Test

Utility Test

For a simple linear regression (SLR) model, if the slope (β1) is zero there is no linear relationship between x and y (the model is useless). A utility test of SLR is as follows:

1. H0 : β1 = 0 versus Ha : β1 6= 0

2. Test statistic: F0 = MSR/MSE

3. p-value = P (F1,n−2 > F0) 4. Decision and Conclusion The test results of the alcohol consumption example:

I Test statistic: F0 = 96.61

I p-value = P (F1,15−2 > 96.61) < 0.0001 I Reject the null.

13 / 32 Lectures 9: Simple Linear Regression Assess Model Adequacy

Coecient of Determination

I The analysis of variance identity n n n X 2 X 2 X 2 (yi − y¯) = (ˆyi − y¯) + (yi − yˆi) i=1 i=1 i=1 | {z } | {z } | {z } SST SSR SSE I Coecient of Determination, denoted by R2: SSR SSE R2 = = 1 − SST SST I R2 is often used to judge the adequacy of a regression model. I We often refer loosely to R2 as the amount of variability in the data explained or accounted for by the regression model. I 0 ≤ R2 ≤ 1. It could be shown that R2 = r2, the square of the sample correlation coecient.

2 In the alcohol consumption example, R = 0.8814. 14 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression

Estimation of Variances

I The variance σ2 is estimated as s2 given by Pn (y − yˆ )2 SSE s2 = i=1 i i = ≡ MSE n − 2 n − 2 ˆ I The estimated variance of β1 is 2 ˆ s Var (β1) = Pn 2 i=1(xi − x¯) I The estimated variance of a predicted value y∗ at a given value of x, say x∗, is

 ∗ 2  ∗ 2 1 (x − x¯) Var (y ) = s 1 + + Pn 2 n i=1(xi − x¯)

15 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression

Estimation of Variances (example)

I The of variance, s2, is MSE: √ s2 = 17.39076 =⇒ s = 17.39076 = 4.17022

In the SAS output, s is called Root MSE. ˆ I The standard error of β1 is given in the table of parameter estimates:

se(βˆ1) = 0.20123

16 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression

Inference about the Slope (β1) Under our model assumptions,

βˆ1 − β1 T = ∼ tn−2 se(βˆ1) where ˆ p 2 Pn 2 p 2 . se(β1) = s / i=1(xi − x¯) ≡ s /Sxx I Suppose we want to test β1 equals to a certain value, β1,0

H0 : β1 = β1,0 versus Ha : β1 6= (< or >)β1,0 I The test statistic under the null is

βˆ1 − β1,0 t0 = p 2 s /Sxx I Note that if the alternative is the not equal case and β1,0 = 0 then it is the same as the utility test. 17 / 32 Lectures 9: Simple Linear Regression Inference about Linear Regression

Hypothesis Test about β1 In the standard SAS output, the result of the test that is equivalent to the utility test is automatically included. For example, in the alcohol example,

I Hypotheses: H0 : β1 = 0 versus Ha : β1 6= 0 I Test statistic:

βˆ1 1.97792 t0 = = = 9.83 p 2 s /Sxx 0.20123 I p-value < 0.0001. I Reject the null. With these standard output, it is straightforward to test other

cases. For example, if the alternative is right-sided, Ha : β1 > 0, then the p-value is P (Tn−2 > t0), which is technically a half of the p-value of the two-sided test. 18 / 32 Lectures 9: Simple Linear Regression Model Diagnostics

Model Diagnostics Check model assumptions: I Constant variance I Normality of error terms The residuals ei = yi − yˆi play an essential role in diagnosing a tted model. I Diagnosis plots: I Residuals versus tted values I Residuals versus explanatory variables I Normal probability plot of the residuals I Residuals and Studentized residuals I Cook's distance I

19 / 32 Lectures 9: Simple Linear Regression Model Diagnostics

Residuals and Studentized Residual

Residuals are used to identify outlying or extreme Y observations. I Residuals ei = yi − yˆi I Studentized residuals

ei ei ri = = √ s(ei) s 1 − hii

th where hii is the i element in the diagonal of the hat matrix.

While the residuals ei will have substantially dierent sampling variations if their standard deviations dier markedly, the

studentized residuals ri have constant variance.

20 / 32 Lectures 9: Simple Linear Regression Model Diagnostics

Deleted Residuals and Studentized Deleted Residual

If yi is far outlying, the regression line may be inuenced to come close to yi, yielding a tted value yˆi near yi, so that the residual ei will be small and will not disclose that yi is outlying. A renement to make residuals more eective for detecting y is to measure the ith residual when the tted regression is based on all the cases except the ith one. I Deleted residuals ei di = yi − yˆi(i) = 1 − hii th yˆi(i) is the tted value of yi with the i case omitted. I Studentized deleted residuals di ei ti = = √ s(di) s(i) 1 − hii th where s(i) is s with the i case omitted. 21 / 32 Lectures 9: Simple Linear Regression Model Diagnostics

Cook's Distance This Cook's D statistic is a measure of the inuence of an observation on the regression line. The statistic is dened as follows: n 1 X 2 D = yˆ − yˆ  k (p + 1)s2 i(k) i i=1 th th where yˆi(k) is the tted value of the i observation when the k observation is omitted from the model. p is the number of explanatory variables. th I The values of Dk assess the impact of the k observation on the estimated regression coecients.

I Values of Dk greater than 1 implies undue inuence on the model coecients.

22 / 32 Lectures 9: Simple Linear Regression Model Diagnostics

Leverage

This is a measure of how unusual the x value of a point is, relative to the x observations as a whole. th I The leverage of observation i is hii, the i diagonal element of the hat matrix.

I A leverage value hii is usually considered to be large if it is more than twice as large as the mean leverage value, (p + 1)/n, i.e. 2(p + 1) h > ii n where p is the number of predictors in the regression model. I Another suggested guideline is that I hii > 0.5 =⇒ high leverage I 0.2 ≤ hii ≤ 0.5 =⇒ moderate leverage I hii < 0.2 =⇒ low leverage 23 / 32 Lectures 9: Simple Linear Regression Model Diagnostics

Diagnostic Plots ODS graphics are used to generate diagnostic plots: ods graphics on; proc reg data=drinking; model cirrhosis=alcohol; run; ods graphics off;

I Raw residuals against predicted values I Standardized/Studentized residuals against predicted values I Studentized residuals against leverage I Cook's D against observation No. I Histogram of residuals In both of the two residual plots, there is a large negative residual. In the residual vs. leverage plot, there is a large leverage. The rst observation has a very large Cook's D. 24 / 32 Lectures 9: Simple Linear Regression Model Diagnostics

Identifying Outlying Observations and Inuential Points To identify the observations with large residuals or high leverage or large Cook's distance we need to save these quantities. proc reg data=drinking; model cirrhosis=alcohol; output out = drinkingfit predicted = yhat student = student /*Studentized residuals*/ rstudent = rstudent /*Studentized deleted residuals*/ h = leverage cookd = cooksd; run; proc print data = drinkingfit; where abs(rstudent) > 2 or leverage >.3 or cooksd > 1; run;

I Inuential point: France (Cook's D = 1.54) I X-outlier and inuential point: France (Leverage = 0.64) I Y-outlier: Australia (Studentized deleted residual = -2.55)

25 / 32 Lectures 9: Simple Linear Regression Model Diagnostics

Problem of Outlying Observations We've found that France is an X-outlier and inuential point. Observations that are outliers may distort the value of the an estimated regression coecient. Let us simply drop it and ret the model:

ods graphics on; proc reg data=drinking; model cirrhosis=alcohol; where country ne 'France'; run; ods graphics off;

I The residuals are more normal look. I No inuential points. I Two marginally large residuals

26 / 32 Lectures 9: Simple Linear Regression Models without an Intercept

Models without an Intercept In some applications of simple linear regression, a model without an intercept is required. The model is of the following form:

yi = βxi + i In this case, application of the least squares method gives the following estimator of β: Pn ˆ i=1 xiyi β = Pn 2 i=1 xi

27 / 32 Lectures 9: Simple Linear Regression Models without an Intercept

Example: Estimating the Age of the Universe The dominant motion in the universe is the smooth expansion known as Hubble's Law.

V = H0D where

I V is Recessional Velocity, the observed velocity of the galaxy away from us, usually in km/sec

I H0 is Hubble's constant, in km/sec/Mpc I D is the distance to the galaxy in Mpc Based on this law, we can easily estimate the age of the universe

1 age = H0

28 / 32 Lectures 9: Simple Linear Regression Models without an Intercept

Example: Estimating the Age of the Universe (cont.) Wood (2006) gives the relative velocity and the distances of 24 galaxies, according to measurements made using the Hubble Space Telescope.

Observation Galaxy Velocity (km/sec) Distance (mega-parsec) 1 NGC0300 133 2.00 2 NGC0925 664 9.16 3 NGC1326A 1794 16.14 ...... 24 NGC7331 999 14.72 where mega-parsec is Mpc and 1 Mpc = 3.09 × 1019 km

29 / 32 Lectures 9: Simple Linear Regression Models without an Intercept

Scatterplot After reading in the data le, we can create a scatterplot to visualize the relationship between velocity and distance.

proc sgplot data=universe; scatter y=velocity x=Distance; yaxis label='velocity (kms)'; xaxis label='Distance (mega-parsec)'; run;

I The yaxis and xaxis statements are used to give the units of measurement to the axis labels.

I The diagram shows a clear, strong relationship between velocity and distance.

I The scatterplot also shows that the intercept probably is zero.

30 / 32 Lectures 9: Simple Linear Regression Models without an Intercept

Model Diagnostics There are two large Studentized deleted residuals (absolute value > 2) and two large leverages (> 2/24 ≈ 0.08). We can identify these observations by saving these values. proc reg data=universe; model velocity= distance / noint; output out=regout predicted=pred rstudent=rstudent h=leverage; run; quit;

proc print data=regout; where abs(rstudent)>2; run;

proc print data=regout; where leverage>.08; run; 31 / 32 Lectures 9: Simple Linear Regression Models without an Intercept

Age of the Universe

Now we can use the estimated value of β to nd an approximate value for age of the universe.

1 3.09 × 1019 km 1 year Age = × × βˆ km/sec/Mpc 1 Mpc 3.15 × 107 sec It is approximately 12.8 billion years.

32 / 32