<<

22s:152 Applied

Chapter 5: Ordinary Regression ———————————————————— Part 1: Simple Linear Regression Introduction and Estimation

• Methods for studying the relationship of two or more quantitative variables

• Examples: – predict salary from years of experience

– find effect of lead exposure on school per- formance

– predict force at which a metal alloy rod bends based on iron content

1 Simple Linear Regression Linear regression model

• The basic model

Yi = β0 + β1xi + i

– Yi is the response of dependent variable

– xi is the observed predictor, explanatory variable, independent variable, covariate

– xi is treated as a fixed quantity (or if random it is conditioned upon)

– i is the error term

2 – i are iid N(0, σ )

So, E[Yi] = β0 + β1xi + 0 = β0 + β1xi

2 Simple Linear Regression Linear regression model

• Key assumptions (will check these later)

– linear relationship (between Y and x)

*we say the relationship between Y and x is linear if the means of the conditional distributions of Y|x lie on a straight line

– independent errors (independent observations in SLR)

– constant variance of errors

– normally distributed errors

3 Simple Linear Regression Interpreting the model

• Model can also be written as:

2 Yi|Xi = xi ∼ N(β0 + β1xi, σ )

– mean of Y given X = x is β0 + β1x (known as conditional mean)

– β0 + β1x is the mean value of all the Y ’s for the given value of x

– β0 is conditional mean when x=0

– β1 is slope, change in mean of Y per 1 unit change in x

– σ2 is the variation of responses at x (i.e. dispersion around conditional mean)

4 Simple Linear Regression Estimation of β0 andβ1

We wish to use the sample data to estimate the population parameters: the slope β1 and the in- tercept β0

• Least squares estimation

ˆ ˆ – choose β0 = b0 and β1 = b1 such that we minimize the sum of the squared Pn ˆ 2 residuals, i.e. minimize i=1(Yi − Yi)

– minimize Pn 2 g(b0, b1) = i=1(Yi − (b0 + b1xi))

– Take derivative of g(b0, b1) with respect to b0 and b1, set equal to zero, and solve

5 – Results:

b0 = Y¯ − b1x¯ Pn ¯ i=1(xi−x¯)(Yi−Y ) b1 = Pn 2 i=1(xi−x¯) the point (x,¯ Y¯ ) will always be on the least squares line

– b0 and b1 are best linear unbiased estimators (best meaning smallest variance estimator)

Notation for fitted line:

ˆ ˆ ˆ Yi = β0 + β1xi or ˆ Yi = b0 + b1xi

or in the text ˆ Yi = A + Bxi

6 ˆ – predicted (fitted) value: Yi = b0 + b1xi

ˆ – residual: ei = Yi − Yi 4 2 0 Y −2 −4

0 5 10 15 20 25

X The least squares regression line minimizes the Pn ˆ 2 residual sums of squares (RSS)= i=1(Yi−Yi)

7 Example: Cigarette data

Measurements of weight and tar, nicotine, and carbon monoxide content are given for 25 brands of domestic cigarettes.

VARIABLE DESCRIPTIONS: Brand name Tar content (mg) Nicotine content (mg) Weight (g) Carbon monoxide content (mg)

Mendenhall, William, and Sincich, Terry (1992), for Engi- neering and the Sciences (3rd ed.), New York: Dellen Publishing

8 Do a scatterplot, fit the best fitting line according to least squares estimation.

> cig.data=as.data.frame(read.delim("cig.txt",sep=" ", header=FALSE)) > dim(cig.data) [1] 25 5

## This data set had no header, so I will assign ## the column names here: > dimnames(cig.data)[[2]]=c("Brand","Tar","Nic", "Weight","CO") > head(cig.data) Brand Tar Nic Weight CO 1 Alpine 14.1 0.86 0.9853 13.6 2 Benson-Hedges 16.0 1.06 1.0938 16.6 3 BullDurham 29.8 2.03 1.1650 23.5 4 CamelLights 8.0 0.67 0.9280 10.2 5 Carlton 4.1 0.40 0.9462 5.4 6 Chesterfield 15.0 1.04 0.8885 15.0

9 > plot(cig.data$Tar,cig.data$Nic)

● 2.0 1.5

● ● ● ● ● ●●●

1.0 ● ●

cig.data$Nic ● ● ● ● ●● ● ● ● ● 0.5 ●●

0 5 10 15 20 25 30

cig.data$Tar

## Fit a simple linear regression of Nicotine on Tar. > lm.out=lm(Nic~Tar,data=cig.data)

## Get the estimated slope and intercept: > lm.out$coefficients (Intercept) Tar 0.13087532 0.06102854

You can do this manually too...

10 Pn ¯ i=1(xi−x¯)(Yi−Y ) b1 = Pn 2 i=1(xi−x¯)

R easily works with vectors and matrices.

> numerator=sum((cig.data$Tar-mean(cig.data$Tar))* (cig.data$Nic-mean(cig.data$Nic))) > denominator=sum((cig.data$Tar-mean(cig.data$Tar))^2) > b1=numerator/denominator > b1 [1] 0.06102854

b0 = Y¯ − b1x¯ > b0=mean(cig.data$Nic)-mean(cig.data$Tar)*b1 > b0 [1] 0.1308753

The fitted line for this data:

ˆ Yi = 0.1309 + 0.0610xi

11 ## Add the fitted line to the original plot: > plot(cig.data$Tar,cig.data$Nic) > abline(lm.out)

● 2.0 1.5

● ● ● ● ● ●●●

1.0 ● ●

cig.data$Nic ● ● ● ● ●● ● ● ● ● 0.5 ●●

0 5 10 15 20 25 30

cig.data$Tar

12 13 Simple Linear Regression Estimating σ2

• One of the assumptions of linear regression is that the variance for each of the conditional distributions of Y |x is the same at all x-values. 4 2 0 Y −2 −4

0 5 10 15 20 25

X

• In this case, it makes sense to pool all the error information to come up with a common estimate for σ2

14 Recall the model:

iid 2 Yi = β0 + β1xi + i with i ∼ N(0, σ )

• We use the sum of the squares of the residuals to estimate σ2 Acronyms: RSS ≡ Residual sum of squares SSE ≡ Sum of squared errors RSS ≡ SSE

Pn ˆ 2 ˆ2 RSS i=1(Yi−Yi) σ = n−2 = n−2

Pn ˆ 2 RSS = i=1(Yi − Yi)

RSS 2 E[ n−2 ] = σ

p 2 σˆ = SE = SE is called the standard error for the regression (a phrase used by this author)

15 – ‘2’ is subtracted from n in the denominator because we’ve used 2 degrees of freedom for estimating the slope and in- tercept (i.e. there were 2 parameters estimated in the mean structure).

– When we estimate σ2 in a 1-sample Pn ˆ 2 population, we divide i=1(Yi − Yi) by (n − 1) because we only estimate 1 parameter in the mean structure, namely µ.

16 Simple Linear Regression Total sums of squares (TSS)

• Total sums of squares (TSS) quantifies the overall squared dis- tance of the Y -values from the overall mean of the responses Y¯

● 30 ● ● Y−bar= 10.91 ● ●

● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 10

y ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 0 ● ● ● ●

● ● ● −10 ●

0 2 4 6

x

Pn ¯ 2 • TSS= i=1(Yi − Y )

17 • For regression, we can ‘decompose’ this distance and write:

Yi − Y¯ = (Yi − Yˆi) + (Yˆi − Y¯ ) | {z } | {z } distance from distance from observation to fitted line to fitted line overall mean

• Which leads to the equation1:

n n n X 2 X 2 X 2 (Yi − Y¯ ) = (Yi − Yˆi) + (Yˆi − Y¯ ) i=1 i=1 i=1

or

TSS = RSS + RegSS

where RegSS is the regression sum of squares

1 (a + b)2 6= a2 + b2. You must square both sides, then include the summation terms and then the cross terms will cancel out due to properties of the fitted line.

18 • Total variability has been decomposed into “explained” and “un- explained” variability

• In general, when the proportion of total variability that is ex- plained is high, we have a good fitting model

• The R2 value (coefficient of determination): – the proportion of variation in the response that is explained by the model

2 RegSS – R = TSS

2 RSS – R = 1 − TSS

– also stated as r2 in simple linear regression

– the square of the correlation coefficient ‘r’

– 0 ≤ R2 ≤ 1

– R2 near 1 suggests a good fit to the data

– if R2 = 1, ALL points fall exactly on the line

– different disciplines have different views on what is a high R2 = 1, in other words what is a good model

19 ∗ social scientists may get excited about an R2 near 0.30

∗ a researcher with a designed experiment may want to see an R2 near 0.80

20 Simple Linear of Variance (ANOVA)

The decomposition of total variance into parts is part of ANOVA.

As was stated before:

TSS = RSS + RegSS

Example: cigarette data

● 2.0 1.5

● ● ● ● ● ●●●

1.0 ● ●

cig.data$Nic ● ● ● ● ●● ● ● ● ● 0.5 ●●

0 5 10 15 20 25 30

cig.data$Tar

21 Look at the ANOVA table:

You can get these sums of squares manually too... > sum((lm.out$fitted.values-mean(cig.data$Nic))^2) [1] 2.869467 > sum(lm.out$residuals^2) [1] 0.1391091 > sum((cig.data$Nic-mean(cig.data$Nic))^2) [1] 3.008576

Get the R2 value (2 ways shown): > summary(lm.out) look for.... Multiple R-Squared: 0.9538

> summary(lm.out)$r.squared [1] 0.9537625

22 Example: Lifespan and Thorax of fruitflies

LONGEVITY Lifespan, in days THORAX Length of thorax, in mm n=125 100 ● ● ●

● ● ●

● ● ● ● ●

80 ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

60 ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● data$Longevity ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

40 ● ● ● ● ● ● ● ● ● ● ● ●

20 ● ●

0.65 0.70 0.75 0.80 0.85 0.90 0.95

data$Thorax

“Sexual Activity and the Lifespan of Male Fruitflies” by Linda Par- tridge and Marion Farquhar. Nature, 294, 580-581, 1981.

23 The data and the variables:

> ff.data=as.data.frame(read.delim("/fruitfly.txt", sep="\t",header=FALSE)) > dimnames(ff.data)[[2]]=c("ID","Partners","Type", "Longevity","Thorax","Sleep") > head(ff.data) ID Partners Type Longevity Thorax Sleep 1 1 8 0 35 0.64 22 2 2 8 0 37 0.68 9 3 3 8 0 49 0.68 49 4 4 8 0 46 0.72 1 5 5 8 0 63 0.72 23 6 6 8 0 39 0.76 83

See how many different Partner values there are: > unique(ff.data$Partners) [1] 8 0 1

24 Fit the simple linear regression model: > lm.fruitflies=lm(ff.data$Longevity~ff.data$Thorax) > summary(lm.fruitflies)

. . .

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -61.05 13.00 -4.695 7.00e-06 *** ff.data$Thorax 144.33 15.77 9.152 1.50e-15 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 13.6 on 123 degrees of freedom Multiple R-Squared: 0.4051,Adjusted R-squared: 0.4003 F-statistic: 83.76 on 1 and 123 DF, p-value: 1.497e-15

Slope interpretation: For every 1 mm increase in the thorax length of a fruitfly, the average lifespan increases by 144.3 days.

25 Intercept interpretation: When the thorax length is 0 mm, the average lifespan is -61.05 days (*#??*).

This doesn’t make sense for this data set for 2 reasons... a fruitfly wouldn’t have a 0 mm thorax, and x=0 is far outside of the range of observed x-values.

> plot(ff.data$Thorax,ff.data$Longevity) > abline(lm.fruitflies,lwd=2)

100 ● ● ●

● ● ● ●

● ● ● ● ●

80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

ff.data$Longevity ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

40 ● ● ● ● ● ● ● ● ● ● ● ●

20 ● ● ●

0.65 0.70 0.75 0.80 0.85 0.90 0.95

ff.data$Thorax

26 > anova(lm.fruitflies) Analysis of Variance Table

Response: ff.data$Longevity Df Sum Sq Mean Sq F value Pr(>F) ff.data$Thorax 1 15497 15497 83.761 1.497e-15 *** Residuals 123 22756 185 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Regression sum of squares RegSS 15497 Residual sum of squares RSS 22756 Total sum of squares TSS 38253

2 RegSS 15497 R = TSS = 38253 = 0.4051186

27 > summary(lm.fruitflies)$r.squared [1] 0.4051113

R2 interpretation: 40.5% of the total variability in lifespan for fruitflies is explained by the length of the thorax.

28 Simple Linear Regression Correlation coefficient

• The correlation coefficient r measures the strength of a linear relationship

Pn ¯ ¯ r = √ i=1(Xi−X)(Yi−Y ) Pn ¯ 2 Pn ¯ 2 i=1(Xi−X) i=1(Yi−Y )

rPn ¯ 2 i=1(Xi−X) = n−1 · b rPn ¯ 2 1 i=1(Yi−Y ) n−1

SX = · b1 SY

– it is the standardized slope, a unitless measure

– can be thought of as the value we would get for the slope if the standard deviations of X and Y were equal (similar spreads)

– would be the slope if X and Y had been standardized before fitting the regression

29 – −1 ≤ r ≤ 1

– r near -1 or +1 shows a strong linear relationship

– a negative (positive) r is associated with an estimated nega- tive (positive) slope

– the sample correlation coefficient r estimates the population correlation coefficient ρ

– r is NOT used to measure strength of a curved line

30 Common mistake People often think that as the estimated slope of the regression line, ˆ β1, gets larger (steeper), so does r. But r really measures how close all the data points are to our estimated regression line.

You could have a steep fitted line with a small r (noisy relationship), or a fairly flat fitted line with large r (less noisy relationship).

This can be confusing because when the estimated slope is actually 0, then r is 0 no matter how close the points are to the regression line (see formulas for r on the previous pages).

31 Example: Cigarette data > cor(cig.data$Tar,cig.data$Nic) [1] 0.9766076

● 2.0 1.5

● ● ● ● ● ●●●

1.0 ● ●

cig.data$Nic ● ● ● ● ●● ● ● ● ● 0.5 ●●

0 5 10 15 20 25 30

cig.data$Tar

Standardize the dependent and independent variables and fit the sim- ple linear regression model to the standardized variables...

32 Standardizing the variables: > std.Y=(cig.data$Nic-mean(cig.data$Nic))/sqrt(var(cig.data$Nic)) > std.X=(cig.data$Tar-mean(cig.data$Tar))/sqrt(var(cig.data$Tar))

Fitting the model to the standardized variables: > (lm(std.Y~std.X))$coefficients (Intercept) std.X 5.420460e-17 9.766076e-01

The slope in the standardized regression is 0.9766, which is the corre- lation between the original two variables (as we saw on the previous slide).

Standardized Y vs. Standardize X 3 2 1 std.Y 0 -1 -2

-2 -1 0 1 2 3

std.X

33 A very strong curved relationship can have an r value near 0.

> plot(x,y)

150 ●● ● ●● ● ● ●● ●● ●● ●● ●● 100 ●● y ● ● ●●● ● ●● ● ● ●● ●● ● ● ●●

50 ● ●●●● ● ● ●● ●● ● ●● ●● ● ●● ●●● ● ● ●●● ● ●● ●●● ● ●●● ●●● ● ● ●● ● 0 ● ● ●●●● ● ● ●

−6 −4 −2 0 2 4 6

x

> abline(lm(y~x)) > cor(x,y) [1] -0.2789433

The correlation coefficient measures the strength in a linear re- lationship.

34