<<

Lecture 14 Simple

Ordinary (OLS)

Consider the following model

Yi = α + βXi + εi where, for each unit i,

• Yi is the dependent variable (response).

• Xi is the independent variable (predictor).

• εi is the error between the observed Yi and what the model predicts.

This model describes the relation between Xi and Yi using an intercept and a slope parameter. The error εi = Yi − α − βXi is also known as the residual because it measures the amount of variation in Yi not explained by the model.

ˆ 2 We saw last class that there existsα ˆ and β that minimize the sum of εi . Specifically, we wish to findα ˆ and βˆ such that n X  2 Yi − (ˆα + βXˆ i) i=1 is the minimum. Using calculus, it turns out that

P(X − X¯)(Y − Y¯ ) αˆ = Y¯ − βˆX¯ βˆ = i i . P 2 (Xi − X¯)

OLS and Transformation

If we center the predictor, X˜i = Xi − X¯, then X˜i has zero. Therefore,

P X˜ (Y − Y¯ ) αˆ∗ = Y¯ βˆ∗ = i i . P ˜ 2 Xi ∗ By horizontally shifting the value of Xi, note that β = β, but the intercept changed to the overall average of Yi

Consider the linear transformation Zi = a + bXi with Z¯ = a + bX¯. Consider the

∗ ∗ Yi = α + β Zi + εi

1 Letα ˆ and βˆ be the estimates using X as the predictor. Then the intercept and slope using Z as the predictor are P(Z − Z¯)(Y − Y¯ ) βˆ∗ = i i P 2 (Zi − Z¯) P(a + bX − a − bX¯)(Y − Y¯ ) = i i P(a + bX¯ − a − bX¯)2 P(bX − bX¯)(Y − Y¯ ) = i i P(bX¯ − bX¯)2 b P(X − X¯)(Y − Y¯ ) 1 = i i = βˆ b2 P(X¯ − X¯)2 b and

αˆ∗ = Y¯ − βˆ∗Z¯ βˆ aβˆ a = Y¯ − (a + bX¯) = Y¯ − βˆX¯ − =α ˆ − βˆ b b b

Zi−a We can get the same results by substituting Xi = b into the linear model.

Yi = α + βXi + εi Z − a Y = α + β i + ε i b i aβ β Y = α − + Z + ε i b b i i ∗ ∗ Yi = α + β Zi + εi

Properties of OLS

Given the estimatesα ˆ and βˆ, we can define (1) the estimated predicted value Yˆi and (2) the estimated residualε ˆi.

Yˆi =α ˆ + βXˆ i εˆi = Yi − Yˆi = Yi − αˆ − βXˆ i

The least squared estimates have the following properties. P 1. i εˆi = 0 n n n n X X X X εˆi = (Yi − αˆ − βXˆ i) = Yi − nαˆ − βˆ Xi i=1 i=1 i=1 i=1 = nY¯ − nαˆ − nβˆX¯ = n(Y¯ − αˆ − βˆX¯) = n(Y¯ − (Y¯ − βˆX¯) − βˆX¯) = 0

2 2. X¯ and Y¯ is always on the fitted line.

αˆ + βˆX¯ = (Y¯ − βˆX¯) + βˆX¯ = Y¯

ˆ sY 3. β = rXY × , where sY and sX are the of X and Y , and rXY is sX the correlation between X and Y . Note that the sample correlation is given by: P (Xi − X¯)(Yi − Y¯ ) rXY = . (n − 1)sX sY Therefore,

P(X − X¯)(Y − Y¯ ) n − 1 s s βˆ = i i × × Y × X P 2 (Xi − X¯) n − 1 sY sX Cov(X,Y ) sY sX = 2 × × sX sY sX Cov(X,Y ) s 1 = × Y × sX sY 1 sX sY = rXY × sX

4. βˆ is a weighted sum of Yi.

P(X − X¯)(Y − Y¯ ) βˆ = i i P 2 (Xi − X¯) P(X − X¯)Y P(X − X¯)Y¯ = i i − i P 2 P 2 (Xi − X¯) (Xi − X¯) P(X − X¯)Y Y¯ P(X − X¯) = i i − i P 2 P 2 (Xi − X¯) (Xi − X¯) P(X − X¯)Y = i i − 0 P 2 (Xi − X¯)  ¯  X Xi − X = Y P 2 i (Xi − X¯) ¯ X Xi − X = w Y , w = i i i P 2 (Xi − X¯)

3 5.ˆα is a weighted sum of Yi.

αˆ = Y¯ − βˆX¯ X 1 X = Y − X¯ w Y n i i i ¯ ¯ X 1 X(Xi − X) = v Y , v = − i i i P 2 n (Xi − X¯)

P 6. Xiεˆi = 0. X X X Xiεˆi = (Yi − Yˆi)Xi = (Yi − αˆ − βXˆ i)Xi X = (Yi − (Y¯ − βˆX¯) − βXˆ i)Xi X = ((Yi − Y¯ ) − βˆ(Xi − X¯))Xi X X = Xi(Yi − Y¯ ) − βˆ Xi(Xi − X¯) X 2 X = βˆ (Xi − X¯) − βˆ Xi(Xi − X¯) X  2  = βˆ (Xi − X¯) − Xi(Xi − X¯) ˆ X  2 ¯ ¯ 2 2 ¯  = β Xi − 2XiX + X − Xi + XiX) X  2  = βˆ X¯ − XiX¯ X  = βˆX¯ X¯ − Xi = 0

P(X − X¯)(Y − Y¯ ) P X (Y − Y¯ ) Note that βˆ = i i = i i P 2 P 2 (Xi − X¯) (Xi − X¯)

7. Partition of total variation.

(Yi − Y¯ ) = Yi − Y¯ + Yˆi − Yˆi = (Yˆi − Y¯ ) + (Yi − Yˆi)

• (Yi − Y¯ ): total deviation of the observed Yi from a null model (β = 0).

• (Yˆ − Y¯ ): deviation of the modeled value Yˆi (β 6= 0) from a null model (β = 0).

• (Yˆ − Yi): deviation of the modeled value from the observed value.

Total Variations (Sum of Squares)

P 2 • (Yi − Y¯ ) : total variation: (n − 1) × V (Y ). • P(Yˆ − Y¯ )2: variation explained by the model

4 P ˆ 2 P 2 • (Y − Yi) : variation not explained by the model; (SSE), εˆi

Partition of Total Variation

X 2 X 2 X 2 (Yi − Y¯ ) = (Yˆi − Y¯ ) + (Yi − Yˆi)

Proof:

2 X 2 X h i (Yi − Y¯ ) = (Yˆi − Y¯ ) + (Yi − Yˆi)

X 2 X 2 X = (Yˆi − Y¯ ) + (Yi − Yˆi) + 2 (Yˆi − Y¯ )(Yi − Yˆi) X 2 X 2 = (Yˆi − Y¯ ) + (Yi − Yˆi)

where,

X X h i (Yˆi − Y¯ )(Yi − Yˆi) = (ˆα + βXˆ i) − (ˆα + βˆX¯) εˆi X h i = βˆ(Xi − X¯) εˆi X X = βˆ Xiεˆi − βˆX¯ εˆi = 0 + 0

P 2 8. Alternative formula for εˆi

X 2 X 2 X 2 (Yi − Yˆi) = (Yi − Y¯ ) − (Yˆi − Y¯ ) X 2 X 2 = (Yi − Y¯ ) − (ˆα + βXˆ i − Y¯ ) X 2 X 2 = (Yi − Y¯ ) − (Y¯ + βXX¯ + βXˆ i − Y¯ ) 2 X 2 X h i = (Yi − Y¯ ) − βˆ(Xi − X¯)

X 2 2 X 2 = (Yi − Y¯ ) − βˆ (Xi − X¯)

or

P ¯ ¯ 2 X X (Xi − X)(Yi − Y ) (Y − Yˆ )2 = (Y − Y¯ )2 − i i i P 2 (Xi − X¯)

9. Variation Explained by the Model:

5 P 2 An easy way to calculate (Yˆi − Y¯ ) is

X 2 X 2 X 2 (Yˆi − Y¯ ) = (Yi − Y¯ ) − (Yi − Yˆi) P ¯ ¯ 2 X (Xi − X)(Yi − Y ) = (Y − Y¯ )2 − i P 2 (Xi − X¯) P 2 (Xi − X¯)(Yi − Y¯ ) = P 2 (Xi − X¯)

10. Coefficient of Determination r2:

variation explained by the model P(Yˆ − Y¯ )2 r2 = = i P 2 total variation (Yi − Y¯ )

(a) r2 is related the residual .

X 2 X 2 X 2 (Yi − Yˆi) = (Yi − Y¯ ) − (Yˆi − Y¯ ) P ¯ 2 hX X i (Yi − Y ) = (Y − Y¯ )2 − (Yˆ − Y¯ )2 × i i P 2 (Yi − Y¯ ) 2 X 2 = (1 − r ) × (Yi − Y¯ )

Diving the right-hand side by n − 2 and the left-hand side by n − 1, for large n

s2 ≈ (1 − r2)Var(Y )

(b) r2 is the square of the sample correlation between X and Y .

P(Yˆ − Y¯ )2 r2 = i P 2 (Yi − Y¯ ) P 2 (Xi − X¯)(Yi − Y¯ ) = P 2 P 2 (Yi − Y¯ ) (Xi − X¯) " #2 P(X − X¯)(Y − Y¯ ) = i i = r2 pP 2pP 2 XY (Yi − Y¯ ) (Xi − X¯)

Statistical Inference for OLS Estimates

Parametersα ˆ and βˆ can be estimated for any given sample of . Therefore, we also need to consider their distributions because each sample of (Xi,Yi) pairs will result in different estimates of α and β.

6 Consider the following distribution assumption on the error,

iid 2 Yi = α + βXi + εi εi ∼ N (0, σ ) .

The above is now a that describes the distribution of Yi given Xi. Specifically, we assume the observed Yi is error-prone but centered around the linear model for each value of Xi.

iid 2 Yi ∼ N(α + βXi, σ )

The error distribution results in the following assumptions.

1. : variance of Yi is the same for any Xi.

2 V (Yi) = V (α + βXi + εi) = V (εi) = σ

2. Linearity: the mean of Yi is linear with respect to Xi

E(Yi) = E(α + βXi + εi) = α + βXi + E(εi) = α + βXi

3. Independence: Yi are independent of each other.

Cov(Yi,Yj) = Cov(α + βXi + εi, α + βXj + εj) = Cov(εi, εj) = 0

We can now investigate the bias and variance of OLS estimatorsα ˆ and βˆ. We assume that Xi are known constants.

• βˆ is unbiased. ¯ hX i Xi − X E[βˆ] = E w Y , w = i i i P 2 (Xi − X¯) X = wiE[Yi] X = wi(α + βXi) X X = α wi + β wiXi = β

7 Because

X 1 X w = (X − X¯) = 0 i P 2 i (Xi − X¯) X 1 X w X = X (X − X¯) i i P 2 i i (Xi − X¯) 1 X ¯ = P P Xi(Xi − X) Xi(Xi − X¯) − X¯(Xi − X¯) 1 X ¯ = P P Xi(Xi − X) Xi(Xi − X¯) − X¯ (Xi − X¯) 1 X ¯ = P Xi(Xi − X) Xi(Xi − X¯) − 0 = 1

• αˆ is unbiased.

E[ˆα] = E[Y¯ − βˆX¯] = E[Y¯ ] − XE¯ [βˆ]  1 X  1 X = E Y − βX¯ = E [α + βX ] − βX¯ n i n i 1 = (nα + nβX ) − βX¯ n i = α + βX¯ − βX¯ = α

σ2 σ2 • βˆ has variance = , where σ2 is the residual variance and s2 is the Pn ¯ 2 2 x i=1(Xi − X1) (n − 1)sx sample variance of x1, . . . , xn. ¯ hX i Xi − X V [βˆ] = V w Y , w = i i i P 2 (Xi − X¯) X 2 = wi V [Yi] 2 X 2 = σ wi

where

X 1 X w2 = (X − X¯)2 i P 2 2 i [ (Xi − X¯) ] 1 = P 2 (Xi − X¯)

8  1 X¯ 2  • αˆ has variance σ2 + Pn ¯ 2 n i=1(Xi − X1) ¯ ¯ hX i 1 X(Xi − X) V [ˆα] = V v Y , v = − i i i P 2 n (Xi − X¯) X 2 = vi V [Yi]  ¯ ¯ 2 X 1 X(Xi − X) = σ2 − P 2 n (Xi − X¯)  ¯ 2 ¯ 2 ¯ ¯  X 1 X X (Xi − X) X X(Xi − X) = σ2 − − 2 2 P 2 2 P 2 n [ (Xi − X¯) ] n (Xi − X¯)  1 X¯ 2 X 2X¯ X  = σ2 − (X − X¯)2 − (X − X¯) P 2 2 i P 2 i n [ (Xi − X¯) ] n (Xi − X¯)  1 X¯ 2  = σ2 + Pn ¯ 2 n i=1(Xi − X1)

• αˆ and βˆ are negatively correlated. ¯ ¯ ¯ X X  1 X(Xi − X) Xi − X Cov(ˆα, βˆ) = Cov v Y , w Y , v = − , w = i i i i i P 2 i P 2 n (Xi − X¯) (Xi − X¯) X = viwiCov (Yi,Yi) X 2 = viwiσ  ¯ ¯   ¯  X 1 X(Xi − X) Xi − X = σ2 − P 2 P 2 n (Xi − X¯) (Xi − X¯)   ¯  ¯ ¯ ¯  X Xi − X X X(Xi − X) Xi − X = σ2 − P 2 P 2 P 2 n (Xi − X¯) (Xi − X¯) (Xi − X¯)  P(X − X¯) P X¯(X − X¯)2  = σ2 i − i P 2 P 2 2 n (Xi − X¯) [ (Xi − X¯) ]  X¯ P(X − X¯)2  = σ2 0 − i P 2 2 [ (Xi − X¯) ] σ2X¯ = − P 2 (Xi − X¯)

Therefore, the sampling distributions for the regression parameters are:

  1 X¯ 2   σ2  αˆ ∼ N α, σ2 + βˆ ∼ N β, Pn ¯ 2 Pn ¯ 2 n i=1(Xi − X1) i=1(Xi − X1)

9 However, in most cases, we don’t know what σ2 and will need to estimate it from the data. One unbiased estimator for σ2 is obtained from the residuals.

N 1 X s2 = (Y − Yˆ )2 n − 2 i i i=1

By replacing σ2 with s2, we have the following sampling distributions forα ˆ and βˆ.   1 X¯ 2   s2  αˆ ∼ t α, s2 + βˆ ∼ t β, n−2 Pn ¯ 2 n−2 Pn ¯ 2 n i=1(Xi − X1) i=1(Xi − X1) Note that our degrees of freedom is n − 2 because we estimated two parameters (α and β) in order P to calculate the residuals. Similarly recall that (Yˆi − Yi) = 0.

Knowing thatα ˆ and βˆ follow a t-distribution with degrees of freedom n-2, we are able to construct confidence interval and carry out hypothesis test as before. For example, a 90% confidence interval for β is given by s s2 βˆ ± t × . 5%,n−2 Pn ¯ 2 i=1(Xi − X1)

Similarly, consider testing the hypotheses H0β = β0 versus HAβ 6= β0. The corresponding t-test is given by βˆ − β0 s ∼ tn−2 . s2 Pn ¯ 2 i=1(Xi − X1)

Prediction

Based on the ofα ˆ and βˆ we can also investigate the uncertainty associated with Yˆ0 =α ˆ + βXˆ 0 on the fitted line for any X0. Specifically,   1 (X − X¯)2  Yˆ ∼ N α + X β, σ2 + 0 . 0 0 P 2 n (Xi − X¯)

Note that α + X0β is the mean of Y given X0. The textbook uses the notation µ0. Replacing the unknown with our estimates, we have   1 (X − X¯)2  Yˆ ∼ t αˆ + X β,ˆ s2 + 0 . 0 n−2 0 P 2 n (Xi − X¯)

 1 (X − X¯)2  • The fitted value Yˆ =α ˆ + βXˆ has mean α + X β and variance σ2 + 0 . 0 0 0 P 2 n (Xi − X¯)

E(Yˆ0) = E(ˆα + βXˆ 0) = E(ˆα) + X0E(βˆ) = α + X0β

10 ˆ ˆ 2 ˆ ˆ V (Y0) = V (ˆα + βX0) = V (ˆα) + X0 V (β) + X0Cov(ˆα, β)  1 X¯ 2  X2 X X¯ = σ2 + + σ2 0 − 2σ2 0 Pn ¯ 2 Pn ¯ 2 P ¯ 2 n i=1(Xi − X1) i=1(Xi − X1) (Xi − X)  1 (X − X¯)2  = σ2 + 0 P 2 n (Xi − X¯)

∗ We can also consider the uncertainty of a new observation Y0 at X0. In other words, what values ∗ ∗ ˆ of Y0 do I expect to see at X0 ? Note that this is different from Y0 which is the fitted value that represents the mean of Y at X0. Since

∗ 2 Y0 = α + βX0 + ε, ε ∼ N (0, σ ) , by replacing the unknown parameters with our estimates, we obtain

  1 (X − X¯)2  Y ∗ ∼ t αˆ + βXˆ , s2 + s2 + 0 . 0 n−2 0 P 2 n (Xi − X¯) Again, the above distribution describes the actual observable data, instead of model parameters; interval estimates are also known as prediction intervals, instead of confidence intervals.

11