<<

Multiple (MLR) Handouts

Yibi Huang

• Data and Models • Least Square Estimate, Fitted Values, Residuals • Sum of Squares • Do Regression in R • Interpretation of Regression Coefficients • t-Tests on Individual Regression Coefficients • F -Tests on Multiple Regression Coefficients/Goodness-of-Fit

MLR - 1 Data for Multiple Linear Regression Multiple linear regression is a generalized form of simple linear regression, in which the data contains multiple explanatory variables. SLR MLR x y x1 x2 ... xp y case 1: x1 y1 x11 x12 ... x1p y1 case 2: x2 y2 x21 x22 ... x2p y2 ...... case n: xn yn xn1 xn2 ... xnp yn

I For SLR, we observe pairs of variables. For MLR, we observe rows of variables. Each row (or pair) is called a case, a record, or a data point I yi is the response (or dependent variable) of the ith observation I There are p explanatory variables (or covariates, predictors, independent variables), and xik is the value of the explanatory variable xk of the ith case MLR - 2 Multiple Linear Regression Models

2 yi = β0 + β1xi1 + ... + βpxip + εi where εi ’s are i.i.d. N(0, σ )

In the model above, 2 I εi ’s (errors, or noise) are i.i.d. N(0, σ ) I Parameters include:

β0 = intercept;

βk = regression coefficients (slope) for the kth explanatory variable, k = 1,..., p 2 σ = Var(εi ) is the variance of errors

I Observed (known): yi , xi1, xi2,..., xip 2 Unknown: β0, β1, . . . , βp, σ , εi ’s

I Random variables: εi ’s, yi ’s 2 Constants (nonrandom): βk ’s, σ , xik ’s MLR - 3 Questions

I What are the mean, the variance, and the distribution of yi ?

I We assume εi ’s are independent. Are yi ’s independent?

MLR - 4 Fitting the Model — Method 14 I Recall for SLR, the least ● squares estimates (βb0, βb1) regression line 10 ●

for (β0, β1) is the intercept 8 an arbitrary ● ● ● y ● ● straight line and slope of the straight line 6 ● ●

with the minimum sum of 4 2 squared vertical distance to ● the data points 0 Pn 2 0 1 2 3 4 5 6 7 8 i=1(yi − βb0 − βb1xi ) x

I MLR is just like SLR. The least squares estimates (βb0, βb1,..., βbp) for β0, . . . , βp is the intercept and slopes of the (hyper)plane with the minimum sum of squared vertical distance to the data points

n X 2 (yi − βb0 − βb1xi1 − ... − βbpxip) i=1

MLR - 5 Solving the Least Squares Problem (1) From now on, we use the “hat” symbol to differentiate the estimated coefficient βbj from the actual unknown coefficient βj . To find the (βb0, βb1,..., βbp) that minimize

n X 2 L(βb0, βb1,..., βbp) = (yi − βb0 − βb1xi1 − ... − βbpxip) i=1

one can set the derivatives of L with respect to βbj to 0

n ∂L X = −2 (yi − βb0 − βb1xi1 − ... − βbpxip) ∂βb0 i=1 n ∂L X = −2 xik (yi − βb0 − βb1xi1 − ... − βbpxip), k = 1, 2,..., p ∂βbk i=1 and then equate them to 0. This results in a system of (p + 1) equations in (p + 1) unknowns. MLR - 6 Solving the Least Squares Problem (2)

The least square estimate (βb0, βb1,..., βbp) is the solution to the following system of equations, called normal equations.

Pn Pn Pn nβb0 + βb1 i=1 xi1 + ··· + βbp i=1 xip = i=1 yi Pn Pn 2 Pn Pn βb0 i=1 xi1 + βb1 i=1 xi1 + ··· + βbp i=1 xi1xip = i=1 xi1yi . . Pn Pn Pn Pn βb0 i=1 xik + βb1 i=1 xik xi1 + ··· + βbp i=1 xik xip = i=1 xik yi . . Pn Pn Pn 2 Pn βb0 i=1 xip + βb1 i=1 xipxi1 + ··· + βbp i=1 xip = i=1 xipyi

I Don’t worry about solving the equations. R and many other softwares can do the computation for us.

I In general, βbj 6= βj , but they will be close under some conditions

MLR - 7 Fitted Values

The fitted value or predicted value:

ybi = βb0 + βb1xi1 + ... + βbpxip

I Again, the “hat” symbol is used to differentiate the fitted value ybi from the actual observed value yi .

MLR - 8 Residuals

I One cannot directly compute the errors

εi = yi − β0 − β1xi1 − ... − βpxip

since the coefficients β0, β1, . . . , βp are unknown.

I The errors εi can be estimated by the residuals ei defined as:

residual ei = observed yi − predicted yi = yi − ybi = yi − (βb0 + βb1xi1 + ... + βbpxip)

= β0 + β1xi1 + ... + βpxip + εi

− βb0 − βb1xi1 − ... − βbpxip

I ei 6= εi in general since βbj 6= βj I Graphical explanation

MLR - 9 Properties of Residuals

Recall the least square estimate (βb0, βb1,..., βbp) satisfies the equations

n X (yi − βb0 − βb1xi1 − ... − βbpxip) = 0 and i=1 | {z } = yi −ybi = ei = residual n X z }| { xik (yi − βb0 − βb1xi1 − ... − βbpxip) = 0, k = 1, 2,..., p. i=1

Thus the residuals ei have the properties Xn Xn ei = 0 , xik ei = 0, k = 1, 2,..., p. i=1 i=1 | {z } | {z } Residuals add up to 0. Residuals are orthogonal to covariates.

The two properties also imply that the residuals are uncorrelated with each of the p covariates. (proof: HW1 bonus).

MLR - 10 Sum of Squares Observe that yi − y = (ybi − y) + (yi − ybi ) Squaring up both sides we get

2 2 2 (yi − y) = (ybi − y) + (yi − ybi ) + 2(ybi − y)(yi − ybi ) Summing up over all the cases i = 1, 2,..., n, we get

SST SSR SSE zn }| { zn }| { zn }| { X 2 X 2 X 2 (yi − y) = (ybi − y) + (yi − ybi ) i=1 i=1 i=1 | {z } =ei n X + 2 (ybi − y)(yi − ybi ) i=1 | {z } We’ll shortly show this is 0.

MLR - 11 Pn Why i=1(ybi − y)(yi − ybi ) = 0?

n X (ybi − y)(yi − ybi ) i=1 | {z } =ei n n X X = ybi ei − yei i=1 i=1 n n X X = (βb0 + βb1xi1 + ... + βbpxip)ei − yei i=1 i=1 n n n n X X X X = βb0 ei +βb1 xi1ei + ... + βbp xipei −y ei i=1 i=1 i=1 i=1 | {z } | {z } | {z } | {z } =0 =0 =0 =0 = 0

Pn in which we used the properties of residuals that i=1 ei = 0 and Pn i=1 xik ei = 0 for all k = 1,..., p. MLR - 12 Interpretation of Sum of Squares

=e n n n i X 2 X 2 X z }| { 2 (yi − y) = (ybi − y) + (yi − ybi ) i=1 i=1 i=1 | {z } | {z } | {z } SST SSR SSE

I SST = total sum of squares

I total variability of y

I depends on the response y only, not on the form of the model

I SSR = regression sum of squares

I variability of y explained by x1,..., xp

I SSE = error (residual) sum of squares Pn 2 I = minβ0,β1,...,βp i=1(yi − β0 − β1xi1 − · · · − βpxip) I variability of y not explained by x’s

MLR - 13 Nested Models

We say Model 1 is nested in Model 2 if Model 1 is a special case of Model 2 (and hence Model 2 is an extension of Model 1). E.g., for the 4 models below,

Model A : Y = β0 + β1X1 + β2X2 + β3X3 + ε

Model B : Y = β0 + β1X1 + β2X2 + ε

Model C : Y = β0 + β1X1 + β3X3 + ε

Model D : Y = β0 + β1(X1 + X2) + ε

I B is nested in A ...... since A reduces to B when β3 = 0

I C is also nested in A...... since A reduces to C when β2 = 0

I D is nested in B ...... since B reduces to D when β1 = β2 I B and C are NOT nested in either way

I D is NOT nested in C

MLR - 14 Nested Relationship is Transitive If Model 1 is nested in Model 2, and Model 2 is nested in Model 3, then Model 1 is also nested in Model 3. For example, for models in the previous slide,

D is nested in B, and B is nested in A,

implies D is also nested in A, which is clearly true because Model A reduces to Model D when

β1 = β2, and β3 = 0.

When two models are nested (Model 1 is nested in Model 2),

I the smaller model (Model 1) is called the reduced model, and

I the more general model (Model 2) is called the full model.

MLR - 15 SST of Nested Models

Question: Compare the SST for Model A and the SST for Model B. Which one is larger? Or are they equal?

What about the SST for Model C? For Model D? MLR - 16 SSE of Nested Models When a reduced model is nested in a full model, then

(i) SSEreduced ≥ SSEfull , and (ii) SSRreduced ≤ SSRfull . Proof. We will prove (i) for I the full model yi = β0 + β1xi1 + β2xi2 + β3xi3 + εi and I the reduced model yi = β0 + β1xi1 + β3xi3 + εi . The proofs for other nested models are similar. n X 2 SSEfull = min (y1 − β0 − β1xi1 − β2xi2 − β3xi3) β0,β1,β2,β3 i=1 n X 2 ≤ min (y1 − β0 − β1xi1 − β3xi3) β0,β1,β3 i=1

= SSEreduced Part (ii) follows directly from (i), the identity SST = SSR + SSE, and the fact that all MLR models of the same data set have a common SST MLR - 17 Degrees of Freedom It can be shown that if the MLR model 2 yi = β0 + β1xi1 + ... + βpxip + εi , εi ’s i.i.d. ∼ N(0, σ ) is true then SSE ∼ χ2 , σ2 n−p−1

If we further assume that β1 = β2 = ··· = βp = 0, then SST SSR ∼ χ2 , ∼ χ2 σ2 n−1 σ2 p and SSR is independent of SSE. Note the degrees of freedom of the 3 chi-square distributions

dfT = n − 1, dfR = p, dfE = n − p − 1

break down similarly

dfT = dfR + dfE

just like SST = SSR + SSE. MLR - 18 Why the Degrees of Freedom for Errors is dfE = n − p − 1?

The n residuals e1,..., en cannot all vary freely.

There are p + 1 constraints:

n n X X ei = 0 and xki ei = 0 for k = 1,..., p. i=1 i=1 So only n − (p + 1) of them can be freely varying.

The p + 1 constraints comes from the p + 1 coefficients β0, . . . , βp in the model, and each contributes one constraint ∂ = 0. ∂βk

MLR - 19 Mean Square Error (MSE) — Estimate of σ2

The mean squares is the sum of squares divided by its degrees of freedom: SST SST MST = = = sample variance of Y , dfT n − 1 SSR SSR MSR = = , dfR p SSE SSE MSE = = = σ2 dfE n − p − 1 b

SSE 2 2 I From the fact σ2 ∼ χn−p−1 and that the mean of a χk distribution is k, we know that MSE is an unbiased estimator for σ2.

I Though SSE always decreases as we add terms to the model, adding unimportant terms may increases MSE.

MLR - 20 Multiple R-Squared Multiple R2, also called the coefficient of determination, is defined as SSR SSE R2 = = 1 − SST SST = proportion of variability in y explained by x1,..., xp which measures the strength of the linear relationship between y and the p covariates. 2 I 0 ≤ R ≤ 1 2 2 I For SLR, R = rxy is the square of the correlation coefficient between X and Y (Proof: HW1). So multiple R2 is a generalization of the correlation coefficient. I When two models are nested, then R2(reduced model) ≤ R2(full model).

2 I Is large R always preferable? MLR - 21 Adjusted R-Squared

Because R2 always increases as we add terms to the model, some people prefer to use an adjusted R2 defined as

SSE/dfE SSE/(n − p − 1) R2 = 1 − = 1 − adj SST /dfT SST /(n − 1) n − 1 = 1 − (1 − R2). n − p − 1

p 2 2 I − ≤ R ≤ R ≤ 1 n − p − 1 adj 2 2 I Unlike R , Radj can be negative 2 I Radj does not always increase as more variables are added to the model. 2 In fact, if unnecessary terms are added, Radj may decrease.

MLR - 22 Example: Housing Price

Price BDR FLR FP RMS ST LOT BTH CON GAR LOC 53 2 967 0 5 0 39 1.5 1 0.0 0 55 2 815 1 5 0 33 1.0 1 2.0 0 Price = Selling price in $1000 56 3 900 0 5 1 35 1.5 1 1.0 0 BDR = Number of bedrooms 58 3 1007 0 6 1 24 1.5 0 2.0 0 FLR = Floor space in sq. ft. 64 3 1100 1 7 0 50 1.5 1 1.5 0 FP = Number of fireplaces 44 4 897 0 7 0 25 2.0 0 1.0 0 RMS = Number of rooms 49 5 1400 0 8 0 30 1.0 0 1.0 0 ST = Storm windows 70 3 2261 0 6 0 29 1.0 0 2.0 0 (1 if present, 0 if absent) 72 4 1290 0 8 1 33 1.5 1 1.5 0 LOT = Front footage of lot in feet 82 4 2104 0 9 0 40 2.5 1 1.0 0 BTH = Number of bathrooms 85 8 2240 1 12 1 50 3.0 0 2.0 0 CON = Construction 45 2 641 0 5 0 25 1.0 0 0.0 1 (1 if frame, 0 if brick) 47 3 862 0 6 0 25 1.0 1 0.0 1 49 4 1043 0 7 0 30 1.5 0 0.0 1 GAR = Garage size 56 4 1325 0 8 0 50 1.5 0 0.0 1 (0 = no garage, 60 2 782 0 5 1 25 1.0 0 0.0 1 1 = one-car garage, etc.) 62 3 1126 0 7 1 30 2.0 1 0.0 1 LOC = Location 64 4 1226 0 8 0 37 2.0 0 2.0 1 (1 if property is in zone A, . 0 otherwise) . . 50 2 691 0 6 0 30 1.0 0 2.0 0 65 3 1023 0 7 1 30 2.0 1 1.0 0 MLR - 23 How to Do Regression Using R?

> housing = read.table("housing.dat",h=TRUE) # to load the data > lm(Price ~ FLR+RMS+BDR+GAR+LOT+ST+CON+LOC, data=housing) Call: lm(formula = Price ~ FLR + RMS + BDR + GAR + LOT + ST + CON + LOC, data = housing)

Coefficients: (Intercept) FLR RMS BDR GAR 12.47980 0.01704 2.38264 -4.54355 5.07873 LOT ST CON LOC 0.38241 9.82757 4.86507 6.95701 The lm() command above asks R to fit the model

Price = β0 + β1FLR + β2RMS + β3BDR + β4GAR

+ β5LOT + β6ST + β7CON + β8LOC + ε and R gives us the regression equation Price = 12.480 + 0.017FLR + 2.383RMS − 4.544BDR + 5.079GAR + 0.382LOT + 9.828ST + 4.865CON + 6.957LOC

MLR - 24 More R Commands

> lm1 = lm(Price ~ FLR+RMS+BDR+GAR+LOT+ST+CON+LOC, data=housing) > summary(lm1) # Regression output with more details # including multiple R-squared, # and the estimate of sigma

> lm1$coef # show the estimated beta’s > lm1$fitted # show the fitted values > lm1$res # show the residuals

Guess what we will get in R if we type the following commands?

> sum(lm1$res) > mean(lm1$res) > cor(lm1$res, housing$FLR) > cor(lm1$res, housing$RMS) > cor(lm1$res, housing$BDR)

MLR - 25 > lm1 = lm(Price ~ FLR+RMS+BDR+GAR+LOT+ST+CON+LOC, data=housing) > summary(lm1) Call: lm(formula = Price ~ FLR + RMS + BDR + GAR + LOT + ST + CON + LOC, data = housing)

Residuals: Min 1Q Median 3Q Max -6.020 -2.129 -0.213 2.147 6.492

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.479799 4.443985 2.808 0.012094 * FLR 0.017038 0.002751 6.195 9.8e-06 *** RMS 2.382638 1.418290 1.680 0.111251 BDR -4.543550 1.781145 -2.551 0.020671 * GAR 5.078729 1.209692 4.198 0.000604 *** LOT 0.382411 0.106832 3.580 0.002309 ** ST 9.827572 1.929232 5.094 9.0e-05 *** CON 4.865071 1.890718 2.573 0.019746 * LOC 6.957007 2.044084 3.403 0.003382 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.021 on 17 degrees of freedom Multiple R-squared: 0.9305, Adjusted R-squared: 0.8979 F-statistic: 28.47 on 8 and 17 DF, p-value: 2.25e-08 MLR - 26 Residual standard error: 4.021 on 17 degrees of freedom Multiple R-squared: 0.9305, Adjusted R-squared: 0.8979 F-statistic: 28.47 on 8 and 17 DF, p-value: 2.25e-08 √ I Residual standard error = MSE = σb

I “on 17 degrees of freedom” because dfE = n − p − 1 = 26 − 8 − 1 = 17

SSE I Multiple R-squared = 1 − SST

2 SSE/(n − p − 1) I Adjusted R-squared = R = 1 − adj SST /n − 1

I The F-statistic 28.47 is for testing

H0 : β1 = β2 = ··· = βp = 0 v.s.

Ha : not all βj = 0, j = 1,..., p. We will soon explain the meaning of this F -. MLR - 27 Price = 12.480 +0 .017FLR +2 .383RMS − 4.544BDR + 5.079GAR + 0.382LOT + 9.828ST + 4.865CON + 6.957LOC

The regression equation tells us:

I an extra square foot in floor area increase the price by $17 ,

I an additional room by ...... $2383 ,

I an additional bedroom by ...... −$4544 ,

I an additional space in the garage by ...... $5079 ,

I an extra foot in front footage by ...... $382 ,

I using storm windows by ...... $9828 .

I constructing in brick rather than in frame by ...... $4865 ,

I locating in zone A rather than other area by ...... $6957 .

Question: Why an additional bedroom makes a house less valuable?

MLR - 28 Interpretation of Regression Coefficients

I β0: intercept, the mean value of the response y when all covariates xj ’ are zero.

I May not make practical sense e.g., in the housing price model, β0 doesn’t make practical sense since there is no such housing unit with 0 floor space.

I βj : regression coefficient for xj , is the mean change in the response y when xj is increased by one unit holding all other covariates constant

I Interpretation of βj depends on the presence of other covariates in the model e.g., the meaning of the 2 β1’s in the following 2 models are different

Model 1 : yi = β0 + β1xi1 + β2xi2 + β3xi3 + εi

Model 2 : yi = β0 + β1xi1 + εi .

MLR - 29 What’s Wrong? # Model 1 > lmBDRonly = lm(Price ~ BDR, data=housing) > lmBDRonly$coef (Intercept) BDR 43.487365 3.920578 The regression coefficient for BDR is 3.92 in the Model 1 above but −4.54 in the Model 2 below. # Model 2 > lm1 = lm(Price ~ FLR+RMS+BDR+GAR+LOT+ST+CON+LOC, data=housing) > lm1$coef (Intercept) FLR RMS BDR GAR 12.47979941 0.01703833 2.38263831 -4.54355024 5.07872939 LOT ST CON LOC 0.38241085 9.82757237 4.86507085 6.95700689 Considering BDR alone, house prices increase with BDR. However, when other covariates (FLR, RMS, etc) are taken into account, house prices decrease with BDR. Does this make sense? MLR - 30 One Sample t-Test (Review)

Given a sample of size n, Y1, Y2,..., Yn, from some (normal) population with unknown mean µ and unknown variance σ2. Want to test H0 : µ = µ0 v.s. Ha : µ 6= µ0 The t-statistic is s Y − µ Y − µ Pn (Y − Y )2 t = 0 = √ 0 where s = i=1 i SE(Y ) s/ n n − 1

If the population is normal, the t-statistic has a t-distribution with n − 1 degrees of freedom

MLR - 31 t-Tests on Individual Regression Coefficients

For a MLR model Yi = β0 + β1Xi1 + ... + βpXip + εi , to test the hypotheses, H0 : βj = c v.s. Ha : βj 6= c the t-statistic is βbj − c t = SE(βbj )

in which SE(βbj ) is the standard error for βbj .

I General formula for SE(βbj ) is a bit complicate but unimportant in STAT222 and hence is omitted

I R can compute SE(βbj ) for us

I Formula for SE(βbj ) for a few special cases will be given later This t-statistic also has a t-distribution with n − p − 1 degrees of freedom

MLR - 32 > lm1 = lm(Price ~ FLR+RMS+BDR+GAR+LOT+ST+CON+LOC, data=housing) > summary(lm1) Call: lm(formula = Price ~ FLR + RMS + BDR + GAR + LOT + ST + CON + LOC, data = housing)

Residuals: Min 1Q Median 3Q Max -6.020 -2.129 -0.213 2.147 6.492

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.479799 4.443985 2.808 0.012094 * FLR 0.017038 0.002751 6.195 9.8e-06 *** RMS 2.382638 1.418290 1.680 0.111251 BDR -4.543550 1.781145 -2.551 0.020671 * GAR 5.078729 1.209692 4.198 0.000604 *** LOT 0.382411 0.106832 3.580 0.002309 ** ST 9.827572 1.929232 5.094 9.0e-05 *** CON 4.865071 1.890718 2.573 0.019746 * LOC 6.957007 2.044084 3.403 0.003382 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.021 on 17 degrees of freedom Multiple R-squared: 0.9305, Adjusted R-squared: 0.8979 F-statistic: 28.47 on 8 and 17 DF, p-value: 2.25e-08 MLR - 33 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.479799 4.443985 2.808 0.012094 * FLR 0.017038 0.002751 6.195 9.8e-06 *** RMS 2.382638 1.418290 1.680 0.111251 ...... (some rows are omitted) LOC 6.957007 2.044084 3.403 0.003382 **

I the first column gives variable names

I the column Estimate gives the LS estimate βbj ’s for βj ’s

I the column Std. Error gives SE(βbj ), the standard error of βbj

βbj I the column t value gives t-value = SE(βbj ) I column Pr(>|t|) gives the P-value for testing H0: βj = 0 v.s. Ha: βj 6= 0.

E.g., for RMS, we see βbRMS ≈ 2.38 and SE(βbRMS ) ≈ 1.42, and the βbRMS 2.38 t-value for RMS = ≈ 1.42 ≈ 1.68. The P-value 0.111 is SE(βbRMS ) the 2-sided P-value for testing H0: βRMS = 0 MLR - 34 Types of Tests for MLR The t-test can only tests for a single coefficient. One may also want to test for multiple coefficients. E.g., for the following MLR model with 6 covariates

Y = β0 + β1X1 + ... + β6X6 + ε, one may want to test whether 1. all the regression coefficients are equal to zero H0: β1 = β2 = ··· = β6 = 0 v.s. Ha: at least one of β1 . . . , β6 is not 0 2. some subset of the regression coefficients are equal to zero e.g. H0: β2 = β3 = β5 = 0 v.s. Ha: at least one of β2, β3, β5 6= 0 3. some of the regression coefficients equal to each other e.g. H0: β2 = β3 = β5 v.s. Ha: β2, β3 and β5 are not all equal 4. some β’s satisfy certain specified linear constraints e.g. H0 : β2 = β3 + β5 v.s. Ha: β2 6= β3 + β5 MLR - 35 Tests on Multiple Coefficients are Model Comparisons Each of the four tests in the previous slide can be viewed as a comparison of 2 models, a full model and a reduced model.

I Testing β1 = β2 = ... = β6 = 0

Full :Y = β0 + β1X1 + ... + β6X6 + ε

Reduced :Y = β0 + ε (All covariates are redundant)

I Testing β2 = β3 = β5 = 0

Full :Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 + β6X6 + ε

Reduced :Y = β0 + β1X1 + β4X4 + β6X6 + ε (X2, X3, X5 are redundant)

I Testing β2 = β3 = β5

Full :Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 + β6X6 + ε

Reduced :Y = β0 + β1X1 + β2(X2 + X3 + X5) + β4X4 + β6X6 + ε

(Y depends on X2, X3, X5 only through their sum X2 + X3 + X5)

MLR - 36 Tests on Multiple Coefficients are Model Comparisons (2)

I Testing β2 = β3 + β5

Full : Y = β0 + β1X1 + β2X2 + β3X3+

+ β4X4 + β5X5 + β6X6 + ε

Reduced : Y = β0 + β1X1 + (β3 + β5)X2 + β3X3+

+ β4X4 + β5X5 + β6X6 + ε

= β0 + β1X1 + β3(X2 + X3) + β4X4+

+ β5(X2 + X5) + β6X6 + ε

Observed the reduced model is always nested in the full model.

MLR - 37 General Framework for Testing Nested Models

H0: the reduced model is true v.s. Ha : the full model is true

I As the reduced model is nested in the full model, it is always true that SSEreduced ≥ SSEfull I Trade-off between Simplicity and Precision

I Full model fits the data better (with smaller SSE) but is more complicate

I Reduced model doesn’t fit as well but is simpler.

I How to choose between the full and the reduced models?

I If SSEreduced ≈ SSEfull , one can sacrifice a bit of precision in exchange for simplicity I If SSEreduced  SSEfull , it costs a great reduction in precision in exchange for simplicity. The full model is preferred. MLR - 38 The F -Statistic

(SSE − SSE )/(df − df ) F = reduced full reduced full SSEfull /dffull

I SSEreduced − SSEfull is the reduction in SSE from replacing the reduced model with the full model.

I dffull /dfreduced is the dfE for the full/reduced model.

I The denominator is the MSE of the full model

I F ≥ 0 since SSEreduced ≥ SSEfull ≥ 0

I The smaller the F -statistic, the more we favor the reduced model

I Under H0, the F -statistic has an F -distribution with dfreduced − dffull and dffull degrees of freedom.

MLR - 39 Testing All Coefficients Equal Zero Testing the hypotheses

H0: β1 = ··· = βp = 0 v.s. Ha: not all β1 . . . , βp = 0

is a test to evaluate the overall significance of a model.

Full :Y = β0 + β1X1 + ... + βpXp + ε

Reduced :Y = β0 + ε (all covariates are unnecessary)

I The LS estimate for β0 in the reduced model is βb0 = Y , so n X 2 X 2 SSEreduced = (Yi − βb0) = (Yi − Y ) = SSTfull i=1 i

I dfreduced = dfEreduced = n − 1, because the reduced model has only one coefficient β0

I dffull = dfEfull = n − p − 1. MLR - 40 Testing All Coefficients Equal Zero So (SSE − SSE )/(df − df ) F = reduced full reduced full SSEfull /dffull (SST − SSE )/[n − 1 − (n − p − 1)] = full full SSEfull /(n − p − 1) SSR /p = full . SSEfull /(n − p − 1)

Moreover, F ∼ Fp,n−p−1 under H0: β1 = β2 = ··· = βp = 0. In R, the F statistic and p-value are displayed in the last line of the output of the summary() command. > lm1 = lm(Price ~ FLR+RMS+BDR+GAR+LOT+ST+CON+LOC, data=housing) > summary(lm1) ... (output omitted)

Residual standard error: 4.021 on 17 degrees of freedom Multiple R-squared: 0.9305, Adjusted R-squared: 0.8979 F-statistic: 28.47 on 8 and 17 DF, p-value: 2.25e-08 MLR - 41 ANOVA and the F -Test

The test of all coefficients equal zero is often summarized in an ANOVA table.

Sum of Mean Source df Squares Squares F SSR MSR Regression dfR = p SSR MSR = F = dfR MSE SSE Error dfE = n − p − 1 SSE MSE = dfE Total dfT = n − 1 SST

ANOVA is the shorthand for analysis of variance. It decomposes the total variation in the response (SST) into separate pieces that correspond to different sources of variation, like SST = SSR + SSE in the regression setting. Throughout STAT222, we will introduce several other ANOVA tables.

MLR - 42 Testing Some Coefficients Equal to Zero

E.g., for the housing price data, we may want to test if we can eliminate RMS and CON from the model, i.e., H0: βRMS = βCON = 0.

> lm1 = lm(Price ~ FLR+RMS+BDR+GAR+LOT+ST+CON+LOC, data=housing) > lm3 = lm(Price ~ FLR+BDR+GAR+LOT+ST+LOC, data=housing) > anova(lm3,lm1) Analysis of Variance Table

Model 1: Price ~ FLR + BDR + GAR + LOT + ST + LOC Model 2: Price ~ FLR + RMS + BDR + GAR + LOT + ST + CON + LOC Res.Df RSS Df Sum of Sq F Pr(>F) 1 19 472.03 2 17 274.84 2 197.19 6.0985 0.01008 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note SSE is called RSS (residual sum of square) in R.

MLR - 43 Testing Equality of Coefficients (1)

Full model : Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + ε

Example 1: want to test H0: β1 = β4, then the reduced model is

Y = β0 + β1X1 + β2X2 + β3X3 + β1X4 + ε

= β0 + β1(X1 + X4) + β2X2 + β3X3 + ε

1. Make a new variable W = X1 + X4 2. Fit the reduced model by regressing Y on W , X2, and X3 3. Find SSEreduced and dfreduced − dffull = 1 4. Can be done in R as follows: > lmfull = lm(Y ~ X1 + X2 + X3 + X4) > lmreduced = lm(Y ~ I(X1 + X4) + X2 + X3) > anova(lmreduced,lmfull) The line lmreduced = lm(Y ~ I(X1 + X4) + X2 + X3) is equivalent to > W = X1 + X4 > lmreduced = lm(Y ~ W + X2 + X3) MLR - 44 Testing Equality of Coefficients (2)

Example 2: want to test H0: β1 = β2 = β3, then the reduced model is

Y = β0 + β1X1 + β1X2 + β1X3 + β4X4 + ε

= β0 + β1(X1 + X2 + X3) + β4X4 + ε

1. Make a new variable W = X1 + X2 + X3

2. Fit the reduced model by regressing Y on W and X4

3. Find SSEreduced and dfreduced − dffull = 2 4. Can be done in R as follows > lmfull = lm(Y ~ X1 + X2 + X3 + X4) > lmreduced = lm(Y ~ I(X1 + X2 + X3) + X4) > anova(lmreduced, lmfull)

MLR - 45 Testing Equality of Coefficients (3)

Example 3: want to test H0: β1 = β2 and β3 = β4, then the reduced model is

Y = β0 + β1X1 + β1X2 + β3X3 + β3X4 + ε

= β0 + β1(X1 + X2) + β3(X3 + X4) + ε

1. Make new variables W1 = X1 + X2, W2 = X3 + X4

2. Fit the reduced model by regressing Y on W1 and W2

3. Find SSEreduced and dfreduced − dffull = 2 4. Can be done in R as follows > lmfull = lm(Y ~ X1 + X2 + X3 + X4) > lmreduced = lm(Y ~ I(X1 + X2) + I(X3 + X4)) > anova(lmreduced, lmfull)

MLR - 46 Testing Coefficients under Constraints (1) Again say the full model is

Full model : Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + ε

Example 4: If H0: β2 = β3 + β4, then the reduced model is

Y = β0 + β1X1 + (β3 + β4)X2 + β3X3 + β4X4 + ε

= β0 + β1X1 + β3(X2 + X3) + β4(X2 + X4) + ε

1. Make new variables W1 = X2 + X3, W2 = X2 + X4

2. Fit the reduced model by regressing Y on X1, W1 and W2

3. Find SSEreduced and dfreduced − dffull = 1 4. Can be done in R as follows > lmfull = lm(Y ~ X1 + X2 + X3 + X4) > lmreduced = lm(Y ~ X1 + I(X2 + X3) + I(X2 + X4)) > anova(lmreduced, lmfull) MLR - 47 Testing Coefficients under Constraints (2)

Example 5: If we think β2 = 2β1, then the reduced model is

Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + ε

= β0 + β1X1 + 2β1X2 + β3X3 + β4X4 + ε

= β0 + β1(X1 + 2X2) + β3X3 + β4X4 + ε

1. Make a new variable W = X1 + 2X2

2. Fit the reduced model by regressing Y on W , X3 and X4

3. Find SSEreduced and dfreduced − dffull = 1 4. Can be done in R as follows > lmfull = lm(Y ~ X1 + X2 + X3 + X4) > lmreduced = lm(Y ~ I(X1 + 2*X2) + X3 + X4) > anova(lmreduced, lmfull)

MLR - 48 Relationship Between t-tests and F -tests (Optional) The F -test can also test for a single coefficient, and the result is equivalent to the t-test. E.g., if one wants to test a single coefficient β3 = 0 in the model

Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + ε one way is to do a t-test using the command summary(lm(Y ~ X1 + X2 + X3 + X4)) and read the t-statistic and P-value for X3. Alternatively, one can also be viewed as a model comparison between

Full model :Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + ε

Reduced model :Y = β0 + β1X1 + β2X2 + β4X4 + ε

> anova(lm(Y ~ X1 + X2 + X3 + X4), lm(Y ~ X1 + X2 + X4)) One can show that the F -statistic = (t-statistic)2 and the P-values are the same, and thus the two tests are equivalent. The proof involves complicate matrix algebra and is hence omitted. MLR - 49 Consider again the model

Price = β0 + β1FLR + β2RMS + β3BDR + β4GAR

+ β5LOT + β6ST + β7CON + β8LOC + ε for the housing price data and want to test βBDR = 0. From the output on Slide 33, we see the t-statistic for test βBDR = 0 is −2.551 with P-value 0.020671. If using an F -test,

> lmfull = lm(Price ~ FLR+RMS+BDR+GAR+LOT+ST+CON+LOC, data=housing) > lmreduced = lm(Price ~ FLR+RMS+GAR+LOT+ST+CON+LOC, data=housing) > anova(lmreduced,lmfull) Analysis of Variance Table

Res.Df RSS Df Sum of Sq F Pr(>F) 1 18 380.04 2 17 274.84 1 105.2 6.5072 0.02067 * we see t2 = (−2.551)2 = 6.507601 ≈ 6.5072 = F (the subtle difference is due to rounding), and the P-value is also 0.02067. MLR - 50