Lecture 4:

1 Regression

• Regression is a multivariate analysis, i.e., we are interested in relationship between several variables.

• For corporate audience, it is sufficient to show correlation. Causality is of no interest.

• Nevertheless, business people care about and multicollinearity, two issues downplayed in econometrics class.

2 Scatter Plot

A scatter plot displays the variable on the vertical axis against the variable on the horizontal axis (limitation?). Each point represents one observation. Sometimes a line fitted by OLS is attached. For instance, a downward sloping line shown in Figure 1 indicates negative correlation, though we do not know whether it is statistically significant. We don’t care this correlation is due to causality, or a lurking variable.

Figure 1, Scatter Plot GDP Growth

247 −10 −5 0 5 10 15

−10 −5 0 5 10 15 Inflation 3 Outliers I

From the scatter plot, we notice several outliers. For example, there is one observation with inflation rate less than -5 in the lower left corner. The regression using all observations is all = data.frame(infr, gro) m0 = lm(gro˜infr, data=all) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.32604 0.33757 9.853 <2e-16 *** infr -0.06050 0.07094 -0.853 0.394 and the regression excluding that (leave-one-out regression) is Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.57737 0.33914 10.548 <2e-16 *** infr -0.11884 0.07163 -1.659 0.0983 So the t value of inflation rate (infr) changes from -0.853 to -1.659, a big jump. 4 Subsetting

• In stata we can use if statement to run a regression for a subset of sample.

• In R, we can use function subset to specify a subsample, call it subA. Then consider the function lm(y˜x, data=subA)

• The logic operators commonly used by subset are == equal to != not equal to !x Not x x | y x OR y x & y x AND y

5 Outliers II

It turns out we can duplicate the leave-one-out regression by defining a dummy variable that equals one for that outlier, and run the regression using all observations and that dummy variable d = ifelse(infr<(-5),1,0) # -5 must be inside parentheses! lm(formula = gro ˜ infr + d, data = all)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.57737 0.33914 10.548 < 2e-16 *** infr -0.11884 0.07163 -1.659 0.098259 . d -13.21955 3.85739 -3.427 0.000705 *** The coefficient of the dummy variable -13.21955 is the leave-one-out residual for that outlier observation.

6 Studentized Residuals I

More importantly, the t value of the dummy variable, -3.427, is the studentized residual for that outlier: eˆi Studentized Residuali = (1) σˆi wheree ˆi is the i-th residual of the regression that uses all observations; σˆi is the estimated standard of the i-th residual. In large sample, the studentized residual follows standard . So a studentized residual greater than 1.96 in absolute value may indicate an outlier.

7 Studentized Residuals II

Here 3.427 > 1.96, so the observation with inflation less than -5 is indeed an outlier, without which the regression result changes significantly. We can obtain all studentized residuals using the R function studres available in the MASS package, and then display all the observations with studentized residuals greater than 1.96 > library(MASS) > m0 = lm(gro˜infr, data=all) > ehat = studres(m0) > ehat[which(abs(ehat)>1.96)] 8 12 13 14 23 27 32 44 55 96 112 -2.423826 3.283723 2.322132 3.282804 2.557551 -2.495271 2.109686 -3.635595 -2.128574 1.965128 -2.038829 125 133 139 140 247 248 3.356643 -2.902490 -2.020119 -2.617281 -3.427072 -2.421953

For more about outliers, see section 9.5 of Wooldridge’s Introductory Econometrics textbook.

8 Matrix Algebra for Studentized Residual

Consider the multiple regression in matrix form

Y = Xβ + e

The residual vector ise ˆ = Y − Xβˆ = Y − X(X′X)−1X′Y = (I − H)Y, where

H ≡ X(X′X)−1X′ is called hat matrix. It follows that the variance-covariance matrix of the residual vector is

E(eˆeˆ′) = σ 2(I − H).

2 So the variance for the i-th residual is var(eˆi) = σ (1 − hii), where hii, called , is the i-th diagonal entry of H. Formally, the i-th studentized residual is

eˆi Studentized Residuali = √ (2) 2 σ (1 − hii) where we estimate σ 2 by running the leave-one-out regression without the i-th observation. 9 Leverage

The vector of fitted values is Yˆ = Xβˆ = HY. More explicitly,       yˆ1  h11 ......  y1        .    .   .  =  ......  . 

yˆn ...... hnn yn So it is evident that ∂yˆi hii = ∂yi

Therefore, the leverage measures the change of the fitted valuey ˆi when yi changes by one unit. An observation with high leverage tends to pull the OLS fitted line toward it.

10 Summary of Outliers

According to Equation (2), an outlier should satisfy both of following: (1) it has bige ˆi, which means unusual value of y or big discrepancy from the fitted line; (2) it has big leverage hii, which means the value of regressor is far from the center. In short, an outlier must have an unusual X-value with an unusual Y-value given its X-value.

11 Least Absolute Deviations Estimation

In the presence of outliers, one option is reporting the OLS regression excluding them. Alternatively, we may report the result of Least Absolute Deviations (LAD) Estimation, which minimizes the sum of the absolute values of residuals. Because the residual is not squared, the effect of outliers is diminished, meaning that LAD estimate is less sensitive to outliers than OLS estimate. In fact, LAD estimates the parameters of the conditional median of y given x. > library(L1pack) > lad(gro˜infr, data=all, method = "EM")

Coefficients: (Intercept) infr 3.5915 -0.1487

See section 9.6 of Wooldridge’s Introductory Econometrics textbook for more details.

12 FW Theorem

Consider a multiple regression in the form of partitioned matrix ˆ ˆ Y = X1β1 + X2β2 + eˆ

≡ − ′ −1 Pre-multiplying the matrix M1 I X1(X1X1) X1 yields βˆ ′ −1 ′ ′ −1 ′ 2 = (X2M1X2) (X2M1Y) = (rˆ rˆ) (rˆ Y) ˆ So the FW theorem states that we can obtain β2 in two steps: first, regressing X2 onto X1 and keep the residualr ˆ ≡ M1X2, then second, regressing Y ontor ˆ. It follows that σ 2 var(βˆ ) = σ 2(rˆ′rˆ)−1 = 2 − 2 SSTX2(1 RX2X1) 2 where SSTX2 denotes the total sum squares (TSS) for X2, and RX2X1 is the R-squared of regressing X2 onto X1.

13 Variance Inflation Factor (VIF)

We obtain imprecise estimate for β2 (with big standard error) when 1. σ 2 is big, i.e., when there are many omitted factors.

2. SSTX2 is small, i.e., when there is little variation in X2. 2 3. RX2X1 is close to one, i.e., when X2 is highly correlated with X1, an issue called multicollinearity. In general, we can define the Variance Inflation Factor as 1 VIF = j − 2 1 R j 2 where R j denotes the R-squared of regressing the j-th regressor onto all other regressors. If VIF is above 10, we conclude that multicollinearity is a problem for estimating β j. In that case, we may drop some regressors to mitigate the multicollinearity. Doing this will produce more efficient (with smaller standard error) but less unbiased estimate (since we have more omitted variables). 14 An Illustration I

We generate two redundant variables that are highly correlated with inflation rate. As expected, we obtain big VIF after including those redundant variables r.infr1 = infr + 0.1*rnorm(n) r.infr2 = infr + 0.1*rnorm(n) m3 = lm(gro˜infr+r.infr1+r.infr2, data=all) library(car) vif(m3) > vif(m3) infr r.infr1 r.infr2 2216.169 1050.749 1186.535

15 An Illustration II

By contrast, the VIF is small if an uncorrelated regressor is added > r.infr3 = rnorm(n) > m4 = lm(gro˜infr+r.infr3, data=all) > vif(m4) infr r.infr3 1.000043 1.000043

16 Regression with Categorial Information I

We can run the groupwise regression, where the group (subset) is specified by categorial information. For example sub2 = subset(all, gro>0) sub3 = subset(all, gro<=0) summary(lm(gro˜infr, data=sub2)) Estimate Std. Error t value Pr(>|t|) (Intercept) 4.25127 0.28497 14.918 <2e-16 *** infr -0.01957 0.06338 -0.309 0.758 summary(lm(gro˜infr, data=sub3)) Estimate Std. Error t value Pr(>|t|) (Intercept) -3.34405 0.53057 -6.303 1.97e-07 *** infr 0.08438 0.08779 0.961 0.342

17 Regression with Categorial Information II

Of course, we get the same results by using all observations, and including a dummy and an interaction term d = ifelse(gro<=0, 1,0) i = d*infr summary(lm(gro˜infr+d+i, data=all))

Estimate Std. Error t value Pr(>|t|) (Intercept) 4.25127 0.27927 15.223 <2e-16 *** infr -0.01957 0.06211 -0.315 0.753 d -7.59532 0.67155 -11.310 <2e-16 *** i 0.10395 0.11862 0.876 0.382

18