<<

Day 10: OLS Assumptions: and (Outliers Too)

Daniel J. Mallinson

School of Public Affairs Penn State Harrisburg [email protected]

PADM 576

Mallinson Day 10 February 18, 2020 1 / 22 Road map

Understand the assumptions of OLS Identify assumption violations Correct for assumption violations

Mallinson Day 10 February 18, 2020 2 / 22 Problems with OLS

Problem Biased β Biased SE Biased t F Hi Var Non-linear Yes Yes Yes – Heterscedasticity No Yes Yes Yes Autocorrelation No Yes Yes Yes Non-normal error No No No Yes Multicollinearity No No No Yes Omit Relevant X Yes Yes Yes – Irrelevant X No No No Yes X measurement error Yes Yes Yes – Table: Jenkins-Smith et al Figure 15.1.1

Mallinson Day 10 February 18, 2020 3 / 22 Residual Analysis We can learn a lot about the appropriateness of our empirical (and theoretical) model by looking at residuals

Figure: Source: Statwing

Mallinson Day 10 February 18, 2020 4 / 22 Assumption 4: OLS assumes is constant among the predicted Ys across the of X (important for t-stat)

Mallinson Day 10 February 18, 2020 5 / 22 Heteroscedasticity Residuals change systematically across the range of predicted values (non-constant variance)

Figure: Source: Statwing

Mallinson Day 10 February 18, 2020 6 / 22 Assumption 4: Heteroscedasticity Example: Revenue more variable on hot days than cold

Figure: Source: Statwing

Mallinson Day 10 February 18, 2020 7 / 22 Detecting Heteroscedasticity ds.small$fit.r <- ols1$residuals ds.small$fit.p <- ols1$fitted.values ggplot(ds.small, aes(fit.p, fit.r)) + geom_jitter(shape = 1) + geom_hline(yintercept = 0, color = "red") + ylab("Residuals") + xlab("Fitted")

Mallinson Day 10 February 18, 2020 8 / 22 Detecting Heteroscedasticity

Breusch-Pagan is the formal test. The null hypothesis is that the variance is constant (homoscedastic). library(car) ncvTest(ols1)

## Non-constant Variance ## Variance formula: ~ fitted.values ## Chisquare = 12.70938 Df = 1 p = 0.0003638269

Mallinson Day 10 February 18, 2020 9 / 22 Correcting Heteroscedasticity

1 Add omitted variable

2 Transformation

3 Heteroscedasticity consistent (robust standard errors)

Mallinson Day 10 February 18, 2020 10 / 22 Robust Standard Errors in R ## Calculate robust standard errors library(car) hccm(ols1) %>% diag() %>% sqrt()

## Calculate correct p-values library(car) robust.se <- function(model) { s <- summary(model) wse <- sqrt(diag(hccm(ols1))) t <- model$coefficients/wse p <- 2*pnorm(-abs(t)) results <- cbind(model$coefficients, wse, t, p) dimnames(results) <- dimnames(s$coefficients) results }

Mallinson Day 10 February 18, 2020 11 / 22 Assumption 5: Independence of Errors

Autocorrelation occurs when errors are not independent of each other, there is structure in the errors. Most often due to over-time .

Mallinson Day 10 February 18, 2020 12 / 22 Detecting Autocorrelation

We use the Durbin-Watson test. Null hypothesis is that there is no autocorrelation. library(lmtest) dwtest(ols1)

## Durbin-Watson test ## ## data: ols1 ## DW = 1.9008, p-value = 0.1441

Mallinson Day 10 February 18, 2020 13 / 22 Correcting Autocorrelation

1 Add lagged dependent variable (Y − 1)

2 Estimate model (ARIMA)

Mallinson Day 10 February 18, 2020 14 / 22 Another “Problem”: Outliers What’s the problem?

Wrong βs and standard errors

Mallinson Day 10 February 18, 2020 15 / 22 Outlier Residuals

Mallinson Day 10 February 18, 2020 16 / 22 Terminology

Leverage Degree of potential influence on the coefficients that an observation can have

Discrepancy Extent to which an observation is “different” from the rest of the data

Influence How much effect does a particular observation’s values on Y and Xs have on the coefficient estimates

Influence = X Discrepancy

Mallinson Day 10 February 18, 2020 17 / 22 Detecting Outliers DFBETA(S) Positive values correspond with observations that decrease the ˆ estimate of βk ; negative correspond with observations that increase ˆ the estimate of βk

Cook’s D Equation accounts for both leverage and discrepancy, so high values on both are necessary to yield high influence

COVRATIO Standardized indicator of the extent to which an observation influences the precision of the estimated coefficients (the standard errors). COVRATIO > 1 increases the precision (decreases SE) and < 1 decreases precision (increases SE). Large absolute value is influential observation.

Mallinson Day 10 February 18, 2020 18 / 22 Plots in R: “Bubble” Plot influencePLOT(model, id.n=X, labels=labels, id.cex=0.8, id.col=‘‘red’’, xlab=‘‘Residuals’’)

Mallinson Day 10 February 18, 2020 19 / 22 Plots in R: DFBETAS dfbetasPlots(model, id.n=X, labels=labels, id.cex=0.8, id.col=‘‘red’’, pch=19)

Mallinson Day 10 February 18, 2020 20 / 22 Plots in R: COVRATIO FitCOVRATIO <- covratio(model) plot(FitCOVRATIO~X, pch=19, xlab=‘‘X’’, ylab=‘‘Value of COVRATIO’’) abline(h=1, lty=2)

Mallinson Day 10 February 18, 2020 21 / 22 Correcting Outliers

1 Think! 2 If logical, leave in 3 If reasonable (random event), take out 4 If error, fix or exclude 5 Theory is the best guide 6 Re-run the model and footnote approach

Mallinson Day 10 February 18, 2020 22 / 22