Day 10: OLS Assumptions: Autocorrelation and Heteroscedasticity (Outliers Too)
Daniel J. Mallinson
School of Public Affairs Penn State Harrisburg [email protected]
PADM 576
Mallinson Day 10 February 18, 2020 1 / 22 Road map
Understand the assumptions of OLS Identify assumption violations Correct for assumption violations
Mallinson Day 10 February 18, 2020 2 / 22 Problems with OLS
Problem Biased β Biased SE Biased t F Hi Var Non-linear Yes Yes Yes – Heterscedasticity No Yes Yes Yes Autocorrelation No Yes Yes Yes Non-normal error No No No Yes Multicollinearity No No No Yes Omit Relevant X Yes Yes Yes – Irrelevant X No No No Yes X measurement error Yes Yes Yes – Table: Jenkins-Smith et al Figure 15.1.1
Mallinson Day 10 February 18, 2020 3 / 22 Residual Analysis We can learn a lot about the appropriateness of our empirical (and theoretical) model by looking at residuals
Figure: Source: Statwing
Mallinson Day 10 February 18, 2020 4 / 22 Assumption 4: Homoscedasticity OLS assumes variance is constant among the predicted Ys across the range of X (important for t-stat)
Mallinson Day 10 February 18, 2020 5 / 22 Heteroscedasticity Residuals change systematically across the range of predicted values (non-constant variance)
Figure: Source: Statwing
Mallinson Day 10 February 18, 2020 6 / 22 Assumption 4: Heteroscedasticity Example: Revenue more variable on hot days than cold
Figure: Source: Statwing
Mallinson Day 10 February 18, 2020 7 / 22 Detecting Heteroscedasticity ds.small$fit.r <- ols1$residuals ds.small$fit.p <- ols1$fitted.values ggplot(ds.small, aes(fit.p, fit.r)) + geom_jitter(shape = 1) + geom_hline(yintercept = 0, color = "red") + ylab("Residuals") + xlab("Fitted")
Mallinson Day 10 February 18, 2020 8 / 22 Detecting Heteroscedasticity
Breusch-Pagan is the formal test. The null hypothesis is that the variance is constant (homoscedastic). library(car) ncvTest(ols1)
## Non-constant Variance Score Test ## Variance formula: ~ fitted.values ## Chisquare = 12.70938 Df = 1 p = 0.0003638269
Mallinson Day 10 February 18, 2020 9 / 22 Correcting Heteroscedasticity
1 Add omitted variable
2 Transformation
3 Heteroscedasticity consistent covariance matrix (robust standard errors)
Mallinson Day 10 February 18, 2020 10 / 22 Robust Standard Errors in R ## Calculate robust standard errors library(car) hccm(ols1) %>% diag() %>% sqrt()
## Calculate correct p-values library(car) robust.se <- function(model) { s <- summary(model) wse <- sqrt(diag(hccm(ols1))) t <- model$coefficients/wse p <- 2*pnorm(-abs(t)) results <- cbind(model$coefficients, wse, t, p) dimnames(results) <- dimnames(s$coefficients) results }
Mallinson Day 10 February 18, 2020 11 / 22 Assumption 5: Independence of Errors
Autocorrelation occurs when errors are not independent of each other, there is structure in the errors. Most often due to over-time data.
Mallinson Day 10 February 18, 2020 12 / 22 Detecting Autocorrelation
We use the Durbin-Watson test. Null hypothesis is that there is no autocorrelation. library(lmtest) dwtest(ols1)
## Durbin-Watson test ## ## data: ols1 ## DW = 1.9008, p-value = 0.1441
Mallinson Day 10 February 18, 2020 13 / 22 Correcting Autocorrelation
1 Add lagged dependent variable (Y − 1)
2 Estimate time series model (ARIMA)
Mallinson Day 10 February 18, 2020 14 / 22 Another “Problem”: Outliers What’s the problem?
Wrong βs and standard errors
Mallinson Day 10 February 18, 2020 15 / 22 Outlier Residuals
Mallinson Day 10 February 18, 2020 16 / 22 Terminology
Leverage Degree of potential influence on the coefficients that an observation can have
Discrepancy Extent to which an observation is “different” from the rest of the data
Influence How much effect does a particular observation’s values on Y and Xs have on the coefficient estimates
Influence = Leverage X Discrepancy
Mallinson Day 10 February 18, 2020 17 / 22 Detecting Outliers DFBETA(S) Positive values correspond with observations that decrease the ˆ estimate of βk ; negative correspond with observations that increase ˆ the estimate of βk
Cook’s D Equation accounts for both leverage and discrepancy, so high values on both are necessary to yield high influence
COVRATIO Standardized indicator of the extent to which an observation influences the precision of the estimated coefficients (the standard errors). COVRATIO > 1 increases the precision (decreases SE) and < 1 decreases precision (increases SE). Large absolute value is influential observation.
Mallinson Day 10 February 18, 2020 18 / 22 Plots in R: “Bubble” Plot influencePLOT(model, id.n=X, labels=labels, id.cex=0.8, id.col=‘‘red’’, xlab=‘‘Residuals’’)
Mallinson Day 10 February 18, 2020 19 / 22 Plots in R: DFBETAS dfbetasPlots(model, id.n=X, labels=labels, id.cex=0.8, id.col=‘‘red’’, pch=19)
Mallinson Day 10 February 18, 2020 20 / 22 Plots in R: COVRATIO FitCOVRATIO <- covratio(model) plot(FitCOVRATIO~X, pch=19, xlab=‘‘X’’, ylab=‘‘Value of COVRATIO’’) abline(h=1, lty=2)
Mallinson Day 10 February 18, 2020 21 / 22 Correcting Outliers
1 Think! 2 If logical, leave in 3 If reasonable (random event), take out 4 If error, fix or exclude 5 Theory is the best guide 6 Re-run the model and footnote approach
Mallinson Day 10 February 18, 2020 22 / 22