Day 10: OLS Assumptions: Autocorrelation and Heteroscedasticity (Outliers Too)
Total Page:16
File Type:pdf, Size:1020Kb
Day 10: OLS Assumptions: Autocorrelation and Heteroscedasticity (Outliers Too) Daniel J. Mallinson School of Public Affairs Penn State Harrisburg [email protected] PADM 576 Mallinson Day 10 February 18, 2020 1 / 22 Road map Understand the assumptions of OLS Identify assumption violations Correct for assumption violations Mallinson Day 10 February 18, 2020 2 / 22 Problems with OLS Problem Biased β Biased SE Biased t F Hi Var Non-linear Yes Yes Yes { Heterscedasticity No Yes Yes Yes Autocorrelation No Yes Yes Yes Non-normal error No No No Yes Multicollinearity No No No Yes Omit Relevant X Yes Yes Yes { Irrelevant X No No No Yes X measurement error Yes Yes Yes { Table: Jenkins-Smith et al Figure 15.1.1 Mallinson Day 10 February 18, 2020 3 / 22 Residual Analysis We can learn a lot about the appropriateness of our empirical (and theoretical) model by looking at residuals Figure: Source: Statwing Mallinson Day 10 February 18, 2020 4 / 22 Assumption 4: Homoscedasticity OLS assumes variance is constant among the predicted Ys across the range of X (important for t-stat) Mallinson Day 10 February 18, 2020 5 / 22 Heteroscedasticity Residuals change systematically across the range of predicted values (non-constant variance) Figure: Source: Statwing Mallinson Day 10 February 18, 2020 6 / 22 Assumption 4: Heteroscedasticity Example: Revenue more variable on hot days than cold Figure: Source: Statwing Mallinson Day 10 February 18, 2020 7 / 22 Detecting Heteroscedasticity ds.small$fit.r <- ols1$residuals ds.small$fit.p <- ols1$fitted.values ggplot(ds.small, aes(fit.p, fit.r)) + geom_jitter(shape = 1) + geom_hline(yintercept = 0, color = "red") + ylab("Residuals") + xlab("Fitted") Mallinson Day 10 February 18, 2020 8 / 22 Detecting Heteroscedasticity Breusch-Pagan is the formal test. The null hypothesis is that the variance is constant (homoscedastic). library(car) ncvTest(ols1) ## Non-constant Variance Score Test ## Variance formula: ~ fitted.values ## Chisquare = 12.70938 Df = 1 p = 0.0003638269 Mallinson Day 10 February 18, 2020 9 / 22 Correcting Heteroscedasticity 1 Add omitted variable 2 Transformation 3 Heteroscedasticity consistent covariance matrix (robust standard errors) Mallinson Day 10 February 18, 2020 10 / 22 Robust Standard Errors in R ## Calculate robust standard errors library(car) hccm(ols1) %>% diag() %>% sqrt() ## Calculate correct p-values library(car) robust.se <- function(model) { s <- summary(model) wse <- sqrt(diag(hccm(ols1))) t <- model$coefficients/wse p <- 2*pnorm(-abs(t)) results <- cbind(model$coefficients, wse, t, p) dimnames(results) <- dimnames(s$coefficients) results } Mallinson Day 10 February 18, 2020 11 / 22 Assumption 5: Independence of Errors Autocorrelation occurs when errors are not independent of each other, there is structure in the errors. Most often due to over-time data. Mallinson Day 10 February 18, 2020 12 / 22 Detecting Autocorrelation We use the Durbin-Watson test. Null hypothesis is that there is no autocorrelation. library(lmtest) dwtest(ols1) ## Durbin-Watson test ## ## data: ols1 ## DW = 1.9008, p-value = 0.1441 Mallinson Day 10 February 18, 2020 13 / 22 Correcting Autocorrelation 1 Add lagged dependent variable (Y − 1) 2 Estimate time series model (ARIMA) Mallinson Day 10 February 18, 2020 14 / 22 Another \Problem": Outliers What's the problem? Wrong βs and standard errors Mallinson Day 10 February 18, 2020 15 / 22 Outlier Residuals Mallinson Day 10 February 18, 2020 16 / 22 Terminology Leverage Degree of potential influence on the coefficients that an observation can have Discrepancy Extent to which an observation is “different” from the rest of the data Influence How much effect does a particular observation's values on Y and Xs have on the coefficient estimates Influence = Leverage X Discrepancy Mallinson Day 10 February 18, 2020 17 / 22 Detecting Outliers DFBETA(S) Positive values correspond with observations that decrease the ^ estimate of βk ; negative correspond with observations that increase ^ the estimate of βk Cook's D Equation accounts for both leverage and discrepancy, so high values on both are necessary to yield high influence COVRATIO Standardized indicator of the extent to which an observation influences the precision of the estimated coefficients (the standard errors). COVRATIO > 1 increases the precision (decreases SE) and < 1 decreases precision (increases SE). Large absolute value is influential observation. Mallinson Day 10 February 18, 2020 18 / 22 Plots in R: \Bubble" Plot influencePLOT(model, id.n=X, labels=labels, id.cex=0.8, id.col=``red'', xlab=``Residuals'') Mallinson Day 10 February 18, 2020 19 / 22 Plots in R: DFBETAS dfbetasPlots(model, id.n=X, labels=labels, id.cex=0.8, id.col=``red'', pch=19) Mallinson Day 10 February 18, 2020 20 / 22 Plots in R: COVRATIO FitCOVRATIO <- covratio(model) plot(FitCOVRATIO~X, pch=19, xlab=``X'', ylab=``Value of COVRATIO'') abline(h=1, lty=2) Mallinson Day 10 February 18, 2020 21 / 22 Correcting Outliers 1 Think! 2 If logical, leave in 3 If reasonable (random event), take out 4 If error, fix or exclude 5 Theory is the best guide 6 Re-run the model and footnote approach Mallinson Day 10 February 18, 2020 22 / 22.