Regression Diagnostics
Total Page:16
File Type:pdf, Size:1020Kb
Error assumptions Unusual observations Regression diagnostics Botond Szabo Leiden University Leiden, 30 April 2018 Botond Szabo Diagnostics Error assumptions Unusual observations Outline 1 Error assumptions Introduction Variance Normality 2 Unusual observations Residual vs error Outliers Influential observations Botond Szabo Diagnostics Error assumptions Unusual observations Introduction Errors and residuals 2 Assumption on errors: "i ∼ N(0; σ ); i = 1;:::; n: How to check? Examine the residuals" ^i 's. If the error assumption is okay," ^i will look like a sample generated from the normal distribution. Botond Szabo Diagnostics Error assumptions Unusual observations Variance Mean zero and constant variance Diagnostic plot: fitted values Y^i 's versus residuals" ^i 's. Illustration: savings data on 50 countries from 1960 to 1970. Linear regression; covariates: per capita disposable income, percentage of population under 15 etc. Botond Szabo Diagnostics Error assumptions Unusual observations Variance R code > library(faraway) > data(savings) > g<-lm(sr~pop15+pop75+dpi+ddpi,savings) > plot(fitted(g),residuals(g),xlab="Fitted", + ylab="Residuals") > abline(h=0) Botond Szabo Diagnostics Error assumptions Unusual observations Variance Plot No significant evidence against constant variance. 10 5 0 Residuals −5 6 8 10 12 14 16 Fitted Botond Szabo Diagnostics Error assumptions Unusual observations Variance Constant variance: examples 2 1.5 1 0.5 0 −1 rnorm(50) rnorm(50) −0.5 −2 −1.5 0 10 20 30 40 50 0 10 20 30 40 50 1:50 1:50 2 2 1 1 0 0 rnorm(50) rnorm(50) −1 −2 0 10 20 30Botond40 Szabo50 Diagnostics0 10 20 30 40 50 1:50 1:50 Error assumptions Unusual observations Variance Constant variance: strong violation 80 50 40 0 0 −50 (1:50) * rnorm(50) (1:50) * rnorm(50) −40 0 10 20 30 40 50 0 10 20 30 40 50 1:50 1:50 60 50 20 0 0 (1:50) * rnorm(50) (1:50) * rnorm(50) −40 −100 0 10 20 30 40 50 0 10 20 30 40 50 1:50 1:50 Botond Szabo Diagnostics Error assumptions Unusual observations Variance Constant variance: milder violation 10 15 5 5 0 −5 −5 sqrt((1:50)) * rnorm(50) sqrt((1:50)) * rnorm(50) −15 −15 0 10 20 30 40 50 0 10 20 30 40 50 1:50 1:50 10 10 5 5 0 0 −10 sqrt((1:50)) * rnorm(50) sqrt((1:50)) * rnorm(50) −10 0 10 20 30 40 50 0 10 20 30 40 50 1:50 1:50 Botond Szabo Diagnostics Error assumptions Unusual observations Variance Nonlinearity 3 2 2 1 1 0 0 −2 −2 cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50 cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50 1:50 1:50 2 2 1 1 0 0 −1 −1 −2 −3 cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50 cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50 1:50 1:50 Botond Szabo Diagnostics Error assumptions Unusual observations Variance Predictors versus residuals Another diagnostic tool: predictors Xij 's versus residuals" ^i 's. > plot(savings$pop15,residuals(g), + xlab="Population under 15", + ylab="Residuals") Botond Szabo Diagnostics Error assumptions Unusual observations Variance Plot 10 5 0 Residuals −5 25 30 35 40 45 Population under 15 Botond Szabo Diagnostics Error assumptions Unusual observations Variance Variance test Two groups can be identified in the plot. Test the null hypothesis that the ratio of variances is equal to 1: Only the p-value displayed on this slide. > var.test(residuals(g)[savings$pop15>35], + residuals(g)[savings$pop15<35])$p.value [1] 0.01357595 Botond Szabo Diagnostics Error assumptions Unusual observations Variance Dealing with nonconstant variance Transforming the responses Yi 's through a function h into h(Yi )'s is a possible way to deal with nonconstant variance. p Two choices that often work: h(y) = log y and h(y) = y: General method: Box-Cox transformation. Works well, but not always. Upon transforming the response, what do the parameters mean? Botond Szabo Diagnostics Error assumptions Unusual observations Variance Galapagos tortoise example I Botond Szabo Diagnostics Error assumptions Unusual observations Variance Galapagos tortoise example II > data(gala) > gg<-lm(Species~Area+Elevation+Scruz+Nearest + +Adjacent,gala) > plot(fitted(gg),residuals(gg),xlab="Fitted", + ylab="Residuals") 150 100 50 Residuals 0 −50 −100 0 100 200 300 400 Fitted Botond Szabo Diagnostics Error assumptions Unusual observations Variance Fixing problem > gs<-lm(sqrt(Species)~Area+Elevation+Scruz+Nearest + +Adjacent,gala) > plot(fitted(gs),residuals(gs),xlab="Fitted", + ylab="Residuals") 4 2 0 Residuals −2 −4 5 10 15 20 Fitted Botond Szabo Diagnostics Error assumptions Unusual observations Normality Checking normality Assume the constant variance assumption is fine. Normality? QQ-plot, histogram and normality tests based on residuals. Botond Szabo Diagnostics Error assumptions Unusual observations Normality Savings data example: QQ-plot > qqnorm(residuals(g)) > qqline(residuals(g)) Normal Q−Q Plot 10 5 0 Sample Quantiles −5 −2 −1 0 1 2 Botond SzaboTheoretical DiagnosticsQuantiles Error assumptions Unusual observations Normality Savings data example: histogram Usual warning: histogram is sensitive to bin width and placement. > hist(residuals(g)) Histogram of residuals(g) 10 8 6 Frequency 4 2 0 −10 −5 0 5 10 residuals(g) Botond Szabo Diagnostics Error assumptions Unusual observations Normality Savings data example: Shapiro-Wilk test > shapiro.test(residuals(g)) Shapiro-Wilk normality test data: residuals(g) W = 0.987, p-value = 0.8524 No evidence against normality found. Usual warning: can be unreliable for small sample sizes; for large sample sizes even mild deviations from normality will be detected, but is the effect so noticeable we need to care? Use only in conjunction with QQ-plot. Botond Szabo Diagnostics Error assumptions Unusual observations Residual vs error Leverage Errors (i ) and residuals (i ) are not the same. Recall that H = X (X T X )−1X T and therefore ^ = Y − Y^ = (I − H)Y (I − H)X β + (I − H) = (I − H). 2 V(^) = V[(I − H)] = (I − H)σ (assuming indpendent noise with variance σ2). Botond Szabo Diagnostics Error assumptions Unusual observations Residual vs error Leverage hi = Hii are called leverages. 2 Variance of residuals: V[^i ] = σ (1 − hi ): If hi is large, V[^i ] is small and the fitted line is forced to stay close to Yi : Large values of hi are due to extreme values in X . P One has i hi = p; so on average hi is p=n and a rule of thumb is to look at leverages larger than 2p=n: A high leverage point is unusual in the predictor space and has potential of influencing the LS fit. Botond Szabo Diagnostics Error assumptions Unusual observations Residual vs error Savings data example The code below computes leverages for the savings data example (part of the output displayed only). > ginf<-lm.influence(g) > ginf$hat[1:3] Australia Austria Belgium 0.06771343 0.12038393 0.08748248 Botond Szabo Diagnostics Error assumptions Unusual observations Residual vs error Leverages and residuals Botond Szabo Diagnostics Error assumptions Unusual observations Residual vs error Leverages: visualisation Leverages can be visualised through a half-normal plot. Unlike the QQ-plot we are not looking for a straight line relationship, but for unusual quantities. > countries<-row.names(savings) > halfnorm(lm.influence(g)$hat,labs=countries, + ylab="Leverages") Botond Szabo Diagnostics Error assumptions Unusual observations Residual vs error Half-normal plot Libya 0.5 0.4 United States 0.3 Leverages 0.2 0.1 0.0 0.0 0.5 1.0 1.5 2.0 Half−normal quantiles Botond Szabo Diagnostics Error assumptions Unusual observations Residual vs error Aside: studentised residuals 2 V[^"i ] = σ (1 − hi ); so instead of the raw residuals we can use studentised residuals for diagnostics: "^i ri = p : σ^i 1 − hi Studentisation corrects only for nonconstant variance among residuals (assuming that the error has constant variance). For nonconstant variance among errors studentisation does not help. Using studentised residuals does not lead to much different conclusions, unless there is unusually high leverage. Botond Szabo Diagnostics Error assumptions Unusual observations Residual vs error Studentised residuals: illustration > stud<-rstandard(g) > qqnorm(stud) > qqline(stud) Botond Szabo Diagnostics Error assumptions Unusual observations Residual vs error Plot Normal Q−Q Plot 2 1 0 Sample Quantiles −1 −2 −2 −1 0 1 2 Theoretical Quantiles Botond Szabo Diagnostics Error assumptions Unusual observations Outliers Plot: Outlier Botond Szabo Diagnostics Error assumptions Unusual observations Outliers Outlier An outlier is a point that does not fit the current model. Outliers may badly affect the fit, so finding them is important. Statistic n − p − 1 1=2 Ti = ri 2 : n − p − ri If the model assumptions are correct, Ti ∼ tn−p−1 and this can be used to construct a hypothesis test that the ith data point is an outlier. Even though we explicitly test only one or two unusual cases, implicitly we are testing all of them and hence need to adjust the level α: Recall the Bonferroni method: test each case at level α=n: Botond Szabo Diagnostics Error assumptions Unusual observations Outliers Savings data example > jack<-rstudent(g) > jack[which.max(abs(jack))] Zambia 2.853558 > qt(0.025/(50),44) [1] -3.525801 Botond Szabo Diagnostics Error assumptions Unusual observations Outliers Remarks Several outliers next to each other might hide each other. If you transform your model, outliers in the original model will not necessarily be outliers in the transformed model and vice versa. Individual outliers typically are not a big problem in large datasets. Clusters of outliers are. Do not remove "outliers" mechanically: use astronomical knowledge to understand what is going on and why.