Error assumptions Unusual observations
Regression diagnostics
Botond Szabo
Leiden University
Leiden, 30 April 2018
Botond Szabo Diagnostics Error assumptions Unusual observations
Outline
1 Error assumptions Introduction Variance Normality
2 Unusual observations Residual vs error Outliers Influential observations
Botond Szabo Diagnostics Error assumptions Unusual observations
Introduction Errors and residuals
2 Assumption on errors: εi ∼ N(0, σ ), i = 1,..., n. How to check?
Examine the residualsε ˆi ’s. If the error assumption is okay,ε ˆi will look like a sample generated from the normal distribution.
Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Mean zero and constant variance
Diagnostic plot: fitted values Yˆi ’s versus residualsε ˆi ’s. Illustration: savings data on 50 countries from 1960 to 1970. Linear regression; covariates: per capita disposable income, percentage of population under 15 etc.
Botond Szabo Diagnostics Error assumptions Unusual observations
Variance R code
> library(faraway) > data(savings) > g<-lm(sr~pop15+pop75+dpi+ddpi,savings) > plot(fitted(g),residuals(g),xlab="Fitted", + ylab="Residuals") > abline(h=0)
Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Plot
No significant evidence against constant variance. 10 5 0 Residuals −5
6 8 10 12 14 16
Fitted
Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Constant variance: examples 2 1.5 1 0.5 0 −1 rnorm(50) rnorm(50) −0.5 −2 −1.5 0 10 20 30 40 50 0 10 20 30 40 50
1:50 1:50 2 2 1 1 0 0 rnorm(50) rnorm(50) −1 −2
0 10 20 30Botond40 Szabo50 Diagnostics0 10 20 30 40 50
1:50 1:50 Error assumptions Unusual observations
Variance Constant variance: strong violation 80 50 40 0 0 −50 (1:50) * rnorm(50) (1:50) * rnorm(50) −40
0 10 20 30 40 50 0 10 20 30 40 50
1:50 1:50 60 50 20 0 0 (1:50) * rnorm(50) (1:50) * rnorm(50) −40 −100 0 10 20 30 40 50 0 10 20 30 40 50
1:50 1:50 Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Constant variance: milder violation 10 15 5 5 0 −5 −5 sqrt((1:50)) * rnorm(50) sqrt((1:50)) * rnorm(50) −15 −15
0 10 20 30 40 50 0 10 20 30 40 50
1:50 1:50 10 10 5 5 0 0 −10 sqrt((1:50)) * rnorm(50) sqrt((1:50)) * rnorm(50) −10 0 10 20 30 40 50 0 10 20 30 40 50
1:50 1:50 Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Nonlinearity 3 2 2 1 1 0 0 −2 −2
cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50 cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50
1:50 1:50 2 2 1 1 0 0 −1 −1 −2 −3
cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50 cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50
1:50 1:50 Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Predictors versus residuals
Another diagnostic tool: predictors Xij ’s versus residualsε ˆi ’s. > plot(savings$pop15,residuals(g), + xlab="Population under 15", + ylab="Residuals")
Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Plot 10 5 0 Residuals −5
25 30 35 40 45
Population under 15
Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Variance test
Two groups can be identified in the plot. Test the null hypothesis that the ratio of variances is equal to 1. Only the p-value displayed on this slide. > var.test(residuals(g)[savings$pop15>35], + residuals(g)[savings$pop15<35])$p.value [1] 0.01357595
Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Dealing with nonconstant variance
Transforming the responses Yi ’s through a function h into h(Yi )’s is a possible way to deal with nonconstant variance. √ Two choices that often work: h(y) = log y and h(y) = y. General method: Box-Cox transformation. Works well, but not always. Upon transforming the response, what do the parameters mean?
Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Galapagos tortoise example I
Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Galapagos tortoise example II
> data(gala) > gg<-lm(Species~Area+Elevation+Scruz+Nearest + +Adjacent,gala) > plot(fitted(gg),residuals(gg),xlab="Fitted", + ylab="Residuals") 150 100 50 Residuals 0 −50 −100
0 100 200 300 400
Fitted
Botond Szabo Diagnostics Error assumptions Unusual observations
Variance Fixing problem
> gs<-lm(sqrt(Species)~Area+Elevation+Scruz+Nearest + +Adjacent,gala) > plot(fitted(gs),residuals(gs),xlab="Fitted", + ylab="Residuals") 4 2 0 Residuals −2 −4
5 10 15 20
Fitted
Botond Szabo Diagnostics Error assumptions Unusual observations
Normality Checking normality
Assume the constant variance assumption is fine. Normality? QQ-plot, histogram and normality tests based on residuals.
Botond Szabo Diagnostics Error assumptions Unusual observations
Normality Savings data example: QQ-plot
> qqnorm(residuals(g)) > qqline(residuals(g))
Normal Q−Q Plot 10 5 0 Sample Quantiles −5
−2 −1 0 1 2
Botond SzaboTheoretical DiagnosticsQuantiles Error assumptions Unusual observations
Normality Savings data example: histogram
Usual warning: histogram is sensitive to bin width and placement. > hist(residuals(g))
Histogram of residuals(g) 10 8 6 Frequency 4 2 0
−10 −5 0 5 10
residuals(g)
Botond Szabo Diagnostics Error assumptions Unusual observations
Normality Savings data example: Shapiro-Wilk test
> shapiro.test(residuals(g)) Shapiro-Wilk normality test
data: residuals(g) W = 0.987, p-value = 0.8524
No evidence against normality found. Usual warning: can be unreliable for small sample sizes; for large sample sizes even mild deviations from normality will be detected, but is the effect so noticeable we need to care? Use only in conjunction with QQ-plot.
Botond Szabo Diagnostics Error assumptions Unusual observations
Residual vs error Leverage
Errors (i ) and residuals (i ) are not the same. Recall that H = X (X T X )−1X T and therefore
ˆ = Y − Yˆ = (I − H)Y (I − H)X β + (I − H) = (I − H).
2 V(ˆ) = V[(I − H)] = (I − H)σ (assuming indpendent noise with variance σ2).
Botond Szabo Diagnostics Error assumptions Unusual observations
Residual vs error Leverage
hi = Hii are called leverages. 2 Variance of residuals: V[ˆi ] = σ (1 − hi ). If hi is large, V[ˆi ] is small and the fitted line is forced to stay close to Yi .
Large values of hi are due to extreme values in X . P One has i hi = p, so on average hi is p/n and a rule of thumb is to look at leverages larger than 2p/n. A high leverage point is unusual in the predictor space and has potential of influencing the LS fit.
Botond Szabo Diagnostics Error assumptions Unusual observations
Residual vs error Savings data example
The code below computes leverages for the savings data example (part of the output displayed only). > ginf<-lm.influence(g) > ginf$hat[1:3] Australia Austria Belgium 0.06771343 0.12038393 0.08748248
Botond Szabo Diagnostics Error assumptions Unusual observations
Residual vs error Leverages and residuals
Botond Szabo Diagnostics Error assumptions Unusual observations
Residual vs error Leverages: visualisation
Leverages can be visualised through a half-normal plot. Unlike the QQ-plot we are not looking for a straight line relationship, but for unusual quantities. > countries<-row.names(savings) > halfnorm(lm.influence(g)$hat,labs=countries, + ylab="Leverages")
Botond Szabo Diagnostics Error assumptions Unusual observations
Residual vs error Half-normal plot
Libya 0.5 0.4
United States 0.3 Leverages 0.2 0.1 0.0
0.0 0.5 1.0 1.5 2.0
Half−normal quantiles
Botond Szabo Diagnostics Error assumptions Unusual observations
Residual vs error Aside: studentised residuals
2 V[ˆεi ] = σ (1 − hi ), so instead of the raw residuals we can use studentised residuals for diagnostics:
εˆi ri = √ . σˆi 1 − hi Studentisation corrects only for nonconstant variance among residuals (assuming that the error has constant variance). For nonconstant variance among errors studentisation does not help. Using studentised residuals does not lead to much different conclusions, unless there is unusually high leverage.
Botond Szabo Diagnostics Error assumptions Unusual observations
Residual vs error Studentised residuals: illustration
> stud<-rstandard(g) > qqnorm(stud) > qqline(stud)
Botond Szabo Diagnostics Error assumptions Unusual observations
Residual vs error Plot
Normal Q−Q Plot 2 1 0 Sample Quantiles −1 −2
−2 −1 0 1 2
Theoretical Quantiles Botond Szabo Diagnostics Error assumptions Unusual observations
Outliers Plot: Outlier
Botond Szabo Diagnostics Error assumptions Unusual observations
Outliers Outlier
An outlier is a point that does not fit the current model. Outliers may badly affect the fit, so finding them is important. Statistic n − p − 1 1/2 Ti = ri 2 . n − p − ri
If the model assumptions are correct, Ti ∼ tn−p−1 and this can be used to construct a hypothesis test that the ith data point is an outlier. Even though we explicitly test only one or two unusual cases, implicitly we are testing all of them and hence need to adjust the level α. Recall the Bonferroni method: test each case at level α/n.
Botond Szabo Diagnostics Error assumptions Unusual observations
Outliers Savings data example
> jack<-rstudent(g) > jack[which.max(abs(jack))] Zambia 2.853558 > qt(0.025/(50),44) [1] -3.525801
Botond Szabo Diagnostics Error assumptions Unusual observations
Outliers Remarks
Several outliers next to each other might hide each other. If you transform your model, outliers in the original model will not necessarily be outliers in the transformed model and vice versa. Individual outliers typically are not a big problem in large datasets. Clusters of outliers are. Do not remove ”outliers” mechanically: use astronomical knowledge to understand what is going on and why. Always report removal of outliers in your papers.
Botond Szabo Diagnostics Error assumptions Unusual observations
Outliers Astronomical example
Astronomical data are the log surface temperature versus log light intensity of 47 stars in the star cluster CYG OB1 (in the direction of Cygnus).
Botond Szabo Diagnostics Error assumptions Unusual observations
Outliers Data plot
> data(star) > plot(star$temp,star$light,xlab="log(Temperature)", + ylab="log(Light intensity)") 6.0 5.5 5.0 log(Light intensity) 4.5 4.0
3.6 3.8 4.0 4.2 4.4 4.6
log(Temperature)
Botond Szabo Diagnostics Error assumptions Unusual observations
Outliers Least squares fit
> ga<-lm(light~temp,star) > plot(star$temp,star$light,xlab="log(Temperature)", + ylab="log(Light intensity)") > abline(ga) 6.0 5.5 5.0 log(Light intensity) 4.5 4.0
3.6 3.8 4.0 4.2 4.4 4.6
log(Temperature) Botond Szabo Diagnostics Error assumptions Unusual observations
Outliers Giants excluded
> gaa<-lm(light~temp,star,subset=(temp>3.6)) > plot(star$temp,star$light,xlab="log(Temperature)", + ylab="log(Light intensity)") > abline(gaa) 6.0 5.5 5.0 log(Light intensity) 4.5 4.0
3.6 3.8 4.0 4.2 4.4 4.6
log(Temperature) Botond Szabo Diagnostics Error assumptions Unusual observations
Influential observations Cook statistic
An influential observation is one whose removal from the dataset causes a large change in the fit. An influential observation may or may not be an outlier, and may or may not have large leverage, but typically it is at least one of these. Cook statistic 2 ri hi Di = . p 1 − hi A half-normal plot can be used to identify influential observations.
Botond Szabo Diagnostics Error assumptions Unusual observations
Influential observations Savings data example
> cook<-cooks.distance(g) > halfnorm(cook,3,labs=countries,ylab="Cooks distances")
Libya 0.25 0.20
0.15 Japan
Cook's distances Zambia 0.10 0.05 0.00
0.0 0.5 1.0 1.5 2.0
Botond SzaboHalf−normalDiagnostics quantiles Error assumptions Unusual observations
Influential observations Lybia included
Botond Szabo Diagnostics Error assumptions Unusual observations
Influential observations Lybia excluded
We notice in particular that the ddpi parameter estimate changed by about 50%. Lybia seems to be influential and this is in accord with what the Cook statistics told us.
Botond Szabo Diagnostics Error assumptions Unusual observations
Influential observations Summary
After fitting a model always perform diagnostics. Try to fix problems, don’t be shy of refitting the model. There is more on diagnostics.
Botond Szabo Diagnostics