Error assumptions Unusual observations

Regression diagnostics

Botond Szabo

Leiden University

Leiden, 30 April 2018

Botond Szabo Diagnostics Error assumptions Unusual observations

Outline

1 Error assumptions Introduction Normality

2 Unusual observations Residual vs error Influential observations

Botond Szabo Diagnostics Error assumptions Unusual observations

Introduction

2 Assumption on errors: εi ∼ N(0, σ ), i = 1,..., n. How to check?

Examine the residualsε ˆi ’s. If the error assumption is okay,ε ˆi will look like a generated from the .

Botond Szabo Diagnostics Error assumptions Unusual observations

Variance zero and constant variance

Diagnostic plot: fitted values Yˆi ’s versus residualsε ˆi ’s. Illustration: savings on 50 countries from 1960 to 1970. ; covariates: per capita disposable income, percentage of population under 15 etc.

Botond Szabo Diagnostics Error assumptions Unusual observations

Variance R code

> library(faraway) > data(savings) > g<-lm(sr~pop15+pop75+dpi+ddpi,savings) > plot(fitted(g),residuals(g),xlab="Fitted", + ylab="Residuals") > abline(h=0)

Botond Szabo Diagnostics Error assumptions Unusual observations

Variance Plot

No significant evidence against constant variance. 10 5 0 Residuals −5

6 8 10 12 14 16

Fitted

Botond Szabo Diagnostics Error assumptions Unusual observations

Variance Constant variance: examples 2 1.5 1 0.5 0 −1 rnorm(50) rnorm(50) −0.5 −2 −1.5 0 10 20 30 40 50 0 10 20 30 40 50

1:50 1:50 2 2 1 1 0 0 rnorm(50) rnorm(50) −1 −2

0 10 20 30Botond40 Szabo50 Diagnostics0 10 20 30 40 50

1:50 1:50 Error assumptions Unusual observations

Variance Constant variance: strong violation 80 50 40 0 0 −50 (1:50) * rnorm(50) (1:50) * rnorm(50) −40

0 10 20 30 40 50 0 10 20 30 40 50

1:50 1:50 60 50 20 0 0 (1:50) * rnorm(50) (1:50) * rnorm(50) −40 −100 0 10 20 30 40 50 0 10 20 30 40 50

1:50 1:50 Botond Szabo Diagnostics Error assumptions Unusual observations

Variance Constant variance: milder violation 10 15 5 5 0 −5 −5 sqrt((1:50)) * rnorm(50) sqrt((1:50)) * rnorm(50) −15 −15

0 10 20 30 40 50 0 10 20 30 40 50

1:50 1:50 10 10 5 5 0 0 −10 sqrt((1:50)) * rnorm(50) sqrt((1:50)) * rnorm(50) −10 0 10 20 30 40 50 0 10 20 30 40 50

1:50 1:50 Botond Szabo Diagnostics Error assumptions Unusual observations

Variance Nonlinearity 3 2 2 1 1 0 0 −2 −2

cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50 cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50

1:50 1:50 2 2 1 1 0 0 −1 −1 −2 −3

cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50 cos((1:50) * pi/25) + rnorm(50) 0 10 20 30 40 50

1:50 1:50 Botond Szabo Diagnostics Error assumptions Unusual observations

Variance Predictors versus residuals

Another diagnostic tool: predictors Xij ’s versus residualsε ˆi ’s. > plot(savings$pop15,residuals(g), + xlab="Population under 15", + ylab="Residuals")

Botond Szabo Diagnostics Error assumptions Unusual observations

Variance Plot 10 5 0 Residuals −5

25 30 35 40 45

Population under 15

Botond Szabo Diagnostics Error assumptions Unusual observations

Variance Variance test

Two groups can be identified in the plot. Test the null hypothesis that the ratio of is equal to 1. Only the p-value displayed on this slide. > var.test(residuals(g)[savings$pop15>35], + residuals(g)[savings$pop15<35])$p.value [1] 0.01357595

Botond Szabo Diagnostics Error assumptions Unusual observations

Variance Dealing with nonconstant variance

Transforming the responses Yi ’s through a function h into h(Yi )’s is a possible way to deal with nonconstant variance. √ Two choices that often work: h(y) = log y and h(y) = y. General method: Box-Cox transformation. Works well, but not always. Upon transforming the response, what do the parameters mean?

Botond Szabo Diagnostics Error assumptions Unusual observations

Variance Galapagos tortoise example I

Botond Szabo Diagnostics Error assumptions Unusual observations

Variance Galapagos tortoise example II

> data(gala) > gg<-lm(Species~Area+Elevation+Scruz+Nearest + +Adjacent,gala) > plot(fitted(gg),residuals(gg),xlab="Fitted", + ylab="Residuals") 150 100 50 Residuals 0 −50 −100

0 100 200 300 400

Fitted

Botond Szabo Diagnostics Error assumptions Unusual observations

Variance Fixing problem

> gs<-lm(sqrt(Species)~Area+Elevation+Scruz+Nearest + +Adjacent,gala) > plot(fitted(gs),residuals(gs),xlab="Fitted", + ylab="Residuals") 4 2 0 Residuals −2 −4

5 10 15 20

Fitted

Botond Szabo Diagnostics Error assumptions Unusual observations

Normality Checking normality

Assume the constant variance assumption is fine. Normality? QQ-plot, and normality tests based on residuals.

Botond Szabo Diagnostics Error assumptions Unusual observations

Normality Savings data example: QQ-plot

> qqnorm(residuals(g)) > qqline(residuals(g))

Normal Q−Q Plot 10 5 0 Sample Quantiles −5

−2 −1 0 1 2

Botond SzaboTheoretical DiagnosticsQuantiles Error assumptions Unusual observations

Normality Savings data example: histogram

Usual warning: histogram is sensitive to bin width and placement. > hist(residuals(g))

Histogram of residuals(g) 10 8 6 4 2 0

−10 −5 0 5 10

residuals(g)

Botond Szabo Diagnostics Error assumptions Unusual observations

Normality Savings data example: Shapiro-Wilk test

> shapiro.test(residuals(g)) Shapiro-Wilk

data: residuals(g) W = 0.987, p-value = 0.8524

No evidence against normality found. Usual warning: can be unreliable for small sample sizes; for large sample sizes even mild deviations from normality will be detected, but is the effect so noticeable we need to care? Use only in conjunction with QQ-plot.

Botond Szabo Diagnostics Error assumptions Unusual observations

Residual vs error

Errors (i ) and residuals (i ) are not the same. Recall that H = X (X T X )−1X T and therefore

ˆ = Y − Yˆ = (I − H)Y (I − H)X β + (I − H) = (I − H).

2 V(ˆ) = V[(I − H)] = (I − H)σ (assuming indpendent noise with variance σ2).

Botond Szabo Diagnostics Error assumptions Unusual observations

Residual vs error Leverage

hi = Hii are called leverages. 2 Variance of residuals: V[ˆi ] = σ (1 − hi ). If hi is large, V[ˆi ] is small and the fitted line is forced to stay close to Yi .

Large values of hi are due to extreme values in X . P One has i hi = p, so on average hi is p/n and a rule of thumb is to look at leverages larger than 2p/n. A high leverage point is unusual in the predictor space and has potential of influencing the LS fit.

Botond Szabo Diagnostics Error assumptions Unusual observations

Residual vs error Savings data example

The code below computes leverages for the savings data example (part of the output displayed only). > ginf<-lm.influence(g) > ginf$hat[1:3] Australia Austria Belgium 0.06771343 0.12038393 0.08748248

Botond Szabo Diagnostics Error assumptions Unusual observations

Residual vs error Leverages and residuals

Botond Szabo Diagnostics Error assumptions Unusual observations

Residual vs error Leverages: visualisation

Leverages can be visualised through a half-normal plot. Unlike the QQ-plot we are not looking for a straight line relationship, but for unusual quantities. > countries<-row.names(savings) > halfnorm(lm.influence(g)$hat,labs=countries, + ylab="Leverages")

Botond Szabo Diagnostics Error assumptions Unusual observations

Residual vs error Half-normal plot

Libya 0.5 0.4

United States 0.3 Leverages 0.2 0.1 0.0

0.0 0.5 1.0 1.5 2.0

Half−normal quantiles

Botond Szabo Diagnostics Error assumptions Unusual observations

Residual vs error Aside: studentised residuals

2 V[ˆεi ] = σ (1 − hi ), so instead of the raw residuals we can use studentised residuals for diagnostics:

εˆi ri = √ . σˆi 1 − hi Studentisation corrects only for nonconstant variance among residuals (assuming that the error has constant variance). For nonconstant variance among errors studentisation does not help. Using studentised residuals does not lead to much different conclusions, unless there is unusually high leverage.

Botond Szabo Diagnostics Error assumptions Unusual observations

Residual vs error Studentised residuals: illustration

> stud<-rstandard(g) > qqnorm(stud) > qqline(stud)

Botond Szabo Diagnostics Error assumptions Unusual observations

Residual vs error Plot

Normal Q−Q Plot 2 1 0 Sample Quantiles −1 −2

−2 −1 0 1 2

Theoretical Quantiles Botond Szabo Diagnostics Error assumptions Unusual observations

Outliers Plot:

Botond Szabo Diagnostics Error assumptions Unusual observations

Outliers Outlier

An outlier is a point that does not fit the current model. Outliers may badly affect the fit, so finding them is important.  n − p − 1 1/2 Ti = ri 2 . n − p − ri

If the model assumptions are correct, Ti ∼ tn−p−1 and this can be used to construct a hypothesis test that the ith data point is an outlier. Even though we explicitly test only one or two unusual cases, implicitly we are testing all of them and hence need to adjust the level α. Recall the Bonferroni method: test each case at level α/n.

Botond Szabo Diagnostics Error assumptions Unusual observations

Outliers Savings data example

> jack<-rstudent(g) > jack[which.max(abs(jack))] Zambia 2.853558 > qt(0.025/(50),44) [1] -3.525801

Botond Szabo Diagnostics Error assumptions Unusual observations

Outliers Remarks

Several outliers next to each other might hide each other. If you transform your model, outliers in the original model will not necessarily be outliers in the transformed model and vice versa. Individual outliers typically are not a big problem in large datasets. Clusters of outliers are. Do not remove ”outliers” mechanically: use astronomical knowledge to understand what is going on and why. Always report removal of outliers in your papers.

Botond Szabo Diagnostics Error assumptions Unusual observations

Outliers Astronomical example

Astronomical data are the log surface temperature versus log light intensity of 47 stars in the star cluster CYG OB1 (in the direction of Cygnus).

Botond Szabo Diagnostics Error assumptions Unusual observations

Outliers Data plot

> data(star) > plot(star$temp,star$light,xlab="log(Temperature)", + ylab="log(Light intensity)") 6.0 5.5 5.0 log(Light intensity) 4.5 4.0

3.6 3.8 4.0 4.2 4.4 4.6

log(Temperature)

Botond Szabo Diagnostics Error assumptions Unusual observations

Outliers fit

> ga<-lm(light~temp,star) > plot(star$temp,star$light,xlab="log(Temperature)", + ylab="log(Light intensity)") > abline(ga) 6.0 5.5 5.0 log(Light intensity) 4.5 4.0

3.6 3.8 4.0 4.2 4.4 4.6

log(Temperature) Botond Szabo Diagnostics Error assumptions Unusual observations

Outliers Giants excluded

> gaa<-lm(light~temp,star,subset=(temp>3.6)) > plot(star$temp,star$light,xlab="log(Temperature)", + ylab="log(Light intensity)") > abline(gaa) 6.0 5.5 5.0 log(Light intensity) 4.5 4.0

3.6 3.8 4.0 4.2 4.4 4.6

log(Temperature) Botond Szabo Diagnostics Error assumptions Unusual observations

Influential observations Cook statistic

An influential observation is one whose removal from the dataset causes a large change in the fit. An influential observation may or may not be an outlier, and may or may not have large leverage, but typically it is at least one of these. Cook statistic 2 ri hi Di = . p 1 − hi A half-normal plot can be used to identify influential observations.

Botond Szabo Diagnostics Error assumptions Unusual observations

Influential observations Savings data example

> cook<-cooks.distance(g) > halfnorm(cook,3,labs=countries,ylab="Cooks distances")

Libya 0.25 0.20

0.15 Japan

Cook's distances Zambia 0.10 0.05 0.00

0.0 0.5 1.0 1.5 2.0

Botond SzaboHalf−normalDiagnostics quantiles Error assumptions Unusual observations

Influential observations Lybia included

Botond Szabo Diagnostics Error assumptions Unusual observations

Influential observations Lybia excluded

We notice in particular that the ddpi parameter estimate changed by about 50%. Lybia seems to be influential and this is in accord with what the Cook statistics told us.

Botond Szabo Diagnostics Error assumptions Unusual observations

Influential observations Summary

After fitting a model always perform diagnostics. Try to fix problems, don’t be shy of refitting the model. There is more on diagnostics.

Botond Szabo Diagnostics