Unusual and Influential Data: Outliers, Leverage, and Influence

Home , Leverage (statistics)

22s:152 Applied Linear Regression In multiple regression, we often investigate with ascatterplotmatrixfirst: Chapter 11 Diagnostics: Unusual and Influential Data: collinearity (highly correlated x’s) Outliers, Leverage, and Influence outliers ———————————————————— marginal pairwise linearity

We’ve seen that there are certain assumptions These scatterplots can be useful, but it only made when we model the data. shows pairwise relationships.

These assumptions allow us to calculate a test statistic, and know the distribution of the test statistic under a null hypothesis. After fitting the model, we check for: Besides the specified distributional assumptions, nonlinearity (with residual plot) there are other things to check for in regression. non-constant variance non-normality In this section, we look for characteristics in the multicollinearity (VIFs, yet to come in Ch. 13) data that can cause problems with our fitted *outliers and influential observations models.

1 2

Outliers, Leverage, and Inﬂuence Astrongoutlierwithrespecttotheindepen- • In general, an outlier is any unusual data dent variable(s), is said to have high lever- • point. It helps to be more speciﬁc... age.

Such a point could have a big impact on the 120 • ﬁtted model. 100 80 repwt Here is the ﬁtted line for the full data set: • 60 ● 120 40

40 60 80 100 120 140 160 ●

● ● ●

weight 100 ●

● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● repwt ● ● ●

In the above plot there is a strong outlier in 80 ● ● ●●● ● ●● ●● ● ●● ● • ● ● ●● ● ● ● the bottom right. ● ●●● ● ●● ● ●●●● ●●● ● ● ●●● ● ● ●●●● ●●● ● ●●● ● ●●● ● ●

60 ● ● ●● ● ● ●● ● ● ●● ● ●●●● ● ● ●●● ● ●● ●●●● ● ●●● ● ●●● ●●● ● ●● ● This is an outlier with respect to the X-distribution ● ● ● ● ● ● ●

●

(independent variable). 40

40 60 80 100 120 140 160

weight It is not an outlier with respect to the Y- distribution (dependent variable).

3 4 Here is the ﬁtted line after removing the high Apointwithhighleveragedoesn’tHAVEto • leverage point: • have a large impact on the ﬁtted line.

● 120 Recall the cigarette data: ● • ● ● ●

100 ●

● ● ● ● ●

● 2.0 ● ● ● ●● ● ● ● ● ● ● ● repwt ● ● ●

80 ● ● ●●● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ●●●● ●●● ● ● ●●● ● ● ●●●● ●●● ● 1.5 ●●● ● ●●● ● ●

60 ● ● ●● ● ● ●● ● ● ●● ● ●●●● ● ●●● ● ●● ●●●● ● ●●● ● ●●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ●● 1.0 cig.data$Nic ● ● 40 60 80 100 120 140 160 ● ● weight ● ● ● ● ● ●

● ● 0.5

● ● This one point had a a lot of influence on the • fitted line. The βˆ and βˆ changed greatly in ● 0 1 0 5 10 15 20 25 30 the re-fit. cig.data$Tar

The second picture shows a much better fit to • Above is the fitted line with the inclusion the bulk of the data. But one must be careful • of the point of high leverage (top right data about removing outliers.Inthiscaseitwas point is far outside the distribution of x-values). avaluethatwasreportedincorrectly,andre- moval was justified.

5 6

Here is the ﬁtted line after removing the out- The blue point below is an outlier, but not with re- • • lier: spect to the independent variable (not high leverage).

● 5

2.0 ● 4

● ● ●

3 ● ●● ● ● ● ● ● ● ● ● 1.5 ● ● ● y ● ● ● ● ● ●

2 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 1.0 cig.data$Nic ● ● ● 0 ● ● ● ● ● ● ● ● ● −1 ● 0 1 2 3 ● ● x 0.5

● ●

● Here is the fitted line after removal of the outlier, it 0 5 10 15 20 25 30 • cig.data$Tar did not have a big influence on the fitted line.

● 5

●

The before and after in this case are quite 4

• ● ● ●

3 ● ●● similar. This leverage point did NOT have ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ●

2 ● ●● ● alargeinﬂuenceontheﬁttedmodel. ● ● ● ● ● ● ● ● ● ● ● ● 1 ●

● ● ● ● ● ● Apointwithhighleverage has the potential 0 • ● to greatly aﬀect the ﬁtted model. −1 0 1 2 3 x

7 8 What about higher dimensions (more predictors)? Assessing Leverage: Hat Values •

Apointwithhighleverage will be outside of Beyond graphics, we have a quantity called • • the ‘cloud’ of observed x-values. the hat value which is a measure of leverage.

With 2 predictors X and X (no response shown) Leverage measures “potential” inﬂuence. • 1 2 •

5 In regression, the hat value h (or h )isa • i ii 4 common measure of leverage. 3

2 Ahighh hat value equates to high leverage.

x2 i

1 • 0 What is hi?Howisitcalculated? -1 • – First, why is it called a hat value? -3 -2 -1 0 1 2 – In matrix notation for OLS, we have: x1 1 Yˆ = Xβˆ = X XX − X Y The point at the top left has high leverage an n n matrix because it is outside the cloud of (X1,X2) called the× Hat matrix independent values. Yˆ = HY

9 10

– Thus, The hat value for obs i is the ith element in ˆ th • Yj is the j row of H times Y the diagonal of H (thus, hii). But H is symmetrical, so the book shows it as... How do we determine if h is large? • i Yˆ = h Y + h Y + + h Yn *1/n hi 1 j 1j 1 2j 2 ··· nj ≤ ≤ n n = i=1 hijYi * i=1 hi =(k +1) (where k is the number of predictors) – The value hij captures the contribution of ¯ observation Yi to the ﬁtted value of the * h =(k +1)/n ˆ jth observation, or Yj (consider hij aweight) We can compare h to the average. If it is • i – If hij is large, Yi CAN have a substantial much larger, then the point has high lever- impact on the jth ﬁtted value. age.

– It can be shown that... Often use 2(k+1)/n or 3(k+1)/n as a guide n 2 • hi = hii = j=1 hij for determining high leverage. and hi summarizes the potential inﬂuence of Yi on the ﬁt of ALL other observations.

11 12 In SLR, hi just measures the distance from Example:Duncandata • the mean X¯ . • Reponse: Prestige Percent of raters in an opinion study rating 1 (X X¯ )2 h = + i − occupation as excellent or good in prestige. i n n (X X¯ )2 j=1 j − Predictors: Income In multiple regression, h measures distance Percent of males in occupation earn- • i from the centroid (center of the cloud of data ing $3500 or more in 1950. points in the X-space). Education The dependent-variable values are not involved Percent of males in occupation in 1950 • in determining leverage. Leverage is a state- who were high school graduates. ment about the X-space. >attach(Duncan) >head(Duncan) type income education prestige accountant prof 62 86 82 pilot prof 72 76 83 architect prof 75 92 90 author prof 55 90 76 chemist prof 64 86 90 minister prof 21 84 87

>nrow(Duncan) [1] 45

13 14

Two Predictors: Fit the model, get the hat values: >plot(education,income,pch=16,cex=2) ## The next line produces an interactive option for the >lm.out=lm(prestige~education+income) ## user to identify points on a plot with the mouse. >identify(education,income,row.names(Duncan)) >hatvalues(lm.out) 1234 ... 0.05092832 0.05732001 0.06963699 0.06489441 ... ● RR.engineer ● 80 ● ● conductor ●● ●

● ●

●

● ● ● ● 60

● ● ●

●

income ● ● ● ●

40 Plot the hat values against the indices 1 to 45, ● ● and include thresholds for 2 (k +1)/n and ● ● 3 (k +1)/n: ∗ ● ∗ ● ● ● ●

20 minister ● ● ● ● ● ● >plot(hatvalues(lm.out),pch=16,cex=2) ● ● ● ● ● >abline(h=2*3/45,lty=2)

20 40 60 80 100 >abline(h=3*3/45,lty=2) education >identify(1:45,hatvalues(lm.out),row.names(Duncan)) From this bivariate plot of the two predictors, it looks like the ‘RR.engineer’, ‘conductor’, and ‘minister’ may have high leverage.

15 16 Studentized Residuals

● RR.engineer 0.25 In this section, we’re looking for outliers in • the Y-direction for a given combination of in- 0.20 ●conductor dependent variables (x-vector). ●minister 0.15

hatvalues(lm.out) Such an observation would have an unusally •

0.10 large residual, and we call it a regression outlier. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.05 ● ● ● ● ● ● ● ● If you have two predictors, the mean struc- ● ● ● ● • ● ● ture is a plane, and you’d be looking for ob-

0 10 20 30 40

Index servations that fall exceptionally far from the plane compared to the other observations. These three points have high leverage (potential to greatly influence the fitted model) using the First, we start with standardized residuals. • 2timesand3timestheaveragehatvaluecrite- rion. Though we assume the errors in our model • 2 have constant variance [i N(0, σ )], the To measure influence, we’ll look at another statis- ∼ estimated errors, or the sample residuals ei’s tic, but first... consider studentized residuals. DO NOT have equal variance...

17 18

Observations with high leverage tend to have We can get away from the problem by esti- • smaller residuals. This is an artifact of our mating σ2 with a sum of squares that does model fitting. These points can pull the fit- not include the ith residual. ted model close to them, giving them a ten- dency toward smaller residuals. We will use subscript ( i)toindicate − quantities calculated without case i... 2 Var(ei)=σ (1 hi) – Delete the ith observation, and re-fit the • − model based on n 1observations,andget − We know... 1/n hi 1. • ≤ ≤ ˆ2 RSS σ ( i) = n 1 k 1 − − − − With this variance, we can form a standardized • RSS residual e which all have equal variance as or SE( i)= n 1 k 1 i − − − − ei ei e = = This gives us the studentized residual i σˆ√1 h S √1 h • − i E − i ei e∗ = i S √1 h E( i) − i but the distribution of ei isn’t a t because − the numerator and denominator are not in- Now, the σˆ in the denominator, or S ( i), dependent (part of theory of a t statistic). E • is not correlated with the numerator and−

ei∗ tn 1 k 1 ∼ − − − 19 20 Test for outliers by comparing e to a t dis- • i∗ ei∗ tn 1 k 1 tribution with n k 2 df and applying a ∼ − − − Bonferroni correction− − (multiply the p-values by the number of residuals). We can now use the t-distribution to judge • what is a large studentized residual, or how Really, we only have to be concerned about likely we are to get a studentized residual as • far away from 0 as the ones we get. the largest studentized residuals. Perhaps take a closer look at those with e > 2 | i∗| As a note, if the ith observation has a large • residual and we left it in the computation of SE,itmaygreatlyinﬂateSE,deﬂatingthe standardized residual ei and making it hard to notice that it was large.

The standardized residuals is also referred to • as the internally studentized residual.

The studentized residual listed here is also re- • ferred to as the externally studentized residual.

21 22

Example:ReturningtotheDuncanmodel: An outlier may indicate a sample peculiar- • • >nrow(Duncan) ity or may indicate a data entry error. It [1] 45 may suggest an observation belongs to an-

>lm.out=lm(prestige~education+income) other ‘population’.

>sort(rstudent(lm.out)) -2.3970223990 -1.9309187757 -1.7604905300 ... Fitting a model with and without an obser- • vation gives you a feel for the sensitivity of Can look at adjusted p-values from t41,but the fitted model to the observation. there is a built-in function in the car library to help with this... If an observation has a big impact on the • fitted model and it’s not justified to remove it, one option is to report both models (with Get the adjusted p-value for the largest e∗ : | i | and without) commenting on the differences. >outlierTest(lm.out,row.names(Duncan))

max|rstudent| = 3.134519, degrees of freedom = 41, unadjusted p = 0.003177202, Bonferroni p = 0.1429741 An analysis called Robust Regression will al- • low you to leave the observation in, while Observation: minister reducing it’s impact on the ﬁtted model. Since p= 0.1429741 is larger than 0.05, we would conclude that this model doesn’t have This method essentially ‘weights’ the obser- any extreme residuals. vations diﬀerently when computing the least squares estimates.

23 24 Measuring influence Fortunately this can be done analytically (and • we don’t have to re-fit the model n times). We’ve described a point with high leverage • We will use subscript ( i)toindicatequan- as having the potential to greatly influence − the fitted model. tities calculated without case i...

To greatly influence the fitted model, a point DFBETAS - effect of Y on a single estimated • has high leverage and it’s Y-value is not ‘in- i • coefficient line’ with the general trend of the data (has high leverage and looks ‘odd’). ˆ ˆ βj βj( i) DF BET AS = − − = j,i ˆ SE(βj( i)) High leverage + large studentized residual = − • High influence DF BET AS > 1isconsideredlargeina | | To check influence, we can delete an observa- small or medium sized sample • tion, and see how much the fitted regression 1/2 coefficients change. A large change suggests DF BET AS > 2n− is considered large high influence. in| a big sample|

ˆ ˆ Diﬀerence=βj βj( i) − − 25 26

DFFITS - effect of ith case on fitted value COOKSD - effect on all fitted values • • for Yi Based on: ˆ ˆ Yi Y h − i( i) i how far x-values are from the mean of the x’s DFFITS = − = ei∗ SE( i)√hi 1 hi how far Yi is from the regression line − − ˆ ˆ 2 j(Yj Yj( i)) DFFITS > 1isconsideredlargeina COOKSD = − − | | 2 small or medium sized sample (k +1)SE e2h = i i 2 2 DFFITS > 2 k+1 is considered large in S (k +1)(1 hi) | | n E − abigsample 2 ei∗ hi Look at DFFITS relative to each other, in = k +1 1 hi other words, look for large values. −

So, COOKSD can be high if hi is very large 2 2 (close to 1) and ei∗ is moderate, or if ei∗ is very large and hi is moderate, or if they’re both extreme.

27 28 COOKSD > 4 Unusually inﬂuential >plot(dfbetas(lm.out)[,c(2,3)],pch=16) n k 1 >identify(dfbetas(lm.out)[,2],dfbetas(lm.out)[,3], − − case. row.names(Duncan))

Look at COOKSD relative to each other.

0.5 RR.engineer Example:ReturningtotheDuncanmodel: • >influence.measures(lm.out)

Influence measures of 0.0 lm(formula = prestige ~ education + income) :

dfb.1_ dfb.edct dfb.incm dffit cov.r cook.d hat inf 1-2.25e-020.0359446.66e-040.0703981.1251.69e-030.0509 2-2.54e-02-0.0081185.09e-020.0840671.1312.41e-030.0573 income -0.5 3-9.19e-030.0056196.48e-030.0197681.1551.33e-040.0696 4-4.72e-050.000140-6.02e-050.0001871.1501.20e-080.0649 conductor 5-6.58e-020.0867771.70e-020.1922611.0781.24e-020.0513 . . -1.0 . minister You get a DFBETA for each regression coef- ﬁcient and each observation. 0.0 0.5 1.0 education

We’re not really interested in the intercept DFBETA though. That said, a bivariate plot The ‘minister’ observation may have a large of the other DFBETAS may be useful... impact on both regression coeﬃcients (using the > 1threshold).

29 30

The Cook’s distance is perhaps most com- We can bring together leverage, studentized monly looked at (eﬀect on all ﬁtted values). residuals and cooks distance in the following “Bubble plot”:

>plot(cooks.distance(lm.out),pch=16,cex=1.5) >abline(h=4/(45-2-1),lty=2) ## Plot Residuals vs. Leverage: >identify(1:45,cooks.distance(lm.out),row.names(Duncan)) >plot(hatvalues(lm.out),rstudent(lm.out),type="n") >cook=sqrt(cooks.distance(lm.out)) >points(hatvalues(lm.out),rstudent(lm.out),

● minister cex=10*cook/max(cook))

## Include diagnostic thresholds: 0.5 >abline(h=c(-2,0,2),lty=2) >abline(v=c(2,3)*3/45,lty=2) 0.4

## Overlay names:

0.3 >identify(hatvalues(lm.out),rstudent(lm.out),

cookd(lm.out) row.names(Duncan)) ● conductor 0.2

● 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0

0 10 20 30 40

Index

31 32 It turns out, R will do some work for us in our diagnostic plotting...

3 minister >par(mfrow=c(2,2)) >plot(lm.out) 2

Residuals vs Fitted Normal Q-Q 40 6 3 6 1 RR.engineer 17 2 17 20 1 0 0 0 Residuals rstudent(lm.out) -1 -20 Standardized residuals

9 -2 9 -1

0 20 40 60 80 100 -2 -1 0 1 2 conductor Fitted values Theoretical Quantiles -2 reporter

0.05 0.10 0.15 0.20 0.25 Scale-Location Residuals vs Leverage 6 3

hatvalues(lm.out) s 6 l 9 1 a 1.5 17 u 2

d 0.5 i s e r 1

d 1.0 e

High Leverage: conductor, minister, RR en- z i 0 d r

• a d -1 0.5

gineer n a t 16

S 0.5

-2 Cook's distance Standardized residuals 9 1 Large regression residuals: minister, reporter 0.0 • ‘minister’ has the highest inﬂuence on the 0 20 40 60 80 100 0.00 0.10 0.20 Fitted values Leverage • ﬁtted model

33 34

1. Residuals vs. fitted Jointly influential data points (checking constant variance) The previous section considered the impact • 2. Normal QQ plot that individual data points can have on the (checking normality) fitted regression coefficients.

3. Scale-Location plot Their impact can be seen with leave-one-out • (similar to 1., but uses e vs. Yˆ ) deletion diagnostics. See... | i∗| i >influence.measures(lm.out) 4. Residuals vs. Leverage (checking influence, plus it gives you a sweet But data points can have a joint influence on • contour plot of COOKSD) the fitted model.

Graphical methods can help identify such sub- • sets.

For this purpose, we will use the Partial • Regression Plot that we saw earlier (in the multiple regression introduction). Also known as Added Variable Plot.

35 36 Recall an Added Variable Plot for Xi As this ‘special’ plot is used to fit the multi- For Xi, ple regression coefficient, the relationship we see should be linear. 1. get the residuals from the regression of Y on all predictors except Xi,andplotontheY axis (what’s left for Xi to explain). ————————————————————– 2. get the residuals from the regression of Xi on all other predictors, and plot on the X axis In the Duncan data set, the data point (the non-redundant part of Xi in the model). • ‘minister’ had the most influence when 3. The plot provides information on adding the considering leave-one-out diagnostics. Xi variable to the model. MINISTER DATA POINT: This data point had a large positive studen- The ‘special’ simple linear regression related to tized residual... the prestige for the ‘minis- the plot above (i.e. of Residuals(Y X ) ∼ ( Xi) ter’ was higher than expected for the given on Residuals(X X )) gives the multiple− i ( Xi) income and education (seems reasonable for regression coefficient∼ for− X . i the occupation though). After we fit the regression to this plot, the residuals are the same as the full model multiple regression residuals.

37 38

This data point also had an odd combination The fitted regression coefficients with of income/education with a low income for • the ‘minister’ data point: their given amount of education (high lever- >summary(lm.out) age, but the combination again seems rea- Coefficients: sonable). Estimate Std. Error t value Pr(>|t|) (Intercept) -6.06466 4.27194 -1.420 0.163 education 0.54583 0.09825 5.555 1.73e-06 *** This lead to minister being highly influen- income 0.59873 0.11967 5.003 1.05e-05 *** tial. The fitted regression coefficients without • the ‘minister’ data point: Let’s fit the model with and without this data >summary(lm(prestige[-6]~education[-6]+income[-6])) point... Coefficients: Estimate Std. Error t value Pr(>|t|) And let’s see if it is jointly influential by con- (Intercept) -6.62751 3.88753 -1.705 0.0958 . education[-6] 0.43303 0.09629 4.497 5.56e-05 *** sidering added variable plots... income[-6] 0.73155 0.11674 6.266 1.81e-07 ***

The education coeﬃcient is lower without ‘minister’. The income coeﬃcient is higher without ‘minister’. (these changes are understandable when we look at the added variable plots.)

39 40 Example:ReturningtoDuncanmodel: The ‘conductor’ data point influences the fit- • >avPlots(lm.out,"education",labels=row.names(Duncan), ted line in the same direction (removal of id.method=cooks.distance(lm.out),id.n=2) both at once may show a large change in the education regression coefficient). >avPlots(lm.out,"income",labels=row.names(Duncan), minister 60 id.method=cooks.distance(lm.out),id.n=2) 40 40 20 30 0 20 prestige | prestige others minister 10 conductor -20 0 prestige | prestige others -40 conductor -10

-60 -40 -20 0 20 40 -20

education | others -30 -40 -20 0 20 40 This is the plot used to fit the education mul- income | others tiple regression coefficient. Here we see that ‘minister’ and ‘conductor’ work together to flatten the income multi- Here we see how the multiple regression coef- ple regression coefficient. Removal of both ficient for education is higher when ‘minister’ at once may show a large change in the in- is included (influential point). come regression coefficient.

41 42

Removing outliers (careful)

Removal of an outlier should only be done • after careful investigation, and perhaps dis- cussion.

It shouldn’t be removed if it’s not an error, or • can’t be shown that the data point belongs to a diﬀerent ‘population’.

Duncan Data set: • When reporting the analysis, one could report both analyses (with and without ‘minister’). Otherwise, if all involved agree that ‘minister’ is not representative of the population at hand, it could be removed.