Regression Diagnostics

Regression III and Influential Data • Today’s lecture deals specifically with unusual data and how they are identified and measured • Regression Outliers Dave Armstrong • Studentized residuals (and the Bonferroni adjustment) • University of Wisconsin – Milwaukee • Hat values Department of Political Science • Influence e: [email protected] • DFBETAs, Cook’s D, influence plots, added-variable plots (partial w: www.quantoid.net/ICPSR.php regression plots) • Robust and resistant regression methods that limit the e↵ect of such cases on the regression estimates will be discussed later in the course

1/52 2/52

Outlying Observations: Who Cares? Example 1. Influence and Small Samples: Inequality Data

• Small samples are Non−Democracies • Can cause us to misinterpret patterns in plots especially vulnerable to ● Slovakia All Obs outliers - there are fewer 60 • Temporarily removing them can sometimes help see patterns that we No Outliers ● CzechRepublic otherwise would not have cases to counter the • Transformations can also spread out clustered observations and bring 50

in the outliers • With Czech Republic 40 • More importantly, separated points can have a strong influence on ●

and Slovakia included, secpay 30 statistical models - removing outliers from a regression model can there is no relationship ● ● sometimes give completely di↵erent results ● ● ● between Attitudes 20 ●

• Unusual cases can substantially influence the fit of the OLS model - ● ● ● towards inequality and ●● ● 10 ● ● ● Cases that are both outliers and high leverage exert influence on both ●●● ● ● ● the slopes and intercept of the model the Gini coecient ● ● • Outliers may also indicate that our model fails to capture important • If these cases are 20 30 40 50 60 characteristics of the data removed, we see a gini positive relationship

3/52 4/52 Code for Previous Figure Ex 1. Influence and Small Samples: Inequality Data (2)

All Obs No Outliers >weakliem2<-read.csv("http://www.quantoid.net/702/Weakliem2.txt") (Intercept) 19.4771 5.9234 >rownames(weakliem2)<-weakliem2[,1] >outs<-which(rownames(weakliem2)%in%c("CzechRepublic","Slovakia")) (11.0655) (5.2569) >weakliem2$gdp<-weakliem2$gdp/10000 gini 0.0759 0.4995⇤ >plot(secpay~gini,data=weakliem2,main="Non-Democracies") (0.2884) (0.1336) >abline(lm(secpay~gini,data=weakliem2)) N 26 24 >abline(lm(secpay~gini,data=weakliem2,subset=-outs),lty=2,col="red") 2 >with(weakliem2,text(gini[outs],secpay[outs], R 0.0029 0.3887 +rownames(weakliem2)[outs],pos=4)) adj. R2 0.0387 0.3609 >legend("topright",c("AllObs","NoOutliers"), Resid. sd 14.8520 6.2702 +lty=c(1,2),col=c("black","red"),inset=.01) Standard errors in parentheses

⇤ indicates significance at p < 0.05

5/52 6/52

Example 2. Influence and Small Samples: Davis Data (1) R-script for previous slide

• These data are the Davis

data in the car package ● 12 All Cases >library(car) 160 Outlier Excluded • It is clear that observation >data(Davis) 12 is influential 140 >plot(weight~height,data=Davis)

● • The model including 120 >with(Davis,text(height[12],weight[12],"12",pos=1))

● observation 12 does a ●● >abline(lm(weight~height,data=Davis),

100 ●

weight ● ● ● ● ● ● ● ●● ● poor job of representing ● ●● ● ●● ● +lty=1,col=1,lwd=2) ●●●●● ●●●●● 80 ● ● ●● ● ●●●● ● ● ● ● ● ●● ● ●●● the trend in the data; The ● ●●●● ●●●● ● ●●●●● >abline(lm(weight~height,data=Davis,subset=-12), ● ● ●●●●●● ● ●●●●●●●●●● ●●● ●●●●●● ●● ●●●●●●●●● ● 60 ●●●●●●● ● model excluding ● ●●●●●●●●● ● ●●●●●●● ●●●● +lty=2,col=2,lwd=2) ● ●●●●●● ● ● ●●● ●●● ●● ●● ● ● ●● observation 12 does much ● 40 >legend("topright",lty=c(1,2),col=c(1,2),

better 60 80 100 120 140 160 180 200 +legend=c("AllCases","OutlierExcluded"),inset=.01) • The output on the next height slide confirms this

7/52 8/52 Example 2. Influence and Small Samples: Davis Data (2) Example 3. Large Datasets: Contrived Data

• Although regression models from small datasets are most vulnerable to unusual observations, large datasets are not completely immune All Obs No Outliers • An unusually high (or low) x or y value could easily result from miscoding during the data entry stage. This could in turn influence (Intercept) 25.27 130.75⇤ the findings (14.95) (11.56) • Imagine a dataset with 1001 observations, where a variable, X1, height 0.24⇤ 1.15⇤ ranges from 0.88-7.5. (0.09) (0.07) • Assume also that Y is perfectly correlated with X1. N 200 199 • Even if there is just one miscode - e.g., A “55” is wrongly entered R2 0.04 0.59 instead of “5” - the distribution of X1 is drastically misrepresented. adj. R2 0.03 0.59 This one miscode also seriously distorts the regression line. Resid. sd 14.86 8.52 >set.seed(123) >x<-c(rnorm(1000,mean=4,sd=1)) Standard errors in parentheses >x1<-c(x,55) ⇤ indicates significance at p < 0.05 >y<-c(x,5) >range(x1) [1] 1.190225 55.000000 >range(y) [1] 1.190225 7.241040

9/52 10 / 52

Example 3. Large Datasets: Contrived Data (2) Example 4. Large Datasets: Marital Coital Frequency (1)

>mod1<-lm(y~x1) >apsrtable(mod1,model.names="",Sweave=T) • Jasso, Guillermina (1985) ‘Marital Coital Frequency and the Passage

● of Time: Estimating the Separate E↵ects of Spouses’ Ages and 7

● ● ● Marital Duration, Birth and Marriage Cohorts, and Period ● ● ● ● . ● (Intercept) 2 84 ● ⇤ ● 6 ● ● ● Influences,’ American Sociological Review, 50: 224-241. ● ● ● ● . ● (0 06) ● ● ● ● 5 ● ● • Using panel data, estimates age and period e↵ects - controlling for ● ● ● ● ● . ● x1 ● 0 29 ● ⇤ ● ● ● cohort e↵ects - on frequency of sexual relations for married couples ● y ● ● ● 4 ● ● ● . ● (0 01) ● ● ● from 1970-75 ● ● ● ● ● ● 3 ● N ● 1001 ● ● ● ● • Major Findings: ● ● 2 ● ● ● . ● R ●

0 30 2 ● ● • Controlling for cohort and age e↵ects, there was a negative period ● ● 2 ● ● adj. R 0.30 ● e↵ect; 1

Resid. sd 0.83 0 10 20 30 40 50 • Controlling for period and cohort e↵ects, wife’s age had a positive Standard errors in parentheses x1 e↵ect • Both findings di↵er significantly from previous research in the area ⇤ indicates significance at p < 0.05

11 / 52 12 / 52 COMMENTS 735

+ 3 lnMARDURij In theory, equation (2) is an estimable formula- + i PERIODi + -Y WCohi tion of a fixed-effect age-period-cohort model. + _Y2HCohj + Y3 MARDATEj However, in practice, it is computationallyintrac- + OcxXijk + 81Di + Eij table because of the large number of dummy variables required. It has been shown that, when Now, inWAge no longer equals PERIOD minus using only two waves of panel data, equation (2) WCoh, InHAge no longer equals PERIOD minus can be estimated using first differences (i.e., the HCoh, and InMARDUR no longer equals PE- model is specified for each wave and then one is RIOD minus MARDATE. subtractedfrom the other). The log transformationbreaks the APC identity Taking first differences from equation (2), we only if we can assume that higher order terms are have: zero. Note that in the above model, (3) ACFj = 3', (A lnWAgej) + 132(A lnHAgej) A = P - C. + i 3 (A lnMARDURj) It follows that since + n (A PERIOD) + atk (A Xik) + AEi f(A) = f(P - C), then ln(A) = ln(P - C). where A signifies the change between time 1 and time 2. Now, ln(P - C) can be expanded using a Taylor Note that the cohort terms as well as the series approximation: couple-specific dummies drop out of equation (3) = (P - C -1) -1/2(P - C -1)2 because their values remain constant in the two + 1/3(P -C - 1)3 -.... waves. Each of the parametersin equation (3) can ln(A) = P - C -[1 + 1/2(P - C-1)2 be interpretedas net of both the cohort effects and - 1/3(P - C - 1)3 + ...] the effects of the unobservable couple-specific This is simply a reformulationof the APC identity. covariates. In breaking the age-period-cohort To break this identity, Jasso assumes that the identity, Jasso is still unable to estimate cohort quantity in brackets (i.e., the higher-orderperiod effects; rather,she estimates age and period effects and cohort terms) equals zero. As Heckman and that are not confounded with cohort effects. Robb (1985) and Fienbergand Mason (1985) note, Specifically, she estimates equation (3) for a no transformationcan break the APC identity sample of continuously married women inter- without restrictions on the values of parameters viewed in both 1970 and 1975 as part of the (i.e., assuming some to be equal to zero). National Fertility Survey. As described above, she We have no immediate qualms about the log found significant age and period effects that transformation;in fact, we tested the functional directly contradict past research. Column 1 of form of equation (2) using the Box-Cox transfor- Table 1 presentsJasso's results (Jasso, 1985, Table mation and found it to be reasonable (see 4). Weisberg, 1980). However, we show that Jasso's Troubled by her substantive findings, we Example 4. Large Datasets: Marital Coital Frequency (2) identifyingExample restrictions 4. (i.e., Large making Datasets: all higher Maritalre-estimated Coital equation Frequency (3) using the (3) same data. order terms equal zero) lead to a serious Except for a sample size difference of one misspecification of her model. observation, we almost perfectly replicated her

Table 1. Fixed-Effects Estimates of the Determinantsof MaritalCoital Frequency • Kahn, J.R. and J.R. Udry (1986) ‘Marital Coital Frequency: (4) (5) (6) Unnoticed Outliers and Unspecified Interactions Lead to Erroneous (1) (2) (3) Drop 4 Marital Marital Conclusions,’ American Sociological Review, 51: 734-737, critiques Jasso's Our Drop 4 Miscodes & Duration Duration and replicates Jasso’s research. They claim that Jasso: Results Replication Miscodes 4 Outliers s 2 > 2 1. Failed to check the data for influential outliers Period -0.72*** -0.72*** -0.75*** -.67*** -3.06** -0.08 Log Wife's Age 27.61** 27.50** 21.99* 13.56 29.49 -1.62 • 4 cases were seemingly miscoded 88 (must be missing data - coded Log Husband's Age -6.43 -6.38 1.87 7.87 57.89 -5.23 99 - since no other value was higher than 63 and 99.5% were less Log MaritalDuration -1.50*** -1.51** -1.61*** - 1.56*** -1.51* 1.29 Wife Pregnant -3.71*** -3.70*** -3.71*** -3.74*** -2.88*** -3.95* than 40) Child under 6 -0.56** -0.55** -0.73*** -0.68*** -2.91*** -0.55** • 4 additional cases had very large studentized residuals (each was also Wife Employed 0.37 0.38 0.17 0.23 0.86 0.02 largely di↵erent from the first survey) HusbandEmployed - 1.28** - 1.27** -1.29** -1.10** -4.11*** -0.38 R2 .0475 .0474 .0568 .0612 .2172 .0411 2. Missed an interaction between length of marriage and wife’s age N 2062 2063 2059 2055 243 1812 • Dropping the 8 outliers (from a sample of more than 2000) and * p < .10. ** p < .05. adding the interaction drastically changes the findings *** p < .01.

13 / 52 14 / 52

Example 4. Large Datasets: Marital Coital Frequency (4) Example 4. Large Datasets: Marital Coital Frequency (5)

• Jasso, Guilermina (1986) ‘Is It Outlier Deletion or Is It Sample Truncation? Notes on Science and Sexuality,’ American Sociological Review, 51:738-42. • Claims that Kahn and Udry’s analysis generates a new problem of sample truncation bias • The outcome variable has been confined to a specified segment of its range • She argues that we should not remove data just because they don’t conform to our beliefs • She doesn’t believe that the 88’s are miscodes, claiming that 2 of the complete n=5981 were coded 98, so 88 is possible • She claims that having sex 88 times a month - which is only 22 times a week (or about 3 times a day) is not unrealistic : • There are large di↵erences in coital frequencies, especially due to cultural/regional di↵erence

15 / 52 16 / 52 Types of Unusual Observations (1) Types of Unusual Observations (2)

1. Regression Outliers 2. Cases with Leverage • An observation that is unconditionally unusual in either its Y or X • An observation that has an unusual X value - i.e., it is far from the value is called a univariate outlier, but it is not necessarily a mean of X - has leverage on the regression line regression outlier • The further the outlier sits from the mean of X (either in a positive or • A regression outlier is an observation that has an unusual value of negative direction), the more leverage it has the outcome variable Y, conditional on its value of the explanatory • High leverage does not necessarily mean that it influences the variable X regression coecients • In other words, for a regression outlier, neither the X nor the Y value • It is possible to have a high leverage and yet follow straight in line is necessarily unusual on its own with the pattern of the rest of the data. Such cases are sometimes • Regression outliers often have large residuals but do not necessarily called “good” leverage points because they help the precision of the 2 1 a↵ect the regression slope coecient estimates. Remember, V(B) = "(X0 X) , so outliers could increase the variance of X. • Also sometimes referred to as vertical outliers

17 / 52 18 / 52

Types of Unusual Observations (3) Types of Unusual Observations (4)

• Figure (a): Outlier without influence. Although its Y value is unusual given its X value, it has little influence on 3. Influential Observations the regression line because it is in the middle of the X-range • An observation with high leverage that is also a regression outlier will strongly influence the regression line • Figure (b) High leverage because it has • In other words, it must have an unusual X-value with an unusual a high value of X.However,because Y-value given its X-value its value of Y puts it in line with the • In such cases both the intercept and slope are a↵ected, as the line general pattern of the data it has no chases the observation influence Discrepancy Leverage = Influence • Figure (c): Combination of discrepancy ⇥ (unusual Y value) and leverage (unusual X value) results in strong influence. When this case is deleted both the slope and intercept change dramatically.

19 / 52 20 / 52 Assessing Leverage: Hat Values (1) Assessing Leverage: Hat Values (2)

• Most common measure of leverage is the hat value, h i • The name hat values results from their calculation based on the • If h is large, the ith observation has a substantial impact on the jth fitted values (Yˆ): ij fitted value

Yˆ j = h1 jY1 + h2 jY2 + + hnjYn • Since H is symmetric and idempotent, the diagonal entries represent th th n ··· both the i row and the i column: = hijYi i=1 hi = h0i hi X n • Recall that the Hat Matrix, , projects the Y’s onto their predicted 2 H = hij values: Xi=1

yˆ = Xb • This means that hi = hii 1 = X(X0 X) X0 y • As a result, the hat value hi measures the potential leverage of Yi on = Hy all the fitted values 1 H = X(X0 X) X0 (n n) ⇥

21 / 52 22 / 52

Properties of Hat Values Hat Values in Multiple Regression

• The diagram to the right • The average hat value is: ¯ = k+1 h n shows elliptical contours of 1 • The hat values are bound between n and 1 hat values for two explanatory • In simple regression hat values measure distance from the mean of X: variables • As the contours suggest, hat 1 (X X¯)2 values in multiple regression h = + i n ¯ 2 take into consideration the n = (X j X) j 1 correlational and variational

• In multiple regression, hi measuresP the distance from the centroid structure of the X’s point of all of the X’s (point of means) • As a result, outliers in • Commonly used Cut-o↵: multi-dimensional X-space are • Hat values exceeding about twice the average hat-value should be high leverage observations - considered noteworthy i.e., the outcome variable • With large sample sizes, however, this cut-o↵ is unlikely to identify values are irrelevant in any observations regardless of whether they deserve attention calculating hi

23 / 52 24 / 52 Leverage and Hat Values: Inequality Data Revisited (1) Leverage and Hat Values: Inequality data Revisited (2)

Hat Values for Inequality model (Intercept) 2.83 ● Brazil

• We start by fitting the model • Several countries have large 0.25 (12.78) ● Chile ● Slovenia to the complete dataset gini 0.07 hat values, suggesting that

• Recall that, looking at the (0.28) they have unusual X values 0.20 ● ● scatterplot of Gini and gdp 17.52⇤ • Notice that there are several ● ● ● attitudes, we identified two (7.99) that have much higher hat 0.15 ● hatvalues(mod3) possible outliers (Czech N 26 values than the Czech ● ● ● 2 ● ● 0.10 Republic and Slovakia) R 0.18 Republic and Slovakia ● ● ● ● ● 2 ● ● • With these included in the adj. R 0.10 • These cases have high ● ● ● ● model there was no apparent Resid. sd 13.80 leverage, but not necessarily 0.05 ● e↵ect of Gini on attitudes: Standard errors in parentheses high influence 0 5 10 15 20 25 Index ⇤ indicates significance at p < 0.05

25 / 52 26 / 52

R-Script for Hat Values Plot Formal Tests for Outliers: Standardized Residuals

• Unusual observations typically have large residuals but not necessarily so - high leverage observations can have small residuals because they pull the line towards them:

2 V(Ei) = "(1 hi) >plot(hatvalues(mod3),xlim=c(0,27), +main="HatValuesforInequalitymodel") >cutoff<-2*3/nrow(weakliem2) • Standardized residuals provide one possible, though unsatisfactory, >bighat<-hatvalues(mod3)>cutoff way of detecting outliers: >abline(h=cutoff,lty=2) >tx<-which(bighat) >text((1:length(bighat))[tx],hatvalues(mod3)[tx], Ei E0 = +rownames(weakliem2)[tx],pos=4) i S p1 h E i

• The numerator and denominator are not independent and thus Ei0 does not follow a t-distribution: If Ei is large, the standard error is also large: | | E2 S = i E sn k 1 P 27 / 52 28 / 52 Studentized Residuals (1) Studentized Residuals (2)

th • An alternative, but equivalent, method of calculating studentized • If we refit the model deleting the i observation we obtain an residuals is the so-called ‘mean-shift’ outlier model: estimate of the standard of the residuals S E( 1) (standard error of the regression) that is based on the n 1 observations Y = ↵ + X + X + + X + D + " i 1 i1 2 i2 ··· k ik i i • We then calculate the studentized residuals Ei⇤’s, which have an independent numerator and denominator: where D is a dummy regressor coded 1 for observation i and 0 otherwise E i • We test the null hypothesis that the outlier i does not di↵er from Ei⇤ = S E( i) p1 hi the rest of the observations, H0 : = 0,bycalculatingthet-test: Studentized residuals follow a t-distribution with n k 2 degrees of ˜ t0 = freedom SE(˜) • We might employ this method when we have several cases that • The test statistic is the studentized residual E and is distributed as might be outliers c i⇤ tn k 2 • Observations that have a studentized residual outside the 2 range are considered statistically significant at the 95% level ± • This method is most suitable when, after looking at the data, we have determined that a particular case might be an outlier

29 / 52 30 / 52

Studentized Residuals (3): Bonferroni adjustment Studentized Residuals (4): An Example of the Outlier Test

• The Bonferroni-adjusted outlier test in car tests the largest absolute • Since we are selecting the furthest outlier, it is not legitimate to use asimplet-test studentized residual. • We would expect that 5% of the studentized residuals would be • Recalling our inequality model: beyond t 2 by chance alone .025 ± >mod3<-lm(secpay~gini+gdp,data=weakliem2) • To remedy this we can make a Bonferroni adjustment to the >outlierTest(mod3) p-value. rstudent unadjusted p-value Bonferonni p • The Bonferroni p-value for the largest outlier is: p = 2np0 where p0 is the unadjusted p-value from a t-test with n k 2 degrees of freedom Slovakia 4.317504 0.00027781 0.007223 • The outlierTest function in the car package for R gives • It is now quite clear that Slovakia (observation 26) is an outlier, but Bonferroni p-value for the largest absolute studentized residual as of yet we have not assessed whether it influences the regression line - the test statistically significant

31 / 52 32 / 52 Quantile Comparison Plots (1) Quantile Comparison Plot (2): Inequality Data

● • We can use a quantile comparison plots to compare the distribution 4

of a single variable to the t-distribution, assessing whether the 3 distribution of the variable showed a departure from normality ● • Here we can again see that 2 • Using the same technique, we can compare the distribution of the ●

two cases appear to be 1 ● studentized residuals from our regression model to the t-distribution ● ● outliers: these are the Czech ● ● ● ● ● 0 ● ● ● • Observations that stray outside of the 95% confidence envelope are ● ● ● Republic and Slovakia Residuals(mod3) Studentized ● ● ● ● ● ● statistically significant outliers 1 ● −

● >qqPlot(mod3,simulate=T,labels=F) ● −2 −1 0 1 2

t Quantiles

33 / 52 34 / 52

Influential Observations: DFBeta (1) Influential Observations: DFBeta (2)

• We see here Slovakia makes the gdp coecient larger and • Recall that an influential observation is one that combines the coecient for gini smaller discrepancy with leverage ● • The Czech Republic also

● • The most direct approach to assessing influence is to assess how the makes the coecient for gdp 0.5 regression coecients change if outliers are omitted from the model larger ●

● ● • We can use D (often termed DFBeta ) to do so: ● ●● ● ● ij ij ●● ●● ●● • A problem with DFBetas is 0.0 ● ● ● ●

that each observation has GDP DFBeta for Dij = B j B j( i) i = 1,...,n; j = 1,...,k ● 8 several measures of influence - ● 0.5 one for each coecient − The B j are the coecients for all the data and the B j( i) are the n(k + 1) di↵erent measures ● coecients for the same model with the ith observation removed. • Cook’s D overcomes the −1.5 −1.0 −0.5 0.0 0.5 • Astandardcut-o↵ for an influential observation is: D 2 . ij pn problem by presenting a single DFBeta for GINI summary measure for each observation

35 / 52 36 / 52 Identifying DFBetas Influential Observations: Cook’s D (1)

• Cook’s D measures the ‘distance’ between B j and B j( i) by calculating an F-test for the hypothesis that b j = B j( i),for j = 0, 1,...,k.AnF-test is calculated for each observation as >cutoff<-2/sqrt(26) follows: 2 >big<-with(dfb,which(abs(gini)>cutoff| Ei0 hi Di = +abs(gdp)>cutoff)) k + 1 ⇥ 1 hi >dfb[big,] where hi is the hat value for each observation and Ei0 is the (Intercept) gini gdp standardized residual Chile -0.51676696 0.55836187 0.3132308 • The first fraction measures discrepancy; the second fraction CzechRepublic 0.06163614 -0.34805553 0.8471765 measures leverage Slovakia 1.14014221 -1.43107966 0.5112908 • There is no significance test for Di (i.e., the F-test here measures Slovenia 0.17438196 0.08084083 -0.8037418 only distance) but a commonly used cut-o↵ is: Taiwan -0.01827400 0.17003877 -0.4692173 4 D > i n k 1 • The cut-o↵ is useful, but there is no substitution for examining relative discrepancies in plots of Cook’s D versus cases, or of Ei0 against hi 37 / 52 38 / 52

Cook’s D: An Example Influence Plot or “Bubble Plot”

● Slovakia 0.7 • Displays studentized residuals,

• We can see from this plot of 0.6 hat-values and Cook’s D on a Slovakia 4

Cook’s D against the case 0.5 single plot

numbers, that Slovakia has an 3 0.4 • The horizontal axis represents unusually high level of ● CzechRepublic the hat-values; the vertical 2 0.3 influence on the regression Slovenia ● axis represents the studentized cooks.distance(mod3) 1 surface 0.2 residuals; circles for each ● ● ● ● Brazil ● ● ● ● ● ● Studentized Residuals Studentized 0

0.1 observation represent the ● ● ● ● ● ● • The Czech Republic and ● ● ● ● ● ● ● ● relative size of the Cook’s D ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1 ● 0.0 ● Slovenia also stand out • The radius is proportional − 0 5 10 15 20 25 to the square root of Index Cook’s D, and thus the 0.05 0.10 0.15 0.20 0.25 Hat−Values areas are proportional to >mod3.cook<-cooks.distance(mod3) the Cook’s D >plot(cooks.distance(mod3)) >cutoff<-with(mod3,4/df.residual) >abline(h=cutoff,lty=2) >text(which(mod3.cook>cutoff),mod3.cook[which(mod3.cook>cutoff)], +names(mod3.cook[which(mod3.cook>cutoff)]),pos=c(4,4,2)) 39 / 52 40 / 52 Joint Influence (1) Joint Influence (2)

• Subsets of cases can jointly influence a regression line, or can o↵set each other’s influence • Cook’s D can help us determine joint influence if there are relatively • The heavy solid is the regression with all cases few influential cases. included; The broken line is the regression • That is, we can delete cases sequentially, updating the model each with the asterisk deleted;The light solid line is time and exploring the Cook’s D’s again for the regression with both the plus and • This approach is impractical if there are potentially a large number of asterisk deleted subsets to explore, however • Depending on where the jointly influential • Added-variable plots (also called partial-regression plots) provide a cases lie, they can have di↵erent e↵ects on the more useful method of assessing joint influence regression line. • These plots essentially show the partial relationships between Y and • (a) and (b) are jointly influential because they each X change the regression line when included • We make one plot for each X together. • The observations in (c) o↵set each other and thus have little e↵ect on the regression line

41 / 52 42 / 52

Added Variable Plots [or Partial Regression Plots] (1) Added Variable Plots (2)

• The Residuals Y(1) and X(1) have the following properties: (1) (1) (1) 1. Let Y represent the residuals from the least-squares regression of 1. Slope of the regression of Y on X is the least-squares slope B1 i from the full multiple regression Y on all of the X’s except for X1: 2. Residuals from the regression of Y(1) on X(1) are the same as the residuals from the full regression: Y = A(1) + B(1)X + + B(1)X + Y(1) i 2 i2 ··· k ik i Y(1) = B X(1) + E (1) i 1 i1 i 2. Similarly, Xi are the residuals from the regression of X1 on all the (1) other X’s 3. Variation of X is the conditional variance of X1 holding the other (1) (1) (1) (1) Xi1 = C + D Xi2 + + D Xik + X X’s constant. Consequently, except for the df the standard error from 2 ··· k i the partial simple regression is the same as the multiple regression SE (1) (1) 3. These two equations determine the residuals X and Y as parts of B1. of and that remain when the e↵ects of ,..., are removed. S E X1 Y X2 Xk SE(B1) = (1)2 Xi c q P

43 / 52 44 / 52 Added Variable Plots (3): An Example Added Variable Plots (4): Example Continued

● ● 40

40 ●

30 ● 30

• Once again recalling the outlier model from the Inequality data 20 ● ● (1) (1) ● 20 10 • A plot of against allows us to examine the leverage and ● Y X ● ● ● ● ● ● 10 secpay | others secpay ● | others secpay ●

0 ● influence of cases on ● B ● ● 1 ● ● ● ● ●

● 0 ● ● ●● ● 10 ● • we make one plot for each X − ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● − ● ● 20 • These plots also gives us an idea of the precision of our slopes − −10 0 10 20 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 (B1,...,Bk) gini | others gdp | others >avPlots(mod3,"gini") >avPlots(mod3,"gdp")

• We see here that the Czech Republic and Slovakia have unusually high Y values given their X’s • Because they are on the extreme of the X-range as well, they are most likely influencing both slopes

45 / 52 46 / 52

Unusual Observations and their impact on Standard Errors Unusual Cases: Solutions

• Depending on their location, unusual observations can either increase or decrease standard errors • Unusual observations may reflect miscoding, in which case the observations can be rectified or deleted entirely • Recall that the standard error for a slope is as follows: • Outliers are sometimes of substantive interest: • S E If only a few cases, we may decide to deal separately with them SE(B) = • Several outliers may reflect model misspecification - i.e., an important 2 explanatory variable that accounts for the subset of the data that are (Xi X) outliers has been neglected c q • An observation with high leverageP (i.e., an X-value far from the • Unless there are strong reasons to remove outliers we may decide to mean of ) increases the size of the denominator, and thus keep them in the analysis and use alternative models to OLS, for X example , which down weight outlying data. decreases the standard error • Often these models give similar results to an OLS model that omits • A regression outlier (i.e., a point with a large residual) that does not the influential cases, because they assign very low weight to highly have leverage (i.e., it does not have an unusual X-value) does not influential cases change the slope coecients but will increase the standard error

47 / 52 48 / 52 Summary (1) Summary (2)

• Small samples are especially vulnerable to outliers - there are fewer cases to counter the outlier • We can test for outliers using studentized residuals and quantile - comparison plots • Large samples can also be a↵ected, however, as shown by the “marital coital frequency” example • Leverage is assessed by exploring the hat-values • Even if you have many cases, and your variables have limited ranges, • Influence is assessed using DFBetas and, preferably Cook’s D’s miscodes that could influence the regression model are still possible • Influence Plots (or bubble plots) are useful because they display the • Unusual cases are only influential when they are both unusual in studentized residuals, hat-values and Cook’s distances all on the terms of their Y value given their X (outlier), and when they have same plot an unusual X-value (leverage): • Joint influence is best assessed using Added Variable Plots (or partial regression plots) Influence = Leverage Discrepency ⇥

49 / 52 50 / 52