Regression III Outliers and Influential Data
Total Page:16
File Type:pdf, Size:1020Kb
Regression Diagnostics Regression III Outliers and Influential Data • Today’s lecture deals specifically with unusual data and how they are identified and measured • Regression Outliers Dave Armstrong • Studentized residuals (and the Bonferroni adjustment) • Leverage University of Wisconsin – Milwaukee • Hat values Department of Political Science • Influence e: [email protected] • DFBETAs, Cook’s D, influence plots, added-variable plots (partial w: www.quantoid.net/ICPSR.php regression plots) • Robust and resistant regression methods that limit the e↵ect of such cases on the regression estimates will be discussed later in the course 1/52 2/52 Outlying Observations: Who Cares? Example 1. Influence and Small Samples: Inequality Data • Small samples are Non−Democracies • Can cause us to misinterpret patterns in plots especially vulnerable to ● Slovakia All Obs outliers - there are fewer 60 • Temporarily removing them can sometimes help see patterns that we No Outliers ● CzechRepublic otherwise would not have cases to counter the • Transformations can also spread out clustered observations and bring outlier 50 in the outliers • With Czech Republic 40 • More importantly, separated points can have a strong influence on ● and Slovakia included, secpay 30 statistical models - removing outliers from a regression model can there is no relationship ● ● sometimes give completely di↵erent results ● ● ● between Attitudes 20 ● • Unusual cases can substantially influence the fit of the OLS model - ● ● ● towards inequality and ●● ● 10 ● ● ● Cases that are both outliers and high leverage exert influence on both ●●● ● ● ● the slopes and intercept of the model the Gini coefficient ● ● • Outliers may also indicate that our model fails to capture important • If these cases are 20 30 40 50 60 characteristics of the data removed, we see a gini positive relationship 3/52 4/52 Code for Previous Figure Ex 1. Influence and Small Samples: Inequality Data (2) All Obs No Outliers >weakliem2<-read.csv("http://www.quantoid.net/702/Weakliem2.txt") (Intercept) 19.4771 5.9234 >rownames(weakliem2)<-weakliem2[,1] − >outs<-which(rownames(weakliem2)%in%c("CzechRepublic","Slovakia")) (11.0655) (5.2569) >weakliem2$gdp<-weakliem2$gdp/10000 gini 0.0759 0.4995⇤ >plot(secpay~gini,data=weakliem2,main="Non-Democracies") −(0.2884) (0.1336) >abline(lm(secpay~gini,data=weakliem2)) N 26 24 >abline(lm(secpay~gini,data=weakliem2,subset=-outs),lty=2,col="red") 2 >with(weakliem2,text(gini[outs],secpay[outs], R 0.0029 0.3887 +rownames(weakliem2)[outs],pos=4)) adj. R2 0.0387 0.3609 >legend("topright",c("AllObs","NoOutliers"), Resid. sd −14.8520 6.2702 +lty=c(1,2),col=c("black","red"),inset=.01) Standard errors in parentheses ⇤ indicates significance at p < 0.05 5/52 6/52 Example 2. Influence and Small Samples: Davis Data (1) R-script for previous slide • These data are the Davis data in the car package ● 12 All Cases >library(car) 160 Outlier Excluded • It is clear that observation >data(Davis) 12 is influential 140 >plot(weight~height,data=Davis) ● • The model including 120 >with(Davis,text(height[12],weight[12],"12",pos=1)) ● observation 12 does a ●● >abline(lm(weight~height,data=Davis), 100 ● weight ● ● ● ● ● ● ● ●● ● poor job of representing ● ●● ● ●● ● +lty=1,col=1,lwd=2) ●●●●● ●●●●● 80 ● ● ●● ● ●●●● ● ● ● ● ● ●● ● ●●● the trend in the data; The ● ●●●● ●●●● ● ●●●●● >abline(lm(weight~height,data=Davis,subset=-12), ● ● ●●●●●● ● ●●●●●●●●●● ●●● ●●●●●● ●● ●●●●●●●●● ● 60 ●●●●●●● ● model excluding ● ●●●●●●●●● ● ●●●●●●● ●●●● +lty=2,col=2,lwd=2) ● ●●●●●● ● ● ●●● ●●● ●● ●● ● ● ●● observation 12 does much ● 40 >legend("topright",lty=c(1,2),col=c(1,2), better 60 80 100 120 140 160 180 200 +legend=c("AllCases","OutlierExcluded"),inset=.01) • The output on the next height slide confirms this 7/52 8/52 Example 2. Influence and Small Samples: Davis Data (2) Example 3. Large Datasets: Contrived Data • Although regression models from small datasets are most vulnerable to unusual observations, large datasets are not completely immune All Obs No Outliers • An unusually high (or low) x or y value could easily result from miscoding during the data entry stage. This could in turn influence (Intercept) 25.27 130.75⇤ the findings (14.95)− (11.56) • Imagine a dataset with 1001 observations, where a variable, X1, height 0.24⇤ 1.15⇤ ranges from 0.88-7.5. (0.09) (0.07) • Assume also that Y is perfectly correlated with X1. N 200 199 • Even if there is just one miscode - e.g., A “55” is wrongly entered R2 0.04 0.59 instead of “5” - the distribution of X1 is drastically misrepresented. adj. R2 0.03 0.59 This one miscode also seriously distorts the regression line. Resid. sd 14.86 8.52 >set.seed(123) >x<-c(rnorm(1000,mean=4,sd=1)) Standard errors in parentheses >x1<-c(x,55) ⇤ indicates significance at p < 0.05 >y<-c(x,5) >range(x1) [1] 1.190225 55.000000 >range(y) [1] 1.190225 7.241040 9/52 10 / 52 Example 3. Large Datasets: Contrived Data (2) Example 4. Large Datasets: Marital Coital Frequency (1) >mod1<-lm(y~x1) >apsrtable(mod1,model.names="",Sweave=T) • Jasso, Guillermina (1985) ‘Marital Coital Frequency and the Passage ● of Time: Estimating the Separate E↵ects of Spouses’ Ages and 7 ● ● ● Marital Duration, Birth and Marriage Cohorts, and Period ● ● ● ● . ● (Intercept) 2 84 ● ⇤ ● 6 ● ● ● Influences,’ American Sociological Review, 50: 224-241. ● ● ● ● . ● (0 06) ● ● ● ● 5 ● ● • Using panel data, estimates age and period e↵ects - controlling for ● ● ● ● ● . ● x1 ● 0 29 ● ⇤ ● ● ● cohort e↵ects - on frequency of sexual relations for married couples ● y ● ● ● 4 ● ● ● . ● (0 01) ● ● ● from 1970-75 ● ● ● ● ● ● 3 ● N ● 1001 ● ● ● ● • Major Findings: ● ● 2 ● ● ● . ● R ● 0 30 2 ● ● • Controlling for cohort and age e↵ects, there was a negative period ● ● 2 ● ● adj. R 0.30 ● e↵ect; 1 Resid. sd 0.83 0 10 20 30 40 50 • Controlling for period and cohort e↵ects, wife’s age had a positive Standard errors in parentheses x1 e↵ect • Both findings di↵er significantly from previous research in the area ⇤ indicates significance at p < 0.05 11 / 52 12 / 52 COMMENTS 735 + 3 lnMARDURij In theory, equation (2) is an estimable formula- + i PERIODi + -Y WCohi tion of a fixed-effect age-period-cohort model. + _Y2HCohj + Y3 MARDATEj However, in practice, it is computationallyintrac- + OcxXijk + 81Di + Eij table because of the large number of dummy variables required. It has been shown that, when Now, inWAge no longer equals PERIOD minus using only two waves of panel data, equation (2) WCoh, InHAge no longer equals PERIOD minus can be estimated using first differences (i.e., the HCoh, and InMARDUR no longer equals PE- model is specified for each wave and then one is RIOD minus MARDATE. subtractedfrom the other). The log transformationbreaks the APC identity Taking first differences from equation (2), we only if we can assume that higher order terms are have: zero. Note that in the above model, (3) ACFj = 3', (A lnWAgej) + 132(A lnHAgej) A = P - C. + i 3 (A lnMARDURj) It follows that since + n (A PERIOD) + atk (A Xik) + AEi f(A) = f(P - C), then ln(A) = ln(P - C). where A signifies the change between time 1 and time 2. Now, ln(P - C) can be expanded using a Taylor Note that the cohort terms as well as the series approximation: couple-specific dummies drop out of equation (3) = (P - C -1) -1/2(P - C -1)2 because their values remain constant in the two + 1/3(P -C - 1)3 -.... waves. Each of the parametersin equation (3) can ln(A) = P - C -[1 + 1/2(P - C-1)2 be interpretedas net of both the cohort effects and - 1/3(P - C - 1)3 + ...] the effects of the unobservable couple-specific This is simply a reformulationof the APC identity. covariates. In breaking the age-period-cohort To break this identity, Jasso assumes that the identity, Jasso is still unable to estimate cohort quantity in brackets (i.e., the higher-orderperiod effects; rather,she estimates age and period effects and cohort terms) equals zero. As Heckman and that are not confounded with cohort effects. Robb (1985) and Fienbergand Mason (1985) note, Specifically, she estimates equation (3) for a no transformationcan break the APC identity sample of continuously married women inter- without restrictions on the values of parameters viewed in both 1970 and 1975 as part of the (i.e., assuming some to be equal to zero). National Fertility Survey. As described above, she We have no immediate qualms about the log found significant age and period effects that transformation;in fact, we tested the functional directly contradict past research. Column 1 of form of equation (2) using the Box-Cox transfor- Table 1 presentsJasso's results (Jasso, 1985, Table mation and found it to be reasonable (see 4). Weisberg, 1980). However, we show that Jasso's Troubled by her substantive findings, we Example 4. Large Datasets: Marital Coital Frequency (2) identifyingExample restrictions 4. (i.e., Large making Datasets: all higher Maritalre-estimated Coital equation Frequency (3) using the (3) same data. order terms equal zero) lead to a serious Except for a sample size difference of one misspecification of her model. observation, we almost perfectly replicated her Table 1. Fixed-Effects Estimates of the Determinantsof MaritalCoital Frequency • Kahn, J.R. and J.R. Udry (1986) ‘Marital Coital Frequency: (4) (5) (6) Unnoticed Outliers and Unspecified Interactions Lead to Erroneous (1) (2) (3) Drop 4 Marital Marital Conclusions,’ American Sociological Review, 51: 734-737, critiques Jasso's Our Drop 4 Miscodes & Duration Duration and replicates Jasso’s research. They claim that Jasso: Results Replication Miscodes 4 Outliers s 2 >