<<

Goals of the lecture

Regression III • Introducing the idea of “robustness”, in particular distributional Robust Regression robustness • Discuss various measures of the robustness of an , such as the breakdown point and influence function Dave Armstrong • Discuss robust and resistant regression methods that can be used when there are unusual observations or skewed University of Wisconsin – Milwaukee distributions Department of Political Science • Particular emphasis will be placed on M-estimation and some e: [email protected] extensions (in particular MM-estimation) w: www.quantoid.net/teachicpsr/regression3 • Revisit diagnostics for , showing how robust regression can be used as a diagnostic tool • As usual, we’ll see how to do these things in .

1 / 67 2 / 67

Defining “Robust” Breakdown Point (1)

• Statistical inferences are based both on observations and on prior assumptions about the underlying distributions and • Assume a sample, Z, with n observations, and let T be a relationships between variables regression estimator. • Although the assumptions are never exactly true, some statistical • Applying T to Z gives us the vector of regression coefficients: models are more sensitive to small deviations from these assumptions than others T(Z) = ˆ • Following Huber (1981) robustness signifies insensitivity to deviations from the assumptions the model imposes • Imagine all possible “corrupted” samples Z0 that replace any • A model is robust then, if it is (1) reasonably efficient and observations m of the dataset with arbitrary values (i.e., unbiased, (2) small deviations from model assumptions will not influential cases) substantially impair the performance of the model and (3) somewhat larger deviations will not invalidate the model • The maximum bias that could arise from these substitutions is: completely effect(m; T, Z) = sup T(Z0) T(Z) • Robust regression is concerned with distributional robustness || || and resistance Z0 • Although conceptually distinct, these are for practical purposes where the supremum is over all possible Z0 synonymous

3 / 67 4 / 67 Breakdown Point (2) Influence Function (or Influence Curve)

• if the effect(m; T, Z) is infinite, the m outliers have an arbitrarily large effect on T • While the breakdown point measures global robustness, the • The breakdown point for an estimator T for a finite sample Z is: influence function (IF) measures local robustness • More specifically, the IF measures the impact of a single m BDP(T, Z) = min ; effect(m; T, Z)is infinite observation Y that contaminates the theoretically assumed n distribution F on an estimator T ⇢ • In other words, the breakdown point is the smallest fraction of T (1 )F + Y T F “bad” (outliers or data grouped in the extreme tail of the IF(Y, F, T) = lim { } { } 0 distribution) the estimator can tolerate without taking on values ! arbitrarily far from T(Z) where Y is the that puts its mass at the • For OLS regression, one unusual case is enough to influence the point Y (i.e., =1 at Y and 0 otherwise), and is the proportion coefficient estimates. Its breakdown point then is: Y of contamination at Y 1 BDP = • Simply put, the IF indicates the bias caused by adding outliers at n the point Y, standardized by the proportion of contamination 1 • The IF can be calculated from the first derivative of the estimator • As n gets larger, n tends toward 0, meaning that the breakdown point for OLS is 0%

5 / 67 6 / 67

Influential Cases and OLS Estimating Location

• In order to explain how robust regression works, it is helpful to start with the simple case of robust estimation of the center of a • OLS is not robust to outliers. It can produce misleading results if distribution unusual cases go undetected - even a single case can have a significant impact on the fit of the regression surface • Consider independent observations and the simple model: • Moreover, the efficiency of the OLS regression can be hindered Yi = µ + ei by heavy-tailed distributions and outliers • Diagnostics should be used to detect heavy-tails or influential • If the underlying distribution is normal, the sample is the cases, but once they are found we are left with a decision as to maximally efficient estimator of µ, producing the fitted model: what to do • Investigate whether the deviations are a symptom of model failure Yi = Y¯ + Ei that can be repaired by deleting cases, transformations or adding more terms to the model • The mean minimizes the objective function: • In cases when the unusual data cannot be remedied, robust regression can provide an alternative to OLS n n n ⇢ (E ) = ⇢ (Y µˆ) (Y µˆ)2 LS i LS i ⌘ i Xi=1 Xi=1 Xi=1

7 / 67 8 / 67 Estimating Location (2) Estimating Location (3)

• The derivative of the objective function with respect to E gives • Again, taking the derivative of the function gives the shape of the the influence function which determines the influence of influence function observations: (E) ⇢ (E) = 2E. In other words, influence is LS ⌘ 0LS proportional to the residual E 1 for E > 0 • Compared to the , the mean is sensitive to extreme LAV (E) ⇢0LAV (E) = 0 for E = 0 cases. As an alternative, then, we now consider the median as ⌘ 8 1 for E < 0 > an estimator of µ <> > • The median minimizes the least-absolute-values (LAV) objective • the fact that the median is more resistant:> than the mean to function: outliers is a favorable characteristic • It is far less efficient, however. If Y (µ, 2), the n n n 2 ⇠ N 2 is ; the sampling variance for the median is ⇡ ⇢LAV (Ei) = ⇢LAV (Yi µˆ) Yi µˆ n 2n ⌘ | | • In other words, the sampling variance for the median is ⇡ 1.57 i=1 i=1 i=1 2 ' X X X times as large as for the mean • This method is more resistant to outliers because, in contrast to • The goal, then, is to find an estimator that is more resistant than the mean, the influence of an unusual observation on the median the mean, but more efficient than the median is bounded

9 / 67 10 / 67

M-estimation of Location M-estimation of Location (2) • A large class of that generalize the idea of maximum • Restricting the objective function ⇢(Y; ✓) to any function that is likelihood to robust measures of scale and location, and also differentiable with an absolutely continuous derivative ( ) results · extend to robust regression in the ML estimator Tn • M-estimates are very robust for estimation location and relatively n efficient compared to other robust measures for large samples (Y; ✓) = 0 > i=1 (n 40) X • Let Tn(y1,...,yn) be an estimate of an unknown parameter that where characterizes the distribution F(Y; ✓) @ @ (Y; ✓) = ⇢(Y; ✓) = log f (Y; ✓) n @✓ @✓ ! ! lF(✓; y1,...,yn) = f (Y, ✓) = • M-estimation relies on the least squares objective function: Yi 1 1 2 ⇢(y; ✓) = 2 (y µˆ) whose derivative shows the influence function: where f (Y; ✓) is the density corresponding to F(Y; ✓) (y; ✓) = (y µˆ) is proportional to the value of y. • The ML estimator is the value of ✓ that maximizes the likelihood • If ⇢(Y; ✓) is symmetric around 0, the breakdown point of the function, of equivalently minimizes: estimator is "⇤ = lim "⇤ = .5 n n n !1 n n y µˆ • The commonly used Huber and Bisquare estimates fit these l = ⇢ Y ✓ = ⇢ i log ( ; ) criteria by replacing ⇢LS with an objective function that = = cS Xi 1 Xi 1 ! down-weights extreme values. 11 / 67 12 / 67 M-Estimation of Location Huber Estimates (1) M-estimation, the Tuning Contant

• A good compromise between the efficiency of the least-squares • The tuning constant can be expressed as a multiple of the scale and the robustness of the least-absolute values estimators is the (the spread) of Y, k = cS wher S is the measure of the scale of Y Huber Objective Function (i.e., the spread) • At the center of the distribution, the Huber function behaves like • Since the is influenced by extreme the OLS function, but at the extremes, it behaves like the LAV observations, we instead use the median absolute deviation to measure spread: function 1 y2 for y c 2 MAD = median Yi µˆ ⇢H(y; ✓) = 1 2 | |  | | c y 2 c for y > c ( | | | | • The median of Y serves as an initial estimate of µˆ, thus allowing • The Influence function is determined by taking the derivative: us to define S = MAD/.6745 which ensures that S estimates when the population is normal c for y > c 1.345 • Using k = 1.345 where .6745 2MADs produces a 95% efficiency H(y; ✓) = y for y c ' 8 | |  relative to the sample mean⇣ when the⌘ population is normal and > c for y < c gives substantial resistance to outliers when it is not <> > • A smaller k gives more resistance • The tuning constant, c, defines:> the center and tails

13 / 67 14 / 67

M-Estimation of Location: Biweight Estimates Weights in M-Estimation

• Biweight estimates behave somewhat differently than Huber weights, but are calculated in a similar manner

• The biweight objective function is especially resistant to Taking the derivative of H(y; ✓) gives the weights applied to each observations on the extreme tails: observation.

2 2 3 c 1 1 y , if y c 1 if y c; 6 c w (y) = c  ⇢BW (y) = | | Hi if y > c 8 2 ( ) y > c ,  ⇣ ⌘ > ( | | > 6 if y c <> | | 2 > y 2 • > 1 c if y c; The influence function,:> then, tends toward zero: wBWi (y) = | |  8 0 if y > c 2 > ⇣ ⌘ y 2 <> | | y 1 c , if y c The difference between the> two weight functions happens more in the BW (y) = | | > 8 extremes. The Bi-weight function: is a bit more resistant to outliers >0, ⇣ ⌘ if y > c <> | | • For this function a tuning> constant of c = 4.685 S 7 MADs, produces 95% efficiency: when sampling from a⇥ normal' population

15 / 67 16 / 67 Weight Functions for Various Estimators M-Estimation for Regression • OLS minimizes the sum of squares function n 2 min (Ei) Xi=1 • Following from M-estimation of location, a robust regression M-estimate minimizes the sum of a less rapidly increasing function ⇢ of the residuals n min ⇢(Ei) Xi=1 • Since the solution is not scale invariant. the residuals must be standardized by a robust estimate of their scale, ", which is estimated simultaneously. Usually, the median absolute deviation is used: n E min ⇢ i ˆ E Xi=1 ! 17 / 67 18 / 67

M-estimation for Regression (2) M-Estimation and Regression (3) • Taking the derivative and solving produces the shape of the influence function: • Since the weights cannot be estimated before fitting the model n and estimates cannot be found without the weights, an iterative Ei xi, where = ⇢0 procedure is needed to find estimates ˆ E Xi=1 ! • Initial estimates of b are selected using • We then substitute with an appropriate weight function • The residuals from this model are used to calculate an estimate (0) (0) n of the scale of the residuals e and the weights wi Ei w x • The model is then refit with several iterations minimizing the i ˆ i i=1 E weighted sum of squares to obtain new estimates of X ! b • Typically the Huber or bisquare weight is employed. In other (l) = 1 words, the solution assigns a different weight to each case b (X0WX) X0Wy depending on the size of its residual and thus minimizes the • where l is the iteration counter; in the ith row of the model matrix weighted sum of squares l 1 are xi0 and W diag wi n ⌘ { } (l) (l 1) 2 • This process continues until the model converges (b b ) wiEi ' Xi=1 19 / 67 20 / 67 Asymptotic Standard Errors Inequality Model Revisited (1)

• For all M-estimators (including the MM-estimator), asymptotic standard errors are given by the square root of the diagonal • The scatterplot clearly ● Slovakia entries of the estimated asymptotic matrix shows Slovakia and the 60 1 2 ● CzechRepublic (X0WX) from the final IWLS fit E Czech Republic as unusual 50 • The ASE for a particular coefficient, is then given by: cases - they are outliers in Y and have high 40 ●

according to X,a secpay [ (E )]2 30 W i ● SEˆ = (X X) 1 combination that results in 2 0 ● / ● ● E n ● s[ W0( i) ] high influence 20 P ●

● ● ● ● • The OLS line fit to the data ● ● 10 ● P ● ● • ● ● ● The ASEs are relaibale if the sample size n is sufficiently large ● ● ● relative to the number of parameters estimated indicates that they ● ● significantly pull the line 20 30 40 50 60 • Other evidence suggests that their reliability also decreases as gini the proportion of influential cases increases toward them • As a result, if n < 50 bootstrapping should be considered

21 / 67 22 / 67

R-Script for plot with case names Inequality Model Revisited (2)

## ## Call: (boot) ## lm(formula = secpay ~ gini, data = weakliem) library ## weakliem <- read.table( ## Residuals: "http://quantoid.net/files/reg3/weakliem2.txt", ## Min 1Q Median 3Q Max sep=",", header=T, row.names=1) ## -11.630 -9.368 -6.123 4.163 44.202 plot(secpay ~ gini, data=weakliem) ## ## Coefficients: outs <- which(rownames(weakliem) %in% ## Estimate Std. Error t value Pr(>|t|) c("Slovakia","CzechRepublic")) ## (Intercept) 19.47705 11.06550 1.760 0.0911 . with(weakliem, text(gini[outs], secpay[outs], ## gini -0.07586 0.28843 -0.263 0.7948 rownames(weakliem)[outs], pos=4)) ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual : 14.85 on 24 degrees of freedom ## Multiple R-squared: 0.002874, Adjusted R-squared: -0.03867 ## F-: 0.06918 on 1 and 24 DF, p-value: 0.7948

23 / 67 24 / 67 Inequality Model Revisited (3) Inequality Model Revisited (4)

## StudRes Hat CookD ## Brazil 0.631408 0.2395099 0.0643932 ## ## Slovakia 4.219304 0.1541151 0.9539156 ## Call: ## lm(formula = secpay ~ gini, data = weakliem[-outs, ]) ## ## Residuals: Slovakia

4 ## Min 1Q Median 3Q Max ## -10.4536 -2.8870 -0.6556 3.0017 13.8004 ## 3 ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) 2 ## (Intercept) -5.9234 5.2569 -1.127 0.27197 ## gini 0.4995 0.1336 3.740 0.00113 ** 1 Studentized Residuals Studentized ● ## --- Brazil● ● ● ● ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

0 ●

● ## ● ● ● ● ● ● ● ● ● ● ● ●● ● ## Residual standard error: 6.27 on 22 degrees of freedom ● ● 1

− ## Multiple R-squared: 0.3887, Adjusted R-squared: 0.3609 0.05 0.10 0.15 0.20 ## F-statistic: 13.99 on 1 and 22 DF, p-value: 0.001135 Hat−Values

We see the Czech Republic and Slovakia standing out, but Chile also25 / 67 26 / 67 has relatively high influence

M-Estimation in R (1) M-Estimation in R (2)

Slovakia ● 50 library(MASS) CzechRepublic ● mod3 <- (secpay ~ gini, data=weakliem) rlm • As we would expect, the 40 summary(mod3) Czech Republic and 30 Slovakia have the largest

## 20 ● ## Call: rlm(formula = secpay ~ gini, data = weakliem) residuals meaning that they residuals(mod3) ●

## Residuals: would have the largest 10 ● ● ● ● ● ● ● ● ● ## Min 1Q Median 3Q Max 0 influence on the OLS ● ● ● ● ● ● ● ● ## -9.1680 -4.1786 -0.5509 3.8944 54.5452 ● ● ● regression line ● ● 10 ## − ## Coefficients: 0 5 10 15 20 25 ## Value Std. Error t value Index ## (Intercept) 1.6927 5.6295 0.3007 ## gini 0.3057 0.1467 2.0837 ## plot(residuals(mod3)) ## Residual standard error: 6.363 on 24 degrees of freedom text(outs, residuals(mod3)[outs], rownames(weakliem)[outs], pos=2)

27 / 67 28 / 67 M-Estimation in R (3) M-estimation in R (4)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● • The plot on the right shows 1.0 Russia ● the weight given to each Armenia ● mod4 <- rlm(secpay ~ gini, data=weakliem, psi= psi.bisquare)

case 0.8 summary(mod4)

• In line with the previous Uruguay ● 0.6 graph, the Czech Republic w ## and Slovakia received the Chile ● ## Call: rlm(formula = secpay ~ gini, data = weakliem, psi = psi.bisquare) ## Residuals:

least weight of all the 0.4 ## Min 1Q Median 3Q Max observations ## -9.5520 -2.3978 0.5806 4.6531 57.7087

0.2 ## • Note: this technique can CzechRepublic ● Slovakia ● also be used to assess ## Coefficients: 0 5 10 15 20 25 ## Value Std. Error t value influential cases Index ## (Intercept) -4.1712 4.9081 -0.8499 ## gini 0.4442 0.1279 3.4724 ## loww <- with(mod3, which(w < 1)) ## Residual standard error: 5.552 on 24 degrees of freedom with(mod3, plot(w)) text(loww, mod3[["w"]][loww], rownames(weakliem)[loww], pos=2) 29 / 67 30 / 67

M-estimation in R (5) Comparing Huber and Bisquare Weights

60 ● ●●● ●● ● ●●●●● Slovakia ● 1.0 Russia ● Armenia ●

50 CzechRepublic ● • Notice that although the 0.8 ● 40 Uruguay

relative weights are similar, 0.6

30 Chile ● • Again, the Czech Republic the Huber method gives a Huber Weights 0.4 residuals and Slovakia have large 20 weight of 1 to far more ● ●

cases than does the CzechRepublic0.2 ● residuals 10 Slovakia ● ● ● ● ● ● bisquare weight ● ● ● ● ● 0 ● ● ● ● ● ● 0.0 ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 ● ● 10 − bisquare weights 0 5 10 15 20 25

Index plot(mod4[["w"]], mod3[["w"]], xlim=c(0,1), ylim=c(0,1), xlab="bisquare weights", (mod4, (residuals)) with plot ylab="Huber Weights") text(outs, mod4[["residuals"]][outs], abline(0,1) rownames(weakliem)[outs], pos=2) text(mod4[["w"]][loww], mod3[["w"]][loww],

31 / 67 rownames(weakliem)[loww], pos=2) 32 / 67 Limitation of M-estimation Resistant Regression Methods

• Also referred to as bounded influence methods • M-estimation for regression is designed to be robust against • Have the highest possible breakdown point ("i⇤ = .5). Some of heavy-tailed error distributions these are: • They are more resistant to vertical outliers than are OLS • S-estimation estimators • Least Trimmed Squares (LTS) • They are also more efficient than OLS estimators when there are • Least Median Squares (LMS) vertical outliers • Although very resistant in the presence of unusual observations, • M-estimates are not immune to unusual observations however these methods have the serious limitation that they are very • They have a breakdown point of 0, and an unbounded influence inefficient relative to OLS when the errors are normally function because they pay no attention to leverage distributed • As a result, “bad” leverage points that have small residuals can • For all of these, relative efficiency is less than 40% have as much influence on M-estimates as they do on OLS estimates • These methods have utility, however, in that they all can play an important role in MM-estimation, which produces estimates that are both highly resistant and efficient

33 / 67 34 / 67

Bounded Influence Regression: LTS Bounded Influence Regression: LMS

• Least-Trimmed Squares orders the squared residuals from 2 2 2 smallest to largest: (E )(1), (E )(2),...,(E )(n) • • It then calculates b that minimizes the sum of only the smaller An alternative bounded influence method is Least Median half of the residuals Squares m • 2 Rather than minimize the sum of the least squares function, LMS min (E )(i) 2 minizes the median of the squared residuals Ei Xi=1 n+k+2 min MED(E2) where m = 2 ; the floor brackets indicate rounding down to i the nearest integer j k • LMS is very robust with respect to outliers both in terms of X and • By using only the 50% of the data that fits closest to the original Y values OLS line, LTS completely ignores extreme outliers (i.e., it • Again, however, like LTS it performs poorly from the point of view eliminates the largest residuals) of asymptotic efficiency (at best, it is 37%) • Although highly resistant, LTS is very inefficient and can also misrepresent the trend in the data if it is characterized by clusters of extreme cases or if the dataset is relatively small

35 / 67 36 / 67 Bounded Influence Regression: S-estimation Resistant Regression in R

• Attempt to find the solution with the smallest possible dispersion of the residuals min ˆ E1(ˆ),...,En(ˆ) • LTS, LMS and S-estimation can all be done using the MASS library: • Of course, OLS does this too,⇣ by minimizing⌘ the variance • LTS regression: ltsreg(y x) ⇠ • • LMS regression: lmsreg(y x) S-estimation, however, minimizes a robust M-estimate of the ⇠ residual scale ˆ • S-estimation: lqs(y x, method = "S") n ⇠ E • Although the fitted, residuals, and coefficients ⇢ i = (n p)K ˆ E functions work for these objects, many other functions available Xi=1 ! for lm objects are not available (including the summary function) K = 0.5 ensures consistency at the normal distribution of errors • In any event, these methods really have limited utility on their • Although it has a high breakdown point, S-estimation isn’t own attractive on its own because it has very low efficiency (approximately 30% of OLS estimators under the condition of normally distributed errors)

37 / 67 38 / 67

Combining Resistance and Efficiency: MM-estimation (1) Steps to MM-estimation (1) 1. Initial estimates of the coefficients B(1) and corresponding (1) residuals ei are taken from a highly resistant robust regression (i.e., a regression with a breakdown point of 50%) • Although the estimator must be consistent, it is not necessary that it is efficient. As a result, S-estimation with Huber or Bisquare • MM estimation is perhaps the most commonly employed method weights is typically employed here today (1) 2. The Residuals Ei from the S-estimation stage 1 are used to • “MM” in the name refers to the fact that more than one compute an M-estimation of the scale of the residuals M-estimation procedure is used to calculate the final estimates (1) 3. Finally, the initial estimates of the residuals ei from stage 1 and • Combine a high breakdown point (50%), bounded influence of the residual scale E from stage 2 are used to compute a function and high efficiency under normal errors ( 95%) single-step M-estimate ⇡ n E(1) w i x i ˆ i i=1 0 E 1 X B C where the wi are typically Huber@B or bisquareAC weights. In other words, the M-estimation procedure at this stage needs only a single iteration of weighted least squares

39 / 67 40 / 67 MM-estimation in R Comparing the models

mod5 <- rlm(secpay ~ gini, data=weakliem, method="MM") Estimate SE t summary(mod5) OLS (with outliers) -0.0759 0.2884 -0.2630 OLS (no outliers) 0.4995 0.1336 3.7400 ## M (Huber) 0.3057 0.1467 2.0837 ## Call: rlm(formula = secpay ~ gini, data = weakliem, method = "MM") M (bisquare) 0.4442 0.1279 3.4724 ## Residuals: ## Min 1Q Median 3Q Max MM (Huber) 0.4561 0.1276 3.5759 ## -9.7533 -2.3654 0.4203 4.6373 57.8463 ## ## Coefficients: • the M-estimation techniques give very similar estimates to the ## Value Std. Error t value OLS with the two influential cases deleted ## (Intercept) -4.5408 4.8936 -0.9279 • The MM-estimation is closest ## gini 0.4561 0.1276 3.5759 ## • Don’t just fit these models blindly - look at the pattern in the data ## Residual standard error: 6.068 on 24 degrees of freedom and make sure you pick the method that works best

41 / 67 42 / 67

Comparison Plot of Different Models Diagnostics Revisited RR-Plots

• A criticism of common measures of influence, such as Cooks’ D is that they are not robust ● OLS (with outliers)

60 OLS (no outliers) • Since their calculation is based on the sample mean and M (Huber) ● , they can often miss outliers because of a M (bisquare) MM (Huber) “masking effect” where a group of influential points can mask each 50 other • Partial regression plots can overcome this effect with respect to individual coefficients, but don’t help detect overall influence 40 ● • RR-plots (“residual-residual” plots), which plot the residuals from

secpay an OLS fit against the residuals from many different robust 30 ● regressions, are a solution

● • If the OLS assumptions hold perfectly, there will be a perfect ● ● ● 20 positive relationship, with a slope = 1 (the identity line), between ● OLS and the residuals from any robust regression

● ● ● ● • If there are outliers, however, the slope will equal a value other ● ● 10 ● ● ● than 1 since the OLS regression will not resist them while the ● ●● ● ● ● ● ● robust regression will

20 30 40 50 60

gini 43 / 67 44 / 67 RR-Plots Bootstrapping Robust Regression

• Robust Regression (MM-estimation) gave a significantly better fit ● ● 40 40 ● ● to the data than OLS 30 30 • The slope for the Gini coefficient didn’t only reach statistical ● ● 20 20

● ● significance, but it changed direction residuals(mod) residuals(mod) 10 10 ● ●

● ● ● ● ● ● •

0 0 The residual standard error also decreased ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● LM Line ● ● ● LM Line ● ●● ● ●● 10 ● ● 10 ● ●

− ● ● 45−degree − ● ● 45−degree

−10 0 10 20 30 40 50 −10 0 10 20 30 40 50 60 residuals(mod3) residuals(mod4) mod2 <- rlm(secpay ~ gini, data=weakliem, method="MM") summary(mod2) (a) M (Huber) (b) M (Bisquare) ##

● ## Call: rlm(formula = secpay ~ gini, data = weakliem, method = "MM") 40 ● ## Residuals:

30 ## Min 1Q Median 3Q Max

● 20 ## -9.7533 -2.3654 0.4203 4.6373 57.8463

● residuals(mod) 10 ● ## ● ● ●

0 ● ## Coefficients: ● ● ● ● ● ● ● ● ● ● LM Line ● ●● 10 ● ## Value Std. Error t value ●

− ● ● 45−degree

−10 0 10 20 30 40 50 60 ## (Intercept) -4.5408 4.8936 -0.9279 residuals(mod5) ## gini 0.4561 0.1276 3.5759 (c) MM (Huber) ## ## Residual standard error: 6.068 on 24 degrees of freedom

45 / 67 46 / 67

Bootstrapping Regression: Example (5) Random-X

• I start by creating a function that extracts the Huber weights • Despite that the slope coefficient is now statistically significant, • Recall that the Huber weight behaves like OLS in the center of the standard errors are not reliable because of the small sample the distribution, but similar to absolute values regression on the size tails, giving low weight to observations with extreme residuals • Standard errors produced by the rlm function rely on asymptotic approximations, and thus are not trustworthy for samples of size 26 boot.huber <- function(data, inds, maxit=20){ • The bootstrap provides an alternative way to get standard errors assign(".inds", inds, envir=.GlobalEnv) • I could proceed to bootstrap the regression in 2 ways: mod <- rlm(secpay ~ gini, data=data[.inds,], maxit=maxit) • Random-X resampling or observation resampling • Fixed-X resampling remove(".inds", envir=.GlobalEnv) coefficients(mod) }

47 / 67 48 / 67 Huber Weight Function Random-X Resampling (3) • The boot function returns the original robust coefficients (from the rlm model), bootstrap standard errors and the bias of the bootstrap estimate boot.ranx <- boot(data=weakliem, statistic=boot.huber, R=1500, maxit=100) boot.ranx ## ## ORDINARY NONPARAMETRIC BOOTSTRAP ## ## ## Call: ## boot(data = weakliem, statistic = boot.huber, R = 1500, maxit = 100) ## ## ## Bootstrap Statistics : ## original bias std. error ## t1* 1.6927104 2.90201781 13.4664347 ## t2* 0.3057488 -0.06137429 0.3297755 • t1* represents the first coefficient in the model frame (the intercept); t2* represents the second (gini) 49 / 67 50 / 67

Bootstrap Confidence Interval Fixed-X Resampling

• According to all three measures, the coefficient is not statistically significant (the default for boot.ci is a 95% CI) - still, the BCa is slightly larger and centered differently than the normal fit <- fitted(mod2) boot.ci(boot.ranx, type=c("normal","perc", "bca"), index=2) e <- residuals(mod2) boot.huber.fixed <- function(data, inds, maxit=20){ (".inds", inds, envir=.GlobalEnv) ## BOOTSTRAP CALCULATIONS assign y.star <<- fit + e[.inds] ## Based on 1500 bootstrap replicates mod <- (mod2, y.star ~ .) ## update (".inds", envir=.GlobalEnv) ## CALL : remove (mod) ## boot.ci(boot.out = boot.ranx, type = c("normal", "perc", "bca"), coefficients } ## index = 2) ## ## Intervals : ## Level BCa ## 95% (-0.7222, 0.7147 ) (-0.9420, 0.6618 ) ## Calculations and Intervals on Original Scale

51 / 67 52 / 67 Bootstrap Results Bootstrap Distributions

density.default(x = boot.ranx$t[, 2]) boot.fixx <- boot(weakliem, boot.huber.fixed,

R=1500, maxit=100) 3.5 Random X boot.fixx Fixed X 3.0 ## ## ORDINARY NONPARAMETRIC BOOTSTRAP 2.5 ## ## 2.0

## Call: Density ## boot(data = weakliem, statistic = boot.huber.fixed, R = 1500, 1.5 ## maxit = 100) ## 1.0 ## ## Bootstrap Statistics : 0.5 ## original bias std. error ## t1* -4.5408396 -0.126663931 4.7126241 0.0 ## t2* 0.4561285 0.004636794 0.1244956 −1.5 −1.0 −0.5 0.0 0.5 1.0 Bootstrap Coefficient

53 / 67 54 / 67

Make Confidence Intervals Comparison ranx.ci <- boot.ci(boot.ranx, index=2, type=c("perc", "bca")) fixx.ci <- boot.ci(boot.fixx, index=2, type=c("perc", "bca")) analytical.ci <- coef(mod2)[2] + qt(.975, nrow(weakliem)-mod2$rank)* • The standard error in the original data was 0.0015, essentially sqrt(diag(vcov(mod2)))[2] * c(-1,1) the same as the fixed-X bootstrap. This is considerably less than cis <- rbind(ranx.ci$bca[4:5], the standard error of 0.0033 found in the random-X bootstrap fixx.ci$bca[4:5], • This result is also confirmed in the distribution of the bootstrap - analytical.ci) although there is still a skew, it is less than in the random-X case rownames(cis) <- c("randomX", "fixedX", "analytical") colnames(cis) <- c("lower", "upper") round(cis, 3)

## lower upper ## randomX -0.942 0.662 ## fixedX 0.236 0.693 ## analytical 0.193 0.719 55 / 67 56 / 67 Jasso Robust Regression Jasso Robust Regression (2)

library(foreign) NFS7075 <- read.dta("http://www.quantoid.net/files/reg3/nfs70-75-long.dta") NFS7075$hage <- exp(NFS7075$loghage) NFS7075$wiage <- exp(NFS7075$logwiage) NFS7075$mardur <- exp(NFS7075$logmardur) w.dat <- data.frame( X <- model.matrix(~cf + datint + logwiage + loghage + w = c(rrmod2a$w, rrmod2b$w), logmardur + preg + wemp + hemp, data=NFS7075)[,-1] method = rep(c("M", "MM"), each=length(rrmod2a$w))) X2 <- model.matrix(~log(cf + 1) + datint + logwiage + loghage + logmardur + preg + wemp + hemp, data=NFS7075)[,-1] library(lattice) colnames(X) <- colnames(X2) <- c("cf", "datint", "logwiage", histogram( ~ w | method, data=w.dat, col="gray75") "loghage", "logmardur", "preg", "wemp", "hemp") Xres <- lm(X ~ factor(NFS7075$ID))$residuals rrmod2a <- rlm(cf ~ datint + logwiage + loghage + logmardur + preg + wemp + hemp, data=as.data.frame(Xres)) rrmod2b <- rlm(cf ~ datint + logwiage + loghage + logmardur + preg + wemp + hemp, data=as.data.frame(Xres), method="MM")

57 / 67 58 / 67

Histogram of Weights Low Weights - M estimation

0.0 0.2 0.4 0.6 0.8 1.0 M MM 80

inds <- which(rrmod2a$w < .1) ids <- unique(NFS7075$ID[inds]) NFS7075[which(NFS7075$ID %in% ids), c(1,2,3,5,6,7,15,16,17)] 60 ## ID year cf preg wemp hemp hage wiage mardur ## 1367 2197 70 88 No Yes Yes 22.08333 23.08333 3.75000 ## 1368 2197 75 10 No Yes Yes 26.25000 27.25000 7.91667 40 ## 1369 2198 70 88 No Yes Yes 25.08334 23.25000 5.00000 ## 1370 2198 75 8 No No Yes 29.25000 27.41667 9.16667 ## 1373 2203 70 88 No No Yes 40.83333 38.91667 18.66667 ## 1374 2203 75 4 No No Yes 45.83333 43.91666 23.66666

Percent of Total Percent ## 2909 4652 70 88 No No Yes 34.33333 33.00000 11.58333 ## 2910 4652 75 6 No No Yes 39.08333 37.75000 16.33334 20

0

0.0 0.2 0.4 0.6 0.8 1.0 w

59 / 67 60 / 67 Low Weights - M estimation Which Obs with Low Weights

inds <- which(rrmod2b$w == 0) ids <- unique(NFS7075$ID[inds]) NFS7075[which(NFS7075$ID %in% ids), c(1,2,3,5,6,7,15,16,17)] out <- by(cbind(rrmod2a$w, rrmod2b$w, NFS7075$cf), list(NFS7075$ID), ## ID year cf preg wemp hemp hage wiage mardur function(x)c(mean(x[,1]), mean(x[,2]), diff(x[,3]))) ## 574 1000 70 28 No No Yes 30.41667 23.58334 5.166666 ## 575 1000 75 56 No No No 35.41667 28.58333 10.166665 wlens <- which(sapply(out, length) == 2) ## 663 1144 70 28 No No Yes 22.50000 22.66667 2.666667 out <- do.call("rbind", out) ## 664 1144 75 1 No No Yes 27.41667 27.58334 7.583337 ## 1343 2165 70 6 No No Yes 45.83333 35.25000 15.750002 out <- out[-wlens, ] ## 1344 2165 75 52 No No Yes 50.75000 40.16667 20.666670 ## 1367 2197 70 88 No Yes Yes 22.08333 23.08333 3.750000 ## 1368 2197 75 10 No Yes Yes 26.25000 27.25000 7.916670 ## 1369 2198 70 88 No Yes Yes 25.08334 23.25000 5.000000 ## 1370 2198 75 8 No No Yes 29.25000 27.41667 9.166670 (out[,1]~ out[,3], ## 1373 2203 70 88 No No Yes 40.83333 38.91667 18.666668 plot ## 1374 2203 75 4 No No Yes 45.83333 43.91666 23.666665 xlab="Difference in Coital ", ## 2304 3651 70 25 No No Yes 20.08333 17.66667 0.750000 ## 2305 3651 75 50 No Yes Yes 25.00000 22.58333 5.666664 ylab = "Weight") ## 2310 3658 70 0 No Yes Yes 22.83333 19.58334 3.083333 ## 2311 3658 75 46 No No Yes 27.75000 24.50000 7.999998 ## 2909 4652 70 88 No No Yes 34.33333 33.00000 11.583334 ## 2910 4652 75 6 No No Yes 39.08333 37.75000 16.333335 ## 3053 4920 70 0 No No Yes 34.16667 28.50000 11.833332 plot(out[,2]~ out[,3], ## 3054 4920 75 25 No Yes Yes 39.08333 33.41667 16.750000 ## 3144 5108 70 0 No Yes Yes 24.00000 19.33334 1.583333 xlab="Difference in Coital Frequency", ## 3145 5108 75 25 No Yes No 28.66667 24.00000 6.250003 ylab = "Weight") ## 3909 6440 70 30 No No Yes 34.08333 28.41667 10.250000 ## 3910 6440 75 5 No Yes Yes 38.66667 33.00000 14.833335

61 / 67 62 / 67

Summary and Conclusions (1)

●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●● ●● ●●●●●●●●●●● 1.0 ●● ● ● 1.0 ● ●●●●●●●●●● ●● ●● ●●●●●●●●●●● ●● ● ●●● ● ● ●●●●●●● ● ● ● ●●●● ●●●●●● ● ● ●● ●●●●●●●●●●● ●● ●● ●● ● ● ●● ●●●●●●● ●● ● ● ●● ● ●●●● ● ●● ●●●●●●●● ●●●●● • ● ● ● ●●●● ●●● ● Separate points can have strong influence on statistical models ● ● ● ●● ● ●● ●●●●● ●●●●●● ● ●● ● ●●●●● ● ● ● ● ●●● ●●● ● ● ● ●● ● ● ●●●● ●●●●●● ●● ●● ●●● ● ● ●●● ● ●●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●●●●● ●● ● ●● ●● ● ●● 0.8 ●● ● ●● • ●● ● ● ● ● ●●● Unusual cases can substantially influence the fit of the OLS model ● ● ●

0.8 ● ●●● ● ● ●● ● ●● ● ●● ●●● ● ●●● ●● ●● ●● ● ●●●●● ●●● ● ●● ●●● ●●● ● ●● ●● ● ● ●● ● ● ●● ● ● ●●●●● ● ●● ● ● ●●● ● ●● ● ●●● ● ●● ●●● ● ● ● - cases that are both outliers and high leverage exert influence on ● ● ● ●● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ●● ●● ●●● ● ● ●● ● ● ● ●●● ● ●● ● ● ●● ● 0.6 ● ● ● ● ● ● ● ●●●● ●●● ● both the slopes and intercept of the model ● ●● ● ●

0.6 ● ● ● ●● ● ●● ● ● ●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ●● ●● ●● ●●● ● ●●● ● ● ● ● ● ●●● ●● ●● ● • ● ● ●●

Weight ● Weight ● Outliers can also cause problems for efficiency ● ●● ●● ● ●● ●●● ●● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●●●● ●● ● ●●● 0.4 ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ●● ● ● ● 0.4 ● ● ●●● ● ●● ● ●● • ● Efforts should first be made to remedy the problem of unusual ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● cases before proceeding to robust regression ● 0.2 ●● ● ● ● ● ● ●● 0.2 ● ● ● ●● ● ● • If robust regression is used, careful attention should be paid to ●● ● ● ● ●●●● ●●●● ●● ●● ● ● 0.0 the model - different procedures can give different answers −80 −60 −40 −20 0 20 40 −80 −60 −40 −20 0 20 40 • Plots of OLS residuals against robust residuals can be helpful for Difference in Coital Frequency Difference in Coital Frequency detecting problems with the tails of the error distribution • If there are no problems, the residuals should all be close to perfectly correlated

63 / 67 64 / 67 Summary and Conclusions (2) Readings

Today: Robust Regression Andersen (2008) • MM-estimation is typically the best robust method for the linear ⇤ Fox (2008) Chapter 19 model ⇤ Cantoni and Ronchetti (2001) • Inference from these models is better than OLS in the presence of Rousseeuw and Leroy (1987) “bad” cases, but its standard errors are not reliable in small samples Tomorrow: Heteroskedasticity • This can be overcome by using bootstrapping to obtain new Fox (2008) Chapters 12 & 13 estimates of the standard errors ⇤ Fox and Weisberg (2011) Chapters 3 & 6 • Robust regression can also be extended to the GLM ⇤ Long and Ervin (2000) ⇤ Harvey (1976) ⇤ Cribari-Neto (2004), Cribari-Neto, Souza and Klaus L.P. Vasconcellos (2007), Cribari-Neto and da Silva (2011)

65 / 67 66 / 67

Andersen, Robert. 2008. Modern Methods for Robust Regression. Thousand Oaks, CA: Sage. Cantoni, Gustavo E and Elvezio Ronchetti. 2001. “Robust Inference for Generalized Linear Models.” Journal of the American Statistical Association 96:1022–1030. Cribari-Neto, Francisco. 2004. “Asymptotic Inference Under Heteroskedasticity of Unknown Form.” & Data Analysis 45:215–233. Cribari-Neto, Francisco, Tatiene C Souza and Klaus L.P. Vasconcellos. 2007. “Inference Under Heteroskedasticity and Leveraged Data.” Communications in Statistics 36(10):1877–1888. Cribari-Neto, Francisco and Wilton Bernardio da Silva. 2011. “A New Heteroskedasticity-Consistent Covariance Matrix Estimator for the Model.” Advances in Statistical Analysis 95(1):129–146. Fox, John. 2008. Applied and Generalized Linear Models, 2nd edition. Thousand Oaks, CA: Sage, Inc. Fox, John and Sanford Weisberg. 2011. An R Companion to Applied Regression, 2nd ed. Thousand Oaks, CA: Sage. Harvey, Andrew C. 1976. “Estimating Regression Models with Multiplicative Heteroskedasticity.” Econometrica 44(3):461–465. Long, J. Scott and Laurie H. Ervin. 2000. “Using Consistent Standard Errors in the Linear Regression Model.” The American 54(3):217–224. Rousseeuw, Peter J and Annick M Leroy. 1987. Robust Regression and Outlier Detection. New York: Wiley & Sons, Inc.

67 / 67