Regression III Robust Regression

Goals of the lecture Regression III • Introducing the idea of “robustness”, in particular distributional Robust Regression robustness • Discuss various measures of the robustness of an estimator, such as the breakdown point and influence function Dave Armstrong • Discuss robust and resistant regression methods that can be used when there are unusual observations or skewed University of Wisconsin – Milwaukee distributions Department of Political Science • Particular emphasis will be placed on M-estimation and some e: [email protected] extensions (in particular MM-estimation) w: www.quantoid.net/teachicpsr/regression3 • Revisit diagnostics for outliers, showing how robust regression can be used as a diagnostic tool • As usual, we’ll see how to do these things in R . 1 / 67 2 / 67 Defining “Robust” Breakdown Point (1) • Statistical inferences are based both on observations and on prior assumptions about the underlying distributions and • Assume a sample, Z, with n observations, and let T be a relationships between variables regression estimator. • Although the assumptions are never exactly true, some statistical • Applying T to Z gives us the vector of regression coefficients: models are more sensitive to small deviations from these assumptions than others T(Z) = βˆ • Following Huber (1981) robustness signifies insensitivity to deviations from the assumptions the model imposes • Imagine all possible “corrupted” samples Z0 that replace any • A model is robust then, if it is (1) reasonably efficient and observations m of the dataset with arbitrary values (i.e., unbiased, (2) small deviations from model assumptions will not influential cases) substantially impair the performance of the model and (3) somewhat larger deviations will not invalidate the model • The maximum bias that could arise from these substitutions is: completely effect(m; T, Z) = sup T(Z0) T(Z) • Robust regression is concerned with distributional robustness || − || and outlier resistance Z0 • Although conceptually distinct, these are for practical purposes where the supremum is over all possible Z0 synonymous 3 / 67 4 / 67 Breakdown Point (2) Influence Function (or Influence Curve) • if the effect(m; T, Z) is infinite, the m outliers have an arbitrarily large effect on T • While the breakdown point measures global robustness, the • The breakdown point for an estimator T for a finite sample Z is: influence function (IF) measures local robustness • More specifically, the IF measures the impact of a single m BDP(T, Z) = min ; effect(m; T, Z)is infinite observation Y that contaminates the theoretically assumed n distribution F on an estimator T ⇢ • In other words, the breakdown point is the smallest fraction of T (1 λ)F + λδY T F “bad” data (outliers or data grouped in the extreme tail of the IF(Y, F, T) = lim { − } − { } λ 0 λ distribution) the estimator can tolerate without taking on values ! arbitrarily far from T(Z) where δY is the probability distribution that puts its mass at the • For OLS regression, one unusual case is enough to influence the point Y (i.e., δ =1 at Y and 0 otherwise), and λ is the proportion coefficient estimates. Its breakdown point then is: Y of contamination at Y 1 BDP = • Simply put, the IF indicates the bias caused by adding outliers at n the point Y, standardized by the proportion of contamination 1 • The IF can be calculated from the first derivative of the estimator • As n gets larger, n tends toward 0, meaning that the breakdown point for OLS is 0% 5 / 67 6 / 67 Influential Cases and OLS Estimating Location • In order to explain how robust regression works, it is helpful to start with the simple case of robust estimation of the center of a • OLS is not robust to outliers. It can produce misleading results if distribution unusual cases go undetected - even a single case can have a significant impact on the fit of the regression surface • Consider independent observations and the simple model: • Moreover, the efficiency of the OLS regression can be hindered Yi = µ + ei by heavy-tailed distributions and outliers • Diagnostics should be used to detect heavy-tails or influential • If the underlying distribution is normal, the sample mean is the cases, but once they are found we are left with a decision as to maximally efficient estimator of µ, producing the fitted model: what to do • Investigate whether the deviations are a symptom of model failure Yi = Y¯ + Ei that can be repaired by deleting cases, transformations or adding more terms to the model • The mean minimizes the least squares objective function: • In cases when the unusual data cannot be remedied, robust regression can provide an alternative to OLS n n n ⇢ (E ) = ⇢ (Y µˆ) (Y µˆ)2 LS i LS i − ⌘ i − Xi=1 Xi=1 Xi=1 7 / 67 8 / 67 Estimating Location (2) Estimating Location (3) • The derivative of the objective function with respect to E gives • Again, taking the derivative of the function gives the shape of the the influence function which determines the influence of influence function observations: (E) ⇢ (E) = 2E. In other words, influence is LS ⌘ 0LS proportional to the residual E 1 for E > 0 • Compared to the median, the mean is sensitive to extreme LAV (E) ⇢0LAV (E) = 0 for E = 0 cases. As an alternative, then, we now consider the median as ⌘ 8 1 for E < 0 > − an estimator of µ <> > • The median minimizes the least-absolute-values (LAV) objective • the fact that the median is more resistant:> than the mean to function: outliers is a favorable characteristic • It is far less efficient, however. If Y (µ, σ2), the sampling n n n 2 ⇠ N 2 variance is σ ; the sampling variance for the median is ⇡ ⇢LAV (Ei) = ⇢LAV (Yi µˆ) Yi µˆ n 2n − ⌘ | − | • In other words, the sampling variance for the median is ⇡ 1.57 i=1 i=1 i=1 2 ' X X X times as large as for the mean • This method is more resistant to outliers because, in contrast to • The goal, then, is to find an estimator that is more resistant than the mean, the influence of an unusual observation on the median the mean, but more efficient than the median is bounded 9 / 67 10 / 67 M-estimation of Location M-estimation of Location (2) • A large class of estimators that generalize the idea of maximum • Restricting the objective function ⇢(Y; ✓) to any function that is likelihood to robust measures of scale and location, and also differentiable with an absolutely continuous derivative ( ) results · extend to robust regression in the ML estimator Tn • M-estimates are very robust for estimation location and relatively n efficient compared to other robust measures for large samples (Y; ✓) = 0 > i=1 (n 40) X • Let Tn(y1,...,yn) be an estimate of an unknown parameter that where characterizes the distribution F(Y; ✓) @ @ (Y; ✓) = ⇢(Y; ✓) = log f (Y; ✓) n − @✓ @✓ ! ! lF(✓; y1,...,yn) = f (Y, ✓) = • M-estimation relies on the least squares objective function: Yi 1 1 2 ⇢(y; ✓) = 2 (y µˆ) whose derivative shows the influence function: where f (Y; ✓) is the density corresponding to F(Y; ✓) (y; ✓) = (y −µˆ) is proportional to the value of y. • The ML estimator is the value of ✓ that maximizes the likelihood • If ⇢(Y; ✓) is symmetric− around 0, the breakdown point of the function, of equivalently minimizes: estimator is "⇤ = lim "⇤ = .5 n n n !1 n n y µˆ • The commonly used Huber and Bisquare estimates fit these l = ⇢ Y ✓ = ⇢ i − log ( ; ) criteria by replacing ⇢LS with an objective function that − = = cS Xi 1 Xi 1 ! down-weights extreme values. 11 / 67 12 / 67 M-Estimation of Location Huber Estimates (1) M-estimation, the Tuning Contant • A good compromise between the efficiency of the least-squares • The tuning constant can be expressed as a multiple of the scale and the robustness of the least-absolute values estimators is the (the spread) of Y, k = cS wher S is the measure of the scale of Y Huber Objective Function (i.e., the spread) • At the center of the distribution, the Huber function behaves like • Since the standard deviation is influenced by extreme the OLS function, but at the extremes, it behaves like the LAV observations, we instead use the median absolute deviation to measure spread: function 1 y2 for y c 2 MAD = median Yi µˆ ⇢H(y; ✓) = 1 2 | | | − | c y 2 c for y > c ( | | − | | • The median of Y serves as an initial estimate of µˆ, thus allowing • The Influence function is determined by taking the derivative: us to define S = MAD/.6745 which ensures that S estimates σ when the population is normal c for y > c 1.345 • Using k = 1.345 where .6745 2MADs produces a 95% efficiency H(y; ✓) = y for y c ' 8 | | relative to the sample mean⇣ when the⌘ population is normal and > c for y < c gives substantial resistance to outliers when it is not <> − − > • A smaller k gives more resistance • The tuning constant, c, defines:> the center and tails 13 / 67 14 / 67 M-Estimation of Location: Biweight Estimates Weights in M-Estimation • Biweight estimates behave somewhat differently than Huber weights, but are calculated in a similar manner • The biweight objective function is especially resistant to Taking the derivative of H(y; ✓) gives the weights applied to each observations on the extreme tails: observation. 2 2 3 c 1 1 y , if y c 1 if y c; 6 c w (y) = c ⇢BW (y) = − − | | Hi if y > c 8 2 ( ) y > c , ⇣ ⌘ > ( | | > 6 if y c <> | | 2 > y 2 • > 1 c if y c; The influence function,:> then, tends toward zero: wBWi (y) = − | | 8 0 if y > c 2 > ⇣ ⌘ y 2 <> | | y 1 c , if y c The difference between the> two weight functions happens more in the BW (y) = − | | > 8 extremes.

Regression III Robust Regression

Robust Statistics Part 3: Regression Analysis

Quantile Regression for Overdispersed Count Data: a Hierarchical Method Peter Congdon

Robust Linear Regression: a Review and Comparison Arxiv:1404.6274

A Guide to Robust Statistical Methods in Neuroscience

Robust Bayesian General Linear Models ⁎ W.D

Sketching for M-Estimators: a Unified Approach to Robust Regression

On Robust Regression with High-Dimensional Predictors

Robust Regression in Stata

Robust Fitting of Parametric Models Based on M-Estimation

Robust Regression Examples

Robust Regression Analysis of Ochroma Pyramidale (Balsa-Tree)

Experimental Design and Robust Regression