On Robust Estimation of High Dimensional Generalized Linear Models

On Robust Estimation of High Dimensional Generalized Linear Models Eunho Yang Ambuj Tewari Pradeep Ravikumar Department of Computer Science Department of Statistics Department of Computer Science University of Texas, Austin University of Michigan, Ann Arbor University of Texas, Austin [email protected] [email protected] [email protected] Abstract on the insight that the typical log-likelihood losses, such as the squared loss for the ordinary least squares estimator for We study robust high-dimensional estimation of linear regression, are very sensitive to outliers, and that one generalized linear models (GLMs); where a small could devise surrogate losses instead that are more resistant number k of the n observations can be arbitrarily to such outliers. Rousseeuw [1984] for instance proposed the corrupted, and where the true parameter is high di- least median estimator as a robust variant of the ordinary least mensional in the “p n” regime, but only has squares estimator. Another class of approaches fit trimmed a small number s of non-zero entries. There has estimators to the data after an initial removal of candidate been some recent work connecting robustness and outliers Rousseeuw and Leroy [1987]. There have also been sparsity, in the context of linear regression with cor- estimators that model the outliers explicitly. [Gelman et al., rupted observations, by using an explicitly mod- 2003] for instance model the responses using a mixture of eled outlier response vector that is assumed to be two Gaussian distributions: one for the regular noise, and the sparse. Interestingly, we show, in the GLM set- other for the outlier noise, typically modeled as a Gaussian ting, such explicit outlier response modeling can be with high variance. Another instance is where the outliers are performed in two distinct ways. For each of these modeled as being drawn from heavy-tailed distributions such two approaches, we give `2 error bounds for pa- as the t distribution [Lange et al., 1989]. rameter estimation for general values of the tuple (n, p, s, k). Existing Research: Robustness and Sparsity. The past few years have actually led to an understanding that outlier robust estimation is intimately connected to sparse signal re- 1 Introduction covery [Candes and Tao, 2005; Antoniadis, 2007; Jin and Statistical models in machine learning allows us to make Rao, 2010; Mitra et al., 2010; She and Owen, 2011]. The strong predictions even from limited data, by leveraging spe- main insight here is that if the number of outliers is small, cific assumptions imposed on the model space. However, on it could be cast as a sparse error vector that is added to the the flip side, when the specific model assumptions do not ex- standard noise. The problem of sparse signal recovery it- actly hold, these standard methods could deteriorate severely. self has seen a surge of recent research, and where a large Constructing estimators that are robust to such departures body of work has shown that convex and tractable methods from model assumptions is thus an important problem, and employing the likes of `1 regularization enjoy strong statisti- forms the main focus of Robust Statistics [Huber, 1981; Ham- cal guarantees [Donoho and Elad, 2003; Ng, 2004; Candes pel et al., 1986; Maronna et al., 2006]. In this paper, we focus and Tao, 2006; Meinshausen and Buhlmann,¨ 2006; Tropp, on the robust estimation of high-dimensional generalized lin- 2006; Zhao and Yu, 2007; Wainwright, 2009; Yang et al., ear models (GLMs). GLMs are a very general class of models 2012]. Intriguingly, Antoniadis [2007]; She and Owen [2011] for predicting a response given a covariate vector, and include show that even classical robust statistics methods could be many classical conditional distributions such as Gaussian, lo- cast as sparsity encouraging M-estimators that specifically gistic, etc. In classical GLMs, the data points are typically use non-convex regularization. Jin and Rao [2010]; Mitra et low dimensional and are all assumed to be actually drawn al. [2010] have also suggested the use of non-convex penal- from the assumed model. In our setting of high dimensional ization based methods such as SCAD [Fan and Li, 2001] for robust GLMs, there are two caveats: (a) the true parameter robust statistical estimation. Convex regularization based es- vector can be very high dimensional and furthermore, (b) cer- timators however have been enormously successful in high- tain observations are outliers, and could have arbitrary values dimensional statistical estimation, and in particular provide with no quantitative relationship to the assumed generalized tractable methods that scale to very high-dimensional prob- linear model. lems, and yet come with rigorous guarantees. To complete Existing Research: Robust Statistics. There has been a long the story on the connection between robustness and spar- line of work [Huber, 1981; Hampel et al., 1986; Maronna et sity, it is thus vital to obtain bounds on the performance of al., 2006] on robust statistical estimators. These are based the convex regularization based estimators for general high- dimensional robust estimation. For the task of high dimen- with known scale parameter σ, the probability of y in (1) can sional robust linear regression, there has been some interest- be rewritten as 2 ? ? 2 ing recent work [Nguyen and Tran, 2011] that have provided ? y /2+y ✓ ,x ✓ ,x /2 P(y x, ✓ ) exp − h ih i , (2) precisely such bounds. In this paper, we provide such an anal- | / σ2 ysis for GLMs beyond the standard Gaussian linear model. ⇢ where the normalization function A(a) in (1) in this case be- It turns out that the story for robust GLMs beyond the stan- 2 dard Gaussian linear model is more complicated. In particu- comes a /2. Another very popular example in GLM models lar, outlier modeling in GLMs could be done in two ways: (a) is logistic regression given a categorical output variable: ? ? ? in the parameter space of the GLM, or (b) in the output space. P(y x, ✓ )=exp y ✓ ,x log 1+exp( ✓ ,x ) , (3) For the linear model these two approaches are equivalent, but | h i h i where is 0, 1 , and the normalization function A(a)= significant differences emerge in the general case. We show log(1 +Y exp({a)).} We can also derive the Poisson regression that the former approach always leads to convex optimiza- model from (1) as follows: tion problems but only works under rather stringent condi- ? ? ? P(y x, ✓ )=exp log(y!) + y ✓ ,x exp( ✓ ,x ) , (4) tions. On the other hand, the latter approach can lead to a | {− h i h i } non-convex M-estimator, but enjoys better guarantees. How- where is 0, 1, 2,... , and the normalization function ever, we show that all global minimizers of the M-estimation A(a)=exp(Y {a). Our final} example is the case where the problem arising in the second approach are close to each variable y follows an exponential distribution: other, so that the non-convexity in the problem is rather be- ? ? ? P(y x, ✓ )=exp y ✓ ,x +log( ✓ ,x ) , (5) nign. Leveraging recent results [Agarwal et al., 2010; Loh | { h i h i } and Wainwright, 2011], we can then show that projected gra- where is the set of non-negative real numbers, and the nor- dient descent will approach one of the global minimizers up malizationY function A(a)= log( a). Any distribution to an additive error that scales with the statistical precision of in the exponential family can− be written− as the GLM form the problem. Our main contributions are thus as follows: (1) where the canonical parameter of exponential family is 1. For robust estimation of GLMs, we show that there are ✓?,x . Note however that some distributions such as Pois- two distinct ways to use the connection between robust- sonh ori exponential place restrictions on ✓?,x to be valid ness and sparsity. parameter, so that the density is normalizable,h ori equivalently 2. For each of these two distinct approaches, we provide the normalization function A( ✓?,x ) < + . h i 1 M-estimators, that use `1 regularization, and in addi- In the GLM setting, suppose that we are given n covariate p tion, appropriate constraints. For the first approach, the vectors, xi R , drawn i.i.d. from some distribution, and 2 M-estimation problem is convex and tractable. For the corresponding response variables, yi , drawn from the ? 2Y second approach, the M-estimation problem is typically distribution P(y xi,✓ ) in (1). A key goal in statistical esti- | non-convex. But we provide a projected gradient descent mation is to estimate the parameters ✓⇤, given just the samples n n algorithm that is guaranteed to converge to a global min- Z1 := (xi,yi) i=1. Such estimation becomes particularly imum of the corresponding M-estimation problem, up to challenging{ in a }high-dimensional regime, where the dimen- an additive error that scales with the statistical precision sion p is potentially even larger than the number of samples of the problem. n. In this paper, we are interested in such high dimensional 3. One of the main contributions of the paper is to pro- parameter estimation of a GLM under the additional caveat vide `2 error bounds for each of the two M-estimators, that some of the observations yi are arbitrarily corrupted. We for general values of the tuple (n, p, s, k). The anal- can model such corruptions by adding an “outlier error” pa- ? ? ysis of corrupted general non-linear models in high- rameter ei in two ways: (i) we consider ei in the “parameter ? dimensional regimes is highly non-trivial: it combines space” to the uncorrupted parameter ✓ ,xi , or (ii) introduce ? h i the twin difficulties of high-dimensional analysis of non- ei in the output space, so that the output yi is actually the sum ? linear models, and analysis given corrupted observa- of ei and the uncorrupted output y¯i.

Load more