<<

On Robust Estimation of High Dimensional Generalized Linear Models

Eunho Yang Ambuj Tewari Pradeep Ravikumar Department of Computer Science Department of Department of Computer Science University of Texas, Austin University of Michigan, Ann Arbor University of Texas, Austin [email protected] [email protected] [email protected]

Abstract on the insight that the typical log-likelihood losses, such as the squared loss for the ordinary for We study robust high-dimensional estimation of , are very sensitive to , and that one generalized linear models (GLMs); where a small could devise surrogate losses instead that are more resistant number k of the n observations can be arbitrarily to such outliers. Rousseeuw [1984] for instance proposed the corrupted, and where the true parameter is high di- least estimator as a robust variant of the ordinary least mensional in the “p n” regime, but only has squares estimator. Another class of approaches fit trimmed a small number s of non-zero entries. There has to the after an initial removal of candidate been some recent work connecting robustness and outliers Rousseeuw and Leroy [1987]. There have also been sparsity, in the context of linear regression with cor- estimators that model the outliers explicitly. [Gelman et al., rupted observations, by using an explicitly mod- 2003] for instance model the responses using a mixture of eled response vector that is assumed to be two Gaussian distributions: one for the regular noise, and the sparse. Interestingly, we show, in the GLM set- other for the outlier noise, typically modeled as a Gaussian ting, such explicit outlier response modeling can be with high . Another instance is where the outliers are performed in two distinct ways. For each of these modeled as being drawn from heavy-tailed distributions such two approaches, we give `2 error bounds for pa- as the t distribution [Lange et al., 1989]. rameter estimation for general values of the tuple (n, p, s, k). Existing Research: Robustness and Sparsity. The past few years have actually led to an understanding that outlier ro- bust estimation is intimately connected to sparse signal re- 1 Introduction covery [Candes and Tao, 2005; Antoniadis, 2007; Jin and Statistical models in machine learning allows us to make Rao, 2010; Mitra et al., 2010; She and Owen, 2011]. The strong predictions even from limited data, by leveraging spe- main insight here is that if the number of outliers is small, cific assumptions imposed on the model space. However, on it could be cast as a sparse error vector that is added to the the flip side, when the specific model assumptions do not ex- standard noise. The problem of sparse signal recovery it- actly hold, these standard methods could deteriorate severely. self has seen a surge of recent research, and where a large Constructing estimators that are robust to such departures body of work has shown that convex and tractable methods from model assumptions is thus an important problem, and employing the likes of `1 regularization enjoy strong statisti- forms the main focus of [Huber, 1981; Ham- cal guarantees [Donoho and Elad, 2003; Ng, 2004; Candes pel et al., 1986; Maronna et al., 2006]. In this paper, we focus and Tao, 2006; Meinshausen and Buhlmann,¨ 2006; Tropp, on the robust estimation of high-dimensional generalized lin- 2006; Zhao and Yu, 2007; Wainwright, 2009; Yang et al., ear models (GLMs). GLMs are a very general class of models 2012]. Intriguingly, Antoniadis [2007]; She and Owen [2011] for predicting a response given a covariate vector, and include show that even classical robust statistics methods could be many classical conditional distributions such as Gaussian, lo- cast as sparsity encouraging M-estimators that specifically gistic, etc. In classical GLMs, the data points are typically use non-convex regularization. Jin and Rao [2010]; Mitra et low dimensional and are all assumed to be actually drawn al. [2010] have also suggested the use of non-convex penal- from the assumed model. In our setting of high dimensional ization based methods such as SCAD [Fan and Li, 2001] for robust GLMs, there are two caveats: (a) the true parameter robust statistical estimation. Convex regularization based es- vector can be very high dimensional and furthermore, (b) cer- timators however have been enormously successful in high- tain observations are outliers, and could have arbitrary values dimensional statistical estimation, and in particular provide with no quantitative relationship to the assumed generalized tractable methods that scale to very high-dimensional prob- . lems, and yet come with rigorous guarantees. To complete Existing Research: Robust Statistics. There has been a long the story on the connection between robustness and spar- line of work [Huber, 1981; Hampel et al., 1986; Maronna et sity, it is thus vital to obtain bounds on the performance of al., 2006] on robust statistical estimators. These are based the convex regularization based estimators for general high- dimensional robust estimation. For the task of high dimen- with known , the probability of y in (1) can sional robust linear regression, there has been some interest- be rewritten as 2 ? ? 2 ing recent work [Nguyen and Tran, 2011] that have provided ? y /2+y ✓ ,x ✓ ,x /2 P(y x, ✓ ) exp h ih i , (2) precisely such bounds. In this paper, we provide such an anal- | / 2 ysis for GLMs beyond the standard Gaussian linear model. ⇢ where the normalization function A(a) in (1) in this case be- It turns out that the story for robust GLMs beyond the stan- 2 dard Gaussian linear model is more complicated. In particu- comes a /2. Another very popular example in GLM models lar, outlier modeling in GLMs could be done in two ways: (a) is given a categorical output variable: ? ? ? in the parameter space of the GLM, or (b) in the output space. P(y x, ✓ )=exp y ✓ ,x log 1+exp( ✓ ,x ) , (3) For the linear model these two approaches are equivalent, but | h i h i where is 0, 1 , and the normalization function A(a)= significant differences emerge in the general case. We show log(1 +Y exp({a)).} We can also derive the that the former approach always leads to convex optimiza- model from (1) as follows: tion problems but only works under rather stringent condi- ? ? ? P(y x, ✓ )=exp log(y!) + y ✓ ,x exp( ✓ ,x ) , (4) tions. On the other hand, the latter approach can lead to a | { h i h i } non-convex M-estimator, but enjoys better guarantees. How- where is 0, 1, 2,... , and the normalization function ever, we show that all global minimizers of the M-estimation A(a)=exp(Y {a). Our final} example is the case where the problem arising in the second approach are close to each variable y follows an exponential distribution: other, so that the non-convexity in the problem is rather be- ? ? ? P(y x, ✓ )=exp y ✓ ,x +log( ✓ ,x ) , (5) nign. Leveraging recent results [Agarwal et al., 2010; Loh | { h i h i } and Wainwright, 2011], we can then show that projected gra- where is the set of non-negative real numbers, and the nor- dient descent will approach one of the global minimizers up malizationY function A(a)= log( a). Any distribution to an additive error that scales with the statistical precision of in the can be written as the GLM form the problem. Our main contributions are thus as follows: (1) where the canonical parameter of exponential family is 1. For robust estimation of GLMs, we show that there are ✓?,x . Note however that some distributions such as Pois- two distinct ways to use the connection between robust- sonh ori exponential place restrictions on ✓?,x to be valid ness and sparsity. parameter, so that the density is normalizable,h ori equivalently 2. For each of these two distinct approaches, we provide the normalization function A( ✓?,x ) < + . h i 1 M-estimators, that use `1 regularization, and in addi- In the GLM setting, suppose that we are given n covariate p tion, appropriate constraints. For the first approach, the vectors, xi , drawn i.i.d. from some distribution, and 2 M-estimation problem is convex and tractable. For the corresponding response variables, yi , drawn from the ? 2Y second approach, the M-estimation problem is typically distribution P(y xi,✓ ) in (1). A key goal in statistical esti- | non-convex. But we provide a projected gradient descent mation is to estimate the parameters ✓⇤, given just the samples n n algorithm that is guaranteed to converge to a global min- Z1 := (xi,yi) i=1. Such estimation becomes particularly imum of the corresponding M-estimation problem, up to challenging{ in a }high-dimensional regime, where the dimen- an additive error that scales with the statistical precision sion p is potentially even larger than the number of samples of the problem. n. In this paper, we are interested in such high dimensional 3. One of the main contributions of the paper is to pro- parameter estimation of a GLM under the additional caveat vide `2 error bounds for each of the two M-estimators, that some of the observations yi are arbitrarily corrupted. We for general values of the tuple (n, p, s, k). The anal- can model such corruptions by adding an “outlier error” pa- ? ? ysis of corrupted general non-linear models in high- rameter ei in two ways: (i) we consider ei in the “parameter ? dimensional regimes is highly non-trivial: it combines space” to the uncorrupted parameter ✓ ,xi , or (ii) introduce ? h i the twin difficulties of high-dimensional analysis of non- ei in the output space, so that the output yi is actually the sum ? linear models, and analysis given corrupted observa- of ei and the uncorrupted output y¯i. For the specific case of tions. The presence of both these elements, specifically the linear model, both these approaches are exactly the same. the interactions therein, required a subtler analysis, as We assume that only some of the examples are corrupted, well as slightly modified M-estimators. which translates to the error vector e? Rn being sparse. We further assume that the parameter ✓?2is also sparse. We 2 Problem Statement and Setup thus assume: ✓? s, and e? k, We consider generalized linear models (GLMs) where the re- k k0  k k0  sponse variable has an exponential family distribution, condi- with support sets S and T , respectively. We detail the two tioned on the covariate vector, approaches (modeling outlier errors in the parameter space ? ? and output space respectively) in the next two sections. ? h(y)+y ✓ ,x A ✓ ,x P(y x, ✓ )=exp h i h i . (1) | c() ( ) 3 Modeling gross errors in the parameter space Examples. The standard linear model with Gaussian noise, In this section, we discuss a robust estimation ap- the logistic regression and the Poisson model are typical ex- proach by modeling gross outlier errors in the param- amples of this model. In case of standard linear model, the eter space. Specifically, we assume that the i-th re- domain of variable y, , is the set of real numbers, R, and sponse yi is drawn from the conditional distribution in Y ? ? (1) but with a “corrupted” parameter ✓ ,xi + pnei , Note that the theorem requires the assumption that the out- so that the samples are distributed as h(y x i,✓?,e?)= ? b0 P i i i lier errors are bounded as e 2 pn . Since the “corrupted” ? ? ? ? | k k  h(yi)+yi( ✓ ,xi +pnei ) A ✓ ,xi +pnei ? ? exp h i h i . We can parameters are given by ✓ ,xi + pnei , the gross outlier er- c() pn e? h i ⇢ ror scales as 2, which the assumption thus entails be then write down the resulting negative log-likelihood as, bounded by a constantk k (independent of n). Our search to find n p(✓, e; Z ):= a method that can tolerate larger gross errors led us to intro- L 1 n n n duce the error in the output space in the next section. 1 1 1 ✓, yixi yiei + A ✓, xi + pnei . h n i pn n h i i=1 i=1 i=1 X X X 4 Modeling gross errors in the output space ` We thus arrive at the following 1 regularized maximum like- In this section, we investigate the consequences of mod- lihood estimator: eling the gross outlier errors directly in the response n (✓, e) argmin p(✓, e; Z1 )+n,✓ ✓ 1 + n,e e 1. space. Specifically, we assume that a perturbation of 2 L k k k k ? In the sequel, we will consider the following constrained ver- the i-th response, yi pnei is drawn from the con- ? sionb of the MLE (a ,b are constants independent of n, p): ditional distribution in (1) with parameter ✓ ,xi , so b 0 0 h ? ?i that the samples are distributed as P(yi xi,✓ ,e )= n | i (✓, e) argmin p(✓, e; Z1 )+n,✓ ✓ 1 + n,e e 1. (6) h(y pne?)+(y pne?) ✓?,x A ✓?,x 2 L k k k k i i i i i i ✓ 2 a0 exp h i h i . We can k k  b0 c() e 2 b k k  pn ⇢ n b then write down the resulting likelihood as, o(✓, e; Z1 ):= The additional regularization provided by the constraints al- 1 n L B (yi pnei) (yi pnei) ✓, xi + A( ✓, xi ) , low us to obtain tighter bounds for the resulting M-estimator. n i=1 h i h i convex where Bh(y)= h(y), and the resulting `1 regularized maxi-i We note that the M-estimation problem in (6) is : P adding the outlier variables e does not destroy the convexity mum likelihood estimator as: of the original problem with no gross errors. On the other (✓, e) argmin (✓, e; Zn)+ ✓ + e . (7) hand, as we detail below, extremely stringent conditions are 2 Lo 1 n,✓k k1 n,ek k1 required for consistent statistical estimation. (The strictness Note that when B(y) is set to h(y) as above, the estimator b of these conditions is what motivated us to also consider out- has theb natural interpretation of maximizing regularized log- put space gross errors in the next section, where the condi- likelihood, but in the sequel we allow it to be an arbitrary tions required are more benign). function taking the response variable as an input argument. h(y) ` As we will see, setting this to a function other than 3.1 2 Error Bound will allow us to obtain stronger statistical guarantees. We require the following stringent condition: Just as in the previous section, we consider a constrained ? ? b0 Assumption 1. ✓ 2 a0 and e 2 . version of the MLE in the sequel: k k  k k  pn We assume the covariates are multivariate Gaussian dis- n (✓, e) argmin o(✓, e; Z1 )+n,✓ ✓ 1 + n,e e 1. (8) tributed as described in the following condition: 2 ✓ 1 a0ps L k k k k k k  p Let X be the n p design matrix, with the e 1 b0 k Assumption 2. b b k k  n samples x along the n rows.⇥ We assume that each sam- { i} A key reason we introduce these constraints will be seen in ple xi is independently drawn from N(0, ⌃). Let max and the next section: these constraints help in designing an effi- min > 0 be the maximum and minimum eigenvalues of the cient iterative optimization algorithm to solve the above opti- matrix ⌃, respectively, and let ⇠ be the maximum mization problem (by providing bounds on the iterates right diagonal entry of ⌃. We assume that ⇠max =⇥(1). from the first iteration). One unfortunate facet of the M- Additionally, we place a mild restriction on the normalization estimation problem in (8), and the allied problem in (7), is that function A( ) that all examples of GLMs in Section 2 satisfy: it is not convex in general. We will nonetheless show that the · Assumption 3. The double-derivative A00( ) of the normal- computationally tractable algorithm we provide in the next · ization function has at most exponential decay: A00(⌘) section is guaranteed to converge to a global optimum (up to exp( c⌘) for some c>0. an additive error that scales at most with the statistical error of the global optimum). Consider the optimal solution (✓, e) of (6) with Theorem 1. We require the following bounds on the ` norms of ✓?,e?. the regularization parameters: 2 ? ? Assumption 4. ✓ 2 a0 and e 2 b0 for some con- log p b blog n k k  k k  =2c and =2c , stants a0, b0. n,✓ 1 n n,e 2 n r r When compared with Assumption 1 in the previous sec- where c1 and c2 are some known constants. Then, there exist tion, the above assumption imposes a much weaker restric- positive constants K, c3 and c4 such that with probability at tion on the magnitude of the gross errors. Specifically, with K least 1 , the error (, ) := (✓ ✓?, e e?) is bounded pn n the scaling included, the above Assumption 4 allows the by `2 norm of the gross error to scale as pn, whereas Assump- ps log p + pk log n ? + b bc b b . tion 1 in the previous section required the norm ✓ 2 to be 2 2 3 1/2 c /plog n k k k k k k  n 4 bounded above by a constant. b b 4.1 `2 Error bound for a local minimum of the objective, by using a gradient de- It turned out, given our analysis, that a natural selection of the scent based method. In particular, projected gradient descent function B( ) is to use the quadratic function (we defer dis- (PGD) applied to the M-estimation problem (8) produces the cussion due· to lack of space). Thus, in the spirit of classical iterates: ⌘ robust statistics, we considered the modified log-likelihood (✓t+1,et+1) argmin ✓, (✓t,et; Zn) + ✓ ✓t 2 n ✓ o 1 2 objective in (7) with the above setting of B( ): o(✓, e; Z1 ):= ✓ 1 a0ps h r L i 2 k k k k  n 2 · L e 1 b0pk n 1 1 k k  n i=1 2 (yi pnei) (yi pnei) ✓, xi + A( ✓, xi ) . h i h i t t n ⌘ t 2 + e, e o(✓ ,e ,Z ) + e e + n,✓ ✓ 1 + n,e e 1 , PSimilarlyh here, we assume the random design matrixi has h r L 1 i 2 k k2 k k k k rows sampled from a sub-Gaussian distribution: o where ⌘>0 is a step-size parameter. Note that even though Assumption 5. Let X be the n p design matrix which has ⇥ o is non-convex, the problem above is convex, and decou- each sample xi in its ith row. Let max and min > 0 be the plesL in ✓, e. Moreover, minimizing a composite objective over maximum and minimum eigenvalues of the p the `1 ball can be solved very efficiently by performing two of x, respectively. For any v R , the variable ✓, xi is projections onto the ` -ball (see Agarwal et al. [2010] for in- sub-Gaussian with parameter at2 most 2 v 2. h i 1 uk k2 stance for details). Theorem 2. Consider the optimal solution (✓, e) of (8) with While the projected gradient descent algorithm with the it- the regularization parameters: erates above might be tractable, one concern might be that these iterates would atmost converge to a local minimum, b b log p max(s, k) log p which might not satisfy the consistency and `2 convergence = max 2c ,c and n,✓ 1 n 2 sn rates as outlined in Theorem 2. However, the following theo- ( r r ) rem shows that the concern is unwarranted: the iterates con- 2 max(s, k) log p verge to a global minimum of the optimization problem in n,e = max c ,c3 , (8), up to an additive error that scales at most as the statisti- 1/2 0 kn (c n plog n r ) 00 cal error, ✓ ✓? 2 + e e? 2 . k k2 k k2 where c0,c00,c1,c2,c3 are some known constants. Then, ⇣ ⌘ there exist positive constants K, L and c such that if n Theorem 3. Suppose all conditions of Theorem 2 hold and 4 2 b 2 b L max(s, k) log p, then with probability at least 1 nK , the that n>c0(k + s) log(p). Let F (✓, e) denote the objec- error (, ) := (✓ ✓?, e e?) is bounded by tive function in (8) and let (✓, e) be a global optimum of the problem. Then, when we apply the PGD steps above pk max(s, k) log p with appropriate step-size ⌘, thereb exist universal constants b b b b b 2 + 2 c4 max 1 c , . C1,C2 > 0 and a contraction coefficient <1, inde- 0 n k k k k  (n 2 plog n r ) t 2 t 2 pendent of (n, p, s, k), such that ✓ ✓ 2 + e e 2 b b ? 2 ? 2 k k k k  C1 ✓ ✓ 2 + e e 2 for all iterates t T where Remarks. Nguyen and Tran [2011] analyze the specific k k k k b case of the standard linear regression model (which nonethe- ⇣ ⌘ b b 2 less is a member of the GLM family), and provide the bound: F (✓0,e0)b F (✓,e)) |T = C2 log {z 2 /}log(1/). s log p k log n b b 2 + 2 c max , , k k k k  (r n r n ) 6 Experimental Results which isb asymptoticallyb equivalent to the bound in Theo- In this section, we provide experimental validation, over both rem 2. As we noted earlier, for the linear regression model, simulated as well as real data, of the performance of our M- both approaches of modeling outlier errors in the parameter estimators. space or the output space are equivalent, so that we could also compare the linear regression bound to our bound in Theo- 6.1 Simulation Studies rem 1.There too, the bounds can be seen to be asymptotically In this section, we provide simulations corroborating Theo- equivalent. We thus see that the generality of the GLM fam- rems 1 and 2. The theorems are applicable to any distribution ily does not adversely affect `2 norm convergence rates even in the GLM family (1), and as canonical instances, we con- when restricted to the model. sider the cases of logistic regression (3), Poisson regression (4), and exponential regression (5). (The case of the stan- 5 A Tractable Optimization Method for the dard linear regression model under gross errors has been pre- viously considered in Nguyen and Tran [2011].) Output Space Modeling Approach We instantiated our models as follows. We first randomly In this section we focus on the M-estimation problem (8) selected a subset S of 1,...,p of size pp as the support { } that arises in the second approach where we model errors in set (indexing non-zero values) of the true parameter ✓⇤.We the output space. Unfortunately, this is not a tractable opti- then set the nonzero elements, ✓S⇤ , to be equal to !, which mization problem: in particular, the presence of the bilinear we vary as noted in the plots. We then randomly gener- term ei ✓, xi makes the objective function o non-convex. ated n i.i.d samples, x1,...,xn , from the normal distribu- h i L 2 { } A tractable seemingly-approximate method would be to solve tion N(0, Ip p). Given each feature vector xi, we drew ⇥ 2.5 3 3 w/o gross error w/o gross error w/o gross error 2 L1 logistic reg. L1 logistic reg. L1 logistic reg. Gross error in param Gross error in param Gross error in param 2 2 2 2 2 k Gross error in output k Gross error in output k Gross error in output ⇤ 1.5 ⇤ ⇤ ✓ ✓ ✓ ˆ ˆ ˆ ✓ 1 ✓ ✓ k k 1 k 1 0.5

0 0 0 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 n n n (a) Logistic regression models: ! =0.5 and =5.

0.5 0.8 0.8 w/o gross error w/o gross error w/o gross error 0.4 L1 Poisson reg. L1 Poisson reg. L1 Poisson reg. Gross error in param 0.6 Gross error in param 0.6 Gross error in param 2 2 2 k Gross error in output k Gross error in output k Gross error in output ⇤ 0.3 ⇤ ⇤ ✓ ✓ 0.4 ✓ 0.4 ˆ ˆ ˆ ✓ 0.2 ✓ ✓ k k k 0.2 0.2 0.1

0 0 0 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 n n n (b) Poisson regression models: ! =0.1, =5and =50.

0.6 w/o gross error 0.6 w/o gross error 0.6 w/o gross error L1 exponential reg. L1 exponential reg. L1 exponential reg. Gross error in param Gross error in param Gross error in param 2 0.5 2 0.5 2 0.5 k Gross error in output k Gross error in output k Gross error in output ⇤ ⇤ ⇤ ✓ ✓ ✓

0.4 0.4 0.4 ˆ ˆ ˆ ✓ ✓ ✓ k k k 0.3 0.3 0.3

0.2 0.2 0.2 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 n n n (c) Exponential regression models: ! =0.1, =1and =20.

Figure 1: Comparisons of `2 error norm for ✓ vs. n, and where p = 196, for different regression cases: logistic regression (top row), Poisson regression (middle row), and exponential regression (bottom row). Three different types of corruptions are presented: k = log(n) (Left column), k = (n) (Center), and k =0.1(n) (Right column). p n the corresponding true class label y¯i from the corresponding the dataset, xi, y¯i i=1 (w/o gross error). Note that the `2 GLM distribution. To simulate the worst instance of gross norm error is{ just on} the parameter estimates, and we exclude errors, we selected the k samples with the highest value of the error in estimating the outliers e themselves, so that we ✓⇤,x and corrupted them as follows. For logistic regres- could compare against the gold-standard GLM regression on h ii sion, we just flipped their class labels, to yi =(1 y¯i). For the uncorrupted data. the Poisson and exponential regression models, the corrupted While the M-estimation problem with gross errors in the response yi is obtained by adding a gross error term i to output space is not convex, it can be seen that the proximal y¯i. The learning algorithms were then given the corrupted gradient descent (PGD) iterates converge to the true ✓⇤, cor- n dataset xi,yi i=1. We scaled the number of corrupted sam- roborating Theorem 3. In the figure, the three rows corre- ples k with{ the} total number of samples n in three different spond to the three different GLMs, and the three columns ways: logarithmic scaling with k =⇥(log n), square root correspond to different outlier scalings, with logarithmic (first scaling with k =⇥(pn), and linear scaling with k =⇥(n). column), square-root (second column), and linear (third col- For each tuple of (n, p, s, k), we drew 50 batches of n sam- umn) scalings of the number of outliers k as a function of the ples, and plot the average. number of samples n. As the figure shows, the approaches Figure 1 plots the ` norm error ✓ˆ ✓⇤ of the param- modeling the outliers in the output and parameter spaces per- 2 k k2 eter estimates, against the number of samples n. We com- form overwhelmingly better than the baseline `1 penalized pare three methods: (a) the standard `1 penalized GLM MLE GLM regression estimator, and their error even approaches (e.g. `1 logistic reg.), which directly learns a GLM regression the estimator that is trained from uncorrupted data, even un- model over the corrupted data; (b) our first M-estimator (6), der settings where the number of outliers is a linear fraction which models error in the parameter space (Gross error in of the number of samples. The approach modeling outliers param); and (c) our second M-estimator, which models er- in the output space seems preferable in some cases (logistic, ror in the output space (8) (Gross error in output). As a gold exponential), while the approach modeling outliers in the pa- standard, we also include the performance of the standard `1 rameter space seems preferable in some cases (Poisson). penalized GLM regression over the uncorrupted version of 0.34 0.34 0.34 Logistic reg. Gross error in param 0.29 0.29 0.29 Gross error in output

0.24 0.24 0.24

Prediction Error 0.19 Prediction Error 0.19 Prediction Error 0.19

0.14 0.14 0.14 w/o Log Sqrt Linear w/o Log Sqrt Linear w/o Log Sqrt Linear (a) australian

0.45 0.45 0.45 Logistic reg. Gross error in param 0.4 0.4 0.4 Gross error in output

0.35 0.35 0.35

Prediction Error 0.3 Prediction Error 0.3 Prediction Error 0.3

0.25 0.25 0.25 w/o Log Sqrt Linear w/o Log Sqrt Linear w/o Log Sqrt Linear (b) german.numer

0.4 0.4 0.4 Logistic reg. Gross error in param 0.35 0.35 0.35 Gross error in output

0.3 0.3 0.3

Prediction Error 0.25 Prediction Error 0.25 Prediction Error 0.25

0.2 0.2 0.2 w/o Log Sqrt Linear w/o Log Sqrt Linear w/o Log Sqrt Linear (c) splice

Figure 2: Comparisons of the empirical prediction errors for different types of outliers on 3 real data examples. Percentage of the used samples in the training dataset: 10% (Left column), 50% (Center) and 100% (Right column)

6.2 Real Data Examples ularly strong performance, with more outliers, and/or where less samples are used for the training. We found the latter In this section, we evaluate the performance of our esti- phenomenon interesting, and worthy of further research: that mators on some real binary classification datasets, obtained robustness might help the performance of regression models from LIBSVM (http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/ ⇠ even in the absence of outliers by preventing overfitting. datasets/). We focused on the logistic regression case, and compared our two proposed approaches against the standard 7 Conclusion logistic regression. Note that the datasets we consider have We have provided a comprehensive analysis of statistical es- p

A. Y. Ng. Feature selection, `1 vs. `2 regularization, and rota- tional invariance. In International Conference on Machine Learning, 2004. N. H. Nguyen and T. D. Tran. Robust Lasso with missing and grossly corrupted observations. IEEE Trans. Info. Theory, 2011. submitted. P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection. John Wiley & Sons, 1987. P. J. Rousseeuw. Least median of squares regression. J. Amer. Statist. Assoc., 79(388):871–880, 1984.