On Robust Regression with High-Dimensional Predictors
Total Page:16
File Type:pdf, Size:1020Kb
On robust regression with high-dimensional predictors Noureddine El Karoui ∗, Derek Bean ∗ , Peter Bickel ∗ , Chinghway Lim y, and Bin Yu ∗ ∗University of California, Berkeley, and yNational University of Singapore Submitted to Proceedings of the National Academy of Sciences of the United States of America We study regression M-estimates in the setting where p, the num- pointed out a surprising feature of the regime, p=n ! κ > 0 ber of covariates, and n, the number of observations, are both large for LSE; fitted values were not asymptotically Gaussian. He but p ≤ n. We find an exact stochastic representation for the dis- was unable to deal with this regime otherwise, see the discus- Pn 0 tribution of β = argmin p ρ(Y − X β) at fixed p and n b β2R i=1 i i sion on p.802 of [6]. under various assumptions on the objective function ρ and our sta- In this paper we intend to, in part heuristically and with tistical model. A scalar random variable whose deterministic limit \computer validation", analyze fully what happens in robust rρ(κ) can be studied when p=n ! κ > 0 plays a central role in this regression when p=n ! κ < 1. We do limit ourselves to Gaus- representation. Furthermore, we discover a non-linear system of two deterministic sian covariates but present grounds that the behavior holds equations that characterizes rρ(κ). Interestingly, the system shows much more generally. We also investigate the sensitivity of that rρ(κ) depends on ρ through proximal mappings of ρ as well as our results to the geometry of the design matrix. We have various aspects of the statistical model underlying our study. chosen to use heuristics because we believe successful general Several classical results in statistics are upended. In particular, we proofs by us or others will require a great deal of time and show that when p=n is large enough, least-squares becomes prefer- perhaps remain unresolved. We proceed in the manner of Hu- able to least-absolute-deviations for double exponential errors. ber [6] who also developed highly plausible results buttressed by simulations, many of which have not yet been established robust regression j prox function j high-dimensional statistics j concentration rigorously. of measure We find that : L 1. the asymptotic normality and unbiasedness of estimates of Abbreviations: EPE, expected prediction error; =, equal in law; LAD, least absolute coordinates and contrasts which, unlike fitted values, have deviations; i.i.d, independent identically distributed; fidi, finite dimensional coefficients independent of the observed covariates persist in these situations; n the \classical" period up to the 1980's, research on regres- 2. this happens at scale n−1=2, as in the fixed p case, at least Ision models focused on situations for which the number of when the minimal and maximal eigenvalues of the covari- covariates p was much smaller than n, the sample size. Least ance of the predictors stay bounded away from 0 and 1 squares regression (LSE) was the main fitting tool used but respectively1. its sensitivity to outliers came to the fore with the work of Tukey, Huber, Hampel and others starting in the 1950's. These findings are obtained by 0 Given the model Yi = Xiβ0+i and M-estimation methods 1. using intricate leave-one-out pertubation arguments both described in the abstract, it follows from the discussion in [7] for the data units and predictors; (p. 170 for instance) that, if the design matrix X (an n×p ma- 2. exhibiting a pair of master equations from which the trix whose i-th row is Xi) is non singular, under various regu- asymptotic mean square prediction error and the correct 0 n larity conditions on X, ρ, = ρ and the (i.i.d) errors figi=1, expressions for asymptotic variances can be recovered; βb is asymptotically normal with mean β0 and covariance ma- 3. showing that these two quantities depend in a nonlinear trix C(ρ, )(X0X)−1. Here C(ρ, ) = E 2() =[E ( 0())]2 way on p=n, the error distribution, the design matrix and and has the same distribution as i's. the form of the objective function, ρ. It follows that, for p fixed, the relative efficiency of M es- It is worth noting that our findings upend the classical intu- timates such as LAD, to LSE, does not depend on the design ition that the \ideal" objective function is the negative log- matrix. Thus, LAD has the same advantage over LSE for density of the error distribution. That is, we show that when heavy tailed distributions as the median has over the mean. p=n is large enough, it becomes preferable to use least-squares In recent years there has been great focus on the case rather than LAD for double exponential errors. We illustrate where p and n are commensurate and large. Greatest atten- this point in Figure 3. tion has been paid to the \sparse" case where the number The \Main results" section contains a detailed presenta- of nonzero coefficients is much smaller than n or p. This has tion of our results. We give some examples and supporting been achieved by adding an `1 type of penalty to the quadratic objective function of LSE, in the case of the LASSO. Unfortu- nately, these types of methods result in biased estimates of the Reserved for Publication Footnotes coefficients and statistical inference, as opposed to prediction, becomes problematic. Huber [6] was the first to investigate the regime of large p (p ! 1 with n). His results were followed up by Portnoy [9] under weaker conditions. Huber showed that the behavior found for fixed p persisted in regions such as p2=n ! 0 and p3=n ! 0. That is, estimates of coefficients and contrasts were asymptotically Gaussian and relative efficiencies of methods did not depend on the design matrix. His arguments were, in part, heuristic but well confirmed by simulation. He also 1and the vector defining the contrasts has norm bounded away from 0 and 1 www.pnas.org/cgi/doi/10.1073/pnas.0709640104 PNAS Issue Date Volume Issue Number 1{7 simulations in the \Examples" section. We present our deriva- The asymptotic normality (in the fidi convergence sense) tion in the last section. and unbiasedness of βb is a consequence of Result 1 and Lemma 1. Note the complicated interaction of ρ, the distribution of , the distribution of the X 's and κ in determining r (κ). In the Main results i ρ p fixed (κ = 0) case, Xi ∼ N (0; Σ), the contribution of the We consider the following robust regression problem: let βb be design is just Σ−1, which determines the correlation matrix n of βb. In general the correlation structure is the same but the X 0 β = argmin p ρ(Yi − X β) : [1] variances also depend on the design. b β2R i i=1 As our arguments will show, we expect that the results p 0 p concerning rρ(κ) detailed in Result 1 will hold when the as- Here Xi 2 , Yi = X β0 + i, where β0 2 , i is a random n R i R sumptions of normality on fXigi=1 are replaced by assump- (scalar) error independent of the vector X 2 p. ρ is a con- i R tions on concentration of quadratic forms in Xi. Results on vex function. We assume that the pairs f gn and fX gn i i=1 i i=1 fidi convergence of βb also appear likely to hold under these are independent. Furthermore, we assume that Xi's are in- weakened restrictions. dependent. Our aim is to characterize the distribution of βb. The difference between the systems of equations charac- As we will discuss later, our approach is not limited to this terizing rρ(κ) in Result 1 and Corollary 1 highlights the im- \standard" robust regression setting: we can, for instance, portance of the geometry of the predictors, Xi, in the results. shed light on similar questions in weighted regression. As a matter of fact, if we consider the case where Σ = Idp and The following lemma is easily shown by using the rota- 2 λi's are i.i.d with E λi = 1, in both situations the Xi's have tional invariance of the Gaussian distribution (see SI). covariance Idp and are nearly orthogonal to one another; how- Lemma 1. Suppose that X = λ X , where X 's are n i.i.d p i i i i ever, in the setting of Result 1, kXik= p is close to jλij - hence N (0; Σ), with Σ of rank p, and fλ gn are (non zero) scalars, i i=1 variable with i - whereas in the setting of Corollary 1, the Xi's n 2 independent of fXigi=1. Call βb(ρ; β0; Σ) the solution of all have almost the same norm and hence are near a sphere. Equation [1]. When n > p, we have the stochastic repre- The importance of the geometry of the vectors of predictors in sentation this situation is hence a generalization of similar phenomena L −1=2 that were highlighted in [3] for instance. Further examples of βb(ρ; β0; Σ) = β0 + kβb(ρ; β0; Idp) − β0kΣ u ; a different nature detailed in [4] p.27 illustrate the fundamen- tal importance of our implicit geometric assumptions on the where u is uniform on the sphere of radius 1 in p and is in- R design matrix. dependent of kβb(ρ; 0; Idp) − β0k. Furthermore, βb(ρ; β0; Idp) − Our analysis also extends to the case where ρ is replaced L β0 = βb(ρ; 0; Idp).