<<

On robust regression with high-dimensional predictors Noureddine El Karoui ∗, Derek Bean ∗ , Peter Bickel ∗ , Chinghway Lim †, and Bin Yu ∗ ∗University of California, Berkeley, and †National University of Singapore Submitted to Proceedings of the National Academy of Sciences of the United States of America

We study regression M-estimates in the setting where p, the num- pointed out a surprising feature of the regime, p/n → κ > 0 ber of covariates, and n, the number of observations, are both large for LSE; fitted values were not asymptotically Gaussian. He but p ≤ n. We find an exact stochastic representation for the dis- was unable to deal with this regime otherwise, see the discus- Pn 0 tribution of β = argmin p ρ(Y − X β) at fixed p and n b β∈ i=1 i i sion on p.802 of [6]. under various assumptions on the objective function ρ and our sta- In this paper we intend to, in part heuristically and with tistical model. A scalar whose deterministic limit “computer validation”, analyze fully what happens in robust rρ(κ) can be studied when p/n → κ > 0 plays a central role in this regression when p/n → κ < 1. We do limit ourselves to Gaus- representation. Furthermore, we discover a non-linear system of two deterministic sian covariates but present grounds that the behavior holds equations that characterizes rρ(κ). Interestingly, the system shows much more generally. We also investigate the sensitivity of that rρ(κ) depends on ρ through proximal mappings of ρ as well as our results to the geometry of the design matrix. We have various aspects of the underlying our study. chosen to use heuristics because we believe successful general Several classical results in are upended. In particular, we proofs by us or others will require a great deal of time and show that when p/n is large enough, least-squares becomes prefer- perhaps remain unresolved. We proceed in the manner of Hu- able to least-absolute-deviations for double exponential errors. ber [6] who also developed highly plausible results buttressed by simulations, many of which have not yet been established robust regression | prox function | high-dimensional statistics | concentration rigorously. of measure We find that :

L 1. the asymptotic normality and unbiasedness of estimates of Abbreviations: EPE, expected prediction error; =, equal in law; LAD, least absolute coordinates and contrasts which, unlike fitted values, have deviations; i.i.d, independent identically distributed; fidi, finite dimensional coefficients independent of the observed covariates persist in these situations; n the “classical” period up to the 1980’s, research on regres- 2. this happens at scale n−1/2, as in the fixed p case, at least Ision models focused on situations for which the number of when the minimal and maximal eigenvalues of the covari- covariates p was much smaller than n, the sample size. Least ance of the predictors stay bounded away from 0 and ∞ squares regression (LSE) was the main fitting tool used but respectively1. its sensitivity to came to the fore with the work of Tukey, Huber, Hampel and others starting in the 1950’s. These findings are obtained by 0 Given the model Yi = Xiβ0+i and M-estimation methods 1. using intricate leave-one-out pertubation arguments both described in the abstract, it follows from the discussion in [7] for the units and predictors; (p. 170 for instance) that, if the design matrix X (an n×p ma- 2. exhibiting a pair of master equations from which the trix whose i-th row is Xi) is non singular, under various regu- asymptotic square prediction error and the correct 0 n larity conditions on X, ρ, ψ = ρ and the (i.i.d) errors {i}i=1, expressions for asymptotic can be recovered; βb is asymptotically normal with mean β0 and ma- 3. showing that these two quantities depend in a nonlinear trix C(ρ, )(X0X)−1. Here C(ρ, ) = E ψ2() /[E (ψ0())]2 way on p/n, the error distribution, the design matrix and and  has the same distribution as i’s. the form of the objective function, ρ. It follows that, for p fixed, the relative efficiency of M es- It is worth noting that our findings upend the classical intu- timates such as LAD, to LSE, does not depend on the design ition that the “ideal” objective function is the negative log- matrix. Thus, LAD has the same advantage over LSE for density of the error distribution. That is, we show that when heavy tailed distributions as the has over the mean. p/n is large enough, it becomes preferable to use least-squares In recent years there has been great focus on the case rather than LAD for double exponential errors. We illustrate where p and n are commensurate and large. Greatest atten- this point in Figure 3. tion has been paid to the “sparse” case where the number The “Main results” section contains a detailed presenta- of nonzero coefficients is much smaller than n or p. This has tion of our results. We give some examples and supporting been achieved by adding an `1 type of penalty to the quadratic objective function of LSE, in the case of the LASSO. Unfortu- nately, these types of methods result in biased estimates of the Reserved for Publication Footnotes coefficients and , as opposed to prediction, becomes problematic. Huber [6] was the first to investigate the regime of large p (p → ∞ with n). His results were followed up by Portnoy [9] under weaker conditions. Huber showed that the behavior found for fixed p persisted in regions such as p2/n → 0 and p3/n → 0. That is, estimates of coefficients and contrasts were asymptotically Gaussian and relative efficiencies of methods did not depend on the design matrix. His arguments were, in part, heuristic but well confirmed by simulation. He also 1and the vector defining the contrasts has norm bounded away from 0 and ∞ www.pnas.org/cgi/doi/10.1073/pnas.0709640104 PNAS Issue Date Volume Issue Number 1–7 simulations in the “Examples” section. We present our deriva- The asymptotic normality (in the fidi convergence sense) tion in the last section. and unbiasedness of βb is a consequence of Result 1 and Lemma 1. Note the complicated of ρ, the distribution of , the distribution of the X ’s and κ in determining r (κ). In the Main results i ρ p fixed (κ = 0) case, Xi ∼ N (0, Σ), the contribution of the We consider the following robust regression problem: let βb be design is just Σ−1, which determines the correlation matrix n of βb. In general the correlation structure is the same but the X 0 β = argmin p ρ(Yi − X β) . [1] variances also depend on the design. b β∈R i i=1 As our arguments will show, we expect that the results

p 0 p concerning rρ(κ) detailed in Result 1 will hold when the as- Here Xi ∈ , Yi = X β0 + i, where β0 ∈ , i is a random n R i R sumptions of normality on {Xi}i=1 are replaced by assump- (scalar) error independent of the vector X ∈ p. ρ is a con- i R tions on concentration of quadratic forms in Xi. Results on vex function. We assume that the pairs { }n and {X }n i i=1 i i=1 fidi convergence of βb also appear likely to hold under these are independent. Furthermore, we assume that Xi’s are in- weakened restrictions. dependent. Our aim is to characterize the distribution of βb. The difference between the systems of equations charac- As we will discuss later, our approach is not limited to this terizing rρ(κ) in Result 1 and Corollary 1 highlights the im- “standard” robust regression setting: we can, for instance, portance of the geometry of the predictors, Xi, in the results. shed light on similar questions in weighted regression. As a matter of fact, if we consider the case where Σ = Idp and The following lemma is easily shown by using the rota- 2 λi’s are i.i.d with E λi = 1, in both situations the Xi’s have tional invariance of the Gaussian distribution (see SI). covariance Idp and are nearly orthogonal to one another; how- Lemma 1. Suppose that X = λ X , where X ’s are n i.i.d √ i i i i ever, in the setting of Result 1, kXik/ p is close to |λi| - hence N (0, Σ), with Σ of rank p, and {λ }n are (non zero) scalars, i i=1 variable with i - whereas in the setting of Corollary 1, the Xi’s n 2 independent of {Xi}i=1. Call βb(ρ; β0, Σ) the solution of all have almost the same norm and hence are near a sphere. Equation [1]. When n > p, we have the stochastic repre- The importance of the geometry of the vectors of predictors in sentation this situation is hence a generalization of similar phenomena

L −1/2 that were highlighted in [3] for instance. Further examples of βb(ρ; β0, Σ) = β0 + kβb(ρ; β0, Idp) − β0kΣ u , a different nature detailed in [4] p.27 illustrate the fundamen- tal importance of our implicit geometric assumptions on the where u is uniform on the sphere of radius 1 in p and is in- R design matrix. dependent of kβb(ρ; 0, Idp) − β0k. Furthermore, βb(ρ; β0, Idp) − Our analysis also extends to the case where ρ is replaced L β0 = βb(ρ; 0, Idp). by ρi, where for instance ρi = wiρ (the weighted regression n n In light of this result, it is clear that we just need to un- case) as long as {wi}i=1 is independent of {Xi}i=1. (One sim- derstand the distribution of βb(ρ; 0, Idp) to understand that of ply needs to replace ρ by ρi in the system [S1] above and take expectation with respect to these quantities, too.) We refer βb(ρ; β0, Σ). the interested reader to [4], p.26. Result 1. Suppose that ρ is a non-linear convex function. Let us call rρ(p, n) = kβb(ρ; 0, Idp)k. We assume that Xi = λiXi, n where Xi are i.i.d N (0, Idp) and {λi}i=1 are non-zero scalars, n Examples independent of {Xi}i=1. We also assume that Yi = i (i.e n n We illustrate the quality of our results on a few numerical ex- β0 = 0) and {i}i=1 are independent of {Xi}i=1. n n amples, showing the importance of both the objective function Then, under regularity conditions on {i}i=1, {λi}i=1 and and the distribution of the errors in the behavior of rρ(κ). For ρ, rρ(p, n) has a deterministic limit in probability as p and n 2 simplicity, we focus only on the case where λi = 1 for all i, i.e tend to infinity while p/n → κ < 1. We call this limit rρ(κ). the case of Gaussian predictors (an example with λi random Let us call z(i) = i + λirρ(κ)Zi, where Zi ∼ N (0, 1) are b n n is in the SI). We also assume that i’s are i.i.d. i.i.d and independent of {}i=1 and {λi}i=1. We can deter- mine rρ(κ) through solving Least-squares. In this case, ρ(x) = x2/2 and ψ(x) = x. Hence,   0  1 E prox (ρ) (z (i)) proxc(ρ) = x. Elementary computations then show that cλ2 b 1+c  Pn i 2 2 limn→∞ i=1 = 1 − κ , c = κ/(1 − κ). We also find that r (κ) = κ/(1 − κ)σ , where  n  `2  E λ−2[z (i)−prox (ρ)(z (i))]2 2 i b cλ2 b σ is the of . Naturally, in the case of least-squares,  Pn i 2 limn→∞ i=1 n = κrρ(κ) , one can use results concerning Wishart distribution [2] as well [S1] as the explicit form of βb to verify mathematically that this where c is a positive deterministic constant to be determined expression is correct. We also note that in this case, the dis- from the above system. (The expectations above are taken n n tribution of  does not matter, only its variance. with respect to the joint distribution of {i}i=1, {λi}i=1 and {Z }n .) i i=1 Median regression (LAD).This case, where ρ takes values The prox abbreviation refers to the proximal mapping which ρ(x) = |x|, is substantially more interesting and reveals the is standard in convex optimization (see [8]). One of its defini- importance of the interaction between objective function and  2  tion is prox (ρ)(x) = argmin ρ(y) + (x−y) . error distribution. Clearly, we first have to compute the prox c y∈R 2c 2 of the function ρ. It is well-known and not difficult to show Corollary 1. (Important special case) When for all i, λi = 1, and that this prox is the soft-thresholding function. More formally, i’s are i.i.d, the same conclusions hold but the system char- using the notation x+ = max(x, 0), we have, for any t > 0, acterizing rρ(κ) becomes: if z =  + rρ(κ)Z, where  has the b proxt(ρ)(y) = sign(y)(|y| − t)+. In this subsection, we use the same distribution as i and is independent of Z ∼ N (0, 1), notation r`1 instead of rρ .  0  E [proxc(ρ)] (zb) = 1 − κ , 2 2 [S2] E [z − proxc(ρ)(z)] = κrρ(κ) . 2 n n b b We write βb(ρ; β0, Σ) instead of βb(ρ; β0, Σ; {i}i=1, {λi}i=1) for simplicity.

2 www.pnas.org/cgi/doi/10.1073/pnas.0709640104 Footline Author that the second equation in the system [S2] becomes Relative error of E(r2 (p,n)) vs r2 (κ) computed from system, Gaussian errors, 1000 simulations L1 L1 n=1000 2 2 2 0.016 κr`1 (κ) = s h (κ) + c (1 − κ) , "   2# 2 −1 1 0.014 = s h (κ) + (1 − κ) Φ (1 + κ) , 2

0.012 where h is the function such that for t ∈ [0, 1],

0.01

)-1| r κ ( 2 −1 −1 2 2 L1 h(t) = t − Φ ([1 + t]/2) exp(−[Φ ([1 + t]/2)] /2) . 0.008 π (p,n))/r 2 L1 (r Finally, calling ζ the function such that for t ∈ [0, 1], if ϕ E 0.006 | denotes the standard normal density,

−1 −1 −1 0.004 ζ(t) = 2Φ (t) ϕ[Φ (t)] − Φ (t)(1 − t) ,

0.002 further manipulations show that we can solve for s as a func-

tion of κ and therefore for r`1 (κ). Our final expression is that, 2 0 when the  ’s are i.i.d N (0, σ ), 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 i  p/n κ − ζ([1 + κ]/2) r2 (κ) = σ2 . `1 ζ([1 + κ]/2)    E r2 (p,n) `1 2 2  Fig. 1. Relative errors: | 2 − 1|, Gaussian errors, 1000 simulations Figure 1 compares this expression for r (κ) to E r (p, n) r` (κ) `1 `1 1 obtained by simulations. The comparison is done by com- puting relative errors. (A figure comparing the actual values, Case of Gaussian errors which are also of interest, is in the SI.)

Case of errors with symmetric distribution

We call fr, the density of zb and drop the dependence of rρ(κ) 2 2 2 2 on ρ and κ from our notations for simplicity. The first equa- Let us call s = r`1 (κ) + σ . When i’s are i.i.d N (0, σ ), 2 tion of system [S2] still reads P (|zb| > c) = 1 − κ. Let us call zb ∼ N (0, s ). The first equation of our system [S2] therefore ¯ ¯−1 Fr, the cdf of zb and Fr, = 1 − Fr,. Let us denote by Fr, becomes P (|Z| > c/s) = 1 − κ, where Z ∼ N (0, 1). Hence, ¯ c/s = Φ−1((1 + κ)/2), where Φ−1 is the function for the functional inverse of Fr,. the standard normal distribution. Integration by parts, symmetry of fr, as well as the above We now turn our attention to the second equation in the characterization of c finally lead to the implicit characteri- 2 2 2 zation of r`1 (κ) (denoted simply by r for short in the next system [S2]. We have [y − proxt(ρ)(y)] = y 1y≤t + t 1y≥t. Using the fact that c/s = Φ−1((1 + κ)/2), computations show

E(r2 (p,n))/ E(r2 (p,n)) and r2 (κ)/r2 (κ) computed from system, double exponential errors, 1000 simulations L1 L2 L1 L2 n=1000 1.4 Relative error of E(r2 (p,n)) vs r2 (κ) computed from system, double exponential errors, 1000 simulations L1 L1 n=1000 0.025 1.3

1.2

0.02 1.1 (p,n))

2 L2 1 (r E

)-1| 0.015 κ (

2 L1 0.9 (p,n))/ 2 L1 (r (p,n))/r E

2 L1 0.8 (r 0.01 E |

0.7

0.005 0.6

Average over 1000 simulations Prediction from heuristics 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p/n 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p/n

2 2 Fig. 3. Prediction (i.e r` (κ)/r` (κ)) vs realized value of     1 2  2  E r2 (p, n) /E r2 (p, n) , double exponential errors. Surprisingly, E r` (p,n) ` ` Fig. 2. Relative errors: | 1 − 1|, double exponential errors, 1000 1 2 r2 (κ) according to this measure, it becomes preferable to use ordinary least-squares rather `1 simulations. than l1-regression when the errors are double-exponential and κ is sufficiently large.

Footline Author PNAS Issue Date Volume Issue Number 3 equation) assumptions βb satisfies the gradient equation Z ∞ X 0 2 ¯ 2 Xiψ(i − Xiβb) = 0 . [3] (1 − κ)r = 4 xFr,(x)dx − σ . [2] ¯−1 Fr, ((1−κ)/2) In the derivations that follow, we will use repeatedly the 2 2 R ∞ ¯ fact that if Xi are i.i.d N (0, Idp) and Ap is a sequence of de- We note in passing that r + σ = 4 0 xFr,(x)dx; 2 terministic symmetric matrices, under mild conditions on the therefore the previous equation can be rewritten κr = growth of trace Ak with k ∈ , we have as n and p grow ¯−1 p N R Fr, ((1−κ)/2) ¯ 4 0 xFr,(x)dx, a convenient equation to work 0 Xi ApXi trace (Ap) with numerically when κ is small. sup − = oP (1) . i=1,...,n p p Case of double exponential errors We now present a Many methods can be used to show this concentration result. comparison of simulation results to numerical solutions of Sys- A particularly simple one is to compute the second and fourth 0 tem [S2] when the errors are double exponential. It should cumulants of Xi ApXi. It shows that the result holds as soon 4 −4 2 −2 2 be noted that in this case the cdf Fr, takes values as trace Ap p + (trace Ap p ) = o(1/n), a mild condi- tion. This concentration result is easily extended to the case   r2/2   2   2  t e t t + r −t t − r where Ap is random but independent of Xi. The previous re- Fr,(t) = Φ + e Φ − − e Φ . r 2 r r sult also extends easily to Xi = λiXi, under mild conditions on λi’s, to yield 2 It is also clear in this case that σ = 2. We used all this infor- 0 XiApXi 2 trace (Ap) mation to solve Equation [2] for r, by doing a dichotomous sup − λi = oP (1) . [4] search. Figure 2 illustrates our results by showing the relative i=1,...,n p p 2  errors between E r`1 (p, n) (computed from simulations) and numerical solutions of system [S2] with appropriate parame- Leaving-out one observation. Let us call βb(i) the usual leave ters. (A figure comparing the actual values is in the SI.) one out (i.e the estimator we get by not using (Xi,Yi) in our regression problem). It solves Other objective functions. We have carried out similar compu- X 0 tations and validations of results for other objective functions, Xj ψ(j − Xj βb(i)) = 0 . [5] including the Huber objective functions, the objective func- j6=i tions appearing in , as well as `3 and `1.5 n objective functions - the latter two more for their analytical Note that when {Xi}i=1 are independent, βb(i) is independent tractability than for their statistical interest. We refer the of Xi. For all j, 1 ≤ j ≤ n, we call rej,(i) reader to [4] for details. 0 rej,(i) = j − Xj βb(i) . [6] Further remarks. The characterizations of rρ(κ) allows us to When j 6= i, these are the residuals from this leave-one-out sit- compare the performance of various regression methods for uation. For j = i, rei,(i) is the prediction error for observation various error distributions. One mathematical and statistical i. consequence is that we can optimize over ρ to minimize rρ(κ) Intuitively, it is clear that under regularity conditions on when the distribution of the errors is given and log-concave ρ and i’s, when Xi’s are i.i.d, for i 6= j, Rj ' rej,(i) (this and we are in the setup of Gaussian predictors. We have done statistically that leave-one-out makes sense). On the this in the companion paper [1]. other hand, it is easy to convince oneself (by looking e.g at the Quite independently, we can investigate the performance least-squares situation) that rei,(i) is very different from Ri in of say median regression vs for a of val- high-dimension. The expansion we will get below will indeed ues of κ. In the case of double exponential errors, it is well- confirm this fact in a more general setting than least-squares. known (see e.g [7]) that median regression is twice as efficient Taking the difference between Equations [3] and [5] we as least-squares when κ is close to 0. As our simulations and get , after using Taylor expansions for j 6= i (and truncating computations illustrate, this is not the case when κ is not close the expansion at first order), to zero. Indeed, when κ > .3 or so, r`2 (κ) < r`1 (κ) for double 0 X 0 0 exponential errors. This should serve as caution against us- Xiψ(i − Xiβb) + ψ (rej,(i))Xj Xj (βb(i) − βb) ' 0 . ing “natural” maximum-likelihood methods in high-dimension j6=i since they turn out to be suboptimal even in apparently fa- P 0 0 vorable situations. We call Si = j6=i ψ (rej,(i))Xj Xj . This suggests that −1 0 βb − βb(i) ' Si Xiψ(i − Xiβb) . [7] Derivation Note that Si is independent of Xi. Hence, multiplying the We now turn our attention to the derivation of the system of 0 previous expression by Xi, we get, using the approximation equations [S1] presented in Result 1. −1 given in Equation [4] (which amounts to assuming that Si Our approach hinges on a “double leave-one out ap- is “nice enough”), proach”, the use of concentration properties of certain 2 −1 quadratic forms and the Sherman-Woodbury-Morrison for- Ri − rei,(i) ' −λi trace Si ψ(Ri) . mula of linear algebra. Experience in random matrix theory as well as the form of We focus on the case β0 = 0 and Σ = Idp. Lemma 1 −1 the matrix Si suggest that trace Si should have a deter- guarantees that we can do so without loss of generality. Note 3 ministic limit (again under conditions on ρ, λi’s and i’s). that in this case Yi = i. We call βb(ρ; 0, Idp) simply βb from now on. We also assume that ρ has two derivatives. 0 We call the residuals Ri = i − Xiβb and use the notation 3to help with intuition, note that in the least squares case, S = P X X0 , a sample 0 i j6=i j j X(i) = {Xj }j6=i. Recall that ψ = ρ . We note that under our multiplied by n − 1

4 www.pnas.org/cgi/doi/10.1073/pnas.0709640104 Footline Author −1 0 Then, by symmetry between the observations, all trace Si Since Ri −ri,[p] = −βbpXi(p)+Vi (γb−βbSp ), the previous equa- are approximately the same, i.e, when p and n are large, tion reads −1 trace Si ' c. Hence, " # X 0 0 X 0 ψ (r )V V (γ − β ) − β ψ (r )V X (p) ' 0 . 2 i,[p] i i b bSp bp i,[p] i i Ri − rei,(i) ' −λi cψ(Ri) . [8] i i Calling Note that since Xi and βb(i) are independent when Xi’s are in- n n dependent and independent of {i}i=1 and {λi}i=1, much can X 0 0 X 0 Sp = ψ (r )ViV , and up = ψ (r )ViXi(p) , be said about the distribution of rei,(i). However, at this point i,[p] i i,[p] in the derivation it is not clear what the value of c should be. i i

−1 Leaving-out one predictor.Let us consider what happens we see that (γb − βbSp ) ' βbpSp up . Using this approximation when we leave the p-th predictor out. Because we are assum- in the previous equation for βbp, we have finally ing that Xi is N (0, Idp) and β0 = 0, all the predictors play P a symmetric role, so we pick the p-th to simplify notations. Xi(p)ψ(ri,[p]) There is nothing particular about it and the same analysis can βbp ' . [9] P X2(p)ψ0(r ) − u0 S−1u be done with any other predictors. i i,[p] p p p p−1 Let us call γb (∈ R ) the corresponding optimal regres- sion vector for the ρ. We use the notations and Approximation of this denominator Let us write in ma- partitions 0 −1 0 1/2 1/2 trix form upSp up = X(p) AX(p), where A = D PV D , 1/2 0 −1 0 1/2     PV = D V (V DV ) V D and D is a diagonal matrix Vi βbSp 0 Xi = , βb = . with D(i, i) = ψ (r ). Note that PV is a projection matrix Xi(p) β i,[p] bp of rank p − 1 in general. p−1 Let us call ξ the denominator of β divided by n. We We have Vi ∈ R . Naturally, γ satisfies n bp b have n 1 0 X 0 ξn = X(p) (D − A)X(p) . Viψ(i − Vi γb) = 0 . n i=1 0 0 Let us call Sp(i) = Sp − ψ (ri,[p])ViVi . Using the We call Sherman-Morrison-Woodbury formula (see [5], p.19 and the 0 r = i − V γ , 1 i,[p] i b SI), we see that PV (i, i) = 1 − 0 0 −1 . We 1+ψ (ri,[p])Vi [Sp(i)] Vi n −1 −1 i.e the residuals based on p−1 predictors. Note that {ri,[p]}i=1 notice that Sp(i) can be approximated by a matrix M n i is independent of {Xi(p)}i=1 under our assumptions (because which is independent of Vi (by using our leave-one-predictor- Vi is independent of Xi(p) and the Xi’s are i.i.d). 0 −1 2 −1 out observations) for which Vi M Vi/λi = trace M + It is intuitively clear that4 R ' r , for all i, since adding i i i i,[p] oP (1) by Equation [4] (these two approximations naturally a predictor will not help us much in estimating β0 = 0. Hence −1 require some regularity conditions on ρ, etc... so that Mi is the residuals should not be much affected by the addition of “nice enough”). Hence, one predictor. Taking the difference of the equations defining βb (Equation [3]) and γb, we get 1 PV (i, i) = 1 − 2 0 −1 + oP (1) .   1 + λi ψ (ri,[p])trace ([Sp(i)] ) X 0 Vi 0 Xiψ(i − Xiβb) − ψ(i − Vi γ) = 0 . 0 b Therefore, using the approximations r ' R i i,[p] i −1 −1 and trace [Sp(i)] ' c (because trace [Sp(i)] ' This p-dimensional equation separates into a scalar and a vec- trace [S ]−1 using Sherman-Morrison-Woodbury), we also tor equation, namely, p have X 0 Xi(p)ψ(i − Xiβb) = 0 , 1 1 1−PV (i, i) = ' . i 0 0 −1 2 0 1 + ψ (ri,[p])V [Sp(i)] Vi 1 + λ cψ (ri,[p]) X i i Vi[ψ(Ri) − ψ(ri,[p])] = 0p−1 . Since PV is a rank (p − 1) projection matrix, we have i P trace (PV ) = p − 1 = i PV (i, i), and therefore Using a first-order Taylor expansion of ψ(Ri) around ψ(ri,[p]) n and noting that R − r = V 0(γ − β ) − X (p)β , we can 1 X 1 p i i,[p] i b bSp i bp = 1 − + o (1) . transform the first equation above into n 1 + λ2cψ0(r ) n P i=1 i i,[p] X h 0 0 i Xi(p) ψ(ri,[p]) + ψ (ri,[p])(Vi (γb − βbSp ) − Xi(p)βbp) ' 0 . Using concentration properties of X(p) conditional on i n {λi}i=1, we have This gives the near identity 1 ξn = trace (Dλ(D − A)Dλ) + oP (1) P 0 0 n Xi(p)[ψ(ri,[p]) + ψ (ri,[p])Vi (γb − βbSp )] βbp ' . n P 2 0 1 X 2 0 Xi (p)ψ (ri,[p]) = λ ψ (r )(1 − P (i, i)) + o (1) . n i i,[p] V P i=1 Working similarly on the equations involving Vi, we get

X 0 ψ (ri,[p])Vi[Ri − ri,[p]] ' 0 . 4under regularity conditions on ρ,  ’s and λ ’s i i i

Footline Author PNAS Issue Date Volume Issue Number 5  2 Replacing 1 − PV (i, i) by its approximate value, we get Efron-Stein inequality, we see that var kβbk = O(1/n) if we

n 2 0 take squared expectations in our approximations. It follows 1 X λi ψ (ri,[p]) 2 ξn ' , that kβbk is asymptotically deterministic. n 1 + cλ2ψ0(r ) i=1 i i,[p] These arguments suggest that as p and n become large, L rei,(i) = i + |λi|rρ(κ)Zi + oP (1), where Zi ∼ N (0, 1), indepen- n ! dent of λi and i, and rρ(κ) is deterministic. We also note 1 1 1 1 p X that Zi are i.i.d, since Xi are. = 1 − 2 0 ' . 2 c n 1 + cλi ψ (ri,[p]) c n Since cλi ψ(Ri) ' ri,(i) − Ri ' ri,(i) − prox 2 (ρ)(ri,(i)), i=1 e e cλi e we see that Equation [11] now becomes asymptotically So finally, n  2 n 2 1 X −2 h i P X (p)ψ(r )/n κrρ(κ) = E λi ri,(i) − proxcλ2 (ρ)(ri,(i)) , i i,[p] 1 X n e i e βbp ' ' c λiψ(ri,[p])Xi(p) . [10] p /c p i=1 n i=1 where the expectations are over the joint distribution of λi’s, Using again ψ(ri,[p]) ' ψ(Ri), we see that i’s and Zi’s. (We note that our arguments do not depend on independence of λi’s or i’s, though both families of ran- " n # n dom variables need to be independent of {Xi} .) This is the  2 n 1 X 2 2 2  i=1 E kβbk ' E c λi ψ (Ri) , [11] p n second equation of System [S1]. i=1 We now recall that using the fact that the matrix PV above was a projection matrix, we had argued that asymptotically assuming that we can take expectations in all these approxi- mations. n 1 X 1 2 = 1 − κ + oP (1) . n 1 + cλ ψ(Ri) From approximations to functional system. Our approxima- i=1 i tions concerning the residuals and βbp shed considerable light 0 1 We observe that (prox 2 ) (r ) = and on them. Our focus is now on kβk. cλ i,(i) 1+cλ2ψ(prox [r ]) b i e i cλ2 ei,(i) i From Equation [8] we got the approximation 1 0 therefore 2 ' (proxcλ2 ) (ri,(i)). This allows us to 1+cλi ψ(Ri) i e 2 conclude that under regularity conditions, rei,(i) ' Ri + cλi ψ(Ri) . n Recall that for a convex, proper and closed function ρ, whose 1   −1 X 0 subdifferential we call ψ, and t > 0, prox (ρ) = (Id + tψ) . E (proxcλ2 ) (ri,(i)) = 1 − κ . t n i e It is an important fact that the prox is indeed a function and i=1 not a multi-valued mapping, even when ρ is not differentiable everywhere. We therefore get the approximation This is the first equation of our System [S1].

Ri ' proxcλ2 (ρ)(rei,(i)) . A note on non-differentiable ρ’s. One of the appeals of our i systems [S1] and [S2] is that they yield expressions even in the case of non-differentiable ρ, since the prox is well-defined. Recalling Equation [6] and using the independence of βb(i) L However, we derived the systems assuming smoothness of ρ. and Xi, we have rei,(i) = i + |λi|kβb(i)kZi, where Zi is N (0, 1) To go around this hurdle, one can approximate ρ by a family and independent of i, λi and kβb(i)k. ρη of smooth convex functions such that ρη → ρ as η → 0 in We now argue that kβbk is asymptotically deterministic. an appropriate sense. Intuitively it is quite clear that rρη (κ) should tend to rρ(κ) as η tends to 0 under appropriate regu- Using the relationship between βb and βb(i) in Equation [7] and taking squared norms, we see that larity conditions on i’s and λi’s. We then just need to take limits in our systems to justify them for non-differentiable 2 2 0 −1 0 −2 2 ρ’s. kβbk ' kβb(i)k + 2βb(i)Si Xiψ(Ri) + XiSi Xiψ (Ri) .

Assuming that the smallest eigenvalue of Si/n remains ACKNOWLEDGMENTS. bounded, which is automatically satisfied with high- Bean gratefully acknowledges support from NSF grant DMS-0636667 (VIGRE). probability for strongly convex functions ρ, we see that Bickel gratefully acknowledges support from NSF grant DMS-0907362. El Karoui 2 2 gratefully acknowledges support from an Alfred P. Sloan Research Fellowship and kβbk − kβb(i)k is OP (1/n), provided kβb(i)k remains bounded NSF grant DMS-0847647 (CAREER). Yu gratefully acknowledges support from NSF and ψ and ψ0 do not grow too fast at infinity. Applying the Grants SES-0835531 (CDI), DMS-1107000 and CCF-0939370.

1. Bean, D., Bickel, P., El Karoui, N., and Yu, B. (2012). Optimal objective function in 5. Horn, R. A. and Johnson, C. R. (1994). Topics in matrix analysis. Cambridge University high-dimensional regression. Submitted to PNAS . Press, Cambridge. Corrected reprint of the 1991 original. 2. Eaton, M. L. (2007). . Institute of Lec- 6. Huber, P. J. (1973). Robust regression: asymptotics, conjectures and Monte Carlo. ture Notes—Monograph Series, 53. Institute of Mathematical Statistics. Reprint of Ann. Statist. 1, 799–821. the 1983 original. 7. Huber, P. J. and Ronchetti, E. M. (2009). . Wiley Series in Probability and Statistics. John Wiley & Sons Inc., Hoboken, NJ, second edition. 3. El Karoui, N. (2010). High-dimensionality effects in the Markowitz problem and other 8. Moreau, J.-J. (1965), Proximit´eet dualit´edans un espace hilbertien. Bull. Soc. Math. quadratic programs with linear constraints: risk underestimation. Ann. Statist. 38, France 93, 273–299. 3487–3566. 9. Portnoy, S. (1984), Asymptotic behavior of M- of p regression parameters 4. El Karoui, N., Bean, D., Bickel, P., Lim, C., and Yu, B. (2011). On robust regression when p2/n is large. I. Consistency. Ann. Statist. 12, 1298–1309.s with high-dimensional predictors. Technical report.

6 www.pnas.org/cgi/doi/10.1073/pnas.0709640104 Footline Author