UC San Diego UC San Diego Previously Published Works

Title Adaptive Huber Regression.

Permalink https://escholarship.org/uc/item/4sn8g6n9

Journal Journal of the American Statistical Association, 115(529)

ISSN 0162-1459

Authors Sun, Qiang Zhou, Wen-Xin Fan, Jianqing

Publication Date 2020

DOI 10.1080/01621459.2018.1543124

Peer reviewed

eScholarship.org Powered by the California Digital Library University of California Journal of the American Statistical Association

ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: https://www.tandfonline.com/loi/uasa20

Adaptive Huber Regression

Qiang Sun, Wen-Xin Zhou & Jianqing Fan

To cite this article: Qiang Sun, Wen-Xin Zhou & Jianqing Fan (2020) Adaptive Huber Regression, Journal of the American Statistical Association, 115:529, 254-265, DOI: 10.1080/01621459.2018.1543124 To link to this article: https://doi.org/10.1080/01621459.2018.1543124

View supplementary material

Accepted author version posted online: 12 Dec 2018. Published online: 22 Apr 2019.

Submit your article to this journal

Article views: 1560

View related articles

View Crossmark data

Citing articles: 1 View citing articles

Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=uasa20 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2020, VOL. 115, NO. 529, 254–265: Theory and Methods https://doi.org/10.1080/01621459.2018.1543124

Adaptive Huber Regression

Qiang Suna, Wen-Xin Zhoub, and Jianqing Fanc,d aDepartment of Statistical Sciences, University of Toronto, Toronto, ON, Canada; bDepartment of Mathematics, University of California, San Diego, La Jolla, CA; cSchool of Data Science, Fudan University, Shanghai, China; dDepartment of Operations Research and Financial Engineering, Princeton University, NJ

ABSTRACT ARTICLE HISTORY Big data can easily be contaminated by or contain variables with heavy-tailed distributions, which Received August 2017 makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber Revised October 2018 regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. KEYWORDS + Adaptive Huber regression; Our theoretical framework deals with heavy-tailed distributions with bounded (1 δ)th moment for any Bias and robustness tradeoff; δ>0. We establish a sharp phase transition for robust estimation of regression parameters in both low Finite-sample inference; and high dimensions: when δ ≥ 1, the admits a sub-Gaussian-type deviation bound without Heavy-tailed data; sub-Gaussian assumptions on the data, while only a slower rate is available in the regime 0 <δ<1and Nonasymptotic optimality; the transition is smooth and optimal. In addition, we extend the methodology to allow both heavy-tailed Phase transition. predictors and observation noise. Simulation studies lend further support to the theory. In a genetic study of cancer cell lines that exhibit heavy-tailedness, the proposed methods are shown to be more robust and predictive. Supplementary materials for this article are available online.

1. Introduction larger than 3, despite of the normalization methods used. In finance, the power-law nature of the distribution of returns Modern data acquisitions have facilitated the collection of mas- has been validated as a stylized fact (Cont 2001). Fan et al. sive and high-dimensional data with complex structures. Along (2016) argued that heavy-tailed distribution is a stylized feature with holding great promises for discovering subtle population for high-dimensional data and proposed a shrinkage principle patterns that are less achievable with small-scale data, big data have introduced a series of new challenges to data analysis to attenuate the influence of outliers. Standard statistical pro- both computationally and statistically (Loh and Wainwright cedures that are based on the method of least squares often 2015;Fanetal.2018). During the last two decades, extensive behavepoorlyinthepresenceofheavy-taileddata.Wesaya P | | progress has been made toward extracting useful information random variable X has heavy tails if ( X > t) decays to zero →∞ from massive data with high-dimensional features and sub- polynomially in 1/t as t (Catoni 2012). It is therefore of Gaussian tails. A random variable Z is said to have sub-Gaussian ever-increasing interest to develop new statistical methods that tails if there exists constants c1 and c2 such that P(|Z| > t) ≤ are robust against heavy-tailed errors and other potential forms 2 c1 exp(−c2t ) for any t ≥ 0 (Tibshirani 1996; Fan and Li 2001; of contamination. Efron et al. 2004; Bickel, Ritov, and Tsybakov 2009). We refer to In this article, we first revisit the that was themonographs,BühlmannandvandeGeer(2011)andHastie, initiated by Peter Huber in his seminal work Huber (1973). Tibshirani, and Wainwright (2015), for a systematic coverage of Asymptotic properties of the Huber estimator have been well contemporary statistical methods for high-dimensional data. studied in the literature. We refer to Huber (1973), Yohai and The sub-Gaussian tails requirement, albeit being convenient Maronna (1979), Portnoy (1985), Mammen (1989), and He for theoretical analysis, is not realistic in many practical appli- and Shao (1996, 2000) for an unavoidably incomplete overview. cations since modern data are often collected with low quality. However, in all of the aforementioned papers, the robustification For example, a recent study on functional magnetic resonance parameter is suggested to be set as fixed according to the 95% imaging (fMRI) (Eklund, Nichols, and Knutsson 2016)shows asymptotic efficiency rule. Thus, this procedure cannot estimate that the principal cause of invalid fMRI inferences is that the the model-generating parameters consistently when the sample data do not follow the assumed Gaussian shape, which speaks to distribution is asymmetric. the need of validating the statistical methods being used in the From a nonasymptotic perspective (rather than an asymp- field of neuroimaging. In a microarray data example considered totic efficiency rule), we propose to use the Huber regression in Wang, Peng, and Li (2015), it is observed that some gene with an adaptive robustification parameter, which is referred expression levels have heavy tails as their kurtosises are much to as the adaptive Huber regression, for robust estimation and

CONTACT Qiang Sun [email protected] Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, Canada. Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/JASA. Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA. © 2019 American Statistical Association JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 255 inference. Our adaptive procedure achieves the nonasymptotic regressions (Koenker 2005;BelloniandChernozhukov2011; robustness in the sense that the resulting estimator admits Wang 2013; Fan, Fan, and Barut 2014;Zheng,Peng,andHe exponential-type concentration bounds when only low-order 2015). In general, quantile regression is biased toward estimat- moments exist. Moreover, the resulting estimator is also an ing the mean regression coefficient unless the error distributions asymptotically unbiased estimate for the parameters of interest. are symmetric around zero. Another recent work that is related In particular, we do not impose symmetry and homoscedasticity to ours is Alquier, Cottett, and Lecué (2017). They studied a gen- conditions on error distributions, so that our problem is eral class of regularized empirical risk minimization procedures intrinsically different from /quantile regression models, with a particular focus on Lipschitz losses, which includes the which are also of independent interest and serve as important quantile, hinge, and logistic losses. Different from all these work, robust techniques (Koenker 2005). our goal is to estimate the mean regression coefficients robustly. We made several major contributions toward robust model- Therobustnessiswitnessedbyanonasymptoticanalysis:the ing in this article. First and foremost, we establish nonasymp- proposed achieve sub-Gaussian deviation bounds totic deviation bounds for adaptive Huber regression when the when the regression errors have only finite second moments. error variables have only finite (1+δ)th moments. By providing Asymptotically, our proposed estimators are fully efficient: they amatchinglowerbound,weobserveasharpphasetransition achieve the same efficiency as the ordinary least squares (OLS) phenomenon, which is in line with that discovered by Devroye estimators. et al. (2016) for univariate mean estimation. Second, a similar An important step toward estimation under heavy-tailedness phase transition for regularized adaptive Huber regression is has been made by Catoni (2012), whose focus is on estimating established in high dimensions. By defining the effective dimen- a univariate mean. Let X be a real-valued random variable with mean μ = E(X) and variance σ 2 = var(X)>0, and assume sion and effective sample size, we present nonasymptotic results that X , ..., X are independent and identically distributed (iid) under the two different regimes in a unified form. Last, by 1 n from X. For any prespecified exception probability t > 0, Catoni exploiting the localized analysis developed in Fan et al. (2018), constructs a robust mean estimator μ (t) that deviates from the we remove the artificial bounded parameter constraint imposed C true mean μ logarithmically in 1/t,thatis, in previous works; see Loh and Wainwright (2015)andFan,Li,   P |ˆ − |≤ 1/2 ≥ − − 2 and Wang (2017). In the supplementary materials, we present a μC(t) μ tσ/n 1 2 exp( ct ),(1) nonasymptotic Bahadur representation for the adaptive Huber while the empirical mean deviates from the true mean only poly- estimator when δ ≥ 1, which provides a theoretical foundation nomially in 1/t2, namely sub-Gaussian tails versus Cauchy tail for robust finite-sample inference. in terms of t. Further, Devroye et al. (2016)developedadaptive Therestofthearticleproceedsasfollows.Therestofthis sub-Gaussian estimators that are independent of the prespeci- section is devoted to related literature. In Section 2,werevisit fied exception probability. Beyond mean estimation, Brownlees, the Huber loss and robustification parameter, followed by the Joly, and Lugosi (2015) extended Catoni’s idea to study empirical proposal of adaptive Huber regression in both low and high risk minimization problems when the losses are unbounded. dimensions. We sharply characterize the nonasymptotic perfor- Generalizations of the univariate results to those for matrices, mance of the proposed estimators in Section 3. We describe such as the covariance matrices, can be found in Catoni (2016), the algorithm and implementation in Section 5. Section 6 is Minsker (2018), Giulini (2017), and Fan, Li, and Wang (2017). devoted to simulation studies and a real data application. In Fan, Li, and Wang (2017) modified Huber’s procedure (Huber Section 4, we extend the methodology to allow possibly heavy- 1973)toobtainarobustestimator,whichisconcentratedaround tailed covariates/predictors. All the proofs are collected in the the true mean with exponentially high probability in the sense supplementary materials. of (1), and also proposed a robust procedure for sparse linear regression with asymmetric and heavy-tailed errors.

1.1. Related Literature Notation. We fix some notations that will be used throughout this article. For any vector u = (u , ..., u )T ∈ Rd and The terminology “robustness” used in this article describes how  1 d q ≥ 1, u = ( d |u |q)1/q is the  norm. For any vectors stable the method performs with respect to the tail-behavior of q j=1 j q ∈ Rd  = T   = the data, which can be either sub-Gaussian/subexponential or u, v ,wewrite u, v u v. Moreover, we let u 0 d = Pareto-like (Delaigle, Hall, and Jin 2011;Catoni2012;Devroye j=1 1(uj 0) denote the number of nonzero entries of u,and   = | | et al. 2016). This is different from the conventional perspec- set u ∞ max1≤j≤d uj . For two sequences of real numbers { } { }  ≤ tive of robust under Huber’s -contamination model an n≥1 and bn n≥1, an bn denotes an Cbn for some   (Huber 1964),forwhichanumberofdepth-basedprocedures constant C > 0 independent of n, an bn if bn an,and   have been developed since the groundbreaking work of Tukey an bn if an bn and bn an.Fortwoscalars,weuse ∧ = { } (1975). Significant contributions have also been made in Liu a b min a, b to denote the minimum of a and b.IfA is an ×   (1990), Liu, Parelius, and Singh (1999), Zuo and Serfling (2000), m n matrix, we use A to denote its spectral norm, defined by  =   Sn−1 ={ ∈ Rn   = } Mizera (2002), and Mizera and Müller (2004). We refer to Chen, A maxu∈Sn−1 Au 2,where u : u 2 1 Rn × Gao, and Ren (2018) for the most recent result and a literature is the unit sphere in .Forann n matrix A,weuseλmax(A) review concerning this problem. and λmin(A) to denote the maximum and minimum eigenvalues × Our main focus is on the conditional mean regression in the of A,respectively.Fortwon n matrices A and B,wewriteA − Rd → R presence of heavy-tailed and asymmetric errors, which auto- B if B A is positive semidefinite. For a function f : , ∇ ∈ Rd matically distinguishes our method from quantile-based robust we use f to denote its gradient vector as long as it exists. 256 Q. SUN, W.-X. ZHOU, AND J. FAN

2. Methodology

We consider iid observations (y1, x1), ..., (yn, xn) that are gen- erated from the following heteroscedastic regression model

∗ y =x , β +ε ,with i i i   1+δ E(εi|xi) = 0andvi,δ = E |εi| < ∞.(2)

Assuming that the second moments are bounded (δ = 1), the ols standard OLS estimator, denoted by β ,admitsasuboptimal polynomial-type deviation bound, and thus does not concen- ∗ trate around β tightlyenoughforlarge-scalesimultaneous estimation and inference. The key observation that underpins thissuboptimalityoftheOLSestimatoristhesensitivityof quadratic loss to outliers (Huber 1973;Catoni2012), while the Huber regression with a fixed tuning constant may lead to non- negligible estimation bias. To overcome this drawback, we pro- Figure 1. Phase transition in terms of 2-error for the adaptive Huber estimator.  − ∗ −δ/(1+δ) pose to employ the Huber loss with an adaptive robustification With fixed effective dimension, βτ β 2 neff , when 0 <δ<1; parameter to achieve robustness and (asymptotic) unbiasedness  − ∗ −1/2 ≥ = βτ β 2 neff , when δ 1. Here neff is the effective sample size: neff n = simultaneously. We begin with the definitions of the Huber loss in low dimensions while neff n/ log d in high dimensions. and the corresponding robustification parameter. to show its explicit dependence on the moment) achieves the Definition 1 (Huber loss and robustification parameter). The − ∧ − { + } tight upper bound d1/2τ (δ 1) d1/2n min δ/(1 δ),1/2 .The Huber loss τ (·) (Huber 1964) is defined as phase transition at δ = 1 can be easily observed (see Figure 1).  When higher moments exist (δ ≥ 1), robustification leads to a x2/2, if |x|≤τ, sub-Gaussian-type deviation inequality in the sense of (1).  (x) = τ τ|x|−τ 2/2, if |x| >τ, In the high-dimensional regime, we consider the following regularized adaptive Huber regression with a different choice of the robustification parameter where τ>0 is referred to as the robustification parameter that  ∈ L +   balances bias and robustness. βτ,λ arg min τ (β) λ β 1 ,(4) β∈Rd The τ (x) is quadratic for small values of x and { }max{1/(1+δ),1/2} becomes linear when x exceeds τ in magnitude. The parameter where τ νδ n/(log d) and λ νδ min{δ/(1+δ),1/2} 1/(1+δ) 1/2 τ therefore controls the blending of quadratic and 1 losses, {(log d)/n} with νδ = min{v , v }.Let δ∗ 1 which can be regarded as two extremes of the Huber loss with s bethesizeofthetruesupportS = supp(β ).Wewillshow τ =∞and τ → 0, respectively. Comparing with the least that the regularized Huber estimator achieves an upper bound { + } squares,outliersaredownweightedintheHuberloss.Wewill that is of the order s1/2{(log d)/n}min δ/(1 δ),1/2 for estimating ∗ use the name, adaptive Huber loss, to emphasize the fact that β in 2-error with high probability. the parameter τ should adapt to the sample size, dimension To unify the nonasymptotic upper bounds in the two dif- and moments for a better tradeoff between bias and robustness. ferent regimes, we define the effective dimension, deff,tobed This distinguishes our framework from the classical setting. As in low dimensions and s in high dimensions. In other words, τ →∞is needed to reduce the bias when the error distribution deff denotes the number of nonzero parameters of the problem. is asymmetric, this loss is also called the RA-quadratic (robust The effective sample size, neff, is defined as neff = n and = approximation to quadratic) loss in Fan, Li, and Wang (2017). neff n/ log d in low and high dimensions, respectively. We L = −1 n ≥ Define the empirical loss function τ (β) n i=1 will establish a phase transition: when δ 1, the proposed d τ (yi −xi, β ) for β ∈ R . The Huber estimator is defined estimator enjoys a sub-Gaussian concentration, while it only through the following convex optimization problem achieves a slower concentration when 0 <δ<1. Specifically, we show that, for any δ ∈ (0, ∞), the proposed estimators with /( +δ) / { /( +δ) / } τ min{v1 1 , v1 2} nmax 1 1 ,1 2 achieve the following ˆ = L δ 1 eff βτ arg min τ (β).(3)tight upper bound, up to logarithmic factors β∈Rd

 ∗ 1/2 − min{δ/(1+δ),1/2} β − β  d n with high probability.  τ 2 eff eff − In low dimensions, under the condition that v = n 1 n δ i=1 (5) E | |1+δ ∞ ˆ ( εi )< for some δ>0, we will prove that βτ with { 1/(1+δ) 1/2} max{1/(1+δ),1/2} τ min vδ , v1 n (the first factor is kept This finding is summarized in Figure 1. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 257

2 3. Nonasymptotic Theory C(L, cl)d t for some C(L, cl)>0 depending only on L and cl. ≥ ˆ = Then, for any t > 0andτ0 νδ,theestimatorβτ with τ 3.1. Adaptive Huber Regression With Increasing max{1/(1+δ),1/2} Dimensions τ0(n/t) satisfies the bound   min{δ/(1+δ),1/2} We begin with the adaptive Huber regression in the low- ˆ − ∗ ≤ −1 1/2 t βτ β 4c Lτ0 d (8) dimensional regime. First, we provide an upper bound for the 2 l n estimation bias of Huber regression. We then establish the phase − with probability at least 1 − (2d + 1)e t. transition by establishing matching upper and lower bounds  on the 2-error. The analysis is carried out under both fixed Remark 1. It is worth mentioning that the proposed robust esti- and random designs. The results under random designs are /( +δ) mator depends on the unknown parameter v1 1 .Adaptation provided in the supplementary materials. We start with the δ to the unknown moment is indeed another important problem. following regularity condition. In Section 6, we suggest a simple cross-validation scheme for  = −1 n T choosing τ with desirable numerical performance. A general Condition 1. The empirical Gram matrix Sn : n i=1 xix i adaptive construction of τ can be obtained via Lepski’s method is nonsingular. Moreover, there exist constants cl and cu such (Lepski 1991), which is more challenging due to unspecified that cl ≤ λmin(Sn) ≤ λmax(Sn) ≤ cu. constants. In the supplementary materials, we discuss a variant ˆ For any τ>0, βτ given in (3)isnaturalM-estimator of of Lepski’s method and establish its theoretical guarantee. + n Remark 2. We do not assume E(|ε |1 δ|x ) to be a constant, ∗ = E{L }= 1 E{ − } i i βτ : arg min τ (β) arg min τ (yi xi, β ) , andhencetheproposedmethodaccommodatesheteroscedastic β∈Rd β∈Rd n i=1 regression models. For example, εi can take the form of σ(xi)vi, (6) d where σ : R → (0, ∞) is a positive function, and vi are E = E | |1+δ ∞ where the expectation is taken over the regression errors. Wecall random variables satisfying (vi) 0and ( vi )< . ∗ β the Huber regression coefficient,whichispossiblydifferent τ ∗ Remark 3. Weneedthescalingconditiontogoroughlyasn  fromthevectoroftrueparametersβ . The estimation bias, mea- ∗ ∗ d2t under fixed designs. With random designs, we show that the sured by β − β  , is a direct consequence of robustification τ 2 scaling condition can be relaxed to n  d + t. Details are given and asymmetric error distributions. Heuristically, choosing a in the supplementary materials. sufficiently large τ reduces bias at the cost of losing robustness =∞ (the extreme case of τ corresponds to the least squares Theorem 1 indicates that, with only bounded (1 + δ)th estimator). Our first result shows how the magnitude of τ affects ∗ ∗ − moment, the adaptive Huber estimator achieves the upper β − β  = 1 n = − { + } the bias τ 2.Recallthatvδ n i=1 vi,δ with vi,δ bound d1/2n min δ/(1 δ),1/2 , up to a logarithmic factor, by E | |1+δ ( εi ). setting t = log(nd). A natural question is whether the upper bound in (8) is optimal. To address this, we provide a matching Proposition 1. Assume Condition 1 holds and that vδ is finite for v ∗ lower bound up to a logarithmic factor. Let P δ be the class of some δ>0. Then, the vector β of Huber regression coefficients δ τ all distributions on R whose (1 + δ)th absolute central moment satisfies T 1 d n×d equals vδ.LetX = (x1, ..., xn) = (x , ..., x ) ∈ R be the ∗ ∗ −1/2 −δ n  −  ≤ design matrix and Un ={u : u ∈{−1, 1} }. βτ β 2 2cl vδτ (7) 2 1/(1+δ) 1/2 provided τ ≥ (4vδM ) for 0 <δ<1orτ ≥ (2v1) M Theorem 2 (Lower bound). Assume that the regression errors −1/2 Pvδ for δ ≥ 1, where M = max ≤ ≤ S x  . εi are iid from a distribution in δ with δ>0. Suppose 1 i n n i 2 −1 T there exists a u ∈ Un such that n X umin ≥ α for some  ∗  The total estimation error β − β  can therefore be α>0. Then, for any t ∈[0, n/2] and any estimator β = τ 2 ˆ decomposed into two parts β(y1, ..., yn, t) possibly depending on t,wehave

 ∗  ∗ ∗ ∗    { + } β − β ≤ β − β + β − β , min δ/(1 δ),1/2  τ  2  τ  τ 2  τ  2 P  − ∗ ≥ −1 1/2 t sup β β 2 αcu νδ d estimation error P∈Pvδ n Total error approximation bias δ − −δ e 2t where the approximation bias is of order τ .Alargeτ reduces ≥ , the bias but compromises the degree of robustness. Thus, an 2 optimal estimator is the one with τ diverging at a certain rate where cu ≥ λmax(Sn). to achieve the optimal tradeoff between estimation error and approximation bias. Our next result presents nonasymptotic Theorem 2 reveals that root-n consistency with exponential upper bounds on the 2-error with an exponential-type excep- concentration is impossible when δ ∈ (0, 1). It widens the tion probability, when τ is properly tuned. Recall that νδ = phenomenon observed in Theorem 3.1 in Devroye et al. (2016) /( +δ) / min{v1 1 , v1 2} for any δ>0. for estimating a mean. In addition to the eigenvalue assump- δ 1 n tion, we need to assume that there exists a u ∈ Un ⊆ R −1 j Theorem 1 (Upper bound). Assume Condition 1 holds and vδ < such that the minimum angle between n u and x is non- ∞ for some δ>0. Let L = max1≤i≤n xi∞ and assume n ≥ vanishing. This assumption comes from the intuition that the 258 Q. SUN, W.-X. ZHOU, AND J. FAN linear subspace spanned by xj is at most of rank d and thus Huber loss, Condition 2 also involves the observation noise. The cannot span the whole space Rn. This assumption naturally following definition concerns the restricted eigenvalues (REs) of T holds in the univariate case where X = (1, ...,1) and we can Sn instead of Hτ . T T take u = (1, ...,1) and α = 1. More generally, X u/nmin = min{|uTx1|/n, ..., |uTxd|/n}.Taking|uTx1|/n for an example, Definition 3 (RE). The restricted maximum and minimum since u ∈{−1, +1}n, we can assume that each coordinate of x1 eigenvalues of Sn are defined, respectively, as  T 1/ = n | 1|/ ≥ | 1| is positive. In this case, u x n i=1 xi n mini xi , + =  ∈ C 1 ρ (m, γ) sup u, Snu : u (m, γ) , which is strictly positive with probability one, assuming x is u drawn from a continuous distribution. ρ−(m, γ)= inf u, Snu : u ∈ C(m, γ) , Together, the upper and lower bounds show that the adaptive u − Huber estimator achieves near-optimal deviations. Moreover, it where C(m, γ) :={u ∈ Sd 1 : ∀J ⊆{1, ..., d} satisfying S ⊆ indicates that the Huber estimator with an adaptive τ exhibits a J, |J|≤m, u c  ≤ γ u  }. ≥  ∗ J 1 J 1 sharp phase transition: when δ 1, βτ converges to β at the −1/2 −δ/(1+δ) parametric rate n , while only a slower rate of order n Condition 3. Sn satisfies the RE condition RE(k, γ),thatis,κl ≤ is available when the second moment does not exist. ρ−(k, γ)≤ ρ+(k, γ)≤ κu for some constants κu, κl > 0.

Remark 4. We provide a parallel analysis under random designs To make Condition 2 on Hτ practically useful, in what fol- in the supplementary materials. Beyond the nonasymptotic lows, we show that Condition 3 implies Condition 2 with high = −1 n = deviation bounds, we also prove a nonasymptotic Bahadur probability. As before, we write vδ n i=1 vi,δ and L representation, which establishes a linear approximation of the max1≤i≤n xi∞. nonlinear robust estimator. This result paves the way for future research on conducting statistical inference and constructing Lemma 1. Condition 3 implies Condition 2 with high proba- confidence sets under heavy-tailedness. Additionally, the bility: if 0 <κl ≤ ρ−(k, γ) ≤ ρ+(k, γ) ≤ κu < ∞ for proposed estimator achieves full efficiency: it is as efficient some k ≥ 1andγ>0, then it holds with probability at least −t as the OLS estimator asymptotically, while the robustness is 1 − e that, 0 <κl/2 ≤ κ−(k, γ , r) ≤ κ+(k, γ , r) ≤ κu < ∞ 2 1/(1+δ) 4 2 characterized via nonasymptotic performance. provided τ ≥ max{8Lr, c1(L kvδ) } and n ≥ c2L k t, where c1, c2 > 0 are constants depending only on (γ , κl).

3.2. Adaptive Huber Regression in High Dimensions With the above preparations in place, we are now ready to present the main results on the adaptive Huber estimator in high Inthissection,westudytheregularizedadaptiveHuberestima- dimensions. tor in high dimensions where d is allowed to grow with the sam- ple size n exponentially. The analysis is carried out under fixed Theorem 3 (Upper bound in high dimensions). Assume Condi- designs, and results for random designs are again provided in tion 3 holds with (k, γ) = (2s,3), vδ < ∞ for some 0 <δ≤ 1. the supplementary materials. We start with a modified version max{1/(1+δ),1/2} For any t > 0andτ0 ≥ νδ,letτ = τ0(n/t) of the localized restricted eigenvalue (LRE) introduced by Fan ≥ min{δ/(1+δ),1/2} 2 and λ 4Lτ0(t/n) . Then with probability at least et al. (2018). Let Hτ (β) =∇ Lτ (β) denote the Hessian matrix. − ˆ ∗ 1−(2s+1)e t,the -regularized Huber estimator β defined Recall that S = supp(β ) ⊆{1, ..., d} isthetruesupportset 1 τ,λ in (4) satisfies with |S|=s. ˆ − ∗ ≤ −1 1/2 βτ,λ β 2 3κl s λ,(9) Definition 2 (LRE). The LRE of Hτ is defined as   2 as long as n ≥ C(L, κl)s t for some C(L, κl) depending only on =  ∈ C κ+(m, γ , r) sup u, Hτ (β)u : (u, β) (m, γ , r) , (L, κl).Inparticular,witht = (1 + c) log d for c > 0wehave     =  ∈ C + min{δ/(1+δ),1/2} κ−(m, γ , r) inf u, Hτ (β)u : (u, β) (m, γ , r) , ˆ ∗ −1 1/2 (1 c) log d β − β  κ Lτ0 s τ,λ 2 l n d−1 d where C(m, γ , r) :={(u, β) ∈ S × R : ∀J ⊆ (10) {1, ..., d} satisfying S ⊆ J, |J|≤m, uJc 1 ≤ γ uJ1, β − ∗ − −c β 1 ≤ r} is a local 1-cone. with probability at least 1 d . ∗ The LRE is defined in a local neighborhood of β under 1- The above result demonstrates that the regularized Huber norm. This facilitates our proof, while Fan et al. (2018)usethe estimator with an adaptive robustification parameter converges 1/2{ }min{δ/(1+δ),1/2} 2-norm. at the rate s (log d)/n with overwhelming probability. Provided the observation noise has finite variance, Condition 2. Hτ satisfies the LRE condition LRE(k, γ , r),thatis, the proposed estimator performs as well as the Lasso with sub- κl ≤ κ−(k, γ , r) ≤ κ+(k, γ , r) ≤ κu for some constants κu, κl > Gaussian errors. We advocate the adaptive Huber regression 0. method since sub-Gaussian condition often fails in practice (Wang, Peng, and Li 2015;Eklund,Nichols,andKnutsson2016). The condition above is referred to as the LRE condition (Fan et al. 2018). It is a unified condition for studying generalized Remark 5. As pointed out by a reviewer, if one pursues a loss functions, whose Hessians may possibly depend on β.For sparsity-adaptive approach, such as the SLOPE (Bogdan et al. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 259

 = T = 2015; Bellec, Lecué, and Tsybakov 2018),theupperboundon covariates xi (ψ (xi1), ..., ψ (xid)) ,whereψ (x) : 2-error can be improved from s log(d)/n to s log(ed/s)/n. min{max(−, x), } and >0isatuningparameter.Then With heavy-tailed observation noise, it is interesting to inves- we consider the modified adaptive Huber estimator (see Fan tigate whether this sharper bound can be achieved by Huber- et al. (2016) for a general robustification principle) typeregularizedestimator.Weleavethistofutureworkasa ˆ ∈ L +   significant amount of additional work is still needed. On the βτ,,λ arg min τ (β) λ β 1 , (11) ∈Rd other hand, since log(ed/s) = 1 + log d − log s and s ≤ n, β  log(ed/s) scales the same as log d so long as log d > a log n for L = −1 n −  where τ (β) n i=1 τ (yi xi , β ) and λ>0isa some a > 1. regularization parameter. ∗ Let S bethetruesupportofβ with sparsity |S|=s, Remark 6. Analogouslytothelow-dimensionalcase,herewe and denote by H (β) =∇2L (β) the Hessian matrix of the impose the sample size scaling n  s2 log d under fixed designs. τ τ modified Huber loss. To investigate the deviation property of In the supplementary materials, we obtain minimax optimal 1- ˆ ˆ βτ,,λ,weimposethefollowingmildmomentassumptions. , 2- and prediction error bounds for βτ,λ with random designs under the scaling n  s log d. 2 2 Condition 4. (i) E(ε) = 0, σ = E(ε )>0andv3 := 4 T d E(ε )<∞; (ii) the covariate vector x = (x1, ..., xd) ∈ R Finally, we establish a matching lower bound for estimating = E 4 ∞ ∗ is independent of ε and satisfies M4 : max1≤j≤d (xj )< . β . Recall the definition of Un in Theorem 2. Wearenowinplacetostatethemainresultofthissection. Theorem 4 (Lower bound in high dimensions). Assume that εi v Theorem 5 demonstrates that the modified adaptive Huber esti- are independent from some distribution in P δ .Supposethat δ mator admits exponentially fast concentration when the conva- Condition 3 holds with k = 2s and γ = 0. Further assume riates only have finite fourth moments, although at the cost of that there exists a set A with |A|=s and u ∈ Un such that T stronger scaling conditions. XAu/nmin ≥ α for some α>0. Then, for any A > 0and  = ˆ s-sparse estimator β β(y1, ..., yn, A) possibly depending on Theorem 5. Assume Condition 4 holds and let H (·) satisfy − τ A,wehave Condition 2 with k = 2s, γ = 3andr > 12κ 1λs.Then,the     l 1/2 min{δ/(1+δ),1/2} βˆ  ∗ αs A log d modified adaptive Huber estimator τ,,λ givenin(11 ) satisfies, sup P β − β ≥ ν E =  ∇L ∗  ≤ 2 δ on the event (τ, , λ) ( τ (β ))S ∞ λ/2 ,that P∈Pvδ κu 2n δ ∗ − −1 −A ˆ − ≤ 1 1/2 ≥ 2 d , βτ,,λ β 2 3κl s λ. as long as n ≥ 2(A log d + log 2). For any t > 0, let the triplet (τ, , λ) satisfy

∗ − 1/2 ∗ − Together, Theorems 3 and 4 show that the regularized adap- λ ≥ 2M β  s1/2 2 + 8 v M + M β 3 s3/2 τ 2 4 2 2 2 4 2 tive Huber estimator achieves the optimal rate of convergence   2 ∗ 2 1/2 t t in  -error. The proof, which is given in the supplementary + 2 2σ M2 + 2M4β  s + τ , (12) 2 2 n n materials, involves constructing a subclass of binomial distribu- tions for the regression errors. Unifying the results in low and = E | |3 = E 2 where v2 ( ε ) and M2 max1≤j≤d (xj ).Then − high dimensions, we arrive at the claim (5) and thus the phase P{E(τ, , λ)}≥1 − 2se t. transition in Figure 1. ∗ Remark 7. Assume that the quantities v3, M4,andβ 2 are all ˆ 4. Extension to Heavy-Tailed Designs bounded. Taking t log d in (12), we see that βτ, ,λ achieves a near-optimal convergence rate of order s (log d)/n when the In this section, we extend the idea of adaptive Huber regression parameters (τ, , λ) scale as described in Section 2 to the case where both the covariate      vector x and the regression error ε exhibit heavy tails. We focus n 1/4 n 1/4 s log d ∗ τ s1/2 ,  ,andλ . on the high-dimensional regime d  n,whereβ ∈ Rd is ∗ log d log d n sparse with s =β 0  n. Observe that, for Huber regression, the linear part of the Huber loss penalizes the residuals, and We remark here that the theoretically optimal τ is different from therefore robustifies the quadratic loss in the sense that out- that in the sub-Gaussian design case. See Theorem B.2 in the liers in the response space (caused by heavy-tailed observation supplementary materials. noise) are down weighted or removed. Since no robustification is imposed on the covariates, intuitively, the adaptive Huber 5. Algorithm and Implementation estimator may not be robust against heavy-tailed covariates. In what follows, we modify the adaptive Huber regression to This section is devoted to computational algorithm and numeri- robustify both the covariates and regression errors. cal implementation. Wefocus on the regularized adaptive Huber To begin with, suppose we observe independent data regression in (4), as (3) can be easily solved via the iteratively n {(yi, xi)} = from (y, x), which follows the linear model reweighted least squares method. To solve the convex optimiza- i 1 ∗ y =x, β +ε.Torobustifyxi, we define truncated tion problem in (4), standard optimization algorithms, such as 260 Q. SUN, W.-X. ZHOU, AND J. FAN the cutting-plane or interior point method, are not scalable to Algorithm 1 LAMM algorithm for regularized adaptive Huber large-scale problems. regression. In what follows, we describe a fast and easily implementable { (k) }∞ ← (0) 1: Algorithm: β , φk k=1 LAMM(λ, β , φ0,  ) method using the local adaptive majorize-minimization (0) (LAMM) principle (Fan et al. 2018). We say that a function 2: Input: λ, β , φ0,  (,k) ← { −1 (,k−1)} g(β|β(k)) majorizes f (β) at the point β(k) if 3: Initialize: φ max φ0, γu φ (k+1) (k) 4: for k = 0, 1, ...until β − β 2 ≤  do | (k) ≥ (k)| (k) = (k) g(β β ) f (β) and g(β β ) f (β ). 5: Repeat ( + ) ( ) 6: β k 1 ← (β k ) To minimize a general function f (β), a majorize-minimization Tλ,φk ( ) (k+1) (k) (k+1) 0 7: If gk(β |β ) 1, say γu = 2. If the solution satisfies + + + − g (β(k 1)|β(k)) ≥ L (β(k 1)),westopandobtainβ(k 1), d 5 k τ we generate x from standard multivariate normal distribution which makes the target value nonincreasing. We then continue i N (0, I ).Inthissection,weset(n, d) = (100, 5), and generate with the iteration to produce next solution until the solution d ∞ regression errors from three different distributions: the normal sequence {β(k)} converges. A simple stopping criterion is k=1 distribution N (0, 4),thet-distribution with degrees of freedom  (k+1) − (k) ≤ −4 β β 2  for a sufficiently small ,say10 .We 1.5, and the log-normal distribution log N (0, 4).Botht and log- refer to Fan et al. (2018) for a detailed complexity analysis of the normal distributions are heavy-tailed and produce outliers with LAMM algorithm. high chance. The results on 2-error for adaptive Huber regression and the 6. Numerical Studies least squares estimator, averaged over 100 simulations, are sum- marized in Table 1. In the case of normally distributed noise, the 6.1. Tuning Parameter and Finite Sample Performance adaptive Huber estimator performs as well as the least squares. With heavy-tailed regression errors following Student’s t or log- For numerical studies and real data analysis, in the case where normal distribution, the adaptive Huber regression significantly the actual order of moments is unspecified, we presume the outperforms the least squares. These empirical results reveal variance is finite and therefore choose robustification and regu- that adaptive Huber regression prevails across various scenarios: larization parameters as follows     not only it provides more reliable estimators in the presence of / / neff 1 2 neff 1 2 heavy-tailed and/or asymmetric errors, but also loses almost no τ = cτ × σ and λ = cλ × σ , t t efficiency at the normal model. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 261

Figure 2. Negative log 2-error versus δ in low (left panel) and high (right panel) dimensions.

Figure 3. Comparison between the (regularized) adaptive Huber estimator and the (regularized) least squares estimator under 2-error.

6.2. Phase Transition 2. (High dimension):

In this section, we validate the phase transition behavior of      − ∗  ∗ δ n 1 βτ β 2 empirically. We generate continuous responses − log β − β 2 log − log(vδ), ∗ τ 1 + δ log d 1 + δ according to (15), where β and xi are set the same way as before. We sample independent errors as εi ∼ tdf, Student’s t- 0 <δ≤ 1, distribution with df degrees of freedom. Note that tdf has finite + − (1 δ)th moments provided δ

= Figure 4. The 2-error versus sample size n (left panel) and the 2-error versus effective sample size neff n/ log d (right panel).

6.3. Effective Sample Size antibodies were acquired via reverse-phase protein lysate arrays  ∗ in their original scale. One observation had to be removed since In this section, we verify the scaling behavior of β − β  τ 2 all values were missing in the gene expression data, reducing the with respect to the effective sample size. The data are generated number of observations to n = 59. in the same way as before except that the errors are drawn We first center all the protein and gene expression variables from t . As discussed in the previous subsection, we take δ = 1.5 to have mean zero, and then plot the histograms of the kurtosises 0.45 and then choose the robustification parameter as τ = + of all expressions in Figure 5. The left panel in the figure shows c v (n/log d)1/(1 δ),wherev is the (1 + δ)th sample absolute τ δ δ that 145 out of 162 protein expressions have kurtosises larger central moment. For simplicity, we take c = 0.5 here since our τ than 3; and 49 larger than 9. In other words, more than 89.5% goal is to demonstrate the scaling behavior as n grows, instead of the protein expression variables have tails heavier than the of to achieve the best finite-sample performance.  ∗ normal distribution, and about 30.2% are severely heavy-tailed The left panel of Figure 4 plots the  -error β − β  2 τ,λ 2 with tails flatter than t ,thet-distribution with 5 degrees of versus sample size over 200 repetitions when the dimension 5 freedom. Similarly, about 36.5% of the gene expression variables, d ∈{100, 500, 5000}. In all three settings, the  -error decays as 2 even after the log -transformation, still exhibit empirical kur- the sample size grows. As expected, the curves shift to the right 2 tosises larger than that of t . This suggests that, regardless of when the dimension increases. Theorem 3 provides a specific 5 the normalization methods used, genomic data can still exhibit prediction about this scaling behavior: if we plot the  -error 2 heavy-tailedness, which was also pointed out by Purdom and versus effective sample size (n/ log d), the curves should align Holmes (2005). roughly with the theoretical curve   We order the protein expression variables according to their −δ/(1+δ)  ∗ n scales, measured by the SD. We show the results for the protein β − β 2 τ,λ log d expressionsbasedontheKRT19antibody,theproteinkeratin 19, which constitutes the variable with the largest SD, serving for different values of d. This is validated empirically by the right as one dependent variable. KRT19, a type I keratin, also known panel of Figure 4. This near-perfect alignment in Figure 4 is also as Cyfra 21-1, is encoded by the KRT19 gene. Due to its high observed by Wainwright (2009) for Lasso with sub-Gaussian sensitivity, the KRT19 antibody is the most used marker for the errors. tumor cells disseminated in lymph nodes, peripheral blood, and bone marrow of breast cancer patients (Nakata et al. 2004). We 6.4. A Real Data Example: NCI-60 Cancer Cell Lines denote the adaptive Huber regression as AHuber, and that with truncated covariates as TAHuber. We then compare AHuber We apply the proposed methodologies to the NCI-60, a panel and TAHuber with Lasso. Both regularization and robustifica- of 60 diverse human cancel cell lines. The NCI-60 consists of tion parameters are chosen by the ten-fold cross-validation. data on 60 human cancer cell lines and can be downloaded To measure the predictive performance, we consider a robust from http://discover.nci.nih.gov/cellminer/. More details on data prediction loss: the mean absolute error (MAE) defined as acquisition can be found in Shankavaram et al. (2007). Our aim   ntest is to investigate the effects of genes on protein expressions. The  1  MAE β = ytest −xtest, β , gene expression data were obtained with an Affymetrix HG- n i i test i=1 U133A/B chip, log2 transformed and normalized with the gua- test test = nine dytosine robust multi-array analysis. We then combined where yi and xi , i 1, ..., ntest, denote the observa- the same gene expression variables measured by multiple dif- tions of the response and predictor variables in the test data, ferent probes into one by taking their median, resulting in a set respectively. We report the MAE via the leave-one-out cross- of p = 17,924 predictors. The protein expressions based on 162 validation. Table 2 reports the MAE, model size, and selected JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 263

Figure 5. Histogram of kurtosises for the protein and gene expressions. The dashed red line at 3 is the kurtosis of a normal distribution.

Table 2. We report the mean absolute error (MAE) for protein expressions based on the KRT19 antibody from the NCI-60 cancer cell lines, computed from leave-one-out cross-validation. Method MAE Size Selected genes Lasso 7.64 42 FBLIM1, MT1E, EDN2, F3, FAM102B, S100A14, LAMB3, EPCAM, FN1, TM4SF1, UCHL1, NMU, ANXA3, PLAC8, SPP1, TGFBI, CD74, GPX3, EDN1, CPVL, NPTX2, TES, AKR1B10, CA2, TSPYL5, MAL2, GDA, BAMBI, CST6, ADAMTS15, DUSP6, BTG1, LGALS3, IFI27, MEIS2, TOX3, KRT23, BST2, SLPI, PLTP, XIST, NGFRAP1 AHuber 6.74 11 MT1E, ARHGAP29, CPCAM, VAMP8, MALL, ANXA3, MAL2, BAMBI, LGALS3, KRT19, TFF3 TAHuber 5.76 7 MT1E, ARHGAP29, MALL, ANXA3, MAL2, BAMBI, KRT19 NOTE: We also report the model size and selected genes for each method. genes for the considered methods. TAHuber clearly shows the xenograft model, and instead significantly inhibited growth smallest MAE, followed by AHuber and Lasso. The Lasso pro- and metastasis of coinoculated cancer. MAL2 expressions were duces a fairly large model despite the small sample. Now it has showntobeelevatedatbothRNAandproteinlevelsinbreast been recognized that Lasso tends to select many noise variables cancer (Shehata et al. 2008). It has also been shown that MALL along with the significant ones, especially when data exhibit is associated with various forms of cancer (Oh et al. 2005;Landi heavy tails. et al. 2014). However, the effect of ARHGAP29 and MALL on The Lasso selects a model with 42 genes but excludes the breast cancer remains unclear and is worth further investigation. KRT19 gene, which encodes the protein keratin 19. AHuber finds 11 genes including KRT19. TAHuber results in a model Supplementary Materials with 7 genes: KRT19, MT1E, ARHGAP29, MALL, ANXA3, MAL2, BAMBI.First,KRT19 encodes the keratin 19 protein. It In the supplementary materials, we provide theoretical analysis under has been reported in Wu et al. (2008)thattheMT1E expression random designs, and proofs of all the theoretical results in this article. is positively correlated with cancer cell migration and tumor stage, and the MT1E isoform was found to be present in estrogen Acknowledgments receptor-negative breast cancer cell lines (Friedline et al. 1998). ANXA3 is highly expressed in all colon cell lines and all breast- The authors thank the editor, associate editor, and two anonymous referees derived cell lines positive for the oestrogen receptor (Ross et al. for their valuable comments. 2000). A very recent study in Zhou et al. (2017)suggested that silencing the ANXA3 expression by RNA interference Funding inhibits the proliferation and invasion of breast cancer cells. Moreover, studies in Shangguan et al. (2012)andKretzschmar This work is supported by a Connaught Award, NSERC Grant RGPIN- 2018-06484, NSF Grants DMS-1662139, DMS-1712591, and DMS- (2000) showed that the BAMBI transduction significantly 1811376, NIH Grant 2R01-GM072611-14, and NSFC Grant 11690014. inhibited TGF-β/Smad signaling and expression of carcinoma- associated fibroblasts in human bone marrow mesenchymal stem cells (BM-MSCs), and disrupted the cytokine network References mediating the interaction between MSCs and breast cancer cells. Alquier, P., Cottet, V., and Lecué, G. (2017), “Estimation Bounds and Consequently, the BAMBI transduction abolished protumor Sharp Oracle Inequalities of Regularized Procedures With Lipschitz Loss effects of BM-MSCs in vitro and in an orthotopic breast cancer Functions,” arXiv no. 1702.01402. [255] 264 Q. SUN, W.-X. ZHOU, AND J. FAN

Belloni, A., and Chernozhukov, V. (2011), “1-Penalized Quantile Regres- (1973), “Robust Regression: Asymptotics, Conjectures and Monte sion in High-Dimensional Sparse Models,” The Annals of Statistics, 39, Carlo,” The Annals of Statistics, 1, 799–821. [254,255,256] 82–130. [255] Koenker, R. (2005), Quantile Regression,NewYork:CambridgeUniversity Bellec, P. C., Lecué, G., and Tsybakov, A. B. (2018), “Slope Meets Lasso: Press. [255] Improved Oracle Bounds and Optimality,” The Annals of Statistics, 46, Kretzschmar, M. (2000), “Transforming Growth Factor-β and Breast Can- 3603–3642. [259] cer: Transforming Growth Factor-β/Smad Signaling Defects and Can- Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009), “Simultaneous Analysis cer,” Breast Cancer Research, 2, 107–115. [263] of Lasso and Dantzig Selector,” The Annals of Statistics, 37, 1705–1732. Landi,A.,Vermeire,J.,Iannucci,V.,Vanderstraeten,H.,Naessens,E.,Ben- [254] tahir, M., and Verhasselt, B. (2014), “Genome-Wide shRNA Screening Bogdan, M., van den Berg, E., Sabatti, C., Su, W., and Candès, E. J. (2015), Identifies Host Factors Involved in Early Endocytic Events for HIV-1- “SLOPE—Adaptive Variable Selection via Convex Optimization,” The Induced CD4 Down-Regulation,” Retrovirology, 11, 118–129. [263] Annals of Applied Statistics, 9, 1103–1140. [259] Lepski, O. V. (1991), “Asymptotically Minimax Adaptive Estimation. I. Brownlees, C., Joly, E., and Lugosi, G. (2015), “Empirical Risk Minimization Upper Bounds. Optimally Adaptive Estimates,” IEEE Transactions on for Heavy-Tailed Losses,” The Annals of Statistics, 43, 2507–2536. [255] Information Theory, 36, 682–697. [257] Bühlmann, P., and van de Geer, S. (2011), Statistics for High-Dimensional Liu, R. Y.(1990), “On a Notion of Data Depth Based on Random Simplices,” Data: Methods, Theory and Applications, Heidelberg: Springer. [254] The Annals of Statistics, 18, 405–414. [255] Catoni, O. (2012), “Challenging the Empirical Mean and Empirical Liu, R. Y., Parelius, J. M., and Singh, K. (1999), “Multivariate Analysis Variance: A Deviation Study,” Annales de I’Institut Henri Poincaré— by Data Depth: Descriptive Statistics, Graphics and Inference” (with Probabilités et Statistiques, 48, 1148–1185. [254,255,256] discussion and a rejoinder by Liu and Singh), The Annals of Statistics, (2016), “PAC-Bayesian Bounds for the Gram Matrix and Least 27, 783–858. [255] Squares Regression With a Random Design,” arXiv no. 1603.05229. Loh, P., and Wainwright, M. J. (2015), “Regularized M-Estimators With [255] Nonconvexity: Statistical and Algorithmic Theory for Local Optima,” Chen, M., Gao, C., and Ren, Z. (2018), “Robust Covariance and Scatter Journal of Machine Learning Research, 16, 559–616. [254,255] Matrix Estimation Under Huber’s Contamination Model,” The Annals Mammen, E. (1989), “Asymptotics With Increasing Dimension for Robust of Statistics, 46, 1932–1960. [255] Regression With Applications to the Bootstrap,” The Annals of Statistics, Cont, R. (2001), “Empirical Properties of Asset Returns: Stylized Facts and 17, 382–400. [254] Statistical Issues,” Quantitative Finance, 1, 223–236. [254] Minsker, S. (2018), “Sub-Gaussian Estimators of the Mean of a Random Delaigle, A., Hall, P., and Jin, J. (2011), “Robustness and Accuracy of Matrix With Heavy-Tailed Entries,” The Annals of Statistics, 46, 2871– MethodsforHighDimensionalDataAnalysisBasedonStudent’st- 2903. [255] Statistic,” JournaloftheRoyalStatisticalSociety, Series B, 73, 283–301. Mizera, I. (2002), “On Depth and Deep Points: A Calculus,” The Annals of [255] Statistics, 30, 1681–1736. [255] Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016), “Sub- Mizera, I., and Müller, C. H. (2004), “Location-Scale Depth,” Journal of the Gaussian Mean Estimators,” The Annals of Statistics, 44, 2695–2725. American Statistical Association, 99, 949–966. [255] [255,257] Nakata, B., Takashima, T., Ogawa, Y.,Ishikawa, T., and Hirakawa, K. (2004), Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), “Least Angle “Serum CYFRA 21-1 (Cytokeratin-19 Fragments) Is a Useful Tumour Regression,” The Annals of Statistics, 32, 407–499. [254] Marker for Detecting Disease Relapse and Assessing Treatment Efficacy Eklund, A., Nichols, T., and Knutsson, H. (2016), “Cluster Failure: Why in Breast Cancer,” British Journal of Cancer, 91, 873–878. [262] fMRI Inferences for Spatial Extent Have Inflated False-Positive Rates,” Oh,J.H.,Yang,J.O.,Hahn,Y.,Kim,M.R.,Byun,S.S.,Jeon,Y.J.,Kim,J.M., Proceedings of the National Academy of Sciences of the United States of Song, K. S., Noh, S. M., Kim, S., and Yoo, H. S. (2005), “Transcriptome America, 113, 7900–7905. [254,258] Analysis of Human Gastric Cancer,” Mammalian Genome, 16, 942–954. Fan, J., Fan, Y., and Barut, E. (2014), “Adaptive Robust Variable Selection,” [263] The Annals of Statistics, 42, 324–351. [255] Portnoy, S. (1985), “Asymptotic Behavior of M Estimators of p Regression Fan, J., Li, Q., and Wang, Y. (2017), “Estimation of High Dimensional Mean Parameters When p2/n Is Large; II. Normal Approximation,” The Annals Regression in the Absence of Symmetry and Light Tail Assumptions,” of Statistics, 13, 1403–1417. [254] JournaloftheRoyalStatisticalSociety, Series B, 79, 247–265. [255,256] Purdom, E., and Holmes, S. P.(2005), “Error Distribution for Gene Expres- Fan, J., and Li, R. (2001), “Variable Selection via Nonconcave Penalized sion Data,” Statistical Applications in Genetics and Molecular Biology,4, Likelihood and Its Oracle Properties,” Journal of the American Statistical 16. [262] Association, 96, 1348–1360. [254] Ross,D.T.,Scherf,U.,Eisen,M.B.,Perou,C.M.,Rees,C.,Spellman,P., Fan, J., Liu, H., Sun, Q., and Zhang, T. (2018), “I-LAMM for Sparse Learn- Iyer,W.,Jeffrey,S.S.,VandeRijn,M.,Pergamenschikov,A.,Lee,J.C.F., ing: Simultaneous Control of Algorithmic Complexity and Statistical Lashkari,D.,Shalon,D.,Myers,T.G.,Weinstein,J.N.,Botstein,D.,and Error,” The Annals of Statistics, 96, 1348–1360. [254,255,258,260] Brown, P. O. (2000), “Systematic Variation in Gene Expression Patterns Fan, J., Wang, W., and Zhu, Z. (2016), “A Shrinkage Principle for Heavy- inHumanCancerCellLines,”Nature Genetics, 24, 227–235. [263] Tailed Data: High-Dimensional Robust Low-Rank Matrix Recovery,” Shangguan,L.,Ti,X.,Krause,U.,Hai,B.,Zhao,Y.,Yang,Z.,andLiu,F. arXiv no. 1603.08315. [254,259] (2012), “Inhibition of TGF-β/Smad Signaling by BAMBI Blocks Differ- Friedline, J. A., Garrett, S. H., Somji, S., Todd, J. H., and Sens, D. A. (1998), entiation of Human Mesenchymal Stem Cells to Carcinoma-Associated “Differential Expression of the MT-1E Gene in Estrogen-Receptor- Fibroblasts and Abolishes Their Protumor Effects,” Stem Cells, 30, 2810– Positive and -Negative Human Breast Cancer Cell Lines,” The American 2819. [263] Journal of Pathology, 152, 23–27. [263] Shankavaram, U. T., Reinhold, W. C., Nishizuka, S., Major, S., Morita, Giulini, I. (2017), “Robust PCA and Pairs of Projections in a Hilbert Space,” D.,Chary,K.K.,Reimers,M.A.,Scherf,U.Kahn,A.,Dolginow,D., Electronic Journal of Statistics, 11, 3903–3926. [255] Cossman,J.,Kaldjian,E.P.,Scudiero,D.A.,Petricoin,E.,Liotta,L.,Lee, Hastie, T., Tibshirani, R., and Wainwright, M. J. (2015), Statistical Learning J. K., and Weinstein, J. N. (2007), “Transcript and Protein Expression With Sparsity: The Lasso and Generalizations,BocaRaton,FL:CRC Profiles of the NCI-60 Cancer Cell Panel: An Integromic Microarray Press. [254] Study,” Molecular Cancer Therapeutics, 40, 2877–2909. [262] He, X., and Shao, Q.-M. (1996), “A General Bahadur Representation of M- Shehata, M., Bièche, I., Boutros, R., Weidenhofer, J., Fanayan, S., Spalding, Estimators and Its Application to Linear Regression With Nonstochastic L.,Zeps,N.,Byth,K.,Bright,R.K.,Lidereau,R.,andByrne,J.A. Designs,” The Annals of Statistics, 24, 2608–2630. [254] (2008), “Nonredundant Functions for Tumor Protein D52-Like Proteins (2000), “On Parameters of Increasing Dimensions,” Journal of Support Specific Targeting of TPD52,” Clinical Cancer Research, 14, Multivariate Analysis, 73, 120–135. [254] 5050–5060. [263] Huber, P.J. (1964), “Robust Estimation of a Location Parameter,” The Annals Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” of Mathematical Statistics, 35, 73–101. [255,256] JournaloftheRoyalStatisticalSociety, Series B, 58, 267–288. [254] JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 265

Tukey, J. W.(1975), “Mathematics and the Picturing of Data,”in Proceedings Tumor Invasion in Human Bladder Cancer Identify Metallothionein of the International Congress of Mathematicians (Vol. 2), pp. 523–531. E1 and Nicotinamide N-Methyltransferase as Novel Regulators of Cell [255] Migration,” Oncogene, 27, 6679–6689. [263] Wainwright, M. J. (2009), “Sharp Thresholds for High-Dimensional and Yohai, V. J., and Maronna, R. A. (1979), “Asymptotic Behavior of M- NoisySparsityRecoveryUsing1-Constrained Quadratic Programming Estimators for the Linear Model,” The Annals of Statistics, 7, 258–268. (Lasso),” IEEE Transactions on Information Theory, 55, 2183–2202. [262] [254] Wang, L. (2013), “The L1 Penalized LAD Estimator for High Dimen- Zheng, Q., Peng, L., and He, X. (2015), “Globally Adaptive Quantile Regres- sional Linear Regression,” Journal of Multivariate Analysis, 120, 135–151. sion With Ultra-High Dimensional Data,” The Annals of Statistics, 43, [255] 2225–2258. [255] Wang, L., Peng, B., and Li, R. (2015), “A High-Dimensional Nonparametric Zhou, T., Li, Y., Yang, L., Liu, L., Ju, Y., and Li, C. (2017), “Silencing of Multivariate Test for Mean Vector,” Journal of the American Statistical ANXA3 Expression by RNA Interference Inhibits the Proliferation and Association, 110, 1658–1669. [254,258] Invasion of Breast Cancer Cells,” Oncology Reports, 37, 388–398. [263] Wu,Y.,Siadaty,M.S.,Berens,M.E.,Hampton,G.M.,andTheodorescu, Zuo, Y., and Serfling, R. (2000), “General Notions of Statistical Depth D. (2008), “Overlapping Gene Expression Profiles of Cell Migration and Function,” The Annals of Statistics, 28, 461–482. [255]