<<

arXiv:1010.2731v3 [math.ST] 12 Mar 2013 42,UAe-mail: California USA Berkeley, 94720, California, and of University of EECS, Departments Professor, is Wainwright Wainwrigh J. Martin Ravikumar, Pradeep Negahban, N. Sahand Regularizers of M Analysis for Framework High-Dimensional Unified A nvriyo aiona ekly aiona94720, e-mail: California USA Berkeley, EECS, California, and of Statistics University of Departments Professor, is Texas e-mail: Austin, USA Texas, 78712, of University of Science, Department Computer Professor, Assistant is Ravikumar 02 o.2,N.4 538–557 4, DOI: No. 27, Vol. 2012, Science Statistical 23,UAe-mail: USA Massachusetts 02139, Cambridge, Avenue, Massachusetts Technology,77 of Institute Massachusetts Department,

aadNgha sPsdcoa eerhr EECS Researcher, Postdoctoral is Negahban Sahand c nttt fMteaia Statistics Mathematical of Institute 10.1214/12-STS400 Etmtr ihDecomposable with -Estimators [email protected] eopsblt,ta nuecrepnigregularized corresponding stron ensure restricted that as properties decomposability, to key two referred identifies functions, also regularization analysis Our norms. lated rvdsauie rmwr o salsigcnitnya regularized consistency such for establishing r rates for some gence framework with structure unified data assumed a the the provides fits encourages that model combi function the which e larization well problem, to how optimization approach measuring regularized general function a a solve settings, to matrice such is low-rank In inclu matrices, thereof. structure, structured binations low-dimensional and of sparse types vectors, various with els rcdrsunless procedures size sample the e od n phrases: and words Key wel many in optimal are which cases. and rates convergence fast have hc h h ubro parameters of number the the which edrv oeeitn eut,adas ooti number t both a used in obtain be rates, to can convergence and it also consistency how and on show results, results and existing theorem some main re-derive one state We scaling. Abstract. ru as,sparsity, Lasso, group [email protected] [email protected] [email protected] 2012 , ihdmninlsaitclifrnedaswt models with deals inference statistical High-dimensional . n Pradeep . ic ti sal mosbet banconsistent obtain to impossible usually is it Since . p/n ℓ atnJ. Martin . → 1 rglrzto,ncernorm. nuclear -regularization, ihdmninlstatistics, High-dimensional i Yu Bin . ,aln frcn okhssuidmod- studied has work recent of line a 0, M 1 etmtr ne high-dimensional under -estimators r ut l,dtn akt oko admma- problems random testing [[ on high-dimensional (e.g., work and to theory back trix dating old, quite are a eo h aeodra—rsbtnilylarger substantially size as—or sample order than—the same problem the the of of be dimension may ambient the which in els ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint ttsia Science article Statistical original the the of by reprint published electronic an is This p ihdmninlsaitc scnendwt mod- with concerned is statistics High-dimensional scmaal oo agrthan larger or to comparable is 24 ,[ ], 42 ,[ ], .INTRODUCTION 1. nttt fMteaia Statistics Mathematical of Institute M , 54 ovxt and convexity g ℓ 02 o.2,N.4 538–557 4, No. 27, Vol. 2012, 2 etmtr Lasso, -estimator, , n i Yu Bin and t erradre- and -error M 75 hspaper This . igsparse ding n com- and s n -estimators dconver- nd fls and loss of ].O h te ad the hand, other the On ]]). nteoehn,isroots its hand, one the On . e loss a nes stimation l-studied fnew of egu- in o This . in p 2 NEGAHBAN, RAVIKUMAR, WAINWRIGHT AND YU past decade has witnessed a tremendous surge of structured regularizers, as well as regularization based research activity. Rapid development of data collec- on the nuclear norm (sum of singular values). tion technology is a major driving force: it allows for Past Work more observations to be collected (larger n) and also for more variables to be measured (larger p). Exam- Within the framework of high-dimensional statis- ples are ubiquitous throughout science: astronomi- tics, the goal is to obtain bounds on a given perfor- cal projects such as the Large Synoptic Survey Tele- mance metric that hold with high for a scope (available at www.lsst.org) produce terabytes finite sample size, and provide explicit control on of data in a single evening; each sample is a high- the ambient dimension p, as well as other struc- resolution image, with several hundred megapixels, tural parameters such as the sparsity of a vector, so that p 108. Financial data is also of a high- degree of a graph or rank of matrix. Typically, such dimensional≫ nature, with hundreds or thousands of bounds show that the ambient dimension and struc- financial instruments being measured and tracked tural parameters can grow as some function of the over time, often at very fine time intervals for use in sample size n, while still having the statistical error high frequency trading. Advances in biotechnology decrease to zero. The choice of performance metric is now allow for measurements of thousands of genes or application-dependent; some examples include pre- proteins, and lead to numerous statistical challenges diction error, parameter estimation error and model (e.g., see the paper [6] and references therein). Vari- selection error. ous types of imaging technology, among them mag- By now, there are a large number of theoreti- netic resonance imaging in medicine [40] and hyper- cal results in place for various types of regularized spectral imaging in ecology [36], also lead to high- M-estimators.1 Sparse has perhaps dimensional data sets. been the most active area, and multiple bodies of In the regime p n, it is well known that con- work can be differentiated by the error metric un- ≫ sistent estimators cannot be obtained unless addi- der consideration. They include work on exact re- tional constraints are imposed on the model. Ac- covery for noiseless observations (e.g., [16, 20, 21]), cordingly, there are now several lines of work within prediction error consistency (e.g., [11, 25, 72, 79]), high-dimensional statistics, all of which are based on consistency of the parameter estimates in ℓ2 or some imposing some type of low-dimensional constraint other norm (e.g., [8, 11, 12, 14, 46, 72, 79]), as well as on the model space and then studying the behav- variable selection consistency (e.g., [45, 73, 81]). The ior of different estimators. Examples include linear information-theoretic limits of sparse linear regres- regression with sparsity constraints, estimation of sion are also well understood, and ℓ1-based meth- structured covariance or inverse covariance matrices, ods are known to be optimal for ℓq-ball sparsity [56] graphical , sparse principal compo- and near-optimal for model selection [74]. For gen- nent analysis, low-rank matrix estimation, matrix eralized linear models (GLMs), estimators based on decomposition problems and estimation of sparse ℓ1-regularized maximum likelihood have also been additive nonparametric models. The classical tech- studied, including results on risk consistency [71], nique of regularization has proven fruitful in all of consistency in the ℓ2 or ℓ1-norm [2, 30, 44] and these contexts. Many well-known estimators are based model selection consistency [9, 59]. Sparsity has also on solving a convex optimization problem formed proven useful in application to different types of ma- by the sum of a with a weighted reg- trix estimation problems, among them banded and ularizer; we refer to any such method as a regular- sparse covariance matrices (e.g., [7, 13, 22]). Another ized M-estimator. For instance, in application to lin- line of work has studied the problem of estimating ear models, the Lasso or basis pursuit approach [19, Gaussian Markov random fields or, equivalently, in- 67] is based on a combination of the least squares verse covariance matrices with sparsity constraints. loss with ℓ1-regularization, and so involves solving a Here there are a of results, including conver- quadratic program. Similar approaches have been gence rates in Frobenius, operator and other matrix applied to generalized linear models, resulting in norms [35, 60, 64, 82], as well as results on model se- more general (nonquadratic) convex programs with ℓ1-constraints. Several types of regularization have 1Given the extraordinary number of papers that have ap- been used for estimating matrices, including stan- peared in recent years, it must be emphasized that our refer- dard ℓ1-regularization, a wide range of sparse group- encing is necessarily incomplete. HIGH-DIMENSIONAL ANALYSIS OF REGULARIZED M-ESTIMATORS 3 lection consistency [35, 45, 60]. Motivated by appli- direct manner a large number of corollaries, some of cations in which sparsity arises in a structured man- them known and others novel. In concurrent work, ner, other researchers have proposed different types a subset of the current authors has also used this of block-structured regularizers (e.g., [3, 5, 28, 32, framework to prove several results on low-rank ma- 69, 70, 78, 80]), among them the group Lasso based trix estimation using the nuclear norm [51], as well on ℓ1/ℓ2-regularization. High-dimensional consisten- as minimax-optimal rates for noisy matrix comple- cy results have been obtained for exact recovery tion [52] and noisy matrix decomposition [1]. Fi- based on noiseless observations [5, 66], convergence nally, en route to establishing these corollaries, we rates in the ℓ2-norm (e.g., [5, 27, 39, 47]) as well also prove some new technical results that are of in- as model selection consistency (e.g., [47, 50, 53]). dependent interest, including guarantees of restricted Problems of low-rank matrix estimation also arise in strong convexity for group-structured regularization numerous applications. Techniques based on nuclear (Proposition 1). norm regularization have been studied for differ- The remainder of this paper is organized as fol- ent statistical models, including lows. We begin in Section 2 by formulating the class [37, 62], matrix completion [15, 31, 52, 61], multitask of regularized M-estimators that we consider, and regression [4, 10, 51, 63, 77] and system identifica- then defining the notions of decomposability and tion [23, 38, 51]. Finally, although the primary em- restricted strong convexity. Section 3 is devoted to phasis of this paper is on high-dimensional paramet- the statement of our main result (Theorem 1) and ric models, regularization methods have also proven discussion of its consequences. Subsequent sections effective for a class of high-dimensional nonparamet- are devoted to corollaries of this main result for dif- ric models that have a sparse additive decomposi- ferent statistical models, including sparse linear re- tion (e.g., [33, 34, 43, 58]), and have been shown to gression (Section 4) and estimators based on group- achieve minimax-optimal rates [57]. structured regularizers (Section 5). A number of tech- Our Contributions nical results are presented within the appendices in the supplementary file [49]. As we have noted previously, almost all of these estimators can be seen as particular types of regular- 2. PROBLEM FORMULATION AND SOME ized M-estimators, with the choice of loss function, KEY PROPERTIES regularizer and statistical assumptions changing ac- In this section we begin with a precise formulation cording to the model. This methodological similarity of the problem, and then develop some key proper- suggests an intriguing possibility: is there a common ties of the regularizer and loss function. set of theoretical principles that underlies analysis of all these estimators? If so, it could be possible 2.1 A Family of M-Estimators to gain a unified understanding of a large collection Let Zn := Z ,...,Z denote n identically dis- of techniques for high-dimensional estimation and 1 1 n tributed observations{ with} marginal distribution P, afford some insight into the literature. and suppose that we are interested in estimating The main contribution of this paper is to provide some parameter θ of the distribution P. Let : Rp an affirmative answer to this question. In particu- n R be a convex and differentiable lossL func-× lar, we isolate and highlight two key properties of a tionZ → that, for a given set of observations Zn, as- regularized M-estimator—namely, a decomposabil- 1 signs a cost (θ; Zn) to any parameter θ Rp. Let ity property for the regularizer and a notion of re- L 1 ∈ θ∗ arg minθ Rp (θ) be any minimizer of the pop- stricted strong convexity that depends on the inter- ∈ ∈ L n ulation risk (θ) := EZn [ (θ; Z )]. In order to esti- action between the regularizer and the loss func- 1 1 mate this quantityL basedL on the data Zn, we solve tion. For loss functions and regularizers satisfying 1 the convex optimization problem these two conditions, we prove a general result (The- orem 1) about consistency and convergence rates n (1) θλn arg min (θ; Z1 )+ λn (θ) , ∈ θ Rp{L R } for the associated estimators. This result provides ∈ a family of bounds indexed by subspaces, and each where λnb> 0 is a user-defined regularization penalty p bound consists of the sum of approximation error and : R R+ is a norm. Note that this setup and estimation error. This general result, when spe- allowsR for the→ possibility of misspecified models as cialized to different statistical models, yields in a well. 4 NEGAHBAN, RAVIKUMAR, WAINWRIGHT AND YU

Our goal is to provide general techniques for de- In order to build some intuition, let us consider riving bounds on the difference between any solution the ideal case = for the time being, so that M M θλn to the convex program (1) and the unknown vec- the decomposition (3) holds for all pairs (θ, γ) tor θ . In this paper we derive bounds on the quan- . For any given pair (θ, γ) of this form,∈ ∗ M × M⊥ tityb θλn θ∗ , where the error norm is induced the vector θ + γ can be interpreted as a pertur- by somek inner− k product , on Rp. Mostk · k often, this bation of the model vector θ away from the sub- h· ·i errorb norm will either be the Euclidean ℓ2-norm on space , and it is desirable that the regularizer vectors or the analogous Frobenius norm for matri- penalizeM such deviations as much as possible. By ces, but our theory also applies to certain types of the triangle inequality for a norm, we always have weighted norms. In addition, we provide bounds on (θ + γ) (θ)+ (γ), so that the decomposabil- R ≤ R R the quantity (θλn θ∗), which measures the error ity condition (3) holds if and only if the triangle in the regularizerR norm.− In the classical setting, the inequality is tight for all pairs (θ, γ) ( , ). It ∈ M M⊥ ambient dimensionb p stays fixed while the number is exactly in this setting that the regularizer penal- of observations n tends to infinity. Under these con- izes deviations away from the model subspace as ditions, there are standard techniques for proving much as possible. M consistency and asymptotic normality for the error In general, it is not difficult to find subspace pairs

θλn θ∗. In contrast, the analysis of this paper is all that satisfy the decomposability property. As a triv- within− a high-dimensional framework, in which the ial example, any regularizer is decomposable with tupleb (n,p), as well as other problem parameters, respect to = Rp and its orthogonal complement M such as vector sparsity or matrix rank, etc., are all ⊥ = 0 . As will be clear in our main theorem, it allowed to tend to infinity. In contrast to asymptotic isM of more{ } interest to find subspace pairs in which statements, our goal is to obtain explicit finite sam- the model subspace is “small,” so that the or- M ple error bounds that hold with high probability. thogonal complement ⊥ is “large.” To formalize this intuition, let us defineM the projection operator 2.2 Decomposability of R (4) Π (u) := arg min u v The first ingredient in our analysis is a property M v k − k ∈M of the regularizer known as decomposability, defined p with the projection Π defined in an analogous in terms of a pair of subspaces of R . The M⊥ M⊆ M manner. To simplify notation, we frequently use the role of the model subspace is to capture the con- shorthand u = Π (u) and u = Π (u). M ⊥ ⊥ straints specified by the model; for instance, it might Of interestM to usM are the actionM of theseM projection be the subspace of vectors with a particular support p operators on the unknown parameter θ∗ R . In (see Example 1) or a subspace of low-rank matrices the most desirable setting, the model subspace∈ (see Example 3). The orthogonal complement of the M can be chosen such that θ∗ θ∗ or, equivalently, space , namely, the set M ≈ M such that θ∗ 0. If this can be achieved with the p M⊥ ≈ (2) ⊥ := v R u, v = 0 for all u , model subspace remaining relatively small, then M { ∈ | h i ∈ M} our main theoremM guarantees that it is possible to is referred to as the perturbation subspace, represent- estimate θ∗ at a relatively fast rate. The following ing deviations away from the model subspace . In examples illustrate suitable choices of the spaces the ideal case, we have = , but ourM defi- M M⊥ M⊥ and in three concrete settings, beginning with nition allows for the possibility that is strictly the caseM of sparse vectors. larger than , so that is strictlyM smaller than M M⊥ ⊥. This generality is needed for treating the case Example 1 (Sparse vectors and ℓ1-norm regular- ofM low-rank matrices and nuclear norm, as discussed ization). Suppose the error norm is the usual k · k in Example 3 to follow. ℓ2-norm and that the model class of interest is the set of s-sparse vectors in p dimensions. For any par- Definition 1. Given a pair of subspaces ticular subset S 1, 2,...,p with cardinality s, we , a norm-based regularizer is decomposableM⊆with ⊆ { } M R define the model subspace respect to ( , ⊥) if M M p (5) (S) := θ R θj = 0 for all j / S . (θ + γ)= (θ)+ (γ) M { ∈ | ∈ } (3) R R R Here our notation reflects the fact that depends M for all θ and γ ⊥. explicitly on the chosen subset S. By construction, ∈ M ∈ M HIGH-DIMENSIONAL ANALYSIS OF REGULARIZED M-ESTIMATORS 5 we have Π (S)(θ∗)= θ∗ for any vector θ∗ that is that for ~α = (1, 1,..., 1), we obtain the standard ℓ1- supported onM S. penalty. Interestingly, our analysis shows that set- In this case, we may define (S)= (S) and ting ~α [2, ]N can often lead to superior statisti- ∈ ∞ G note that the orthogonal complementM withM respect cal performance. to the Euclidean inner product is given by We now show that the norm ,~α is again de- composable with respect to appropriatelyk · kG defined ⊥(S)= ⊥(S) M M subspaces. Indeed, given any subset S 1,...,N (6) G ⊆ { G} p of group indices, say, with cardinality s = S , we = γ R γj = 0 for all j S . { ∈ | ∈ } can define the subspace G | G| This set corresponds to the perturbation subspace, Rp (8) (S ) := θ θGt = 0 for all t / S capturing deviations away from the set of vectors M G { ∈ | ∈ G } with support S. We claim that for any subset S, the as well as its orthogonal complement with respect ℓ1-norm (θ)= θ 1 is decomposable with respect to the usual Euclidean inner product R k k to the pair ( (S), ⊥(S)). Indeed, by construc- M M ⊥(S )= ⊥(S ) tion of the subspaces, any θ (S) can be written M G M G ∈ M Rs (9) in the partitioned form θ = (θS, 0Sc ), where θS Rp p s ∈ := θ θGt = 0 for all t S . and 0Sc R − is a vector of zeros. Similarly, any { ∈ | ∈ G} vector γ∈ (S) has the partitioned representa- With these definitions, for any pair of vectors θ ∈ M⊥ ∈ tion (0S, γSc ). Putting together the pieces, we ob- (S ) and γ ⊥(S ), we have M G ∈ M G tain θ + γ ,~α = θGt + 0Gt αt k kG k k θ + γ 1 = (θS, 0) + (0, γSc ) 1 = θ 1 + γ 1, t S k k k k k k k k X∈ G showing that the ℓ1-norm is decomposable as claimed. (10) + 0 + γ k Gt Gt kαt t/S As a follow-up to the previous example, it is also X∈ G worth noting that the same argument shows that for = θ ,~α + γ ,~α, k kG k kG a strictly positive weight vector ω, the weighted ℓ1- norm θ := p ω θ is also decomposable with thus verifying the decomposability condition. k kω j=1 j| j| respect to the pair ( (S), (S)). For another nat- In the preceding example, we exploited the fact P ural extension, we nowM turnM to the case of sparsity that the groups were nonoverlapping in order to models with more structure. establish the decomposability property. Therefore, some modifications would be required in order to Example 2 (Group-structured norms). In many choose the subspaces appropriately for overlapping applications sparsity arises in a more structured fash- group regularizers proposed in past work [28, 29]. ion, with groups of coefficients likely to be zero (or nonzero) simultaneously. In order to model this be- Example 3 (Low-rank matrices and nuclear norm). p1 p2 havior, suppose that the index set 1, 2,...,p can Now suppose that each parameter Θ R × is ∈ be partitioned into a set of N disjoint{ groups,} say, a matrix; this corresponds to an instance of our G = G1, G2,...,GN . With this setup, for a given general setup with p = p1p2, as long as we identify G { G } N the space Rp1 p2 with Rp1p2 in the usual way. We vector ~α = (α1,...,αN ) [1, ] G , the associated × (1, ~α)-group norm takesG the∈ form∞ equip this space with the inner product Θ, Γ := trace(ΘΓT ), a choice which yields (as thehh inducedii N G norm) the Frobenius norm (7) θ ,~α := θGt αt . k kG k k t=1 p1 p2 X (11) Θ := Θ, Θ = Θ2 . For instance, with the choice ~α = (2, 2,..., 2), we ||| |||F hh ii v jk uj=1 k=1 obtain the group ℓ /ℓ -norm, corresponding to the p uX X 1 2 t regularizer that underlies the group Lasso [78]. On In many settings, it is natural to consider estimating the other hand, the choice ~α = ( ,..., ), corre- matrices that are low-rank; examples include princi- ∞ ∞ sponding to a form of block ℓ1/ℓ -regularization, pal component analysis, spectral clustering, collabo- has also been studied in past work∞ [50, 70, 80]. Note rative filtering and matrix completion. With certain 6 NEGAHBAN, RAVIKUMAR, WAINWRIGHT AND YU exceptions, it is computationally expensive to en- a low-rank matrix with a sparse matrix, along with force a rank-constraint in a direct manner, so that a the regularizer formed by a weighted sum of the nu- variety of researchers have studied the nuclear norm, clear norm and the elementwise ℓ1-norm. By a com- also known as the trace norm, as a surrogate for a bination of Examples 1 and 3, this regularizer also rank constraint. More precisely, the nuclear norm is satisfies the decomposability property with respect given by to appropriately defined subspaces.

min p1,p2 { } 2.3 A Key Consequence of Decomposability (12) Θ := σ (Θ), ||| |||nuc j j=1 Thus far, we have specified a class (1) of M-esti- X mators based on regularization, defined the notion where σ (Θ) are the singular values of the matrix { j } of decomposability for the regularizer and worked Θ. through several illustrative examples. We now turn The nuclear norm is decomposable with respect to the statistical consequences of decomposability— to appropriately chosen subspaces. Let us consider more specifically, its implications for the error vector Rp1 p2 the class of matrices Θ × that have rank ∆ = θ θ , where θ Rp is any solution of the r min p ,p . For any∈ given matrix Θ, we let λn λn − ∗ ∈ ≤ { 1 2} regularized M-estimation procedure (1). For a given row(Θ) Rp2 and col(Θ) Rp1 denote its row space ⊆ ⊆ innerb productb , , theb dual norm of is given by and column space, respectively. Let U and V be h· ·i R a given pair of r-dimensional subspaces U Rp1 u, v Rp2 ⊆ (14) ∗(v):= sup h i = sup u, v . and V ; these subspaces will represent left and R u Rp 0 (u) (u) 1h i ⊆ ∈ \{ } R R ≤ right singular vectors of the target matrix Θ∗ to be estimated. For a given pair (U, V ), we can define the This notion is best understood by working through p1 p2 subspaces (U, V ) and ⊥(U, V ) of R × given some examples. by M M Dual of ℓ1-norm For the ℓ1-norm (u) = u 1 p1 p2 R k k (U, V ) := Θ R × row(Θ) V, previously discussed in Example 1, let us compute (13a) M { ∈ | ⊆ its dual norm with respect to the Euclidean inner col(Θ) U product on Rp. For any vector v Rp, we have ⊆ } ∈ and p

p1 p2 sup u, v sup uk vk ⊥(U, V ) := Θ R × row(Θ) V ⊥, u 1 1h i≤ u 1 1 | || | (13b) M { ∈ | ⊆ k k ≤ k k ≤ Xk=1 col(Θ) U ⊥ . p ⊆ } sup uk max vk So as to simplify notation, we omit the indices (U, V ) ≤ | | k=1,...,p | | u 1 1 k=1 ! when they are clear from context. Unlike the preced- k k ≤ X ing examples, in this case, the set is not2 equal = v . M k k∞ to . We claim that this upper bound actually holds with Finally,M we claim that the nuclear norm is decom- equality. In particular, letting j be any index posable with respect to the pair ( , ⊥). By con- M M for which vj achieves the maximum v = struction, any pair of matrices Θ and Γ ∞ ⊥ max v| |, suppose that we form a vectork k u have orthogonal row and column∈ spaces, M which∈ M im- k=1,...,p k Rp with u =| sign(| v ) and u = 0 for all k = j. With∈ plies the required decomposability condition—name- j j k 6 this choice, we have u 1 1 and, hence, ly, Θ+Γ 1 = Θ 1 + Γ 1. k k ≤ ||| ||| ||| ||| ||| ||| p A line of recent work (e.g., [1, 17, 18, 26, 41, 76]) sup u, v ukvk = v , ∞ has studied matrix problems involving the sum of u 1 1h i≥ k k k k ≤ Xk=1 2 showing that the dual of the ℓ1-norm is the ℓ - However, as is required by our theory, we do have the ∞ inclusion . Indeed, given any Θ and Γ ⊥, norm. M⊆T M ∈M ∈ M we have Θ Γ = 0 by definition, which implies that Θ, Γ = T hh ii Dual of group norm Now recall the group norm trace(Θ Γ)=0. Since Γ ⊥ was arbitrary, we have shown ∈ M that Θ is orthogonal to the space ⊥, meaning that it must from Example 2, specified in terms of a vector ~α M ∈ belong to . [2, ]N . A similar calculation shows that its dual M ∞ G HIGH-DIMENSIONAL ANALYSIS OF REGULARIZED M-ESTIMATORS 7

3 Fig. 1. Illustration of the set C( , ⊥; θ∗) in the special case ∆ =(∆1, ∆2, ∆3) R and regularizer (∆) = ∆ 1 , M M ∈ R k k relevant for sparse vectors (Example 1). This picture shows the case S = 3 , so that the model subspace is 3 { } 3 (S)= ∆ R ∆1 =∆2 = 0 and its orthogonal complement is given by ⊥(S)= ∆ R ∆3 = 0 . (a) In the spe- M { ∈ | } M { ∈ | } cial case when θ∗ = θ∗ = 0, so that θ∗ , the set C( , ⊥; θ∗) is a cone. (b) When θ∗ does not belong to , the set 1 2 ∈M M M M C( , ⊥; θ∗) is enlarged in the coordinates (∆1, ∆2) that span ⊥. It is no longer a cone, but is still a star-shaped set. M M M norm, again with respect to the Euclidean norm on positive regularization parameter satisfying Rp , is given by n (16) λ 2 ∗( (θ∗; Z )). n ≥ R ∇L 1 v ,~α = max v α∗ k kG ∗ t=1,...,N k k t Then for any pair ( , ) over which is de- (15) G M M⊥ R 1 1 composable, the error ∆= θλn θ∗ belongs to the where + = 1 are dual exponents. set − αt αt∗ b b As special cases of this general duality relation, the C( , ⊥; θ∗) M M block (1, 2) norm that underlies the usual group p (17) := ∆ R (∆ ¯ ) Lasso leads to a block ( , 2) norm as the dual, ⊥ ∞ { ∈ | R M whereas the block (1, ) norm leads to a block ( , 1) 3 (∆ ¯ ) + 4 (θ∗ ) . norm as the dual. ∞ ∞ ≤ R M R M⊥ } We prove this result in the supplementary appen- Dual of nuclear norm For the nuclear norm, the dix [49]. It has the following important consequence: dual is defined with respect to the trace inner prod- for any decomposable regularizer and an appropri- uct on the space of matrices. For any matrix N ate choice (16) of regularization parameter, we are Rp1 p2 , it can be shown that ∈ × guaranteed that the error vector ∆ belongs to a very ∗(N)= sup M,N = N op specific set, depending on the unknown vector θ∗. R M 1 hh ii ||| ||| ||| |||nuc≤ As illustrated in Figure 1, the geometryb of the set C = max σj(N), depends on the relation between θ∗ and the model j=1,...,min p1,p2 { } subspace . When θ , then we are guaran- M ∗ ∈ M corresponding to the ℓ -norm applied to the vec- teed that (θ∗ ) = 0. In this case, the constraint ∞ R M⊥ tor σ(N) of singular values. In the special case of (17) reduces to (∆ ¯ ) 3 (∆ ¯ ), so that C is a ⊥ diagonal matrices, this fact reduces to the dual re- cone, as illustratedR inM panel≤ (a).R InM the more general lationship between the vector ℓ and ℓ -norms. 1 case when θ∗ / so that (θ∗ ) = 0, the set C is ∞ ⊥ The dual norm plays a key role in our general not a cone, but∈ M rather a star-shapedR M 6 set [panel (b)]. theory, in particular, by specifying a suitable choice As will be clarified in the sequel, the case θ∗ / of the regularization weight λn. We summarize in requires a more delicate treatment. ∈ M the following: 2.4 Restricted Strong Convexity Lemma 1. Suppose that is a convex and dif- ferentiable function, and considerL any optimal solu- We now turn to an important requirement of the tion θ to the optimization problem (1) with a strictly loss function and its with the statistical

b 8 NEGAHBAN, RAVIKUMAR, WAINWRIGHT AND YU

Fig. 2. Role of curvature in distinguishing parameters. (a) Loss function has high curvature around ∆. A small excess loss d = (θλn ) (θ∗) guarantees that the parameter error ∆= θλn θ∗ is also small. (b) A less desirableb setting, in which L |L −L | − the loss functionb has relatively low curvature around the optimum.b b

model. Recall that ∆= θλn θ∗ is the difference be- is twice differentiable, strong convexity amounts to − 2 tween an optimal solution θ and the true parame- lower bound on the eigenvalues of the Hessian (θ), λn ∇ L b b 3 holding uniformly for all θ in a neighborhood of θ∗. ter, and consider the loss difference (θλn ) (θ∗). In the classical setting, underb fairly mildL conditions,−L Under classical “fixed p, large n” scaling, the loss one expects that the loss difference shouldb converge function will be strongly convex under mild condi- tions. For instance, suppose that population risk to zero as the sample size n increases. It is important L to note, however, that such convergence on its own is strongly convex or, equivalently, that the Hessian 2 (θ) is strictly positive definite in a neighborhood is not sufficient to guarantee that θλn and θ∗ are ∇ L close or, equivalently, that ∆ is small. Rather, the of θ∗. As a concrete example, when the loss function closeness depends on the curvature ofb the loss func- is defined based on negative log likelihood of a sta- tisticalL model, then the Hessian 2 (θ) corresponds tion, as illustrated in Figureb 2. In a desirable set- ∇ L ting [panel (a)], the loss function is sharply curved to the Fisher information matrix, a quantity which arises naturally in asymptotic statistics. If the di- around its optimum θ , so that having a small loss λn mension p is fixed while the sample size n goes to difference (θ ) (θ ) translates to a small er- |L ∗ −L λn | infinity, standard arguments can be used to show ror ∆= θ θ . Panelb (b) illustrates a less desirable λn − ∗ that (under mild regularity conditions) the random setting, in which the lossb function is relatively flat, Hessian 2 (θ) converges to 2 (θ) uniformly for so that the loss difference can be small while the ∇ L ∇ L b b all θ in an open neighborhood of θ∗. In contrast, error ∆ is relatively large. whenever the pair (n,p) both increase in such a way The standard way to ensure that a function is “not that p>n, the situation is drastically different: the too flat”b is via the notion of strong convexity. Since Hessian matrix 2 (θ) is often singular. As a con- is differentiable by assumption, we may perform a ∇ L L crete example, consider linear regression based on first-order Taylor series expansion at θ∗ and in some p samples Zi = (yi,xi) R R , for i = 1, 2,...,n. Us- direction ∆; the error in this Taylor series is given by ∈ × 1 2 ing the least squares loss (θ)= 2n y Xθ 2, the 2 L 1 Tk − k δ (∆,θ∗) := (θ∗ + ∆) (θ∗) p p Hessian matrix (θ)= n X X has rank at (18) L L −L most× n, meaning that∇ theL loss cannot be strongly (θ∗), ∆ . − h∇L i convex when p>n. Consequently, it impossible to One way in which to enforce that is strongly con- guarantee global strong convexity, so that we need vex is to require the existence of someL positive con- to restrict the set of directions ∆ in which we require 2 stant κ> 0 such that δ (∆,θ∗) κ ∆ for all ∆ a curvature condition. p L ≥ k k ∈ R in a neighborhood of θ∗. When the loss function Ultimately, the only direction of interest is given by the error vector ∆= θ θ . Recall that Lemma 1 λn − ∗ 3To simplify notation, we frequently write (θ) as short- guarantees that, for suitable choices of the regular- n nL hand for (θ; Z ) when the underlying data Z is clear from ization parameter λbn, thisb error vector must belong L 1 1 context. to the set C( , ; θ ), as previously defined (17). M M⊥ ∗ HIGH-DIMENSIONAL ANALYSIS OF REGULARIZED M-ESTIMATORS 9

Fig. 3. (a) Illustration of a generic loss function in the high-dimensional p > n setting: it is curved in certain directions, but completely flat in others. (b) When θ∗ / , the set C( , ⊥; θ∗) contains a ball centered at the origin, which necessitates ∈M M M a tolerance term τ (θ∗) > 0 in the definition of restricted strong convexity. L

Consequently, it suffices to ensure that the function case of ℓ1-regularization, for covariates with suitably is strongly convex over this set, as formalized in the controlled tails, this type of bound holds for the log p following: least squares loss with the function g(n,p)= n ; see equation (31) to follow. For generalized linear Definition 2. The loss function satisfies a re- models and the ℓ -norm, a similar type of bound is stricted strong convexity (RSC) condition with cur- 1 given in equation (43). We also provide a bound of vature κ > 0 and tolerance function τ if L L this form for the least-squares loss group-structured 2 2 δ (∆,θ∗) κ ∆ τ (θ∗) norms in equation (46), with a different choice of the L ≥ Lk k − L (19) function g depending on the group structure. for all ∆ C( , ; θ ). ∈ M M⊥ ∗ A bound of the form (20) implies a form of re- In the simplest of cases—in particular, when θ stricted strong convexity as long as (∆) is not “too ∗ R —there are many statistical models for which this∈ large” relative to ∆ . In order to formalize this no- M k k RSC condition holds with tolerance τ (θ∗)=0. In tion, we define a quantity that relates the error norm the more general setting, it can holdL only with a and the regularizer: nonzero tolerance term, as illustrated in Figure 3(b). Definition 3 (Subspace compatibility constant). As our proofs will clarify, we in fact require only the Rp C For any subspace of , the subspace compati- lower bound (19) to hold for the intersection of bility constant withM respect to the pair ( , ) is with a local ball ∆ R of some radius centered given by R k · k at zero. As will be{k clarifiedk≤ } later, this restriction is not necessary for the least squares loss, but is es- (u) (21) Ψ( ):= sup R . sential for more general loss functions, such as those M u 0 u that arise in generalized linear models. ∈M\{ } k k We will see in the sequel that for many loss func- This quantity reflects the degree of compatibility tions, it is possible to prove that with high probabil- between the regularizer and the error norm over the subspace . In alternative terms, it is the Lipschitz ity the first-order Taylor series error satisfies a lower M bound of the form constant of the regularizer with respect to the error norm, restricted to the subspace . As a simple 2 2 M δ (∆,θ∗) κ1 ∆ κ2g(n,p) (∆) example, if is a s-dimensional coordinate sub- L ≥ k k − R (20) space, with regularizerM (u)= u and error norm for all ∆ 1, R k k1 k k≤ u = u 2, then we have Ψ( )= √s. k k k k M where κ1, κ2 are positive constants and g(n,p) is a This compatibility constant appears explicitly in function of the sample size n and ambient dimension the bounds of our main theorem and also arises p, decreasing in the sample size. For instance, in the in establishing restricted strong convexity. Let us 10 NEGAHBAN, RAVIKUMAR, WAINWRIGHT AND YU now illustrate how it can be used to show that the tation, we can now state the main result of this pa- condition (20) implies a form of restricted strong per: convexity. To be concrete, let us suppose that θ ∗ Theorem 1 (Bounds for general models). Un- belongs to a subspace ; in this case, member- der conditions (G1) and (G2), consider the problem ship of ∆ in the set CM( , ; θ ) implies that M M⊥ ∗ (1) based on a strictly positive regularization con- (∆ ¯ ) 3 (∆ ¯ ). Consequently, by the triangle R M⊥ ≤ R M stant λ 2 ( (θ )). Then any optimal solution inequality and the definition (21), we have n ≥ R∗ ∇L ∗ θλn to the convex program (1) satisfies the bound (∆) (∆ ¯ )+ (∆ ¯ ) 4 (∆ ¯ ) R ≤ R M⊥ R M ≤ R M 2 2 λn 2 4Ψ( ) ∆ . b θλ θ∗ 9 Ψ ( ) ≤ M k k k n − k ≤ κ 2 M Therefore, whenever a bound of the form (20) holds (22) L b λn 2 and θ∗ , we are guaranteed that + 2τ (θ∗) + 4 (θ∗ ) . ∈ M ⊥ 2 2 κ { L R M } δ (∆,θ∗) κ 16κ Ψ ( )g(n,p) ∆ L L ≥ { 1 − 2 M }k k Remarks. Let us consider in more detail some for all ∆ 1. different features of this result. k k≤ Consequently, as long as the sample size is large (a) It should be noted that Theorem 1 is actually 2 κ1 enough that 16κ2Ψ ( )g(n,p) < 2 , the restricted a deterministic statement about the set of optimiz- strong convexity conditionM will hold with κ = κ1 L 2 ers of the convex program (1) for a fixed choice of and τ (θ∗) = 0. We make use of arguments of this L λn. Although the program is convex, it need not be flavor throughout this paper. strictly convex, so that the global optimum might be attained at more than one point θ . The stated 3. BOUNDS FOR GENERAL M-ESTIMATORS λn bound holds for any of these optima. Probabilistic We are now ready to state a general result that analysis is required when Theorem b1 is applied to provides bounds and hence convergence rates for particular statistical models, and we need to verify the error θλn θ∗ , where θλn is any optimal so- that the regularizer satisfies the condition lution of thek convex− k program (1). Although it may (23) λ 2 ∗( (θ∗)) appear somewhatb abstract atb first sight, this result n ≥ R ∇L has a number of concrete and useful consequences and that the loss satisfies the RSC condition. A chal- for specific models. In particular, we recover as an lenge here is that since θ∗ is unknown, it is usu- immediate corollary the best known results about ally impossible to compute the right-hand side of estimation in sparse linear models with general de- the condition (23). Instead, when we derive conse- signs [8, 46], as well as a number of new results, quences of Theorem 1 for different statistical mod- including minimax-optimal rates for estimation un- els, we use concentration inequalities in order to pro- der ℓq-sparsity constraints and estimation of block- vide bounds that hold with high probability over the structured sparse matrices. In results that we report data. elsewhere, we also apply these theorems to establish- (b) Second, note that Theorem 1 actually pro- ing results for sparse generalized linear models [48], vides a family of bounds, one for each pair ( , ⊥) estimation of low-rank matrices [51, 52], matrix de- of subspaces for which the regularizer is decompos-M M composition problems [1] and sparse nonparametric able. Ignoring the term involving τ for the , regression models [57]. for any given pair, the error boundL is the sum of two Let us recall our running assumptions on the struc- terms, corresponding to estimation error err and ture of the convex program (1). approximation error , given by, respectively,E Eapp (G1) The regularizer is a norm and is decom- 2 R λn 2 posable with respect to the subspace pair ( , ), err := 9 Ψ ( ) and M M⊥ E κ 2 M where . (24) L M⊆ M (G2) The loss function is convex and differen- λn L app := 4 (θ∗ ). tiable, and satisfies restricted strong convexity with E κ R M⊥ curvature κ and tolerance τ . L As the dimension of the subspace increases (so L L M The reader should also recall the definition (21) of that the dimension of ⊥ decreases), the approxi- the subspace compatibility constant. With this no- mation error tends toM zero. But since , the M⊆ M HIGH-DIMENSIONAL ANALYSIS OF REGULARIZED M-ESTIMATORS 11 estimation error is increasing at the same time. Thus, bound (23). The bound (25b) on the error measured in the usual way, optimal rates are obtained by choos- in the regularizer norm is similar, except that it ing and so as to balance these two contri- scales quadratically with the subspace compatibil- butionsM to theM error. We illustrate such choices for ity constant. As the proof clarifies, this additional various specific models to follow. dependence arises since the regularizer over the sub- (c) As will be clarified in the sequel, many high- space is larger than the norm by a factor of M k · k dimensional statistical models have an unidentifi- at most Ψ( ) (see Definition 3). M able component, and the tolerance term τ reflects Obtaining concrete rates using Corollary 1 requires the degree of this nonidentifiability. L some work in order to verify the conditions of Theo- rem 1 and to provide control on the three quantities A large body of past work on sparse linear re- in the bounds (25a) and (25b), as illustrated in the gression has focused on the case of exactly sparse examples to follow. regression models for which the unknown regression vector θ∗ is s-sparse. For this special case, recall 4. CONVERGENCE RATES FOR SPARSE from Example 1 in Section 2.2 that we can define an REGRESSION s-dimensional subspace that contains θ∗. Conse- M As an illustration, we begin with one of the sim- quently, the associated set C( , ⊥; θ∗) is a cone [see Figure 1(a)], and it is thusM possibleM to estab- plest statistical models, namely, the standard linear lish that restricted strong convexity (RSC) holds model. It is based on n observations Zi = (xi,yi) Rp R Rn ∈ with tolerance parameter τ (θ )=0. This same rea- of covariate-response pairs. Let y de- ∗ × ∈Rn p soning applies to other statisticalL models, among note a vector of the responses, and let X × be the design matrix, where x Rp is the ith∈ row. This them group-sparse regression, in which a small sub- i ∈ set of groups are active, as well as low-rank ma- pair is linked via the linear model trix estimation. The following corollary provides a (26) y = Xθ∗ + w, simply stated bound that covers all of these mod- where θ Rp is the unknown regression vector and els: ∗ w Rn is∈ a noise vector. To begin, we focus on this Corollary 1. Suppose that, in addition to the simple∈ linear setup and describe extensions to gen- conditions of Theorem 1, the unknown θ∗ belongs to eralized models in Section 4.4. n n n p and the RSC condition holds over C( , ,θ∗) Given the data set Z = (y, X) R R , our M M M 1 ∈ × × with τ (θ∗) = 0. Then any optimal solution θλn to goal is to obtain a “good” estimate θ of the regres- L the convex program (1) satisfies the bounds sion vector θ∗, assessed either in terms of its ℓ2-error 2 b θ θ∗ 2 or its ℓ1-error θ θ∗ 1. Itb is worth not- λn 2 k − k k − k (25a) θ θ∗ 9 Ψ ( ) ing that whenever p>n, the standard linear model k λn − k≤ κ M L (26b ) is unidentifiable in ab certain sense, since the Rn p and b rectangular matrix X × has a null space of dimension at least p ∈n. Consequently, in order to λn 2 − (25b) (θλn θ∗) 12 Ψ ( ). obtain an identifiable model—or at the very least, to R − ≤ κ M L bound the degree of nonidentifiability—it is essen- Focusing firstb on the bound (25a), it consists of tial to impose additional constraints on the regres- three terms, each of which has a natural interpreta- sion vector θ∗. One natural constraint is some type tion. First, it is inversely proportional to the RSC of sparsity in the regression vector; for instance, one constant κ , so that higher curvature guarantees might assume that θ∗ has at most s nonzero coef- L lower error, as is to be expected. The error bound ficients, as discussed at more length in Section 4.2. grows proportionally with the subspace compatibil- More generally, one might assume that although θ∗ ity constant Ψ( ), which measures the compatibil- is not exactly sparse, it can be well-approximated by M ity between the regularizer and error norm a sparse vector, in which case one might say that θ R k · k ∗ over the subspace (see Definition 3). This term is “weakly sparse,” “sparsifiable” or “compressible.” M increases with the size of subspace , which con- Section 4.3 is devoted to a more detailed discussion tains the model subspace . Third,M the bound also of this weakly sparse case. M scales linearly with the regularization parameter λn, A natural M-estimator for this problem is the which must be strictly positive and satisfy the lower Lasso [19, 67], obtained by solving the ℓ1-penalized 12 NEGAHBAN, RAVIKUMAR, WAINWRIGHT AND YU quadratic program This type of ℓ1-based RE condition is less restrictive than the corresponding ℓ2-version (29). We refer the 1 2 (27) θλn arg min y Xθ 2 + λn θ 1 reader to the paper by van de Geer and B¨uhlmann ∈ θ Rp 2nk − k k k ∈   [72] for an extensive discussion of different types of for someb choice λn > 0 of regularization parameter. restricted eigenvalue or compatibility conditions. Note that this Lasso estimator is a particular case It is natural to ask whether there are many ma- of the general M-estimator (1), based on the loss trices that satisfy these types of RE conditions. If n 1 function and regularization pair (θ; Z1 )= 2n y X has i.i.d. entries following a sub-Gaussian distri- 2 p L k − Xθ and (θ)= θj = θ 1. We now show bution (including Gaussian and Bernoulli variables k2 R j=1 | | k k how Theorem 1 can be specialized to obtain bounds as special cases), then known results in random ma- P on the error θλn θ∗ for the Lasso estimate. trix theory imply that the restricted isometry prop- − 4.1 Restricted Eigenvalues for Sparse Linear erty [14] holds with high probability, which in turn Regressionb implies that the RE condition holds [8, 72]. Since statistical applications involve design matrices with For the least squares loss function that under- substantial dependency, it is natural to ask whether lies the Lasso, the first-order Taylor series expansion an RE condition also holds for more general random from Definition 2 is exact, so that designs. This question was addressed by Raskutti et 1 1 al. [55, 56], who showed that if the design matrix T 2 n p δ (∆,θ∗)= ∆, X X∆ = X∆ 2. X R is formed by independently each L n nk k ∈ ×   row Xi N(0, Σ), referred to as the Σ-Gaussian Thus, in this special case, the Taylor series error is ensemble∼, then there are strictly positive constants independent of θ , a fact which allows for substantial ∗ (κ1, κ2), depending only on the positive definite ma- theoretical simplification. More precisely, in order trix Σ, such that to establish restricted strong convexity, it suffices 2 2 Xθ 2 2 log p 2 to establish a lower bound on X∆ 2/n that holds k k κ θ κ θ k k n ≥ 1k k2 − 2 n k k1 uniformly for an appropriately restricted subset of (31) p-dimensional vectors ∆. for all θ Rp As previously discussed in Example 1, for any ∈ with probability greater than 1 c1 exp( c2n). The subset S 1, 2,...,p , the ℓ1-norm is decompos- − − able with⊆ respect { to the} subspace (S)= θ Rp bound (31) has an important consequence: it guar- M { ∈ | antees that the RE property (29) holds4 with κ = θSc = 0 and its orthogonal complement. When the κ1 L } p > 0 as long as n> 64(κ2/κ1)s log p. Therefore, unknown regression vector θ∗ R is exactly sparse, 2 it is natural to choose S equal∈ to the support set of not only do there exist matrices satisfying the RE property (29), but any matrix sampled from a Σ- θ∗. By appropriately specializing the definition (17) of C, we are led to consider the cone Gaussian ensemble will satisfy it with high proba- bility. Related analysis by Rudelson and Zhou [65] p (28) C(S) := ∆ R ∆ c 3 ∆ . { ∈ | k S k1 ≤ k Sk1} extends these types of guarantees to the case of sub- See Figure 1(a) for an illustration of this set in three Gaussian designs, also allowing for substantial de- dimensions. With this choice, restricted strong con- pendencies among the covariates. vexity with respect to the ℓ2-norm is equivalent to 4.2 Lasso Estimates with Exact Sparsity requiring that the design matrix X satisfy the con- dition We now show how Corollary 1 can be used to de- 2 rive convergence rates for the error of the Lasso es- Xθ 2 2 C (29) k k κ θ 2 for all θ (S). timate when the unknown regression vector θ∗ is n ≥ Lk k ∈ s-sparse. In order to state these results, we require n This lower bound is a type of restricted eigenvalue some additional notation. Using Xj R to denote (RE) condition and has been studied in past work ∈ on basis pursuit and the Lasso (e.g., [8, 46, 56, 72]). 4To see this fact, note that for any θ C(S), we have ∈ One could also require that a related condition hold θ 1 4 θS 1 4√s θS 2. Given the lower bound (31), k k ≤ k k ≤ k k Xθ 2 with respect to the ℓ1-norm—viz. for any θ C(S), we have the lower bound k k κ1 ∈ √n ≥ { − 2 2 4κ s log p θ κ1 θ , where final inequality follows as Xθ 2 θ 1 2q n 2 2 2 (30) k k κ ′ k k for all θ C(S). }k k ≥ 2k k n ≥ L S ∈ long as n> 64(κ2/κ1) s log p. | | HIGH-DIMENSIONAL ANALYSIS OF REGULARIZED M-ESTIMATORS 13 the jth column of X, we say that X is column- The final step is to compute an appropriate choice normalized if of the regularization parameter. The gradient of the 1 X quadratic loss is given by (θ; (y, X)) = XT w, (32) k jk2 1 for all j = 1, 2,...,p. ∇L n √n ≤ whereas the dual norm of the ℓ1-norm is the ℓ - norm. Consequently, we need to specify a choice∞ of Here we have set the upper bound to one in order to λ > 0 such that simplify notation. This particular choice entails no n loss of generality, since we can always rescale the lin- 1 T λ 2 ∗( (θ∗))=2 X w ear model appropriately (including the observation n ≥ R ∇L n noise ) so that it holds. ∞ with high probability. Using the column normaliza- In addition, we assume that the noise vector w tion (32) and sub-Gaussian (33) conditions, for each Rn is zero- and has sub-Gaussian tails, mean-∈ j = 1,...,p, we have the tail bound P[ Xj, w /n ing that there is a constant σ> 0 such that for any 2 |h i |≥ t] 2 exp( nt ). Consequently, by union bound, we fixed v 2 = 1, 2σ2 ≤ − 2 k k P T nt 2 conclude that [ X w/n t] 2 exp( 2σ2 + δ k 2 k∞ ≥ ≤ − (33) P[ v, w t] 2exp for all δ> 0. 2 4σ log p |h i| ≥ ≤ −2σ2 log p). Setting t = n , we see that the choice   of λn given in the statement is valid with probabil- For instance, this condition holds when the noise 2 ity at least 1 c1 exp( c2nλn ). Consequently, the vector w has i.i.d. N(0, 1) entries or consists of in- claims (34) follow− from− the bounds (25a) and (25b) dependent bounded random variables. Under these in Corollary 1.  conditions, we recover as a corollary of Theorem 1 the following result: 4.3 Lasso Estimates with Weakly Sparse Models

Corollary 2. Consider an s-sparse instance of We now consider regression models for which θ∗ is the linear regression model (26) such that X satisfies not exactly sparse, but rather can be approximated the RE condition (29) and the column normalization well by a sparse vector. One way in which to for- condition (32). Given the Lasso program (27) with malize this notion is by considering the ℓq “ball” of log p radius Rq, given by regularization parameter λn = 4σ n , then with 2 p probability at least 1 c1 exp( c2nλqn ), any optimal p q − − Bq(Rq) := θ R θi Rq solution θλ satisfies the bounds ∈ | | ≤ n ( i=1 ) 2 X 2 64σ s log p b θ θ∗ and where q [0, 1] is fixed. k λn − k2 ≤ κ 2 n ∈ (34) L In the special case q = 0, this set corresponds to b 24σ log p an exact sparsity constraint—that is, θ B (R ) if θ θ s . ∗ 0 0 λn ∗ 1 and only if θ has at most R nonzero entries.∈ More k − k ≤ κ r n ∗ 0 L generally, for q (0, 1], the set B (R ) enforces a Althoughb error bounds of this form are known ∈ q q from past work (e.g., [8, 14, 46]), our proof illumi- certain decay rate on the ordered absolute values nates the underlying structure that leads to the dif- of θ∗. ferent terms in the bound—in particular, see equa- In the case of weakly sparse vectors, the constraint C tions (25a) and (25b) in the statement of Corol- set takes the form lary 1. C( , ; θ∗) (35) M M Proof of Corollary 2. We first note that the p = ∆ R ∆ c 3 ∆ + 4 θ∗ c . RE condition (30) implies that RSC holds with re- { ∈ | k S k1 ≤ k S k1 k S k1} spect to the subspace (S). As discussed in In contrast to the case of exact sparsity, the set C is M Example 1, the ℓ1-norm is decomposable with re- no longer a cone, but rather contains a ball centered spect to (S) and its orthogonal complement, so at the origin—compare panels (a) and (b) of Fig- M that we may set (S)= (S). Since any vector ure 1. As a consequence, it is never possible to en- θ (S) has atM most s nonzeroM entries, the sub- ∈ M sure that Xθ 2/√n is uniformly bounded from be- space compatibility constant is given by Ψ( (S)) = low for allk vectorsk θ in the set (35), and so a strictly θ 1 M sup k k = √s. θ (S) 0 θ 2 positive tolerance term τ (θ∗) > 0 is required. The ∈M \{ } k k L 14 NEGAHBAN, RAVIKUMAR, WAINWRIGHT AND YU random matrix result (31), stated in the previous Consequently, we may apply Theorem 1 with κ = 2 log p 2 L section, allows us to establish a form of RSC that κ1/4 and τ (θ∗) = 8κ2 n θS∗ c 1 to conclude that L k η k is appropriate for the setting of ℓq-ball sparsity. We 2 summarize our conclusions in the following: θ θ∗ k λn − k2 2 Corollary 3. Suppose that X satisfies the RE λn (38) b 144 2 Sη condition (31) as well as the column normalization ≤ κ1 | | condition (32), the noise w is sub-Gaussian (33) B 4λn log p 2 and θ∗ belongs to q(Rq) for a radius Rq such that + 16κ θ c + 4 θ c , 2 S∗η 1 S∗η 1 log p 1/2 q/4 κ1 n k k k k Rq( n ) − 1. Then if we solve the Lasso   ≤ 2 log p where we have used the fact that Ψ (Sη)= Sη , as withp regularization parameter λn = 4σ , there n noted in the proof of Corollary 2. | | are universal positive constants (c0, c1,q c2) such that It remains to upper bound the cardinality of Sη in any optimal solution θλ satisfies n terms of the threshold η and ℓq-ball radius Rq. Note 2 1 q/2 that we have 2 b σ log p − (36) θλn θ∗ 2 c0Rq 2 p k − k ≤ κ1 n q q q   (39) Rq θj∗ θi∗ η Sη , 2 ≥ | | ≥ | | ≥ | | with probabilityb at least 1 c exp( c nλ ). j=1 j Sη − 1 − 2 n X X∈ Remarks. Note that this corollary is a strict hence, S η qR for any η> 0. Next we upper | η|≤ − q generalization of Corollary 2, to which it reduces bound the approximation error θ∗ c 1, using the k Sη k when q = 0. More generally, the parameter q [0, 1] fact that θ B (R ). Letting Sc denote the com- ∈ ∗ ∈ q q η controls the relative “sparsifiability” of θ∗, with lar- plementary set Sη 1, 2,...,p , we have ger values corresponding to lesser sparsity. Naturally \ { } q 1 q θ c = θ = θ θ − then, the rate slows down as q increases from 0 to- S∗η 1 j∗ j∗ j∗ k k c| | c| | | | ward 1. In fact, Raskutti et al. [56] show that the j Sη j Sη X∈ X∈ (40) rates (36) are minimax-optimal over the ℓq-balls— 1 q R η − . implying that not only are the consequences of The- ≤ q orem 1 sharp for the Lasso, but, more generally, no Setting η = λn/κ1 and then substituting the bounds algorithm can achieve faster rates. (39) and (40) into the bound (38) yields

Proof of Corollary 3. Since the loss func- 2 1 q/2 2 λn − tion is quadratic, the proof of Corollary 2 shows θλ θ∗ 160 Rq L k n − k2 ≤ κ2 σ2 log p  1  that the stated choice λn = 4 is valid with n 2 1 q/2 2 2 b λn − (log p)/n probability at least 1 c exp(qc′nλn ). Let us now + 64κ R . − − 2 κ2 q λ /κ show that the RSC condition holds. We do so via  1   n 1 condition (31) applied to equation (35). For a thresh- For any fixed noise variance, our choice of regular- old η> 0 to be chosen, define the thresholded subset ization parameter ensures that the ratio (log p)/n is λn/κ1 (37) S := j 1, 2,...,p θ∗ > η . of order one, so that the claim follows.  η { ∈ { } | | j | } 4.4 Extensions to Generalized Linear Models Now recall the subspaces (S ) and (S ) previ- M η M⊥ η ously defined in equations (5) and (6) of Example 1, In this section we briefly outline extensions of the where we set S = Sη. The following lemma, proved preceding results to the family of generalized linear in the supplement [49], provides sufficient conditions models (GLM). Suppose that conditioned on a vec- for restricted strong convexity with respect to these tor x Rp of covariates, a response variable y subspace pairs: has the∈ distribution ∈Y Lemma 2. Suppose that the conditions of Corol- P y θ∗,x Φ( θ∗,x ) (41) θ∗ (y x) exp h i− h i . lary 3 hold and n> 9κ2 Sη log p. Then with the choice | ∝ c(σ) λn | |   η = , the RSC condition holds over C( (Sη), κ1 M Here the quantity c(σ) is a fixed and known scale 2 log p 2 R R ⊥(Sη),θ∗) with κ = κ1/4 and τ = 8κ2 n θS∗ c 1. parameter, and the function Φ : is the link M L L k η k → HIGH-DIMENSIONAL ANALYSIS OF REGULARIZED M-ESTIMATORS 15 function, also known. The family (41) includes many have proposed extensions of the Lasso based on reg- well-known classes of regression models as special ularizers that have more structure than the ℓ1-norm cases, including ordinary linear regression [obtained (e.g., [5, 44, 70, 78, 80]). Such regularizers allow with = R, Φ(t)= t2/2 and c(σ)= σ2] and logistic one to impose different types of block-sparsity con- regressionY [obtained with = 0, 1 , c(σ)=1 and straints, in which groups of parameters are assumed Φ(t) = log(1 + exp(t))]. Y { } to be active (or inactive) simultaneously. These norms p Given samples Zi = (xi,yi) R , the goal is arise in the context of multivariate regression, where to estimate the unknown vector∈ θ×YRp. Under a the goal is to predict a multivariate output in Rm ∗ ∈ sparsity assumption on θ∗, a natural estimator is on the basis of a set of p covariates. Here it is ap- based on minimizing the (negative) log likelihood, propriate to assume that groups of covariates are combined with an ℓ1-regularization term. This com- useful for predicting the different elements of the bination leads to the convex program m-dimensional output vector. We refer the reader to n the papers [5, 44, 70, 78, 80] for further discussion 1 of and motivation for the use of block-structured θλn arg min yi θ,xi + Φ( θ,xi ) ∈ θ Rp n {− h i h i } norms. ∈ ( i=1 X Given a collection = G1,...,GN of groups, b (θ;Zn) G { G } 1 recall from Example 2 in Section 2.2 the definition of (42) L | {z } the group norm ,~α. In full generality, this group k · kG + λn θ 1 . norm is based on a weight vector ~α = (α1,...,αN ) G ∈ k k ) [2, ]N , one for each group. For simplicity, here we ∞ G consider the case when αt = α for all t = 1, 2,...,N , In order to extend the error bounds from the pre- G and we use ,α to denote the associated group vious section, a key ingredient is to establish that k · kG this GLM-based loss function satisfies a form of re- norm. As a natural extension of the Lasso, we con- stricted strong convexity. Along these lines, Negah- sider the block Lasso estimator ban et al. [48] proved the following result: suppose 1 2 (44) θ arg min y Xθ 2 + λn θ ,~α , that the covariate vectors xi are zero-mean with co- ∈ θ Rp nk − k k kG ∈   variance matrix Σ 0 and are drawn i.i.d. from a where λb > 0 is a user-defined regularization param- distribution with sub-Gaussian≻ tails [see equation n eter. Different choices of the parameter α yield dif- (33)]. Then there are constants κ , κ such that the 1 2 ferent estimators, and in this section we consider first-order Taylor series error for the GLM-based loss the range α [2, ]. This range covers the two most (42) satisfies the lower bound commonly applied∈ ∞ choices, α = 2, often referred to 2 log p 2 as the group Lasso, as well as the choice α =+ . δ (∆,θ∗) κ ∆ κ ∆ ∞ L ≥ 1k k2 − 2 n k k1 (43) 5.1 Restricted Strong Convexity for Group for all ∆ 1. Sparsity k k2 ≤ As discussed following Definition 2, this type of lower As a parallel to our analysis of ordinary sparse bound implies that satisfies a form of RSC, as long regression, our first step is to provide a condition as the sample size scalesL as n = Ω(s log p), where s is sufficient to guarantee restricted strong convexity the target sparsity. Consequently, this lower bound for the group-sparse setting. More specifically, we (43) allows us to recover analogous bounds on the state the natural extension of condition (31) to the block-sparse setting and prove that it holds with error θλ θ∗ 2 of the GLM-based estimator (42). k n − k high probability for the class of Σ-Gaussian random designs. Recall from Theorem 1 that the dual norm b5. CONVERGENCE RATES FOR GROUP-STRUCTURED NORMS of the regularizer plays a central role. As discussed previously, for the block-(1, α)-regularizer, the as- The preceding two sections addressed M-estima- sociated dual norm is a block-( , α∗) norm, where ∞ 1 1 tors based on ℓ1-regularization, the simplest type (α, α∗) are conjugate exponents satisfying + = 1. α α∗ of decomposable regularizer. We now turn to some Letting ε N(0,Ip p) be a standard normal vec- extensions of our results to more complex regulariz- tor, we consider∼ the× following condition. Suppose ers that are also decomposable. Various researchers that there are strictly positive constants (κ1, κ2) such 16 NEGAHBAN, RAVIKUMAR, WAINWRIGHT AND YU that, for all ∆ Rp, we have As with ℓ -regularization, these RSC conditions are ∈ 1 2 milder than analogous group-based RIP conditions X∆ 2 2 2 2 (45) k k κ1 ∆ 2 κ2ρ (α∗) ∆ 1,α, (e.g., [5, 27, 66]), which require that all submatrices n ≥ k k − G k k up to a certain size are close to isometries. εGt α where ρ (α∗) := E[maxt=1,2,...,N k k ∗ ]. To under- G G √n 5.2 Convergence Rates stand this condition, first consider the special case of N = p groups, each of size one, so that the group- Apart from RSC, we impose one additional con- G dition on the design matrix. For a given group G sparse norm reduces to the ordinary ℓ1-norm, and of size m, let us view the matrix X Rn m as its dual is the ℓ -norm. Using α = 2 for concrete- G ∈ × ∞ an operator from ℓm ℓn and define the associated E 3 log p α → 2 ness, we have ρ (2) = [ ε ]/√n n for all G k k∞ ≤ operator norm XG α 2 := max θ α=1 XGθ 2. We p 10, using standard bounds on Gaussianq max- then require that||| ||| → k k k k ima.≥ Therefore, condition (45) reduces to the earlier XGt α 2 condition (31) in this special case. (47) ||| ||| → 1 for all t = 1, 2,...,N . Let us consider a more general setting, say, with √n ≤ G α =2 and N groups each of size m, so that p = Note that this is a natural generalization of the col- N m. For thisG choice of groups and norm, we have umn normalization condition (32), to which it re- G duces when we have N = p groups, each of size one. ε G ρ (2) = E max k Gt k2 , As before, we may assume without loss of generality, G t=1,...,N √n  G  rescaling X and the noise as necessary, that condi- where each sub-vector wGt is a standard Gaussian tion (47) holds with constant one. Finally, we de- E vector with m elements. Since [ εGt 2] √m, tail fine the maximum group size m = maxt=1,...,N Gt . k k ≤ G | | 2 m 3 log N With this notation, we have the following novel re- bounds for χ -variates yield ρ (2) + G , G ≤ n n sult: so that the condition (45) is equivalent to p q 2 Corollary 4. Suppose that the noise w is sub- X∆ 2 2 Gaussian (33), and the design matrix X satisfies k k κ1 ∆ 2 n ≥ k k condition (45) and the block normalization condition 2 m 3 log N 2 (47). If we solve the group Lasso with (46) κ2 + G ∆ ,2 − n n k kG 1 1/α r r m − log N   (48) λ 2σ + G , for all ∆ Rp. n ≥ √n n ∈  r  Thus far, we have seen the form that condition then with probability at least 1 2/N 2, for any (45) takes for different choices of the groups and pa- group subset S 1, 2,...,N − withG cardinality rameter α. It is natural to ask whether there are any G ⊆ { G} S = s , any optimal solution θλn satisfies | G| G matrices that satisfy the condition (45). As shown in 2 the following result, the answer is affirmative—more 2 4λn 4λn (49) θλn θ∗ 2 s + b θG∗ t α. strongly, almost every matrix satisfied from the Σ- k − k ≤ κ 2 G κ k k L L t/S X∈ G Gaussian ensemble will satisfy this condition with b high probability. [Here we recall that for a nonde- Remarks. Since the result applies to any α [2, ], we can observe how the choices of different∈ generate covariance matrix, a random design matrix ∞ n p group-sparse norms affect the convergence rates. So X R × is drawn from the Σ-Gaussian ensemble ∈ as to simplify this discussion, let us assume that the if each row xi N(0, Σ), i.i.d. for i = 1, 2,...,n.] ∼ groups are all of equal size m, so that p = mN is Proposition 1. For a design matrix X Rn p G ∈ × the ambient dimension of the problem. from the Σ-ensemble, there are constants (κ1, κ2) de- Case α = 2: The case α = 2 corresponds to the pending only on Σ such that condition (45) holds block (1, 2) norm, and the resulting estimator is fre- with probability greater than 1 c1 exp( c2n). quently referred to as the group Lasso. For this case, − − We provide the proof of this result in the supple- we can set the regularization parameter as λn = m log N ment [49]. This condition can be used to show that 2σ + G . If we assume, moreover, that θ∗ { n n } appropriate forms of RSC hold, for both the cases is exactly group-sparse, say, supported on a group p q of exactly group-sparse and weakly sparse vectors. subset S 1, 2,...,N of cardinality s , then the G ⊆ { G} G HIGH-DIMENSIONAL ANALYSIS OF REGULARIZED M-ESTIMATORS 17 bound (49) takes the form the M-estimator play a central role in our frame- work. We isolated the notion of a regularizer being 2 s m s log N (50) θ θ∗ - G + G G . decomposable with respect to a pair of subspaces and k − k2 n n showed how it constrains the error vector—meaning Similar bounds were derived in independent work b the difference between any solution and the nomi- by Lounici et al. [39] and Huang and Zhang [27] for nal parameter—to lie within a very specific set. This this special case of exact block sparsity. The analysis fact is significant, because it allows for a fruitful no- here shows how the different terms arise, in partic- tion of restricted strong convexity to be developed ular, via the noise magnitude measured in the dual for the loss function. Since the usual form of strong norm of the block regularizer. convexity cannot hold under high-dimensional scal- In the more general setting of weak block sparsity, ing, this interaction between the decomposable reg- Corollary 4 yields a number of novel results. For instance, for a given set of groups , we can consider ularizer and the loss function is essential. G Our main result (Theorem 1) provides a determin- the block sparse analog of the ℓq-“ball”—namely, the set istic bound on the error for a broad class of regu- larized M-estimators. By specializing this result to N G B Rp q different statistical models, we derived various ex- q(Rq; , 2) := θ θGt 2 Rq . G ( ∈ k k ≤ ) plicit convergence rates for different estimators, in- t=1 X cluding some known results and a range of novel re- In this case, if we optimize the choice of S in the sults. We derived convergence rates for sparse linear bound (49) so as to trade off the estimation and models, both under exact and approximate sparsity approximation errors, then we obtain assumptions, and these results have been shown to 1 q/2 2 m log N − be minimax optimal [56]. In the case of sparse group θ θ∗ - R + G , k − k2 q n n regularization, we established a novel upper bound   of the oracle type, with a separation between the ap- which is ab novel result. This result is a generalization proximation and estimation error terms. For matrix of our earlier Corollary 3, to which it reduces when estimation, the framework described here has been we have N = p groups each of size m = 1. G used to derive bounds on the Frobenius error that Case α =+ : Now consider the case of ℓ1/ℓ - ∞ ∞ are known to be minimax-optimal, both for mul- regularization, as suggested in past work [70]. In titask regression and autoregressive estimation [51], 2 sm2 this case, Corollary 4 implies that θ θ∗ 2 - n + as well as the matrix completion problem [52]. In re- s log N k − k n G . Similar to the case α = 2, this bound con- cent work [1], this framework has also been applied sists of an estimation term and a searchb term. The to obtain minimax-optimal rates for noisy matrix sm2 estimation term n is larger by a factor of m, which decomposition, which involves using a combination corresponds to the amount by which an ℓ -ball in m of the nuclear norm and elementwise ℓ1-norm. Fi- ∞ dimensions is larger than the corresponding ℓ2-ball. nally, as shown in the paper [48], these results may We provide the proof of Corollary 4 in the sup- be applied to derive convergence rates for general- plementary appendix [49]. It is based on verifying ized linear models. Doing so requires leveraging that the conditions of Theorem 1: more precisely, we use restricted strong convexity can also be shown to hold Proposition 1 in order to establish RSC, and we for these models, as stated in the bound (43). provide a lemma that shows that the regularization There are a variety of interesting open questions choice (48) is valid in the context of Theorem 1. associated with our work. In this paper, for sim- plicity of exposition, we have specified the regular- 6. DISCUSSION ization parameter in terms of the dual norm R∗ In this paper we have presented a unified frame- of the regularizer. In many cases, this choice leads work for deriving error bounds and convergence rates to optimal convergence rates, including linear re- for a class of regularized M-estimators. The theory gression over ℓq-balls (Corollary 3) for sufficiently is high-dimensional and nonasymptotic in nature, small radii, and various instances of low-rank ma- meaning that it yields explicit bounds that hold trix regression. In other cases, some refinements of with high probability for finite sample sizes and re- our convergence rates are possible; for instance, for veals the dependence on dimension and other struc- the special case of linear sparsity regression (i.e., tural parameters of the model. Two properties of an exactly sparse vector, with a constant fraction 18 NEGAHBAN, RAVIKUMAR, WAINWRIGHT AND YU of nonzero elements), our rates can be sharpened [5] Baraniuk, R. G., Cevher, V., Duarte, M. F. and by a more careful analysis of the noise term, which Hegde, C. (2008). Model-based compressive sens- allows for a slightly smaller choice of the regular- ing. Technical report, Rice Univ. Available at arXiv:0808.3572. ization parameter. Similarly, there are other non- [6] Bickel, P. J., Brown, J. B., Huang, H. and Li, Q. parametric settings in which a more delicate choice (2009). An overview of recent developments in ge- of the regularization parameter is required [34, 57]. nomics and associated statistical methods. Philos. Last, we suspect that there are many other statis- Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. tical models, not discussed in this paper, for which 367 4313–4337. MR2546390 Bickel, P. J. Levina, E. this framework can yield useful results. Some exam- [7] and (2008). Covariance reg- ularization by thresholding. Ann. Statist. 36 2577– ples include different types of hierarchical regular- 2604. MR2485008 izers and/or overlapping group regularizers [28, 29], [8] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). as well as methods using combinations of decompos- Simultaneous analysis of lasso and Dantzig selector. able regularizers, such as the fused Lasso [68]. Ann. Statist. 37 1705–1732. MR2533469 [9] Bunea, F. (2008). Honest variable selection in linear and logistic regression models via l1 and l1 + l2 penal- ACKNOWLEDGMENTS ization. Electron. J. Stat. 2 1153–1194. MR2461898 All authors were partially supported by NSF [10] Bunea, F., She, Y. and Wegkamp, M. (2010). Adap- tive rank penalized estimators in multivariate re- Grants DMS-06-05165 and DMS-09-07632. B. Yu gression. Technical report, Florida State. Available acknowledges additional support from NSF Grant at arXiv:1004.2995. SES-0835531 (CDI); M. J. Wainwright and S. N. [11] Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Negahban acknowledge additional support from the Sparsity oracle inequalities for the Lasso. Electron. NSF Grant CDI-0941742 and AFOSR Grant J. Stat. 1 169–194. MR2312149 [12] Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. 09NL184; and P. Ravikumar acknowledges additional (2007). Aggregation for Gaussian regression. Ann. support from NSF Grant IIS-101842. We thank a Statist. 35 1674–1697. MR2351101 number of people, including Arash Amini, Francis [13] Cai, T. and Zhou, H. (2010). Optimal rates of conver- Bach, Peter Buhlmann, Garvesh Raskutti, Alexan- gence for sparse covariance matrix estimation. Tech- dre Tsybakov, Sara van de Geer and Tong Zhang nical report, Wharton School of Business, Univ. for helpful discussions. Pennsylvania. Available at http://www-stat. wharton.upenn.edu/~tcai/paper/html/Sparse- Covariance-Matrix.html. SUPPLEMENTARY MATERIAL [14] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than Supplementary material for “A unified framework n. Ann. Statist. 35 2313–2351. MR2382644 for high-dimensional analysis of M-estimators with [15] Candes,` E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. Com- decomposable regularizers” put. Math. 9 717–772. MR2565240 (DOI: 10.1214/12-STS400SUPP; .pdf). Due to space [16] Candes, E. J. and Tao, T. (2005). Decoding by lin- constraints, the proofs and technical details have ear programming. IEEE Trans. Inform. Theory 51 been given in the supplementary document by Ne- 4203–4215. MR2243152 gahban et al. [49]. [17] Candes, E. J., X. Li, Y. M. and Wright, J. (2010). Stable principal component pursuit. In IEEE In- ternational Symposium on Information Theory, REFERENCES Austin, TX. [18] Chandrasekaran, V., Sanghavi, S., Parrilo, P. A. [1] Agarwal, A., Negahban, S. and Wainwright, M. J. and Willsky, A. S. (2011). Rank-sparsity incoher- (2011). Noisy matrix decomposition via convex re- ence for matrix decomposition. SIAM J. Optimiz. laxation: Optimal rates in high dimensions. Ann. 21 572–596. Statist. 40 1171–1197. [19] Chen, S. S., Donoho, D. L. and Saunders, M. A. [2] Bach, F. (2010). Self-concordant analysis for logistic re- (1998). Atomic decomposition by basis pursuit. gression. Electron. J. Stat. 4 384–414. MR2645490 SIAM J. Sci. Comput. 20 33–61. MR1639094 [3] Bach, F. R. (2008). Consistency of the group lasso and [20] Donoho, D. L. (2006). Compressed sensing. IEEE multiple kernel learning. J. Mach. Learn. Res. 9 Trans. Inform. Theory 52 1289–1306. MR2241189 1179–1225. MR2417268 [21] Donoho, D. L. and Tanner, J. (2005). Neighborliness [4] Bach, F. R. (2008). Consistency of trace norm min- of randomly projected simplices in high dimensions. imization. J. Mach. Learn. Res. 9 1019–1048. Proc. Natl. Acad. Sci. USA 102 9452–9457 (elec- MR2417263 tronic). MR2168716 HIGH-DIMENSIONAL ANALYSIS OF REGULARIZED M-ESTIMATORS 19

[22] El Karoui, N. (2008). Operator norm consistent esti- multi-task learning. Technical report, ETH Zurich. mation of large-dimensional sparse covariance ma- Available at arXiv:0903.1468. trices. Ann. Statist. 36 2717–2756. MR2485011 [40] Lustig, M., Donoho, D., Santos, J. and Pauly, J. [23] Fazel, M. (2002). Matrix rank minimization with (2008). Compressed sensing MRI. IEEE Signal Pro- applications. Ph.D. thesis, Stanford. Available at cessing Magazine 27 72–82. http://faculty.washington.edu/mfazel/thesis- [41] McCoy, M. and Tropp, J. (2011). Two proposals for final.pdf. robust PCA using semidefinite programming. Elec- [24] Girko, V. L. (1995). Statistical Analysis of Observa- tron. J. Stat. 5 1123–1160. tions of Increasing Dimension. Theory and Deci- [42] Mehta, M. L. (1991). Random Matrices, 2nd ed. Aca- sion Library. Series B: Mathematical and Statistical demic Press, Boston, MA. MR1083764 Methods 28. Kluwer Academic, Dordrecht. Trans- [43] Meier, L., van de Geer, S. and Buhlmann,¨ P. (2009). lated from the Russian. MR1473719 High-dimensional additive modeling. Ann. Statist. [25] Greenshtein, E. and Ritov, Y. (2004). Persistence 37 3779–3821. MR2572443 in high-dimensional linear predictor selection and [44] Meinshausen, N. (2008). A note on the Lasso for Gaus- the virtue of overparametrization. Bernoulli 10 971– sian selection. Statist. Probab. Lett. 988. MR2108039 78 880–884. MR2398362 [26] Hsu, D., Kakade, S. M. and Zhang, T. (2011). Ro- [45] Meinshausen, N. and Buhlmann,¨ P. (2006). High- bust matrix decomposition with sparse corruptions. dimensional graphs and variable selection with the IEEE Trans. Inform. Theory 57 7221–7234. lasso. Ann. Statist. 34 1436–1462. MR2278363 [27] Huang, J. and Zhang, T. (2010). The benefit of group [46] Meinshausen, N. and Yu, B. (2009). Lasso-type recov- sparsity. Ann. Statist. 38 1978–2004. MR2676881 ery of sparse representations for high-dimensional [28] Jacob, L., Obozinski, G. and Vert, J. P. (2009). data. Ann. Statist. 37 246–270. MR2488351 Group Lasso with overlap and graph Lasso. In [47] Nardi, Y. and Rinaldo, A. (2008). On the asymp- International Conference on totic properties of the group lasso estimator (ICML) 433–440, Haifa, Israel. for linear models. Electron. J. Stat. 2 605–633. [29] Jenatton, R., Mairal, J., Obozinski, G. and MR2426104 Bach, F. (2011). Proximal methods for hierarchical [48] Negahban, S., Ravikumar, P., Wainwright, M. J. sparse coding. J. Mach. Learn. Res. 12 2297–2334. and Yu, B. (2009). A unified framework for high- [30] Kakade, S. M., Shamir, O., Sridharan, K. and dimensional analysis of M-estimators with decom- Tewari, A. (2010). Learning exponential families posable regularizers. In NIPS Conference, Vancou- in high-dimensions: Strong convexity and sparsity. ver, Canada. In AISTATS, Sardinia, Italy. [49] Negahban, S., Ravikumar, P., Wainwright, M. J. [31] Keshavan, R. H., Montanari, A. and Oh, S. (2010). and Yu, B. (2012). Supplement to “A uni- Matrix completion from noisy entries. J. Mach. fied framework for high-dimensional analysis of Learn. Res. 11 2057–2078. M-estimators with decomposable regularizers.” [32] Kim, Y., Kim, J. and Kim, Y. (2006). Blockwise sparse DOI:10.1214/12-STS400SUPP. regression. Statist. Sinica 16 375–390. MR2267240 [50] Negahban, S. and Wainwright, M. J. (2011). Simul- [33] Koltchinskii, V. and Yuan, M. (2008). Sparse recov- taneous support recovery in high-dimensional re- ery in large ensembles of kernel machines. In Pro- gression: Benefits and perils of ℓ1, -regularization. ∞ ceedings of COLT, Helsinki, Finland. IEEE Trans. Inform. Theory 57 3481–3863. [34] Koltchinskii, V. and Yuan, M. (2010). Sparsity in [51] Negahban, S. and Wainwright, M. J. (2011). Esti- multiple kernel learning. Ann. Statist. 38 3660– mation of (near) low-rank matrices with noise and 3695. MR2766864 high-dimensional scaling. Ann. Statist. 39 1069– [35] Lam, C. and Fan, J. (2009). Sparsistency and rates of 1097. MR2816348 convergence in large covariance matrix estimation. [52] Negahban, S. and Wainwright, M. J. (2012). Re- Ann. Statist. 37 4254–4278. MR2572459 stricted strong convexity and (weighted) matrix [36] Landgrebe, D. (2008). Hyperspectral image data anal- completion: Optimal bounds with noise. J. Mach. sysis as a high-dimensional signal processing prob- Learn. Res. 13 1665–1697. lem. IEEE Signal Processing Magazine 19 17–28. [53] Obozinski, G., Wainwright, M. J. and Jordan, M. I. [37] Lee, K. and Bresler, Y. (2009). Guaranteed mini- (2011). Support union recovery in high-dimensional mum rank approximation from linear observations multivariate regression. Ann. Statist. 39 1–47. by nuclear norm minimization with an ellipsoidal MR2797839 constraint. Technical report, UIUC. Available at [54] Pastur, L. A. (1972). The spectrum of random matri- arXiv:0903.4742. ces. Teoret. Mat. Fiz. 10 102–112. MR0475502 [38] Liu, Z. and Vandenberghe, L. (2009). Interior-point [55] Raskutti, G., Wainwright, M. J. and Yu, B. method for nuclear norm approximation with ap- (2010). Restricted eigenvalue properties for corre- plication to system identification. SIAM J. Matrix lated Gaussian designs. J. Mach. Learn. Res. 11 Anal. Appl. 31 1235–1256. MR2558821 2241–2259. MR2719855 [39] Lounici, K., Pontil, M., Tsybakov, A. B. and van de [56] Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Geer, S. (2009). Taking advantage of sparsity in Minimax rates of estimation for high-dimensional 20 NEGAHBAN, RAVIKUMAR, WAINWRIGHT AND YU

linear regression over ℓq -balls. IEEE Trans. Inform. imation. Signal Process. 86 572–602. Special issue Theory 57 6976–6994. MR2882274 on “Sparse approximations in signal and image pro- [57] Raskutti, G., Wainwright, M. J. and Yu, B. (2012). cessing.” Minimax-optimal rates for sparse additive mod- [70] Turlach, B. A., Venables, W. N. and Wright, S. J. els over kernel classes via convex programming. J. (2005). Simultaneous variable selection. Technomet- Mach. Learn. Res. 13 389–427. MR2913704 rics 47 349–363. MR2164706 [58] Ravikumar, P., Lafferty, J., Liu, H. and Wasser- [71] van de Geer, S. A. (2008). High-dimensional general- man, L. (2009). Sparse additive models. J. R. ized linear models and the lasso. Ann. Statist. 36 Stat. Soc. Ser. B Stat. Methodol. 71 1009–1030. 614–645. MR2396809 MR2750255 [72] van de Geer, S. A. and Buhlmann,¨ P. (2009). On the [59] Ravikumar, P., Wainwright, M. J. and Laf- conditions used to prove oracle results for the Lasso. ferty, J. D. (2010). High-dimensional Ising model Electron. J. Stat. 3 1360–1392. MR2576316 selection using ℓ1-regularized logistic regression. [73] Wainwright, M. J. (2009). Sharp thresholds for high- Ann. Statist. 38 1287–1319. MR2662343 dimensional and noisy sparsity recovery using ℓ1- [60] Ravikumar, P., Wainwright, M. J., Raskutti, G. constrained quadratic programming (Lasso). IEEE and Yu, B. (2011). High-dimensional covari- Trans. Inform. Theory 55 2183–2202. MR2729873 ance estimation by minimizing ℓ1-penalized log- [74] Wainwright, M. J. (2009). Information-theoretic limits determinant . Electron. J. Stat. 5 935– on sparsity recovery in the high-dimensional and 980. MR2836766 noisy setting. IEEE Trans. Inform. Theory 55 5728– [61] Recht, B. (2011). A simpler approach to matrix 5741. MR2597190 completion. J. Mach. Learn. Res. 12 3413–3430. [75] Wigner, E. P. (1955). Characteristic vectors of bor- MR2877360 dered matrices with infinite dimensions. Ann. of [62] Recht, B., Fazel, M. and Parrilo, P. A. (2010). Math. (2) 62 548–564. MR0077805 Guaranteed minimum-rank solutions of linear ma- [76] Xu, H., Caramanis, C. and Sanghavi, S. (2012). Ro- trix equations via nuclear norm minimization. bust PCA via outlier pursuit. IEEE Trans. Inform. SIAM Rev. 52 471–501. MR2680543 Theory 58 3047–3064. [63] Rohde, A. and Tsybakov, A. B. (2011). Estimation of [77] Yuan, M., Ekici, A., Lu, Z. and Monteiro, R. (2007). high-dimensional low-rank matrices. Ann. Statist. Dimension reduction and coefficient estimation in 39 887–930. MR2816342 multivariate linear regression. J. R. Stat. Soc. Ser. [64] Rothman, A. J., Bickel, P. J., Levina, E. and B Stat. Methodol. 69 329–346. MR2323756 Zhu, J. (2008). Sparse permutation invariant co- [78] Yuan, M. and Lin, Y. (2006). Model selection and variance estimation. Electron. J. Stat. 2 494–515. estimation in regression with grouped variables. MR2417391 J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67. [65] Rudelson, M. and Zhou, S. (2011). Reconstruction MR2212574 from anisotropic random measurements. Technical [79] Zhang, C.-H. and Huang, J. (2008). The spar- report, Univ. Michigan. sity and bias of the LASSO selection in high- [66] Stojnic, M., Parvaresh, F. and Hassibi, B. (2009). dimensional linear regression. Ann. Statist. 36 1567– On the reconstruction of block-sparse signals with 1594. MR2435448 an optimal number of measurements. IEEE Trans. [80] Zhao, P., Rocha, G. and Yu, B. (2009). The composite Signal Process. 57 3075–3085. MR2723043 absolute penalties family for grouped and hierarchi- [67] Tibshirani, R. (1996). Regression and selec- cal variable selection. Ann. Statist. 37 3468–3497. tion via the Lasso. J. R. Stat. Soc. Ser. B Stat. MR2549566 Methodol. 58 267–288. MR1379242 [81] Zhao, P. and Yu, B. (2006). On model selection consis- [68] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. tency of Lasso. J. Mach. Learn. Res. 7 2541–2563. and Knight, K. (2005). Sparsity and smoothness MR2274449 via the fused lasso. J. R. Stat. Soc. Ser. B Stat. [82] Zhou, S., Lafferty, J. and Wasserman, L. (2008). Methodol. 67 91–108. MR2136641 Time-varying undirected graphs. In 21st Annual [69] Tropp, J. A., Gilbert, A. C. and Strauss, M. J. Conference on Learning Theory (COLT ), Helsinki, (2006). Algorithms for simultaneous sparse approx- Finland.