1

Structured Signal Recovery from Non-linear and Heavy-tailed Measurements Larry Goldstein, Stanislav Minsker and Xiaohan Wei

Abstract—We study high-dimensional signal recovery of i.i.d. copies of (x, y) defined on a probability −1 from non-linear measurements with design vectors having space (Ω, B, P). As f(a hx, aθ∗i, δ) = f(hx, θ∗i, δ) elliptically symmetric distribution. Special attention is for any a > 0, the best one can hope for is to devoted to the situation when the unknown signal belongs to a set of low statistical complexity, while both the recover θ∗ up to a scaling factor. Hence, without measurements and the design vectors are heavy-tailed. loss of generality, we will assume that θ∗ satisfies 1/2 2 1/2 1/2 We propose and analyze a new estimator that adapts to kΣ θ∗k2 := Σ θ∗, Σ θ∗ = 1, where Σ = the structure of the problem, while being robust both E(x − Ex)(x − Ex)T is the of x. to the possible model misspecification characterized by In many applications, θ possesses special struc- arbitrary non-linearity of the measurements as well as ∗ to data corruption modeled by the heavy-tailed distribu- ture, such as sparsity or low rank (when θ∗ ∈ d1×d2 tions. Moreover, this estimator has low computational R , d1d2 = d, is a matrix). To incorporate complexity. Our results are expressed in the form of such structural assumptions into the problem, we will exponential concentration inequalities for the error of d assume that θ∗ is an element of a closed set Θ ⊆ R the proposed estimator. On the technical side, our proofs of small “statistical complexity” that is characterized rely on the generic chaining methods, and illustrate the power of this approach for statistical applications. Theory by its Gaussian mean width [1]. The past decade is supported by numerical experiments demonstrating has witnessed significant progress related to estima- that our estimator outperforms existing alternatives when tion in high-dimensional spaces, both in theory and data is heavy-tailed. applications. Notable examples include sparse linear Index Terms—Signal reconstruction, Nonlinear mea- regression ([2], [3], [4]), low-rank matrix recovery ([5], surements, Heavy-tailed noise, Elliptically symmetric dis- [6], [7]), and mixed structure recovery [8]. However, tribution. the majority of the aforementioned works assume that the link function f is linear, and their results apply I.INTRODUCTION. only to this particular case. Let (x, y) ∈ Rd × R be a random couple with Generally, the task of estimating the index vector distribution P governed by the semi-parametric single requires approximating the link function f or its deriva- index model tive, assuming that it exists (the so-called Average Derivative Method), see [9], [10], [11]. However, when y = f(hx, θ∗i, δ), (1) the measurement vector x ∼ N (0, Id×d), a somewhat where x is a measurement vector with marginal dis- surprising result states that one can estimate θ∗ di- tribution Π, δ is a noise variable that is assumed rectly, avoiding preliminary link function estimation d to be independent of x, θ∗ ∈ R is a fixed but step completely. More specifically, it has been proven E 2 otherwise unknown signal (“index vector”), and f : in [12] that ηθ∗ = argminθ∈Rd (y − hθ, xi) , where 2 R 7→ R is an unknown link function; here and in η = E y hx, θ∗i. Here is the short proof of this fact: what follows, h·, ·i denotes the Euclidean dot product. 2 We impose mild conditions on f, and in particular argmin E (y − hθ, xi) Rd it is not assumed that f is convex, or even contin- θ∈  2  1 = argmin kθk − 2Ey hx, θi uous. Our goal is to estimate the signal θ∗ from 2 θ∈Rd the training data (x , y ),..., (x , y ) - a sequence 1 1 m m  2 E E ⊥ ⊥  = argmin kθk2 − 2 y hx, θ∗i hθ, θ∗i − 2 y x, θ∗ θ, θ∗ Rd L. Goldstein and S. Minsker are with the Department of Mathe- θ∈  2 E  matics, University of Southern , Los Angeles, CA (email: = argmin kθk2 − 2 y hx, θ∗i hθ, θ∗i [email protected]; [email protected]). θ∈Rd X. Wei is with the Department of Electrical Engineering, Uni- 2 versity of Southern California, Los Angeles, CA (email: xiao- = argmin kθ − ηθ∗k2, Rd [email protected]). θ∈ S. Minsker and X. Wei were supported in part by the National ⊥ Science Foundation grant DMS-1712956. where θ∗ denotes the unit vector in the (θ∗, θ) plane 1 The precise assumption on f is specified after (16). orthogonal to θ∗, and the third equality follows from 2

⊥ the fact that x, θ∗ is a centered Gaussian random variable y: for example, when the link function satisfies ˜ variable independent of hx, θ∗i, hence also independent f(hx, θ∗i , δ) = f(hx, θ∗i) + δ, it is only assumed of y. that δ possesses 2 + ε moments, for some ε > 0. In Later, in [13], this result was extended to the more [17], Y. Plan and R. Vershynin present analysis for the general class of elliptically symmetric distributions, Gaussian case and ask “Can the same kind of accuracy which includes Gaussian distributions as a special case; be expected for random non-Gaussian matrices?” In see Lemma 5. In general, it is not always possible to this paper, we give a positive answer to their question. recover θ∗: see [14] for an example in the case when To achieve our goal, we propose a Lasso-type estimator f(x, δ) = sign(x) + δ (so-called “1-bit compressed that admits tight probabilistic guarantees in spirit of (3) sensing” [15]). despite weak tail assumptions (see Theorem 1 below Y. Plan, R. Vershynin and E. Yudovina recently pre- for details). sented the non-asymptotic study for the case of Gaus- Proofs of related non-asymptotic results in the liter- sian measurements in the context of high-dimensional ature rely on special properties of Gaussian measures structured estimation, see [16], [17], [18]; we refer (see, for example, [23]). To handle a wider class of the reader to [19], [14], [20], [21] for further details. elliptically symmetric distributions, we rely on recent On a high level, these works show that when xj’s are developments in generic chaining methods, see [24], Gaussian, nonlinearity can be treated as an additional [25]. These general tools could prove useful in de- noise term. To give an example, works [17] and [16] veloping further extensions to a wider class of design demonstrate that under the same model as (1), when distributions. xj ∼ N (0, Id×d) and θ∗ ∈ Θ, solving the constrained problem II.DEFINITIONS AND BACKGROUND MATERIAL. 2 θb = argmin ky − Xθk2, (2) θ∈Θ This section introduces main notation and the key facts related to elliptically symmetric distributions, y = [y ··· y ]T X = √1 [x ··· x ]T with 1 m and m 1 m , convex geometry and empirical processes. The results results in the following bound: with probability ≥ 0.99, of this section will be used repeatedly throughout the d−1 σ1ω(D(Θ, ηθ∗) ∩ S ) + σ2 paper. For the unified treatment of vectors and matri- θb− ηθ∗ ≤ C √ , (3) v ∈ Rd×1 2 m ces, it will be convenient to treat a vector as a d × 1 matrix. Let d1, d2 ∈ N be such that d1d2 = d. where (with formal definitions to follow in Section II) d1×d2 Given v1, v2 ∈ R , the Euclidean dot product is C is an absolute constant, T then defined as hv1, v2i = tr(v1 v2), where tr(·) stands T g ∼ N (0, 1), η = E(f(g, δ)g), for the trace of a matrix and v denotes the transpose 2 E 2 of v. σ1 = (f(g, δ) − ηg) , d The `1-norm of v ∈ R is defined as kvk1 = 2 E 2 2 d σ2 = (f(g, δ) − ηg) g , P Rd1×d2 j=1 |vj|. The nuclear norm of a matrix v ∈ d−1 d Pmin(d1,d2) S is the unit sphere in R , D(Θ, θ) is the descent is kvk∗ = j=1 σj(v), where σj(v), j = cone of Θ at point θ, and ω(T ) is the Gaussian mean 1,..., min(d1, d2) stand for the singular values of width of a subset T ⊂ Rd. A different approach to v, and the operator norm is defined as kvk = estimation of the index vector in model (1) with similar maxj=1,...,min(d1,d2) σj(v). recovery guarantees has been developed in [21]. How- ever, the key assumption adopted in all these works A. Elliptically symmetric distributions. that the vectors {x }m follow Gaussian distributions j j=1 A centered random vector x ∈ Rd has elliptically preclude situations where the measurements are heavy symmetric (alternatively, elliptically contoured or just tailed, and hence might be overly restrictive for some elliptical) distribution with parameters Σ and F , de- practical applications; for example, noise and outliers µ noted x ∼ E(0, Σ,F ), if observed in high-dimensional image recovery often µ exhibit heavy-tailed behavior, see [22]. x =d µBU, (4) As we mentioned above, in [13] authors have shown d that direct consistent estimation of θ∗ is possible where = denotes equality in distribution, µ is a scalar when Π belongs to a family of elliptically symmet- random variable with cumulative distribution function T ric distributions. Our main contribution is the non- Fµ, B is a fixed d × d matrix such that Σ = BB , asymptotic analysis for this scenario, with a particular and U is uniformly distributed over the unit sphere d−1 focus on the case when d > n and θ∗ possesses S and independent of µ. Note that distribution T T special structure, such as sparsity. Moreover, we make E(0, Σ,Fµ) is well defined, as if B1B1 = B2B2 , very mild assumptions on the tails of the response then there exists a unitary matrix Q such that B1 = 3

d B2Q, and QU = U. Along these same lines, we Proof. Let {v1, ··· , vd} be an orthonormal basis in d note that representation (4) is not unique, as one may R such that vd = y2. Let V = [v1 v2 ··· vd] and 1  replace the pair (µ, B) with cµ, c BQ for any consider the linear transformation constant c > 0 and any orthogonal matrix Q. To avoid T such ambiguity, in the following we allow B to be xe = V x. T any matrix satisfying BB = Σ, and noting that the T Then, by (4), xe = µV BU, which is cen- covariance matrix of U is a multiple of the identity, we tered elliptical with full rank covariance matrix further impose the condition that the covariance matrix T V ΣV. Applications of Theorem 1 with x1 = of x is equal to Σ, i.e. ExxT  = Σ. [hx, v1i, ··· , hx, vd−1i] and x2 = hx, vdi = hx, y2i Alternatively, the mean-zero elliptically symmetric yields distribution can be defined uniquely via its character- istic function E(hx, y1i | hx, y2i) d ! T  Rd s → ψ s Σs , s ∈ , E X = hx, viihy1, vii hx, vdi where ψ : R+ → R is called the characteristic i=1 d−1 ! generator of x. See [26] for further details about E X . = hx, viihy1, vii hx, vdi + hx, vdihy1, vdi i=1 An important special case of the family E(0, Σ,Fµ) of elliptical distributions is the Gaussian distribution =hx, vdihy1, vdi = hy1, y2ihx, y2i, √ d 2 N (0, Σ), where µ = z with z = χd, and the −x/2 where in the second to last equality we have characteristic generator is ψ(x) = e . used the fact that the conditional distribution of The following elliptical symmetry property, general- [hv1, xi, ··· , hvd−1, xi] given hx, vdi is elliptical izing the well known fact for the conditional distribu- with mean zero. tion of the multivariate Gaussian, plays an important role in our subsequent analysis, see [27]: B. Geometry. Proposition 1. Let x = [x1, x2] ∼ Ed(0, Σ,Fµ), In this section, we recall the definitions of several where are of dimension d1 and d2 respectively, with quantities that control the “complexity” of the estima- d1 + d2 = d. Let Σ be partitioning accordingly as tion problem in model (1).  Σ Σ  Σ = 11 12 . Σ21 Σ22 Definition 1 (Gaussian mean width). The Gaussian mean width of a set T ⊆ Rd is defined as Then, whenever Σ22 has full rank, the condi- tional distribution of x given x is elliptical   1 2 ω(T ) := E sup hg, ti , Ed1 (0, Σ1|2,Fµ1|2 ), where t∈T −1 Σ1|2 = Σ11 − Σ12Σ22 Σ21, where g ∼ N (0, Id×d). and Fµ1|2 is the cumulative distribution function of Definition 2 (Descent cone). The descent cone of a 2 T −1 1/2 Rd Rd (µ − x2 Σ22 x2) given x2. set Θ ⊆ at a point θ ∈ is defined as 2 T −1 Note that µ − x2 Σ22 x2 is always nonnegative, D(Θ, θ) = {τh : τ ≥ 0, h ∈ Θ − θ}. hence F is well defined, since by (4) we have µ1|2 Sd−1 Rd For example, when T = , the unit sphere√ in , T −1 2 T T −1 it is easy to see that ω(Sd−1) = E(kgk ) ∼ d. We x2 Σ22 x2 = µ (B2U) (B2B2 ) (B2U) 2 = µ2U T BT (B BT )−1B U ≤ µ2U T U = µ2, will be interested in the Gaussian mean widths of sub- 2 2 2 2 sets of the unit sphere of the form T = Sd−1∩D(Θ, θ), where B2 is the matrix consisting of the last d2 rows where θ lies on the boundary of Θ; the importance of of B in (4), and where the inequality holds due to the such subsets in structured recovery is explained in [1]. fact that BT (B BT )−1B is a projection matrix. The 2 2 2 2 Definition 3 (Restricted set). Given c > 1, the c - following corollary is easily deduced from the theorem 0 0 restricted set of the norm k · k at θ ∈ Rd is defined above: K as Corollary 1. If x ∼ Ed(0, Σ,Fµ) with Σ of full Rd S S rank, then for any two fixed vectors y1, y2 ∈ with c0 (θ) := c0 (θ; K) ky2k2 = 1,  1  = v ∈ Rd : kθ + vk ≤ kθk + kvk . (5) K K c K E(hx, y1i | hx, y2i) = hy1, y2ihx, y2i. 0 4

2 Restricted set is similar to the “cone of dominant coor- E (y − hx, θi) . If distribution Fµ has heavy tails, the 1 Pm dinates” that appears in the analysis of sparse recovery sample average m i=1 yixi might not concentrate problems; we provide more details and examples in the sufficiently well around its mean, hence we replace Appendix. it by a more “robust” version obtained via truncation. Let µ ∈ R, U ∈ Sd−1 be such that x = µU (so that Definition 4 (Restricted compatibility). The restricted Rd µ = kxk2), and set compatibility constant of a set A ⊆ with respect to √ the norm k · kK is given by Ue = dU, (7) √ kvk q = µy/ d, Ψ(A) := Ψ(A; K) = sup K . v∈A\{0} kvk2 so that qUe = yx and√ Ue is uniformly distributed on The restricted compatibility concept is introduced to the sphere of radius d, implying that its covariance capture the dependence of the equivalence constants matrix is Id, the identity matrix. Next, define the between two norms on the geometry of the set under truncated random variables consideration. qei = sign(qi)(|qi| ∧ τ), i = 1, . . . , m, (8) Remark 1. The restricted set from the Definition 3 is 1 2(1+κ) not necessarily convex. However, if the norm k · kK is where τ = m for some κ ∈ (0, 1) that is chosen decomposable (see Definition 8), then the restricted set based on the integrability properties of q, see (17). is contained in a convex cone, and the corresponding Finally, set restricted compatibility constant is easier to estimate. m τ 2 2 X D E Decomposable norms have been introduced by [28] L (θ) = kθk − qiUei, θ , (9) m 2 m e and later appeared in a number of works, e.g. see [29] i=1 and references therein. For the reader’s convenience, and define the estimator θ as the solution to the we provide a self-contained discussion in the Appendix. bm constrained optimization problem:

τ III.MAIN RESULTS. θbm := argmin Lm(θ). (10) θ∈Θ In this section, we define a version of Lasso estima- tor that is well-suited for heavy-tailed measurements, We will also denote D E and state its performance guarantees. Lτ (θ) := ELτ (θ) = kθk2 − 2E qU, θ . (11) d m 2 ee We will assume that x1, x2,..., xm ∈ R are i.i.d. copies of an isotropic vector x with For the scenarios where structure on the unknown θ∗ spherically symmetric distribution Ed(0, Id×d,Fµ). If is induced by a norm k · kK (e.g., if θ∗ is sparse, then x ∼ Ed(0, Σ,Fµ) for some positive definite ma- k · kK could be the k · k1 norm), we will also consider d 1/2 λ trix Σ, then by definition x = µΣ U, and the estimator θbm defined via −1/2 1/2 −1/2 hx, θ∗i = Σ x, Σ θ∗ , where Σ x = λ h τ i ˜ 1/2 θbm := argmin Lm(θ) + λkθkK , (12) µU ∼ Ed(0, Id×d,Fµ). Hence, if we set θ∗ := Σ θ∗, θ∈Rd then all results that we establish for isotropic mea- where λ > 0 is a regularization parameter to be surements hold with θ replaced by θ˜ ; remark after ∗ ∗ specified, and Lτ (θ) is defined in (9). Theorem 1 includes more details. m Let us note that truncation approach has previ- ously been successfully implemented in [30] to handle A. Description of the proposed estimator. heavy-tailed noise in the context of matrix recovery We first introduce an estimator under the scenario with sub-Gaussian design. In the present paper, we d that θ∗ ∈ Θ, for some known closed set Θ ⊆ R . show that truncation-based approach is also useful 0 Define the loss function Lm(·) as in the situations where the measurements are heavy- m tailed. 2 X L0 (θ) := kθk2 − hy x , θi , (6) m 2 m i i Remark 2. 1) In the special case when the mea- i=1 surement vector x is Gaussian, Y. Plan and R. which is the unbiased estimator of Vershynin [18] proposed and analyzed an esti- 2 mator similar to (6) in the framework of 1-bit L0(θ) := kθk2 − 2E hyx, θi = E (y − hx, θi) − Ey2, 2 compressed sensing. where the last equality follows since x is isotropic. 2) Note that our estimator (12) is in general much Clearly, minimizing L0(θ) over any set Θ ⊆ easier to implement than some other popular Rd is equivalent to minimizing the quadratic loss alternatives, such as the usual Lasso estimator 5

[2]. For example, when the signal θ is sparse, B. Estimator performance guarantees. our estimator takes the form In this section, we present the probabilistic guaran- λ m tees for the performance of the estimators θbm and θb h 2 D E i m λ 2 X defined by (10) and (12) respectively. θbm := argmin kθk2 − qeiUei, θ +λkθk1 , Rd m θ∈ i=1 Everywhere below, C, c, Cj denote numerical con- stants; when these constants depend on parameters of which yields a closed form solution in the form the problem, we specify this dependency by writing of “soft-thresholding”. Specifically, let b = C = C (parameters). Let 1 Pm λ j j m i=1 qeiUei, then, the k-th entry of θbm takes the form: η = E hyx, θ∗i , (16)

 and assume that η 6= 0 and ηθ∗ ∈ Θ. bk − λ/2, if bk ≥ λ/2,  λ   θbm = 0, if − λ/2 ≤ bk ≤ λ/2, Theorem 1. Suppose that x ∼ E(0, Id×d,Fµ). k  Moreover, suppose that for some κ > 0 bk + λ/2, if bk ≤ −λ/2. (13) φ := E|q|2(1+κ) < ∞. (17)

We should note however that such simplification Then there exist constants C1 = C1(κ, φ),C2 = comes at the cost of prior knowledge that the C2(κ, φ) > 0 such that θbm satisfies measurement vector x is isotropic. Despite having  d−1  low computational complexity, our estimator can (ω(D(Θ, ηθ∗) ∩ S ) + 1)β P θbm − ηθ∗ ≥ C1 √ still exploit the structure of the problem, while 2 m being robust both to the possible model misspec- −β/2 ≤ C2e , ification as well as to data corruption modeled 2 d−1 2 by the heavy-tailed distributions. We demonstrate for any m ≥ β ω(D(Θ, ηθ∗) ∩ S ) + 1 , β ≥ 8. this in the following sections. Remark 4. 1) Unknown link function f enters the Remark 3 (Non-isotropic measurements). When x ∼ bound only through the constant η defined in (16). Ed(0, Σ,Fµ) for some Σ 0, then estimator (10) has 2) Aside from independence, conditions on the noise to be replaced by δ are implicit and follow from assumptions on y. In the special case when the error is addi- h 2 m D E i 1/2 2 X 1/2 tive, that is, when y = f(hx, θ∗i) + δ, the mo- θbm := argmin kΣ θk − qiUei, Σ θ , 2 m e E θ∈Θ i=1 ment condition (17) becomes kxk2f(hx, θ∗i) + 2(1+κ) (14) kxk2δ < ∞, for which it is sufficient to 2(1+κ) E which is equivalent to assume that kxk2f(hx, θ∗i) < ∞ and E |kxk δ|2(1+κ) < ∞. h m D E i 2 ˜ 2 2 X 3) Theorem 1 is mainly useful when ηθ lies on the θm := argmin kθk2 − qeiUei, θ , ∗ 1/2 m θ∈Σ Θ i=1 boundary of the set Θ. Otherwise, if ηθ∗ belongs to the relative interior of Θ, the descent cone ˜ 1/2 ˆ is a sense that θm = Σ θm. Hence, results obtained D(Θ, ηθ∗) is the affine hull of Θ (which will often for isotropic measurements easily extend to the more be the whole space Rd). Thus, in such cases the Sd−1 general case. Similarly, estimator (12) should be re- Gaussian mean width√ ω(D(Θ, ηθ∗) ∩ ) can placed by be on the order of d, which is prohibitively large when d  m. We refer the reader to [17], [16] for m h 2 X D E a discussion of related result and possible ways θˆλ := argmin kΣ1/2θk2 − q U , Σ1/2θ m 2 ei ei to tighten them. θ∈Rd m i=1 4) As we mentioned in the introduction, for the spe- 1/2 i + λkΣ θkK , (15) cial case x ∼ N (0, Id×d), the work [17] shows that solution of the problem (2) satisfies bound (3) which is equivalent to with high probability. It is worth comparing the

m aforementioned bound with our results. Note that ˜λ h 2 2 X D E i (3) explicitly captures the effect of nonlinearity θm := argmin kθk2 − qeiUei, θ +λkθkΣ1/2K , Rd m θ∈ i=1 on the error bound via two parameters σ1 and σ2, whereas in our bound is the dependence on ˜λ 1/2 ˆλ meaning that θm = Σ θm. these parameters is merged into an (unspecified) 6

constant C1(κ, φ), mainly due to the limitations the unit ball of k·kΣ1/2K norm, in place of G. Namely, C√3β imposed by the more involved proof techniques for all λ ≥ m (1 + ω(GΣ1/2 )), required to handle the heavy-tailed distributions.    5) In the special case where only the noise variable P 1/2 λ Σ θbm − ηθ∗ ≥ has heavy-tailed distribution while the measure- 2  ment vector x is sub-Gaussian, an alternative 3   1/2  1/2  −β/2 λ · Ψ S2 ηΣ θ∗ ; Σ K ≤ C4e approach based on LAD regression or Huber’s 2 regression is possible; non-asymptotic analysis 1/2 Note that ω (G 1/2 ) ≤ kΣ k ω(G). Moreover, we has been performed in several recent works, for Σ show in the Appendix that for a class of decomposable example see [31], [32]. However, construction of norms (which includes k · k and nuclear norm), the resulting estimators typically requires either 1 the upper bounds for Ψ S ηΣ1/2θ  ; Σ1/2K and the known upper bound on the variance of the 2 ∗ Ψ(S (ηθ )) differ by the factor of Σ−1/2 . noise, or the symmetry of the distribution of the 2 ∗ noise variable. C. Examples. Next, we present performance guarantees for the un- constrained estimator (12). We discuss two popular scenarios: estimation of the sparse vector and estimation of the low-rank matrix. Theorem 2. Assume that the norm k · kK dominates Estimation of the sparse signal. Assume that Rd the 2-norm, i.e. kvkK ≥ kvk2, ∀v ∈ . Let x ∼ there exists J ⊆ {1, . . . , d} of cardinality s ≤ E(0, I ,F ), and suppose that for some κ > 0 d×d µ d such that θ∗,j = 0 for j∈ / J. Let Θ = θ ∈ Rd : kθk ≤ kηθ k , with η de- E 2(1+κ) 1 ∗ 1 φ := |q| < ∞. fined in (16). In this case, it is well-known that 2 Sd−1 3 ω D(Θ, ηθ∗) ∩ ≤ 2s log(d/s)+ 2 s, see equa- Then there exist constants C3 = C3(κ, φ),C4 = tion (8) in [33], hence Theorem 1 implies that, with C (κ, φ) > 0 such that for all λ ≥ C√3β (1 + ω(G)) 4 m high probability,   r P λ 3 S −β/2 . s log(d/s) θbm − ηθ∗ ≥ λ · Ψ( 2 (ηθ∗)) ≤ C4e , θbm − ηθ∗ (18) 2 2 2 m as long as m & s log(d/s). for any β ≥ 8 and m ≥ (ω(G) + 1)2β2, where G := We compare this bound to result of Theorem 2 for con- {x ∈ Rd : kxk ≤ 1} is the unit ball of k · k norm, K K strained estimator. Let k·k be the ` norm. It is well- and S (·) and Ψ(·) are given in Definitions 3 and 4 K 1 2 know that ω(G) = E max |g | ≤ p2 log(2d), respectively. j=1,...,d j where g ∼ N (0, Id×d). Moreover,√ we show in the Appendix that Ψ(S2 (ηθ∗)) ≤ 4 s. Hence, for λ ' Remark 5 (Non-isotropic measurements). It fol- q log(2d) lows from remark 3 and (14) that, whenever x ∼ m , Theorem 2 implies that E (0, Σ,F ), inequality of Theorem 1 has the form d µ r λ . s log(d) θbm − ηθ∗ 2  1/2   m P Σ θbm − ηθ∗ ≥ 2 with high probability whenever m & log(2d). This 1/2 d−1  ! ω Σ D(Θ, ηθ∗) ∩ S + 1 β bound is only marginally weaker than (18) due to the C1 √ λ m logarithmic factor, however, definition of θbm does not −β/2 require the knowledge of kηθ∗k1, as we have already ≤ C2e , mentioned before. Estimation of low-rank matrices. Assume that which can be further combined with the bound d = d1d2 with d1 ≤ d2, and θ∗ ∈ Rd1×d2   has rank r ≤ min(d1, d2). Let Θ = 1/2 Sd−1  d1×d2 ω Σ D(Θ, ηθ∗) ∩ θ ∈ R : kθk∗ ≤ kηθ∗k∗ . Then the Gaussian 1/2 −1/2 Sd−1 mean width of the intersection of a descent cone with ≤ kΣ k · kΣ k ω D(Θ, ηθ∗) ∩ , 2 d−1 a unit ball is bounded as ω D(Θ, ηθ∗) ∩ S ≤ 3r(d + d − r) (see proposition 3.11 in [7]). Hence, that follows from remark 1.7 in [17]. Similarly, the 1 2 Theorem 1 yields that with high probability, inequality of Theorem 2 holds with r r(d + d ) d . 1 2 G 1/2 := {x ∈ R : kxk 1/2 ≤ 1}, θbm − ηθ∗ Σ Σ K 2 m 7 as long as the number of observations satisfies m & non-zero coordinate chosen uniformly at random, and r(d1 + d2). the magnitude having uniform distribution on [0, 1]. Finally, we derive the corresponding bound from The- Since we can only recover the original signal θ∗ up orem 2. The Gaussian mean width of the unit ball in to scaling, define the relative error for any estimator θˆ √ √ ∗ the nuclear norm is bounded by 2( d1 + d2), see with respect to θ as follows: proposition 10.3 in [1]. It follows from results in the √ θˆ θ∗ Appendix that Ψ(S2 (ηθ∗)) ≤ 4 2r. Theorem 2 now Relative error = − . (21) ˆ ∗ implies that with high probability kθk2 kθ k2 r In each of the following two scenarios, we run the r(d1 + d2) θbm − ηθ∗ . , experiment 200 times for both the Lasso estimator and 2 m the estimator defined in (12) with k·kK being the k·k1 1 which matches the bound of Theorem 1. norm. We set the truncation level as τ = cm 2(1+κ) , and the values of c and regularization parameter λ are IV. NUMERICALEXPERIMENTS obtained via the standard 2-fold cross validation for In this section, we demonstrate the performance of the relative error (21). We then plot the histogram of proposed robust estimator (12) for one-bit compressed obtained results over 200 runs of the experiment. sensing model. The model takes the following form: In the first scenario, we set the additive error δi = 0, i = 1, 2, ··· , 128 in the 1-bit model (19) and plot y = sign(hx, θ∗i) + δ, (19) the histogram in Fig. 1. We can see from the plot that the robust estimator (12) noticeably outperforms the δ θ∗ where is the additive noise and the parameter Lasso estimator. is assumed to be s-sparse. This model is highly non- In the second scenario, we set the additive error linear because one can only observe the sign of each δi, i = 1, 2, ··· , 128 to be i.i.d. heavy tailed noise measurement. with signal-to-noise ratio (SNR)2 equal to 10dB, so The 1-bit compressed sensing model was previously that the noise has the distribution discussed extensively in a number of works, e.g. [16], d √ [14], [17]. It was shown that when the measurement δi = hi/ 10, vectors are either Gaussian or sub-Gaussian, the Lasso and h , i = 1, 2, ··· , 128 are i.i.d. random variables estimator recovers the support of θ∗ with high prob- i with Pareto-type distribution (20). The results are ability. Here, we show that under the heavy-tailed plotted in Fig. 2. The histogram shows that, while elliptically distributed measurements, our estimator nu- performance of the Lasso estimator becomes worse, merically outperforms the standard Lasso estimator results of robust estimator (12) are relatively stable. 2 In the second simulation study, the simulation frame- θLasso = argmin kXθ − yk2 + λkθk1, θ∈Rd work similar to the second scenario above, the only while taking the form of a simple soft-thresholding as difference being the increased sample size m. The explained in (13). results are plotted in Fig. 3, 4 with sample sizes In the first numerical experiment, data are simulated m = 256 and m = 512 respectively. 512 in the following way: x1, x2, ··· , x128 ∈ R d V. FINALREMARKS are i.i.d. with spherically symmetric distribution xi = 512 µiUi, i = 1, . . . , n. The random vectors Ui ∈ R In this paper, we investigated the problem of struc- tured signal recovery from nonlinear and heavy-tailed are i.i.d.√ with uniform distribution over the sphere of radius 512, and the random variables µi ∈ R are also measurements. In particular, we focus on the scenario i.i.d., independent of Ui and such that where the measurement vectors have an elliptical sym- metric distribution, and propose an estimator that is 1 d robust both to non-linearity and heavy-tailed nature of µi = p (ξi,1 − ξi,2), (20) 2c(q) the measurements. Several questions remain open: first, the proposed where ξi,1 and ξi,2, i = 1, 2, ··· , 128 are i.i.d. with Pareto distribution, meaning that their probability den- estimator relies heavily on the prior knowledge of the sity function is given by true covariance matrix of the measurements, whereas the usual LASSO estimator is does not require such q p(t; q) = I{t>0}, (1 + t)1+q 2The signal-to-noise ratio (dB) is defined as SNR := 10 log (σ2 /σ2 ). In our case, since hx , θ∗i can be positive q 10 signal noise i c(q) := Var(ξ) = 2 , and q = 2.1. The true 2 2 (q−1) (q−2) or negative with equal probability, σsignal = 1, and thus, σnoise = signal θ∗ has sparsity level s = 5, with index of each 1/10. 8

Fig. 1. Lasso vs robust estimator in the noiseless case. Fig. 3. Lasso vs robust estimator under heavy-tailed noise with signal-to-noise ratio(SNR) equal to 10dB and sample size m = 256.

Fig. 2. Lasso vs robust estimator under heavy-tailed noise with signal-to-noise ratio(SNR) equal to 10dB. Fig. 4. Lasso vs robust estimator under heavy-tailed noise with signal- to-noise ratio(SNR) equal to 10dB and sample size m = 512. prior information. Is it possible to obtain strong theo- retical guarantees in the case of heavy-tailed measure- ments when the covariance matrix is unknown? Lemma 1 (Lemma 2.2 of [35]). Let U have the Sd−1 Second (and perhaps more important) question asks uniform distribution on . Then for any ∆ ∈ (0, 1) Sd−1 whether one can extend results of the present paper and any fixed v ∈ , beyond the class of elliptically symmetric distributions. 2 P (hU, vi ≥ ∆) ≤ e−d∆ /2. We note that the case of non-Gaussian measurement vectors with i.i.d. entries has been investigated in Next, we state several useful results from the theory of several works including [14], [34] which showed that empirical processes. the least squares-type estimator is often biased, and the bias can be controlled by a certain distance (for Definition 5 (ψq-norm). For q ≥ 1, the ψq-norm of a instance, the total variation distance) between the dis- random variable ξ ∈ R is given by tribution of the entries and the Gaussian law. − 1 1 q E p p kξkψq = sup p ( (|X| )) . VI.PROOFS. p≥1 This section is devoted to the proofs of Theorems 1 Specifically, the cases q = 1 and q = 2 are known and 2. as the sub-exponential and sub-Gaussian norms re- spectively. We will say that ξ is sub-exponential if A. Preliminaries. kξkψ1 < ∞, and X is sub-Gaussian if kξkψ2 < ∞. We recall several useful facts from probability theory Remark 6. It is easy to check that ψ -norm is indeed that we rely on in the subsequent analysis. q a norm. The following well-known bound shows that the uni- form distribution on a high-dimensional sphere enjoys Remark 7. A useful property, equivalent to the previ- strong concentration properties. ous definition of a sub-Gaussian random variable ξ, is 9 that there exists a positive constant C such that Lemma 3 (Symmetrization inequalities). P (|ξ| ≥ u) ≤ exp(1 − Cu2). E sup |Zm(t)| ≤ 2E sup |Rm(t)|, For the proof, see Lemma 5.5 in [36]. t∈T t∈T Definition 6 (sub-Gaussian random vector). A random and for any u > 0, we have Rd vector x ∈ is called sub-Gaussian if there exists   Sd−1 P E C > 0 such that khx, vikψ2 ≤ C for any v ∈ . sup |Zm(t)| ≥ 2 sup |Zm(t)| + u The corresponding sub-Gaussian norm is then t∈T t∈T   ≤ 4P sup |R (t)| ≥ u/2 . kxkψ2 := sup khx, vikψ2 . m v∈Sd−1 t∈T Next, we recall the notion of the generic chaining Proof. See Lemmas 6.3 and 6.5 in [37] complexity. Let (T, d) be a metric space. We say a ∞ collection {Al}l=0 of subsets of T is increasing when Finally, we recall Bernstein’s concentration inequal- Al ⊆ Al+1 for all l ≥ 0. ity. Definition 7 (Admissible sequence). An increasing ∞ Lemma 4 (Bernstein’s inequality). Let X1, ··· ,Xm sequence of subsets {Al}l=0 of T is admissible if be a sequence of independent centered random vari- 2l |Al| ≤ Nl, ∀l, where N0 = 1 and Nl = 2 , ∀l ≥ 1. ables. Assume that there exist positive constants σ and

For each Al, define the map πl : T → Al as D such that for all integers p ≥ 2 πl(t) = arg mins∈A d(s, t), ∀t ∈ T . Note that, m l 1 X p! since each Al is a finite set, the minimum is always E(|X |p) ≤ σ2Dp−2, m i 2 achieved. When the minimum is achieved for multiple i=1 elements in Al, we break the ties arbitrarily. The then generic chaining complexity γ2 is defined as m ! ∞ 1 σ √ D X l/2 P X γ2(T, d) := inf sup 2 d(t, πl(t)), (22) Xi ≥ √ 2u + u ≤ 2 exp(−u). t∈T m m m l=0 i=1 where the infimum is over all admissible sequences. In particular, if X1, ··· ,Xm are all sub-exponential The following theorem tells us that γ2-functional con- random variables, then σ and D can be chosen as trols the “size” of a Gaussian process. 1 Pm σ = kXikψ and D = max kXikψ . m i=1 1 i=1...m 1 Lemma 2 (Theorem 2.4.1 of [24]). Let {G(t), t ∈ T } be a centered Gaussian process indexed by the set T , and let B. Roadmap of the proof of Theorem 1. 1/2 d(s, t) = E(G(s) − G(t))2 , ∀s, t ∈ T. We outline the main steps in the proof of Theorem 1, and postpone some technical details to sections Then, there exists a universal constant L such that VI-D and VI-E.   1 E As it will be shown below in Lemma 5, γ2(T, d) ≤ sup G(t) ≤ Lγ2(T, d). 0 L t∈T argmin L (θ) = ηθ∗ for η = E (hyx, θ∗i) and θ∈Θ Let (T, d) be a semi-metric space, and let 0 0 2 L (θbm) − L (ηθ∗) = kθbm − ηθ∗k2, hence X1(t), ··· ,Xm(t) be independent stochastic processes T E|X (t)| < ∞ t ∈ T 2 indexed by such that j for all kθbm − ηθ∗k2 and 1 ≤ j ≤ m. We are interested in bounding the =Lτ (θ ) − Lτ (ηθ ) supremum of the empirical process bm ∗  0 τ 0 τ  m + L (θbm) − L (θbm) − L (ηθ∗) + L (ηθ∗) 1 X Z (t) = [X (t) − E(X (t))] . (23) m m i i =Lτ (θ ) − Lτ (ηθ ) + (Lτ (θ ) − Lτ (ηθ )) i=1 bm ∗ m bm m ∗ D E τ τ E The following well-known symmetrization inequality − (Lm(θbm) − Lm(ηθ∗)) − 2 m yx − qeU,e θbm − ηθ∗ , reduces the problem to bounds on a (conditionally) (24) 1 Pm Rademacher process Rm(t) = m i=1 εiXi(t), t ∈ T , where ε1, . . . , εm are i.i.d. Rademacher random where Em(·) stands for the conditional expectation m variables (meaning that they take values {−1, +1} with given (xi, yi)i=1, and where we used the equal- 0 τ 0 τ probability 1/2 each), independent of Xi’s. ity L (θbm) − L (θbm) − L (ηθ∗) + L (ηθ∗) = 10

D E E −2 m yx − qeU,e θbm − ηθ∗ in the last step. Since which further implies that τ τ τ λ 2 θbm minimizes Lm, Lm(θbm) − Lm(ηθ∗) ≤ 0, and kθbm − ηθ∗k2 m 2 2 X D λ E D λ E kθbm − ηθ∗k ≤ E 2 ≤ qeiUei, θbm − ηθ∗ − m qeU,e θbm − ηθ∗ m m 2 X D E D E i=1 qiUei, θbm − ηθ∗ − Em qU,e θbm − ηθ∗ D λ E  λ  m e e − 2Em yx − qU, θb − ηθ∗ + λ kηθ∗kK − kθb kK i=1 ee m m D E * m + − 2Em yx − qU,e θbm − ηθ∗ . 2 X   λ e = qiUei − E qUe , θb − ηθ∗ m e e m i=1 Note that θbm − ηθ∗ ∈ D(Θ, ηθ∗); dividing both sides D E   − 2E yx − qU, θλ − ηθ + λ kηθ k − kθλ k . of the inequality by kθbm − ηθ∗k2, we obtain m ee bm ∗ ∗ K bm K (26) kθbm − ηθ∗k2 ≤ Letting k · k∗ be the dual norm of k · k (meaning m K K 2 D E D E ∗ X E that kxkK = sup {hx, zi , kzkK ≤ 1}), the first term sup qeiUei, v − qeU,e v Sd−1 m in (26) can be estimated as v∈D(Θ,ηθ∗)∩ i=1 D E E * m + + 2 sup yx − qeU,e v . (25) 1 X   λ v∈Sd−1 qiUei − E qUe , θb − ηθ∗ m e e m To get the desired bound, it remains to estimate two i=1 m ∗ terms above. The bound for the first term is implied 1 X   λ ≤ qiUei − E qUe · kθb − ηθ∗kK. (27) by Lemma 8: setting T = D(Θ, ηθ ) ∩ Sd−1, and ob- m e e m ∗ i=1 serving that the diameter ∆ (T ) := sup ktk = 1, K d t∈T 2 Since we get that with probability ≥ 1 − ce−β/2, ∗ 1 m   m X E 2 X D E D E qeiUei − qeUe sup qiUi, v − E qU, v m e e ee i=1 K v∈D(Θ,ηθ )∩Sd−1 m ∗ i=1 * m + (ω(T ) + 1)β 1 X   = sup qiUei − E qUe , t , ≤ C √ . m e e m ktkK≤1 i=1 d To estimate the second term, we apply Lemma 7: Lemma 8 applies with T = G := {x ∈ R : kxkK ≤ 1}. Together with an observation that ∆ (T ) ≤ D E C˜ d E sup ktk = 1 (due to the assumption kvk ≤ 2 sup yx − qeU,e v ≤ √ . t∈T K 2 Sd−1 m d v∈ kvkK, ∀v ∈ R ), this yiels

Result of Theorem 1 now follows from the combina- * m + tion of these bounds. 1 X   P sup qiUei − E qUe , t m e e ktkK≤1 i=1 C. Roadmap of the proof of Theorem 2. (ω(G) + 1) β  ≥ C0 √ ≤ c0e−β/2, Once again, we will present the main steps while m skipping the technical parts. Lemma 5 implies that for any β ≥ 8 and some constants C0, c > 0. For the 0 argmin L (θ) = ηθ∗ for η = E hyx, θ∗i and second term in (26), we use Lemma 7 to obtain θ∈Θ D E C00 0 λ 0 λ 2 λ λ L (θ ) − L (ηθ ) = kθ − ηθ k . 2Em yx − qU,e θb − ηθ∗ ≤ √ kθb − ηθ∗k2 bm ∗ bm ∗ 2 e m m m 00 Thus, arguing as in (24), C λ ≤ √ kθb − ηθ∗kK, m m λ 2 kθb − ηθ∗k m 2 for some constant C00 > 0, where we have again τ λ τ τ λ τ = L (θbm) − L (ηθ∗) + (Lm(θbm) − Lm(ηθ∗)) applied the inequality kvk2 ≤ kvkK. Combining the τ λ τ above two estimates gives that with probability at least − (Lm(θbm) − Lm(ηθ∗)) D E 1 − ce−β/2, E λ − 2 m yx − qU,e θbm − ηθ∗ . e (ω(G) + 1) β λ 2 √ λ λ kθbm − ηθ∗k2 ≤ C kθbm − ηθ∗kK Since θbm is a solution of problem (12), it follows that m   τ λ λ τ λ + λ kηθ∗kK − kθbmkK , (28) Lm(θm) + λ θm K ≤ Lm (ηθ∗) + λ kηθ∗kK , 11 for some constant C >√ 0 and any β ≥ 8. Since Lemma 6. If x ∼ E(0, Id×d,Fµ), then the unit λ ≥ 2C (ω(G) + 1) β/ m by assumption, and the random vector x/kxk2 is uniformly distributed√ over d−1 right hand side of (28) is nonnegative, it follows that the unit sphere S . Furthermore, Ue = dx/kxk2 is a sub-Gaussian random vector with sub-Gaussian 1 λ λ kθbm − ηθ∗kK + kηθ∗kK − kθbmkK ≥ 0. norm kUk independent of the dimension d. 2 e ψ2 λ S Proof. First, we use decomposition (4) for elliptical This inequality implies that θbm − ηθ∗ ∈ 2(ηθ∗). Finally, from (28) and the triangle inequality, distribution together with our assumption that Σ is the identity matrix, to write x =d µU, which implies that λ 2 3 λ kθb − ηθ∗k ≤ λkθb − ηθ∗kK. m 2 2 m d d x/kxk2 = sign(µ)U/kUk2 = sign(µ)U = U, λ Dividing both sides by kθb − ηθ∗k2 gives m with the final distributional equality holding as Sd−1, λ λ 3 kθbm − ηθ∗kK 3 and hence its uniform distribution, is invariant with kθb − ηθ∗k2 ≤ λ ≤ λ · Ψ(S2(ηθ∗)) . m λ respect to reflections across any hyperplane through 2 kθb − ηθ∗k2 2 m the origin. This finishes the proof of Theorem 2. To prove the second claim, it is enough to show D E Sd−1 that U,e v ≤ C, ∀v ∈ with constant C D. Bias of the truncated mean. ψ2 independent of d. By the first claim and Lemma 1, we The following lemma is motivated by and is similar have to Theorem 2.1 in [13]. P −d∆2/2 Sd−1 E (hx, vi/kxk2 ≥ ∆) ≤ e , ∀v ∈ . Lemma 5. Let η = hyx, θ∗i. Then √ 0 Choosing ∆ = u/ d gives ηθ∗ = argmin L (θ), θ∈Θ D E  2 P U,e v ≥ u ≤ e−u /2, ∀v ∈ Sd−1, ∀u > 0. and for any θ ∈ Θ, By an equivalent definition of sub-Gaussian random L0(θ) − L0(ηθ ) = kθ − ηθ k2. ∗ ∗ 2 variables (Lemma 5.5 in [36]), this inequality implies D E Proof. Since y = f(hx, θ i , δ), we have that for any ∗ that U,e v ≤ C, hence finishing the proof. θ ∈ Rd ψ2 With the previous lemma in hand, we now establish E E hyx, θi = hx, θif(hx, θ∗i, δ) the following result. =EE(hx, θif(hx, θ∗i, δ) | hx, θ∗i, δ) Lemma 7. Under the assumptions of Theorem 1, there EE = (hx, θi | hx, θ∗i) · f(hx, θ∗i, δ) exists a constant C = C(κ, φ) > 0 such that   =E hθ , θihx, θ if(hx, θ i, δ) D E √ ∗ ∗ ∗ E yx − qeU,e v ≤ C/ m, =ηhθ∗, θi, for all v ∈ Sd−1. where the third equality follows from the fact that the noise δ is independent of the measurement vector Proof. By (7), we have that yx = qUe, thus the claim x, the second to last equality from the properties of is equivalent to elliptically symmetric distributions (Corollary 1), and D E  √ E the last equality from the definition of η. Thus, U,e v (qe− q) ≤ C/ m.

L0(θ) = kθk2 − 2E(hyx, θi) = kθk2 − 2ηhθ , θi Since qe = sign(q)(|q| ∧ τ), we have |qe− q| = (|q| − 2 2 ∗ τ)1(|q| ≥ τ) ≤ |q|1(|q| ≥ τ), and it follows that = kθ − ηθ k2 − kηθ k2, ∗ 2 ∗ 2 D E E U, v (q − q) which is minimized at θ = ηθ∗. Furthermore, e e 0 ∗ 2 D E L (ηθ ) = −kηθ∗k2, hence E ≤ U,e v (qe− q) 0 0 2  D E  L (θ) − L (ηθ∗) = kθ − ηθ∗k2, E ≤ U,e v q · 1{|q|≥τ} finishing the proof.  D E 21/2 E P 1/2 Next, we estimate the “bias term” ≤ U,e v q (|q| ≥ τ) D E sup Sd−1 E yx − qU, v in inequality (25). In κ v∈ ee 2(1+κ) ! 2(1+κ) 1 D E κ   2(1+κ) 1/2 order to do so, we need the following preliminary ≤E U,e v E |q|2(1+κ) P (|q| ≥ τ) , result. 12 where the second to last inequality uses Cauchy- a concentration result for the supremum of multiplier Schwarz, and the last inequality follows from Holder’s¨ processes under weak moment assumptions. In the cur- inequality. rent work, we show that exponential-type concentration For the first term, by Lemma 6, Ue is sub-Gaussian inequalities for multiplier processes, such as the one in with kUekψ2 independent of d. Thus, by the definition Lemma 8, are achievable by applying truncation under Sd−1 of the k · kψ2 norm and the fact that v ∈ , a bounded 2(1 + κ)-moment assumption. κ Define 2(1+κ) ! 2(1+κ) r D E κ 2(1 + κ) m E 1 X D E D E  U,e v ≤ kUekψ2 . Z(t) = Uei, t qi − E U,e t q , κ m e e i=1 Recall that φ = E|q|2(1+κ). Then, the second term 1 m D E 1 X 2(1+κ) Z(t) = εiqi Uei, t , ∀t ∈ T, is bounded by φ . For the final term, since τ = m e 1 i=1 m 2(1+κ) , Markov’s inequality implies that d m where T is a bounded set in R and {εi} is a 1/2 i=1 E|q|2(1+κ)  φ1/2 sequence i.i.d. Rademacher random variables taking (P (|q| > τ))1/2 ≤ ≤ √ . τ 2(1+κ) m values 1 with probability 1/2 each, and independent of {U , q , i = 1, . . . , m}. Result of Lemma 8 easily Combining these inequalities yields ei ei follows from the following concentration inequality: D E E yx − qeU,e v Lemma 9. For any β ≥ 8,   q 2(1+κ) 2+κ (ω(T ) + ∆d(T ))β 2(1+κ) P −β/2 κ kUekψ2 φ √ sup |Z(t)| ≥ C √ ≤ ce , ≤ √ := C(κ, φ)/ m, t∈T m m (30) completing the proof. where C = C(κ, φ) is another constant possibly different from that of Lemma 8, and c > 0 is an absolute constant. E. Concentration via generic chaining. In the following sections, we will use c, C, C0,C00 To deduce the inequality of Lemma 8, we first apply to denote constants that are either absolute, or depend the symmetrization inequality (Lemma 3), followed by on underlying parameters κ and φ (in the latter case, Lemma 13 with β0 = 8. It implies that     we specify such dependence). To make notation less cumbersome, constants denoted by the same letter E sup Z(t) ≤ 2E sup |Z(t)| t∈T t∈T (c, C, C0, etc.) might be different in various parts of ω(T ) + ∆ (T ) the proof. ≤ 2C 8 + 2ce−4 √ d . The goal of this subsection is to prove the following m inequality: Application of the second bound of the symmetrization√ lemma with u = 2C(ω(T ) + ∆d(T ))β/ m and (30) Lemma 8. Suppose Uei and qei are as defined according completes the proof of Lemma 8. to (7) and (8) respectively. Then, for any bounded It remains to justify (30). We start by picking an subset T ⊂ Rd, arbitrary point t0 ∈ T such that there exists an m admissible sequence {t0} = A0 ⊆ A1 ⊆ A2 ⊆ · · · 1 X D E D E  P sup Uei, t qi − E U,e t q satisfying m e e t∈T i=1 ∞  X l/2 (ω(T ) + ∆d(T ))β −β/2 sup 2 kπl(t) − tk2 ≤ 2γ2(T ), (31) ≥ C √ ≤ ce , t∈T m l=0 where we recall that π is the closest point map from for any β ≥ 8, a positive constant C = C(κ, φ) and l T to A and the factor 2 is introduced so as to deal an absolute constant c > 0. Here l with the case where the infimum in the definition (22) ∆d(T ) := sup ktk2. (29) of γ2(T ) is not achieved. Then, write Z(t) − Z(t0) as t∈T the telescoping sum: The main technique we apply is the generic chaining ∞ X method developed by M. Talagrand [24] for bounding Z(t) − Z(t0) = Z(πl(t)) − Z(πl−1(t)) the supremum of stochastic processes. Later, works l=1 [38] and [39] advanced the technique to obtain a sharp ∞ m X 1 X D E bound for supremum of processes index by squares = εiqi Uei, πl(t) − πl−1(t) . m e of functions. More recently, S. Mendelson [25] proved l=1 i=1 13

We claim that the telescoping sum converges with For any integer p ≥ 2, we have the following chains probability 1 for any t ∈ T . Indeed, note that for each of inequalities: m m fixed set of realizations of {xi} and {εi} , each  D E p i=1 i=1 E summand is bounded as εqe U,e πl(t) − πl−1(t) p  D E 2 p−2 ≤E ε U,e πl(t) − πl−1(t) q · |q| |ε q hU , π (t) − π (t)i| e i ei ei l l−1 p  D E 2 p−2 ≤E U,e πl(t) − πl−1(t) q · τ ≤|qei|kUeik2kπl(t) − πl−1(t)k2 κ  1+κ  1+κ 1 ≤|q |kU k (kπ (t) − tk + kπ (t) − tk ). D E κ p   1+κ ei ei 2 l 2 l−1 2 p−2E E 2(1+κ) ≤τ U,e πl(t) − πl−1(t) q Furthermore, since T is a compact subset of Rd, its  p/2 p−2 p (1 + κ)p 1 p Gaussian mean width is finite. Thus, by lemma 2, ≤τ kUek φ 1+κ kπl(t) − πl−1(t)k , ψ2 κ 2 γ2(T ) ≤ Lω(T ) < ∞. This inequality further implies that the sum on the left hand side of (31) converges where the second inequality follows from the trunca- with probability 1. tion bound, the third from Holder’s¨ inequality, and the E 2(1+κ) Next, with β ≥ 8 being fixed, we split the index set last from the assumption that q ≤ φ and the {l ≥ 1} into the following three subsets: following bound: by Lemma 6, Uei is sub-Gaussian, hence for any p ≥ 2

l κ I1 = {l ≥ 1 : 2 β < log em};  1+κ  (1+κ)p D E κ p l E Uei, v I2 = {l ≥ 1 : log em ≤ 2 β < m}; l 1/2 I3 = {l ≥ 1 : 2 β ≥ m}.   (1 + κ)p Rd ≤ kUeikψ2 kvk2, ∀v ∈ . κ By the assumptions in Theorem 1 and the bound β ≥ 8, 2 2 kU k d we have that m ≥ (ω(T ) + 1) β ≥ 64, implying that We also note that ei ψ2 does not depend on log em = 1 + log m < m, and hence these three index by Lemma√ 6. Next, by Stirling’s approximation, √ p 0 sets are well defined. Depending on β, some of them p! ≥ 2π p(p/e) , thus there exist constants C = 0 00 00 might be empty, but this only simplifies our argument C (κ, φ) and C = C (κ) such that by making the partial sum over such an index set equal D E p E 0. εqe U,e πl(t) − πl−1(t) The following argument yields a bound for p! 0 2 00 p−2 ≤ C kπl(t)−πl−1(t)k2(C τkπl(t)−πl−1(t)k2) . Z(πl(t)) − Z(πl−1(t)), assuming all three index sets 2 are nonempty. Specifically, we show that Bernstein’s inequality (Lemma 4), with σ = 0 00 C kπl(t) − πl−1(t)k2, D = C τkπl(t) − πl−1(t)k2  1/2(1+κ) and τ = m now implies P X sup (Z(πl(t)) − Z(πl−1(t))) m t∈T 1 D E l∈I X j P εiqi Uei, πl(t) − πl−1(t)  m e γ2(T )β i=1 ≥ C √ ≤ ce−β/2, (32) √ ! ! m C0 2u C00u ≥ √ + kπ (t) − π (t)k 1− 1 l l−1 2 m m 2(1+κ) for C = C(κ, φ) and j = 1, 2, 3, respectively. ≤ 2e−u, 1) The case l ∈ I1: for any u > 0. Taking u = 2lβ, noting that as β ≥ 8 by assumption, we have m ≥ (ω(T ) + 1)2β2 ≥ 64, l l Proof of inequality (32) for the index set I1. Recall and since l ∈ I , 2 ≤ 2 β < log em. In turn, this 1 1 that τ = m 2(1+κ) . implies For each t ∈ T we apply Bernstein’s inequality 2l 2l/2 2l/2 (Lemma 4) to estimate each summand = · 1− 1 1/2 κ/2(1+κ) m 2(1+κ) m m r 2l/2 log em r1 + κ 2l/2 Z(πl(t)) − Z(πl−1(t)) ≤ · ≤ , m m1/2 mκ/(1+κ) κ m1/2 1 X D E = εiqi Uei, πl(t) − πl−1(t) . where the last inequality follows from the fact that m e i=1 1+κ κ/(1+κ) log em is dominated by κ m for all m ≥ 14

D E 1. This inequality implies that there exists a positive l ∈ I2, we let Xi = qei Uei, πl(t) − πl−1(t) and wi = constant C = C(κ, φ) such that for any β ≥ 8 hUei, πl(t) − πl−1(t)i. Then Xi = qeiwi and l m m P (Ωl,t) ≤ 2 exp(−2 β), (33) 1 X 1 X Z(π (t))−Z(π (t)) = ε X = ε w q . l l−1 m i i m i i ei i=1 i=1 where for all l ≥ 1 and t ∈ T we let (34) ( m For every fixed k ∈ {1, 2, ··· , m − 1} and fixed 1 X D E u > 0, we bound the summation using the following Ωl,t = ω : εiqi Uei, πl(t) − πl−1(t) m e i=1 inequality l/2  2 β  m k m !1/2 ≥ C √ kπ (t) − π (t)k . m l l−1 2 P X X ∗ X ∗ 2  εiXi ≥ Xi + u (Xi )  i=1 i=1 i=k+1 Notice that for each l ≥ 1 the number of pairs ≤ 2 exp(−u2/2), (πl(t), πl−1(t)) appearing in the sum in (32) can be l+1 |A | · |A | ≤ 22 ∗ m bounded by l l−1 . Thus, by a union where {Xi }i=1 is the non-increasing rearrangement m m bound and (33), of {|Xi|}i=1 and {εi}i=1 is a sequence of i.i.d. Rade- m ! mancher random variables independent of {Xi}i=1. P [ 2l+1 l Ωl,t ≤ 2 · 2 exp(−2 β), Remark 8. This bound was first stated and proved t∈T m in [40] with a sequence of fixed constants {Xi}i=1. and hence, The current form can be obtained using independence property and conditioning on {X }m . Furthermore,   i i=1 paper [40] tells us that the optimal choice of k is l+1 [ X 2 l 2 P  Ωl,t ≤ 2 · 2 exp(−2 β) at O(u ) Applications of this inequality to generic l∈I1,t∈T l∈I1 chaining-type arguments were previously introduced in X l+1 ≤ 2 · 22 exp −2l−1β − β/2 [25].

l∈I1 Letting J be the set of indices of the variables corre- −β/2 m ≤ce , sponding to the k largest coordinates of {|wi|}i=1 and m of {|qei|}i=1, we have |J| ≤ 2k and with probability at for some absolute constant c > 0, where in the last least 1 − 2 exp(−u2/2) inequality we use the fact β ≥ 8 to get a geometrically m decreasing sequence. Thus, on the complement of the X εiXi event ∪l∈I1,t∈T Ωl,t, we have that with probability at i=1 least 1 − ce−β/2, !1/2 X X ≤ X∗ + u (X∗)2 i i X i∈J i∈J c sup (Z(πl(t)) − Z(πl−1(t))) t∈T 1/2 l∈I1 k ! X ∗ ∗ X ∗ ∗ 2 X ≤2 wi qi + u (wi qi ) ≤ sup |Z(πl(t)) − Z(πl−1(t))| e e t∈T i=1 i∈J c l∈I 1 1/2 1/2 l/2 k ! k ! X 2 β X ∗ 2 X ∗ 2 ≤ sup C √ kπl(t) − πl−1(t)k2 ≤2 (wi ) (qei ) t∈T m l∈I1 i=1 i=1 ∞ κ 1 X 2l/2β m ! 2(1+κ) m ! 2(1+κ) X ∗ 2(1+κ) X ∗ 2(1+κ) ≤ sup C √ kπl(t) − πl−1(t)k2 + u (w ) κ (q ) t∈T m i ei l=1 i=k+1 i=k+1 γ2(T )β ≤4C √ , k !1/2 m !1/2 m X ∗ 2 X 2 ≤2 (wi ) qei C = C(κ, φ) i=1 i=1 for , where the last inequality fol- κ 1 m ! 2(1+κ) m ! 2(1+κ) lows from triangle inequality kπl(t) − πl−1(t)k2 ≤ 2(1+κ) X ∗ X 2(1+κ) + u (w ) κ q kπl−1(t) − tk2 + kπl(t) − tk2 and (31). This proves i ei i=k+1 i=1 the inequality (32) for l ∈ I1. (35) 2) The case l ∈ I2: This is the most technically where the second to last inequality is a consequence√ involved case of the three. For any fixed t ∈ T and of Holder’s¨ inequality. We take u = 2(l+1)/2 β. The 15 key is to pick an appropriate cut point k for each is independent of d by Lemma 1. Thus, l l l ∈ I2. Here, we choose k = b2 β/ log(em/2 β)c,  k !1/2  which makes k = O(2lβ) and also guarantees that X P (w∗)2 ≥ C2l/2kπ (t) − π (t)k pβ k ∈ {1, 2, ··· , m − 1}; see Lemma 16. Under this  i l l−1 2  i=1 choice, we have the following lemma:  =P ∃J ⊆ {1, ··· , m}, |J| = k : Lemma 10. Let k = b2lβ/ log(em/2lβ)c, w = D E i !1/2  U , π (t) − π (t) and {w∗}m be the nonincreas- X ei l l−1 i i=1 w2 ≥ C2l/2kπ (t) − π (t)k pβ m i l l−1 2  ing rearrangement of {|wi|}i=1. Then there exists an i∈J absolute constant C > 1 such that for all β ≥ 8,  !1/2   m  X ≤ P w2 ≥ C2l/2kπ (t) − π (t)k pβ k  i l l−1 2  i∈J  k !1/2 X  m  P (w∗)2 ≤2 exp(−4 · 2lβ)  i k i=1 emk l/2 p  ≤2 exp(−4 · 2lβ) ≤ 2 exp(−2lβ), ≥ C2 kπl(t) − πl−1(t)k2 β k l ≤ 2 exp(−2 β). em k l where the last step follows from k ≤ exp(3·2 β), an inequality proved in Lemma 15 in the Appendix.

m Proof. By Lemma 6, we know that {wi} are i.i.d. l l i=1 Lemma 11. Let k = b2 β/ log(em/2 β)c, wi = sub-Gaussian random variables. Thus, by Lemma 14, D E U , π (t) − π (t) and {w∗}m be the non- w2 ei l l−1 i i=1 i is sub-exponential with norm m increasing rearrangement of {|wi|}i=1. Then  κ m ! 2(1+κ) 2 2 2 2 X ∗ 2(1+κ) kw kψ1 = 2kwik ≤ 2kUeik kπl(t) − πl−1(t)k . P κ i ψ2 ψ2 2  (wi ) (36) i=k+1 It then follows from Bernstein’s inequality (Lemma 4) κ  that for any fixed set J ⊆ {1, 2, ··· , m} with |J| = k, ≥ C(κ)m 2(1+κ) kπl(t) − πl−1(t)k2 ≤ exp(−2lβ), for any β ≥ 8 and some constant C(κ) > 0. 1 P X 2 E 2 wi − wi Proof. To avoid possible confusion, we use i to index k i∈J the nonincreasing rearrangement and j for the orig- r !! 2u u inal sequence. We start by noting that {w }m are ≥ 2kU k2 kπ (t) − π (t)k2 + j j=1 ei ψ2 l l−1 2 k k i.i.d. sub-Gaussian random variables with kwjkψ2 ≤ kU k kπ (t) − π (t)k . By an equivalent definition ≤ 2 exp(−u). ej ψ2 l l−1 2 of sub-Gaussian random variables (Lemma 5.5. in [36]), we have for any fixed j ∈ {1, 2, . . . , m}, l l+2 l   We choose u = 4 · 2 β = 2 β. Since 2 β ≥ P E |wj| − (|wj|) ≥ CukUejkψ2 kπl(t) − πl−1(t)k2 b2lβ/ log(em/2lβ)c = k ≥ 1, the factor u/k dom- E 2 −u2 inates the right hand side. Noting that wi = ≤ e , (37) 2 kπl(t) − πl−1(t)k2, we obtain for any u > 0 and an absolute constant C > 0. To establish the claim of the lemma, we bound each w∗ separately for i = 1, 2 . . . , m and then combine  !1/2  i X individual bounds. Instead of using a fixed value of u in P w2 ≥ C2l/2kπ (t) − π (t)k pβ  i l l−1 2  (37), our choice of u will depend on the index i. Specif- i∈J ∗ κ/4(1+κ) ically, for each wi , we choose u = cκ(m/i) l ≤ 2 exp(−4 · 2 β), with √ 2+κ r   5 2 + 4  4(1+κ) 4(1 + κ) c := max κ , . (38) κ e1/2(1+κ) κ where C ≤ 4kUeikψ2 ; note that the upper bound for C   16

The reason for this choice will be clear as we proceed. by Lemma 16 in the Appendix and (38) implying √ 2+κ First, for a fixed nonincreasing rearrangement index 4  4(1+κ) 1/2(1+κ) cκ ≥ 5 2 + κ /e , we have i > k, by (37) and the fact that κ 2+κ c2 m 2(1+κ) k 2(1+κ) ≥ 5 · 2lβ. E E 21/2 κ (|wj|) ≤ wj = kπl(t) − πl−1(t)k2, ∀j ∈ {1, 2, ··· , m}, Overall, we have the following bound:

 κ  we have ∀j ∈ {1, 2, ··· , m}, m 4(1+κ) P ∃i > k : w∗ ≥ C0 kπ (t) − π (t)k i i l l−1 2    P l l  l |wj| ≥ 1 + CcκkUejkψ2 ≤ exp 4 · 2 β − 5 · 2 β ≤ exp(−2 β). κ ! m 4(1+κ) l · kπ (t) − π (t)k Thus, with probability at least 1 − exp(−2 β), j l l−1 2 κ m 4(1+κ) κ ! ∗ 0 m 2(1+κ) wi ≤ C kπl(t) − πl−1(t)k2, ∀i > k, ≤ exp −c2 . i κ j hence with the same probability

0 κ To simplify notation, let C = 1 + CcκkUejkψ2 (note m ! 2(1+κ) X ∗ 2(1+κ) that it depends only on κ). It then follows that κ (wi )  κ  i=k+1 m 4(1+κ) P ∗ 0 κ w ≥ C kπl(t) − πl−1(t)k2 2(1+κ) i 1/2! i 0 X m  ≤C kπl(t) − πl−1(t)k2 P i = ∃J ⊆ {1, ··· , m}, |J| = i : i=k+1 κ κ   m  2(1+κ) m 4(1+κ) κ Z dx 0 ≤C0kπ (t) − π (t)k m 4(1+κ) wj ≥ C kπl(t) − πl−1(t)k2, ∀j ∈ J l l−1 2 1/2 i 1 x   κ κ m 2(1+κ) 0 2(1+κ) ≤ ≤2 C kπl(t) − πl−1(t)k2m , i  κ i and the desired result follows. 0 m 4(1+κ) · P |wj| ≥ C kπl(t) − πl−1(t)k2 i Lemma 12. The following inequalities hold for any   m  κ 2+κ  β ≥ 8: ≤ exp −c2m 2(1+κ) i 2(1+κ) i  1/2  i m ! em  2 κ 2+κ  X p ≤ exp −c m 2(1+κ) i 2(1+κ) . P q2 ≥ C0 βm ≤ 2e−β, i  ei  i=1 Union bound gives  1  m ! 2(1+κ) κ X 2(1+κ) 1   P 00 2(1+κ) −β ∗ 0 m 4(1+κ)  qi ≥ C (βm)  ≤ 2e , P ∃i > k : w ≥ C kπl(t) − πl−1(t)k2 e i i i=1 m X emi  κ 2+κ  0 0 00 ≤ exp −c2m 2(1+κ) i 2(1+κ) for some positive constants C = C (φ, κ),C = i 00 i=k+1 C (φ, κ). m X  em κ 2+κ  Proof. Recall that q = sign(q )(|q | ∧ τ), τ = = exp i log − c2m 2(1+κ) i 2(1+κ) ei  i i i 1/2(1+κ) E 2(1+κ) E 2 i=k+1 m , and φ = qi . Thus, qei ≤ κ 2+κ 2 1/1+κ  em 2  E q ≤ φ p ≥ 2 ≤m · exp k log − c m 2(1+κ) k 2(1+κ) i , and for any integer , we have k  κ 2+κ      ≤ exp 4 · 2lβ − c2m 2(1+κ) k 2(1+κ) , E 2p E 2p−2(1+κ) 2(1+κ) qei = qei qei p−1−κ  2(1+κ) p−1−κ where the second to last inequality follows since by ≤ m 1+κ E q ≤ m 1+κ φ. p i the definition (38) of cκ, cκ ≥ 4(1 + κ)/κ, the κ 2+κ em  2 2(1+κ) 2(1+κ) Thus, for any p ≥ 2, function v(i) = i log i − cκm · i is monotonically decreasing with respect to i (recall that   E 2 E 2 p E 2p E 2p i ≤ m), and thus is dominated by v(k). The final |qei − qei | ≤ qei + qi inequality follows from Lemma 15 as well as the p−1−κ p 1−κ p−2 fact that log m ≤ log(em) ≤ 2lβ. Furthermore, ≤ m 1+κ φ + φ 1+κ ≤ (m + φ) 1+κ φ(m + φ) 1+κ . 17

By Bernstein’s inequality (Lemma 4), with probability for a positive constant C = C(κ, φ) and an absolute at least 1 − 2e−β, constant c > 0. In order to handle the remaining terms m involving q in (39), we apply Lemma 12, which gives 1 ei X 2 E 2 qei − qei γ (T )β m X 2 i=1 sup (Z(πl(t)) − Z(πl−1(t))) ≤ C √ , √ 1−κ 1 ! t∈T m 2β(m + φ) 2(1+κ) φ1/2 β(m + φ) 1+κ l∈I2 ≤ + m1/2 m with probability at least 1 − ce−β/2, where C =

1−κ C(κ, φ) and c > 0 are positive constants and β ≥ 8. √ 1/2 1 2β(1 + φ) 2(1+κ) φ + β(1 + φ) 1+κ This completes the second part of the chaining argu- ≤ κ , m 1+κ ment. which implies the first claim. To establish the second 3) The case l ∈ I3: claim, note that for any p ≥ 2,   p Proof of inequality (32) for the index set I3. Direct E 2(1+κ) E 2(1+κ) qei − qei application of Cauchy-Schwartz on (34) yields, for all   p t ∈ T , ≤C(p) E q2(1+κ)p + E q2(1+κ) ei i m !1/2 m !1/2  2(1+κ)(p−1) 2(1+κ)  1 X 2 1 X 2 E p |Z(πl(t))−Z(πl−1(t))| ≤ w q , ≤C(p) qei qi + φ m i m ei i=1 i=1 p−1 p p−2 ≤C(p)(m φ + φ ) ≤ C(p)(m + φ) (m + φ)φ, D E where wi = Uei, πl(t) − πl−1(t) are sub-Gaussian 1/2(1+κ) where we used the fact that |qi| ≤ m to obtain 2 e random variables. Thus, by Lemma 14, ωi are sub- the third inequality. Bernstein’s inequality implies that exponential with norm bounded as in (36). Using 1 − 2e−β with probability at least , Bernstein’s inequality again, we deduce that m 1   m X 2(1+κ) E 2(1+κ) 1 X qi − qi P 2 E 2 m e e wi − wi i=1 m i=1 ≤ p2β(1 + φ)φ1/2 + β(1 + φ), r !! 2 2 2u u ≥ 2kUeik kπl(t) − πl−1(t)k + which yields the second part of the claim. ψ2 2 m m

Proof of inequality (32) for the index set I2. ≤ 2 exp(−u). Combining Lemmas 10 and 11 with the inequality l l l/2√ Let u = 2 β. Using the fact that 2 β/m ≥ 1 as well (35), and setting u = 2 β, we get that with E 2 2 l as wi = kπl(t) − πl−1(t)k2, we see that the term probability at least 1 − 4 exp(−2 β), for all l ∈ I2, u/m dominates the right hand side and

|Z(πl(t)) − Z(πl−1(t))| ≤  m !1/2 √  1 2l/2 β √  1/2 P X 2 l/2 m !  wi ≥ Ckπl(t) − πl−1(t)k2 √  2 β X m m Ckπ (t) − π (t)k q2 i=1 l l−1 2 m  ei i=1 ≤ 2 exp(−2lβ), 1  m ! 2(1+κ) κ for some absolute constant C > 0. Thus, repeating 2(1+κ) X 2(1+κ) +m qei  , a chaining argument of section VI-E1 (namely, the i=1 argument following (33)), we obtain for some constant C = C(κ, φ) > 0; note that the factor 1/m appears due to equality (34). Next, we X sup (Z(πl(t)) − Z(πl−1(t))) apply a chaining argument similar to the one used in t∈T l∈I3 Section VI-E1, we obtain that with probability at least √ m !1/2 −β/2 1 − ce , γ2(T ) β 1 X 2 ≤ C √ qei √ m m γ (T ) β i=1 X 2 sup (Z(πl(t)) − Z(πl−1(t))) ≤ C · with probability at least 1 − ce−β/2 for some absolute t∈T m l∈I2 constants C, c > 0. Combining this inequality with the  1  m !1/2 m ! 2(1+κ) κ first claim of Lemma 12 gives X 2 2(1+κ) X 2(1+κ)  qei + m qei  , X γ2(T )β i=1 i=1 sup (Z(πl(t)) − Z(πl−1(t))) ≤ C √ , t∈T m (39) l∈I3 18

with probability at least 1 − ce−β/2 for absolute con- REFERENCES stants C, c > 0 and any β ≥ 8. This finishes the bound [1] R. Vershynin. Estimation in high dimensions: a geometric for the third (and final) segment of the “chain”. perspective. In Theory, a Renaissance, pages 3–66. Springer, 2015. [2] R. Tibshirani. Regression shrinkage and selection via the 4) Finishing the proof of Lemma 8: Lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. [3] E. J. Candes,` J. Romberg, and T. Tao. Robust uncertainty Proof. So far, we have shown that principles: Exact signal reconstruction from highly incomplete frequency information,. IEEE Transaction on Information Theory, 52(2):5406–5425, 2006. sup |Z(t) − Z(t0)| t∈T [4] P.J. Bickel, Y. Ritov, and A.B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009. X = sup (Z(πl(t)) − Z(πl−1(t))) [5] E. J. Candes,` X. Li, Y. Ma, and J. Wright. Robust principal t∈T component analysis? Journal of the ACM, 58:3, 2011. l≥1 [6] D. Gross. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, X X 57(3):1548–1566, 2011. ≤ sup (Z(πl(t)) − Z(πl−1(t))) t∈T [7] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. j∈{1,2,3} l∈Ij The convex geometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805–849, 2012. γ2(T )β ≤C √ , (40) [8] S. Oymak, A. Jalali, M. Fazel, Y. C. Eldar, and B. Hassibi. m Simultaneously structured models with application to sparse and low-rank matrices. IEEE Transactions on Information with probability at least 1 − ce−β/2 for some positive Theory,, 61(5):2886–2908, 2015. [9] W. Hardle, P. Hall, and H. Ichimura. Optimal smoothing in constants C = C(κ, φ) and c, and any β ≥ 8. single-index models. The annals of Statistics, 21(1):157–178, To finish the proof, it remains to bound |Z(t )| = 1993. D E 0 1 Pm ε q U , t ∆ (T ) [10] T. M. Stoker. Consistent estimation of scaled coefficients. m i=1 i ei ei 0 . With d defined in (29), Econometrica: Journal of the Econometric Society, pages and since t0 is an arbitrary point in T , we trivially have 1461–1481, 1986. [11] M. Hristache, A. Juditsky, and V. Spokoiny. Direct estimation kt0k2 ≤ ∆d(T ). Applying Bernstein’s inequality in a of the index coefficient in a single-index model. Annals of way similar to Section VI-E1 yields Statistics, pages 595–623, 2001. [12] D. R. Brillinger. A generalized linear model with “Gaussian” m regressor variables. In A Festschrift for Erich L. Lehmann, 1 X D E Wadsworth Statist./Probab. Ser., pages 97–114. Wadsworth, P εiqi Uei, t0 m e Belmont, CA, 1983. i=1 [13] K.-C. Li and N. Duan. Regression analysis under link violation. √ ! ! C0 2u C00u The Annals of Statistics, pages 1009–1052, 1989. ≥ √ + ∆ (T ) ≤ 2e−u, [14] A. Ai, A. Lapanowski, Y. Plan, and R. Vershynin. One-bit 1− 1 d m m 2(1+κ) compressed sensing with non-Gaussian measurements. Linear Algebra and its Applications, 441:222–239, 2014. 0 0 00 00 [15] P. T. Boufounos and R. G. Baraniuk. 1-bit compressive sensing. for some constants C = C (κ, φ),C = C (κ, φ) > In Information Sciences and Systems, 2008. CISS 2008. 42nd 0 and any u > 0. Choosing u = β gives Annual Conference on, pages 16–21. IEEE, 2008. [16] Y. Plan, R. Vershynin, and E. Yudovina. High-dimensional m ! estimation with geometric constraints. arXiv preprint 1 X D E C∆d(T )β −β arXiv:1404.3749, 2014. P εiqi Uei, t0 ≥ √ ≤ 2e , m e m [17] Y. Plan and R. Vershynin. The generalized Lasso with non- i=1 linear observations. IEEE Transactions on Information Theory, 62(3):1528–1537, 2016. for a constant C = C(κ, φ) > 0 and any β ≥ 0. Com- [18] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and bining this bound with (40) shows that with probability sparse logistic regression: A convex programming approach. −β/2 IEEE Transactions on Information Theory, 59(1):482–494, at least 1 − ce , 2013. [19] M. Genzel. High-dimensional estimation of structured signals m from non-linear observations with general convex loss func- 1 X (γ2(T ) + ∆d(T ))β sup ε hU , tiq ≤ C √ tions. arXiv preprint arXiv:1602.03436, 2016. i ei ei [20] C. Thrampoulidis, E. Abbasi, and B. Hassibi. Lasso with non- t∈T m m i=1 linear measurements is equivalent to one with linear measure- (Lω(T ) + ∆d(T ))β ments. In Advances in Neural Information Processing Systems, ≤ C √ , pages 3420–3428, 2015. m [21] X. Yi, Z. Wang, C. Caramanis, and H. Liu. Optimal linear estimation under unknown nonlinear transform. In Advances for C = C(κ, φ), an absolute constant L > 0 and in Neural Information Processing Systems, pages 1549–1557, all β ≥ 8; note that the last inequality follows from 2015. [22] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust Lemma 2. We have established (30), thus completing face recognition via sparse representation. IEEE Trans. PAMI, the proof. 31(2):210–227, 2009. 19

[23] Roman Vershynin. High-dimensional probability. Cambridge of non-negative random variables, Univ Press , 2017. Z ∞ Z ∞ [24] M. Talagrand. Upper and lower bounds for stochastic pro- E(X) = P (X > u) du = K P (X > Kβ) dβ cesses: modern methods and classical problems. Ergebnisse 0 0 der Mathematik und ihrer Grenzgebiete, Springer, 2014.  Z ∞  [25] S. Mendelson. Upper bounds on product and multiplier ≤K β0 + P (X > Kβ) dβ empirical processes. arXiv preprint arXiv:1410.8003, 2014. β0 [26] Theodore Wilbur Anderson, Theodore Wilbur Anderson,  Z ∞  Theodore Wilbur Anderson, Theodore Wilbur Anderson, and −β/2 ≤K β0 + ce dβ Etats-Unis Mathematicien.´ An introduction to multivariate β0 statistical analysis, volume 2. Wiley New York, 1958.   −β0/2 [27] S. Cambanis, S. Huang, and G. Simons. On the theory of =K β0 + 2ce . elliptically contoured distributions. Journal of Multivariate Analysis, 11:368–385, 1981. [28] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of m- Lemma 14. If X and Y are sub-Gaussian random estimators with decomposable regularizers. Statistical Science, variables, then the product XY is a subexponential 27(4):538–557, 2012. [29] A. Banerjee, S. Chen, F. Fazayeli, and V. Sivakumar. Estima- random variable, and tion with norm regularization. Advances Neural Information Processing Systems (NIPS) 27, 2014. kXY kψ1 ≤ kXkψ2 kY kψ2 . [30] J. Fan, W. Wang, and Z. Zhu. Robust low-rank matrix recovery. arXiv:1603.08315, 2016. Proof. See [41]. [31] L. Wang. The l1 penalized LAD estimator for high dimensional linear regression. Journal of Multivariate Analysis, 120:135– l l 151, 2013. Lemma 15. Let k = b2 β/ log(em/2 β)c and l ∈ I2, [32] Q. Sun, W. Zhou, and J. Fan. Adaptive huber regression: Opti- em k l mality and phase transition. arXiv preprint arXiv:1706.06991, then, k ≤ exp(3 · 2 β). 2017. Proof. If k ≥ 2, then, 2lβ/ log(em/2lβ) ≥ 2, which [33] R. Foygel and L. Mackey. Corrupted sensing: Novel guaran- l l tees for separating structured signals. IEEE Transactions on implies 2 β ≥ 2 log(em/2 β). Thus, Information Theory, 60(2):1223–1247, 2014.    [34] L. Goldstein and X. Wei. Non-gaussian observations in emk 2lβ em nonlinear compressed sensing via Stein discrepancies. arXiv ≤2 exp log k log em  2lβ  preprint arXiv:1609.08512, 2016. 2lβ log em − 1 2lβ [35] K. Ball. An elementary introduction to modern convex geom- !! etry. Cambridge University Press, New York,, 1997. 2lβ em em [36] R. Vershynin. Introduction to the non-asymptotic analysis of ≤2 exp em log l em log l random matrices. In Y. C. Eldar and G. Kutyniok, editors, log 2lβ 2 β − log 2lβ 2 β Compressed Sensing: Theory and Applications. Cambridge ! 2lβ 2em em  University Press, 2010. ≤2 exp log log [37] M. Ledoux and M. Talagrand. Probability in Banach Spaces: em l l log l 2 β 2 β isoperimetry and processes. Springer-Verlag, Berlin, 1991. 2 β [38] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Recon- ≤ exp(3 · 2lβ), struction and subgaussian operators in asymptotic geometric analysis. Geometric and Functional Analysis, 17(4):1248– where the second from last inequality follows from 1282, 2007. em k l [39] S. Dirksen. Tail bounds via generic chaining. arXiv preprint k ≤ exp(3 · 2 β), and the last inequality follows l l l arXiv:1309.3522, 2013. from m ≥ 2 β, thus, log(2em/2 β)/ log(em/2 β) ≤ [40] S. J. Montgomery-Smith. The distribution of rademacher sums. 2. In Proceedings of the AMS, pages 517–522, 109(2), 1990. On the other hand, if k = 1, then, since log em ≤ [41] A. W. van der Vaart and J. A. Wellner. Weak convergence and l em k l empirical processes. Springer Series in Statistics. Springer- 2 β, k = em = exp(log em) ≤ exp(2 β), Verlag, New York, 1996. finishing the proof. Lemma 16. With m ≥ 1, β ≥ 1, κ ∈ (1, 0) and l ∈ l I2 = {l ≥ 1 : log em ≤ 2 β < m}, the integer k = APPENDIX b2lβ/ log(em/2lβ)c satisfies k ≥ 1, and

2+κ 4  2(1+κ) Lemma 13. For any nonnegative random variable X, 2 + κ 2+κ κ m 2(1+κ) k 2(1+κ) ≥ 2lβ. if P (X > Kβ) ≤ ce−β/2 for some constants K, c > 0 e1/(1+κ) and all β ≥ β0 ≥ 0, then, Proof. Since 2lβ ≥ log(em) ≥ 1, it follows that k ≥ 1, and thus k ≥ 2lβ/2 log(em/2lβ). It is then enough   −β0/2 E(X) ≤ K β0 + 2ce . to show that

2+κ κ 2+κ 2  2(1+κ) 1 +  m  2(1+κ)  em  2(1+κ) κ ≥ log . Proof. Using a well known identity for the expectation e1/(1+κ) 2lβ 2lβ 20

Raising both sides to the power of 2(1 + κ)/κ, equiv- Assume that the norm k · kK is decomposable with ⊥ alently respect to (L, L1 ), and let θ ∈ L. It is clear that for 2+κ , 2+κ , any v ∈ Sc (θ)   κ   κ 0 2 2 em m κ 1 + e ≥ log l l . κ 2 β 2 β kθ + vkK = kΠLθ + ΠL v + Π ⊥ vkK 1 L1 2+κ 1 κ Consider the function g(x) = (log ex) /x. Note ≤ kΠLθkK + kΠL1 vkK + kΠL⊥ vkK. (43) c 1 that as m > 2lβ, to prove the inequality above it 0 Since θ ∈ L, decomposability and the triangle inequal- suffices to show that the supx≥1 g(x) is upper bounded by the left hand side. Taking the derivative of g(x) ity imply that yields kΠLθ + ΠL v + Π ⊥ vkK 1 L1 2+κ (1 + log x)2/κ − (1 + log x)(2+κ)/κ 0 κ = kΠLθ + ΠL1 vkK + kΠL⊥ vkK g (x) = . 1 x2 ≥ kΠLθkK − kΠL1 vkK + kΠL⊥ vkK. Since x ≥ 1, the only critical point at which the 1 global maximum occurs is given by x = e2/κ. As Substituting this bound into (43) gives g e2/κ is exactly equal to the left hand side the proof − kΠL1 vkK + kΠL⊥ vkK is complete. 1 1 1 Finally, we discuss some facts about decomposable ≤ kΠL1 vkK + kΠL⊥ vkK, c0 c0 1 norms that have been introduced in [28]. S which implies that for any v ∈ c0 (θ) Definition 8. Suppose that L ⊆ L are two subspace 1 c + 1 Rd ⊥ 0 of , and let L1 be the orthogonal complement of L1. kΠL⊥ vkK ≤ kΠL1 vkK. 1 c0 − 1 Norm k · kK is said to be decomposable with respect ⊥ Rd It is easy to see that the set of all v satisfying the to (L, L1 ) if for any θ ∈ , inequality above is a convex cone, which we will kθ1 + θ2kK = kΠLθ1kK + kΠL⊥ θkK, S 1 denote by Cc0 = Cc0 (K). Since c0 (θ) ⊆ Cc0 , where ΠL and ΠL⊥ stand for the orthogonal projectors 1 Ψ(Sc (θ)) ≤ Ψ(Cc ) ⊥ 0 0 onto L and L1 respectively. by definition of the restricted compatibility constant. It is well known that many frequently used norms, This inequality is useful due to the fact that it is often including the `1 norm of a vector and the nuclear easier to estimate Ψ(Cc0 ). norm of a matrix, are decomposable with respect to the Finally, we make a remark that is useful when appropriately chosen pair of subspaces. For instance, dealing with non-isotropic measurements. Let Σ 0 the `1 norm is decomposable with respect to the pair be a d × d matrix, and consider the norm correspond- ⊥ of subspaces (L(J), L(J) ), where 1/2 ing to the convex set Σ K, so that kvkΣ1/2K =  d −1/2 1/2 L(J) := v ∈ R : vj = 0 for all j∈ / J (41) kΣ vkK. It is easy to see that Cc0 (Σ K) = 1/2 Σ Cc (K), hence consists of sparse vectors with non-zero coordinates 0  1/2 1/2  indexed by a set J ⊆ {1, . . . , d}. Ψ Cc0 (Σ K); Σ K d1 d2 Let W1 ⊆ R ,W2 ⊆ R be two linear subspaces. kvk 1/2 d1×d2 Σ K Then we define the subspace L(W1,W2) ⊆ R = sup via v∈Σ1/2K\{0} kvk2 kukK  d1×d2 = sup L(W1,W2) := M ∈ R : 1/2 u∈K\{0} kΣ uk2 row(M) ⊆ W1, col(M) ⊆ W2} , −1/2 ≤kΣ k Ψ(Cc0 (K); K) . where row(M) and col(M) are the linear subspaces Example 1: ` norm. Let L(J) be as in (41) with spanned by the rows and columns of M respectively, 1 |J| = s ≤ d. If v ∈ Rd belongs to the corresponding and 2c0 cone C(c0), then clearly kvk1 ≤ kvJ k1, where c0−1 ⊥  Rd1×d2 v := Π v. Hence L1 (W1,W2) := M ∈ : J L(J) ⊥ ⊥ row(M) ⊆ W1 , col(M) ⊆ W2 . (42) 2c0 2c0 p kvk1 ≤ kvJ k1 ≤ |J|kvk2, c0 − 1 c0 − 1 Then the nuclear norm k · k∗ is decomposable with √ ⊥  2c0 L(W ,W ), L (W ,W ) and Ψ(Cc ) ≤ s. respect to 1 2 1 1 2 (see [28] for 0 c0−1 ⊥ details). Example 2: nuclear norm. Let L1 (W1,W2) be as in 21

d1×d2 (42). Note that for any v ∈ R , Π ⊥ v = L1 (W1,W2) Π ⊥ vΠ ⊥ , where Π ⊥ and Π ⊥ are the orthogonal W2 W1 W1 W2 d1 d2 projectors onto subspaces W1 ⊆ R and W2 ⊆ R

respectively. Then for any v ∈ Cc0 , we have that

kvk∗ ≤ kΠ ⊥ vk∗ + kΠ vk∗ L1 (W1,W2) L1(W1,W2) 2c0 ≤ kΠL1(W1,W2)vk∗. (44) c0 − 1 Note that

Π v = v − Π ⊥ vΠ ⊥ L1(W1,W2) W2 W1 = Π ⊥ vΠW + ΠW v, W2 1 2 hence  rank ΠL1(W1,W2)v ≤ 2 max (dim(W1), dim(W2)) , which yields together with (44) that

2c0 kvk∗ ≤ kΠL1(W1,W2)vk∗ c0 − 1 2c0 p ≤ 2 max (dim(W1), dim(W2))kvk2, c0 − 1 √ 2 2c0 p and Ψ(Cc ) ≤ max (dim(W1), dim(W2)). 0 c0−1

Larry Goldstein is Professor in the Department of Mathematics at the University of Southern California. Goldstein completed his Ph.D at UCSD in 1984, where he also earned a Masters degree in Electrical Enginering. His current research mainly focuses on the applications of Stein’s method to modern day statistics.

Stanislav Minsker is an assistant professor in the Department of Mathematics at the University of Southern California. Before joining USC, he held the Visiting Assistant Professor position at Duke Uni- versity and was the Quantitative Associate at Wells Fargo Securities. Stanislav completed his Ph.D at Georgia Institute of Technology in 2012. His current research focuses on high dimensional statistics and statistical learning theory, specifically on the analysis of robust algorithms and distributed statistical estimation.

Xiaohan Wei received B.S. degree in electrical engineering and information science from University of Science and Technology of China, Hefei, Anhui, China in 2012, and M.S. degree with honor in electrical engineering from University of Southern California, Los Angeles, in 2014. He is now working towards Ph.D. degree in electrical engineering. His research is in the area of stochastic optimization and statistical learning theory.