<<

arXiv:1311.2542v4 [cs.DS] 25 Aug 2015 yNFgat C-829 n DMS-1128155. and CCF-0832797 Par grants Award. NSF Research by Faculty Google a and N00014-14-1-0632, OADAUIIDTER FSAS DIMENSIONALITY SPARSE OF THEORY UNIFIED A TOWARD pedxB ol rmcne nlss53 51 54 36 FJLT an with programs squares least Sketching analysis convex from C. Tools Appendix theory probability from B. Tools Appendix A. Appendix References Discussion 9. subspaces Manifolds of collection infinite 8.6. Possibly subspaces of collection 8.5. Finite vectors 8.4. Flat 8.3. .Eapeapiain fmi hoe 32 subspace 8.2. theorem Linear main of applications 8.1. theorem Example main the of 8. Proof 7. .Secigcntandlatsursporm 20 case 6.2. programs Unconstrained squares least constrained 6.1. Sketching 6. ..Apiainto Application case subspace type-2 5.1. linear The a of theorem case 5. main The of proof of 4. Overview 3. Preliminaries 2. Applications 1.1. Introduction 1. ..spotdb S rn I-477 n AERaadCCF award CAREER and IIS-1447471 Fo grant Deutsche the NSF of by 1060 grant supported SFB J.N. by DMS-1301619. supported grant partially NSF S.D. by supported partially J.B. erig n osrie es qae rbessc sth as such problems squares least constrained compres and model-based learning, and classical Johnson-Lindenstra algebra, sparse linear the merical using in results isomet new restricted Gaussia implies Fourier-based i.i.d. and of Johnson having embeddings, the analog Φ to subspace sparse dense related a results a several is with unify result qualitatively concerned Our was which small. rem, is parameter this that ..s htΦpeevstenr fevery of norm the preserves Φ that so i.e. eed ntegoer of geometry the on depends ilctvl pt + 1 to up tiplicatively ε Abstract. K1]with [KN14] k ℓ 2 ∈ sas vectors -sparse , 1 (0 cntandcase -constrained ENBUGI,SOR IKE,ADJLN NELSON JELANI AND DIRKSEN, SJOERD BOURGAIN, JEAN , 1 / )gvn esuystig for settings study we given, 2) EUTO NECIENSPACE EUCLIDEAN IN REDUCTION e Φ Let s o-eosprclm.Frasubset a For column. per non-zeroes k sas vectors -sparse ∈ R ε eitoueanwcmlxt aaee,which parameter, complexity new a introduce We . m × Φ E T n x n hwta tsffie ochoose to suffices it that show and , sup ∈ easas ono-idntas transform Johnson-Lindenstrauss sparse a be T Contents

k Φ x 1 k 2 2 ,s m, − 1

x eurdt ensure to required ε, < ∈ T fti okdn hl supported while done work this of t iutnosyadmul- and simultaneously Lnesruslemma, -Lindenstrauss T e esn,manifold sensing, sed shnseencat(DFG). rschungsgemeinschaft s rnfr nnu- in transform uss is u okalso work Our ries. fteui sphere, unit the of Lasso. e odnstheo- Gordon’s nre.We entries. n s 1560 N grant ONR -1350670, and m such 45 44 37 34 33 33 33 27 23 22 19 16 12 11 9 5 2 2 JEANBOURGAIN,SJOERDDIRKSEN,ANDJELANINELSON

1. Introduction Dimensionality reduction is a ubiquitous tool across a wide array of disciplines: machine learning [WDL+09], high-dimensional computational geometry [Ind01], privacy [BBDS12], compressed sensing [CT05], spectral graph theory [SS11], inte- rior point methods for linear programming [LS13], numerical linear algebra [Sar06], computational learning theory [BB05, BBV06], manifold learning [HWB07, Cla08], motif-finding in computational biology [BT02], astronomy [CM12], and several oth- ers. Across all these disciplines one is typically faced with data that is not only massive, but each data item itself is represented as a very high-dimensional vec- tor. For example, when learning spam classifiers a data point is an email, and it is represented as a high-dimensional vector indexed by dictionary words [WDL+09]. In astronomy a data point could be a star, represented as a vector of light intensi- ties measured over various points sampled in time [KZM02, VJ11]. Dimensionality reduction techniques in such applications provide the following benefits: Smaller storage consumption. • Speedup during data analysis. • Cheaper signal acquisition. • Cheaper transmission of data across computing clusters. • The technical guarantees required from a dimensionality reduction routine are application-specific, but typically such methods must reduce dimension while still preserving point geometry, e.g. inter-point distances and angles. That is, one has some point set X Rn with n very large, and we would like a dimensionality- reducing map f : X⊂ Rm, m n, such that → ≪ x, y X, (1 ε) x y f(x) f(y) (1 + ε) x y (1.1) ∀ ∈ − k − k≤k − k≤ k − k 2 for some norm . Note also that for unit vectors x, y, cos(∠(x, y)) = (1/2)( x 2 + 2 2k·k k k y 2 x y 2), and thus f also preserves angles with additive error if it preserves kEuclideank −k − normsk of points in X (X X). A powerful tool for achieving∪ Eq. (1.1),− used in nearly all the applications cited above, is the Johnson-Lindenstrauss (JL) lemma [JL84]. Theorem 1 (JL lemma). For any subset X of Euclidean space and 0 <ε< 1/2, m 2 there exists f : X ℓ with m = O(ε− log X ) providing Eq. (1.1) for = . → 2 | | k·k k·k2 n This bound on m is nearly tight: for any n 1 Alon exhibited a point set X ℓ2 , ≥ 2 ⊂ X = n + 1, such that any such JL map f must have m = Ω(ε− (log n)/ log(1/ε)) [Alo03].| | In fact, all known proofs of the JL lemma provide linear f, and the JL lemma is tight up to a constant factor in m when f must be linear [LN14]. Un- fortunately, for actual applications such worst-case understanding is unsatisfying. Rather we could ask: if given a distortion parameter ε and point set X as input (or a succinct description of it if X is large or even infinite, as in some applications), what is the best target dimension m = m(X,ε) such that a JL map exists for X with this particular ε? That is, in practice we are more interested in moving beyond worst case analysis and being as efficient as possible for our particular data X. Unfortunately the previous question seems fairly difficult. For the related ques- tion of computing the optimal distortion for embedding X into a line (i.e. m = 1), it is computationally hard to approximate the optimal distortion even up to a mul- tiplicative factor polynomial in X [BCIS05]. In practice, however, typically f can- not be chosen arbitrarily as a function| | of X anyway. For example, when employing certain learning such as stochastic gradient descent on dimensionality- reduced data, it is at least required that f is differentiable on Rn (where X Rn) [WDL+09]. For several applications it is also crucial that f is linear, e.g.⊂ in nu- merical linear algebra [Sar06] and compressed sensing [CT05, Don06]. In one-pass TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 3 streaming applications [CW09] and data structural problems such as nearest neigh- bor search [HIM12], it is further required that f is chosen randomly without know- ing X. For any particular X, a random f drawn from some distribution must satisfy the JL guarantee with good probability. In streaming applications this is because X is not fully known up front, but is gradually observed in a stream. In data structure applications this is because f must preserve distances to some future query points, which are not known at the time the data structure is constructed. Due to the considerations discussed, in practice typically f is chosen as a random linear map drawn from some distribution with a small number of parameters (in some cases simply the parameter m). For example, popular choices of f include a random matrix with independent Gaussian [HIM12] or Rademacher [Ach03] entries. While worst case bounds inform us how to set parameters to obtain the JL guarantee for worst case X, we typically can obtain better parameters by exploiting prior knowledge about X. Henceforth we only discuss linear f, so we write f(x)=Φx for m n Φ R × . Furthermore by linearity, rather than preserving Euclidean distances in ∈ X it is equivalent to discuss preserving norms of all vectors in T = (x y)/ x y 2 : n 1 { − k − k x, y X S − , the set of all normalized difference vectors amongst points in X. Thus∈ } up ⊂ to changing ε by roughly a factor of 2, Eq. (1.1) is equivalent to

sup Φx 2 1 < ε. (1.2) x T k k − ∈

Furthermore, since we consider Φ chosen at random, we more specifically want E sup Φx 2 1 < ε. (1.3) Φ x T k k − ∈

Instance-wise understanding for achieving Eq. (1.3) was first provided by Gordon [Gor88], who proved that a random Φ with i.i.d. Gaussian entries satisfies Eq. (1.3) as long as m & (g2(T )+1)/ε2, where we write A & B if A CB for a universal constant C > 0. Denoting by g a standard n-dimensional≥ Gaussian vector, the parameter g(T ) is defined as the Gaussian mean width

g(T ) def= E sup g, x . g x T h i ∈ One can think of g(T ) as describing the ℓ2-geometric complexity of T . It is always true that g2(T ) . log T , and thus Gordon’s theorem implies the JL lemma. In fact for all T we know from| | applications, such as for the restricted isometry property from compressed sensing [CT05] or subspace embeddings from numerical linear algebra [Sar06], the best bound on m is a corollary of Gordon’s theorem. Later works extended Gordon’s theorem to other distributions for Φ, such as Φ having independent subgaussian entries (e.g. Rademachers) [KM05, MPTJ07, Dir14]. Although Gordon’s theorem gives a good understanding for m in most scenar- ios, it suffers from the fact that it analyzes a dense random Φ, which means that performing the dimensionality reduction x Φx is dense matrix-vector multiplica- tion, and is thus slow. For example, in some7→ numerical linear algebra applications (such as least squares regression [Sar06]), multiplying a dense unstructured Φ with the input turns out to be slower than obtaining the solution of the original, high- dimensional problem! In compressed sensing, certain iterative recovery algorithms such as CoSamp [NT09] and Iterative Hard Thresholding [BD08] involve repeated multiplications by Φ and Φ∗, the conjugate transpose of Φ, and thus Φ supporting fast matrix-vector multiply are desirable in such applications as well. The first work to provide Φ with small m supporting faster multiplication is the Fast Johnson-Lindenstrauss Transform (FJLT) of [AC09] for finite T . The value 2 3 of m was still O(ε− log T ), with the time to multiply Φx being O(n log n + m ). [AL09] later gave an improved| | construction with time O(n log n + m2+γ ) for any 4 JEANBOURGAIN,SJOERDDIRKSEN,ANDJELANINELSON small constant γ > 0. Most recently several works gave nearly linear embedding time in n, independent of m, at the expense of increasing m by a (log n)c factor [AL13, KW11, NPW14]. In all these works Φ is the product of some number of very sparse matrices and Fourier matrices, with the speed coming from the Fast Fourier Transform (FFT) [CT65]. It is also known that this FFT-based approach can be used to obtain fast RIP matrices for compressed sensing [CT06, RV08, CGV13] and fast oblivious subspace embeddings for numerical linear algebra applications [Sar06] (see also [Tro11, LDFU13] for refined analyses in the latter case). Another line of work, initiated in [Ach03] and greatly advanced in [DKS10], sought fast embedding time by making Φ sparse. If Φ is drawn from a distribution over matrices having at most s non-zeroes per column, then Φx can be computed in time s x 0. After some initial improvements [KN10, BOR10], the best known ·k k 2 achievable value of s to date for the JL lemma while still maintaining m . ε− log T is the sparse Johnson-Lindenstrauss Transform (SJLT) of [KN14], achieving s| .| 1 ε− log T . εm. Furthermore, an example of a set T exists which requires this bound on| |s up to O(log(1/ε)) for any linear JL map [NN13b]. Note however that, again, this is an understanding of the worst-case parameter settings over all T . In summary, while Gordon’s theorem gives us a good understanding of instance- wise bounds on T for achieving good dimensionality reduction, it only does so for dense, slow Φ. Meanwhile, our understanding for efficient Φ, such as the SJLT with small s, has not moved beyond the worst case. In some very specific examples of T we do have good bounds for settings of s,m that suffice, such as T the unit norm vectors in a d-dimensional subspace [CW13, MM13, NN13a], or all elements of T having small ℓ norm [Mat08, DKS10, KN10, BOR10]. However, our understand- ing for general∞T is non-existent. This brings us to the main question addressed in n 1 n this work, where S − denotes the ℓ -unit sphere x R : x =1 . 2 { ∈ k k2 } n 1 Question 2. Let T S − and Φ be the SJLT. What relationship must s,m satisfy, in terms of the⊆ geometry of T , to ensure (1.3)?

We also note that while the FFT-based and sparse Φ approaches seem orthogonal at first glance, the two are actually connected, as pointed out before [AC09, Mat08, NPW14]. The FJLT sets Φ = SP where P is some random preconditioning matrix that makes T “nice” with high probability, and S is a random sparse matrix. We point out that although the SJLT is not the same as the matrix S typically used in the FFT-based literature, one could replace S with the SJLT and hope for similar algorithmic outcome if s is small: nearly linear embedding time. The answer to the analog of Question 2 for a standard Gaussian matrix depends only on the ℓ2-metric structure of T . Indeed, since both ℓ2-distances and Gaussian matrices are invariant under orthogonal transformations, so is (1.3) in this case. This is reflected in Gordon’s theorem, where the embedding dimension m is gov- erned by the Gaussian width, which is invariant under orthogonal transformations. We stress that in sharp contrast, a resolution of Question 2 cannot solely depend on the ℓ2-metric structure of T . Indeed, we require that Φ be sparse in a particu- lar basis and is therefore not invariant under orthogonal transformations. Thus an answer to Question 2 must be more nuanced (see our main theorem, Theorem 3).

Our Main Contribution: We provide a general theorem which answers Ques- n 1 tion 2. Specifically, for every T S − analyzed in previous work that we apply our general theorem to here, we either⊆ (1) qualitatively recover or improve the pre- vious best known result, or (2) prove the first non-trivial result for dimensionality reduction with sparse Φ. We say “qualitatively” since applying our general theorem to these applications loses a factor of logc(n/ε) in m and logc(n/ε)/ε in s. TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 5

In particular for (2), our work is the first to imply that non-trivially sparse dimensionality reducing linear maps can be applied for gain in model-based com- pressed sensing [BCDH10], manifold learning [TdSL00, DG13], and constrained least squares problems such as the popular Lasso [Tib96]. Specifically, we prove the following theorem.

n 1 Theorem 3 (Main Theorem). Let T S − and Φ be an SJLT with column sparsity s. Define the complexity parameter⊂ n def 1 q 1/q κ(T ) = κs,m(T )= max E E sup ηj gjxj , q m log s qs η g | | ≤ s √ x T j=1 n   ∈ X   o where (gj ) are i.i.d. standard Gaussian and (ηj ) i.i.d. Bernoulli with mean qs/(m log s). If (g2(T )+1) m & (log m)3(log n)5 · ε2 1 s & (log m)6(log n)4 . · ε2 Then (1.3) holds as long as s,m furthermore satisfy the condition (log m)2(log n)5/2κ(T ) < ε. (1.4) The complexity parameter κ(T ) may seem daunting at first, but Section 8 shows it can be controlled quite easily for all the T we have come across in applications.

1.1. Applications. Here we describe various T and their importance in certain applications. Then we state the consequences that arise from our theorem. In order to highlight the qualitative understanding arising from our work, we introduce the < 2 c notation A B if A B ε− (log n) . A summary of our bounds is in Figure 1. ∗ ≤ · Finite T : This is the setting T < , for which the SJLT satisfies Eq. (1.3) with 1| | 2 | | ∞ n s . ε− log T , m . ε− log T [KN14]. If also T Bℓ∞ (α), i.e. x α for all | | | | ⊂ 2 k k∞ ≤ x T , [Mat08] showed it is possible to achieve m . ε− log T with a Φ that has ∈ 2 2 | | an expected O(ε− (α log T ) ) non-zeroes per column. Our theorem implies s,m| | < log T suffices in general, and s < 1 + (α log T )2,m < log T in the latter case, qualitatively∗ | | matching the above. ∗ | | ∗ | | Linear subspace: Here T = x E : x 2 = 1 for some d-dimensional lin- ear subspace E Rn, for which{ achieving∈ k k Eq. (1.3)} with m . d/ε2 is possible [AHK06, CW13].⊂ A distribution satisfying Eq. (1.3) for any d-dimensional subspace E is known as an oblivious subspace embedding (OSE). The paper [Sar06] pioneered the use of OSE’s for speeding up approximate algorithms for numerical linear alge- bra problems such as low-rank approximation and least-squares regression. More applications have since been found to approximating leverage scores [DMIMW12], k-means clustering [BZMD11, CEM+14], canonical correlation analysis [ABTZ13], + support vector machines [PBMID13], ℓp-regression [CDMI 13, WZ13], ridge re- gression [LDFU13], streaming approximation of eigenvalues [AN13], and speeding up interior point methods for linear programming [LS13]. In many of these appli- n d cations there is some input A R × , n d, and the subspace E is for example the column space of A. Often∈ an exact solution≫ requires computing the singular value decomposition (SVD) of A, but using OSE’s the running time is reduced to that for computing ΦA, plus computing the SVD of the smaller matrix ΦA. The work [CW13] showed s = 1 with small m is sufficient, yielding algorithms for least squares regression and low-rank approximation whose running times are linear in the number of non-zero entries in A for sufficiently lopsided rectangular matrices. 6 JEANBOURGAIN,SJOERDDIRKSEN,ANDJELANINELSON

Our theorem implies that s < 1 and m < d suffices to satisfy Eq. (1.3), which is qualitatively correct. Furthermore,∗ a subset∗ of our techniques reveals that if the maximum incoherence max1 i n PEei 2 is at most poly(ε/ log n), then s = 1 suffices (Theorem 5). This was≤ ≤ notk knownk in previous work. A random d- dimensional subspace has incoherence d/n w.h.p. for d & log n by the JL lemma, and thus is very incoherent if n d. ≫ p n d n d Closed convex cones: For A R × , b R , and R a closed convex set, ∈ ∈ C ⊆ 2 consider the constrained least squares problem of minimizing Ax b 2 subject to x . A popular choice is the Lasso [Tib96], in whichk the constraint− k set ∈ C d = x R : x 1 R encourages sparsity of x. Let x be an optimal solution, Cand{ let ∈T (x) bek k the≤ tangent} cone of at some point x ∗ Rd (see Appendix B for a definition).C Suppose we wish to accelerateC approximately∈ solving the constrained 2 least squares problem by instead computing a minimizerx ˆ of ΦAx Φb 2 subject k 2 − k to x . The work [PW14] showed that to guarantee Axˆ b 2 (1 + ε) Ax 2 ∈C k − k ≤ k ∗ − b 2, it suffices that Φ satisfy two conditions, one of which is Eq. (1.2) for T = k n 1 AT (x ) S − . The paper [PW14] then analyzed dense random matrices for sketchingC ∗ ∩ constrained least squares problems. For example, for the Lasso if we are promised that the optimal solution x is k-sparse, [PW14] shows that it suffices to set ∗ 2 1 Aj 2 m & 2 max k2 k k log d, ε j=1,...,d σmin,k where Aj is the jth column of A and σmin,k the smallest ℓ1-restricted eigenvalue of A: σmin,k = inf Ay 2. y 2=1, y 1 2√k k k k k k k ≤ Our work also applies to such T (and we further show the SJLT with small s,m satisfies the second condition required for approximate constrained least squares; see Theorem 16). For example for the Lasso, we show that again it suffices that 2 > Aj 2 m max k 2 k k log d ∗ j=1,...,d σmin,k but where we only require from s that A2 > ij s max 2 k ∗ 1 i n σmin,k 1≤j≤d ≤ ≤ That is, the sparsity of Φ need only depend on the largest entry in A as opposed to the largest column norm in A. The former can be much smaller. n 1 Unions of subspaces: Define T = θ ΘEθ S − , where Θ is some index set and n ∪ ∈ ∩ each Eθ R is a d-dimensional linear subspace. A case of particular interest is when θ ⊂Θ ranges over all k-subsets of 1,...,n , and E is the subspace spanned ∈ { } θ by ej j θ (so d = k). Then T is simply the set of all k-sparse unit vectors of unit { } ∈ def n Euclidean norm: Sn,k = x R : x 2 =1, x 0 k for 0 denoting support size. Φ satisfying (1.2) is then{ ∈ referredk k to as havingk k ≤ the}restrictedk·k isometry property (RIP) of order k with restricted isometry constant εk = ε [CT05]. Such Φ are known 2 to exist with m . εk− k log(n/k), and furthermore it is known that ε2k < √2 1 implies that any (approximately) k-sparse x Rn can be (approximately) recovered− from Φx in polynomial time by solving a certain∈ linear program [CT05, Can08]. Unfortunately it is known for ε = Θ(1) that any RIP matrix Φ with such small m must have s & m [NN13b]. Related is the case of vectors sparse in some other n basis, i.e. T = Dx R : Dx 2 = 1, x 0 k for some so-called “dictionary” D (i.e. the subspaces{ ∈ are actedk onk by Dk),k or≤ when} T only allows for some subset TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 7

n of all k sparsity patterns in model-based compressed sensing [BCDH10] (so that Θ < n ). k | |Our main theorem also implies RIP matrices with s,m < k log(n/k). More gen-  ∗ erally when a dictionary D is involved such that the subspaces span( Dej j θ) are all α-incoherent (as defined above), the sparsity can be further impr{ oved} ∈ to s <1+(αk log(n/k))2. That is, to satisfy the RIP with dictionaries yielding incoher- ent∗ subspaces, we can keep m qualitatively the same while making s much smaller. For the general problem of unions of d-dimensional subspaces, our theorem implies one can either set m

set T to preserve our m our s previous m previous s ref T < log T log T log T log T [JL84] | | ∞ | | | | 2 | | | | 2 T < , x T x α log T α log T log T α log T [Mat08] | | ∞ ∀ ∈ k k∞ ≤ | | ⌈ | |⌉ | | ⌈ | |⌉ E, dim(E) d d 1 d 1 [NN13a] ≤ Sn,k k log(n/k) k log(n/k) k log(n/k) k log(n/k) [CT05] HSn,k k log(n/k) 1 k log(n/k) 1 [RV08] 2 2 kAj k2 Aij tangent cone for Lasso maxj 2 k maxi,j 2 k same as here s = m [PW14] σ σ min,k min,k 6 3 Θ < d + log Θ log Θ d (log Θ ) (log Θ ) [NN13a] E Θ| ,|dim(∞E) d | | | | · | | | | ∀ ∈ ≤ 2 Θ < d + log Θ α log Θ — — — E Θ| ,|dim(∞E) d | | ⌈ | |⌉ ∀ ∈ ≤ max1≤j≤n PE ej 2 α E∈Θ k k ≤ Θ infinite see appendix see appendix similar to m [Dir14] E | Θ|, dim(E) d (non-trivial) this work ∀ ∈ ≤ 2 a smooth manifold d 1+(αd) d d [Dir14] M Figure 1. The m,s that suffice when using the SJLT with vari- ous T as a consequence of our main theorem, compared with the best known bounds from previous work. All bounds shown hide 1 poly(ε− log n) factors. One row is blank in previous work due to no non-trivial results being previously known. For the case of the Lasso, we assume k is the sparsity of the true optimal solution. bounds for various T , albeit losing a logc(n/ε) factor in m and a logc(n/ε)/ε factor in s as mentioned above. Finally, Section 9 discusses avenues for future work. In the appendix, in Appendix A for the benefit of the reader we review many probabilistic tools that are used throughout this work . Appendix B reviews some introductory material related to convex analysis, which is helpful for understanding Section 6 on constrained least squares problems. In Appendix C we also give a direct analysis for using the FJLT for sketching constrained least squares programs, providing quantitative benefits over some analyses in [PW14].

2. Preliminaries We fix some notation that we will be used throughout this paper. For a positive integer t, we set [t] = 1,...,t . For a,b R, a . b denotes a Cb for some universal constant C >{0, and }a b signifies∈ that both a . b and≤ b . a hold. n ≃ For any x R and 1 p , we let x p denote its ℓp-norm. To any set Rn ∈ ≤ ≤ ∞ k k S we associate a semi-norm S defined by z S = supx S z, x . Note that⊂ z = z , where conv(||| · |||S) is the closed||| ||| convex hull∈ of| h S,i i.e., | the ||| |||S ||| |||conv(S) closure of the set of all convex combinations of elements in S. We use (ei)1 i n n m n ≤ ≤ and (eij )1 i m,1 j n to denote the standard basis in R and R × , respectively. ≤ ≤ ≤ ≤ If η = (ηi)i 1 is a sequence of random variables, we let (Ωη, Pη) denote the ≥ E p probability space on which it is defined. We use η and Lη to denote the associated expected value and Lp-space, respectively. If ζ is another sequence of random p variables, then p q means that we first take the L -norm and afterwards Lη,Lζ η q k·k the L -norm. We reserve the symbol g to denote a sequence g = (gi)i 1 of i.i.d. ζ ≥ standard Gaussian random variables; unless stated otherwise, the covariance matrix is the identity. If A is an m n matrix, then we use A or A ℓn ℓm to denote its operator 2 → 2 × k k k k 1/2 norm. Moreover we let Tr be the trace operator and use A F = (Tr(A∗A)) to denote the Frobenius norm. k k In the remainder, we reserve the letter ρ to denote (semi-)metrics. If ρ cor- responds to a (semi-)norm , then we let ρ (x, y) = x y denote the k·kX X k − kX 10 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON associated (semi-)metric. Also, we use dρ(S) = supx,y S ρ(x, y) to denote the di- ∈ ameter of a set S with respect to ρ and write dX instead of dρX for brevity. So, for n n example, ρℓ2 is the Euclidean metric and dℓ2 (S) the ℓ2-diameter of S. From here n 1 n on, T is always a fixed subset of S − = x R : x =1 , and ε (0, 1/2) the parameter appearing in (1.3). { ∈ k k } ∈ We make use of chaining results in the remainder, so we define some relevant Rn E quantities. For a bounded set S , g(S) = g supx S g, x is the Gaussian mean width of S, where g Rn ⊂is a Gaussian vector∈ withh identityi covariance ∈ n matrix. For a (semi-)metric ρ on R , Talagrand’s γ2-functional is defined by

∞ r/2 γ2(S,ρ) = inf sup 2 ρ(x, Sr) (2.1) ∞ Sr =0 · r x S r=0 { } ∈ X where ρ(x, Sr) is the distance from x to Sr, and the infimum is taken over all 2r collections S ∞ , S S ... S, with S =1, S 2 . If ρ corresponds { r}r=0 0 ⊂ 1 ⊂ ⊆ | 0| | r|≤ to a (semi-)norm X , then we usually write γ2(S, X ) instead of γ2(S,ρX ). It is known that fork·k any bounded S Rn, g(S) and γ (k·kS, ) differ multiplicatively ⊂ 2 k·k2 by at most a universal constant [Fer75, Tal05]. Whenever γ2 appears without a specified norm, we imply use of ℓ2 or ℓ2 ℓ2 operator norm. We frequently use the entropy integral estimate (see [Tal05])→

∞ γ (S,ρ) . log1/2 (S,ρ,u)du. (2.2) 2 N Z0 Here (S,ρ,u) denotes the minimum number of ρ-balls of radius u centered at pointsN in S required to cover S. If ρ corresponds to a (semi-)norm , then we k·kX write (S, X ,u) instead of (S,ρX ,u). LetN us nowk·k introduce the sparseN Johnson-Lindenstrauss transform in detail. Let σ : Ω 1, 1 be independent Rademacher random variables, i.e., P(σ = ij σ → {− } ij 1) = P(σij = 1)= 1/2. We consider random variables δij :Ωδ 0, 1 with the following properties:− →{ } For fixed j the δ are negatively correlated, i.e. • ij k k s k 1 i

m n Theorem 4. Let R × and let σ1,...,σn be independent Rademacher random variables. For anyA⊂1 p< ≤ ∞ p 1/p E 2 E 2 2 sup Aσ 2 Aσ 2 . γ2 ( , ℓ2 ℓ2 )+ dF ( )γ2( , ℓ2 ℓ2 ) σ A k k − k k A k·k → A A k·k → ∈A   2 + √pdF ( )dℓ2 ℓ2 ( )+ pdℓ2 ℓ2 ( ). (2.5) A → A → A For u Rn and v Rm, let u v : Rm Rn be the operator (u v)w = v, w u. Then, for∈ any x Rn∈we can write⊗ Φx =→A σ, where ⊗ h i ∈ δ,x x(δ1,·) 0 0 − − (δ2 ) ··· 1 m n 1 0 x ,· 0 Aδ,x := δij xj ei eij =  . − . − · · · .  . √s ⊗ √s . . . i=1 j=1   X X  0 0 x(δm,·)     ··· − −(2.6) for x(δi,·) = δ x . Note that E Φx 2 = x 2 for all x Rn and therefore j ij j k k2 k k2 ∈ 2 2 2 E 2 sup Φx 2 x 2 = sup Aδ,xσ 2 Aδ,xσ 2 . x T |k k −k k | x T |k k − k k | ∈ ∈ n Associated with δ = (δij ) we define a random norm on R by

n 1/2 1 2 x δ = max δij xj . (2.7) k k √s 1 i m ≤ ≤ j=1  X  With this definition, we have for any x, y T , ∈ A A = x y , A A = x y . (2.8) k δ,x − δ,yk k − kδ k δ,x − δ,ykF k − k2 Therefore, (2.5) implies that

p 1/p E 2 2 2 2 sup Φx 2 x 2 . γ2 (T, δ)+ γ2(T, δ)+ √pdδ(T )+ pdδ(T ). σ x T k k −k k k·k k·k  ∈  (2.9)

Taking Lp(Ωδ)-norms on both sides yields

p 1/p E 2 2 E 2p 1/p E p 1/p sup Φx 2 x 2 . ( γ2 (T, δ)) + ( γ2 (T, δ)) δ,σ x T k k −k k δ k·k δ k·k  ∈  E p 1/p E 2p 1/p + √p( dδ (T )) + p( dδ (T )) . (2.10) Thus, to find a bound for the expected value of the restricted isometry constant of E 2 E 2 Φ on T , it suffices to estimate δ γ2 (T, δ) and δ dδ(T ). If we can find good E p 1/p E pk·k 1/p bounds on ( δ γ2 (T, δ)) and ( dδ (T )) for all p 1, then we in addition obtain a high probabilityk·k bound for the restricted isometry≥ constant. Unless explicitly stated otherwise, from here on Φ always denotes the SJLT with s non-zeroes per column.

3. Overview of proof of main theorem Here we give an overview of the proof of Theorem 3. To illustrate the ideas going into our proof, it is simplest to first consider the case where T is the set of all unit 12 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON vectors in a d-dimensional linear subspace E Rn. By Eq.(2.10) for p = 1 we have to bound for example E γ (T, ). Standard⊂ estimates give, up to log d, δ 2 k·kδ 1/2 γ2(T, δ) γ2(BE , δ) sup t[log (BE , δ,t)] . (3.1) k·k ≤ k·k ≪ t>0 N k·k n d for BE the unit ball of E. Let U R × have columns forming an orthonormal basis for E. Then dual Sudakov minoration∈ [BLM89, Proposition 4.2], [PTJ86] states that 1/2 sup t[log (BE, δ,t)] E Ug δ (3.2) t>0 N k·k ≤ g k k for a Gaussian vector g. From this point onward one can arrive at a result us- ing the non-commutative Khintchine inequality [LP86, LPP91] and other standard Gaussian concentration arguments (see appendix). Unfortunately, Eq. (3.2) is very specific to unit balls of linear subspaces and has no analog for a general set T . At this point we use a statement about the duality of entropy numbers [BPSTJ89, Proposition 4]. This is the principle that for two symmetric convex bodies K and D, (K,D) and (D◦,aK◦) are roughly comparable for some constant a ( (K,D) is theN number of translatesN of D needed to cover K). Although it has been anN open conjecture for over 40 years as to whether this holds in the general case [Pie72, p. 38], the work [BPSTJ89, Proposition 4] shows that these quantities are comparable up to logarithmic factors as well as a factor depending on the type-2 constant of the norm defined by D (i.e. the norm whose unit vectors are those on the boundary of D). In our case, this lets us relate ˜ m log (T, δ,t) with log (conv( i=1BJi ), T , √st/8), losing small factors. N ˜ k·k N ∪ ||| · ||| Here T is the convex hull of T T , and BJi is the unit ball of span ej : δij =1 . We next use Maurey’s lemma,∪− which is a tool for bounding covering{ numbers of} the set of convex combinations of vectors in various spaces. This lets us relate m 1 log (conv( i=1BJi ), T ,ǫ) to quantities of the form log ( k i A BJi , T ,ǫ), N ∪ |||·||| N ∈ |||·||| where A [m] has size k . 1/ǫ2 (see Lemma 18). For a fixed A, we bucket j [n] ⊂ P α ∈ according to i A δij and define Uα = j [n]: i A δij 2 . Abusing ∈ { ∈ ∈ ≃ } notation, we also let Uα denote the coordinate subspace spanned by j Uα. This leads to (see Eq.P (7.11)) P ∈ 1 k ǫ log B , ,ǫ . log B , , (3.3) N k Ji ||| · |||T N Uα ||| · |||T 2α log m i A α r  X∈  X   Finally we are in a position to apply dual Sudakov minoration to the right hand side of Eq. (3.3), after which point we apply various concentration arguments to yield our main theorem.

4. The case of a linear subspace n n 1 Let E R be a d-dimensional linear subspace, T = E S − , B the unit ⊂ ∩ E ℓ2-ball of E. We use PE to denote the orthogonal projection onto E. The values PEej 2, j = 1,...,n, are typically referred to as the leverage scores of E in the numericalk k linear algebra literature. We denote the maximum leverage score by

µ(E)= max PEej 2, 1 j n k k ≤ ≤ which is sometimes referred to as the incoherence µ(E) of E. In our proof we make use of the well-known Gaussian concentration of Lipschitz functions [Pis86, Corollary 2.3]: if g is an n-dimensional standard Gaussian vector and φ : Rn R is L-Lipschitz from ℓn to R, then for all p 1 → 2 ≥ φ(g) E φ(g) p . L√p. (4.1) k − kLg TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 13

Theorem 5. For any p log m and any 0 <ǫ< 1, ≥ (d + log m) log2(d/ǫ) p log2(d/ǫ) log m (E γ2p(T, ))1/p . ǫ2 + + µ(E)2 (4.2) 2 k·kδ m s and d p (E d2p(T ))1/p . + µ(E)2. (4.3) δ m s As a consequence, if η 1/m and ≤ 2 2 1 2 m & ((d + log m) min log (d/ε), log (m) + d log(η− ))/ε { } 1 2 2 2 1 2 2 s & (log(m) log(η− ) min log (d/ε), log (m) + log (η− ))µ(E) /ε (4.4) { } then Eq. (1.2) holds with probability at least 1 η. − Proof. By dual Sudakov minoration (Lemma 29 in Appendix A) E Ug log (B , ,t) . g k kδ , for all t> 0, N E k·kδ t n d d with U R × having columns f forming an orthonormal basis for E and ∈ { k}k=1 g a random Gaussian vector. Let U (i) be U but where each row j is multiplied by δij . Then by Eq.(2.7) and taking ℓ = log m,

1 n d 2 1/2 E Ug δ = E max δij gk fk,ej g k k √s g 1 i m h i ≤ ≤ h j=1 k=1 i X X 1 (i) = E max U g 2 √s g 1 i m k k ≤ ≤ 1 (i) (i) (i) max E U g 2 + E max U g 2 E U g 2 ≤ √s 1 i m g k k g 1 i m k k − g k k  ≤ ≤ ≤ ≤  m ℓ 1/ℓ 1 (i) (i) (i) max U F + E U g 2 E U g 2 ≤ √s 1 i m k k g k k − g k k  ≤ ≤  i=1   X 1 (i) √ (i) . max U F + ℓ max U ℓd ℓn (4.5) √s 1 i m k k · 1 i m k k 2 → 2  ≤ ≤ ≤ ≤  where Eq. (4.5) used Gaussian concentration of Lipschitz functions (cf. Eq. (4.1)), (i) (i) noting that g U g 2 is U -Lipschitz. Since 7→(1 k /√s)k k, we cank use for small t the bound (cf. Eq. (A.6)) k·kδ ≤ k·k2 log (T, ,t) log (B , ,t√s) N k·kδ ≤ N E k·k2 2 < d log 1+ (4.6) · t√s   Using Eq. (4.6) for small t and noting U (i) = ( n δ P e 2)1/2 for P k kF j=1 ij k E jk2 E the orthogonal projection onto E, Eq. (2.2) yields for t = (ǫ/d)/ log(d/ǫ) P∗ γ2(T, δ) k·k ∗ t 1 1/2 1/√s E Ug . √d log 2+ dt + g k kδ dt · t√s ∗ t Z0 Zt h 1 1/2 i 1 . √dt log + E Ug δ log ∗ t √s g k k t √s h  ∗ i  ∗  n 1/2 log(d/ǫ) 2 (i) . ǫ + max δij PEej 2 + log m max U ℓd ℓn √s · 1 i m k k 1 i m k k 2 → 2 ≤ ≤ j=1 ≤ ≤ h h X i i p (4.7) 14 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

As a consequence,

2 n p 1/p E 2p 1/p 2 log (d/ǫ) E 2 ( γ2 (T, δ)) . ǫ + max δij PEej 2 δ k·k s δ 1 i m k k ≤ ≤ j=1 h h X i  1/p E (i) 2p + log(m) max U ℓd ℓn (4.8) δ 1 i m k k 2 → 2  ≤ ≤  i We estimate the first non-trivial term on the right hand side. Since p log m, ≥ n p 1/p m n p 1/p E 2 E 2 max δij PEej 2 δij PEej 2 δ 1 i m k k ≤ δ k k ≤ ≤ j=1 i=1 j=1  h X i   X X  n p 1/p E 2 e max δij PEej 2 ≤ · 1 i m δ k k ≤ ≤ j=1  h X i  Note that for fixed i, the (δij )1 j n are i.i.d. variables of mean s/m and therefore ≤ ≤ n n s s sd E δ P e 2 = P e ,e = Tr(P )= . ij k E jk2 m h E j ji m E m j=1 j=1 X X Let (rj )1 j n be a vector of independent Rademachers. By symmetrization (Eq. (A.4)) and Khintchine’s≤ ≤ inequality (Eq. (A.2)), n n 2 ds E 2 δij PEej 2 + (δij δij ) PEej 2 k k Lp ≤ m − k k Lp j=1 δ j=1 δ X X n ds 2 +2 rj δij PEej 2 ≤ m k k Lp j=1 δ,r X n 1/2 ds 4 . + √p δij PEej 2 m k k Lp j=1 δ  X  n 1/2 ds 2 + √p max PEej 2 δij PEej 2 p ≤ m · 1 j n k k · k k Lδ ≤ ≤ j=1 X Solving this quadratic inequality yields n 2 ds 2 δij PEej 2 . + p max PEej 2 (4.9) k k Lp m · 1 j n k k j=1 δ ≤ ≤ X We conclude that

n p 1/p E 2 ds 2 max δij PEej 2 . + p max PEej 2 (4.10) δ 1 i m k k m j k k ≤ ≤ j=1  h X i  We now estimate the last term on the right hand side of Eq. (4.8). As p log m, ≥ 1/p 1/p E (i) 2p E (i) 2p max U d n . max U d n δ 1 i m k kℓ2 ℓ2 1 i m k kℓ2 ℓ2 ≤ ≤ → ≤ ≤ →     1/p E (i) (i) p = max (U )∗U ℓd ℓd . 1 i m k k 2 → 2 ≤ ≤   If uj denotes the jth row of U, then n (i) (i) (U )∗U = δij uj uj∗. j=1 X TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 15

By symmetrization and the non-commutative Khintchine inequality (cf. Eq. (A.3)), n s n s δij ujuj∗ + δij ujuj∗ I Lp ≤ m − m Lp j=1 δ j=1 δ X X n s +2 rj δij ujuj∗ ≤ m Lp j=1 δ,r X n 1/2 s 2 . + √p δij uj 2uj uj∗ 2 m k k Lp/ j=1 δ X n s 1/2 + √p max PEej 2 δij uj uj∗ p ≤ m 1 j n k k · Lδ ≤ ≤ j=1 X Solving this quadratic inequality, we find n s 2 δij uj uj∗ . + p max PEej 2 Lp m · 1 j n k k j=1 δ ≤ ≤ X and as a consequence, 1/p E (i) 2p s 2 max U ℓd ℓn . + p max PEej 2 δ 1 i m k k 2 → 2 m · 1 j n k k  ≤ ≤  ≤ ≤ Applying this estimate together with Eq. (4.10) in Eq. (4.8) yields the asserted estimate in Eq. (4.2). For the second assertion, note by Cauchy-Schwarz that

n 1/2 1 2 dδ(T )= max sup δij xj √s 1 i m ≤ ≤ x T j=1 ∈ h X i n d 1 2 1/2 = max sup δij x, fk fk,ej √s 1 i m h i h i ≤ ≤ x T j=1 ∈ h X  kX=1  i n 1/2 1 2 max δij PEej 2 . ≤ √s 1 i m k k ≤ ≤ j=1 h X i Therefore, n p 1/p E 2p 1/p 1 E 2 ( d (T )) max δij PEej 2 δ δ ≤ s δ 1 i m k k ≤ ≤ j=1  h X i  and Eq. (4.3) immediately follows from Eq. (4.10). To prove the final assertion, we combine (2.10), Eq. (4.2) and Eq. (4.3) to obtain

p 1/p E sup Φx 2 x 2 δ,σ x T k k −k k  ∈  2 2 2 (d + log m) log (d/ǫ) p log (d/ǫ) log m 2 . ǫ + + max PEej 2 m s 1 j n k k ≤ ≤ 2 2 1/2 2 (d + log m) log (d/ǫ) p log (d/ǫ) log m 2 + ǫ + + max PEej 2 m s 1 j n ≤ ≤ k k h 2 2 1/2 i pd p 2 pd p 2 + + max PEej 2 + + max PEej 2 . m s 1 j n k k m s 1 j n k k ≤ ≤ h ≤ ≤ i Now apply Lemma 27 of Appendix A to obtain, for any w log m, ≥ 2 2 P 2 (d + log m) log (d/ǫ) w log (d/ǫ) log m 2 δ,σ εΦ,T & ǫ + + max PEej 2 m s 1 j n k k  ≤ ≤ 16 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

2 2 1/2 2 (d + log m) log (d/ǫ) w log (d/ǫ) log m 2 + ǫ + + max PEej 2 m s 1 j n k k ≤ ≤ h 2 2 1/2 i wd w 2 wd w 2 w + + max PEej 2 + + max PEej 2 e− . m s 1 j n m s 1 j n ≤ ≤ k k ≤ ≤ k k ≤ 1 h i  Now set w = log(η− ), choose ǫ = ε/C (with C a large enough absolute constant) and ǫ = d/m, and use the assumptions on m and s to conclude P(ε ε) η.  Φ,T ≥ ≤ Theorem 5 recovers a similar result in [NN13a] but via a different method, less logarithmic factors in the setting of m, and the revelation that s can be taken smaller 1 if all the leverage scores PEej 2 are small (note if PEej 2 (log d log m)− for all j, we may take s = 1,k thoughk this is not true ink general).k ≪ Our dependence· on 1/ε in s is quadratic instead of the linear dependence in [NN13a], but as stated earlier, in most applications of OSE’s ε is taken a constant. Remark 6. As stated, one consequence of the above is we can set s = 1, and m to nearly linear in d as long as µ is at most inverse polylogarithmic in d. It is worth noting that if d & log n, then a random d-dimensional subspace E has µ(E) . d/n by the JL lemma, and thus most subspaces do have even much lower coherence. Note that the latter bound is sharp. Indeed, let E be any d-dimensional p n d subspace and let U R × have orthonormal columns such that E is the column space of U. Then ∈ max max v,ej v E, v 2=1 1 j n | h i | ∈ k k ≤ ≤ is the maximum ℓ norm of a row u of U. Since U has n rows and U 2 = d, 2 j k kF there exists a j such that uj 2 d/n and in particular µ(E) d/n. k k ≥ ≥ n n It is also worth noting that [AMT10, Theorem 3.3] observed that if H R × is p p ∈ a bounded orthonormal system (i.e. H is orthogonal and maxi,j Hij = O(1/√n), and D is a diagonal matrix with independent Rademacher entries,| then| HDE has incoherence µ . d(log n)/n with probability 1 1/poly(n). They then showed that incoherence µ implies that a sampling matrix− S Rm n suffices to achieve p × Eq. (1.3) with m . nµ2 log m/ε2. Putting these two facts∈ together implies SHD gives Eq. (1.3) with m . d(log n)(log m)/ε2. Eq. (4.4) combined with [AMT10, Theorem 3.3] implies a statement of a similar flavor but with an arguably stronger n 1 implication for applications: ΦHD with s = 1 satisfies Eq.(1.3) for T = E S − without increasing m as long as n & d(log(d/ε))c/ε2. ∩

5. The type-2 case

Recall that the type-2 constant T2( ) of a semi-norm is defined as the best constant (if it exists) such that thek·k inequality k·k 1/2 2 E gαxα T2 xα (5.1) g ≤ k k α  α  X X holds, for all finite systems of vectors xα and i.i.d. standard Gaussian (gα). n 1 { } Given T S − , define the semi-norm ⊂ x T = sup x, y ||| ||| y T | h i | ∈ It will be convenient to use the following notation. For i = 1,...,m we let Ji be the set of ‘active’ indices, i.e., J = j =1,...,n : δ =1 . We can then write i { ij } 1 x = max PJ x 2, (5.2) δ √s 1 i m i k k ≤ ≤ k k where we abuse notation and let PJi denote orthogonal projection onto the coordi- nate subspace specified by Ji. TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 17

Lemma 7. Let be any semi-norm satisfying x x for all x Rn. Then, k·k ||| |||T ≤k k ∈ 1 γ (T, ) . T ( )(log n)3(log m)1/2 2 k·kδ √s 2 k·k n m E 1/2 max gj δij ej + (log m) d BJi , × 1 i m g k·k  ≤ ≤ j=1  i=1  X [ where T ( ) is the type-2 constant and for 1 i m, 2 k·k ≤ ≤ B = x e : x2 1 . Ji j j j ≤ j J j J n X∈ i X∈ i o Note also that dδ(T ) can be bounded as

n 1/2 1 2 dδ(T )= max sup δij x, ej √s 1 i m h i ≤ ≤ x T j=1 ∈  X  n 1 2 1/2 = max sup E δij gj x, ej √s 1 i m g h i ≤ ≤ x T j=1 ∈  X  n n 1 1 . max sup E x, δij gjej max E δij gj ej . (5.3) √s 1 i m x T g ≤ √s 1 i m g ≤ ≤ ∈ D j=1 E ≤ ≤ j=1 X X Remark 8. Replacing T by a stronger norm T in this lemma may reduce the type-2 constant.||| · ||| The application to k-sparsek · k ≥ vectors ||| · ||| will illustrate this. Proof of Lemma 7. By Eq. (2.2), to bound γ (T, ) it suffices to evaluate the 2 k·kδ covering numbers (T, ,t) for t < d∗(T )/√s, where we set N k·kδ δ

dδ∗(T )= √sdδ(T ) = sup max PJi x 2. x T 1 i m k k ∈ ≤ ≤ For large values of t we use duality for covering numbers. It will be convenient to work with T˜ = conv(T ( T )), which is closed, bounded, convex and symmetric. Clearly, ∪ − (T, ,t) (T,˜ ,t) N k·kδ ≤ N k·kδ and in the notation of Appendix A,

x T = sup x, y = sup x, y = x T˜◦ . ||| ||| y T T h i y T˜ h i k k ∈ ∪{− } ∈

Here T˜◦ denotes the polar (see Appendix B for a definition). Lemma 30, applied ˜ m Rn 1/2 with U = T , V = i=1 x : s− PJi x 2 1 , ε = t and θ = dδ∗(T )/√s, implies ∩ { ∈ k k ≤ } m d (T ) 1 t log (T,˜ ,t) log δ∗ A log conv B , , N k·kδ ≤ t√s N √s Ji ||| · |||T 8 i=1     [   where A . T ( )2 . log m (by (5.2)) and we used that 2 k·kδ 2 n B = x e : x 1 = x R : P x 1 ◦. Ji j j j ≤ { ∈ k Ji k2 ≤ } j J j J n X∈ i X∈ i o Hence, m d (T ) 1 log (T,˜ ,t) . log δ∗ (log m) log conv B , , √st (5.4) N k·kδ t√s N Ji k·k 8 i=1     [   To estimate the right hand side we use the following consequence of Maurey’s lemma (Lemma 31 of Appendix A). 18 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

Claim 9. Fix 0

conv(Ω) conv(Ω˜ k)+ B (0,ǫ) ⊂ k·k ǫ/2<2−k 1 X ≤ with k Ω˜ k = y Ωk Ωk 1 : y < 3 2− (5.8) ∈ − − k k · By Maurey’s lemma, given zn conv(Ω˜ ) and a positive integero ℓ , there are points ∈ k k y ,...,y Ω˜ such that 1 ℓ ∈ k 1 1 k z (y1 + ... + yℓk ) . T2( ) 3 2− (5.9) − ℓk k·k √ℓk · ·

2 k 2 2 1 Taking ℓk T2( ) 4− ǫ− log (1/ǫ), we obtain a point zk Ω′ = (Ω˜ k+...+Ω˜ k) ≃ k·k ∈ k ℓk such that ǫ z z < k − kk 2 log(1/ǫ) Moreover 2 k 2 2 k log Ω′ ℓ log Ω˜ . T ( ) 4− ǫ− log (1/ǫ) log (Ω, , 2− ) | k|≤ k | k| 2 k·k N k·k k Since k : ǫ/2 < 2− 1 log(2/ǫ), we obtain from the preceding, |{ ≤ }| ≤ 2 2 2 k k log (conv(Ω), ,ǫ) . log (1/ǫ)T ( ) ǫ− 4− log (Ω, , 2− ) N k·k 2 k·k N k·k ǫ/2<2−k 1 X ≤ and Eq. (5.5) follows. This proves the claim.  Observe that

dδ∗(T ) = sup max PJi x 2 x T 1 i m k k ∈ ≤ ≤ m m

= sup sup max x, PJi y = d T BJi d BJi x T y B n 1 i mh i |||·||| ≤ k·k ℓ2 ≤ ≤ i=1 i=1 ∈ ∈  [   [  m m Write d := d ( i=1BJi ) for brevity. Applying Claim 9 with Ω = i=1BJi , k·k k·k ∪ ∪ ǫ = t√s, and the dominating semi-norm T to the right hand side of Eq. (5.4), we find k · k ≥ ||| · ||| 1/2 d 2 t√s log (T,˜ ,t) log k·k (log m)1/2T ( ) N k·kδ ≤ t√s 2 k·k · h i   TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 19

m 1/2

max t′ log BJi , ,t′ . (5.10) t√s t′ d N k·k k·k i=1 h ≤ ≤ h  [ i i Using the dual Sudakov inequality (Lemma 29 of Appendix A) we arrive at 1/2 1 d 2 log (T,˜ ,t) log k·k (log m)1/2T ( ) N k·kδ ≤t√s t√s 2 k·k · h i   1/2 max E gjej + d (log m) . (5.11) 1 i m g k·k h ≤ ≤ j Ji i X∈ 1/2 1/2 We now apply Eq. (5.11) for (sn)− d t

5.1. Application to k-sparse vectors. Let x 0 denote the number of non-zero entries of x and set k k S = x Rn : x =1, x k . n,k { ∈ k k2 k k0 ≤ } n n Theorem 10. Let A R × be orthogonal and denote ∈ T = A(Sn,k) Let η 1/m. If ≤ 1/2 max A < (k log n)− (5.12) | ij | and m & k(log n)8(log m)/ε2 7 1 2 s & (log n) (log m)(log η− )/ε , (5.13) then with probability at least 1 η we have − (1 ε) x 2 Φx 2 (1 + ε) x 2 for all x T. − k k2 ≤k k2 ≤ k k2 ∈ E 2p 1/p E 2p Proof. We apply Eq. (2.10) and estimate ( δ γ2 (T, δ)) and δ dδ (T ) for E 2p k·k p log m. We trivially estimate δ dδ (T ) 1/s. To bound the γ2-functional, note≥ that T K, where ≤ ⊂ K = B n √kA(B n ). (5.14) ℓ2 ∩ ℓ1 The polar body of K is given by 1 K◦ = Bℓn + A(Bℓn ) 2 √k ∞ m and one can readily calculate that T2( K ) . √log n and d K ( i=1BJi ) 1. We apply Lemma 7 with = .||| Note · ||| that for every 1 |||·|||i ∪m, ≤ k·k ||| · |||K ≤ ≤ 1/2 E √ E 2 gj δij ek k max gj δij Ajℓ k log n max δij Ajℓ g K ≤ g 1 ℓ n ≤ 1 ℓ n j ≤ ≤ j ≤ ≤  j  X X p X and therefore the lemma implies that γ (T, ) 2 k·kδ 1/2 1 7/2 k 4 1/2 2 (log n) log m + (log n) (log m) max δij Ajℓ . ≤ √s s 1 i m,1 ℓ n r ≤ ≤ ≤ ≤ j  X  20 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

As a consequence, as p log m, ≥ E 2p 1/p ( γ2 (T, δ)) δ k·k p 1/p 1 7 2 k 8 E 2 (log n) (log m) + (log n) (log m) max max δij Ajℓ . ≤ s s 1 i m δ 1 ℓ n ≤ ≤ ≤ ≤ j   X   (5.15)

Since E δij = s/m and A is orthogonal, we obtain using symmetrization and Khint- chine’s inequality, p 1/p E 2 max δij Ajℓ δ 1 ℓ n ≤ ≤ j   X   p 1/p s E E 2 + max rj δij Ajℓ ≤ m r δ 1 ℓ n ≤ ≤ j   X   p/2 1/p s E 4 + √p max δij Ajℓ ≤ m δ 1 ℓ n ≤ ≤ j   X   p 1/(2p) s E 2 + √p max Ajℓ max δij Ajℓ . ≤ m j,ℓ | | δ 1 ℓ n ≤ ≤ j   X   By solving this quadratic inequality, we find p 1/p E 2 s 2 s p max δij Ajℓ . + p max Ajℓ + . δ 1 ℓ n m j,ℓ ≤ m k log n ≤ ≤ j   X   Combine this bound with Eq. (5.15) and Eq. (2.10) to arrive at p 1/p E 2 2 sup Φx 2 x 2 δ,σ x T k k −k k  ∈  1 k p . (log n)7(log m)2 + (log n)8(log m)+ (log n)7(log m) s m s 1 k p 1/2 + (log n)7(log m)2 + (log n)8(log m)+ (log n)7(log m) s m s  p p  + + . s s r Since p log m was arbitrary, the result now follows from Lemma 27.  ≥ 6. Sketching constrained least squares programs Let us now apply the previous results to constrained least squares minimization. n d We refer to Appendix B for any unexplained terminology. Consider A R × and Rm n 2 ∈ 2 a sketching matrix Φ × . Define f(x)= Ax b 2 and g(x)= ΦAx Φb 2. Let Rd be any closed∈ convex set. Let xk be− a minimizerk of thek constrained− k leastC squares ⊂ program ∗ min f(x) subjectto x (6.1) ∈C and letx ˆ be a minimizer of the associated sketched program min g(x) subjectto x . (6.2) ∈C Before giving our analysis, we define two quantities introduced in [PW14]. Given d n 1 R and u S − we set K⊂ ∈ 2 Z1(A, Φ, ) = inf Φv 2 K v A Sn−1 k k ∈ K∩ Z2(A, Φ, ,u) = sup Φu, Φv u, v . K v A Sn−1 |h i − h i| ∈ K∩ TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 21

We denote the tangent cone of at a point x by T (x) (see Appendix B for a definition). The first statement inC the following lemmaC is [PW14, Lemma 1]. For the convenience of the reader we provide a proof, which is in essence the same as in [PW14], with the second case (when x is a global minimizer) being a slight modification of the proof there. In what follows∗ we assume Ax = b, since otherwise f(ˆx)= f(x ) = 0. ∗ 6 ∗ Lemma 11. Define u = (Ax b)/ Ax b 2 and set Z1 = Z1(A, Φ,T (x )) and ∗ − k ∗ − k C ∗ Z2 = Z2(A, Φ,T (x ),u). Then, C ∗ Z 2 f(ˆx) 1+ 2 f(x ). ≤ Z1 ∗   If x is a global minimizer of f, then ∗ 2 Z2 f(ˆx) 1+ 2 f(x ). ≤ Z1 ∗   Proof. Set e =x ˆ x T (x ) and w = b Ax . By optimality of x , for any x , − ∗ ∈ C ∗ − ∗ ∗ ∈C f(x ), x x = Ax b, A(x x ) 0. h∇ ∗ − ∗i h ∗ − − ∗ i≥ In particular, taking x =x ˆ gives w,Ae 0. As a consequence, h i≤ w Φ∗ΦAe w Φ∗ΦAe w Ae , , , Z2. (6.3) w 2 Ae 2 ≤ w 2 Ae 2 − w 2 Ae 2 ≤ Dk k k k E Dk k k k E Dk k k k E By optimality ofx ˆ, for any x , ∈C g(ˆx), x xˆ = ΦAxˆ Φb, ΦA(x xˆ) 0. h∇ − i h − − i≥ Taking x = x yields ∗ 0 ΦAxˆ Φb, ΦAe = Φw, ΦAe + ΦAe 2. ≥ h − i h− i k k2 Therefore, by (6.3), Z Ae 2 ΦAe 2 Φw, ΦAe Z w Ae . 1k k2 ≤k k2 ≤ h i≤ 2k k2k k2 In other words, Z2 Ae 2 w 2 k k ≤ Z1 k k and by Cauchy-Schwarz, 2 2 2 2 2 Z2 Axˆ b 2 = w 2 + Ae 2 2 w,Ae w 2 1+ . k − k k k k k − h i≤k k Z1   If x is a global minimizer of f, then f(x ) = 0 and therefore w,Ae = 0 (a similar∗ observation to [Sar06, Theorem∇ 12]).∗ This yields the secondh assertion.i 

n 1 Clearly, if Φ satisfies (1.2) for T = AT (x ) S − then Z1 1 ε. We do C ∗ ∩ ≥ − not immediately obtain an upper bound for Z2, however, as u is in general not in n 1 AT (x ) S − . We mend this using the observation in Lemma 13. For its proof weC recall∗ ∩ a chaining result from [Dir15]. Let (T,ρ) be a semi-metric space. Recall that a real-valued process (Xt)t T is called subgaussian if for all s,t T , ∈ ∈ P( X X wρ(t,s)) 2 exp( w2) (w 0). | t − s|≥ ≤ − ≥ The following result is a special case of [Dir15, Theorem 3.2].

Lemma 12. If (Xt)t T is subgaussian, then for any t0 T and 1 p< , ∈ ∈ ≤ ∞ 1/p E p sup Xt Xt0 . γ2(T,ρ)+ √pdρ(T ). t T | − |  ∈  22 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

In particular, if T contains n elements, then 1/p E p 1/2 sup Xt Xt0 . (√p + (log n) )dρ(T ). t T | − |  ∈  n Lemma 13. Fix u B n , T R and let Φ be the SJLT. Set ∈ ℓ2 ⊂ Z = sup Φu, Φv u, v . v T |h i − h i| ∈ For any p 1, ≥ E p 1/p p E p 1/p E p 1/p ( Z ) . +1 ( γ2 (T, δ)) + √p( d (T )) . δ,σ s δ k·k δ δ r   Proof. With the notation (2.6) we have Φx = A σ for any x Rn. Since δ,x ∈ Eσ Φ∗Φ= I we find E Z = sup u, (Φ∗Φ I)v = sup σ∗Aδ,u∗ Aδ,vσ (σ∗Aδ,u∗ Aδ,vσ) . v T |h − i| v T | − σ | ∈ ∈ Let σ′ be an independent copy of σ. By decoupling (see (A.5)), E p 1/p E p 1/p ( Z ) 4( sup σ∗Aδ,u∗ Aδ,vσ′ ) . σ ≤ σ,σ′ v T | | ∈ For any v , v T , Hoeffding’s inequality implies that for all w 0 1 2 ∈ ≥ P ′ ( σ∗A∗ A 1 σ′ σ∗A∗ A 2 σ′ σ | δ,u δ,v − δ,u δ,v | w √w σ∗Aδ,u∗ 2 Aδ,v1 Aδ,v2 ℓ2 ℓ2 ) 2e− ≥ k k k − k → ≤ and therefore v σ∗A∗ A σ′ is a subgaussian process. By Lemma 12 (and (2.8)), 7→ δ,u δ,v E p 1/p ( sup σ∗Aδ,u∗ Aδ,vσ′ ) σ′ v T | | ∈ . A σ γ (T, )+ √p A σ d (T ). k δ,u k2 2 k·kδ k δ,u k2 δ Taking the Lp(Ωσ)-norm on both sides and using Lemma 28 of Appendix A we find

p 1/p p 1/p (E Z ) . (E Aδ,uσ ) γ2(T, δ)+ √pdδ(T ) σ σ k k2 k·k   . (√p A + A ) γ (T, )+ √pd (T ) k δ,uk k δ,ukF 2 k·kδ δ p   +1 γ (T, )+ √pd (T ) . ≤ s 2 k·kδ δ r   Now take the Lp(Ωδ)-norm on both sides to obtain the result.  6.1. Unconstrained case. We first consider unconstrained least squares mini- mization, i.e., the constraint set is = Rd. C Theorem 14. Let Col(A) be the column space of A and let

µ(A)= max PCol(A)ej 2 1 j n k k ≤ ≤ be its largest leverage score. Set = Rd and let x and xˆ be minimizers of (6.1) and (6.2), respectively. Let r(A) beC the rank of A.∗ Suppose that 1 1 2 1 2 m & ε− ((log η− )(r(A) + log m)(log m) + (log η− ) r(A)) 1 2 1 2 1 3 s & ε− µ(A) ((log η− ) + (log η− )(log m) ) 1/2 1 3/2 1 3/2 + ε− µ(A)((log η− ) + (log η− )(log m) ) and η 1 . Then, with probability at least 1 η, ≤ m − 2 f(ˆx) (1 ε)− f(x ). (6.4) ≤ − ∗ TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 23

Proof. Set Z1 = Z1(A, Φ,T (x )) and Z2 = Z2(A, Φ,T (x ),u). Observe that T (x )= Rd. Applying TheoremC ∗ 5 for the r(A)-dimensionalC subspace∗ E = Col(A) C ∗ immediately yields that Z1 1 √ε with probability at least 1 η. Setting n 1 ≥ −n 1 − T = AT (x ) S − = Col(A) S − in Lemma 13 and subsequently using (4.2) and (4.3)C yields,∗ ∩ for any p log(∩m), ≥ E p 1/p p E p 1/p E p 1/p ( Z2 ) . +1 ( γ2 (T, δ)) + √p( d (T )) δ,σ s δ k·k δ δ r   p ( r(A) + log1/2(m)) log(m) pr(A) . +1 + s √m r m r h p √p log3/2(m) p + µ(A)+ µ(A) . √s √s 1 i Applying Lemma 27 (and setting w = log(η− ) in (A.1)) shows that Z2 √ε with probability at least 1 η. The second statement in Lemma 11 now implies≤ that with probability at least− 1 2η, − ε 2 f(ˆx) 1+ f(x ) (1 ε)− f(x ). ≤ (1 √ε)2 ∗ ≤ − ∗  −  To see the last inequality, note that it is equivalent to ε +4ε3/2 3ε2 2ε5/2 +2ε3 0. − − − ≤ Clearly this is satisfied for ε small enough (since then ε is the leading term). In fact, one can verify by calculus that this holds for ε 1−/10, which we may assume without loss of generality. ≤  If Φ is a full random sign matrix (i.e., s = m) then it follows from our proof that (6.4) holds with probability at least 1 η if − 1 1 m & ε− (r(A) + log(η− )). This bound on m is new, and was also recently shown in [Woo14]. Previous works 2 1 1 1 [Sar06, KN14] allowed either m & ε− (r(A) + log(η− )) or m & ε− r(A) log(η− ). Theorem 14 substantially improves s while maintaining m up to logarithmic factors.

6.2. ℓ2,1-constrained case. Throughout this section, we set d = bD. For x = (x ,...,x ) Rd consisting of b blocks, each of dimension D, we define its B1 Bb ∈ ℓ2,1-norm by b

x := x b D = x . (6.5) k k2,1 k kℓ1(ℓ2 ) k Bℓ k2 Xℓ=1 The corresponding dual norm is

D x 2, := x ℓb (ℓ ) = max xBℓ 2. k k ∞ k k ∞ 2 1 ℓ b k k ≤ ≤ In this section we study the effect of sketching on the ℓ2,1-constrained least squares program min Ax b 2 subject to x R, k − k2 k k2,1 ≤ d which is Eq. (6.1) for = x R : x 2,1 R . In the statistics literature, this is called the groupC Lasso{ ∈(with non-overlappingk k ≤ } groups of equal size). The ℓ2,1-constraint encourages a block sparse solution, i.e., a solution which has few blocks containing non-zero entries. We refer to e.g. [BvdG11] for more information. In the special case D = 1 the program reduces to min Ax b 2 subject to x R, k − k2 k k1 ≤ which is the well-known Lasso [Tib96]. 24 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

n d To formulate our results we consider two norms on R × , given by

n 1/2 2 A := max Ajk (6.6) ||| ||| 1 ℓ b | | ≤ ≤ j=1 k B  X X∈ ℓ  and

A ℓ2,1 ℓ∞ = max max (Ajk)k B 2. k k → 1 j n 1 ℓ b k ∈ ℓ k ≤ ≤ ≤ ≤ n d d Lemma 15. Let A R × and consider any set R . If p log m, then ∈ K⊂ ≥ E 2p n 1 1/p ( γ2 (A S − , δ)) δ K ∩ k·k 2 1 2 1 2 . α sup x 2,1 (log b + p) A ℓ2,1 ℓ∞ + A . x : Ax 2=1 k k s k k → m||| ||| h ∈K k k i  where α = (log n)6(log m)2(log b)2. Proof. We deduce the assertion from Lemma 7. By H¨older’s inequality,

sup x, y = sup A∗x, v A∗x 2, sup v 2,1. 1 y A Sn− h i v : Av 2=1h i≤k k ∞ v : Av 2=1 k k ∈ K∩ ∈K k k ∈K k k We define x to be the expression on the right hand side. Observe that for any k k n sequence (xk)k 1 in R , ≥ E E gkA∗xk = max gk(A∗xk)Bℓ g 2, g 1 ℓ b 2 k 1 ∞ ≤ ≤ k 1 X≥ X≥ 1/2 1/ 2 2 (log b) max (A∗xk)Bℓ 2 ≤ 1 ℓ b k k ≤ ≤ k 1  X≥  1/2 1/2 2 (log b) A∗xk 2, ≤ k k ∞ k 1  X≥  and therefore T ( ) (log b)1/2. Similarly, 2 k·k ≤ n n E gj δij A∗ej = E max gjδij Ajk g 2, g 1 ℓ b k B 2 j=1 ≤ ≤ j=1 ℓ X ∞  X  ∈ n 1/2 1/ 2 2 (log b) max δij (Ajk)k Bℓ 2 . (6.7) ≤ 1 ℓ b k ∈ k ≤ ≤ j=1  X  Finally, m

d BJi = sup v 2,1 max sup A∗x 2, k·k 2 k k 1 i m x B k k ∞ i=1 v : Av =1 ≤ ≤ Ji  [  ∈K k k ∈ and for any 1 i m and x B , ≤ ≤ ∈ Ji n

A∗x 2, = max xj δij (Ajk)k Bℓ k k ∞ 1 ℓ b ∈ 2 ≤ ≤ j=1 X n n 1/2 2 max xj δij (Ajk)k Bℓ 2 max δij (Ajk)k Bℓ 2 . ≤ 1 ℓ b | | k ∈ k ≤ 1 ℓ b k ∈ k ≤ ≤ j=1 ≤ ≤ j=1 X  X  n 1 We now apply Lemma 7 for T = A S − and subsequently take Lp(Ωδ)-norms to conclude that for any p log m,K ∩ ≥ 2p n 1 1/p (E γ (A S − , )) 2 K ∩ k·kδ TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 25

n p 1/p α 2 E 2 . sup x 2,1 max max δij (Ajk)k Bℓ 2 . s k k 1 i m δ 1 ℓ b k ∈ k x : Ax 2=1 ≤ ≤ ≤ ≤ j=1 h ∈K k k i   X   (6.8) E s n Since δij = m and (δij )j=1 is independent for any fixed i, we obtain by sym- metrization (see Eq. (A.4)) for (rj ) a vector of independent Rademachers

n p 1/p E 2 max δij (Ajk)k Bℓ 2 δ 1 ℓ b k ∈ k ≤ ≤ j=1   X   n n p 1/p s 2 E E 2 max (Ajk)k Bℓ 2 +2 max rj δij (Ajk)k Bℓ 2 . ≤ m 1 ℓ b k ∈ k δ r 1 ℓ b k ∈ k ≤ ≤ j=1 ≤ ≤ j=1 X   X   By Khintchine’s inequality,

n p 1/p E E 2 max rj δij (Ajk)k Bℓ 2 δ r 1 ℓ b k ∈ k ≤ ≤ j=1   X   n p/2 1/p 1/2 1/2 E 4 . ((log b) + p ) max δij (Ajk)k Bℓ 2 δ 1 ℓ b k ∈ k ≤ ≤ j=1   X   n p 1/(2p) 1/2 1/2 E 2 ((log b) + p ) max δij (Ajk)k Bℓ 2 ≤ δ 1 ℓ b k ∈ k ≤ ≤ j=1   X   max max (Ajk)k B 2. × 1 ℓ b 1 j n k ∈ ℓ k ≤ ≤ ≤ ≤ Putting the last two displays together we find a quadratic inequality. By solving it, we arrive at

n p 1/p E 2 max δij (Ajk)k Bℓ 2 δ 1 ℓ b k ∈ k ≤ ≤ j=1   X   s 2 2 . max (Ajk)k Bℓ 2 + (log b + p) max max (Ajk)k Bℓ 2. (6.9) m 1 ℓ b k ∈ k 1 ℓ b 1 j n k ∈ k ≤ ≤ j ≤ ≤ ≤ ≤ X Combining this estimate with (6.8) yields the assertion. 

Theorem 16. Let Rd be a closed convex set and let α = (log n)6(log m)2(log b)2. Let x and xˆ be minimizersC ⊂ of (6.1) and (6.2), respectively. Set ∗ d2,1 = sup x 2,1. x T (x ) : Ax 2 1=1 k k ∈ C ∗ k k , Suppose that 2 1 1 2 2 m & ε− log(η− )(α + log(η− ) log b) A d , ||| ||| 2,1 2 1 1 1 2 2 s & ε− log(η− )(log b + log(η− ))(α + log(η− ) log b) A ℓ2 1 ℓ d2,1, k k , → ∞ and η 1 . Then, with probability at least 1 η, ≤ m − 2 f(ˆx) (1 ε)− f(x ). ≤ − ∗ n 1 Proof. Set T = AT (x ) S − , Z1 = Z1(A, Φ,T (x )) and Z2 = Z2(A, Φ,T (x ),u). We apply Eq. (5.3),C Eq.∗ ∩ (6.7) and Eq. (6.9) to findC for∗ any p log m, C ∗ ≥ E 2p 1/p ( dδ (T )) n 2p 1/p 1 2 sup x E max E gj δij A∗ej 2,1 g ≤ s x T (x ) : Ax 2 1=1 k k δ 1 i m 2, h ∈ C ∗ k k , i  ≤ ≤ j=1 ∞  X

26 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

n p 1/p 1 2 E 2 (log b) sup x 2,1 max δij (Ajk)k Bℓ 2 ≤ s k k δ 1 ℓ b k ∈ k x TC (x∗) : Ax 2,1=1 ≤ ≤ j=1 h ∈ k k i  X   2 1 2 1 2 (log b) sup x 2,1 A + (log b + p) A ℓ2,1 ℓ∞ . ≤ x T (x ) : Ax 2 1=1 k k m||| ||| s k k → ∈ C ∗ k k , h i (6.10) By Eq. (2.10), Lemma 15 and Eq. (6.10) p 1/p E 2 2 sup Φx 2 x 2 δ,σ x T k k −k k  ∈  E 2p 1/p E p 1/p E p 1/p E 2p 1/p . ( γ2 (T, δ)) + ( γ2 (T, δ)) + √p( d (T )) + p( d (T )) δ k·k δ k·k δ δ αE αF αE αF 1/2 . (log b + p)+ + (log b + p)+ s m s m Ep log b  Fp log b Ep log b Fp log b 1/2 + (log b + p)+ + (log b + p)+ , s m s m where we write   2 2 2 2 E = A ℓ2 1 ℓ d2,1, F = A d2,1. k k , → ∞ ||| ||| Similarly, using Lemma 13 (with Z(u) := Z2(A, Φ,T (x ),u)) we find C ∗ E p 1/p p E p 1/p E p 1/p ( Z2 ) . +1 ( γ2 (T, δ)) + √p( d (T )) δ,σ s δ k·k δ δ r   p αE αF 1/2 . +1 (log b + p)+ s s m r  h Ep log b Fp log b 1/2 + (log b + p)+ . s m 1   i Now apply Lemma 27 (with w = log(η− )) to conclude that with probability at least 1 η we have Z2 ε and Z1 1 ε under the assumptions on m and s. Lemma− 11 now implies that≤ ≥ − 2 ε 2 f(ˆx) 1+ f(x )=(1 ε)− f(x ). ≤ 1 ε ∗ − ∗ −    Corollary 17. Set = x Rd : x R . C { ∈ k k2,1 ≤ } Suppose that x is k-block sparse and x 2,1 = R. Define ∗ k ∗k σmin,k = inf Ay 2. y 2=1, y 2 1 2√k k k k k k k , ≤ Set α = (log n)6(log m)2(log b)2. Assume that 2 1 1 2 2 m & ε− log(η− )(α + log(η− ) log b) A kσ− , ||| ||| min,k 2 1 1 1 2 2 s & ε− log(η− )(log(η− ) + log b)(α + log(η− ) log b) A ℓ2,1 ℓ∞ kσmin− ,k, k k → and η 1 . Then with probability at least 1 η ≤ m − 2 f(ˆx) (1 ε)− f(x∗). ≤ − Proof. As calculated in Example 33 of Appendix B, every x T (x ) satisfies x 2 4k x 2. If moreover Ax = 1, then ∈ C ∗ k k2,1 ≤ k k2 k k2 2 2 2 σ x − 4k x − . min,k ≤k k2 ≤ k k2,1 In other words, 2 2 sup x 2,1 4kσmin− ,k. x T (x ) : Ax 2=1 k k ≤ ∈ C ∗ k k TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 27



For a dense random sign matrix Φ, it follows from [PW14, Theorem 1] that the result in Corollary 17 holds under the condition 2 2 2 1 m & ε− ( A kσ− + log(η− )). ||| ||| min,k Thus, our result shows that one can substantially increase the sparsity in the matrix Φ, while maintaining the embedding dimension m (up to logarithmic factors). Let us now specialize our results to the case D = 1, which corresponds to the Lasso. In this case, if we let Ak denote the columns of A,

A = max Ak 2, ||| ||| 1 k d k k ≤ ≤ the maximum column of A. We can alternatively interpret this as the norm

A ℓ1 ℓ2 . Moreover, the norm A ℓ2 1 ℓ reduces to k k → k k , → ∞

A ℓ1 ℓ∞ = max max Ajk , k k → 1 j n 1 k d | | ≤ ≤ ≤ ≤ the largest entry of the matrix. The first norm can be significantly larger than the second. For example, if A is filled with 1 entries, then ± A ℓ1 ℓ2 = √n, A ℓ1 ℓ =1. k k → k k → ∞ A similar result for the fast J-L transform, but with a worse dependence of m on the matrix A, was obtained in [PW14]. For completeness, we will show in Appendix C that the fast J-L transform in fact satisfies similar results as Theorem 16 and Corollary 17, with the essentially the same dependence of m on A.

7. Proof of the main theorem After having seen instantiations of various subsets of our ideas for specific appli- cations (linear subspaces, T of small type-2 constant, and closed convex cones), we now prove the main theorem|||· ||| of this work, Theorem 3. Our starting point is the observation that inequality Eq. (5.4), i.e., log (T,˜ ,t) N k·kδ d (T ) m 1 . log δ (log m) log conv B , , √st (7.1) t N Ji ||| · |||T 8 i=1     [   holds in full generality. We will need the following replacement of Claim 9. Lemma 18. Let ǫ> 0. Then m log conv B , ,ǫ N Ji ||| · |||T i=1   [   1 1 . log m + (log 1/ǫ) max max log B , ,ǫ (7.2) 2 1 Ji T ǫ k. 2 A =k N k ||| · ||| ǫ | | i A  X∈ 

Proof. Set ρT (x, y)= x y T and let ρℓ2 (x, y)= x y 2. By Maurey’s lemma ||| − ||| k − k 2 (Lemma 31 for 2), given x conv(BJi :1 i m), there is a k . 1/ǫ and a set A 1,...,mk·k, A = k, such∈ that ≤ ≤ ⊂{ } | | ρ (x, conv(B ; i A)) ρ (x, conv(B ; i A)) ǫ (7.3) T Ji ∈ ≤ ℓ2 Ji ∈ ≤ Now, let us take an element y conv(B ; i A), A . 1/ǫ2. Thus ∈ Ji ∈ | | k y = λiyi i=1 X 28 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

2 with yi BJi , λi 0, i λi = 1, k . 1/ǫ . Firstly, we may dismiss the coefficients 3 ∈ ≥ 3 λi <ǫ . Indeed, let S = i : λi <ǫ and sety ˆ = i S λiyi. Then, P{ } ∈ yˆ λ kǫ3 .Pǫ. k k2 ≤ i ≤ i S X∈ 3 Consider now the λi with ǫ λi 1. For ℓ =0, 1,...,ℓ , ℓ log(1/ǫ), denote ≤ ≤ ∗ ∗ ≃ 1 1 A = 1 i k : λ > ℓ ≤ ≤ 2ℓ ≥ i 2ℓ+1 and write n o ℓ∗ ℓ 1 y = 2− Aℓ yi′ +ˆy (7.4) | | · Aℓ ℓ=0 i A X | | X∈ ℓ  ℓ where y′ = λ 2 y B . Note that i i i ∈ Ji ℓ∗ ℓ 2− A < 2 λ 2 (7.5) | ℓ| i ≤ i Xℓ=0 X and 1 1 yi′ BJi Aℓ ∈ kℓ i A i A | | X∈ ℓ X∈ ℓ 2 1 where kℓ = Aℓ . 1/ǫ . Take finite sets ξℓ k i A BJi such that | | ⊂ ℓ ∈ ℓ P1 ρT (z, ξℓ) <ǫ for all z BJi (7.6) ∈ kℓ i A X∈ ℓ 1 ξℓ = BJi , T ,ǫ (7.7) | | N kℓ ||| · ||| i A  X∈ ℓ  Let zℓ ξℓ satisfy ∈ 1 yi′ zℓ <ǫ Aℓ − T | | i Aℓ X∈ and set ℓ∗ ℓ z = 2− A z . (7.8) | ℓ| ℓ Xℓ=0 By (7.4), (7.5)

ℓ∗ ℓ y z . 2− A ǫ + ǫ . ǫ ||| − |||T | ℓ| Xℓ=0 To summarize, we can find an ǫ-net for conv( 1 i mBJ ) with respect to T ∪ ≤ ≤ i ||| · ||| as follows. For every A [m] with A = k, k . 1/ǫ2, we select ξA,...,ξA as 1 ℓ∗ above. Then, ⊂ | | ℓ∗ ℓ A 2− A ξ | ℓ| ℓ k [m] : k.1/ǫ2 A [m], A =k ℓ=0 ∈ [ ⊂ [| | X is an ǫ-net of cardinality at most ℓ 1 m ∗ 1 max m B , ,ǫ 2 2 Ji T k.1/ǫ ǫ k N kℓ ||| · |||   ℓ=0 i Aℓ Y  X∈  1 em k 1 log(1/ǫ) . max m max max BJi , T ,ǫ . k.1/ǫ2 ǫ2 k k 1/ǫ2 A =k N k ||| · ||| ≤ | | i A   h  X∈ i This yields the result.  TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 29

2 Next, we analyze further the set (1/k) i A BJi for some k . 1/ǫ (ǫ > 0 will ∈ be fixed later). The elements of (1/k) i A BJi are of the form ∈P n 1 P y = δ x(i) e k ij j j j=1 i A X  X∈  with x(i) 2 1, for all i. | j | ≤ j J X∈ i Therefore, by Cauchy-Schwarz, n 1 2 1/2 y = δ x(i) k k2 k ij j j=1 i A  X X∈  n 1 1/2 δ x(i) 2 ≤ k ij | j | j=1 i A i A h X  X∈  X∈ i Define for α =1,..., log(min k,s ) the set { } U = U (δ)= 1 j n :2α δ < 2α+1 (7.9) α α ≤ ≤ ≤ ij i A n X∈ o and define U = U (δ)= 1 j n : δ < 2 . 0 0 ≤ ≤ ij i A n X∈ o Estimate for fixed j 1, if 2α 2esk def P α α+1 ≤ m τk,α = 2 δij < 2 α sk 2α α 2esk (7.10) δ ≤ ≤ min 2− , 2− , if 2 > , i A ( m m  X∈  where the first term in the min is a consequencen of Markov’so inequality, and the second term follows by using that k τ P ζ 2α , k,α ≤ j ≥ j=1  X  whenever ζ1,...,ζk are i.i.d. Bernoulli random variables with mean s/m (see (7.19) for bounding Bernoulli moments). Hence, Uα is a random set of intensity τk,α, i.e., P (j U (δ)) = τ . Write according to the preceding δ ∈ α k,α y = yα, with yα = yj ej α j U X X∈ α and 1 α/2 yα 2 . 2 k k √k 2 Hence, denoting BUα := j U xj ej : j U xj 1 , { ∈ α ∈ α | | ≤ } 1 1 P B P 2α/2B . k Ji ⊂ √ Uα i A α k X∈ X Therefore, we deduce that 1 1 ǫ log B , ,ǫ . log 2α/2B , , N k Ji ||| · |||T N √ Uα ||| · |||T log m i A α k  X∈  X   k ǫ = log B , , (7.11) N Uα ||| · |||T 2α log m α r X   30 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

By the dual Sudakov inequality (Lemma 29 of Appendix A), 1/2 E t log (BUα , T ,t) . gj ej N ||| · ||| g T h i j Uα X∈ It follows that 1/2 α 1/2 1 2 log m E log BJi , T ,ǫ . gj ej N k ||| · ||| k ǫ g T h  i A i α     j Uα X∈ X X∈ and, by Lemma 18

m 1/2 log conv B , ,ǫ N Ji ||| · |||T i=1 h   [  i 1/2 α 1 1/2 log m 1 2 E (log m) + log max1 gj ej (7.12) ≤ ǫ ǫ ǫ k. 2 k g T ǫ α r j Uα   A =k h X X∈ i | |

Applying Eq. (7.12) to Eq. (7.1), taking ǫ √st, gives ≃ 1 1 1/2 t log1/2 (T,˜ ,t) . log log m N k·kδ √s √st 1  1 + (log m)3/2 log √s √st ·   2α max E gj ej . A =k k g T | | 1 α r j Uα k. 2 n ∈ o ǫ X X

Using this bound for large t and the elementary bound in Eq. (A.6) for small t in Eq. (2.2), we obtain 1 γ (T, ) . (log n)3/2 log m 2 k·kδ √s 1 + (log m)3/2(log n)2 √s · 2α max max E gj ej (7.13) k m A =k k g T ≤ | | α,2α k r j U n α o X≤ X∈ We split the sum over α into three different parts. Firstly, n 2α s max max E gj ej . E gj ej k m A =k g T g T α k m ≤ | | α,2 2esk/m r j Uα r j=1 ≤X X∈ X s . γ (T, ). (7.14) m 2 k·k2 r Next, by setting 2esk U = j [n]: δ , A ∈ m ≤ ij i A n X∈ o we obtain 2α max max E gjej k m A =k g T α k ≤ | | α,2esk/m<2 10 log n r j Uα X ≤ X∈ 1/2 1 . (log n) max max E gjej k m A =k √k g T ≤ | | j UA X∈

TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 31

1/2 1 (log n) max max E gj ej ≤ m/s k m A =k √k g T h ≤ ≤ | | j UA X∈ 1 + max max E gj ej . k m/s A =k √k g T ≤ | | j UA i X∈ Note that the first term on the right hand side is bounded by 1 s max max E gj ej . γ2(T, 2). (7.15) m/s k m A =k √k g T m k·k ≤ ≤ | | j UA r X∈ To bound the second term, we take expectations with respect to δ and find for m pk = k log sk   1 E max max E gjej δ k m/s A =k √k g T h ≤ | | j UA i X∈ 1 pk 1/pk . max max E E gjej . (7.16) k m/s A =k √k δ g T ≤ | |   j UA   X∈ By Eq. (7.10), for any A [m] with A = k, U is a random set of intensity at most ⊂ | | A esk/m = sq/(m log s), where we set q = ek(log s). If we now let η1,...,ηn be i.i.d. 0, 1 -valued random variables with expectation qs/(m log s), then the random set U{ defined} by P(j U)= P(η = 1) has a higher intensity than U and therefore, ∈ j A 1 E max max E gjej δ k m/s A =k √ g T ≤ | | k j U h X∈ A i n q 1/q 1/2 1 . (log s) max E E ηj gj ej . (7.17) m η g q s log s √q T ≤   j=1   X Finally, consider the α with 2α > 10 log n. Since , ||| · |||T ≤k·k2 2α max max E gj ej k m A =k g T α k ≤ | | α,10 log n<2 k r j Uα X ≤ X∈ α 2 1/2 max max Uα . (7.18) ≤ k m A =k k | | ≤ | | α,10 log n<2α k r X ≤ 2α By Eq. (7.10), the intensity of Uα is bounded by 2− and therefore, 2α E max max E gjej δ k m A =k g T α k ≤ | | α,10 log n<2 k r j Uα h X ≤ X∈ i n 2 α qk 1/(2qk ) max max E ζj , ≤ k m A =k k ≤ | | α,10 log n<2α k r  j=1  X ≤ X 2α where qk k log(m/k) and the ζj are i.i.d. 0, 1 -valued with mean 2− . Since ≃ { } α j ζj is binomially distributed, we find using that 2 > 10 log n, n P 9 α 1/qk qk 10 2 t 1/qk α α ζj t e− . n 2− qk . 2− k(log m) (7.19) qk Lζ ≤ j  t=1  X X and therefore α 2 3/2 E max max E gj ej . (log m) . (7.20) δ k m A =k g T α k h ≤ | | α,10 log n<2 k r j Uα i X ≤ X∈

32 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

By combining Eq. (7.13), Eq. (7.14), Eq. (7.15), Eq. (7.17), and Eq. (7.20) we obtain

E γ (T, ) 2 k·kδ (log m)3/2(log n)5/2γ (T, ) 1 . 2 k·k2 + (log m)3(log n)2+ √m √s(

n q 1/q 2 5/2 1 (log m) (log n) max E E ηj gjej , q m log s η g s √q T ) ≤   j=1   X (7.21) with ηj i.i.d. Bernoulli with mean qs/(m log s). E 2 From our proof it is clear that γ2 (T, δ) can be bounded by the square of the right hand side of Eq. (7.21). Using Eq.k·k (2.10) (for p = 1), we conclude that

E 2 sup Φx 2 1 <ε x T k k − ∈ provided that m,s satisfy

γ2(T, ) m & (log m)3(log n)5 2 k·k2 (7.22) ε2 1 s & (log m)6(log n)4 (7.23) ε2 and

n 1 q 1/q (7.24) = (log m)2(log n)5/2 max E E η g e < ε. m j j j q s log s √qs η g T ≤ n   j=1   o X (7.24) This completes the proof.

8. Example applications of main theorem In this section we use our main theorem to give explicit conditions under which

E 2 sup Φx 2 1 <ε x T k k − ∈ n 1 for several interesting sets T S − . This amounts to computing an upper bound for the parameters γ (T, ⊂) and 2 k·k2 n 1 q 1/q κ(T )= max E E η g e . (8.1) m j j j q s log s √qs η g T ≤ n   j=1   o X

We focus on the latter two and refer to [Dir14] for details on how to estimate 1/2 γ2(T, 2). Note, however, that γ2(T, 2) . (m log s) κ(T ). Indeed, take mk·k k·k q = s log s in Eq. (8.1) and note that the ηj are identically equal to 1 in this case. This gives

1/2 1/2 κ(T ) (m log s)− g(T ) (m log s)− γ (T, ). ≥ ≃ 2 k·k2 Thus, if we ignore logarithmic factors, it suffices to bound κ(T ). TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 33

8.1. Linear subspace. In the application from Section 4 with T the unit ball of a d-dimensional linear subspace E of ℓn, we have γ (T, ) √d and 2 2 k·k2 ≃ 1 1 1/2 (7.24) . (log m)2(log n)5/2 max P e 2η (8.2) m E j 2 j q √s q s log s √q k k Lη ≤ j X with η 0, 1 i.i.d. of mean qs/(m log s). Using Khintchine’s inequality, j ∈{ } 2 dqs 2 ηj PEej 2 q < + q max PEej 2 (8.3) k k Lη m log s j k k j X and therefore

2 5/2 d 1 (7.24) . (log m) (log n) + max PEej 2 (8.4) m √s j k k r  Eq. (7.22) and Eq. (7.23) then give conditions m & d(log n)5(log m)4/ε2 (8.5) 4 6 2 5 4 2 2 s & (log n) (log m) /ε + (log n) (log m) max PEej 2/ε (8.6) j k k 8.2. k-sparse vectors. Consider next the application from Section 5.1, replacing T by K given by Eq. (5.14). Thus γ2(K) . √k log n and Eq. (7.24) is bounded by 1 1 1/2 √k(log m)2(log n)3 max max η A 2 (8.7) m j ij q √s q s log s √q i | | Lη ≤ n  j  o X and qs 2 . 2 max( ηj Aij ) q + q max Aij (8.8) i | | Lη m log s i,j | | j X Hence

2 3 k 2 3 k (7.24) . (log m) (log n) + (log m) (log n) max Aij (8.9) rm r s i,j | | We then arrive at the conditions m & k(log n)6(log m)4/ε2 (8.10) s & (log n)4(log m)6/ε2 (8.11) provided that 1/2 1 max Aij < k− (log n)− (8.12) i,j | | (compare with Eq. (5.13)-(5.12)).

n 1 8.3. Flat vectors. Let T S − be finite with x α for all x T . Then by a similar calculation as in (4.10),⊆ k k∞ ≤ ∈

n q 1/q n 1/2 E E . 2 sup ηigixi log T sup ηixi q (8.13) η g x T | | · x T Lη   ∈ i=1   ∈ i=1 X p X qs log T . | | + (α log T )√q (8.14) r m | | Since γ (T, ) . log T we find the conditions 2 k·k2 | | m & (logpT )(log m)4(log n)5/ε2 (8.15) | | s & (α log T )2(log m)4(log n)5/ε2 + (log m)6(log n)4/ε2, (8.16) | | which is qualitatively similar to [Mat08, Theorem 4.1]. 34 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

8.4. Finite collection of subspaces. Let Θ be a finite collection of d-dimensional subspaces E Rn with Θ = N. Define ⊂ | | T = x E : x =1 . { ∈ k k2 } E Θ [∈ In this case, γ2(T, 2) . √d + √log N. For the duration of the next two sections, we define k·k α = sup max PEej 2. E Θ j k k ∈ Recall that maxj PEej 2 is referred to as the incoherence of E, and thus d/n α 1 (cf. Remarkk 6) isk the maximum incoherence in Θ. ≤ ≤ p 8.4.1. Collection of incoherent subspaces. To estimate κ(T ), consider the collection of operators A = (P e ) e , E Θ. Fix η and define R x = η x e . A j E j ⊗ j ∈ η j j j j Then applying [KMR14, Theorem 3.5] to R , P A η P max η g P e = max AR g j j E j 1 η 1 E Θ k k Lg A k k Lg ∈ j ∈A X . d F ( Rη)+ γ2( Rη, ). (8.17) A A k·k Clearly 1/2 A 1, AR = η P e 2 . k k≤ k ηkF j k E j k2 j  X  By [RV07] (see the formulation in [Tro08, Proposition 7]), 1/p 1/p p p E ARη 3√p E ARη + √̺ A , (8.18) η k k ≤ k k1,2 ·k k     where E η = ̺ and we assume p 2 log n. Here is the ℓn ℓ2 norm, hence j ≥ k·k1,2 1 → n ARη 1,2 = max ARηek 2 = max ηk PEek 2 α. k k 1 k n k k 1 k n k k ≤ ≤ ≤ ≤ ≤ qs Taking ̺ = m log s , it follows that 1/p 1/2 p qs E ARη . α√p + (8.19) η k k m log s     To estimate κ(T ), we need to bound 1/2 q 2 (8.17) Lη max ηj PEej 2 q (8.20) k k ≤ E k k Lη  j  X √(8.20)

+ |γ ( R ) q{z } (8.21) k 2 A η kLη (8.21)

First, by Eq. (8.3) and denoting q1| = q +{z log N},

2 dqs 2 (8.20) . ̺d + q1 max PEej 2 = + (q + log N)α (8.22) j k k m log s Estimate trivially γ2( Rη ) log N max ARη A ≤ · A k k ∈A and hence, applying Eq. (8.19) with p = q + 2 log n + log N 1/p p (8.21) . log N max E ARη · A k k p   qs 1/2 . log N (q + log n + log N)α2 + (8.23) · m log s p n o TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 35

Collecting our estimates we find

1 dqs 1/2 κ(T ) . max + α√q q m log s s √qs( m log s ≤   qs 1/2 + log N (√q + log n + log N)α + · m log s ) p h p p   i d + log N 1/2 (√log n√log N + log N)α . + (8.24) m log s √s   Hence in this application (assuming log N log n), ≥ d + log N 1/2 α log N (7.24) < (log m)2(log n)5/2 + (8.25) m √s n  o and the conditions on m,s become

4 5 2 m & (log m) (log n) (d + log N)ε− (8.26) 6 4 2 4 5 2 s & ((log m) (log n) + (α log N) (log m) (log n) )ε− (8.27)

Notice that s depends only on Θ and not on the dimension d. Thus this bound is of interest when log Θ is small| compared| with d. | | 8.4.2. Collection of coherent subspaces. We can also obtain a bound on s that does not improve for small α, but has linear dependence on log N. Here we will not rely on [RV07]. As described in Section 1, this setting has applications to model-based compressed sensing. For example, for approximate recovery of block-sparse signals using the notation of Section 1, our bounds will show that a measurement matrix Φ may have m < kb + k log(n/k), s < k log(n/k) and allow for efficient recovery. This is non-trivial if∗ the number of blocks∗ satisfies b > log(n/k). One may indeed trivially bound ∗

γ ( R ) . log N 2 A η since certainly AR A 1. Hence p k ηk≤k k≤ (8.21) . log N leading to the following bound on κ(T ) p

1 dqs d log N max + (√q + log N)α + log N . + q m log s qs m log s m s ≤ s √ s r r n p p o Instead of (8.26), (8.27), one may impose the conditions d log N m & (log m)4(log n)5 + (log m)3(log n)5 (8.28) ε2 ε2 1 log N s & (log m)6(log n)4 + (log m)4(log n)5 (8.29) ε2 ε2 We remark that previous work [NN13a] which achieved m d/ε2 for small s had worse dependence on log N: in particular s & (log N)3,m≈& (log N)6. In fact, Conjecture 14 of [NN13a] if true would imply that the correct dependence on N in both m and s should be log N (which is optimal due to known lower bounds for the Johnson-Lindenstrauss lemma, i.e., the special case d = 1). We thus have shown that this implication is indeed true. 36 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

8.5. Possibly infinite collection of subspaces. Assume next Θ is an arbitrary family of linear d-dimensional subspaces E Rn, equipped with the Finsler metric ⊂ ρ (E, E′)= P P ′ . Fin k E − Ek Let T = x E : x =1 { ∈ k k2 } E Θ [∈ for which (cf. [Dir14]) γ (T, ) . √d + γ (Θ,ρ ). (8.30) 2 k·k2 2 Fin Fix some parameter ε > 0 and let Θ Θ be a finite subset such that 0 1 ⊂ Θ (Θ,ρ ,ε ) (8.31) | 1| ≤ N Fin 0 ρ (E, Θ ) ε for any E Θ (8.32) Fin 1 ≤ 0 ∈ Let further T = x E′ : x =1 1 { ∈ k k2 } E′ Θ1 [∈ Let x T, x E, E Θ and ρ (E, E′) ε for some E′ Θ . Hence ∈ ∈ ∈ Fin ≤ 0 ∈ 1 x = P ′ x + (P x P ′ x)= x + x (8.33) E E − E 1 2 where x T and x B + B ′ , x ε . Hence x T with 1 ∈ 1 2 ∈ E E k 2k2 ≤ 0 2 ∈ 2 T = x B + B : x ε 2 { ∈ E F k k2 ≤ 0} E,F Θ [∈ For t<ε , we estimate (T , ,t). Let Θ Θ satisfy 0 N 2 k·k2 t ⊂ t Θ Θ,ρ , (8.34) | t| ≤ N Fin 4  t  ρ (E, Θ ) for all E Θ (8.35) Fin t ≤ 4 ∈

By Eq. (A.6), for each E′ Θ we can find ξ ′ B ′ such that ∈ t E ⊂ E 1 log ξ ′ . d log (8.36) | E | t ρ (x, ξ ′ ) t for all x B ′ (8.37) ℓ2 E ≤ ∈ E Denote ξ = (ξ ′ ξ ′ ) t E − F E′,F ′ Θ [∈ t for which by construction t 1 log ξ . log Θ,ρ , + d log | t| N Fin 4 t   Also, for x T , x = y + z B + B and E′, F ′ Θ satisfying ∈ 2 ∈ E F ∈ t t t ρ (E, E′) ,ρ (F, F ′) , Fin ≤ 4 Fin ≤ 4 we have t x (P ′ y + P ′ z) , k − E F k2 ≤ 2 while t t ρ (P ′ y, ξ ′ ) , ρ (P ′ z, ξ ′ ) ℓ2 E E ≤ 4 ℓ2 F F ≤ 4 Therefore ρ (x, ξ ) t ℓ2 t ≤ TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 37 and we get for t<ε0 (otherwise (8.38) = 0) t 1 log (T , ,t) . log Θ,ρ , + d log (8.38) N 2 k·k2 N Fin 4 t Using the decomposition Eq. (8.33) and the bound Eq. (8.25) for the contribution of T1, we find d + log (Θ,ρ ,ε ) 1/2 α log (Θ,ρ ,ε ) κ(T ) . N Fin 0 + N Fin 0 (8.39) m √s n  o 1 n q 1/q + max E E sup ηj gjxj (8.40) m η g q s log s √qs x T2 ≤ n   ∈ j=1   o X with (η ) 0, 1 i.i.d. of mean qs . j ∈{ } m log s Estimate by the contraction principle [Kah68], the Gaussian concentration for Lipschitz functions (Eq. (4.1)), and Dudley’s inequality Eq. (2.2)[Dud67] 1 n 1 n sup η g x sup g x j j j 1 q j j q q 2 L ,Lη ≤ q 2 Lg √ x T j=1 g √ x T j=1 ∈ X ∈ X n

. sup gjxj L1 x T2 j=1 g ∈ X ε0 . [log (T , ,t)]1/2dt N 2 k·k2 Z0 ε0 1 . [log (Θ,ρ ,t)]1/2dt + √dε log , N Fin 0 ε 0 r 0 Z (8.41) where in the final step we used Eq. (8.38). Summarizing, the conditions on m and s are as follows (for any ε0 > 0) 2 3 5 2 m & ε− (log m) (log n) γ2 (Θ,ρFin) 2 4 5 + ε− (log m) (log n) [d + log (Θ,ρ ,ε )] (8.42) N Fin 0 2 6 4 2 4 5 2 s & ε− (log m) (log n) + ε− (log m) (log n) [α log (Θ,ρ ,ε )] N Fin 0 2 4 5 2 1 + ε− (log m) (log n) ε0 log d ε0  ε0  2 2 4 5 1/2 + ε− (log m) (log n) [log (Θ,ρFin,t)] dt (8.43) 0 N h Z i If Θ = N < , then log (Θ,ρFin,t) log N and (8.42), (8.43) turn into (8.26),| (8.27)| for ε∞ 0. N ≤ 0 → 8.6. Manifolds. Let Rn be a d-dimensional manifold obtained as the image M⊂ n = F (B d ), for a smooth map F : B d R . More precisely, we assume that M ℓ2 ℓ2 → F (x) F (y) x y and the Gauss map, which sends x to E , the k − k2 ≃ k − k2 ∈ M x tangent plane at x, is Lipschitz from the geodesic distance ρ to ρFin. Following [Dir14], we want to ensure that the sparse matrix Φ satisfies M (1 ε) γ Φ(γ) (1 + ε) γ (8.44) − | | ≤ | |≤ | | for any C1-curve γ , where denotes curve length. Note that Eq. (8.44) is equivalent to requiring⊂ M | · | 1 ε Φ(v) 1+ ε (8.45) − ≤k k2 ≤ for any tangent vector v of at a point x . Denote by M ∈M Θ= E : x { x ∈ M} 38 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON the tangent bundle of , to which we apply the estimates on m,s obtained above. By assumption, M 1 1 ρFin(Ex, Ey) . ρ (x, y) F − (x) F − (y) 2, M ≃k − k so that for 0 t 1/2 by Eq. (A.6), ≤ ≤ 1 log (Θ,ρ ,t) . log ( , ,ct) log (B d , ,t) . d log . N Fin N M k·k2 ≃ N ℓ2 k·k2 t In this application

α = max max max v,ej (8.46) x v Ex 1 j n | h i | ∈M v∈2=1 ≤ ≤ k k We assume 1 α (8.47) ≪ √d to make the below of interest (otherwise apply the result of [Dir14]). Taking ε0 = α√d in (8.42), (8.43), it follows that Eq. (8.44) may be ensured under parameter conditions

2 4 5 1 m & ε− (log m) (log n) d log (8.48) · α√d 2 6 4 2 4 7 2 s & ε− (log m) (log n) + ε− (log m) (log n) (αd) (8.49)

Thus for α = o(1/√d), the condition on s becomes non-trivial. Recall from Re- mark 6 that α d/n and therefore the log(1/(α√d)) term in Eq. (8.48) is at most log n. ≥ p d n Remark 19. Returning to the assumptions on F , consider DF : B d (R , R ), ℓ2 → L the space of linear operators from Rd to Rn. The first statement means that uniformly for x B d , ∈ ℓ2 1 d c− ξ DF (x)ξ c ξ for ξ R (8.50) k k2 ≤k k2 ≤ k k2 ∈ The second statement follows from requiring

DF (x) DF (y) 2 2 c x y 2 for x, y Bℓd (8.51) k − k → ≤ k − k ∈ 2 d d Indeed, since Ex = DF (x)(R ), Ey = DF (y)(R ), it follows from Eq. (8.51)

n inf ρℓ2 (u, Ey) . x y 2 u Ex k − k u∈2=1 k k implying

ρFin(Ex, Ey)= PE PE 2 2 . x y 2. k x − y k → k − k Remark 20. If α 1/√d, then necessarily s d in Eq. (8.49) (up to polylog- arithmic factors). Thus≥ the manifold case is very≫ different from the case of linear subspaces. We sketch a construction to demonstrate this. Let n > d10. Denote by 0 ϕ 1 a smooth bump function on R such that ≤ ≤ 1 1 ϕ(t)= t for t , supp(ϕ) [0, 1] (8.52) 4 ≤ ≤ 2 ⊂ d By the lower bound in Eq. (A.6), there is a collection aβ 1 β 2d of 2 points in { } ≤ ≤ def d B d (0, 1/2) = B R such that ℓ2 1/2 ⊂ 1 a ′ a > for β′ = β k β − βk2 10 6 TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 39

Rn and let (ηβ )1 β 2d be any collection of unit vectors in . Consider the map ≤ ≤ f : Rd Rn defined by → 2d f(x)= ϕ(104 x a 2)η (8.53) k − βk2 β βX=1 Thus by construction, the summands in Eq. (8.53) are disjointly supported func- tions of x. Clearly 4 4 2 Df(x)ξ =2 10 ϕ′(10 x a ) x a , ξ η (8.54) · k − βk2 h − β i β Xβ implying that Df(x) 2 2 C k k → ≤ and Df(x) Df(y) 2 2 C x y 2 (8.55) k − k → ≤ k − k Next, let θ ,...,θ Rn be orthogonal vectors such that 1 d ∈ 1 θj 2 =1, θj = (8.56) k k k k∞ √n and define the map d 2n n n F : B d R R = R R ℓ2 ⊂ → × d F (x)= xj θj ,f(x) j=1  X  In view of Eq. (8.55), F satisfies conditions (8.50),(8.51). Also, by (8.54),(8.56) d α . sup max DF (x)ξ . + max ηβ (8.57) 2 ∞ ∞ x B d ξ =1 k k n β k k ∈ ℓ2 k k r

d Let = F (B d ). Fix some 1 β 2 and let for 1/200 t 1/(100√2) M ℓ2 ≤ ≤ ≤ ≤ d 4 2 γ(t)= F (aβ + te1)= aβ,jθj + tθ1, 10 t ηβ j=1  X  where we used Eq. (8.52). Thus γ is a C1-curve in and M 1 γ′ = (θ , 100η ) (8.58) 200 1 β   Let Φ be a sparse m 2n matrix for which Eq. (8.44) holds. Then Φ has to satisfy in particular × (1 ε)(1 + 104)1/2 Φ(θ , 100η ) (1 + ε)(1 + 104)1/2 − ≤k 1 β k2 ≤ and hence Φ(θ , 100η ) < 103 for all 1 β 2d (8.59) k 1 β k2 ≤ ≤ Choose k such that n d 2d > 2k, i.e. k k ≃ log n   k n Rn and let (ηβ ) be the collection of 2 k vectors in of the form 1 η = e with I 1,...,n , I = k (8.60) √ ± j ⊂{ } | | k j I X∈ 40 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

By Eq.(8.59), Φ needs to satisfy Φ(0, η) 20 (8.61) k k2 ≤ for all vectors η of the form Eq. (8.60). But if m < n/d, Eq. (8.61) implies that s k/400 & d/ log n. On the other hand, Eq. (8.57) gives ≥ d 1 log n α . + rn √k ≃ r d 8.6.1. Geodesic distances. In this section we show that not only do sparse maps preserve curve lengths on manifolds, but in fact they preserve geodesic distances. Lemma 21. Fix x Rn such that x 1 for j J 1,...,n , J = r. Then ∈ | j |≥ ∈ ⊂{ } | | 1 sr 1 i m : Φ2 x2 > min ,cm def= r (8.62) ≤ ≤ ij j ≥ s 3 1 n o n o with probability (with respectX to Φ) at least

sr 1 2− (8.63) − 2 2 Proof. Fix I 1,...,m , I r1 and assume j Φij xj < 1/s for i / I. This ⊂ { } | | ≤def ∈ means that for any j J, S = = i;Φij = 0 IP. The probability (with respect to Φ) that i;Φ =0 ∈ I (with j{fixed)6 is (by} ⊂ Eq. (2.3)) { ij 6 }⊂ P(S = K)= E δij Φ K I K I i K KX⊂=s KX⊂=s Y∈ | | | | s s ≤ m K I KX⊂=s   | | I s s | | ≤ s · m     e I s · | | ≤ m Since for different j the events are independent,  it follows the probability that 1 1 i m; Φ2 x2 I ≤ ≤ ij j ≥ s ⊂ ns r Xsr o is bounded by ((e I /m) ) (er1/m) . Taking a union bound over all I 1,...,m , I r | gives| by the≤ choice of r ⊂ { } | |≤ 1 1 sr r1 sr 2 2sr/3 m er1 em er1 e r1 sr < 2− I · m ≤ r1 m ≤ m | |         

For the following lemma recall that for a Rn and (σ ) a Rademacher vector ∈ j 1 4 P aj σj < a 2 < . (8.64) σ 2k k 5  j  X This is a consequence of the Paley-Zygmund inequality, see e.g. [Ber97, Theorem 3.6]. Lemma 22. Let x Rn satisfy the assumption of Lemma 21. Then ∈ 1 cr1 P Φx 2 < < 2− (8.65) Φ k k 2√s   TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 41

Proof. We may assume that Φ2 x2 1/s for i in a subset I 1,...,m , ij j ≥ ⊂ { } I r . Exploiting the random signs of Φ , we find by Eq. (8.64) for each i I | |≥ 1 P ij ∈ 1 4 P Φ x < < (8.66) ij j 2√s 5  j  X If Φx < 1/(2√s), then Φ x < 1 /(2√s) for all i, in particular for all i I. k k2 | j ij j | ∈ By Eq. (8.66) the probability for this event is at most P I 4 | | cr1 < 2− , 5 proving Eq. (8.65).    Corollary 23. Let ξ Rn be a set of vectors x with following properties: ⊂ (a) Each x ξ has a decomposition x = x′ + x′′, x′ ξ′ and there is a set ∈ ∈ J = J ′ 1,...,n , J r so that x′ 1 for j J. Moreover x ⊂{ } | |≥ | j |≥ ∈ 1 x′′ < (8.67) k k2 10n (b) c min sr,m ξ′ < 2 { } (8.68) | | Then 1 Φx > for all x ξ (8.69) k k2 4√s ∈ c min sr,m with probability at least 1 2− { }. − Proof. Write 1 Φx Φx′ Φx′′ Φx′ √n x′′ Φx′ . k k2 ≥k k2 −k k2 ≥k k2 − ·k k2 ≥k k2 − 10√n since clearly Φ √n. Next, Lemma 22 and a union bound will ensure that k k2 ≤ Φx′ 1/(2√s) for all x′ ξ′ with the desired probability.  k k2 ≥ ∈ Lemma 24. Let s c(log n)2. Let ξ Rn be a finite set of unit vectors satisfying ≥ ⊂ log ξ < cm (8.70) | | cs Then with probability at least 1 e− for some constant c> 0, Φ satisfies − 2 c(log n) Φx >e− for x ξ. (8.71) k k ∈ n Proof. Given x R , let x∗ be the decreasing rearrangement of xi . Let K = c log n. We partition∈ ξ as | |

ξ = (ξℓ (ξ0 ξ1 ... ξℓ 1)) \ ∪ ∪ ∪ − [ℓ ℓ 2 where ξ 1 = , and for ℓ 0 satisfying 2 K n− ℓ { ∈ m/(2 K ) } 2 For the vectors x ξ0, apply Corollary 23 with x = x′, r = m/K (after rescaling K ∈ by n− ). Since csr > cm > log ξ log ξ0 , (8.68) holds and by Eq. (8.69), we K | | ≥ | | ensure that Φx >n− /(4√s) for all x ξ . k k2 ∈ 0 Consider the set ξℓ′ = ξℓ (ξ0 ξ1 ... ξℓ 1). Thus each x ξℓ′ has a decompo- sition \ ∪ ∪ ∪ − ∈ x = y + z ℓ 1 2 where y is obtained by considering the m/(2 − K ) largest coordinates of x and K+2ℓ 2 z x∗ ℓ−1 2 n− − since x / ξℓ 1. k k∞ ≤ m/(2 K ) ≤ ∈ − 42 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

Note also that y B , with X = e : j S . ∈ XS S { j ∈ } S 1,...,n ⊂{[m } S = 1 2 | | 2ℓ− K Let f B be a finite subset such that S ⊂ XS K dist(y, f ) n − n− n − n− > − 2 since x ξ . By Eq.(8.69) ∈ ℓ K+2ℓ 1 Φx >n− for all x ξ k k · 4√s ∈ ℓ c min m,sr cs ℓ 2 with probability at least 1 e− { } 1 e− , since m/(2 K ) 1.  − ≥ − ≥ Let Rn be a d-dimensional manifold obtained as a bi-Lipschitz image of M ⊂ the unit ball B d , i.e. F : B d satisfies ℓ2 ℓ2 →M F (x) F (y) x y . (8.74) k − k2 ≃k − k2 d n We assume moreover F is smooth, more specifically DF : B d (R , R ) (i.e. ℓ2 → L linear maps from Rd to Rn under operator norm) is Lipschitz, i.e.,

DF (x) DF (y) ℓd ℓn . x y 2 (8.75) k − k 2 → 2 k − k Lemma 25. Let be as above. Assume M m & d(log n)2 s & (log n)2 (8.76) TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 43

cs Then with probability at least 1 e− for some constant c> 0, Φ is bi-Lipschitz, and more specifically − |M 2 c(log n) Φ(x) Φ(y) e− x y for x, y (8.77) k − k2 ≥ ·k − k2 ∈M Proof. We treat separately the pairs x, y which are at a “large” and “small” ∈M distance from each other. Fix ε1 > 0 and ε2 >ε1 to be specified later. Let Aε1 be an ε -net for . By Eq.(8.74) and Eq.(A.6), we can assume that ⊂M 1 M log A . d log(1/ε ). | ε1 | 1 Assume that x, y , x y 2 > ε2. Take x1,y1 Aε1 s.t. x x1 2 < ε1, y y <ε . Since∈ M Φ isk linear− k ∈ k − k k − 1k2 1 Φ(x) Φ(y) Φ(x y ) 2√nε k − k2 ≥k 1 − 1 k2 − 1 Apply Lemma 24 to the set

x1 y1 ξ = − : x1,y1 Aε1 . x1 y1 2 ∈ nk − k o Assuming m & d log(1/ε ) >c log A (8.78) 1 | ε1 | we ensure Φ to satisfy

2 x1 y1 c(log n) Φ − >e− for x1,y1 Aε1 . x1 y1 2 2 ∈ k − k  Therefore, if x, y , x y >ε ∈M k − k 2 2 2 c(log n) ε1 Φ(x) Φ(y) 2 e− x1 y1 2 2√n x y 2 k − k ≥ ·k − k − ε2 ·k − k 2 1 c(log n) e− x y (8.79) ≥ 2 ·k − k2 2 c(log n) choosing ε2 =5√ne ε1. This takes care of large distances. In order to deal with small distances, we first ensure that 2 c(log n) Φ(v) >e− for any unit tangent vector v of (8.80) k k M Consider = v T : v =1 (8.81) Tε1 ∈ x1 k k x1 A [∈ ε1 n o Under the assumption (8.78) on m (and making an ε1-discretization of each

v Tx1 : v 2 = 1 ), another application of Lemma 24 will ensure that (8.80) {holds∈ for allk vk }. Next, if x , x A , x x < ε , it follows ∈ Tε1 ∈ M 1 ∈ ε1 k − 1k2 1 from Eq. (8.75) that DF (x) DF (x1) ℓ2 ℓ2 . ε1 and hence ρFin(Tx,Tx1 ) . ε1. Therefore, if v T , kv = 1,− there is vk → s.t. v v . ε . Therefore ∈ x k k2 1 ∈ Tε1 k − 1k2 1 2 2 c(log n) 1 c(log n) Φ(v) Φ(v ) √nε e− √nε > e− k k2 ≥k 1 k2 − 1 ≥ − 1 2 2 1/2 c(log n) since ε1

w u 2 sup DF (tw + (1 t)u) DF (u) ℓ2 ℓ2 ≤k − k · t k − − k → . u w 2. (8.82) k − k2 Denote w u v = DF (u) − T . w u ∈ u k − k2 By Eq. (8.82) and using Φ √n,   k k≤ y x u w v < cε x y k − −k − k2 k2 2k − k2 Φ(y) Φ(x) u w Φ(v) 0, Φ preserves geodesicM distances up to factor 1+ ε. − Proof. By (8.84), (1 ε) γ < Φ(γ) < (1 + ε) γ for any C1-curve γ in . Let ρ refer to the geodesic− distance| | in| .| Clearly | | M M M ρΦ( )(Φ(x), Φ(y)) (1 + ε)ρ (x, y) (8.85) M ≤ M 1 from the above. We need to show the reverse inequality. Let γ1 be a C -curve in Φ( ) joining Φ(x), Φ(y) such that ρΦ( )(Φ(x), Φ(y)) = γ1 . Since Φ satisfies (8.77),M Φ : Φ( ) is bi-LipschitzM and hence a diffeomorphism| | (since Φ is |M M → M 1 1 also smooth). It follows that γ =Φ− γ1 is a C -curve joining x and y and

ρ (x, y) γ (1 + ε) Φ(γ) = (1+ ε) γ1 =(1+ ε)ρΦ( )(Φ(x), Φ(y)). M ≤ | |≤ | | | | M 

9. Discussion We have provided a general theorem which captures sparse dimensionality reduc- tion in Euclidean space and qualitatively unifies much of what we know in specific applications. There is still much room though for quantitative improvement. We here list some known quantitative shortcomings of our bounds and discuss some avenues for improvement in future work. First, our dependence on 1/ε in s in all our theorems is, up to logarithmic factors, quadratic. Meanwhile the works [KN14, NN13a] show that the correct dependence in the case of small m should be linear. Part of the reason for this discrepancy may be our use of the chaining result of [KMR14]. Specifically, chaining is a technique TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 45 that in general converts tail bounds into bounds on the expected supremum of stochastic processes (see [Tal05] for details). Perhaps one could generalize the tail bound of [KN14] to give a broad understanding of the decay behavior for the error random variable in the SJLT as a function of s,m then feed such a quantity into a chaining argument to improve our use of [KMR14]. Another place where we lost logarithmic factors is in our use of the duality of entropy numbers [BPSTJ89, Proposition 4] in Eq. (5.4). It is believed that in general if K,D are symmetric convex bodies and (K,D) is the number of translations of D needed to cover K then N 1 (D◦,aK◦) . (K,D) . (D◦,a− K◦) N N N for some universal constant a > 0. This is known as Pietsch’s duality conjecture [Pie72, p. 38], and unfortunately resolving it has been a challenging open problem in convex geometry for over 40 years. We have also lost logarithmic factors in our use of dual Sudakov minoration, which bounds sup t[log (B , ,t)]1/2 by a constant times E P g for t>0 N E k·kX g k E kX any norm X even though what we actually wish to bound is the γ2 functional. k·k 1/2 Passing from this functional via Eq. (2.2) to supt>0 t[log (BE, X ,t)] costs us logarithmic factors. The majorizing measures theoryN (see [Tal05k·k]) shows that the loss can be avoided when X is the ℓ2 norm; it would be interesting to explore how the loss can be avoided ink·k our case. Finally, note that the SJLT Φ considered in this work is a randomly signed adjacency matrix of a random bipartite graph with m vertices in the left bipartition, n in the right, and with all right vertices having equal degree s. Sasha Sodin has asked in personal communication whether taking a random signing of the adjacency matrix of a random biregular graph (degree s on the right and degree ns/m on the left) can yield improved bounds on s,m for some T of interest. We leave this as an interesting open question.

Acknowledgement. We would like to thank Huy L. Nguyˆen˜ for pointing out the second statement in Lemma 11. Part of this work was done during a visit of the second-named author to . He wishes to thank the Theory of Computation group for its hospitality.

References [ABTZ13] Haim Avron, Christos Boutsidis, Sivan Toledo, and Anastasios Zouzias. Efficient dimensionality reduction for canonical correlation analysis. In Proceedings of the 30th International Conference on Ma- chine Learning (ICML), 2013. [AC09] Nir Ailon and Bernard Chazelle. The Fast Johnson–Lindenstrauss Transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302–322, 2009. [Ach03] Dimitris Achlioptas. Database-friendly random projections: Johnson- Lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671– 687, 2003. [ADR14] Ula¸sAyaz, Sjoerd Dirksen, and Holger Rauhut. Uniform recovery of fusion frame structured sparse signals. CoRR, abs/1407.7680, 2014. [AHK06] Sanjeev Arora, Elad Hazan, and Satyen Kale. A fast random sampling algorithm for sparsifying matrices. In APPROX-RANDOM, pages 272–279, 2006. [AL09] Nir Ailon and Edo Liberty. Fast dimension reduction using Rade- macher series on dual BCH codes. Discrete Comput. Geom., 42(4):615–630, 2009. 46 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

[AL13] Nir Ailon and Edo Liberty. An almost optimal unrestricted fast Johnson-Lindenstrauss transform. ACM Transactions on Algorithms, 9(3):21, 2013. [Alo03] Noga Alon. Problems and results in extremal combinatorics–i. Dis- crete Mathematics, 273(1-3):31–53, 2003. [AMT10] Haim Avron, Petar Maymounkov, and Sivan Toledo. Blendenpik: Supercharging LAPACK’s least-squares solver. SIAM J. Scientific Computing, 32(3):1217–1236, 2010. [AN13] Alexandr Andoni and Huy L. Nguyˆen.˜ Eigenvalues of a matrix in the streaming model. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1729–1737, 2013. [BB05] Maria-Florina Balcan and Avrim Blum. A pac-style model for learn- ing from labeled and unlabeled data. In Learning Theory, 18th Annual Conference on Learning Theory, COLT 2005, Bertinoro, Italy, June 27-30, 2005, Proceedings, pages 111–126, 2005. [BBDS12] Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. The johnson-lindenstrauss transform itself preserves differential privacy. In 53rd Annual IEEE Symposium on Foundations of Computer Sci- ence, FOCS 2012, New Brunswick, NJ, USA, October 20-23, 2012, pages 410–419, 2012. [BBV06] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Kernels as features: On kernels, margins, and low-dimensional mappings. Ma- chine Learning, 65(1):79–94, 2006. [BCDH10] Richard G. Baraniuk, Volkan Cevher, Marco F. Duarte, and Chinmay Hegde. Model-based compressive sensing. IEEE Trans. Inf. Theory, 56:1982–2001, 2010. [BCIS05] Mihai Badoiu, Julia Chuzhoy, , and Anastasios Sidiropou- los. Low-distortion embeddings of general metrics into the line. In Proceedings of the 37th Annual ACM Symposium on Theory of Com- puting, Baltimore, MD, USA, May 22-24, 2005, pages 225–233, 2005. [BD08] Thomas Blumensath and Mike E. Davies. Iterative hard thresholding for compressed sensing. J. Fourier Anal. Appl., 14:629–654, 2008. [BDF+11] , Stephen Dilworth, Kevin Ford, Sergei Konyagin, and Denka Kutzarova. Explicit constructions of RIP matrices and related problems. Duke J. Math., 159(1):145–185, 2011. [Ber97] Bonnie Berger. The fourth moment method. SIAM J. Comput., 26(4):1188–1207, 1997. [BLM89] Jean Bourgain, Joram Lindenstrauss, and Vitali D. Milman. Approx- imation of zonoids by zonotopes. Acta Math., 162:73–141, 1989. [BOR10] Vladimir Braverman, Rafail Ostrovsky, and Yuval Rabani. Rade- macher chaos, random Eulerian graphs and the sparse Johnson- Lindenstrauss transform. CoRR, abs/1011.2590, 2010. [BPSTJ89] Jean Bourgain, Alain Pajor, Stanislaw J. Szarek, and Nicole Tomczak-Jaegermann. On the duality problem for entropy numbers of operators. Geometric Aspects of , 1376:50–163, 1989. [BT02] Jeremy Buhler and Martin Tompa. Finding motifs using random projections. Journal of Computational Biology, 9(2):225–242, 2002. [BvdG11] Peter B¨uhlmann and Sara van de Geer. Statistics for high- dimensional data. Springer, Heidelberg, 2011. [BW09] Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Foundations of Computational Mathematics, TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 47

9(1):51–77, 2009. [BZMD11] Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros Drineas. Stochastic dimensionality reduction for k-means clus- tering. CoRR, abs/1110.2897, 2011. [Can08] Emmanuel Cand`es. The restricted isometry property and its impli- cations for compressed sensing. Comptes Rendes Mathematique, 346 (9-10): 589–592, 2008. [Car85] Bernd Carl. Inequalities of Bernstein-Jackson-type and the degree of compactness operators in Banach spaces. Ann. Inst. Fourier (Greno- ble), 35(3):79–118, 1985. [CCFC04] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3–15, 2004. [CDMI+13] Kenneth L. Clarkson, Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, Xiangrui Meng, and David P. Woodruff. The Fast Cauchy Transform and faster robust linear regression. In Pro- ceedings of the 24th Annual ACM-SIAM Symposium on Discrete Al- gorithms (SODA), pages 466–477, 2013. [CEM+14] Michael Cohen, Sam Elder, Cameron Musco, Christopher Musco, and M˘ad˘alina Persu. Dimensionality reduction for k-means clustering and low rank approximation. CoRR, abs/1410.6801, 2014. [CGV13] Mahdi Cheraghchi, Venkatesan Guruswami, and Ameya Velingker. Restricted isometry of Fourier matrices and list decodability of ran- dom linear codes. SIAM J. Comput., 42(5):1888–1914, 2013. [Cla08] Kenneth L. Clarkson. Tighter bounds for random projections of man- ifolds. In Proceedings of the 24th ACM Symposium on Computational Geometry, College Park, MD, USA, June 9-11, 2008, pages 39–48, 2008. [CM12] Pedro Contreras and Fionn Murtagh. Fast, linear time hierarchical clustering using the Baire metric. J. Classification, 29(2):118–143, 2012. [Coh14] Michael B. Cohen. Personal communication, 2014. [CT65] James W. Cooley and John M. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Comput., 19:297–301, 1965. [CT05] Emmanuel Cand`es and Terence Tao. Decoding by linear program- ming. IEEE Trans. Inf. Theory, 51(12):4203–4215, 2005. [CT06] Emmanuel J. Cand`es and Terence Tao. Near-optimal signal recov- ery from random projections: universal encoding strategies? IEEE Trans. Inform. Theory, 52:5406–5425, 2006. [CW09] Kenneth L. Clarkson and David P. Woodruff. Numerical linear alge- bra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009, Bethesda, MD, USA, May 31 - June 2, 2009, pages 205–214, 2009. [CW13] Kenneth L. Clarkson and David P. Woodruff. Low rank approxima- tion and regression in input sparsity time. In Proceedings of the 45th ACM Symposium on Theory of Computing (STOC), pages 81–90, 2013. [DG13] David L. Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci., 100(10):5591–5596, 2013. 48 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

[Dir14] Sjoerd Dirksen. Dimensionality reduction with subgaussian matrices: a unified theory. CoRR, abs/1402.3973, 2014. [Dir15] Sjoerd Dirksen. Tail bounds via generic chaining. Electron. J. Probab., 20:no. 53, 1–29, 2015. [DKS10] Anirban Dasgupta, Ravi Kumar, and Tam´as Sarl´os. A sparse Johnson-Lindenstrauss transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 341–350, 2010. [DMIMW12] Petros Drineas, Malik Magdon-Ismail, Michael Mahoney, and David Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13:3475–3506, 2012. [Don06] David Donoho. Compressed sensing. IEEE Trans. Inf. Theory, 52(4):1289–1306, 2006. [Dud67] Richard M. Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Functional Analysis, 1:290–330, 1967. [EW13] Armin Eftekhari and Michael B Wakin. New analysis of manifold em- beddings and signal recovery from compressive measurements. CoRR, abs/1306.4748, 2013. [Fer75] Xavier Fernique. Regularit´edes trajectoires des fonctions al´eatoires gaussiennes. Lecture Notes in Math., 480:1–96, 1975. [Fig76] T. Figiel. On the moduli of convexity and smoothness. Studia Math., 56(2):121–155, 1976. [FR13] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Birkha¨user, Boston, 2013. [GMPTJ07] O. Gu´edon, S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Subspaces and orthogonal decompositions generated by bounded or- thogonal systems. Positivity, 11(2):269–283, 2007. [GMPTJ08] O. Gu´edon, S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Majorizing measures and proportional subsets of bounded orthonor- mal systems. Rev. Mat. Iberoam., 24(3):1075–1095, 2008. [Gor88] Yehoram Gordon. On Milman’s inequality and random subspaces which escape through a mesh in Rn. Geometric Aspects of Functional Analysis, pages 84–106, 1988. [HIM12] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. The- ory of Computing, 8(1):321–350, 2012. [HUL01] Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal. Fundamentals of convex analysis. Springer-Verlag, Berlin, 2001. [HWB07] C. Hegde, M. Wakin, and R. Baraniuk. Random projections for man- ifold learning. In Advances in neural information processing systems, pages 641–648, 2007. [Ind01] Piotr Indyk. Algorithmic applications of low-distortion geometric embeddings. In Proceedings of the 42nd Annual Symposium on Foun- dations of Computer Science (FOCS), pages 10–33, 2001. [IR13] Piotr Indyk and Ilya Razenshteyn. On model-based RIP-1 matrices. In Proceedings of the 40th International Colloquium on Automata, Languages and Programming (ICALP), pages 564–575, 2013. [JL84] William B. Johnson and Joram Lindenstrauss. Extensions of Lip- schitz mappings into a Hilbert space. Contemporary Mathematics, 26:189–206, 1984. [Kah68] Jean-Pierre Kahane. Some Random Series of Functions. Heath Math. Monographs. Cambridge University Press, 1968. TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 49

[KM05] Bo’az Klartag and Shahar Mendelson. Empirical processes and ran- dom projections. J. Funct. Anal., 225(1):229–245, 2005. [KMR14] F. Krahmer, S. Mendelson, and Holger Rauhut. Suprema of chaos processes and the restricted isometry property. Comm. Pure Appl. Math., 67(11):1877–1904, 2014. [KN10] Daniel M. Kane and Jelani Nelson. A derandomized sparse Johnson- Lindenstrauss transform. CoRR, abs/1006.3585, 2010. [KN14] Daniel M. Kane and Jelani Nelson. Sparser Johnson-Lindenstrauss transforms. Journal of the ACM, 61(1):4, 2014. [KW11] Felix Krahmer and Rachel Ward. New and improved Johnson- Lindenstrauss embeddings via the Restricted Isometry Property. SIAM J. Math. Anal., 43(3):1269–1281, 2011. [KZM02] Geza Kov´acs, Shay Zucker, and Tsevi Mazeh. A box-fitting algo- rithm in the search for periodic transits. Astronomy and Astrophysics, 391:369–377, 2002. [LDFU13] Yichao Lu, Paramveer Dhillon, Dean Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized Hadamard trans- form. In Proceedings of the 26th Annual Conference on Advances in Neural Information Processing Systems (NIPS), 2013. [LN14] Kasper Green Larsen and Jelani Nelson. The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction, 2014. Manu- script. [LP86] Fran¸cois Lust-Piquard. In´egalit´es de Khintchine dans Cp (1

[NT09] Deanna Needell and Joel A. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harmon. Anal., 26:301–332, 2009. [PBMID13] Saurabh Paul, Christos Boutsidis, Malik Magdon-Ismail, and Petros Drineas. Random projections for support vector machines. In Pro- ceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 498–506, 2013. [Pie72] Albrecht Pietsch. Theorie der Operatorenideale (Zusammenfassung). Friedrich-Schiller-Universit¨at Jena, 1972. [Pis86] Gilles Pisier. Probabilistic methods in the geometry of Banach spaces. Probability and Analysis, Lecture Notes in Math., 1206:167–241, 1986. [Pis81] Gilles Pisier. Remarques sur un r´esultat non public de B. Maurey. S´eminaire d’analyse fonctionnelle, Exp. V., 1980/81. [PTJ86] Alain Pajor and Nicole Tomczak-Jaegermann. Subspaces of small codimension of finite dimensional Banach spaces. Proc. Amer. Math. Soc., 97:637–642, 1986. [PW14] M. Pilanci and M. Wainwright. Randomized sketches of convex pro- grams with sharp guarantees. arXiv, abs/1404.7203, 2014. [Roc70] R. Rockafellar. Convex Analysis. Press, 1970. [RS00] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality re- duction by locally linear embedding. Science, 290:2323–2326, 2000. [RV07] Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM, 54(4), 2007. [RV08] Mark Rudelson and Roman Vershynin. On sparse reconstruction from Fourier and Gaussian measurements. Communications on Pure and Applied Mathematics, 61(8):1025–1045, 2008. [Sar06] Tam´as Sarl´os. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 143– 152, 2006. [SS11] Daniel A. Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM J. Comput., 40(6):1913–1926, 2011. [Tal05] Michel Talagrand. The generic chaining: upper and lower bounds of stochastic processes. Springer Verlag, 2005. [TdSL00] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [Tib96] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. [Tro08] Joel A. Tropp. The random paving property for uniformly bounded matrices. Studia Math., 185:67–82, 2008. [Tro11] Joel A. Tropp. Improved analysis of the subsampled randomized hadamard transform. Adv. Adapt. Data Anal., 3(1–2):115–126, 2011. [VJ11] Andrew Vanderburg and John Asher Johnson. A technique for ex- tracting highly precise photometry for the two-wheeled Kepler mis- sion. CoRR, abs/1408.3853, 2011. [WDL+09] Kilian Q. Weinberger, Anirban Dasgupta, John Langford, Alexan- der J. Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 1113–1120, 2009. TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 51

[Woo14] David P. Woodruff. Personal communication, 2014. [WZ13] David P. Woodruff and Qin Zhang. Subspace embeddings and ℓp regression using exponential random variables. In Proceedings of the 26th Conference on Learning Theory (COLT), 2013.

Appendix A. Tools from probability theory We collect some useful tools from probability theory that are used throughout the text. For further reference we record an easy consequence of Markov’s inequality. Lemma 27. If ξ is a real-valued random variable satisfying (E ξ p)1/p a p2 + a p3/2 + a p + a p1/2 + a , for all p p , | | ≤ 1 2 3 4 5 ≥ 0 for some 0 a ,a ,a ,a ,a < , then ≤ 1 2 3 4 5 ∞ P( ξ e(a w2 + a w3/2 + a w + a √w + a )) exp( w) (w p ). (A.1) | |≥ 1 2 3 4 5 ≤ − ≥ 0 n Let us call σ = (σi)i=1 a Rademacher vector if its entries are independent Rade- macher random variables. Khintchine’s inequality states that for any 1 p< ≤ ∞ p 1/p n (E σ, x ) Cp x 2 (x R ), (A.2) σ |h i| ≤ k k ∈ where Cp √p. The noncommutative Khintchine inequality [LP86, LPP91] says ≤ m n that for any A ,...,A R × and 1 p< 1 n ∈ ≤ ∞ n p 1/p 1/2 1/2 E σiAi C√p max Ai∗Ai , AiAi∗ , Sp ≤ Sp Sp  i=1  n  i   i  o X X X where Sp is the p-th Schatten norm. In particular, if p max log m, log n , then k·k ≥ { } n p 1/p 1/2 1/2 E σ A C√p max A∗A , A A∗ (A.3) i i ≤ i i i i  i=1  n  i   i  o X X X as S e in this case. Khintchine’s inequality and its noncommutative k·k≤ k·k p ≤ k·k version are frequently used in combination with symmetrization: if ζ1,...,ζn are X-valued random variables and σ is a Rademacher vector, then for any 1 p< , ≤ ∞ n p 1/p n p 1/p E ζi E ζi E E σiζi . (A.4) ζ − X ≤ ζ σ X  i=1   i=1  X X The following decoupling inequality is elementary to prove (see e.g. [FR13, Theorem n n 8.11]). Let R × , let σ be an n-dimensional Rademacher vector and σ′ an independentM copy. ⊂ Then, for any 1 p< ≤ ∞ p 1/p p 1/p (E sup σ∗Mσ E(σ∗Mσ) ) 4( E sup (σ′)∗Mσ ) . (A.5) σ M | − | ≤ σ,σ′ M | | ∈M ∈M A special case of this bound, combined with Khintchine’s inequality, implies the following.

m n Lemma 28. Let A R × and let σ be an n-dimensional Rademacher vector. For any p 1, ∈ ≥ (E Aσ p)1/p A +2 2p A . k k2 ≤k kF k k Proof. By decoupling, p p 2/p p/2 2/p (E Aσ ) (E σ∗A∗Aσ E σ∗A∗Aσ ) + E(σ∗A∗Aσ) k k2 ≤ σ | − | E p/2 2/p 2 4( (σ′)∗A∗Aσ ) + A F ≤ σ,σ′ | | k k 52 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON and therefore Khintchine’s inequality implies that E p 2/p E p/2 2/p 2 ( Aσ ) 4 p/2( A∗Aσ ) + A F k k2 ≤ σ k k2 k k 2p2p A (E Aσ p)1/p + A 2 . ≤ k k σ k k2 k kF Solving this quadratic inequality yieldsp the claim.  To conclude we collect some tools to estimate covering numbers. Given two closed sets U, V Rn, we let the covering number (U, V ) be the minimal number of translates of V⊂ needed to cover U. If V is closed,N convex and symmetric, then we can associate with it a semi-norm x = inf t> 0 : x tV . k kV { ∈ } In this case it is customary to write (U, ,ε) := (U,εV ) (ε> 0). N k·kV N Conversely, if is a semi-norm, then V = x Rn : x 1 is closed, convex k·k { ∈ k k≤ } and symmetric, and V = . As a first tool wek·k state twok·k elementary bounds that follow from volumetric comparision. If is any semi-norm on Rn and B is the associated unit ball, k·k k·k 1 n 2 n (B , ,ε) 1+ (0 <ε 1). (A.6) ε ≤ N k·k k·k ≤ ε ≤ The following is known as Sudakov minoration  or the dual Sudakov inequality [BLM89, Proposition 4.2], [PTJ86]. Eq. (B.1) in Appendix B gives a definition of the polar V ◦. Lemma 29. Let V Rn be closed, convex and symmetric and let g be a standard Gaussian vector. Then,⊂ 1/2 n E sup ε[log (Bℓ2 , V ,ε)] . sup g, x . ε>0 N k·k x V ◦h i ∈ We will also use the following duality for covering numbers from [BPSTJ89, Proposition 4]. Lemma 30. Let U, V Rn be closed, bounded, convex and symmetric. For every θ ε, ⊂ ≥ r (U, ,ε) (U, ,θ)[ (V ◦, ◦ ,ε/8)] , N k·kV ≤ N k·kV N k·kU with r (27T ( ))2(1 + log(θ/ε)). ≤ 2 k·kV Finally, we state Maurey’s lemma [Pis81, Car85] and its usual proof.

Lemma 31. Let be a semi-norm and T2( ) be its type-2 constant. Let Ω be a set of points xk·keach with x 1. Then fork·k any z conv(Ω) and any integer ℓ> 0 there exist y ,...,y Ωkwithk≤ ∈ 1 ℓ ∈ ℓ 1 T2( ) z yi k·k − ℓ ≤ √ℓ i=1 X Proof. Write z = λ x for 0 < λ 1, λ = 1, x Ω. Let y = (y ,...,y ) i i i i ≤ i i i ∈ 1 ℓ have i.i.d. entries from the distribution in which yj = xj with probability λj . Let σ be a RademacherP vector and let g be a GaussianP vector. By symmetrization, ℓ ℓ 1 2 E z yi E σiyi y − ℓ ≤ ℓ σ,y i=1 i=1 X X ℓ ℓ 2 1 T2( ) = E E σi gi yi . E giyi k·k . ℓ σ,y g | | ℓ y,g ≤ √ℓ i=1 i=1 X X

TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 53



Appendix B. Tools from convex analysis In this appendix we recall some basic facts from convex analysis. More details can be found in e.g. [HUL01, Roc70]. For any set S Rn we let conv(S) denote the closed convex hull of S, i.e., the closure of the set of⊂ all convex combinations of elements in S. The polar of a set S is n S◦ = y R : x, y 1 for all x S , (B.1) { ∈ h i≤ ∈ } which is always closed and convex. A set is called a cone if α for all α> 0. It is called a convex cone if it is, in addition,K convex. We use cone(K ⊂S K) to denote the closed convex cone generated by a set S. The polar ◦ of a convex cone can be written as K K n ◦ = y R : x, y 0 for all x K { ∈ h i≤ ∈ K} and is always a closed convex cone. The bipolar theorem states that ◦◦ is equal to the closure of (cf. [MV97, Theorem 6.11]). Let be a closed convexK set in Rn. The tangentK coneK T (x) to at x is the closedC cone generated by x , i.e., C C ∈C C−{ } T (x) = cone d Rn : d = y x, y . C { ∈ − ∈ C} Clearly, if x is in the interior of , then T (x)= Rn. The normal cone N (x) to at x is given by C C C C ∈C N (x)= s Rn : s,y x 0 y . C { ∈ h − i≤ ∀ ∈ C} It is easy to see that (N (x))◦ = T (x). Since T (x) is closed, the bipolar theorem C C C implies that (T (x))◦ = N (x). If f : Rn RC is any function,C then a vector ξ Rn is called a subgradient of f at x dom(→f) if for any y dom(f) ∈ ∈ ∈ f(y) f(x)+ ξ,y x . (B.2) ≥ h − i The set of all subgradients of f at x is called the subdifferential of f at x and denoted by ∂f(x). If f : Rn R is a proper convex function (i.e., f(x) < for some x Rn), then the descent→ ∪{∞} cone D(f, x) of f at x Rn is defined by ∞ ∈ ∈ D(f, x)= d Rn : f(x + td) f(x) . { ∈ ≤ } t>0 [ The descent cone is always a convex cone, but it may not be closed. Theorem 32. [Roc70, Theorem 23.7] Let f : Rn R be a proper convex function. Suppose that the subdifferential ∂f(x) is nonempty→ and does not contain 0. Then,

D(f, x)◦ = cone(∂f(x)) = t∂f(x). t 0 [≥ Let be any norm on Rn and set k·k = x Rn : x R . C { ∈ k k≤ } Then, for any x with x = R, T (x) is equal to the descent cone of at x. By Theorem 32 and∈C the bipolark k theoremC this implies that k·k

T (x) = [cone(∂ (x))]◦. (B.3) C k·k Let denote the dual norm of , i.e., k·k∗ k·k y = sup x, y . k k∗ x 1h i k k≤ 54 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

It is easy to verify from the definition (B.2) that y Rn : y 1 , if x =0, ∗ ∂ (x)= { ∈ n k k ≤ } k·k ( y R : y 1 and y, x = x , otherwise. { ∈ k k∗ ≤ h i k k} Using these tools, we can readily calculate the tangent cone to an ℓ2,1-ball.

Example 33. Consider the ℓ2,1-norm defined in (6.5) and its dual norm, the ℓ2, - norm. Set ∞ = x RbD : x R . C { ∈ k k2,1 ≤ } Suppose that x = 0 and let S [b] be the indices corresponding to nonzero blocks of x. For z RbD6 , let z be the⊂ vector ∈ S

zBi if i S (zS)B = ∈ i 0 if i Sc. ( ∈ Then, bD ∂ 2,1(x)= z R : z 2, 1 and z, x = x 2,1 k·k { ∈ k k ∞ ≤ h i k k } bD = z R : zSc 2, 1 and zB = xB / xB 2 for all i S . { ∈ k k ∞ ≤ i i k i k ∈ } If x RbD satisfies x = R, then by (B.3), ∈ k k2,1 bD T (x)= y R : z,y 0 for all z ∂ 2,1(x) C { ∈ h i≤ ∈ k·k } bD xBi R c = y : ,yBi + yS 2,1 0 . ∈ xB 2 k k ≤ i S i n X∈ Dk k E o In particular, any y T (x) satisfies ∈ C y = y + y c 2 y 2 S y . k k2,1 k Sk2,1 k S k2,1 ≤ k Sk2,1 ≤ | | k k2 p Appendix C. Sketching least squares programs with an FJLT In this appendix we study sketching of least squares programs using a fast Johnson-Lindenstrauss transform (FJLT). We show in Theorem 40 that for ℓ2,1- constrained least squares minimization, one can achieve the same sketching dimen- sion as in the sparse case (cf. Section 6.2). We first recall the definition of the FJLT. Let F be the discrete Fourier transform. Let θ ,...,θ : Ω 0, 1 be independent random selectors satisfying P (θ = 1 n θ → { } θ i 1) = m/n. Let σ1,...,σn : Ωσ 1, 1 be independent Rademacher random → {− } n variables. The FJLT is defined by Ψ = ΘFDσ, where Θ = n/m diag((θi)i=1) and D = diag((σ )n ). Here diag((x )) denotes the diagonal matrix with the σ i i=1 i p elements of the sequence (xi) on its diagonal. To prove Theorem 40 we apply Lemma 11 using a suitable upper bound for Z2 and lower bound for Z1. To obtain the latter, we use the following chaining estimate. Let T be some index set. For a given set (xt,i)t T,1 i n in R, we define a ∈ ≤ ≤ semi-metric ρx on T by

ρx(t,s)= max xt,i xs,i 1 i n ≤ ≤ | − | and the denote the associated radius by

dx(T ) = sup max xt,i . t T 1 i n | | ∈ ≤ ≤ For every 1 i n we fix a probability space (Ωi, i, Pi) and let (Ω, , P) denote the associated≤ product≤ space. The following observationF was proven inF the special case p = 1 in [GMPTJ07, Theorem 1.2]. TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 55

Lemma 34. Fix 1 p < . For every t T and 1 i n let Xt,i L2p(Ωi). Then, ≤ ∞ ∈ ≤ ≤ ∈

n p 1/p E sup X2 E X2 t,i − t,i t T i=1  ∈ X  n 1/2 E 2p 1/p E 2 E p 1/p . ( γ2 (T,ρX )) + sup Xt,i ( γ2 (T,ρX )) t T i=1 ∈  X  n 1/2 E 2 E p 1/p E 2p 1/p + √p sup Xt,i ( dX (T )) + p( dX (T )) . t T i=1 ∈  X  Proof. Let (ri)i 1 be a Rademacher sequence. By symmetrization, ≥ n p 1/p n p 1/p E 2 E 2 E E 2 sup Xt,i Xt,i 2 sup riXt,i . (C.1) t T − ≤ r t T  ∈ i=1   ∈ i=1  X X By Hoeffding’s inequality, we have for any s,t T , ∈ n n 1/2 P r (X2 X2 ) u (X2 X2 )2 exp( u2/2). r i t,i − s,i ≥ t,i − s,i ≤ − i=1 i=1  X  X   Since n 1/2 n 1/2 (X2 X2 )2 √2 sup X2 ρ (t,s) t,i − s,i ≤ t,i X i=1 t T i=1  X  ∈  X  n 2 we conclude that ( riXt,i)t T is subgaussian with respect to the semi-metric i=1 ∈ P n 1/2 2 ρ (s,t)= √2 sup Xt,i ρX (s,t). ∗ t T i=1 ∈  X  By Lemma 12,

n p 1/p E 2 sup riXt,i r t T i=1  ∈ X  n 1/2 n 1/2 2 2 . sup Xt,i γ2(T,ρX )+ √p sup Xt,i dX (T ) t T i=1 t T i=1 ∈  X  ∈  X  n 1/2 sup X2 E X2 γ (T,ρ )+ √pd (T ) ≤ t,i − t,i 2 X X t T i=1 ∈ X   n 1/2 E 2 + sup Xt,i γ2(T,ρX )+ √pdX (T ) . t T i=1 ∈  X    Taking Lp-norms on both sides, using (C.1) and applying H¨older’s inequality yields

n p 1/p E 2 E 2 sup Xt,i Xt,i t T −  ∈ i=1  X n p 1/2p . E sup X2 E X2 (E γ2p(T,ρ ))1/2p + √p(E d2p(T ))1/2p t,i − t,i 2 X X t T i=1  ∈ X    n 1/2 E 2 E p 1/p E p 1/p + sup Xt,i ( γ2 (T,ρX )) + √p( dX (T )) . t T i=1 ∈  X    Solving this quadratic inequality yields the result.  56 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

To estimate the γ2-functional occuring in Lemma 34 we use a covering number estimate from [GMPTJ08]. Recall the following definitions. Let E be a and let E∗ denote its dual space. The modulus of convexity of E is defined by 1 δ (ε) = inf 1 x + y , x =1, y =1, x y >ε (0 ε 2). E − 2k k k k k k k − k ≤ ≤ n o We say that E is uniformly convex if δE(ε) > 0 for all ε > 0 and that E is 2 2 uniformly convex of power type 2 with constant λ if δE(ε) ε /(8λ ) for all ε> 0. The following observation is due to Figiel [Fig76, Proposition≥ 24]. Lemma 35. Suppose that E is a p-convex and q-concave Banach lattice for some 2 1 < p q < . Set r = max 2, q and K = max 2, √p 1 . Then, δE(ε) 1 r≤r ∞ { } { − } ≥ r− K− ε for all 0 ε 2. ≤ ≤ Using the H¨older-Minkowski inequalities one readily checks that ℓ2,p = ℓp(ℓ2) is p-convex and 2-concave if 1

ρv(x, y)= max vi, x y . (C.2) 1 i N ≤ ≤ |h − i| Set ν = max1 i N vi E∗ and let U BE. Then, for all t> 0, ≤ ≤ k k ⊂ 1/2 2 1/2 1 log (2 (U,ρ ,t)) . νλ T (E∗) log (N)t− . (C.3) N v 2 We can now estimate the parameter Z1. n d d Lemma 37. Set d = bD. Let Ψ be the FJLT, fix A R × and let R . Consider the norm A defined in (6.6) and set ∈ K ⊂ ||| ||| 2 1 3 2 β = (log (η− ) + log(b) + log(n)) log(n) log (b) log (d). (C.4) If 2 2 2 m & βε− A sup x 2,1 , (C.5) ||| ||| x : Ax 2=1 k k h ∈K k k i then with probability at least 1 η we have − 2 2 2 n 1 (1 ε) x Ψx (1 + ε) x , for all x A S − . − k k2 ≤k k2 ≤ k k2 ∈ K ∩ In particular, Z (A, Ψ, ) 1 ε. 1 K ≥ − Proof. Let Fi denote the i-th row of F . Since E 2 2 2 Rn Ψx 2 = Dσx 2 = x 2, for all x , θ k k k k k k ∈ we have 2 2 sup Ψx 2 x 2 x A Sn−1 k k −k k ∈ K∩ n 2 2 n n = sup θi DσFi, x E θi DσFi, x x A Sn−1 m − θ m ∈ K∩ i=1 Dr E Dr E X n n 2 n 2 = sup θi A∗DσFi, x E θi A∗DσFi, x . −1 n−1 m − θ m x A (S ) i=1 r r ∈K∩ X D E D E n We apply Lemma 34 (with X := θ A∗D F , x ) to find x,i h i m σ i i p 1/p E 2 2 p sup Ψx 2 x 2 θ x A Sn−1 k k −k k  ∈ K∩ 

TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 57

E 2p 1 n 1 1/p E p 1 n 1 1/p ( γ2 ( A− (S − ),ρX )) + ( γ2 ( A− (S − ),ρX )) ≤ θ K ∩ θ K ∩ p 1 n 1 1/p 2p 1 n 1 1/p + √p(E d ( A− (S − )) + p(E d ( A− (S − )) θ X K ∩ θ X K ∩ 2 1 n 1 1 n 1 γ ( A− (S − ),ρ )+ γ ( A− (S − ),ρ ) ≤ 2 K ∩ v 2 K ∩ v 1 n 1 2 1 n 1 + √pd ( A− (S − )) + pd ( A− (S − )), v K ∩ v K ∩ n where we have set vi = m A∗DσFi, defined ρv as in (C.2) and used that ρX ρv 1 n 1 ≤ uniformly. Set d2,1 = d 2 1 ( A− (S − )) and ν := max1 i n vi 2, . Clearly k·kp , K ∩ ≤ ≤ k k ∞ 1 n 1 d ( A− (S − )) d ν. (C.6) v K ∩ ≤ 2,1 We estimate the γ2-functional by an entropy integral 1 n 1 γ ( A− (S − ),ρ ) 2 K ∩ v t∗ 1 n 1 ∞ 1 n 1 . ( A− (S − ),ρ ,t)+ ( A− (S − ),ρ ,t) dt. N K ∩ v N K ∩ v Z0 Zt∗ The first integral we estimate using the volumetric bound d 1 n 1 1 n 1 2νd2,1 ( A− (S − ),ρ ,t) ( A− (S − ),ν ,t) 1+ . N K ∩ v ≤ N K ∩ k·k2,1 ≤ t For the second integral we set p = log(b)/(log(b) 1) and apply Lemma 36 with d d − E = ℓ2,p and E∗ = ℓ2,p′ , where p′ = log(b). Note that 2, 2,p′ e 2, ∞ ∞ 1/2 k·k ≤k·k ≤ k·k and therefore T2(E∗) e log (b). Also, by the remark after Lemma 35 we have 2 1 ≤ λ = (p 1)− = log(b) 1. By (C.3), − − 1/2 1/2 3/2 1/2 1 log ( (Bℓd ,ρv,t)) log ( (Bℓd ,ρv,t)) . ν log (b) log (n)t− . (C.7) N 2,1 ≤ N 2,p 1 n 1 Since A− (S − ) d2,1Bℓd , we arrive at K ∩ ⊂ 2,1 1 n 1 γ ( A− (S − ),ρ ) 2 K ∩ v t∗ νd2,1 1/2 2νd2,1 1/2 3/2 1 . √d log 1+ dt + νd log (n) log (b)t− dt t 2,1 Z0 Zt∗ 1/2  1  1/2 3/2 1 √dt log (e +2et− νd2,1)+ νd2,1 log (n) log (b) log(t− νd2,1). ≤ ∗ ∗ ∗ 1/2 Take t = d− νd2,1 to obtain ∗ 1 n 1 γ ( A− (S − ),ρ ) 2 K ∩ v 1/2 1/2 3/2 . νd2,1 log (e +2e√d)+ νd2,1 log (n) log (b) log(√d) 1/2 3/2 . νd2,1 log (n) log (b) log(d). (C.8) In conclusion, p 1/p E 2 2 sup Ψx 2 x 2 θ x A Sn−1 k k −k k  ∈ K∩  2 2 3 2 1/2 3/2 . ν d2,1 log( n) log (b) log (d)+ νd2,1 log (n) log (b) log(d) 2 2 + √pνd2,1 + pν d2,1. (C.9) Since √nF 1, Khintchine’s inequality implies that | ij |≤ (E νp)1/p σ 1 E p 1/p = ( max √n A∗DσFi 2, ) √m σ 1 i n ∞ ≤ ≤ k k 1 n 2 p/2 1/p = E max max σj Ajk√nFij √m σ 1 i n 1 ℓ b  ≤ ≤ ≤ ≤  k Bℓ j=1   X∈ X

58 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

n 1/2 1 1/2 1/2 2 (√p + log (b) + log (n)) max max Ajk√nFij ≤ √m 1 i n 1 ℓ b | | ≤ ≤ ≤ ≤ k B j=1  X∈ ℓ X  n 1/2 1 1/2 1/2 2 (√p + log (b) + log (n)) max Ajk . (C.10) ≤ √m 1 ℓ b | | ≤ ≤ k B j=1  X∈ ℓ X  Taking Lp(Ωσ)-norms in (C.9) and using (C.10), we conclude that p 1/p E 2 2 sup Ψx 2 x 2 θ,σ x A Sn−1 k k −k k  ∈ K∩  1 . (√p + log1/2(b) + log 1/2(n))2 A 2d2 log(n) log3(b) log2(d) m ||| ||| 2,1 1 + (√p + log1/2(b) + log1/2(n)) A d log1/2(n) log3/2(b) log(d) √m ||| ||| 2,1 1 + √p (√p + log1/2(b) + log1/2(n)) A d √m ||| ||| 2,1 1 + p (√p + log1/2(b) + log1/2(n))2 A 2d2 . m ||| ||| 2,1 1 The result now follows from Lemma 27 and taking w = log(η− ) in (A.1). 

To prove an upper bound for Z2 we use the following variation of Lemma 34. Note that the element u below does not need to be in the index set T .

Lemma 38. For every t T and 1 i n let Xt,i L2p(Ωi). Fix also Xu,i L (Ω ). For any 1 p< ∈, ≤ ≤ ∈ ∈ p i ≤ ∞ n p 1/p E sup X X E(X X ) t,i u,i − t,i u,i t T i=1  ∈ X  1/(2p ) n 1/2 E 2p E 2 . √p max Xu,i + Xu,i 1 i n | | ≤ ≤ i=1     X   (E γ2p(T,ρ ))1/(2p) + √p(E d2p(T ))1/(2p) . (C.11) ∗ 2 X X   Proof. Let (ri) be a Rademacher sequence. By Hoeffding’s inequality, we have for any s,t T and w 0, n∈ ≥ n 1/2 2 2 w /2 P r (X X X X ) w (X X X X ) e− . r i t,i u,i − s,i u,i ≥ t,i u,i − s,i u,i ≤ i=1 i=1  X  X   Since n 1/2 n 1/2 (X X X X )2 X2 ρ (t,s), t,i u,i − s,i u,i ≤ u,i X i=1 i=1  X n   X  we conclude that ( i=1 riXt,iXu,i)t T is subgaussian with respect to the semi- metric ∈ P n 1/2 2 ρ (s,t)= Xu,i ρX (s,t). ∗ i=1  X  By Lemma 12,

n p 1/p n 1/2 E 2 sup riXt,iXu,i . Xu,i γ2(T,ρX )+ √pdX (T ). r t T  ∈ i=1   i=1  X X Using symmetrization (A.4), this implies that n p 1/p E sup Xt,iXu,i E(Xt,iXu,i) t T −  ∈ i=1  X

TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 59

n p 1/p 2 E E sup riXt,iXu,i ≤ r t T i=1  ∈ X  n p 1/(2p) E 2 E 2p 1/(2p) E 2p 1/(2p) . Xu,i ( γ2 (T,ρX )) + √p( dX (T )) . i=1   X     By symmetrization and Khintchine’s inequality,

n p 1/p E 2 Xu,i i=1   X   n p 1/p n E X2 E X2 + E X2 ≤ u,i − u,i u,i i=1 i=1  X  X n p 1/p n E E 2 E 2 2 riXu,i + Xu,i ≤ r  i=1  i=1 Xn Xn p/2 1/p 2√p E X4 + E X2 ≤ u,i u,i i=1 i=1  X  X 1/(2p) n p 1/(2p) n E 2p E 2 E 2 2√p max Xu,i Xu,i + Xu,i. ≤ 1 i n | | ≤ ≤ i=1 i=1     X   X Solving this quadratic inequality yields

n p 1/(2p) 1/(2p) n 1/2 E 2 E 2p E 2 Xu,i 2√p max Xu,i + Xu,i . ≤ 1 i n | | i=1 ≤ ≤ i=1   X      X  

n d d n 1 Lemma 39. Let Ψ be the FJLT, let A R × , R and u S − . Let A be as in (6.6) and β as in (C.4). If ∈ K ⊂ ∈ ||| |||

2 2 2 m & βε− A sup x 2,1 , ||| ||| x : Ax 2=1 k k h ∈K k k i then Z (A, Ψ, ,u) ε with probability at least 1 η. 2 K ≤ − Proof. If Ψi denotes the i-th row of Ψ, then we can write n Z2(A, Ψ, ,u) = sup A∗Ψi, x Ψi,u E( A∗Ψi, x Ψi,u ) . K −1 n−1 h ih i− h ih i x A (S ) i=1 ∈K∩ X n Set vi = m A∗DσFi and let ρv be the semi-metric in (C.2). We apply Lemma 38 with X := θ n A D F , x and X := θ n D F ,u . By (C.6) and (C.8) x,ip i m ∗ σ i u,i i m σ i we know thath i h i p p 2p 1/(2p) (E d (T )) d2,1ν θ X ≤ E 2p 1/(2p) 1/2 3/2 ( γ2 (T,ρX )) . d2,1ν log (n) log (b) log(d). θ Moreover, n E 2 2 E 2p 1/(2p) n Xu,i = Dσu 2 =1, ( max Xu,i ) max Fi,Dσu . θ k k θ 1 i n | | ≤ 1 i n m|h i| i=1 ≤ ≤ ≤ ≤ r X Applying these estimates in (C.11) and taking Lp(Ωσ)-norms yields 1/(2p) E p 1/p p E 2p ( Z2 ) . max √n Fi,Dσu +1 θ,σ m σ 1 i n |h i| r  ≤ ≤   2p 1/(2p) 1/2 3/2 (E ν ) d2,1 log (n) log (b) log(d)+ √p . ∗ σ   60 JEANBOURGAIN, SJOERDDIRKSEN,AND JELANINELSON

By Khintchine’s inequality,

1/(2p) n 1/2 E 2p 1/2 2 2 max √n Fi,Dσu . (√p + log (n)) max nFij uj σ 1 i n |h i| 1 i n ≤ ≤ ≤ ≤ j=1    X  . √p + log1/2(n). and by (C.10) 1 (E ν2p)1/(2p) . (√p + log1/2(b) + log1/2(n)) A . √m ||| ||| Combining these estimates and using Lemma 27 yields the result. 

Combining Lemmas 11, 37, and 39 yields the following result.

n d Theorem 40. Set d = bD. Let Ψ be the FJLT, A R × and let be a closed convex set in Rd. Set β as in (C.4). Let x and xˆ be∈ the minimizersC of (6.1) and (6.2), respectively. If ∗

2 2 2 m & βε− A sup x 2,1 , ||| ||| x T (x ) : Ax 2=1 k k h ∈ C ∗ k k i then, with probability at least 1 η, − 2 f(ˆx) (1 ε)− f(x ). ≤ − ∗ The proof of Corollary 17 immediately yields the following consequence.

n d d Corollary 41. Set d = bD. Let Ψ be the FJLT, A R × and let = x R : x R . Define ∈ C { ∈ k k2,1 ≤ }

σmin,k = inf Ay 2. y 2=1, y 2 1 2√k k k k k k k , ≤

Suppose that x is k-block sparse and x 2,1 = R. If ∗ k ∗k 2 2 2 m & βε− A kσ− , ||| ||| min,k then, with probability at least 1 η, − 2 f(ˆx) (1 ε)− f(x ). ≤ − ∗ Observe that the condition on m in Theorem 40 and Corollary 41 is, up to different log-factors, the same as the condition for the SJLT in Theorem 16 and Corollary 17. In the special case D = 1, which corresponds to the Lasso, the result in Corol- lary 41 gives a qualitative improvement over [PW14, Corollary 3]. Recall from the discussion following Corollary 17 that in this case

A = max Ak 2, σmin,k = inf Ay 2. ||| ||| 1 k d k k y 2=1, y 1 2√k k k ≤ ≤ k k k k ≤ In [PW14, Corollary 3] the condition

2 4 2 1 2 2 2 2 4 σmax,k m & ε− log(η− )+ ε− min log (d) A kσmin− ,k , k log(d) log (n) 4 , ||| ||| σmin,k n   o TOWARD A UNIFIED THEORY OF SPARSE DIMENSIONALITY REDUCTION 61 was obtained, where σ = sup Ay . In terms of the depen- max,k y 2=1, y 1 2√k k k2 dence on A this bound is worse thank k our result.k k ≤ We note, however, that the bound contains fewer log-factors and in particular the dependence on η is better.

Institute for Advanced Study, Princeton, NJ 08540 E-mail address: [email protected]

RWTH Aachen University, 52062 Aachen, Germany E-mail address: [email protected]

Harvard University, Cambridge, MA 02138 E-mail address: [email protected]