<<

Sketching for M-: A Unified Approach to

Kenneth L. Clarkson∗ David P. Woodruff†

Abstract ing algorithm often results, since a single pass over A

We give algorithms for the M-estimators minx kAx − bkG, suffices to compute SA. n×d n n where A ∈ and b ∈ R , and kykG for y ∈ R is specified An important property of many of these sketching ≥0 P by a cost function G : R 7→ R , with kykG ≡ i G(yi). constructions is that S is a subspace embedding, meaning d The M-estimators generalize `p regression, for which G(x) = that for all x ∈ R , kSAxk ≈ kAxk. (Here the vector |x|p. We first show that the Huber measure can be computed norm is generally `p for some p.) For the regression up to relative error  in O(nnz(A) log n + poly(d(log n)/ε)) d time, where nnz(A) denotes the number of non-zero entries problem of minimizing kAx − bk with respect to x ∈ R , of the matrix A. Huber is arguably the most widely used for inputs A ∈ Rn×d and b ∈ Rn, a minor extension of M-, enjoying the robustness properties of `1 as well the embedding condition implies S preserves the norm as the smoothness properties of ` . 2 of the residual vector Ax − b, that is kS(Ax − b)k ≈ We next develop algorithms for general M-estimators. kAx − bk, so that a vector x that makes kS(Ax − b)k We analyze the M-sketch, which is a variation of a sketch small will also make kAx − bk small. introduced by Verbin and Zhang in the context of estimating A significant bottleneck for these methods is the the earthmover distance. We show that the M-sketch can computation of SA, taking Θ(nmd) time with straight- be used much more generally for sketching any M- estimator forward matrix multiplication. There has been work provided G has growth that is at least linear and at most showing that fast transform methods can be incorpo- quadratic. Using the M-sketch we solve the M-estimation rated into the construction of S and its application to problem in O(nnz(A) + poly(d log n)) time for any such G A, leading to a sketching time of O(nd log n)[3, 4, 7, 38]. that is convex,making a single pass over the matrix and Recently it was shown that there are useful sketch- finding a solution whose residual error is within a constant ing matrices S such that SA can be computed in factor of optimal, with high . time linear in the number nnz(A) of non-zeros of A [6, 9, 34, 36]. With such sketching matrices, various 1 Introduction. problems can be solved with a running time whose lead- In recent years there have been significant advances in ing term is O(nnz(A)) or O(nnz(A) log n). This promi- randomized techniques for solving numerical linear alge- nently includes regression problems on “tall and thin” bra problems, including the solution of diagonally dom- matrices with n  d, both in the least-squares (`2) inant systems [28, 29, 39], low-rank approximation[2, 9, and robust (`1) cases. There are also recent recursive 15, 12, 13, 34, 36, 38], overconstrained regression[9, 21, -based algorithms for `p regression [35], as well 34, 36, 38], and computation of scores [9, 17, as sketching-based algorithms for p ∈ [1, 2) [34] and 34, 36]. There are many other references; please see for p > 2 [41], though the latter requires sketches whose example the survey by Mahoney [30]. Much of this work size grows polynomially with n. Similar O(nnz(A)) time involves the tool of sketching, which in generality is a results were obtained for regresion [42], by re- descendent of random projection methods as described lating it to `1 regression. A natural question raised by by Johnson and Lindenstrauss[1, 4, 3, 11, 26, 27], and these works is which families of penalty functions can also of sampling methods [10, 14, 15, 16, 18, 19, 20]. be computed in O(nnz(A)) or O(nnz(A) log n) time. n×d Given a problem involving A ∈ R , a sketching ma- M-estimators. Here we further extend the “nnz” t×n trix S ∈ R with t  n is used to reduce to a similar regime to general statistical M-estimators, specified by problem involving the smaller matrix SA, with the key ≥0 a measure function G : R 7→ R , where G(x) = G(−x), property that with high likelihood with respect to the G(0) = 0, and G is non-decreasing in |x|. The result is randomized choice of S, a solution for SA is a good P a new “norm” kyk ≡ G(yi). (In general these solution for A. More generally, derived using SA G i∈[n] functions kk are not true norms, but we will sometimes is used to efficiently solve the problem for A. In cases G refer to them as norms anyway.) An M-estimator is where no further processing of A is needed, a stream- a solution to minx kAx − bkG. For appropriate G, M- estimators can combine the insensitivity to of ∗ IBM Research - Almaden `1 regression with the low of `2 regression. †IBM Research - Almaden The Huber norm. The Huber norm [24], for The sketch SA (and Sb) can be computed in example, is specified by a parameter τ > 0, and its O(nnz(A)) time, and needs O(poly(d log n)) space; we measure function H is given by show that it can be used in O(poly(d log n)) time to ( find approximate M-estimators, that with constant a2/2τ if |a| ≤ τ H(a) ≡ probability have a cost within a constant factor of |a| − τ/2 otherwise, optimal. The success probability can be amplified by independent repetition and choosing the best solution combining an `2-like measure for small x with an `1-like measure for large x. found among the repetitions. The Huber norm is of particular interest, because it Condition on G. For our results we need some is popular and “recommended for almost all situations” additional conditions on the function G beyond sym- [43], because it is “most robust” in a certain sense[24], metry and monotonicity: that it grows no faster than and because it has the useful computational and statis- quadratically in x, and no slower than linearly. For- 0 tical properties implied by the convexity and smooth- mally: there is α ∈ [1, 2] and CG > 0 so that for all a, a 0 ness of its defining function, as mentioned above. The with |a| ≥ |a | > 0, smoothness makes it differentiable at all points, which a α G(a) a can lead to computational savings over ` , while enjoy- (1.1) ≥ ≥ C 1 0 0 G 0 ing the same robustness properties with respect to out- a G(a ) a liers. Moreover, while some measures, such as `1, treat The subquadratic growth condition is necessary for a small residuals “as seriously” as large residuals, it is of- sketch with a sketching dimension sub-polynomial in ten more appropriate to have robust treatment of large n to exist, as shown by Braverman and Ostrovsky [8]. residuals and Gaussian treatment of small residuals [22]. Also, subquadratic growth is appropriate for robust re- We give in §2 a sampling scheme for the Huber gression, to reduce the effect of large values in the resid- norm based on a combination of Huber’s `1 and `2 ual Ax − b, relative to their effect in least-squares. Al- properties. We obtain an algorithm yielding an ε- most all proposed M-estimators satisfy these conditions approximation with respect to the Huber norm of the [43]. residual; as stated in Theorem 2.1, the algorithm needs The latter linear lower bound on the growth of G O(nnz(A) log n) + poly(d/ε) time (see, e.g., [31] for holds for all convex G, and many popular M-estimators convex programming algorithms for solving Huber in have convex G [43]. Moreover, the convexity of G im- poly(d/ε) time when the dimension is poly(d/ε)). plies the convexity of kkG, which is needed for comput- M-sketches for M-estimators. We also show ing a solution to the minimization problem in polyno- that the sketching construction of Verbin and Zhang mial time. Convexity also implies significant properties [40], which they applied to the earthmover distance, can for the statistical interpretation of the results, such as also be applied to sketching for general M-estimators. consistency and asymptotic normality[23, 37]. 1 This construction, which we call the M-sketch is However, we do not require G to be convex for constructed independently of the G function specifying our sketching results, and indeed some M-estimators the M-estimator, and so the same sketch can be used are not convex; here we simply reduce a large non- for all G. That is, one can first sketch the input, in convex problem, minx kAx − bk , to a smaller non- one pass, and decide later on the particular choice of G convex problem minx kS(Ax − b)kG,w of a similar kind. penalty function G. That is, the entire algorithm for the The linear growth lower bound does imply that we problem minx kAx − bkG is to compute S · A and S · b, are unable to apply sketching to some proposed M- for a simple sketching matrix S described below, and estimators; the “Tukey” estimator, for example, whose then solve the regression problem minx kSAx − SbkG,w, G function is constant for large argument values, is not where kkG,w is defined as follows. included in our results. However, we can get close, in Definition 1.1. For dimension m and non-negative the sense that at the cost of more computation, we can weights w1, . . . , wm, define the weighted G-measure of a handle G functions that grow arbitrarily slowly. m P Not only do we obtain optimal O( (A) + vector y ∈ R , denoted kykG,w, to be i∈[m] wiG(yi). nnz We refer to w as the weight vector. poly(d log n)) time approximation algorithms for these M-estimators, our sketch is the first to non-trivially re- Notice that kyk equals kyk when w = 1 for all i. If G G,w i duce the dimension of any of these estimators other the G function is convex, then using the non-negativity than the `p-norms (which are a special case of M- of w, it follows that kykG,w is a convex function of y. estimators). E.g., for the L1 − L2 estimator in which G(x) = 2(p1 + x2/2 − 1), the Fair estimator in which 1 Verbin and Zhang call the construction a Rademacher sketch; 2 h |x| |x| i with apologies, we prefer our name, for this application. G(x) = c c − log(1 + c ) , or the Huber estimator, no dimensionality reduction for the regression problem Each Sh implements a particular sketching was known. scheme called COUNT-SKETCH on its random sub- set. COUNT-SKETCH splits the coordinates of y into 1.1 Techniques. groups (“buckets”) at random, and adds together each Huber algorithm. Our algorithm for the Hu- group after multiplying each coordinate by a random ber estimator, §2, involves importance sampling of ±1; each such sum constitutes a coordinate of Shy. 0 the (Ai:, bi), where a sampling matrix S is obtained COUNT-SKETCH was recently [9, 34, 36] shown to be a 0 such that kS (Ax − b)kH,w is a useful approximation to good subspace embedding for `2, implying here that kAx − bkH . The sampling are based on a the matrix S0, which applies to all the coordinates of n combination of the `1 leverage score vector u ∈ R , and y = Ax, has the property that kS0Axk2 is a good es- 0 n d the `2 leverage score vector u ∈ R . The `1 vector u timator for kAxk2 for all x ∈ R ; in particular, each can be used to obtain good sampling probabilities for `1 coordinate of S0y is the magnitude of the `2 norm of 0 regression, and similarly for u and `2. Since the Huber the coordinates in the contributing group. measure has a mixed `1/`2 character, we are able to use Why should our construction, based on `2 embed- a combination of `1 and `2 scores to obtain good sam- dings, be suitable for, e.g., `1, with kD(w)SAxk1 an es- pling probabilities for Huber. A key observation we use timate of kAxk1? Why should the M-sketch be effective is Lemma 2.1, which roughly bounds the Huber norm for that norm? Here D(w) is an appropriate diagonal of a vector in terms of n, τ, and its `1 and `2 norms, matrix of weights w. An intuition comes from consid- and leads to a recursive sampling algorithm. Several ering the matrix Shmax for the smallest random subset difficulties arise, most notably that the Huber norm is Lhmax of y = Ax to be sketched; we can think of Shmax y not scale-invariant, that is, for small arguments it scales as one coordinate of y = Ax, chosen uniformly at ran- quadratically with its input while for large arguments it dom and sign-flipped. The expectation of kSh yk is P max 1 scales linearly. This complicates the sampling, as well i∈[n] kyik2/n = kyk1/n; with appropriate scaling from as simple aspects such as net arguments typically used D(w), that smallest random subset yields an estimate for ` -regression, which relied on scale-invariance. p of kyk1 = kAxk1. (This scaling is where the values w The M-sketch construction. Our sketch, a vari- are needed.) The variance of this estimate is too high to ant of that of Verbin and Zhang [40], is given formally be useful, especially when the norm of y is concentrated as (3.6) in §3.1. It can be seen as a form of sub-sampling in one coordinate, say y1 = 1, and all other coordinates and finding heavy hitters, techniques common in data zero. For such a y, however, kyk2 = kyk1, so the base streams [25]; however, most analyses we are aware of level estimator kS0yk2 is a good estimate. On the other concerning such data structures, with the exception of hand, when y is the vector with all coordinates 1/n, the that of Verbin and Zhang for earthmover distance, re- variance of kS yk is zero, while kS yk ≈ kyk is logb n 1 0 2 2 quire a operation in the sketch space and thus quite inaccurate as an estimator of kyk1. So in these do not preserve convexity. This is the first time such extreme cases, the extreme ends of the M-sketch are ef- sketches have been considered and shown to work in fective. The intermediate matrices Sh of the M-sketch the context of regression. help with less extreme cases of y-vectors. We describe here a variant construction, comprising Analysis techniques. While helpful to the intu- a sequence of sketching matrices S0,S1,...Shmax , for a ition, the above observations are not used to prove the parameter hmax, each comprising a block of rows of our results here. The general structure of our arguments is sketching matrix: to show that, conditioned on several constant probabil- d   ity events, for a fixed x ∈ R there are bounds on: S0 S  1  • contraction, so with high probability, kSAxkG,w is  S  S ≡  2  . not too much smaller than kAxkG;  .   .    • dilation, so with constant probability, kSAxkG,w is Sh max not too much bigger than kAxkG. n When applied to vector y ∈ R , each Sh ignores all This asymmetry in probabilities that some re- h but a subset Lh of n/b entries of y, where b > 1 is a sults are out of reach, but still allows approximation parameter, and where those entries are chosen uniformly algorithms for minx kAx − bkG. (We blur the distinc- 0 00 d at random. (That is, Sh can be factored as ShSh , where tion between applying S to A for vectors x ∈ R , and 00 n/bh×n Sh ∈ R samples row i of A by having column i to [A b] for vectors [x −1].) If the optimum xOPT for 0 with a single 1 entry, and the rest zero, and Sh has only the original problem has kS(AxOPT − b)kG that is not n/bh nonzero entries.) too large, then it will be a not-too-large solution for the ˆ sketched problem minx kS(Ax − b)kG,w. If contraction did for Lhˆ . For h a bit bigger than h, W ∩ Lh will likely bounds hold with high probability for a fixed vector Ax, be empty, and W will make no contribution to kShyk. and a weak dilation bound holds for every Ax, then an This argument does not work when the function argument using a metric-space ε-net shows that the con- G has near quadratic growth, and would result in an traction bounds hold for all x; thus, there will be no x O(log n) dilation. By modifying the estimator we can that gives a good, small kS(Ax − b)kG,w and bad, large achieve an O(1) dilation by ignoring small buckets, and kAx − bkG. adding only those buckets in a level h that are among The contraction and dilation bounds are shown on the top ones in value. Note that if G is convex, then a fixed vector y ∈ Rn by splitting up the coordinates of so is this “clipped” version, since at each level we are y into groups (“weight classes”) with the members of a applying a Ky Fan norm. The distinction of taking the weight class having roughly equal magnitude. (For y = top number of buckets versus those buckets whose value SAx, it will convenient to consider weight classes based is sufficiently large seems important here, since only the on the values G(yi), not |yi| itself; for this section we former results in a convex program. won’t dwell on this distinction: assume here G(a) = |a|.) A weight class W is then analyzed with respect to its 1.2 Outline. We give our algorithm for the Huber cardinality: there will be some random subset (“level”) M-estimator in §2.

Lhˆ for which |W ∩ Lhˆ | is small relative to the number Next we give some definitions and basic lemmas of rows of Shˆ (each row of Shˆ corresponds to a bucket, related to M-sketches, that for a given vector y, under as an implementation of COUNT-SKETCH), and therefore appropriate assumptions S does not contract y too much the members of W are spread out from each other, in (§3.5). We also show it does not dilate it too much separate buckets. This implies that each member of (§3.6). In §3.6.2, we sharpen the dilation result by

W makes its own independent contribution to kSykG,w, changing slightly the way we use the sketches, improving and therefore that kSykG,w will not be too small. Also, the dilation bound while preserving the contraction the level Lhˆ is chosen such that the expected number bound. of entries of the weight class is large enough that the |W ∩ Lhˆ | is concentrated around its 2 -Approximation for the Huber Measure. with exponentially small failure probability in d, Here we consider specifically the Huber measure: for and so this contribution from W is well-behaved enough parameter τ > 0, and a ∈ R, the Huber function to union bound over a net. ( The above argument works when the weight class W a2/2τ if |a| ≤ τ H(a) ≡ has many members, i.e., at least d coordinates in order |a| − τ/2 otherwise. to achieve concentration. For those W without many P The Huber “norm” is kzk = H(zp). members which still contribute significantly to kykG, we H p need to ensure that as we over y in the subspace, The main theorem of this section, proven in §2.1: these weight classes only ever involve a small fixed set of Theorem 2.1. (Input Sparsity Time Huber Regres- coordinates. We show this by relating the G function to sion) In O(nnz(A) log n)+poly(d/) time, given an n×d 2 the function f(x) = x , and arguing that these weight matrix A with nnz(A) non-zero entries and n × 1 vector classes only involve coordinates with a large `2 leverage b, with probability at least 4/5, one can find an x0 ∈ Rd score; thus the number of such coordinates is small 0 for which kAx − bkH ≤ (1 + ε) minx∈ d kAx − bkH . and they can be handled separately once for the entire R subspace by conditioning on a constant probability We will need to relate the Huber norm to the `1 event. and `2 norms. The following lemma is shown via a case analysis of the coordinates of the vector z. To show that kSykG,w will not be too big, we show that W will not contribute too much to levels other than Lemma 2.1. (Huber Inequality) For z ∈ Rn, ˆ the “Goldilocks” level Lˆ : for h < h, for which |Lh ∩W | h Θ(n−1/2) min{kzk , kzk2/2τ} ≤ kzk ≤ kzk . is expected to be large, the fact that members of W ∩Lh 1 2 H 1 will be crowded together in a single bucket implies they Proof. For the upper bound, we note that H(a) ≤ |a|, will cancel each other out, roughly speaking; or more whether |a| ≤ τ or otherwise, and therefore kzkH ≡ precisely, the fact that the COUNT-SKETCH buckets have P P p H(zp) ≤ p |zp| ≡ kzk1. We now prove the lower an expectation that is the `2 norm of the bucket entries bound. We consider a modified Huber measure kzkG implies that if a bucket contains a large number of given a parameter τ > 0 in which entries from one weight class, those entries will make ( a lower contribution to the estimate kSyk than they a2/2τ if |a| ≤ τ G,w G(a) ≡ |a| otherwise. Then kzk ≤ kzk ≤ 2kzk , and so it suffices to prove the definition of this case, and that U ≥ τ, we have H G H √ √ U U 2 L U nL U 2 the lower bound for kzkG. + + ≤ + , which implies that ≤ nL. 4 4τ 2τ 4 4 τ √ By permuting coordinates, which does not affect the So we just need to show that L = Ω(n−1/2) nL , or √ 2τ 2 inequality we are proving, there is an s for which L equivalently, L = Ω(τ). Since 2τ ≥ U ≥ τ, we have 2 |z1| ≤ |z2| ≤ ... ≤ |zs| ≤ τ ≤ |zs+1| ≤ ... ≤ |zn|. L = Ω(τ ), as desired. L Otherwise, we have U ≥ 2τ and to prove (2.5) we  2  (We may have s = 0, when all |zi| ≥ τ, or s = n, when need to show U = Ω(n−1/2) U + L . We can assume Pn Ps 2 2τ 2τ all |zi| ≤ τ.) Let U = |zj| and L = zj . j=s+1 j=1 U 2 L Consider the n-dimensional vector w with s coordinates 2τ ≥ 2τ , otherwise this is immediate from the fact q 2 L that U ≥ L , and so we need to show U = Ω(n−1/2) U , equal to s , one coordinate equal to U, and remaining 2τ 2τ U √ coordinates equal to 0. Then, or equivalently, 2τ = O( n). Now we use the fact 2 2 that U + L = Θ( U ) realizes the minimum given L L 2τ 2τ τ √ U 2 (2.2) kwkG = s · + U = + U = kzkG. the case that we are in, and so = O(U + nL), or s2τ 2τ  √  τ equivalently, U = O 1 + nL . Since as mentioned Moreover, τ U 2 √ it holds that U ≥ L , we have U 2 ≥ L, and so √ 2τ 2τ L √ nL √ U √ (2.3) kwk1 = U + s · √ = U + sL ≥ kzk1, U ≤ n. It follows that 2τ = O( n), which is what s we needed to show. since subject to a 2-norm constraint L, the 1-norm is √ 2 Case: 1 (U+ nL) < U + L . What we need to show in maximized when all s coordinates are equal. Also, 4 2τ 2τ √ this case to prove (2.5) is U + L = Ω(n−1/2)(U + nL). 2 2 2τ kwk L U 2 kzk Suppose first that U ≥ L , and so we need to (2.4) 2 = + ≥ 2 , √ 2τ 2τ 2τ 2τ 2τ −1/2 show U = Ω(n )(√U + nL), which is equivalent to showing U = Ω( L). Since L ≤ U, we have since subject to a 1-norm constraint U, the 2-norm is √ √ 2τ maximized when there is a single non-zero coordinate. L = O( Uτ) = O(U), using that U ≥ τ. This Combining (2.2), (2.3), and (2.4), in order to show completes this case. −1/2 2 Otherwise, we have L ≥ U and need to show kzkG = Ω(n ) min(kzk1, kzk2/2τ) it suffices to show √ 2τ √ −1/2 2 L = Ω(n−1/2)(U + nL). We can assume nL ≥ U, kwkG = Ω(n ) min(kwk1, kwk2/2τ). By the above, 2τ otherwise this is immediate using L ≥ U, and so this is equivalent to showing √ 2τ √ we need to show L = Ω(n−1/2) nL = Ω( L), or 2 2τ L  √ U L  2 U + = Ω(n−1/2) · min U + sL, + , equivalently,√ L =√ Ω(τ ). Now we use the fact that 2τ 2τ 2τ U + nL = Θ( nL) realizes the minimum, and so √  U 2 L  L which since s ≤ n, is implied by showing nL = O + , and using that U ≤ , this √ 2τ 2τ 2τ (2.5) implies nL = O L · U + L . Since U ≥ τ, it follows  √ 2  √ 2τ 2τ 2τ √ L −1/2 U L LU  U + = Ω(n ) · min U + nL, + . that nL = O τ 2 . Now using that U ≤ nL, this 2τ 2τ 2τ implies that L = Ω(τ 2), which is what we needed to show. Note that we can assume U 6= 0, as other- This completes the proof. wise the inequality is equivalent to showing √ L −1/2  L  2τ = Ω(n ) · min nL, 2τ . This holds since Suppose we want to solve the Huber regression L −1/2 L problem min d kAx − bk , where A is an n×d matrix 2τ = Ω(n ) 2τ . So we can assume U > 0, and by x∈R H definition of U, this implies that U ≥ τ. We break the and b an n × 1 column vector. We will do so by analysis into cases: a recursive argument, and for that we will need to solve min d kAx − bk , for various weight vectors √ x∈R H,w U 2 L 1 w. Note that kAx − bk is a non-negative linear Case: 2τ + 2τ ≤ 4 (U + nL). What we need to show H,w L −1/2 U 2 L combination of convex functions of x, and hence is in this case to prove (2.5) is U + 2τ = Ω(n )( 2τ + 2τ ). L convex. We develop a lemma for this more general Suppose first that 2τ ≥ U. Then what we need to 2 problem, given w. We maintain that if wi 6= 0, then show in this case is that L = Ω(n−1/2)( U + L ). Since 2τ 2τ 2τ wi ≥ 1. L appears on both the left and right hand sides, this 2τ In our recursion we will have kwk∞ ≤ poly(n) for L −1/2 U 2 follows from showing that 2τ = Ω(n )( 2τ ). Using some polynomial that depends on where we are in the 2 2 recursion. These conditions imply that we can partition If Xi − E[Xi] ≤ ∆ for all i, then with σi = E[Xi ] − j 2 the positive coordinates of w into O(log n) groups P , E[Xi] we have j j−1 j for which P = {i | 2 ≤ wi < 2 }. j  −γ2  Let A denote the restriction of the input matrix Pr[X ≥ E[X] + γ] ≤ exp . j j P 2 A to those rows i in the set P . For each j, let U 2 i σi + 2γ∆/3 be an (α, β)-well-conditioned basis for Aj with respect j j If for some i we have pi = 1, then E[Xi] = Xi = to `1, meaning U has the same column span as A , P |U j| = α, and for all x, kxk ≤ βkU jxk wiH(Aix). It follows that such Xi do not contribute i∈P j i 1 ∞ 1 to the deviation of X from E[X ], and therefore we can [10]. Here U j is the i-th row of U j. Let V j be an i i apply Fact (2.1) only to those X for which p < 1. approximately orthonormal basis for the column span i i j P j 2 In order to apply Fact (2.1), we first bound of A , that is, i∈P j kVi k2 = O(d) and for all x, H(Aix)/pi, for the case when pi < 1, by a case j j kV xk2 = (1 ± 1/2)kxk2. Here Vi is the i-th row of analysis. Suppose i ∈ P j. We use Lemma 2.1 to do the j j j j j j j j j V . Let A = U J and A = V K , where J and K case analysis. are d × d matrices. j j j kUi k1 j −1/2 j For each j and i ∈ P , let qi = α and let Case |Aix| ≥ τ and kA xkH = Ω(n kA xk1). It kV j k2 rj = i 2 . For i∈ / P j, let qj = 0 and rj = 0. follows that i P kV j k2 i i i0∈P j i0 2 1/2 −2 H(Aix) |Aix| − τ/2 |Aix| |Aix| |Aix|α Set s = C0 · n max(α · β, d) · dε log(n/ε) for = ≤ ≤ = p p p j j a sufficiently large constant C0 > 0. Suppose we i i i Θ(s)qi Θ(s)kUi k1 independently sample each row i of A with probability kU jk kJ jxk α αβkAjxk P j j ≤ i 1 ∞ ≤ 1 pi = min(1, Θ(s · j(qi + ri ))) (the fact that we choose j Θ(s)kU k1 Θ(s) Θ(s · P (qj + rj)) instead of s · P (qj + rj) in the i j i i j i i αβO(n1/2)kAjxk O(kAjxk ) definition of pi will give us some flexibility in designing H H ≤ = −2 . a fast algorithm, as we will see). s C0ε d log(n/ε) 0 For i ∈ [n], let wi = 0 if we do not sample row j −1/2 j 2 0 Case |Aix| ≥ τ and kA xkH = Ω(n kA xk2/(2τ)). i, and otherwise wi = wi/pi. The expected number of j −1/2 j 0 We claim in this case that kA xkH = Ω(n kA xk1) non-zero elements of w is O(s log n). This is because for j j j kA xk1 1/2 P as well. Suppose not, so that j = ω(n ). each of the O(log n) possibilities of j, i qi +ri = O(1). kA xkH 0 0 Note that if wi 6= 0, then wi ≥ 1. Moreover, by a union Let S ⊆ [n] be the set of ` ∈ [n] for which C j j j bound over the n coordinates, with probability 1−1/n |(A x)`| ≥ τ. Then k(A x)SkH ≥ k(A x)Sk1/2. Hence, 0 C+1 we have kw k∞ ≤ n kwk∞, since the probability that C+1 C j k(Ajx) k + k(Ajx) k any i for which pi ≤ 1/n is sampled is at most 1/n . 1/2 kA xk1 S 1 [n]\S 1 ω(n ) = j = j kA xkH kA xkH j Theorem 2.2. (Huber Embedding) With the notation k(A x)[n]\Sk1 d ≤ 2 + , defined above, for any fixed x ∈ R , j kA xkH

Pr[(1 − ε)kAxk ≤ kAxk 0 ≤ (1 + ε)kAxk ] j 1/2 j H,w H,w H,w so that k(A x)[n]\Sk1 = ω(n )kA xkH . j ≥ 1 − exp(−C2d log(n/ε)), Given a value of k(A x)[n]\Sk1, the value j 2 k(A x)[n]\Sk2 is minimized when all of the coordinates for an arbitrarily large constant C2 > 0. are equal: k(Ajx) k 2 Proof. Fix a vector x and define the non-negative j [n]\S 1 k(A x)[n]\Sk ≥ n · /(2τ) 0 Pn H n random variable Xi = wi · H(Aix). For X = i=1 Xi, Pn j 2 we have E[X] = pi(wi/pi)H(Aix) = kAxk . k(A x) k i=1 H,w = [n]\S 1 . We will use the following version of the Bernstein 2τn inequality. j Note also that k(A x)SkH ≥ τ/2 since there exists an i n Fact 2.1. ([33, 5]) Let {Xi}i=1 be independent random for which |Aix| ≥ τ given that we are in this case. 2 P j variables with E[Xi ] < ∞ and Xi ≥ 0. Set X = i Xi So in order for the condition that k(A x)[n]\Sk1 = and let γ > 0. Then, 1/2 j ω(n )kA xkH , it must be the case that  2   j 2  −γ j 1/2 k(A x)[n]\Sk1 Pr[X ≤ E[X] − γ] ≤ exp P 2 . k(A x)[n]\Sk1 = ω(n ) · τ + . 2 i E[Xi ] 2τn The right hand side of this expression is minimized Setting γ = εkAxkH,w, and applying Fact (2.1), j 2 2 k(A x)[n]\S k1 2 when τ = 2n , which implies Θ(τ n) = j 2 j Pr[kAxkH,w0 ≤ kAxk − γ] k(A x)[n]\Sk1, or equivalently, k(A x)[n]\Sk1 = H,w √ ! Θ(τ n). But then we have −γ2C ε−2d log(n/ε) ≤ exp 1 √ j 1/2 2(kAxk )2 Θ(τ n) = k(A x)[n]\Sk1 = ω(n ) · 2τ, H,w ≤ exp(−C d log(n/ε)), j 2 which is a contradiction. Hence, kA xkH = −1/2 j Ω(n kA xk1), and this case reduces to the first case. and also

j −1/2 j Pr[kAxk 0 ≥ kAxk + γ] Case |Aix| ≤ τ and kA xkH = Ω(n kA xk1). H,w H,w It follows that   −γ2 2 ≤ exp  2  H(Aix) (Aix) τ|Aix| |Aix| (kAxkH,w) 2 kAxkH,w = ≤ = , 2 C ε−2d log(n/ε) + 3 γ C ε−2d log(n/ε) pi 2τpi 2τpi 2pi 1 1 ≤ exp(−C2d log(n/ε)), using that |Aix| ≤ τ. Now we have the same derivation as in the first case, up to a factor of 2. where C2 > 0 is a constant that can be made arbitrarily large by choosing C0 > 0 arbitrarily large. j −1/2 j 2 Case |Aix| ≤ τ and kA xkH = Ω(n kA xk2/(2τ)). It follows using the properties of V j that We now combine Theorem 2.2 with a net argument for the Huber measure. We will use those arguments in 2 2 2 H(Aix) (Aix) (Aix) (Aix) O(d) Section 4. To do so, we need the following lemma. = ≤ j ≤ j pi 2τpi τΘ(s)r τskV k2 i i 2 Lemma 2.2. (Huber Growth Condition) The function kV jk2kKjxk2O(d) kAjxk2O(d) ≤ i 2 2 ≤ 2 H(a) satisfies the growth condition (1.1) with α = 2 j 2 τs τskVi k2 and CG = 1. O(d)τn1/2kAjxk ≤ H Proof. We prove this by a case analysis. We can assume τs a and a0 are positive since the inequality only depends j O(kA xkH ) on the absolute value of these quantities. For notational = −2 . 0 0 C0ε d log(n/ε) convenience, let C ≡ a/a . If a = a , the lemma is immediate, so assume C > 1. j Hence, in all cases, if i ∈ P then First suppose a0 ≥ τ. Then H(a)/H(a0) = (Ca0 − τ/2)/(a0 − τ/2), which is maximized when a0 = τ, H(A x) kAjxk i H yielding (C − 1/2)/(1/2) = 2C − 1. Since 2C − 1 ≤ C2 ≤ −2 pi C1ε d log(n/ε) for C ≥ 1, the left inequality of (1.1) holds. Conversely, 0 for an arbitrarily large constant C > 0. Then, H(a)/H(a ) is at least C, and so the right inequality of 1 (1.1) holds. w H(A x) w kAjxk Next suppose a ≥ τ and a0 < τ. Then X − E[X ] ≤ X ≤ i i ≤ i H . i i i p C ε−2d log(n/ε) i 1 H(a)/H(a0) = (Ca0 − τ/2)/((a0)2/(2τ)) Moreover, using the notation of Fact (2.1), = 2τC/a0 − τ 2/(a0)2.

X 2 X 2 Then σi ≤ E[Xi ] i:pi<1 i:pi<1 0 d(H(a)/H(a )) 0 2 2 0 3 j = −2τC/(a ) + 2τ /(a ) , X X wiH(A x) 0 = w H(A x) i da i i p j i 0 j i:pi<1, i∈P and setting this equal to 0 we find that a = τ/C j maximizes H(a)/H(a0). In this case H(a)/H(a0) = C2, X kA xkH,w X ≤ · wiH(Aix) and so the left inequality of (1.1) holds. Since a ≥ τ, C ε−2d log(n/ε) 1 j 0 j i:pi<1, i∈P 0 d(H(a)/H(a )) 0 τ > a ≥ τ/C, and since da0 < 0 for a ∈ j 2 2 (τ/C, τ], H(a)/H(a0) is minimized when a0 = τ, in X kA xkH,w kAxkH,w ≤ ≤ . C ε−2d log(n/ε) C ε−2d log(n/ε) which case it equals 2C − 1. Since 2C − 1 ≥ C for j 1 1 C ≥ 1, the right inequality of (1.1) holds. 0 Finally, suppose a < τ. In this case 1−o(1). Moreover, since wi = wi/pi with probability pi (and zero otherwise), by a union bound the probability H(a)/H(a0) = a2/(a0)2 = C2, 0 2 that a pi of a nonzero wi is less than 1/n is at most 2 0 2 n/n = 1/n, so with probability 1 − o(1), kw k∞ ≤ n , and the left inequality of (1.1) holds, and the right 2 implying kAx − bk 0 ≤ n kAx − bk for all x. inequality holds as well. H,w H Hence, we can apply Lemma 4.5 with S equal to the identity and our choice of w0 (together with the 2.1 Proof of Theorem 2.1, Huber algorithm input constant 2κ) to conclude, by a union bound that running time. ∗ with probability 1 − o(1), if x = argminxkAx − bkH,w0 ∗ 3/2 Proof. We first solve the regression prob- subject to the constraint kT Ax −T bk2 ≤ 2κcn , then ∗ lem minx kAx−bk2 in O(nnz(A))+poly(d/ε) time using kAx − bkH,w ≤ (1 + ε) minx kAx − bkH,w. [9] up to a factor of 1+ε. This step succeeds with prob- Thus, we have reduced the original regression prob- 0 0 ability 1 − o(1). Suppose y = Ax − b realizes this mini- lem to the regression problem minx kAx − bkH,w0 con- 0 3/2 0 mum. Let c = ky k2/(1 + ε). Then by Lemma 4.4 as we strained by kT Ax − T bk2 ≤ 2κcn , where w has will see in our net argument in §4, applied to w = 1n, n1/2 log2 n · poly(d/ε) non-zero entries. We now repeat ∗ ∗ ∗ this procedure recursively O(1) times. Let w = 1n and if y = Ax − b, where x = argminxkAx − bkH , then 0 ∗ 3/2 0 c ≤ ky k2 ≤ κcn , where κ > 0 is a sufficiently large w1 = w . In the `-th recursive step, ` ≥ 2, we are given constant. the regression problem minx kAx − bkH,w`−1 subject to n 3/2+2`−2 To apply Theorem 2.2 with w = 1 first note that the constraint kT Ax−T bk2 ≤ 2κcn (we use the 1 all weights wi are in the same group P . We then same matrix T in all steps), and we reduce the problem need to be able to compute the sampling probabilities to solving minx kAx − bkH,w` subject to the constraint 1 1 3/2+2`−2 qi and ri , but only up to a constant factor since kT Ax−T bk2 ≤ 2κcn . We now describe the `-th 1 1 1 1 kUi k1 recursive step. pi = min(1, Θ(s · (q + r ))). Recall, q = and i i i α 2`−2 1 2 We inductively have that kw k ≤ n . We kVi k2 1 1 `−1 ∞ let ri = Pd 1 2 , where Ui and Vi denote the i-th i=1 kVi k2 first group the weights of w`−1 into O(log n) groups row of U 1 and V 1, respectively. Here U 1 is an (α, β)- j j j P . For each group we compute Ui and Vi as well-conditioned basis for A with respect to `1, meaning above, thereby obtaining w` in O(t`−1 log n) expected 1 Pd 1 U has the same column span as A, i=1 |Ui |1 = α, time, where t`−1 is the number of non-zero weights 1 and for all x, kxk∞ ≤ βkU xk1. By Lemma 49 and in w . The of t is O(t1/2 max(α · 1 `−1 ` `−1 Theorem 50 of [9] (see also [34, 41]), the qi can be −2 1 β, d) · ε d log(n/ε) log n). We can condition on t`−1 computed in O(nnz(A) log n)+poly(d/ε) time, for a U `−1 being O(n1/2 poly(dε−1 log n)) as all events jointly with α, β ≤ poly(d). Similarly, by Theorem 29 of [9], in succeed with probability 1 − o(1). We thus have O(nnz(A) log n) + poly(d/ε) time we can compute the 1/2` −1 1 P 1 2 t` = n poly(dε log n) with probability 1 − o(1). ri, for a matrix V for which i kVi k2 = Θ(d) and for 1 We now consider the regression problem min kAx − all x, kV xk = (1 ± 1/2)kxk2. These steps succeed with x C probability 1−1/ log n probability for arbitrarily large AbkH,w` subject to the constraint kT Ax − T bk2 ≤ 3/2 3/2+2`−2 constant C > 0. 2κcn kw`−1k∞ ≤ 2κcn . By a union bound, The vector w0 in Theorem 2.2 can be computed Theorem 2.2 holds simultaneously for all points in a O(d) in O(n) time, and the expected number of non-zero net N of size (n/ε) , this step succeeding with prob- 0 entries of w0 is O(s log n) = O(n1/2 max(α · β, d) · ability 1 − o(1). Moreover, the w in Theorem 2.2 is 2 2` ε−2d log(n/ε) log n) = n1/2(log2 n)poly(d/ε), and so equal to w` and satisfies kw`k∞ ≤ n kw`−1k∞ ≤ n . with probability 1 − o(1), we will have nnz(w0) ≤ We can thus apply Lemma 4.5 with S equal to the n1/2(log2 n)poly(d/ε). identity to conclude that with probability 1 − o(1), if ∗ Let T be the sparse subspace embedding of [9], so x = argminxkAx − bkH,w` subject to the constraint kT Ax∗ − T bk ≤ 2κcn3/2+2`−2, then kAx∗ − bk ≤ that with probability 1 − o(1), kT Axk2 = (1 ± ε)kAxk2 2 H,w`−1 (1 + ε) min kAx − bk . for all x and TA can be computed in nnz(A) time and x H,w`−1 T has poly(d/ε) rows. It follows that for ` a large enough constant, and by scaling ε by a constant factor, we will have that Now consider the regression problem minx kAx − 3/2 with probability 1 − o(1), if x∗ = argmin kAx − bk bk 0 subject to the constraint kT Ax−T bk ≤ 2κcn . x H,w` H,w 2 ∗ 3/2+2`−2 This 2-norm constraint is needed to ensure that we subject to the constraint kT Ax −T bk2 ≤ 2κcn , ∗ satisfy the conditions needed to apply Lemma 4.5 in then kAx − bkH ≤ (1 + ε) minx kAx − bkH . Moreover, 1/2` −1 our net argument in §4. By a union bound, Theorem t` ≤ n poly(dε log n). This resulting problem is 2.2 holds simultaneously for all points in a net N of that of minimizing a convex function subject to a convex size (n/ε)O(d). This step succeeds with probability constraint and can be solved using the ellipsoid method C ` in t` time for a fixed constant C > 0. Setting 2 > C/2 Lh is multiset of values at a given level, Lh,i is the −1 and assuming the poly(dε log n) factor is at most multiset of values in a bucket. We can write kSzkG,w n1/2 gives us a running time of O(n) to solve this last as P βbhG(kL k ), where kLk denotes i∈[N],h∈[0,hmax] h,i Λ Λ recursive step of the problem. The overall running time | P Λ z |. zp∈L p p of the recursion is dominated by the time to compute the (The function kk is a semi-norm (if we map sets U j and V j in the different recursive levels, which itself is Λ back to vectors), with kLk ≤ kLk , E [kLk2 ] = kLk2, dominated by the top-most level of recursion. This gives Λ 1 Λ Λ 2 k 1/k an overall running time of O(nnz(A) log n) + poly(d/ε). and all (EΛ[kLkΛ]) within constant factors of kLk2, by Khintchine’s inequality.) 3 M-sketches for M-estimators. Regression theorem. Our main theorem of this section states that M-sketches can be used for regres- Given a function G : 7→ + with G(a) = G(−a), and R R sion. G(0) = 0, we can use the sketch of z ∈ Rn to estimate P kzkG ≡ p G(zp), assuming G is monotone and satisfies Theorem 3.1. (Input Sparsity Time Regression for G- the growth upper and lower bounds of (1.1). functions) Let ≡ min d kAx − bk . There is OPTG x∈R G (Perhaps a more consistent notation would define an algorithm that in nnz(A) + poly(d log n) time, with the measure based on G as G−1(kzk ), by analogy with G constant probability finds xˆ such that kAxˆ − bkG ≤ ` norms. Moreover, kzk does not in general satisfy p G O(1)OPTG. the properties of a norm. However, if G is convex, then −1 kykG is a convex function of z, and if also G (kzkG) The proof is deferred to §4.1; it requires a net argument, −1 −1 is scale-invariant, so that G (ktzkG) = |t|G (kzkG), Lemma 4.5; the contraction bound Theorem 3.2 from −1 then G (kzkG) is a norm.) §3.5; and from §3.6, a clipped variant Theorem 3.4 of the The sketch. We use an extension of dilation bound Theorem 3.3. First, various definitions, COUNT-SKETCH, which has been shown to be effec- assumptions, and lemmas will be given. tive for subspace embeddings [9, 36, 34]. In that n method, for a vector z ∈ R , each coordinate zp is 3.1 Preliminary Definitions and Lemmas for mapped via a hash function from [n] to one of N hash M-estimators. We will analyze the behavior of sketch- n buckets, written as gp ∈ [N] for p ∈ [n]; a coordinate ing on z ∈ R . We assume that kzkG = 1; this is for is generated for bucket g ∈ [N] as P Λ z , where convenience of notation only, the same argument would gp=g p p Λp = ±1 is chosen independently at random with equal apply to any particular value of kzkG (we do not assume probability for +1 and −1. The resulting N-vector has scale-invariance of G). d approximately the same `2 norm as z. Define y ∈ R by yp = G(zp), so that kyk1 = Here we employ also sampling of the coordinates, as kzkG = 1. A large part of our analysis will be related done in the context of estimating earthmover distance to y, although y does not appear in the sketch. Let Z in [40], where each coordinate zp is mapped to a level denote the multiset comprising the coordinates of z, and hp, and the number of coordinates mapped to level h is let Y denote the multiset comprising the coordinates of exponentially small in h: for an integer branching factor y. For Zˆ ⊂ Z, let G(Zˆ) ⊂ Y denote {G(zp) | zp ∈ Zˆ}. P k 1/k b > 1, we expect the number of coordinates at level h to Let kY kk denote [ y∈Y |y| ] , so kY k1 = kyk1. be about a b−h fraction of the coordinates. The number Hereafter multisets will just be called “sets”. of buckets at a given level is N = bcm, where integers Weight classes. For our analysis, fix γ > 1, and m, c > 1 are parameters to be determined later. for integer q ≥ 1, let Wq denote weight class {yp ∈ Y | −q 1−q Our sketching matrix implementing this approach γ ≤ yp ≤ γ }. Nhmax×n h is S ∈ R , where hmax ≡ blogb(n/m)c, and We have βb E[kG(Lh) ∩ Wqk1] = kWqk1. Nh our scaling vector w ∈ R max . The entries of S are For a set of integers Q, let WQ denote ∪q∈QWq. hp Sj,p ← Λp, and the entries of w are wj ← βb , where Defining qmax and h(q). For given ε > 0, consider −hmax 0 d 0 0 β ≡ (b − b )/(b − 1), j ← gp + Nhp, and y ∈ R with yi ← yi when yi > ε/n, and yi ← 0 0 otherwise. Then ky k1 ≥ 1 − n(ε/n) = 1 − ε. Thus (3.6) for some purposes we can neglect Wq for q > qmax ≡ Λp ← ±1 with equal probability logγ (n/ε), up to error . Moreover, we can assume that kWqk ≥ ε/qmax, since the total contribution of weight gp ∈ [N] chosen with equal probability 1 classes of smaller total weight to kyk1 is at most . h ← h with probability 1/βbh for int h ∈ [0, h ], p max Let h(q) denote blogb(|Wq|/βm)c for |Wq| ≥ βm, and zero otherwise, so that all independently. Let Lh be the multiset {zp | hp = h}, and Lh,i the multiset {zp | hp = h, gp = i}; that is, m ≤ E[|G(Lh(q)) ∩ Wq|] ≤ bm for all Wq except those with |Wq| < βm, for which the Lemma 3.2. For h ∈ [hmax], suppose Q ⊂ {q | h(q) = lower bound does not hold. h, |Wq| ≥ βm}, and Wˆ ⊂ Y contains WQ ≡ ∪q∈QWq. Since |Wq| ≤ n for all q, we have h(q) ≤ If |G(Lh) ∩ Wˆ | ≤ εN, then with failure probability at blog (n/βm)c = hmax. 2 ∗ b most 2|Q| exp(−ε m/3), each Wq has Wq ⊂ G(Lh)∩Wq ∗ −1 −h with |W | ≥ (1 − ε)β b |Wq|, and where each entry 3.2 Assumptions About the Parameters. There q of W ∗ is in a bucket with no other element of Wˆ . Also are many minor assumptions about the relations be- q if condition E of Lemma 3.1 holds, then tween various numerical parameters; some of them are collected here for convenience of reference. Recall that ∗ −1 −h kW k ≥ (1 − 4γε)β b kWqk . N = bcm. q 1 1 Proof. We will show that for q ∈ Q, with high probabil- Assumption 3.1. We will assume b ≥ m, b > c, m = −1 −h ity it will hold that aq ≥ (1−ε))β b |Wq|, where aq is Ω(log log(n/ε)), log b = Ω(log log(n/ε)), γ ≥ 2 ≥ β, an the number of buckets G(Lh,i), over i ∈ [N], containing error parameter ε ∈ (0, 1/3), and log N ≤ ε2m. We will a member of W , and no other members of Wˆ . consider γ to be fixed throughout, that is, not dependent q Consider each q ∈ Q in turn, and the members of on the other parameters. Wq in turn, for k = 1, 2, . . . s ≡ |Wq|, and let Zk denote the number of bins occupied by the first k members 3.3 Distribution into Buckets. The entries of y of Wq. The probability that Zk+1 > Zk is at least are well-distributed into the buckets, as the following −1 −h −1 −h β b (1 − |G(Lh) ∩ Wˆ |/N) ≥ β b (1 − ε). We have lemmas describe. −1 −h aq ≥ (1 − ε)β b |Wq| in expectation. Lemma 3.1. For ε ≤ 1, with failure probability at To show that this holds with high probability, let 2 −ε2m ˆ ˆ ˆ most 4qmaxhmax exp(−ε m/3) ≤ C for a constant Zk ≡ E[Zs | Zk]. Then Z1, Z2,... is a Martingale with C > 1, the event E holds, that for all q ≤ qmax with increments bounded by 1, and with the second −1 −h |Wq| ≥ βm, and all h ≤ h(q), that of each increment at most β b . Applying Freed- man’s inequality gives a concentration for aq similar to −1 −h |G(Lh) ∩ Wq| = β b |Wq|(1 ± ε), the above application of Bernstein’s inequality, yielding a failure probability 2 exp(−ε2m/3), and Applying a union bound over all |Q| yields that with 2 −1 −h probability at least 1 − 2|Q| exp(−ε m/3), for each Wq kG(Lh) ∩ Wqk1 = β b kWqk1(1 ± ε). ∗ −1 −h there is Wq of size at least (1 − ε)β b |Wq| such that ∗ Here a = b(1 ± ε) means that |a − b| ≤ ε|b|. each member of Wq is in a bucket containing no other We will hereafter generally assume that E holds. member of Wˆ . For the last claim, we compare the at least (1−ε)X ∗ −1 −h Proof. Let s ≡ |Wq|. When s ≥ βm and h ≤ h(q), in entries of Wq , where X ≡ β b |Wq|, with the at most h ∗ ∗ expectation |G(Lh) ∩ Wq| is equal to µ ≡ s/βb ≥ m, (1 + ε)X − |Wq | entries of G(Lh) ∩ Wq not in Wq , using h and kG(Lh) ∩ Wqk1 ≥ kWqk1/βb . We need that condition E; we have with high probability, deviations from these bounds are kW ∗k −q small. q 1 (1 − ε)Xγ ≥ −q 1−q Applying Bernstein’s inequality to the random vari- kG(Lh) ∩ Wqk1 (1 − ε)Xγ + 2εXγ h able Z with binomial B(s, 1/βb ) distribution, the log- ≥ 1 − 2γε/(1 − ε). arithm of the probability that t ≡ Z − E[Z] = Z − µ exceeds µ is at most Using condition E again to make the comparison with kW k , the claim follows. −(µ)2/2 q 1 ≤ −2µ/3 ≤ −2m/3. ¯ ¯ µ + (µ)/3 Lemma 3.3. For h ∈ [hmax], W ⊂ G(Lh), T ≥ kW k∞, and δ ∈ (0, 1), if Taking the exponential, and using a union bound over h all events (including the event that −t exceeds s/βb ) 6kW¯ k N ≥ 1 , completes the first claim, with half the claimed failure T log(N/δ) probability, using Assumption 3.1 to shown that the claimed C exists. For the second claim, there is a similar then with failure probability δ, argument for the random variables X which are equal p 7 to yp when hp = h and yp ∈ Wq, and zero otherwise. ¯ max kG(Lh,i) ∩ W k1 ≤ T log(N/δ). P 2 P h i∈[N] 6 Here p E[Xp ] ≤ p E[Xp] = kWqk1/βb . Proof. This directly follows from Lemma 2 of [9], (which and the claimed inequality follows. Otherwise, follows directly from Bernstein’s inequality), where t P z2/z2 ≥ k/2d, which implies zq ≥zp q p of that lemma is N, T is the same, us:n is W¯ , r is kW¯ k , and δ is δ. The bound for N also uses  1/2 1 h G(z ) z z 2 ¯ 2 ¯ ¯ X q X q X q kW k2 ≤ kW k∞kW k1. ≥ CG ≥ CG   G(zp) zp zp zq ≥zp zq ≥zp zq ≥zp n 3.4 Leverage Scores. The `2 leverage scores u ∈ R p 2 ≥ CG k/2d, have ui ≡ kUi:k2, where U is an orthogonal basis for the columnspace C(A) ≡ {Ax | x ∈ Rd}. We will use the and the claimed inequality follows. standard facts that these values satisfy kuk∞ ≤ 1 and 2 kuk1 ≤ d, and for y ∈ C(A) with kyk2 = 1, yi ≤ ui for 3.5 Contraction bounds. Here we will show that i ∈ [n]. kSzkG,w is not too much smaller than kzkG. We will condition on a likely event involving the top leverage scores. This lemma will be used to bound the 3.5.1 Estimating kzkG using Sz. For v ∈ T ⊂ Z, −q effect of those Wq with |Wq| small and weight γ large. let T − v denote T \{v}.

n×d n Lemma 3.4. For A ∈ R , let u ∈ R denote the `2 Lemma 3.6. For v ∈ T ⊂ Z, leverage score vector of A. For N1,N2 with N2 ≥ N1  kT − vk 2 and with N1N2 ≤ κN, for κ ∈ (0, 1/2), let Y1 and Λ G(kT kΛ) ≥ 1 − G(v), Y2 denote the sets of indices of the N1 and N2 largest |v| coordinates of u, so that Y1 ⊂ Y2. Then with probability −1 and if G(v) ≥ ε kT − vkG, then at least 1 − 2κ, the event Ec holds, that S sends each member of Y1 into a bucket containing no other member kT − vk (3.7) 2 ≤ ε1/α, of Y2. |v| We will hereafter generally assume that E holds. 1/α c and for a constant C, EΛ[G(kT kΛ)] ≥ (1 − Cε )G(v).

Proof. For each member of Y2, the expected number of Proof. For the first claim, if kT kΛ ≥ |v|, then the claim members of Y1 colliding with it, that is, in the same is immediate since G is non-decreasing. Otherwise, bucket with it, is N1/N. The expected number of note that kT kΛ has the form | |v| ± kT − vkΛ|, so if such collisions is therefore at most N1N2/N < κ. The kT kΛ ≤ |v|, then kT kΛ = | |v| − kT − vkΛ|. We have probability that the number of collisions is at least twice α 2 its mean is at most 2κ, so with probability at least 1−2κ, G(kT k ) kT k  kT k  Λ ≥ Λ ≥ Λ the number of collisions is less than 2κ < 1, that is, zero. G(v) |v| |v| |v| − kT − vk 2  kT − vk 2 We use the `2 leverage scores to bound the coor- = Λ = 1 − Λ , dinates of G(z); this is the one place in proving con- |v| |v| traction bounds that we need the linear lower bound of proving the first claim. For the second claim, we have (1.1) on the growth of G. 0 0 0 |v | < |v| for v ∈ T − v, since G(v ) ≤ kT − vkG ≤ εG(v), and G is non-decreasing in |v|. Therefore Lemma 3.5. If up is the k’th largest `2 leverage score, p 0 then for z ∈ C(A), G(zp) ≤ 2d/kkzk /CG. kT − vk X G(v ) G ε ≥ G = G(v) G(v) v0∈T −v Here CG is the growth parameter from (1.1). α 2 X |v0| X |v0| Proof. We have u ≤ d/k, since P u = d. For ≥ ≥ p i i |v| |v| z = Ux ∈ C(A), v0∈T −v v0∈T −v

2 2 2 2 and so (3.7) follows. For the third claim, we have from z2 ≤ (U x)2 ≤ kU k kxk = u kzk ≤ (d/k)kzk . p p∗ p∗ p the first claim, 2 2 2 2 That is, P z /z ≥ k/d. Suppose P z /z ≥ " 2# q q p zq ≤zp q p   kT − vkΛ k/2d. Then EΛ[G(kT k )] ≥ EΛ 1 − G(v) Λ |v| α 2   X G(zq) X zq X zq EΛ[kT − vk ] ≥ ≥ ≥ k/2d, ≥ 1 − 2 Λ G(v). G(zp) zp zp |v| zq ≤zp zq ≤zp zq ≤zp Using the Khintchine inequality and (3.7), we have Lemma 3.8. Assume that condition E of Lemma 3.1 holds, and that condition Ec of Lemma 3.4 holds for EΛ[kT − vk ] CkT − vk −2 −2 2 0 0 Λ 2 1/α N1 = N2 = O(CG ε dm ). Let Qh ≡ {q | q ≤ Mh}, ≤ ≤ Cε , 0 h+1 2 |v| |v| where Mh ≡ logγ (βb m qmax). Then there is N = 2 2 −1 O(N1 + m bε qmax) so that with probability at least for a constant C, so the claim follows, after adjusting 2 1 − C−ε m for a constant C > 1, for each q ∈ Q∗, there constants. ∗ is Wq ⊂ Lh(q) ∩ Wq such that: We will need a lemma that will allow bounds on 1. |W ∗| ≥ (1 − ε)β−1b−h(q)|W |; the contributions of the weight classes. First, some q q notation. For h = 0 . . . hmax, let ∗ 2. each x ∈ Wq is in a bucket with no other member of WQ∗ ; Qˆh ≡ {q | h(q) = h, |Wq| ≥ βm} M ≡ log (2(1 + 3ε)b/ε) 3. kW ∗k ≥ (1 − 4γε)β−1b−hkW k . ≥ γ q 1 q 1

Qh ≡ {q ∈ Qˆh | q ≤ M≥ + min q} ∗ ˆ 4. for q ∈ Qh, each x ∈ Wq is in a bucket with no (3.8) q∈Qh member of W 0 ; Qh M< ≡ logγ (m/ε) = O(logγ (b/ε))

Q< ≡ {q | |Wq| < βm, q ≤ M<} Proof. There is N1 satisfying the given bound so that ∗ Q ≡ Q< ∪ [∪hQh]. Lemma 3.5 implies that y∈ / Y1 must be smaller than −1p CG 2d/N1 ≤ ε/m, and therefore not in Wq for q ∈ Q . Therefore W ⊂ Y , and with the assumption of Here Qˆh gives the indices of Wq that are “large” and < Q< 1 have h as the level at which between m and bm members condition Ec, no member of WQ< is in the same bucket as any other member of that set. We will take W ∗ ← W of Wq are expected in Lh. The set Qh cuts out the q q weight classes that can be regarded as negligible at level for q ∈ Q<. h. For each h, apply Lemma 3.2 to Qh and with ˆ ∗ W ← WQ ≡ WQ< ∪q∈Qh Wq, so that, using condition Lemma 3.7. Using Assumption 3.1 and assuming con- E, P dition E of Lemma 3.1, q∈Q∗ kWqk1 ≥ 1 − 5ε. |G(Lh) ∩ Wˆ | ≤ M<βm + M≥(1 + ε)bm Proof. The total weight of those weight classes with = O(mb logγ (b/ε)). |Wq| ≤ βm and q > M< is at most −1 ˆ X X To apply Lemma 3.2, we need N > ε |G(Lh) ∩ W |, βm γ1−q ≤ βm(ε/m)γ γ−q ≤ εβ/(1−1/γ) ≤ 4ε, −1 and large enough N in O(mbε logγ (b/ε)) suffices for q>M< g>0 this. We have (1) and (2), with failure probability 2M exp(−ε2m). for γ ≥ 2 and β ≤ 2. ≥ ∗ Condition (3) follows either trivially, for q ∈ Q<, or For given h > 0, let q ≡ min ˆ q. The ratio of h q∈Qh from Lemma 3.2. ˆ −M 0 the total weight of classes in Qh \ Qh to kWq∗ k is at For (4), let Wˆ ← W ∪ W 0 . Since |W 0 |γ h ≤ h 1 Qh Qh Qh most h+1 2 kyk ≤ 1, so that |W 0 | ≤ βb m q , we have 1 Qh max 1 ∗ −qh−M≥ X 1−q ∗ γ (1 + ε)bmγ −q ˆ 0 (1 − ε)γ h m |G(Lh) ∩ W | ≤ |G(Lh) ∩ WQh | + |G(Lh) ∩ WQ | q>0 h −1 −h ≤ (1 + ε)bmM≥ + (1 + ε)β b |W 0 | ε 1 + ε X Qh = b γ−q 2b(1 + 3ε) 1 − ε ≤ O(bm2q ), q≥0 max ε 1 + ε 1 = using condition E. Since |G(Lh) ∩ Wˆ | ≤ εN for large 2(1 + 3ε) 1 − ε 1 − 1/γ 2 −1 enough N = O(m bε qmax), we can apply Lemma 3.2 ≤ ε, to obtain (4). P under the assumptions on γ and ε. So kW ˆ k ≤ + h Qh\Qh 1 Lemma 3.9. Let G : R 7→ R as above. Assume P εkW ∗ k ≤ ε. h qh 1 that condition E of Lemma 3.1 holds, and Assump- Putting together the bounds for the two cases, the tion 3.1, and that condition Ec of Lemma 3.4 holds for total is at most 5ε, as claimed. −2 −2 2 2 N1 = N2 = O(CG ε dm ). There is N = O(N1 + −2 2 −1 ∗ ε m bqmax), so that for h ∈ [hmax] and q ∈ Qh with (1 − 2kL(v) − vkΛ/|v|)G(v). Writing V ≡ G (Wq ), kWqk1 ≥ ε/qmax, we have we have X X 1/α G(kL(v)kΛ) G(kL(yp)kΛ) ≥ (1 − ε )kWqk1 ∗ v∈V yp∈Wq X X  kL(v) − vk  ≥ G(v) + 1 − 2 Λ G(v) −ε2m |v| with failure probability at most C for fixed C > 1. v∈V v∈V kL(v)kΛ>|v| kL(v)kΛ≤|v| Proof. For any q ∈ Qh we have X kL(v) − vk ≥ kW ∗k − 2 Λ γ1−q. q 1 h |v| |Wq| ≤ (1 + ε)βb E[|G(Lh) ∩ Wq|] v∈V kL(v)kΛ≤|v| ≤ (1 + ε)βbhbm It remains to upper bound the sum. Since by condition E and the definition of h(q) = h; since kL(v)kΛ = ||v|±t|, where t ≡ kL(v) − vkΛ, if kL(v)kΛ ≤ |v|, then t ≤ 2|v|. 1−q |Wq|γ ≥ kWqk1 ≥ ε/qmax, Since E[t | t ≤ 2|v|] ≤ E[t] ≤ C kL(v) − vk ≤ C C0ε1/α|v|, using kWqk1 ≥ ε/qmax from the lemma statement, we 1 2 1 have for any yp ∈ Wq, (3.9) using Khintchine’s inequality and (3.7), and similarly −q h+1 E[t2|t ≤ 2|v|] ≤ C (C0ε1/α)2v2, we can use Bernstein’s yp ≥ γ ≥ (ε/qmax)/γ|Wq| ≥ ε/b γβm(1 + ε)qmax. 2 inequality to bound

Condition 4 of Lemma 3.8 holds, since N1,N2, and X kL(v) − vkΛ 1−q X 1/α 1−q N are large enough, and so we have that no bucket γ ≤ C3ε γ ∗ |v| containing yp ∈ Wq contains an entry larger than v∈V v∈V kL(v)k ≤|v| kL(v)k ≤|v| h+1 2 ¯ Λ Λ γ/βb m qmax, so if W comprises G(Lh) ∩ (Y \ WQ0 ), h 1/α ∗ ¯ h+1 2 ≤ C4ε kW k , we have kW k∞ ≤ γ/βb m qmax. Using condition E, q 1 ¯ −h kW k1 ≤ (1 + ε)b , using just the condition kY k1 = 1. −2 with failure probability exp(−ε2m). Hence Therefore the given N is larger than the O(bmε qmax) 2 needed for Lemma 3.3 to apply, with δ = exp(−ε m). X ∗ G(kL(v))k ) This with (3.9) yields that for each yp ∈ Wq , the Λ v∈V remaining entries in its bucket L have kL − ypk1 ≤ 2γ2ε|y |, with failure probability exp(−ε2m). ≥ kW ∗k − 2C ε1/αkW ∗k p q 1 4 q 1 For each such isolated yp we consider the corre- ∗ 1/α = kWq k (1 − 2C4ε ) sponding zp (denoted by v hereafter), and let L(v) 1 −1 −h 1/α denote the set of z values in the bucket containing ≥ β b kWqk1(1 − 4γε)(1 − 2C4ε ), v. We apply Lemma 3.6 to v with L(v) taking the role of T , and 2γ2ε taking the role of ε, obtaining using condition E. Adjusting constants, the result 0 1/α follows. EΛ[G(kL(v)kΛ)] ≥ (1 − C ε )G(v). (Here we fold a factor of (2γ2)1/α into C0, recalling that we consider γ to be fixed.) Using this relation and condition E, we Lemma 3.10. Assume that condition E of Lemma 3.1 holds, and Assumption 3.1, and Ec of Lemma 3.4 holds have −2 −2 2 for large enough N1 = O(CG ε dm ) and N2 = h ∗ −2 2α 4+α 4α−4 2+2α kW k ≤ βb kW k /(1 − 4γε) from Lem 3.8.3 O(CG d(ε m + ε m )). Then for q ∈ Q<, q 1 q 1

h X EΛ[G(kL(v)kΛ)] X 1/α ≤ βb , kG(L(v))k ≥ (1 − ε )kWqk 0 1/α Λ 1 (1 − 4γε)(1 − C ε ) −1 ∗ v∈G (Wq ) G(v)∈Wq

2 so the claim of the lemma follows, in expectation, after with failure probability at most C−ε m for a constant adjusting constants, and conditioned on events of failure C > 1. 2 probability C−ε m for constant C. To show the tail estimate, we relate each Proof. For all yp ∈ WQ< , we have G(kL(v)kΛ) to G(v) via the first claim of ε Lemma 3.6, which implies G(kL(v)k ) ≥ yp ≥ . Λ m Let Combining these lemmas, we have the following ε−α ε2−2α γ0 ≡ min{ , }, contraction bound. 3εαm2+α/2 m1+α 02 2 so that N2 of the lemma statement is at least 2d/γ CG. Theorem 3.2. Assume condition E of Lemma 3.1 Then condition Ec and Lemma 3.5 imply that every holds, and Assumption 3.1, and condition Ec of member of W is in a bucket with no entry other than −2 −2 2 q Lemma 3.4 holds for N1 = O(CG ε dm ) and N2 = itself larger than γ0. −2 2α 4+α 4α−4 2+2α O(CG d(ε m +ε m )), with N = O(N1N2+ Assume for the moment that all h = 0, that is, all −2 2 1/α p ε m bqmax). Then kSzkG,w ≥ kzkG(1 −  ), with 2 values are mapped to level 0. We apply Lemma 3.3 to failure probability no more than C−ε m, for absolute 2 ¯ h = 0, with δ ≡ exp(−ε m) and with W ≡ Y \ Y2, so C > 1. that ε−α  1 α ε 1 kW¯ k ≤ γ0 ≤ = √ . Proof. (We note that c, b, and m can be chosen such ∞ 3m2+α/2 ε1−1/α m 3m ε2m that the relations among these quantities and also N = cbm satisfy Assumption 3.1, up to the weak relations The result is that with large enough N = among m, b, and n/ε, which ultimately will require that O(m1+α/2ε−2+α), and assuming log N ≤ ε2m, so that n is not extremely large relative to d.) log(N/δ) ≤ 2ε2m, we have for v with G(v) = y ∈ W , p q Recalling Q∗ from (3.8), let Q∗∗ ≡ {q | q ∈  α ∗ 7 1 2ε Q , kWqk1 ≥ ε/qmax}. Assuming conditions E and Ec, kG(L(v)) ∩ W¯ k ≤ √ 2 1 6 ε1−1/α m 3m we have, with probability 1 − C−ε m,  1 α ≤ √ |y |, X h 1−1/α p ε m kSzkG,w = βb G(kLh,ikΛ) Def. h,i that is, X ≥ βbh(q)G(kL(v)k ) Lem 3.8 α Λ kL(v) − vk  1  q∈Q∗∗,v∈W ∗ G ≤ √ , q G(v) ε1−1/α m X ≥ βbh(q)(1 − ε1/α)kW ∗k Lems 3.9, 3.10 q 1 so that from (3.7), we have q∈Q∗∗ X ≥ (1 − ε1/α)(1 − 4γε)kW k Lem 3.8. kL(v) − vk2 ε2/α q 1 (3.10) 2 ≤ . q∈Q∗∗ v2 ε2m Since Using Lemma 3.7, 2−2α  α ¯ ε 1 ε X X kW k∞ ≤ = , m1+α mε2−1/α m kWqk1 ≥ −qmax(ε/qmax) + kWqk1 ≥ 1 − 6ε. q∈Q∗∗ q∈Q∗ we also have, for all v0 ∈ L(v) − v, and using that G(v) ≥ ε/m, Adjusting constants gives the result.

0  0 1/α v G(v ) 1 (3.11) ≤ ≤ . 3.6 Dilation bounds. We prove two bounds for v G(v) mε2−1/α dilation, where the first gives a dilation that is at most a log factor, and the second gives a constant factor by From (3.11), we have that the summands determin- 1/α 2 using a different way to estimate distance based on the ing kL(v) − vkΛ have magnitude at most |v|ε /ε m. 2 sketch. From (3.10), we have kL(v) − vk2 is at most 2 2/α 2 v ε /ε m. It follows from Bernstein’s inequality 3.6.1 Bound for kSzk . Our first bound for dila- that with failure probability exp(−ε2m), kL(v) − vk ≤ G,w Λ tion is E[kSzk ] = O(h )kzk , which implies a tail ε1/α|v|. Applying the first claim of Lemma 3.6, we have G,w max G 1/α −1 ∗ bound via Markov’s inequality; first, some lemmas. G(kL(v)kΛ) ≥ (1 − 2ε )G(v), for all v ∈ G (Wq ), 2 with failure probability |Wq| exp(−ε m). This implies the bound after adjusting constants. Lemma 3.11. For T ⊂ Z, EΛ[G(kT kΛ)] ≤ CG(kT k2), for an absolute constant C. We can remove the assumption that all hp = 0, because the bound on kL(v) − vkΛ also holds when splitting up into levels. Proof. Let L denote the event that kT kΛ ≥ kT k2. Here the expectation is with respect to Λ only: by ignoring small buckets, using a subset of the coordi- nates of Sz, as follows: for a given sketch, our new es- E[G(kT k )] Λ timate kSzkGc,w of kzkG is obtained by adding in only those buckets in level h that are among the top = E[G(kT kΛ) | L ] P{L} + E[G(kT kΛ) | ¬L ] P{¬L} α kT k  ∗ Λ M ≡ bmM≥ + βmM< = O(mb log (b/ε)) ≤ E α G(kT k2) | L P{L} + G(kT k2) γ kT k2 in Λ-semi-norm, recalling M≥ and M< defined in (3.8). α G(kT k2) = E[kT kΛ | L ] P{L} α + G(kT k2) That is, kT k2 G(kT k ) X j X α 2 kSzk ≡ βb G(kL k ), ≤ E[kT kΛ] α + G(kT k2) Gc,w j,(i) Λ kT k2 j i∈[M ∗]

≤ CG(kT k2), where Lj,(i) denotes the level j bucket with the i’th for a constant C, where the last inequality uses Khint- largest Λ-semi-norm among the level j buckets. chine. The proof of the bounded contraction of kSzkG,w, Theorem 3.2, only requires lower bounds on kG(Lh,i)kΛ ∗ Lemma 3.12. For T ⊂ Z, G(kT k2) ≤ kT kG, and so for those at most M buckets on level h containing ∗ ∗ ∗ EΛ[G(kT kΛ)] ≤ CkT kG. some member of Wq for q ∈ Q , for the Wq defined in Lemma 3.8. Thus if the estimate of kSykG,w uses only Proof. Using the growth upper bound for G, the largest such buckets in Λ-norm, the proven bounds on contraction continue to hold, and in particular kT kG X G(zp) = kSzkGc,w ≥ (1 − ε)kSzkG,w. G(kT k2) G(kT k2) Moreover, the dilation of kk is constant: zp∈T Gc,w α X zp ≥ Theorem 3.4. There is c = O(logγ (b/ε)(logb(n/m))) kT k2 and b ≥ c, recalling N = mbc, such that zp∈T 2 X zp E[kSzk ] ≤ Ckzk ≥ Gc,w G kT k2 zp∈T for a constant C. = 1. Proof. From Lemma 3.12, the contribution of level h The last claim follows from this and the previous lemma. satisfies X Theorem 3.3. Assuming condition E of Lemma 3.1, (3.12) E[ G(kLh,ikΛ)] ≤ CkLhkG = CkG(Lh)k1. Eg,`,Λ[kSzkG,w] = O(hmax)kzkG. i P We will consider the contribution of each weight class Proof. Note that for each level h, i EΛ[G(kLh,ikΛ)] ≤ separately. The contribution of Wq at h = h(q) is CkLhkG, applying the previous lemma. Since kLhkG = h h kG(L )k = (1 ± ε)kyk /βb under assumption E, we βb kG(Lh) ∩ Wqk1 ≤ kWqk1(1 + ε), if all entries of h 1 1 ∗ have Wq land among the top M buckets; otherwise the contribution will be smaller. The expected contribution of W at h = h(q) − k Eg,`,Λ[kSzkG,w] q ∗ 1−q X X for k > 0 is at most M |Lh,i∗ ∩ Wq|γ , where Lh,i∗ = βbh E [G(kL k )] Λ h,i Λ contains the largest number of members of Wq among h i the buckets on level h. When |G(Lh) ∩ Wq| ≥ N log N, X h X h ≤ βb C(1 + ε)kyk1/βb |G(Lh,i∗ ) ∩ Wq| ≤ 4|G(Lh) ∩ Wq|/N, with failure h i probability at most 1/N. (This follows by applying X X Bernstein’s inequality to the sum of random variables = C(1 + ε)kyk = hmaxC(1 + ε)kzk , 1 G X , where X = 1 when the i’th element of W falls in h i i i q a given bucket, and Xi = 0 otherwise, followed by a and the theorem follows, picking bounded ε. union bound over the buckets.) At level h = h(q) − k, k |G(Lh) ∩ Wq| ≥ b m(1 − ε), using assumption E of 3.6.2 Bound for a “clipped” version. We can Lemma 3.1, so to obtain bkm(1−ε) ≥ N log N it suffices achieve a better dilation than O(hmax) = O(log(εn/d)) that k ≥ 2 + 2 logb c ≥ logb(N log(N)/m(1 − ε)), using N = bcm, obtaining for those k a contribution for Wq and so G(a + b) ≤ (1 + 3ε)G(a). is within a constant factor of Now suppose a and b have the opposite sign. Then G(a + b) ≤ G(a) by monotonicity, and again by the h ∗ −1 −h 1−q O(logγ (b/ε)) βb M (4β b |Wq|/N)γ ≤ kWqk , growth condition, c 1 ∗ 2 using the bound on M given above. Adding this G(a) a 2 ≤ ≤ (1 + 2ε) ≤ 1 + 5ε, contribution to that for k ≤ 2 + 2 logb c, we obtain G(a + b) a + b an overall bound for Wq and h < h(q) that is within ∗ and so G(a + b) ≥ G(a)/(1 + 5ε), and so G(a + b) ≥ a constant factor of (1 + log c + h M )kW k , and b max N q 1 (1 − 5ε)G(a). therefore within a constant factor of kWqk1 under the given conditions on b and c. Lemma 4.2. (Approximate Scale Invariance) For all a For h = h(q) + k for k > 0, the expected size of and C ≥ 1, G(Ca) ≤ C2G(a). G(L ) ∩ W is at most m/bk−1; this quantity is also h q Proof. By the growth condition, G(Ca)/G(a) ≤ C2. an upper bound for the probability that G(Lh) ∩ Wq is non-empty. Thus for the qmax non-negligible sets Lemma 4.3. (Perturbation of the weighted M- 0 Wq, by a union bound the event Es holds with failure Estimator) There is a constant C > 0 for which 5 probability δ, that all Wq ∩ Lh(q)+k will be empty for any e and any w, with kekG,w ≤ ε kykG,w, for large enough k = O(log q m/δ). For each b max ky + ek = (1 ± C0ε)kyk . q and k, condition E implies that the contribution G,w G,w h P βb kG(Lh,i)k ≤ (1 + ε)kWqk , and so the total 1 1 i Λ 1 Proof. By Lemma 4.2, G( ε2 ei) ≤ ε4 G(ei), and so contribution is CkWqk log qmaxm/δ, within a constant 1 1 1 b k 2 ek ≤ 4 kek ≤ εkyk , where the final factor of kW k , under given conditions. ε G,w ε G,w G,w q 1 inequality follows by the assumption of the lemma.

Note that if G is convex, then so is kSzkGc,w, Now let S ⊆ [n] denote those coordinates i for which since at each level we are applying a Ky Fan norm, |ei| ≤ ε|yi|. By Lemma 4.1, G(yi + ei) = (1 ± Cε)G(yi). −1 discussed below; also, if G (k · kG) is scale-invariant, Now consider an i ∈ [n] \ S. In this case |yi| ≤ −1 1 then so is G (k · kGc,w). If both conditions hold, then ε( ε2 |ei|). Using that G is monotonically non-decreasing −1 −1 and applying Lemma 4.1 again, G (k · kG) is a norm, and so is G (k · kGc,w). n The Ky Fan k-norm of a vector y ∈ R is 1 2 P G(ei + yi) ≤ G( ei + yi) = (1 ± Cε)G(ei/ε ), i∈[k] |y(i)|, where y(i) denotes the i’th largest entry ε2 of y in magnitude. Thus the Ky Fan 1-norm of y is so that kyk , and the Ky Fan n-norm of y is kyk . The matrix ∞ 1 X 2 version of the norm arises by application to the vector wiG(yi + ei) ≤ (1 + Cε)ke/ε kG,w of singular values. i∈[n]\S A disadvantage of this approach is that some ≤ (1 + Cε)εkykG,w. smoothness is sacrificed: kzkGc,w is not a smooth func- tion, even if G is; while this does not affect the fact that Again using that G is monotonically non-decreasing, we the minimization problem in the sketch space is polyno- note that mial time, it could affect the concrete polynomial time X X wiG(yi) ≤ wiG(ei/ε) ≤ ke/εkG,w complexity, which we leave a subject for future work. i∈[n]\S i∈[n]\S ≤ ke/ε2k ≤ εkyk . 4 Net Argument. G,w G,w We prove a general ε-net argument for M-estimators Hence, satisfying our growth condition (1.1). ky + ekG,w We need a few lemmas to develop the net argument. X X = wiG(yi + ei) + wiG(yi + ei) (Bounded Derivative) There is a constant Lemma 4.1. i∈S i∈[n]\S C > 0 for which for any a, b with |b| ≤ ε|a|, G(a + b) = X (1 ± Cε)G(a). = (1 ± Cε) wiG(yi) ± (1 + Cε)εkykG,w i∈S Proof. First suppose that a and b have the same sign. X Then by montonicity, G(a) ≤ G(a + b). Moreover, by = (1 ± O(ε)) wiG(yi) ± (2 + Cε)εkykG,w the growth condition, i∈[n]

2 = (1 ± O(ε))kykG,w. G(a + b) a + b 2 ≤ ≤ (1 + ε) ≤ 1 + 3ε, G(a) a This completes the proof. d+1 Lemma 4.4. (Relation of weighted M-Estimator to 2-  3n2CS +2  that there exists an Nα for which |Nα| ≤ 5 Norm) Suppose w ≥ 1 for all i. Given an n × d matrix ε i [32]. We define A, an n × 1 column vector b, let c = minx kAx − bk2 (note the norm is the 2-norm here). Let y∗ = Ax∗ − b, N = N ∪ N ∪ N 2 ∪ · · · ∪ N 3/2 . ∗ ∗ c c(1+ε) c(1+ε) κcn kwk∞ where x = argminxkAx − bkG,w. Then c ≤ ky k2 ≤ 3/2 κcn kwk∞, where κ > 0 is a sufficiently large con- Then stant. d+1  2CS +2  C d 3/2 3n n N |N| = O(log κn kwk∞) ≤ , Proof. If c = 0, then there exists an x for which 1+ε ε5 ε Ax = b. In this case, since G(0) = 0, it follows where C > 0 is a large enough constant. that ky∗k = 0. Now suppose c > 0 and let y be a N 2 Now consider any x ∈ d for which y = Ax − b vector of the form Ax − b of minimal 2-norm. Since R satisfies kyk ≤ κcn3/2kwk . By construction of kyk = c, each coordinate of y is at most c. Hence, 2 ∞ 2 N, there exists an e ∈ N for which ke − yk = kyk ≤ kwk · n · G(c) by monotonicity of G. 2 G,w ∞ 5 2CS +2 ∗ ∗ O(ε /n )kyk2. Then, Now consider the 2-norm of y , and let d = ky k2. By definition, d ≥ c. Moreover, there exists a coordinate CS √ kS(e − y)kG,w0 ≤ n ke − ykG,w of y∗ of absolute value at least d/ n. Hence, by CS ≤ n · kwk∞ · nG(ke − yk ), monotonicity of G and using that wi ≥ 1 for all i, 2 ∗ √ ∗ ky kG ≥ G(d/ n). Since y is the√ minimizer for G with using the fact that each coordinate of e − y is at most weight vector w, necessarily G(d/ n) ≤ kwk∞ ·n·G(c). √ ke − yk2 in magnitude and that G is monotonically non- If d/ n ≤ c, the lemma follows. Otherwise, by√ the decreasing. By the lower bound on the growth condition lower bound on√ the growth condition√ for G, G(d/ n) ≥ on G, G(c) · CGd/(c n), and so CGd/(c n) ≤ kwk∞ · n. 3/2 Hence, d ≤ kwk∞n c/CG. ke − yk2 G(ke − yk2) ≤ G(kyk2) CGkyk Lemma 4.5. (Net for weighted M-Estimators) Let c = 2  ε5  minx kAx − bk . For any constant CS > 0 there is a 2 = O G(kyk2). d n2Cs+2 constant CN > 0 and a set N ⊆ {Ax − b | x ∈ R } with |N| ≤ (n/ε)CN d for which if both: √ Note that kykG,w ≥ G(kyk2/ n) by monotonicity and using that w ≥ 1 for all i. Furthermore, by the growth 1. kS(Ax − b)kG,w0 = (1 ± ε)kAx − bkG,w holds for i √ all Ax − b ∈ N and S is a matrix for which condition on G, G(kyk2) ≤ nG(kyk2/ n). Combining CS kS(Ax−b)kG,w0 ≤ n kAx−bkG,w for all x for an these inequalities, we have 0 appropriate w , 2Cs+1 kS(e − y)kG,w0 ≤ n G(ke − yk2) Cs 2. kwk∞ ≤ n and wi ≥ 1 for all i, ε5  = O G(kyk ) 3/2 n 2 then for all x for which kAx − bk2 ≤ κcn kwk∞, for an arbitrary constant κ > 0, it holds that 5 (4.13) = O(ε )kykG,w.

kS(Ax − b)kG,w0 = (1 ± ε)kAx − bkG,w. Note that the argument thus far was true for any S and 0 Cs w for which kS(Ax − b)kG,w0 ≤ n kAx − bkG,w for all Moreover, if the first condition is relaxed to state only x, and so in particular holds for S being the identity that (1 − ε)kAx − bkG,w ≤ kS(Ax − b)kG,w0 for all Ax − 0 and wi = 1 for all i ∈ [n]. So in particular we have b ∈ N and S is a matrix for which kS(Ax − b)kG,w0 ≤ ke − yk = O(ε5)kyk . Applying Lemma 4.3, it CS G,w G,w n kAx − bkG,w for all x for an appropriate weight follows that vector w0, then the following conclusion holds: for all 3/2 x for which kAx − bk2 ≤ κcn kwk∞, for an arbitrary (4.14) kykG,w = ke + (y − e)kG,w = (1 ± O(ε))kekG,w. constant κ > 0, it holds that (1 − ε)kAx − bkG,w ≤ Now we use the assumption of the theorem that for kS(Ax − b)kG,w0 . all e ∈ N with a particular choice of S and w0 one n Proof. Let L be the subspace of R of dimension at most has (1 − ε)kekG,w ≤ kSekG,w0 ≤ (1 + ε)kekG,w. Then d + 1 spanned by the columns of A together with b. Let kSykG,w0 = kSe + S(y − e)kG,w0 . Now, kSekG,w0 = Nα be a finite subset of {z | z ∈ L and kzk2 = α} for (1±ε)kekG,w by the assumption of the theorem, whereas 5 5 which for any point y with kyk2 = α, there exists a point kS(y − e)k 0 = O(ε )kyk = O(ε )kek by com- 5 G,w G,w G,w e ∈ N for which ky − ek ≤ ε α. It is well-known bining (4.13) and (4.14). So we can apply Lemma 4.3 α 2 n2CS +2 to conclude that kSykG,w0 = (1 ± O(ε))kSekG,w0 , and For xG minimizing kAx − bkG, apply Theorem 3.3 combining this with the assumption of the theorem and to xG and S, so that with constant probabil- (4.14), ity, kS(AxG − b)kG,w = O(logd n)kAxG − bkG = O(logd n)OPTG. kSykG,w0 = (1 ± O(ε))kSekG,w0 = (1 ± O(ε))kekG,w By making the totals of the failure probabilities for 0 = (1 ± O(ε))kyk . conditions E and Ec, for the contraction bound, and G,w the dilation bound less than one, the overall failure For the second part of the lemma, suppose we only had probability is less than one. (Here we note that all such that for all e ∈ N, (1 − ε)kekG,w ≤ kSekG,w0 . We still failure probabilities can be made less than 1/5, even if 5 described as fixed.) have kS(y − e)kG,w0 = O(ε )kekG,W , and so we can still apply Lemma 4.3 to conclude that Let T be the sparse subspace embedding of [9], so that with probability 1 − o(1), kT Axk2 = O(1)kAxk2

kSykG,w0 = (1 ± O(ε))kSekG,w0 . for all x and TA can be computed in nnz(A) time and T has poly(d) rows.

Using (4.14), we have Find x0 minimizing kT (Ax − b)k2, and let c ≡ kAx0 − bk2. kSyk = (1 ± O(ε))kSek ≥ (1 − O(ε))kek G,w0 G,w0 G,w Now findx ˜ minimizing kS(Ax − b)kGc,w, subject to 3/2 ≥ (1 − O(ε))kykG,w, kT (Ax − b)k2 ≤ κcn , using the ellipsoid method, in poly(d log n) time. Now Lemma 4.5 applies, implying which completes the proof. thatx ˜ satisfies the claim of the theorem. A similar argument holds forx ˆ, by minimizing

4.1 Proof of Theorem 3.1. Using Lemma 4.5, and kS(Ax − b)kGc,w. previous results on contraction and dilation, we can now prove Theorem 3.1. Acknowledgements. We acknowledge the sup- port of the XDATA program of the Defense Ad- Proof. The first algorithm: compute SA and Sb, vanced Research Projects Agency (DARPA), adminis- for S an M-sketch matrix with large enough N = tered through Air Force Research Laboratory contract −2 2 6+α O(CG d m ), putting m = O(d log n), and  = 1/2. FA8750-12-C-0323. We thank the referees for their help- This N is large enough for Theorem 3.2 to apply, obtain- ful comments. −m ing a contraction bound with failure probability C1 . To apply Lemma 4.5, we need to ensure the as- References sumptions of the lemma are satisfied. For the second Cs condition, note that indeed kwk∞ ≤ n for a constant Cs > 0 by definition of the sketch, since hmax ≤ log n. [1] Dimitris Achlioptas. Database-friendly random pro- For the first condition, because the second condition jections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671–687, 2003. holds, it now suffices to bound kSykG,w for an arbitrary vector y. For this it suffices to show that for each level [2] Dimitris Achlioptas and Frank McSherry. Fast com- O(1) P putation of low-rank matrix approximations. J. ACM, h and bucket i, G(kLh,ikΛ) ≤ n G(p). By p∈Lh,i 54(2), 2007. monotonicity of G, we have G(kLh,ikΛ) ≤ G(kLh,ik1). [3] Nir Ailon and Bernard Chazelle. Approximate nearest By the growth condition on G, for a ≥ b we have neighbors and the fast Johnson-Lindenstrauss trans- form. In STOC, pages 557–563, 2006. G(a + b) (a + b)2 2b2 2G(b) ≤ ≤ 2 + ≤ 2 + , [4] Nir Ailon and Edo Liberty. An almost optimal unre- G(a) a2 a2 G(a) stricted fast Johnson-Lindenstrauss transform. ACM Trans. Algorithms, 9(3):21:1–21:12, June 2013. and so G(a+b) ≤ 2G(a)+2G(b). Applying this inequal- [5] S. Bernstein. Theory of Probability. Moscow, 1927. ity recursively dlog |Lh,i|e times, we have G(kLh,ik1) ≤ [6] Jean Bourgain and Jelani Nelson. Toward a unified n P |y |, which is what we needed to show (where theory of sparse dimensionality reduction in euclidean p∈Lh,i p with some abuse of notation, we use the definition space. CoRR, abs/1311.2542, 2013. [7] C. Boutsidis and A. Gittens. Improved matrix al- y = G(z ) given in §3.1). p p gorithms via the Subsampled Randomized Hadamard Hence, we can apply Lemma 4.5, and by Theo- Transform. ArXiv e-prints, March 2012. rem 3.2, the needed contraction bound holds for all [8] Vladimir Braverman and Rafail Ostrovsky. Zero-one members of the net N of Lemma 4.5, with failure prob- laws. In STOC, pages 281–290, 2010. CN d −m ability O(n) C1 < 1, for m = O(d log n), assuming [9] Kenneth L Clarkson and David P Woodruff. Low 0 conditions E and Ec. rank approximation and regression in input spar- sity time. In STOC, 2013. Full version at [27] Daniel M. Kane and Jelani Nelson. Sparser Johnson- http://arxiv.org/abs/1207.6365. Lindenstrauss transforms. In SODA, pages 1195–1206, [10] Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi 2012. Kumar, and Michael W. Mahoney. Sampling algo- [28] Ioannis Koutis, Gary L. Miller, and Richard Peng. Ap- rithms and coresets for `p regression. SIAM J. Com- proaching optimality for solving SDD linear systems. put., 38(5):2060–2078, 2009. In FOCS, pages 235–244, 2010. [11] Anirban Dasgupta, Ravi Kumar, and Tam´asSarl´os. [29] Ioannis Koutis, Gary L. Miller, and Richard Peng. A A sparse Johnson-Lindenstrauss transform. In STOC, nearly-m log n time solver for SDD linear systems. In pages 341–350, 2010. FOCS, pages 590–598, 2011. [12] Amit Deshpande, Luis Rademacher, Santosh Vempala, [30] Michael W. Mahoney. Randomized algorithms for ma- and Grant Wang. Matrix approximation and projec- trices and data. Foundations and Trends in Machine tive clustering via volume sampling. In SODA, pages Learning, 3(2):123–224, 2011. 1117–1126, 2006. [31] Olvi L. Mangasarian and David R. Musicant. Robust [13] Amit Deshpande and Santosh Vempala. Adaptive linear and support vector regression. IEEE Trans. sampling and fast low-rank matrix approximation. In Pattern Anal. Mach. Intell., 22(9):950–955, 2000. APPROX-RANDOM, pages 292–303, 2006. [32] Jiri Matousek. Lectures on Discrete Geometry. [14] Petros Drineas, Ravi Kannan, and Michael W. Ma- Springer, 2002. honey. Fast Monte Carlo algorithms for matrices I: [33] A. Maurer. A bound on the deviation probability for Approximating matrix multiplication. SIAM J. Com- sums of non-negative random variables. Journal of put., 36(1):132–157, 2006. Inequalities in Pure and Applied Mathematics, 4(1), [15] Petros Drineas, Ravi Kannan, and Michael W. Ma- 2003. honey. Fast Monte Carlo algorithms for matrices II: [34] X. Meng and M. W. Mahoney. Low-distortion Sub- Computing a low-rank approximation to a matrix. space Embeddings in Input-sparsity Time and Appli- SIAM J. Comput., 36(1):158–183, 2006. cations to Robust . ArXiv e-prints, [16] Petros Drineas, Ravi Kannan, and Michael W. Ma- October 2012. honey. Fast Monte Carlo algorithms for matrices III: [35] Gary L. Miller and Richard Peng. Iterative approaches Computing a compressed approximate matrix decom- to row sampling. CoRR, abs/1211.2713, 2012. position. SIAM J. Comput., 36(1):184–206, 2006. [36] Jelani Nelson and Huy L. Nguyen. OSNAP: Faster nu- [17] Petros Drineas, Malik Magdon-Ismail, Michael W. merical linear algebra algorithms via sparser subspace Mahoney, and David P. Woodruff. Fast approximation embeddings. CoRR, abs/1211.1002, 2012. of matrix coherence and statistical leverage. CoRR, [37] Wojciech Niemiro. Asymptotics for m-estimators abs/1109.3843, 2011. defined by convex minimization. Ann. Statist., [18] Petros Drineas, Michael W. Mahoney, and S. Muthukr- 20(3):1514–1533, 1992. ishnan. Sampling algorithms for `2 regression and ap- [38] Tam´asSarl´os.Improved approximation algorithms for plications. In SODA, pages 1127–1136, 2006. large matrices via random projections. In FOCS, pages [19] Petros Drineas, Michael W. Mahoney, and S. Muthukr- 143–152, 2006. ishnan. Subspace sampling and relative-error matrix [39] D. A. Spielman and S.-H. Teng. Nearly-linear time approximation: Column-based methods. In APPROX- algorithms for graph partitioning, graph sparsification, RANDOM, pages 316–326, 2006. and solving linear systems. In 36th Annual ACM [20] Petros Drineas, Michael W. Mahoney, and S. Muthukr- Symposium on Theory of Computing, pages 81–90. ishnan. Subspace sampling and relative-error matrix ACM, 2004. approximation: Column-row-based methods. In ESA, [40] Elad Verbin and Qin Zhang. Rademacher-sketch: A pages 304–314, 2006. dimensionality-reducing embedding for sum-product [21] Petros Drineas, Michael W. Mahoney, S. Muthukrish- norms, with an application to earth-mover distance. nan, and Tam´asSarl´os. Faster least squares approxi- In ICALP (1), pages 834–845, 2012. mation. CoRR, abs/0710.1435, 2007. [41] David P. Woodruff and Qin Zhang. Subspace embed- [22] Antoine Guitton and William W. Symes. Robust and dings and lp-regression using exponential random vari- stable velocity analysis using the Huber function, 1999. ables. CoRR, abs/1305.5580, 2013. [23] S. J. Haberman. Convexity and estimation. Ann. [42] Jiyan Yang, Xiangrui Meng, and Michael W. Mahoney. Statist., 17:1631–1661, 1989. for large-scale applications. In [24] Peter J. Huber. Robust estimation of a location ICML (3), pages 881–887, 2013. parameter. Ann. Math. Statist., 35(1):73–101, 1964. [43] Zhengyou Zhang. Parameter estimation techniques: [25] Piotr Indyk and David P. Woodruff. Optimal approxi- A tutorial with application to conic fitting. Image mations of the frequency moments of data streams. In and Vision Computing Journal, 15(1):59–76, 1997. STOC, pages 202–208, 2005. http://research.microsoft.com/en-us/um/people/ [26] William B. Johnson and Joram Lindenstrauss. Exten- zhang/INRIA/Publis/Tutorial-Estim/node24.html. sions of Lipschitz mappings into a Hilbert space. In Conference in Modern Analysis and Probability, 1982.