Sketching for M-Estimators: a Unified Approach to Robust Regression
Total Page:16
File Type:pdf, Size:1020Kb
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth L. Clarkson∗ David P. Woodruffy Abstract ing algorithm often results, since a single pass over A We give algorithms for the M-estimators minx kAx − bkG, suffices to compute SA. n×d n n where A 2 R and b 2 R , and kykG for y 2 R is specified An important property of many of these sketching ≥0 P by a cost function G : R 7! R , with kykG ≡ i G(yi). constructions is that S is a subspace embedding, meaning d The M-estimators generalize `p regression, for which G(x) = that for all x 2 R , kSAxk ≈ kAxk. (Here the vector jxjp. We first show that the Huber measure can be computed norm is generally `p for some p.) For the regression up to relative error in O(nnz(A) log n + poly(d(log n)=")) d time, where nnz(A) denotes the number of non-zero entries problem of minimizing kAx − bk with respect to x 2 R , of the matrix A. Huber is arguably the most widely used for inputs A 2 Rn×d and b 2 Rn, a minor extension of M-estimator, enjoying the robustness properties of `1 as well the embedding condition implies S preserves the norm as the smoothness properties of ` . 2 of the residual vector Ax − b, that is kS(Ax − b)k ≈ We next develop algorithms for general M-estimators. kAx − bk, so that a vector x that makes kS(Ax − b)k We analyze the M-sketch, which is a variation of a sketch small will also make kAx − bk small. introduced by Verbin and Zhang in the context of estimating A significant bottleneck for these methods is the the earthmover distance. We show that the M-sketch can computation of SA, taking Θ(nmd) time with straight- be used much more generally for sketching any M- estimator forward matrix multiplication. There has been work provided G has growth that is at least linear and at most showing that fast transform methods can be incorpo- quadratic. Using the M-sketch we solve the M-estimation rated into the construction of S and its application to problem in O(nnz(A) + poly(d log n)) time for any such G A, leading to a sketching time of O(nd log n)[3, 4, 7, 38]. that is convex,making a single pass over the matrix and Recently it was shown that there are useful sketch- finding a solution whose residual error is within a constant ing matrices S such that SA can be computed in factor of optimal, with high probability. time linear in the number nnz(A) of non-zeros of A [6, 9, 34, 36]. With such sketching matrices, various 1 Introduction. problems can be solved with a running time whose lead- In recent years there have been significant advances in ing term is O(nnz(A)) or O(nnz(A) log n). This promi- randomized techniques for solving numerical linear alge- nently includes regression problems on \tall and thin" bra problems, including the solution of diagonally dom- matrices with n d, both in the least-squares (`2) inant systems [28, 29, 39], low-rank approximation[2, 9, and robust (`1) cases. There are also recent recursive 15, 12, 13, 34, 36, 38], overconstrained regression[9, 21, sampling-based algorithms for `p regression [35], as well 34, 36, 38], and computation of leverage scores [9, 17, as sketching-based algorithms for p 2 [1; 2) [34] and 34, 36]. There are many other references; please see for p > 2 [41], though the latter requires sketches whose example the survey by Mahoney [30]. Much of this work size grows polynomially with n. Similar O(nnz(A)) time involves the tool of sketching, which in generality is a results were obtained for quantile regresion [42], by re- descendent of random projection methods as described lating it to `1 regression. A natural question raised by by Johnson and Lindenstrauss[1, 4, 3, 11, 26, 27], and these works is which families of penalty functions can also of sampling methods [10, 14, 15, 16, 18, 19, 20]. be computed in O(nnz(A)) or O(nnz(A) log n) time. n×d Given a problem involving A 2 R , a sketching ma- M-estimators. Here we further extend the \nnz" t×n trix S 2 R with t n is used to reduce to a similar regime to general statistical M-estimators, specified by problem involving the smaller matrix SA, with the key ≥0 a measure function G : R 7! R , where G(x) = G(−x), property that with high likelihood with respect to the G(0) = 0, and G is non-decreasing in jxj. The result is randomized choice of S, a solution for SA is a good P a new \norm" kyk ≡ G(yi). (In general these solution for A. More generally, data derived using SA G i2[n] functions kk are not true norms, but we will sometimes is used to efficiently solve the problem for A. In cases G refer to them as norms anyway.) An M-estimator is where no further processing of A is needed, a stream- a solution to minx kAx − bkG. For appropriate G, M- estimators can combine the insensitivity to outliers of ∗ IBM Research - Almaden `1 regression with the low variance of `2 regression. yIBM Research - Almaden The Huber norm. The Huber norm [24], for The sketch SA (and Sb) can be computed in example, is specified by a parameter τ > 0, and its O(nnz(A)) time, and needs O(poly(d log n)) space; we measure function H is given by show that it can be used in O(poly(d log n)) time to ( find approximate M-estimators, that with constant a2=2τ if jaj ≤ τ H(a) ≡ probability have a cost within a constant factor of jaj − τ=2 otherwise, optimal. The success probability can be amplified by independent repetition and choosing the best solution combining an `2-like measure for small x with an `1-like measure for large x. found among the repetitions. The Huber norm is of particular interest, because it Condition on G. For our results we need some is popular and \recommended for almost all situations" additional conditions on the function G beyond sym- [43], because it is \most robust" in a certain sense[24], metry and monotonicity: that it grows no faster than and because it has the useful computational and statis- quadratically in x, and no slower than linearly. For- 0 tical properties implied by the convexity and smooth- mally: there is α 2 [1; 2] and CG > 0 so that for all a; a 0 ness of its defining function, as mentioned above. The with jaj ≥ ja j > 0, smoothness makes it differentiable at all points, which a α G(a) a can lead to computational savings over ` , while enjoy- (1.1) ≥ ≥ C 1 0 0 G 0 ing the same robustness properties with respect to out- a G(a ) a liers. Moreover, while some measures, such as `1, treat The subquadratic growth condition is necessary for a small residuals \as seriously" as large residuals, it is of- sketch with a sketching dimension sub-polynomial in ten more appropriate to have robust treatment of large n to exist, as shown by Braverman and Ostrovsky [8]. residuals and Gaussian treatment of small residuals [22]. Also, subquadratic growth is appropriate for robust re- We give in x2 a sampling scheme for the Huber gression, to reduce the effect of large values in the resid- norm based on a combination of Huber's `1 and `2 ual Ax − b, relative to their effect in least-squares. Al- properties. We obtain an algorithm yielding an "- most all proposed M-estimators satisfy these conditions approximation with respect to the Huber norm of the [43]. residual; as stated in Theorem 2.1, the algorithm needs The latter linear lower bound on the growth of G O(nnz(A) log n) + poly(d=") time (see, e.g., [31] for holds for all convex G, and many popular M-estimators convex programming algorithms for solving Huber in have convex G [43]. Moreover, the convexity of G im- poly(d=") time when the dimension is poly(d=")). plies the convexity of kkG, which is needed for comput- M-sketches for M-estimators. We also show ing a solution to the minimization problem in polyno- that the sketching construction of Verbin and Zhang mial time. Convexity also implies significant properties [40], which they applied to the earthmover distance, can for the statistical interpretation of the results, such as also be applied to sketching for general M-estimators. consistency and asymptotic normality[23, 37]. 1 This construction, which we call the M-sketch is However, we do not require G to be convex for constructed independently of the G function specifying our sketching results, and indeed some M-estimators the M-estimator, and so the same sketch can be used are not convex; here we simply reduce a large non- for all G. That is, one can first sketch the input, in convex problem, minx kAx − bk , to a smaller non- one pass, and decide later on the particular choice of G convex problem minx kS(Ax − b)kG;w of a similar kind. penalty function G. That is, the entire algorithm for the The linear growth lower bound does imply that we problem minx kAx − bkG is to compute S · A and S · b, are unable to apply sketching to some proposed M- for a simple sketching matrix S described below, and estimators; the \Tukey" estimator, for example, whose then solve the regression problem minx kSAx − SbkG;w, G function is constant for large argument values, is not where kkG;w is defined as follows.