Layered Sampling for Robust Optimization Problems

Hu Ding 1 Zixiu Wang 1

Abstract the complexity measures like running time, space, and com- munication. In the past years, the coreset techniques have In real world, our datasets often contain outliers. been successfully applied to solve many optimization prob- Moreover, the outliers can seriously affect the lems, such as clustering (Chen, 2009; Feldman & Langberg, final result. Most existing algo- 2011; Huang et al., 2018), (Huggins et al., rithms for handling outliers take high time com- 2016; Munteanu et al., 2018), (Dasgupta plexities (e.g. quadratic or cubic complexity). et al., 2009; Drineas et al., 2006), and Gaussian mixture Coreset is a popular approach for compressing model (Lucic et al., 2017; Karnin & Liberty, 2019). data so as to speed up the optimization algorithms. However, the current coreset methods cannot be A large part of existing coreset construction methods are easily extended to handle the case with outliers. based on the theory of sensitivity which was proposed by In this paper, we propose a new variant of coreset (Langberg & Schulman, 2010). Informally, each data point technique, layered sampling, to deal with two p P has the sensitivity φ(p) (in fact, we just need to fundamental robust optimization problems: k- compute∈ an appropriate upper bound of φ(p)) to measure median/means clustering with outliers and linear its importance to the whole instance P over all possible P regression with outliers. This new coreset method solutions, and Φ(P ) = p∈P φ(p) is called the total sen- is in particular suitable to speed up the iterative al- sitivity. The coreset construction is a simple sampling pro- gorithms (which often improve the solution within cedure where each point p is drawn i.i.d. from P propor- a local range) for those robust optimization prob- φ(p) tional to Φ(P ) ; each sampled point p is assigned a weight lems. Moreover, our method is easy to be imple- w(p) = Φ(P ) where m is the sample size depending on the mented in practice. We expect that our framework mφ(p) of layered sampling will be applicable to other “pseudo-dimension” of the objective function ∆ ((Feldman robust optimization problems. & Langberg, 2011; Li et al., 2001)); eventually, the set of weighted sampled points form the desired coreset S. In real world, datasets are noisy and contain outliers. More- 1. Introduction over, outliers could seriously affect the final results in data analysis (Chandola et al., 2009; Goodfellow et al., 2018). Coreset is a widely studied technique for solving many However, the sensitivity based coreset approach is not ap- optimization problems (Phillips, 2016; Bachem et al., 2017; propriate to handle robust optimization problems involving Munteanu et al., 2018; Feldman, 2020). The (informal) outliers (e.g., k-means clustering with outliers). For exam- definition is as follows. Given an optimization problem with ple, it is not easy to compute the sensitivity φ(p) because the objective function ∆, denote by ∆(P,C) the objective the point p could be inlier or outlier for different solutions; value determined by a dataset P and a solution C; a small arXiv:2002.11904v1 [cs.CG] 27 Feb 2020 moreover, it is challenging to build the relation, such as (1), set S is called a coreset if between the original instance P and the coreset S (e.g., how ∆(P,C) ∆(S, C) (1) to determine the number of outliers for the instance S?). ≈ for any feasible solution C. Roughly speaking, the coreset 1.1. Our Contributions is a small set of data approximately representing a much larger dataset, and therefore existing algorithm can run on In this paper, we consider two important robust optimization the coreset (instead of the original dataset) so as to reduce problems: k-median/means clustering with outliers and lin- ear regression with outliers. Their quality guaranteed algo- 1School of Computer Science and Technology, University rithms exist but often have high complexities that seriously of Science and Technology of China. Correspondence to: limit their applications in real scenarios (see Section 1.2 for http://staff.ustc.edu. Hu Ding . efficiently solved by some heuristic algorithms in practice, though they only guarantee local optimums in theory. For Layered Sampling for Robust Optimization Problems

algorithms (Chen, 2009; Har-Peled & Kushal, 2007; C˜ Fichtenberger et al., 2013; Feldman et al., 2013); in

Local range particular, (Feldman & Langberg, 2011) proposed a LL unified coreset framework for a set of clustering problems. Solution space However, the research on using coreset to handle outliers is Figure 1. The red point represents the initial solution C˜, and our still quite limited. Recently, (Huang et al., 2018) showed goal is to guarantee (2) for a local range around C˜. that a uniform independent sample can serve as a coreset for clustering with outliers in Euclidean space; however, example, (Chawla & Gionis, 2013b) proposed the algorithm such uniform sampling based method often misses some k-means- - to solve the problem of k-means clustering with important points and therefore introduces an unavoidable outliers, where the main idea is an alternating minimization error on the number of outliers. (Gupta, 2018) also studied strategy. The algorithm is an iterative procedure, where it the uniform random sampling idea but under the assumption alternatively updates the outliers and the k cluster centers that each optimal cluster should be large enough. Partly in each iteration; eventually the solution converges to a lo- inspired by the method of (Mettu & Plaxton, 2004), (Chen cal optimum. The alternating minimization strategy is also et al., 2018) proposed a novel summary construction widely used for solving the problem of linear regression algorithm to reduce input data size which guarantees an with outliers, e.g., (Shen & Sanghavi, 2019). A common O(1) factor of distortion on the clustering cost. feature of these methods is that they usually start from an In theory, the algorithms with provable guarantees for k- initial solution and then locally improve the solution round median/means clustering with outliers (Chen, 2008; Kr- by round. Therefore, a natural question is ishnaswamy et al., 2018; Friggstad et al., 2018) have high can we construct a “coreset” only for a local range in the complexities and are difficult to be implemented in practice. solution space? The heuristic but practical algorithms have also been stud- ied before (Chawla & Gionis, 2013b; Ott et al., 2014). By Using such a coreset, we can substantially speed up those it- using the local search method, (Gupta et al., 2017b) pro- erative algorithms. Motivated by this question, we introduce vided a 274-approximation algorithm of k-means clustering a new variant of coreset method called layered sampling. with outliers but needing to discard more than the desired Given an initial solution C˜, we partition the given data set number of outliers; to improve the running time, they also P into a consecutive sequence of “layers” surrounding C˜ used k-means++ (Arthur & Vassilvitskii, 2007b) to seed the and conduct the random sampling in each layer; the union “coreset” that yields an O(1) factor approximation. Based of the samples, together with the points located in the out- on the idea of k-means++, (Bhaskara et al., 2019) proposed ermost layer, form the coreset S. Actually, our method an O(log k)-approximation algorithm. is partly inspired by the coreset construction method of k-median/means clustering (without outliers) proposed by Linear regression (with outliers). Several coreset meth- (Chen, 2009). However, we need to develop significantly ods for ordinary linear regression (without outliers) have new idea in theory to prove its correctness for the case with been proposed (Drineas et al., 2006; Dasgupta et al., 2009; outliers. The purpose of layered sampling is not to guaran- Boutsidis et al., 2013). For the case with outliers, which is tee the approximation quality (as (1)) for any solution C, also called “Least Trimmed Squares linear estimator (LTS)”, instead, it only guarantees the quality for the solutions in a uniform sampling approach was studied by (Mount et al., a local range in the solution space (the formal definition 2014; Ding & Xu, 2014). But similar to the scenario of is given in SectionL 1.3). Informally, we need to prove the clustering with outliers, such uniform sampling approach following result to replace (1): introduces an unavoidable error on the number of outliers.

C , ∆(P,C) ∆(S, C) (2) (Mount et al., 2014) also proved that it is impossible to ∀ ∈ L ≈ achieve even an approximate solution for LTS within poly- See Figure1 for an illustration. In other words, the new nomial time under the conjecture of the hardness of affine method can help us to find a local optimum faster. Our main degeneracy (Erickson & Seidel, 1995), if the dimensionality results are shown in Theorem1 and2. The construction d is not fixed. Despite of its high complexity, several practi- algorithms are easy to implement. cal algorithms were proposed before and most of them are based on the idea of alternating minimization that improves 1.2. Related Works the solution within a local range, such as (Rousseeuw, 1984; k-median/means clustering (with outliers). k- Rousseeuw & van Driessen, 2006; Hawkins, 1994; Mount median/means clustering are two popular center-based et al., 2016; Bhatia et al., 2015; Shen & Sanghavi, 2019). clustering problems (Awasthi & Balcan, 2014). It has (Klivans et al., 2018) provided another approach based on been extensively studied for using coreset techniques to the sum-of-squares method. reduce the complexities of k-median/means clustering Layered Sampling for Robust Optimization Problems

1.3. Preliminaries for the clustering and linear regression problems. Consider Below, we introduce several important definitions. the clustering problems first. Given a clustering solution ˜ d ˜ C = c˜1, , c˜k R and L > 0, we use “C L” to i. k-Median/Means Clustering with Outliers. Suppose P denote{ the··· range of} ⊂solutions = ± is a set of n points in Rd. Given two integers 1 z, k < n, L ≤ n o the problem of k-median clustering with z outliers is to C = c1, , ck c˜j cj L, 1 j k . (7) d { ··· } | || − || ≤ ∀ ≤ ≤ find a set of k points C = c1, , ck R and a subset P 0 P with P 0 = n z,{ such··· that the} following ⊂ objective Next, we define the solution range for linear regression with function⊂ | | − outliers. Given an instance P , we often normalize the values −z 1 X in each of the first d 1 dimensions as the preprocessing step; 1 (P,C) = min p cj (3) K n z 1≤j≤k || − || without loss of generality,− we can assume that x [0,D] − p∈P 0 i,j with some D > 0 for any 1 i n and 1 j d∈ 1. For ≤ ≤ ≤ ≤ − is minimized. Similarly, we have the objective function convenience, denote by D the region (s1, s2, , sd) R { ··· | 0 sj D, 1 j d 1 and thus P D after the −z 1 X 2 ≤ ≤ ∀ ≤ ≤ − } ⊂ R 2 (P,C) = min p cj , (4) normalization. It is easy to see that the region actually K n z 1≤j≤k || − || D − p∈P 0 is a vertical square cylinder in the space. GivenR a coefficient ˜ ˜ ˜ ˜ d for k-means clustering with outliers. The set C is also called vector (hyperplane) h = (h1, h2, , hd) R and L > 0, we use “h˜ L” to denote the range··· of hyperplanes∈ = a solution of the instance P . Roughly speaking, given a ± L solution C, the farthest z points to C are discarded, and n 0 h = (h1, h2, , hd) the remaining subset P is partitioned into k clusters where ··· | each point is assigned to its nearest neighbor of C. o Res(p, h˜) Res(p, h) L, p D . (8) ii. Linear Regression with Outliers. Given a vector | − | ≤ ∀ ∈ R d h = (h1, h2, , hd) R , the linear function defined To understand the range defined in (8), we can imagine ··· d−1 ∈ P + − by h is y = j=1 hjxj + hd for d 1 real variables two linear functions h˜ = (h˜1, h˜2, , h˜d + L) and h˜ = − ··· x1, x2, , xd−1. Thus the linear function can be repre- ˜ ˜ ˜ ; if we only consider the region , ··· (h1, h2, , hd L) D sented by the vector h. From geometric perspective, the the range··· (8) contains− all the linear functions “sandwiched”R linear function can be viewed as a (d 1)-dimensional hy- by h˜+ and h˜−. perplane in the space. Let z be an integer− between 1 and n, d For both the clustering and regression problems, we also and P = p1, p2, , pn be a set of n points in R , where { ··· } say that the size of the solution range is = L. each pi = (xi,1, xi,2, , xi,d−1, yi) for 1 i n; the L |L| objective is to find a subset··· P 0 P with P 0 ≤= n ≤z and a (d 1)-dimensional hyperplane,⊂ represented| | as a coefficient− 2. The Layered Sampling Framework − d vector h = (h1, h2, , hd) R , such that ··· ∈ We present the overview of our layered sampling framework. −z 0 1 P (P , h) = 0 Res(pi, h) (5) For the sake of completeness, we first introduce the core- LR1 n−z pi∈P −z 0 1 P 2 set construction method for the ordinary k-median/means or (P , h) = 0 Res(pi, h) (6) LR2 n−z pi∈P clustering proposed by (Chen, 2009). Pd−1 is minimized. Res(pi, h) = yi j=1 hjxi,j hd is the Suppose α and β 1. A “bi-criteria (α, β)-approximation” − − ≥ “residual” of pi to h. The objective functions (5) and (6) are means that it contains αk cluster centers, and the induced called the “least absolute error” and “least squared error”, clustering cost is at most β times the optimum. Usually, find- respectively. ing a bi-criteria approximation is much easier than achiev- Remark 1. All the above problems can be extended to ing a single-criterion approximation. For example, one can weighted case. Suppose each point p has a non-negative obtain a bi-criteria approximation for k-median/means clus- weight w(p), then the (squared) distance p cj ( p tering in linear time with α = O(1) and β = O(1) (Chen, 2 || − || || 2− d cj ) is replaced by w(p) p cj (w(p) p cj ); 2009). Let T = t1, t2, , tαk R be the obtained || · || − || · || − || { ··· } ⊂ we can perform the similar modification on Res(pi, h) (α, β)-approximate solution of the input instance P . For 2 convenience, we use (c, r) to denote the ball centered and Res(pi, h) for the problem of linear regression with B outliers. Moreover, the total weights of the outliers should at a point c with radius r > 0. At the beginning of be equal to z. Namely, we can view each point p as w(p) Chen’s coreset construction algorithm, it takes two care- unit-weight overlapping points. fully designed values r > 0 and N = O(log n), and parti- tions the space into N + 1 layers H0,H1, ,HN , where αk αk ···i  αk Solution range. To analyze the performance of our layered H0 = j=1B(tj, r) and Hi = j=1 B(tj, 2 r) j=1 ∪i−1  ∪ \ ∪ sampling method, we also need to define the “solution range” B(tj, 2 r) for 1 i N. It can be proved that P is ≤ ≤ Layered Sampling for Robust Optimization Problems

N covered by i=0Hi; then the algorithm takes a random sam- Algorithm 1 LAYERED SAMPLINGFOR k-MED-OUTLIER ∪ N d ple Si from each layer P Hi, and the union i=0Si forms Input: An instance P R of k-median clustering ∩ ∪ ⊂ the desired coreset S satisfying the condition (1). with z outliers, a solution C˜ = c˜1, , c˜k , and two { ··· } However, this approach cannot directly solve the case with parameters , η (0, 1). 1. Let γ = z/∈(n z) and N = log 1 . Compute the outliers. First, it is not easy to obtain a bi-criteria approxi- − d γ e mation for the problem of k-median/means clustering with value r satisfying (12). outliers (e.g., in linear time). Moreover, it is challenging 2. As described in (9), (10), and (11), the space is to guarantee the condition (1) for any feasible solution C, partitioned into N + 2 layers H0,H1, ,HN and because the set of outliers could change when C changes ··· Hout. (this is also the major challenge for proving the correctness n of our method later on). We propose a modified version 1 d N 3. Randomly sample min O( 2 kd log  log η ), P of Chen’s coreset construction method and aim to guar- o | ∩ antee (2) for a local range of solutions. We take the - Hi points, denoted by Si, from P Hi for 0 k | ∩ ≤ median clustering with outliers problem as an example. Let i N. ˜ d ≤ C = c˜1, , c˜k R be a given solution. Assume  > 0 and { ···+ are two} ⊂ pre-specified parameters. With a slight 4. For each point p Si, set its weight to be P N Z ∈ N | ∩ ∈ Hi / Si ; let SH = Si. abuse of notations, we still use H0,H1, ,HN to denote | | | ∪i=0 ˜ ··· Output S = SH (P Hout). the layers surrounding C, i.e., ∪ ∩ k H0 = j=1B(˜cj, r); (9) ∪ d k i  k i−1  r > 0, i.e., S(h, r) = p R r Res(p, h) r . Hi = j=1 B(˜cj, 2 r) j=1 B(˜cj, 2 r) { ∈˜ |˜ − ≤ ˜ d≤ } ∪ \ ∪ Let P be an instance, and h = (h1, , hd) R be a for 1 i N. (10) ··· ∈ ≤ ≤ given hyperplane. We divide the space into N + 2 layers H0,H1, ,HN ,Hout, where In addition, let ··· H = (h,˜ r); (15) H = d k (˜c , 2N r). (11) 0 S out R j=1 B j ˜ i ˜ i−1 \ ∪ Hi = S(h, 2 r) S(h, 2 r) for 1 i N;(16) \ ≤ ≤ Here, we set the value to satisfy the following condition: d ˜ N r Hout = R S(h, 2 r). (17) \ 1 Similar to (12), we also require the value r to satisfy the P Hout = (1 + )z. (12) ∩  following condition: N 1 That is, the union of the layers i=0Hi covers n (1 +  )z ∪ 1 − ˜ N 1 points of P and excludes the farthest (1 + )z. Obviously, P Hout = P S(hN , 2 r) = (1 + )z. (18)  ∩ \  such a value r always exists. Suppose P 0 is the set of n z inliers induced by C˜, and then we have − And consequently, we have

N  X N  −z ˜ 2 r min p c˜j 2 r (n z) 1 (P, h). (19) ≤ z 1≤j≤k || − || ≤ z − LR p∈P 0  = (n z) −z(P, C˜) (13) Then, we construct the coreset for linear regression with z − K1 outliers by the same manner of (14). via the Markov’s inequality. Our new coreset contains the following N + 2 parts: 3. k-Median/Means Clustering with Outliers

S = S0 S1 SN Sout, (14) In this section, we provide the details on applying our lay- ∪ ∪ · · · ∪ ∪ ered sampling framework to the problem of k-median clus- where Si is still a random sample from P Hi for 0 i tering with outliers. See Algorithm1. The algorithm and 1 ∩ ≤ ≤ N, and Sout contains all the (1 +  )z points in Hout. In analysis can be easily modified to handle k-means clustering Section3, we will show that the coreset S of (14) satisfies with outliers, where the only difference is that we need to q (2) for the k-median clustering with outliers problem (and N  −z ˜ replace (13) by “ 2 r (n z) 2 (P, C) ”. similarly for the k-means clustering with outliers problem). ≤ z − K Theorem 1. Algorithm1 returns a point set S having the For the linear regression with outliers problem, we apply the ˜1 1 1 size S = O ( 2 kd)+(1+  )z. Moreover, with probability similar layered sampling framework. Define S(h, r) to be | | the slab centered at a (d 1)-dimensional hyperplane h with 1The asymptotic notation O˜(f) = Of · polylog( d ). − γη Layered Sampling for Robust Optimization Problems

at least 1 η, for any L > 0 and any solution C C˜ L, probability 1 η, we have − ∈ ± − 1 X 1 X i xp xp (2 r + 2L). −z −z −z  ˆ − P Hi ≤ (S, C) (P,C)  (P, C˜) + L . (20) Si ˆ p∈P ∩H K1 ∈ K1 ± K1 | | p∈Si | ∩ | i Here, is a weighted instance of -median clustering with S k Lemma1 is only for a fixed solution C. To guarantee outliers, and the total weight of outliers is (see Remark1). z the result for any C C˜ L, we discretize the range ˜ ∈ ± Remark 2. (1) The running time of Algorithm1 is O(knd). C L. Imagine that we build a grid inside each B(˜cj,L) ±  For each point p P , we compute its shortest distance to for 1 j k, where the grid side length is √ L. Denote by d C˜, min p ∈ c˜ ; then select the farthest (1 + 1/)z ≤ ≤ 1≤j≤k j Gj the set of grid points inside each (˜cj,L), and then = || − || B√ points and compute the value r by running the linear time  2 d kd G G1 G2 Gk contains O k-tuple points selection algorithm (Blum et al., 1973); finally, we obtain × × · · · ×  the N + 1 layers Hi with 0 i N and take the samples of C˜ L in total. We increase the sample size in Lemma1 ≤ ≤ ± η 1 1 S0,S1, ,SN from them. via replacing η by N·|G| in the sample size “O( 2 log η )”. ··· As a consequence, through taking the union bound for the (2) Comparing with the standard coreset (1), our result −z ˜  success probability, we have the following result. contains an additive error  1 (P, C) + L in (20) that K −z Lemma 2. is the sample obtained from in Step 3 depends on the initial objective value (P, C˜) and the Si P Hi K1 of Algorithm1 for 0 i N. Then with probability∩ 1 η, size L of the solution range. In particular, the smaller the ≤ ≤ − range size L, the lower the error of our coreset. 1 X 1 X i xp xp (2 r + 2L). Si − P Hi ≤ (3) The algorithm of (Chen et al., 2018) also returns a | | p∈Si | ∩ | p∈P ∩Hi summary for compressing the input data. But there are for each i = 0, 1, ,N and any C . two major differences comparing with our result. First, { ··· } ∈ G their summary guarantees a constant factor of distortion Next, we show that for any C C˜ L (in particular the on the clustering cost, while our error approaches 0 if  is solutions in C˜ L ), Lemma∈2 is± true. For any solution small enough. Second, their construction algorithm (called ± \G˜ 0 0 0 C = c1, , ck C L, let C = c1, , ck be its “successive sampling” from (Mettu & Plaxton, 2004)) needs { ··· } ∈ ± 0 { ··· } nearest neighbor in , i.e., cj is the grid point of the cell G 0 to scan the data multiple passes, while our Algorithm1 is containing cj in Gj, for 1 j k. Also, denote by x p the much simpler and only needs to read the data in one-pass. ≤0 ≤ distance min1≤j≤k p cj . Then we consider to bound We also compare these two methods in our experiments. 1 P || − || 1 P the error xp xp through |Si| p∈Si − |P ∩Hi| p∈P ∩Hi C0. By using the triangle inequality, we have To prove Theorem1, we first show that SH is a good ap- proximation of P Hout. Fixing a solution C C˜ L, 1 X 1 X \ ∈ ± xp xp (23) we view the distance from each point p P to C, i.e., Si − P Hi ∈ | | p∈Si | ∩ | p∈P ∩Hi min1≤j≤k p cj , as a xp. For any point || − || 1 X 1 X 0 p P Hi with 0 i N, we have the following bounds xp xp for∈ .∩ Suppose is≤ covered≤ by i . Let the nearest ≤ Si − Si xp p B(˜cj1 , 2 r) | | p∈Si | | p∈Si neighbor of p in C be cj . Then, we have the upper bound | {z } 2 (a) 1 X 1 X xp = p cj2 p cj1 0 0 || − || ≤ || − || + xp xp i Si − P Hi p c˜j1 + c˜j1 cj1 2 r + L. (21) | | p∈Si | ∩ | p∈P ∩Hi ≤ || − || || − || ≤ | {z } Similarly, we have the lower bound (b) 1 X 0 1 X i−1 ) + xp xp . xp max 2 r L, 0 if i 1; P Hi − P Hi ≥ { − } ≥ (22) | ∩ | p∈P ∩Hi | ∩ | p∈P ∩Hi xp 0 if i = 0. | {z } ≥ (c) In (23), the term (b) is bounded by Lemma2 since C0 . Therefore, we can take a sufficiently large random ∈ G ˆ 1 P To bound the terms (a) and (c), we study the difference sample Si from P Hi, such that ˆ p∈Sˆ xp ∩ |Si| i ≈ 0 1 P xp xp for each point p. Suppose the nearest neighbor of xp with certain probability. Specifically, | − | 0 0 |P ∩Hi| p∈P ∩Hi p in C (resp., C ) is cj1 (resp., cj ). Then, combining (21) and (22), we have the following lemma 2 through the Hoeffding’s inequality. xp = p cj1 p cj2 || − 0 || ≤ || 0 − || p c + c cj Lemma 1. Let η (0, 1). If we randomly sample ≤ || − j2 || || j2 − 2 || 1 1 ∈ ˆ 0 0 O( 2 log ) points, denote by Si, from P Hi, then with p c + L = x + L, (24)  η ∩ ≤ || − j2 || p Layered Sampling for Robust Optimization Problems

0 N where the last inequality comes from the fact that cj2 and whole region i=0Hi; the case (ii) indicates that the region  ∪ c are in the same grid cell with side length √ L. Similarly, N H contains some outliers from P C . In the following j2 d i=0 i out 0 0 subsections,∪ we prove that (20) holds for both cases. For we have xp xp + L. Overall, xp xp L. As a consequence,≤ the terms (a) and (c) in| (23−) are| both ≤ bounded ease of presentation, we use w(U) to denote the total weight by L. Overall, (23) becomes of a weighted point set U (please be not confused with U , which is the number of points in U). | | 1 1 X X C xp xp 3.1. Case (i): PH Pin = Si − P Hi \ ∅ | | p∈Si | ∩ | p∈P ∩Hi We prove the following key lemma first. i (25) O()(2 r + L). C C C ≤ Lemma 4. If PH Pin = , Sin = SH (Pin PH ) and C C \ ∅ N ∪ \ N Sout = Pout (recall SH = i=0Si from Algorithm1). For convenience, we use PH to denote the set (P Hi). ∪ ∪i=0 ∩ C Lemma 3. Let S0,S1, ,SN be the samples obtained in Proof. First, the assumption PH P = implies ··· \ in ∅ Algorithm1. Then, with probability 1 η, C − PH Pin; (29) N C ⊂ C 1 X P Hi X X Pin PH = Pin PH . (30) | ∩ | xp xp | \ | | | − | | n z S − C i=0 i In addition, since SH PH , we have SH P from (29). − | | p∈Si p∈PH ⊂ ⊂ in   Consequently, the set S (P C P ) P C . Therefore, −z ˜ H in H in O() 1 (P, C) + L (26) C ∪ \ ⊂ C ≤ K for any p SH (Pin PH ) and any q Pout, xp xq. ∈ ∪  \ C ∈ ≤ Moreover, the set S SH (P PH ) for any C C˜ L. \ ∪ in \ ∈ ±    C  1 P = SH (P PH ) SH (Pin PH ) Proof. For convenience, let Erri = xp |Si| p∈Si ∪ \ \ ∪ \ 1 P − C xp for 0 i N. We directly have = (P PH ) (Pin PH ) |P ∩Hi| p∈P ∩Hi \ \ \ i ≤ ≤ C C Erri O()(2 r + L) from (25). Moreover, the left hand- = P Pin = Pout. (31) ≤ 1 PN |{z} \ side of (26) = P Hi Erri by (29) n−z i=0 | ∩ | · C C N Note Pout = z. As a consequence, Sin should be exactly | | C C C O() X i the set SH (P PH ), and S = P . P Hi (2 r + L) in out out ≤ n z | ∩ | · ∪ \ i=0 C − Lemma 5. If PH Pin = ,(20) is true. N \ ∅ X P Hi i = O() | ∩ |2 r + O()L. (27) C C · n z Proof. Because the set Sin is equal to SH (Pin i=0 −z ∪ \ − PH ) from Lemma4, the objective value (S, C) =   K1 N |P ∩Hi| i 1 P P P w(p)xp + C xp = It is easy to know that the term i=0 n−z 2 r of (27) n−z p∈SH p∈Pin\PH 1 P is at most ( P H0 r + 2 xp) r + n−z | ∩ | p∈PH \H0 ≤ N −z ˜ 1 1  X P Hi X X  2 1 (P, C). Note we set N = log γ in Algorithm1. | ∩ | xp + xp . (32) K d e n z Si −z ˜ i=0 p∈S C Together with (13), we know r  (P, C) and thus − | | i p∈Pin\PH ≤ K1 PN |P ∩Hi| 2ir O(1) −z(P, C˜). So (26) is true. i=0 n−z ≤ K1 From Lemma3, the value of (32) is no larger than 1  X −z Below, we always assume that (26) is true and consider to xp + O()(n z)( (P, C˜) + L) ≤ n z − K1 prove (20) of Theorem1. The set P is partitioned into two − p∈PH C C C parts: Pin and Pout by C, where Pout is the z farthest points X  C C + xp . (33) to C (i.e., the outliers) and Pin = P Pout. Similarly, the C \ C C p∈Pin\PH coreset S is also partitioned into two parts Sin and Sout by C, where SC is the set of outliers with total weights z. In C out Note that PH Pin = , and thus the sum of the two other words, we need to prove P \ P ∅ P terms xp and C xp in (33) is C xp. p∈PH p∈Pin\PH p∈Pin −z X X Therefore, 1 (S, C) w(p)xp xp. (28) K ≤ ≈ p∈SC p∈P C 1 X in in −z ˜ xp + O()( 1 (P, C) + L) n z C K C C − p∈Pin Consider two cases: (i) PH Pin = and (ii) PH Pin = . Intuitively, the case (i) indicates\ that∅ the set P C occupies\ 6 the∅ = −z(P,C) + O()( −z(P, C˜) + L). (34) in K1 K1 Layered Sampling for Robust Optimization Problems

−z −z C C Similarly, we have 1 (S, C) 1 (P,C) Note the points of Sin SH have unit-weight (since Sin −z ˜ K ≥ K − \ 1 \ O()( 1 (P, C) + L). Thus, (20) is true. SH P PH are the points from the outermost (1 +  )z K points⊆ of \P ). Obviously, the part (a) is no larger than C 3.2. Case (ii): PH P = \ in 6 ∅ N X X P Hi X Since S PH = P PH are the outermost (1+1/)z points w(p)xp = | ∩ | xp ˜ \ \ Si to C, we have the following claim first (due to the space p∈SH i=0 | | p∈Si limit, please refer to our supplement for the detailed proof). X −z ˜ xp + O()(n z)( 1 (P, C) + L) (37) C C C ≤ − K Claim 1. Either Sin PH Pin PH or Pin PH p∈PH C \ ⊆ \ \ ⊆ Sin PH is true. \ C C N from Lemma3. The set PH consists of two parts PH Pin Lemma 6. If PH Pin = , we have xp 2 r + L for C ∩C C C\ 6 ∅ ≤ and PH Pin. From Lemma6 and the fact PH Pin any p Sin Pin PH . C \ P N| \ | ≤ ∈ ∪ ∪ Pout = z, we know p∈P \P C xp z(2 r + L). Thus, | | H in ≤ C the upper bound of the part (a) becomes Proof. We consider the points in the three parts PH , Pin, and SC separately. P N in C xp + z(2 r + L) + p∈PH ∩Pin N (1) Due to (21), we have xp 2 r + L for any p PH . −z ˜ ≤ ∈ O()(n z)( 1 (P, C) + L). (38) C − K (2) Arbitrarily select one point p0 from PH Pin. By (21) N \ C C again, we have xp0 2 r + L. Also, because PH Pin To bound the part (b), we consider the size Sin SH . Since C ≤ C \ ⊂ |C  \ | Pout, we directly have xp xp0 for any p Pin. Namely, the total weight of outliers is z, w SH Sin N ≤ C ∈ ∩ xp 2 r + L for any p Pin. ≤ ∈ C  C C = w(SH ) w SH Sout w(SH ) z (3) Below, we consider the points in Sin. If w(Sin PH ) > − ∩ ≥ − C ∩ C P PH , i.e., PH contains more inliers of S than that of = PH z PH Pin z. (39) in ∩ − ≥ ∩ − P , then the outer region Hout should contain less inliers of C C  C S than that of P . Thus, from Claim1, we have Sin PH Together with the fact w SH Sin + Sin SH = PH C C C C \ ⊆ C C ∩ \ ∩ Pin PH . Hence, Sin = (Sin PH ) (Sin PH ) Pin + Pin PH = n z, we have C \ C \ ∪ ∩ ⊆ \ − (Pin PH ) PH = Pin PH . From (1) and (2), we know \ N ∪ ∪ C C C xp 2 r + L for any p S . Sin SH Pin PH + z. (40) ≤ ∈ in \ ≤ \ C C C Else, w(Sin PH ) Pin PH . Then w(Sin SH ) C  C  C ∩ C≤ ∩ C ∩ ≤ Therefore Sin SH Pin PH z from Claim1. P PH since S PH = S SH . Because w(SH ) = \ \ \ ≤ in ∩ in ∩ in ∩ Through Lemma6 again, we know that the part (b) is no PH , we have | | P C  C  larger than p∈P C \P xp + Sin SH Pin PH C C in H \ \ \ · w(SH S ) PH P . (35) N \ in ≥ | \ in| (2 r + L) C C Also, the assumption PH Pin = implies w(SH Sin) X N C \ 6 ∅ \ ≥ xp + z(2 r + L). (41) PH P > 0, i.e., in ≤ C | \ | p∈Pin\PH C SH S = . (36) \ in 6 ∅ Putting (38) and (41) together, we have −z(S, C) C 1 Arbitrarily select one point p0 from SH Sin. We know K ≤ N C \ −z −z ˜ xp0 2 r + L since p0 SH Sin PH . Also, for any 1 (P,C) + O()( 1 (P, C) + L) ≤ C ∈ \ ⊂ C K K point p Sin, we have xp xp0 because p0 SH Sin 2z C ∈ N ≤ ∈ \ ⊂ + (2N r + L). (42) Sout. Therefore xp 2 r + L. n z ≤ − C Lemma 7. If PH Pin = ,(20) is true. \ 6 ∅ Recall γ = z/(n z) in Algorithm1. If γ , the size of − ≥ −z our coreset S is at least (1 + 1/)z n; that is, S contains Proof. We prove the upper bound of 1 (S, C) first. We ≥ K C all the points of P . For the other case  > γ, together with analyze the clustering costs of the two parts Sin SH and 2z N C ∩ (13), the term (2 r + L) in (42) is at most S SH separately. n−z in \ −z ˜ −z 1  X X  O()( 1 (P, C) + L). (43) (S, C) = w(p)xp + xp . K 1 n z K C C −z −z −z p∈S ∩SH p∈S \SH ˜ − in in Overall, 1 (S, C) 1 (P,C)+O()( 1 (P, C)+L) | {z } | {z } K ≤ K K (a) (b) via (42). So we complete the proof for the upper bound. Layered Sampling for Robust Optimization Problems

N Algorithm 2 LAYERED SAMPLINGFOR LIN1-OUTLIER We still use PH to denote the set (P Hi). First, ∪i=0 ∩ Input: An instance P Rd of linear regression with z we need to prove that SH is a good approximation of PH . ⊂ outliers, a solution h˜ = (h˜1, , h˜k), and two parame- Given a hyperplane h, we define a random variable xp = ··· Res(p, h) for each p P . If p Hi for 0 i N, ters , η (0, 1). | | ∈ ∈ ≤ ≤ 1. Let∈γ = z/(n z) and N = log 1 . Compute the similar to (21) and (22), we have the following bounds for γ i i−1 − d e xp: xp 2 r + L; xp max 2 r L, 0 if i 1 and value r satisfying (18). ≤ ≥ { − } ≥ xp 0 if i = 0. 2. As described in (15), (16), and (17), the space is ≥ Then, we can apply the similar idea of Lemma3 to obtain partitioned into N + 2 layers H0,H1, ,HN and ··· the following lemma, where the only difference is about Hout. the discretization on h˜ L. Recall that h˜ is defined by the n ˜ ˜ ± 3. Randomly sample 1 d N coefficients h1, , hd and the input set P is normalized min O( 2 d log  log η ), P ··· o | ∩ within the region D. We build a grid inside each vertical R Hi points, denoted by Si, from P Hi for 0 segment ljuj for 0 j d 1, where l0 = (0, , 0, h˜d | ∩ ≤ ≤ ≤ − ··· − i N. L), u0 = (0, , 0, h˜d + L), and ≤ ···

4. For each point p Si, set its weight to be P lj = (0, , 0,D , 0, , 0, h˜jD + h˜d L), (46) ∈ N | ∩ Hi / Si ; let SH = Si. ··· |{z} ··· − | | | ∪i=0 j−th Output S = SH (P Hout). ∪ ∩ uj = (0, , 0,D , 0, , 0, h˜jD + h˜d + L) (47) ··· |{z} ··· j−th −z Now we consider the lower bound of 1 (S, C). Denote  C K C for j = 0; the grid length is 2d L. Denote by Gj the set by X = SH (Pin PH ) and Y = X Sin. Obviously, 6 ∪ \ \ of grid points inside the segment ljuj. Obviously, = 4d d G −z 1  X X  G0 G1 Gd−1 contains (  ) d-tuple points, and (S, C) w(p)xp w(p)xp . × × · · · × K1 ≥ n z − each tuple determines a (d 1)-dimensional hyperplane in − p∈X p∈Y h˜ L; moreover, we have the− following claim (see the proof | {z } | {z } ± (c) (d) in our supplement). From Lemma3, the part (c) is at least Claim 2. For each h h˜ L, there exist a hyper- plane h0 determined by a∈d-tuple± points from , such that X −z X 0 G xp O()( (P, C˜) + L) + xp Res(p, h) Res(p, h ) L for any p P . − K1 | − | ≤ ∈ p∈PH p∈P C \P in H Lemma 8. Let S0,S1, ,SN be the samples obtained in ··· X −z ˜ Algorithm2. Then, with probability 1 η, xp O()( 1 (P, C) + L). (44) − ≥ C − K p∈Pin N 1 X P Hi X X | ∩ | xp xp Further, since w(Y ) z and Y X P C P , the part n z S − in H i=0 i ≤ N ⊆ ⊆ ∪ − | | p∈Si p∈PH (d) is no larger than z(2 r + L) from Lemma6. Using the   similar manner for proving the upper bound, we know that O() −z(P, h˜) + L (48) ≤ LR1 −z(S, C) −z(P,C) O()( −z(P, C˜) + L). K1 ≥ K1 − K1 for any h h˜ L. ∈ ± 4. Linear Regression with Outliers We fix a solution h h˜ L. Similar to the proof of ∈ ± In this section, we consider the problem of linear regres- Theorem1 in Section3, we also consider the two parts h h h sion with outliers. Our algorithm and analysis are for the Pin and Pout of P partitioned by h, where Pout is the z −z objective function 1 , and the ideas can be extended to farthest points to the hyperplane h (i.e., the outliers) and LR −z h h handle the objective function 2 . Pin = P Pout. Similarly, S is also partitioned into two LR h \ h h parts Sin and Sout by h, where Sout is the set of outliers Theorem 2. Algorithm2 returns a point set S having the h 1 1 with total weights z. For case (i) PH Pin = and case (ii) size S = O˜( 2 d) + (1 + )z. Moreover, with probability | |   P P h = , we can apply almost\ the identical∅ ideas in at least , for any and any solution ˜ , H in 1 η L > 0 h h L Section\ 3.16 and∅ 3.2 respectively to prove (45). we have − ∈ ±

−z(S, h) −z(P, h)  −z(P, h˜) + L. (45) 5. Experiments LR1 ∈ LR1 ± LR1 Here, S is a weighted instance of linear regression with For both Algorithm 1 and 2, we need to compute an initial outliers, and the total weight of outliers is z (see Remark1). solution C˜ or h˜ first. For k-median/means clustering, we Layered Sampling for Robust Optimization Problems

∗ |O∩O | ∗ run the algorithm of Local Search with Outliers from (Gupta and precision = |O| . Since O = O = z, ∗ | | | | et al., 2017a) on a small sample of size O(k) to obtain the recall = precision = |O∩O | . k initial centers. We do not directly use the k-means + + z method (Arthur & Vassilvitskii, 2007a) to seed the k initial pre-recall. It indicates the proportion of O∗ that centers because it is sensitive to outliers. For linear regres- • are included in the coreset. Let S be the coreset and sion, we run the standard linear regression algorithm on a |S∩O∗| pre-recall = |O∗| . We pay attention in par- small random sample of size O(z) to compute an initial ticular to this measure, because the outliers could be solution. quite important and may reveal some useful informa- tion (e.g., ). For example, as for Different coreset methods Each of the following meth- clustering, if we do not have any prior knowledge of a ods returns a weighted set as the coreset, and then we run given biological data, the outliers could be from an un- the alternating minimization algorithm k-means-- (Chawla known tiny species. Consequently it is more preferable & Gionis, 2013a) (or (Shen & Sanghavi, 2019) for linear to keep such information when compressing the data. regression) on it to obtain the solution. For fairness, we More detailed discussion on the significance of outliers keep the coresets from different methods to have the same can be found in (Beyer & Sendhoff, 2007; Zimek et al., coreset size for each instance. 2012; Moitra, 2018; Goodfellow et al., 2018).

Layered Sampling (LaySam). i.e., Algorithm 1 and 2 Datasets We consider the following datasets in our exper- • proposed in this paper. iments. Uniform Sampling (UniSam). The most natural and • syncluster We generate the synthetic data as fol- simple method is to take a sample S uniformly at ran- • dom from the input data set P , where each sampled lows: Firstly we create k centers with each dimen- point has the weight P / S . sion randomly located in [0, 100]. Then we generate | | | | the points following standard Gaussian distributions Uniform Sampling + Nearest Neighbor Weight (NN around the centers. • )(Gupta et al., 2017a; Chen et al., 2018). Similar to UniSam synregression Firstly, we randomly set the d co- , we also take a random sample S from the • input data set P . For each p P , we assign it to its efficients of hyperplane h in [ 5, +5] and construct xi ∈ d−1 − nearest neighbor in S; for each q S, we set its weight in [0, 10] by uniform sampling. Then let yi be the ∈ to be the number of points assigned to it. inner product of (xi, 1) and h. Finally we randomly perturb each yi by (0, 1). Summary (Chen et al., 2018). It is a method to con- N • struct the coreset for k-median/means clustering with 3DSpatial (n = 434874, d = 4). This dataset was • outliers by successively sampling and removing points constructed by adding the elevation information to a from the original data until the number of the remain- 2D road network in North Jutland, Denmark (Kaul ing points is small enough. et al., 2013).

covertype (n = 581012, d = 10). It is a forest The above LaySam,UniSam and NN are also used for lin- • cover type dataset from Jock A. Blackard (UCI Ma- ear regression in our experiments. We run 10 trials for each chine Learning Repository), and we select its first 10 case and take the average. All the experimental results were attributes. obtained on a Ubuntu server with 2.4GHz E5-2680V4 and 256GB main memory; the algorithms were implemented in skin (n = 245057, d = 3). The skin dataset is Matlab R2018b. • collected by randomly sampling B, G, R values from face images of various age groups and we select its Performance Measures The following measures will be first three dimension (Bhatt & Dhall). taken into account in the experiment. SGEMM (n = 241600, d = 15). It contains the run- −z −z • ning times for multiplying two 2048 x 2048 matrices `1-loss: 1 or 1 . • K LR using a GPU OpenCL SGEMM kernel with varying −z −z `2-loss: or . parameters (Ballester-Ripoll et al., 2017). • K2 LR2 recall/precision. Let O∗ and O be the sets of PM2.5 (n = 41757, d = 11). It is a data set contain- • outliers with respect to the optimal solution and our • ing the PM2.5 data of US Embassy in Beijing (Liang |O∩O∗| obtained solution, respectively. recall = |O∗| et al., 2015). Layered Sampling for Robust Optimization Problems

For each dataset, we randomly pick z points to be out- 5.2. Performance liers by perturbing their locations in each dimension. We Figure 4(a)-4(e) show the performances of clustering on use a parameter σ to measure the extent of perturba- -loss with different s. The results on -loss are tion. For example, we consider the Gaussian distribu- `2 σ `1 very similar (more results are summarized in Table1 and2). tion (0, σ) and uniform distribution [ σ, +σ]. So the We can see that LaySam outperforms the other methods in largerN the parameter σ is, the more diffused− the outliers terms of both synthetic and real-world datasets. Moreover, will be. For simplicity, we use the notations in the form its performance remains quite stable (with small standard of [dataset]-[distribution]-σ to indicate the deviation) and is also robust when σ increases. UniSam datasets, e.g., syncluster-gauss-σ. works well when σ is small, but it becomes very instable when σ rises to large. Both LaySam and UniSam outper- 5.1. Coreset Construction Time form Summary on most datasets. We fix the coreset size and vary the data size n of the syn- Similar comparison of the performance for linear regression thetic datasets, and show the coreset construction times in are shown in Figure 5(a)-5(c). Figure 2(a) and 2(b). It is easy to see that the construc- tion time of NN is larger than other construction times by The three coreset methods achieve very close values of several orders of magnitude. It is not out of expectation recall and precision. But UniSam has much lower that UniSam is the fastest (because it does not need any pre-recall than those of Summary and LaySam. operation except uniform random sampling). Our LaySam lies in between UniSam and Summary. 6. Conclusion To reduce the time complexities of existing algorithms for LaySam NN Summary UniSam LaySam NN UniSam clustering and linear regression with outliers, we propose 8 6 a new variant of coreset method which can guarantee the ) ) 6 ms ms

( ( quality for any solution in a local range surrounding the

) 4 ) 4 given initial solution. In future, it is worth considering to time time ( (

10 2 10 2 apply our framework to a broader range of robust optimiza- log log tion problems, such as logistic regression with outliers and 0 0 5 6 7 8 5 6 7 8 Gaussian mixture model with outliers. log10 (data size) log10 (data size) (a) (b)

Figure 2. Coreset construction time w.r.t. data size. (a) Clustering: coreset size=104, d = 5 and k = 10; (b) Linear Regression: coreset size=104 and d = 20.

LaySam Summary UniSam LaySam Summary UniSam

1500 4000 ) )

ms ms 1000 ( (

2000 time time 500

0 0 20 40 60 80 20 40 60 80 k k (a) (b)

Figure 3. Construction time w.r.t. k. (a) syncluster with data size n = 106, d = 20; (b) covertype.

We also study the influence of k (for clustering) on the construction time by testing the synthetic datasets and the real-world dataset covertype (see Figure 3(a) and 3(b)). Layered Sampling for Robust Optimization Problems

LaySam Summary UniSam LaySam Summary UniSam LaySam Summary UniSam 3.9 19.00

3.8 18.75 1.65

18.50 3.7 −loss −loss −loss 2 2 2 l l l 18.25 1.60 3.6

18.00 3.5

0 100 200σ 300 400 0 50 100σ 150 200 0 50 100 σ150 200 250 (a) syncluster-gauss-σ, with d = (b) 3DSpatial-gauss-σ, with k = (c) covertype-gauss-σ, with k = 20, k = 10 ,z = 2 · 104 and coreset 5, z = 4000 and coreset size= 104. 20, z = 104 and coreset size= 3 · 104. size= 5 · 104.

LaySam Summary UniSam LaySam Summary UniSam 3.9

3.8 ) 7.4 −2 10

3.7 ×

( 7.0 −loss 2 l 3.6 −loss 2 l 6.6 3.5

0 50 100 σ150 200 250 0 50 100σ 150 200 (d) covertype-uniform-σ, with (e) skin-uniform-σ, with k = 40, k = 20, z = 104 and coreset size= z = 3000 and coreset size= 104. 3 · 104.

Figure 4. Clustering: `2-loss and stability w.r.t. σ.

LaySam UniSam LaySam UniSam LaySam UniSam 0.75 1.5

0.70 1.05 1.2

0.65

−loss −loss −loss 0.9 2 2 2 l l l 1.00 0.60

0.6 0.55

0 200 σ 400 600 0 250 500σ 750 1000 0 250 500σ 750 1000 (a) synregression-gauss-σ, with (b) SGEMM-gauss-σ, with z = 4000 (c) PM2.5-uniform-σ, with z = d = 20, z = 104 and coreset size=2·104. and coreset size=104. 1000 and coreset size=3000.

Figure 5. Linear Regrssion: `2-loss and stability w.r.t. σ. Layered Sampling for Robust Optimization Problems

(a) syncluster-gauss-σ, with d = 20, k = 10 ,z = 2 · 104 and coreset size= 5 · 104. σ 20 100 200 Coreset LaySam UniSam Summary LaySam UniSam Summary LaySam UniSam Summary `1-loss 3.976 3.976 4.073 3.976 3.976 4.067 3.977 4.048 4.074 `2-loss 18.01 18 18.9 18.01 18.01 18.84 18.01 18.14 18.91 Prec 1 1 1 1 1 1 1 1 1 Pre-Rec 1 0.0506 1 1 0.0499 1 1 0.0503 1

(b) covertype-gauss-σ, with k = 20, z = 104 and coreset size= 3 · 104. σ 20 100 200 Coreset LaySam UniSam Summary LaySam UniSam Summary LaySam UniSam Summary `1-loss 1.781 1.781 1.868 1.779 1.785 1.871 1.786 1.799 1.853 `2-loss 3.501 3.501 3.818 3.505 3.523 3.874 3.515 3.576 3.767 Prec 1 1 1 1 1 1 1 1 1 Pre-Rec 1 0.0524 1 1 0.0511 1 1 0.0547 1

(c) skin-uniform-σ, with k = 40, z = 3000 and coreset size= 104. σ 20 100 200 Coreset LaySam UniSam Summary LaySam UniSam Summary LaySam UniSam Summary `1-loss 0.1906 0.1906 0.1925 0.1883 0.1902 0.1914 0.1954 0.2032 0.1957 −2 `2-loss( 10 ) 6.46 6.531 6.579 6.449 6.849 6.565 6.471 7.173 6.565 Prec × 0.9993 0.9995 0.9993 1 1 1 1 1 1 Pre-Rec 0.9997 0.0403 1 1 0.0403 1 1 0.0437 1

Table 1. Performance on clustering with outliers.

(a) synregression-gauss-σ, with d = 20, z = 104 and coreset size=2 · 104. σ 20 300 600 Coreset LaySam UniSam LaySam UniSam LaySam UniSam `1-loss 0.7996 0.7982 0.7989 0.8007 0.7987 0.7998 `2-loss 1.003 0.9993 1.001 1.01 1.002 1.025 Prec 0.9305 0.931 0.9953 0.9953 0.9975 0.9978 Pre-Rec 0.9535 0.0108 0.9968 0.0093 0.9975 0.0098

(b) PM2.5-uniform-σ, with z = 1000 and coreset size=3000. σ 20 500 1000 Coreset LaySam UniSam LaySam UniSam LaySam UniSam `1-loss 0.6292 0.6244 0.637 0.6368 0.6349 0.6791 `2-loss 0.7179 0.7075 0.744 0.7473 0.7456 0.8819 Prec 0.9188 0.921 0.99 0.9901 0.9918 0.9917 Pre-Rec 0.9714 0.0732 0.999 0.0725 1 0.0734

Table 2. Performance on linear regression with outliers. Layered Sampling for Robust Optimization Problems

References Chawla, S. and Gionis, A. k-means–: A unified approach to clustering and outlier detection. In Proceedings of the Arthur, D. and Vassilvitskii, S. K-means++: The advan- 2013 SIAM International Conference on , pp. tages of careful seeding. In Proceedings of the Eigh- 189–197. SIAM, 2013b. teenth Annual ACM-SIAM Symposium on Discrete Al- gorithms, SODA ’07, pp. 1027–1035, Philadelphia, PA, Chen, J., Azer, E. S., and Zhang, Q. A practical algorithm USA, 2007a. Society for Industrial and Applied Mathe- for distributed clustering and outlier detection. In Ad- matics. ISBN 978-0-898716-24-5. vances in Neural Information Processing Systems 31: An- nual Conference on Neural Information Processing Sys- Arthur, D. and Vassilvitskii, S. k-means++: The advantages tems 2018, NeurIPS 2018, 3-8 December 2018, Montreal,´ of careful seeding. In Proceedings of the eighteenth an- Canada, pp. 2253–2262, 2018. nual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035. Society for Industrial and Applied Mathemat- Chen, K. A constant factor approximation algorithm for ics, 2007b. k-median clustering with outliers. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete al- Awasthi, P. and Balcan, M.-F. Center based clustering: A gorithms, pp. 826–835. Society for Industrial and Applied foundational perspective. 2014. Mathematics, 2008. Bachem, O., Lucic, M., and Krause, A. Practical core- Chen, K. On coresets for k-median and k-means clustering set constructions for machine learning. arXiv preprint in metric and euclidean spaces and their applications. arXiv:1703.06476, 2017. SIAM Journal on Computing, 39(3):923–947, 2009. Ballester-Ripoll, R., Paredes, E. G., and Pajarola, R. Sobol Dasgupta, A., Drineas, P., Harb, B., Kumar, R., and Ma- tensor trains for global sensitivity analysis. CoRR, honey, M. W. Sampling algorithms and coresets for ell p abs/1712.00233, 2017. regression. SIAM Journal on Computing, 38(5):2060–\ 2078, 2009. Beyer, H.-G. and Sendhoff, B. Robust optimization–a com- prehensive survey. Computer methods in applied mechan- Ding, H. and Xu, J. Sub-linear time hybrid approximations ics and engineering, 196(33-34):3190–3218, 2007. for least trimmed squares estimator and related problems. In 30th Annual Symposium on Computational Geometry, Bhaskara, A., Vadgama, S., and Xu, H. Greedy sampling SOCG’14, Kyoto, Japan, June 08 - 11, 2014, pp. 110, for approximate clustering in the presence of outliers. In 2014. doi: 10.1145/2582112.2582131. Advances in Neural Information Processing Systems, pp. 11146–11155, 2019. Drineas, P., Mahoney, M. W., and Muthukrishnan, S. Sam- pling algorithms for l2 regression and applications. In Bhatia, K., Jain, P., and Kar, P. Robust regression via hard Proceedings of the seventeenth annual ACM-SIAM sym- thresholding. In Advances in Neural Information Pro- posium on Discrete algorithm, pp. 1127–1136. Society cessing Systems 28: Annual Conference on Neural Infor- for Industrial and Applied Mathematics, 2006. mation Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 721–729, 2015. Erickson, J. and Seidel, R. Better lower bounds on de- tecting affine and spherical degeneracies. Discrete & Bhatt, R. and Dhall, A. Skin segmentation dataset. UCI Computational Geometry, 13:41–57, 1995. doi: 10.1007/ Machine Learning Repository. BF02574027. Blum, M., Floyd, R. W., Pratt, V., Rivest, R. L., and Tarjan, Feldman, D. Core-sets: An updated survey. Wiley Interdis- R. E. Time bounds for selection. Journal of Computer cip. Rev. Data Min. Knowl. Discov., 10(1), 2020. and System Sciences, 7(4):448–461, 1973. Feldman, D. and Langberg, M. A unified framework for Boutsidis, C., Drineas, P., and Magdon-Ismail, M. Near- approximating and clustering data. In Proceedings of the optimal coresets for least-squares regression. IEEE Trans. 43rd ACM Symposium on Theory of Computing, STOC Information Theory, 59(10):6880–6892, 2013. doi: 10. 2011, San Jose, CA, USA, 6-8 June 2011, pp. 569–578, 1109/TIT.2013.2272457. 2011. Chandola, V., Banerjee, A., and Kumar, V. Anomaly detec- Feldman, D., Schmidt, M., and Sohler, C. Turning big tion: A survey. ACM Computing Surveys (CSUR), 41(3): data into tiny data: Constant-size coresets for k-means, 15, 2009. PCA and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Dis- Chawla, S. and Gionis, A. k-means--: A unified approach crete Algorithms, SODA 2013, New Orleans, Louisiana, to clustering and outlier detection. In SDM, 2013a. USA, January 6-8, 2013, pp. 1434–1453, 2013. Layered Sampling for Robust Optimization Problems

Fichtenberger, H., Gille,´ M., Schmidt, M., Schwiegelshohn, Klivans, A. R., Kothari, P. K., and Meka, R. Efficient C., and Sohler, C. Bico: Birch meets coresets for k-means algorithms for outlier-robust regression. In Conference clustering. In European Symposium on Algorithms, pp. On Learning Theory, COLT 2018, Stockholm, Sweden, 481–492. Springer, 2013. 6-9 July 2018, pp. 1420–1430, 2018.

Friggstad, Z., Khodamoradi, K., Rezapour, M., and Krishnaswamy, R., Li, S., and Sandeep, S. Constant ap- Salavatipour, M. R. Approximation schemes for clus- proximation for k-median and k-means with outliers via tering with outliers. In Proceedings of the Twenty-Ninth iterative rounding. In Proceedings of the 50th Annual Annual ACM-SIAM Symposium on Discrete Algorithms, ACM SIGACT Symposium on Theory of Computing, pp. pp. 398–414. SIAM, 2018. 646–659. ACM, 2018.

Goodfellow, I. J., McDaniel, P. D., and Papernot, N. Mak- Langberg, M. and Schulman, L. J. Universal ε- ing machine learning robust against adversarial inputs. approximators for integrals. In Proceedings of the twenty- Commun. ACM, 61(7):56–66, 2018. first annual ACM-SIAM symposium on Discrete Algo- rithms, pp. 598–607. SIAM, 2010. Gupta, S. Approximation algorithms for clustering and facility location problems. PhD thesis, University of Li, Y., Long, P. M., and Srinivasan, A. Improved bounds on Illinois at Urbana-Champaign, 2018. the sample complexity of learning. Journal of Computer and System Sciences, 62(3):516–527, 2001. Gupta, S., Kumar, R., Lu, K., Moseley, B., and Vassilvitskii, S. Local search methods for k-means with outliers. Proc. Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., VLDB Endow., 10(7):757–768, March 2017a. ISSN 2150- Huang, H., and Chen, S. Assessing beijing’s pm 2.5 pol- 8097. doi: 10.14778/3067421.3067425. lution: severity, weather impact, apec and winter heating. Proceedings of the Royal Society A: Mathematical, Phys- Gupta, S., Kumar, R., Lu, K., Moseley, B., and Vassilvit- ical and Engineering Science, 471:20150257, 10 2015. skii, S. Local search methods for k-means with outliers. doi: 10.1098/rspa.2015.0257. Proceedings of the VLDB Endowment, 10(7):757–768, 2017b. Lucic, M., Faulkner, M., Krause, A., and Feldman, D. Train- ing gaussian mixture models at scale via coresets. The Har-Peled, S. and Kushal, A. Smaller coresets for k-median Journal of Machine Learning Research, 18(1):5885–5909, and k-means clustering. Discrete & Computational Ge- 2017. ometry, 37(1):3–19, 2007. Mettu, R. R. and Plaxton, C. G. Optimal time bounds for Hawkins, D. M. The feasible solution algorithm for least approximate clustering. Machine Learning, 56(1-3):35– trimmed squares regression. Computational Statistics and 60, 2004. Data Analysis, 17, 1994. doi: 10.1016/0167-9473(92) 00070-8. Moitra, A. Robustness meets algorithms (invited talk). In 16th Scandinavian Symposium and Workshops on Algo- Huang, L., Jiang, S., Li, J., and Wu, X. Epsilon-coresets rithm Theory, SWAT 2018, June 18-20, 2018, Malmo,¨ for clustering (with outliers) in doubling metrics. In Sweden, pp. 3:1–3:1, 2018. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 814–825. IEEE, 2018. Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silver- man, R., and Wu, A. Y. On the least trimmed squares Huggins, J., Campbell, T., and Broderick, T. Coresets for estimator. Algorithmica, 69(1):148–183, 2014. doi: scalable bayesian logistic regression. In Advances in 10.1007/s00453-012-9721-8. Neural Information Processing Systems, pp. 4080–4088, 2016. Mount, D. M., Netanyahu, N. S., Piatko, C. D., Wu, A. Y., and Silverman, R. A practical approximation algorithm Karnin, Z. S. and Liberty, E. Discrepancy, coresets, and for the LTS estimator. Computational Statistics and Data sketches in machine learning. CoRR, abs/1906.04845, Analysis, 99:148–170, 2016. doi: 10.1016/j.csda.2016.01. 2019. 016.

Kaul, M., Yang, B., and Jensen, C. Building accurate Munteanu, A., Schwiegelshohn, C., Sohler, C., and 3d spatial networks to enable next generation intelli- Woodruff, D. On coresets for logistic regression. In gent transportation systems. volume 1, 06 2013. doi: Advances in Neural Information Processing Systems, pp. 10.1109/MDM.2013.24. 6561–6570, 2018. Layered Sampling for Robust Optimization Problems

Ott, L., Pang, L., Ramos, F. T., and Chawla, S. On integrated for 1 j d 1. For any p = (x1, , xd) D, clustering and outlier detection. In Advances in neural ≤ ≤ − ··· ∈ R 0 information processing systems, pp. 1359–1367, 2014. Res(p, h) Res(p, h ) | − | d−1 X 0 0 Phillips, J. M. Coresets and sketches. Computing Research h hj xj + h hd ≤ | j − | · | | | d − | Repository, 2016. j=1 d−1 Rousseeuw, P. and van Driessen, K. Computing LTS regres- X 0 0 h hj D + h hd sion for large data sets. Data Min. Knowl. Discov., 12(1): ≤ | j − | · | d − | j=1 29–45, 2006. doi: 10.1007/s10618-005-0024-4. L. (53) ≤ Rousseeuw, P. J. Least median of squares regression. Jour- nal of the American Statistical Association, 79, 12 1984. doi: 10.1080/01621459.1984.10477105.

Shen, Y. and Sanghavi, S. Iterative least trimmed squares for mixed linear regression. In Advances in Neural In- formation Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 6076–6086, 2019.

Zimek, A., Schubert, E., and Kriegel, H.-P. A survey on unsupervised outlier detection in high-dimensional nu- merical data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 5(5):363–387, 2012.

7. Proof of Claim 1

C Since Sin is the set of inliers to C, there must exist some value rS > 0 such that

C Sin = p p S, min p cj rS . (49) { | ∈ 1≤j≤k || − || ≤ } And therefore

C Sin PH = p p S PH , min p cj rS . (50) \ { | ∈ \ 1≤j≤k || − || ≤ }

Similarly, there exists some value rP > 0 such that

C Pin PH = p p P PH , min p cj rP . (51) \ { | ∈ \ 1≤j≤k || − || ≤ }

C Note S PH = P PH . So, if rS rP , we have Sin PH C \ \ C ≤ C \ ⊆ P PH . Otherwise, P PH S PH . in \ in \ ⊆ in \ 8. Proof of Claim 2

0 0 0 Let h = (h1, , hd), and suppose h = (h1, , hd) ··· 0 ··· is h’s nearest neighbor in , i.e., hd hd 2d L and 0 0 G | − | ≤ Dh + h Dhj hd L for 1 j d 1. Then, | j d − − | ≤ 2d ≤ ≤ −

0 1  0 h hj ( L + h hd ) | j − | ≤ D 2d | d − |  L (52) ≤ Dd