Layered Sampling for Robust Optimization Problems

Layered Sampling for Robust Optimization Problems Hu Ding 1 Zixiu Wang 1 Abstract the complexity measures like running time, space, and com- munication. In the past years, the coreset techniques have In real world, our datasets often contain outliers. been successfully applied to solve many optimization prob- Moreover, the outliers can seriously affect the lems, such as clustering (Chen, 2009; Feldman & Langberg, final machine learning result. Most existing algo- 2011; Huang et al., 2018), logistic regression (Huggins et al., rithms for handling outliers take high time com- 2016; Munteanu et al., 2018), linear regression (Dasgupta plexities (e.g. quadratic or cubic complexity). et al., 2009; Drineas et al., 2006), and Gaussian mixture Coreset is a popular approach for compressing model (Lucic et al., 2017; Karnin & Liberty, 2019). data so as to speed up the optimization algorithms. However, the current coreset methods cannot be A large part of existing coreset construction methods are easily extended to handle the case with outliers. based on the theory of sensitivity which was proposed by In this paper, we propose a new variant of coreset (Langberg & Schulman, 2010). Informally, each data point technique, layered sampling, to deal with two p P has the sensitivity φ(p) (in fact, we just need to fundamental robust optimization problems: k- compute2 an appropriate upper bound of φ(p)) to measure median/means clustering with outliers and linear its importance to the whole instance P over all possible P regression with outliers. This new coreset method solutions, and Φ(P ) = p2P φ(p) is called the total sen- is in particular suitable to speed up the iterative al- sitivity. The coreset construction is a simple sampling pro- gorithms (which often improve the solution within cedure where each point p is drawn i.i.d. from P propor- a local range) for those robust optimization prob- φ(p) tional to Φ(P ) ; each sampled point p is assigned a weight lems. Moreover, our method is easy to be imple- w(p) = Φ(P ) where m is the sample size depending on the mented in practice. We expect that our framework mφ(p) of layered sampling will be applicable to other “pseudo-dimension” of the objective function ∆ ((Feldman robust optimization problems. & Langberg, 2011; Li et al., 2001)); eventually, the set of weighted sampled points form the desired coreset S. In real world, datasets are noisy and contain outliers. More- 1. Introduction over, outliers could seriously affect the final results in data analysis (Chandola et al., 2009; Goodfellow et al., 2018). Coreset is a widely studied technique for solving many However, the sensitivity based coreset approach is not ap- optimization problems (Phillips, 2016; Bachem et al., 2017; propriate to handle robust optimization problems involving Munteanu et al., 2018; Feldman, 2020). The (informal) outliers (e.g., k-means clustering with outliers). For exam- definition is as follows. Given an optimization problem with ple, it is not easy to compute the sensitivity φ(p) because the objective function ∆, denote by ∆(P; C) the objective the point p could be inlier or outlier for different solutions; value determined by a dataset P and a solution C; a small arXiv:2002.11904v1 [cs.CG] 27 Feb 2020 moreover, it is challenging to build the relation, such as (1), set S is called a coreset if between the original instance P and the coreset S (e.g., how ∆(P; C) ∆(S; C) (1) to determine the number of outliers for the instance S?). ≈ for any feasible solution C. Roughly speaking, the coreset 1.1. Our Contributions is a small set of data approximately representing a much larger dataset, and therefore existing algorithm can run on In this paper, we consider two important robust optimization the coreset (instead of the original dataset) so as to reduce problems: k-median/means clustering with outliers and linear regression with outliers. Their quality guaranteed algo- 1School of Computer Science and Technology, University rithms exist but often have high complexities that seriously of Science and Technology of China. Correspondence to: limit their applications in real scenarios (see Section 1.2 for http://staff.ustc.edu. Hu Ding <[email protected], more details). We observe that these problems can be often cn/˜huding/ >. efficiently solved by some heuristic algorithms in practice, though they only guarantee local optimums in theory. For Layered Sampling for Robust Optimization Problems algorithms (Chen, 2009; Har-Peled & Kushal, 2007; C˜ Fichtenberger et al., 2013; Feldman et al., 2013); in Local range particular, (Feldman & Langberg, 2011) proposed a LL unified coreset framework for a set of clustering problems. Solution space However, the research on using coreset to handle outliers is Figure 1. The red point represents the initial solution C~, and our still quite limited. Recently, (Huang et al., 2018) showed goal is to guarantee (2) for a local range around C~. that a uniform independent sample can serve as a coreset for clustering with outliers in Euclidean space; however, example, (Chawla & Gionis, 2013b) proposed the algorithm such uniform sampling based method often misses some k-means- - to solve the problem of k-means clustering with important points and therefore introduces an unavoidable outliers, where the main idea is an alternating minimization error on the number of outliers. (Gupta, 2018) also studied strategy. The algorithm is an iterative procedure, where it the uniform random sampling idea but under the assumption alternatively updates the outliers and the k cluster centers that each optimal cluster should be large enough. Partly in each iteration; eventually the solution converges to a lo- inspired by the method of (Mettu & Plaxton, 2004), (Chen cal optimum. The alternating minimization strategy is also et al., 2018) proposed a novel summary construction widely used for solving the problem of linear regression algorithm to reduce input data size which guarantees an with outliers, e.g., (Shen & Sanghavi, 2019). A common O(1) factor of distortion on the clustering cost. feature of these methods is that they usually start from an In theory, the algorithms with provable guarantees for k- initial solution and then locally improve the solution round median/means clustering with outliers (Chen, 2008; Kr- by round. Therefore, a natural question is ishnaswamy et al., 2018; Friggstad et al., 2018) have high can we construct a “coreset” only for a local range in the complexities and are difficult to be implemented in practice. solution space? The heuristic but practical algorithms have also been studied before (Chawla & Gionis, 2013b; Ott et al., 2014). By Using such a coreset, we can substantially speed up those it- using the local search method, (Gupta et al., 2017b) pro- erative algorithms. Motivated by this question, we introduce vided a 274-approximation algorithm of k-means clustering a new variant of coreset method called layered sampling. with outliers but needing to discard more than the desired Given an initial solution C~, we partition the given data set number of outliers; to improve the running time, they also P into a consecutive sequence of “layers” surrounding C~ used k-means++ (Arthur & Vassilvitskii, 2007b) to seed the and conduct the random sampling in each layer; the union “coreset” that yields an O(1) factor approximation. Based of the samples, together with the points located in the out- on the idea of k-means++, (Bhaskara et al., 2019) proposed ermost layer, form the coreset S. Actually, our method an O(log k)-approximation algorithm. is partly inspired by the coreset construction method of k-median/means clustering (without outliers) proposed by Linear regression (with outliers). Several coreset meth- (Chen, 2009). However, we need to develop significantly ods for ordinary linear regression (without outliers) have new idea in theory to prove its correctness for the case with been proposed (Drineas et al., 2006; Dasgupta et al., 2009; outliers. The purpose of layered sampling is not to guaran- Boutsidis et al., 2013). For the case with outliers, which is tee the approximation quality (as (1)) for any solution C, also called “Least Trimmed Squares linear estimator (LTS)”, instead, it only guarantees the quality for the solutions in a uniform sampling approach was studied by (Mount et al., a local range in the solution space (the formal definition 2014; Ding & Xu, 2014). But similar to the scenario of is given in SectionL 1.3). Informally, we need to prove the clustering with outliers, such uniform sampling approach following result to replace (1): introduces an unavoidable error on the number of outliers. C ; ∆(P; C) ∆(S; C) (2) (Mount et al., 2014) also proved that it is impossible to 8 2 L ≈ achieve even an approximate solution for LTS within poly- See Figure1 for an illustration. In other words, the new nomial time under the conjecture of the hardness of affine method can help us to find a local optimum faster. Our main degeneracy (Erickson & Seidel, 1995), if the dimensionality results are shown in Theorem1 and2. The construction d is not fixed. Despite of its high complexity, several practi- algorithms are easy to implement. cal algorithms were proposed before and most of them are based on the idea of alternating minimization that improves 1.2. Related Works the solution within a local range, such as (Rousseeuw, 1984; k-median/means clustering (with outliers). k- Rousseeuw & van Driessen, 2006; Hawkins, 1994; Mount median/means clustering are two popular center-based et al., 2016; Bhatia et al., 2015; Shen & Sanghavi, 2019). clustering problems (Awasthi & Balcan, 2014). It has (Klivans et al., 2018) provided another approach based on been extensively studied for using coreset techniques to the sum-of-squares method. reduce the complexities of k-median/means clustering Layered Sampling for Robust Optimization Problems 1.3.

Load more