Computing Core-Sets and Approximate Smallest Enclosing Hyperspheres in High Dimensions∗
Total Page:16
File Type:pdf, Size:1020Kb
Computing Core-Sets and Approximate Smallest Enclosing HyperSpheres in High Dimensions∗ Piyush Kumar† Joseph S. B. Mitchell‡ E. Alper Yıldırım§ Abstract values of > 0, compared with the best known imple- We study the minimum enclosing ball (MEB) problem for mentations of exact solvers. We also demonstrate that sets of points or balls in high dimensions. Using techniques the sizes of the core-sets tend to be much smaller than of second-order cone programming and “core-sets”, we have the worst-case theoretical upper bounds. developed (1+)-approximation algorithms that perform well Preliminaries. We let Bc,r denote a ball of radius d in practice, especially for very high dimensions, in addition r centered at point c R . Given an input set S = ∈ d p1,..., pn of n objects in R , the minimum enclosing to having provable guarantees. We prove the existence of { } core-sets of size (1/) , improving the previous bound of ball MEB(S) of S is the unique minimum-radius ball O (1/2) , and we study empirically how the core-set size containing S. (Uniqueness follows from results of O grows with dimension. We show that our algorithm, which [14, 33]; if B1 and B2 are two different smallest enclosing is simple to implement, results in fast computation of nearly balls for S, then one can construct a smaller ball containing B B and therefore containing S.) The optimal solutions for point sets in much higher dimension 1 ∩ 2 than previously computable using exact techniques. center, c∗, of MEB(S) is often called the 1-center of S, since it is the point of Rd that minimizes the maximum distance to points in S. We let r∗ denote the radius of 1 Introduction MEB(S). A ball Bc,(1+)r is said to be (1 + )-approximation of MEB(S) if r r∗ and S Bc r. We study the minimum enclosing ball (MEB) problem: ≤ ⊂ ,(1+) Compute a ball of minimum radius enclosing a given Throughout this paper, S will be either a set of d set of objects (points, balls, etc) in Rd. The MEB problem points in R or a set of balls. We let n = S . Given > 0, a subset, X S, is said| to| be a core-set arises in a number of important applications, often re- ⊆ of S if Bc r S, where Bc r = MEB(X); in other words, quiring that it be solved in relatively high dimensions. ,(1+) ⊃ , Applications of MEB computation include gap toler- X is a core-set if an expansion by factor (1+) of its MEB contains S. Since X S, r r∗; thus, the ball Bc r is a ant classifiers [8] in Machine Learning, tuning Support ⊆ ≤ ,(1+) Vector Machine parameters [10], Support Vector Clus- (1 + )-approximation of MEB(S). tering [4, 3], doing fast farthest neighbor query approx- Related Work. For small dimension d, the MEB problem can be solved in (n) time for n points using imation [17], k-center clustering [5], testing of radius O clustering for k = 1 [2], approximate 1-cylinder prob- the fact that it is an LP-type problem [21, 14]. One of the lem [5], computation of spatial hierarchies (e.g., sphere best implementable solutions to compute MEB exactly trees [18]), and other applications [13]. in moderately high dimensions is given by Gartner¨ and In this paper, we give improved time bounds for ap- Schonherr¨ [16]; the largest instance of MEB they solve proximation algorithms for the MEB problem, applica- is d = 300, n = 10000 (in about 20 minutes on their ble to an input set of points or balls in high dimensions. platform). In comparison, the largest instance we solve (1 + )-approximately is d = 1000, n = 100000, = 10 3; We prove a time bound of nd + 1 log 1 , which is − O 4.5 based on an improved bound of (1/) on the size of O ∗The code associated with this paper can be downloaded from “core-sets” as well as the use of second-order cone pro- http://www.compgeom.com/meb/. This research was partially sup- gramming (SOCP) for solving subproblems. We have ported by a DARPA subcontract from HRL Laboratories and grants performed an experimental investigation to determine from Honda Fundamental Research Labs, NASA Ames Research how the core-set size tends to behave in practice, for a (NAG2-1325), NSF (CCR-9732220, CCR-0098172), and Sandia Na- variety of input distributions. We show that substan- tional Labs. †Stony Brook University, [email protected]. Part of this work was tially larger instances, both in terms of the number n of done while the author was visiting MPI-Saarbrucken.¨ input points and the dimension d, of the MEB problem ‡Stony Brook University, [email protected]. can be solved (1 + )-approximately, with very small §Stony Brook University, [email protected]. in this case the virtual memory was running low on the the size of core-sets. Section 4 is devoted to discussion system1. Another implementation of an exact solver of the experiments and of the results obtained with our is based on the algorithm of Gartner¨ [15]; this code is implementation. part of the CGAL2 library. For large dimensions, our approximation algorithm is found to be much faster 2 SOCP Formulation than this exact solver. The minimum enclosing ball (MEB) problem can be for- We are not aware of other implementations of mulated as a second-order cone programming (SOCP) polynomial-time approximation schemes for the MEB problem. SOCP can be viewed as an extension of linear problem. programming in which the nonnegative orthant is re- Independently from our work, the MEB problem in placed by the second-order cone (also called the “Lorenz high dimensions was also studied in [33]. The authors cone,” or the “quadratic cone”), defined as consider two approaches, one based on reformulation 1+d as an unconstrained convex optimization problem and K = (σ, x) R : x σ . { ∈ k k ≤ } another based on a Second Order Cone Programming Therefore, SOCP is essentially linear programming (SOCP) formulation. Similarly,four algorithms (includ- over an affine subset of products of second-order cones. ing a randomized algorithm) are compared in [31] for Recently, SOCP has received a lot of attention from the computation of the minimum enclosing circle of cir- the optimization community due to its applications cles on the plane. Both studies reveal that solving MEB in a wide variety of areas (see, e.g., [20, 1]) and due using a direct SOCP formulation suffers from mem- also to the existence of very efficient algorithms to ory problems as the dimension, d, and the number of solve this class of optimization problems. In particular, points, n, increase. This is why we have worked to any SOCP problem involving n second-order cones combine SOCP with core-sets in designing a practical can be solved within any specified additive error MEB method. > 0 in ( √n log(1/)) iterations by interior-point In a forthcoming paper of Badoiu˘ and Clarkson [6], algorithmsO [22, 26]. the authors have independently also obtained an upper The MEB problem can be formulated as an SOCP bound of (1/) on the size of core-sets and have, most problem as recently [7],O proved a worst-case tight upper bound of 1/ . Note that the worst case upper bound does min r, s.t. c ci + ri r, i = 1,..., n, c,r k − k ≤ notd applye to our experiments since in almost all our experiments, the dimension d satisfies d < 1 . The worst where c1,..., cn and r1,..., rn constitute the centers and the radii of the input set S Rd, respectively, c and r are case upper bound of [6, 7] only applies to the case when ⊂ 1 the center and the radius of the MEB, respectively (Note d . Our experimental results on a wide variety of input≥ sets show that the core set size is smaller than that the formulation reduces to the usual MEB problem 1 for point sets if ri = 0 for i = 1,..., n). By introducing min( , d + 1) (See Figure 2). Badoiu˘ et al. [5] introduced the notion of core-sets slack variables and their use in approximation algorithms for high- (γi, si) K, i = 1,..., n, dimensional clustering problems. In particular, they ∈ the MEB problem can be reformulated in (dual) stan- give an dn + 1 log 1 time (1 + )-approximation O 2 10 dard form as algorithm based on their upper bound of (1/2) on " # " # " # O r γ r the size of core-sets; the upper bound on the core- max r, s.t. + i = i , c,r,γ,s ,...,s c s c set size is remarkable in that it does not depend on 1 n − − i − i d. In comparison, our time bound (Theorem 3.2) is along with the constraints (γi, si) K, i = 1,..., n, where ∈ nd + 1 log 1 . γ denotes the n-dimensional vector whose components O 4.5 Outline of paper. We first show in Section 2 how are given by γ1, . , γn. The Lagrangian dual is given to use second-order cone programming to solve the in (primal) standard form by 2 MEB problem in ( √nd (n + d) log(1/)) arithmetic Pn Pn T O min i=1 riσi i=1 ci xi, operations. This algorithm is specially suited for σ,x1,...,xn − − " # " # problems in which n is small and d is large; thus, we Pn σi 1 s.t. i=1 = − , study algorithms to compute core-sets in Section 3 in − xi 0 an effort to select a small subset X, a core-set, that is (σi, xi) K, i = 1,..., n, sufficient for approximation purposes.