K-Means Requires Exponentially Many Iterations Even in the Plane

k-means Requires Exponentially Many Iterations Even in the Plane Andrea Vattani University of California, San Diego [email protected] ABSTRACT centers as the means (or centers of mass) of their assigned The k-means algorithm is a well-known method for parti- points. This process of assigning data points and readjusting tioning n points that lie in the d-dimensional space into k centers is repeated until it stabilizes. clusters. Its main features are simplicity and speed in prac- Despite its age, k-means is still very popular today and tice. Theoretically, however, the best known upper bound is considered \by far the most popular clustering algorithm on its running time (i.e. O(nkd)) is, in general, exponential used in scientific and industrial applications", as Berkhin re- in the number of points (when kd = Ω(n= log n)). Recently, marks in his survey on data mining [4]. Its widespread usage Arthur and Vassilvitskii [2] showed a super-polynomial worst- extends over a variety of different areas, such as artificial in- case analysis, improving the best known lower bound from telligence, computational biology, computer graphics, just to p p name a few (see [1, 8]). It is particularly popular because of Ω(n) to 2Ω( n) with a construction in d = Ω( n) dimen- its simplicity and observed speed: as Duda et al. say in their sions. In [2] they also conjectured the existence of super- text on pattern classification [6], \In practice the number of polynomial lower bounds for any d ≥ 2. iterations is much less than the number of samples". Our contribution is twofold: we prove this conjecture and we improve the lower bound, by presenting a simple con- Even if, in practice, speed is recognized as one of k-means' struction in the plane that leads to the exponential lower main qualities (see [11] for empirical studies), on the other bound 2Ω(n). hand there are a few theoretical bounds on its worst-case running time and they do not corroborate this feature. Categories and Subject Descriptors An upper bound of O(kn) can be trivially established since it can be shown that no clustering occurs twice during the F.2.2 [Analysis of Algorithms and Problem Complex- course of the algorithm. In [10], Inaba et al. improved ity]: Nonnumerical Algorithms and Problems|Geometrical this bound to O(nkd) by counting the number of Voronoi problems and computations d partitions of n points in R into k classes. Other bounds are known for some special cases. Namely, Dasgupta [5] General Terms analyzed the case d = 1, proving an upper bound of O(n) Algorithms, Theory when k < 5, and a worst-case lower bound of Ω(n). Later, Har-Peled and Sadri [9], again for the one-dimensional case, showed an upper bound of O(n∆2) where ∆ is the spread Keywords of the point set (i.e. the ratio between the largest and the K-means, Lower bounds smallest pairwise distance), and conjectured that k-means might run in time polynomial in n and ∆ for any d. kd 1. INTRODUCTION The upper bound O(n ) for the general case has not been improved since more than a decade, and this suggests that The k-means method is one of the most widely used algo- it might be not far from the truth. Arthur and Vassilvit- rithms for geometric clustering. It was originally proposed skii [2] showed that k-means can run for super-polynomially by Forgy in 1965 [7] and McQueen in 1967 [13], and is often many iterations, improvingp the best known lower bound known as Lloyd's algorithm [12]. It is a local search algo- Ω( n) from Ω(n) [5]p to 2 . Their contruction lies in a space rithm and partitions n data points into k clusters in this with d = Θ( n) dimensions, and they leave an open ques- way: seeded with k initial cluster centers, it assigns every tion about the performance of k-means for a smaller number data point to its closest center, and then recomputes the new of dimensions d, conjecturing the existence of superpolyno- mial lower bounds when d > 1. Also they show that their construction can be modified to have low spread,p disproving the aforementioned conjecture in [9] for d = Ω( n). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are A more recent line of work that aims to close the gap not made or distributed for profit or commercial advantage and that copies between practical and theoretical performance makes use bear this notice and the full citation on the first page. To copy otherwise, to of the smoothed analysis introduced by Spielman and Teng republish, to post on servers or to redistribute to lists, requires prior specific [15]. Arthur and Vassilvitskii [3] proved a smoothed upper p permission and/or a fee. O(k) O( k) SCG’09, June 8–10, 2009, Aarhus, Denmark. bound of poly(n ), recently improved to poly(n ) by Copyright 2009 ACM 978-1-60558-501-7/09/06 ...$5.00. Manthey and Röglin [14]. 324 1.1 Our result We stress that when k-means runs on our constructions, it In this work we are interested in the performance of k- does not fall into any of these situations, so the lower bound means in a low dimensional space. We said it is conjectured does not exploit these degeneracies. [2] that there exist instances in d dimensions for any d ≥ 2, Our construction uses points that have constant integer for which k-means runs for a super-polynomial number of weights. This means that the data set that k-means will iterations. take in input is actually a multiset, and the center of mass Our main result is a construction in the plane (d = 2) of a cluster Ci (that is Step 2 of k-means) is computed as P P for which k-means requires exponentially many iterations to wxx= wx, where wx is the weight of x. This is x2Ci x2Ci stabilize. Specifically, we present a set of n data points ly- not a restriction since integer weights in the range [1;C] can 2 ing in R , and a set of k = Θ(n) adversarially chosen cluster be simulated by blowing up the size of the data set by at 2 Ω(n) centers in R , for which the algorithm runs for 2 iter- most C: it is enough to replace each point x of weight w with ations. This proves the aforementioned conjecture and, at a set of w distinct points (of unitary weight) whose center the samep time, it also improves the best known lower bound of mass is x, and so close to each other that the behavior of from 2Ω( n) to 2Ω(n). Notice that the exponent is optimal k-means (as well as its number of iterations) is not affected. disregarding logarithmic factor, since the bound for the general case O(nkd) can be rewritten as 2O(n log n) when d = 2 3. LOWER BOUND and k = Θ(n). For any k = o(n), our lower bound easily Ω(k) In this section we present a construction in the plane for translates to 2 , which, analogously, is almost optimal Ω(n) O(k log n) which k-means requires 2 iterations. We start with some since the upper bound is 2 . high level intuition of the construction, then we give some A common practice for seeding k-means is to choose the definitions explaining the idea behind the construction, and initial centers as a subset of the data points. We show that finally we proceed to the formal proof. even in this case (i.e. cluster centers adversarially chosen At the end of the section, we show a couple of extensions: among the data points), the running time of k-means is still the first one is a modification of our construction so that the exponential. initial set of centers is a subset of the data points, and the Also, using a result in [2], our construction can be modified second one describes how to obtain low spread. to an instance in d = 3 dimensions having low spread for Ω(n) A simple implementation in Python of the lower bound is which k-means requires 2 iterations, which disproves the available at the web address [16]. conjecture of Har-Peled and Sadri [9] for any d ≥ 3. Finally, we observe that our result implies that the smoothed 3.1 High level intuition analysis helps even for a small number of dimensions, since p The idea behind our construction is simple and can be O( k) the best smoothed upper bound is n , while our lower related to the saying \Who watches the watchmen?" (or the Ω(k) 2 bound is 2 which is larger for k = !(log n). In other original Latin phrase \Quis custodiet ipsos custodes?"). words, perturbing each data point and then running k-means Consider a sequence of t watchmen W0;W1;:::;Wt−1.A would improve the performance of the algorithm. \day" of a watchman Wi (i > 0) can be described as follows (see Fig. 1): Wi watches Wi−1, waking it up once it falls 2. THE K-MEANS ALGORITHM asleep, and does so twice; afterwards, Wi falls asleep itself. The k-means algorithm partitions a set X of n points in The watchman W0 instead will simply fall asleep directly d R into k clusters.

Load more