A Discriminative Framework for Clustering Via Similarity Functions
Total Page:16
File Type:pdf, Size:1020Kb
A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan∗ Avrim Blum∗ Santosh Vempala† Dept. of Computer Science Dept. of Computer Science College of Computing Carnegie Mellon University Carnegie Mellon University Georgia Inst. of Technology Pittsburgh, PA Pittsburgh, PA Atlanta, GA [email protected] [email protected] [email protected] ABSTRACT new efficient algorithms that are able to take advantage of them. Problems of clustering data from pairwise similarity information Our algorithms for hierarchical clustering combine recent learning- are ubiquitous in Computer Science. Theoretical treatments typi- theoretic approaches with linkage-style methods. We also show cally view the similarity information as ground-truth and then de- how our algorithms can be extended to the inductive case, i.e., by sign algorithms to (approximately) optimize various graph-based using just a constant-sized sample, as in property testing. The anal- objective functions. However, in most applications, this similarity ysis here uses regularity-type results of [20] and [3]. information is merely based on some heuristic; the ground truth is Categories and Subject Descriptors: F.2.0 [Analysis of Algo- really the unknown correct clustering of the data points and the real rithms and Problem Complexity]: General goal is to achieve low error on the data. In this work, we develop a theoretical approach to clustering from this perspective. In partic- General Terms: Algorithms, Theory ular, motivated by recent work in learning theory that asks “what Keywords: Clustering, Similarity Functions, Learning. natural properties of a similarity (or kernel) function are sufficient to be able to learn well?” we ask “what natural properties of a 1. INTRODUCTION similarity function are sufficient to be able to cluster well?” To study this question we develop a theoretical framework that Clustering is an important problem in the analysis and explo- can be viewed as an analog of the PAC learning model for cluster- ration of data. It has a wide range of applications in data mining, ing, where the object of study, rather than being a concept class, computer vision and graphics, and gene analysis. It has many vari- is a class of (concept, similarity function) pairs, or equivalently, a ants and formulations and it has been extensively studied in many property the similarity function should satisfy with respect to the different communities. ground truth clustering. We then analyze both algorithmic and in- In the Algorithms literature, clustering is typically studied by formation theoretic issues in our model. While quite strong prop- posing some objective function, such as k-median, min-sum or erties are needed if the goal is to produce a single approximately- k-means, and then developing algorithms for approximately op- correct clustering, we find that a number of reasonable properties timizing this objective given a data set represented as a weighted are sufficient under two natural relaxations: (a) list clustering: anal- graph [12, 24, 22]. That is, the graph is viewed as “ground truth” ogous to the notion of list-decoding, the algorithm can produce a and then the goal is to design algorithms to optimize various objec- small list of clusterings (which a user can select from) and (b) hi- tives over this graph. However, for most clustering problems such erarchical clustering: the algorithm’s goal is to produce a hierarchy as clustering documents by topic or clustering web-search results such that desired clustering is some pruning of this tree (which a by category, ground truth is really the unknown true topic or true user could navigate). We develop a notion of the clustering com- category of each object. The construction of the weighted graph is plexity of a given property (analogous to notions of capacity in just done using some heuristic: e.g., cosine-similarity for cluster- learning theory), that characterizes its information-theoretic use- ing documents or a Smith-Waterman score in computational biol- fulness for clustering. We analyze this quantity for several natural ogy. In all these settings, the goal is really to produce a clustering game-theoretic and learning-theoretic properties, as well as design that gets the data correct. Alternatively, methods developed both in the algorithms and in the machine learning literature for learning ∗Supported in part by the National Science Foundation under grant mixtures of distributions [1, 5, 23, 36, 17, 15] explicitly have a no- CCF-0514922, and by a Google Research Grant. tion of ground-truth clusters which they aim to recover. However, †Supported in part by the National Science Foundation under grant such methods are based on very strong assumptions: they require CCF-0721503, and by a Raytheon fellowship. an embedding of the objects into Rn such that the clusters can be viewed as distributions with very specific properties (e.g., Gaus- sian or log-concave). In many real-world situations (e.g., cluster- ing web-search results by topic, where different users might have Permission to make digital or hard copies of all or part of this work for different notions of what a “topic” is) we can only expect a domain personal or classroom use is granted without fee provided that copies are expert to provide a notion of similarity between objects that is re- not made or distributed for profit or commercial advantage and that copies lated in some reasonable ways to the desired clustering goal, and bear this notice and the full citation on the first page. To copy otherwise, to not necessarily an embedding with such strong properties. republish, to post on servers or to redistribute to lists, requires prior specific In this work, we develop a theoretical study of the clustering permission and/or a fee. STOC’08, May 17–20, 2008, Victoria, British Columbia, Canada. problem from this perspective. In particular, motivated by work Copyright 2008 ACM 978-1-60558-047-0/08/05 ...$5.00. on learning with kernel functions that asks “what natural properties of a given kernel (or similarity) function are sufficient to allow one to learn well?” [6, 7, 31, 29, 21] weK ask the question “what natural properties of a pairwise similarity function are sufficient to A C allow one to cluster well?” To study this question we develop a theoretical framework which can be thought of as a discriminative (PAC style) model for clustering, though the basic object of study, rather than a concept class, is a property of the similarity function D in relation to the target concept, or equivalently a set of (concept, B similarityK function) pairs. The main difficulty that appears when phrasing the problem in this general way is that if one defines success as outputting a single clustering that closely approximates the correct clustering, then one Figure 1: Data lies in four regions A,B,C,D (e.g., think of as docu- needs to assume very strong conditions on the similarity function. ments on baseball, football, TCS, and AI). Suppose that K(x,y) = 1 if For example, if the function provided by our expert is extremely x and y belong to the same region, K(x,y) = 1/2 if x ∈ A and y ∈ B good, say (x,y) > 1/2 for all pairs x and y that should be in or if x ∈ C and y ∈ D, and K(x,y) = 0 otherwise. Even assuming the same cluster,K and (x,y) < 1/2 for all pairs x and y that that all points are more similar to other points in their own cluster than should be in different clusters,K then we could just use it to recover to any point in any other cluster, there are still multiple consistent clus- the clusters in a trivial way.1 However, if we just slightly weaken terings, including two consistent 3-clusterings ((A ∪ B, C, D) or (A, this condition to simply require that all points x are more similar B, C ∪ D)). However, there is a single hierarchical decomposition such to all points y from their own cluster than to any points y from any that any consistent clustering is a pruning of this tree. other clusters, then this is no longer sufficient to uniquely identify even a good approximation to the correct answer. For instance, in the example in Figure 1, there are multiple clusterings consistent 1.1 Perspective with this property. Even if one is told the correct clustering has The standard approach in theoretical computer science to clus- 3 clusters, there is no way for an algorithm to tell which of the tering is to choose some objective function (e.g., k-median) and two (very different) possible solutions is correct. In fact, results of then to develop algorithms that approximately optimize that objec- Kleinberg [25] can be viewed as effectively ruling out a broad class tive [12, 24, 22, 18]. If the true goal is to achieve low error with of scale-invariant properties such as this one as being sufficient for respect to an underlying correct clustering (e.g., a user’s desired producing the correct answer. clustering of search results by topic), however, then one can view In our work we overcome this problem by considering two re- this as implicitly making the strong assumption that not only does laxations of the clustering objective that are natural for many clus- the correct clustering have a good objective value, but also that all tering applications. The first is as in list-decoding to allow the al- clusterings that approximately optimize the objective must be close gorithm to produce a small list of clusterings such that at least one to the correct clustering as well. In this work, we instead explicitly of them has low error.