A Spectral Approach to Density-Based Clustering
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) The SpectACl of Nonconvex Clustering: A Spectral Approach to Density-Based Clustering Sibylle Hess Wouter Duivesteijn Philipp Honysz, Katharina Morik TU Dortmund University Technische Universiteit Eindhoven TU Dortmund University Computer Science Faculty Faculteit Wiskunde & Informatica Computer Science Faculty Artificial Intelligence LS VIII Data Mining Group Artificial Intelligence LS VIII D-44221 Dortmund, Germany Eindhoven, the Netherlands D-44221 Dortmund, Germany [email protected] [email protected] [email protected] [email protected] Abstract SC DBSCAN DBSCAN SPECTACL When it comes to clustering nonconvex shapes, two paradigms are used to find the most suitable clustering: minimum cut and maximum density. The most popular algorithms incorporat- ing these paradigms are Spectral Clustering and DBSCAN. Both paradigms have their pros and cons. While minimum cut clusterings are sensitive to noise, density-based clusterings have trouble handling clusters with varying densities. In this Figure 1: Performance of Spectral Clustering, DBSCAN, and paper, we propose SPECTACL: a method combining the ad- SPECTACL on two concentric circles. Best viewed in color. vantages of both approaches, while solving the two mentioned drawbacks. Our method is easy to implement, such as Spectral Clustering, and theoretically founded to optimize a proposed Clustering and DBSCAN; the combination can overcome density criterion of clusterings. Through experiments on syn- some of the innate disadvantages of both individual methods. thetic and real-world data, we demonstrate that our approach For all their strengths, even the most advanced clustering provides robust and reliable clusterings. methods nowadays still can be tripped up by some patholog- ical cases: datasets where the human observer immediately 1 Introduction sees what is going on, but which prove to remain tricky for all state-of-the-art clustering algorithms. One such example Despite being one of the core tasks of data mining, and de- is the dataset illustrated in Figure 1: we will refer to it as spite having been around since the 1930s (Driver and Kroeber the two circles dataset. It consists of two noisily1 separated 1932; Klimek 1935; Tryon 1939), the question of clustering concentric circles, each encompassing the same number of has not yet been answered in a manner that doesn’t come with observations. As the leftmost plot in Figure 1 shows, Spec- innate disadvantages. The paper that you are currently read- tral Clustering does not at all uncover the innate structure ing will also not provide such an answer. Several advanced of the data: two clusters are discovered, but both incorpo- solutions to the clustering problem have become quite fa- rate about half of each circle. It is well-known that Spectral mous, and justly so, for delivering insight in data where the Clustering is highly sensitive to noise (Bojchevski, Matkovic, clusters do not offer themselves up easily. Spectral Clustering and Gunnemann¨ 2017, Figure 1); if two densely connected (Dhillon, Guan, and Kulis 2004) provides an answer to the communities are additionally connected to each other via a curse of dimensionality inherent in the clustering task formu- narrow bridge of only a few observations, Spectral Clustering lation, by reducing dimensionality through the spectrum of runs the risk of reporting these communities plus the bridge the similarity matrix of the data. DBSCAN (Ester et al. 1996) as a single cluster, whereas two clusters plus a few noise is a density-based clustering algorithm which has won the observations (outliers) would be the desired outcome. The SIGKDD test of time award in 2014. Both Spectral Cluster- middle plots in Figure 1 shows how DBSCAN (with minpts ing and DBSCAN can find non-linearly separable clusters, set to 25 and 26, respectively) fails to realize that there are which trips up naive clustering approaches; these algorithms deliver good results. In this paper, we propose a new cluster- 1The scikit-learn clustering documentation (cf. https:// ing model which encompasses the strengths of both Spectral scikit-learn.org/stable/modules/clustering.html) shows how SC and DBSCAN can succeed on this dataset, but that version contains Copyright c 2019, Association for the Advancement of Artificial barely any noise (scikit’s noise parameter set to 0.05, instead of our Intelligence (www.aaai.org). All rights reserved. still benign 0.1). 3788 two clusters. This is hardly surprising, since DBSCAN is effective yet theoretically founded optimization by alternating known to struggle with several clusters of varying density, minimization, and the intuitively easily graspable notion of and that is exactly what we are dealing with here: since both clusters by the within-cluster point scatter. The underlying circles consist of the same number of observations, the inner assumption for this kind of clustering is that points in one circle is substantially more dense than the outer circle. The cluster are similar to each other on average. As fallout, this rightmost plot in Figure 1 displays the result of the new clus- imposes the undesired constraint that clusters are separated tering method that we introduce in this paper, SPECTACL by a Voronoi tesselation and thus, have convex shapes. (Spectral Averagely-dense Clustering): it accurately delivers We denote the objective of k-means in its matrix factoriza- the clustering that represents the underlying phenomena in tion variant: the dataset. 2 m r n r min kD − YX>k s.t. Y 2 1 × ;X 2 R × : (1) Y;X 1.1 Main Contributions Here, the matrix Y indicates the cluster memberships, i.e., In this paper, we provide SPECTACL, a new clustering Yjs = 1 if point j is assigned to cluster s and Yjs = 0 oth- method which combines the benefits of Spectral Clustering erwise. The matrix X represents the cluster centroids. The and DBSCAN, while alleviating some of the innate disadvan- objective is nonconvex, but convex if one of the matrices is tages of each individual method. Our method finds clusters fixed. The optimum with respect to the matrix X for fixed 1 having a large average density, where the appropriate density cluster assignments Y is attained at X = D>Y (Y >Y )− . for each cluster is automatically determined through the spec- This is a matrix formulation of the simple notion that each trum of the weighted adjacency matrix. Hence, SPECTACL cluster centroid is equal to the average of points in the clus- does not suffer from sensitivity to the minPts parameter as ter (Bauckhage 2015). DBSCAN does, and unlike DBSCAN it can natively handle The objective of Equation (1) is equivalent to a trace maxi- several clusters with varying densities. As in the Spectral mization problem (Ding et al. 2006), which derives from the Clustering pipeline, the final step of SPECTACL is an em- definition of the Frobenius norm by means of the trace bedding postprocessing step using k-means. However, unlike 1 m r max tr(Y >DD>Y (Y >Y )− ) s.t. Y 2 1 × : (2) Spectral Clustering, we demonstrate the fundamental sound- Y ness of applying k-means to the embedding step in SPEC- TACL: from SPECTACL’s objective function we derive an Note that this objective formulation of k-means depends only upper bound by means of the eigenvector decomposition; we on the similarities of data points, expressed by the inner derive that the optimization of our upper bound is equal to product of points. This is a stepping stone to the application k-means on the eigenvectors. Our Python implementation, of kernel methods and the derivation of nonconvex clusters. and the data generating and evaluation script, are publicly available2. 2.2 Graph Cuts The restriction to convex clusters can be circumvented by a 2 A Short Story of Clustering transformation of the data points into a potentially higher- dimensional Hilbert space, reflected by a kernel matrix A formal discussion of the state of clustering requires some (Scholkopf,¨ Smola, and Muller¨ 1998). Such a representa- D 2 notation. We assume that our data is given by the matrix tion based only on the similarity of points introduces an Rm n m D 1 ≤ j ≤ m × . The data represents points j for in interpretation of the data as a graph, where each data point n · the -dimensional feature space. For every point, we denote corresponds to a node, and the denoted similarities are the N (j) = fljkD − D k < g with j l its -neighborhood. weights of the corresponding edges. In this view, a cluster · · k k Given an arbitrary matrix M, we denote with M is identifiable as a densely connected component which has its Frobenius norm. If M is diagonalizable, then we de- only weak connections to points outside of the cluster. In ≈ note its truncated eigendecomposition of rank d as M other words, if we cut the edges connecting the cluster to V (d)Λ(d)V (d)>. The matrix Λ is a d × d diagonal matrix, its outside, we strive to cut as few edges as possible. This having the eigenvalues of M on its diagonal, which have the is known as the minimum cut problem. Given a possibly d-largest absolute values: jλ1j = jΛ11j ≥ ::: ≥ jΛddj. The normalized edge weight matrix W , the cut is: (d) matrix V concatenates the corresponding eigenvectors. r We write 1 to represent a matrix having all elements equal X cut(Y ; W ) = Y >s W (1 − Y s): (3) · · to 1. The dimension of such a matrix can always be inferred s=1 m l from the context. We also use the shorthand 1 × for the set m l However, minimizing the cut often returns unfavorable, un- of all binary partition matrices. That is, M 2 1 × if and m l balanced clusterings, where some clusters encompass only a only if M 2 f0; 1g × and jMj j = 1 for all j.