Density-Based Clustering Validation
Total Page:16
File Type:pdf, Size:1020Kb
to appear in: Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014 Density-Based Clustering Validation Davoud Moulavi∗ Pablo A. Jaskowiak∗y Ricardo J. G. B. Campelloy Arthur Zimekz Jörg Sander∗ Abstract but also have to properly tune its parameters. Such One of the most challenging aspects of clustering is valida- choices are closely related to clustering validation, one of tion, which is the objective and quantitative assessment of the most challenging topics in the clustering literature. clustering results. A number of different relative validity As stated by Jain and Dubes [20], “without a strong criteria have been proposed for the validation of globular, effort in this direction, cluster analysis will remain a clusters. Not all data, however, are composed of globular black art accessible only to those true believers who have clusters. Density-based clustering algorithms seek partitions experience and great courage”. More striking than the with high density areas of points (clusters, not necessarily statement itself is the fact that it still holds true after globular) separated by low density areas, possibly contain- 25 years, despite all the progress that has been made. ing noise objects. In these cases relative validity indices pro- A common approach to evaluate the quality of clus- posed for globular cluster validation may fail. In this paper tering solutions involves the use of internal validity cri- we propose a relative validation index for density-based, ar- teria [20]. Many of such measures allow one to rank so- bitrarily shaped clusters. The index assesses clustering qual- lutions accordingly to their quality and are hence called ity based on the relative density connection between pairs of relative validity criteria. Because internal validity cri- objects. Our index is formulated on the basis of a new ker- teria measure clustering quality based solely on infor- nel density function, which is used to compute the density mation intrinsic to the data they have great practical of objects and to evaluate the within- and between-cluster appeal and numerous criteria have been proposed in density connectedness of clustering results. Experiments on the literature [24, 20, 30]. The vast majority of rela- synthetic and real world data show the effectiveness of our tive validity criteria are based on the idea of comput- approach for the evaluation and selection of clustering algo- ing the ratio of within-cluster scattering (compactness) rithms and their respective appropriate parameters. to between-cluster separation. Measures that follow this definition have been designed for the evaluation of 1 Introduction convex shaped clusters (e.g., globular clusters) and fail when applied to validate arbitrarily shaped, non-convex Clustering is one of the primary data mining tasks. clusters. They are also not defined for noise objects. Although there is no single consensus on the definition of Density-based clusters are found, e.g., in geograph- a cluster, the clustering procedure can be characterized ical applications, such as clusters of points belonging as the organization of data into a finite set of categories to rivers, roads, power lines or any connected shape in by abstracting their underlying structure, either by image segmentations [21]. Some attempts have been grouping objects in a single partition or by constructing made to develop relative validity measures for arbitrar- a hierarchy of partitions to describe data according to ily shaped clusters [25, 16, 8, 26, 36]. As we shall see, similarities or relationships among its objects [20, 12, however, these measures have serious drawbacks that 18]. Over the previous decades, different clustering limit their practical applicability. To overcome the lack definitions have given rise to a number of clustering of appropriate measures to the validation of density- algorithms, showing a significant field development. based clusters, we propose a measure called the Density- The variety of clustering algorithms, however, poses Based Clustering Validation index (DBCV). DBCV em- difficulties to users, who not only have to select the ploys the concept of Hartigan’s model of density-contour clustering algorithm best suited for a particular task, trees [18] to compute the least dense region inside a cluster and the most dense region between the clus- ∗Dept. of Computing Science, University of Alberta, Edmon- ters, which are used to measure the within and between- ton, AB, Canada, {moulavi, jaskowia, jsander}@ualberta.ca cluster density connectedness of clusters. yDept. of Computer Science, University of São Paulo, São Carlos, SP, Brazil, {pablo, campello}icmc.usp.br Our contributions are: (i) a new core distance def- zLudwig-Maximilians-Universität München, Munich, Ger- inition, which evaluates the density of objects w.r.t. many, [email protected] other objects in the same cluster; these distances are regions. Therefore, clustering results can contain arbi- also comparable to distances of objects inside the clus- trarily shaped clusters and noise, which such measures ter; (ii) a new relative validity measure, based on our cannot handle properly, considering their original def- concept of core distance, for the validation of arbitrarily inition. In spite of the extensive literature on relative shaped clusters (along with noise, if present) and; (iii) validation criteria, not much attention has been given to a novel approach that makes other relative validity cri- density-based clustering validation. Indeed, only a few teria capable of handling noise. preliminary approaches are described in the literature. The remainder of the paper is organized as follows. Trying to capture arbitrary shapes of clusters some In Section 2 we review previous attempts to tackle the authors have incorporated concepts from graph theory validation of density-based clustering results. In Sec- into clustering validation. Pal and Biswas [25] build tion 3 we define the problem of density-based clustering graphs (Minimum Spanning Trees) for each clustering validation and introduce our relative validity index. In solution, and use information from their edges to re- Section 4 we discuss aspects of the evaluation of relative formulate relative measures such as Dunn [10]. Al- validity measures and design an experimental setup to though the use of a graph can, in principle, capture ar- assess the quality of our index. The results of the em- bitrary cluster structures, the measures introduced by pirical evaluation are presented in Section 5. Finally, in the authors still compute compactness and separation Section 6, we draw the main conclusions. based on Euclidean distances, favoring globular clusters. Moreover, separation is still based on cluster centroids, 2 Related Work which is not appropriate for arbitrarily shaped clusters. One of the major challenges in clustering is the vali- Yang and Lee [33] employ a Proximity Graph to detect dation of its results, which is often described as one of cluster borders and develop tests to verify if a cluster- the most difficult and frustrating steps of cluster anal- ing is invalid, possibly valid or good. The problem with ysis [20, 24]. Clustering validation can be divided into this approach is that it does not result in a relative mea- three scenarios: external, internal, and relative [20]. sure. Moreover, the tests employed by the authors re- External clustering validity approaches such as the quire three different parameters from the user. Finally, Adjusted Rand Index [19] compare clustering results in both approaches [25, 33] graphs are obtained directly with a pre-existing clustering (or class) structure, i.e., from distances. No density concept is employed. a ground truth solution. Although disputably useful Occasionally, density-based concepts have been for algorithm comparison and evaluation [13], external used for clustering validation. Chou et al. [8] introduce measures do not have practical applicability, since, a measure that combines concepts from Dunn [10] and according to its definition, clustering is an unsupervised Davies&Bouldin [9] and is aimed to deal with clusters task, with no ground truth solution available a priori. of different densities. The measure cannot, however, In real world applications internal and relative va- handle arbitrary shapes. Pauwels and Frederix [26] in- lidity criteria are preferred, finding wide applicability. troduce a measure based on the notion of cluster homo- Internal criteria measure the quality of a clustering so- geneity and connectivity. Although their measure pro- lution using only the data themselves. Relative crite- vides interesting results, it has two critical parameters, ria are internal criteria able to compare two clustering e.g., the user has to set K in order to compute KNN structures and point out which one is better in rela- distances and obtain cluster homogeneity. The SD and tive terms. Although most external criteria also meet S_Dbw measures [17, 14] have similar definitions, based this requirement, the term relative validity criteria of- on concepts of scattering inside clusters and separation ten refers to internal criteria that are also relative. Such between clusters considering the variance and distance a convention is adopted hereafter. There are many rel- between centroids of the clusters. Both measures can- ative clustering validity criteria proposed in the litera- not, however, deal with arbitrarily shaped clusters given ture [30]. Such measures are based on the general