Scalable Parallel OPTICS Data Clustering Using Graph Algorithmic Techniques

Scalable Parallel OPTICS Data Clustering Using Graph Algorithmic Techniques 1, 1 1 Md. Mostofa Ali Patwary †, Diana Palsetia , Ankit Agrawal , Wei-keng Liao1, Fredrik Manne2, Alok Choudhary1 1Northwestern University, Evanston, IL 60208, USA 2University of Bergen, Norway †Corresponding author: [email protected] ABSTRACT and sparse regions and therefore, discovers overall distribution pat- terns and correlations in the data. Based on the data properties or OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using the task requirements, various clustering algorithms have been de- veloped. Well-known algorithms include K-means [35], K-medoids adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequen- [41], BIRCH [55], DBSCAN [20], OPTICS [5, 29], STING [52], and WaveCluster [48]. These algorithms have been used in vari- tial data access order. We present a scalable parallel OPTICS algo- ous scientific areas such as satellite image segmentation [38], noise rithm (POPTICS) designed using graph algorithmic concepts. To filtering and outlier detection [11], unsupervised document clus- break the data access sequentiality, POPTICS exploits the similar- tering [50], and clustering of bioinformatics data [36]. Existing ities between the OPTICS algorithm and PRIM’s Minimum Span- ning Tree algorithm. Additionally, we use the disjoint-set data clustering algorithms have been roughly categorized as partitional, structure to achieve a high parallelism for distributed cluster ex- hierarchical, grid-based, and density-based [26, 27]. OPTICS (Or- traction. Using high dimensional datasets containing up to a billion dering Points To Identify the Clustering Structure) is a hierarchical floating point numbers, we show scalable speedups of up to 27.5 density-based clustering algorithm [5]. The key idea of the density- for our OpenMP implementation on a 40-core shared-memory ma- based clustering algorithm such as OPTICS and DBSCAN is that for chine, and up to 3,008 for our MPI implementation on a 4,096-core each data point in a cluster, the neighborhood within a given radius distributed-memory machine. We also show that the quality of the (ε), known as generating distance, has to contain at least a minimum number of points (minpts), i.e. the density of the neighbor- results given by POPTICS is comparable to those given by the clas- hood has to exceed some threshold [5, 20]. Additionally, OPTICS sical OPTICS algorithm. addresses DBSCAN’s major limitation: the problem of detecting Categories and Subject Descriptors meaningful clusters in data of varying density. OPTICS provides an overview of the cluster structure of a dataset H.3.3 [Information Storage and Retrieval]: Information Search with respect to density and contains information about every clus- and Retrieval—Clustering; I.5.3 [ ]: Cluster- Pattern Recognition ter level of the dataset. In order to do so, OPTICS generates a linear ing—Algorithms; H.2.8 [Database Management]: Database Ap- order of points where spatially closest points become neighbors. plications—Data Mining; G.1.0 [Mathematics of Computing]: Additionally, for each point, a spacial distance (known as reach- Numerical Analysis—Parallel Algorithms ability distance) is computed which represents the density. Once the order and the reachability distances are computed using ε and General Terms minpts, we can query for the clusters that a particular value of Algorithms, Experimentation, Performance, Theory ε (known as clustering distance) would give where ε ε. The query is answered in linear time. ≤ Keywords One example application of OPTICS, which requires high per- Density-based clustering, Minimum spanning tree, Union-Find al- formance computing, is finding halos and subhalos (clusters) from gorithm, Disjoint-set data structure massive cosmology data in astrophysics [34]. Other application domains include analyzing satellite images, X-ray crystallography, 1. INTRODUCTION and anomaly detection [7]. However, OPTICS is challenging to Clustering is a data mining technique that groups data into mean- parallelize as its data access pattern is inherently sequential. To ingful subclasses, known as clusters, such that it minimizes the the best of our knowledge, there has not been any effort yet to do intra-differences and maximizes inter-differences of these subclasses so. Due to the similarities with DBSCAN, a natural choice for de- [21]. For the purpose of knowledge discovery, it identifies dense signing a parallel OPTICS could be one of the several master-slave based approaches [6, 13, 14, 17, 23, 54, 56]. However, in [44], Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not we showed that these approaches incur high communication over- made or distributed for profit or commercial advantage and that copies bear head between the master and slaves, and low parallel efficiency. As this notice and the full citation on the first page. Copyrights for components an alternative to these approaches we presented a parallel DBSCAN of this work owned by others than ACM must be honored. Abstracting with algorithm based on the disjoint-set data structure, suitable for mas- credit is permitted. To copy otherwise, or republish, to post on servers or to sive data sets [44]. However, this approach is not directly applica- redistribute to lists, requires prior specific permission and/or a fee. Request ble to OPTICS as DBSCAN produces a clustering result for a single permissions from [email protected]. PTICS SC 13 November 17-21, 2013, Denver, CO, USA set of density parameters, whereas O generates a linear or- Copyright 2013 ACM 978-1-4503-2378-9/13/11 ...$15.00. der of the points that provides an overview of the cluster structure http://dx.doi.org/10.1145/2503210.2503255 for a wide range of input parameters. One key difference between clusters in data of varying density by producing a linear order of these two algorithms from the viewpoint of parallelization is that in points such that points which are spatially closest become neigh- DBSCAN, after processing a point one can process its neighbors in bors in the order. OPTICS starts with adding an arbitrary point of a parallel, whereas in OPTICS, the processing of the neighbors fol- cluster to the order list and then iteratively expands the cluster by lows a strict order. adding a point within the ε neighborhood of a point in the clus- − To overcome this challenge, we develop a scalable parallel OPTI- ter which is also closest to any of the already selected points. The CS algorithm (POPTICS) using graph algorithmic concepts. POPTI- process continues until the entire cluster has been added to the or- CS exploits the similarities between OPTICS and PRIM’s Minimum der. The process then moves on to the remaining clusters. Addi- Spanning Tree (MST) algorithm [46] to break the sequential access tionally, OPTICS computes the reachability distance for each point. of data points in the classical OPTICS algorithm. The main idea This represents the required density in order to keep two points in is that two points should be assigned to the same cluster if they the same cluster. Once the order and the reachability distances are are sufficiently close (if at least one of them has sufficiently many computed, we can extract the clusters for any clustering distance, neighbors). This relationship is transitive so a connected compo- ε where ε ε, in linear time. In the following we first define nent of points should also be in the same cluster. If the distance the notation used≤ throughout the paper and then present a brief de- bound is set sufficiently high, all vertices will be in the same clus- scription of OPTICS based on [5]. ter. As this bound is lowered, the cluster will eventually break apart Let X be the set of forming sub-clusters. This is modeled by calculating a minimum data points to be clus- ! distance spanning tree on the graph using an initial (high) distance tered. The neighborhood MinPts = 3 bound (ε). Then to query the dataset for the clusters that an ε ε of a point x X within !” ≤ ∈ would give, one has only to remove edges from the MST of weight a given radius ε (known as y more than ε and the remaining connected components will give the generating distance) is x the clusters. called the ε-neighborhood The idea of our POPTICS algorithm is as follows. Each processor of x, denoted by Nε(x). z computes a MST on its local dataset without incurring any commu- More formally, Nε(x)= nication. We then merge the local MSTs to obtain a global MST. y X DISTANCE(x, y) core distance (x) Both steps are performed in parallel. Additionally, we extract the { ∈ε, y | = x , where reachability distance (y) ≤ } reachability distance (z) clusters directly from the global MST (without a linear order of DISTANCE(x, y) is the dis- Figure 1: An example showing the points) for any clustering distance, ε, by simply traversing the tance function. A point the core distance of x and the edges of the MST once in an arbitrary order, thus also enabling x X is referred to reachability distances of y and z the cluster generation in parallel using the parallel disjoint-set data as a∈core point if its ε- with respect to x. structure [42]. POPTICS shows higher concurrency for data access neighborhood contains at while maintaining a comparable time complexity and quality with least a minimum number of points (minpts), i.e., Nε(x) | |≥ the classical OPTICS algorithm. We note that MST-based tech- minpts. A point y X is directly density-reachable from x X niques have been applied previously in cluster analysis, such as the if y is within the ε-neighborhood∈ of x and x is a core point.∈ A single-linkage method that uses MST to join clusters by the shortest point y X is density-reachable from x X if there is a chain ∈ ∈ distance between them [25].

Load more