A Fast Algorithm for Identifying Density-Based Clustering Structures Using a Constraint Graph

electronics Article A Fast Algorithm for Identifying Density-Based Clustering Structures Using a Constraint Graph Jeong-Hun Kim 1, Jong-Hyeok Choi 1 , Kwan-Hee Yoo 1 , Woong-Kee Loh 2,* and Aziz Nasridinov 1,* 1 Department of Computer Science, Chungbuk National University, Cheongju 28644, Korea; [email protected] (J.-H.K.); [email protected] (J.-H.C.); [email protected] (K.-H.Y.) 2 Department of Software, Gachon University, Seongnam 13120, Korea * Correspondence: [email protected] (W.-K.L.); [email protected] (A.N.) Received: 2 August 2019; Accepted: 24 September 2019; Published: 27 September 2019 Abstract: OPTICS is a state-of-the-art algorithm for visualizing density-based clustering structures of multi-dimensional datasets. However, OPTICS requires iterative distance computations for all objects and is thus computed in O n2 time, making it unsuitable for massive datasets. In this paper, we propose constrained OPTICS (C-OPTICS) to quickly create density-based clustering structures that are identical to those by OPTICS. C-OPTICS uses a bi-directional graph structure, which we refer to as the constraint graph, to reduce unnecessary distance computations of OPTICS. Thus, C-OPTICS achieves a good running time to create density-based clustering structures. Through experimental evaluations with synthetic and real datasets, C-OPTICS significantly improves the running time in comparison to existing algorithms, such as OPTICS, DeLi-Clu, and Speedy OPTICS (SOPTICS), and guarantees the quality of the density-based clustering structures. Keywords: OPTICS; density-based clustering structure; visualization; constraint graph 1. Introduction Clustering is one of the data mining techniques that group data objects based on a similarity [1]. The groups can provide important insights that are used for a broad range of applications [2–8], such as superpixel segmentation for image clustering [2,3], brain cancer detection [4], wireless sensor networks [5,6], pattern recognition [7,8], and others. We can classify clustering algorithms into centroid, hierarchy, model, graph, density, and grid-based clustering algorithms [9]. Many algorithms address various clustering issues including scalability, noise handling, dealing with multi-dimensional datasets, the ability to discover clusters with arbitrary shapes, and the minimum dependency on domain knowledge for determining certain input parameters [10]. Among clustering algorithms, density-based clustering algorithms can discover arbitrary shaped clusters and noise from datasets. Furthermore, density-based clustering algorithms do not require the number of clusters as an input parameter. Instead, clusters are defined as dense regions separated by sparse regions and are formed by growing due to the inter-connectivity between objects. Density-based spatial clustering of applications with noise (DBSCAN) [11] is a well-known density-based clustering algorithm. To define dense regions which serve as clusters, DBSCAN requires two parameters: ", which represents the radius of the neighborhood of an observed object, and MinPts, which is the minimum number of objects in the "-neighborhood of an observed object. Let P be a set of multi-dimensional objects and let the "-neighbors of an object p P be N"(p ). Here, DBSCAN implements two rules: i 2 i An object pi is an "-core object if N"(pi) MinPts; • ≥ If p is an "-core object, all objects in N"(p ) should appear in the same cluster as p . • i i i Electronics 2019, 8, 1094; doi:10.3390/electronics8101094 www.mdpi.com/journal/electronics Electronics 2019, 8, 1094 2 of 23 The process of DBSCAN is simple. Firstly, an arbitrary "-core object pi is added to an empty cluster. Secondly, a cluster grows as follows: for every "-core object pi in the cluster, all objects of N"(pi) are added to the cluster. This process is then repeated until the size of a cluster no longer increases. However, DBSCAN cannot easily select appropriate input parameters to form suitable clusters because the input parameters depend on prior knowledge, such as the distribution of objects and the ranges of datasets. Moreover, DBSCAN cannot find clusters of differing densities. Figure1 demonstrates this limitation of DBSCAN in a two-dimensional dataset P when MinPts = 3. If " = 0.955, p2 is an "-core object and forms a cluster, which contains p1, p2, p3, and p4 because N"(p2) MinPts is satisfied. ≥ However, p9 and p13 are noise objects. Here, a noise object is an object that is not included in any cluster. In other words, a set of objects that cannot reach "-core objects in the clusters are defined as noise objects. On the other hand, if " = 1.031, p13 becomes an "-core object and forms a cluster, which contains p11, p13, and p12. However, p9 is still a noise object. As shown in the example in Figure1, input parameter selection in DBSCAN is problematic. Electronics 2019, 8, x FOR PEER REVIEW 2 of 22 Figure 1. TwoTwo-dimensional-dimensional clustering example demonstrating problems of density-baseddensity-based spatial clustering of applications with noise (DBSCAN) for the selectionselection ofof "휀( MinPts(푀푖푛푃푡푠==3).3). TheTo address process this of disadvantageDBSCAN is simple of DBSCAN,. Firstly, a an method arbitrary for ordering 휀-core object points 푝 to푖 identifyis added the to clusteringan empty structure,cluster. Secondly called OPTICS, a cluster [12 grows], was as proposed. follows: Likefor every DBSCAN, 휀-core OPTICS object requires푝푖 in the two cluster, input all parameters, objects of "푁,휀( and푝푖) MinPtsare added, and to findsthe cluster clusters. This of diprocessffering is densities then repeated by creating until the a reachability size of a cluster plot. no Here, longer the reachabilityincreases. However, plot represents DBSCAN an cannot ordering easily of a select dataset appropriate with respect input to theparameters density-based to form clustering suitable structure.clusters because To create the theinput reachability parameters plot, depend OPTICS on formsprior knowledge, a linear order such of objectsas the distribution where objects of thatobjects are spatiallyand the ranges closest of become datasets. neighbors Moreover, [13 DBSCAN]. Figure2 cannot shows find the reachabilityclusters of differing plot for adensities dataset. PFigure, when 1 "demonstrates= p2 and MinPts this limitation= 3. The of DBSCAN horizontal in axis a two of- thedimensional reachability dataset plot enumerates푃 when 푀푖푛푃푡푠 the objects= 3. If in휀 = a linear0.955, order,푝2 is while an 휀 vertical-core object bars displayand forms reachability a cluster, distance which (see contains Definition 푝1 , 2),푝2 , which 푝3 , and is the 푝4 minimum because distance|푁휀(푝2)| ≥ for푀푖푛푃푡푠 an object is tosatisfied. be included However, in a cluster. 푝9 and The 푝13 reachability are noise objects. distances Here, for some a noise objects object (e.g., is anp1, pobject6, p11 ,that and isp14 not) are included infinite. in In any this cluster. case, an In infinite other reachabilitywords, a set distance of objects results that whencannot the reach distance 휀-core to eachobjects object in the is undefinedclusters are because defined the as distancenoise objects. value On is greater the other than hand, given if "휀. OPTICS= 1.031, does푝13 becomes not provide an clustering휀-core object results and forms explicitly, a cluster, but which the reachability contains 푝 plot11, 푝1 shows3, and the푝12. clusters However, for 푝"9. is For still example, a noise object. when "As= shown0.5, a firstin the cluster exampleC1 containing in Figure 1,p1 input, p2, and parameterp3 is found. selection When in" DBSCAN= 0.943, second is problematic cluster C. 2, which contains p6, p7, and p8 is found. As " grows larger, clusters C1 and C2 continue to expand, and a third cluster C3 containing p11, p12, and p13 is found. Figure 2. A reachability plot of an example dataset (휀 = √2 and 푀푖푛푃푡푠 = 3). To address this disadvantage of DBSCAN, a method for ordering points to identify the clustering structure, called OPTICS [12], was proposed. Like DBSCAN, OPTICS requires two input parameters, 휀, and 푀푖푛푃푡푠, and finds clusters of differing densities by creating a reachability plot. Here, the reachability plot represents an ordering of a dataset with respect to the density-based clustering structure. To create the reachability plot, OPTICS forms a linear order of objects where objects that are spatially closest become neighbors [13]. Figure 2 shows the reachability plot for a dataset 푃, when 휀 = √2 and 푀푖푛푃푡푠 = 3. The horizontal axis of the reachability plot enumerates the objects in a linear order, while vertical bars display reachability distance (see Definition 2), which is the minimum distance for an object to be included in a cluster. The reachability distances for some objects (e.g., 푝1, 푝6, 푝11, and 푝14) are infinite. In this case, an infinite reachability distance results when the distance to each object is undefined because the distance value is greater than given 휀. OPTICS does not provide clustering results explicitly, but the reachability plot shows the clusters for Electronics 2019, 8, x FOR PEER REVIEW 2 of 22 Figure 1. Two-dimensional clustering example demonstrating problems of density-based spatial clustering of applications with noise (DBSCAN) for the selection of 휀 (푀푖푛푃푡푠 = 3). The process of DBSCAN is simple. Firstly, an arbitrary 휀-core object 푝푖 is added to an empty cluster. Secondly, a cluster grows as follows: for every 휀-core object 푝푖 in the cluster, all objects of 푁휀(푝푖) are added to the cluster. This process is then repeated until the size of a cluster no longer increases. However, DBSCAN cannot easily select appropriate input parameters to form suitable clusters because the input parameters depend on prior knowledge, such as the distribution of objects and the ranges of datasets. Moreover, DBSCAN cannot find clusters of differing densities. Figure 1 demonstrates this limitation of DBSCAN in a two-dimensional dataset 푃 when 푀푖푛푃푡푠 = 3. If 휀 = 0.955, 푝2 is an 휀 -core object and forms a cluster, which contains 푝1 , 푝2 , 푝3 , and 푝4 because |푁휀(푝2)| ≥ 푀푖푛푃푡푠 is satisfied.

A Fast Algorithm for Identifying Density-Based Clustering Structures Using a Constraint Graph

A Literature Review on Patent Texts Analysis Techniques Guanlin Li

Enhancement of DBSCAN Algorithm and Transparency Clustering Of

Comparison of Dimensionality Reduction Techniques on Audio Signals

DBSCAN++: Towards Fast and Scalable Density Clustering

Density-Based Clustering of Static and Dynamic Functional MRI Connectivity

Semantic Computing

An Improvement for DBSCAN Algorithm for Best Results in Varied Densities

Unsupervised Novelty Detection Using Deep Autoencoders with Density Based Clustering

Mobility Modes Awareness from Trajectories Based on Clustering and a Convolutional Neural Network

Representatives for Visually Analyzing Cluster Hierarchies

Visualization of the Optics Algorithm

Comparative Study of Different Clustering Algorithms for Association Rule Mining Ms