electronics

Article A Fast Algorithm for Identifying Density-Based Clustering Structures Using a Constraint Graph

Jeong-Hun Kim 1, Jong-Hyeok Choi 1 , Kwan-Hee Yoo 1 , Woong-Kee Loh 2,* and Aziz Nasridinov 1,*

1 Department of Computer Science, Chungbuk National University, Cheongju 28644, Korea; [email protected] (J.-H.K.); [email protected] (J.-H.C.); [email protected] (K.-H.Y.) 2 Department of Software, Gachon University, Seongnam 13120, Korea * Correspondence: [email protected] (W.-K.L.); [email protected] (A.N.)

 Received: 2 August 2019; Accepted: 24 September 2019; Published: 27 September 2019 

Abstract: OPTICS is a state-of-the-art algorithm for visualizing density-based clustering structures of multi-dimensional datasets. However, OPTICS requires iterative distance computations for all   objects and is thus computed in O n2 time, making it unsuitable for massive datasets. In this paper, we propose constrained OPTICS (C-OPTICS) to quickly create density-based clustering structures that are identical to those by OPTICS. C-OPTICS uses a bi-directional graph structure, which we refer to as the constraint graph, to reduce unnecessary distance computations of OPTICS. Thus, C-OPTICS achieves a good running time to create density-based clustering structures. Through experimental evaluations with synthetic and real datasets, C-OPTICS significantly improves the running time in comparison to existing algorithms, such as OPTICS, DeLi-Clu, and Speedy OPTICS (SOPTICS), and guarantees the quality of the density-based clustering structures.

Keywords: OPTICS; density-based clustering structure; visualization; constraint graph

1. Introduction Clustering is one of the techniques that group data objects based on a similarity [1]. The groups can provide important insights that are used for a broad range of applications [2–8], such as superpixel segmentation for image clustering [2,3], brain cancer detection [4], wireless sensor networks [5,6], pattern recognition [7,8], and others. We can classify clustering algorithms into centroid, hierarchy, model, graph, density, and grid-based clustering algorithms [9]. Many algorithms address various clustering issues including scalability, noise handling, dealing with multi-dimensional datasets, the ability to discover clusters with arbitrary shapes, and the minimum dependency on domain knowledge for determining certain input parameters [10]. Among clustering algorithms, density-based clustering algorithms can discover arbitrary shaped clusters and noise from datasets. Furthermore, density-based clustering algorithms do not require the number of clusters as an input parameter. Instead, clusters are defined as dense regions separated by sparse regions and are formed by growing due to the inter-connectivity between objects. Density-based spatial clustering of applications with noise (DBSCAN) [11] is a well-known density-based clustering algorithm. To define dense regions which serve as clusters, DBSCAN requires two parameters: ε, which represents the radius of the neighborhood of an observed object, and MinPts, which is the minimum number of objects in the ε-neighborhood of an observed object. Let P be a set of multi-dimensional objects and let the ε-neighbors of an object p P be Nε(p ). Here, DBSCAN implements two rules: i ∈ i

An object pi is an ε-core object if Nε(pi) MinPts; • ≥ If p is an ε-core object, all objects in Nε(p ) should appear in the same cluster as p . • i i i

Electronics 2019, 8, 1094; doi:10.3390/electronics8101094 www.mdpi.com/journal/electronics Electronics 2019, 8, 1094 2 of 23

The process of DBSCAN is simple. Firstly, an arbitrary ε-core object pi is added to an empty cluster. Secondly, a cluster grows as follows: for every ε-core object pi in the cluster, all objects of Nε(pi) are added to the cluster. This process is then repeated until the size of a cluster no longer increases. However, DBSCAN cannot easily select appropriate input parameters to form suitable clusters because the input parameters depend on prior knowledge, such as the distribution of objects and the ranges of datasets. Moreover, DBSCAN cannot find clusters of differing densities. Figure1 demonstrates this limitation of DBSCAN in a two-dimensional dataset P when MinPts = 3. If ε = 0.955, p2 is an ε-core object and forms a cluster, which contains p1, p2, p3, and p4 because Nε(p2) MinPts is satisfied. ≥ However, p9 and p13 are noise objects. Here, a noise object is an object that is not included in any cluster. In other words, a set of objects that cannot reach ε-core objects in the clusters are defined as noise objects. On the other hand, if ε = 1.031, p13 becomes an ε-core object and forms a cluster, which contains p11, p13, and p12. However, p9 is still a noise object. As shown in the example in Figure1, input parameter selection in DBSCAN is problematic. Electronics 2019, 8, x FOR PEER REVIEW 2 of 22

Figure 1. TwoTwo-dimensional-dimensional clustering example demonstrating problems of density-baseddensity-based spatial clustering of applications with noise (DBSCAN) for the selectionselection ofof ε휀( MinPts(푀푖푛푃푡푠==3).3).

ToThe address process this of disadvantageDBSCAN is simple of DBSCAN,. Firstly, a an method arbitrary for ordering 휀-core object points 푝 to푖 identifyis added the to clusteringan empty structure,cluster. Secondly called OPTICS, a cluster [12 grows], was as proposed. follows: Likefor every DBSCAN, 휀-core OPTICS object requires푝푖 in the two cluster, input all parameters, objects of ε푁,휀( and푝푖) MinPtsare added, and to findsthe cluster clusters. This of diprocessffering is densities then repeated by creating until the a reachability size of a cluster plot. no Here, longer the reachabilityincreases. However, plot represents DBSCAN an cannot ordering easily of a select dataset appropriate with respect input to theparameters density-based to form clustering suitable structure.clusters because To create the theinput reachability parameters plot, depend OPTICS on formsprior knowledge, a linear order such of objectsas the distribution where objects of thatobjects are spatiallyand the ranges closest of become datasets. neighbors Moreover, [13 DBSCAN]. Figure2 cannot shows find the reachabilityclusters of differing plot for adensities dataset. PFigure, when 1 εdemonstrates= √2 and MinPts this limitation= 3. The of DBSCAN horizontal in axis a two of- thedimensional reachability dataset plot enumerates푃 when 푀푖푛푃푡푠 the objects= 3. If in휀 = a linear0.955, order,푝2 is while an 휀 vertical-core object bars display and forms reachability a cluster, distance which (see contains Definition 푝1 , 2),푝2 , which 푝3 , and is the 푝4 minimum because distance|푁휀(푝2)| ≥ for푀푖푛푃푡푠 an object is tosatisfied. be included However, in a cluster. 푝9 and The 푝13 reachability are noise objects.distances Here, for some a noise objects object (e.g., is anp1, pobject6, p11 ,that and isp14 not) are included infinite. in In any this cluster. case, an In infinite other reachabilitywords, a set distance of objects results that whencannot the reach distance 휀-core to eachobjects object in the is undefinedclusters are because defined the as distancenoise objects. value On is greater the other than hand, given if ε휀. OPTICS= 1.031, does푝13 becomes not provide an clustering휀-core object results and forms explicitly, a cluster, but which the reachability contains 푝 plot11, 푝1 shows3, and the푝12. clusters However, for 푝ε9. is For still example, a noise object. when εAs= shown0.5, a firstin the cluster exampleC1 containing in Figure 1,p1 input, p2, and parameterp3 is found. selection When inε DBSCAN= 0.943, second is problematic cluster C. 2, which contains p6, p7, and p8 is found. As ε grows larger, clusters C1 and C2 continue to expand, and a third cluster C3 containing p11, p12, and p13 is found.

Figure 2. A reachability plot of an example dataset (휀 = √2 and 푀푖푛푃푡푠 = 3).

To address this disadvantage of DBSCAN, a method for ordering points to identify the clustering structure, called OPTICS [12], was proposed. Like DBSCAN, OPTICS requires two input parameters, 휀, and 푀푖푛푃푡푠, and finds clusters of differing densities by creating a reachability plot. Here, the reachability plot represents an ordering of a dataset with respect to the density-based clustering structure. To create the reachability plot, OPTICS forms a linear order of objects where objects that are spatially closest become neighbors [13]. Figure 2 shows the reachability plot for a dataset 푃, when 휀 = √2 and 푀푖푛푃푡푠 = 3. The horizontal axis of the reachability plot enumerates the objects in a linear order, while vertical bars display reachability distance (see Definition 2), which is the minimum distance for an object to be included in a cluster. The reachability distances for some objects (e.g., 푝1, 푝6, 푝11, and 푝14) are infinite. In this case, an infinite reachability distance results when the distance to each object is undefined because the distance value is greater than given 휀. OPTICS does not provide clustering results explicitly, but the reachability plot shows the clusters for Electronics 2019, 8, x FOR PEER REVIEW 2 of 22

Figure 1. Two-dimensional clustering example demonstrating problems of density-based spatial clustering of applications with noise (DBSCAN) for the selection of 휀 (푀푖푛푃푡푠 = 3).

The process of DBSCAN is simple. Firstly, an arbitrary 휀-core object 푝푖 is added to an empty cluster. Secondly, a cluster grows as follows: for every 휀-core object 푝푖 in the cluster, all objects of 푁휀(푝푖) are added to the cluster. This process is then repeated until the size of a cluster no longer increases. However, DBSCAN cannot easily select appropriate input parameters to form suitable clusters because the input parameters depend on prior knowledge, such as the distribution of objects and the ranges of datasets. Moreover, DBSCAN cannot find clusters of differing densities. Figure 1 demonstrates this limitation of DBSCAN in a two-dimensional dataset 푃 when 푀푖푛푃푡푠 = 3. If 휀 =

0.955, 푝2 is an 휀 -core object and forms a cluster, which contains 푝1 , 푝2 , 푝3 , and 푝4 because |푁휀(푝2)| ≥ 푀푖푛푃푡푠 is satisfied. However, 푝9 and 푝13 are noise objects. Here, a noise object is an object that is not included in any cluster. In other words, a set of objects that cannot reach 휀-core objects in the clusters are defined as noise objects. On the other hand, if 휀 = 1.031, 푝13 becomes an Electronics휀-core object2019, and8, 1094 forms a cluster, which contains 푝11, 푝13, and 푝12. However, 푝9 is still a noise object.3 of 23 As shown in the example in Figure 1, input parameter selection in DBSCAN is problematic.

√ FigureFigure 2. 2. AA reachability reachability plot plot of of an an example example datase datasett ( (휀ε==√22 and and MinPts푀푖푛푃푡푠== 3).3).

As demonstrated in Figure2, OPTICS addresses the limitations of input parameter selection To address this disadvantage of DBSCAN, a method for ordering points to identify the for DBSCAN. However, OPTICS requires distance computations for all pairs of objects to create a clustering structure, called OPTICS [12], was proposed. Like DBSCAN, OPTICS requires two input reachability plot. In other words, OPTICS first computes an ε-neighborhood for each object to identify parameters, 휀, and 푀푖푛푃푡푠, and finds clusters of differing densities by creating a reachability plot. ε-core objects, and then, computes reachability distances at which ε-core objects are reached from Here, the reachability plot represents an ordering of a dataset with respect to the density-based all other objects. Thus, OPTICS is computed in O n2 time, where n is the number of objects in a clustering structure. To create the reachability plot, OPTICS forms a linear order of objects where dataset [14]. Therefore, OPTICS is unsuitable for massive datasets. Prior studies have proposed many objects that are spatially closest become neighbors [13]. Figure 2 shows the reachability plot for a algorithms to address the running time of OPTICS, such as DeLi-Clu [15] and SOPTICS [16]. These dataset 푃, when 휀 = √2 and 푀푖푛푃푡푠 = 3. The horizontal axis of the reachability plot enumerates algorithms improve the running time of OPTICS, but have their own limitations such as dependence the objects in a linear order, while vertical bars display reachability distance (see Definition 2), which on the number of dimensions, and deformation of the reachability plot. is the minimum distance for an object to be included in a cluster. The reachability distances for some This paper focuses on improving OPTICS by addressing its quadratic time complexity problem. objects (e.g., 푝1, 푝6, 푝11, and 푝14) are infinite. In this case, an infinite reachability distance results To do this, we propose a fast algorithm, called constrained OPTICS (simply, C-OPTICS). C-OPTICS when the distance to each object is undefined because the distance value is greater than given 휀. uses a novel bi-directional graph structure, called the constraint graph, which consists of the vertices OPTICS does not provide clustering results explicitly, but the reachability plot shows the clusters for corresponding to each cell that partitions a given dataset. In the constraint graph, the vertices are linked by means of edges when the distance between vertices is less than an ε. The constraints are assigned as the weight of each edge. The main feature of C-OPTICS is that it only computes the reachability distances of the objects that satisfy the constraints, which results in a reduction of unnecessary distance computations when creating a reachability plot. We evaluated the performance of C-OPTICS through experiments with the OPTICS, DeLi-Clu, and SOPTICS algorithms. The experimental results show that C-OPTICS significantly reduces the running time compared with OPTICS, DeLi-Clu, and SOPTICS algorithms and guarantees the reachability plot identical to those by of OPTICS. The rest of the paper is organized as follows: Section2 provides an overview of OPTICS, including its limitations, and describes related studies that have been performed to improve OPTICS. Section3 defines the concepts of C-OPTICS and describes the algorithm. Section4 presents an evaluation of C-OPTICS based on the results of experiments with synthetic and real datasets. Section5 summarizes and concludes the paper.

2. Related Work This section focuses on OPTICS and related algorithms proposed to address the quadratic time complexity problem of OPTICS. Section 2.1 presents the concepts of OPTICS and discusses unnecessary distance computations in OPTICS. Section 2.2 describes the existing algorithms proposed to improve the running time of OPTICS. Details of all the symbols used in this paper are defined in Table1. Electronics 2019, 8, 1094 4 of 23

Table 1. The notations.

Symbols Definitions P A set of N objects for clustering N The cardinality of P p An i-th object (or data point) in P (1 i N) i ≤ ≤ UC Unit cell that partitions the dataset (Definition 3) d The number of dimensions of P MBR The minimum bounding rectangle in UC v A vertex in the constraint graph C The cluster of P ε Epsilon represents the radius of neighborhood of an object pi MRD The maximum reachable ranges of a vertex by ε (Definition 4) Nε(pi) The ε-neighborhood of an object pi MinPts Minimum number of objects in the ε-neighborhood of an object pi cdistε, MinPts(pi) The core distance of pi (Definition 1)   The reachability distance of an object p w.r.t. an ε-core object p rdist p , p i j ε, MinPts i j (Definition 2)   dist pi, pj The distance between pi and pj   vdist vi, vj The distance between two vertices vi and vj RV The set of adjacent vertices for a vertex v vstate The state of a vertex v (Definition 6)   LC vi, vj The linkage constraint from vi to vj (Definition 5)

2.1. OPTICS The well-known hierarchical density-based algorithm OPTICS visualizes the density-based clustering structure. Section 2.1.1 reviews the definitions of the naïve algorithm used to compute a reachability plot. Section 2.1.2 demonstrates the quadratic time complexity of the naïve algorithm and its unnecessary distance computations.

2.1.1. Definitions Let P be a set of n objects in the d-dimensional space d. Here, the Euclidean distance between   R two objects pi and pj is denoted by dist pi, pj . OPTICS creates a reachability plot based on the concepts defined below.

Definition 1 (Core distance of an object p) [12]. Let Nε(p) be the ε-neighborhood and let MinPts-dist(p) be the distance between p and MinPts-th nearest neighbor. The core distance of p, cdistε, MinPts(p), is then defined using Equation (1):

( UNDEFINED, Nε(p) < MinPts cdist (p) = . (1) ε, MinPts MinPts dist(p), otherwise −

Note that cdistε, MinPts(p) is the minimum ε at which p qualifies as an ε-core object for DBSCAN. = √ = ( ) = For example, when ε 2 and MinPts 3 for the sample dataset P in Figure1, cdist √2, 3 p9 1.315, which is the distance between p9 and the MinPts-th nearest neighbor p6.

Definition 2 (Reachability distance object p with respect to object o) [12]. Let p and o be objects from dataset P; the reachability distance of p with respect to o, rdistε, MinPts(p, o), is defined as Equation (2):

( UNDEFINED, Nε(o) < MinPts rdistε, MinPts(p, o) = . (2) max(cdistε, MinPts(o), dist(o, p)), otherwise Electronics 2019, 8, 1094 5 of 23

Intuitively, when o is an ε-core object, the reachability distance of p with respect to o is the minimum distance such that p is directly density-reachable from o. Thus, the minimum reachability distance of each object p P means the minimum distance that can be contained in a cluster. To create ∈ a reachability plot, a linear order of objects is formed by selecting a next object p that has the closest reachability distance to an observed object o. Here, a linear order of objects represents the order of interconnection between objects by densities in the dataset. Accordingly, the reachability plot shows the reachability distance for each object in the order the object was processed.

2.1.2. Computation   OPTICS first finds all ε-core objects in the dataset at O n2 time and then computes the minimum   reachability distance for all objects at O n2 time to create a reachability plot. This is still the best time complexity known. Alternatively, ε-core objects can be found quickly using spatial indexing structures, such as an R*-tree [17], that optimize range queries to obtain ε-neighborhoods. However, Electronics 2019, 8, x FOR PEER REVIEW   5 of 22 the reachability plot is still created at O n2 time because computing the reachability distances of all objectsreachability to form distances a linear for order all 휀 of-core objects objects is not in the optimized dataset. byHowever, the spatial only indexing the minimum structure. reachability In other words,distance OPTICS of each has object quadratic is displayed time complexity in the reachability because each plot object (see computesFigure 2). reachabilityThat is, all distancesreachability for alldistancε-coree computations, objects in the dataset.except for However, identifying only the the reachability minimum distance reachability displayed distance in ofthe each reachability object is displayedplot, are unnecessary. in the reachability plot (see Figure2). That is, all reachability distance computations, except for identifyingFigure 3 shows the reachability an example distance of unnecessary displayed reachability in the reachability distances plot, for are sample unnecessary. dataset 퐷 when Figure3 shows an example of unnecessary reachability distances for sample dataset D when 휀 = √2 and 푀푖푛푃푡푠 = 3 . First, the distance between 푝1 and all sample objects 푝푖 ∈ 푁휀(푝1) is ε = √2 and MinPts = 3. First, the distance between p1 and all sample objects pi Nε(p1) is computed computed to determine if 푝1 is a core object. Next, the distance between 푝1∈ and the 푀푖푛푃푡푠-th to determine if p is a core object. Next, the distance between p and the MinPts-th nearest neighbor is nearest neighbor1 is computed to obtain the core distance1 of 푝1 according to Definition 1. computed to obtain the core distance of p according to Definition 1. Subsequently, the reachability Subsequently, the reachability distances for1 all objects contained in 푁휀(푝1) are computed according distances for all objects contained in Nε(p ) are computed according to Definition 2 as shown in to Definition 2 as shown in Figure 3a. That is,1 the reachability distances between 푝1 and 푝2, 푝3, 푝4, Figure3a. That is, the reachability distances between p and p , p , p , p are computed. This process 푝5 are computed. This process is repeated for all sample1 objects2 as3 shown4 5 in Figure 3b. However, as isshown repeated in Figure for all 3 samplec, not all objects reachability as shown distances in Figure are3b. required However, to as create shown a reachability in Figure3c, plot. not allFor reachability distances are required to create a reachability plot. For example, p is reachable from example, 푝5 is reachable from both 푝1 and 푝2; however, a reachability distance from5 푝1 is 1.29 and both p and p ; however, a reachability distance from p is 1.29 and from p is 1.37. Considering from 푝12 is 1.37.2 Considering that only the minimum reachability1 distance of each2 object is displayed that only the minimum reachability distance of each object is displayed in the reachability plot, the in the reachability plot, the reachability distance computation from 푝5 to 푝2 is unnecessary because reachabilityit does not contribute distance computation to creating a from reachabilityp5 to p2 is plot. unnecessary because it does not contribute to creating a reachability plot.

(a) (b) (c)

Figure 3. Visualization of the reachability distances for sample dataset D:(a) all reachability distances Figure 3. Visualization of the reachability distances for sample dataset 퐷: (a) all reachability distances for p1;(b) all reachability distances between sample objects; (c) unnecessary distance computations for 푝1; (b) all reachability distances between sample objects; (c) unnecessary distance computations between sample objects. between sample objects.

2.2. Existing Work This subsection describes the algorithms proposed to address the quadratic time complexity of OPTICS. Researchers have proposed new indexing structures and approximate reachability plot to reduce the number of core distance and reachability distance computations. In addition, some researchers have proposed algorithms to visualize a new hierarchical density-based clustering structure. We can classify these algorithms roughly into the following three categories: equivalent algorithms with results identical to those of OPTICS, approximate algorithms, and other hierarchical density-based algorithms. Among the equivalent algorithms, DeLi-Clu [15] optimized range queries to compute the core distance and reachability distance of each object in a dataset using a spatial indexing structure, in this case, the variance of R*-tree. In particular, the proposed self-join R*-tree query significantly reduces the total number of distance computations. Furthermore, DeLi-Clu discards a parameter 휀, as the join process will automatically stop when all objects are connected, without computing all pair-wise distances. However, as the dimension of the dataset increases, the limitations of R*-tree arise in the form of overlapping distance computations, which significantly degrades the performance of DeLi- Clu. The extended DBSCAN and OPTICS algorithms are proposed by Brecheisen et al [18]. These two algorithms improved the performances of DBSCAN and OPTICS by applying a multi-step query Electronics 2019, 8, 1094 6 of 23

2.2. Existing Work This subsection describes the algorithms proposed to address the quadratic time complexity of OPTICS. Researchers have proposed new indexing structures and approximate reachability plot to reduce the number of core distance and reachability distance computations. In addition, some researchers have proposed algorithms to visualize a new hierarchical density-based clustering structure. We can classify these algorithms roughly into the following three categories: equivalent algorithms with results identical to those of OPTICS, approximate algorithms, and other hierarchical density-based algorithms. Among the equivalent algorithms, DeLi-Clu [15] optimized range queries to compute the core distance and reachability distance of each object in a dataset using a spatial indexing structure, in this case, the variance of R*-tree. In particular, the proposed self-join R*-tree query significantly reduces the total number of distance computations. Furthermore, DeLi-Clu discards a parameter ε, as the join process will automatically stop when all objects are connected, without computing all pair-wise distances. However, as the dimension of the dataset increases, the limitations of R*-tree arise in the form of overlapping distance computations, which significantly degrades the performance of DeLi-Clu. The extended DBSCAN and OPTICS algorithms are proposed by Brecheisen et al. [18]. These two algorithms improved the performances of DBSCAN and OPTICS by applying a multi-step query processing paradigm. Specifically, the core distance and reachability distance were quickly computed by replacing range queries with MinPts-th nearest neighbor queries. In addition, complicated distance computations were performed only at essential steps to create a reachability plot to reduce waste computations. However, the running time of the extended DBSCAN and OPTICS in [18] is not suitable for massive datasets. Other approaches which use exact algorithms are those based on a graphics processing unit (GPU), or multi-core based distributed algorithms. One study [13] introduced an extensible parallel OPTICS algorithm on a 40-core shared memory machine. The similarities between OPTICS and Prim’s minimum spanning tree algorithm were used to extract clusters in parallel from the shared memory. Other work [19] introduced G-OPTICS, which improves the scalability of OPTICS using a GPU. G-OPTICS significantly reduces the running time of OPTICS by processing the iterative computations required to form a reachability plot in parallel through the GPU and shared memory. Approximate algorithms simplify and reduce the complex distance computations of OPTICS. An algorithm was proposed to compress the dataset with data bubbles containing only key information [20]. The number of distance computations is determined via the data compression ratio, allowing the scalability to be improved. However, if the data compression ratio exceeds a certain level, the reachability plot identical to those by of OPTICS cannot be guaranteed. The higher the data compression ratio, the larger the number of objects abstracted into only a representative object. GridOPTICS, which transforms the dataset into a grid structure to create a reachability plot, was proposed [21]. GridOPTICS initially creates a grid structure for a given dataset, after which it forms a reachability plot for the grids. It then assigns objects in the dataset to the formed reachability plot. However, clustering quality is not guaranteed given that the form of the reachability plot depends on the size of the grids. Another method is SOPTICS [16], which applies random projection to improve the running time of OPTICS. After partitioning a dataset into subsets while performing random projection via a pre-defined criterion, it quickly creates a reachability plot using a new density estimation metric based on average distance computations through data sampling. However, SOPTICS includes many deformations in the reachability plot due to the random projection maps objects into the lower dimensions. Other hierarchical density-based clustering algorithms have been continuously studied. A runt-pruning algorithm that analyzes the minimum spanning tree of the sampled dataset and then uses nearest neighbor-based density estimation to define the density levels of each object to create a cluster tree was proposed in [22]. Based on a runt test for multi-modality [23], the runt-pruning algorithm creates a cluster tree by repeating the task of partitioning a dense cluster into two connected components. HDBSCAN, which constructs a graph structure with a mutual reachability distance between two objects, was also proposed in [24]. It discovers a minimum spanning tree and then forms Electronics 2019, 8, 1094 7 of 23 a structure. In addition to hierarchical clustering, an algorithm for producing flat partitioning from the hierarchical solution was introduced in [25].

3. Constrained OPTICS (C-OPTICS)

3.1. Overview We now explain the proposed algorithm, called C-OPTICS. The goal of the C-OPTICS is to reduce the total number of distance computations required to create a reachability plot because they affect the running time of OPTICS, as explained in Section 2.1.2. To achieve this goal, C-OPTICS proceeds in three steps: (a) partitioning step, (b) graph construction step, and (c) plotting step. For the convenience of the explanation,Electronics Figure2019, 8, x4 FOR shows PEER the REVIEW process of the C-OPTICS for the two-dimensional dataset P in7 of Figure 22 1. In the partitioning step, we partition P into identical cells. Figure4a shows the result of partitioning, where emptyC-OPTICS cells are can discarded. obtain 휀-neighbor In the graphs by finding construction adjacent step, cells as for shown an arbitrary in Figure object.4b, we Thus, construct C- a constraintOPTICS graph can reduce that consists the number of vertices of distance corresponding computations to to the find cells 휀-core obtained objects. in Besides, the partitioning C-OPTICS step and edgesonly computes linking vertices the reachability by constraints. distances In of the the plotting objects that step, satisfy we compute the constraints the reachability in the constraint distance of graph. Consequently, C-OPTICS reduce the total number of distance computations required to create each object while traversing the constraint graph. Figure4c shows the reachability distance and linear a reachability plot. We explain in detail the partitioning step in Section 3.2, the graph construction orderstep ofeach in Section object. 3.3, and the plotting step in Section 3.4.

(a)

(b) (c)

FigureFigure 4. The 4. overallThe over processall process of theof the C-OPTICS C-OPTICS for for the the two-dimensional two-dimensional dataset datasetP 푃:(: a(a))partitioning partitioning step, (b) graphstep, ( constructionb) graph construction step, and step, (c) and plotting (c) plotting step. step.

C-OPTICS3.2. Partitioning can S obtaintep ε-neighbors by finding adjacent cells for an arbitrary object. Thus, C-OPTICS can reduce the number of distance computations to find -core objects. Besides, C-OPTICS only This subsection describes the data partitioning step for Cε-OPTICS. We partition a given dataset computesinto cells the reachability of identical size distances with a of diagonal the objects length that 휀 satisfy. Definition the constraints 3 explains the in the unit constraint cell of the graph. Consequently,partitioning C-OPTICS step. reduce the total number of distance computations required to create a reachability plot. We explain in detail the partitioning step in Section 3.2, the graph construction step in SectionDefinition 3.3, and 3 (Unit the plottingcell, 푈퐶). stepWe define in Section 푈퐶 as 3.4 a .푑-dimensional hypercube with a diagonal length 휀 and straight sides, all of equal length. 푈퐶 is a square with a diagonal length 휀 in two dimensions, and in three 3.2. Partitioningdimensions 푈퐶 Step is a cube with a diagonal length 휀.

ThisH subsectionere, all 푈퐶 describess have two the main data features. partitioning First, each step for푈퐶 C-OPTICS.has a unique We identifier partition that a givenis used dataset as the into cells oflocation identical information size with in a diagonalthe dataset. length We ε use. Definition the coordinate 3 explainss value thes of unit a 푈퐶 cell to of theobtain partitioning a unique step. identifier. Here, the coordinates value of a 푈퐶 represents a sequence of 푈퐶 for each dimension in the dataset. Thus, combining the coordinates values of 푈퐶 for all dimensions enables us to obtain the location information from a unique identifier. The location information can quickly find adjacent 푈퐶s within a radius 휀 for an arbitrary 푈퐶. Through partitioning step, we can reduce the number of distance computations in finding 휀-neighbors of each object because the number of objects in each cell within a radius 휀 is smaller than in the entire dataset. We can further simplify the process of constructing a constraint graph in a later step. Second, each 푈퐶 has a 푑-dimensional minimum bounding rectangle (푀퐵푅) that encloses the contained objects. We use 푀퐵푅s to compute the distance between 푈퐶s to find adjacent 푈퐶s. We obtain all coordinates of 푀퐵푅 based on the maximum and Electronics 2019, 8, 1094 8 of 23

Definition 3 (Unit cell, UC). We define UC as a d-dimensional hypercube with a diagonal length ε and straight sides, all of equal length. UC is a square with a diagonal length ε in two dimensions, and in three dimensions UC is a cube with a diagonal length ε.

Here, all UCs have two main features. First, each UC has a unique identifier that is used as the location information in the dataset. We use the coordinates values of a UC to obtain a unique identifier. Here, the coordinates value of a UC represents a sequence of UC for each dimension in the dataset. Thus, combining the coordinates values of UC for all dimensions enables us to obtain the location information from a unique identifier. The location information can quickly find adjacent UCs within a radius ε for an arbitrary UC. Through partitioning step, we can reduce the number of distance computations in finding ε-neighbors of each object because the number of objects in each cell within a radius ε is smaller than in the entire dataset. We can further simplify the process of constructing a constraintElectronics graph 2019, 8 in, x FOR a later PEER step. REVIEW Second, each UC has a d-dimensional minimum bounding8 rectangle of 22 (MBR) that encloses the contained objects. We use MBRs to compute the distance between UCs to find adjacentminimumUCs. values We obtain of each all dimension coordinates for of theMBR containedbased objects on the. If maximum the 푈퐶 contains and minimum only one object, values th ofe each object becomes an 푀퐵푅. dimension for the contained objects. If the UC contains only one object, the object becomes an MBR. Example 1. Figure 5 shows an example of the partitioning step for the sample dataset 푃 when 휀 = √2. Figure Example 1. Figure5 shows an example of the partitioning step for the sample dataset P when ε = √2. Figure5 a 5a shows each object 푝푖 ∈ 푃 in a two-dimensional universe. Figure 5b shows the partitioning of 푃 into 푈퐶s, shows each object p P in a two-dimensional universe. Figure5 b shows the partitioning of P into UCs, where each where each UCi has∈ the same diagonal length. Figure 5c shows empty and non-empty 푈퐶s. Note that we do not UC hasconsider the same empty diagonal 푈퐶s. length.Further,Figure we obtain5 c showsthe unique empty identifier and non-empty of each 푈퐶UC ass. follows. Note that To wecompute do not the consider unique empty UCs.identifier Further, of we 푈퐶 obtain21, we the first unique obtain identifier the coordinate of each valueUCsas of follows.푈퐶21 in To each compute dimension. the unique As shown identifier in Figure of UC 5c,21 , we first obtainthe coordinate the coordinate values valuesof 푈퐶21 of areUC (2,1)21 in in each the dimension. first and second As shown dimensions, in Figure respectively.5 c, the coordinate Considering values that of a UC21 unique identifier of 푈퐶 is obtained by the combination of the coordinate values, the unique identifier of 푈퐶 are (2,1) in the first and second21 dimensions, respectively. Considering that a unique identifier of UC21 is obtained21 by is 21. The result of partitioning is shown in Figure 5d, where 푃 is partitioned into seven 푈퐶s, with each 푈퐶 the combination of the coordinate values, the unique identifier of UC21 is 21. The result of partitioning is shown in having a 푀퐵푅. Figure5 d, where P is partitioned into seven UCs, with each UC having a MBR.

(a) (b)

(c) (d)

√ FigureFigure 5. Partitioning 5. Partitioning into intoUC s푈퐶 of samples of sample dataset datasetP (ε 푃= (휀 =2):√ (2a)): the(a) the sample sample dataset datasetP,( b푃), UC(b) s푈퐶 thats that partition P,(c)partition non-empty 푃, (UCc) nons (white-empty area) 푈퐶s and (white their area) unique and identifiers,their unique and identifiers (d) the result, and ( ofd) thethe partitioningresult of the step. partitioning step.

3.3. Graph Construction Step In the graph construction step, we construct a constraint graph with the 푈퐶s obtained in the data partitioning step. Through constructing a constraint graph, we can reduce the number of distance computations to obtain the minimum reachability distance for each object. For this, we first use the 푈퐶s obtained in the data partitioning step as the vertices of the constraint graph. Here, a unique identifier of each 푈퐶 is used as a unique identifier of each vertex. Thus, we can quickly find adjacent vertices because we preserve the location information of 푈퐶 for each vertex. Additionally, we also preserve the 푀퐵푅 of each 푈퐶. Then, we connect the vertices to define the edges. Here, we connect the vertices according to two properties: (i) the two vertices must be adjacent to each other within a radius 휀, and (ii) an edge between two vertices must have constraints. Considering that we can quickly find the 휀-neighborhood of each object in the dataset by the first property, we can obtain Electronics 2019, 8, 1094 9 of 23

3.3. Graph Construction Step In the graph construction step, we construct a constraint graph with the UCs obtained in the data partitioning step. Through constructing a constraint graph, we can reduce the number of distance computations to obtain the minimum reachability distance for each object. For this, we first use the UCs obtained in the data partitioning step as the vertices of the constraint graph. Here, a unique identifier of each UC is used as a unique identifier of each vertex. Thus, we can quickly find adjacent vertices because we preserve the location information of UC for each vertex. Additionally, we also preserve the MBR of each UC. Then, we connect the vertices to define the edges. Here, we connect the vertices according to two properties: (i) the two vertices must be adjacent to each other within a radius ε, and (ii) an edge between two vertices must have constraints. Considering that we can quickly find Electronics 2019, 8, x FOR PEER REVIEW 9 of 22 the ε-neighborhood of each object in the dataset by the first property, we can obtain the objects that are thewithin objects a radius that areε. Thus, within there a radius is no need휀. Thus, to computethere is no distances need to with compute all objects. distances We with first explainall objects. the Wemaximum first explain reachability the maximum distance, reachability which is defined distance as, Definitionwhich is defined 4. as Definition 4.

DefinitionDefinition 4 (Maximum reachability reachability distance, distance, 푀푅퐷MRD)).. We We define define 푀푅퐷MRD asas the the range range of of a aradius radius 휀ε at which a vertex vertex can can reac reach.h. Here, Here, the the unique unique identifier identifier of ofa vertex a vertex is isthe the combination combination of coordinate of coordinate values. values. Thus, Thus, in 푑 in- d-dimension, MRD represents √d range for a coordinate value of each dimension because a radius ε is √d dimension, 푀푅퐷 represents ±±⌈√푑⌉ range for a coordinate value of each dimension because a radius 휀 is √푑 timestimes the the length length of of a a straight side of UC푈퐶 thatthat correspondscorresponds toto thethe vertex.vertex.

Figure6 shows an example of the MRD for a vertex, v , in two-dimensional space. We can Figure 6 shows an example of the 푀푅퐷 for a vertex, 푣3333, in two-dimensional space. We can decompose the unique identifier of v into two coordinates values, which are (3, 3). Considering that decompose the unique identifier of 33푣33 into two coordinates values, which are (3, 3). Considering the MRD of a vertex in the two-dimensional dataset represents a range of +2 and 2 in coordinate that the MRD of a vertex in the two-dimensional dataset represents a range of +2 and− -2 in coordinate values of each dimension,dimension, accordingaccording totoDefinition Definition 4, 4, the the reachable reachable range range of of each each dimension dimension is is 1 to1 to 5 in5 the first dimension and 1 to 5 in the second dimension. As a result, the MRD of v is equal to the gray in the first dimension and 1 to 5 in the second dimension. As a result, the 푀푅33퐷 of 푣33 is equal to thearea gray shown areain shown Figure in6 Figure. Through 6. Through the calculation the calculation of MRD, of MRD, we can we find can adjacentfind adjacent vertices vertices within within the thesame same range range and and thus, thus, avoid avoid unnecessary unnecessary calculations calculations of distancesof distances between between all all vertices. vertices. In In Figure Figure6, adjacent vertices v and v are in the MRD of v . 6, adjacent vertices12 푣12 and53 푣53 are in the 푀푅퐷33 of 푣33.

Figure 6. An example of maximum reachability distance (MRD) in two-dimensional space. Figure 6. An example of maximum reachability distance (푀푅퐷) in two-dimensional space. Additionally, we use the MBR of each vertex to compute the distance between two vertices. We can obtainAdditionally, the adjacent we use vertices the 푀퐵푅 more of accurately each vertex by to computing compute the the distance distance between between two the vertices. two vertices. We canHere, obtain adjacent the adjacent vertices obtainedvertices more from accuratelyMRD of an by arbitrary computing vertex the maydistance not actuallybetween be the vertices two vertices. within Here,a radius adjacentε because vertices the obtained range of MRDfrom and푀푅퐷 the of rangean arbitrary of ε do vertex not precisely may not match.actually Thus, be vertices we compute within athe radius minimum 휀 because distance the betweenrange of the푀푅퐷MBR ands of the two range vertices of 휀 todo find not adjacent precisely vertices match. withinThus, we the compute radius ε . theEquation minimum (3) represents distance between the distance the between푀퐵푅s of twotwo vertices.vertices to find adjacent vertices within the radius 휀. Equation (3) represents the distance between two vertices.      vdist vi, vj = min dist vi.MBR, vj.MBR , (3) 푣푑푖푠푡(푣푖, 푣푗) = min (푑푖푠푡(푣푖. 푀퐵푅, 푣푗. 푀퐵푅)), (3) where 푣푑푖푠푡(푣푖, 푣푗) is the distance between 푣푖 and 푣푗 . 푣푖. 푀퐵푅 is the 푀퐵푅 of a unit cell 푈퐶푖 corresponding to 푣푖. Here, if 푣푑푖푠푡(푣푖, 푣푗) ≤ 휀, 푣푗 is the adjacent vertex of 푣푖 and vice versa. Next, we define the constraints to link vertices that satisfy the second property. Recall from Section 1 that the constraint graph only links vertices when the distance between vertices is less than an ε. These pairs of vertices become edges, and the constraints are assigned as weights for each edge. Here, the constraint is defined as a subset of 휀 -core objects that can compute the minimum reachability distances of objects in different vertices. In the plotting step, we only compute the reachability distances of the objects that satisfy the constraints and thus, reduce unnecessary distance computations. Here, a constraint is satisfied if the observed object belongs to a subset of 휀-core objects as designated by the constraint. We first explain the linkage constraints defined as Definition 5 to obtain the 휀-core objects corresponding to the constraints.

Definition 5. (Linkage constraint, 퐿퐶(푣푖, 푣푗)) For two vertices 푣푖 and 푣푗 , let an 휀-core object in 푣푖 be 푝푐표푟푒 and let the 푀퐵푅 of a vertex 푣푗 be 푣푗. 푀퐵푅. We define the linkage constraint from 푣푖 to 푣푗 as a Electronics 2019, 8, 1094 10 of 23

  where vdist vi, vj is the distance between vi and vj. vi.MBR is the MBR of a unit cell UCi corresponding   to v . Here, if vdist v , v ε, v is the adjacent vertex of v and vice versa. i i j ≤ j i Next, we define the constraints to link vertices that satisfy the second property. Recall from Section1 that the constraint graph only links vertices when the distance between vertices is less than an ε. These pairs of vertices become edges, and the constraints are assigned as weights for each edge. Here, the constraint is defined as a subset of ε-core objects that can compute the minimum reachability distances of objects in different vertices. In the plotting step, we only compute the reachability distances of the objects that satisfy the constraints and thus, reduce unnecessary distance computations. Here, a constraint is satisfied if the observed object belongs to a subset of ε-core objects as designated by the constraint. We first explain the linkage constraints defined as Definition 5 to obtain the ε-core objects corresponding to the constraints.   Definition 5. (Linkage constraint, LC vi, vj ) For two vertices vi and vj, let an ε-core object in vi be pcore and let the MBR of a vertex vj be vj.MBR. We define the linkage constraint from vi to vj as a subset of ε-core objects in vi. Here, when p1 is an ε-core object closest to the vj.MBR and q is the coordinates of vj.MBR farthest from p1, the ε-core objects should be closer to q than they are to p1.

We link two vertices when the linkage constraint between the vertices is defined according to Definition 5. Thus, we can reduce unnecessary distance computations by pruning the adjacent vertices obtained in the first property again. Furthermore, the linkage constraints guarantee the quality of the reachability plot while reducing the number of distance computations. We prove this in Lemma 1.   Lemma 1. Let the linkage constraint from vi to vj be LC vi, vj for vi, vj V in the constraint graph G(V, E).   ∈ If LC vi, vj , ∅, the minimum reachability distances for all objects contained in vj are determined by the   LC vi, vj vj pair and the core distance. In other words, the reachability distances of all objects in vj are ×   determined by ε-core objects in LC vi, vj .

    Proof. For a given dataset P, let v = p , p , pt v and let the linkage constraint LC v , v = p i 1 2 ∈ j i j 1 from vi to vj. In this case, the reachability distance of an object pt is determined by p1 when cdist (p ) dist(p , pt). Conversely, p cannot determine the reachability distance of pt because ε, MinPts 1 ≤ 1 2 cdistε, MinPts(p1) < dist(p2, pt) holds in all cases even when dist(p1, pt) < cdistε, MinPts(p1). Therefore, Lemma 1 is obviously true. 

Additionally, we define the state of a vertex. We can find ε-core objects in the dataset without computing the distance between the objects through the states of vertices. We define the state of a vertex based on the rules of DBSCAN, and from the UCs obtained in the data partitioning step. As described in Section1, if the number of objects contained in the ε-neighborhood for an arbitrary object is greater than or equal to MinPts, then the object is an ε-core object. Here, because the diagonal length of UC is ε, the objects must be ε-core objects when the number of objects contained in UC is greater than or equal to MinPts. Accordingly, we define three states for each vertex: stable, unstable, and noise. More precisely, the state of a vertex is defined as follows:

Definition 6. (The state of a vertex, vstate) Let a set of adjacent vertices contained in MRD of vi be RVi and an adjacent vertex be rv RV . The state of v , which can determine whether to compute the distance of contained k ∈ i i objects, satisfies the following conditions:

1. Let v be the number of data points contained in v . If MinPts v , v .vstate is stable; | i| i ≤ | i| i 2. If condition 1 is not satisfied and MinPts v + rv , v .vstate is unstable; ≤ | i| | k| i 3. If both conditions 1 and 2 are not satisfied, vi.vstate is noise. Electronics 2019, 8, 1094 11 of 23

Again, we denote that objects contained in the same vertex must be ε-neighbor for each other. Thus, when the vstate of a vertex vi is stable, all objects contained in vi are ε-core objects, because the

Nε(pt) of all objects pt vi is always greater than or equal to MinPts. Conversely, if the vstate of vi is ∈ noise, all objects contained in vi are noise objects and are not considered anymore.

Example 2. Figure7 shows the step-by-step constructing process of the constraint graph G(V, E) when ε = √2 and MinPts = 3 for the sample dataset P, using an identical dataset to that shown in Figure5. The example starts with a partitioned dataset, Pˆ, as shown in Figure7a. First, as shown in Figure7b, each UC is mapped into the vertices of G, and the unique identifier of each UC is used as the unique identifier of the vertex. Second, a vertex is selected by the input sequence of objects. In our case, a vertex v21 containing p1 is selected as shown in Figure7c. To find the adjacent vertices for v21, the MRD of v21 (i.e., gray area) is computed using Definition 4. Here, two vertices v0 and v33 are found as adjacent vertices of v21. Next, the state of v21, v21.vstate, is obtained according to Definition 6. Here, all objects in v21 are ε-core objects hence v21.vstate is stable and depicted in a doubled circle. Third, as shown in Figure7d, linkage constraints for LC(v21, v0) and LC(v21, v33) are obtained using Definition 5. For example, let the MBR of a vertex vi be vi.MBR. We first obtain the ε-core objects of v21, but we can skip this step because v21.vstate is stable. Next, an ε-core object of v21 closest to v0.MBR, p1, is obtained. Because v0 contains only one object, the coordinate of v0.MBR farthest from p1 is p5 (blue star). Thus, LC(v21, v0) contains only p1 because no ε-core object is closer to p5 than p1. Then, because the constraint between v21 and v0 is defined, the two vertices are linked by e(v0, v21). Figure7e shows the linkage constraint LC(v0, v21) and state of v0. The v0 contains only one object. Thus, v0.vstate is unstable (circle) due to the adjacent vertex v21 that contains four objects. LC(v0, v21) contains p5 and thus e(v0, v21) becomes bidirectional. On the other hand, v74 contains only one object and has no adjacent vertex. Thus, the state of v74 is noise and is depicted as a dotted circle. When all vertices are processed, constraint graph G is constructed, as shown in Figure7f.

3.4. Plotting Step In the plotting step, we traverse the constraint graph to generate a reachability plot as in OPTICS. Recall from Section 2.1.2 that OPTICS computes reachability distances for all pairs of objects to create a reachability plot. However, to create a reachability plot, only one reachability distance is required for each object. Through the constraint graph, we can reduce the reachability distance computations that do not contribute to generating the reachability plot. Here, the constraint graph identifies the reachability distance of each object required for the reachability plot by the linkage constraint. Thus, we reduce the unnecessary distance computations by only computing the distance between objects contained in the vertices which are linked to each other. We can further reduce the unnecessary distance computations by only computing the reachability distance for objects that satisfy the linkage constraints between the linked vertices. To plot a reachability plot, we traverse the constraint graph with the following rules: (i) if two objects are contained in the same vertex, the distance is computed; (ii) if two objects are contained in different vertices, only objects that satisfy the linkage constraint between two vertices are computed; (iii) the object with the closest reachability distance to the target object becomes the next target object; (iv) if no object is reachable, the next target object is selected by the input order of objects. For clarity, we provide the pseudocode, which will be referred to as the C-OPTICS procedure. Electronics 2019, 8, x FOR PEER REVIEW 11 of 22

coordinate of 푣0. 푀퐵푅 farthest from 푝1 is 푝5 (blue star). Thus, 퐿퐶(푣21, 푣0) contains only 푝1 because no 휀-core object is closer to 푝5 than 푝1. Then, because the constraint between 푣21 and 푣0 is defined, the two vertices are linked by 푒(푣0, 푣21). Figure 7e shows the linkage constraint 퐿퐶(푣0, 푣21) and state of 푣0. The 푣0 contains only one object. Thus, 푣0. 푣푠푡푎푡e is 푢푛푠푡푎푏푙푒 (circle) due to the adjacent vertex 푣21 that contains four objects. 퐿퐶(푣0, 푣21) contains 푝5 and thus 푒(푣0, 푣21) becomes bidirectional. On the other hand, 푣74 contains only one object and has no adjacent vertex. Thus, the state of 푣74 is 푛표푖푠푒 and is depicted Electronicsas a dotted2019 circle, 8, 1094. When all vertices are processed, constraint graph 퐺 is constructed, as shown in Figure12 of7f. 23

(a) (b)

(c) (d)

(e) (f)

Figure 7. Construction of constraint graph G(V, E) for sample dataset P (ε = √2 and MinPts = 3): (a) Figure 7. Construction of constraint graph 퐺(푉, 퐸) for sample dataset 푃 (휀 = √2 and 푀푖푛푃푡푠 = 3): the sample dataset P partitioned into non-empty UCs, (b) vertices that are corresponding to UCs, (c) (a) the sample dataset 푃 partitioned into non-empty 푈퐶s, (b) vertices that are corresponding to 푈퐶s, MRD and vstate of v21,(d) edges in v21. with linkage constraints weighted, (e) an edge in v0 with a (c) 푀푅퐷 and 푣푠푡푎푡푒 of 푣21, (d) edges in 푣21 with linkage constraints weighted, (e) an edge in 푣0 linkage constraint weighted, and (f) the constraint graph. with a linkage constraint weighted, and (f) the constraint graph. Algorithm 1 shows the C-OPTICS procedure that creates a reachability plot by traversing the 3.4. Plotting Step constraint graph and plotting the reachability distance of each object. The inputs for C-OPTICS are the datasetIn Ptheand plotting the constraint step, wegraph traverseG( Vthe, E constraint). The output graph of to C-OPTICS generate is a thereachability reachability plot plot as inRP OPTICS.. In line 1,RecallRP, afrom list structure Section 2.1.2 representing that OPTICS the reachabilitycomputes reachability plot, and OrderSeed distances, for a priority all pairs queue of objects structure, to create are initialized.a reachability Here, plot.OrderSeed However,determines to create thea reachability order in which plot, to only traverse one reachability the constraint distance graph. is In required line 2, a targetfor each object object.ptarget ThroughP is selectedthe constraint by the graph, input order we can of objects.reduce the In linereachability 3, the algorithm distance checks computations whether ∈ the target object ptarget has been processed. If ptarget is not processed, ptarget is set to the processed state in line 4. Then, ptarget is inputted to RP with UNDEFINED reachability distance (infinite). In line 6, the vertex vtarget V, which contains ptarget, is obtained. If vtarget.vstate is a noise, in line 7, ptarget is ∈ determined to be a noise object and the algorithm selects the next target object. If vtarget.vstate is not a noise, ptarget is checked as to whether ptarget is an ε-core object. If ptarget is not an ε-core object, ptarget is determined to be a noise object. In the opposite case, in line 8, the Update procedure is called to Electronics 2019, 8, 1094 13 of 23

  discover Nε ptarget , after which OrderSeed is updated. In lines 9–14, repeated processes are performed on OrderSeed. The object with the highest priority in OrderSeed is selected as the next target object ptarget in line 10. Then, ptarget is set to the processed state in line 11. Next, in line 12, ptarget is inputted to RP with the computed reachability distance. Then, a vertex vtarget, which contains ptarget, is obtained. In lines 13 and 14, OrderSeed is updated by the Update procedure when vtarget.vstate is not noise and ptarget is an ε-core object. This process is repeated until OrderSeed is empty. When OrderSeed is empty, the unprocessed object is selected as the next target object ptarget P. Thereafter, all of the above processes ∈ are repeated. When all objects are processed, RP is output. Here, listing the objects in RP results in a reachability plot.

Algorithm 1: C-OPTICS Input:  (1) P = p , p , , pn : the input dataset 1 2 ··· (2) G(V, E): the constraint graph for P Output: (1) RP: a reachability plot Algorithm:

1. Initialize the list RP object, rdist and the priority queue OrderSeed object, rdist .

2. foreach ptarget P do ∈ 3. if ptarget.Unprocessed then 4. ptarget.Unprocessed = f alse. 5. Put ptarget into RP. 6. vtarget = ptarget.get Vertex(). 7. if vtarget.vstate , noise & ptarget.iscore() then   8. Update ptarget, vtarget, OrderSeed . 9. while OrderSeed , ∅ do 10. ptarget = OrderSeed.next(). 11. ptarget.Unprocessed = f alse. 12. Put ptarget into RP. 13. vtarget = ptarget.get Vertex(). 14. If vtarget.vstate , noise & ptarget.iscore() then   15. Update ptarget, vtarget, OrderSeed . 16. end

Algorithm 2 shows the Update procedure that determines whether OrderSeed is updated according to the state of each vertex and linkage constraints. The inputs for the Update are the object ptarget, vertex vtarget containing ptarget, and priority queue OrderSeed. The output of Update is OrderSeed. In     line 1, Nε ptarget of ptarget is obtained. In line 2, an object pneighbor Nε ptarget is selected. In line 3, the ∈   processing state of p is checked. When p is processed, a new object p Nε ptarget neighbor neighbor neighbor ∈ is selected; otherwise, in line 4, a vertex vneighbor V containing pneighbor is obtained. In line 5, ∈   ptarget is checked for whether it satisfies the linkage constraint LC ptarget, pneighbor . If ptarget satisfies   LC ptarget, pneighbor , the reachability distance of pneighbor to ptarget is computed and OrderSeed is updated. In the opposite case, the reachability distance of pneighbor is not computed and thus OrderSeed is not updated. Electronics 2019, 8, 1094 14 of 23

Algorithm 2: Update Input: (1) ptarget: a target object of dataset D (2) vtarget: a target vertex of constraint graph G (3) OrderSeed: a priority queue Output: (4) OrderSeed: a priority queue Algorithm:

1. neighbors = ptarget.getNeighbors(εmax, MinPts) 2. foreach p neighbors do neighbor ∈ 3. if pneighbor.Unprocessed then

4. vneighbor = pneighbor.get Vertex().   5. if LC vtarget, vneighbor .contains(ptarget) then    6. rdist = max ptarget.cdist, dist ptarget, pneighbor .   7. OrderSeed.update pneighbor, rdist . 8. end

Example 3. Figure8 shows the step-by-step clustering process of the plotting step for the sample dataset P. Here, this example assumes that ε = √2 and MinPts = 3, and a constraint graph G(V, E) is created, as shown in Figure8 a. First, a target object p P is selected by the input sequence of objects. Thus, p is selected (red dot), i ∈ 1 as shown in Figure8 b. Second, a vertex v21 containing p1 is obtained (blue circle) and the ε-neighborhood of p1, Nε(p1), is obtained (blue dots). Because LC(v21, v0) is satisfied, p5 contained in a vertex v0 is also a neighbor of p1. Conversely, because LC(v21, v33) is not satisfied, two objects p9 and p10 contained in a vertex v33 are not neighbors of p1. Third, as shown in Figure8c, the reachability distance of each object, which is contained in Nε(p1), is computed for p1 according to Definition 2. Fourth, an object p2 having the closest reachability distance to p1 is selected as the next target object as shown in Figure8d (red dot). Then, the above process is repeated for p2. Note that only objects contained in v21 are considered because no linkage constraint is satisfied. Figure8e shows the process for p4. Where p4 satisfies LC(v21, v33), however, two objects p9 and p10 are not neighbors of p4 because dist(p4, p9) and dist(p4, p10) are greater than ε. Thus, the next target object is not selected. In this case, the next target object is selected by the input order of objects which are not processed. The above process is repeated for all unprocessed objects, creating a unidirectional graph structure that represents the reachability plot, as shown in Figure8f. Electronics 2019,, 8,, 1094x FOR PEER REVIEW 1514 of 2322

(a) (b)

(c) (d)

(e) (f)

Figure 8. An example of the plotting step for sample dataset P (ε = √2 and MinPts = 3): (a) the Figure 8. An example of the plotting step for sample dataset 푃 (휀 = √2 and 푀푖푛푃푡푠 = 3): (a) the constraint graph for the sample dataset P,(b) Nε(p1) that is obtained by using the linkage constraints of constraint graph for the sample dataset 푃 , (b) 푁휀(푝1) that is obtained by using the linkage p1,(c) the reachability distances of all objects contained in Nε(p1) for p1,(d) the reachability distances of constraints of 푝1 , (c) the reachability distances of all objects contained in 푁휀(푝1) for 푝1 , (d) the all objects contained in Nε(p2) for p2,(e) Nε(p4), and (f) a unidirectional graph structure that represents reachability distances of all objects contained in 푁휀(푝2) for 푝2, (e) 푁휀(푝4), and (f) a unidirectional the reachability plot. graph structure that represents the reachability plot.

4. Performance Evaluation 4. Performance Evaluation This section presents the experiments designed to evaluate C-OPTICS. The main purposes of theseThis experiments section presents are to provide the experiments experimental designed evidence to that evaluate C-OPTICS C-OPTICS. alleviates The the main quadratic purposes time of thesecomplexity experiments of OPTICS are to and provide that experimentalit outperforms evidence the current that C-OPTICS state-of-the alleviates-art algorithms. the quadratic Section time 4.1 complexitypresents the of meta OPTICS-information and that pertaining it outperforms to the the experiments. current state-of-the-art Afterward, algorithms. Section 4.2 Sectionshows the4.1 presentsquality of the a reachability meta-information plot by pertaining C- OPTICS to theand experiments. the running Afterward,time achieved Section by C 4.2-OPTICS shows thecompared quality ofto the a reachability state-of-the plot-art algorithms. by C- OPTICS and the running time achieved by C-OPTICS compared to the state-of-the-art algorithms. 4.1. Experimental Setup This subsection describes the meta-information set up to perform the experiment. The experiments were run on a machine with a single core (Intel Core i7-8700 3.20 GHz CPU) and 48 GB Electronics 2019, 8, 1094 16 of 23

4.1. Experimental Setup This subsection describes the meta-information set up to perform the experiment. The experiments were run on a machine with a single core (Intel Core i7-8700 3.20 GHz CPU) and 48 GB of memory. The operating system installed is Windows 10 x64. All algorithms used in our experiments were implemented in the Java programming language. Moreover, the maximum Java heap memory in the JVM environment was set to 48 GB. Section 4.1.1 describes the datasets used in the experiments. Section 4.1.2 describes the existing algorithms which are compared with C-OPTICS. Section 4.1.3 describes the approach used to evaluate the clustering quality (accuracy of the reachability plot) of the algorithms.

4.1.1. Datasets We conducted experiments with three real datasets and two synthetic datasets. The real datasets are termed HT, Household, and PAMAP2, and are obtained from the UCI Repository [26]. First, HT is a ten-dimensional dataset that collects measured values of home sensors that monitor the temperature, humidity, and concentration levels of various gases produced in another project [27]. Second, Household is a seven-dimensional dataset that collects the measured values of active energy consumed by each electronic product in the home. Third, PAMAP2 is a four-dimensional dataset that collects the measured values of three inertial forces and the heart rates for 18 physical activities produced during a project [28]. The Synthetic datasets are referred to here as BIRCH2 and Gaussian. First, BIRCH2 is a synthetic dataset for a clustering benchmark produced in earlier work [29]. This paper extended BIRCH2 to one million instances in seven dimensions to evaluate the dimensionality and scalability of the algorithms. Second, Gaussian is a synthetic dataset for the benchmarking of the clustering quality and the running time of the algorithms according to the size of the dimensions. It has a minimum of ten dimensions and a maximum of 50 dimensions. Table2 presents the properties of all datasets, including their sizes and dimensions. In addition, each dataset was sampled at various sizes to assess the scalability of the algorithms.

Table 2. Meta-information of the datasets.

Dataset Number of Instances Number of Dims Domain of Each Dims Data Type HT 928,991 10 [0,1] Real Household 2,049,280 7 [0,1] Real PAMAP2 3,850,505 4 [0,1] Real BIRCH2 1,000,000 7 [0,1] Synthetic Gaussian 500,000 10–50 [0,1] Synthetic

4.1.2. Competing Algorithms We compared C-OPTICS with the three state-of-the-art algorithms, each of which is representative in a unique sense, as explained below:

OPTICS: The naïve algorithm [12] with the spatial indexing structure R*-tree to improve • range queries; DeLi-Clu: A state-of-the-art algorithm that quickly creates an exact reachability plot by improving • the single-linkage approach [15]; SOPTICS: A fast OPTICS algorithm that achieves sub-quadratic time complexity by resorting to • approximation using random projection [16].

Our comparisons with the above algorithms had different purposes. The comparisons with OPTICS and DeLi-Clu, which create an exact reachability plot, focus on evaluating the running time of C-OPTICS. As can be observed from the experimental results in Section 4.2, C-OPTICS is superior in terms of running time to both algorithms in all cases. Electronics 2019, 8, 1094 17 of 23

The comparison with SOPTICS represents the assessments of the clustering quality (accuracy of the reachability plot) and the running time. Here, the main purpose is to demonstrate two phenomena through experiments. First, the reachability plot created by C-OPTICS is robust and accurate in all cases, unlike that by SOPTICS. Second, C-OPTICS outperforms SOPTICS in terms of running time. These outcomes are verified in experimental results on the three real datasets and two synthetic datasets introduced in Section 4.1.1. All four algorithms, including C-OPTICS, run in a single-threaded environment, with the selected parameter settings for each algorithm differing depending on the dataset. Table3 lists the parameters of each algorithm and their search ranges.

Table 3. Parameters and search ranges for the three compared algorithms.

Algorithm Parameters and Search Ranges OPTICS ε [0.2, 2.5]; MinPts 20, 30, 40, 50 ∈ ∈ { } DeLi-Clu MinPts 20, 30, 40, 50 ∈ { } SOPTICS MinPts 20, 30, 40, 50 ∈ { } C-OPTICS ε [0.2, 2.5]; MinPts 20, 30, 40, 50 ∈ ∈ { }

4.1.3. Clustering Quality Metrics We assess the clustering quality of the algorithms with two approaches. The first seeks to present the reachability plots in full to enable a direct visual comparison. This approach, however, fails to quantify the degree of similarity. In addition, the larger the dataset size, the more difficult it is to observe the difference between the reachability plots. To remedy this defect, we use the adjusted Rand index (ARI) with 30-fold cross-validation. ARI is a metric that evaluates the clustering quality based on the degree of similarity through all pair-wise comparisons between extracted clusters [30–32]. ARI returns a real value between 0 and 1, with 1 representing completely identical clustering. However, the reachability plot visualizes only the hierarchy of clusters without extracting the clusters. To measure the ARI of a reachability plot, we extract clusters for a reachability plot by defining the thresholds for εt. Thus, ARI is computed by comparing the clusters of OPTICS with the clusters of each algorithm for εt. In other words, the reachability plot of OPTICS is used as ground truth to evaluate the algorithms SOPTICS and C-OPTICS. In order to demonstrate the robustness of the clustering quality, the experiment is repeated and the computed minimum and maximum ARIs are compared.

4.2. Experimental Results In this subsection, we present experimental evidence of the robustness of the clustering quality and the superior computational efficiency of C-OPTICS in a comparison with the three state-of-the-art algorithms OPTICS, DeLi-Clu, and SOPTICS. Section 4.2.1 describes the experimental results on the clustering quality of C-OPTICS and SOPTICS based on the two approaches used to evaluate the clustering quality introduced in Section 4.1.3. Section 4.2.2 presents the results of the experiments on scalability and dimensionality to evaluate the computational efficiency of C-OPTICS.

4.2.1. Clustering Quality In this subsection, we present the robustness of the clustering quality of C-OPTICS through a comparison with OPTICS and SOPTICS. Here, DeLi-Clu is excluded from the evaluation of clustering quality because DeLi-Clu creates a reachability plot identical to that of OPTICS. Note that the clustering quality for each algorithm is evaluated on the basis of OPTICS as mentioned in Section 4.1.3. To assess the clustering quality according to the two approaches mentioned in Section 4.1.3, the reachability plots are directly compared first. Figure9 presents a visual comparison of the reachability plots for PAMAP2. Figure9a–c show the reachability plots for OPTICS, C-OPTICS, and SOPTICS, respectively. The reachability plots for each Electronics 2019, 8, 1094 18 of 23

algorithm are similar; however, many differences are observed based on the threshold ε2. Figure9a,b show that four identical clusters (C1, C2, C3, and C4) are extracted from OPTICS and C-OPTICS. On the other hand, Figure9c shows that, unlike OPTICS, two clusters are extracted from SOPTICS. This result Electronics 2019, 8, x FOR PEER REVIEW 17 of 22 is due to the accumulation of deformations in the dataset by repeated random projection, which creates createsa reachability a reachability plot diff ploterent ide fromntical that to by that OPTICS. by OPTICS, Conversely, as shown C-OPTICS in Figur createse 9b, as a reachabilityit identifies plotthe essentialidentical distance to that by computations OPTICS, as shown to guarantee in Figure an9 b,exact as it reachability identifies the plot. essential distance computations to guarantee an exact reachability plot.

(a)

(b)

(c)

Figure 9. Visual comparison of reachability plots for PAMAP2:(a) OPTICS; (b) C-OPTICS; (c) SOPTICS. Figure 9. Visual comparison of reachability plots for PAMAP2: (a) OPTICS; (b) C-OPTICS; (c) SOPTICS. Figure 10 shows ARI with respect to the value of MinPts for each algorithm. Here, each dot on a curve gives the minimum ARI and is associated with a vertical bar that indicates the corresponding maximum ARI. This experiment compared the influence of MinPts on ARI of each algorithm for three real datasets HT, Household, and PAMAP2, and one synthetic dataset BIRCH2. Figure 10 shows that C-OPTICS creates a reachability plot for all datasets. C-OPTICS guarantees that the reachability distance of each object is identical to OPTICS by linkage constraint and thus ARI of C-OPTICS is 1. This means that a reachability plot formed by C-OPTICS is identical to OPTICS. Furthermore, these results show that C-OPTICS is robust for MinPts and is not dependent. SOPTICS creates different reachability plots for all datasets. Furthermore, the maximum and minimum ARI outcomes are influenced by the value of MinPts. In most cases, when MinPts is small, the difference between the minimum and maximum ARI is large, as the smaller the value of MinPts is, the more the random projection is performed, and the deformations can thus accumulate more. Figure 11 shows the ARI outcomes with respect to the number of(a) dimensions for the Gaussian(b) dataset. This experiment(c) focuses on the dependence(d) of the ARI of each algorithm on the number of dimensions. According to Figure 11, C-OPTICS guarantees Figure 10. Clustering quality quantified by adjusted Rand index (ARI) vs. 푀푖푛푃푡푠: (a) HT; (b) Household; ARI even if the number of dimensions increases. In other words, C-OPTICS creates a reachability plot (c) PAMAP2; (d) BIRCH2. identical to that by OPTICS regardless of the number of dimensions. The strategy for improving the Figure 10 shows ARI with respect to the value of 푀푖푛푃푡푠 for each algorithm. Here, each dot on a curve gives the minimum ARI and is associated with a vertical bar that indicates the corresponding maximum ARI. This experiment compared the influence of 푀푖푛푃푡푠 on ARI of each algorithm for three real datasets HT, Household, and PAMAP2, and one synthetic dataset BIRCH2. Figure 10 shows that C-OPTICS creates a reachability plot for all datasets. C-OPTICS guarantees that the reachability distance of each object is identical to OPTICS by linkage constraint and thus ARI of C-OPTICS is 1. Electronics 2019, 8, x FOR PEER REVIEW 17 of 22 creates a reachability plot identical to that by OPTICS, as shown in Figure 9b, as it identifies the essential distance computations to guarantee an exact reachability plot.

(a)

(b)

Electronics 2019, 8, 1094 19 of 23

(c) computationalElectronics 2019, 8, x e ffiFORciency PEER ofREVIEW C-OPTICS is not dependent on the number of dimensions. However,18 of as22 the numberFigure 9. of Visual dimensions comparison increases, of reachability the minimum plots ARI for decreasesPAMAP2: ( fora) OPTICS SOPTICS.; (b This) C-OPTICS occurs; because(c) This means that a reachability plot formed by C-OPTICS is identical to OPTICS. Furthermore, these the randomSOPTICS projection. of SOPTICS increases as the number of dimensions increases, as with MinPts. Thus,results the show diff erencethat C- betweenOPTICS theis robust minimum for 푀푖푛푃푡푠 and maximum and is not ARI dependent. becomes large, SOPTICS as shown creates in Figure different 11. reachability plots for all datasets. Furthermore, the maximum and minimum ARI outcomes are influenced by the value of 푀푖푛푃푡푠. In most cases, when 푀푖푛푃푡푠 is small, the difference between the

minimum and maximum ARI is large, as the smaller the value of 푀푖푛푃푡푠 is, the more the random projection is performed, and the deformations can thus accumulate more. Figure 11 shows the ARI outcomes with respect to the number of dimensions for the Gaussian dataset. This experiment focuses on the dependence of the ARI of each algorithm on the number of dimensions. According to Figure 11, C-OPTICS guarantees ARI even if the number of dimensions increases. In other words, C-OPTICS creates a reachability plot identical to that by OPTICS regardless of the number of dimensions. The strategy for improving the computational efficiency of C-OPTICS is not dependent on the number of dimensions. However, as the number of dimensions increases, the minimum ARI decreases for SOPTICS. This occurs because the random projection of SOPTICS increases as the number of dimensions(a) increases, as with 푀푖푛푃푡푠(b) . Thus, the difference (c)between the minimum and(d) maximum ARI becomes large, as shown in Figure 11. Figure 10. Clustering quality quantified quantified by by adjusted Rand index (ARI) vs. 푀푖푛푃푡푠MinPts::( (a)a) HT;;( (b)) Household;; c PAMAP2 d BIRCH2 (c)) PAMAP2;;( (d)) BIRCH2. .

Figure 10 shows ARI with respect to the value of 푀푖푛푃푡푠 for each algorithm. Here, each dot on a curve gives the minimum ARI and is associated with a vertical bar that indicates the corresponding maximum ARI. This experiment compared the influence of 푀푖푛푃푡푠 on ARI of each algorithm for three real datasets HT, Household, and PAMAP2, and one synthetic dataset BIRCH2. Figure 10 shows that C-OPTICS creates a reachability plot for all datasets. C-OPTICS guarantees that the reachability distance of each object is identical to OPTICS by linkage constraint and thus ARI of C-OPTICS is 1.

Figure 11. Gaussian Figure 11. Clustering quality quantifiedquantified byby ARIARI vs.vs. dimensionality for the Gaussian dataset.dataset. 4.2.2. Computational Efficiency 4.2.2. Computational Efficiency This subsection evaluates the computational efficiency of C-OPTICS and the three state-of-the-art algorithmsThis subsection using the evaluates three real datasetsthe computationalHT, Household efficiency, PAMAP2 of C,- andOPTICS two syntheticand the three datasets stateBIRCH2-of-the- andart algorithmsGaussian (50 using dimensions). the three Although real datasets there HT are, distribution-basedHousehold, PAMAP2 algorithms, and two which synthetic improve datasets the runningBIRCH2 timeand ofGaussian OPTICS, (50 this dimensions). paper excludes Although those algorithms, there are distribut as we doion not-based consider algorithms a distributed which environment.improve the running Figure time 12 shows of OPTICS, the result this paper of the excludes comparison those of algorithms, running time as we for do each not consider algorithm a distributed environment. Figure 12 shows the result of the comparison of running time for each according to the sampled ratios for the two synthetic datasets. Note that the y-axis is presented on a logarithmicalgorithm according scale. In addition, to the sampled if the running ratios time for the of an two algorithm synthetic exceeds datasets.105 Notes, it does that not the appear y-axis on is presented on a logarithmic scale. In addition, if the running time of an algorithm exceeds 105 s, it the graph. This indicates that the algorithms did not terminate within 24 h and therefore were not considereddoes not appear for further on the experiments. graph. This indicates As shown that in the Figure algorithms 12a, at adid sampling not terminate rate of within 5% for 24BIRCH2 h and, therefore were not considered for further experiments. As shown in Figure 12a, at a sampling rate of C-OPTICS improved the running time by fivefold over OPTICS. As the sampling rate increases (i.e., as5% the for sizeBIRCH2 of the, C dataset-OPTICS increases), improved C-OPTICSthe running improves time by fivefold the running over OPTICS. times by As up the to 25sampling times over rate increases (i.e., as the size of the dataset increases), C-OPTICS improves the running times by up to 25 OPTICS. C-OPTICS even improved the running times by up to eight times over the fastest SOPTICS times over OPTICS. C-OPTICS even improved the running times by up to eight times over the fastest among the other algorithms. Figure 12b shows the results of the comparison for algorithm running SOPTICS among the other algorithms. Figure 12b shows the results of the comparison for algorithm running times for the 50-dimensional Gaussian dataset. Here, C-OPTICS shows an improvement up to 100 times over OPTICS, as the efficiency of the spatial indexing structure at high dimensions is significantly decreased. These results also show a similar trend in the running time comparison with Electronics 2019, 8, 1094 20 of 23

Electronics 2019, 8, x FOR PEER REVIEW 19 of 22 times for the 50-dimensional Gaussian dataset. Here, C-OPTICS shows an improvement up to 100 times overElectronics OPTICS, 2019, 8, asx FOR the PEER efficiency REVIEW of the spatial indexing structure at high dimensions is significantly19 of 22 DeLi-Clu. For SOPTICS, which does not use a spatial indexing structure, a running time similar to decreased. These results also show a similar trend in the running time comparison with DeLi-Clu. For that of C-OPTICS arises because it is not influenced by the number of dimensions. Similar trends SOPTICS,DeLi-Clu. whichFor SOPTICS, does not which use a spatialdoes not indexing use a spatial structure, indexing a running structure, time similar a running to that time of C-OPTICSsimilar to were observed in the experiments on the three real datasets, as shown in Figure 13. As the sampling arisesthat of because C-OPTICS it is notarises influenced because byit is the not number influenced of dimensions. by the number Similar of trendsdimensions. were observed Similar trends in the rate increased, the running time of C-OPTICS improved significantly compared to that by OPTICS. experimentswere observed on in the the three experime real datasets,nts on the as three shown real in datasets, Figure 13 as. shown As the in sampling Figure 1 rate3. As increased, the sampling the For the HT dataset in Figure 13a, C-OPTICS shows an improvement by as much as 50 times over runningrate increased, time of the C-OPTICS running improved time of C significantly-OPTICS improved compared significantly to that by OPTICS.compared For to thethatHT bydataset OPTICS. in OPTICS and up to nine times over SOPTICS. For the Household dataset in Figure 13b, the results for FigureFor the 13 HTa, C-OPTICSdataset in showsFigure an13a improvement, C-OPTICS shows by as muchan improvement as 50 times by over as OPTICSmuch as and 50 times up to nineover C-OPTICS are improved by up to 150 times over OPTICS. Conversely, C-OPTICS shows a running timesOPTICS over and SOPTICS. up to nine For times the Household over SOPTICS.dataset inFor Figure the Household 13b, the results dataset for in C-OPTICS Figure 13b are, the improved results for by timeC-OPTICS nearly are identical improved to thatby up of to SOPTICS. 150 times This over corresponds OPTICS. Conversely, to nearly C identical-OPTICS worst shows case a running for C- upOPTICS, to 150 where times overmost OPTICS. of the objects Conversely, are contained C-OPTICS in showsa few vertices. a running Nevertheless, time nearly identical it is important to that ofto SOPTICS.time nearly This identical corresponds to that to nearlyof SOPTICS. identical This worst corresponds case for C-OPTICS, to nearly where identical most worst of the case objects for are C- noteOPTICS, that whereC-OPTICS most shows of the anobjects improvement are contained over inSOPTICS. a few vertices. Likewise, Nevertheless, for the PAMAP2 it is important dataset toin containedFigure 13c in, C a- fewOPTICS vertices. overwhelms Nevertheless, the other it is important algorithms. to note These that experimental C-OPTICS shows results an improvementshow that C- overnote SOPTICS.that C-OPTICS Likewise, shows for an the improvementPAMAP2 dataset over SOPTICS. in Figure 13Likewise,c, C-OPTICS for the overwhelms PAMAP2 dataset the other in OPTICS outperforms the other algorithms in terms of computational efficiency. algorithms.Figure 13c, C These-OPTICS experimental overwhelms results the other show algorithms. that C-OPTICS These outperforms experimental the results other algorithmsshow that C in- termsOPTICS of computationaloutperforms the effi otherciency. algorithms in terms of computational efficiency.

(a) (b) (a) (b) Figure 12. Running time vs. sampling rate for synthetic datasets: (a) BIRCH2; (b) Gaussian (50푑). BIRCH2; Gaussian d FigureFigure 12.12. Running time vs.vs. sampling rate for synthetic datasets: ( a) BIRCH2; (b) Gaussian (50푑).).

. .

(a) (b) (c)

Figure 13.(a)Running time vs. sampling rate for(b) real datasets: (a) HT;(b) Household;(c(c)) PAMAP2. Figure 13. Running time vs. sampling rate for real datasets: (a) HT; (b) Household; (c) PAMAP2. To provideFigure 13. an Running additional time directvs. sampling comparison rate for of real the datasets: running (a time) HT; for(b) theHousehold algorithms,; (c) PAMAP2 we compared. To provide an additional direct comparison of the running time for the algorithms, we compared the rate of reduction of the total number of distance computations for each algorithm with OPTICS. the rateTo provideof reduction an additional of the total direct number comparison of distance of the computations running time for for each the algorithms algorithm ,with we compared OPTICS. Figurethe rate 1 of4 showsreduction the of experimental the total number results of fordistance all datasets, computations where itfor can each be algorithm observed with that OPTICS. the total numberFigure 1 of4 showsdistance the computations experimental for results C-OPTICS for all is datasets, significantly where reduced. it can beAgain, observed it should that be the noted total thatnumber the yof-axis distance is presented computations on a logarithmic for C-OPTICS scale. is C significantly-OPTICS reduces reduced. the distanceAgain, it computationsshould be noted by that the y-axis is presented on a logarithmic scale. C-OPTICS reduces the distance computations by Electronics 2019, 8, 1094 21 of 23

Figure 14 shows the experimental results for all datasets, where it can be observed that the total number ofElectronics distance 2019 computations, 8, x FOR PEER REVIEW for C-OPTICS is significantly reduced. Again, it should be noted that20 of the 22 y-axis is presented on a logarithmic scale. C-OPTICS reduces the distance computations by more than tenmore times than in ten all times cases in compared all cases compared to OPTICS. to This OPTICS. occurs This because occurs C-OPTICS because C identifies-OPTICS identifies and excludes and unnecessaryexcludes unnecessary distance computationsdistance computations based on based the constraint on the constraint graph. graph.

Electronics 2019, 8, x FOR PEER REVIEW 20 of 22

more than ten times in all cases compared to OPTICS. This occurs because C-OPTICS identifies and excludes unnecessary distance computations based on the constraint graph.

Figure 14. Comparison of the total number of distance computations computations.. We conducted experiments on Gaussian datasets of various dimensions to evaluate the dimensionality of C-OPTICS experimentally. Commonly, the running time also increases due to the number of dimensions, and the numbers of distance computations are proportional. Figure 15 shows linear time complexity regarding the number of dimensions for C-OPTICS as compared to SOPTICS. In contrast, OPTICS and DeLi-Clu show an exponential increase in the running time as the number of dimensions increases. Moreover, the computational efficiency of the algorithms decreases and the running time increases exponentially. As a result, C-OPTICS is shown to improve the scalability significantly and canFigure address 14. Comparison the quadratic of the time total complexity number of of distance OPTICS computations while guaranteeing. the quality of a reachability plot.

Figure 15. Running time vs. dimensionality for the Gaussian (50d) dataset.

We conducted experiments on Gaussian datasets of various dimensions to evaluate the dimensionality of C-OPTICS experimentally. Commonly, the running time also increases due to the number of dimensions, and the numbers of distance computations are proportional. Figure 15 shows linear time complexity regarding the number of dimensions for C-OPTICS as compared to SOPTICS. In contrast, OPTICS and DeLi-Clu show an exponential increase in the running time as the number of dimensions increases. Moreover, the computational efficiency of the algorithms decreases and the running time increasesFigure 15. exponentially.Running time vs.As dimensionalitya result, C-OPTICS for the Gaussian is shown(50 tod) dataset. improve the scalability Figure 15. Running time vs. dimensionality for the Gaussian (50d) dataset. significantly and can address the quadratic time complexity of OPTICS while guaranteeing the 5.quality Conclusions of a reachability plot. We conducted experiments on Gaussian datasets of various dimensions to evaluate the dimensionaliIn this paper,ty of C we-OPTICS proposed experimentally. C-OPTICS, which Commonly, improves the the running running time time also of OPTICSincreases by due reducing to the 5. Conclusion thenumber unnecessary of dimensions, distance and computations the numbers toof addressdistance thecomputations quadratic timeare proportional. complexity issue Figure of 1 OPTICS.5 shows C-OPTICSlinearIn time this partitions complexitypaper, we a proposed dregarding-dimensional C the-OPTICS, number dataset which intoof dimensions unitimproves cells which thefor runniC- haveOPTICSng identical time as ofcompared OPTICS diagonal toby length SOPTICS. reducing and theIn contrast, unnecessary OPTICS distance and DeLicomputations-Clu show to an address exponential the quadratic increase time in the complexity running timeissue asof theOPTICS. number C- OPTICSof dimensions partitions increases. a 푑-dimensional Moreover, thedataset computational into unit cells efficiency which ofhave the identicalalgorithms diagonal decreases length and and the constructsrunning time a constraint increases grap exponentially.h. Subsequently, As a result,C-OPTICS C-OPTICS only computes is shown the to reachability improve the distance scalability for eachsignificantly object that and appears can address in the reachabilitythe quadratic plot time through complexity linkage ofconstraints OPTICS in while the constraint guaranteeing graph. the quality of a reachability plot.

5. Conclusion In this paper, we proposed C-OPTICS, which improves the running time of OPTICS by reducing the unnecessary distance computations to address the quadratic time complexity issue of OPTICS. C- OPTICS partitions a 푑-dimensional dataset into unit cells which have identical diagonal length and constructs a constraint graph. Subsequently, C-OPTICS only computes the reachability distance for each object that appears in the reachability plot through linkage constraints in the constraint graph. Electronics 2019, 8, 1094 22 of 23 constructs a constraint graph. Subsequently, C-OPTICS only computes the reachability distance for each object that appears in the reachability plot through linkage constraints in the constraint graph. We conducted experiments on synthetic and real datasets to confirm the scalability and efficiency of C-OPTICS. Specifically, C-OPTICS outperformed state-of-the-art algorithms. Experimental results show that C-OPTICS addressed the quadratic time complexity of OPTICS. Specifically, the running time with regard to the data size is improved by as much as 102 times over DeLi-Clu. In addition, the running time is improved up to nine times over SOPTICS, which creates an approximate reachability plot. We also conducted experiments on dimensionality. These experimental results show that C-OPTICS has robust clustering quality and linear time complexity regardless of the size of the dimensions. Future research can consider methods by which the proposed algorithm can be improved. For example, C-OPTICS can be improved by having it construct a constraint graph without depending on a radius ε. This can provide a solution to the worst case of C-OPTICS. In addition, C-OPTICS can be improved to enable GPU-based parallel processing to accelerate the construction of the constraint graph.

Author Contributions: J.-H.K. and J.-H.C. designed the algorithm. J.-H.K. performed the bibliographic review and writing of the draft and developed the proposed algorithm. A.N. shared his expertise with regard to the overall review of this paper. A.N., K.-H.Y., and W.-K.L. supervised the entire process. Funding: This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2017R1D1A3B03035729). This work was also supported by National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2018R1A2B6009188). Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Wang, Z.; Yu, Z.; Chen, C.P.; You, J.; Gu, T.; Wong, H.S.; Zhang, J. Clustering by local gravitation. IEEE T. Cybern. 2017, 48, 1383–1396. [CrossRef][PubMed] 2. Li, Z.; Chen, J. Superpixel segmentation using linear . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1356–1363. 3. Fang, Z.; Yu, X.; Wu, C.; Chen, D.; Jia, T. Superpixel Segmentation Using Weighted Coplanar Feature Clustering on RGBD Images. Appl. Sci. 2018, 8, 902. [CrossRef] 4. Torti, E.; Florimbi, G.; Castelli, F.; Ortega, S.; Fabelo, H.; Callicó, G.; Marrero-Martin, M.; Leporati, F. Parallel K-Means clustering for brain cancer detection using hyperspectral images. Electronics 2018, 7, 283. [CrossRef] 5. Han, C.; Lin, Q.; Guo, J.; Sun, L.; Tao, Z. A Clustering Algorithm for Heterogeneous Wireless Sensor Networks Based on Solar Energy Supply. Electronics 2018, 7, 103. [CrossRef] 6. Al-Shalabi, M.; Anbar, M.; Wan, T.C.; Khasawneh, A. Variants of the low-energy adaptive clustering hierarchy protocol: Survey, issues and challenges. Electronics 2018, 7, 136. [CrossRef] 7. Panapakidis, I.P.; Michailides, C.; Angelides, D.C. Implementation of Pattern Recognition Algorithms in Processing Incomplete Wind Speed Data for Energy Assessment of Offshore Wind Turbines. Electronics 2019, 8, 418. [CrossRef] 8. Zhang, T.; Haider, M.; Massoud, Y.; Alexander, J. An Oscillatory Neural Network Based Local Processing Unit for Pattern Recognition Applications. Electronics 2019, 8, 64. [CrossRef] 9. Yaohui, L.; Zhengming, M.; Fang, Y. Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy. Knowl. Based Syst. 2017, 133, 208–220. [CrossRef] 10. Zaiane, O.R.; Foss, A.; Lee, C.H.; Wang, W. On data clustering analysis: Scalability, constraints, and validation. In Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Taipei, Taiwan, 6–8 May 2002; pp. 28–39. 11. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. 12. Ankerst, M.; Breunig, M.; Kriegel, H.P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, 1–3 June 1999; pp. 49–60. Electronics 2019, 8, 1094 23 of 23

13. Patwary, M.A.; Palsetia, D.; Agrawal, A.; Liao, W.K.; Manne, F.; Choudhary, A. Scalable parallel OPTICS data clustering using graph algorithmic techniques. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 17–22 November 2013; pp. 49–60. 14. Gunawan, A.; de Berg, M. A Faster Algorithm for DBSCAN. Master’s Thesis, Eindhoven University of Technology, Eindhoven, The Netherlands, March 2013. 15. Achtert, E.; Böhm, C.; Kröger, P. DeLi-Clu: Boosting robustness, completeness, usability, and efficiency of hierarchical clustering by a closest pair ranking. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 9–12 April 2006; pp. 119–128. 16. Schneider, A.; Vlachos, M. Scalable density-based clustering with quality guarantees using random projections. Data Min. Knowl. Discov. 2017, 31, 972–1005. [CrossRef] 17. Beckmann, N.; Kriegel, H.P.; Schneider, R.; Seeger, B. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, USA, 23–25 May 1990; pp. 322–331. 18. Brecheisen, S.; Kriegel, H.P.; Pfeifle, M. Multi-step density-based clustering. Knowl. Inf. Syst. 2006, 9, 284–308. [CrossRef] 19. Lee, W.; Loh, W.K. G-OPTICS: Fast ordering density-based cluster objects using graphics processing units. Int. J. Web Grid Serv. 2018, 14, 273–287. [CrossRef] 20. Breunig, M.M.; Kriegel, H.P.; Sander, J. Fast hierarchical clustering based on compressed data and optics. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, France, 13–16 September 2000; pp. 232–242. 21.V ágner, A. The GridOPTICS clustering algorithm. Intell. Data Anal. 2016, 20, 1061–1084. [CrossRef] 22. Stuetzle, W. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 2003, 20, 25–47. [CrossRef] 23. Hartigan, J.A.; Mohanty, S. The runt test for multimodality. J. Classif. 1992, 9, 63–70. [CrossRef] 24. Campello, R.J.; Moulavi, D.; Zimek, A.; Sander, J. Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 2015, 10, 5–56. [CrossRef] 25. Bryant, A.; Cios, K. RNN-DBSCAN: A density-based clustering algorithm using reverse nearest neighbor density estimation. IEEE Trans. Knowl. Data Eng. 2018, 30, 1109–1121. [CrossRef] 26. Blake, C.; Merz, C. UCI Repository of Machine Learning Database; UCI: Irvine, CA, USA, 1998. 27. Huerta, R.; Mosqueiro, T.; Fonollosa, J.; Rulkov, N.F.; Rodriguez-Lujan, I. Online decorrelation of humidity and temperature in chemical sensors for continuous monitoring. Chemom. Intell. Lab. Syst. 2016, 157, 169–176. [CrossRef] 28. Reiss, A.; Stricker, D. Introducing a new benchmarked dataset for activity monitoring. Proceedings of International Symposium on Wearable Computers, Boston, MA, USA, 11–15 November 2012; pp. 108–109. 29. Zhang, T.; Ramakrishnan, R.; Livny, M. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, QC, Canada, 4–6 June 1996; pp. 103–114. 30. Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [CrossRef] 31. Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [CrossRef] 32. Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clustering comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 2010, 11, 2837–2854.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).