A Dispersive Degree Based Clustering Algorithm Combined with Classification

A Dispersive Degree Based Clustering Algorithm Combined with Classification Xianchao Zhang Shimin Shan Zhihang Yu He Jiang School of Software, Dalian University of Technology, Dalian 116620, P. R. China Abstract model-based method, etc. Density-based clustering algorithm is an important embranchment of cluster The various-density problem has become one of the analysis with the advantages of capability of focuses in density based clustering research. A novel discovering clusters with arbitrary shape and dispersive degree based algorithm combined with insensitivity to noise data. The basic idea is using a classification, called CDDC, is presented in this paper local cluster criterion, in which clusters are defined as to remove the hurdle. In CDDC, a sequence is regions in the data space where the objects are dense, established for depicting the data distribution, and clusters are separated from one another by low- discriminating cores and classifying edges. Clusters density regions. There are several concernful density- are discovered by utilizing the revealed information. based algorithms: DBSCAN, OPTICS, and Several experiments are performed and the results KNNCLUST [2]-[4]. suggest that CDDC is effective in handling the The key idea of DBSCAN is that for each object various-density problem and is more efficient than the of a cluster, its neighborhood at a given radius must well-known algorithms such as DBSCAN, OPTICS contain at least a minimum number of objects and KNNCLUST. (MinPts), then cluster the density-connected objects. It is significantly effective in discovering clusters of Keywords: Clustering analysis, Various-density, arbitrary shape and can deal with noise. But the two Dispersive degree, Data mining fixed input parameters, ε and MinPts, weaken its ability of dealing with data sets with various densities. Instead of producing clusters explicitly, OPTICS 1. Introduction creates an augmented ordering to represent the One of the primary data mining tasks is clustering density-based clustering structure of a data set. It is analysis. The goal of a clustering algorithm is to group equivalent to cluster a broad range of parameter the objects of a database into a set of meaningful settings, which overcomes the drawbacks of DBSCAN subclasses (clusters). A clustering algorithm can be in a way. But the result of OPTICS is a tree of used either as a stand-alone tool to get insight into the clustering structure, on which every node is a cluster. distribution of a data set, or as a preprocessing step for The sons of a node are subclusters of their parent node. other algorithms which operate on the detected In other words, OPTICS can’t attach an object exactly clusters. Applications of clustering are, for instance, to a cluster. KNNCLUST method combined computational analysis, pattern recognition, medical nonparametric k-nearest-neighbor and kernel density diagnosis, web retrieval. For each of these applications estimation, relies Bayes’ decision rule to cluster specialties are requested [1], such as scalability, ability objects. Although this technique makes it possible to to deal with different types of attributes, ability to model clusters of different densities in data sets and handle dynamic data, discovery of clusters with identify the number of clusters automatically, it’s a arbitrary shape, minimal requirements for domain “hard” algorithm which assigns object to one and only knowledge, able to deal with noise and outliers, one cluster. It means that KNNCLUST can not insensitive to order of input records, high identify noise. Furthermore, it is less suited for finding dimensionality, incorporation of user-specified clusters with strange shapes. constraints, interpretability and usability, etc, which Almost all of the well-known clustering make the research of clustering algorithm to be algorithms require input parameters which are hard to challenging and magnetic. determine but have a significant influence on the Many clustering algorithms were raised heretofore, clustering results. Furthermore, for many real data sets among which are partition method, hierarchical there is not a global parameter setting which describes method, density-based method, grid-based method, the intrinsic clustering structure accurately. For density-based clustering, this problem embodies as that the intrinsic cluster structure of many real-data Lemma 1. Points in clusters have higher density than sets with various densities can not be characterized by noise points. Density(p , ε)> Density(q, ε) ,where p ∊ global density parameters. Most of the widespread Ci, q ∊ {noises}. algorithms are not effective on handling practical Lemma 2. The density difference between two points datasets because of incapability of tackling the various in same cluster is more distinct than that between one densities. The various density problem has become in cluster and the other in noise. |Density(o, ε)- one of the focuses for the density-based clustering Density(p, ε)| < |Density(o, ε)-Density(q, ε)|, where o, research. In this paper, a new dispersive degree based p ∊ Ci, q ∊ noise. algorithm combined with classification, called CDDC Definition 5. (outlier degree) The outlier degree of (Clustering using Dispersive Degree and point p wrt. ε ,denoted by ODegree(p, ε), is defined by Classification), is developed to tackle these situations. 1 Density(,) p ε ODegree(,) p ε = Enlightening by OPTICS and KNNCLUST, CDDC ∑ Neighbors(,) pε qi Density ( qi ,)ε computes dispersive degree with KNN distance first, where ∀i :qi ∊ Neighbor(p, ε). and produces an order(sequence of scanning) to depict Depending on lemma 2, it is obvious that points in the clustering information, then partitions the order clusters have ODegree trending to 1.If the ODegree of into core and edge points, assigns the edge points to a point is farther from 1, its probability to be a noise is clusters applying KNN-kernel density estimation in higher. the end. The rest of the paper is organized as follows. The new dispersive degree based algorithm combined with 2.2. Frame of CDDC algorithm classification, CDDC, is given in Section 2. In Section 3, its performance is evaluated by simulate CDDC algorithm has four main steps: experiments and compared to the results from • Compute dispersive degree DBSCAN, OPTICS and KNNCLUST. Finally, the • Scan data work is summarized in Section 5. • Divide scan order In order to facilitate the description of CDDC, • Classify edge points simple 2-dimension points are used to represent These steps are introduced in the following 4 objects in data space in this paper. subsections, respectively. 2. CDDC clustering algorithm 2.3. Computing dispersive degree Definition 6. (KNN distance) Let p be a point in data set D, p’s KNN distance is the distance from p to its 2.1. Density-based clustering kth nearest neighbor, denoted by KNN-dist(p,k) . For a given k, the density of p is larger , its KNN- Density-based clustering algorithms are built on dist is smaller. So KNN-dist represents density of p definition of density and cluster which are given as from another aspect. This idea of KNN-density is first follows. introduced by Loftsgaarden and Quesenberry [5], Definition 1. (Neighborhood of a point) Let D be a which is redefined in this paper as follows: set of points. The neighborhood of a point p in D wrt. Definition 7. (KNN density) KNN density of p in a given radius ε, denoted by Neighbors(p, ε), is D ,denoted by KNN-density , is the reciprocal of p’s ∊ ≤ defined by Neighbors(p , ε) = {q D | dist(p,q) ε}, KNN-dist. where dist(p,q) is the distance between p and q. Outlier degree can be redefined with KNN- Definition 2. (Density of a point) The density of a density as: point p wrt. ε in D , denoted by Density(p, ε), is Definition 8. (K outlier degree) The K outlier degree defined by Density(p, ε) = |Neighbors(p, ε)|. of p, denoted by ODegree(p,k), is defined by Definition 3. (Density-based cluster) A cluster C is a 1 Density(,) p k non-empty subset of D, in which the points have ODegree(,) p k = ∑ Neighbors(,') pε −1 Density ( q ,) k higher density and uniformity of density distribution. qi i ∀ ∊ Definition 4. (Noise) Let C1,C2,…,Cn be n clusters in where ε’ = KNN-dist(p,k), i : qi Neighbor(p, ε’), D, i=1,2,…,n. noise is defined by Noise={p ∊ D |∀i: qi ≠ p. p ∉ Ci}. Definition 9. (Dispersive degree) The dispersive It can be deduced that points in clusters should degree of p in data set D wrt. k, denoted by satisfy the following lemmas: DDegree( p , k ) , is defined by DDegree( p , k ) = max(ODegree( p , k ) , (ODegree( p , k ))-1)-1. It is known from the definition above that the ScanData(DataSet, ScanOrder) smaller the DDegree the more likely it is in a cluster; WHILE DataSet.GetNextStart() <> NULL DO oppositely it is noise. SeedsQueue.Insert(DataSet.GetNextStart); In this step, CDDC finds k nearest neighbors of WHILE NOT SeedsQueue.Empty() DO each point in data set, records the distances to every CurrentPoint = SeedsQueue.GetFirst(); neighbor, and computes DDegree with KNN-dist of ScanOrder.Add(CurrentPoint); the points. This procedure is illustrated by Figure 1. CurrentPoint.scanned = TRUE; FORALL Neighbor FROM CurrentPoint.Neighbors[] DO ComputeDDegree(DataSet, k) IF Neighbor.scanned = FALSE FORALL Point FROM DataSet DO AND NOT Point.FindNeighbors(DataSet,k); SeedsQueue.Contain(Neighbor) Point.knn_dist = Point.Neighbors[k].dist; SeedsQueue.Insert(Neighbor); FORALL Point FROM DataSet DO END FORALL Neighbor FROM Point.Neighbors[] DO GetNextStart() insures that a scan process starts at Point.ODegree = Point.ODegree a point in a cluster, which can be inferred from lemma +Neighbor.knn_dist; 1; and it is ensured by GetFirst() that a point, which is Point.ODegree = Point.ODegree most likely to be in the same cluster with the last point /(Point.knn_dist*Point.Neighbors[].size) in ScanOrder, will be added next in conformity to Point.DDegree = Max(Point.ODegree, lemma 2.

Load more