Parameter Reduction for Density-Based Clustering

Parameter Reduction for Density-based Clustering on Large Data Sets

Baoying Wang, William Perrizo {baoying.wang, william.perrizo}@ndsu.nodak.edu Computer Science Department North Dakota State University Fargo, ND 58105 Tel: (701) 231-6257 Fax: (701) 231-8255

Abstract same density or different densities. 1 shows some possible distributions of a dataset. Clustering on large datasets has become one of the most intensively studied areas with increasing data volumes. One of the problems of clustering on large datasets is minimal domain knowledge to determine the input parameters. In the density based clustering, the main input is the minimum neighborhood radius. The problem becomes more difficult when the clusters are in different densities. In this paper, we explore an automatic approach to determine the minimum neighborhood radius based on the distribution of datasets. The algorithm, MINR, is developed to determine the minimum neighborhood radii for different density clusters based on many experiments and observations. MINR can be used together with any (a) Same density (b) Different density based clustering method to make a nonparametric densities clustering algorithm. In this paper, we combine MINR with the enhanced DBCSCAN, e-DBCSCAN. Figure 1. Clusters with same or different densities Experiments show our approach1, is more efficient and scalable than TURN* 2. In the density based clustering, the main input is the minimum neighborhood radius. When clusters are in Keywords: Data mining, Density-based clustering, different densities, it is more difficult to determine the Parameter reduction. minimum neighborhood radii. Although there have been many efforts to make clustering parameter free, they 1. INTRODUCTION either try to give users all possible choices 7, or adopt trial-and-error approach based on statistic information. Clustering on large datasets has become one of the We explore an automatic approach to determine the most intensively studied areas in data mining. In minimum neighborhood radii on the distribution of particular, density-based clustering is widely used in datasets. The algorithm, MINR, is developed to determine various spatial applications such as geographical the minimum neighborhood radii for different density information analysis, medical applications, and satellite clusters based on many experiments and observations. image analysis. In density-based clustering, clusters are MINR can be used together with any density based dense areas of points in the data space that are separated clustering method to make a nonparametric clustering by areas of low density (noise) 4. A cluster is regarded as algorithm. In this paper, we combine MINR with the a connected dense area of data points, which grows in any enhanced DBCSCAN, e-DBCSCAN, into a direction that density leads. nonparametric density-based clustering algorithm (NPDBC). Experiments show NPDBC is more efficient One of the problems of clustering on large spatial and scalable than TURN* 2. The reason is that for datasets is minimal domain knowledge to determine the NPDBC, the parameters are computed once at the input parameters. A dataset may consist of clusters with beginning of the clustering process, while TURN* algorithm tries different neighborhood radii until the first 1 This work is partially supported by GSA Grant ACT#: “turn” is found in case of two different densities clusters. K96130308. This paper is organized as follows. In section 2, we algorithm cannot handle outliers and needs parameter give a brief review of related work. In section 3, we setting to work effectively. present parameter reduction method for density-based clustering, and a nonparametric clustering method. We TURN* is a brute force approach. It first decreases the give performance analysis in section 4. Finally, we neighborhood radius to so small that every data point conclude the paper in section 5. becomes noise. Then the radius is doubled each time to do clustering until it finds a “turn” where stabilization occurs 2. RELATED WORK in the clustering process 3. TURN* uses two constant step sizes 2 and 0.4 to increase and decrease the neighborhood 2.1. Clustering methods radius respectively. Obviously the step sizes depend on data distribution of the dataset. Even though it chooses There are mainly two clustering methods: similarity- big steps, the computation time is not promising for large based partitioning methods and density-based clustering datasets with various densities. methods. A similarity-based partitioning algorithm breaks a dataset into k subsets, called clusters. The major 2.3. Enhanced DBSCAN clustering problems with partitioning methods are: (1) k has to be predetermined; (2) it is difficult to identify clusters with Given a data set X, the neighborhood radius, r, and different sizes; (3) it only finds convex clusters. the minimum points in the neighborhood, k, we introduce some definitions of density-based clustering and then Density-based clustering methods are used to discover present our enhanced DBSCAN clustering algorithm. clusters with arbitrary shapes. The most typical algorithm is DBSCAN 1. The basic idea of DBSCAN is that each Definition 1. The neighborhood of a data point p with a cluster is a maximal set of density-connected points. radius r is defined as the set Nbr(p, r) = {xX: |p-x| r}, Points are connected when they are density-reachable where |p-x| is the distance between x and p. from neighborhood to the other. DBSCAN is very Definition 2. A point p is an internal point if it has at sensitive to input parameters, which are the neighborhood least k neighbors within its neighborhood Nbr(p, r), radius (r) and a minimum number of neighbors (MinPts). denoted as |Nbr(p,r)| ≥k. Its neighborhood is called core. Another density-based method is WaveCluster 10, which applies wavelet transform to the feature space. It can Definition 3. A point p is an external point if the number detect arbitrary-shape clusters at different scales. The of its neighbors within its neighborhood Nbr(p, r), is less algorithm is grid-based and only applicable to low- than k, i.e. |Nbr(p,r)| < k, and it is located within a core. dimensional data. Input parameters include the number of 2 shows the internal points and external points, given k = grid cells for each dimension, the wavelet to use and the 4. number of applications of the wavelet transform. In 5, another density-based algorithm DenClue is proposed. + This algorithm uses a grid but is very efficient because it + + only keeps information about grid cells that do actually + +3 +7 + +5 + +1 +8 contain data points and manages these cells in a tree- +4 +6 based access structure. This algorithm generalizes some +2 + + + + + + + + other clustering approaches which, however, results in a + large number of input parameters. (a) Five internal points (b) Two internal 2.2. Attempts to reduce parameters points

There have been many efforts to make clustering one external point process parameter-free, such as OPTICS 7, CHAMELEON 6 and TURN*2. OPTICS computes an Figure 2. Internal and external points (k=4) augmented cluster ordering. This ordering represents the density-based clustering structure of the data. This Definition 4: A point p is directly density-reachable from method is used for interactive cluster analysis. a point q if p Nbr(q, r) and q is an internal point. CHAMELEON operates on a derived similarity graph. The algorithm first uses a graph partitioning approach to Definition 5: A point p is density-reachable from a point divide the dataset into a set of small clusters. Then the q if there is a chain of points x1, x2 ..., xn, q = x1, p = xn small clusters are merged based on their similarity such that xi+1 is directly density-reachable from xi+1. measure. CHAMELEON has been found to be very effective in clustering convex shapes. However, the Definition 6: A cluster C is a collection of cores, the In this section, we first present a few observations centers of which are density reachable from each other. based on our experiments on many different datasets. And then we develop a built-in algorithm, MINR, to determine Definition 7: Boundary points of a cluster is a collection the minimum neighborhood radii for clusters in different of external points within clusters. densities based on the data distribution. Finally, we develop a nonparametric density based clustering method Enhanced DBSCAN: We develop an enhanced by combining MINR with e-DBSCAN. DBSCAN algorithm (e-DBSCAN). e-DBSCAN is used as a nested clustering procedure, which is called repeatedly 3.1. Experiments and Observations to process clustering in different densities. e-DBSCAN is different from the original DBSCAN in that the boundary Observation 1: We define R as a distance between each points of each cluster are stored as a separate set. The point x and its 4th nearest neighbor. The points are then boundary sets are used for cluster merge at the later stage. sorted based on R in ascending order. 3 shows two The enhanced DBSCAN process is summarized as datasets DS1 and DS2 and their R-x graphs respectively follows: after sorting. DS1 is a dataset used by DBSCAN. The data size is 200. DS2 is reproduced from a dataset used by 1. Pick an arbitrary point x, if it is not an internal point, CHAMELEON. The original data is 10K and the clusters it is labeled as noise. Otherwise its neighborhood will have similar density. In order to test our algorithm, we be a rudiment cluster C. Insert all neighbors of point insert more data in the 3 clusters on the left up part. The x into the seed store. size of DS2 is 17.5K. 2. Retrieve the next point from the seed store. If it is an internal point, merge its neighborhood to cluster C. As we can see from 3, for a noisy dataset, there is a Insert all its neighbors to the seed store; if it is an turning point in the R-x graph where R starts to increase external point, insert it to the boundary set of C. dramatically. Our experiments show most points on the 3. Go back to step 2 with the next seed until the seed right side of the turning point are noise. If the dataset store is empty. were clean, there would be no turning point in the graph. 4. Go back to step 1 with the next unclustered point in DS1 and DS2 are both noisy datasets, therefore there are the dataset. turning points in 3 (c) and (d). We can even check our When the process is finished, there will be some cluster observation on the dataset DS1 by eyes. The turning point sets, a noise set and a boundary set for each cluster. in (c) is at around 175. There are 24 points on its right side. In fact DS1 has 20 noise points. 3. PARAMETER REDUCTION FOR DENSITY-BASED CLUSTERING

There are two input parameters in DBSCAN algorithm: the minimum number of neighbors, k, and the minimum neighborhood radius, r. In fact, k is the size of the smallest cluster. It shouldn’t be varied with different datasets. DBSCAN set k to 4 1. TURN* also treats it as a (a) DS1 (b) DS2 fixed value 2. We also set k to 4.

Therefore, the only input parameter is the minimum neighborhood radius, r. Intuitively, r should depend on the cluster density of the dataset. Different density cluster should have different r. Because of it, DBSCAN presents the user a graph of sorted distance between each point and its 4th nearest neighbor. The user will be asked to find the “valley” which represents the optimal r. The method is only for clusters with the same density. TURN* treats the (c) R-x of DS1 (d) R-x of DS2 whole set as an image, tries a range of resolutions (radii) from one end where each point is classified as noise, to Figure 3. DS1 and DS2 and their sorted R-x graphs the other end where all data points can be included in a single cluster. An optimal resolution is found out of the range by statistic method. Observation 2: Given a neighborhood radius r, we calculate the number of neighbors for each point within the given radius, denoted as K, sort the points in descending order, and get the sorted K-x graph. When r is small, the line is quite smooth. As r increases, the graph starts to have “knees”. When we continue to increase r, the graph becomes smooth again. The rational is that if r is very small or very big, the number of neighbors of each point will be close. One extreme case is when r is so small that every point will have no neighbor but itself. The other extreme case is when r is large enough to cover the whole data set as the neighborhood. 4 shows K-x graphs for DS1 and DS2 for three different radii respectively. 4 (a) and (b) are the cases when r is very small. (c) and (d) are the cases when r is close to the maximum R in the R-x (e) DS1 r = 50 (f) DS2 r graph. (e) and (f) are the cases when r is very large. = 250

Figure 4. Sorted K-x graphs for datasets DS1 and DS2 with different neighborhood radii

From 4, we can see that when the neighborhood radius is close to the maximum R, the K-x graph shows “knees” very clearly. In order to find the “knees” we need to calculate the differentials of the graphs, Ks. 5 (a) and (b shows the sorted K-x graphs for DS1 and DS2 when the neighborhood radius is close to R. (c) and (d) show the differentials of the graphs respectively.

(a) DS1 r = 2 (b) DS2 r = 5

(a) K-x graph for DS1 (b) K-x graph for DS2

(c) DS1 r = 22 (d) DS2 r = 30

(c) K for DS1 (d) K for DS2 Figure 5. Sorted K-x graphs of datasets DS1 and DS2 Figure 6. Partitions of the sorted DS2 separated by and their differentials Ks two “knees” at 10000 and 15500 Both DS1 and DS2 consist of clusters in two different densities and some noise. The knees are close to the 3.2. Determination of the neighborhood radii points with peak differentials as we can see in (c) and (d). The number of “knees” is equal to the number of cluster Based on the experiments above, we develop an densities in the dataset. Intuitively, we infer that the points algorithm to automatically determine the minimum divided by “knees” belong to different density clusters or neighborhood radii for mining clusters with different noise. densities, MINR, based on the data distribution. The process is as follows: Observation 3: In order to justify our intuition above, we sort the dataset DS2 based on K, and then partition the 1. Calculate the sorted dataset into three subsets separated by two “knees” distance between each point and its 4th neighbor, R; in 5 (b). The two “knees” are at positions of 10000 and Find the maximum R; 15500. Therefore the three partitions are 0 – 10000, 10001 2. Compute the – 15500, and 15501-17500. The three partitions are number of neighbors, K, within the maximum shown in 6. We can see that partition (a) consists of the neighborhood radius R for each point; denser clusters; partition (b) consists of the less dense 3. Sort the points clusters; and partition (c) is mainly noise. in descending order based on K; 4. Calculate the differential K; Search for the peak K values; 5. Find the “knee” point right before each peak point with K = 0.

The “knee” points are denoted as KNi, where i = 1, 2 …m, m is the number of “knees.” The distance between

KNi and its 4th neighbor will be the neighborhood radius for clustering the ith dense cluster group. The algorithm is summarized in 7. (a) Partition 0 - 10000 MINR Algorithm Input: A data set X

Output: neighborhood radii ri 1. Calculate the distance between each point and its 4th neighbor, R. Get Rm = max (R). 2. Compute the number of neighbors within Rm for each point, K. 3. Sort the points in descending order based on K. 4. Calculate the differential K, and find the peak K position, XPi. Stop if it is at the end of dataset. 5. For the ith peak K position, find the “knee” point (b) Partition 100000 – 15500 KNi: if x < XPi and Ki = 0 and |x- XPi| is the smallest, then KNi = x.

6. ri = Rx. Increase i and go back to step 4.

Figure 7. MINR algorithm

3.3. Nonparametric Density-based Clustering

In this section, we first propose an iterative clustering process given a series of neighborhood radii for different density cluster groups in the dataset, and then develop our nonparametric density based clustering method. (c) Partition 15500 - 17500 We start clustering using the enhance DBSCAN 4. PERFORMANCE ANALYSIS algorithm, e-DBSCAN, with k = 4 and r = r1. The densest cluster(s) would be formed as shown in 8. In this section, we compare our nonparametric density- based clustering algorithm (NPDBC) with the performance of TURN*. We tested the algorithms on several data sets. We will show the run time comparisons on the dataset, DS2, we discussed above. In order to make the data contain the clusters in different densities, we Densest cluster artificially insert more data in some clusters to make them is formed denser than the others. The resulted datasets have the + r + + + + + 1 + sizes from 10k to 200k. + + + We implemented NPDBC in the C language and run on a 1GHz Pentium PC machine with 1GB main memory, and Debian Linux 4.0. The run time comparison of Figure 8. Resulted clusters after clustering with r1: the denser cluster is formed. NPDBC and TURN* is shown in 11.

Then set r = r2. Only process those unclustered points. The next sparser cluster(s) are formed (See 9). The process continues until r = rm. The remaining unclustered points are noise.

Densest cluster Noise is formed + + + + + + + + + r + 2 Sparser cluster is formed

Figure 9. Resulted clusters after clustering with r : 2 Figure 11. Comparison of NPDBC and TURN* The sparser cluster is formed. The unclustered is noise. From 11, we see NPDBC is more efficient than Our nonparametric density-based clustering algorithm TURN* for large datasets. The reason is that for NPDBC, is processed as follows. First, calculates a series of the parameters are computed once at the beginning of the neighborhood radii for different density clusters using clustering process, while TURN* algorithm tries different MINR, then starts iterative clustering process using e- neighborhood radii until the first “turn” is found in case of DBSCAN with the radii. Finally, merge any pair of two different densities. We only compare NPDBC with clusters which share most of the boundary points of either TURN* on datasets with two different densities. If the cluster. The whole process of our nonparametric density variety increases, NPDBC will outperform clustering algorithm is summarized in 10. TURN* much more. In that case, TURN* wouldn’t stop at the first turning point. It has to continue to search for more knees till the very end. It is obvious that TURN* Nonparametric Clustering Algorithm Input: A dataset X will fail for large datasets with various densities. Output: Clusters and noise 1. Calculate a number of the neighborhood radii: r1, r2 5. CONCLUSION … rm for different density clusters with MINR ( ): 2. Iterative Clustering with e-DBSCAN One of the major challenges of clustering is minimal 3. Check the boundaries of each pair of clusters. If two domain knowledge to determine the input parameters. It is clusters share most of the boundary of either cluster, even more difficult to determine the input parameters merge the two clusters into one. when the dataset contains clusters in different densities. Although many algorithms have tried to make clustering Figure 10. Nonparametric clustering algorithm parameter free, they either try to give users all possible choices, or adopt trial-and-error approach based on statistic information, not practical for very large datasets. In this paper, we explore an automatic approach to determine this parameter based on the distribution of datasets. The algorithm, MINR, is developed to determine the minimum neighborhood radii for different density clusters. We developed a nonparametric clustering method (NPDBC) by combining MINR with the enhanced DBCSCAN, e-DBCSCAN. Experiments show our NPDBC is more efficient and scalable than TURN* for clusters in two different densities. The reason is that in NPDBC, the parameters are computed once at the beginning of the clustering process, while TURN* algorithm tries different neighborhood radii until the first “turn” is found in case of clusters in two different densities. When the dataset contains clusters in various densities, our algorithm will be much more efficient. In our future work, we will implement our NPDBC using the vertical data structure, P-tree, the efficient data mining ready data representation.

6. REFERENCES

[1]. Ester, M., Kriegel, H-P., Sander, J. & Xu, X.: A density- based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD, Portland, Oregon (1996) 226-231 [2]. Foss, A. & Zaiane, O., R. A Parameterless Method for Efficiently Discovering Clusters of Arbitrary Shape in Large Datasets. In Proceedings of ICDM 2002. [3]. Halkidi, M. V. M. and Batistakis, Y.. On clustering validation techniques. Journal of Intelligent Information Systems, 17(2-3):107–145, December 2001. [4]. Han, J. and Kamber, M. Data Mining, Concepts and Techniques. Morgan Kaufmann, 2001. [5]. Hinneburg, A., and Keim, D. A.: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. Proceeding 4th Int. Conf. on Knowledge Discovery and Data Mining, AAAI Press (1998) [6]. Karypis, G., Han, E.-H., and Kumar, V. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68–75, August 1999. [7]. M.Ankerst, M.Breunig, H.-P. Kriegel, and J.Sander. OPTICS: Ordering points to identify the clustering structure. In Proc. 1999 ACM-SIGMOD Conf. on Management of Data (SIGMOD’99), pages 49–60, 1999. [8]. Ng, R. T. and Han, J., Efficient and effective clustering methods for spatial data mining. In Proc. of the 20th Int’l Conf. on Very Large Data Bases, 1994. [9]. Palmer, C. R. and Faloutsos, C. Density biased sampling: an improved method for data mining and clustering. In Proceedings of Int’l Conf. on Management of Data, ACM SIGMOD 2000. [10].Sheikholeslami, G., Chatterjee, S. and A. Zhang. A wavelet-based clustering approach for spatial data in very large databases. The International Journal on Very Large Databases, 8(4):289–304, February 2000.