Analysis of Mass Based and Density Based Clustering Techniques on Numerical Datasets
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Information Engineering and Applications www.iiste.org ISSN 2224-5782 (print) ISSN 2225-0506 (online) Vol.3, No.4, 2013 Analysis of Mass Based and Density Based Clustering Techniques on Numerical Datasets Rekha Awasthi 1* Anil K Tiwari 2 Seema Pathak 3 1. Disha College Ram Nager Kota Raipur 492010 Chhattisgarh India 2. Disha College Ram Nager Kota Raipur 492010 Chhattisgarh India 3. Disha College Ram Nager Kota Raipur 492010 Chhattisgarh India *E-mail: [email protected] Abstract Clustering is the techniques adopted by data mining tools across a range of application . It provides several algorithms that can assess large data set based on specific parameters & group related points . This paper gives comparative analysis of density based clustering algorithms and mass based clustering algorithms. DBSCAN [15] is a base algorithm for density based clustering techniques. One of the advantages of using these techniques is that method does not require the number of clusters to be given a prior and it can detect the clusters of different shapes and sizes from large amount of data which contains noise and outliers. OPTICS [14] on the other hand does not produce a clustering of a data set explicitly, but instead creates an augmented ordering of the database representing its density based clustering structure. Mass based clustering algorithm mass estimation technique is used (it is alternate of density based clustering) .In Mass based clustering algorithm [22] there are also core regions and noise points are used as a parameter. We analyze the algorithms in terms of the parameters essential for creating meaningful clusters. All the algorithms are tested using numerical data sets for low as well as high dimensional data sets. Keywords : Mass Based (DEMassDBSCAN) ,DBSCAN,OPTICS. 1. Introduction Data mining is the use of automated data analysis techniques to uncover previously undetected relationships among data items. Data mining often involves the analysis of data stored in a data warehouse. Three of the major data mining techniques are regression, classification and clustering . In this research paper we are working only with the clustering because it is most important process, if we have a very large database. We are using Weka tools for clustering . Clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters. Clustering is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. We are using Weka data mining tools for this purpose. It provides a batter interface to the user than compare the other data mining tools. 2. Literature Survey Clustering is an active research area in data mining with various methods reported. Xin et al. and Howard et al. made a comparative analysis of two density-based Clustering algorithms i.e. DBSCAN and DBRS which is a density- based clustering algorithm. They concluded that DBSCAN gives extremely good results and is r is efficient in many datasets. However, if a dataset has clusters of widely varying densities, than DBSCAN is not able to perform well. Also DBRS aims to reduce the running time for datasets with varying densities. It also works well on high-density clusters. In [19] Mariam et al. and Syed et al. made a comparison for two density-based clustering algorithm i.e. DBSCAN and RDBC i.e. Recursive density based clustering. RDBC is an improvement of DBSCAN. In this algorithm it calls DBSCAN with different density distance thresholds ε and density threshold MinPts. It concludes that the number of clusters formed by RDBC is more as compared to DBSCAN also we see that the runtime of RDBC is less as compared to DBSCAN. In [1] K.Santhisree et al. described a similarity measure for density-based clustering of web usage data. They developed a new similarity measure named sequence similarity measure and enhanced DBSCAN [15] and OPTICS [14] for web personalization. As an experimental result it was found that the 29 Journal of Information Engineering and Applications www.iiste.org ISSN 2224-5782 (print) ISSN 2225-0506 (online) Vol.3, No.4, 2013 average intra cluster distance in DBSCAN is more as compared to OPTICS and the average intra cluster distance is minimum in OPTICS. In [17] K.Mumtaz et al. and Dr. K.Duraiswamy described an analysis on Density-Based Clustering of Multi-Dimensional Spatial Data. They showed the results of analyzing the properties of density-based clustering characteristics of three clustering algorithms namely DBSCAN, k-means and SOM using synthetic two dimensional spatial data sets. It was seen that DBSCAN performs better for spatial data sets and produces the correct set of clusters compared to SOM and k-means algorithm In [11] A.Moreia, M.Santos and S.Corneiro et al. described the implementation of two density based clustering algorithms: DBSCAN [15] and SNN [12]. The no of input required by SNN is more as compared to DBSCAN. The results showed that SNN performs better than DBSCAN since it can detect clusters with different densities while the former cannot. 3. Mass Based And Density-Based Clustering Techniques The mass estimation is another technique to find clusters in arbitrary shape data. In the clustering the mass estimation is unique because in this estimation there is no use of distance or density [20]. DEMassDBSCAN clustering mass estimation technique is used (it is alternate of density based clustering) .In DEMassDBSCAN algorithm there are also core regions and noise points are used as a parameter. Density based clustering is to discover clusters of arbitrary shape in spatial databases with noise. It forms clusters based on maximal set of density connected points. The core part in Density-Based clustering is density-reach ability and density connectivity. Also it requires two input parameters i.e. Eps which is known as radius and the MinPts i.e. the minimum number of points required to form a cluster. It starts with an arbitrary starting point that has not visited once. Then the ε- neighborhood is retrieved, and if it contains sufficiently many points than a cluster is started. Otherwise, the point is labeled as noise. This section describes two density based clustering algorithms briefly i.e. DBSCAN (Density Based Spatial Clustering of Application with Noise) and OPTICS (Ordering Points to Identify the Clustering Structure). Here, Density=number of points within a specified radius. Density-Based clustering Algorithms mainly include two techniques: - DBSCAN [15] which grows clusters according to a density-based connectivity analysis. - OPTICS [14] extends DBSCAN to produce a cluster ordering obtained from a wide range of parameter setting. Mass Based Clustering: In the clustering approach there is various estimation techniques are used to form clusters. The mass estimation is another technique to find clusters in arbitrary shape data. In the clustering the mass estimation is unique because in this estimation there is no use of distance or density [20]. In the DBSCAN clustering algorithm hyper spheres are used to show points but in DEMassDBSCAN clustering algorithm rectangular regions are used for showing the points [21]. mass - Number of points in a given region is called mass of that region. Each region in a given data space having rectangular shape and for estimation of mass a function that called rectangular function [18]. One-Dimensional Mass Estimation : The estimation of mass in data space is depends on levels (h). If value of level h=1 the shape of function is concave. For multidimensional mass estimation the value of h is > 1. , 1, 1 Eq- 1 [20] , , 1 In the one dimensional mass estimation the value of is equals to For one-dimensional mass estimation the value of h is equals to 1. , 1 In the above equation the m i(x) mass base function. 30 Journal of Information Engineering and Applications www.iiste.org ISSN 2224-5782 (print) ISSN 2225-0506 (online) Vol.3, No.4, 2013 , , !" #$ 2 Combining these two equations- " + & ()* ' ' + - , & ,) . ' + - , & ,) !. For large data bases the values of t is 1000 and 256. For' small data sets the value of h and is is varies. In multidimensional Mass Estimation /: In multidimensional mass estimation there is/ value of h>1. In multidimensional mass estimation there are two functions mass and random spaces generator in data space. ' + - [20] 0 & ' ,) !. Mass Based Clustering:+ - [20] 1 2 )344 5 ,) !. Multidimensional mass estimation using h:d Trees: ' + [20] 0 & ' 1 2 )344 6 The multi dimensional+ data set the value of h (level) is greater than 1and is is similar to one dimensional mass estimation. In this equation the value of x is replaced by X. MassTER Algorithm: In the MassTER algorithm the mass will estimate using h:d trees. In the h:d trees the h stands for no of times attribute appears and d stands for no. of dimensions. The height of tree is calculated as l=h×d. Algorithm Step 1 Build trees t i Step 2 Assign a cluster seed for every instance. Step 3 Join two seed based clusters. Step 4 Cluster having instances less than Ƞ are called noise instances. Step 5: return clusters c j,j=1,2,3,-------k. And E. (noise instances) Data input: D= Input Data. T= No of trees. Ψ= Sampling size. h= how many times attribute come in a path. Ƞ= minimum instances in a cluster In this algorithm firstly h:d trees are made of instance in a region. The cluster seed is assigned for each region in data space. The cluster seed are joined and make pair with neighbor cluster seed.