Effect of Distance Measures on Partitional Clustering Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312 Effect of Distance measures on Partitional Clustering Algorithms using Transportation Data Sesham Anand#1, P Padmanabham*2, A Govardhan#3 #1Dept of CSE,M.V.S.R Engg College, Hyderabad, India *2Director, Bharath Group of Institutions, BIET, Hyderabad, India #3Dept of CSE, SIT, JNTU Hyderabad, India Abstract— Similarity/dissimilarity measures in clustering research of algorithms, with an emphasis on unsupervised algorithms play an important role in grouping data and methods in cluster analysis and outlier detection[1]. High finding out how well the data differ with each other. The performance is achieved by using many data index importance of clustering algorithms in transportation data has structures such as the R*-trees.ELKI is designed to be easy been illustrated in previous research. This paper compares the to extend for researchers in data mining particularly in effect of different distance/similarity measures on a partitional clustering algorithm kmedoid(PAM) using transportation clustering domain. ELKI provides a large collection of dataset. A recently developed data mining open source highly parameterizable algorithms, in order to allow easy software ELKI has been used and results illustrated. and fair evaluation and benchmarking of algorithms[1]. Data mining research usually leads to many algorithms Keywords— clustering, transportation Data, partitional for similar kind of tasks. If a comparison is to be made algorithms, cluster validity, distance measures between these algorithms.In ELKI, data mining algorithms and data management tasks are separated and allow for an I. INTRODUCTION independent evaluation. This aspect is unique to ELKI Clustering as an unsupervised data mining technique and among other data mining frameworks like Weka or has been widely used in various applications for analyzing Rapidminer, Tanagra etc. ELKI has many implementations data. Kmeans and kmedoid algorithms are two very widely of, distances or similarity measure file formats, data types used partitional algorithms. Since kmeans is more suitable and is open to develop new ones also. using Euclidean distance metric, in order to study other Visualizing data and results is also available and ELKI distance measures, kmedoids algorithm has been chosen for supports evaluation of different algorithms using different the study. Details of transportation data, its attributes, aim measures like pair-counting, Entropy based, BCubed based, of research, details of application of clustering etc has been Set-Matching based, Editing-distance based, Gini measures presented previously[2,3,4,5].Also, one other aim of the etc. research was to compare different open source software[4]. Although not exhaustive, Table 1 shows the wide variety Previous research illustrated the importance and use of of clustering algorithms implemented in ELKI. All clustering on transportation data and results of applying algorithms are highly efficient, and a wide variety of clustering was indicated. It has been observed that the parameters can be used to customize them. clustering results varied with different algorithms and using different parameters. Table 1: Clustering algorithms in ELKI The need of current research is to study the effect of Category Algorithms Implemented distance metrics on the clustering algorithms and to come kmeans(Llyod and MacQueen), out with recommendations on suitable distance measures to kmedoids(PAM &EM), be used and whether suitable modification is to be made for Partitional kMedians(Llyod),kmeansBisecting, a particular distance measure specifically for the Clustering kmeansBatchedLlyod, transportation dataset. The work has been done using ELKI[1] software. It is a recently developed java based data kmeansHybridLlyodMacQueen, mining software. Different distances have been selected ExtractFlatClusteringFromHierarchy, using kmedoids algorithm and different parameters have Hierarchical NaiveAgglomerativeHierarchicalClustering, been recorded. Suitable conclusions have been made. All SLINKierarchicHie the different cluster evaluation measures and distances used CASH, COPAC, ERiC, FourC, HiCO, Correlation in the work are discussed.e. LMCLUS, ORCLUS, II ELKI SOFTWARE CLIQUE, DiSH, DOC, HiSC, P3C, The software used for the present research is ELKI[1], Subspace PROCLUS, SUBCLU, PreDeCon, which stands for Environment for Developing KDD- Applications Supported by Index-Structures. It is a Java Canopy PreClustering, DBScan, based open source software. The current version used is Others DeLiClu,EM, ELKI 0.6.0.The original aim of ELKI is to be used in AffinityPropagatingClustering www.ijcsit.com 5308 Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312 III DEFINITION AND MOTIVATION FOR CURRENT RESEARCH Author’s previous research[2,3,4,5] has showed the original aim of research and application of clustering techniques on transportation data. Ultimately it is proposed to come out with recommendations on suitable algorithms and distance measures suitable to the HIS Transportation dataset. Preprocessing of the dataset and suitable dimensionality reduction[5] techniques were applied on the data, and it is concluded that clustering on factor scores is more suitable than on original data. It now remains to be seen what distance measures are suitable and need to be used. The objective of clustering the data has been to determine answers to questions such as how rich or prosperous a household is, their mobility levels, vehicle availability etc. The intention is to analyze why people are investing money Fig 1: ELKI Home Screen to acquire money and whether it is related to inadequacy of public transport etc. Also is it possible to predict the ELKI home screen provides a lot of parameters for potential of households and likely to buy vehicles in future selecting clustering and customizing them. Fig 1 shows etc. initial menu of the software. A lot of clustering algorithms have been implemented in ELKI unlike in other open IV ALGORITHM USED source software. Easy to use menus allow the researcher to This paper is intended to study the effect of distance select and customize various parameters as shown in Fig 2. measures and initialization methods in partitional clustering algorithms when applied to transportation dataset. Two of the most widely used partitional clustering algorithms are kmeans[7] and kmedoids[6](also known as partitioning around medoids PAM). Partitional clustering algorithms construct k clusters or classifies data into k groups which satisfy the partition requirements: Each group must contain at least one object and each object must belong to at least one group. All these algorithms require the k value as input. Since the dataset under consideration is numeric, these two algorithms are suitable for application on the dataset. kmeans algorithm is more widely used than kmedoids in the research literature but it has some drawbacks. kmeans algorithm[7] is more suitable using Euclidean distance and its derivatives. If any other similarity measures is used, it Fig 2: ELKI algorithm selection may not converge to global minima. kmeans is designed naturally for squared Euclidean distance. Precisely for this Although not yet mature, ELKI provides to some extent reason, kmedoids algorithm has been chosen for further visualization facility to see the results of clustering and data work and the intention is to change the distance measures as shown in Fig 3. for each run of the algorithm and see the effect. A Algorithm Kmedoids algorithm has two phases[6], a build phase and a swap phase. In the first phase, an initial clustering is done by selecting k representative objects. The first object is selected for which the sum of dissimilarities to all other objects is as small as possible. All subsequent steps involve selecting all other remaining k-1 objects. In order to select the objects the following steps are carried out[6]: 1. Select an object i which is not yet selected 2. Select a non-selected object j and calculate the difference between its dissimilarity Dj with the most similar previously selected object and its dissimilarity d(j,i) with object i. Fig 3: Visualization of Data www.ijcsit.com 5309 Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312 3. If this difference is positive, object j will c) FarthestPointsInitialMeans: The farthest-point method contribute to the decision to select object i. picks up an arbitrary point p. It now picks a point q that is therefore Cji = max(Dj – d(j,i),0) as far from p as possible. Then select r to maximize the 4. Calculate the total gain obtained by selecting distance to the nearest of all centroids picked so far. That is, object i: ΣCji maximize the min {dist (r, p), dist (r, q) ...}. After all k 5. Choose the not yet selected object I which representatives are chosen then the partition of Data D is maximizes ΣCji done: a cluster Cj will now consist of all points closer to 6. These steps are carried out until k objects have representative sj than to any other representative. been found. d) PAMInitialMeans: This method[6] is based on The second phase consists of trying to improve the partitioning around medoids(PAM) algorithm which representative objects, thereby improving the overall searches for k representative objects in the dataset