Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312

Effect of Distance measures on Partitional Clustering Algorithms using Transportation Data Sesham Anand#1, P Padmanabham*2, A Govardhan#3 #1Dept of CSE,M.V.S. Engg College, Hyderabad, India *2Director, Bharath Group of Institutions, BIET, Hyderabad, India #3Dept of CSE, SIT, JNTU Hyderabad, India

Abstract— Similarity/dissimilarity measures in clustering of algorithms, with an emphasis on unsupervised algorithms play an important role in grouping data and methods in and detection[1]. High finding out how well the data differ with each other. The performance is achieved by using many data index importance of clustering algorithms in transportation data has structures such as the R*-trees.ELKI is designed to be easy been illustrated in previous research. This paper compares the to extend for researchers in particularly in effect of different distance/similarity measures on a partitional clustering algorithm kmedoid(PAM) using transportation clustering domain. ELKI provides a large collection of dataset. A recently developed data mining open source highly parameterizable algorithms, in order to allow easy software ELKI has been used and results illustrated. and fair evaluation and benchmarking of algorithms[1]. Data mining research usually leads to many algorithms Keywords— clustering, transportation Data, partitional for similar kind of tasks. If a comparison is to be made algorithms, cluster validity, distance measures between these algorithms.In ELKI, data mining algorithms and data management tasks are separated and allow for an I. INTRODUCTION independent evaluation. This aspect is unique to ELKI Clustering as an unsupervised data mining technique and among other data mining frameworks like or has been widely used in various applications for analyzing Rapidminer, Tanagra etc. ELKI has many implementations data. Kmeans and kmedoid algorithms are two very widely of, distances or similarity measure file formats, data types used partitional algorithms. Since kmeans is more suitable and is open to develop new ones also. using Euclidean distance metric, in order to study other Visualizing data and results is also available and ELKI distance measures, kmedoids algorithm has been chosen for supports evaluation of different algorithms using different the study. Details of transportation data, its attributes, aim measures like pair-counting, Entropy based, BCubed based, of research, details of application of clustering etc has been Set-Matching based, Editing-distance based, Gini measures presented previously[2,3,4,5].Also, one other aim of the etc. research was to compare different open source software[4]. Although not exhaustive, Table 1 shows the wide variety Previous research illustrated the importance and use of of clustering algorithms implemented in ELKI. All clustering on transportation data and results of applying algorithms are highly efficient, and a wide variety of clustering was indicated. It has been observed that the parameters can be used to customize them. clustering results varied with different algorithms and using different parameters. Table 1: Clustering algorithms in ELKI The need of current research is to study the effect of Category Algorithms Implemented distance metrics on the clustering algorithms and to come kmeans(Llyod and MacQueen), out with recommendations on suitable distance measures to kmedoids(PAM &EM), be used and whether suitable modification is to be made for Partitional kMedians(Llyod),kmeansBisecting, a particular distance measure specifically for the Clustering kmeansBatchedLlyod, transportation dataset. The work has been done using ELKI[1] software. It is a recently developed based data kmeansHybridLlyodMacQueen, mining software. Different distances have been selected ExtractFlatClusteringFromHierarchy, using kmedoids algorithm and different parameters have Hierarchical NaiveAgglomerativeHierarchicalClustering, been recorded. Suitable conclusions have been made. All SLINKierarchicHie the different cluster evaluation measures and distances used CASH, COPAC, ERiC, FourC, HiCO, Correlation in the work are discussed.e. LMCLUS, ORCLUS,

II ELKI SOFTWARE CLIQUE, DiSH, DOC, HiSC, P3C, The software used for the present research is ELKI[1], Subspace PROCLUS, SUBCLU, PreDeCon, which stands for Environment for Developing KDD- Applications Supported by Index-Structures. It is a Java Canopy PreClustering, DBScan, based open source software. The current version used is Others DeLiClu,EM, ELKI 0.6.0.The original aim of ELKI is to be used in AffinityPropagatingClustering

www.ijcsit.com 5308 Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312

III DEFINITION AND MOTIVATION FOR CURRENT RESEARCH Author’s previous research[2,3,4,5] has showed the original aim of research and application of clustering techniques on transportation data. Ultimately it is proposed to come out with recommendations on suitable algorithms and distance measures suitable to the HIS Transportation dataset. Preprocessing of the dataset and suitable [5] techniques were applied on the data, and it is concluded that clustering on factor scores is more suitable than on original data. It now remains to be seen what distance measures are suitable and need to be used. The objective of clustering the data has been to determine answers to questions such as how rich or prosperous a household is, their mobility levels, vehicle availability etc. The intention is to analyze why people are investing money Fig 1: ELKI Home Screen to acquire money and whether it is related to inadequacy of public transport etc. Also is it possible to predict the ELKI home screen provides a lot of parameters for potential of households and likely to buy vehicles in future selecting clustering and customizing them. Fig 1 shows etc. initial menu of the software. A lot of clustering algorithms have been implemented in ELKI unlike in other open IV ALGORITHM USED source software. Easy to use menus allow the researcher to This paper is intended to study the effect of distance select and customize various parameters as shown in Fig 2. measures and initialization methods in partitional clustering algorithms when applied to transportation dataset. Two of the most widely used partitional clustering algorithms are kmeans[7] and kmedoids[6](also known as partitioning around medoids PAM). Partitional clustering algorithms construct k clusters or classifies data into k groups which satisfy the partition requirements: Each group must contain at least one object and each object must belong to at least one group. All these algorithms require the k value as input. Since the dataset under consideration is numeric, these two algorithms are suitable for application on the dataset. kmeans algorithm is more widely used than kmedoids in the research literature but it has some drawbacks. kmeans algorithm[7] is more suitable using Euclidean distance and its derivatives. If any other similarity measures is used, it Fig 2: ELKI algorithm selection may not converge to global minima. kmeans is designed

naturally for squared Euclidean distance. Precisely for this Although not yet mature, ELKI provides to some extent reason, kmedoids algorithm has been chosen for further visualization facility to see the results of clustering and data work and the intention is to change the distance measures as shown in Fig 3. for each run of the algorithm and see the effect.

A Algorithm Kmedoids algorithm has two phases[6], a build phase and a swap phase. In the first phase, an initial clustering is done by selecting k representative objects. The first object is selected for which the sum of dissimilarities to all other objects is as small as possible. All subsequent steps involve selecting all other remaining k-1 objects. In order to select the objects the following steps are carried out[6]: 1. Select an object i which is not yet selected 2. Select a non-selected object j and calculate the difference between its dissimilarity Dj with the most similar previously selected object and its dissimilarity d(j,i) with object i.

Fig 3: Visualization of Data

www.ijcsit.com 5309 Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312

3. If this difference is positive, object j will c) FarthestPointsInitialMeans: The farthest-point method contribute to the decision to select object i. picks up an arbitrary point p. It now picks a point q that is therefore Cji = max(Dj – d(j,i),0) as far from p as possible. Then select r to maximize the 4. Calculate the total gain obtained by selecting distance to the nearest of all centroids picked so far. That is, object i: ΣCji maximize the min {dist (r, p), dist (r, q) ...}. After all k 5. Choose the not yet selected object I which representatives are chosen then the partition of Data D is maximizes ΣCji done: a cluster Cj will now consist of all points closer to 6. These steps are carried out until k objects have representative sj than to any other representative. been found. d) PAMInitialMeans: This method[6] is based on The second phase consists of trying to improve the partitioning around medoids(PAM) algorithm which representative objects, thereby improving the overall searches for k representative objects in the dataset called clustering. It considers all pairs of objects (i,h) in which medoids of the clusters. The sum of dissimilarities of object ‘i’ has been selected and object ‘h’ has not been objects to their closest representative object is minimized. selected. If, swap is carried out between ‘I’ and ‘h’, the The first of k objects[6] is selected for which the sum of effect on clustering is determined and any improvement is dissimilarities to all other objects is as less as possible. In seen the swap is carried out. More details of the algorithm subsequent steps, rest of the objects is chosen can be found in [6]. The advantage of using kmedoids algorithm is that it easily VII COMPARISON MEASURES adapts itself to different distance measures unlike kmeans Clustering algorithms retrieve the inherent partitions in the algorithm. Different distance measures were changed on the data.Although all clustering algorithms partition the data, same algorithm and different run results are indicated. each of them reveal different clusters based on different ELKI software allows the researcher to change the distance input parameters. Hence validating the resulting clustering functions and run the same algorithm. This can be extended is very important. There are three ways of cluster validation to all other algorithms implemented in ELKI. namely internal, external and relative criteria. They are also important because some datasets may not have a proper V EVALUATION OF DISTANCE MEASURES USING K- inherent cluster structure and hence makes the clustering MEDOIDS ON TRANSPORTATION DATA output meaningless. Table shows the criteria used for A Review of Distance Metrics for numerical data comparison of clustering on transportation dataset. All the Some of the distance functions suitable for numeric data criteria chosen are external validity indices. which are used in this paper are presented in Table 2. Let the dataset be X and the clustering structure be denoted by C when a particular algorithm is applied.Let Pbe a Table 2: Distance functions for numeric data prespecified partition of data set X with Ndata points. The external validity indices used to compare the Distance Formula algorithms are presented in Table 3

Euclidean Table 3: Cluster validity indices Distance Manhattan Validity Index Formula Distance Jaccard[12] LPNorm Lp(x,y) = Distance Rand[12] ArcCosine Cos(θ) = Distance FowlkesMallows[11] Canberra FM = Distance Kulczynski1 F1-Measure[12] F(i,j)=2 Distance Precision[12] Lorentzian Precision(i,j) = Distance Recall[12] Recall(i,j) =

VI INITIALIZATION METHODS The initial k representative objects can be selected in many The larger the values of Rand, Jaccard and Fowlkes ways. Some of the methods used in the current work are: Mallows indices, the better are the partitioning of data. a) RandomlyChosenInitalMeans: Used by default in kmeans algorithm. k objects are selected from data at VIII RESULTS random. Kmedoids(PAM) algorithm has been chosen and for each b) FirstkInitialMeans: This is the easiest method of run, k value of 4 is taken, and a particular distance measure selecting k representatives from the data. The first k objects was selected. Each time, different values of the same are selected as representatives of k clusters. indices are noted in Table 4.

www.ijcsit.com 5310 Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312

TABLE 4: Values for PAM Initial means Fowlkes Distance Measure F1-Measure Jaccard Recall Rand Mallows Euclidean 0.75 0.6 0.6 0.6 0.78 Manhattan 0.72 0.6 0.6 0.6 0.8 LPNormDistance 0.75 0.6 0.6 0.6 0.78 ArcCosineDistance 0.4 0.3 0.3 0.3 0.5 CanberraDistance 0.4 0.3 0.3 0.3 0.5 Kulczynski1Distance 0.5 0.3 0.3 0.3 0.56 Lorentzian Distance 0.49 0.33 0.33 0.33 0.57

TABLE 5: values for RandomlyChosenInitialMeans Distance Measure F1-Measure Jaccard Recall Rand FowlkesMallows Euclidean 0.4 0.25 0.25 0.25 0.5 Manhattan 0.69 0.53 0.53 0.53 0.73 LPNormDistance 0.5 0.33 0.33 0.33 0.57 ArcCosineDistance 0.4 0.25 0.25 0.25 0.5 CanberraDistance 0.41 0.26 0.26 0.26 0.51 Kulczynski1Distance 0.4 0.25 0.25 0.25 0.5 Lorentzian Distance 0.49 0.33 0.33 0.33 0.57

TABLE 6: Values for FirstKInitialMeans Distance Measure F1-Measure Jaccard Recall Rand FowlkesMallows Euclidean 0.5 0.33 0.33 0.33 0.57 Manhattan 0.5 0.33 0.33 0.33 0.58 LPNormDistance 0.5 0.33 0.33 0.33 0.57 ArcCosineDistance 0.4 0.25 0.25 0.25 0.5 CanberraDistance 0.52 0.36 0.36 0.36 0.6 Kulczynski1Distance 0.4 0.25 0.25 0.25 0.5 Lorentzian Distance 0.49 0.33 0.33 0.33 0.57

TABLE 7: Values for FarthestPointsInitialMeans Distance Measure F1-Measure Jaccard Recall Rand FowlkesMallows Euclidean 0.75 0.61 0.61 0.61 0.78 Manhattan 0.76 0.61 0.61 0.61 0.78 LPNormDistance 0.75 0.61 0.61 0.61 0.78 ArcCosineDistance 0.4 0.25 0.25 0.25 0.5 CanberraDistance 0.47 0.31 0.31 0.31 0.56 Kulczynski1Distance 0.7 0.54 0.54 0.54 0.73 Lorentzian Distance 0.75 0.6 0.6 0.6 0.77

TABLE 8: Values for RandomlyGeneratedInitialMeans Distance Measure F1-Measure Jaccard Recall Rand FowlkesMallows Euclidean 0.4 0.25 0.25 0.25 0.5 Manhattan 0.5 0.34 0.34 0.34 0.58 LPNormDistance 0.4 0.25 0.25 0.25 0.5 ArcCosineDistance 0.4 0.25 0.25 0.25 0.5 CanberraDistance 0.48 0.32 0.32 0.32 0.56 Kulczynski1Distance 0.49 0.32 0.32 0.32 0.57 Lorentzian Distance 0.4 0.25 0.25 0.25 0.5

A RESULT ANALYSIS IX CONCLUSIONS Looking at the results, it is observed that Manhattan A preliminary study has been made about the effect of distance seems to be the best distance function to be used in distance functions used in partitional algorithm such as kmedoids for the given data. All the validity indices gave kmedoids, and the initialization methods used to select the consistently high values when the initialization method initial cluster centres. It is shown that the clustering results used was farthest point initial means. of the clusters vary by changing the above parameters. Farthest point initialization gave the best results. Further research is planned wherein the instances falling in different clusters will be studied and a suitable weighted distance function will be designed.

www.ijcsit.com 5311 Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312

REFERENCES [1] Elke Achtert, Hans-Peter Kriegel, Erich Schubert, , Interactive Data Mining with 3D-Parallel-Coordinate-Trees. Proceedings of the ACM International Conference on Management of Data (SIGMOD), New York City, NY, 2013, June 22-27, 2013 [2] Sesham Anand, Sai Hanuman A, Dr. Govardhan A, and Dr. Padmanabham P, Application of Data Mining Techniques to Transportation Demand Modelling Using Home Interview Survey Data, International Conference on Systemics, Cybernetics and Informatics 2008. [3] Sesham Anand, Sai Hanuman A, Dr. Govardhan A, and Dr. Padmanabham P, Use of Data Mining Techniques in understanding Home Interview Surveys Employed for Travel Demand Estimation, International Conference on data Mining(DMIN ’08) at Las Vegas, USA, 2008 [4] Sai Hanuman A, SeshamAnand, Dr. Vinaybabu A, Dr. Govardhan A and Dr. Padmanabham P, Efficient Use of Open Source Software for Knowledge Discovery from Complex , 2009 IEEE International Advance Computing Conference (IACC 2009)Patiala, India, 6–7, March 2009. [5] Sesham Anand, Dr P. Padmanabham, Dr. A Govardhan, Dr. A. Sai Hanuman, Performance of Clustering Algorithms on Home Interview Survey Data Employed for Travel Demand Estimation, International Journal of Computer Science and Information Technologies, Vol. 5 (3) , 2014, 2767-2771. [6] Kaufman, L., and Rousseeuw, P.J, Clustering by means of medoids, Statistical based on the L1 Norm, edited by T. Dodge, Elsevier/North-Holland, Amsterdam, pp 405-416. [7] J. B. MacQueen, Some methods for classification and analysis of multivariate observations, L. M. Le Cam and J. Neyman, editors, Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297, University of California Press, 1967. [8] Elena Deza, Michel-Marie Ddeza, Dictionary of Distances, Elsevier 2006. [9] Rui Xu, Donald L. Wunsch, II, Clustering, IEEE Press, John Wiley & Sons Publications, 2009, ISBN: 978-0-470-27680-8, [10] Rand, W Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66: 846 – 850. [11] Fowlkes, E. and Mallows, C. A method for comparing two . Journal of the American Statistical Association, 78: 553 – 569. [12] Jacob Kogan, Introduction to Clustering Large and High Dimensional Data, Cambridge University press, 2007.

www.ijcsit.com 5312