International Journal of Computer Science and Communication Vol. 2, No. 1, January-June 2011, pp. 101-104

CURE: CLUSTERING ON SEQUENTIAL DATA FOR WEB PERSONALIZATION: TESTS AND EXPERIMENTAL RESULTS

K. Santhisree, and A. Damodaram Department of Computer Science, Jawaharlal Nehru Technological University, Hyderabad E-mail: [email protected], E-mail: [email protected]

ABSTRACT The world wide web is full of multi-disciplinary data for knowledge data discovery research. In this paper we present CURE (Clustering usage Representatives) algorithm to find clusters on a web usage data. We adopted data from MSNBC.COM website which is a free news data website with different categories of news and subjects. After generating the clusters by CURE algorithm, average of inter cluster and intra cluster are calculated and the results are compared with different similarity measures like Euclidean, Jaccard, projected Euclidean, cosine and fuzzy similarity. Finally behavior of clusters that made by CURE algorithm showed on a sequential data in a web usage domain with quantify our results by the way of explanations and list conclusions. Keywords: Clustering Using Representatives, inter luster, intra cluster, ALD measure, similarity measures

1. INTRODUCTION 2. RELATED WORK A web session is a continuous stream of hyper link clicks Many researchers carried out work on CURE clustering recorded in web log. Clustering web sessions is to group technique in the past. Listed below are some papers that them according to their similarity and includes they worked on clustering and CURE algorithms. Sudipto maximizing the intra-group similarity while minimizing Guha.et.al.[7] propose CURE algorithm and showed the inter-group similarity. Clustering web sessions is very efficient of CURE on large . Qian Yuntao, Shi Qingsong, Wang Qi [5] found relation of shrinking scheme important for web usage mining. It can be used to discover of CURE and hidden assumption of spherical shape of users usage patterns and behavior from web data in order cluster. G. Adomavicius.et.al[1] presented a new approach to understand and serve the needs of web-based to discovering clusters in too large amount of applications in a better way. The most important question continuously-arriving data as a dataset and used that should be answered when clustering web sessions based techniques for clustering the dataset. In is the way of measuring similarity between two web [2] presented a new synthesized algorithm sessions. Most of the previous related works apply either named CA with improving CURE algorithm and C4.5 with Euclidean distance for vector or set similarity measures, help of PCA to reduce the scale of and then put Cosine or Jaccard similarities. The order of the sequence maize seed breeding dataset into the CA and showed is considered in the earlier experiments. The data is experimental results. On paper [3] M. Kaya, R. Alhajj typically collected by web servers in large logs. Data propose an automated method for mining fuzzy mining from web access logs consists of three sequential association rules with help of Genetic algorithm and CURE. On another paper Mehmet Kaya, Reda Alhajj, steps: collection of data and data -pre-processing and Faruk Polat, Ahmet Arslan [4] worked on autonomous formatting the log entries, and finally, pattern analysis mining of both fuzzy sets and fuzzy association rules to which consists of retrieving and studying the behavior propose an automated method. Parthasarathy, Zaki, of the discovered patterns by user. The rest of our paper Ogihara, Dwarkadas [6] has worked on the discovery of is organized as following: In Section II, we review the clusters from database updates. They proposed a method related work done by various researchers in the past. We with SPADE algorithm for incremental and interactive explain about CURE algorithm in section III, distance frequent sequence mining. measures in section IV and then we describe the dataset and data preprocessing in section V. Section VI, includes 3. CURE CLUSTERING ALGORITHM the experiment and results that are achieved using Clustering using representatives is a clustering technique different measurements formulas. Section VII offers which overcomes the problem of favoring clusters with subjects for future research and in the last part the spherical shape and similar sizes and is more robust with references that are used for this paper are presented. respect to the . It is a clustering algorithm that 102 International Journal of Computer Science and Communication (IJCSC) adopts the middle ground between centroid based and d(p, q) = ()()...()p- q2 + p - q + 2 +p - q 2 representatives. Instead of using a single centroid or object 1 1 2 2 n n to represent a cluster, a fixed number of representative n points are choosen. - 2 = å ().pi q i (1) Pseudocode i=1 CURE (no. of points, k) Jaccard similarity measure: The Jaccard coefficient Input : A set of points S measures similarity between sequences, is defined as the Output : k clusters size of the intersection divided by the size of the union of Step 1: Draw a random sample, S of the original the two sequences: objects. |ABÇ | . Step 2: Partition sample s into a set of partitions. J (A, B) = |ABÈ | (2) Step 3: Partially cluster each partition. Cosine Similarity Measure: Cosine similarity is a Step 4: Eliminate outliers by random sampling. If a measure of similarity between two vectors by finding the cluser grows too slowly, remove it. cosine of the angle between them. Step 5: Cluster the partial clusters. n ´ AB× å = ABi i Step 6: Mark the data with the corresponding cluster Similarity = cos(q ) = = i 1 labels. AB n2´ n 2 å=()()AB å = i1i i 1 i (3) 4. DISTANCE MEASURES Fuzzy Dissimilarity: sss Given two ordered fuzzy sets Euclidean distance measure: It is the distance between ds1 = (si1, si2, si3... sin) and s2 = (sj1, sj2, sj3... sjn), is two points that one would measure with a ruler, and is defined as given by the Pythagorean formula. The Euclidean distance between sequences s1 and s2 is defined FuzzySim (S1, S2) = (S1 Ç S2)/S1 È S2 (4)

Table1 Clusters Formed Using Cosine Similarity cosine C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C1 0.00 0.15 0.16 0.16 0.16 0.16 0.17 0.17 0.18 0.18 0.19 0.19 0.21 0.22 0.23 0.24 0.25 0.25 C2 0.15 0.00 0.13 0.13 0.14 0.13 0.13 0.14 0.15 0.16 0.17 0.18 0.21 0.22 0.23 0.24 0.25 0.26 C3 0.16 0.13 0.00 0.12 0.14 0.15 0.15 0.15 0.16 0.17 0.19 0.21 0.26 0.27 0.27 0.31 0.31 0.29 C4 0.16 0.13 0.12 0.00 0.18 0.18 0.18 0.19 0.19 0.19 0.21 0.21 0.22 0.23 0.24 0.24 0.25 0.26 C5 0.16 0.14 0.14 0.18 0.00 0.16 0.16 0.16 0.17 0.17 0.18 0.18 0.19 0.21 0.20 0.20 0.21 0.22 C6 0.16 0.13 0.15 0.18 0.16 0.00 0.16 0.17 0.18 0.21 0.22 0.23 0.24 0.24 0.25 0.25 0.25 0.25 C7 0.17 0.13 0.15 0.18 0.16 0.16 0.00 0.21 0.21 0.22 0.23 0.24 0.25 0.25 0.25 0.26 0.26 0.27 C8 0.17 0.14 0.15 0.19 0.16 0.17 0.21 0.00 0.18 0.18 0.18 0.18 0.19 0.19 0.19 0.21 0.21 0.22 C9 0.18 0.15 0.16 0.19 0.17 0.18 0.21 0.18 0.00 0.21 0.21 0.22 0.24 0.25 0.26 0.27 0.21 0.22 C10 0.18 0.16 0.17 0.19 0.17 0.21 0.22 0.18 0.21 0.00 0.22 0.23 0.23 0.24 0.24 0.25 0.25 0.27 C11 0.19 0.17 0.19 0.21 0.18 0.22 0.23 0.18 0.21 0.22 0.00 0.18 0.19 0.19 0.21 0.21 0.22 0.22 C12 0.19 0.18 0.21 0.21 0.18 0.23 0.24 0.18 0.22 0.23 0.18 0.00 0.19 0.14 0.14 0.21 0.24 0.24 C13 0.21 0.21 0.26 0.22 0.19 0.24 0.25 0.19 0.24 0.23 0.19 0.19 0.00 0.21 0.22 0.23 0.24 0.25 C14 0.22 0.22 0.27 0.23 0.21 0.24 0.25 0.19 0.25 0.24 0.19 0.14 0.21 0.00 0.25 0.26 0.26 0.27 C15 0.23 0.23 0.27 0.24 0.20 0.25 0.25 0.19 0.26 0.24 0.21 0.14 0.22 0.25 0.00 0.24 0.24 0.24 C16 0.24 0.24 0.31 0.24 0.20 0.25 0.26 0.21 0.27 0.25 0.21 0.21 0.23 0.26 0.24 0.00 0.19 0.23 C17 0.25 0.25 0.31 0.25 0.21 0.25 0.26 0.21 0.21 0.25 0.22 0.24 0.24 0.26 0.24 0.19 0.00 0.19 C18 0.25 0.26 0.29 0.26 0.22 0.25 0.27 0.22 0.22 0.27 0.22 0.24 0.25 0.27 0.24 0.23 0.19 0.00 CURE: Clustering on Sequential Data for Web Personalization: Tests and Experimental Results 103

Table2 No of Clusters Formed, Mean and Standard Deviation using Various Similarity Measures

Similarity Euclidean Projected Cosine Fuzzy Jaccard measure measure Euclidean similarity dissimilarity Similarity distance measure No of Clusters 16 21 18 19 23

Mean 0.146 0.179 0.198 0.205 0.205

Standard Deviation 0.025 0.027 0.0267 0.029 0.026

Table3 Intra Cluster Distance for the Clusters Formed using Different Similarity Measures EUCLIDEAN JACCARD Projected Euclidean Cosine similarity Fuzzy Distance C1 0.18 0.23 0.14 0.27 0.13 C2 0.13 0.21 0.13 0.23 0.15 C3 0.12 0.22 0.15 0.25 0.15 C4 0.16 0.22 0.16 0.25 0.13 C5 0.17 0.21 0.21 0.26 0.14 C6 0.15 0.23 0.23 0.24 0.16 C7 0.13 0.23 0.21 0.21 0.17 C8 0.17 0.23 0.17 0.19 0.14 C9 0.17 0.24 0.16 0.21 0.17 C10 0.16 0.25 0.16 0.21 0.17 C11 0.13 0.23 0.17 0.19 0.18 C12 0.18 0.23 - 0.21 C13 0.19 0.21 - 0.14 0.19 C14 0.17 0.21 - 0.15 0.21 C15 0.18 0.21 - 0.12 0.20 C16 0.19 0.22 - 0.14 0.19 C17 0.23 - 0.13 0.21 C18 0.24 - 0.13 0.19 C19 0.19 - 0.19 C20 0.19 - C21 0.19 - C22 0.18 - C23 0.18 -

5. DATA PREPROCESSING preprocessing we deleted unused attributes from the In this work we used MSNBCdataset(www.msnbc.com) dataset. After extracting unused attributes from the founded in 1996 as a joint venture between Microsoft and dataset and created a new dataset as a text file with 40’000 NBC [MSNBC website]. This is a famous online news records of users. Finally we applied a the CURE algorithm website with has different news subjects like breaking used to clustering on the dataset and created certain news, extensive sources, advanced technology, original hidden patterns. journalism and expansive content. The msnbc.com internet information server (IIS) a creates a log file with sequential 6. EXPERIMENTAL RESULTS list of data. After getting user log file from the MSNBC The results listed below considered arbitrary with 40,000 website we did some preprocessing on it. In first step of records of web transactions from the MSNBC.COM. 104 International Journal of Computer Science and Communication (IJCSC)

Initially clusters are generated using CURE clustering Jaccard Coefficient, Fuzzy dissimilarity, projected techniques varying the similarity measures. For example Euclidean and Cosine. From the results we determine that Table 1 shows the clusters formed using the similarity Fuzzy disimialrity generated good results compared to measure Cosine. The statistical measures like mean, the other. Web clustering is a useful technique for grouping standard deviation are generated for every measure. And web sessions such that sessions within a cluster have the intra cluster distance is calculated among the clusters. similar characteristic, while sessions in different groups Fig. 1. shows the inter cluster distance taking number of are dissimilar. Finally behavior of clusters that made by clusters on the x-axis and the similarity measures on the CURE algorithm showed on a sequential data in a web y-axis. Similarly Fig. 2 for the intra cluster is plotted taking usage domain with quantify our results by the way of the clusters on the x-axis and the similarity measures on explanations and list conclusion. the y-axis. REFERENCES [1] G. Adomavicius, J. Bockstedt, and V. Parimi. “Scalable Temporal Clustering for Massive Multidimensional Data Streams”. Proceedings of the 18th Workshop on Information Technology and Systems (WITS'08), Paris, France, December 2008. [2] Ji Dan, Qiu Jianlin, Gu Xiang, Chen Li, He Peng, “A Synthesized Data Mining Algorithm Based on Clustering and Decision Tree”, cit, pp. 2722-0.2728, 2010 10th, IEEE International Conference on Computer and Information Technology, 2010. [3] M. Kaya, R. Alhajj. “Genetic Algorithm Based Framework for Mining Fuzzy Association Rules”. Fuzzy Fig. 1: Interclusterdistance Sets and Systems, 152 (3), (2005), 587-601. [4] Mehmet Kaya, Reda Alhajj, Faruk Polat, Ahmet Arslan. “Efficient Automated Mining of Fuzzy”. [5] Association Rules. DEXA ‘02 Proceedings of the 13th International Conference on Database and Expert Systems Applications 2002., ISBN: 3-540-44126-3. [6] C8Qian Yuntao, Shi Qingsong, Wang Qi 20c902. “CURE- NS: A Algorithm with New Shrinking Scheme”, ICMLC’2002, Beijing, Nov., 4-5, pp. 895-899. [7] Srinivasan Parthasarathy, Mohammed J. Zaki, Mitsunori Fig. 2: Intraclusterdistance Ogihara, and Sandhya Dwarkadas, “Incremental and Interactive Sequence Mining”. Proc. in 8th ACM 7. CONCLUSIONS International Conference Information and Knowledge Web usage clustering is an important task in web mining Management. Nov 1999. in order to group similar sessions and identify web user [8] Sudipto Guha, Rajeev Rastogi, and Keyuseok Shim, access behavior. Here in this paper in our experiments, 1998. “CURE: An Efficient Clustering Algorithm for we compared the clustering characteristics of CURE Large Databases”. In Proc. of the 1998 ACM SIGMOD algorithm on the session similarity measures: Euclidean, Intl. Conf. on Management of Data, pp. 73-84.