Cure Implementation

International Journal of Computer Engineering & Technology (IJCET) Volume 9, Issue 4, July-Aug 2018, pp. 217-228, Article IJCET_09_04_024 Available online at http://iaeme.com/Home/issue/IJCET?Volume=9&Issue=4 Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com ISSN Print: 0976-6367 and ISSN Online: 0976–6375 © IAEME Publication

CURE IMPLEMENTATION

Anchal Chauhan and Seema Maitrey Krishna Inst itute of Engineering and Technology, Uttar Pradesh, India

ABSTRACT Process of data mining is extraction of relevant knowledge and the interesting patterns from large amount of the available information. There are many data mining techniques, one is clustering technique. Process of clustering is unsupervised classification of the patterns (data items, feature vector and observations) in groups i.e. clusters. This paper intends to discuss CURE hierarchical algorithm and implement CURE which is one of the clustering algorithm. Keywords: Data mining, clustering, CURE hierarchical clustering. Cite this Article: Anchal Chauhan and Seema Maitrey, Cure Implementation. International Journal of Computer Engineering & Technology, 9(4), 2018, pp. 217- 228. http://iaeme.com/Home/issue/IJCET?Volume=9&Issue=4

1. INTRODUCTION Data mining process is a kind of sorting technique that is actually used for extracting the hidden patterns from voluminous databases. Data mining is called as KDD (knowledge discovery in databases) sometimes. Main goal of mining includes fast retrieval of information or data for identifying hidden patterns and also the patterns that are not explored previously to reduce level of complexity, knowledge discovery from database, time saving. [1] Classification is supervised learning. In classification, class labels are defined previously and incoming data is categorized according to class labels. On other hand, clustering is unsupervised learning. In clustering, data is categorized in according to the similarities in to the different groups, and then groups are labelled [2]. The process of clustering can be performed through different algorithms such as partitioning, grid, density and hierarchical algorithms. Hierarchical clustering algorithms are categorised as agglomerative and divisive algorithms [3] and agglomerative is further categorised in CURE, BIRCH, ROCK and CHAMELEON [4]. This paper focuses on CURE hierarchical clustering algorithm and its implementation.

http://iaeme.com/Home/journal/IJCET 217 [email protected] Anchal Chauhan and Seema Maitrey

Figure 1 Phases of data mining 2. RELATED WORK Many researchers have carried out research on the CURE hierarchical clustering techniques in past. Some papers are listed below that worked on clustering process and the CURE clustering algorithms. Sudipto Guha et.al [5] proposed CURE algorithm and tried to show efficiency of the CURE on the large database. Then Qian Yuntao, Wang Qi and Shi Qing song [6] founded relation of the shrinking scheme of the CURE algorithm, also hidden assumption of the spherical shape of the clusters. Researcher G. Adomavicious et.al [7] proposed new approach for discovering clusters in very large amount of the continuous arriving data as dataset and then used sampling technique to cluster dataset. M. Kaya, R. Alhajj [8] introduced a automated method to perform mining on fuzzy association rules by help of the genetic algorithm and the CURE algorithm. Ogihera, Dwarakadas [9] worked on discovery of the clusters from the database updates. These proposed method with the SPADE algorithm for the interactive and incremental frequent sequence mining.

3. CURE CLUSTERING ALGORITHM CURE is improved hierarchical algorithm. In data mining, clustering is useful to discover groups and identify interesting distributions in underlying data. Traditionally, clustering favours the clusters having spherical shape and similar sizes, or weak in presence of outliers. CURE is one algorithm which is very robust to the outliers, also performs well in identifying clusters that have non spherical shapes or wide variances in the size. The CURE algorithm achieves this through each and every cluster by certain fix number of the points which are generated from selecting the well scattered points and then towards centre of cluster by specified fraction. Ability of having more than the one representative point in per cluster allows the CURE algorithm to adjust well in geometry of the non-spherical shape. The shrinking helps in dampening effects of the outliers. For handling the large databases, CURE clustering algorithm employs the combination of the random sampling techniques and partitioning [5]. CURE implements a novel hierarchical algorithm that adopts middle ground in between the centroid based and the approaches based on representative object. Instead of making use of an single centroid or the object for representing cluster, a fix number of the representative points in the space are selected. Representative point in the clusters are generated through the

http://iaeme.com/Home/journal/IJCET 218 [email protected] Cure Implementation selection of the well scattered objects and then moving and shrinking them towards cluster centre by an specified fraction or an shrinking factor. In each of the step, two clusters with the closest pair of the representative points are selected. Having the more than a single representative point in per cluster allows the CURE algorithm for adjusting well in geometry of the non-spherical shapes. Condensing and shrinking of the cluster helps in dampening effects of the outliers. CURE algorithm is much more robust to the outliers and then helps in identifying clusters which have non spherical shapes, also wide variances in the size. Because of this, it scales very well for the large or voluminous databases without sacrificing the clustering quality. The random sample which is drawn from dataset is firstly partitioned and then each partition is clustered partially. All the partial clusters are again clustered in second pass to get the all required clusters. It will confirm quality of the clusters produced from CURE which is much better than those found from other algorithm. [10]

Figure 2 Overview of CURE Algorithm 4. PROPOSED WORK In field of data mining, it’s well known that sometimes it becomes difficult to handle the voluminous data or large amount of the data. So, we tried to take advantage of this issue and in among many various clustering algorithms, the CURE hierarchical algorithm is considered to be implemented as the CURE (Clustering usage Representatives) clustering technique finds the clusters from voluminous database which is very robust to the outliers, and also determines clusters with the non-spherical shapes. CURE algorithm is implemented by the combination of data collection and the data reduction by use of random sampling method and partitioning method.

Algorithm:

Input: Table A main table Table B join table Column A Join column from main table Column B Join column from join table n number of cluster Column C Filter column Value Filter value 1. Join table A and table B on equivalence of column A and column B 2. Calculate count from result set and store it in a variable T= total number of rows. 3. Start random sampling by calculating size of clustered partition by dividing total number of rows T from number of clusters n Size of cluster=total number of rows (T)/ number of clusters (n)

http://iaeme.com/Home/journal/IJCET 219 [email protected] Anchal Chauhan and Seema Maitrey

4. Store this value to variable s 5. Select every nth row from result set with filter criteria build by given user input in terms of filter column and filter value starting from 0th row. 6. Create other partition by selecting every nth row with ith starting element where i ranges from 1 to n-1. 7. Analyse all partitions and perform clustering. 8. Merge relevant partitions to get knowledge based and meaningful relevant data. 9. Repeat step 3 to 9 if further clustering is required. 10. End. 5. FLOWCHART

http://iaeme.com/Home/journal/IJCET 220 [email protected] Cure Implementation

6. STEPS OF ALGORITHM STEP 1: Install software in the system by clicking install application icon. When installation is complete, welcome page will appear on screen. Connection URL for DB based on the DB name and the passwords and username value.

Figure 4 Initial Database Settings STEP 2. Now we then start setting up the initial properties of this process that takes the few inputs from user. The Main Table as the table A. Then check for checkbox, if we look for the Join Data set for this procedure. If Join is required then we takes more input from the user as the Join Table as table b, the join column for the table A as A column and Join column for the table B as the column B. All of these setting get populated with connection string which is provided in the previous step. It queries the system tables for giving list of the entire existing table with their relationship fields; by this we can specify the initial setting for this process. STEP 3. In the next step, we can see the count of the results set created, and then we will use all this for applying request values for performing sampling, partitioning and the clustering step. We then store results set in the temp table, provides random sampling request value which are the number of clusters (n),filter value as value, filter column as column C. Also some specified list containing conditionals that can be used to filter more data. It contains entries as EQUAL, IN, BETWEEN etc. We have provided all settings for the random sampling procedure and then click next for checking the partitioning results.

http://iaeme.com/Home/journal/IJCET 221 [email protected] Anchal Chauhan and Seema Maitrey

Figure 5 Setup Table Name

Figure 6 Setup Cure Properties STEP 4: In step 4, we can see the different partitions that are the every nth value of result set starting from the row number I, where the I ranges from 0 to the n-1. You can see all the result set grouped each other which is based on value of column C provided in the previous steps in the different partitions like the partition1, partition2 and partition 3 etc. This will show that cluster is separated from the outliners. Also it can be seen that the filtered values based over criteria specification. See Fig.

http://iaeme.com/Home/journal/IJCET 222 [email protected] Cure Implementation

Figure 7 Partition 1 results

Figure 8 Partition 2 results STEP 5: In this step, after analysing and processing all partitions we have merge them in single table as result. If one want to perform more clustering with the same database instance then can go to previous setup page once again as the next step for starting furthur clustering process.

http://iaeme.com/Home/journal/IJCET 223 [email protected] Anchal Chauhan and Seema Maitrey

Figure 9 Final Result Set STEP 6: step 2 to 6 can be repeated, if more clustering is required on the same database instance. Initial and cure settings can be provided, and process data as per the new setting to see the more relevant and better clustering results.

Figure 10 Setup Table and Join Table Name

http://iaeme.com/Home/journal/IJCET 224 [email protected] Cure Implementation

Figure 11 Setup Join Clusters and CURE properties

Figure 12 Join Partition 1 Results

http://iaeme.com/Home/journal/IJCET 225 [email protected] Anchal Chauhan and Seema Maitrey

Figure 13 Join Partition 2 Results

Figure 14 Join Partition 3 Results

http://iaeme.com/Home/journal/IJCET 226 [email protected] Cure Implementation

Figure 15 Join Final Results

STEP 7: If analysis and processing is completed then process can be finished. By clicking on the Finish Button on the Merge Results page.

Figure 16 Filtered Result based on parameter 7. CONCLUSION In this paper we studied that the CURE clustering algorithm can determine the cluster with non-spherical shape and the wide variance in size. CURE algorithm provides the better execution time as compared to other algorithms in the large database from using random sampling technique and the partitioning ways. CURE clustering algorithm works very well when the data have the outliers. All outliers are detected firstly and then these are eliminated in CURE hierarchical clustering algorithm. Each and every level or step is important to achieve efficiency, scalability and as well as the concurrency improvement. So, it can be concluded that CURE algorithm is suitable for handling the voluminous data.

http://iaeme.com/Home/journal/IJCET 227 [email protected] Anchal Chauhan and Seema Maitrey

8. FUTURE SCOPE In future, parallel programming can be introduced with CURE algorithm through this we can get the result with much more accuracy in very less time. In the CURE algorithm, during the random sampling result set is break in various different partitions. As the enhancement to the CURE hierarchical algorithm we can process these partitions in a parallel thread environment. By this performance of CURE algorithm we can improved and can make it a very efficient algorithm than the other hierarchical algorithm.

REFERENCES

[1] Smita, Priti Sharma, Use of Data Mining in Various Field: A Survey Paper, (May-Jun. 2014) [2] Megha Mandloi, A Survey on Clustering Algorithms and K-Means, July-2014 [3] G.Thilagavathi, D.Srivaishnavi, N.Aparna, “A Survey on Efficient Hierarchical Algorithm used in Clustering”, IJERT, Year: 2013. [4] Marjan Kuchaki Rafsanjani, Zahra Asghari Varzaneh, Nasibeh Emami Chukanlo, A survey of Hierarchical clustering algorithms”, The Journal of Mathematics and Computer Science, Year: 2012 [5] Sudipto Guha, Rajeev Rastogi, and Keyuseok Shim, 1998. “CURE: An Efficient Clustering Algorithm for Large Databases”. In Proc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pp. 73-84. [6] C8Qian Yuntao, Shi Qingsong, Wang Qi 20c902. “CURENS: A Hierarchical Clustering Algorithm with New Shrinking Scheme”, ICMLC’2002, Beijing, Nov., 4-5, pp. 895-899. [7] G. Adomavicius, J. Bockstedt, and V. Parimi. “Scalable Temporal Clustering for Massive Multidimensional Data Streams”. Proceedings of the 18th Workshop on Information Technology and Systems (WITS'08), Paris, France, December 2008. [8] M. Kaya, R. Alhajj. “Genetic Algorithm Based Framework for Mining Fuzzy Association Rules”. Fuzzy Sets and Systems, 152 (3), (2005), 587-601. [9] Srinivasan Parthasarathy, Mohammed J. Zaki, Mitsunori Ogihara, and Sandhya Dwarkadas, “Incremental and Interactive Sequence Mining”. Proc. in 8th ACM International Conference Information and Knowledge Management. Nov 1999. [10] Seema Maitrey, C.K. Jha, Rajat Gupta & Jaiveer Singh (2012), “Enhancement of CURE Clustering Technique in Data Mining”, Proceedings in International Journal of Computer Application, Published by Foundation of Computer Science, New-York, USA. April 2012.

http://iaeme.com/Home/journal/IJCET 228 [email protected]