Analysis of Representative Values in Clustering Using the CURE Algorithm

International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-9 Issue-2, July 2020 Analysis of Representative Values in Clustering using the CURE Algorithm Umaya Ramadhani Putri Nst, Sutarman, Pahala Sirait Abstract: The collection of data that everyone has on earth from the representative point has an influence on the time has a fully agreed upon value of knowledge. Analysis of a the process is traveled, whether it will be able to speed up or collection of data that can accommodate a long processing time, slow- down in the analysis process. For this reason, an for this we need an algorithm that can provide a comparison of the acceleration of the analysis process. One process of data analysis analysis of the value of the representative point input at the is clustering, which is a process of grouping large amounts of data beginning and how to overcome it if it gives a long processing so that it is easy to understand. One of the algorithms in the time. clustering process is CURE (Clustering Using Representative) where CURE random sample-based data bases partition the data II. RELATED RESEARCH using representative points called representative points. Sample-based process will provide better processing time This description begins with research by Manjula and acceleration because it will only be done on the data collection, Nandakumar (2018) which uses the CURE algorithm to not the whole data. This representative point determines the identify educational data in data mining regarding processing time of the testing carried out in the input. Values, representative values, and shrinkage values will provide a faster engineering performance of weak engineering students. The settlement process for the values inputted according to the correct educational data to be analyzed has many attributes so that conditions. this study performs irrelevant attribute reduction using the Keywords : Representative, CURE algorithm method of Reduction by Dimensionality Reduction. The resulting time runs for 11,929215 s. I. INTRODUCTION Previously, in 2005 a study was carried out by Yin (2005) by classifying in large datasets using the BIRCH algorithm The collection of data contained on this earth is a collection where CF-Tree was used to determine sub-clusters. The of all data possessed by everyone. Data from one person can cluster formed is stored like a leaf node, then uses k-prototype have large amounts of data with large data variants too. The to correct the node if the cluster is not balanced. data set will eventually become a pile of data that can produce The next four years of research conducted by Meng, Song important information from each who has it. and Wang (2009) who proposed a new algorithm to handle A collection of data that accumulates can be extracted into the clustering process in complex and massive data effectively a pattern structure or knowledge automatically when an and efficiently. The proposed algorithm is an algorithm that is analysis of the data is carried out. To that end, one of the data adopted from the grid shape and then grouped into clusters by analysis is to cluster data so that data can be better understood looking at the density of the grid shape. by humans and have a value of knowledge. Eick, Zeidat and Vilalta (2004) carried out the research by Clustering is a grouping stage for analyzing large amounts editing in advance the dataset used to improve the of data [1]. This analysis is able to divide the clusters of data classification accuracy. One example is removing an object so that data is more easily understood. One algorithm for from the training set and being given a decision limit. clustering is the CURE algorithm, Clustering Using Rani, Manju and Rohil (2014) conducted a study Representative. CURE is an algorithm that uses a hierarchical comparing the results of clustering using the CURE and method that combines random sampling-based data and then BIRCH algorithm with WEKA data 3.6.9. the results obtained partitions the data so that the process is carried out only on the are the CURE algorithm provides the best cluster results than happiness of the data [1]. For this reason, CURE can be used the BIRCH algorithm, but in terms of time, BIRCH gives a on large amounts of data. CURE is an algorithm that faster time than the CURE algorithm. determines a representative point as a reference point to form several clusters. This representative point is called the representative point which was initially determined by the input value of the representative point. The number of values Revised Manuscript Received on June 29, 2020. * Correspondence Author Umaya Ramadhani Putri Nst, Bachelor of Information Technology, Universitas Sumatera Utara, Indonesia. Email: [email protected]. Sutarman, Magister of Applied Probability and Statistics, Northern Illionis University, Amerika. Email: [email protected]. Pahala Sirait, Magister of Computer Science, Universitas Indonesia, Indonesia. Email: [email protected]. Published By: Retrieval Number: B3742079220/2020©BEIESP Blue Eyes Intelligence Engineering DOI:10.35940/ijrte.B3742.079220 627 and Sciences Publication Analysis of Representative Values in Clustering Using the CURE Algorithm III. PROPOSED METHOD size , where > 1. A. Min-Max Normalization Method 4) Establish the number of points as a representative point The Min-Max method is a normalization method by for each cluster. performing a linear transformation of the original data [2]. 5) The representative points then shrink with the given The formula is as follows: shrink value forming a new representative point. 6) The two clusters with the closest distance value are then combined. 7) After that, a representative point is chosen as the (1) representative to represent the search for new clusters. In this case: 8) Cluster merge stops after cluster targets meet. = Normalized data = Minimum value of data per column IV. RESULT AND DISCUSSION = Maximum value of data per column A. Research Data = The minimum we set The data in this study used a dataset taken from the UCI = The maximum we set Machine Learning dataset, which is data from a customer's credit card in Taiwan from April to September 2005. The dataset consists of 30,000 data and 24 attributes. The dataset B. CURE Algorithm will be normalized using the min-max normalization method CURE (Clustering Using Representative) Algorithm is an into a balanced range of values to facilitate the calculation of algorithm that uses a hierarchical method that combines trials. random sampling-based data and then partitiones the data so Following are examples of data that have not been that the process is carried out only on the happiness of the data normalized and after normalization, 30 data are taken from [1]. This algorithm is created to identify data that will form 30,000 data. random clusters with wide variations [3]. CURE presents each cluster from certain points that are Tabel- I: Dataset Before Normalized scattered by shrinking the cluster center using linear space so as to produce a faster process [4]. The main purpose of CURE No … is to look for these representative points called representative points in order to get good cluster results [5]. Although it 1 50000 1 1 2 … 0 produces a fast time process in the clustering process, CURE 2 230000 2 1 2 … 0 still includes algorithm which has a bad time complexity with values . 3 50000 1 2 2 … 716 The stages of CURE can be seen in Figure 1 below, namely 4 100000 1 1 2 … 2504 the flowchart of CURE algorithm. 5 500000 2 2 1 … 51582 6 500000 1 1 1 … 768 … … … … … … … 30 200000 2 1 2 … 0 Fig 1. Flowchart CURE [1] Following is the CURE algorithm process [6]: 1) Take a random sample of data from the dataset. 2) Partitioning to the sample becomes a size , where the value = 2, here will form two initial partitions by having the data contents of each cluster. 3) Then each initial partition is partitioned back into a Published By: Retrieval Number: B3742079220/2020©BEIESP Blue Eyes Intelligence Engineering DOI:10.35940/ijrte.B3742.079220 628 and Sciences Publication International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-9 Issue-2, July 2020 Tabel- II: Dataset After Normalized reference point that will be used for the boundaries of a cluster of clusters, the representative value given is also able to No … minimize the calculation process but can slow down the process. The following table 3 shows the results of the test by 1 0,08 0,00 0,00 0,50 … 0,00 testing the input value . 2 0,45 1,00 0,00 0,50 … 0,00 Tabel- III: Input Value Testing Results 3 0,08 0,00 0,25 0,50 … 0,01 Number of Number of Processing Input Value Representative Clusters Time 4 0,18 0,00 0,00 0,50 … 0,05 Points Produced 2 0,393 s 2 3 5 1,00 1,00 0,25 0,00 … 1,00 3 0,172 s 2 3 6 1,00 0,00 0,00 0,00 … 0,01 4 0,155 s 2 3 5 0,094 s 2 3 … … … … … … 6 0,068 s 2 3 7 0,067 s 2 3 30 0,39 1,00 0,00 0,50 … 0,00 8 0,067 s 2 3 9 0,068 s 2 3 10 0,065 s 2 3 B. Result and Discussion From Table III above, it can be concluded that the greater the input value is given, the shorter the processing time because the more partitions are formed so that the process runs simultaneously with smaller calculations. From the test results it can be seen the number of clusters that remain stable without changing.

Analysis of Representative Values in Clustering Using the CURE Algorithm

A Survey of Hierarchical Clustering Algorithms the Journal Of

The CURE for Class Imbalance

An Improved CURE Algorithm Mingjuan Cai, Yongquan Liang

Dynamic Group Recommendation Based on the Attention Mechanism

An Improvement for DBSCAN Algorithm for Best Results in Varied Densities

Download (PDF)

Hierarchical Clustering Algorithm for Big Data Using Hadoop and Mapreduce

Statistical Modeling Machine Learning

Intel® Xeon® Processors Projector

Privacy-Preserving Clustering Using Representatives Over Arbitrarily Partitioned Data∗

Robust Clustering Algorithms

Large Scale Clustering