International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-9 Issue-2, July 2020

Analysis of Representative Values in Clustering using the CURE Algorithm

Umaya Ramadhani Putri Nst, Sutarman, Pahala Sirait

 Abstract: The collection of data that everyone has on earth from the representative point has an influence on the time has a fully agreed upon value of knowledge. Analysis of a the process is traveled, whether it will be able to speed up or collection of data that can accommodate a long processing time, slow- down in the analysis process. For this reason, an for this we need an algorithm that can provide a comparison of the acceleration of the analysis process. One process of data analysis analysis of the value of the representative point input at the is clustering, which is a process of grouping large amounts of data beginning and how to overcome it if it gives a long processing so that it is easy to understand. One of the algorithms in the time. clustering process is CURE (Clustering Using Representative) where CURE random sample-based data bases partition the data II. RELATED RESEARCH using representative points called representative points. Sample-based process will provide better processing time This description begins with research by Manjula and acceleration because it will only be done on the data collection, Nandakumar (2018) which uses the CURE algorithm to not the whole data. This representative point determines the identify educational data in regarding processing time of the testing carried out in the input. Values, representative values, and shrinkage values will provide a faster engineering performance of weak engineering students. The settlement process for the values inputted according to the correct educational data to be analyzed has many attributes so that conditions. this study performs irrelevant attribute reduction using the Keywords : Representative, CURE algorithm method of Reduction by . The resulting time runs for 11,929215 s. I. INTRODUCTION Previously, in 2005 a study was carried out by Yin (2005) by classifying in large datasets using the BIRCH algorithm The collection of data contained on this earth is a collection where CF-Tree was used to determine sub-clusters. The of all data possessed by everyone. Data from one person can cluster formed is stored like a leaf node, then uses k-prototype have large amounts of data with large data variants too. The to correct the node if the cluster is not balanced. data set will eventually become a pile of data that can produce The next four years of research conducted by Meng, Song important information from each who has it. and Wang (2009) who proposed a new algorithm to handle A collection of data that accumulates can be extracted into the clustering process in complex and massive data effectively a pattern structure or knowledge automatically when an and efficiently. The proposed algorithm is an algorithm that is analysis of the data is carried out. To that end, one of the data adopted from the grid shape and then grouped into clusters by analysis is to cluster data so that data can be better understood looking at the density of the grid shape. by humans and have a value of knowledge. Eick, Zeidat and Vilalta (2004) carried out the research by Clustering is a grouping stage for analyzing large amounts editing in advance the dataset used to improve the of data [1]. This analysis is able to divide the clusters of data classification accuracy. One example is removing an object so that data is more easily understood. One algorithm for from the training set and being given a decision limit. clustering is the CURE algorithm, Clustering Using Rani, Manju and Rohil (2014) conducted a study Representative. CURE is an algorithm that uses a hierarchical comparing the results of clustering using the CURE and method that combines random -based data and then BIRCH algorithm with WEKA data 3.6.9. the results obtained partitions the data so that the process is carried out only on the are the CURE algorithm provides the best cluster results than happiness of the data [1]. For this reason, CURE can be used the BIRCH algorithm, but in terms of time, BIRCH gives a on large amounts of data. CURE is an algorithm that faster time than the CURE algorithm. determines a representative point as a reference point to form several clusters. This representative point is called the representative point which was initially determined by the input value of the representative point. The number of values

Revised Manuscript Received on June 29, 2020. * Correspondence Author Umaya Ramadhani Putri Nst, Bachelor of Information Technology, Universitas Sumatera Utara, Indonesia. Email: [email protected]. Sutarman, Magister of Applied Probability and Statistics, Northern Illionis University, Amerika. Email: [email protected]. Pahala Sirait, Magister of Computer Science, Universitas Indonesia, Indonesia. Email: [email protected].

Published By: Retrieval Number: B3742079220/2020©BEIESP Blue Eyes Intelligence Engineering DOI:10.35940/ijrte.B3742.079220 627 and Sciences Publication

Analysis of Representative Values in Clustering Using the CURE Algorithm

III. PROPOSED METHOD size , where > 1. A. Min-Max Normalization Method 4) Establish the number of points as a representative point The Min-Max method is a normalization method by for each cluster. performing a linear transformation of the original data [2]. 5) The representative points then shrink with the given The formula is as follows: shrink value forming a new representative point.

6) The two clusters with the closest distance value are then combined. 7) After that, a representative point is chosen as the (1) representative to represent the search for new clusters. In this case: 8) Cluster merge stops after cluster targets meet. = Normalized data = Minimum value of data per column IV. RESULT AND DISCUSSION = Maximum value of data per column A. Research Data = The minimum we set The data in this study used a dataset taken from the UCI = The maximum we set dataset, which is data from a customer's credit card in Taiwan from April to September 2005. The dataset consists of 30,000 data and 24 attributes. The dataset B. CURE Algorithm will be normalized using the min-max normalization method CURE (Clustering Using Representative) Algorithm is an into a balanced range of values to facilitate the calculation of algorithm that uses a hierarchical method that combines trials. random sampling-based data and then partitiones the data so Following are examples of data that have not been that the process is carried out only on the happiness of the data normalized and after normalization, 30 data are taken from [1]. This algorithm is created to identify data that will form 30,000 data. random clusters with wide variations [3]. CURE presents each cluster from certain points that are Tabel- I: Dataset Before Normalized scattered by shrinking the cluster center using linear space so as to produce a faster process [4]. The main purpose of CURE No … is to look for these representative points called representative points in order to get good cluster results [5]. Although it 1 50000 1 1 2 … 0 produces a fast time process in the clustering process, CURE 2 230000 2 1 2 … 0 still includes algorithm which has a bad time complexity with values . 3 50000 1 2 2 … 716 The stages of CURE can be seen in Figure 1 below, namely 4 100000 1 1 2 … 2504 the flowchart of CURE algorithm. 5 500000 2 2 1 … 51582

6 500000 1 1 1 … 768

… … … … … … …

30 200000 2 1 2 … 0

Fig 1. Flowchart CURE [1]

Following is the CURE algorithm process [6]:

1) Take a random sample of data from the dataset.

2) Partitioning to the sample becomes a size , where the value = 2, here will form two initial partitions by

having the data contents of each cluster. 3) Then each initial partition is partitioned back into a

Published By: Retrieval Number: B3742079220/2020©BEIESP Blue Eyes Intelligence Engineering DOI:10.35940/ijrte.B3742.079220 628 and Sciences Publication International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-9 Issue-2, July 2020

Tabel- II: Dataset After Normalized reference point that will be used for the boundaries of a cluster of clusters, the representative value given is also able to No … minimize the calculation process but can slow down the process. The following table 3 shows the results of the test by 1 0,08 0,00 0,00 0,50 … 0,00 testing the input value .

2 0,45 1,00 0,00 0,50 … 0,00 Tabel- III: Input Value Testing Results

3 0,08 0,00 0,25 0,50 … 0,01 Number of Number of Processing Input Value Representative Clusters Time 4 0,18 0,00 0,00 0,50 … 0,05 Points Produced 2 0,393 s 2 3 5 1,00 1,00 0,25 0,00 … 1,00 3 0,172 s 2 3 6 1,00 0,00 0,00 0,00 … 0,01 4 0,155 s 2 3 5 0,094 s 2 3 … … … … … … 6 0,068 s 2 3 7 0,067 s 2 3 30 0,39 1,00 0,00 0,50 … 0,00 8 0,067 s 2 3 9 0,068 s 2 3 10 0,065 s 2 3 B. Result and Discussion

From Table III above, it can be concluded that the greater the input value is given, the shorter the processing time because the more partitions are formed so that the process runs simultaneously with smaller calculations. From the test results it can be seen the number of clusters that remain stable without changing.

Tabel- IV: Test Results Representative Input Value

Number of Representative Processing Number of Clusters Points Time Produced 2 0,2 s 3 3 0,125 s 3 4 0,147 s 3 5 0,179 s 3 6 0,240 s 3 7 0,265 s 3 8 0,271 s 3 9 0,377 s 3 10 0,473 s 3

The test results in Table IV, show that when the number of representative points is getting bigger, the process time is getting longer, concluding that CURE has a weakness in giving the best number of representative values. Nevertheless, it still provides a stable number of clusters that does not change. Here it is seen that a trial is needed on the two inputs in order to find the right input values to provide better processing results. The test results in Table V below will minimize the above weaknesses, namely by providing appropriate input values and representative values by trying Fig. 1. Flowchart design CURE several tests. After a number of tests have been conducted, it In Figure 1 above, a flowchart picture of the CURE algorithm can be concluded that if the values > representative values is performed in this test. Starting from the input stage some inputted, will result in shorter processing time. values then produce output as a result of the tests conducted. Testing the CURE algorithm in this study is to test the requested input values. In the CURE process, there are some inputs that we must know, namely input values and values that are representative points. We can analyze the input values so we know the strengths and weaknesses of each input value in the CURE process. Input value is the value used for the second partition stage of the clustering process so that it can further reduce the calculation process but slow down the process. Representative value is a value that is used as a

Published By: Retrieval Number: B3742079220/2020©BEIESP Blue Eyes Intelligence Engineering DOI:10.35940/ijrte.B3742.079220 629 and Sciences Publication

Analysis of Representative Values in Clustering Using the CURE Algorithm

Tabel- V: Modification and Representative Testing 4) If the 3rd point plus the shrink value is enlarged to reach Results 1, the processing time will be even faster than before.

Number of Number of Processing REFERENCES Input Value Representative Clusters Time Points Produced 1. Shirkhorsidi, A.S., Aghabozorghi, S., Wah, T.Y., & Herawan, T. 2014. 5 2 0,17 s 3 Big data Clustering. International Conference Computational Science and Applications, pp. 707-720. 5 3 0,155 s 3 2. Rani, Y., Manju & Rohil, H. 2014. Comparative Analysis of BIRCH 5 4 0,121 s 3 and CURE Algorithm using WEKA 3.6.9. Computer Science Engineering and Application 2: 25-29. In the process of clustering, the CURE algorithm is known 3. Jian, S., Pang, G., & Cao, L. 2018. CURE: Flexible Categorical Data Representation by Hierarchical Coupling Learning. IEEE Transactions as a shrink value. This shrink value is between 0 and 1. The on Knowledge and Data Engineer. Vol. 14. No.8 shrink value here is a value to narrow the representative point 4. Eick, Christoph F., Zeidat, N., & Vilalta, R. 2004. Using that is used as a reference point to other points. Representative-Based Clustering for Nearest Neighbor Dataset For changes that occur when the shrink value is inputted, if Editing. Proceedings of the fourth IEEE International Conference on Data Mining, pp. 2142-2145. the results of the above conditions, namely, the value> 5. Putra, A.K.P., Purwanto, Y., & Novianty, A. 2015. Analisis Sistem representative value, plus the shrink value is enlarged to reach Deteksi Anomali Traffic Menggunakan Algoritma CURE dengan a value of 1, then, the processing time obtained will be even Koefisien Silhoutte dalam Validasi Clustering. Procedings of faster than before. we can see in Table VI, Table VII and engineering, pp. 3837-3842. 6. Han, J., Kamber, M., & Pei, J. 2012. Data Mining: Concepts and Table VIII below. Techniques Third. Elsevier: USA. 7. Manjula, V. & Nandakumar, A.N. 2018. An Effective Cure Clustering Tabel- VI: Testing Results Input Value =5 R=2 Algorithm in Education Data Mining Techniques to Valuate Student’s Performance. International Journal of Applied Engineering Research 10: 7493-7498. Number of Clusters Value of Shrink Processing Time 8. Meng, H.-D., Song, Y.-C., Song, F.-Y., & Wang, S. L. 2009. Produced Clustering for Complex and Massive Data. International Conference 0,3 0,187 s 3 on Information Engineering and Computer Science, pp. 4244-4994. 0,5 0,08 s 3 9. Rani, Y., Manju & Rohil, H. 2014. Comparative Analysis of BIRCH 0,8 0,075 s 3 and CURE Hierarchical Clustering Algorithm using WEKA 3.6.9. 1 0,072 s 3 Computer Science Engineering and Application 2: 25-29. 10. Yin, J., Tan, Z., Ren, J., & Chen, Y. 2005. An Efficient Clustering Algorithm Mixed Type Attributes in Large Dataset. Proceedings of the Tabel- VII: Testing Results Input Value =5 R=3 Fourth International Conference on Machine Learning and Cybernatics, pp. 7803-9091.

Number of Value of Shrink Processing Time Clusters Produced AUTHORS PROFILE 0,3 0,115 s 3

0,5 0,094 s 3 Umaya Ramadhani Putri Nst, Graduate school of 0,8 0,157 s 3 Information Technology, Universitas Sumatera Utara, 1 0,098 s 3 Indonesia. The first research title is "The Design of Vehicle Laying Model in Ticket Booking Application Ship” on International Journal of e-Ducation, e-Business, Tabel- VIII: Testing Results Input Value =5 R=4 e-Management, and e-Learning.

Number of Value of Shrink Processing Time Clusters Sutarman, Magister of Applied Probability and Produced Statistics, Northern Illionis University, Amerika, 1994. 0,3 0,5 s 3

0,5 0,159 s 3

0,8 0,128 s 3 1 0,13 s 3 Pahala Sirait, Magister of Computer Science, V. CONCLUSION Universitas Indonesia, Indonesia, 2004.

In this study, it can be concluded that the CURE algorithm can cluster the data in large amounts with large variants with stable cluster results reaching k targets. This study also shows that in the trial analysis of changes in values , representative values and shrink values provide a faster time change where the conditions are as follows: 1) When the value is entered, the processing time becomes shorter. 2) When the value of a representative value is entered, the processing time becomes longer 3) When the value > representative value, the processing time is faster.

Published By: Retrieval Number: B3742079220/2020©BEIESP Blue Eyes Intelligence Engineering DOI:10.35940/ijrte.B3742.079220 630 and Sciences Publication