Comparison of Different Types of Cluster Algorithm on Netherland
Total Page:16
File Type:pdf, Size:1020Kb
Comparison of Different Types of Cluster Algorithm on Netherland Telecommunication Provider Solutions Christophorus Jeremy Wicaksana University of Twente PO Box 217, 7500 AE Enschede the Netherlands christophorusjeremywicaksana @student.utwente.nl ABSTRACT In this research there are limitation on the cleaning the data Clustering is an unsupervised method of grouping data into because some words still in non-basic form. For example, different groups. In this research the data is in a form of text. “monitoring” is not lemmatized (reduce the word into its basic The text then needs to be translated into numbers with form) into “monitor” while there is other sentence that contain word2vec skip-gram. The numerical data created will be used in word “monitor”. Since the data is high dimension because of its the 4 different types of clustering algorithm which is divided features based on every unique word the only way to reduce the into different groups based on the numerical data based on data is by using PCA where it is not 100% accurate and the plot sample data of Netherland’s telecommunication regarding the diagram is hardly observable. The data is in the form of text, it solution in telecommunication problem. This research is needs to be translated into numbers in order the machine conducted because there has not been any research on learning can understand. The translation algorithm from text to comparing clustering algorithm in telecommunication data and number which in this case is word2vec is not able to put the there is no research on comparing 4 types of clustering similar word into same weight because every word is deemed as algorithm on text data. So, 4 types of clustering algorithm different words so word “like” and “liking” will be different. which are Centroid-based clustering, Density-based Clustering, Lastly, this research does not have any ground truth on the Hierarchical Clustering, and Distribution-based Clustering used clustering, so it depends on cluster validation S_Dbw score for clustering the text documents. The result then is evaluated which see how dense and scatter the clusters are and a by S_Dbw and observation of the result from each algorithm. observation on the data. The result of the research showed that centroid-based has the There are many clustering algorithms exist, but there has not best performance compared to other clustering algorithms based been a comparison on clustering algorithm based on its type on the telecommunication data while Hierarchical algorithm is while there are many other paper comparing between 2 the most consistent on grouping the data. clustering algorithm without any basis on why it is chosen [16,7,13]. Also there has not been any research on Keywords telecommunication provider solution on major problems by Preprocessing, k-medoids, OPTICS, Hierarchical, GMM, using clustering techniques. The research is mainly focusing on S_Dbw what is the best out of 4 types of clustering algorithm based on google developers1 which are Centroid-based clustering, 1. INTRODUCTION Density-based Clustering, Hierarchical Clustering, and Telecommunication infrastructure and internet are part of daily Distribution-based Clustering. The result will be based on the needs for everyone and the foundation to keep society running S_Dbw score and the observation on the data. properly. So, telecommunication provider needs to be able to There are 2 main research questions in this research which maintain major problem such as outage or maintain availability follow CRISP-DM cycle2. First, “what is the appropriate of contact emergencies in Netherlands, 112. If the provider number of clusters for each algorithm?” this question objective manages to solve the major problem, then they need to submit is trying to find the best cluster performance from each their solution to the national providers [6]. However, the clustering algorithm. The comparison is within the same range solution needs to be anonymized such that it does not contain to each other to ensure the validity of the result. In order to find any data related to the origin of the providers due to commercial the best cluster, the S_Dbw score is used as the comparison. reason. Because of that the research is going to use the sample The second research question is “which is the best clustering in the form of anonymized data. The data used will be algorithm based on the possible factors as comparison” and the implemented on unsupervised machine learning given the fact reason as of why the clustering algorithm is the best. no labels on the data in order to help the telecommunication The paper is structured into 4 sections with abstract and providers categorized the data which in this case going to use references. The first section is the introduction explaining what clustering without doing it manually. the background is, limitation, problems, research questions, and the structure of the paper also there is research question in subsection 1. In the section 2 is the background of the research Permission to make digital or hard copies of all or part of this work for all of the information that supports the research and why it is personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that chosen. Section 3 is the findings of the report that shows what is copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, 1https://developers.google.com/machine- requires prior specific permission and/or a fee. learning/clustering/clustering-algorithm 35thTwente Student Conference on IT, April. 26th, 2021, Enschede, The Netherlands. Copyright 2021, University of Twente, Faculty of Electrical 2 https://www.datascience-pm.com/crisp-dm-2/ Engineering, Mathematics and Computer Science. 1 the result and what has been done in the whole of the research. Lastly, section 4 is the conclusion to conclude the paper. 1.1 Research Questions Based on CRISP-DM process model. There are 2 main research question to be answered at the end of this paper. The question is focusing on data modelling and data evaluation. The other CRISP-DM cycle will be explained in methodologies. The provider solutions refer to the data that will be used in this research. Main research question 1: Figure 1. Workflow of CBOW and Skip-gram What is the appropriate number of cluster groups for each algorithm? 2.3 Clustering • What are the appropriate range value of clusters for Clustering is an unsupervised learning method that can classify each algorithm except density-based ratio? data into different clusters of data. Not only numerical data, but text documents also possible to be clustered. However, there is • What is the lowest cluster validation value for each no way for clustering to understand documents directly, the algorithm? documents need to be translated into statistical features through • Will every/any algorithm have the same amount of a weighted text classification in a form of number. In this cluster groups based on lowest cluster validation research, Word2vec is used for that purpose which is previously value of S_Dbw? explained in detail in section 2.2. Main research question 2: There are many clustering methods available, but in this research 4 algorithms are chosen because based on google Based on 4 types of clustering algorithm which one is the most developers1 there are 4 types of clustering. Centroid based optimum implementation for provider solutions? clustering, in this case k-medoids is used. Density-based • What factors from each algorithm can be used as clustering which uses the OPTICS algorithm. Distributed-based comparison? clustering, Gaussian Mixture Model. Lastly, Hierarchical • What causes each algorithm is better than others? clustering using agglomerative hierarchical. Those algorithms used to represent each type will be explained later in detail. 2. STATE OF ART 2.3.1 K-medoids 2.1 Preprocessing Text As the name implies K-medoids is the combination of K-means The subsection is discussing steps on how to clean the data. concept and using medoids instead of means for reference Preprocessing text data is a method to clean the data and always points of a cluster [18]. Medoid is a statistical function to find a used in text mining [1]. This is needed in order to maintain central point of the data which can be seen on equation below. consistency of text data that is translated into statistical feature Later, the reference point from the equation helps reducing in form of numbers. sensitivity of an outlier in set of data. For instance, an outlier will be nearer from the reference point rather than using means, There are 5 steps needed to be performed on the raw data before so it minimizes dissimilarities between data [4]. The algorithm applied to the algorithm [9,19,2]. Lower casing the words to of K-medoids is similar to K-means it chooses random data as prevent any capitals in the text. Tokenizing the texts into words medoid points. Other points then see the nearest reference so it can be processed in the next step. Removing punctuation points to each other. After that all data now in clusters. Now and then stopwords because clustering is sensitive to stopwords each cluster try to find medoids for each cluster then iterate the and lastly implementing lemmatization which generalize the whole data points based on the medoids value hoping for words in the same basic form. For instance, making plural changes. If there are no changes then the clustering is final. words into base form which is singular. Based on how many iterations it wants, the algorithm will store 2.2 Word2vec the clustering results as many as given iteration then see which result is the most balanced clustering.