Density Based Data Clustering
Total Page:16
File Type:pdf, Size:1020Kb
California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of aduateGr Studies 3-2015 Density Based Data Clustering Rayan Albarakati California State University - San Bernardino Follow this and additional works at: https://scholarworks.lib.csusb.edu/etd Part of the Other Computer Engineering Commons Recommended Citation Albarakati, Rayan, "Density Based Data Clustering" (2015). Electronic Theses, Projects, and Dissertations. 134. https://scholarworks.lib.csusb.edu/etd/134 This Project is brought to you for free and open access by the Office of aduateGr Studies at CSUSB ScholarWorks. It has been accepted for inclusion in Electronic Theses, Projects, and Dissertations by an authorized administrator of CSUSB ScholarWorks. For more information, please contact [email protected]. California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 3-2015 Density Based Data Clustering Rayan Albarakati Follow this and additional works at: http://scholarworks.lib.csusb.edu/etd This Project is brought to you for free and open access by the Office of Graduate Studies at CSUSB ScholarWorks. It has been accepted for inclusion in Electronic Theses, Projects, and Dissertations by an authorized administrator of CSUSB ScholarWorks. For more information, please contact [email protected], [email protected]. DESNITY BASED DATA CLUSTERING A Project Presented to the Faculty of California State University, San Bernardino In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by Rayan Albarakati March 2015 DESNITY BASED DATA CLUSTERING A Project Presented to the Faculty of California State University, San Bernardino by Rayan Albarakati March 2015 Approved by: Haiyan Qiao, Advisor, School of Computer Date Science and Engineering Owen J.Murphy Krestin Voigt © 2015 Rayan Albarakati ABSTRACT Data clustering is a data analysis technique that groups data based on a measure of similarity. When data is well clustered the similarities between the objects in the same group are high, while the similarities between objects in different groups are low. The data clustering technique is widely applied in a variety of areas such as bioinformatics, image segmentation and market research. This project conducted an in-depth study on data clustering with focus on density-based clustering methods. The latest density-based (CFSFDP) algorithm is based on the idea that cluster centers are characterized by a higher density than their neighbours and by a relatively larger distance from points with higher densities. This method has been examined, ex- perimented, and improved. These methods (KNN-based, Gaussian Kernel- based and Iterative Gaussian Kernel-based) are applied in this project to improve (CFSFDP) density-based clustering. The methods are applied to four milestone datasets and the results are analyzed and compared. iii ACKNOWLEDGEMENTS I would like to thank God. And thank my advisor Dr. Haiyan Qiao for all her time, knowledge and patience. I would also like to thank my committee members, Dr. Owen J. Murphy and Dr. Krestin Voigt. I would also like to thank my parents for their support and patience. iv TABLE OF CONTENTS Abstract ::::::::::::::::::::::::::::::::::::::: iii Acknowledgements ::::::::::::::::::::::::::::::::: iv List of Figures ::::::::::::::::::::::::::::::::::: vii 1. INTRODUCTION TO DATA CLUSTERING :::::::::::::::: 1 1.1 Overview . .1 1.2 Significance . .2 1.3 Grouping of Data Clustering Techniques . .3 2. REVIEW OF DENSITY-BASED CLUSTERING :::::::::::::: 6 2.1 Review of Density-based Clustering Algorithms . .7 2.1.1 Density-Based Spatial Clustering Application with Noise (DB- SCAN) . .7 2.1.2 Density-Based Clustering (DENCLUE) . .9 2.2 Datasets . .9 3. METHODOLOGY ::::::::::::::::::::::::::::::: 11 3.1 Clustering by Fast Search and Find of Density Peaks (CFSFDP) . 11 3.2 Analysis . 14 4. ALGORITHM DESIGN, IMPLEMENTATION AND ANALYSIS ::::: 19 4.1 KNN-based CFSFDP . 19 v 4.1.1 Method . 19 4.1.2 Results . 20 4.2 Gaussian Kernel-based CFSFDP . 22 4.2.1 Method . 22 4.2.2 Results . 23 4.3 Iterative Gaussian Kernel-based CFSFDP . 27 4.3.1 Method . 27 4.3.2 Results . 28 4.4 Comparisons and Analysis . 32 5. CONCLUSIONS :::::::::::::::::::::::::::::::: 34 References :::::::::::::::::::::::::::::::::::::: 36 vi LIST OF FIGURES 2.1 Schematic representation of density connectivity. In blue the core points, in green the points on the boundaries, in red the rest of the points. .8 2.2 Dataset chosen to test different algorithms for data clustering A- Ag- gregation, B-Jain, C- Flame, and D- Spiral. 10 3.1 ρ - Schematic representation of how ρi is calculated. In red there is the point with the highest local density. δ - Schematic representation of how δi is calculated. In red there is the point with the highest local density so δred is obtained considering its farthest point. For the blue point, with lower density, δblue is calculated as the minimum distance between that point and any other point with a higher local density. 12 3.2 A - Data set distribution. The points are numbered starting from the highest to the lowest local density. B - Density decision graph for the dataset on the left. The data points with higher δ and ρ are selected as centers of the clusters (points 1 and 10). The points with high ρ and low δ belong to the existing clusters. Points 7, 8 and 9 belong to the cluster centered in 1 (red). Points 13, 15 and 22 belong to the cluster centered in 10 (blue). The data points with low ρ and high δ are considered outliers (black). 13 3.3 Rectangle selected by the user to define ρminimum and δminimum.... 14 3.4 Aggregation dataset, CFSFDP method applied. 15 vii 3.5 Flame dataset, CFSFDP method applied. 16 3.6 Jain dataset, CFSFDP method applied. 17 3.7 Spiral dataset, CFSFDP method applied. 18 4.1 Aggregation dataset, KNN method applied. 20 4.2 Flame dataset, KNN method applied. 21 4.3 Jain dataset, KNN method applied. 21 4.4 Spiral dataset, KNN method applied. 22 4.5 Aggregation dataset, Gaussian Kernel method applied. 23 4.6 Flame dataset, Gaussian Kernel method applied. 24 4.7 Jain dataset, Gaussian Kernel method applied. 25 4.8 Spiral dataset, Gaussian Kernel method applied. 26 4.9 Aggregation dataset, Iterative Gaussian Kernel method applied. 28 4.10 Flame dataset, Iterative Gaussian Kernel method applied. 29 4.11 Jain dataset, Iterative Gaussian Kernel method applied. 30 4.12 Spiral dataset, Iterative Gaussian Kernel method applied. 31 viii 1. INTRODUCTION TO DATA CLUSTERING 1.1 Overview Clustering analysis is aimed at classifying objects into categories on the basis of their similarity, and, nowadays, it is a technique used in many different fields such as bioin- formatics, image segmentation and market research [1]. The goal of data clustering is to find groups of similar objects in a dataset while keeping them separated from the noisy points (outliers). Data objects that are classified in the same group should display similar properties based on particular criteria because with such classification information, it is possible to infer the properties of a specific object based on the category to which it belongs [2]. A summary of the goals of cluster analysis has been done by Aldenderfer and Blashfield (1984) [3]: • Development of a classification • Investigation of useful conceptual schemes for grouping entities • Hypothesis generation through data exploration • Hypothesis testing or the attempt to determine if types defined through other procedures are in fact present in a data set. Cluster analysis is a useful tool in large scale data analysis providing a compressed 1 representation of the data. Even if several different clustering strategies have been proposed and many different algorithms have been published to group similar data, cluster analysis determine which criterion is the best solution in general [2]. 1.2 Significance Data clustering became an important problem solving technique because it can be achieved by various algorithms that differ significantly in their notion of what consti- tutes a cluster and how to efficiently find them [17]. Cluster analysis itself does not use one specific algorithm, for a general task but rather various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space and intervals or particular statistical distributions [16]. Data Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, which is a density threshold or the num- ber of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis is considered an iterative process of knowledge discovery and interactive multi-objective optimization that involves trial and failure [17]. Clustering has been applied in a wide variety of fields [16]: 1. Biology. One example of how clustering can be used is human genetic clustering the similarity of genetic data to infer population structures. 2. Computer sciences. One example of clustering is image segmentation to divide 2 a digital image into distinct regions for border detection or object recognition. 3. Life and medical sciences. One example of how clustering can be used is to analyze patterns of antibiotic resistance and to classify antimicrobial compounds according to their mechanism of action. 4. Astronomy and earth sciences. One example of how clustering can be used is to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties. 5. Social sciences. One example of how clustering can used is to identify areas where there are greater incidences of particular types of crime. 6. Economics. One example of how clustering can be used is to group all the shopping items available on the web into a set of unique products; all the items in eBay can be grouped into unique products.