Evaluation of Prediction Mechanisms of Parameters for Data Mining Algorithms

Institute for Parallel and Distributed Systems University of Stuttgart Universitätsstraße 38 70569 Stuttgart Germany Master Thesis Evaluation of Prediction Mechanisms of Parameters for Data Mining Algorithms Kinda Loutfi Study Program: Computer Science Examiner: PD Dr. rer. nat. habil. Holger Schwarz Advisors: Manuel Fritz, M.Sc. Dipl.-Inf. Michael Behringer Start date: 01.11.2017 End date: 01.05.2018 Abstract Extracting knowledge and useful information from huge amount of data is one of the biggest issues currently in the world of computer science. Data mining is one essential step in the knowledge discovery process due to the importance of its contribution in extracting this knowledge. One of the most famous mining techniques is Clustering, which is a widely used approach in data mining. Clustering is the process of partitioning a group of objects into smaller sets called clusters, in which similar objects are assigned to the same cluster, and dissimilar objects are assigned to different clusters. K-Means, DBSCAN, and OPTICS are three of the most popular clustering algorithms which are used to group similar data into clusters. Each of these algorithms requires input parameters. The difficulty of knowing these input parameters in advance is the flaw of these algorithms. Many previous approaches were introduced which provide a prediction of these parameters. However, different problems emerged while using these approaches, such as a long overall runtime to achieve predicted parameter values, since the clustering algorithm is applied multiple times with varying parameters to identify the best parameter configuration. This thesis introduces a new approach for predicting the input parameters. The proposed approach is called PROD (Position-Based Prediction Using Voronoi Diagrams). PROD overcomes the problems which emerged in previous approaches. The prediction of the input parameters is performed by this new approach through using space partitioning approaches and subsequently exploiting their properties. PROD is evaluated on various data sets. Regarding K-Means algorithm; PROD provides a prediction of number of clusters as well as a prediction of the initial centroids. The experimental results unveil, that PROD is (a) more accurate and (b) faster by a factor of 26.5 in contrast to previously best existing approaches. Additionally, the results of the evaluations show that it can be used either as a prediction approach for the amount of clusters or as a standalone clustering algorithm. Despite the effectiveness of the prediction made by PROD with regard to the K-Means parameters, the results of the experiments show that this novel approach needs further improvement to make the prediction of DBSCAN and OPTICS parameters work. More clearly, changing some aspects of PROD might lead to a better prediction with regard to these two algorithms. Table of Contents 1. Introduction ......................................................................................................................................1 2. Preliminaries .....................................................................................................................................4 2.1 Data Mining ..............................................................................................................................4 2.2 Clustering Technique ................................................................................................................4 2.2.1 K-Means ...........................................................................................................................5 2.2.2 DBSCAN ..........................................................................................................................6 2.2.3 OPTICS ............................................................................................................................7 2.3 Space Partitioning .....................................................................................................................9 Voronoi Diagram and Delaunay Tessellation...................................................................................9 2.4 Sampling Algorithms ..............................................................................................................11 2.4.1 Random Sampling Method Without Replacement .........................................................11 2.4.2 Systematic Sampling Method Without Replacement .....................................................12 3. Related Work ..................................................................................................................................13 3.1 Previous Prediction Methods for Density-Based Clustering Algorithms ...............................13 3.1.1 Determining DBSCAN Parameters ................................................................................13 3.1.2 Determining OPTICS Parameters ..................................................................................14 3.2 Previous Prediction Methods for K-Means Algorithm ...........................................................15 3.2.1 Enhanced K-Means Clustering Algorithm Using a Heuristic Approach ........................15 3.2.2 Improved K-Means Algorithm Based on the Clustering Reliability Analysis ...............15 3.2.3 The Global K-Means Clustering Algorithm ...................................................................16 3.2.4 The Elbow Method .........................................................................................................17 3.2.5 The Information-Theoretic Approach.............................................................................17 3.2.6 The Silhouette Method ...................................................................................................18 4. Position-Based Prediction Using Voronoi Diagrams .....................................................................19 4.1 Overview ................................................................................................................................19 4.2 Sampling .................................................................................................................................21 4.2.1 Position-Based Sampling Approach ...............................................................................22 4.2.2 Complexity of Position-Based Sampling .......................................................................24 4.3 Space Partitioning ...................................................................................................................24 4.4 Fine Tuning ............................................................................................................................24 4.4.1 Scattering the Data Points...............................................................................................25 4.4.2 Inspection of Voronoi Regions .......................................................................................25 4.4.3 Merging Voronoi Regions ..............................................................................................26 4.4.4 Complexity of the Fine Tuning ......................................................................................26 4.5 Predicting Clustering Parameters ...........................................................................................27 4.5.1 Predicting Input Parameter for K-Means Algorithm ......................................................27 4.5.2 Predicting Input Parameters for Density-Based Clustering Algorithms .........................27 4.5.3 Complexity of the Prediction ..........................................................................................27 4.6 Identifying Outliers ................................................................................................................28 4.7 The Overall Complexity of PROD .........................................................................................29 5. Implementation ...............................................................................................................................30 5.1 Class Diagram ........................................................................................................................30 5.1.1 The kdtnearest Package ..................................................................................................31 5.1.2 The sampling Package ....................................................................................................31 5.1.3 The spacePartitioning Package ......................................................................................32 5.1.4 The fineTuning Package .................................................................................................33 5.1.5 The prediction Package ..................................................................................................34 5.1.6 The com.mavenproject.predictingparams Package ........................................................35 5.2 Libraries ..................................................................................................................................36 5.2.1 Open Source Implementation of Nearest-Neighbor Search ...........................................37 5.2.2 Voronoi Diagram Library ...............................................................................................37 5.2.3 Mining Algorithms Library ............................................................................................38 5.2.4 Implementation of Previous Prediction

Load more