Master Thesis Automatic Data

Eindhoven University of Technology MASTER Automatic data cleaning Zhang, J. Award date: 2018 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Department of Mathematics and Computer Science Data Mining Research Group Automatic Data Cleaning Master Thesis Ji Zhang Supervisor: Dr. ir. Joaquin Vanschoren Assessment Committee Members: Dr. ir. Joaquin Vanschoren Dr. ir. Nikolay Yakovets Dr. ir. Michel Westenberg Eindhoven, November 2018 Abstract In the machine learning field, data quality is essential for developing an effective machine learning model. However, raw data always contain various kinds of data problems such as duplicated records, missing values and outliers, which may weaken the model power severely. Therefore, it is vital to clean data thoroughly before proceeding with the data analysis step. The process that cleans the potential problems in the data is called data cleaning. Unfortunately, although inevitable and primary as data cleaning is, it is also quite a tedious and time-consuming task. People do not want to repeat this process endlessly and hence expect a tool to help them clean data automatically. In this thesis, a Python tool is developed in order to fulfill this expectation. This tool is able to identify the potential issues in the data and report results and recommendations such that users can clean data smoothly and effectively with its assistance. Compared with existing data cleaning tools, this tool is specially designed for addressing machine learning tasks and can find the optimal cleaning approach according to the characteristics of the given dataset. There are three aspects meaningfully automated in this thesis: automatic discovery of data types, automatic missing value handling, and automatic outlier detection. Automatic Data Cleaning iii Preface First and foremost, I would like to thank my supervisor Dr. Joaquin Vanschoren for his dedic- ation and guidance throughout this master thesis work. With his encouragement, I can find my passion for data science and have the freedom to pursue this thesis. Special thanks to Dr. Michel Westenberg and Dr. Nikolay Yakovets for their valuable suggestions on this thesis and assessing my project. Many thanks to my great colleague students for their kindness to offer me help at any time. It was really a joy working alongside them. In addition, I would like to thank my wonderful friends for listening, offering me advice, and especially for always being there for me. Finally, I would like to express my gratitude to my parents, Jianju and Jianxin, who have constantly in- spired me to study as much as possible, and given me the opportunity to do so. Without all these people, I would not be here. Automatic Data Cleaning v Contents Contents vii List of Figures ix List of Tables xi 1 Introduction 1 1.1 Motivation........................................1 1.2 Thesis Objective.....................................1 1.3 Results...........................................2 1.4 Outline..........................................3 2 Problem Statement4 2.1 Automatic Discovery of Data Types..........................4 2.1.1 Problem Formulation..............................4 2.1.2 Challenges....................................5 2.2 Automatic Missing Value Handling...........................5 2.2.1 Problem Formulation..............................5 2.2.2 Challenges....................................5 2.3 Automatic Outlier Detection..............................5 2.3.1 Problem Formulation..............................5 2.3.2 Challenges....................................6 3 Literature Analysis7 3.1 Data Cleaning......................................7 3.2 Automatic Discovery of Data Type...........................8 3.2.1 Useful Data Types in Machine Learning....................8 3.2.2 Data Type Discovery Techniques........................9 3.3 Automatic Missing Value Handling........................... 10 3.3.1 Missing Data Mechanisms............................ 10 3.3.2 Missing Value Handling Techniques....................... 11 3.4 Automatic Outlier Detection.............................. 15 3.4.1 Categorization of Outlier Detection....................... 15 3.4.2 Outlier Detection Techniques.......................... 15 3.4.3 Dealing with Outliers.............................. 19 3.5 Visualization Techniques................................. 20 3.5.1 Bar Chart..................................... 20 3.5.2 Box Plot..................................... 20 3.5.3 Pie Chart..................................... 21 3.5.4 Scatter plot.................................... 22 3.5.5 Heatmap..................................... 22 3.5.6 Parallel Coordinates Plot............................ 23 Automatic Data Cleaning vii CONTENTS 3.6 Related Work....................................... 23 3.6.1 Data Analytical Tools.............................. 23 3.6.2 Data Cleaning Tools............................... 24 4 Methodologies and Results 26 4.1 Automatic Discovery of Data Types.......................... 26 4.1.1 Discover Basic Data Types........................... 27 4.1.2 Discover Statistical Data Types......................... 28 4.1.3 Results...................................... 31 4.2 Automatic Missing Value Handling........................... 32 4.2.1 Detect Missing Values.............................. 33 4.2.2 Identify Missing Mechanisms.......................... 34 4.2.3 Clean Missing Values............................... 37 4.3 Automatic Outlier Detection.............................. 39 4.3.1 Meta-learning for Outlier Detection...................... 40 4.3.2 Benchmarking of Outlier Detection Algorithms................ 43 4.3.3 Recommendation Model............................. 45 4.3.4 Visualize Outliers................................. 46 4.4 Other Capabilities.................................... 48 4.4.1 Show Important Features............................ 48 4.4.2 Show Statistical Information.......................... 49 4.4.3 Detect Duplicated Records........................... 49 4.4.4 Unify Inconsistent Capitalization........................ 49 5 Conclusions 51 5.1 Contributions....................................... 51 5.2 Future Work....................................... 51 Bibliography 53 Appendix 59 A Demo 59 A.1 Acquire Datasets from OpenML............................ 59 A.2 Auto Clean with Automatic Cleaning Tool....................... 60 A.2.1 Input dataset ID................................. 60 A.2.2 Show important features............................. 60 A.2.3 Show statistical information........................... 60 A.2.4 Automatic discovery of data types....................... 61 A.2.5 Detect duplicated rows.............................. 61 A.2.6 Detect inconsistent column names....................... 61 A.2.7 Automatic missing value handling....................... 62 A.2.8 Automatic Outlier Detection.......................... 64 B Datasets 67 B.1 Datasets Used in Automatic Discovery of Data Types................ 67 B.2 Datasets Used in Automatic Outlier Detection.................... 67 viii Automatic Data Cleaning List of Figures 3.1 A brief machine learning process............................7 3.2 Low rank representation of a dataset.......................... 10 3.3 Listwise and pairwise deletion.............................. 11 3.4 Mean and regression imputation............................ 12 3.5 Multiple imputation process [56]............................ 13 3.6 Matrix factorization................................... 14 3.7 Outlier detection modes depending on the availability of labels in the dataset [23]. 16 3.8 Standard Deviation [71]................................. 16 3.9 One-class Support Vector Machine........................... 17 3.10 Compute reachability distance (k=3).......................... 18 3.11 Local outlier factor [68]................................. 18 3.12 Isolation Forest [15]................................... 19 3.13 Deal with outliers by log transformation........................ 20 3.14 Bar chart......................................... 20 3.15 Box plot.......................................... 21 3.16 Pie chart......................................... 21 3.17 Scatter plot........................................ 22 3.18 Heatmap.......................................... 22 3.19 Parallel Coordinates Plot [35].............................. 23 4.1 Workflow of discovering the statistical data types................... 27 4.2 Discover basic data types for a parsed OpenML dataset............... 28 4.3 Discover basic data types for an unparsed csv format dataset............ 28 4.4 Input of the Bayesian model..............................

Master Thesis Automatic Data

Categorical and Numerical Data Examples

Contributions to Biostatistics: Categorical Data Analysis, Data Modeling and Statistical Inference Mathieu Emily

Compliance, Safety, Accountability: Analyzing the Relationship of Scores to Crash Risk October 2012

General Latent Feature Models for Heterogeneous Datasets

Advanced Quantitative Data Analysis: Methodology for the Social Sciences

A Probabilistic Programming Approach to Probabilistic Data Analysis

Poisson Regression Model with Application to Doctor Visits

Computing Functions of Random Variables Via Reproducing Kernel Hilbert Space Representations

Mathieu Emily

Statistical Diagnostics of Models for Count Data

Probabilistic Data Analysis with Probabilistic Programming

Automatic Discovery of the Statistical Types of Variables in a Dataset