POLITECNICO DI TORINO M.Sc. in Computer Science Engineering DEPARTMENT OF DAUIN Automated Data Exploration To Discover Structures And Models Hidden In The Data Master’s candidate: Supervisor: Paolo BETHAZ Prof. Tania CERQUITELLI october 2019 To all my family Acknowledgements Turin, october 2019 I would like to thank all the people who supported me writing this thesis. First of all I thank all my family, in particular my parents and my brother, who have always been close to me and helped me patiently to reach the end of this university career. A special thanks also to my supervisor professor Tania Cerquitelli, who gave me the opportunity to work on this very interesting topic, always following me with a lot of patience and availability. Moreover, thanks to all my friends, the old ones and those most recently met along my journey, for the fun and carefree moments spent together in these years. I know I can always count on your good mood. Last but not least, thanks to Stefania. Not only for having taught me to use Latex, but for always supporting me in this last period, sharing many joyful moments that allowed me to write this thesis with more serenity. Contents 1 Introduction 7 1.1 Techniques Used . 8 2 State of the art 9 2.1 Datawrapper . 10 2.2 AnswerMiner . 11 2.3 JetPack Data . 13 2.4 DataPine . 13 2.5 Dive . 14 2.6 NCSS . 15 2.7 ML Sandbox . 17 3 Automated Data Exploration 19 3.1 Data type identification . 20 3.2 Data characterization . 20 3.3 Outlier detection . 22 3.4 Self-tuning cluster analysis . 24 3.4.1 K-Means Clustering . 26 3.4.2 DBSCAN . 28 3.4.3 Hierarchical Clustering . 31 3.5 Classification . 35 3.6 Data Transformation . 37 3.7 Visualization . 37 1 Contents 4 Architecture and Workflow 39 4.1 Data . 41 4.2 Data Characterization . 42 4.3 Exploratory and Unsupervised learning . 43 4.4 Data Transformation . 48 4.5 StoryTelling . 51 5 Results and evaluation 56 5.1 Iris Dataset . 56 5.1.1 Data . 57 5.1.2 User Exploration . 57 5.1.3 StoryTelling . 61 5.2 Prescription Dataset . 63 5.2.1 Data . 64 5.2.2 User Exploration . 64 5.2.3 Data Transformation . 66 5.3 Storytelling . 67 6 Conclusions 69 2 List of Figures 2.1 First two steps of Datawrapper application . 11 2.2 Third anf fourth steps of Datawrapper application . 11 2.3 General overview of the functionality offered by AnswerMiner . 12 2.4 Layout of the Datapine analysis page ..................... 14 2.5 Four stages of a data exploration workflow in DIVE: (A) Datasets, (B) Vi- sualize, (C) Analysis, and (D) Stories .................... 15 2.6 Procedures offered by NCSS analysis ..................... 16 3.1 Traditional Knowledge Extraction Process . 19 3.2 Automated Workflow proposed by the framework . 20 3.3 Different parts of a boxplot [6] ........................ 24 3.4 Example of how the One Hot Encoding Works [9] . 25 3.5 Result of the elbow method for choosing the best number of clusters [10] . 27 3.6 Find the elbow of a curve, looking for the point with the greatest distance from the line b (between the first and the last point of the curve) [12] . 28 3.7 Example of based density on the left. Core, border and noise points on the right [7] .................................... 29 3.8 Hierarchical Clustering using ’min’ as a criterion of similarity between two clusters [7] .................................. 32 3.9 Hierarchical Clustering using ’max’ as a criterion of similarity between two clusters [7] .................................. 33 3.10 Hierarchical Clustering using ’average’ as a criterion of similarity between two clusters [7] ................................ 33 3 List of Figures 3.11 Hierarchical Clustering using ’ward’ as a criterion of similarity between two clusters [7] .................................. 34 4.1 Map containing the different data exploration tasks offered by the framework 39 4.2 WorkFlow of the automated data exploration proposed by the framework . 40 4.3 Upload Page .................................. 41 4.4 Radar Plot Schema [16] ............................ 45 4.5 Example of a Decision Tree applied to the Iris dataset [14][17] . 47 4.6 The two attributes ’Id’ and ’Exam’ are merged into a single column . 50 4.7 Example of how k-fold cross-validation works [18] . 53 4.8 Example of confusion matrix with two labels (positive and negative). True positive (TP) and True negative (TN) are the values classified correctly [20] 54 5.1 Section ’Data’. First page shown. The user can scroll the table horizontally and vertically to see the whole dataset .................... 57 5.2 Section ’Info’. The user can see some statistics on the various attributes . 58 5.3 Section ’Info’. ’Species’ is the only textual attribute. Histogram containing the number of elements for each species .................... 58 5.4 Section ’Info’. Correlation Matrix ...................... 59 5.5 Section ’Info’. Boxplots for each numerical field . 59 5.6 Section ’Info’. CDF and Histograms for each numerical field . 59 5.7 Section ’Clustering’. 3d scatter chart to show the outliers . 60 5.8 Section ’Clustering’. Elbow graph and 3d scatter chart for the K-means. The radar chart is obtained by clicking on the centroid2 on the scatter chart. To the right of the scatter chart there is also a pie chart . 60 5.9 Section ’Clustering’. 3d scatter chart and pie chart for the DBSCAN. On the left of the scatter chart there is also a k-distance graph. The stacked bar chart is obtained clicking on the button ’Compare with original label’ . 61 5.10 Section ’Clustering’. Dendrogram and 3d scatter chart for the Hierarchical Clustering. The popup window is obtained by clicking on the button ’See decision tree’ ................................. 61 5.11 Section ’Storytelling’. First Page ....................... 62 4 List of Figures 5.12 Section ’Storytelling’. First Page. Result obtained by clicking on the corre- lation matrix cell highlighted in the previous image . 62 5.13 Section ’Storytelling’. Second Page ..................... 62 5.14 Section ’Storytelling’. Second Page. Result obtained by clicking on any point in the cluster0 .............................. 63 5.15 Section ’Storytelling’. Third Page ...................... 63 5.16 Section ’Data’. First page after the upload page . 64 5.17 Vertical navigation bar present on all pages. In this case there is also the section ’Data Transformation’ ........................ 65 5.18 Section ’Clustering’. The structure of this section is the same as that reported for the Iris dataset. But in this case, since there is no label, the ’Compare with original label’ button is no longer present . 66 5.19 Section ’Data Transformation’ ........................ 66 5.20 Section ’Data Transformation’. Result obtained after clicking on the trans- formation highlighted in the previous image . 67 5.21 Section ’Storytelling’. First Page ....................... 67 5.22 Section ’Storytelling’. Second Page ..................... 68 5.23 Section ’Storytelling’. Third Page ...................... 68 5.24 Section ’Storytelling’. Third Page. Window opened after clicking on the transformation highlighted in the previous image . 68 5 List of Tables 2.1 Dataset Schema ................................ 9 4.1 Initial Dataset ................................. 49 4.2 Dataset After Transformation ......................... 49 5.1 Iris Dataset .................................. 57 5.2 Prescription Dataset ............................. 64 6 Chapter 1 Introduction In today’s world, data value is severely undermined by our inability to translate them into actionable knowledge. Up to now, a lot of research efforts have been devoted to enhancing the effectiveness and efficiency of analytics algorithms. How- ever, their exploitation is still a multi-step process requiring a lot of expertise to be correctly performed. To streamline the knowledge extraction process and enhance the friendliness of data analytics task, what we tried to do is to design and develop a new generation of data analytics solutions to automatically discover descriptive, predictive and prescriptive models and structures hidden in the data without re- quiring the intervention of the data scientist. A key feature of such solutions is the automation of the full data-analytic workflow, from heterogenous data ingestion to the result presentation. To do this, in this thesis work we implemented a new framework that aims to be as user-friendly as possible, so that it can be accessible by everyone. Furthermore, since our framework aims to analyze a dataset automatically, minimizing the contri- bution of the data scientist, the skills required by the user who uses it are minimal. We built it so that all users, even the less experienced in the field, can use it easily, going to meet the concept of democratizing data science. 7 Chapter 1. Introduction 1.1 Techniques Used In implementing our new framework we have chosen to use python as program- ming language, which is one of the most used programming languages for Data Mining and Machine Learning techniques. In particular, the libraries and modules we used most to produce this work were the following: • Pandas, which is a library for manipulating data in a tabular or sequential format. Very useful for loading and saving standard formats for tabular data, such csv and Excel files • Matplotlib, a very powerful and flexible library for the creation of graphics, which allows the creation of many types of plots in a simple and intuitive way • Seaborn, which is a matplotilib-based data visualization library. Seaborn pro- vides a high-level interface for drawing information-rich statistical graphs • Sklearn, which is an automatic learning library that contains all the most im- portant classification, regression and clustering algorithms To build the architecture of our web application we used Flask, which is a micro- framework for Python web applications[1]. The graphic part of our web pages is then managed thanks to Css and Javascript. Of particular interest was HighCharts [2], a library written in pure Javascript which allowed us to create lots of interactive and high-level graphics.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages78 Page
-
File Size-