Based Platform for Data Analysis, Visualization and Machine Learning

DECEMBER 2019 THESSALONIKI – GREECE Implementation of a web- based platform for Data Analysis, Visualization and Machine Learning Konstantinos Mouratidis SID: 8180012 Supervisor: Prof. Ioannis Magnisalis Supervising Committee Prof.Assoc. Berberidis Prof. Name Christos Surname Members: Prof.Assist. Stavrinides Prof. Name SurnameStavros SCHOOL OF SCIENCE & TECHNOLOGY A thesis submitted for the degree of Master of Science (MSc) in Data Science DECEMBER 2019 THESSALONIKI – GREECE Abstract This dissertation attempt to explain the process of building an open-source platform for data science that allows users to easily upload their data or connect to data sources, visualize said data, and create machine learning models to solve common problems in the field. Everything is implemented with a simple-to-use web-based graphical interface, and is designed for usage/deployment in a cloud environment. This project came about with the help of my supervisor and mentor, Prof. Ioannis Magnisalis, and fellow students – the team that helped in the implementation of the first version - Vaso Tsichli, George Katrilakas, and Chris Timamopoulos. And a shout-out to the thousands of contributors of the open-source tools that this project used, mainly of plotly/dash and the SciPy stack. Konstantinos Mouratidis December 1st, 2019 Contents 1. Introduction ................................................................................................................................. 5 1.1 Background ............................................................................................................................ 5 1.2 This project ............................................................................................................................ 6 1.3 Licensing & private/public version ........................................................................................ 7 1.4 Structure ................................................................................................................................ 8 1.5 Absence of definitions and terminology, and final remarks .................................................. 9 2. Literature (i.e. competitors’) Review ......................................................................................... 11 2.1 Overview of papers on various topics relating to implementations ................................... 11 2.1.1 Data type inference ...................................................................................................... 11 2.1.2 Other guiding sources................................................................................................... 12 2.2 Overviews of competitors and other products ................................................................... 13 2.2.1 Visualization Software .................................................................................................. 13 2.2.2 Machine Learning / Statistical Modeling Software ...................................................... 27 2.2.3 Cloud providers and services ........................................................................................ 34 3. Designing an open-source solution for data visualization and analysis .................................... 39 3.1 Python libraries .................................................................................................................... 40 3.1.1 Tech and computing stack ............................................................................................ 40 3.1.2 Interface / web stack .................................................................................................... 44 3.2 Database technologies ........................................................................................................ 48 3.2.1 Redis ............................................................................................................................. 48 3.2.2 SQLite ............................................................................................................................ 49 3.3 Other libraries and tools ...................................................................................................... 50 4. Implementing the EDA Miner tool ............................................................................................ 52 4.1 Project Structure: Dash & Flask ........................................................................................... 54 4.2 Per-app implementation notes ............................................................................................. 61 4.2.1 Data app ....................................................................................................................... 61 4.2.2 Visualization app .......................................................................................................... 65 4.2.3 Modeling app ................................................................................................................ 68 4.3 Database Technologies & Models ....................................................................................... 74 4.4 Directory structure for project & apps, and extensions ....................................................... 81 4.4.1 Directory structure: project .......................................................................................... 81 4.4.2 Directory structure: Demo app .................................................................................... 82 4.4.3 A demonstration: Google Analytics REST API ............................................................... 83 5. Conclusions & future work ........................................................................................................ 86 Bibliography ................................................................................................................................... 90 Appendix 1, Gartner reports ......................................................................................................... 94 Appendix 2, table comparison of tools ......................................................................................... 98 Appendix 3, sklearn performance ................................................................................................. 99 Appendix 4, Redis benchmarks ................................................................................................... 100 Appendix 5, Flask + hey Benchmarks .......................................................................................... 102 Appendix 6, Contributor guidelines ............................................................................................ 105 Table of contents ..................................................................................................................... 105 Learning resources .................................................................................................................. 105 Style guide recommendations ................................................................................................. 106 General info on project structure ............................................................................................ 106 General info on contributions ................................................................................................. 108 Contributions for code quality ............................................................................................ 108 Contributions for visualization ............................................................................................ 109 Contributions for data ......................................................................................................... 109 Contributions for modeling ................................................................................................. 110 Contributions for deployment and scaling .......................................................................... 111 List of contributors .................................................................................................................. 112 Appendix 7, Directory Tree.......................................................................................................... 113 1. Introduction 1.1 Background With the recent hype on Data Science, Machine Learning, and Artificial Intelligence a lot of new tools have been created to aid the practitioners of these three domains with their daily work, and a large array of older tools is still relevant and have resurfaced. The popular debates (e.g. “R vs Python”, “PyTorch vs Tensorflow”, and the list goes on) have become commonplace across social media platforms and due to self- posting on platforms like Medium1 you can find literally thousands of articles on these topics. Furthermore, each year seems to be having its own “hype word(s)”, usually spanning multiple subdomains of computer science and statistics. Some of them are “Deep Neural Networks” (and their derivatives), “DevOps” and “Micro-Services”, “Big Data”. These guide investments from various funds to startups, as well as hordes of young talent to corporate positions with cool titles. Usually, the hype is an overstatement and when the balloon pops, so do expectations. Taking a broader perspective, most of what is done today is not much different from what used to be done in earlier “eras”. What has really changed is how almost all of these practices are now freely and openly available to the public, wrapped in nice programmatic interfaces (APIs) that significantly

Load more