Based Platform for Data Analysis, Visualization and Machine Learning

Total Page:16

File Type:pdf, Size:1020Kb

Based Platform for Data Analysis, Visualization and Machine Learning DECEMBER 2019 THESSALONIKI – GREECE Implementation of a web- based platform for Data Analysis, Visualization and Machine Learning Konstantinos Mouratidis SID: 8180012 Supervisor: Prof. Ioannis Magnisalis Supervising Committee Prof.Assoc. Berberidis Prof. Name Christos Surname Members: Prof.Assist. Stavrinides Prof. Name SurnameStavros SCHOOL OF SCIENCE & TECHNOLOGY A thesis submitted for the degree of Master of Science (MSc) in Data Science DECEMBER 2019 THESSALONIKI – GREECE Abstract This dissertation attempt to explain the process of building an open-source platform for data science that allows users to easily upload their data or connect to data sources, visualize said data, and create machine learning models to solve common problems in the field. Everything is implemented with a simple-to-use web-based graphical interface, and is designed for usage/deployment in a cloud environment. This project came about with the help of my supervisor and mentor, Prof. Ioannis Magnisalis, and fellow students – the team that helped in the implementation of the first version - Vaso Tsichli, George Katrilakas, and Chris Timamopoulos. And a shout-out to the thousands of contributors of the open-source tools that this project used, mainly of plotly/dash and the SciPy stack. Konstantinos Mouratidis December 1st, 2019 Contents 1. Introduction ................................................................................................................................. 5 1.1 Background ............................................................................................................................ 5 1.2 This project ............................................................................................................................ 6 1.3 Licensing & private/public version ........................................................................................ 7 1.4 Structure ................................................................................................................................ 8 1.5 Absence of definitions and terminology, and final remarks .................................................. 9 2. Literature (i.e. competitors’) Review ......................................................................................... 11 2.1 Overview of papers on various topics relating to implementations ................................... 11 2.1.1 Data type inference ...................................................................................................... 11 2.1.2 Other guiding sources................................................................................................... 12 2.2 Overviews of competitors and other products ................................................................... 13 2.2.1 Visualization Software .................................................................................................. 13 2.2.2 Machine Learning / Statistical Modeling Software ...................................................... 27 2.2.3 Cloud providers and services ........................................................................................ 34 3. Designing an open-source solution for data visualization and analysis .................................... 39 3.1 Python libraries .................................................................................................................... 40 3.1.1 Tech and computing stack ............................................................................................ 40 3.1.2 Interface / web stack .................................................................................................... 44 3.2 Database technologies ........................................................................................................ 48 3.2.1 Redis ............................................................................................................................. 48 3.2.2 SQLite ............................................................................................................................ 49 3.3 Other libraries and tools ...................................................................................................... 50 4. Implementing the EDA Miner tool ............................................................................................ 52 4.1 Project Structure: Dash & Flask ........................................................................................... 54 4.2 Per-app implementation notes ............................................................................................. 61 4.2.1 Data app ....................................................................................................................... 61 4.2.2 Visualization app .......................................................................................................... 65 4.2.3 Modeling app ................................................................................................................ 68 4.3 Database Technologies & Models ....................................................................................... 74 4.4 Directory structure for project & apps, and extensions ....................................................... 81 4.4.1 Directory structure: project .......................................................................................... 81 4.4.2 Directory structure: Demo app .................................................................................... 82 4.4.3 A demonstration: Google Analytics REST API ............................................................... 83 5. Conclusions & future work ........................................................................................................ 86 Bibliography ................................................................................................................................... 90 Appendix 1, Gartner reports ......................................................................................................... 94 Appendix 2, table comparison of tools ......................................................................................... 98 Appendix 3, sklearn performance ................................................................................................. 99 Appendix 4, Redis benchmarks ................................................................................................... 100 Appendix 5, Flask + hey Benchmarks .......................................................................................... 102 Appendix 6, Contributor guidelines ............................................................................................ 105 Table of contents ..................................................................................................................... 105 Learning resources .................................................................................................................. 105 Style guide recommendations ................................................................................................. 106 General info on project structure ............................................................................................ 106 General info on contributions ................................................................................................. 108 Contributions for code quality ............................................................................................ 108 Contributions for visualization ............................................................................................ 109 Contributions for data ......................................................................................................... 109 Contributions for modeling ................................................................................................. 110 Contributions for deployment and scaling .......................................................................... 111 List of contributors .................................................................................................................. 112 Appendix 7, Directory Tree.......................................................................................................... 113 1. Introduction 1.1 Background With the recent hype on Data Science, Machine Learning, and Artificial Intelligence a lot of new tools have been created to aid the practitioners of these three domains with their daily work, and a large array of older tools is still relevant and have resurfaced. The popular debates (e.g. “R vs Python”, “PyTorch vs Tensorflow”, and the list goes on) have become commonplace across social media platforms and due to self- posting on platforms like Medium1 you can find literally thousands of articles on these topics. Furthermore, each year seems to be having its own “hype word(s)”, usually spanning multiple subdomains of computer science and statistics. Some of them are “Deep Neural Networks” (and their derivatives), “DevOps” and “Micro-Services”, “Big Data”. These guide investments from various funds to startups, as well as hordes of young talent to corporate positions with cool titles. Usually, the hype is an overstatement and when the balloon pops, so do expectations. Taking a broader perspective, most of what is done today is not much different from what used to be done in earlier “eras”. What has really changed is how almost all of these practices are now freely and openly available to the public, wrapped in nice programmatic interfaces (APIs) that significantly
Recommended publications
  • ROADS and BRIDGES: the UNSEEN LABOR BEHIND OUR DIGITAL INFRASTRUCTURE Preface
    Roads and Bridges:The Unseen Labor Behind Our Digital Infrastructure WRITTEN BY Nadia Eghbal 2 Open up your phone. Your social media, your news, your medical records, your bank: they are all using free and public code. Contents 3 Table of Contents 4 Preface 58 Challenges Facing Digital Infrastructure 5 Foreword 59 Open source’s complicated relationship with money 8 Executive Summary 66 Why digital infrastructure support 11 Introduction problems are accelerating 77 The hidden costs of ignoring infrastructure 18 History and Background of Digital Infrastructure 89 Sustaining Digital Infrastructure 19 How software gets built 90 Business models for digital infrastructure 23 How not charging for software transformed society 97 Finding a sponsor or donor for an infrastructure project 29 A brief history of free and public software and the people who made it 106 Why is it so hard to fund these projects? 109 Institutional efforts to support digital infrastructure 37 How The Current System Works 38 What is digital infrastructure, and how 124 Opportunities Ahead does it get built? 125 Developing effective support strategies 46 How are digital infrastructure projects managed and supported? 127 Priming the landscape 136 The crossroads we face 53 Why do people keep contributing to these projects, when they’re not getting paid for it? 139 Appendix 140 Glossary 142 Acknowledgements ROADS AND BRIDGES: THE UNSEEN LABOR BEHIND OUR DIGITAL INFRASTRUCTURE Preface Our modern society—everything from hospitals to stock markets to newspapers to social media—runs on software. But take a closer look, and you’ll find that the tools we use to build software are buckling under demand.
    [Show full text]
  • Google Summer of Code 2019
    Google Summer of Code 2019 Contributing for: The Terasology Foundation Biome-centric Gameplay Template / Enhancements for Terasology! 1 ABOUT ME Name Hassaan Ali (TheHxn) Email [email protected] Discord @TheHxn (#3124) GitHub - https://github.com/TheHxn Profiles Forum - https://forum.terasology.org/members/thehxn.3148/ 2 BIOME-CENTRIC GAMEPLAY ENHANCEMENTS 2.1 OVERVIEW This Idea has been chosen from Terasology’s GSoC Ready Ideas board from Trello [1]. Currently biomes are used in a few game settings, but not with a huge impact to gameplay. This idea aims to support greater variety, meaning to biomes and to help make worlds more "alive" as said by Brylie on the forum. 2.2 INTEREST My interest in this project comes from the fact that not many GSoC students are interested in it, so it definitely needs work as it is a very good idea for Terasology giving the game engine a unique feel to it. Also because I have worked very much with terrains, used World Machine, L3DT, Terresculptor terrain generators to generate climate based terrains. I am very interested as to how the world and life biomes could be improved in Terasology. 2.3 PROJECT FUNCTIONS 1. Inspection tool: When a player encounters a plant or animal, they might use an 'inspection' tool. It can show the details of the entity, we can use WordlyToolTip module to give such information. These details could include health, hunger, biome preferences, and genomic information for the inspected entity. 2. Transplant/Transport: Plants and animals can be transplanted between biomes. Animals could be transplanted using the GooKeeper module as a catch-and-release tool.
    [Show full text]
  • Ultimate++ Forum - Mentoring How to Ing-Howto/Index.Html
    Subject: Google Summer of Code Posted by koldo on Mon, 08 Mar 2010 11:08:17 GMT View Forum Message <> Reply to Message Hello all Google Summer of Code is a program that awards with money students that work in approved Open Source projects. To participate in it first the open source project has to apply to it as a "mentor organization". The deadline for this is this Friday 12. Main things to do are: - Open a "ideas" page in web - Fill the mentor organization questionnaire There is few time and few opportunities to be approved but some of us think that we would have to try it. If you can help please answer to this post ASAP. We have only 4 days, so we have to be very constructive talking ONLY about "Applying to GSoC as a Mentoring Organization". Please put other discussions in other posts. If you cannot participate this week but you have an idea for a project please post it, including: - Project description - Experience required to do it Do not forget that there is few time to do the project ("summer of code") so please be specific including only projects to be finished in short time. Some links: - Google Summer of Code 2010 FAQ http://socghop.appspot.com/document/show/gsoc_program/google /gsoc2010 - "ideas" page examples: -- https://svn.boost.org/trac/boost/wiki/soc2009 -- http://wiki.winehq.org/SummerOfCode -- http://wiki.wxwidgets.org/Development:_Student_Projects - Selection criteria http://socghop.appspot.com/document/show/program/google/gsoc 2009/orgcriteria - Advices for mentor organization http://code.google.com/p/google-summer-of-code/wiki/Advicefo
    [Show full text]
  • Google Certified Professional - Cloud Architect.Exam.57Q
    Google Certified Professional - Cloud Architect.exam.57q Number : GoogleCloudArchitect Passing Score : 800 Time Limit : 120 min https://www.gratisexam.com/ Google Certified Professional – Cloud Architect (English) https://www.gratisexam.com/ Testlet 1 Company Overview Mountkirk Games makes online, session-based, multiplayer games for the most popular mobile platforms. Company Background Mountkirk Games builds all of their games with some server-side integration, and has historically used cloud providers to lease physical servers. A few of their games were more popular than expected, and they had problems scaling their application servers, MySQL databases, and analytics tools. Mountkirk’s current model is to write game statistics to files and send them through an ETL tool that loads them into a centralized MySQL database for reporting. Solution Concept Mountkirk Gamesis building a new game, which they expect to be very popular. They plan to deploy the game’s backend on Google Compute Engine so they can capture streaming metrics, run intensive analytics, and take advantage of its autoscaling server environment and integrate with a managed NoSQL database. Technical Requirements Requirements for Game Backend Platform 1. Dynamically scale up or down based on game activity 2. Connect to a managed NoSQL database service 3. Run customize Linux distro Requirements for Game Analytics Platform 1. Dynamically scale up or down based on game activity 2. Process incoming data on the fly directly from the game servers 3. Process data that arrives late because of slow mobile networks 4. Allow SQL queries to access at least 10 TB of historical data 5. Process files that are regularly uploaded by users’ mobile devices 6.
    [Show full text]
  • Create Live Dashboards with Google Data Studio
    Make Your Library's Data Accessible and Usable Create Live Dashboards with Google Data Studio Kineret Ben-Knaan & Andrew Darby | University of Miami Libraries INTRODUCTION OBJECTIVES AND GOALS STRATEGY AND METHODS PROJECT STATUS This poster presents the implementation of a The aim of the project is to implement a shared and Customized dashboards have been built and shared with collaborative approach and solution for data gathering, straightforward data collection platform, which will harvest Our solution involves the following: several library departments. So far, we have connected data consolidation and most importantly, data multiple, diverse data sources and present statistics in a or imported manually the following isolated data sources ● Move to shared Google Sheets and Forms for any accessibility through a live connection to free Google clear and visually engaging manner. into Google Data Studio: Data Studio dashboards. local departmental data collection, then connect these Our key goals are: sheets to Google Data Studio, a free tool that allows Metric/Data Source by type of Google Data Studio Connector Units at the University of Miami Libraries (UML) have ● To facilitate the use of data from isolated data sources, users to create customized and shareable data long been engaged in data collection activities. Data from Metrics Data Sources Real-time connection Imported manually into visualization dashboards. and Systems & automatically Google Data Studio (on instructional sessions, consultations and all types of user not only to encourage evidence-based decision updated a weekly bases) ● Import and connect other library data sources to interactions have routinely been gathered. Other making, but also to better communicate how UM Instructional & Workshops Google Sheets Yes assessment measures, including surveys and user Libraries’ activities support student learning and faculty Google Data Studio.
    [Show full text]
  • Phpmyadmin Documentation Release 5.1.2-Dev
    phpMyAdmin Documentation Release 5.1.2-dev The phpMyAdmin devel team Sep 29, 2021 Contents 1 Introduction 3 1.1 Supported features............................................3 1.2 Shortcut keys...............................................4 1.3 A word about users............................................4 2 Requirements 5 2.1 Web server................................................5 2.2 PHP....................................................5 2.3 Database.................................................6 2.4 Web browser...............................................6 3 Installation 7 3.1 Linux distributions............................................7 3.2 Installing on Windows..........................................8 3.3 Installing from Git............................................8 3.4 Installing using Composer........................................9 3.5 Installing using Docker..........................................9 3.6 IBM Cloud................................................ 14 3.7 Quick Install............................................... 14 3.8 Verifying phpMyAdmin releases..................................... 16 3.9 phpMyAdmin configuration storage................................... 17 3.10 Upgrading from an older version..................................... 19 3.11 Using authentication modes....................................... 19 3.12 Securing your phpMyAdmin installation................................ 26 3.13 Using SSL for connection to database server.............................. 27 3.14 Known issues..............................................
    [Show full text]
  • MEDES and Google Summer of Code
    MEDES and Google Summer Of Code All students and developers are welcome to participate in the Google Summer of Code program, with MEDES. Google Summer of Code is a program that offers student developers stipends to write code for various open source projects. For its activities, MEDES has developed an innovative data collection tool based on Open Source technologies: the Imogene solution. This tool has enabled the deployment of information systems in various contexts that are now operational. The platform allows to rapidly design a data collection system and, based on MDA technologies, it allows to generate a set of applications fulfilling the needs specified by the model. The applications generated include: * a Web application, * an Android application, * a Desktop application. Both Android and Desktop applications can work offline and have bi-directionnal synchronization processes. They integrate remote update mechanisms. Read more about Imogene on our website or at code.google.com/p/imogene All our applications are developed using the Java programing language. You will have to work with Eclipse as Imogene is an Eclipse Plugin itself. A good knowledge of Java is required to apply to one of these projects. The ideas below were contributed by our team. If you wish to submit a proposal based on these ideas, you can contact us and find out more about the particular suggestion you're looking at. Project: Unit test project for the web generated application using Selenium Brief explanation: Each time a web application is generated using Imogene, the application needs to be tested. By generating a unit test project, this would automate the unit tests for a generated application allowing the users to validate the application functionalities.
    [Show full text]
  • Facebook's Libra
    JULY 2019 Facebook’s Libra AND THE FUTURE OF DIGITAL IDENTITIES Page 6 (Feature Story) Apple launches its own digital ID program 9Page 9 (News and Trends) The challenges of digital IDs in the mobile space 13Page 13 (Deep Dive) © 2019 PYMNTS.com All Rights Reserved WHAT'S INSIDE Digital ID developers race to provide better, 03 more secure solutions FEATURE STORY Wayne Vaughan, co-founder of the 06 Decentralized Identity Foundation, on Facebook’s Libra cryptocurrency and how it will impact the digital identity industry NEWS AND TRENDS METHODOLOGY The latest headlines from around the digital Who’s on top and how they got there, including 09 identity space, including Apple's new digital ID 15 three sets of top provider rankings program, 3D finger vein scanners at hospitals and more DEEP DIVE SCORECARD An in-depth look at mobile digital IDs and the The results are in. See the highest-ranked 13 issues world governments have faced during 16 companies in a provider directory featuring implementation more than 200 major digital identity players. ABOUT 89 Information about PYMNTS.com and Jumio TABLE OF CONTENTS ACKNOWLEDGMENT The Digital Identity Tracker is done in collaboration with Jumio, and PYMNTS is grateful for the company’s support and insight. PYMNTS.com retains full editorial control over the presented findings, methodology and data analysis. WHAT’S INSIDE The digital identity market is expected to reach $15 billion paralysis of choice with so many options available. One by 2024, and giants such as Google and Apple are rac- potential solution is a decentralized, self-sovereign stan- ing to improve identity verification experiences.
    [Show full text]
  • Economic and Social Impacts of Google Cloud September 2018 Economic and Social Impacts of Google Cloud |
    Economic and social impacts of Google Cloud September 2018 Economic and social impacts of Google Cloud | Contents Executive Summary 03 Introduction 10 Productivity impacts 15 Social and other impacts 29 Barriers to Cloud adoption and use 38 Policy actions to support Cloud adoption 42 Appendix 1. Country Sections 48 Appendix 2. Methodology 105 This final report (the “Final Report”) has been prepared by Deloitte Financial Advisory, S.L.U. (“Deloitte”) for Google in accordance with the contract with them dated 23rd February 2018 (“the Contract”) and on the basis of the scope and limitations set out below. The Final Report has been prepared solely for the purposes of assessment of the economic and social impacts of Google Cloud as set out in the Contract. It should not be used for any other purposes or in any other context, and Deloitte accepts no responsibility for its use in either regard. The Final Report is provided exclusively for Google’s use under the terms of the Contract. No party other than Google is entitled to rely on the Final Report for any purpose whatsoever and Deloitte accepts no responsibility or liability or duty of care to any party other than Google in respect of the Final Report and any of its contents. As set out in the Contract, the scope of our work has been limited by the time, information and explanations made available to us. The information contained in the Final Report has been obtained from Google and third party sources that are clearly referenced in the appropriate sections of the Final Report.
    [Show full text]
  • Harvesting the Twittersphere: Qualitative Research Methods Using Twitter
    Fall 08 Spring 13 Harvesting the Twittersphere: Qualitative Research Methods Using Twitter By: Anthony La Rosa [email protected] May 2013 Marketing Advisor: Deborah Fain Marketing-Lubin School of Business Pace University i Abstract: Harvesting the Twittersphere explores the current state of research, by comparing quantitative to qualitative and analyzing the current market. In a consumer driven market, it seems that most businesses are neglecting performing qualitative research. It could be because of the falling cost and increasing convenience of quantitative research methods provided by cloud systems such as: Salesforce.com, IBM, SAP, and McKinsey. Another reason may be that there is no efficient or inexpensive way to conduct qualitative research on a digital platform. While certain companies try to conduct qualitative studies online by using chat rooms, discussion boards or Facebook prompts, there is no method that is as widespread or respective as the traditional methods. While focus groups and participant observation offer unique insights into consumers, both methods can be costly and difficult to set up. Harvesting the Twittersphere proposes a new methodology of using Twitter to conduct a qualitative study. By search a specific term, a researcher can search through the constantly generated tweets to see what people are saying about the term. The tweets should be captured, sorted and analyzed in order to provide a unique insight from the consumer. By nature, Twitter offers feelings of users since what they tweet is usually their individual perspective on a subject. This is the perfect field in order to conduct a qualitative study since it is about sharing emotions, sentiments and feelings, rather than numbers, facts or statistics.
    [Show full text]
  • Zvonimir Rakamaric June, 2021
    Zvonimir Rakamari´c June, 2021 CONTACT Address: School of Computing, 50 South Central Campus Drive, Rm 3424 INFORMATION University of Utah, Salt Lake City, UT 84112-9205, USA Phone: +1 (801) 581-6139 E-mail: [email protected] WWW: www.zvonimir.info, www.soarlab.org PROFESSIONAL Amazon Web Services (AWS), Seattle, WA, USA APPOINTMENTS Principal Applied Scientist May 2021 – present School of Computing, University of Utah, Salt Lake City, UT, USA Associate Professor (leave of absence) Jul 2018 – present School of Computing, University of Utah, Salt Lake City, UT, USA Assistant Professor Jun 2012 – Jun 2018 Carnegie Mellon University, Silicon Valley Campus, NASA Ames Research Park, CA, USA Postdoctoral Fellow Mar 2011 – Mar 2012 Dept. of Computer Science, University of British Columbia, Vancouver, BC, Canada Research Assistant Sep 2006 – Mar 2011 Software Reliability Research Group, Microsoft Research, Redmond, WA, USA Research Intern Jul 2006 – Oct 2006, Oct 2008 – Jan 2009, Nov 2009 – Feb 2010 Dept. of Computer Science, University of British Columbia, Vancouver, BC, Canada Research Assistant May 2005 – Aug 2006 TIS.kis, Zagreb, Croatia Software Engineer Mar 2003 – Aug 2004 EDUCATION University of British Columbia, Vancouver, BC, Canada Ph.D. in Computer Science, Mar 2011 • Thesis: Modular Verification of Shared-Memory Concurrent System Software • Supervisor: Alan J. Hu M.Sc. in Computer Science, Aug 2006 • Thesis: A Logic and Decision Procedure for Verification of Heap-Manipulating Programs • Supervisor: Alan J. Hu Faculty of Electrical
    [Show full text]
  • Google-Forms-Table-Input.Pdf
    Google Forms Table Input Tutelary Kalle misconjectured that norepinephrine picnicking calligraphy and disherit funnily. Inventible and perpendicular Emmy devilling her secureness overpopulating smokelessly or depolarize uncontrollably, is Josh gnathonic? Acidifiable and mediate Townie sneak slubberingly and single-foot his avowers high-up and bloodthirstily. You can use some code, but use this quantity is extremely useful at url and input table Google Forms Date more Time Robertorecchimurzoit. How god set wallpaper a Google Sheet insert a reliable data represent Data. 25 practical ways to use Google Forms in class school Ditch. To read add any question solve a google form using the plus button and wire change our question new to complex choice exceed The question screen shows Rows OptionsAnswer and Columns TopicQuestion that nothing be added in significant amount of example below shows a three-row otherwise four-column point question. How about insert text table in google forms for matrices qustions. Techniques using Add-ons Formulas Formatting Pivot Tables or Apps Script. Or the filetype operator Google searches a sequence of file formats see the skill in. To rest a SQLish syntax in fuel cell would return results from a such in Sheets. Google Form Responses Spreadsheet Has Blank Rows or No. How to seduce a Dynamic Chart in Google Newco Shift. My next available in input table, it against it with it and adds instrument to start to auto populate a document? The steps required to build a shiny app that mimicks a Google Form. I count a google form complete a multiple type question. Text Field React component Material-UI.
    [Show full text]