Automating Exploratory Data Analysis Via Machine Learning: an Overview
Total Page:16
File Type:pdf, Size:1020Kb
Tutorials SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Automating Exploratory Data Analysis via Machine Learning: An Overview Tova Milo Amit Somech Tel Aviv University Tel Aviv University [email protected] [email protected] ABSTRACT up-close, better understand its nature and characteristics, Exploratory Data Analysis (EDA) is an important initial step and extract preliminary insights from it. Typically, EDA is for any knowledge discovery process, in which data scien- done by interactively applying various analysis operations tists interactively explore unfamiliar datasets by issuing a (such as filtering, aggregation, and visualization) – the user sequence of analysis operations (e.g. filter, aggregation, and examines the results of each operation, and decides if and visualization). Since EDA is long known as a difficult task, which operation to employ next. requiring profound analytical skills, experience, and domain EDA is long known to be a complex, time consuming pro- knowledge, a plethora of systems have been devised over cess, especially for non-expert users. Therefore, numerous the last decade in order to facilitate EDA. lines of work were devised to facilitate the process, in multi- In particular, advancements in machine learning research ple dimensions [10, 24, 25, 31, 42, 46]: First, simplified EDA 1 have created exciting opportunities, not only for better fa- interfaces (e.g., [25], Tableau ) allow non-programmers to cilitating EDA, but to fully automate the process. In this effectively explore datasets without knowing scripting lan- tutorial, we review recent lines of work for automating EDA. guages or SQL. Second, numerous solutions (e.g., [6, 14, 22]) Starting from recommender systems for suggesting a single were devised to improve the interactivity of EDA, by re- exploratory action, going through kNN-based classifiers and ducing the running times of exploratory operations, and active-learning methods for predicting users’ interestingness displaying preliminary sketches of their results. Last, several preferences, and finally to fully automating EDA using state- more systems simplify the query formulation for non-expert of-the-art methods such as deep reinforcement learning and users (e.g., [15, 24]), as well as introducing query-by example sequence-to-sequence models. systems (e.g. [8, 41]) that allow non-technical users to specify We conclude the tutorial with a discussion on the main their information needs by selecting a set of example tuples. challenges and open questions to be dealt with in order to While these works greatly facilitate the exploration pro- ultimately reduce the manual effort required for EDA. cess in terms of response time and ease of use, EDA is still a difficult process, as the user remains in the “driver’s seat”, ACM Reference Format: having to decide which exploratory operation to employ at Tova Milo and Amit Somech. 2020. Automating Exploratory Data each step of the exploratory process. Therefore the scope Analysis via Machine Learning: An Overview. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of of this tutorial is an emerging body of research works that Data (SIGMOD’20), June 14–19, 2020, Portland, OR, USA. ACM, New are aimed at automating EDA, whether partially or fully. We York, NY, USA, 6 pages. https://doi.org/10.1145/3318464.3383126 start by surveying recommender-systems designed for EDA, aimed to assist users in choosing the next exploratory step to 1 TUTORIAL BACKGROUND & SCOPE take, or suggest dataset segments and tuples that are likely to be of interest. We then discuss how machine learning tech- Exploratory Data Analysis (EDA) is an essential process per- niques can improve upon such EDA recommender systems, formed by data scientists in order to examine a new dataset by predicting users’ interests in order to generate better, Permission to make digital or hard copies of all or part of this work for more personalized recommendations. Finally, we explain personal or classroom use is granted without fee provided that copies are not how deep learning methods can be used to fully automate made or distributed for profit or commercial advantage and that copies bear EDA i.e. by auto-generating entire exploratory sessions and this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with visualizations, hereby significantly reducing the time and credit is permitted. To copy otherwise, or republish, to post on servers or to effort data scientists dedicate to EDA. redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Tutorial Outline. The proposed tutorial begins with a brief SIGMOD’20, June 14–19, 2020, Portland, OR, USA introduction to modern-day EDA, overviewing commonly © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-6735-6/20/06...$15.00 https://doi.org/10.1145/3318464.3383126 1https://www.tableau.com 2617 Tutorials SIGMOD ’20, June 14–19, 2020, Portland, OR, USA used interactive EDA interfaces, methodology and optimiza- it using deep reinforcement learning. Such generated sessions, tion tools. Then, the core of the tutorial consists of the fol- when presented in EDA notebooks, allow allow users to lowing three modules, which are depicted in more details in gain preliminary insights on their dataset, before they begin Section 2. The first module describes recommender systems exploring it themselves [3]. designated for EDA, whereas the latter two modules explain As automated EDA is an emerging field of research, it how automated EDA can be vastly improved using machine conveys many exciting opportunities for future work, as learning and deep learning methods. Table 1 summarizes well as challenges and open questions. We review questions selected works along the three modules. such as How to make the auto-generated sessions personalized, 1. EDA Recommender Systems: Data-driven, Log-based, reactive to users’ information needs? How to design a better and Hybrid Systems. The first section surveys the major user experience and interface for automated EDA? How to build developments in the field of recommender systems dedi- an effective, reproducible, experimental framework to evaluate cated to EDA that provide the user with suggestions for the quality of auto-generated sessions? specific, high-utility exploratory operations. Such systems can be roughly divided into two categories: (1) data-driven Tutorial Duration. The tutorial is 1.5 hour long. It begins (also known as discovery-driven) systems, which use heuris- with an introduction to EDA, in which we review commonly tic notions of interestingness and employ them, e.g., to find used EDA interfaces, and present common data models and data subsets conveying interesting patterns ([12]), data vi- required definitions. We then overview selected works in sualizations [46], and data summaries [43]. (2) Log-based each module: EDA recommender systems, predicting users’ systems [1, 13, 49] leverage a log of former exploratory op- interestingness, and finally - we brifly touch recent works erations, performed by the same or different users, in order in fully automated EDA. We conclude the tutorial with an to generate more personalized EDA recommendations. Last, elaborate discussion on the potential future endeavors of hybrid methods such as [29, 31] allow effectively utilizing automated EDA. both the log and the dataset currently being explored. Target Audience. The tutorial is primarily intended for re- 2. ML for Predicting Users’ Interestingness Preferences. searchers, students, and industry practitioners interested in Measuring the interestingness of exploratory operations is a data exploration and automatic knowledge discovery, but crucial component in many of the EDA systems described also addresses those who are interested in employing ma- above, as well as for fully automating EDA (as described fur- chine learning tools for data management applications. The ther below). Many heuristic measures have been proposed tutorial is self-contained and requires no prior knowledge ex- for assessing the interestingness of analysis operations, each cept for some familiarity with basic databases and machine capturing a different facet of the broad concept16 [ , 27]. How- learning terminology. Advanced techniques such as deep ever, interestingness is often subjective [7] and dynamically reinforcement learning and sequence-to-sequence networks changing, even in the same exploratory session [28]. We are explained from the basics. examine two lines of recent works that utilize machine learn- Related Tutorials. ing techniques to solve the problem: (1) Dynamic selection A comprehensive tutorial on data ex- of interestingness measures, in which systems such as [28] ploration [19] was presented in SIGMOD ’15. It surveyed predict which measure most accurately captures the user’s multiple lines of works, all dedicated to enhance the explo- interest, and (2) ML-based models for users’ interest using, e.g., ration process, from adaptive database storage and indexing active-learning [10, 18] and learning-to-rank [26] techniques. mechanisms, to gesture-based database interfaces. In con- trast, our tutorial is focused particularly on more recent 3. Fully-automated EDA Using Deep Learning Meth- developments in automated EDA, and the only overlap is ods. While the recommender systems and interestingness in the introductory part of our tutorial, where we briefly