Vizcertify: a Framework for Secure Visual Data Exploration

Total Page:16

File Type:pdf, Size:1020Kb

Vizcertify: a Framework for Secure Visual Data Exploration VizCertify: A framework for secure visual data exploration Lorenzo De Stefani, Leonhard F. Spiegelberg, Eli Upfal Tim Kraska Department of Computer Science CSAIL Brown University Massachusetts Institute of Technology Providence, United States of America Cambridge, United States of America florenzo, lspiegel, [email protected] [email protected] Abstract—Recently, there have been several proposals to insightful then. Rather, users most likely would infer that develop visual recommendation systems. The most advanced French wines are generally rated higher; generalizing their systems aim to recommend visualizations, which help users to insight to all wines. Statistically savvy users will now test find new correlations or identify an interesting deviation based on the current context of the user’s analysis. However, when whether this generalization is actually statistically valid using recommending a visualization to a user, there is an inherent risk an appropriate test. Even more technically-savvy users will to visualize random fluctuations rather than solely true patterns: consider other hypotheses they tried and adjust the statistical a problem largely ignored by current techniques. testing procedure to account for the multiple comparisons In this paper, we present VizCertify, a novel framework to problem. This is important since every additional hypothesis, improve the performance of visual recommendation systems by quantifying the statistical significance of recommended vi- explicitly expressed as a test or implicitly observed through a sualizations. The proposed methodology allows controlling the visualization, increases the risk of finding insights which are probability of misleading visual recommendations using a novel just spurious effects. application of the Vapnik Chervonenkis (VC) dimension which The issue with visual recommendation engines: What results in an effective criterion to decide whether a candidate happens when visualization recommendations are generated visualization recommendation corresponds to an actually statis- tically significant phenomenon. by one of the above systems? First, the user does not know if Index Terms—Visualization, Recommendation Systems, the effect shown by the visualization is actually significant FWER, Data Exploration, Visual Recommendation, Vapnik or not. Even worse, she can not use a standard statistical Chervonenkis theory, Statistical Learning method and simply test the effect shown in the visualization for significance. Visual recommendation engines are potentially I. INTRODUCTION checking hundreds of thousands of visualizations for their Recently many visual recommendation engines [1]–[9] have interesting-factor. As a result, by testing large quantities of been proposed to help savvy and unsavvy users with data visualizations it is very likely that a system will find something analysis. While some recommendation engines (e.g., [2], [8], “interesting” regardless of whether the observed phenomenon [4], [9]) aim to recommend better visual encodings, others is actually statistically valid or not. Therefore, a test for sig- (e.g., SeeDB [1], DeepEye [7], MuVE [5], or VizDeck [6]) nificance for the recommended visualization should consider aim to automatically recommend entirely new visualization to the whole testing history. help users finding interesting insights. Why not a holdout? Advocates of visual recommendation While the latter are significantly more powerful, they also engines usually argue that visual recommendations systems significantly increase the risk of finding false insights [10], are meant to be hypothesis generation engines, which should [11]. This happens whenever a visualization is used as a always be validated on a separate hold-out dataset. While this device to represent statistical properties of the data. Consider a is a valid method to control false discoveries, it has several user exploring a dataset containing information about different unpractical implications: (1) None of the found insights from wines. After browsing the data for a while, she creates a the exploration dataset should be regarded as an actual insight visualization of wines ranked by origin showing wines from before they are validated. This is clearly problematic if one France to be apparently higher rated. If her only takeaway observation may steer towards another during the exploration. is, that in this particular dataset wines from France have a (2) Splitting a dataset into an exploration and a hold-out set can higher rating, there is no risk of a false insight. However, significantly reduce the power (i.e., the chance to find actual it is neither in the nature of users to constrain themselves true phenomena). (3) The hold-out needs to be controlled for to such thinking [12], nor would visualizing the data be the multi-hypothesis problem unless the user only wants to use it exactly once for a single test. This research was funded in part by the DARPA Award 16-43-D3M-FP- In this paper, we present VizCertify, a first framework 040, NSF Award IIS-1514491, NSF Award IIS-1562657, NSF Award RI- 18134446, and suppported by Google, Intel, and Microsoft as part of the to make visual recommendation engines “safe”. That is, MIT Data Systems and AI Lab (DSAIL). VizCertify validates candidate recommended visualizations by verifying that they represent actually statistically significant survey dataset we conducted with 2; 644 participants from the phenomena. U.S. on Amazon Mechanical Turk for 32 (mostly unrelated) Scope: As already pointed out in [10], [13], associating multiple-choice questions1. Questions range from “Would you an appropriate hypothesis test for validating a visualization is rather drive an electric car or a gas-powered car?” to “What notoriously hard. In this work, we focus on the visual recom- is your highest achieved education level?” and “Do you mendation style proposed by SeeDB for histograms as it uses a believe in Astrology?”. A global aggregation reveals that the clear semantic for what “interesting” means and rumored to be majority of participants does not believe in Astrology as shown implemented in a widely-used commercial product. SeeDB [1] in Figure 1a. Based on this visualization, SeeDB’s top 3 rec- makes recommendations based on a reference view; it tries ommended visualization are depicted in Figure 1b-1d. SeeDB to find a visualization whose underlying data distribution is proposes various ways to measure the difference between two different from the one’s the user is currently shown (i.e., the visualizations to rank the visualization. For this experiment, implicit test performed by SeeDB evaluates the difference in we configured SeeDB to use the Chebyshev distance. For sub-populations). However, our techniques can be extended example, Figure 1b shows that people who prefer cheese- to other visualization frameworks and hypothesis generation flavored potato chips are more likely to believe in Astrology. tools (e.g., Data Polygamy [14]) provided that the chosen However, the visualizations alone do not answer the ques- “interestingness” criterion can be evaluated by a statistical test. tion, if the shown difference is actually statistically significant Contributions: The core idea of VizCertify is to estimate or not. In contrast, VizCertify would automatically mark the the interest of candidate recommendations and automatically reference view in Figure 1a and 1c as statistically significant, adjusts the significance value based on the search space to but disregard the visualizations in Figure 1b and 1d. Important account for the multiple-hypothesis pitfall. With this work we to note is, that the top-ranked visualization, Figure 1b, is not make the following contributions: considered statistically significant, but the second-ranked is. • We formalize the process of making visualization recom- I.e., higher rank does not imply higher statistical significance. mendations as statistical hypothesis testing. Obviously, detecting statistically significant differences is • We propose a method based on the use of VC dimension, challenging as it depends on the data size, the number of which allows controlling the probability of observing tested hypothesis by the user and the recommendation system. false discoveries during the visualization recommendation Furthermore, while it is easy to define what a test means with process. VizCertify allows control of the Family Wise Er- a two-bar chart histogram, it is much harder for visualizations ror Rate (FWER) at a given control level δ 2 (0; 1). Our with multiple bars. Finally, it has to be pointed out that method provides finite sample FWER control whereas in the case of SeeDB increasing the data size does not classical statistical approaches (i.e., the chi-squared test) necessarily mitigate the problem as SeeDB automatically adds ensure asymptotic FWER control. filter conditions to create ever-smaller sub-populations. • We evaluate the performance of our system, in compari- son with SeeDB via extensive experimental analysis. B. Problem set-up The remainder of this paper is organized as follows: In Sec- In this work, we assume that D (e.g., our survey data) tion II we give a definition of the visualization recommenda- consists of n records chosen uniformly at random and inde- tion problem in rigorous probabilistic terms. In Section III we pendently from a universe Ω. We can imagine Ω in the form discuss possible approaches for the visualization recommen- of a two-dimensional, relational table with N rows and m dation problem, and highlight why they suffer due to
Recommended publications
  • Improving Data Preparation for Business Analytics Applying Technologies and Methods for Establishing Trusted Data Assets for More Productive Users
    BEST PRACTICES REPORT Q3 2016 Improving Data Preparation for Business Analytics Applying Technologies and Methods for Establishing Trusted Data Assets for More Productive Users By David Stodder Research Sponsors Research Sponsors Alation, Inc. Alteryx, Inc. Attivio, Inc. Datameer Looker Paxata Pentaho, a Hitachi Group Company RedPoint Global SAP / Intel SAS Talend Trifacta Trillium Software Waterline Data BEST PRACTICES REPORT Q3 2016 Improving Data Table of Contents Research Methodology and Demographics 3 Preparation for Executive Summary 4 Business Analytics Why Data Preparation Matters 5 Better Integration, Business Definition Capture, Self-Service, and Applying Technologies and Methods for Data Governance. 6 Establishing Trusted Data Assets for Sidebar – The Elements of Data Preparation: More Productive Users Definitions and Objectives 8 By David Stodder Find, Collect, and Deliver the Right Data . 8 Know the Data and Build a Knowledge Base. 8 Improve, Integrate, and Transform the Data . .9 Share and Reuse Data and Knowledge about the Data . 9 Govern and Steward the Data . 9 Satisfaction with the Current State of Data Preparation 10 Spreadsheets Remain a Major Factor in Data Preparation . 10 Interest in Improving Data Preparation . 11 IT Responsibility, CoEs, and Interest in Self-Service . 13 Attributes of Effective Data Preparation 15 Data Catalog: Shared Resource for Making Data Preparation Effective . 17 Overcoming Data Preparation Challenges 19 Addressing Time Loss and Inefficiency . 21 IT Responsiveness to Data Preparation Requests . 22 Data Volume and Variety: Addressing Integration Challenges . 22 Disconnect between Data Preparation and Analytics Processes . 24 Self-Service Data Preparation Objectives 25 Increasing Self-Service Data Preparation . 27 Self-Service Data Integration and Catalog Development .
    [Show full text]
  • Automating Exploratory Data Analysis Via Machine Learning: an Overview
    Tutorials SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Automating Exploratory Data Analysis via Machine Learning: An Overview Tova Milo Amit Somech Tel Aviv University Tel Aviv University [email protected] [email protected] ABSTRACT up-close, better understand its nature and characteristics, Exploratory Data Analysis (EDA) is an important initial step and extract preliminary insights from it. Typically, EDA is for any knowledge discovery process, in which data scien- done by interactively applying various analysis operations tists interactively explore unfamiliar datasets by issuing a (such as filtering, aggregation, and visualization) – the user sequence of analysis operations (e.g. filter, aggregation, and examines the results of each operation, and decides if and visualization). Since EDA is long known as a difficult task, which operation to employ next. requiring profound analytical skills, experience, and domain EDA is long known to be a complex, time consuming pro- knowledge, a plethora of systems have been devised over cess, especially for non-expert users. Therefore, numerous the last decade in order to facilitate EDA. lines of work were devised to facilitate the process, in multi- In particular, advancements in machine learning research ple dimensions [10, 24, 25, 31, 42, 46]: First, simplified EDA 1 have created exciting opportunities, not only for better fa- interfaces (e.g., [25], Tableau ) allow non-programmers to cilitating EDA, but to fully automate the process. In this effectively explore datasets without knowing scripting lan- tutorial, we review recent lines of work for automating EDA. guages or SQL. Second, numerous solutions (e.g., [6, 14, 22]) Starting from recommender systems for suggesting a single were devised to improve the interactivity of EDA, by re- exploratory action, going through kNN-based classifiers and ducing the running times of exploratory operations, and active-learning methods for predicting users’ interestingness displaying preliminary sketches of their results.
    [Show full text]
  • The Landscape of R Packages for Automated Exploratory Data Analysis by Mateusz Staniak and Przemysław Biecek
    CONTRIBUTED RESEARCH ARTICLE 1 The Landscape of R Packages for Automated Exploratory Data Analysis by Mateusz Staniak and Przemysław Biecek Abstract The increasing availability of large but noisy data sets with a large number of heterogeneous variables leads to the increasing interest in the automation of common tasks for data analysis. The most time-consuming part of this process is the Exploratory Data Analysis, crucial for better domain understanding, data cleaning, data validation, and feature engineering. There is a growing number of libraries that attempt to automate some of the typical Exploratory Data Analysis tasks to make the search for new insights easier and faster. In this paper, we present a systematic review of existing tools for Automated Exploratory Data Analysis (autoEDA). We explore the features of fifteen popular R packages to identify the parts of the analysis that can be effectively automated with the current tools and to point out new directions for further autoEDA development. Introduction With the advent of tools for automated model training (autoML), building predictive models is becoming easier, more accessible and faster than ever. Tools for R such as mlrMBO (Bischl et al., 2017), parsnip (Kuhn and Vaughan, 2019); tools for python such as TPOT (Olson et al., 2016), auto-sklearn (Feurer et al., 2015), autoKeras (Jin et al., 2018) or tools for other languages such as H2O Driverless AI (H2O.ai, 2019; Cook, 2016) and autoWeka (Kotthoff et al., 2017) supports fully- or semi-automated feature engineering and selection, model tuning and training of predictive models. Yet, model building is always preceded by a phase of understanding the problem, understanding of a domain and exploration of a data set.
    [Show full text]
  • Lecture #1: Exploratory Data Analysis CS 109A, STAT 121A, AC 209A: Data Science
    lecture #1: exploratory data analysis CS 109A, STAT 121A, AC 209A: Data Science . Weiwei Pan, Pavlos Protopapas, Kevin Rader Fall 2016 Harvard University announcements ■ How to optimally communicate with us (Piazza then help line) ■ Issues with Gutenberg in HW0 (for now stop using Gutenberg) ■ Extension students: labs stream on Friday (posted same day), interactive lab via zoom, TF office hours, extra day on HW ■ Cross-registered students already have or will get full access to Canvas today (please check!) ■ HW0 must be submitted (through Vocareum)! ■ HW0 submission deadline extended to Wednesday (today) at 11:59pm. ■ Time and effort required to complete HW0 will vary depending on familiarity with programming (this is the learning curve), and is a good indicator of fit of the class (the programming will not get easier) ■ HW1 is released 1 the data science process On the first day of class you were introduced to the “data science” process. ■ Ask questions ■ Data Collection ■ Data Exploration ■ Data Modeling ■ Data Analysis ■ Visualization and Presentation of Results Note: This process is not linear! 2 lecture outline What Is Data? Exploring Data Descriptive Statistics Data Visualization An Example What Next? 3 .what is data? what is data? “A datum is a single measurement of something on a scale that is understandable to both the recorder and the reader. Data is multiple such measurements.” Provocative claim: everything is (can be) data! 5 where does it come from? ■ Internal sources: already collected by or is part of the overall data collection of you organization. For example: business-centric data that is available in the organization data base to record day to day operations; scientific or experimental data ■ Existing External Sources: available in ready to read format from an outside source for free or for a fee.
    [Show full text]
  • Database Benchmarking for Supporting Real-Time Interactive
    Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data Leilani Battle, Philipp Eichmann, Marco Angelini, Tiziana Catarci, Giuseppe Santucci, Yukun Zheng, Carsten Binnig, Jean-Daniel Fekete, Dominik Moritz To cite this version: Leilani Battle, Philipp Eichmann, Marco Angelini, Tiziana Catarci, Giuseppe Santucci, et al.. Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data. SIGMOD ’20 - International Conference on Management of Data, Jun 2020, Portland, OR, United States. pp.1571-1587, 10.1145/3318464.3389732. hal-02556400 HAL Id: hal-02556400 https://hal.inria.fr/hal-02556400 Submitted on 28 Apr 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data Leilani Battle Philipp Eichmann Marco Angelini University of Maryland∗ Brown University University of Rome “La Sapienza”∗∗ College Park, Maryland Providence, Rhode Island Rome, Italy [email protected] [email protected] [email protected] Tiziana Catarci Giuseppe Santucci Yukun Zheng University of Rome “La Sapienza”∗∗ University of Rome “La Sapienza”∗∗ University of Maryland∗ [email protected] [email protected] [email protected] Carsten Binnig Jean-Daniel Fekete Dominik Moritz Technical University of Darmstadt Inria, Univ.
    [Show full text]
  • An Interactive Visualization Environment for Data Exploration
    An Interactive Visualization Environment for Data Exploration Mark Derthick and John Kolojejchick and Steven F. Roth {mad+,jake+,roth+}@cs.cmu.edu Carnegie Mellon University School of Computer Science Pittsburgh, PA 15213 Abstract increase sales” it is impossible to formally characterize what constitutes an answer. A surprising observation Exploratory data analysis is a process of sifting may lead to new goals, which lead to new surprises, and through data in search of interesting information or patterns. Analysts’ current tools for exploring data in- in turn more goals. To facilitate this kind of iterative clude database management systems, statistical anal- exploration, we would like to provide the analyst rapid, ysis packages, data mining tools, visualization tools, incremental, and reversible operations giving continu- and report generators. Since the exploration process ous visual feedback. seeks the unexpected in a data-driven manner, it is In some ways, the state of the art 30 years ago crucial that these tools are seamlessly integrated so was closer to this goal of a tight loop between explo- analysts can flexibly select and compose tools to use at ration, hypothesis formation, and hypothesis testing each stage of analysis. Few systems have integrated all these capabilities either architecturally or at the user than it is today. In the era of slide rules and graph pa- interface level. Visage’s information-centric approach per, the analyst’s forced intimacy with the data bred allows coordination among multiple application user strong intuitions. As computer power and data sets interfaces. It uses an architecture that keeps track of have grown, analysis retaining the human focus has the mapping of visual objects to information in shared become on-line analytical processing (OLAP).
    [Show full text]
  • Exploring the Data Wilderness Through Examples
    Exploring the Data Wilderness through Examples Davide Mottin Matteo Lissandrini Aarhus University Aalborg University [email protected] [email protected] Yannis Velegrakis Themis Palpanas Utrecht University Paris Descartes University [email protected] [email protected] Machine Machine ABSTRACT Interface learning Exploration is one of the primordial ways to accrue knowl- edge about the world and its nature. As we accumulate, Example-based approaches mostly automatically, data at unprecedented volumes and Middleware speed, our datasets have become complex and hard to un- derstand. In this context exploratory search provides a handy Data structures and hardware tool for progressively gather the necessary knowledge by starting from a tentative query that hopefully leads to an- Relational Graph Text swers at least partially relevant and that can provide cues about the next queries to issue. Recently, we have witnessed a rediscovery of the so-called example-based methods, in which Figure 1: A view of example-based data exploration. the user or the analyst circumvent query languages by using examples as input. This shift in semantics has led to a num- ber of methods receiving as query a set of example members 1 SCOPE OF THE TUTORIAL of the answer set. The search system then infers the entire Data exploration includes methods to eciently extract knowl- answer set based on the given examples and any additional edge from data, even if we do not know what exactly we are information provided by the underlying database. In this tu- looking for, nor how to precisely describe our needs. In such torial, we present an excursus over the main example-based exploration, the user progressively acquires the knowledge methods for exploratory analysis, show techniques tailored by issuing a sequence of generic queries to gather intelli- to dierent data types, and provide a unifying view of the gence about the data.
    [Show full text]
  • Astronomy in the Big Data Era Search
    Astronomy in the big data era Search... \ Log in Register Share: A- Alt. Download A+ Display Proceedings Papers Astronomy in the Big Data Era Yongheng Authors: Yanxia Zhang , Zhao Abstract The fields of Astrostatistics and Astroinformatics are vital for dealing with the big data issues now faced by astronomy. Like other disciplines in the big data era, astronomy has many V characteristics. In this paper, we list the different data mining algorithms used in astronomy, along with data mining software and tools related to astronomical applications. We present SDSS, a project often referred to by other astronomical projects, as the most successful sky survey in the history of astronomy and describe the factors influencing its success. We also discuss the success of Astrostatistics and Astroinformatics organizations and the conferences and summer schools on these issues that are held annually. All the above indicates that astronomers and scientists from other areas are ready to face the challenges and opportunities provided by massive data volume. Keywords: Big data , Data mining , Astrostatistics , Astroinformatics How to Cite: Zhang, Y. and Zhao, Y., 2015. Astronomy in the Big Data Era. Data Science Journal, 14, p.11. DOI: http://doi.org/10.5334/dsj-2015-011 12912 1197 4 19 14 Views Downloads Citations Twitter Facebook Published on 22 May 2015 Peer Reviewed CC BY 4.0 1 Introduction At present, the continuing construction and development of ground-based and space-born sky surveys ranging from gamma rays and X-rays, ultraviolet, optical, and infrared to radio bands is bringing astronomy into the big data era.
    [Show full text]
  • Big Data Visualization Tools ?
    Big Data Visualization Tools ? Nikos Bikakis ATHENA Research Center, Greece 1 Synonyms Visual exploration; Interactive visualization; Information visualization; Vi- sual analytics; Exploratory data analysis. 2 Definition Data visualization is the presentation of data in a pictorial or graphical for- mat, and a data visualization tool is the software that generates this presenta- tion. Data visualization provides users with intuitive means to interactively explore and analyze data, enabling them to effectively identify interesting patterns, infer correlations and causalities, and supports sense-making activ- ities. 3 Overview Exploring, visualizing and analysing data is a core task for data scientists and analysts in numerous applications. Data visualization2 [1] provides intuitive ways for the users to interactively explore and analyze data, enabling them arXiv:1801.08336v2 [cs.DB] 22 Feb 2018 to effectively identify interesting patterns, infer correlations and causalities, and support sense-making activities. ? This article appears in Encyclopedia of Big Data Technologies, Springer, 2018 2 Throughout the article, terms visualization and visual exploration, as well as terms tool and system are used interchangeably. 1 2 Nikos Bikakis The Big Data era has realized the availability of a great amount of massive datasets that are dynamic, noisy and heterogeneous in nature. The level of difficulty in transforming a data-curious user into someone who can access and analyze that data is even more burdensome now for a great number of users with little or no support and expertise on the data processing part. Data visualization has become a major research challenge involving several issues related to data storage, querying, indexing, visual presentation, interaction, personalization [2,3,4,5,6,7,8,9].
    [Show full text]
  • The New Analytics Lifecycle Accelerating Insight to Drive Innovation
    The New Analytics Lifecycle Accelerating Insight to Drive Innovation Dave Wells November 2017 The New Analytics Lifecycle About the Author Dave Wells is an advisory consultant, educator, and research analyst dedicated to building meaningful connections along the path from data to business impact. He works at the intersection of information and business, driving value through analytics, business intelligence, and innovation. With nearly five decades of combined experience in information management and business management, Dave has a unique perspective about the connections of business, information, data, and technology. Knowledge sharing and skills building are Dave’s passions, carried out through consulting, speaking, teaching, research, and writing. He is a continuous learner—fascinated with understanding how we think—and a student and practitioner of systems thinking, critical thinking, design thinking, divergent thinking, and innovation. About Eckerson Group Eckerson Group is a research and consulting firm that helps business and analytics leaders use data and technology to drive better insights and actions. Through its reports and advisory services, the firm helps companies maximize their investments in data and analytics. Its researchers and consultants each have more than 20 years of experience in the field and are uniquely qualified to help business and technical leaders succeed with business intelligence, analytics, data management, data governance, performance management, and data science. About This Report This report is sponsored by Alteryx, Amazon Web Services and Tableau. © Eckerson Group 2017 www.eckerson.com 2 The New Analytics Lifecycle Executive Summary Analytics is at the forefront of modern business management. The role of analytics for informed decision making is well known, and much attention is given to the value of insight.
    [Show full text]
  • Data Exploration and Visualization
    Data Exploration and Visualization Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın görüşlerini yansıtmamaktadır. Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns Examples of Business Questions • Simple (descriptive) Stats • “Who are the most profitable customers?” • Hypothesis Testing • “Is there a difference in value to the company of these customers?” • Segmentation/Classification • What are the common characteristics of these customers? • Prediction • Will this new customer become a profitable customer? If so, how profitable? adapted from Provost and Fawcett, “Data Science for Business” Ask an interesting What is the scientific goal? What would you do if you had all the data? question. What do you want to predict or estimate? How were the data sampled? Which data are relevant? Get the data. Are there privacy issues? Plot the data. Are there anomalies? Explore the data. Are there patterns? Build a model. Model the data. Fit the model. Validate the model. Communicate and What did we learn? Do the results make sense? visualize the results.
    [Show full text]
  • Futzing and Moseying: Interviews with Professional Data Analysts on Exploration Practices
    Futzing and Moseying: Interviews with Professional Data Analysts on Exploration Practices Sara Alspaugh and Nava Zokaei and Andrea Liu and Cindy Jin and Marti A. Hearst Abstract—We report the results of interviewing thirty professional data analysts working in a range of industrial, academic, and regulatory environments. This study focuses on participants’ descriptions of exploratory activities and tool usage in these activities. Highlights of the findings include: distinctions between exploration as a precursor to more directed analysis versus truly open-ended exploration; confirmation that some analysts see “finding something interesting” as a valid goal of data exploration while others explicitly disavow this goal; conflicting views about the role of intelligent tools in data exploration; and pervasive use of visualization for exploration, but with only a subset using direct manipulation interfaces. These findings provide guidelines for future tool development, as well as a better understanding of the meaning of the term “data exploration” based on the words of practitioners “in the wild.” Index Terms—EDA, exploratory data analysis, interview study, visual analytics tools 1 INTRODUCTION The professional field known variously as data analysis, visual analyt- a given result; that is, when a top-down plan of action coalesces and ics, business intelligence, and more recently, data science, continues to can be specified in advance from start to finish, precisely. Example expand year over year. This interdisciplinary field requires its practition- exploratory questions are “what’s going on with my users?” or “has ers to acquire diverse technical and mental skills, and be comfortable anything interesting happened this quarter?” working with ill-defined goals and uncertain outcomes.
    [Show full text]