An Interactive Visualization Environment for Data Exploration

Total Page:16

File Type:pdf, Size:1020Kb

An Interactive Visualization Environment for Data Exploration An Interactive Visualization Environment for Data Exploration Mark Derthick and John Kolojejchick and Steven F. Roth {mad+,jake+,roth+}@cs.cmu.edu Carnegie Mellon University School of Computer Science Pittsburgh, PA 15213 Abstract increase sales” it is impossible to formally characterize what constitutes an answer. A surprising observation Exploratory data analysis is a process of sifting may lead to new goals, which lead to new surprises, and through data in search of interesting information or patterns. Analysts’ current tools for exploring data in- in turn more goals. To facilitate this kind of iterative clude database management systems, statistical anal- exploration, we would like to provide the analyst rapid, ysis packages, data mining tools, visualization tools, incremental, and reversible operations giving continu- and report generators. Since the exploration process ous visual feedback. seeks the unexpected in a data-driven manner, it is In some ways, the state of the art 30 years ago crucial that these tools are seamlessly integrated so was closer to this goal of a tight loop between explo- analysts can flexibly select and compose tools to use at ration, hypothesis formation, and hypothesis testing each stage of analysis. Few systems have integrated all these capabilities either architecturally or at the user than it is today. In the era of slide rules and graph pa- interface level. Visage’s information-centric approach per, the analyst’s forced intimacy with the data bred allows coordination among multiple application user strong intuitions. As computer power and data sets interfaces. It uses an architecture that keeps track of have grown, analysis retaining the human focus has the mapping of visual objects to information in shared become on-line analytical processing (OLAP). In con- databases. One result is the ability to perform di- trast, data mining has emerged as a distinct interdisci- rect manipulation operations such as drag-and-drop plinary field including statistics, machine learning and transfer of data among applications. This paper de- AI, and database technology, but that does not include scribes Visage’s Visual Query Language and visualiza- human-computer interaction (HCI). KDD recognizes tion tools, and illustrates their application to several the broader context in which data mining systems are stages of the exploration process: creating the target dataset, data cleaning and preprocessing, data reduc- used, and that the majority of the user’s effort is in tion and projection, and visualization of the reduced preparatory and evaluatory steps. HCI is thus an im- data. Unlike previous integrated KDD systems’ inter- portant part of an end-to-end KDD system(Fayyad, faces, direct manipulation is used pervasively, and the Piatetsky-Shapiro, & Smyth 1996). visualizations are more diverse and can be customized Visage provides a scripted rapid-prototyping envi- automatically as needed. Coordination among all in- ronment for experimenting with interaction and visual- terface objects simplifies iterative modification of de- ization techniques. Unfortunately, the overhead of in- cisions at any stage. terpretation and first-class graphical objects limits the Keywords Data Visualization, Interactive Data Ex- dataset size that can be manipulated rapidly. What we ploration, Visual Query Language, Automatic Presen- propose here are ways to attain initial intuitions about tation a large dataset from small samples; to create descrip- tions of the interesting subsets and attributes for input Introduction12 to cleaning and data mining applications; and to eval- uate the discovered rules. We also offer a framework Exploratory data analysis has been called data archae- for how these distinct applications can be seamlessly ology (Brachman et al. 1993) to emphasize that it re- integrated in a single information-centric interface. quires a human analyst tightly in the loop who has high A key innovation is the combination of extensional level goals in seeking useful information from the data. direct manipulation navigation with intensional query For goals such as “find buying patterns we can use to construction. The former allows fast opportunistic ex- 1Copyright c 1997, American Association for Artificial ploration of the sample, while the latter allows these Intelligence (www.aaai.org). All rights reserved. operations to be carried out automatically on the full 2Color versions of the figures are at, eg, dataset. Below we survey the state of the art in in- http://www.cs.cmu.edu/ sage/KDD97/figure1.gif teractive visualization systems and illustrate the use of Visage (Roth et al. 1997) on KDD tasks for a real- Shneiderman 1992). For KDD, where we want to reuse estate domain. We conclude by discussing the possibil- queries on different data, having an explicit query is ity of a fully integrated KDD system based on a system very important. However these queries don’t involve like Visage. arithmetic functions of attributes and, more funda- mentally, don’t involve multiple objects and quantifi- Interactive Visualization Systems cation, nor do they involve aggregation and aggregate Previous Interactive Visualization Systems functions. There are some visual query languages, such as The interactive aspects of previous visualization sys- GQL (Papantonakis & King 1995), that can express tems that we make use of affect how datapoints in these more complicated queries, but they are not inte- the visualization can change their appearance based grated in a drag and drop manner with visualization on attributes of the domain objects they represent. In systems. Thus the exploration loop requires the an- filtering the points corresponding to houses in a cer- alyst to recall her sequence of exploration operations tain neighborhood could be made invisible. In brush- and reproduce them in the query language, rerun the ing (Becker & Cleveland 1987) the color of the points is query, and recreate any visualizations. The problem is changed. Datapoints can also be dragged and dropped much the same as with any other type of disconnected to add or remove them from datasets, or to examine application used in the KDD process. With very few them in different tools. Direct manipulation interfaces exceptions, the user is forced to manage the transmis- are those in which operations are carried out by inter- sion of data from one application to another, increasing acting with interface objects representing the semantic the overhead of exploration. arguments of the operation. For instance, dragging an icon representing a file to a trash can is direct manip- Visage ulation, while dialog boxes are not. There are no previous systems that allow query con- In addition to capabilities of previous interactive visu- struction by direct manipulation on domain-level data alization systems, Visage also includes the Sage expert as well as schema-level data. This is largely because system (Roth et al. 1994) for automated design of so- it is viewed as the job of the database management phisticated visualizations integrating many attributes. system (DBMS) to create a table with a given schema, Visage is information centric (Roth et al. 1996), which and the visualization system’s job to show all and only means that data objects are represented as first class the domain-level data in the table. Visualizations are interface objects, which can be manipulated using a not viewed as full exploration interfaces themselves. common set of basic operations, such as drill-down Although some drill-down capability may be available, and roll-up, applicable regardless of where they ap- changing a query requires a distinct interface. pear: in a hierarchical table, a slide show, a map, Further, each time a query is changed and re-run, or a query. Furthermore these graphical objects can a new table is sent to the visualization system, which be dragged across application boundaries. Allowing won’t know how to coordinate visualizations from the the visualization system direct access to the underly- different tables even if the underlying conceptual ob- ing database, rather than just isolated tables derived jects are identical. For instance, if objects in one vi- from it, is key to interface unification of visualization sualization are brushed/filtered it’s difficult to identify application with the other components of a KDD sys- the corresponding objects in other visualizations. The tem. problem is compounded if the various visualizations Ausercannavigate from the visual representation originate from different database joins where the num- of any database object to other related objects. For ber of records differ. instance from a graphical object representing a real es- Research is being done to increase the amount of tate agent, all the houses listed by the agent could be exploration possible within a visualization via direct found. It is also possible to aggregate database objects manipulation, such as brushing. Dynamic Query slid- into a new object, which will have properties derived ers (Ahlberg, Williamson, & Shneiderman 1992) filter from its elements. For instance, we could aggregate interface objects representing data objects whose at- the houses listed by John, and look up their average tribute value falls outside the range selected by the size. However these operations are only specified pro- slider. They are called “queries” because they not cedurally, that is by the sequence of direct manipula- only allow interactive exploration, but also make ex- tion operations – there is no explicit query. Once the plicit
Recommended publications
  • Improving Data Preparation for Business Analytics Applying Technologies and Methods for Establishing Trusted Data Assets for More Productive Users
    BEST PRACTICES REPORT Q3 2016 Improving Data Preparation for Business Analytics Applying Technologies and Methods for Establishing Trusted Data Assets for More Productive Users By David Stodder Research Sponsors Research Sponsors Alation, Inc. Alteryx, Inc. Attivio, Inc. Datameer Looker Paxata Pentaho, a Hitachi Group Company RedPoint Global SAP / Intel SAS Talend Trifacta Trillium Software Waterline Data BEST PRACTICES REPORT Q3 2016 Improving Data Table of Contents Research Methodology and Demographics 3 Preparation for Executive Summary 4 Business Analytics Why Data Preparation Matters 5 Better Integration, Business Definition Capture, Self-Service, and Applying Technologies and Methods for Data Governance. 6 Establishing Trusted Data Assets for Sidebar – The Elements of Data Preparation: More Productive Users Definitions and Objectives 8 By David Stodder Find, Collect, and Deliver the Right Data . 8 Know the Data and Build a Knowledge Base. 8 Improve, Integrate, and Transform the Data . .9 Share and Reuse Data and Knowledge about the Data . 9 Govern and Steward the Data . 9 Satisfaction with the Current State of Data Preparation 10 Spreadsheets Remain a Major Factor in Data Preparation . 10 Interest in Improving Data Preparation . 11 IT Responsibility, CoEs, and Interest in Self-Service . 13 Attributes of Effective Data Preparation 15 Data Catalog: Shared Resource for Making Data Preparation Effective . 17 Overcoming Data Preparation Challenges 19 Addressing Time Loss and Inefficiency . 21 IT Responsiveness to Data Preparation Requests . 22 Data Volume and Variety: Addressing Integration Challenges . 22 Disconnect between Data Preparation and Analytics Processes . 24 Self-Service Data Preparation Objectives 25 Increasing Self-Service Data Preparation . 27 Self-Service Data Integration and Catalog Development .
    [Show full text]
  • Automating Exploratory Data Analysis Via Machine Learning: an Overview
    Tutorials SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Automating Exploratory Data Analysis via Machine Learning: An Overview Tova Milo Amit Somech Tel Aviv University Tel Aviv University [email protected] [email protected] ABSTRACT up-close, better understand its nature and characteristics, Exploratory Data Analysis (EDA) is an important initial step and extract preliminary insights from it. Typically, EDA is for any knowledge discovery process, in which data scien- done by interactively applying various analysis operations tists interactively explore unfamiliar datasets by issuing a (such as filtering, aggregation, and visualization) – the user sequence of analysis operations (e.g. filter, aggregation, and examines the results of each operation, and decides if and visualization). Since EDA is long known as a difficult task, which operation to employ next. requiring profound analytical skills, experience, and domain EDA is long known to be a complex, time consuming pro- knowledge, a plethora of systems have been devised over cess, especially for non-expert users. Therefore, numerous the last decade in order to facilitate EDA. lines of work were devised to facilitate the process, in multi- In particular, advancements in machine learning research ple dimensions [10, 24, 25, 31, 42, 46]: First, simplified EDA 1 have created exciting opportunities, not only for better fa- interfaces (e.g., [25], Tableau ) allow non-programmers to cilitating EDA, but to fully automate the process. In this effectively explore datasets without knowing scripting lan- tutorial, we review recent lines of work for automating EDA. guages or SQL. Second, numerous solutions (e.g., [6, 14, 22]) Starting from recommender systems for suggesting a single were devised to improve the interactivity of EDA, by re- exploratory action, going through kNN-based classifiers and ducing the running times of exploratory operations, and active-learning methods for predicting users’ interestingness displaying preliminary sketches of their results.
    [Show full text]
  • The Landscape of R Packages for Automated Exploratory Data Analysis by Mateusz Staniak and Przemysław Biecek
    CONTRIBUTED RESEARCH ARTICLE 1 The Landscape of R Packages for Automated Exploratory Data Analysis by Mateusz Staniak and Przemysław Biecek Abstract The increasing availability of large but noisy data sets with a large number of heterogeneous variables leads to the increasing interest in the automation of common tasks for data analysis. The most time-consuming part of this process is the Exploratory Data Analysis, crucial for better domain understanding, data cleaning, data validation, and feature engineering. There is a growing number of libraries that attempt to automate some of the typical Exploratory Data Analysis tasks to make the search for new insights easier and faster. In this paper, we present a systematic review of existing tools for Automated Exploratory Data Analysis (autoEDA). We explore the features of fifteen popular R packages to identify the parts of the analysis that can be effectively automated with the current tools and to point out new directions for further autoEDA development. Introduction With the advent of tools for automated model training (autoML), building predictive models is becoming easier, more accessible and faster than ever. Tools for R such as mlrMBO (Bischl et al., 2017), parsnip (Kuhn and Vaughan, 2019); tools for python such as TPOT (Olson et al., 2016), auto-sklearn (Feurer et al., 2015), autoKeras (Jin et al., 2018) or tools for other languages such as H2O Driverless AI (H2O.ai, 2019; Cook, 2016) and autoWeka (Kotthoff et al., 2017) supports fully- or semi-automated feature engineering and selection, model tuning and training of predictive models. Yet, model building is always preceded by a phase of understanding the problem, understanding of a domain and exploration of a data set.
    [Show full text]
  • Lecture #1: Exploratory Data Analysis CS 109A, STAT 121A, AC 209A: Data Science
    lecture #1: exploratory data analysis CS 109A, STAT 121A, AC 209A: Data Science . Weiwei Pan, Pavlos Protopapas, Kevin Rader Fall 2016 Harvard University announcements ■ How to optimally communicate with us (Piazza then help line) ■ Issues with Gutenberg in HW0 (for now stop using Gutenberg) ■ Extension students: labs stream on Friday (posted same day), interactive lab via zoom, TF office hours, extra day on HW ■ Cross-registered students already have or will get full access to Canvas today (please check!) ■ HW0 must be submitted (through Vocareum)! ■ HW0 submission deadline extended to Wednesday (today) at 11:59pm. ■ Time and effort required to complete HW0 will vary depending on familiarity with programming (this is the learning curve), and is a good indicator of fit of the class (the programming will not get easier) ■ HW1 is released 1 the data science process On the first day of class you were introduced to the “data science” process. ■ Ask questions ■ Data Collection ■ Data Exploration ■ Data Modeling ■ Data Analysis ■ Visualization and Presentation of Results Note: This process is not linear! 2 lecture outline What Is Data? Exploring Data Descriptive Statistics Data Visualization An Example What Next? 3 .what is data? what is data? “A datum is a single measurement of something on a scale that is understandable to both the recorder and the reader. Data is multiple such measurements.” Provocative claim: everything is (can be) data! 5 where does it come from? ■ Internal sources: already collected by or is part of the overall data collection of you organization. For example: business-centric data that is available in the organization data base to record day to day operations; scientific or experimental data ■ Existing External Sources: available in ready to read format from an outside source for free or for a fee.
    [Show full text]
  • Database Benchmarking for Supporting Real-Time Interactive
    Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data Leilani Battle, Philipp Eichmann, Marco Angelini, Tiziana Catarci, Giuseppe Santucci, Yukun Zheng, Carsten Binnig, Jean-Daniel Fekete, Dominik Moritz To cite this version: Leilani Battle, Philipp Eichmann, Marco Angelini, Tiziana Catarci, Giuseppe Santucci, et al.. Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data. SIGMOD ’20 - International Conference on Management of Data, Jun 2020, Portland, OR, United States. pp.1571-1587, 10.1145/3318464.3389732. hal-02556400 HAL Id: hal-02556400 https://hal.inria.fr/hal-02556400 Submitted on 28 Apr 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data Leilani Battle Philipp Eichmann Marco Angelini University of Maryland∗ Brown University University of Rome “La Sapienza”∗∗ College Park, Maryland Providence, Rhode Island Rome, Italy [email protected] [email protected] [email protected] Tiziana Catarci Giuseppe Santucci Yukun Zheng University of Rome “La Sapienza”∗∗ University of Rome “La Sapienza”∗∗ University of Maryland∗ [email protected] [email protected] [email protected] Carsten Binnig Jean-Daniel Fekete Dominik Moritz Technical University of Darmstadt Inria, Univ.
    [Show full text]
  • Exploring the Data Wilderness Through Examples
    Exploring the Data Wilderness through Examples Davide Mottin Matteo Lissandrini Aarhus University Aalborg University [email protected] [email protected] Yannis Velegrakis Themis Palpanas Utrecht University Paris Descartes University [email protected] [email protected] Machine Machine ABSTRACT Interface learning Exploration is one of the primordial ways to accrue knowl- edge about the world and its nature. As we accumulate, Example-based approaches mostly automatically, data at unprecedented volumes and Middleware speed, our datasets have become complex and hard to un- derstand. In this context exploratory search provides a handy Data structures and hardware tool for progressively gather the necessary knowledge by starting from a tentative query that hopefully leads to an- Relational Graph Text swers at least partially relevant and that can provide cues about the next queries to issue. Recently, we have witnessed a rediscovery of the so-called example-based methods, in which Figure 1: A view of example-based data exploration. the user or the analyst circumvent query languages by using examples as input. This shift in semantics has led to a num- ber of methods receiving as query a set of example members 1 SCOPE OF THE TUTORIAL of the answer set. The search system then infers the entire Data exploration includes methods to eciently extract knowl- answer set based on the given examples and any additional edge from data, even if we do not know what exactly we are information provided by the underlying database. In this tu- looking for, nor how to precisely describe our needs. In such torial, we present an excursus over the main example-based exploration, the user progressively acquires the knowledge methods for exploratory analysis, show techniques tailored by issuing a sequence of generic queries to gather intelli- to dierent data types, and provide a unifying view of the gence about the data.
    [Show full text]
  • Astronomy in the Big Data Era Search
    Astronomy in the big data era Search... \ Log in Register Share: A- Alt. Download A+ Display Proceedings Papers Astronomy in the Big Data Era Yongheng Authors: Yanxia Zhang , Zhao Abstract The fields of Astrostatistics and Astroinformatics are vital for dealing with the big data issues now faced by astronomy. Like other disciplines in the big data era, astronomy has many V characteristics. In this paper, we list the different data mining algorithms used in astronomy, along with data mining software and tools related to astronomical applications. We present SDSS, a project often referred to by other astronomical projects, as the most successful sky survey in the history of astronomy and describe the factors influencing its success. We also discuss the success of Astrostatistics and Astroinformatics organizations and the conferences and summer schools on these issues that are held annually. All the above indicates that astronomers and scientists from other areas are ready to face the challenges and opportunities provided by massive data volume. Keywords: Big data , Data mining , Astrostatistics , Astroinformatics How to Cite: Zhang, Y. and Zhao, Y., 2015. Astronomy in the Big Data Era. Data Science Journal, 14, p.11. DOI: http://doi.org/10.5334/dsj-2015-011 12912 1197 4 19 14 Views Downloads Citations Twitter Facebook Published on 22 May 2015 Peer Reviewed CC BY 4.0 1 Introduction At present, the continuing construction and development of ground-based and space-born sky surveys ranging from gamma rays and X-rays, ultraviolet, optical, and infrared to radio bands is bringing astronomy into the big data era.
    [Show full text]
  • Big Data Visualization Tools ?
    Big Data Visualization Tools ? Nikos Bikakis ATHENA Research Center, Greece 1 Synonyms Visual exploration; Interactive visualization; Information visualization; Vi- sual analytics; Exploratory data analysis. 2 Definition Data visualization is the presentation of data in a pictorial or graphical for- mat, and a data visualization tool is the software that generates this presenta- tion. Data visualization provides users with intuitive means to interactively explore and analyze data, enabling them to effectively identify interesting patterns, infer correlations and causalities, and supports sense-making activ- ities. 3 Overview Exploring, visualizing and analysing data is a core task for data scientists and analysts in numerous applications. Data visualization2 [1] provides intuitive ways for the users to interactively explore and analyze data, enabling them arXiv:1801.08336v2 [cs.DB] 22 Feb 2018 to effectively identify interesting patterns, infer correlations and causalities, and support sense-making activities. ? This article appears in Encyclopedia of Big Data Technologies, Springer, 2018 2 Throughout the article, terms visualization and visual exploration, as well as terms tool and system are used interchangeably. 1 2 Nikos Bikakis The Big Data era has realized the availability of a great amount of massive datasets that are dynamic, noisy and heterogeneous in nature. The level of difficulty in transforming a data-curious user into someone who can access and analyze that data is even more burdensome now for a great number of users with little or no support and expertise on the data processing part. Data visualization has become a major research challenge involving several issues related to data storage, querying, indexing, visual presentation, interaction, personalization [2,3,4,5,6,7,8,9].
    [Show full text]
  • Vizcertify: a Framework for Secure Visual Data Exploration
    VizCertify: A framework for secure visual data exploration Lorenzo De Stefani, Leonhard F. Spiegelberg, Eli Upfal Tim Kraska Department of Computer Science CSAIL Brown University Massachusetts Institute of Technology Providence, United States of America Cambridge, United States of America florenzo, lspiegel, [email protected] [email protected] Abstract—Recently, there have been several proposals to insightful then. Rather, users most likely would infer that develop visual recommendation systems. The most advanced French wines are generally rated higher; generalizing their systems aim to recommend visualizations, which help users to insight to all wines. Statistically savvy users will now test find new correlations or identify an interesting deviation based on the current context of the user’s analysis. However, when whether this generalization is actually statistically valid using recommending a visualization to a user, there is an inherent risk an appropriate test. Even more technically-savvy users will to visualize random fluctuations rather than solely true patterns: consider other hypotheses they tried and adjust the statistical a problem largely ignored by current techniques. testing procedure to account for the multiple comparisons In this paper, we present VizCertify, a novel framework to problem. This is important since every additional hypothesis, improve the performance of visual recommendation systems by quantifying the statistical significance of recommended vi- explicitly expressed as a test or implicitly observed through a sualizations. The proposed methodology allows controlling the visualization, increases the risk of finding insights which are probability of misleading visual recommendations using a novel just spurious effects. application of the Vapnik Chervonenkis (VC) dimension which The issue with visual recommendation engines: What results in an effective criterion to decide whether a candidate happens when visualization recommendations are generated visualization recommendation corresponds to an actually statis- tically significant phenomenon.
    [Show full text]
  • The New Analytics Lifecycle Accelerating Insight to Drive Innovation
    The New Analytics Lifecycle Accelerating Insight to Drive Innovation Dave Wells November 2017 The New Analytics Lifecycle About the Author Dave Wells is an advisory consultant, educator, and research analyst dedicated to building meaningful connections along the path from data to business impact. He works at the intersection of information and business, driving value through analytics, business intelligence, and innovation. With nearly five decades of combined experience in information management and business management, Dave has a unique perspective about the connections of business, information, data, and technology. Knowledge sharing and skills building are Dave’s passions, carried out through consulting, speaking, teaching, research, and writing. He is a continuous learner—fascinated with understanding how we think—and a student and practitioner of systems thinking, critical thinking, design thinking, divergent thinking, and innovation. About Eckerson Group Eckerson Group is a research and consulting firm that helps business and analytics leaders use data and technology to drive better insights and actions. Through its reports and advisory services, the firm helps companies maximize their investments in data and analytics. Its researchers and consultants each have more than 20 years of experience in the field and are uniquely qualified to help business and technical leaders succeed with business intelligence, analytics, data management, data governance, performance management, and data science. About This Report This report is sponsored by Alteryx, Amazon Web Services and Tableau. © Eckerson Group 2017 www.eckerson.com 2 The New Analytics Lifecycle Executive Summary Analytics is at the forefront of modern business management. The role of analytics for informed decision making is well known, and much attention is given to the value of insight.
    [Show full text]
  • Data Exploration and Visualization
    Data Exploration and Visualization Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın görüşlerini yansıtmamaktadır. Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns Examples of Business Questions • Simple (descriptive) Stats • “Who are the most profitable customers?” • Hypothesis Testing • “Is there a difference in value to the company of these customers?” • Segmentation/Classification • What are the common characteristics of these customers? • Prediction • Will this new customer become a profitable customer? If so, how profitable? adapted from Provost and Fawcett, “Data Science for Business” Ask an interesting What is the scientific goal? What would you do if you had all the data? question. What do you want to predict or estimate? How were the data sampled? Which data are relevant? Get the data. Are there privacy issues? Plot the data. Are there anomalies? Explore the data. Are there patterns? Build a model. Model the data. Fit the model. Validate the model. Communicate and What did we learn? Do the results make sense? visualize the results.
    [Show full text]
  • Futzing and Moseying: Interviews with Professional Data Analysts on Exploration Practices
    Futzing and Moseying: Interviews with Professional Data Analysts on Exploration Practices Sara Alspaugh and Nava Zokaei and Andrea Liu and Cindy Jin and Marti A. Hearst Abstract—We report the results of interviewing thirty professional data analysts working in a range of industrial, academic, and regulatory environments. This study focuses on participants’ descriptions of exploratory activities and tool usage in these activities. Highlights of the findings include: distinctions between exploration as a precursor to more directed analysis versus truly open-ended exploration; confirmation that some analysts see “finding something interesting” as a valid goal of data exploration while others explicitly disavow this goal; conflicting views about the role of intelligent tools in data exploration; and pervasive use of visualization for exploration, but with only a subset using direct manipulation interfaces. These findings provide guidelines for future tool development, as well as a better understanding of the meaning of the term “data exploration” based on the words of practitioners “in the wild.” Index Terms—EDA, exploratory data analysis, interview study, visual analytics tools 1 INTRODUCTION The professional field known variously as data analysis, visual analyt- a given result; that is, when a top-down plan of action coalesces and ics, business intelligence, and more recently, data science, continues to can be specified in advance from start to finish, precisely. Example expand year over year. This interdisciplinary field requires its practition- exploratory questions are “what’s going on with my users?” or “has ers to acquire diverse technical and mental skills, and be comfortable anything interesting happened this quarter?” working with ill-defined goals and uncertain outcomes.
    [Show full text]