Data Exploration and Analysis Guide

Total Page:16

File Type:pdf, Size:1020Kb

Data Exploration and Analysis Guide Oracle® Big Data Discovery Data Exploration and Analysis Guide Version 1.0.0 • Revision A • March 2015 Copyright and disclaimer Copyright © 2003, 2015, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. UNIX is a registered trademark of The Open Group. This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are "commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agency- specific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of the programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, shall be subject to license terms and license restrictions applicable to the programs. No other rights are granted to the U.S. Government. This software or hardware is developed for general use in a variety of information management applications. It is not developed or intended for use in any inherently dangerous applications, including applications that may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications. This software or hardware and documentation may provide access to or information on content, products and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Oracle Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services. Oracle® Big Data Discovery: Data Exploration and Analysis Guide Version 1.0.0 • Revision A • March 2015 Table of Contents Copyright and disclaimer ..........................................................2 Preface.........................................................................13 About this guide ...............................................................13 Who should use this guide ........................................................13 Conventions used in this document .................................................13 Contacting Oracle Customer Support ................................................14 Part I: Introduction to Big Data Discovery Chapter 1: Overview .............................................................16 Value of Using Big Data Discovery ..................................................16 Your tasks within Big Data Discovery ................................................17 About the Workflow in Big Data Discovery ............................................18 Find ....................................................................19 Profile and Enrich Data ......................................................21 Explore ..................................................................21 Transform ................................................................22 Discover: Analyze and Obtain Results ...........................................23 Filter, Use Guided Navigation, and Search ........................................24 Understand the Data Flow ....................................................24 Part II: Logging in and Managing Your Account Chapter 2: Getting Access to Big Data Discovery.....................................27 Options for access to Big Data Discovery .............................................27 Logging in to Big Data Discovery ...................................................27 Chapter 3: Managing Your Big Data Discovery Account ...............................29 Changing your Big Data Discovery password ..........................................30 Selecting the locale to use for Big Data Discovery.......................................31 About locales .............................................................31 Configuring the locale for your user account .......................................32 Selecting a locale from the user menu ...........................................32 Chapter 4: Using Big Data Discovery on an iPad .....................................34 Part III: Searching, Filtering, and Navigating in Big Data Discovery Chapter 5: Using Refinements for Filtering ..........................................36 About using refinements for filtering .................................................36 Oracle® Big Data Discovery: Data Exploration and Analysis Guide Version 1.0.0 • Revision A • March 2015 Table of Contents 4 Using a value list to select refinements ...............................................37 Using a range filter to select refinements .............................................41 Using date/time values for refinement................................................44 About implicit refinements ........................................................46 Chapter 6: Using the Big Data Discovery Search Feature ..............................48 About the Big Data Discovery search feature ..........................................48 Displaying and using the results of a search ...........................................48 Notes about keyword searches ....................................................56 Chapter 7: Managing Selected Refinements .........................................58 About the Selected Refinements panel ...............................................58 Removing refinements from the Selected Refinements panel...............................60 Part IV: Finding and Browsing Available Data Sets and Projects Chapter 8: About the Catalog ......................................................62 Chapter 9: Filtering and Searching Data Sets and Projects.............................63 Chapter 10: Browsing Information About Data Sets or Projects.........................65 Chapter 11: About Tags ..........................................................67 Adding and removing tags for catalog items ...........................................67 Part V: Creating, Exploring, and Deleting Data Sets Chapter 12: Creating a New Data Set ...............................................69 Loading data from a file ..........................................................69 Chapter 13: Exploring a Data Set...................................................71 About exploring a data set .......................................................71 Selecting the data to explore ......................................................71 Switching Explore between tile view and table view......................................72 Filtering and sorting the attributes in Explore...........................................72 Working with an individual attribute in Explore..........................................74 Using the scratchpad............................................................75 Chapter 14: Deleting Data Sets ....................................................80 Deleting a data set .............................................................80 Part VI: Creating and Configuring Projects Chapter 15: Creating a project .....................................................82 About projects .................................................................82 Creating a project using the Catalog.................................................82 Creating a project from within a data set .............................................83 Oracle® Big Data Discovery: Data Exploration and Analysis Guide Version 1.0.0 • Revision A • March 2015 Table of Contents 5 Chapter 16: Managing Project Pages ...............................................84 Adding a page to a project ........................................................84 Renaming a page ..............................................................84 Changing the layout of a page .....................................................85 Changing the data set binding for a page .............................................85 Deleting a page................................................................86 Chapter 17: Adding and Configuring Components ....................................87 Adding and deleting components on a page ...........................................87 Renaming components ..........................................................88
Recommended publications
  • Improving Data Preparation for Business Analytics Applying Technologies and Methods for Establishing Trusted Data Assets for More Productive Users
    BEST PRACTICES REPORT Q3 2016 Improving Data Preparation for Business Analytics Applying Technologies and Methods for Establishing Trusted Data Assets for More Productive Users By David Stodder Research Sponsors Research Sponsors Alation, Inc. Alteryx, Inc. Attivio, Inc. Datameer Looker Paxata Pentaho, a Hitachi Group Company RedPoint Global SAP / Intel SAS Talend Trifacta Trillium Software Waterline Data BEST PRACTICES REPORT Q3 2016 Improving Data Table of Contents Research Methodology and Demographics 3 Preparation for Executive Summary 4 Business Analytics Why Data Preparation Matters 5 Better Integration, Business Definition Capture, Self-Service, and Applying Technologies and Methods for Data Governance. 6 Establishing Trusted Data Assets for Sidebar – The Elements of Data Preparation: More Productive Users Definitions and Objectives 8 By David Stodder Find, Collect, and Deliver the Right Data . 8 Know the Data and Build a Knowledge Base. 8 Improve, Integrate, and Transform the Data . .9 Share and Reuse Data and Knowledge about the Data . 9 Govern and Steward the Data . 9 Satisfaction with the Current State of Data Preparation 10 Spreadsheets Remain a Major Factor in Data Preparation . 10 Interest in Improving Data Preparation . 11 IT Responsibility, CoEs, and Interest in Self-Service . 13 Attributes of Effective Data Preparation 15 Data Catalog: Shared Resource for Making Data Preparation Effective . 17 Overcoming Data Preparation Challenges 19 Addressing Time Loss and Inefficiency . 21 IT Responsiveness to Data Preparation Requests . 22 Data Volume and Variety: Addressing Integration Challenges . 22 Disconnect between Data Preparation and Analytics Processes . 24 Self-Service Data Preparation Objectives 25 Increasing Self-Service Data Preparation . 27 Self-Service Data Integration and Catalog Development .
    [Show full text]
  • Automating Exploratory Data Analysis Via Machine Learning: an Overview
    Tutorials SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Automating Exploratory Data Analysis via Machine Learning: An Overview Tova Milo Amit Somech Tel Aviv University Tel Aviv University [email protected] [email protected] ABSTRACT up-close, better understand its nature and characteristics, Exploratory Data Analysis (EDA) is an important initial step and extract preliminary insights from it. Typically, EDA is for any knowledge discovery process, in which data scien- done by interactively applying various analysis operations tists interactively explore unfamiliar datasets by issuing a (such as filtering, aggregation, and visualization) – the user sequence of analysis operations (e.g. filter, aggregation, and examines the results of each operation, and decides if and visualization). Since EDA is long known as a difficult task, which operation to employ next. requiring profound analytical skills, experience, and domain EDA is long known to be a complex, time consuming pro- knowledge, a plethora of systems have been devised over cess, especially for non-expert users. Therefore, numerous the last decade in order to facilitate EDA. lines of work were devised to facilitate the process, in multi- In particular, advancements in machine learning research ple dimensions [10, 24, 25, 31, 42, 46]: First, simplified EDA 1 have created exciting opportunities, not only for better fa- interfaces (e.g., [25], Tableau ) allow non-programmers to cilitating EDA, but to fully automate the process. In this effectively explore datasets without knowing scripting lan- tutorial, we review recent lines of work for automating EDA. guages or SQL. Second, numerous solutions (e.g., [6, 14, 22]) Starting from recommender systems for suggesting a single were devised to improve the interactivity of EDA, by re- exploratory action, going through kNN-based classifiers and ducing the running times of exploratory operations, and active-learning methods for predicting users’ interestingness displaying preliminary sketches of their results.
    [Show full text]
  • The Landscape of R Packages for Automated Exploratory Data Analysis by Mateusz Staniak and Przemysław Biecek
    CONTRIBUTED RESEARCH ARTICLE 1 The Landscape of R Packages for Automated Exploratory Data Analysis by Mateusz Staniak and Przemysław Biecek Abstract The increasing availability of large but noisy data sets with a large number of heterogeneous variables leads to the increasing interest in the automation of common tasks for data analysis. The most time-consuming part of this process is the Exploratory Data Analysis, crucial for better domain understanding, data cleaning, data validation, and feature engineering. There is a growing number of libraries that attempt to automate some of the typical Exploratory Data Analysis tasks to make the search for new insights easier and faster. In this paper, we present a systematic review of existing tools for Automated Exploratory Data Analysis (autoEDA). We explore the features of fifteen popular R packages to identify the parts of the analysis that can be effectively automated with the current tools and to point out new directions for further autoEDA development. Introduction With the advent of tools for automated model training (autoML), building predictive models is becoming easier, more accessible and faster than ever. Tools for R such as mlrMBO (Bischl et al., 2017), parsnip (Kuhn and Vaughan, 2019); tools for python such as TPOT (Olson et al., 2016), auto-sklearn (Feurer et al., 2015), autoKeras (Jin et al., 2018) or tools for other languages such as H2O Driverless AI (H2O.ai, 2019; Cook, 2016) and autoWeka (Kotthoff et al., 2017) supports fully- or semi-automated feature engineering and selection, model tuning and training of predictive models. Yet, model building is always preceded by a phase of understanding the problem, understanding of a domain and exploration of a data set.
    [Show full text]
  • Lecture #1: Exploratory Data Analysis CS 109A, STAT 121A, AC 209A: Data Science
    lecture #1: exploratory data analysis CS 109A, STAT 121A, AC 209A: Data Science . Weiwei Pan, Pavlos Protopapas, Kevin Rader Fall 2016 Harvard University announcements ■ How to optimally communicate with us (Piazza then help line) ■ Issues with Gutenberg in HW0 (for now stop using Gutenberg) ■ Extension students: labs stream on Friday (posted same day), interactive lab via zoom, TF office hours, extra day on HW ■ Cross-registered students already have or will get full access to Canvas today (please check!) ■ HW0 must be submitted (through Vocareum)! ■ HW0 submission deadline extended to Wednesday (today) at 11:59pm. ■ Time and effort required to complete HW0 will vary depending on familiarity with programming (this is the learning curve), and is a good indicator of fit of the class (the programming will not get easier) ■ HW1 is released 1 the data science process On the first day of class you were introduced to the “data science” process. ■ Ask questions ■ Data Collection ■ Data Exploration ■ Data Modeling ■ Data Analysis ■ Visualization and Presentation of Results Note: This process is not linear! 2 lecture outline What Is Data? Exploring Data Descriptive Statistics Data Visualization An Example What Next? 3 .what is data? what is data? “A datum is a single measurement of something on a scale that is understandable to both the recorder and the reader. Data is multiple such measurements.” Provocative claim: everything is (can be) data! 5 where does it come from? ■ Internal sources: already collected by or is part of the overall data collection of you organization. For example: business-centric data that is available in the organization data base to record day to day operations; scientific or experimental data ■ Existing External Sources: available in ready to read format from an outside source for free or for a fee.
    [Show full text]
  • Database Benchmarking for Supporting Real-Time Interactive
    Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data Leilani Battle, Philipp Eichmann, Marco Angelini, Tiziana Catarci, Giuseppe Santucci, Yukun Zheng, Carsten Binnig, Jean-Daniel Fekete, Dominik Moritz To cite this version: Leilani Battle, Philipp Eichmann, Marco Angelini, Tiziana Catarci, Giuseppe Santucci, et al.. Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data. SIGMOD ’20 - International Conference on Management of Data, Jun 2020, Portland, OR, United States. pp.1571-1587, 10.1145/3318464.3389732. hal-02556400 HAL Id: hal-02556400 https://hal.inria.fr/hal-02556400 Submitted on 28 Apr 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data Leilani Battle Philipp Eichmann Marco Angelini University of Maryland∗ Brown University University of Rome “La Sapienza”∗∗ College Park, Maryland Providence, Rhode Island Rome, Italy [email protected] [email protected] [email protected] Tiziana Catarci Giuseppe Santucci Yukun Zheng University of Rome “La Sapienza”∗∗ University of Rome “La Sapienza”∗∗ University of Maryland∗ [email protected] [email protected] [email protected] Carsten Binnig Jean-Daniel Fekete Dominik Moritz Technical University of Darmstadt Inria, Univ.
    [Show full text]
  • An Interactive Visualization Environment for Data Exploration
    An Interactive Visualization Environment for Data Exploration Mark Derthick and John Kolojejchick and Steven F. Roth {mad+,jake+,roth+}@cs.cmu.edu Carnegie Mellon University School of Computer Science Pittsburgh, PA 15213 Abstract increase sales” it is impossible to formally characterize what constitutes an answer. A surprising observation Exploratory data analysis is a process of sifting may lead to new goals, which lead to new surprises, and through data in search of interesting information or patterns. Analysts’ current tools for exploring data in- in turn more goals. To facilitate this kind of iterative clude database management systems, statistical anal- exploration, we would like to provide the analyst rapid, ysis packages, data mining tools, visualization tools, incremental, and reversible operations giving continu- and report generators. Since the exploration process ous visual feedback. seeks the unexpected in a data-driven manner, it is In some ways, the state of the art 30 years ago crucial that these tools are seamlessly integrated so was closer to this goal of a tight loop between explo- analysts can flexibly select and compose tools to use at ration, hypothesis formation, and hypothesis testing each stage of analysis. Few systems have integrated all these capabilities either architecturally or at the user than it is today. In the era of slide rules and graph pa- interface level. Visage’s information-centric approach per, the analyst’s forced intimacy with the data bred allows coordination among multiple application user strong intuitions. As computer power and data sets interfaces. It uses an architecture that keeps track of have grown, analysis retaining the human focus has the mapping of visual objects to information in shared become on-line analytical processing (OLAP).
    [Show full text]
  • Exploring the Data Wilderness Through Examples
    Exploring the Data Wilderness through Examples Davide Mottin Matteo Lissandrini Aarhus University Aalborg University [email protected] [email protected] Yannis Velegrakis Themis Palpanas Utrecht University Paris Descartes University [email protected] [email protected] Machine Machine ABSTRACT Interface learning Exploration is one of the primordial ways to accrue knowl- edge about the world and its nature. As we accumulate, Example-based approaches mostly automatically, data at unprecedented volumes and Middleware speed, our datasets have become complex and hard to un- derstand. In this context exploratory search provides a handy Data structures and hardware tool for progressively gather the necessary knowledge by starting from a tentative query that hopefully leads to an- Relational Graph Text swers at least partially relevant and that can provide cues about the next queries to issue. Recently, we have witnessed a rediscovery of the so-called example-based methods, in which Figure 1: A view of example-based data exploration. the user or the analyst circumvent query languages by using examples as input. This shift in semantics has led to a num- ber of methods receiving as query a set of example members 1 SCOPE OF THE TUTORIAL of the answer set. The search system then infers the entire Data exploration includes methods to eciently extract knowl- answer set based on the given examples and any additional edge from data, even if we do not know what exactly we are information provided by the underlying database. In this tu- looking for, nor how to precisely describe our needs. In such torial, we present an excursus over the main example-based exploration, the user progressively acquires the knowledge methods for exploratory analysis, show techniques tailored by issuing a sequence of generic queries to gather intelli- to dierent data types, and provide a unifying view of the gence about the data.
    [Show full text]
  • Astronomy in the Big Data Era Search
    Astronomy in the big data era Search... \ Log in Register Share: A- Alt. Download A+ Display Proceedings Papers Astronomy in the Big Data Era Yongheng Authors: Yanxia Zhang , Zhao Abstract The fields of Astrostatistics and Astroinformatics are vital for dealing with the big data issues now faced by astronomy. Like other disciplines in the big data era, astronomy has many V characteristics. In this paper, we list the different data mining algorithms used in astronomy, along with data mining software and tools related to astronomical applications. We present SDSS, a project often referred to by other astronomical projects, as the most successful sky survey in the history of astronomy and describe the factors influencing its success. We also discuss the success of Astrostatistics and Astroinformatics organizations and the conferences and summer schools on these issues that are held annually. All the above indicates that astronomers and scientists from other areas are ready to face the challenges and opportunities provided by massive data volume. Keywords: Big data , Data mining , Astrostatistics , Astroinformatics How to Cite: Zhang, Y. and Zhao, Y., 2015. Astronomy in the Big Data Era. Data Science Journal, 14, p.11. DOI: http://doi.org/10.5334/dsj-2015-011 12912 1197 4 19 14 Views Downloads Citations Twitter Facebook Published on 22 May 2015 Peer Reviewed CC BY 4.0 1 Introduction At present, the continuing construction and development of ground-based and space-born sky surveys ranging from gamma rays and X-rays, ultraviolet, optical, and infrared to radio bands is bringing astronomy into the big data era.
    [Show full text]
  • Big Data Visualization Tools ?
    Big Data Visualization Tools ? Nikos Bikakis ATHENA Research Center, Greece 1 Synonyms Visual exploration; Interactive visualization; Information visualization; Vi- sual analytics; Exploratory data analysis. 2 Definition Data visualization is the presentation of data in a pictorial or graphical for- mat, and a data visualization tool is the software that generates this presenta- tion. Data visualization provides users with intuitive means to interactively explore and analyze data, enabling them to effectively identify interesting patterns, infer correlations and causalities, and supports sense-making activ- ities. 3 Overview Exploring, visualizing and analysing data is a core task for data scientists and analysts in numerous applications. Data visualization2 [1] provides intuitive ways for the users to interactively explore and analyze data, enabling them arXiv:1801.08336v2 [cs.DB] 22 Feb 2018 to effectively identify interesting patterns, infer correlations and causalities, and support sense-making activities. ? This article appears in Encyclopedia of Big Data Technologies, Springer, 2018 2 Throughout the article, terms visualization and visual exploration, as well as terms tool and system are used interchangeably. 1 2 Nikos Bikakis The Big Data era has realized the availability of a great amount of massive datasets that are dynamic, noisy and heterogeneous in nature. The level of difficulty in transforming a data-curious user into someone who can access and analyze that data is even more burdensome now for a great number of users with little or no support and expertise on the data processing part. Data visualization has become a major research challenge involving several issues related to data storage, querying, indexing, visual presentation, interaction, personalization [2,3,4,5,6,7,8,9].
    [Show full text]
  • Vizcertify: a Framework for Secure Visual Data Exploration
    VizCertify: A framework for secure visual data exploration Lorenzo De Stefani, Leonhard F. Spiegelberg, Eli Upfal Tim Kraska Department of Computer Science CSAIL Brown University Massachusetts Institute of Technology Providence, United States of America Cambridge, United States of America florenzo, lspiegel, [email protected] [email protected] Abstract—Recently, there have been several proposals to insightful then. Rather, users most likely would infer that develop visual recommendation systems. The most advanced French wines are generally rated higher; generalizing their systems aim to recommend visualizations, which help users to insight to all wines. Statistically savvy users will now test find new correlations or identify an interesting deviation based on the current context of the user’s analysis. However, when whether this generalization is actually statistically valid using recommending a visualization to a user, there is an inherent risk an appropriate test. Even more technically-savvy users will to visualize random fluctuations rather than solely true patterns: consider other hypotheses they tried and adjust the statistical a problem largely ignored by current techniques. testing procedure to account for the multiple comparisons In this paper, we present VizCertify, a novel framework to problem. This is important since every additional hypothesis, improve the performance of visual recommendation systems by quantifying the statistical significance of recommended vi- explicitly expressed as a test or implicitly observed through a sualizations. The proposed methodology allows controlling the visualization, increases the risk of finding insights which are probability of misleading visual recommendations using a novel just spurious effects. application of the Vapnik Chervonenkis (VC) dimension which The issue with visual recommendation engines: What results in an effective criterion to decide whether a candidate happens when visualization recommendations are generated visualization recommendation corresponds to an actually statis- tically significant phenomenon.
    [Show full text]
  • The New Analytics Lifecycle Accelerating Insight to Drive Innovation
    The New Analytics Lifecycle Accelerating Insight to Drive Innovation Dave Wells November 2017 The New Analytics Lifecycle About the Author Dave Wells is an advisory consultant, educator, and research analyst dedicated to building meaningful connections along the path from data to business impact. He works at the intersection of information and business, driving value through analytics, business intelligence, and innovation. With nearly five decades of combined experience in information management and business management, Dave has a unique perspective about the connections of business, information, data, and technology. Knowledge sharing and skills building are Dave’s passions, carried out through consulting, speaking, teaching, research, and writing. He is a continuous learner—fascinated with understanding how we think—and a student and practitioner of systems thinking, critical thinking, design thinking, divergent thinking, and innovation. About Eckerson Group Eckerson Group is a research and consulting firm that helps business and analytics leaders use data and technology to drive better insights and actions. Through its reports and advisory services, the firm helps companies maximize their investments in data and analytics. Its researchers and consultants each have more than 20 years of experience in the field and are uniquely qualified to help business and technical leaders succeed with business intelligence, analytics, data management, data governance, performance management, and data science. About This Report This report is sponsored by Alteryx, Amazon Web Services and Tableau. © Eckerson Group 2017 www.eckerson.com 2 The New Analytics Lifecycle Executive Summary Analytics is at the forefront of modern business management. The role of analytics for informed decision making is well known, and much attention is given to the value of insight.
    [Show full text]
  • Data Exploration and Visualization
    Data Exploration and Visualization Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın görüşlerini yansıtmamaktadır. Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns Examples of Business Questions • Simple (descriptive) Stats • “Who are the most profitable customers?” • Hypothesis Testing • “Is there a difference in value to the company of these customers?” • Segmentation/Classification • What are the common characteristics of these customers? • Prediction • Will this new customer become a profitable customer? If so, how profitable? adapted from Provost and Fawcett, “Data Science for Business” Ask an interesting What is the scientific goal? What would you do if you had all the data? question. What do you want to predict or estimate? How were the data sampled? Which data are relevant? Get the data. Are there privacy issues? Plot the data. Are there anomalies? Explore the data. Are there patterns? Build a model. Model the data. Fit the model. Validate the model. Communicate and What did we learn? Do the results make sense? visualize the results.
    [Show full text]