some remarks on Interactive Data Analysis

Jeffrey Heer The Data Deluge 2010: 1,200 exabytes 10x increase over 5 years

Gantz et al, 2008, 2010 cabspotting.org Gapminder.org Wikipedia History Flow (IBM) The Rise of the “Data Scientist”

Looking for a career where your services will be in high demand? … Provide a scarce, complementary service to something that is getting ubiquitous and cheap.

So what is ubiquitous and cheap? Data. And what is complementary to data? Analysis.

Hal Varian, ’s Chief Economist The Rise of the “Data Scientist”

The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids…

Hal Varian, Google’s Chief Economist Shifting Analysis Practices Example: Fox Audience Network Greenplum DB Variety of data 42 Sun X4500s each with: Ad logs, CRM, User data 48 500GB drives Research & Reporting 16GB RAM Diversity of users: from Sales 2 dual-core Opterons Mgrs to Research Scientists Big and growing Tools from MicroStrategy UI 200 TB data (mirrored) to command-line SQL Fact table of 1.5 trillion rows Extensive use of R and Growing 5TB per day MapReduce / Hadoop 4-7 Billion rows per day

As reported by FAN, Feb, 2009 Example Analysis Questions

How many female How are these people WWF fans under the similar to those that age of 30 visited the visited Nissan? Toyota community over the last 4 days and saw a Class A ad? Greenplum: SQL & MapReduce

Unified execution of SQL, MapReduce Code (Perl, MapReduce on a common parallel Python, etc) execution engine

ODBC Query Planner Transaction Parallel JDBC and Optimizer Manager & DataFlow etc (SQL) Log Engine

Analyze structured or unstructured data, inside or outside the database

Scale out parallelism on commodity External Database hardware Storage Storage Agile Analysts

Statistically savvy, diversity of training & tools Ad-hoc queries and analytics are common Want access to the full data, not just samples Bring new data to analysis tasks x e.g., correlate sales with weather x Willing to clean, integrate data Acquisition

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination Interactive Visualization Goals of Visualization Research

1 Understand how people use visualizations to gain insight

2 Design principles and techniques for effective visual analysis systems Visual Data Analysis

1 Data diagnostics Node-link Matrix Matrix Visual Data Analysis

1 Data diagnostics 2 Exploratory data analysis Set A Set B Set C Set D XY XY XY XY 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.1 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.1 4 5.39 19 12.5 12 10.84 12 9.11 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89

Summary Statistics Linear Regression 2 uX = 9.0 σX = 3.317 Y = 3 + 0.5 X 2 uY = 7.5 σY = 2.03 R = 0.67 Anscombe 1973 Set A Set B

14 14 12 12 10 10 8 8 Y 6 6 4 4 2 2 0 0 0 5 10 15 0 5 10 15 Set C Set D 14 14 12 12 10 10 8 8 Y 6 6 4 4 2 2 0 0 0 5 10 15 0 5 10 15 20 XX

Visual Data Analysis

1 Data diagnostics 2 Exploratory data analysis 3 Assessing analytic results While it is often most helpful to “plot the data,” this is rarely enough. We need also to “plot the results of analysis” as a routine matter. There is often more analysis than there was data.

J. W. Tukey, The Future of Data Analysis, 1962. community 1 Acquisition

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination Effects of Latency

Milliseconds Matter < 100ms: perception of animation, causation Wait times > 1s may interrupt flow of thought To reduce (perceived) latency, drop details Example: Latency in results Latency (+300ms) reduces searches ~0.5% Reduction persists after performance resumes Goal: support analysis at interactive rates to enable fluid conversations with data. Parallel processing essential for scalable, interactive analysis.

Accelerate query processing and analytics Estimate time to completion Incremental results, streaming/dynamic data Interruptible execution and steering Auto-tuning layout, indexing, query sub-plans Predicting data of interest; what is “near”? Use domain-specific languages to simplify specification; enable retargeting and optimization. Protovis: A Declarative Language for Visualization http://protovis.org/ Area Bar Dot Image

Line Label Rule Wedge Protovis Create customized visualizations using a declarative specification language.

var vis = new pv.Panel(); vis.add(pv.Bar) .data([1, 1.2, 1.7, 1.5, .7]) .bottom(10).width(20) .height(function(d) d * 70) .left(function() this.index * 25 + 20); vis.render();

Protovis (protovis.org) var army = pv.nest(napoleon.army, "dir", "group“); vis.add(pv.Rule).data([0,-10,-20,-30]) var vis = new pv.Panel(); .top(function(d) 300 - 2*d - 0.5).left(200).right(150) .lineWidth(1).strokeStyle("#ccc") var lines = vis.add(pv.Panel).data(army); .anchor("right").add(pv.Label) lines.add(pv.Line) .font("italic 10px Georgia") .data(function() army[this.idx]) .text(function(d) d+"°").textBaseline("center"); .left(lon).top(lat).size(function(d) d.size/8000) .strokeStyle(function() color[army[paneIndex][0].dir]); vis.add(pv.Line).data(napoleon.temp) .left(lon).top(tmp) .strokeStyle("#0") vis.add(pv.Label).data(napoleon.cities) .add(pv.Label) .left(lon).top(lat) .top(function(d) 5 + tmp(d)) .text(function(d) d.city).font("italic 10px Georgia") .text(function(d) d.temp+"° "+d.date.substr(0,6)) .textAlign("center").textBaseline("middle"); .textBaseline("top").font("italic 10px Georgia");

Exploiting DSLs & Parallelism

DSL has led to faster designs, less code Job Voyager: 5x less code, 10x less dev time Retargeting across platforms / devices Parallel execution. Use language “lifting” to (a) identify dependencies (b) streamline code generation 20x scalability boost over comparable systems Data Wrangler Data Wrangler

Declarative data transformation language x Formatting, reshaping, cleaning & verification x Type transformation & lookup Data Wrangler

Declarative data transformation language x Formatting, reshaping, cleaning & verification x Type transformation & lookup tables

Visual interface for data diagnostics x Interactive data transformation & cleaning x Type inference; transform by example Example: Parsing a log file User highlights text in first table cell. We give user candidate transformations Or user can highlight more cells, and we prune the space of suggested transformations User continues parsing… User continues parsing… Output Wrangler Script split(0).on('\n’).result(ROW).max_splits(NO_MAX) split(-1).on(' - -') split(-1).on('\\[') split(-1).on('\\] "') replace(-1).on('"') split(-1).on(‘ ').max_splits(NO_MAX) drop([8,7,5,4,3,1,0]) columnName().to(['ip','date','method', 'url', 'protocol','resp','resp2'])

Next Steps

Iterative design of the Wrangler UI Build corpus of example data and scripts, explore type inference and script reuse Leverage parallelism on server to support interactive cleaning of large data sets Scala implementation of Wrangler language Recap

Creation and access to data continues to rise Agile, deep analysis practices around Big Data Interactive visualization within the data life-cycle Parallel computing can be a key enabler of interactive, iterative data analysis Acquisition Questions? Cleaning Integration

Visualization

Acknowledgements Analysis Mike Bostock, Sean Kandel, Joe Hellerstein, Brian Dolan, Presentation Greenplum, Pat Hanrahan Dissemination

Jeffrey Heer hci.stanford.edu/jheer