Why Visualization? 2 the Critical Role of Visualization Methods for Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
1 Why Visualization? 2 The Critical Role of Visualization Methods for Data Analysis First effectively argued in the 1960s John Tukey, Frank Anscombe, Cuthbert Daniel, and others 3 The Critical Role of Visualization Methods for Data Analysis Part I Visual displays of data are essential for understanding the patterns in a dataset • determining which mathematical learning methods and models are appropriate • for the data Using mathematical methods and models, without understanding the patterns, risks missing important information in the data • incorrect conclusions • 4 The Critical Role of Visualization Methods for Data Analysis Part II But, one cannot get far with just visualization of the raw data Need the mathematical learning methods at the outset as well Fit mathematical structures to aid in visualizing the patterns in the data 5 The Critical Role of Visualization Methods for Data Analysis Mathematical methods and visualization methods are symbiotic. Both should be applied from the moment the data arrive. 6 About the Course: Technical It is really a course about how to analyze data Visualization methods Some attention to enhancing our ability to perceive important effects in using theses methods We will use R in class and for homework the highest honored and probably the most used language for data analysis • it’s free • We will use the trellis display very powerful visualization system for data analysis • implemented in R: lattice graphics package • There will be data everywhere, in class and in homework We will make a foray into big data methods, but without big data Yes, you can learn big data ideas without actually using big data • use cluster with 2 nodes • package datadr • 7 About the Course: Non-Technical No exams Expectations: come to class and do homework Homework is iterative submit to Xiaosu Tong • get feedback • resubmit if needed • iterate until convergence (Xiaosu says “OK, You’re done.”) • then your homework grade is A+ • Course web page: ml.stat.purdue.edu/stat695t http://ml.stat.purdue.edu/stat695t/writings/sarkar.lattice.book/ 8 Barley Data Agricultural experiment from the 1930s Factors 6 sites • 10 varieties • 2 years • 120 barley yields Analysis reported by experimenters, 1934 • published by Fisher, 1935 • analyzed by others including Anscombe and Daniel • 9 20 30 40 50 60 Waseca Waseca 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Crookston Crookston 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Morris Morris 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota University Farm University Farm 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Duluth Duluth 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Grand Rapids Grand Rapids 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota 20 30 40 50 60 Barley Yield (bushels/acre) 10 Main Effects Ordering Ordering of levels of an unordered categorical variable Order by category medians of numerical variable (or by any other location estimate) 11 Barley Data Anomaly — patterns across sites not monotone for each year separately Morris appears out of place 12 Visual Perception Main Effects Ordering Establishes regularities in patterns that make comparison of different panels more effective Judgement on A Common Horizontal or Vertical Scale Comparison of variety-yield patterns across sites for each year: enhanced by common horizontal scale 13 What Happens If We Order Arbitrarily Example: alphabetically Might as well order randomly Cannot as readily detect Morris anomaly Great reduction in judging effect of variety Cannot as readily detect No. 462 at Waseca 1931 has an outlier 14 20 30 40 50 60 Waseca Waseca 1931 1932 Wisconsin No. 38 Velvet Trebi Svansota Peatland No. 475 No. 462 No. 457 Manchuria Glabron University Farm University Farm 1931 1932 Wisconsin No. 38 Velvet Trebi Svansota Peatland No. 475 No. 462 No. 457 Manchuria Glabron Morris Morris 1931 1932 Wisconsin No. 38 Velvet Trebi Svansota Peatland No. 475 No. 462 No. 457 Manchuria Glabron Grand Rapids Grand Rapids 1931 1932 Wisconsin No. 38 Velvet Trebi Svansota Peatland No. 475 No. 462 No. 457 Manchuria Glabron Duluth Duluth 1931 1932 Wisconsin No. 38 Velvet Trebi Svansota Peatland No. 475 No. 462 No. 457 Manchuria Glabron Crookston Crookston 1931 1932 Wisconsin No. 38 Velvet Trebi Svansota Peatland No. 475 No. 462 No. 457 Manchuria Glabron 20 30 40 50 60 Barley Yield (bushels/acre) 15 20 30 40 50 60 Waseca Waseca 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Crookston Crookston 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Morris Morris 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota University Farm University Farm 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Duluth Duluth 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Grand Rapids Grand Rapids 1932 1931 Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota 20 30 40 50 60 Barley Yield (bushels/acre) 16 Barley Data: Trellis Display Yield vs. variety given year and site Panel variables: variety, yield Conditioning variable 1 year (2 levels) • categorical variable • levels ordered by medians • Conditioning variable 2 site (6 levels) • categorical variable • levels ordered by medians • Layout: 2 columns, 6 rows, 1 page Panel method: dot plot 17 Subsets 12 subsets — values of variety and yield for each of the 12 combinations of year and site Conditioning variables and their levels are ordered Year: 1932, 1931 Site: Grand Rapids, Duluth, University Farm, Morris, Crookston, Waseca Subset Order 1. 1932 Grand Rapids 2. 1931 Grand Rapids 3. 1932 Duluth ... 12. 1931 Waseca 18 Panel Order 11 12 9 10 7 8 5 6 3 4 1 2 19 Panel and Subset Order 1932 1931 Waseca Waseca 1932 1931 Crookston Crookston 1932 1931 Morris Morris 1932 1931 University Farm University Farm 1932 1931 Duluth Duluth 1932 1931 Grand Rapids Grand Rapids 20 20 30 40 50 60 20 30 40 50 60 20 30 40 50 60 1931 1931 1931 1931 1931 1931 Grand Rapids Duluth University Farm Morris Crookston Waseca Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota 1932 1932 1932 1932 1932 1932 Grand Rapids Duluth University Farm Morris Crookston Waseca Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota 20 30 40 50 60 20 30 40 50 60 20 30 40 50 60 Barley Yield (bushels/acre) 21 Barley Data Comparison of variety-yield patterns across years for each site separately is enhanced by common horizontal scale There is a year reversal at Morris Cannot compare variety-yield patterns across sites because of the loss of the common scale Therefore Often the case in using trellis display that you see a lot more by rearranging panels through altering the conditioning and the layout 22 Barley Data: Trellis Display Yield vs. variety given year and site Panel variables: variety, yield Conditioning variable 1 site (6 levels) • categorical variable • levels ordered by medians • Conditioning variable 2 year (2 levels) • categorical variable • levels ordered by medians • Layout: 6 columns, 2 rows, 1 page Panel method: dot plot 23 Juxtaposition vs. Superposition We have been juxtaposing data for different sites and for different years We can superpose But superposition is limited to just a few categories 1932 1931 24 Waseca Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Crookston Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Morris Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota University Farm Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Duluth Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Grand Rapids Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota 20 30 40 50 60 Barley Yield (bushels/acre) 25 Barley Data: Trellis Display Yield and year vs. variety given site Panel variables: yield, variety, year Conditioning variable site • categorical variable • ordered by medians • Layout — (1, 6, 1) Panel method — dotplot with year encoded by symbol 26 Barley Data A key observation emerges Absolute differences at Morris have an overall level similar to those at other sites Either nature produced an amazing coincidence or there is a mistake 27 Pages Trellis displays can go across pages The panels can be thought of as being in 3-space with coordinates: column (x), row (y), and page (z) 28 20 30 40 50 60 1932 1932 Wisconsin No. 38 Trebi Waseca Crookston Morris University Farm Duluth Grand Rapids 1932 1932 Glabron No. 457 Waseca Crookston Morris University Farm Duluth Grand Rapids 1932 1932 Velvet Peatland Waseca Crookston Morris University Farm Duluth Grand Rapids 1932 1932 Manchuria No. 475 Waseca Crookston Morris University Farm Duluth Grand Rapids 1932 1932 Svansota No. 462 Waseca Crookston Morris University Farm Duluth Grand Rapids 20 30 40 50 60 Barley Yield (bushels/acre) 29 20 30 40 50 60 1931 1931 Wisconsin No. 38 Trebi Waseca Crookston Morris University Farm Duluth Grand Rapids 1931 1931 Glabron No. 457 Waseca Crookston Morris University Farm Duluth Grand Rapids 1931 1931 Velvet Peatland Waseca Crookston Morris University Farm Duluth Grand Rapids 1931 1931 Manchuria No.