<<

Why your ?

Graphs help us to see patterns, trends, anomalies and other features not otherwise easily apparent from numerical summaries.

Well, at least I Graphical Methods for Data noticed! & Multivariate

Michael Friendly Psychology 6140 Source: http://xkcd.com/523/ 1 2

How graphs can change your life (n=1) Different graphs for different purposes

Personal

15 yr. blood sugar, pre-diagnosis daily average, after diagnosis Graphs (& tables) as communication: A statistician contracts • What audience? diabetes, and uses • What message? graphs to monitor his blood sugar. •Analysis graphs: design to see patterns, trends, aid the process of ĺVisual on data description, interpretation diet & exercise reinforce behavioral change •Presentation graphs: design to ĺ5HVLGXDOSORWVVKRZ attract attention, make a point, illustrate a conclusion average hourly variation residuals: - daily average and hourly unexplained events, possibly important

Ref: Wainer & Velleman, Looking at blood sugar, Chance, 2008, 21(4), 56-61 3 5 Comparing groups: Analysis vs. Different graphs for different purposes Presentation graphs

Six different graphs for comparing groups in a one-way design • which group means differ? • equal variability? • distribution shape?

Ah ha! • what do error bars mean? Ooh! • unusual observations?

Never use dynamite plots Always explain what error bars mean Consider tradeoff between Presentation graphs: single image for a large audience summarization & exposure Exploratory graphs: many images for a narrow audience (you!) 6 7

Presentation graph: Nightingale’s coxcomb Presentation: Turning tables into graphs

Graphs of model coefficients are often : Deaths in the clearer than tables Crimean war from battle vs. other causes (disease, wounds) She used this to argue for better field hospitals (MASH units)

The best presentation graphs pass the Interocular Traumatic Test: The message hits you between the eyes!

Source: tables2graphs.com 8 9 Rhetorical graph: Common Sense Revolution Effective data display

• Make the data stand out ƒ Fill the data region (axes, ranges) ƒ Use visually distinct symbols (shape, color) for different groups ƒ Avoid junk, heavy grid lines that detract from the data • Facilitate comparison ƒ Emphasize the important comparisons visually ƒ Side-by-side easier than in separate panels ƒ “data” vs. a “standard” easier against a horizontal line ƒ Show uncertainty where possible

10 11

Make comparisons direct • Points not bars • Connect similar by lines • Same panel rather than different panels

Published in: Ian Gordon; Sue Finch; Journal of Computational and Graphical Statistics 2015, 24, 1210-1229. DOI: 10.1080/10618600.2014.989324 Copyright © 2015 American Statistical Association, Institute of , and Interface Foundation of North America 12 13 Showing uncertainty • Standard plots of observed vs. predicted lack a basis for assessment of uncertainty Analysis graph: Screening • Confidence envelopes indicate extent of deviation • Identify “noteworthy” observations to track them down Side-by-side boxplots of variables in the baseball data show the shapes of distributions --- aid to transformation Example: Normal QQ plots used to assess normality of data

• Each is standardized to allow comparison. • Plot is produced by datachk macro.

See: http://datavis.ca/sasmac/datachk. html

14 15

Exploratory graphs: Transformations Diagnostic graphs: Transformations

Data often needs to be Diagnostic plots can be used to transformed to meet analysis suggest corrective action, often by a assumptions: power transformation\ĺyp • Symmetry (~ Normality) Symmetry transformation plot: • Linear relations • Constructed so symmetric data • Constant variance plots as horizontal line • Slope (b) of data line ĺSRZHUS = 1 – b ĺyp = y(1-b) For symmetry, a symbox plot shows a variable transformed to various powers. Other diagnostic plots use the VDPHLGHDVORSH E ĺ\(1-b) SAS: symbox macro : car package: symbox()

16 17 Model diagnosis: regression quartet Model diagnosis: Influence in regression

Statistical software should Residuals vs Fitted Normal Q-Q make it easy to get minister minister Multiple regression model: prestige ~ income + contractor informative diagnostic plots contractor Residuals Standardized residuals Influence plots can show:

In R, plotting a lm model -20 0 20 40

object ĺWKH³UHJUHVVLRQ reporter -2 -1 0 1 2 3 reporter • model residual quartet” of plots 0 20406080100 -2 -1 0 1 2 • leverage (potential impact) Fitted values Theoretical > model <- lm(prestige ~ income • influence ~ residual x + education) > plot(model) Scale-Location Residuals vs Leverage leverage (Cook D statistic) minister minister reporter 1 contractor • contour of influence 0.5

(SAS has similar, using Standardized residuals ODS graphics) Standardized residuals conductor -2-10123 0.5 reporter

0.0 0.5 1.0 1.5 Cook's distance 1

0 20406080100 0.00 0.05 0.10 0.15 0.20 0.25 18 19 Fitted values Leverage

Scatterplots: A basic workhorse for Scatterplots: Scales matter quantitative data

Computer plots are usually generated with a given aspect ratio, • Show the relation between two to conform to the page or screen. Q variables (ignoring all others!) A better idea is to scale the plot so More useful when enhanced to • that slopes of lines or curves show visual summaries average ~ 45 degrees. • Vary point color/shape to show strata/groups • Combine in multi-panel displays to show more • matrix: all In the rescaled version, we can see pairs that, within each cycle, sunspots tend to increase more quickly than • Conditional relations: Y vs. they decline. X stratified by Group

20 21 Scatterplots: Annotations enhance perception Scatterplots: Annotations enhance perception

Data from the US draft Drawing a smooth curve lottery, 1970 shows a systematic decrease toward the end of the year. • Birth dates were drawn at • The smooth curve is fit by random to assign a “draft loess, a form of non- priority value” (1=bad) parametric regression. • Can you see any pattern or trend? Visual explanation:

Me (May 7): ĺSULRULW\ 

22 23

Scatterplots: Data ellipses Scatterplots: Data ellipses

Galton’s (1886) semi-graphic Galton’s data on child & mid-parent heights, shown as a sunflower plot: each , showing relation of sunflower symbol shows the number of observations in the (x, y) cell. mid-parent’s height to children’s height.

As shown: 2D density estimate of bivariate surface • Contours of equal frequency formed ellipses • Regression lines of Y on X and X on Y are the loci of vertical and horizontal tangents • Major/minor axes are the principal components

24 25 Scatterplots: Data ellipses Visualizing multivariate data

Any scatterplot can be summarized by data ellipses (assuming normality). These show: means, standard deviations, and allow correlations & regression lines to be Showing relations among 3 or more variables: visually estimated. • Scatter plot matrices (enhance with visual summaries, thin for many variables) Data ellipse: E(x|y) 22 • Conditional plots: Y ~ X | (Z, Group) Dy()|FDp (1 ) • Seeing multivariate profiles, clusters: Galton data, 40%, 68% & 95% data ellipses. Sizes ƒ Star plots, face plots, parallel coordinates are: E(y|x) • ȋ2 (0.40) = 1.0 • : project data into low-D view • ȋ2 (0.68) = 2.28 • ȋ2 (0.95) = 6.0

26 27

Scatterplot matrix Scatterplot matrix

• Fitness data: • Occ. prestige: Oxy ~ Age + Weight + Prestige ~ %women + Educ + Runtime + Rstpulse Income • Box, rug plots show univar. • Each panel shows row distributions var vs. col var • Quadratic regressions show • Reg line shows linear linear/non-linear relations relation (loess would be better)

Questions: Questions: • What is the best predictor of Oxy? • How should Educ be modeled? • Which two predictors are most • How should Income be modeled? highly correlated?

28 29 Larger data sets: Visual thinning Larger data sets: Corrgrams

Baseball data: log(Salary) ~ Correlation shows performance variables pattern of correlations for many • Too much data to show variables. individual points Variables are re-ordered to make • Each scatterplot is the groupings most visually summarized by a loess apparent. smoothed curve and a data This graphic assumes that all ellipse relations are linear, not necessarily always true Questions: • Which variables are most strongly related to logSal? • Which relations are strongly nonlinear? Graph using SAS corrgram macro, • Which predictors are too highly http://datavis.ca/sasmac/corrgram.html correlated? R: corrgram package 30 31

Corrgrams: Different renderings Conditional plots: Y ~ X | Z

Baseball data PC2/PC1 order The value of a correlation may Often want to explore how the be rendered in different ways, Assis relation between Y and X with different visual impact. Error depends on/ varies with some • Shading levels: help detect other variable(s) Z. Atbat similar values • Moderator variables Hits • Pie symbols: make it easier • Interactions to compare for larger/smaller Runs

Walk

RBI Emission of NOx from ethanol in relation to engine Putou compression ratio and richness Graph using R corrgram Home of air/ethanol mixture (EE)

package logSa

Years Graph using R lattice package

32 33 Conditional plots: Y ~ X | Z 3D plots

Often not useful, unless done The same data is shown in a with great care. different format, with This plot shows the loess • loess smooth curves smoothed predicted values of • curves banked to ~ 45o NOx in relation to EE and CR. (But, not shown.) The joint dependence on CR and EE is now much clearer Color is used to show the predicted NOx, using a “heatmap” color scale. The interpretation is simple! (These are examples of lattice plots, produced using R software.)

34 35

3D plots Seeing multivariate clusters: face plot

3D plots can be enormously useful with A faces plot assigns variables to dynamic, interactive software & facial features, to show configural patterns of many variables. This plot shows a relation of occupational prestige to income & education. Pros: Easy to see similar patterns in large data sets. • points are shown in perspective, connected to the fitted surface Cons: • the fitted surface (linear, quadratic, • Hard to connect features to smoothed) can be changed variables for interpretation interactively • No good rules/ideas for assigning • the plot can be rotated dynamically variables to features. to see other views

Graph using SAS faces macro, http://datavis.ca/sasmac/faces.html 36 37 Seeing multivariate clusters: face plot Biplots: variables and obs. in low-D View

Means, by make & origin • Based on PCA: data is shown in 2D (3D) view that accounts for greatest variance • Observations: plotted as points • Variables: vectors from origin (=mean) • Angles between vectors ~ correlations • Projection of point on vector ~ score

o1Ɣ v2 v1 Ɣ

x11 x Ɣ 12 Ɣ Ɣ v Ɣ 3

38 39

Biplot: US crime rates : Baseball data

Dim1: ~ Overall crime rate Baseball hitters’ data: Dim2: Property vs. personal • Dim2: fielding, -years Note: clusters of southern, New • Dim1: batting performance England, western states

Players identified by position, This 2D biplot only shows 76.5% with data ellipses for each of total variance. • IF: more assists, errors Still, it gives a useful summary of • DH: older 9 variables and 50 observations.

This 2D biplot only shows 63.7% of total variance.

40 41 HE plots for MANOVA, MMReg HE plot matrices

HE plots provide a way to visualize hypothesis tests in MANOVA and HE plots in a scatterplot multivariate multiple regression, using data ellipses for fitted (H) and matrix show effects for all residual (E) co-variances. pairs of responses. Graphic ideas: (a) Data ellipses summarize H & E (co)variation; (b) Scale For the iris data, the H ellipse so it projects outside E ellipse iff effect is significant (Roy test) Species means are highly correlated on all variables except Sepal length.

42 43

HE plots: MMRA HE plots: MMRA & MANCOVA

Rohwer data: Cognitive ability Rohwer data: Low SES & Hi and PA tests: n=37, Low SES SES groups group (SAT, PPVT, Raven) ~ SES (SAT, PPVT, Raven) ~ n + s + + n + s + ns + na + ss ns + na + ss

• Only one predictor, NA, is (barely) significant

• Yet, overall multivariate test: H0: B = 0 is highly so!

45 46 Dynamic, interactive graphics Dynamic, interactive graphics

Interactive graphics & Interactive graphics & provides: data analysis provides: • Identifying points • Identifying points • Model & display • Model & display controls controls

These methods are much more highly developed in R • googleVis • shiny •ggvis • ggobi -> rggobi

SAS/Insight: mpg ~ weight, linear fit SAS/Insight: mpg ~ weight, quadratic fit

47 48

shiny: dynamic app showing downloads of R packages Dynamic, interactive graphics https://gallery.shinyapps.io/087-crandash/

Dynamic graphics provide multiple, linked views of a data set Selecting points, regions in one plot (“brushing”) selects the same observations in all other plots

Image source: (Paul Velleman) See: http://www.activstats.com/products/mediadx/custom/lessonbook/nyheart.shtml

49 : Latent distance analysis of a corpus of papers https://gallery.shinyapps.io/LDAelife/ Multivariate frequency data: mosaic plots

Uses MDS to find a 2D space from distances among terms

A contingency table can be visualized by tiles whose area ~ cell frequency. Shading: ~ Pearson residual,

dOEEij ()/ ij ij ij

Color:

• blue: Oij > Eij; red: Oij > Eij

Interp: + association (dark hair, dark eyes), (light hair, light eyes)

52

N-way tables N-way tables

3+ way tables: split each tile ~ All models fit to the same table conditional proportions of the have same-sized tiles (Oijk), next variable. but different residuals. Now, there are several different This model of conditional inde- models that can be fit. pendence, [HS][ES] ĺ+( independent given Sex. • Mutual independence: [H][E][S] ĺDOOYDUVXQDVVRFLDWHG • Residuals: show associations not acct’d for by the model

53 54 N-way tables Summary

The model of joint independence, • Goal of statistical analysis: summarization [HE][S] allows Hair, Eye color association, but ĺ>+(@DVVRFLV • Goals of graphical analysis: exposure! independent of Sex. ƒ Often more useful when enhanced with visual This model obviously fits much better, except for blue-eyed summaries (fitted curve, data ellipse) blonds, where females are more prevalent than the model allows. • Different graphs for different purposes: ƒ Reconnaisance (overview) ƒ Exploration (detecting patterns, trends) ƒ Model diagnosis (assumptions, )

55 56

Summary

• Multivariate data requires novel graphs to display increasing # of variables ƒ Enhanced scatterplot matrices ƒ Visual thinning: less is often more ƒ Low-D views (biplots / MDS) ƒ HE plots to visualize multivariate tests ƒ Mosaic plots to visualize n-way frequency tables.

57