Visualizing Multivariate Uncertainty: Some Graphical Methods for Multivariable Spatial Data

Median Ranks by Region Crime_prop Infants Instruction Suicides Donations Crime_pers

Michael Friendly York University http://www.math.yorku.ca/SCS/friendly.html

National Academy of Sciences Washington, DC, Mar, 2005 Visualizing Multivariate Uncertainty outline

Outline

Multivariate uncertainty and “moral ” A. M. Guerry’s Moral Statistics of France Guerry’s data and analyses Multivariate analyses: Data-centric displays Bivariate plots and data ellipses Canonical discriminant plots HE plots for multivariate linear models Multivariate mapping: Map-centric displays Star maps Reduced-rank color maps

National Academies of Sciences, March 2005 1 c Michael Friendly Visualizing Multivariate Uncertainty guerry1

Multivariate Uncertainty and “Moral Statistics” ∼ 1800

It is a capital mistake to theorize before one has data. Sherlock Homes in Scandal in Bohemia What to do about crime? Liberal view: increase education, literacy Conservative view: build more prisons What to do about poverty? Liberal view: increase social assistance Conservative view: build more poor-houses But: Little actual data – all armchair theorizing No ways to understand or visualize relationships between variables • just invented (Playfair)— line graph, , • All 1D or 1.5D ()

National Academies of Sciences, March 2005 2 c Michael Friendly Visualizing Multivariate Uncertainty background

The rise of “moral statistics” and modern social science

Political arithmetic: William Petty (and others) 1654— first attempt at scientific survey (on Irish estates) 1687— idea that wealth and strength of a state depended on its subjects (number and characteristics) : Johann Peter Sussmilch¨ (1741)— importance of measuring and analyzing population distributions idea that ethical and state policies could encourage growth and wealth (increase birth rate, decrease death rate) • discourage alcohol, gambling, prostitution & priestly celibacy • encourage state support for medical care, distribution of land, lower taxes Statistik: Numbers of the state (1800–1820), Germany and France collect data on imports, exports, transportation, ... Guerry & Quetelet Quetelet: Concepts of “average man” and “social physics” Guerry: First real social data analysis (Guerry, 1833)

National Academies of Sciences, March 2005 3 c Michael Friendly Visualizing Multivariate Uncertainty data

Guerry’s data

Compte gen´ eral´ de l’administration de la justice criminelle en France The first national compilation of official justice data (1825) • detailed data on all charges and disposition • collected quarterly in all 86 departments. Other sources: Bureau de Longitudes (illegitimate births); Parent-Duchateletˆ (prostitutes in Paris); Compte du ministere du guerre (military desertions); ... Moral variables: Scaled so ’more’ is ’better’ Crime pers Population per Crime against persons Crime prop Population per Crime against property Donations Donations to the poor Infants Population per illegitimate birth Literacy Percent who can read & write Suicides Population per suicide Tried to define these to ensure comparability and representativeness • Crime: Use number of accused rather than convicted • Literacy: Reported levels of education unreliable; use data from military draft examinations (% of young men able to read and write) Other variables: Ranks by department: wealth, commerce, ...

National Academies of Sciences, March 2005 4 c Michael Friendly Visualizing Multivariate Uncertainty questions

Guerry’s Questions

Should crime and other moral variables be considered as structural, lawful characteristics of society, or simply as indicants of individual behavior? Statistical regularity as the key to social science (“social physics”) social equivalent of “law of large numbers”) Guerry showed that rates of crime had nearly invariant distributions over time (1825–1830) when classified by region, sex of accused, type of crime, etc. “We would be forced to recognze that the facts of moral order, like those of physical order, obey invariant laws...” (p.14)

Relations between crime and other moral variables Do crimes against persons and crimes against property show the same or different trends? How does crime relate to education and literacy? • Some “armchair” arguments had suggested increasing literacy to decrease crime: “The definitive result shows that 67 out of 100 prisoners can neither read nor write. What stronger proof could there be that ignorance is the mother of all vices” (A. Taillander, 1828) Does crime vary coherently over regions of France (C, N, S, E, W)?

National Academies of Sciences, March 2005 5 c Michael Friendly Visualizing Multivariate Uncertainty maps

Guerry’s maps

National Academies of Sciences, March 2005 6 c Michael Friendly Visualizing Multivariate Uncertainty maps

Guerry’s maps

Population per Crime against persons Population per Crime against property Per cent who can Read and Write

59 69 5 26 62 56

83 38 53 39 66 85 2 19 60 52 64 78 72 31 68 10 9 80 34 24 62 42 7 12 65 52 65 64 86 71 27 68 15 33 77 79 1322 12 3 1 14 7082 76 73 56 54 21 56 68 76 37 7 79 52 49 46 34 4 6 62 6 68 73 76 55 67 30 68 22 83 74 5 66 13 12 60 84 36 49 55 16 5 32 50 82 37 57 29 47 60 73 51 24 26 48 78 81 67 57 64 20 17 74 47 9 36 22 26 85 54 64 76 53 3 14 65 51 85 50 82 75 43 44 77 29 8 38 40 25 35 11 48 70 48 22 3 77 84 42 86 86 17 30 45 28 8 3 56 43 63 32 71 18 83 80 81 45 40 12 31 26 44 82 56 1 31 53 38 29 72 79 85 10 35 15 61 8 19 33 41 73 46 52 47 26 50 80 2 6 63 23 20 26 58 7 15 61 32 40 35 35 58 21 20 42 35 23 17 25 59 50 29 22 47 16 27 14 42 14 20 18 75 78 17 69 44 56 44 17 31 42 58 39 60 35 11 28 71 74 66 39 3 70 10 4 45 35

Rank 1 - 9 10 - 19 20 - 28 1 Rank 1 - 9 10 - 19 20 - 28 10 Rank 1 - 8 10 - 17 20 - 26 62 29 - 38 39 - 47 48 - 57 29 - 38 39 - 47 48 - 57 29 - 35 38 - 47 48 - 56 58 - 66 67 - 76 77 - 86 58 - 66 67 - 76 77 - 86 58 - 66 68 - 76 77 - 86 Population per Illegitimate birth Donations to the poor Population per Suicide

8 4 51 55 22 17 17 43 13 3 26 37 58 62 56 7 12 43 48 49 3 22 5 36 45 20 46 85 74 34 65 70 49 16 15 41 11 28 27 6 23 38 1 15 23 2935 80 1 2 29 61 39 64 48 53 4 83 67 78 75 25 50 29 84 39 25 31 40 20 8 50 84 46 60 30 65 33 49 27 5 54 45 34 80 10 7 81 20 42 54 46 10 42 12 36 73 11 63 19 50 9 19 24 21 51 35 61 37 44 51 41 18 59 21 21 64 69 54 66 68 32 47 59 15 55 86 56 60 79 71 26 75 39 36 74 82 83 63 38 35 43 69 82 81 45 56 24 70 79 55 2 78 3 52 27 71 53 66 78 76 77 52 22 26 42 80 77 52 14 82 32 66 59 63 33 79 41 33 12 58 81 86 9 85 40 68 44 17 13 57 31 69 37 25 58 73 47 4 68 9 47 64 67 38 18 11 62 83 18 44 1 30 32 70 76 75 40 16 57 67 28 77 23 76 31 62 6 28 14 2 7 8 72 48 5 14 16 34 19 6 73 71 13 57 54 10 85 74 65 24 84 30 72 61

Rank 1 - 9 10 - 19 20 - 28 72 Rank 1 - 9 10 - 19 20 - 28 86 Rank 1 - 9 10 - 19 20 - 28 60 29 - 38 39 - 47 48 - 57 29 - 38 39 - 47 48 - 57 29 - 38 39 - 47 48 - 57 58 - 66 67 - 76 77 - 86 58 - 66 67 - 76 77 - 86 58 - 66 67 - 76 77 - 86

National Academies of Sciences, March 2005 7 c Michael Friendly Visualizing Multivariate Uncertainty maps

Guerry’s analyses

st Relate variables by comparing maps and ranked lists (1 || coordinate plot) Conclusion: no clear relation between crime and literacy

Literacy Ranked lists Crimes against persons Similar analyses for other variables (suicide, illegitimate births, ...)

National Academies of Sciences, March 2005 8 c Michael Friendly Visualizing Multivariate Uncertainty multivar0

Graphical methods for multivariate data

Bivariate displays: Bivariate displays can be enhanced to show statistical relations more clearly and effectively Scatterplots with data (concentration) ellipses and smoothed (loess) curves Scatterplot matrices Corrgrams and visual thinning Reduced-rank displays: Multivariate visualization techniques can show the statistical data in simple ways, using dimension reduction techniques. Biplots - show variables and observations in space accounting for greatest Canonical discriminant plots - show variables and observations in space accounting for greatest between-group variation HE plots: Visualization for Multivariate Linear Models

National Academies of Sciences, March 2005 9 c Michael Friendly Visualizing Multivariate Uncertainty bivar

Bivariate plots: Points and visual summaries

40000 Creuse Ardennes

Indre Cote-d’Or

30000

Haute-MarneJura Meuse

20000 Pop. per Crime against persons Pop. per Crime against

10000 Doubs

Haut-Rhin Ariege

Corsica 0

10 20 30 40 50 60 70 80 Percent Read & Write Scatterplot with line

National Academies of Sciences, March 2005 10 c Michael Friendly Visualizing Multivariate Uncertainty dataellipse0

The Data Ellipse: Galton’s Discovery

Pearson (1920): “... one of the most noteworthy scientific discoveries arising from pure analysis of observations.”

National Academies of Sciences, March 2005 11 c Michael Friendly Visualizing Multivariate Uncertainty dataellipse0

The Data Ellipse: Details

Visual summary for bivariate marginal relations Shows: , standard deviations, correlation, regression line(s) 2 Defined: set of points whose squared Mahalanobis distance ≤ c , D2(y) ≡ (y − y¯)T S−1 (y − y¯) ≤ c2

S = sample variance-covariance matrix 2 2 Radius: when y is approx. bivariate normal, D (y) has a large-sample χ2 distribution with 2 degrees of freedom. 2 2 • c = χ2(0.40) ≈ 1: 1 std. dev univariate ellipse– 1D shadows: y¯ ± 1s 2 2 • c = χ2(0.68)=2.28: 1 std. dev bivariate ellipse 2 • Small samples: c ≈ 2F2,n−2(1 − α) Construction: Transform the unit circle, U =(sinθ, cos θ),

1/2 Ec = y¯ + cS U

1/2 S = any “square root” of S (e.g., Cholesky) Robust version: Use robust covariance estimate (MCD, MVE) Nonparametric version: Use kernel

National Academies of Sciences, March 2005 12 c Michael Friendly Visualizing Multivariate Uncertainty bivar

Bivariate plots: Data ellipse and smoothing

40000 Creuse Ardennes

Indre Cote-d’Or

30000

Haute-MarneJura Meuse

20000 Pop. per Crime against persons Pop. per Crime against

10000 Doubs

Haut-Rhin Ariege

Corsica 0

10 20 30 40 50 60 70 80 Percent Read & Write Scatterplot with 68% data ellipse and smoothed (loess) curve

National Academies of Sciences, March 2005 13 c Michael Friendly Visualizing Multivariate Uncertainty bivar

Bivariate plots: Region differences

40000 Creuse Ardennes

Indre Cote-d’Or

30000

Haute-MarneJura Meuse CW N

E 20000

S Pop. per Crime against persons Pop. per Crime against

10000 Doubs

Haut-Rhin Ariege

Corsica 0

10 20 30 40 50 60 70 80 Percent Read & Write

National Academies of Sciences, March 2005 14 c Michael Friendly Visualizing Multivariate Uncertainty bivar

Bivariate plots: Scatterplot matrices

37014

Crime_pers

2199 20235

Crime_prop

1368 74

Literacy

12 62486

Infants

2660

National Academies of Sciences, March 2005 15 c Michael Friendly Visualizing Multivariate Uncertainty corrgram

Corrgrams— Correlation matrix displays

How to show a correlation matrix for different purposes? (Friendly, 2002) Render a correlation to depict sign and magnitude (tasks: lookup, comparison, detection) Correlation value (x 100) -100 -85 -70 -55 -40 -25 -10 5 20 35 50 65 80 Numbe 95 r Circle Ellipse Bars Shade d

Task-specific renderings:

Task Lookup Comparison Detection Rendering Number Circle Shading

National Academies of Sciences, March 2005 16 c Michael Friendly Visualizing Multivariate Uncertainty corrgram

Corrgrams— Rendering

Baseball data: (lower) Patterns vs. (upper) comparison Baseball data: PC2/1 order

Years

logSal

Homer

Putouts

RBI

Walks

Runs

Hits

Atbat

Errors Atbat Assists Years logSa Home Putou RBI Walks Runs Hits Errors Assis l r ts ts

National Academies of Sciences, March 2005 17 c Michael Friendly Visualizing Multivariate Uncertainty corrgram

Corrgrams— Variable ordering

Baseball data: (a) alpha vs. (b) correlation ordering (Friendly and Kwan, 2003)

(a) Alpha order (b) PC2/1 order

Assists Years

Atbat logSal

Errors Homer

Hits Putouts

Homer RBI

logSal Walks

Runs Putouts

RBI Hits Assist Runs Errors Years Atbat Atbat Homer Walks Hits Walks Hits Runs Runs Errors RBI Errors Homer Walks Atbat RBI Years Putouts logSal Assists logSal Assists Years Putouts s

See: http://www.math.yorku.ca/SCS/sasmac/corrgram.html

National Academies of Sciences, March 2005 18 c Michael Friendly Visualizing Multivariate Uncertainty corrgram

Corrgrams— Variable ordering

Reorder variables to show similarities: PC1 or angles (PC2/PC1) 1.5 1.0 Assists Errors

0.5

Atbat 0.0 Hits Runs Walks Dimension 2 (17.4%) Putouts RBI -0.5 Homer logSal Years -1.0 -1.0 -0.5 0.0 0.5 1.0 1 .5 Dimension 1 (46.3%)

National Academies of Sciences, March 2005 19 c Michael Friendly Visualizing Multivariate Uncertainty corrgram

Corrgrams— Guerry data

Guerry Variables

Crime_pers

Desertion

Wealth

Infanticide

Literacy

Instruction

Clergy

Suicides

Lottery

Crime_prop

Infants Crime_perDesertionWealth Infanticide LiteracyInstruction Clergy Suicides Lottery Crime_pro Infants

s p

National Academies of Sciences, March 2005 20 c Michael Friendly Visualizing Multivariate Uncertainty corrgram

Guerry data— Variable ordering

1.2 Clergy

0.8 InstructionLiteracy Suicides 0.4 Lottery

0.0 Infanticide Crime_prop Infants Wealth Dimension 2 (15.8%) -0.4 Desertion

-0.8 Crime_pers

-1.2

-1.6 -1.2 -0.8 -0.4 0.0 0.4 0.8 1.2

Dimension 1 (35.5%)

National Academies of Sciences, March 2005 21 c Michael Friendly Visualizing Multivariate Uncertainty corrgram

Visual thinning: Minimal summaries for large data sets

Guerry data: schematic scatterplot matrix: 68% data ellipse + loess smooth

37014

rime_pers

2199 86

Desertion

1 86

Wealth

1 86

Infanticide

1 74

Literacy

12 86

Instruction

1 86

Clergy

1 163241

Suicides

3460 86

Lottery

1 20235

Crime_prop

1368 62486

Infants

2660

National Academies of Sciences, March 2005 22 c Michael Friendly Visualizing Multivariate Uncertainty multivar1

Multivariate analyses: Reduced rank displays

Multivariate visualization techniques can show the statistical data in simple ways, using dimension reduction techniques. Biplots - show variables and departments in space accounting for greatest variance Canonical discriminant plots - show variables and departments in space accounting for greatest between-region variation Can try to show geographic location by color coding or other visual attributes. Color code by region Show data ellipse to summarize regions → Data-centric displays: The multivariate data is shown directly; geographic relations indirectly

National Academies of Sciences, March 2005 23 c Michael Friendly Visualizing Multivariate Uncertainty multivar1

Biplots

Biplots represent both variables (attributes) and observations (departments) in the same plot— a low-rank (2D) approximation to a data matrix (Gabriel, 1971) d T T Y = Y − Y·· ≈ AB = akbk k=1

Variables are usually represented by vectors from origin () Observations are usually represented by points Can show clusters of observations by data ellipses Properties: Angles between vectors show correlations (r ≈ cos(θ)) Length of variable vectors ∼ % variance accounted for T yij ≈ ai bj: projection of observation on variable vector Dimensions are uncorrelated overall (but not necessarily within group)

National Academies of Sciences, March 2005 24 c Michael Friendly Visualizing Multivariate Uncertainty multivar1

Biplots: Guerry data

National Academies of Sciences, March 2005 25 c Michael Friendly Visualizing Multivariate Uncertainty -bb

Biplots: Baseball data

1.5

Assists Errors

1.0 IF IF

IF IF IFIF IF IF IF IFIF IF IF IF IF IF IF IF IF IF IF 0.5 IF IF IF IF IF IF IF IF IF IF IF IFIFIF IF IF IFIFIF IF IF IF IF IFIF IFIF Atbat IF IF IF IF IF IF IFOF IFIF IF IF 1B 1B Hits IF IF1BIF IFOF IFIF OFIF IF IFOF IF C1B IF CUTIFIFUTOF C IF IF OF1B CCUT IFIF OF OF1B Runs IFOF IFOF1B IF IFUTOFOFOFOFC 1B OF 0.0 CC OFC CCOF 1B IF OF C IF 1BIF 1B OF OFOFC OF UTIF OFIFC OFOFOF 1B OFCOFOF OF OF IFOF OFOF UTOFCOF COFOFOF OF OF OFOFCIF 1B Dimension 2 (17.4%) UTCDHOF C1BOFOF OF OFOFOFOF OF OF Walks COFC OFOFOFUTOFDHIFC 1BOF OF 1B 1B 1B OFOFC OFOF COFOFOF1BOFPutouts 1B UTIF OF IF OF OF OFCDHOFCOFOF OF 1B RBI OFUTOF COF OF OFOFOF DH 1BOF C DHOFDHOF COF C DH1B DH 1B OF DH OF OFOFOF -0.5 DH DH 1B Homer DH DH logSal Years

-1.0 -1.0 -0.5 0.0 0.5 1.0 1.5 Dimension 1 (46.3%) Biplot of baseball data, players labeled by position

National Academies of Sciences, March 2005 26 c Michael Friendly Visualizing Multivariate Uncertainty biplot-bb

1.5

Assists Errors

1.0 IF IF

IF IF IFIF IF IF IF IFIF IF IF IF IF IF IF IF IF IF IF 0.5 IF IF IF IF IF IF IF IF IF IF IF IFIFIF IF IF IF IFIFIF IF IF IF IF IFIF IFIF Atbat IF IF IF IF IF IF IFOF IFIF IF IF 1B 1B Hits IF IF1BIF IFOF IFIF OFIF IF IFOF IF C1B IF CUTIFIFUT OFC IF IF OF1B CCUT IFIF OF OF1B Runs IFOF IFOF1B IF IFUTOFOFOFOFC 1B OF 0.0 CC OFC CCOFUT1B IF OF C 1BIF 1BIF 1B OF OFOFC OFCUTIF OFIFC OFOFOF 1B OFCOFOF OF OFOF IFOF OFOF UTOFCOF COFOFOF OF OF OFOFCIF 1B Dimension 2 (17.4%) Dimension 2 UTCDHOF C1BOFOF OF OFOFOFOF OF OF Walks COFC OFOFOFUTOFDHIFC 1BOF OF 1B 1B 1B OFOFC OFOF COFOFOF1BOFPutouts 1B UTIF OF IF OF OF OFCDHOFDHCOFOF OF 1B RBI OFUTOF COF OF OFOFOF DH 1BOF C DHOFDHOF COF C DH1B DH 1B OF DH OF OFOFOF -0.5 DH DH 1B Homer DH DH logSal Years

-1.0 -1.0 -0.5 0.0 0.5 1.0 1.5 Dimension 1 (46.3%) Biplot of baseball data, with data ellipses by position

National Academies of Sciences, March 2005 27 c Michael Friendly Visualizing Multivariate Uncertainty biplot-bb

1.5

Assists Errors

1.0

0.5 IF Atbat Hits Runs 0.0 UT 1B C OF

Dimension 2 (17.4%) Dimension 2 Walks Putouts DH RBI

-0.5 Homer logSal Years

-1.0 -1.0 -0.5 0.0 0.5 1.0 1.5 Dimension 1 (46.3%) Biplot of baseball data, with data ellipses by position (thinned)

National Academies of Sciences, March 2005 28 c Michael Friendly Visualizing Multivariate Uncertainty multivar1

Canonical discriminant plots

Project the variables into a low-rank (2D) space that maximally discrimates among regions (Friendly, 1991) Visual summary of a MANOVA Canonical dimensions are linear combinations of the variables with maximum univariate F -statistics. Vectors from the origin (grand mean) for the observed variables show the correlations with the canonical dimensions Properties: Canonical variates are uncorrelated 2 Circles of radius χ2(1 − α)/ni give confidence regions for group means. Variable vectors show how variables discriminate among groups Lengths of variable vectors ∼ contribution to discrimination

National Academies of Sciences, March 2005 29 c Michael Friendly Visualizing Multivariate Uncertainty multivar1

Canonical discriminant plots: Guerry data, by Region

National Academies of Sciences, March 2005 30 c Michael Friendly Visualizing Multivariate Uncertainty cdaplots-bb

CDA plots: Baseball data, by player position

6 assists errors putouts

4

2 atbat logsal hitsyears 0 walks rbi runs homer -2 Canonical Dimension 2 (27.1%) Canonical Dimension -4

Position 1B C DH -6 IF OF UT -6-4-202468 Canonical Dimension 1 (72.9%)

National Academies of Sciences, March 2005 31 c Michael Friendly Visualizing Multivariate Uncertainty heplots

HE plots: Visualization for Multivariate Linear Models

How are p responses, Y =(y1, y2,...,yp) related to q predictors, X =(x1, x2,...,xq)? (Friendly, 2004a)⎫ ⎪ MANOVA: X ∼ discrete factors ⎬ All the same MLM: X ∼ MMRA: quantitative predictors ⎪ Y = X B + E MANCOVA, response surface models,⎭ (n×p) (n×q)(q×p) (n×p)

Analogs of univariate tests: Explained variation: MSH −→ (p × p) covariance matrix, H Residual variation: MSE −→ (p × p) covariance matrix, E Test statistics: F −→ | H − λE| =0→ λ1,λ2,...λs How big is H relative to E ? Latent roots λ1,λ2,...λs measure the “size” of H relative to E in s =min(p, dfh) orthogonal directions. Test statistics: Wilks’ Λ, Pillai trace, Hotelling-Lawley trace, Roy’s maximum root combine these into a single number

National Academies of Sciences, March 2005 32 c Michael Friendly Visualizing Multivariate Uncertainty heplots

HE plots: Visualization for Multivariate Linear Models

HE plot: for two response variables, (y1, y2), plot a H ellipse and E ellipse

HE plot matrices: For all p responses, plot an HE scatterplot matrix

→ Shows: size, dimensionality, and effect-correlation of H relative to E . Individual group scatter Between and Within Scatter Y2 Y2 40 40

Deviations of group means from How big is H relative to E? Scatter around group means grand mean (outer) and pooled represented by each ellipse within-group (inner) ellipses.

30 30 H matrix

8

20 4 7 20

6 E matrix 3

5 10 1 10

2

(a) (b) 0 0 0 102030400 10203040 Y1 Y1 Essential ideas behind multivariate tests: (a) Data ellipses; (b) H and E ellipses

National Academies of Sciences, March 2005 33 c Michael Friendly Visualizing Multivariate Uncertainty heplots

Simple example: Iris data

80 80

70 70 Virginica Virginica

Versicolor 60 60 Versicolor Sepal length in mm. Sepal length Sepal length in mm. Setosa 50 50 Setosa

40 Matrix Hypothesis Error 40 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Petal length in mm. Petal length in mm. (a) Data ellipses and (b) H and E ellipses

H ellipse: Shows 2D covariation of predicted values (means) E ellipse: Shows 2D covariation of residuals points: show group means on both variables

National Academies of Sciences, March 2005 34 c Michael Friendly Visualizing Multivariate Uncertainty he-bb

Baseball data: Variation by position

How do relations among variables vary with player’s position? Fit MANOVA model, (logSal Years Homer Runs Hits RBI Atbat Walks Putouts Assists Errors) = Position HE plots for selected pairs: (Years, logSal), (Putouts, Assists)

Model: logSal Years ... Errors = Position Model: logSal Years ... Errors = Position 3.2 300 IF

3.0

200

2.8

1B DH 2.6 100 UT OF IF C Assists log Salary UT C 2.4 OF 0 DH

2.2

Matrix Hypothesis Error Matrix Hypothesis Error 2.0 -100 0 5 10 15 20 0 100 200 300 400 500 600 Years in the Major Leagues Put Outs

National Academies of Sciences, March 2005 35 c Michael Friendly Visualizing Multivariate Uncertainty he-bb

Model: logSal Years ... Errors = Position 3.2

3.0

2.8

1B DH 2.6 OF IF C log Salary UT

2.4

2.2

Matrix Hypothesis Error 2.0 0 5 10 15 20 Years in the Major Leagues

National Academies of Sciences, March 2005 36 c Michael Friendly Visualizing Multivariate Uncertainty he-bb

Model: logSal Years ... Errors = Position 300 IF

200

100 UT Assists

C

OF 0 DH

Matrix Hypothesis Error -100 0 100 200 300 400 500 600 Put Outs

National Academies of Sciences, March 2005 37 c Michael Friendly Visualizing Multivariate Uncertainty he-guerry

Guerry data: Predicting crime

How do rates of crime vary with other variables? Fit MANCOVA model, (Crime_pers Crime_prop) = Region + Wealth + Suicides + Literacy + Donations + Infants HE plots: Overall, plus for Region and covariate effects

Region effect Covariates: Wealth Suicides Literacy Donations Infants Creuse Creuse 20000 20000 Haute-Loire Haute-Loire

Ain Ain 15000 15000 Correze Correze Suicides Wealth Donation 10000 C Region 10000 Infants LIteracy S . E W .

N 5000 5000 Pop. per Crime against propertyPop. per Crime against propertyPop. per Crime against propertyPop. per Crime against propertyPop. per Crime against propertyPop. per Crime against property Pop. per Crime against property against property Pop. per CrimePop. per Crime

0 Seine 0 Seine 0 10000 20000 30000 40000 0 10000 20000 30000 40000 Pop. per Crime against persons Pop. per Crime against persons

National Academies of Sciences, March 2005 38 c Michael Friendly Visualizing Multivariate Uncertainty he-guerry

Region effect Creuse 20000 Haute-Loire

Ain 15000 Correze

10000 C Region S . E W

N 5000 Pop. per Crime against property property Crime against Crime against Pop. perPop. per

0 Seine 0 10000 20000 30000 40000 Pop. per Crime against persons

Overall: Predicted crimes against persons and property are negatively correlated Larger variation in crimes against property Region variation greater in crimes against persons

National Academies of Sciences, March 2005 39 c Michael Friendly Visualizing Multivariate Uncertainty he-guerry

Covariates: Wealth Suicides Literacy Donations Infants Creuse 20000 Haute-Loire

Ain 15000 Correze Suicides Wealth Donation 10000 Infants LIteracy .

5000 Pop. per Crime against property property property property property property Crime against Crime against Crime against Crime against Crime against Crime against Pop. perPop. perPop. perPop. perPop. perPop. per

0 Seine 0 10000 20000 30000 40000 Pop. per Crime against persons

Each quantitative variable (covariate) plots as a 1D ellipse (vector) Orientation: relation of xi to y1, y2 Length: strength of relation

National Academies of Sciences, March 2005 40 c Michael Friendly Visualizing Multivariate Uncertainty multimap

Multivariate mapping: Map-centric displays

How to generalize choropleth maps to many variables? Star maps: Show multivariate data on the map using star icons, variable ∼ length of ray Reduced-rank RGB displays: → (F1, F2, F3) factor scores → (R, G, B) shading PREFMAP (x, y) maps: Fit data variables to (Long, Lat) map coordinates. Display variables as vectors in map coordinates.

National Academies of Sciences, March 2005 41 c Michael Friendly Visualizing Multivariate Uncertainty multimap

Star maps

Star map of Guerry Variables (Ranks) Crime_prop Infants

Instruction Suicides

Donations Crime_pers

Departments colored by Region

National Academies of Sciences, March 2005 42 c Michael Friendly Visualizing Multivariate Uncertainty multimap

Star maps: by region

Median Ranks by Region Crime_prop Infants Instruction Suicides Donations Crime_pers

National Academies of Sciences, March 2005 43 c Michael Friendly Visualizing Multivariate Uncertainty multimap

Star maps: Multivariate boxplots by region

Crime_prop Infants

Instruction Suicides

Donations Crime_pers

stars for Q1, Median, Q3 How to show unusual depts?

National Academies of Sciences, March 2005 44 c Michael Friendly Visualizing Multivariate Uncertainty gfactor

Reduced-rank color-coded displays

Use dimension-reduction technique (PCA, Factor analysis, ...) to produce scores for observations (departments) on 3 dimensions (F1,F2,F3) Factor1 Factor2 Factor3 Variable Civil society Moral values Crime Pop per Crime against persons 0 97 Pop per Crime against property 0 75 0 39 Percent Read & Write -0 72 Pop per illegitimate birth 0 62 0 42 Donations to the poor 0 89 Pop per suicide 0 80

Scale (F1,F2,F3) → [0,1]

Color mapping function, e.g., C(F1,F2,F3) → rgb(Fi,Fj,Fk)

National Academies of Sciences, March 2005 45 c Michael Friendly Visualizing Multivariate Uncertainty gfactor

Reduced-rank color-coded displays

RGB 3-factor map: R=f1, G=f2, B=f3 Variables: Crime_pers Crime_prop Literacy Infants Donations Suicides

62 59

80

76 2 8 60 57 14 27 50 51 55 75 54 67 78 61 77 22 29 28 10 88 35 53 52 72 68 56 45 89 70 41 21 49 44 25 37 58 18 39 85 36 71 79 86 3 1 23 69 87 42 17 16 63

19 38 43 24 15 33 7 26 5 46 48 47 12 4 82 30 84 40 32 81 F1 34 13 83 64 31 65 11 9 66

200

F3 F2

National Academies of Sciences, March 2005 46 c Michael Friendly Visualizing Multivariate Uncertainty summary

Summary and future directions

Guerry’s challenge: Understanding uncertainty in multivariate, spatial data How visualize and understand relations among many variables? How to relate these to geographic information? Understanding multivariate variation Visual summaries (data ellipses, smoothings) can show statistical relations more clearly and effectively Reduced-rank visualization methods can show simpler, approximate views, based on several criteria. Multivariate statistical models need their own visualization methods, just beginning – HE plots as an example. Understanding multivariate, spatial variation Multivariate visualizations applied to spatial data can be revealing, but still need work Statistical methods for spatial data need to be extended to the multivariate setting.

National Academies of Sciences, March 2005 47 c Michael Friendly Visualizing Multivariate Uncertainty

References

Friendly, M. (1991). SAS System for Statistical Graphics. Cary, NC: SAS Institute, 1st edn.

Friendly, M. (2002). Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 56(4), 316–324.

Friendly, M. (2004a). Graphical methods for the multivariate . Journal of Computational and Graphical Statistics. (submitted 2/17/04; resubmitted: 10/02/04).

Friendly, M. (2004b). Milestones in the history of data visualization: A case study in statistical historiography. In W. Gaul and C. Weihs, eds., Studies in Classification, Data Analysis, and Knowledge Organization. New York: Springer. (In press).

Friendly, M. and Kwan, E. (2003). Effect ordering for data displays. Computational Statistics and Data Analysis, 43(4), 509–539.

Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal components analysis. Biometrics, 58(3), 453–467.

Guerry, A.-M. (1833). Essai sur la statistique morale de la France. Paris. English translation: Hugh P. Whitt and Victor W. Reinking, Lewiston, N.Y. : Edwin Mellen Press, c2002.

Pearson, K. (1920). Notes on the history of correlation. Biometrika, 13(1), 25–45.

National Academies of Sciences, March 2005 48 c Michael Friendly