Visualizing Multivariate Uncertainty: Some Graphical Methods for Multivariable Spatial Data
Median Ranks by Region Crime_prop Infants Instruction Suicides Donations Crime_pers
Michael Friendly York University http://www.math.yorku.ca/SCS/friendly.html
National Academy of Sciences Washington, DC, Mar, 2005 Visualizing Multivariate Uncertainty outline
Outline
Multivariate uncertainty and “moral statistics” A. M. Guerry’s Moral Statistics of France Guerry’s data and analyses Multivariate analyses: Data-centric displays Bivariate plots and data ellipses Biplots Canonical discriminant plots HE plots for multivariate linear models Multivariate mapping: Map-centric displays Star maps Reduced-rank color maps
National Academies of Sciences, March 2005 1 c Michael Friendly Visualizing Multivariate Uncertainty guerry1
Multivariate Uncertainty and “Moral Statistics” ∼ 1800
It is a capital mistake to theorize before one has data. Sherlock Homes in Scandal in Bohemia What to do about crime? Liberal view: increase education, literacy Conservative view: build more prisons What to do about poverty? Liberal view: increase social assistance Conservative view: build more poor-houses But: Little actual data – all armchair theorizing No ways to understand or visualize relationships between variables • Statistical graphics just invented (Playfair)— line graph, bar chart, pie chart • All 1D or 1.5D (time series)
National Academies of Sciences, March 2005 2 c Michael Friendly Visualizing Multivariate Uncertainty background
The rise of “moral statistics” and modern social science
Political arithmetic: William Petty (and others) 1654— first attempt at scientific survey (on Irish estates) 1687— idea that wealth and strength of a state depended on its subjects (number and characteristics) Demography: Johann Peter Sussmilch¨ (1741)— importance of measuring and analyzing population distributions idea that ethical and state policies could encourage growth and wealth (increase birth rate, decrease death rate) • discourage alcohol, gambling, prostitution & priestly celibacy • encourage state support for medical care, distribution of land, lower taxes Statistik: Numbers of the state (1800–1820), Germany and France collect data on imports, exports, transportation, ... Guerry & Quetelet Quetelet: Concepts of “average man” and “social physics” Guerry: First real social data analysis (Guerry, 1833)
National Academies of Sciences, March 2005 3 c Michael Friendly Visualizing Multivariate Uncertainty data
Guerry’s data
Compte gen´ eral´ de l’administration de la justice criminelle en France The first national compilation of official justice data (1825) • detailed data on all charges and disposition • collected quarterly in all 86 departments. Other sources: Bureau de Longitudes (illegitimate births); Parent-Duchateletˆ (prostitutes in Paris); Compte du ministere du guerre (military desertions); ... Moral variables: Scaled so ’more’ is ’better’ Crime pers Population per Crime against persons Crime prop Population per Crime against property Donations Donations to the poor Infants Population per illegitimate birth Literacy Percent who can read & write Suicides Population per suicide Tried to define these to ensure comparability and representativeness • Crime: Use number of accused rather than convicted • Literacy: Reported levels of education unreliable; use data from military draft examinations (% of young men able to read and write) Other variables: Ranks by department: wealth, commerce, ...
National Academies of Sciences, March 2005 4 c Michael Friendly Visualizing Multivariate Uncertainty questions
Guerry’s Questions
Should crime and other moral variables be considered as structural, lawful characteristics of society, or simply as indicants of individual behavior? Statistical regularity as the key to social science (“social physics”) social equivalent of “law of large numbers”) Guerry showed that rates of crime had nearly invariant distributions over time (1825–1830) when classified by region, sex of accused, type of crime, etc. “We would be forced to recognze that the facts of moral order, like those of physical order, obey invariant laws...” (p.14)
Relations between crime and other moral variables Do crimes against persons and crimes against property show the same or different trends? How does crime relate to education and literacy? • Some “armchair” arguments had suggested increasing literacy to decrease crime: “The definitive result shows that 67 out of 100 prisoners can neither read nor write. What stronger proof could there be that ignorance is the mother of all vices” (A. Taillander, 1828) Does crime vary coherently over regions of France (C, N, S, E, W)?
National Academies of Sciences, March 2005 5 c Michael Friendly Visualizing Multivariate Uncertainty maps
Guerry’s maps
National Academies of Sciences, March 2005 6 c Michael Friendly Visualizing Multivariate Uncertainty maps
Guerry’s maps
Population per Crime against persons Population per Crime against property Per cent who can Read and Write
59 69 5 26 62 56
83 38 53 39 66 85 2 19 60 52 64 78 72 31 68 10 9 80 34 24 62 42 7 12 65 52 65 64 86 71 27 68 15 33 77 79 1322 12 3 1 14 7082 76 73 56 54 21 56 68 76 37 7 79 52 49 46 34 4 6 62 6 68 73 76 55 67 30 68 22 83 74 5 66 13 12 60 84 36 49 55 16 5 32 50 82 37 57 29 47 60 73 51 24 26 48 78 81 67 57 64 20 17 74 47 9 36 22 26 85 54 64 76 53 3 14 65 51 85 50 82 75 43 44 77 29 8 38 40 25 35 11 48 70 48 22 3 77 84 42 86 86 17 30 45 28 8 3 56 43 63 32 71 18 83 80 81 45 40 12 31 26 44 82 56 1 31 53 38 29 72 79 85 10 35 15 61 8 19 33 41 73 46 52 47 26 50 80 2 6 63 23 20 26 58 7 15 61 32 40 35 35 58 21 20 42 35 23 17 25 59 50 29 22 47 16 27 14 42 14 20 18 75 78 17 69 44 56 44 17 31 42 58 39 60 35 11 28 71 74 66 39 3 70 10 4 45 35
Rank 1 - 9 10 - 19 20 - 28 1 Rank 1 - 9 10 - 19 20 - 28 10 Rank 1 - 8 10 - 17 20 - 26 62 29 - 38 39 - 47 48 - 57 29 - 38 39 - 47 48 - 57 29 - 35 38 - 47 48 - 56 58 - 66 67 - 76 77 - 86 58 - 66 67 - 76 77 - 86 58 - 66 68 - 76 77 - 86 Population per Illegitimate birth Donations to the poor Population per Suicide
8 4 51 55 22 17 17 43 13 3 26 37 58 62 56 7 12 43 48 49 3 22 5 36 45 20 46 85 74 34 65 70 49 16 15 41 11 28 27 6 23 38 1 15 23 2935 80 1 2 29 61 39 64 48 53 4 83 67 78 75 25 50 29 84 39 25 31 40 20 8 50 84 46 60 30 65 33 49 27 5 54 45 34 80 10 7 81 20 42 54 46 10 42 12 36 73 11 63 19 50 9 19 24 21 51 35 61 37 44 51 41 18 59 21 21 64 69 54 66 68 32 47 59 15 55 86 56 60 79 71 26 75 39 36 74 82 83 63 38 35 43 69 82 81 45 56 24 70 79 55 2 78 3 52 27 71 53 66 78 76 77 52 22 26 42 80 77 52 14 82 32 66 59 63 33 79 41 33 12 58 81 86 9 85 40 68 44 17 13 57 31 69 37 25 58 73 47 4 68 9 47 64 67 38 18 11 62 83 18 44 1 30 32 70 76 75 40 16 57 67 28 77 23 76 31 62 6 28 14 2 7 8 72 48 5 14 16 34 19 6 73 71 13 57 54 10 85 74 65 24 84 30 72 61
Rank 1 - 9 10 - 19 20 - 28 72 Rank 1 - 9 10 - 19 20 - 28 86 Rank 1 - 9 10 - 19 20 - 28 60 29 - 38 39 - 47 48 - 57 29 - 38 39 - 47 48 - 57 29 - 38 39 - 47 48 - 57 58 - 66 67 - 76 77 - 86 58 - 66 67 - 76 77 - 86 58 - 66 67 - 76 77 - 86
National Academies of Sciences, March 2005 7 c Michael Friendly Visualizing Multivariate Uncertainty maps
Guerry’s analyses
st Relate variables by comparing maps and ranked lists (1 || coordinate plot) Conclusion: no clear relation between crime and literacy
Literacy Ranked lists Crimes against persons Similar analyses for other variables (suicide, illegitimate births, ...)
National Academies of Sciences, March 2005 8 c Michael Friendly Visualizing Multivariate Uncertainty multivar0
Graphical methods for multivariate data
Bivariate displays: Bivariate displays can be enhanced to show statistical relations more clearly and effectively Scatterplots with data (concentration) ellipses and smoothed (loess) curves Scatterplot matrices Corrgrams and visual thinning Reduced-rank displays: Multivariate visualization techniques can show the statistical data in simple ways, using dimension reduction techniques. Biplots - show variables and observations in space accounting for greatest variance Canonical discriminant plots - show variables and observations in space accounting for greatest between-group variation HE plots: Visualization for Multivariate Linear Models
National Academies of Sciences, March 2005 9 c Michael Friendly Visualizing Multivariate Uncertainty bivar
Bivariate plots: Points and visual summaries
40000 Creuse Ardennes
Indre Cote-d’Or
30000
Haute-MarneJura Meuse
20000 Pop. per Crime against persons Pop. per Crime against
10000 Doubs
Haut-Rhin Ariege
Corsica 0
10 20 30 40 50 60 70 80 Percent Read & Write Scatterplot with linear regression line
National Academies of Sciences, March 2005 10 c Michael Friendly Visualizing Multivariate Uncertainty dataellipse0
The Data Ellipse: Galton’s Discovery
Pearson (1920): “... one of the most noteworthy scientific discoveries arising from pure analysis of observations.”
National Academies of Sciences, March 2005 11 c Michael Friendly Visualizing Multivariate Uncertainty dataellipse0
The Data Ellipse: Details
Visual summary for bivariate marginal relations Shows: means, standard deviations, correlation, regression line(s) 2 Defined: set of points whose squared Mahalanobis distance ≤ c , D2(y) ≡ (y − y¯)T S−1 (y − y¯) ≤ c2
S = sample variance-covariance matrix 2 2 Radius: when y is approx. bivariate normal, D (y) has a large-sample χ2 distribution with 2 degrees of freedom. 2 2 • c = χ2(0.40) ≈ 1: 1 std. dev univariate ellipse– 1D shadows: y¯ ± 1s 2 2 • c = χ2(0.68)=2.28: 1 std. dev bivariate ellipse 2 • Small samples: c ≈ 2F2,n−2(1 − α) Construction: Transform the unit circle, U =(sinθ, cos θ),
1/2 Ec = y¯ + cS U
1/2 S = any “square root” of S (e.g., Cholesky) Robust version: Use robust covariance estimate (MCD, MVE) Nonparametric version: Use kernel density estimation
National Academies of Sciences, March 2005 12 c Michael Friendly Visualizing Multivariate Uncertainty bivar
Bivariate plots: Data ellipse and smoothing
40000 Creuse Ardennes
Indre Cote-d’Or
30000
Haute-MarneJura Meuse
20000 Pop. per Crime against persons Pop. per Crime against
10000 Doubs
Haut-Rhin Ariege
Corsica 0
10 20 30 40 50 60 70 80 Percent Read & Write Scatterplot with 68% data ellipse and smoothed (loess) curve
National Academies of Sciences, March 2005 13 c Michael Friendly Visualizing Multivariate Uncertainty bivar
Bivariate plots: Region differences
40000 Creuse Ardennes
Indre Cote-d’Or
30000
Haute-MarneJura Meuse CW N
E 20000
S Pop. per Crime against persons Pop. per Crime against
10000 Doubs
Haut-Rhin Ariege
Corsica 0
10 20 30 40 50 60 70 80 Percent Read & Write
National Academies of Sciences, March 2005 14 c Michael Friendly Visualizing Multivariate Uncertainty bivar
Bivariate plots: Scatterplot matrices
37014
Crime_pers
2199 20235
Crime_prop
1368 74
Literacy
12 62486
Infants
2660
National Academies of Sciences, March 2005 15 c Michael Friendly Visualizing Multivariate Uncertainty corrgram
Corrgrams— Correlation matrix displays
How to show a correlation matrix for different purposes? (Friendly, 2002) Render a correlation to depict sign and magnitude (tasks: lookup, comparison, detection) Correlation value (x 100) -100 -85 -70 -55 -40 -25 -10 5 20 35 50 65 80 Numbe 95 r Circle Ellipse Bars Shade d
Task-specific renderings:
Task Lookup Comparison Detection Rendering Number Circle Shading
National Academies of Sciences, March 2005 16 c Michael Friendly Visualizing Multivariate Uncertainty corrgram
Corrgrams— Rendering
Baseball data: (lower) Patterns vs. (upper) comparison Baseball data: PC2/1 order
Years
logSal
Homer
Putouts
RBI
Walks
Runs
Hits
Atbat
Errors Atbat Assists Years logSa Home Putou RBI Walks Runs Hits Errors Assis l r ts ts
National Academies of Sciences, March 2005 17 c Michael Friendly Visualizing Multivariate Uncertainty corrgram
Corrgrams— Variable ordering
Baseball data: (a) alpha vs. (b) correlation ordering (Friendly and Kwan, 2003)
(a) Alpha order (b) PC2/1 order
Assists Years
Atbat logSal
Errors Homer
Hits Putouts
Homer RBI
logSal Walks
Runs Putouts
RBI Hits Assist Runs Errors Years Atbat Atbat Homer Walks Hits Walks Hits Runs Runs Errors RBI Errors Homer Walks Atbat RBI Years Putouts logSal Assists logSal Assists Years Putouts s
See: http://www.math.yorku.ca/SCS/sasmac/corrgram.html
National Academies of Sciences, March 2005 18 c Michael Friendly Visualizing Multivariate Uncertainty corrgram
Corrgrams— Variable ordering
Reorder variables to show similarities: PC1 or angles (PC2/PC1) 1.5 1.0 Assists Errors
0.5
Atbat 0.0 Hits Runs Walks Dimension 2 (17.4%) Putouts RBI -0.5 Homer logSal Years -1.0 -1.0 -0.5 0.0 0.5 1.0 1 .5 Dimension 1 (46.3%)
National Academies of Sciences, March 2005 19 c Michael Friendly Visualizing Multivariate Uncertainty corrgram
Corrgrams— Guerry data
Guerry Variables
Crime_pers
Desertion
Wealth
Infanticide
Literacy
Instruction
Clergy
Suicides
Lottery
Crime_prop
Infants Crime_perDesertionWealth Infanticide LiteracyInstruction Clergy Suicides Lottery Crime_pro Infants
s p
National Academies of Sciences, March 2005 20 c Michael Friendly Visualizing Multivariate Uncertainty corrgram
Guerry data— Variable ordering
1.2 Clergy
0.8 InstructionLiteracy Suicides 0.4 Lottery
0.0 Infanticide Crime_prop Infants Wealth Dimension 2 (15.8%) -0.4 Desertion
-0.8 Crime_pers
-1.2
-1.6 -1.2 -0.8 -0.4 0.0 0.4 0.8 1.2
Dimension 1 (35.5%)
National Academies of Sciences, March 2005 21 c Michael Friendly Visualizing Multivariate Uncertainty corrgram
Visual thinning: Minimal summaries for large data sets
Guerry data: schematic scatterplot matrix: 68% data ellipse + loess smooth
37014
rime_pers
2199 86
Desertion
1 86
Wealth
1 86
Infanticide
1 74
Literacy
12 86
Instruction
1 86
Clergy
1 163241
Suicides
3460 86
Lottery
1 20235
Crime_prop
1368 62486
Infants
2660
National Academies of Sciences, March 2005 22 c Michael Friendly Visualizing Multivariate Uncertainty multivar1
Multivariate analyses: Reduced rank displays
Multivariate visualization techniques can show the statistical data in simple ways, using dimension reduction techniques. Biplots - show variables and departments in space accounting for greatest variance Canonical discriminant plots - show variables and departments in space accounting for greatest between-region variation Can try to show geographic location by color coding or other visual attributes. Color code by region Show data ellipse to summarize regions → Data-centric displays: The multivariate data is shown directly; geographic relations indirectly
National Academies of Sciences, March 2005 23 c Michael Friendly Visualizing Multivariate Uncertainty multivar1
Biplots
Biplots represent both variables (attributes) and observations (departments) in the same plot— a low-rank (2D) approximation to a data matrix (Gabriel, 1971) d T T Y = Y − Y·· ≈ AB = akbk k=1
Variables are usually represented by vectors from origin (mean) Observations are usually represented by points Can show clusters of observations by data ellipses Properties: Angles between vectors show correlations (r ≈ cos(θ)) Length of variable vectors ∼ % variance accounted for T yij ≈ ai bj: projection of observation on variable vector Dimensions are uncorrelated overall (but not necessarily within group)
National Academies of Sciences, March 2005 24 c Michael Friendly Visualizing Multivariate Uncertainty multivar1
Biplots: Guerry data
National Academies of Sciences, March 2005 25 c Michael Friendly Visualizing Multivariate Uncertainty biplot-bb
Biplots: Baseball data
1.5
Assists Errors
1.0 IF IF
IF IF IFIF IF IF IF IFIF IF IF IF IF IF IF IF IF IF IF 0.5 IF IF IF IF IF IF IF IF IF IF IF IFIFIF IF IF IFIFIF IF IF IF IF IFIF IFIF Atbat IF IF IF IF IF IF IFOF IFIF IF IF 1B 1B Hits IF IF1BIF IFOF IFIF OFIF IF IFOF IF C1B IF CUTIFIFUTOF C IF IF OF1B CCUT IFIF OF OF1B Runs IFOF IFOF1B IF IFUTOFOFOFOFC 1B OF 0.0 CC OFC CCOF 1B IF OF C IF 1BIF 1B OF OFOFC OF UTIF OFIFC OFOFOF 1B OFCOFOF OF OF IFOF OFOF UTOFCOF COFOFOF OF OF OFOFCIF 1B Dimension 2 (17.4%) UTCDHOF C1BOFOF OF OFOFOFOF OF OF Walks COFC OFOFOFUTOFDHIFC 1BOF OF 1B 1B 1B OFOFC OFOF COFOFOF1BOFPutouts 1B UTIF OF IF OF OF OFCDHOFCOFOF OF 1B RBI OFUTOF COF OF OFOFOF DH 1BOF C DHOFDHOF COF C DH1B DH 1B OF DH OF OFOFOF -0.5 DH DH 1B Homer DH DH logSal Years
-1.0 -1.0 -0.5 0.0 0.5 1.0 1.5 Dimension 1 (46.3%) Biplot of baseball data, players labeled by position
National Academies of Sciences, March 2005 26 c Michael Friendly Visualizing Multivariate Uncertainty biplot-bb
1.5
Assists Errors
1.0 IF IF
IF IF IFIF IF IF IF IFIF IF IF IF IF IF IF IF IF IF IF 0.5 IF IF IF IF IF IF IF IF IF IF IF IFIFIF IF IF IF IFIFIF IF IF IF IF IFIF IFIF Atbat IF IF IF IF IF IF IFOF IFIF IF IF 1B 1B Hits IF IF1BIF IFOF IFIF OFIF IF IFOF IF C1B IF CUTIFIFUT OFC IF IF OF1B CCUT IFIF OF OF1B Runs IFOF IFOF1B IF IFUTOFOFOFOFC 1B OF 0.0 CC OFC CCOFUT1B IF OF C 1BIF 1BIF 1B OF OFOFC OFCUTIF OFIFC OFOFOF 1B OFCOFOF OF OFOF IFOF OFOF UTOFCOF COFOFOF OF OF OFOFCIF 1B Dimension 2 (17.4%) Dimension 2 UTCDHOF C1BOFOF OF OFOFOFOF OF OF Walks COFC OFOFOFUTOFDHIFC 1BOF OF 1B 1B 1B OFOFC OFOF COFOFOF1BOFPutouts 1B UTIF OF IF OF OF OFCDHOFDHCOFOF OF 1B RBI OFUTOF COF OF OFOFOF DH 1BOF C DHOFDHOF COF C DH1B DH 1B OF DH OF OFOFOF -0.5 DH DH 1B Homer DH DH logSal Years
-1.0 -1.0 -0.5 0.0 0.5 1.0 1.5 Dimension 1 (46.3%) Biplot of baseball data, with data ellipses by position
National Academies of Sciences, March 2005 27 c Michael Friendly Visualizing Multivariate Uncertainty biplot-bb
1.5
Assists Errors
1.0
0.5 IF Atbat Hits Runs 0.0 UT 1B C OF
Dimension 2 (17.4%) Dimension 2 Walks Putouts DH RBI
-0.5 Homer logSal Years
-1.0 -1.0 -0.5 0.0 0.5 1.0 1.5 Dimension 1 (46.3%) Biplot of baseball data, with data ellipses by position (thinned)
National Academies of Sciences, March 2005 28 c Michael Friendly Visualizing Multivariate Uncertainty multivar1
Canonical discriminant plots
Project the variables into a low-rank (2D) space that maximally discrimates among regions (Friendly, 1991) Visual summary of a MANOVA Canonical dimensions are linear combinations of the variables with maximum univariate F -statistics. Vectors from the origin (grand mean) for the observed variables show the correlations with the canonical dimensions Properties: Canonical variates are uncorrelated 2 Circles of radius χ2(1 − α)/ni give confidence regions for group means. Variable vectors show how variables discriminate among groups Lengths of variable vectors ∼ contribution to discrimination
National Academies of Sciences, March 2005 29 c Michael Friendly Visualizing Multivariate Uncertainty multivar1
Canonical discriminant plots: Guerry data, by Region
National Academies of Sciences, March 2005 30 c Michael Friendly Visualizing Multivariate Uncertainty cdaplots-bb
CDA plots: Baseball data, by player position
6 assists errors putouts
4
2 atbat logsal hitsyears 0 walks rbi runs homer -2 Canonical Dimension 2 (27.1%) Canonical Dimension -4
Position 1B C DH -6 IF OF UT -6-4-202468 Canonical Dimension 1 (72.9%)
National Academies of Sciences, March 2005 31 c Michael Friendly Visualizing Multivariate Uncertainty heplots
HE plots: Visualization for Multivariate Linear Models
How are p responses, Y =(y1, y2,...,yp) related to q predictors, X =(x1, x2,...,xq)? (Friendly, 2004a)⎫ ⎪ MANOVA: X ∼ discrete factors ⎬ All the same MLM: X ∼ MMRA: quantitative predictors ⎪ Y = X B + E MANCOVA, response surface models,⎭ (n×p) (n×q)(q×p) (n×p)
Analogs of univariate tests: Explained variation: MSH −→ (p × p) covariance matrix, H Residual variation: MSE −→ (p × p) covariance matrix, E Test statistics: F −→ | H − λE| =0→ λ1,λ2,...λs How big is H relative to E ? Latent roots λ1,λ2,...λs measure the “size” of H relative to E in s =min(p, dfh) orthogonal directions. Test statistics: Wilks’ Λ, Pillai trace, Hotelling-Lawley trace, Roy’s maximum root combine these into a single number
National Academies of Sciences, March 2005 32 c Michael Friendly Visualizing Multivariate Uncertainty heplots
HE plots: Visualization for Multivariate Linear Models
HE plot: for two response variables, (y1, y2), plot a H ellipse and E ellipse
HE plot matrices: For all p responses, plot an HE scatterplot matrix
→ Shows: size, dimensionality, and effect-correlation of H relative to E . Individual group scatter Between and Within Scatter Y2 Y2 40 40
Deviations of group means from How big is H relative to E? Scatter around group means grand mean (outer) and pooled represented by each ellipse within-group (inner) ellipses.
30 30 H matrix
8
20 4 7 20
6 E matrix 3
5 10 1 10
2
(a) (b) 0 0 0 102030400 10203040 Y1 Y1 Essential ideas behind multivariate tests: (a) Data ellipses; (b) H and E ellipses
National Academies of Sciences, March 2005 33 c Michael Friendly Visualizing Multivariate Uncertainty heplots
Simple example: Iris data
80 80
70 70 Virginica Virginica
Versicolor 60 60 Versicolor Sepal length in mm. Sepal length Sepal length in mm. Setosa 50 50 Setosa
40 Matrix Hypothesis Error 40 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Petal length in mm. Petal length in mm. (a) Data ellipses and (b) H and E ellipses
H ellipse: Shows 2D covariation of predicted values (means) E ellipse: Shows 2D covariation of residuals points: show group means on both variables
National Academies of Sciences, March 2005 34 c Michael Friendly Visualizing Multivariate Uncertainty he-bb
Baseball data: Variation by position
How do relations among variables vary with player’s position? Fit MANOVA model, (logSal Years Homer Runs Hits RBI Atbat Walks Putouts Assists Errors) = Position HE plots for selected pairs: (Years, logSal), (Putouts, Assists)
Model: logSal Years ... Errors = Position Model: logSal Years ... Errors = Position 3.2 300 IF
3.0
200
2.8
1B DH 2.6 100 UT OF IF C Assists log Salary UT C 2.4 OF 0 DH
2.2
Matrix Hypothesis Error Matrix Hypothesis Error 2.0 -100 0 5 10 15 20 0 100 200 300 400 500 600 Years in the Major Leagues Put Outs
National Academies of Sciences, March 2005 35 c Michael Friendly Visualizing Multivariate Uncertainty he-bb
Model: logSal Years ... Errors = Position 3.2
3.0
2.8
1B DH 2.6 OF IF C log Salary UT
2.4
2.2
Matrix Hypothesis Error 2.0 0 5 10 15 20 Years in the Major Leagues
National Academies of Sciences, March 2005 36 c Michael Friendly Visualizing Multivariate Uncertainty he-bb
Model: logSal Years ... Errors = Position 300 IF
200
100 UT Assists
C
OF 0 DH
Matrix Hypothesis Error -100 0 100 200 300 400 500 600 Put Outs
National Academies of Sciences, March 2005 37 c Michael Friendly Visualizing Multivariate Uncertainty he-guerry
Guerry data: Predicting crime
How do rates of crime vary with other variables? Fit MANCOVA model, (Crime_pers Crime_prop) = Region + Wealth + Suicides + Literacy + Donations + Infants HE plots: Overall, plus for Region and covariate effects
Region effect Covariates: Wealth Suicides Literacy Donations Infants Creuse Creuse 20000 20000 Haute-Loire Haute-Loire
Ain Ain 15000 15000 Correze Correze Suicides Wealth Donation 10000 C Region 10000 Infants LIteracy S . E W .
N 5000 5000 Pop. per Crime against propertyPop. per Crime against propertyPop. per Crime against propertyPop. per Crime against propertyPop. per Crime against propertyPop. per Crime against property Pop. per Crime against property against property Pop. per CrimePop. per Crime
0 Seine 0 Seine 0 10000 20000 30000 40000 0 10000 20000 30000 40000 Pop. per Crime against persons Pop. per Crime against persons
National Academies of Sciences, March 2005 38 c Michael Friendly Visualizing Multivariate Uncertainty he-guerry
Region effect Creuse 20000 Haute-Loire
Ain 15000 Correze
10000 C Region S . E W
N 5000 Pop. per Crime against property property Crime against Crime against Pop. perPop. per
0 Seine 0 10000 20000 30000 40000 Pop. per Crime against persons
Overall: Predicted crimes against persons and property are negatively correlated Larger variation in crimes against property Region variation greater in crimes against persons
National Academies of Sciences, March 2005 39 c Michael Friendly Visualizing Multivariate Uncertainty he-guerry
Covariates: Wealth Suicides Literacy Donations Infants Creuse 20000 Haute-Loire
Ain 15000 Correze Suicides Wealth Donation 10000 Infants LIteracy .
5000 Pop. per Crime against property property property property property property Crime against Crime against Crime against Crime against Crime against Crime against Pop. perPop. perPop. perPop. perPop. perPop. per
0 Seine 0 10000 20000 30000 40000 Pop. per Crime against persons
Each quantitative variable (covariate) plots as a 1D ellipse (vector) Orientation: relation of xi to y1, y2 Length: strength of relation
National Academies of Sciences, March 2005 40 c Michael Friendly Visualizing Multivariate Uncertainty multimap
Multivariate mapping: Map-centric displays
How to generalize choropleth maps to many variables? Star maps: Show multivariate data on the map using star icons, variable ∼ length of ray Reduced-rank RGB displays: Factor analysis → (F1, F2, F3) factor scores → (R, G, B) shading PREFMAP (x, y) maps: Fit data variables to (Long, Lat) map coordinates. Display variables as vectors in map coordinates.
National Academies of Sciences, March 2005 41 c Michael Friendly Visualizing Multivariate Uncertainty multimap
Star maps
Star map of Guerry Variables (Ranks) Crime_prop Infants
Instruction Suicides
Donations Crime_pers
Departments colored by Region
National Academies of Sciences, March 2005 42 c Michael Friendly Visualizing Multivariate Uncertainty multimap
Star maps: Medians by region
Median Ranks by Region Crime_prop Infants Instruction Suicides Donations Crime_pers
National Academies of Sciences, March 2005 43 c Michael Friendly Visualizing Multivariate Uncertainty multimap
Star maps: Multivariate boxplots by region
Crime_prop Infants
Instruction Suicides
Donations Crime_pers
stars for Q1, Median, Q3 How to show unusual depts?
National Academies of Sciences, March 2005 44 c Michael Friendly Visualizing Multivariate Uncertainty gfactor
Reduced-rank color-coded displays
Use dimension-reduction technique (PCA, Factor analysis, ...) to produce scores for observations (departments) on 3 dimensions (F1,F2,F3) Factor1 Factor2 Factor3 Variable Civil society Moral values Crime Pop per Crime against persons 0 97 Pop per Crime against property 0 75 0 39 Percent Read & Write -0 72 Pop per illegitimate birth 0 62 0 42 Donations to the poor 0 89 Pop per suicide 0 80
Scale (F1,F2,F3) → [0,1]
Color mapping function, e.g., C(F1,F2,F3) → rgb(Fi,Fj,Fk)
National Academies of Sciences, March 2005 45 c Michael Friendly Visualizing Multivariate Uncertainty gfactor
Reduced-rank color-coded displays
RGB 3-factor map: R=f1, G=f2, B=f3 Variables: Crime_pers Crime_prop Literacy Infants Donations Suicides
62 59
80
76 2 8 60 57 14 27 50 51 55 75 54 67 78 61 77 22 29 28 10 88 35 53 52 72 68 56 45 89 70 41 21 49 44 25 37 58 18 39 85 36 71 79 86 3 1 23 69 87 42 17 16 63
19 38 43 24 15 33 7 26 5 46 48 47 12 4 82 30 84 40 32 81 F1 34 13 83 64 31 65 11 9 66
200
F3 F2
National Academies of Sciences, March 2005 46 c Michael Friendly Visualizing Multivariate Uncertainty summary
Summary and future directions
Guerry’s challenge: Understanding uncertainty in multivariate, spatial data How visualize and understand relations among many variables? How to relate these to geographic information? Understanding multivariate variation Visual summaries (data ellipses, smoothings) can show statistical relations more clearly and effectively Reduced-rank visualization methods can show simpler, approximate views, based on several criteria. Multivariate statistical models need their own visualization methods, just beginning – HE plots as an example. Understanding multivariate, spatial variation Multivariate visualizations applied to spatial data can be revealing, but still need work Statistical methods for spatial data need to be extended to the multivariate setting.
National Academies of Sciences, March 2005 47 c Michael Friendly Visualizing Multivariate Uncertainty
References
Friendly, M. (1991). SAS System for Statistical Graphics. Cary, NC: SAS Institute, 1st edn.
Friendly, M. (2002). Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 56(4), 316–324.
Friendly, M. (2004a). Graphical methods for the multivariate general linear model. Journal of Computational and Graphical Statistics. (submitted 2/17/04; resubmitted: 10/02/04).
Friendly, M. (2004b). Milestones in the history of data visualization: A case study in statistical historiography. In W. Gaul and C. Weihs, eds., Studies in Classification, Data Analysis, and Knowledge Organization. New York: Springer. (In press).
Friendly, M. and Kwan, E. (2003). Effect ordering for data displays. Computational Statistics and Data Analysis, 43(4), 509–539.
Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal components analysis. Biometrics, 58(3), 453–467.
Guerry, A.-M. (1833). Essai sur la statistique morale de la France. Paris. English translation: Hugh P. Whitt and Victor W. Reinking, Lewiston, N.Y. : Edwin Mellen Press, c2002.
Pearson, K. (1920). Notes on the history of correlation. Biometrika, 13(1), 25–45.
National Academies of Sciences, March 2005 48 c Michael Friendly