Design principles for presentation: Translating into

Andrew Eppig, UC Berkeley CAIR Annual Conference 8 November 2012 Overview • Design Principles • Data and Summary • Why Matter • Visualization Process • Selection • Layout • Aesthetics • Self-Sufficiency Check • Institutional Examples and Applications

CAIR 2012 2 Andrew Eppig, 8 November 2012 Guiding Design Principles • Good visualizations start with good data and detailed analysis • Know your data • Good visualizations directly answer specific, focused questions • Know what question(s) you are asking • Good visualizations get out of the way of the data • Let the data tell its story without excess clutter or distraction

“Too often we pay more attention to ‘pretty’ than to the most important element: .” -- Dona Wong, The Secrets of Graphics Presentation

CAIR 2012 3 Andrew Eppig, 8 November 2012 Data Sample, Summary, and Metrics Name Weight (lb.) Height (in.) BMI Batman 210 74 27.0 Michael Phelps 165 75 20.6 Wonder Woman 130 72 17.6 Hope Solo 140 69 20.7 weight[lb] BMI = 703 x height[in]2

Gender Group N BMI Mean BMI Std. Dev. Comics 1,239 26.0 4.2 Male Athletes 403 23.8 3.5 Models 493 21.6 2.3 Comics 505 20.3 3.4 Female Athletes 254 22.0 3.4 Models 489 18.2 2.7

CAIR 2012 4 Andrew Eppig, 8 November 2012 Default Excel Charting

Average BMI by Group and Gender 35.000 30.000 25.000 20.000 Female 15.000 Male 10.000 5.000 0.000 Models Athletes Comics

CAIR 2012 5 Andrew Eppig, 8 November 2012 Excessive Chart Junk

Average BMI by Group and Gender

27.000

26.000 26.000 25.000

24.000 23.800 23.000 22.000 22.000 BMI 21.600 Models 21.000 20.300 Athletes 20.000 Comics

19.000 18.200 18.000 17.000

Female Male

CAIR 2012 6 Andrew Eppig, 8 November 2012 Improved Excel Charting Average BMI by Group and Gender

Comics 26!

Male! Athletes 24!

Models 22!

Comics 20!

Female! Athletes 22!

Models 18!

CAIR 2012 7 Andrew Eppig, 8 November 2012 Visualization Checklist • What stories does the data tell? Which story do you want to tell? • What visualization will best aid the story? • Who is the audience? • What metric should you use?

• Which type of chart should you use? Charts Junk What is the layout of the visualization? Kaiser Fung, •

• How can details enhance the chart? Source: • Font, color, lines/shading, and text “What are the content-reasoning tasks that this display is supposed to help with?” -- , Beautiful Evidence

CAIR 2012 8 Andrew Eppig, 8 November 2012 Super humans - or -

Analysis, Writing, Art, and Lettering by: Andrew Eppig Super Models? A nalysis Question: Are Comic Book Superheroes’ bodies more like Top Athletes’ or Top Models’ bodies? Are comparisons the same for both men and women? To account for differences in height and weight, Body Mass Index (BMI) is used: BMI = 703 x weight / height². Comparing the BMI

distributions will reveal similarities or differences. Blue Shaded area shows middle 50% of superhero BMI distributions Chart SelectionBody Mass Index Distributions Male Superheroes FeMale Superheroes (N = 1,239) (N = 505)

50% of distribution is 50% of distribution is INSIDE shaded area INSIDE shaded area BMI = 25.2 Median BMI = 19.7 Probability Probability 0.0 0.1 0.0 0.2 0.1 15 20 25 30 35 15 20 25 30 35 Body Mass Index Body Mass Index “Meaningful quantitative information always involves relationships. When displayed in graphs, these relationshipsMale Athletes always boil down to one or more of feMale Athletes eight specific relationships: time series,(N = ranking, 403) part-to-whole, deviation, (N = 254) distribution, correlation, geospatial, nominal39% of comparison.” distribution is 31% of distribution is -- Stephen Few, Designing Effective TablesINSIDE and Graphsshaded area INSIDE shaded area Median BMI = 23.2 Median BMI = 21.4 CAIR 2012 9 Andrew Eppig, 8 November 2012 Probability Probability 0.0 0.1 0.0 0.2 0.1 15 20 25 30 35 15 20 25 30 35 Body Mass Index Body Mass Index Male Models feMale Models (N = 493) (N = 489) 17% of distribution is 28% of distribution is INSIDE shaded area INSIDE shaded area Median BMI = 21.6 Median BMI = 17.7 Probability Probability 0.0 0.1 0.0 0.2 0.1 15 20 25 30 35 15 20 25 30 35 Body Mass Index Body Mass Index S ummary: Male superheroes Tend to have a higher bmi than top MAle Athletes and a much higher bmi than top male models. So Male superheroes are beyond super human - but more like top male athletes than like top male models. Female superheroes tend to have a lower BMI than top female athletes and a higher bmi than top female models. So female superheroes are Neither super humans nor Super models but between Top female athletes and top female models. Female superheroes also have less variation in their BMI than male Superheroes or Top athletes of either gender.

Sources: DC Comics (dc.wikia.com); marvel comics (Marvel.com/universe); 2008 US Olympic Team (www.2008.nbcolympics.com); Models.com (models.com) Super humans - or -

Analysis, Writing, Art, and Lettering by: Andrew Eppig Super Models? A nalysis Question: Are Comic Book Superheroes’ bodies more like Top Athletes’ or Top Models’ bodies? Are comparisons the same for both men and women? To account for differences in height and Visualizationweight, Body Mass Index (BMI) is used: BMI = 703 x weight Layout / height². Comparing the BMI distributions will reveal similarities or differences. Blue Shaded area shows middle 50% of superhero BMI distributions

Body Mass Index Distributions Male Superheroes FeMale Superheroes (N = 1,239) (N = 505)

50% of distribution is 50% of distribution is INSIDE shaded area INSIDE shaded area Median BMI = 25.2 Median BMI = 19.7 Probability Probability 0.0 0.1 0.0 0.2 0.1 15 20 25 30 35 15 20 25 30 35 Body Mass Index Body Mass Index Male Athletes feMale Athletes (N = 403) (N = 254) 39% of distribution is 31% of distribution is INSIDE shaded area INSIDE shaded area Median BMI = 23.2 Median BMI = 21.4 Probability Probability 0.0 0.1 0.0 0.2 0.1 15 20 25 30 35 15 20 25 30 35 Body Mass Index Body Mass Index Male Models feMale Models (N = 493) (N = 489) 17% of distribution is 28% of distribution is INSIDE shaded area INSIDE shaded area Median BMI = 21.6 Median BMI = 17.7 Probability Probability 0.0 0.1 0.0 0.2 0.1 15 20 25 30 35 15 20 25 30 35 Body Mass Index Body Mass Index S ummary: Male superheroes Tend to have a higher bmi than top MAle Athletes and a much “Tufte’s (1990)higher bmirecommendation than top male models. So Male superheroes of ‘small are beyond super multiples human - but more like’ [...] uses the top male athletes than like top male models. replicationFemale in superheroes the display tend to have ato lower facilitate BMI than top female athletes comparison and a higher bmi than to the top female models. So female superheroes are Neither super humans nor Super models but implicitbetween modelTop female athletes of and topno female change models. Female superheroes between also have lessthe variation displays.” in their BMI than male Superheroes or Top athletes of either gender. -- AndrewSources: Gelman, DC Comics (dc.wikia.com); marvel Exploratory comics (Marvel.com/universe); 2008 Data US Olympic Team Analysis (www.2008.nbcolympics.com for); Models.com Complex (models.com) Models CAIR 2012 10 Andrew Eppig, 8 November 2012 Aesthetic Considerations • Font: How do you increase legibility and decrease distraction? • Color: Which color palette is appropriate? • Line/Shading: Which weight, color, and style will enhance the final product?

• Text: Can adding labels and narrative Source: Chemical Color Plate Corp. provide useful context? “Hue contrast is easy to overuse to the point of visual clutter. A better approach is to use a few high chroma colors as color contrast in a presentation consisting primarily of grays and muted colors.” -- Maureen Stone, Choosing Colors for

CAIR 2012 11 Andrew Eppig, 8 November 2012 Final

Super humans SuperHeroes - or - Analysis, Writing, Art, and Lettering by: Andrew Eppig Super Models? A nalysis Question: Are Comic Book Superheroes’ bodies more like Top Athletes’ or Top Models’ bodies? Are comparisons the same for both men and women? To account for differences in height and weight, Body Mass Index (BMI) is used: BMI = 703 x weight / height². Comparing the BMI

distributions will reveal similarities or differences. Blue Shaded area shows middle 50% of superhero BMI distributions

Body Mass Index Distributions Male Superheroes FeMale Superheroes (N = 1,239) (N = 505)

50% of distribution is 50% of distribution is INSIDE shaded area INSIDE shaded area Median BMI = 25.2 Median BMI = 19.7 Probability Probability 0.0 0.1 0.0 0.2 0.1 15 20 25 30 35 15 20 25 30 35 Body Mass Index Body Mass Index Male Athletes feMale Athletes (N = 403) (N = 254) 39% of distribution is 31% of distribution is INSIDE shaded area INSIDE shaded area Median BMI = 23.2 Median BMI = 21.4 Probability Probability 0.0 0.1 0.0 0.2 0.1 15 20 25 30 35 15 20 25 30 35 Body Mass Index Body Mass Index Male Models feMale Models (N = 493) (N = 489) 17% of distribution is 28% of distribution is INSIDE shaded area INSIDE shaded area Median BMI = 21.6 Median BMI = 17.7 Probability Probability 0.0 0.1 0.0 0.2 0.1 15 20 25 30 35 15 20 25 30 35 Body Mass Index Body Mass Index S ummary: Male superheroes Tend to have a higher bmi than top MAle Athletes and a much higher bmi than top male models. So Male superheroes are beyond super human - but more like top male athletes than like top male models. Female superheroes tend to have a lower BMI than top female athletes and a higher bmi than top female models. So female superheroes are Neither super humans nor Super models but between Top female athletes and top female models. Female superheroes also have less variation in their BMI than male Superheroes or Top athletes of either gender.

Sources: DC Comics (dc.wikia.com); marvel comics (Marvel.com/universe); 2008 US Olympic Team (www.2008.nbcolympics.com); Models.com (models.com)

CAIR 2012 12 Andrew Eppig, 8 November 2012 Self-Sufficiency Test

Male Superheroes FeMale Superheroes Probability Probability Body Mass Index Body Mass Index

Male Athletes feMale Athletes Probability Probability Body Mass Index Body Mass Index

Male Models feMale Models Probability Probability Body Mass Index Body Mass Index “Can the graphical elements stand on their own feet? If one removes the numbers from the graphic, can one still understand the key messages?” -- Kaiser Fung, Junk Charts

CAIR 2012 13 Andrew Eppig, 8 November 2012 Chart Function and Selection

Analytical Relationship Highlighted Feature Time Series Changes over time Ranking Relative position Part-to-Whole Fraction of whole Deviation Differences between sets Distribution Range and frequency Correlation Relationship between sets Geospatial Location Nominal Comparison Group values

Source: Stephen Few, Show Me the Numbers (2nd edition)

CAIR 2012 14 Andrew Eppig, 8 November 2012 Line Charts - Time Series

30,000

26,000 25,000 26,000

20,000 24,000 24,000 15,000

22,000 10,000 22,000

5,000 20,000 0 20,000

1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 1980 1985 1990 1995 2000 2005 2010 1980 1985 1990 1995 2000 2005 2010 x-axis has too many labels y-axis gridlines are too ✘ ✘ y-axis scale is too large ✘ and labels are slanted heavy

UC Berkeley Undergraduate Fall Enrollment, 1983-2012 Headcount 26,000 ✓ x-axis labels are horizontal and legibly spaced

24,000 ✓ data range is roughly 2/3 of the y-axis scale

22,000 ✓ y-axis gridlines are light in color and weight

20,000 1980 1985 1990 1995 2000 2005 2010 Source: UC Berkeley, Cal Answers CAIR 2012 15 Andrew Eppig, 8 November 2012 Line Charts - Distribution

2,500.00 2,500 2,500 2,000.00 2,000 2,000

1,500.00 1,500 1,500

1,000.00 1,000 1,000 500 500.00 500 0 0.00 0 1 2 3 4 5 6 7 1 2.5 4 5.5 7 1 2 3 4 5 6 7 y-axis labels have extraneous x-axis labels use an ✘ ✘ ✘ axis labels are too heavy decimal points unintuitive interval

Undergraduate Years to Degree, UC Berkeley Fall 2004 Cohort 2,500 Headcount ✓ y-axis labels are rounded to 2,000 the nearest major increment ✓ axis labels use natural 1,500 intervals 1, 2, 3... 1,000 2, 4, 6... 10, 20, 30... 500 ✓ x-axis labels are light in size and weight 0 1 2 3 4 5 6 7 Years to Graduation Source: UC Berkeley, Cal Answers CAIR 2012 16 Andrew Eppig, 8 November 2012 Bar Charts - Ranking 4000 Undergraduate Major Headcount by Division, UC Berkeley Fall 2012 3000 L&S-Social Sciences 3,757 2000 College of Engineering 3,158

1000 College of Natural Resources 1,774

0 L&S-Arts & Humanities 1,731

L&S-Bio Sciences L&S-Social Sciences L&S-Undergraduate College of Chemistry L&S-Undergraduate 1,464 College of EngineeringL&S-Arts & Humanities College of OtherEnv. Design EVCP Programs Haas School of Business L&S-Math & Phys. Sciences College of Natural Resources L&S-Administered Programs ✘ x-axis labels are slanted and small L&S-Bio Sciences 1,151 L&S-Administered Programs 1,101

College of Chemistry 846 L&S-Math & Phys. Sciences 977 College of Engineering 3,158

College of Env. Design 579 College of Chemistry 846

College of Natural Resources 1,774

Haas School of Business 680 Haas School of Business 680

L&S-Administered Programs 1,101 579 L&S-Arts & Humanities 1,731 College of Env. Design

L&S-Bio Sciences 1,151 Other EVCP Programs 6 L&S-Math & Phys. Sciences 977

L&S-Social Sciences 3,757 L&S-Undergraduate 1,464 ✓ labels are horizontal and easy to read Other EVCP Programs 6 units are usefully ranked by descending value ✘ units are ranked alphabetically ✓ Source: UC Berkeley, Cal Answers CAIR 2012 17 Andrew Eppig, 8 November 2012 Bar Charts - Part-to-Whole

Asian Undergraduate Demographic Shares by Race/

White Ethnicity, UC Berkeley Fall 2012

Chicano/Latino Asian 39% International

Decline to State White 29% African American

Native American/Alaskan Native Chicano/Latino 13% Pacific Islander

0% 10% 20% 30% 40% International 10% ✘ values are hard to determine

Decline to State 5% Asian 39%

White 29% African American 3%

Chicano/Latino 13%

International 10% Native American/Alaskan Native 1%

Decline to State 5%

African American 3% Pacific Islander 0%

Native American/Alaskan 1% Native

Pacific Islander 0% ✓ values are easy to read gaps between bars are 25-50% of the bar width ✘ gaps between bars are too large ✓ Source: UC Berkeley, Cal Answers CAIR 2012 18 Andrew Eppig, 8 November 2012 Bar Charts -

2,500! 2,500! 2,339! 2,260! 2,1182,104! ! 2,024! 1,969! 2,000! 2,000! 1,817! 1,677!

1,500! 1,500! 1,446! 1,295! 1,235!

1,049! 1,000! 1,000! 874! 748! 621! 459! 500! 500! 352! 394! 291! 201! 151! 49!66!82! 23! 2! 0! 2! 6! 3! 4! 2! 5! 5! 10! 8! 13!17!22!23!29! 0! 0! 0.0! 1.0! 2.0! 3.0! 4.0! 0.0! 1.0! 2.0! 3.0! 4.0! ✘ y-axis gridlines are too dense ✘ data labels distract, add clutter UC Berkeley Undergraduate Cumulative GPA, Spring 2012 2,500! Count ✓ y-axis gridlines are only on 2,000! the major increments

1,500! ✓ shape of the distribution is easily seen without

1,000! distraction

500!

0! 0.0! 1.0! 2.0! 3.0! 4.0! Cumulative GPA Source: UC Berkeley, Cal Answers CAIR 2012 19 Andrew Eppig, 8 November 2012 Bar Charts - Time Series

3,000! 3,000! 2010! 2,500! 2,500!

2,000! 2005! 2,000!

1,500! 1,500! 2000!

1,000! 1,000! 1995! 500! 500!

0! 1990! 0! 2012! 2007! 2002! 1997! 1992! 0! 500! 1,000! 1,500! 2,000! 2,500! 3,000! 1990! 1995! 2000! 2005! 2010! ✘ time variable is shown ✘ y-axis is used for ✘ multiple hues are used from right to left time variable for the same kind of data

UC Berkeley International Undergraduate Fall Enrollment, 1990-2012 3,000! Headcount 2,500! ✓ years increase from left to right 2,000! ✓ years are plotted on 1,500! x-axis

1,000! ✓ bars are colored using 500! shades of a single hue

0! 1990! 1995! 2000! 2005! 2010! Source: UC Berkeley, Cal Answers CAIR 2012 20 Andrew Eppig, 8 November 2012 Dot Plots - Deviation UC Berkeley New Freshmen 6-Year Graduation Asian Rates by Race/Ethnicity, Fall 2004 Cohort Other/Decline to State Men Women White

Pacific Islander Asian

Native American/ Alaskan Native Other/Decline to Chicano/Latino State

International White African American

60%! 70%! 80%! 90%! 100%! Pacific Islander

✘ points are too small Native American/ Alaskan Native

Asian Chicano/Latino Other/Decline to State International White

Pacific Islander African American Native American/ Alaskan Native

Chicano/Latino 60%! 70%! 80%! 90%! 100%! 6-Year Graduation Rate International African American ✓ dots’ size makes them easy to see 60%! 70%! 80%! 90%! 100%! group colors are complementary ✘ group colors are too similar ✓ Source: UC Berkeley, Cal Answers CAIR 2012 21 Andrew Eppig, 8 November 2012 Scatter Plots - Correlation

100%! 100%!

90%! 90%!

80%! 80%!

70%! 70%!

60%! 60%!

50%! 50%!

40%! 40%! 0%! 10%! 20%! 30%! 40%! 50%! 0%! 10%! 20%! 30%! 40%! 50%! ✘ too many bright hues are ✘ chart length distorts the used to mark groups correlation of the data Impact of Critical Mass on Respect Rates by Race/ Ethnicity and by UC Campus, 2007-2008 AY 100%! Respect White Rate* 90%! ✓ groups are marked using Asian/Pacific Islander muted hues 80%! ✓ chart dimensions are 70%! Chicano/Latino close to the golden ratio (1:1.618) 60%! African American

50%!

* Respect Rate = percentage of students of a given race/ethnicity 40%! who responded strongly agree, agree, or somewhat agree to the 0%! 10%! 20%! 30%! 40%! 50%! prompt “students of my race/ethnicity are respected at this campus” on UCUES. Share of Race/Ethnicity Among New Students on Campus Source: UC Accountability Report, 2011 CAIR 2012 22 Andrew Eppig, 8 November 2012 Pie Charts - Part-to-Whole

10%! 17%! 25%! 11%! 37%! 37%! 37%!

14%! 17%! 17%! 11%! 10%! 10%! 11%! 11%! 25%! ✘ more than five groups are ✘ slices are blown ✘ smallest slices are given shown out of the pie too prominent location

UC Berkeley College of Letters & Sciences Enrollment Shares, Fall 2012 ✓ Only five groups are shown Other L&S Divisions 25%! Social Sciences with the rest aggregated 37%! together ✓ center of the pie is shown making angles visible 17%! Arts & Humanities ✓ groups are ordered usefully 10%! with the largest slices at 11%! Math & Physical Sciences the top Biological Sciences Source: UC Berkeley, Cal Answers CAIR 2012 23 Andrew Eppig, 8 November 2012 Visualization Layout: Attention Areas

High Visual Focus Medium Visual Focus

Good for primary content Good for secondary content

Medium Visual Focus Low Visual Focus

Good for secondary content Good for tertiary content

CAIR 2012 24 Andrew Eppig, 8 November 2012 Visualization Aesthetics: Color

2"3"$(4.3&55&(6 7$8+95 3"4"$(5.4&66&(7 8#6&9

✘ avoid alternating high The Wall Street to Guide Information Journal contrast hues Source: Dona Wong, Graphics: The Don’ts of PresentingDos and Data, Facts, Figures and

!"#$%&'(()"*+,(-"*.,(/010,(!"#$%&''$()*##)$+,-*.&'$/-01#$),$2.3,*4&)0,.$/*&5"06Bright!"#$%&'(()"*+,(-"*.,(/0102 (high chroma) 78 Muted ✓ use a palette mostly of grays and muted hues

✓ choose a few high chroma colors for contrast ✓ use shades and tints to ensure that a black- ✘ avoid using more than one high chroma hue and-white copy will still be coherent

CAIR 2012 25 Andrew Eppig, 8 November 2012 Visualization Aesthetics: Font UC BERKELEY NEW FRESHMEN YIELD RATES UC Berkeley New Freshmen Yield Rates by BY RACE/ETHNICITY, FALL 2010 COHORT Race/Ethnicity, Fall 2010 Cohort WOMEN MEN Women Men

Asian Asian

Other/Decline to Other/Decline to State State

African American African American

Pacific Islander Pacific Islander

Chicano/Latino Chicano/Latino

White White

Native American/ Native American/ Alaskan Native Alaskan Native

International International

20%! 30%! 40%! 50%! 20%! 30%! 40%! 50%! YIELD RATE Yield Rate ✘ bold and condensed fonts confuse the ✓ font choice, weight, and spacing aid clarity viewer single font used for labels -- second font ✘ multiplicity of fonts deters legibility ✓ only used for the title Source: UC Berkeley, Cal Answers CAIR 2012 26 Andrew Eppig, 8 November 2012 Visualization Aesthetics: Lines/Shading

7,000! 7,000! 7,000! 6,000! 6,000! 6,000! 5,000! 5,000! 5,000! 4,000! 3,000! 4,000! 4,000! 2,000! 3,000! 3,000! 1,000! 0! 2,000! 2,000! 1995! 2000! 2005! 2010! 1,000! 1,000! Davis Irvine Los Angeles 0! 0! Merced Riverside San Diego 1995! 2000! 2005! 2010! 1995! 2000! 2005! 2010! Santa Barbara Santa Cruz Berkeley ✘ more than four groups are ✘ weight of lines ✘ label position makes identified in one chart blurs trend details identification hard

UC New Fall Undergraduate Enrollment by Campus, 1995-2011 7,000! Headcount Other UC 6,000! Campuses ✓ only two groups are 5,000! identified UC Berkeley 4,000! ✓ line weights are used for emphasis 3,000!

2,000! ✓ lines are directly labeled 1,000!

0! 1995! 2000! 2005! 2010! Source: UC Accountability Report, 2011 CAIR 2012 27 Andrew Eppig, 8 November 2012 Visualization Aesthetics: Labels/Text UC Berkeley Undergraduate New Enrollment Shares by Gender and Race/ Ethnicity, 1983-2012 Prop 209 banned affirmative action in 1997, precipitating a sharp decline in underrepresented minority (URM) students shares, which have yet to recover. The overall gender gap with women outnumbering men is driven by Asian and URM students where the gender gaps are largest.

Share White Share Asian/Pacific Islander 30%! 30%!

Women 20%! 20%! Women Men Men 10%! 10%!

0%! 0%! 1980! 1985! 1990! 1995! 2000! 2005! 2010! 1980! 1985! 1990! 1995! 2000! 2005! 2010!

Share Underrepresented Minority Share International 30%! 30%! Prop 209 enacted

20%! 20%!

Women 10%! 10%! Men Men Women 0%! 0%! 1980! 1985! 1990! 1995! 2000! 2005! 2010! 1980! 1985! 1990! 1995! 2000! 2005! 2010! Source: UC Berkeley, Cal Answers CAIR 2012 28 Andrew Eppig, 8 November 2012 Summary • Know what question you are asking a visualization to answer • Choose the best metric for your analysis and your audience • Choose your chart to fit your question rather than your question to fit your chart • Let the data tell its story without excess clutter or distraction • Keep the focus of the visualization on the data • Make sure all use of font, color, shading, and text enhance rather than distract • Provide narrative to contextualize the highlights of the data

CAIR 2012 29 Andrew Eppig, 8 November 2012 Contact Information Please feel free to contact me with questions or comments

Andrew Eppig Analyst Equity & Inclusion UC Berkeley

104 California Hall #1500 Berkeley, CA 94720-1500 [email protected]

CAIR 2012 30 Andrew Eppig, 8 November 2012 Web Resources

• Junk Charts -- Kaiser Fung • http://junkcharts.typepad.com • Flowing Data -- Nathan Yau • http://flowingdata.com/ • Charts ‘n’ Things -- NY Times Graphics Department • http://chartsnthings.tumblr.com/ • Perceptual Edge -- Stephen Few • http://www.perceptualedge.com

CAIR 2012 31 Andrew Eppig, 8 November 2012 Print Resources

Edward Tufte

• The Visual Display of Quantitative Information, 1983, Cheshire, CT: Graphics Press • Visual Explanations: Images and Quantities, Evidence and Narrative, 1997, Cheshire, CT: Graphics Press William Cleveland

• The Elements of Graphing Data, 1994, revised ed., Murray Hill, NJ: AT&T Bell Laboratories Dona Wong

• The Wall Street Journal Guide to Information Graphics: The Dos and Don’ts of Presenting Data, Facts, and Figures, 2010, : W.W. Norton and Co.

Stephen Few

• Information Dashboard Design: The Effective of Data, 2006, Oakland, CA: Press

• Now You See It: Simple Visualization Techniques for Quantitative Analysis, 2009, Oakland, CA: Analytics Press

• Show Me the Numbers: Designing Tables and Graphs to Enlighten, 2012, second ed., Oakland, CA: Analytics Press

CAIR 2012 32 Andrew Eppig, 8 November 2012 Appendices

CAIR 2012 33 Andrew Eppig, 8 November 2012 Classic Charts

Charles Minard's 1869 chart showing the number of men in Napoleon’s 1812 Russian campaign army, their movements, as well as the temperature they encountered on the return path. Lithograph, 62 x 30 cm

CAIR 2012 34 Andrew Eppig, 8 November 2012 Classic Charts

Detail from John Snow's spot of the Golden Square outbreak [1854 London cholera outbreak] showing area enclosed within the Voronoi network . Snow's original dotted line to denote equidistance between the Broad Street pump and the nearest alternative pump for procuring water has been replaced by a solid line for legibility. Fold lines and tear in original (adapted from CIC, between 106 and 07).

CAIR 2012 35 Andrew Eppig, 8 November 2012 Bad Chart Examples The problem: • The 1978 dollar should be roughly half as big as the 1958 dollar ($0.44 vs $1.00) instead of the roughly one quarter as big

How the problem occurred: • The chart uses 2-D graphics (i.e., representations of dollar bills with length and width), and both the length and the height were scaled by 1/2 -- resulting in the area being scaled by 1/4 (1/2 x 1/2)

The fix: !"#$%&'($ )*++,-$ • When dealing with 2-D area ."*/+)$0#$ representations (never use 3-D), 1234#$,.$035$ remember to scale the area rather than ,.$."*267 scaling each dimension separately

Source: Tufte, 1983 8*/-4#9$:-,;"34$-#;-361#)$36$<)2,-)$!/=1#>$%&(?7CAIR 2012 36 Andrew Eppig, 8 November 2012 Bad Chart Examples The problem: • The message (growth of medical spending in emerging markets) is obfuscated and exaggerated

How the problem occurred: • The chart uses too many bold colors, which creates visual confusion • The chart uses pie charts for each year, which makes it hard to see trends • The chart scales the pie charts incorrectly by scaling only the radius opposed to the area which distorts the changes

The fix: • When dealing with trend data, time series using line charts are the best choice

Source: “Expanding Circles of Error”, Junk Charts CAIR 2012 37 Andrew Eppig, 8 November 2012 !"#$%&"'()*+,(-.)/ via Visualization

0%1&'/2**!%3(&4* 5("6/&7**899:;

Howard Wainer’s visualization of John Arbuthnot’s 1710 analysis of London Bills of Mortality not only<%=6*>&?1$=6%$@#*ABA9*(6()C#"#*%D*E%64%6* depicts historical incidents, it also provides a check!"##$%&'%(&)*+#"*,% for . The 1704 spike is not4"4*6%$*1$")"F/*G&(.="'()*-/$=%4#;**5("6/&H#*.)%$*% associated with any historical incident. A check of the data revealsD*$=/*$=/* a transcription error by Arbuthnot where the 1674 data point was mistakenly labeled as 1704. 4($(*"#*(*&/I/)($"%6; Source: Wainer, 2009 CAIR 2012 38 Andrew Eppig, 8 November 2012 Infographic Creation Details

CAIR 2012 39 Andrew Eppig, 8 November 2012 Steps

• Source identification • • Data analysis

CAIR 2012 40 Andrew Eppig, 8 November 2012 Infographic Source Identification

• Super heroes and villains: DC and Marvel • http://dc.wikia.com/ • http://marvel.com/universe/Main_Page • Top athletes: 2008 US Olympic Team • http://www.2008.nbcolympics.com/athletes/index.html • Top models: models.com listings • http://models.com/

CAIR 2012 41 Andrew Eppig, 8 November 2012 Infographic Data Collection

• Create Python web scraper • Crawl web sites • Download web pages • Extract height, weight, and gender data • Save data to file

CAIR 2012 42 Andrew Eppig, 8 November 2012 Infographic Data Scrubbing

• Check data quality • Did extraction get correct height and weight? • Are there duplicate entries? • Remove super hero and super villain • Define height window based on athlete and model data • Define weight window based on athlete and model data

CAIR 2012 43 Andrew Eppig, 8 November 2012 Infographic Data Analysis • Combine all data in • Super heroes and villains, athletes, and models • Create dummy variables • Gender: male, female • Source: super hero/villain, athlete, model • Calculate BMI for each record • Check summary • Data ranges, mean, • Run t-tests between groups

CAIR 2012 44 Andrew Eppig, 8 November 2012 Height Distributions

Male Superheroes FeMale Superheroes (N = 1,289) (N = 1,289) 50% of distribution is 50% of distribution is INSIDE shaded area INSIDE shaded area Median height = 6’0” Median height = 5’7” Probability Probability 0.00.2 0.1 0.00.2 0.1 5’0” 5’6” 6’0” 6’6” 7’0” 5’0” 5’6” 6’0” 6’6” 7’0” Height (ft.) height (ft.)

Male Athletes feMale Athletes (N = 403) (N = 254) 35% of distribution is 42% of distribution is INSIDE shaded area INSIDE shaded area Median height = 6’1” Median height = 5’8” Probability Probability 0.00.2 0.1 5’0” 5’6” 6’0” 6’6” 7’0” 0.00.2 0.1 5’0” 5’6” 6’0” 6’6” 7’0” height (ft.) height (ft.)

Male Models feMale Models (N = 493) (N = 479) 66% of distribution is 59% of distribution is INSIDE shaded area INSIDE shaded area Median height = 6’0” Median height = 5’8” Probability Probability 0.00.2 0.1 5’0” 5’6” 6’0” 6’6” 7’0” 0.00.2 0.1 5’0” 5’6” 6’0” 6’6” 7’0” height (ft.) height (ft.) CAIR 2012 45 Andrew Eppig, 8 November 2012 Weight Distributions

Male Superheroes FeMale Superheroes (N = 1,289) (N = 1,289) 50% of distribution is 50% of distribution is INSIDE shaded area INSIDE shaded area Median weight = 185 Median weight = 128 Probability Probability 0.0 0.1 0.00.2 0.4 100 150 200 250 100 150 200 250 weight (lbs.) weight (lbs.)

Male Athletes feMale Athletes (N = 403) (N = 254) 40% of distribution is 38% of distribution is INSIDE shaded area INSIDE shaded area Median weight = 175 Median weight = 140 Probability Probability 0.0 0.1 100 150 200 250 0.00.2 0.4 100 150 200 250 weight (lbs.) weight (lbs.)

Male Models feMale Models (N = 493) (N = 479) 35% of distribution is 39% of distribution is INSIDE shaded area INSIDE shaded area Median weight = 160 Median weight = 115 Probability Probability 0.0 0.1 0.00.2 0.4 100 150 200 250 100 150 200 250 weight (lbs.) weight (lbs.) CAIR 2012 46 Andrew Eppig, 8 November 2012 With Revised Data

Male Superheroes FeMale Superheroes (N = 1,289) (N = 1,289) 50% of distribution is 50% of distribution is INSIDE shaded area INSIDE shaded area Median BMI = 25.4 Median BMI = 19.7 Probability Probability 0.0 0.1 15 20 25 30 35 0.00.2 0.4 15 20 25 30 35 Body Mass Index Body Mass Index

Male Athletes feMale Athletes (N = 3,631) (N = 2,610) 47% of distribution is 32% of distribution is INSIDE shaded area INSIDE shaded area Median BMI = 23.9 Median BMI = 21.5 Probability Probability 0.0 0.1 15 20 25 30 35 0.00.2 0.4 15 20 25 30 35 Body Mass Index Body Mass Index

Male Models feMale Models (N = 493) (N = 479) 17% of distribution is 28% of distribution is INSIDE shaded area INSIDE shaded area Median BMI = 21.6 Median BMI = 18.3 Probability Probability 0.0 0.1 0.00.2 0.4 15 20 25 30 35 15 20 25 30 35 Body Mass Index Body Mass Index CAIR 2012 47 Andrew Eppig, 8 November 2012