A Short History of Data Visualisation: A Single Variable
Eric J. Beh School of Mathematics & Physical Sciences University of Newcastle, Australia [email protected] Why Visualise Data?
If statistical graphics, although born just yesterday, extends its reach every day, it is because it replaces long tables of numbers and it allows one not only to embrace at
Why? glance the series of phenomena, but also to signal the correspondence or anomalies, to find the causes, to identify
the laws. Émile Cheysson (1836 – 1919) (Cheysson, circa 1877)
2 3 Why? pointbest at graph without understanding the context are likely to be wrong, or off the statistical context that surrounds thegraph. Asinart, conclusions about a artists context,interpreting graph requires a understanding an ofthe Just as interpretation of a painting or drawing requires understanding of the In statistical graphics, information is contained in shapes and patterns. . . . American Statistician Cook,and D. R. Weisberg, S.(1999), Graphs in Statistical Analysis: the mediumIs the message?, ,29 53, – 37. WhyData? Visualise The The 4 Why? PhD thesis PhD his 1978 translation of English his to introduction the in says Greenacre Michael Jean his and considered the “guru of correspondence analysis”of“guru correspondencethe andhis considered philosophy” Data visualisation vsData Modelling the . “ . . them to fit your own own your fit ideas” to them manipulating and with them “tampering” you start before possible degree fullest the to observations your to understand try first, . . . However statistical of decades of testing hypothesis and theory distribution the all annihilate to seemed“axiom” this one phrase, In “model should follow the data, not the inverse” the not data, follow the should “model - Paul Paul Benzecri is a retired French linguist and . . . the “father of correspondence analysis”. Greenacre is a former student of student is formera Greenacreanalysis”.of correspondence“father. the and linguist French a is retired , what is actually intended here here . . . intended data at your Look actually is , what WhyData? Visualise Benzecri thought . . . 5 Who? Some Turnof theCentury Texts 1910 6 Who? graphics statistical book on circulated First widely Some Turnof theCentury Texts (Reprint 1914 1919) 7 Who? Some Turnof theCentury Texts 1939 8 Who? Some Turnof theCentury Texts 9 Who? Some Turnof theCentury Texts 10 Who? Some Turnof theCentury Texts 2001 First Documented Graphical Display?
This graph is considered the first known graphical display to summarise data
This graph is thought to date back to the tenth of maybe 11th century and is part of a manuscript discovered by Sigmund Gunther in 1877.
Introduction - Funkhouser (1936)
The graph reflects the inclination of the orbits of the planets of our Solar System over time.
Vertical axis: the Latin name of the planets and the sun. Horizontal axis: is the 30 zodiacal zones representing time 11 12 Introduction Napoleon’sinvasion, and retreat, from Russia (June 1812) Invasion Russiaof Retreat fromRussia First Documented Graphical Display? (Appears in Marey . 2002 English translation. French. of , E., J. (1885), Tufte , 2001,40) pg. La La Méthode Graphique . Paris, pp. 73 A Single Variable
• Numerical – Histogram – Boxplot – Line Graph
Introduction • Categorical – Pie Chart – Pareto Distribution – Bar Chart
13 14 Numerical Data Who developedHistogram?Who the The Histogram?
. . . on the 20th of November 1891 He used to term in reference to a ‘time- diagram’ that appeared in Gresham Lecture on ‘Maps and Chartograms’.
Pearson explained that the histogram could be used for historical purposes to create blocks of time of “charts about reigns or sovereigns or periods of Numerical Data Numerical different prime ministers”
So the etymology of histogram may have come from “historical graphs”
Sir Karl Pearson (1857 – 1936) 15 The Histogram?
. . . on the 20th of November 1891 He used to term in reference to a ‘time- diagram’ that appeared in Gresham Lecture on ‘Maps and Chartograms’.
Pearson explained that the histogram could be used for historical purposes to create blocks of time of “charts about reigns or sovereigns or periods of Numerical Data Numerical different prime ministers”
Wikipedia says
Sometimes it is said to be derived from the Ancient Greek ἱστός Sir Karl Pearson (histos) – "anything set upright" . (1857 – 1936) . . and γράμμα (gramma) – 16 "drawing, record, writing". The Histogram?
At the turn of the 20th century when Pearson was laying the foundations of the structure to much of statistical concepts we use today, he was also interested in the visualisation of data.
Pearson gave the first published histogram in
Pearson, K. (1895), Contributions to the
Numerical Data Numerical mathematical theory of evolution II: Skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London A, 186, 343 – 414.
He studied data from the 1887 This data summarises the cost of a house and shop properties in England and Presidential address of Mr Goschen to Wales in 1885 – 1886. the Royal Statistical Society. 17 The Histogram?
This is the first histogram published and graphically summarises the cost of house and shop properties data Pearson considered.
We can see here that Pearson (1895, Plate 13, Figure 13) was also the first to use term histogram. Although there is
Numerical Data Numerical some evidence that suggests he may have used the term as early as 1891 in a seminar.
18 19 Numerical Data • • on discussed been has Much Debate has grown over the years about years the over grown has Debate bins of the of width choice The Equal bin area leads to “e to leads area bin Equal issue the discussed first Statistical Association Sturges Computational and Graphical Statistics Graphical and Computational Denby , L. & Mallows, C. (2009), “Variations (2009), C. , Mallows, & L. of Journal Histogram”, on the , H. (1926), “The Choice of a Class Interval,” Interval,” Class ofa Choice “The (1926), H. , Equal width bins vs equal area (“e area equal vs bins width Equal a histograms” histograms” -a , 21, 21, , The Histogram?The 65 – 66. – see,example, for , 18, 21 18, , – a”) bins -a”) 31. Journal of the the of Journal American 20 Numerical Data Living histogram of histogram Living The Hartford Courant (1996), "Reaching New Heights," November 23, 1996; photo byK. photo Hanley.23, 1996; NovemberHeights,"New"Reaching (1996), Courant HartfordThe 143 Living Histograms!!Living at at heights student University of University Connecticut Connecticut 21 Numerical Data Who developedWho the Living histogram of histogram Living Blakeslee, A. F.and"Corn Men," (1914), 175 male college students college 175 male Histogram? The Journal of Heredityof The Journal , 5, 511, - 518. 22 Numerical Data Who developedBoxplot?Who the Who developed the Boxplot?
Some accredit it to John W Tukey (1915 – 2000) Numerical Data Numerical
Tufte’s 2001 book dedicate to the memory of Tukey 23 24 Numerical Data Who developedBoxplot?Who the Variations of the Boxplot
Tufte’s “Box”-plot In an attempt to further simplify the boxplot, Tufte (2001, pp. 123 – 125) considered removing the box and median line and representing the median with a point and differently weighted ‘whiskers’ to reflect the range and interquartile range of the data. Numerical Data Numerical
25 Variations of the Boxplot
• Data: a single month's telephone bills for a group of Chicago residence McGill, Tukey & Larsen (1977) customers
• The data is considers the number of years the resident has lived in the city.
• However . . . not everything known is shown.
• One might conclude that the overall median for all groups combined is Numerical Data Numerical about $21.
• This is absolutely wrong. The actual overall median is about $14.
• What isn’t the data showing? Frequencies
26 Variations of the Boxplot
Variable-Width Boxplot McGill, Tukey & Larsen (1977)
• The information available but not displayed is the number of customers in the various groups.
• Here the width of each box has been Numerical Data Numerical made proportional to the square root of the number of customers in the corresponding group.
• The viewer's attention is immediately drawn to the size differences
27 Variations of the Boxplot
Violin Boxplot Hintze and Nelson (1998)
• Since the boxplot doesn’t give any real indication of the shape of the data (only of certain quartiles which help with assessing centre and spread, one may consider the violin boxplot
Numerical Data Numerical • Basically, the same thing as a boxplot but with the shape of the data reflected
28 29 The Boxplot boxplot, Benjamini, Y.(1988), Opening of the box of a The AmericanStatistician Vaseplot VariationsBoxplot the of , 42, 257-262 , 42, AmericanStatistician, 12 32, (1978), Variationsplots, Thebox the of Tukey,R., McGill, W.J. W.Larsen, & A. Notched boxplot - 16 Variations of the Boxplot
The “Bivariate” Boxplot Becketti & Gould (1987), Rangefinder box plots: A note, The American Statistician, 41, pg 149. - The Rangefinder boxplot
Median of both variables
Whiskers Numerical Data Numerical
A scatterplot of 1980 divorce rates against birth rates in the 49 states other than Nevada 30 Variations of the Boxplot
The “Bivariate” Boxplot Becketti & Gould (1987), Rangefinder box plots: A note, The American Statistician, 41, pg 149. - The Rangefinder boxplot Numerical Data Numerical
A scatterplot of 1980 divorce rates against birth rates in the 49 states other than Nevada 31 Variations of the Boxplot
The “Bivariate” Boxplot Becketti & Gould (1987), Rangefinder box plots: A note, The American Statistician, 41, pg 149. - The Rangefinder boxplot Numerical Data Numerical
A scatterplot of 1980 divorce rates against birth rates in the 49 states other than Nevada 32 33 Numerical Data The “Bivariate” Boxplot - The Rangefinder boxplot . problems VariationsBoxplot the of Lenth The American StatisticianGould, The American , R. R. , V. (1988), on Comment • • • Concerns: of IQR width reflect variable) other the the limits of whiskers (at the of Length No “boxes” axes on labels No , 42, Becketti pg 87 and 34 Numerical Data The “Bivariate” Boxplot VariationsBoxplot the of - Becketti & Gould rejoinder Variations of the Boxplot
The “Bivariate” Boxplot Rousseeuw, Ruts & Tukey (1999) - The Bagpipe boxplot • The depth median is the deepest location, and it is surrounded by a "bag" containing the n/2 observations with largest depth.
• Magnifying the bag by a factor 3 yields the "fence". Observations
Numerical Data Numerical between the bag and the fence are marked by a light gray loop, whereas observations outside the fence are flagged as outliers.
• The bagplot visualises the location, spread, correlation, skewness, and tails of the data 35 Variations of the Boxplot
The “Bivariate” Boxplot - The “Replot” and “Quelplot” Numerical Data Numerical
36 Goldberg, K.M. and Iglewicz, B. (1992), Bivariate extensions of the boxplot, Technometrics, 34, 307-320. 37 Numerical Data who developed . who So Boxplot?the The Origins of the Boxplot
The Range Bar Numerical Data Numerical
38 Mary Eleanor Spear (1952), Charting Statistics, McGraw-Hill (page 166). The Origins of the Boxplot
The “Rangebar” Chart Numerical Data Numerical
39 Haemer, K. W. (1948), Range-bar charts, The American Statistician, 2, 23 40 Categorical Data Who developed the pie chart? developed pie Who the The Pie Chart
The story of the pie chart begins with William Playfair (1759 – 1823). After proposing the line graph and barchart in 1786, Playfair (in 1801) constructed the pie chart as a visual aid to compare the geographical size of each of European regions, and the areas around the world they occupied.
William Playfair (1759 - 1823) Categorical Data Categorical
Playfair, W. (1801) The Statistical Breviary: Shewing, on a Principle Entirely New, the Resources of Every State and Kingdom in 41 Europe, T. Bensley. The Pie Chart
The area of the circles allows for a comparison of the land area (in square miles) each region occupied, given the geopolitical situation in Europe in 1801.
Green coloured regions = countries that were adjudged a maritime power, Red colour regions = countries with no maritime power. Vertical lines compare the population of each region (by the red line) Green line reflect the tax revenues Categorical Data Categorical
Funkhouser (1937, pg 273) says that Playfair may be regarded as the “father of the 42 graphic method in statistics". The Pie Chart Categorical Data Categorical
43 Bertillon, J. (1891), Atlas de Statistique Graphique de la Ville de Paris, année 1889. Paris The Pie Chart
Playfair’s contribution to data visualisation in Great Britain remained relatively unknown. Spence (2005) discusses that this may be because Playfair was involved in “failed and sometimes fraudulent business ventures in London and Paris since the early 1780’s”.
Friendly and Denis (2005, pg 106) add that “Playfair was indeed a sinner, graphic and otherwise".
Tufte (2001, pg 178) said of the pie chart Categorical Data Categorical “A table is nearly always better than a dumb pie chart; the only worse design that a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial disarray both within and between pies. . .” Brinton (1914) said
44 “the circle with sectors is not a desirable form of presentation” The Pie Chart Categorical Data Categorical
45 46 Who? Some Turnof theCentury Texts The Pie Chart Categorical Data Categorical
Brinton (1914, pg 5) Brinton (1914, pg 6)
Haskell (1922, pg 9) calls it a version of the pie chart a circular percentage chart. Brinton (1939, pg 81) notes that alternative names of the pie chart include the sector 47 chart and divided circle. He dedicates a whole chapter (Chp 9) to their study. The Pie Chart
Interestingly . . . .
For expressing component parts the circle chart or "pie diagram" is admittedly inferior to the composite bar chart. However, there are times when it is advantageous for the sake of popularization to make use of the "pie diagram." Categorical Data Categorical
48 Croxton, F.E. (1922) A percentage protractor. Journal of the American Statistical Association, 18, 108--109. The “Dumb” Pie Chart
. . . Some dumb pie
Categorical Data Categorical charts courtesy of Microsoft Word . . .
49 The “Dumb” Pie Chart
. . . Some dumb pie
Categorical Data Categorical charts (pie chart in a pie chart . . .)
“Inception pie charts” 50 The “Dumb” Pie Chart
Exact source unknown . . . But from someone who describes himself as
"an innovation leader in delivering analytics." Categorical Data Categorical
51 The “Dumb” Pie Chart Categorical Data Categorical
http://www.opiniondynamics.com/
52 The “Dumb” Pie Chart Categorical Data Categorical
53 The “Fun” Pie Chart Categorical Data Categorical
54 http://chandoo.org/wp/tag/bad-charts/ The “Fun” Pie Chart Categorical Data Categorical
55 The “Fun” Pie Chart Categorical Data Categorical
56 The “Fun” Pie Chart Categorical Data Categorical
57 The Pareto Chart Numerical Data Numerical
Who invented the Pareto Chart?
58 The Pareto Chart
• Born – in Paris on 15 July 1848 to an exiled Genoese family • Parents named him Fritz Wilfried, and renamed him Vilfreldo Federico on the families return to Italy in 1858 • Nationality - Italian • Died – in Switzerland on 19 August 1923 • Worked and resided – France, Germany, Austria, England, Belgium and Switzerland Vilfredo Pareto (1848 – 1923)
Many phenomenon, like wealth, Numerical Data Numerical survival times, follow the Pareto distribution x P X > x = , x x , > 0 α xm ≥ m α where xm is the minimum possible value of x 59 The Pareto Chart
• Born – in Paris on 15 July 1848 to an exiled Genoese family • Parents named him Fritz Wilfried, and renamed him Vilfreldo Federico on the families return to Italy in 1858 • Nationality - Italian • Died – in Switzerland on 19 August 1923 • Worked and resided – France, Germany, Austria, England, Belgium and Switzerland Vilfredo Pareto (1848 – 1923)
Many phenomenon, like wealth, Numerical Data Numerical survival times, follow the Pareto distribution x P X > x = , x x , > 0 α xm ≥ m α where xm is the minimum possible value of x 60 The Pareto Chart
• Born – in Paris on 15 July 1848 to an exiled Genoese family • Parents named him Fritz Wilfried, and renamed him Vilfreldo Federico on the families return to Italy in 1858 • Nationality - Italian • Died – in Switzerland on 19 August 1923 • Worked and resided – France, Germany, Austria, England, Belgium and Switzerland Vilfredo Pareto (1848 – 1923) Numerical Data Numerical
61 The Pareto Chart
• Born – in Paris on 15 July 1848 to an exiled Genoese family • Parents named him Fritz Wilfried, and renamed him Vilfreldo Federico on the families return to Italy in 1858 • Nationality - Italian • Died – in Switzerland on 19 August 1923 • Worked and resided – France, Germany, Austria, England, Belgium and Switzerland Vilfredo Pareto (1848 – 1923) Numerical Data Numerical The cumulative line was added in 1951 ?
Juran, J. M (1951), The economics of quality, in Quality Control Handbook, (ed. J. M. Juran), New York: McGraw-Hill, pp. 1–41.
62 The Pareto Chart
Juran stumbled across Pareto’s work in 1941.
As a result, he postulated the Pareto principal, or 80-20 rule, in honour of Vilfredo Pareto Joseph M. Juran Vilfredo Pareto (1904 – 2008) (1848 – 1923) Pareto Principal Numerical Data Numerical • 80% of the land in Italy was owned by 20% of the population (α = 1.16). • Juran originally described the fundamental principal as “the vital few and the trivial many”
• confessed “that I had mistakenly applied the wrong name to the principle” (1975) • Revised the idea to “the vital few and the useful many” • was older brother to Academy Award winner Nathan H. Juran who won for “Best Art Direction” for 63 “How Green was my Valley” (1942) The Pareto Chart
Leland Wilkinson (2006) raised some concerns about the Pareto chart
• It is meaningless to superimpose a density onto a distribution
• Some remedies include making dual vertical scales.
• He felt dual scales are Numerical Data Numerical confusing, and . . . .
• . . . . the rationale for aligning the scales is arbitrary
Wilkinson (2006), Revising the Pareto Chart, The American Statistician, 60, 332 – 334. 64 Note: Wilkinson was Senior VP of SPSS (1994-2007). Now Adjunct Prof at University of Illinois The Pareto Chart
To deal with the “dual scale” problem Juran reshaped the Pareto diagram to be a cumulative bar chart. Although this had not gained wide attention, or application
Note here that the height of each “bar” on the revised chart matches up with the height of each bar on the original chart and appears under the cumulative frequency Numerical Data Numerical
65 The Pareto Chart
Wilksinson (2006) raised some concerns about the Pareto chart
• A well known problem: The cumulative line forces the bars towards the bottom of the graph and can make it difficult to distinguish the bars.
• Experts advise users to truncate Numerical Data Numerical the plot by making an “Other” category but this is rather ad hoc
66 The Pareto Chart
Wilksinson (2006) raised some concerns about the Pareto chart
• A well known problem: The cumulative line forces the bars towards the bottom of the graph and can make it difficult to distinguish the bars.
• Experts advise users to truncate Numerical Data Numerical the plot by making an “Other” category but this is rather ad hoc
67 DataVis.ca
Michael Friendly at CARME2015 (Naples, Italy) © Pieter Kroonenberg DataVis.ca What’s next??
Part 2: Two Variables . . . . (maybe) Two Numerical Variables – Scatterplot – Anscombe’s Quartet – Trellis Plot
Two Categorical Variables – Fourfold Display – Mosaic Displays (Popularised by Michael Friendly) – Correspondence Plot – Biplot
One Categorical & One Numerical Variable - Side-by-side boxplot Other – Chernoff Faces (multiple variables) – Andrew’s curves – Dendrograms Thank you 72 Pie Chart An early variation called early variation chart was pie the An the of between the French, British, Ottoman Empires and and Sardinia. EmpiresOttoman British, French, the between formed alliance and the Empire Russian the It between was Ukraine). day of Sea(now modern part Black the of coast on northern the located * The CrimeanWar is that peninsula the Crimeanalong fought was (1853 conditionsarmyhospitals of Crimean during the War* sanitary deplorable the at enraged became Nightingale statistician accomplished highly a FamousEnglish nurse Florence Nightingalealsowas who see the diagram on the next page. It was proposed by were highly statistical. highly were instrumental inthese reforms, and manyof her arguments were skills mathematical and literary Her conditions. The Impact of Florence Nightingale of Impact The – 1856) foughtand for reforms to improve their coxplot – Association American Statistical member ofthe honorary an became later she Society and Statistical Royal the of member female firstelected the was Nightingale In 1859 Florence Nightingale Florence (1820 – 1910) 73 Pie Chart wounds from highlights It death1856 those rates March 1855to April Army British Hospital and Efficiency Health, the the of Administration matters affecting on Notes This reflected by the area of each bar of the the of bar each of area the by reflected coxplot , preventable diseases appears in Nightingales 1858 Nightingales in appears and deaths deaths and to due coxplot not its length. its not other causes other . The death rates are as a result ofbattle result a as Florence Nightingale Florence (1820 – 1910)