<<

A Short History of Visualisation: A Single Variable

Eric J. Beh School of Mathematics & Physical Sciences University of Newcastle, Australia [email protected] Why Visualise Data?

If , although born just yesterday, extends its reach every day, it is because it replaces long tables of numbers and it allows one not only to embrace at

Why? glance the series of phenomena, but also to signal the correspondence or anomalies, to find the causes, to identify

the laws. Émile Cheysson (1836 – 1919) (Cheysson, circa 1877)

2 3 Why? pointbest at graph without understanding the context are likely to be wrong, or off the statistical context that surrounds thegraph. Asinart, conclusions about a artists context,interpreting graph requires a understanding an ofthe Just as interpretation of a painting or drawing requires understanding of the In statistical graphics, is contained in shapes and patterns. . . . American Cook,and D. R. Weisberg, S.(1999), Graphs in Statistical Analysis: the mediumIs the message?, ,29 53, – 37. WhyData? Visualise The The 4 Why? PhD thesis PhD his 1978 translation of English his to introduction the in says Greenacre Michael Jean his and considered the “guru of ”of“guru correspondencethe andhis considered philosophy” Data visualisation vsData Modelling the . “ . . them to fit your own own your fit ideas” to them manipulating and with them “tampering” you start before possible degree fullest the to observations your to understand try first, . . . However statistical of decades of testing hypothesis and theory distribution the all annihilate to seemed“axiom” this one phrase, In “model should follow the data, not the inverse” the not data, follow the should “model - Paul Paul Benzecri is a retired French linguist and . . . the “father of correspondence analysis”. Greenacre is a former student of student is formera Greenacreanalysis”.of correspondence“father. the and linguist French a is retired , what is actually intended here here . . . intended data at your Look actually is , what WhyData? Visualise Benzecri thought . . . 5 Who? Some Turnof theCentury Texts 1910 6 Who? graphics statistical book on circulated First widely Some Turnof theCentury Texts (Reprint 1914 1919) 7 Who? Some Turnof theCentury Texts 1939 8 Who? Some Turnof theCentury Texts 9 Who? Some Turnof theCentury Texts 10 Who? Some Turnof theCentury Texts 2001 First Documented Graphical Display?

This graph is considered the first known graphical display to summarise data

This graph is thought to date back to the tenth of maybe 11th century and is part of a manuscript discovered by Sigmund Gunther in 1877.

Introduction - Funkhouser (1936)

The graph reflects the inclination of the orbits of the planets of our Solar System over time.

Vertical axis: the Latin name of the planets and the sun. Horizontal axis: is the 30 zodiacal zones representing time 11 12 Introduction Napoleon’sinvasion, and retreat, from Russia (June 1812) Invasion Russiaof Retreat fromRussia First Documented Graphical Display? (Appears in Marey . 2002 English translation. French. of , E., J. (1885), Tufte , 2001,40) pg. La La Méthode Graphique . Paris, pp. 73 A Single Variable

• Numerical – – Boxplot – Line Graph

Introduction • Categorical – Pie – Pareto Distribution – Bar Chart

13 14 Numerical Data Who developedHistogram?Who the The Histogram?

. . . on the 20th of November 1891 He used to term in reference to a ‘time- ’ that appeared in Gresham Lecture on ‘Maps and Chartograms’.

Pearson explained that the histogram could be used for historical purposes to create blocks of time of “ about reigns or sovereigns or periods of Numerical Data Numerical different prime ministers”

So the etymology of histogram may have come from “historical graphs”

Sir Karl Pearson (1857 – 1936) 15 The Histogram?

. . . on the 20th of November 1891 He used to term in reference to a ‘time- diagram’ that appeared in Gresham Lecture on ‘Maps and Chartograms’.

Pearson explained that the histogram could be used for historical purposes to create blocks of time of “charts about reigns or sovereigns or periods of Numerical Data Numerical different prime ministers”

Wikipedia says

Sometimes it is said to be derived from the Ancient Greek ἱστός Sir Karl Pearson (histos) – "anything set upright" . (1857 – 1936) . . and γράμμα (gramma) – 16 "drawing, record, writing". The Histogram?

At the turn of the 20th century when Pearson was laying the foundations of the structure to much of statistical concepts we use today, he was also interested in the visualisation of data.

Pearson gave the first published histogram in

Pearson, K. (1895), Contributions to the

Numerical Data Numerical mathematical theory of evolution II: Skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London A, 186, 343 – 414.

He studied data from the 1887 This data summarises the cost of a house and shop properties in England and Presidential address of Mr Goschen to Wales in 1885 – 1886. the Royal Statistical Society. 17 The Histogram?

This is the first histogram published and graphically summarises the cost of house and shop properties data Pearson considered.

We can see here that Pearson (1895, Plate 13, Figure 13) was also the first to use term histogram. Although there is

Numerical Data Numerical some evidence that suggests he may have used the term as early as 1891 in a seminar.

18 19 Numerical Data • • on discussed been has Much Debate has grown over the years about years the over grown has Debate bins of the of width choice The Equal bin area leads to “e to leads area bin Equal issue the discussed first Statistical Association Sturges Computational and Graphical Graphical and Computational Denby , L. & Mallows, C. (2009), “Variations (2009), C. , Mallows, & L. of Journal Histogram”, on the , H. (1926), “The Choice of a Class Interval,” Interval,” Class ofa Choice “The (1926), H. , Equal width bins vs equal area (“e area equal vs bins width Equal a ” histograms” -a , 21, 21, , The Histogram?The 65 – 66. – see,example, for , 18, 21 18, , – a”) bins -a”) 31. Journal of the the of Journal American 20 Numerical Data Living histogram of histogram Living The Hartford Courant (1996), "Reaching New Heights," November 23, 1996; photo byK. photo Hanley.23, 1996; NovemberHeights,"New"Reaching (1996), Courant HartfordThe 143 Living Histograms!!Living at at heights student University of University Connecticut Connecticut 21 Numerical Data Who developedWho the Living histogram of histogram Living Blakeslee, A. F.and"Corn Men," (1914), 175 male college students college 175 male Histogram? The Journal of Heredityof The Journal , 5, 511, - 518. 22 Numerical Data Who developedBoxplot?Who the Who developed the Boxplot?

Some accredit it to John W Tukey (1915 – 2000) Numerical Data Numerical

Tufte’s 2001 book dedicate to the memory of Tukey 23 24 Numerical Data Who developedBoxplot?Who the Variations of the Boxplot

Tufte’s “Box”-plot In an attempt to further simplify the boxplot, Tufte (2001, pp. 123 – 125) considered removing the box and line and representing the median with a point and differently weighted ‘whiskers’ to reflect the and of the data. Numerical Data Numerical

25 Variations of the Boxplot

• Data: a single month's telephone bills for a group of Chicago residence McGill, Tukey & Larsen (1977) customers

• The data is considers the number of years the resident has lived in the city.

• However . . . not everything known is shown.

• One might conclude that the overall median for all groups combined is Numerical Data Numerical about $21.

• This is absolutely wrong. The actual overall median is about $14.

• What isn’t the data showing? Frequencies

26 Variations of the Boxplot

Variable-Width Boxplot McGill, Tukey & Larsen (1977)

• The information available but not displayed is the number of customers in the various groups.

• Here the width of each box has been Numerical Data Numerical made proportional to the square root of the number of customers in the corresponding group.

• The viewer's attention is immediately drawn to the size differences

27 Variations of the Boxplot

Violin Boxplot Hintze and Nelson (1998)

• Since the boxplot doesn’t give any real indication of the shape of the data (only of certain quartiles which help with assessing centre and spread, one may consider the violin boxplot

Numerical Data Numerical • Basically, the same thing as a boxplot but with the shape of the data reflected

28 29 The Boxplot boxplot, Benjamini, Y.(1988), Opening of the box of a The AmericanStatistician Vaseplot VariationsBoxplot the of , 42, 257-262 , 42, AmericanStatistician, 12 32, (1978), Variationsplots, Thebox the of Tukey,R., McGill, W.J. W.Larsen, & A. Notched boxplot - 16 Variations of the Boxplot

The “Bivariate” Boxplot Becketti & Gould (1987), Rangefinder box plots: A note, The American Statistician, 41, pg 149. - The Rangefinder boxplot

Median of both variables

Whiskers Numerical Data Numerical

A scatterplot of 1980 divorce rates against birth rates in the 49 states other than Nevada 30 Variations of the Boxplot

The “Bivariate” Boxplot Becketti & Gould (1987), Rangefinder box plots: A note, The American Statistician, 41, pg 149. - The Rangefinder boxplot Numerical Data Numerical

A scatterplot of 1980 divorce rates against birth rates in the 49 states other than Nevada 31 Variations of the Boxplot

The “Bivariate” Boxplot Becketti & Gould (1987), Rangefinder box plots: A note, The American Statistician, 41, pg 149. - The Rangefinder boxplot Numerical Data Numerical

A scatterplot of 1980 divorce rates against birth rates in the 49 states other than Nevada 32 33 Numerical Data The “Bivariate” Boxplot - The Rangefinder boxplot . problems VariationsBoxplot the of Lenth The American StatisticianGould, The American , R. R. , V. (1988), on Comment • • • Concerns: of IQR width reflect variable) other the the limits of whiskers (at the of Length No “boxes” axes on labels No , 42, Becketti pg 87 and 34 Numerical Data The “Bivariate” Boxplot VariationsBoxplot the of - Becketti & Gould rejoinder Variations of the Boxplot

The “Bivariate” Boxplot Rousseeuw, Ruts & Tukey (1999) - The Bagpipe boxplot • The depth median is the deepest location, and it is surrounded by a "bag" containing the n/2 observations with largest depth.

• Magnifying the bag by a factor 3 yields the "fence". Observations

Numerical Data Numerical between the bag and the fence are marked by a light gray loop, whereas observations outside the fence are flagged as outliers.

• The visualises the location, spread, correlation, , and tails of the data 35 Variations of the Boxplot

The “Bivariate” Boxplot - The “Replot” and “Quelplot” Numerical Data Numerical

36 Goldberg, K.M. and Iglewicz, B. (1992), Bivariate extensions of the boxplot, Technometrics, 34, 307-320. 37 Numerical Data who developed . who So Boxplot?the The Origins of the Boxplot

The Range Bar Numerical Data Numerical

38 Mary Eleanor Spear (1952), Charting Statistics, McGraw-Hill (page 166). The Origins of the Boxplot

The “Rangebar” Chart Numerical Data Numerical

39 Haemer, K. W. (1948), Range-bar charts, The American Statistician, 2, 23 40 Categorical Data Who developed the ? developed pie Who the The Pie Chart

The story of the pie chart begins with (1759 – 1823). After proposing the line graph and barchart in 1786, Playfair (in 1801) constructed the pie chart as a visual aid to compare the geographical size of each of European regions, and the areas around the world they occupied.

William Playfair (1759 - 1823) Categorical Data Categorical

Playfair, W. (1801) The Statistical Breviary: Shewing, on a Principle Entirely New, the Resources of Every State and Kingdom in 41 Europe, T. Bensley. The Pie Chart

The area of the circles allows for a comparison of the land area (in square miles) each region occupied, given the geopolitical situation in Europe in 1801.

Green coloured regions = countries that were adjudged a maritime power, Red colour regions = countries with no maritime power. Vertical lines compare the population of each region (by the red line) Green line reflect the tax revenues Categorical Data Categorical

Funkhouser (1937, pg 273) says that Playfair may be regarded as the “father of the 42 graphic method in statistics". The Pie Chart Categorical Data Categorical

43 Bertillon, J. (1891), Atlas de Statistique Graphique de la Ville de Paris, année 1889. Paris The Pie Chart

Playfair’s contribution to data visualisation in Great Britain remained relatively unknown. Spence (2005) discusses that this may be because Playfair was involved in “failed and sometimes fraudulent business ventures in London and Paris since the early 1780’s”.

Friendly and Denis (2005, pg 106) add that “Playfair was indeed a sinner, graphic and otherwise".

Tufte (2001, pg 178) said of the pie chart Categorical Data Categorical “A table is nearly always better than a dumb pie chart; the only worse design that a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial disarray both within and between pies. . .” Brinton (1914) said

44 “the circle with sectors is not a desirable form of presentation” The Pie Chart Categorical Data Categorical

45 46 Who? Some Turnof theCentury Texts The Pie Chart Categorical Data Categorical

Brinton (1914, pg 5) Brinton (1914, pg 6)

Haskell (1922, pg 9) calls it a version of the pie chart a circular percentage chart. Brinton (1939, pg 81) notes that alternative names of the pie chart include the sector 47 chart and divided circle. He dedicates a whole chapter (Chp 9) to their study. The Pie Chart

Interestingly . . . .

For expressing component parts the circle chart or "pie diagram" is admittedly inferior to the composite bar chart. However, there are times when it is advantageous for the sake of popularization to make use of the "pie diagram." Categorical Data Categorical

48 Croxton, F.E. (1922) A percentage protractor. Journal of the American Statistical Association, 18, 108--109. The “Dumb” Pie Chart

. . . Some dumb pie

Categorical Data Categorical charts courtesy of Microsoft Word . . .

49 The “Dumb” Pie Chart

. . . Some dumb pie

Categorical Data Categorical charts (pie chart in a pie chart . . .)

“Inception pie charts” 50 The “Dumb” Pie Chart

Exact source unknown . . . But from someone who describes himself as

"an innovation leader in delivering analytics." Categorical Data Categorical

51 The “Dumb” Pie Chart Categorical Data Categorical

http://www.opiniondynamics.com/

52 The “Dumb” Pie Chart Categorical Data Categorical

53 The “Fun” Pie Chart Categorical Data Categorical

54 http://chandoo.org/wp/tag/bad-charts/ The “Fun” Pie Chart Categorical Data Categorical

55 The “Fun” Pie Chart Categorical Data Categorical

56 The “Fun” Pie Chart Categorical Data Categorical

57 The Numerical Data Numerical

Who invented the Pareto Chart?

58 The Pareto Chart

• Born – in Paris on 15 July 1848 to an exiled Genoese family • Parents named him Fritz Wilfried, and renamed him Vilfreldo Federico on the families return to Italy in 1858 • Nationality - Italian • Died – in Switzerland on 19 August 1923 • Worked and resided – France, Germany, Austria, England, Belgium and Switzerland Vilfredo Pareto (1848 – 1923)

Many phenomenon, like wealth, Numerical Data Numerical survival times, follow the Pareto distribution x P X > x = , x x , > 0 α xm ≥ m α where xm is the minimum possible value of x 59 The Pareto Chart

• Born – in Paris on 15 July 1848 to an exiled Genoese family • Parents named him Fritz Wilfried, and renamed him Vilfreldo Federico on the families return to Italy in 1858 • Nationality - Italian • Died – in Switzerland on 19 August 1923 • Worked and resided – France, Germany, Austria, England, Belgium and Switzerland Vilfredo Pareto (1848 – 1923)

Many phenomenon, like wealth, Numerical Data Numerical survival times, follow the Pareto distribution x P X > x = , x x , > 0 α xm ≥ m α where xm is the minimum possible value of x 60 The Pareto Chart

• Born – in Paris on 15 July 1848 to an exiled Genoese family • Parents named him Fritz Wilfried, and renamed him Vilfreldo Federico on the families return to Italy in 1858 • Nationality - Italian • Died – in Switzerland on 19 August 1923 • Worked and resided – France, Germany, Austria, England, Belgium and Switzerland Vilfredo Pareto (1848 – 1923) Numerical Data Numerical

61 The Pareto Chart

• Born – in Paris on 15 July 1848 to an exiled Genoese family • Parents named him Fritz Wilfried, and renamed him Vilfreldo Federico on the families return to Italy in 1858 • Nationality - Italian • Died – in Switzerland on 19 August 1923 • Worked and resided – France, Germany, Austria, England, Belgium and Switzerland Vilfredo Pareto (1848 – 1923) Numerical Data Numerical The cumulative line was added in 1951 ?

Juran, J. M (1951), The economics of quality, in Handbook, (ed. J. M. Juran), New York: McGraw-Hill, pp. 1–41.

62 The Pareto Chart

Juran stumbled across Pareto’s work in 1941.

As a result, he postulated the Pareto principal, or 80-20 rule, in honour of Vilfredo Pareto Joseph M. Juran Vilfredo Pareto (1904 – 2008) (1848 – 1923) Pareto Principal Numerical Data Numerical • 80% of the land in Italy was owned by 20% of the population (α = 1.16). • Juran originally described the fundamental principal as “the vital few and the trivial many”

• confessed “that I had mistakenly applied the wrong name to the principle” (1975) • Revised the idea to “the vital few and the useful many” • was older brother to Academy Award winner Nathan H. Juran who won for “Best Art Direction” for 63 “How Green was my Valley” (1942) The Pareto Chart

Leland Wilkinson (2006) raised some concerns about the Pareto chart

• It is meaningless to superimpose a density onto a distribution

• Some remedies include making dual vertical scales.

• He felt dual scales are Numerical Data Numerical confusing, and . . . .

• . . . . the rationale for aligning the scales is arbitrary

Wilkinson (2006), Revising the Pareto Chart, The American Statistician, 60, 332 – 334. 64 Note: Wilkinson was Senior VP of SPSS (1994-2007). Now Adjunct Prof at University of Illinois The Pareto Chart

To deal with the “dual scale” problem Juran reshaped the Pareto diagram to be a cumulative bar chart. Although this had not gained wide attention, or application

Note here that the height of each “bar” on the revised chart matches up with the height of each bar on the original chart and appears under the cumulative Numerical Data Numerical

65 The Pareto Chart

Wilksinson (2006) raised some concerns about the Pareto chart

• A well known problem: The cumulative line forces the bars towards the bottom of the graph and can make it difficult to distinguish the bars.

• Experts advise users to truncate Numerical Data Numerical the plot by making an “Other” category but this is rather ad hoc

66 The Pareto Chart

Wilksinson (2006) raised some concerns about the Pareto chart

• A well known problem: The cumulative line forces the bars towards the bottom of the graph and can make it difficult to distinguish the bars.

• Experts advise users to truncate Numerical Data Numerical the plot by making an “Other” category but this is rather ad hoc

67 DataVis.ca

Michael Friendly at CARME2015 (Naples, Italy) © Pieter Kroonenberg DataVis.ca What’s next??

Part 2: Two Variables . . . . (maybe)  Two Numerical Variables – Scatterplot – Anscombe’s Quartet – Trellis Plot

 Two Categorical Variables – Fourfold Display – Mosaic Displays (Popularised by Michael Friendly) – Correspondence Plot –

 One Categorical & One Numerical Variable - Side-by-side boxplot  Other – Chernoff Faces (multiple variables) – Andrew’s curves – Dendrograms Thank you 72 Pie Chart An early variation called early variation chart was pie the An the of between the French, British, Ottoman Empires and and Sardinia. EmpiresOttoman British, French, the between formed alliance and the Empire Russian the It between was Ukraine). day of Sea(now modern part Black the of coast on northern the located * The CrimeanWar is that peninsula the Crimeanalong fought was (1853 conditionsarmyhospitals of Crimean during the War* sanitary deplorable the at enraged became Nightingale statistician accomplished highly a FamousEnglish nurse Florence Nightingalealsowas who see the diagram on the next page. It was proposed by were highly statistical. highly were instrumental inthese reforms, and manyof her arguments were skills mathematical and literary Her conditions. The Impact of Florence Nightingale of Impact The – 1856) foughtand for reforms to improve their coxplot – Association American Statistical member ofthe honorary an became later she Society and Statistical Royal the of member female firstelected the was Nightingale In 1859 Florence Nightingale Florence (1820 – 1910) 73 Pie Chart wounds from highlights It death1856 those rates March 1855to April Army British Hospital and Health, the the of Administration matters affecting on Notes This reflected by the area of each bar of the the of bar each of area the by reflected coxplot , preventable diseases appears in Nightingales 1858 Nightingales in appears and deaths deaths and to due coxplot not its length. its not other causes other . The death rates are as a result ofbattle result a as Florence Nightingale Florence (1820 – 1910)