Examining Data: I

Georges Monette

Examining Data: I

MATH 4330: Applied Categorical Data Analysis – Fall 2017

Georges Monette

September 2017 (Updated: September 12 2017 19:06)

Introduction

This material is based on Fox (2016) ch. 4, pp. 28 – 80. How to explore a data set? The first step is not to jump in and plot the data. The very first step should be to get or create the data biography: • where does it come from? • who collected it? • from whom or what? • when? • was there some form of random selection of sampled units? – what was the sampling frame? – how was sampling carried out? – are there sampling weights to adjust for possible biases? – was sampling done effectively so that the sampled units might be representative of a larger population? • was there random assignment of units to different levels of some variable (treatments)? – was this done effectively so the differences in outcomes could be interpreted as the result of a causal effect? You may not find answers to all these questions but it is essential that you be aware of the answers you don’t have and that you be aware of the consequential limitations on the interpretation of results of analyses of these data. Many data sets used in teaching lack a good data biography. That includes, unfortunately, many data sets used in this course. See Krause (2017) for a discussion of the role of data biographies. The second step is to find or create a data directory or ‘codebook’ describing each of the variables in the data. The third step is to think about what are the interesting questions this data set could help explore. Everything you present about a dataset should be related to a question. As you will see, there a great many ways of displaying a variable. The better ways answer interesting questions. As you explore data, keep asking yourself whether a graph or an analysis addresses an interesting and meaningful question. Present those that do and don’t waste your audience’s time – and mental energy – with those that don’t. This script focuses on the initial examination of data using graphical and numeric methods. We use some basic tools in R and vary parameters so you can judge what works best. R has a number of ‘systems’ for graphics. They represent different stages of the development of R and its ancestor, S. There a good overview of graphics in R by Michael Friendly, (???). The graphics systems can be classified as:

1 1. Standard base graphics, developed in the early days of S at Bell Labs in the 1970s. 2. Lattice, latticeExtra, and grid graphics, modelled on ‘trellis’ graphics also developed at Bell Labs in the 1990s. 3. 3d graphics based on ‘rgl’, an implementation in R of OpenGL and packages that use ‘rgl’ such as p3d 4. ggplot2 which is a package among a large number of packages developed by Hadley Wickham in the last decade. The easiest system to use is, by far, the base graphics system. Next in ease is the lattice system which, although daunting, is quite powerful once you master a few techniques. Most of the examples in this file use lattice graphics, a powerful set of functions and conventions that allow you to make easy graphs to explore data quickly and to produce, with more effort, graphical displays suitable for professional presentations. The best way to learn is to play with the code. Try to vary arguments. Look up help for functions and experiment with the examples. Don’t hesitate to also explore the other graphics systems. We also use some simple data manipulation tools in the ‘spida2’ package. Hadley Wickham’s packages are more powerful but also somewhat more complex. We will use the following data set(s) from Fox (2016). knitr::opts_knit(cache=TRUE) # Download these files manually. # Make sure to change the working directory # to the file in which this script was saved # using RStudio menus: # Session > Set Working Directory > To Source File Directory fox_data <- "http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/" download.file(paste0(fox_data,'UnitedNations.txt'),'UnitedNations.txt') download.file(paste0(fox_data,'.txt'),'Titanic.txt') # download bibliography file download.file('http://blackwell.math.yorku.ca/MATH4330/common/4330.bib','4330.bib') # usual packages: library(car) library(spida2) library(lattice) library(latticeExtra) | Loading required package: RColorBrewer # read data un <- read.table('UnitedNations.txt', header = TRUE) titanic <- read.table('Titanic.txt', header = TRUE) head(un) | region tfr contraception educationMale educationFemale | Afghanistan Asia 6.90 NA NA NA | Albania Europe 2.60 NA NA NA | Algeria Africa 3.81 52 11.1 9.9 | American.Samoa Asia NA NA NA NA | Andorra Europe NA NA NA NA | Angola Africa 6.69 NA NA NA | lifeMale lifeFemale infantMortality GDPperCapita | Afghanistan 45.0 46.0 154 2848 | Albania 68.0 74.0 32 863 | Algeria 67.5 70.3 44 1531 | American.Samoa 68.0 73.0 11 NA | Andorra NA NA NA NA | Angola 44.9 48.1 124 355 | economicActivityMale economicActivityFemale illiteracyMale | Afghanistan 87.5 7.2 52.800 | Albania NA NA NA | Algeria 76.4 7.8 26.100 | American.Samoa 58.8 42.4 0.264

2 | Andorra NA NA NA | Angola NA NA NA | illiteracyFemale | Afghanistan 85.00 | Albania NA | Algeria 51.00 | American.Samoa 0.36 | Andorra NA | Angola NA dim(un) | [1] 207 13 un$country <- rownames(un) # rownames identify the rows of a data frame head(titanic) | survived age | Allen, Miss Elisabeth Walton yes 29.0000 | Allison, Miss Helen Loraine no 2.0000 | Allison, Mr Hudson Joshua Creighton no 30.0000 | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) no 25.0000 | Allison, Master Hudson Trevor yes 0.9167 | Anderson, Mr Harry yes 47.0000 | passengerClass sex | Allen, Miss Elisabeth Walton 1st female | Allison, Miss Helen Loraine 1st female | Allison, Mr Hudson Joshua Creighton 1st male | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 1st female | Allison, Master Hudson Trevor 1st male | Anderson, Mr Harry 1st male dim(titanic) | [1] 1313 4 Do you know of any famous passengers on the Titanic? For example the ‘unsinkable’ Molly Brown: grep('Brown', rownames(titanic)) # selecting subset of strings with a regular expression | [1] 37 38 52 348 349 350 600 821 grep('Brown', rownames(titanic)) %>% titanic[., ] | survived age | Brown, Mrs James Joseph (Margaret Molly Tobin) yes 44 | Brown, Mrs John Murray (Caroline Lane Lamson) yes 59 | Case, Mr Howard Brown no 49 | Brown, Miss Edith E. yes 15 | Brown, Mr Thomas William Solomon no 45 | Brown, Mrs Thomas William Solomon (Elizabeth C.) yes 40 | Brown, Miss Mildred yes 24 | Goldsmith, Mrs Frank John (Emily A. Brown) yes NA | passengerClass sex | Brown, Mrs James Joseph (Margaret Molly Tobin) 1st female | Brown, Mrs John Murray (Caroline Lane Lamson) 1st female | Case, Mr Howard Brown 1st male | Brown, Miss Edith E. 2nd female | Brown, Mr Thomas William Solomon 2nd male | Brown, Mrs Thomas William Solomon (Elizabeth C.) 2nd female | Brown, Miss Mildred 2nd female | Goldsmith, Mrs Frank John (Emily A. Brown) 3rd female Who were the 10 youngest (except NAs) sortdf(titanic,~ age)[1:10,]

3 | survived age passengerClass | Dean, Miss Elizabeth Gladys (Millvena) yes 0.1667 3rd | Danbom, Master Gilbert Sigvard Emanuel no 0.3333 3rd | Caldwell, Master Alden Gates yes 0.8333 2nd | Richards, Master George Sidney yes 0.8333 2nd | Aks, Master Philip yes 0.8333 3rd | Allison, Master Hudson Trevor yes 0.9167 1st | Becker, Master Richard F. yes 1.0000 2nd | Hamalainen, Master Viljo yes 1.0000 2nd | LaRoche, Miss Louise yes 1.0000 2nd | Dean, Master Bertram Vere yes 1.0000 3rd | sex | Dean, Miss Elizabeth Gladys (Millvena) female | Danbom, Master Gilbert Sigvard Emanuel male | Caldwell, Master Alden Gates male | Richards, Master George Sidney male | Aks, Master Philip male | Allison, Master Hudson Trevor male | Becker, Master Richard F. male | Hamalainen, Master Viljo male | LaRoche, Miss Louise female | Dean, Master Bertram Vere male The ten oldest sortdf(titanic,~ I(- age)) %>% head(10) | survived age | Artagaveytia, Mr Ramon no 71 | Goldschmidt, Mr George B. no 71 | Mitchell, Mr Henry Michael no 71 | Crosby, Captain Edward Gifford no 70 | Crosby, Mrs Edward Gifford (Catherine Elizabeth Halstead) yes 69 | Straus, Mr Isidor no 67 | Millet, Mr Francis Davis no 65 | Dewan, Mr Frank no 65 | Compton, Mrs Alexander Taylor (Mary Eliza Ingersoll) yes 64 | Fortune, Mr Mark no 64 | passengerClass | Artagaveytia, Mr Ramon 1st | Goldschmidt, Mr George B. 1st | Mitchell, Mr Henry Michael 2nd | Crosby, Captain Edward Gifford 1st | Crosby, Mrs Edward Gifford (Catherine Elizabeth Halstead) 1st | Straus, Mr Isidor 1st | Millet, Mr Francis Davis 1st | Dewan, Mr Frank 3rd | Compton, Mrs Alexander Taylor (Mary Eliza Ingersoll) 1st | Fortune, Mr Mark 1st | sex | Artagaveytia, Mr Ramon male | Goldschmidt, Mr George B. male | Mitchell, Mr Henry Michael male | Crosby, Captain Edward Gifford male | Crosby, Mrs Edward Gifford (Catherine Elizabeth Halstead) female | Straus, Mr Isidor male | Millet, Mr Francis Davis male | Dewan, Mr Frank male | Compton, Mrs Alexander Taylor (Mary Eliza Ingersoll) female | Fortune, Mr Mark male Is age a useful variable? Who had NAs for age?

4 tab(titanic,~ is.na(age)) | is.na(age) | FALSE TRUE Total | 633 680 1313 Find out what happened to others, e.g. Benjamin Guggenheim, the Straus’s who founded Macy’s, Look at the data directory (or codebook) for these data frames at http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/ There are no recipes for examining data. The ‘best’ way depends on the nature of the data. You want to be able to see things you expect to see but you also want to have a good chance of finding things you didn’t expect. When exploring data it’s always a challenge to avoid overinterpreting patterns that could have occured by chance without missing those that are, in fact, meaningful. See Gunter and Tong (2017) for a current article on this problem. It’s important to understand the nature of the data and the relationships and patterns that are expected before examining the data, to distinguish between findings that are confirmatory and those that are exploratory. In the best research this disctinction is crucial and, increasingly, top journals in many medical fields require preregistration of research hypotheses before the data are gathered. Any findings regarding hypotheses that were not preregistered are treated as exploratory and are not considered confirmed until that have been the subject of a future study in which they had been preregistered. Only a small portion of current research is subjected to these stringent requirements. It is nevertheless important to keep in mind that data dredging and p hacking will produces reams of spurious results. See Science isn’t broken – It’s just a hell of lot harder than we give it credit for, Ashwanden (2015). This file is divided in four parts: • univariate including conditional displays: – numerical: histograms, quantile-comparisons, non-parametric density estimation, boxplots – categorical: barcharts • bivariate numerical • multivariate numerical and categorical • data with special structures: longitudinal data We necessarity only scratch the surface. Specialized methods are needed in special applications.

Continuous Univariate

We will use the United Nations data set on infant mortality and GDPperCapita and the Titanic passenger data set. First, think of the questions you would be interested in exploring with this data? Google ‘RMS Titanic’ if you aren’t familiar with the story of the Titanic.

Quick look library(spida2) library(car) library(latticeExtra)

Quick uniform quantile plots and normal quantile plots

These quick quantile plots are suitable for exploration and analysis but should not be used in a polished report. xqplot(un) # uniform quantile plot

5 Figure: Uniform quantile plots for United Nations infant mortality data in 1998 xqplot(un, ptype = 'normal') # uniform quantile plot

6 Figure: Normal quantile plots for United Nations infant mortality data in 1998 xqplot(titanic) # uniform quantile plot

7 Figure: Uniform quantile plots for passengers on the Titanic xqplot(titanic, ptype = 'normal') # uniform quantile plot

8 Figure: Normal quantile plots for passengers on the Titanic A normal quantile plot for a normally distrituted variable will be close to a diagonal. Look carefully to see whether the data are plotted on the x or on the y axis. With a normal quantile plot, if the data are on the y axis, then an S shaped curve implies a platykurtic distribution (review) and a reverse-S shaped curve implies a leptokurtic distribution. Sample means of highly leptokurtic distributions converge to normality much more slowly than those of platykurtic distributions. Question; How large a sample does it take for the sample mean from a normal distribution to converge to normality? Look at the quantile plot to see whether a numeric distribution is quite continuous or whether the values are clumped. Beware: Missing values are often coded with a value like 99 or 999. If you see a clump of unusual values or outliers, you must question why before analyzing the data. To use a missing value code as if it were a valid numerical value is a very serious error . . . like a surgeon amputating the wrong leg!

Numerical variables

Using the histogram function in lattice The following illustrates panels defined by the variable following |`` in the one-sided formula in thehistogram‘ func- tion. To use the ‘histogram’ function we • specify the variable name in a one-sided formula, e.g. ~ age, • and we specify variables that define panels after a vertical bar in the formula, e.g. ~age | sex. • we can specify colors, etc. through parameters of the histogram function or by setting global parameters. The gd function in spida2 is intended to make this easy. Later, We will see the benefits of using gd instead of using graphic function parameters.

9 head(titanic) | survived age | Allen, Miss Elisabeth Walton yes 29.0000 | Allison, Miss Helen Loraine no 2.0000 | Allison, Mr Hudson Joshua Creighton no 30.0000 | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) no 25.0000 | Allison, Master Hudson Trevor yes 0.9167 | Anderson, Mr Harry yes 47.0000 | passengerClass sex | Allen, Miss Elisabeth Walton 1st female | Allison, Miss Helen Loraine 1st female | Allison, Mr Hudson Joshua Creighton 1st male | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 1st female | Allison, Master Hudson Trevor 1st male | Anderson, Mr Harry 1st male histogram(~age, titanic)

histogram(~age, titanic, col = 'gray', border = 'black')

10 histogram(~age| sex, titanic, col = 'gray', border= 'black')

11 The ‘gd’ function in spida2 helps to set plotting parameters For the histogram function, the global parameter that controls color is ‘plot.polygon’. gd() gd(plot.polygon = list(col = 'darkred', border = 'black')) # this is the parameter # that changes the defaults for bars in histograms # -- no logic to this, just something to note and remember histogram(~age| sex, titanic , layout = c(1,2))

12 histogram(~age| sex* passengerClass, titanic)

13 histogram(~age| sex* passengerClass, titanic) %>% useOuterStrips

14 Note how variable names after the conditioning bar, |, in the formula define panels. histogram(~ infantMortality, un)

15 histogram(~ infantMortality| region, un)

16 histogram(~ infantMortality| region+ cut(GDPperCapita,2), un)

17 histogram(~ infantMortality| region+ cut(GDPperCapita,2, labels = c('Low','High')), un)

18 Non-parametric densities

With the densityplot function, as with most lattice functions, you can identify subgroups with • panels (argument following the | in the formula) or with the • groups argument, • or both Often, you want to experiment to find the more effective way of using groups and panels. ‘densityplot’ is a ‘full-featured’ lattice function that can take a panels argument (after the | in the formula) and a groups argument. Different colors, line types, point types, etc. are used to distinguish different groups within each panel. Many of the parameters can be set with the gd() function in ‘spida2’. densityplot(~age , titanic)

19 densityplot(~age| passengerClass , titanic) # panels

20 # densities are easier to compare when they are lined up: densityplot(~age| passengerClass , titanic, layout = c(1,3))

21 densityplot(~age| passengerClass* sex , titanic)

22 # pick a blue you like: pals('blue') # too many

23 # just blues without a number at the end # the following regular expression specifies color names that # contain the substring blue AND do not end with an integer 1 to 9 pals('*blue*[^1-9]$')

24 pals('*red*[^1-9]$')

25 gd(col = 'darkblue', lwd =2, lty =2) # you can set graphics parameters with `gd_` # `lwd` stands for 'line width', `lty` for 'line type' # Use `?par` for a list of common graphical parameters densityplot(~age| passengerClass* sex , titanic) %>% useOuterStrips

26 Using panels puts different subgroups in different panels, using groups plots different subgroups within each panel densityplot(~age| passengerClass, titanic, groups = sex)

27 gd(2, lty =1, lwd =2) # to set colors for groups, first argument is number of colors densityplot(~age, titanic, groups = sex)

28 gd(col = c('blue','red')) # you can also list the colors densityplot(~age, titanic, groups = sex, auto.key =T)

29 densityplot(~age, titanic, groups = sex, auto.key = list(space='right'))

30 Using panels and groups densityplot(~age| passengerClass* survived, titanic, groups = sex, layout = c(2,3), xlim = c(0,80), auto.key = list(space='right')) %>% useOuterStrips

31 Improving information in strips: titanic$Survived <- paste0('Survived: ', titanic$survived) titanic$Class <- paste0('Class: ', titanic$passengerClass) densityplot(~age| Class* Survived, titanic, groups = sex, layout = c(2,3), xlim = c(0,80), auto.key = list(title='sex',space='right')) %>% useOuterStrips

32 Combining histogram and densityplot histogram(~age, titanic, col = 'light blue', border = 'blue', type = 'density')+ densityplot(~age, titanic, col = 'black', lwd =2)

33 histogram(~age| passengerClass, titanic, col = 'light blue', border = 'blue', type = 'density', layout = c(1,3))+ densityplot(~age| passengerClass, titanic, col = 'black', lwd =2)

34 Quantile comparison plots

Compares the empirical distribution of a variables with a theoretical distribution or with the empirical distribution of another variable. qqmath(~ age, titanic)

35 qqmath(~ age| passengerClass, titanic)

36 qqmath(~ age| passengerClass, titanic, groups = sex, pch =1)+ layer(panel.qqmathline(...)) # normal by default

37 qqmath(~ GDPperCapita| region, un)

38 qqmath(~ infantMortality| region, un)

39 Boxplots

Display only the medians, quartiles and extreme points • groups argument does not work but you can use an ‘x’ or ‘y’ variable in the formula to create side-by-side boxplot. • note: ‘bwplot’ stands for ‘box and whisker’ plot. bwplot(~ age, titanic)

40 bwplot(age~ sex, titanic)

41 bwplot(sex~ age, titanic)

42 bwplot(age~ sex| passengerClass, titanic)

43 Questions: Review the construction of boxplots, also known as box-and-whisker plots. What are inner and outer fences? How can you identify the IQR from the boxplot?

Categorical variables

Purely categorical data

Purely categorical data, i.e. a data set all of whose variables are categorical, can be represented in three equivalent (you can go back-and-forth among them) ways. An example of categorical date would be the Titanic data omitting the age variable, or the United Nations data if we created ordered categories for some continuous variables. titan_cat <- titanic[ ,- grep('age', names(titanic))] rownames(titan_cat) <- NULL head(titan_cat) | survived passengerClass sex Survived Class | 1 yes 1st female Survived: yes Class: 1st | 2 no 1st female Survived: no Class: 1st | 3 no 1st male Survived: no Class: 1st | 4 no 1st female Survived: no Class: 1st | 5 yes 1st male Survived: yes Class: 1st | 6 yes 1st male Survived: yes Class: 1st head(un) | region tfr contraception educationMale educationFemale | Afghanistan Asia 6.90 NA NA NA | Albania Europe 2.60 NA NA NA

44 | Algeria Africa 3.81 52 11.1 9.9 | American.Samoa Asia NA NA NA NA | Andorra Europe NA NA NA NA | Angola Africa 6.69 NA NA NA | lifeMale lifeFemale infantMortality GDPperCapita | Afghanistan 45.0 46.0 154 2848 | Albania 68.0 74.0 32 863 | Algeria 67.5 70.3 44 1531 | American.Samoa 68.0 73.0 11 NA | Andorra NA NA NA NA | Angola 44.9 48.1 124 355 | economicActivityMale economicActivityFemale illiteracyMale | Afghanistan 87.5 7.2 52.800 | Albania NA NA NA | Algeria 76.4 7.8 26.100 | American.Samoa 58.8 42.4 0.264 | Andorra NA NA NA | Angola NA NA NA | illiteracyFemale country | Afghanistan 85.00 Afghanistan | Albania NA Albania | Algeria 51.00 Algeria | American.Samoa 0.36 American.Samoa | Andorra NA Andorra | Angola NA Angola qqmath(~infantMortality, un)

quantile(un$infantMortality, na.rm = TRUE) | 0% 25% 50% 75% 100%

45 | 2 12 30 66 169 quantile(un$infantMortality, na.rm = TRUE, probs = c(0,33,66,100)/100) | 0% 33% 66% 100% | 2 16 51 169 quantile(un$infantMortality, na.rm = TRUE, probs = c(0,33,66,100)/100 ) -> qs un$infantMortality_level <- cut(un$infantMortality, breaks = qs, include.lowest = TRUE, labels = c('Low','Middle','High')) tab(~infantMortality_level, un) # note that the order is not lexicographical | infantMortality_level | Low Middle High Total | 68 66 67 6 207 Creating a categorical variable from a numeric variable: un$GDP_Q <- cut(un$GDPperCapita, breaks = quantile(un$GDPperCapita, na.rm = TRUE), # quartiles by default include.lowest = TRUE, labels = paste0("Q",1:4)) tab(~GDP_Q, un) | GDP_Q | Q1 Q2 Q3 Q4 Total | 50 49 49 49 10 207 tab(~GDP_Q+ I(is.na(GDPperCapita)), un) | I(is.na(GDPperCapita)) | GDP_Q FALSE TRUE Total | Q1 50 0 50 | Q2 49 0 49 | Q3 49 0 49 | Q4 49 0 49 | 0 10 10 | Total 197 10 207 Question: What happens if you specify include.lowest = FALSE in cut and then run the tab function above? tab(~GDP_Q+ infantMortality_level, un) | infantMortality_level | GDP_Q Low Middle High Total | Q1 1 9 40 0 50 | Q2 6 25 18 0 49 | Q3 16 26 6 1 49 | Q4 40 5 1 3 49 | 5 1 2 2 10 | Total 68 66 67 6 207 tab(~GDP_Q+ infantMortality_level, un, useNA = 'no') | infantMortality_level | GDP_Q Low Middle High Total | Q1 1 9 40 50 | Q2 6 25 18 49 | Q3 16 26 6 48 | Q4 40 5 1 46 | Total 63 65 65 193 tab(~GDP_Q+ infantMortality_level, un, useNA = 'no', pct =1) %>% round(1)

46 | infantMortality_level | GDP_Q Low Middle High Total | Q1 2.0 18.0 80.0 100.0 | Q2 12.2 51.0 36.7 100.0 | Q3 33.3 54.2 12.5 100.0 | Q4 87.0 10.9 2.2 100.0 | All 32.6 33.7 33.7 100.0 un_cat <- un[, grep('level|Q$|region', names(un))] head(un_cat) | region infantMortality_level GDP_Q | Afghanistan Asia High Q3 | Albania Europe Middle Q2 | Algeria Africa Middle Q2 | American.Samoa Asia Low | Andorra Europe | Angola Africa High Q1

Formats for purely categorical data

The 3 formats for representing categorical cata are: 1. Data frame with one row per case. The ‘raw’ titanic and un data are examples. 2. Frequency data frame with one row per combination of levels of variables and a frequency variable to record the number of cases with that particular combination of levels. 3. Contingency table, which is an array containing frequencies each of whose dimensions corresponds to a variable and the levels correspond to the levels of the variable. The un_cat and titan_cat data frames are in format ‘1’. In general, if a data set includes continuous variables, then method 1 is the only format in which it can be represented without losing information on the continuous variable. The tab_ function easily format ‘1’ transforms to format ‘3’: (un_tab <- tab_(un_cat)) # note how '(...)' cause the assigned value to be printed | , , GDP_Q = Q1 | | infantMortality_level | region Low Middle High | Africa 0 1 30 0 | America 0 0 1 0 | Asia 0 6 9 0 | Europe 1 2 0 0 | Oceania 0 0 0 0 | | , , GDP_Q = Q2 | | infantMortality_level | region Low Middle High | Africa 0 3 11 0 | America 1 7 2 0 | Asia 1 7 3 0 | Europe 4 5 0 0 | Oceania 0 3 2 0 | | , , GDP_Q = Q3 | | infantMortality_level | region Low Middle High | Africa 2 2 4 0

47 | America 6 13 0 1 | Asia 1 6 2 0 | Europe 6 1 0 0 | Oceania 1 4 0 0 | | , , GDP_Q = Q4 | | infantMortality_level | region Low Middle High | Africa 1 0 0 0 | America 7 2 0 0 | Asia 8 2 1 0 | Europe 21 0 0 3 | Oceania 3 1 0 0 | | , , GDP_Q = NA | | infantMortality_level | region Low Middle High | Africa 0 0 1 0 | America 1 0 0 0 | Asia 2 1 1 0 | Europe 0 0 0 1 | Oceania 2 0 0 1 (titan_tab <- tab_(titan_cat)) | , , sex = female, Survived = Survived: no, Class = Class: 1st | | passengerClass | survived 1st 2nd 3rd | no 9 0 0 | yes 0 0 0 | | , , sex = male, Survived = Survived: no, Class = Class: 1st | | passengerClass | survived 1st 2nd 3rd | no 120 0 0 | yes 0 0 0 | | , , sex = female, Survived = Survived: yes, Class = Class: 1st | | passengerClass | survived 1st 2nd 3rd | no 0 0 0 | yes 134 0 0 | | , , sex = male, Survived = Survived: yes, Class = Class: 1st | | passengerClass | survived 1st 2nd 3rd | no 0 0 0 | yes 59 0 0 | | , , sex = female, Survived = Survived: no, Class = Class: 2nd | | passengerClass | survived 1st 2nd 3rd | no 0 13 0 | yes 0 0 0

48 | | , , sex = male, Survived = Survived: no, Class = Class: 2nd | | passengerClass | survived 1st 2nd 3rd | no 0 148 0 | yes 0 0 0 | | , , sex = female, Survived = Survived: yes, Class = Class: 2nd | | passengerClass | survived 1st 2nd 3rd | no 0 0 0 | yes 0 94 0 | | , , sex = male, Survived = Survived: yes, Class = Class: 2nd | | passengerClass | survived 1st 2nd 3rd | no 0 0 0 | yes 0 25 0 | | , , sex = female, Survived = Survived: no, Class = Class: 3rd | | passengerClass | survived 1st 2nd 3rd | no 0 0 134 | yes 0 0 0 | | , , sex = male, Survived = Survived: no, Class = Class: 3rd | | passengerClass | survived 1st 2nd 3rd | no 0 0 440 | yes 0 0 0 | | , , sex = female, Survived = Survived: yes, Class = Class: 3rd | | passengerClass | survived 1st 2nd 3rd | no 0 0 0 | yes 0 0 79 | | , , sex = male, Survived = Survived: yes, Class = Class: 3rd | | passengerClass | survived 1st 2nd 3rd | no 0 0 0 | yes 0 0 58 Note that ‘un_tab’ is an object of class ‘table’: class(un_tab) | [1] "table" To list functions that have methods for this class, use: methods(class='table') | [1] [ aperm as.data.frame Axis barchart | [6] cloud coerce contourplot dotplot head | [11] initialize levelplot lines plot points

49 | [16] print show slotsFromS3 summary tab | [21] tail | see '?methods' for accessing help and source code Not all the methods are very useful but you can experiment and get help: plot(un_tab) # this is a mosaic plot, however not a fair example

dotplot(un_tab)

50 dotplot(un_tab, auto.key =T)

51 barchart(un_tab, auto.key =T, col = heat.colors(5)) # note what happens to the legend

52 gd(plot.polygon=list(col= heat.colors(5))) gd(superpose.polygon=list(col= heat.colors(5))) un_tab | , , GDP_Q = Q1 | | infantMortality_level | region Low Middle High | Africa 0 1 30 0 | America 0 0 1 0 | Asia 0 6 9 0 | Europe 1 2 0 0 | Oceania 0 0 0 0 | | , , GDP_Q = Q2 | | infantMortality_level | region Low Middle High | Africa 0 3 11 0 | America 1 7 2 0 | Asia 1 7 3 0 | Europe 4 5 0 0 | Oceania 0 3 2 0 | | , , GDP_Q = Q3 | | infantMortality_level | region Low Middle High | Africa 2 2 4 0 | America 6 13 0 1

53 | Asia 1 6 2 0 | Europe 6 1 0 0 | Oceania 1 4 0 0 | | , , GDP_Q = Q4 | | infantMortality_level | region Low Middle High | Africa 1 0 0 0 | America 7 2 0 0 | Asia 8 2 1 0 | Europe 21 0 0 3 | Oceania 3 1 0 0 | | , , GDP_Q = NA | | infantMortality_level | region Low Middle High | Africa 0 0 1 0 | America 1 0 0 0 | Asia 2 1 1 0 | Europe 0 0 0 1 | Oceania 2 0 0 1 barchart(un_tab, auto.key =T)

barchart(un_tab, auto.key =T, horizontal = FALSE) # legend is reversed

54 barchart(un_tab, auto.key = list(space = 'right',reverse.rows = T), horizontal = FALSE)

55 barchart(aperm(un_tab,c(3,1,2)), auto.key =T)

56 gd(superpose.polygon=list(col= heat.colors(3))) barchart(aperm(un_tab,c(3,1,2)), auto.key =T)

57 Going from ‘3’ to ‘2’ is easy: un_freq <- as.data.frame(un_tab) head(un_freq) # note the new variable `Freq` | region infantMortality_level GDP_Q Freq | 1 Africa Low Q1 0 | 2 America Low Q1 0 | 3 Asia Low Q1 0 | 4 Europe Low Q1 1 | 5 Oceania Low Q1 0 | 6 Africa Middle Q1 1 (titan_freq <- as.data.frame(titan_tab)) %>% head | survived passengerClass sex Survived Class Freq | 1 no 1st female Survived: no Class: 1st 9 | 2 yes 1st female Survived: no Class: 1st 0 | 3 no 2nd female Survived: no Class: 1st 0 | 4 yes 2nd female Survived: no Class: 1st 0 | 5 no 3rd female Survived: no Class: 1st 0 | 6 yes 3rd female Survived: no Class: 1st 0 Going from ‘2’ to ‘1’ is conceptually easy. We need to replicate each row as many times as given by Freq, and then get rid of the ‘Freq’ variable so we don’t confuse the data frame with a frequency data frame. We can avoid a lot of typing by using pipes: un_freq %>% .[ rep(1:nrow(.), .$Freq),] %>% .[,- grep("Freq", names(.))] -> un_df head(un_freq)

58 | region infantMortality_level GDP_Q Freq | 1 Africa Low Q1 0 | 2 America Low Q1 0 | 3 Asia Low Q1 0 | 4 Europe Low Q1 1 | 5 Oceania Low Q1 0 | 6 Africa Middle Q1 1 head(un_df) | region infantMortality_level GDP_Q | 4 Europe Low Q1 | 6 Africa Middle Q1 | 8 Asia Middle Q1 | 8.1 Asia Middle Q1 | 8.2 Asia Middle Q1 | 8.3 Asia Middle Q1 This was an advanced example of pipes. We’ll see many easier examples. Normally the ‘%>%’ operator ‘pipes’ the output of the left-hand side (lhs) as the first argument of the function on the right side of the pipe. Wherever there’s a ‘.’ on the rhs the output of the lhs is substituted for the ‘.’ In RStudio, you can type ‘%>%’ by pressing Control-Shift-M. Question: Can you explain what each line in the pipeline does and how it achieves the goal of turning a frequency data frame (with one row per combination of variable levels) into a data frame with one row per case. Pipes are relatively new in R but the idea dates back to early Unix in the 1970s and many modern languages use related ideas extensively. head(un_df) | region infantMortality_level GDP_Q | 4 Europe Low Q1 | 6 Africa Middle Q1 | 8 Asia Middle Q1 | 8.1 Asia Middle Q1 | 8.2 Asia Middle Q1 | 8.3 Asia Middle Q1 head(un_cat) | region infantMortality_level GDP_Q | Afghanistan Asia High Q3 | Albania Europe Middle Q2 | Algeria Africa Middle Q2 | American.Samoa Asia Low | Andorra Europe | Angola Africa High Q1 # The order of rows might be different all.equal(tab(un_df), tab(un_cat)) | [1] TRUE With tab_, going from format ‘2’ to ‘3’ is easy: head(un_freq) | region infantMortality_level GDP_Q Freq | 1 Africa Low Q1 0 | 2 America Low Q1 0 | 3 Asia Low Q1 0 | 4 Europe Low Q1 1 | 5 Oceania Low Q1 0 | 6 Africa Middle Q1 1

59 un_tab2 <- tab_(Freq~ region+ infantMortality_level+ GDP_Q, un_freq) un_tab2 | , , GDP_Q = Q1 | | infantMortality_level | region Low Middle High | Africa 0 1 30 0 | America 0 0 1 0 | Asia 0 6 9 0 | Europe 1 2 0 0 | Oceania 0 0 0 0 | | , , GDP_Q = Q2 | | infantMortality_level | region Low Middle High | Africa 0 3 11 0 | America 1 7 2 0 | Asia 1 7 3 0 | Europe 4 5 0 0 | Oceania 0 3 2 0 | | , , GDP_Q = Q3 | | infantMortality_level | region Low Middle High | Africa 2 2 4 0 | America 6 13 0 1 | Asia 1 6 2 0 | Europe 6 1 0 0 | Oceania 1 4 0 0 | | , , GDP_Q = Q4 | | infantMortality_level | region Low Middle High | Africa 1 0 0 0 | America 7 2 0 0 | Asia 8 2 1 0 | Europe 21 0 0 3 | Oceania 3 1 0 0 | | , , GDP_Q = NA | | infantMortality_level | region Low Middle High | Africa 0 0 1 0 | America 1 0 0 0 | Asia 2 1 1 0 | Europe 0 0 0 1 | Oceania 2 0 0 1 un_tab2 %>% tab_(useNA='no') | , , GDP_Q = Q1 | | infantMortality_level | region Low Middle High | Africa 0 1 30 | America 0 0 1

60 | Asia 0 6 9 | Europe 1 2 0 | Oceania 0 0 0 | | , , GDP_Q = Q2 | | infantMortality_level | region Low Middle High | Africa 0 3 11 | America 1 7 2 | Asia 1 7 3 | Europe 4 5 0 | Oceania 0 3 2 | | , , GDP_Q = Q3 | | infantMortality_level | region Low Middle High | Africa 2 2 4 | America 6 13 0 | Asia 1 6 2 | Europe 6 1 0 | Oceania 1 4 0 | | , , GDP_Q = Q4 | | infantMortality_level | region Low Middle High | Africa 1 0 0 | America 7 2 0 | Asia 8 2 1 | Europe 21 0 0 | Oceania 3 1 0

Presenting categorical data

A standard way to summarize purely categorical data is to use tables. R has a number of functions to produce tables: table and xtabs being the two major ones. The spida2 package has a function tab that uses a formula argument. It simplifies the creation of proportion tables and percentage tables, which are proportion tables multiplied by 100. It also works on all three formats for categorical data. Thus, a table of a table is an easy way to produce a marginal table. (Think about what this could possibly mean!) head(titanic) | survived age | Allen, Miss Elisabeth Walton yes 29.0000 | Allison, Miss Helen Loraine no 2.0000 | Allison, Mr Hudson Joshua Creighton no 30.0000 | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) no 25.0000 | Allison, Master Hudson Trevor yes 0.9167 | Anderson, Mr Harry yes 47.0000 | passengerClass sex | Allen, Miss Elisabeth Walton 1st female | Allison, Miss Helen Loraine 1st female | Allison, Mr Hudson Joshua Creighton 1st male | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 1st female | Allison, Master Hudson Trevor 1st male | Anderson, Mr Harry 1st male | Survived Class | Allen, Miss Elisabeth Walton Survived: yes Class: 1st

61 | Allison, Miss Helen Loraine Survived: no Class: 1st | Allison, Mr Hudson Joshua Creighton Survived: no Class: 1st | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) Survived: no Class: 1st | Allison, Master Hudson Trevor Survived: yes Class: 1st | Anderson, Mr Harry Survived: yes Class: 1st One-way tables: tab(titanic,~ survived) | survived | no yes Total | 864 449 1313 tab(titanic,~ survived, pr =0) # proportion table (0 means proportions of entire table) | survived | no yes Total | 0.658035 0.341965 1.000000 tab(titanic,~ survived, pct =0) # percentage table | survived | no yes Total | 65.8035 34.1965 100.0000 round(tab(titanic,~ survived, pct =0),1) # percentage table | survived | no yes Total | 65.8 34.2 100.0 Here are some easier examples of pipes. titanic %>% tab(~passengerClass) # note how the pipe uses 'titanic' as the first argument of 'tab' | passengerClass | 1st 2nd 3rd Total | 322 280 711 1313 titanic %>% tab(~passengerClass, pct =0) %>% round(1) | passengerClass | 1st 2nd 3rd Total | 24.5 21.3 54.2 100.0 Two-way tables Frequency table titanic %>% tab(~passengerClass+ survived) | survived | passengerClass no yes Total | 1st 129 193 322 | 2nd 161 119 280 | 3rd 574 137 711 | Total 864 449 1313 Table percentages: titanic %>% tab(~passengerClass+ survived, pct =0) %>% round(1)

62 | survived | passengerClass no yes Total | 1st 9.8 14.7 24.5 | 2nd 12.3 9.1 21.3 | 3rd 43.7 10.4 54.2 | Total 65.8 34.2 100.0 Row percentages: titanic %>% tab(~passengerClass+ survived, pct =1) %>% round(1) | survived | passengerClass no yes Total | 1st 40.1 59.9 100.0 | 2nd 57.5 42.5 100.0 | 3rd 80.7 19.3 100.0 | All 65.8 34.2 100.0 Column percentages: titanic %>% tab(~passengerClass+ survived, pct =2) %>% round(1) | survived | passengerClass no yes All | 1st 14.9 43.0 24.5 | 2nd 18.6 26.5 21.3 | 3rd 66.4 30.5 54.2 | Total 100.0 100.0 100.0 Question: Do the previous two tables give roughly equivalent information? How would you describe in words the information each table gives? Which is more ‘interesting’ in some sense? Three+-way tables: titanic %>% tab(~ passengerClass+ survived+ sex) | , , sex = female | | survived | passengerClass no yes Total | 1st 9 134 143 | 2nd 13 94 107 | 3rd 134 79 213 | Total 156 307 463 | | , , sex = male | | survived | passengerClass no yes Total | 1st 120 59 179 | 2nd 148 25 173 | 3rd 440 58 498 | Total 708 142 850 | | , , sex = Total | | survived | passengerClass no yes Total | 1st 129 193 322 | 2nd 161 119 280

63 | 3rd 574 137 711 | Total 864 449 1313 ’Flat 3+ way table: titanic %>% tab(~ passengerClass+ survived+ sex) %>% ftable | sex female male Total | passengerClass survived | 1st no 9 120 129 | yes 134 59 193 | Total 143 179 322 | 2nd no 13 148 161 | yes 94 25 119 | Total 107 173 280 | 3rd no 134 440 574 | yes 79 58 137 | Total 213 498 711 | Total no 156 708 864 | yes 307 142 449 | Total 463 850 1313 titanic %>% tab(~ passengerClass+ sex+ survived) %>% # last variable is the col var ftable | survived no yes Total | passengerClass sex | 1st female 9 134 143 | male 120 59 179 | Total 129 193 322 | 2nd female 13 94 107 | male 148 25 173 | Total 161 119 280 | 3rd female 134 79 213 | male 440 58 498 | Total 574 137 711 | Total female 156 307 463 | male 708 142 850 | Total 864 449 1313 titanic %>% tab(~ passengerClass+ sex+ survived, pct = c(1,2)) # not very effective | , , survived = no | | sex | passengerClass female male All | 1st 6.293706 67.039106 40.062112 | 2nd 12.149533 85.549133 57.500000 | 3rd 62.910798 88.353414 80.731364 | All 33.693305 83.294118 65.803503 | | , , survived = yes | | sex | passengerClass female male All | 1st 93.706294 32.960894 59.937888 | 2nd 87.850467 14.450867 42.500000 | 3rd 37.089202 11.646586 19.268636 | All 66.306695 16.705882 34.196497

64 | | , , survived = Total | | sex | passengerClass female male All | 1st 100.000000 100.000000 100.000000 | 2nd 100.000000 100.000000 100.000000 | 3rd 100.000000 100.000000 100.000000 | All 100.000000 100.000000 100.000000 titanic %>% tab(~ passengerClass+ sex+ survived, pct = c(1,2)) %>% ftable # but good as a flat table | survived no yes Total | passengerClass sex | 1st female 6.293706 93.706294 100.000000 | male 67.039106 32.960894 100.000000 | All 40.062112 59.937888 100.000000 | 2nd female 12.149533 87.850467 100.000000 | male 85.549133 14.450867 100.000000 | All 57.500000 42.500000 100.000000 | 3rd female 62.910798 37.089202 100.000000 | male 88.353414 11.646586 100.000000 | All 80.731364 19.268636 100.000000 | All female 33.693305 66.306695 100.000000 | male 83.294118 16.705882 100.000000 | All 65.803503 34.196497 100.000000 titanic %>% tab_(~ passengerClass+ sex+ survived, pct = c(1,2)) %>% # removing 'Total' ftable | survived no yes | passengerClass sex | 1st female 6.293706 93.706294 | male 67.039106 32.960894 | All 40.062112 59.937888 | 2nd female 12.149533 87.850467 | male 85.549133 14.450867 | All 57.500000 42.500000 | 3rd female 62.910798 37.089202 | male 88.353414 11.646586 | All 80.731364 19.268636 | All female 33.693305 66.306695 | male 83.294118 16.705882 | All 65.803503 34.196497 titanic %>% tab__(~ passengerClass+ sex+ survived, pct = c(1,2)) %>% # removing 'Total' and 'All' ftable %>% round(1) | survived no yes | passengerClass sex | 1st female 6.3 93.7 | male 67.0 33.0 | 2nd female 12.1 87.9 | male 85.5 14.5 | 3rd female 62.9 37.1 | male 88.4 11.6 We see that the survival rate among women is much higher than among men and that survival rate is strongly related to class.

65 There’s a big drop going from 1st class to 2nd class among men. Among women, the marked drop occurs going from 2nd to 3rd class. A limitation of this analysis is that it does not take age into account. The proportion of children could have been quite different among the two sexes and the different classes. A higher survival rate among children could account for some of the pattern show in these tables. The age variable does not provide a good solution to control for age since is it missing for such a high percentage (namely 51.8%) of the passengers. Note: This was an example of an ‘inline’ R expression. Another example: R even knows that 2 + 2 is 4. This information was the second element of the vector: tab(titanic,~ is.na(age), pct =0) | is.na(age) | FALSE TRUE Total | 48.21021 51.78979 100.00000 Note that the tab function can use a table as input tt <- tab(titanic,~ sex+ survived) tt | survived | sex no yes Total | female 156 307 463 | male 708 142 850 | Total 864 449 1313 tab(tt,~ sex) # but this doesn't make sense with a table that includes Totals | sex | female male Total Total | 926 1700 2626 5252 tab_(tt) # removes totals | survived | sex no yes | female 156 307 | male 708 142 tab_(tt) %>% tab(~sex) | sex | female male Total | 463 850 1313 Same output as tab(titanic,~ sex) | sex | female male Total | 463 850 1313

Visualizing categorical data

The barchart is a graphical display that helps visualize the information in a table. Barcharts work best on frequency, proportion and percentage tables. The following uses a lot of material from the markdown example in the first class. frequency table some(titanic)

66 | survived age passengerClass | Carter, Master William T. II yes 11 1st | Smith, Mr Richard William no NA 1st | Widener, Mr George Dunton no 50 1st | Becker, Miss Marion Louise yes 4 2nd | Davis, Master John Morgan yes 8 2nd | Renouf, Mrs Peter Henry (Lillian Jefferys) yes 30 2nd | Goldsmith, Mr Nathan no 41 3rd | Lindblom, Miss Augusta Charlotta no NA 3rd | Morley, Mr Henry Samuel no NA 3rd | Peter (Joseph), Mrs Catherine no NA 3rd | sex Survived Class | Carter, Master William T. II male Survived: yes Class: 1st | Smith, Mr Richard William male Survived: no Class: 1st | Widener, Mr George Dunton male Survived: no Class: 1st | Becker, Miss Marion Louise female Survived: yes Class: 2nd | Davis, Master John Morgan male Survived: yes Class: 2nd | Renouf, Mrs Peter Henry (Lillian Jefferys) female Survived: yes Class: 2nd | Goldsmith, Mr Nathan male Survived: no Class: 3rd | Lindblom, Miss Augusta Charlotta female Survived: no Class: 3rd | Morley, Mr Henry Samuel male Survived: no Class: 3rd | Peter (Joseph), Mrs Catherine female Survived: no Class: 3rd dim(titanic) | [1] 1313 6 tab(titanic,~ survived+ sex+ passengerClass) | , , passengerClass = 1st | | sex | survived female male Total | no 9 120 129 | yes 134 59 193 | Total 143 179 322 | | , , passengerClass = 2nd | | sex | survived female male Total | no 13 148 161 | yes 94 25 119 | Total 107 173 280 | | , , passengerClass = 3rd | | sex | survived female male Total | no 134 440 574 | yes 79 58 137 | Total 213 498 711 | | , , passengerClass = Total | | sex | survived female male Total | no 156 708 864 | yes 307 142 449 | Total 463 850 1313 tab_(titanic,~ survived+ sex+ passengerClass) # No totals

67 | , , passengerClass = 1st | | sex | survived female male | no 9 120 | yes 134 59 | | , , passengerClass = 2nd | | sex | survived female male | no 13 148 | yes 94 25 | | , , passengerClass = 3rd | | sex | survived female male | no 134 440 | yes 79 58 percentage within each ‘sex by passengerClass’ grouping tab(titanic,~ survived+ sex+ passengerClass, pct = c(2,3)) | , , passengerClass = 1st | | sex | survived female male All | no 6.293706 67.039106 40.062112 | yes 93.706294 32.960894 59.937888 | Total 100.000000 100.000000 100.000000 | | , , passengerClass = 2nd | | sex | survived female male All | no 12.149533 85.549133 57.500000 | yes 87.850467 14.450867 42.500000 | Total 100.000000 100.000000 100.000000 | | , , passengerClass = 3rd | | sex | survived female male All | no 62.910798 88.353414 80.731364 | yes 37.089202 11.646586 19.268636 | Total 100.000000 100.000000 100.000000 | | , , passengerClass = All | | sex | survived female male All | no 33.693305 83.294118 65.803503 | yes 66.306695 16.705882 34.196497 | Total 100.000000 100.000000 100.000000 tab(titanic,~ survived+ sex+ passengerClass, pct = c(2,3)) %>% round(1) # nicer | , , passengerClass = 1st |

68 | sex | survived female male All | no 6.3 67.0 40.1 | yes 93.7 33.0 59.9 | Total 100.0 100.0 100.0 | | , , passengerClass = 2nd | | sex | survived female male All | no 12.1 85.5 57.5 | yes 87.9 14.5 42.5 | Total 100.0 100.0 100.0 | | , , passengerClass = 3rd | | sex | survived female male All | no 62.9 88.4 80.7 | yes 37.1 11.6 19.3 | Total 100.0 100.0 100.0 | | , , passengerClass = All | | sex | survived female male All | no 33.7 83.3 65.8 | yes 66.3 16.7 34.2 | Total 100.0 100.0 100.0 tab_(titanic,~ survived+ sex+ passengerClass, pct = c(2,3)) %>% round(1) # No Total | , , passengerClass = 1st | | sex | survived female male All | no 6.3 67.0 40.1 | yes 93.7 33.0 59.9 | | , , passengerClass = 2nd | | sex | survived female male All | no 12.1 85.5 57.5 | yes 87.9 14.5 42.5 | | , , passengerClass = 3rd | | sex | survived female male All | no 62.9 88.4 80.7 | yes 37.1 11.6 19.3 | | , , passengerClass = All | | sex | survived female male All | no 33.7 83.3 65.8 | yes 66.3 16.7 34.2

69 tab__(titanic,~ survived+ sex+ passengerClass, pct = c(2,3)) %>% round(1) # No Total and no All --just the cells of the table | , , passengerClass = 1st | | sex | survived female male | no 6.3 67.0 | yes 93.7 33.0 | | , , passengerClass = 2nd | | sex | survived female male | no 12.1 85.5 | yes 87.9 14.5 | | , , passengerClass = 3rd | | sex | survived female male | no 62.9 88.4 | yes 37.1 11.6

Visualizing frequencies barchart of frequencies Setting up colors – you can see the currently assigned colors with: show.settings()

70 cols <- brewer.pal(8,'Dark2') pal(cols)

71 | red green blue | #1B9E77 27 158 119 | #D95F02 217 95 2 | #7570B3 117 112 179 | #E7298A 231 41 138 | #66A61E 102 166 30 | #E6AB02 230 171 2 | #A6761D 166 118 29 | #666666 102 102 102 You can choose something more to your liking by using one of these palettes: display.brewer.all()

72 Or play with the new RStudio ‘colorpicker’ by clicking on “Addins” in the second menu line above, gd(superpose.polygon=list(col=cols, border='black')) # use 'superpose.polygon' for colors show.settings()

73 tab(titanic,~ survived+ sex+ passengerClass) %>% barchart(auto.key=T, stack=F)

74 # get rid of 'Total' tab_(titanic,~ survived+ sex+ passengerClass) %>% barchart(auto.key=T)

75 # not really informative, ... so tab_(titanic,~ survived+ sex+ passengerClass) %>% barchart(auto.key=T, ylab = "survived", xlab= 'number of passengers', xlim=c(0,850))

76 # Not really informative, so: tab_(titanic,~ sex+ passengerClass+ survived) %>% barchart(xlab = 'number of passengers', xlim = c(0,550), auto.key=list(space='right',title='survived'))

77 tab_(titanic,~ sex+ passengerClass+ survived) %>% barchart(xlab = 'number of passengers', xlim = c(0,550), layout = c(1,3), auto.key=list(space='right',title='survived'))

78 # If we want to compare survival within difference classes and sex # we we really want proportions withing classes and sex tab_(titanic,~ sex+ passengerClass+ survived, pct = c(1,2)) %>% barchart(xlab = 'percentage of passengers', xlim = c(0,100), layout = c(1,4), auto.key=list(space='right',title='survived'))

79 # If we want to get rid of 'All' and 'Total' tab__(titanic,~ sex+ passengerClass+ survived, pct = c(1,2)) %>% barchart(xlab = 'percentage of passengers', xlim = c(0,100), layout = c(1,3), auto.key=list(space='right',title='survived'))

80 Vertical bars: Using a different palette gd_(superpose.polygon=list(col= rev(brewer.pal(3,'Blues')))) tab__(titanic,~ sex+ passengerClass+ survived, pct = c(1,2)) %>% barchart(ylab = 'percentage of passengers', ylim = c(0,100), horizontal = FALSE, layout = c(1,3), auto.key=list(space='right',title='survived'))

81 but legend and graph have colors reversed tab__(titanic,~ sex+ passengerClass+ survived, pct = c(1,2)) %>% barchart(ylab = 'percentage of passengers', ylim = c(0,100), horizontal = FALSE, layout = c(1,3), as.table = TRUE, auto.key=list(space='right',title='survived',reverse.rows=T))

82 not stacked tab__(titanic,~ sex+ passengerClass+ survived, pct = c(1,2)) %>% barchart(ylab = 'percentage of passengers', ylim = c(0,100), horizontal = FALSE, layout = c(1,3), stack = FALSE, as.table = TRUE, auto.key=list(space='right',title='survived',reverse.rows=T))

83 What happens with 4 variables: titanic2 <- titanic titanic2$ Age <- cut(titanic2$age, breaks = c(-Inf, 18, Inf), labels = c('child','adult')) tab(titanic2,~ Age) | Age | child adult Total | 107 526 680 1313 tab__(titanic2,~ sex+ passengerClass+ Age+ survived ) | , , Age = child, survived = no | | passengerClass | sex 1st 2nd 3rd | female 1 1 12 | male 1 7 22 | | , , Age = adult, survived = no | | passengerClass | sex 1st 2nd 3rd | female 4 9 17 | male 81 99 98 | | , , Age = NA, survived = no |

84 | passengerClass | sex 1st 2nd 3rd | female 4 3 105 | male 38 42 320 | | , , Age = child, survived = yes | | passengerClass | sex 1st 2nd 3rd | female 9 17 14 | male 6 11 6 | | , , Age = adult, survived = yes | | passengerClass | sex 1st 2nd 3rd | female 87 58 14 | male 37 10 12 | | , , Age = NA, survived = yes | | passengerClass | sex 1st 2nd 3rd | female 38 19 51 | male 16 4 40 tab__(titanic2,~ sex+ passengerClass+ Age+ survived ) %>% barchart(auto.key =T) %>% useOuterStrips

85 tab__(titanic2,~ sex+ passengerClass+ Age+ survived ) %>% barchart(auto.key = list(reverse.rows=T), horizontal =F) %>% useOuterStrips

tab__(titanic2,~ sex+ passengerClass+ Age+ survived , pct = c(1,2,3)) %>% barchart(auto.key = list(reverse.rows=T), horizontal =F) %>% useOuterStrips

86 We like this barchart so we will polish it up and write a caption for it. Figures you use in a report or presentation should always have informative captions. tab__(titanic2,~ sex+ passengerClass+ Age+ survived, pct = c(1,2,3)) %>% barchart(auto.key = list(space = 'left',reverse.rows=T, title = 'survived'), horizontal =F, ylim = c(-1,101), ylab = 'percent', ) %>% useOuterStrips

87 Figure (number the figures in a polished report): Percentage of passengers on the Titanic who survived categorized by age (above or below 18), sex and passenger class. Note that age information was missing for 48.2, 51.8, 100% of the passengers. These results are therefore not reliable since age is not likely to be missing at random. We saw from this barchart that, with tab__ formula of the form: ‘~ a + b + c + d“ • a is the x-axis variable • b and c define panels • d defines groups within panels Exercise: The last graph shows survival rate within subgroups. Can you think of other barchart graphs that would reveal interesting information about the Titanic? Exercise: Use the last barchart and add the argument stack = FALSE. Does this look better? When would it more sense to use stack = FALSE? Exercise: Find another data set, perhaps among the datasets avaiable for the text, and produce some barcharts that tell a story. Explain clearly what the interesting questions are and how your barchart addresses them. Experimenting with colors: Package RColorBrewer has interesting palettes: 1. Sequential palettes: from light to dark to show a gradient 2. Divergent palettes with neutral at the centre and radiating in two directions 3. Qualitative palettes: not ordered library(RColorBrewer) # automaticall loaded when you use `gd()` display.brewer.all()

88 brewer.pal.info | maxcolors category colorblind | BrBG 11 div TRUE | PiYG 11 div TRUE | PRGn 11 div TRUE | PuOr 11 div TRUE | RdBu 11 div TRUE | RdGy 11 div FALSE | RdYlBu 11 div TRUE | RdYlGn 11 div FALSE | Spectral 11 div FALSE | Accent 8 qual FALSE | Dark2 8 qual TRUE | Paired 12 qual TRUE | Pastel1 9 qual FALSE | Pastel2 8 qual FALSE | Set1 9 qual FALSE | Set2 8 qual TRUE | Set3 12 qual FALSE | Blues 9 seq TRUE | BuGn 9 seq TRUE | BuPu 9 seq TRUE | GnBu 9 seq TRUE | Greens 9 seq TRUE | Greys 9 seq TRUE | Oranges 9 seq TRUE | OrRd 9 seq TRUE | PuBu 9 seq TRUE | PuBuGn 9 seq TRUE

89 | PuRd 9 seq TRUE | Purples 9 seq TRUE | RdPu 9 seq TRUE | Reds 9 seq TRUE | YlGn 9 seq TRUE | YlGnBu 9 seq TRUE | YlOrBr 9 seq TRUE | YlOrRd 9 seq TRUE cols <- brewer.pal(12,'Dark2') pal(cols)

| red green blue | #1B9E77 27 158 119 | #D95F02 217 95 2 | #7570B3 117 112 179 | #E7298A 231 41 138 | #66A61E 102 166 30 | #E6AB02 230 171 2 | #A6761D 166 118 29 | #666666 102 102 102 cols <- brewer.pal(12,'Dark2')[c(3,5)] gd(superpose.polygon=list(col = cols)) tab__(titanic,~ sex+ passengerClass+ survived, pct = c(1,2)) %>% barchart(ylab = 'percentage of passengers', ylim = c(0,100), horizontal = FALSE, layout = c(3,1), auto.key=list(space='right',title='survived', reverse.rows = T))

90 # Change the order of the factor titanic2$survived <- factor(titanic2$survived, levels = c('yes','no')) gd(superpose.polygon=list(col = rev(cols))) tab__(titanic,~ sex+ passengerClass+ survived, pct = c(1,2)) %>% barchart(ylab = 'percentage of passengers', ylim = c(0,100), horizontal = FALSE, layout = c(3,1), auto.key=list(space='right',title='survived', reverse.rows = T))

91 Exercise: Ordinal categorical variables Create a factor with three levels, Low, Middle, High for infant mortality. Hint: Use the cut function with suitable breaks and labels. Do the same for GDP per capita with at least 3 levels. Explore ways of displaying the relationship between GDP per capita and infant mortality, treating both variables as categorical. Explore whether the relationship looks different in different regions of the world. You can use ‘barchart’ with a frequency data frame. However, using a table created by tab on the original data frame is usually a better approach.

Barchart using frequency data frame

To use ‘barchart’ on a frequency data frame, you need to include all variables in the data frame in the formula. This gives little flexibility in trying different aggregations. However, you can use ‘groups’ to produce side by side bars. This is an example from the help file for barchart: head(barley) | yield variety year site | 1 27.00000 Manchuria 1931 University Farm | 2 48.86667 Manchuria 1931 Waseca | 3 27.43334 Manchuria 1931 Morris | 4 39.93333 Manchuria 1931 Crookston | 5 32.96667 Manchuria 1931 Grand Rapids | 6 28.96667 Manchuria 1931 Duluth tab(barley,~ variety+ year+ site) | , , site = Grand Rapids | | year

92 | variety 1932 1931 Total | Svansota 1 1 2 | No. 462 1 1 2 | Manchuria 1 1 2 | No. 475 1 1 2 | Velvet 1 1 2 | Peatland 1 1 2 | Glabron 1 1 2 | No. 457 1 1 2 | Wisconsin No. 38 1 1 2 | Trebi 1 1 2 | Total 10 10 20 | | , , site = Duluth | | year | variety 1932 1931 Total | Svansota 1 1 2 | No. 462 1 1 2 | Manchuria 1 1 2 | No. 475 1 1 2 | Velvet 1 1 2 | Peatland 1 1 2 | Glabron 1 1 2 | No. 457 1 1 2 | Wisconsin No. 38 1 1 2 | Trebi 1 1 2 | Total 10 10 20 | | , , site = University Farm | | year | variety 1932 1931 Total | Svansota 1 1 2 | No. 462 1 1 2 | Manchuria 1 1 2 | No. 475 1 1 2 | Velvet 1 1 2 | Peatland 1 1 2 | Glabron 1 1 2 | No. 457 1 1 2 | Wisconsin No. 38 1 1 2 | Trebi 1 1 2 | Total 10 10 20 | | , , site = Morris | | year | variety 1932 1931 Total | Svansota 1 1 2 | No. 462 1 1 2 | Manchuria 1 1 2 | No. 475 1 1 2 | Velvet 1 1 2 | Peatland 1 1 2 | Glabron 1 1 2 | No. 457 1 1 2 | Wisconsin No. 38 1 1 2 | Trebi 1 1 2 | Total 10 10 20

93 | | , , site = Crookston | | year | variety 1932 1931 Total | Svansota 1 1 2 | No. 462 1 1 2 | Manchuria 1 1 2 | No. 475 1 1 2 | Velvet 1 1 2 | Peatland 1 1 2 | Glabron 1 1 2 | No. 457 1 1 2 | Wisconsin No. 38 1 1 2 | Trebi 1 1 2 | Total 10 10 20 | | , , site = Waseca | | year | variety 1932 1931 Total | Svansota 1 1 2 | No. 462 1 1 2 | Manchuria 1 1 2 | No. 475 1 1 2 | Velvet 1 1 2 | Peatland 1 1 2 | Glabron 1 1 2 | No. 457 1 1 2 | Wisconsin No. 38 1 1 2 | Trebi 1 1 2 | Total 10 10 20 | | , , site = Total | | year | variety 1932 1931 Total | Svansota 6 6 12 | No. 462 6 6 12 | Manchuria 6 6 12 | No. 475 6 6 12 | Velvet 6 6 12 | Peatland 6 6 12 | Glabron 6 6 12 | No. 457 6 6 12 | Wisconsin No. 38 6 6 12 | Trebi 6 6 12 | Total 60 60 120 barchart(yield~ variety| site, data = barley, groups = year, layout = c(1,6), stack = TRUE, auto.key = list(columns =2), ylab = "Barley Yield (bushels/acre)", scales = list(x = list(rot = 45)))

94 barchart(yield~ variety| site, data = barley, groups = year, layout = c(1,6), stack = FALSE, auto.key = list(columns =2), ylab = "Barley Yield (bushels/acre)", scales = list(x = list(rot = 45)))

95 barley$Year <- factor(barley$year, levels = c('1931','1932')) barchart(yield~ variety| site, data = barley, groups = Year, layout = c(1,6), stack = FALSE, auto.key = list(columns =2), ylab = "Barley Yield (bushels/acre)", scales = list(x = list(rot = 45)))

96 # to make quantities comparable it is much better to have # a 0 origin for bars that will then have an area proportional to quantity barchart(yield~ variety| site, data = barley, ylim = c(0,70), groups = Year, layout = c(1,6), stack = FALSE, auto.key = list(columns =2), ylab = "Barley Yield (bushels/acre)", scales = list(x = list(rot = 45)))

97 Summary

We have looked at four lattice graphics functions: 1. histogram • specify panels in formula: ~ y | panel_var • no ‘groups’ • control color with gd(plot.polygon = list(col = 'blue', border = 'black')) 2. densityplot • specify panels in formula: ~ y | panel_var • specify groups with groups = group_var • 3. bwplot 4. barchart on a table: ~a + b + c + ... + d • a is the axis variable • d is the within panel group variable • b, c, ... are panel variables • decide whether to stack or not: stack = FALSE • decide whether to use horizontal or vertical bars: horizontal = FALSE

Exercises:

1. In the context of R, what’s the difference between a data set and a data frame?

98 2. What’s are the differences among the concepts of lexicographic, alphabetic and numeric orderings? Give an example of two numbers whose lexicographical and numerical order are different. 3. Suppose dd is data frame with 4 categorical variables, a, b, c, and d. Which variables would be axis, panel and group variables with the command: dd %>% tab(~a+b+c+d) %>% barchart? 4. Explore and play with the mosaic function in the package vcd. Use it with the titanic and un data sets? How does it compare with barchart?

References

Ashwanden, Christie. 2015. “Science Isn’t Broken: It’s Just a Hell of Lot Harder Than We Give It Credit for.” FiveThir- tyEight.com. https://fivethirtyeight.com/features/science-isnt-broken/. Fox, John. 2016. Applied Regression Analysis and Generalized Linear Models. 3rd ed. Sage Publications. Gunter, Bert, and Christopher Tong. 2017. “What Are the Odds!? The ‘Airport Fallacy’ and Statistical Inference.” Significance 14 (4): 38–41. doi:10.1111/j.1740-9713.2017.01057.x.

99