Examining Data: I

Examining Data: I Georges Monette Examining Data: I MATH 4330: Applied Categorical Data Analysis – Fall 2017 Georges Monette September 2017 (Updated: September 12 2017 19:06) Introduction This material is based on Fox (2016) ch. 4, pp. 28 – 80. How to explore a data set? The first step is not to jump in and plot the data. The very first step should be to get or create the data biography: • where does it come from? • who collected it? • from whom or what? • when? • was there some form of random selection of sampled units? – what was the sampling frame? – how was sampling carried out? – are there sampling weights to adjust for possible biases? – was sampling done effectively so that the sampled units might be representative of a larger population? • was there random assignment of units to different levels of some variable (treatments)? – was this done effectively so the differences in outcomes could be interpreted as the result of a causal effect? You may not find answers to all these questions but it is essential that you be aware of the answers you don’t have and that you be aware of the consequential limitations on the interpretation of results of analyses of these data. Many data sets used in teaching lack a good data biography. That includes, unfortunately, many data sets used in this course. See Krause (2017) for a discussion of the role of data biographies. The second step is to find or create a data directory or ‘codebook’ describing each of the variables in the data. The third step is to think about what are the interesting questions this data set could help explore. Everything you present about a dataset should be related to a question. As you will see, there a great many ways of displaying a variable. The better ways answer interesting questions. As you explore data, keep asking yourself whether a graph or an analysis addresses an interesting and meaningful question. Present those that do and don’t waste your audience’s time – and mental energy – with those that don’t. This script focuses on the initial examination of data using graphical and numeric methods. We use some basic tools in R and vary parameters so you can judge what works best. R has a number of ‘systems’ for graphics. They represent different stages of the development of R and its ancestor, S. There a good overview of graphics in R by Michael Friendly, (???). The graphics systems can be classified as: 1 1. Standard base graphics, developed in the early days of S at Bell Labs in the 1970s. 2. Lattice, latticeExtra, and grid graphics, modelled on ‘trellis’ graphics also developed at Bell Labs in the 1990s. 3. 3d graphics based on ‘rgl’, an implementation in R of OpenGL and packages that use ‘rgl’ such as p3d 4. ggplot2 which is a package among a large number of packages developed by Hadley Wickham in the last decade. The easiest system to use is, by far, the base graphics system. Next in ease is the lattice system which, although daunting, is quite powerful once you master a few techniques. Most of the examples in this file use lattice graphics, a powerful set of functions and conventions that allow you to make easy graphs to explore data quickly and to produce, with more effort, graphical displays suitable for professional presentations. The best way to learn is to play with the code. Try to vary arguments. Look up help for functions and experiment with the examples. Don’t hesitate to also explore the other graphics systems. We also use some simple data manipulation tools in the ‘spida2’ package. Hadley Wickham’s packages are more powerful but also somewhat more complex. We will use the following data set(s) from Fox (2016). knitr::opts_knit(cache=TRUE) # Download these files manually. # Make sure to change the working directory # to the file in which this script was saved # using RStudio menus: # Session > Set Working Directory > To Source File Directory fox_data <- "http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/" download.file(paste0(fox_data,'UnitedNations.txt'),'UnitedNations.txt') download.file(paste0(fox_data,'Titanic.txt'),'Titanic.txt') # download bibliography file download.file('http://blackwell.math.yorku.ca/MATH4330/common/4330.bib','4330.bib') # usual packages: library(car) library(spida2) library(lattice) library(latticeExtra) | Loading required package: RColorBrewer # read data un <- read.table('UnitedNations.txt', header = TRUE) titanic <- read.table('Titanic.txt', header = TRUE) head(un) | region tfr contraception educationMale educationFemale | Afghanistan Asia 6.90 NA NA NA | Albania Europe 2.60 NA NA NA | Algeria Africa 3.81 52 11.1 9.9 | American.Samoa Asia NA NA NA NA | Andorra Europe NA NA NA NA | Angola Africa 6.69 NA NA NA | lifeMale lifeFemale infantMortality GDPperCapita | Afghanistan 45.0 46.0 154 2848 | Albania 68.0 74.0 32 863 | Algeria 67.5 70.3 44 1531 | American.Samoa 68.0 73.0 11 NA | Andorra NA NA NA NA | Angola 44.9 48.1 124 355 | economicActivityMale economicActivityFemale illiteracyMale | Afghanistan 87.5 7.2 52.800 | Albania NA NA NA | Algeria 76.4 7.8 26.100 | American.Samoa 58.8 42.4 0.264 2 | Andorra NA NA NA | Angola NA NA NA | illiteracyFemale | Afghanistan 85.00 | Albania NA | Algeria 51.00 | American.Samoa 0.36 | Andorra NA | Angola NA dim(un) | [1] 207 13 un$country <- rownames(un) # rownames identify the rows of a data frame head(titanic) | survived age | Allen, Miss Elisabeth Walton yes 29.0000 | Allison, Miss Helen Loraine no 2.0000 | Allison, Mr Hudson Joshua Creighton no 30.0000 | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) no 25.0000 | Allison, Master Hudson Trevor yes 0.9167 | Anderson, Mr Harry yes 47.0000 | passengerClass sex | Allen, Miss Elisabeth Walton 1st female | Allison, Miss Helen Loraine 1st female | Allison, Mr Hudson Joshua Creighton 1st male | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 1st female | Allison, Master Hudson Trevor 1st male | Anderson, Mr Harry 1st male dim(titanic) | [1] 1313 4 Do you know of any famous passengers on the Titanic? For example the ‘unsinkable’ Molly Brown: grep('Brown', rownames(titanic)) # selecting subset of strings with a regular expression | [1] 37 38 52 348 349 350 600 821 grep('Brown', rownames(titanic)) %>% titanic[., ] | survived age | Brown, Mrs James Joseph (Margaret Molly Tobin) yes 44 | Brown, Mrs John Murray (Caroline Lane Lamson) yes 59 | Case, Mr Howard Brown no 49 | Brown, Miss Edith E. yes 15 | Brown, Mr Thomas William Solomon no 45 | Brown, Mrs Thomas William Solomon (Elizabeth C.) yes 40 | Brown, Miss Mildred yes 24 | Goldsmith, Mrs Frank John (Emily A. Brown) yes NA | passengerClass sex | Brown, Mrs James Joseph (Margaret Molly Tobin) 1st female | Brown, Mrs John Murray (Caroline Lane Lamson) 1st female | Case, Mr Howard Brown 1st male | Brown, Miss Edith E. 2nd female | Brown, Mr Thomas William Solomon 2nd male | Brown, Mrs Thomas William Solomon (Elizabeth C.) 2nd female | Brown, Miss Mildred 2nd female | Goldsmith, Mrs Frank John (Emily A. Brown) 3rd female Who were the 10 youngest (except NAs) sortdf(titanic,~ age)[1:10,] 3 | survived age passengerClass | Dean, Miss Elizabeth Gladys (Millvena) yes 0.1667 3rd | Danbom, Master Gilbert Sigvard Emanuel no 0.3333 3rd | Caldwell, Master Alden Gates yes 0.8333 2nd | Richards, Master George Sidney yes 0.8333 2nd | Aks, Master Philip yes 0.8333 3rd | Allison, Master Hudson Trevor yes 0.9167 1st | Becker, Master Richard F. yes 1.0000 2nd | Hamalainen, Master Viljo yes 1.0000 2nd | LaRoche, Miss Louise yes 1.0000 2nd | Dean, Master Bertram Vere yes 1.0000 3rd | sex | Dean, Miss Elizabeth Gladys (Millvena) female | Danbom, Master Gilbert Sigvard Emanuel male | Caldwell, Master Alden Gates male | Richards, Master George Sidney male | Aks, Master Philip male | Allison, Master Hudson Trevor male | Becker, Master Richard F. male | Hamalainen, Master Viljo male | LaRoche, Miss Louise female | Dean, Master Bertram Vere male The ten oldest sortdf(titanic,~ I(- age)) %>% head(10) | survived age | Artagaveytia, Mr Ramon no 71 | Goldschmidt, Mr George B. no 71 | Mitchell, Mr Henry Michael no 71 | Crosby, Captain Edward Gifford no 70 | Crosby, Mrs Edward Gifford (Catherine Elizabeth Halstead) yes 69 | Straus, Mr Isidor no 67 | Millet, Mr Francis Davis no 65 | Dewan, Mr Frank no 65 | Compton, Mrs Alexander Taylor (Mary Eliza Ingersoll) yes 64 | Fortune, Mr Mark no 64 | passengerClass | Artagaveytia, Mr Ramon 1st | Goldschmidt, Mr George B. 1st | Mitchell, Mr Henry Michael 2nd | Crosby, Captain Edward Gifford 1st | Crosby, Mrs Edward Gifford (Catherine Elizabeth Halstead) 1st | Straus, Mr Isidor 1st | Millet, Mr Francis Davis 1st | Dewan, Mr Frank 3rd | Compton, Mrs Alexander Taylor (Mary Eliza Ingersoll) 1st | Fortune, Mr Mark 1st | sex | Artagaveytia, Mr Ramon male | Goldschmidt, Mr George B. male | Mitchell, Mr Henry Michael male | Crosby, Captain Edward Gifford male | Crosby, Mrs Edward Gifford (Catherine Elizabeth Halstead) female | Straus, Mr Isidor male | Millet, Mr Francis Davis male | Dewan, Mr Frank male | Compton, Mrs Alexander Taylor (Mary Eliza Ingersoll) female | Fortune, Mr Mark male Is age a useful variable? Who had NAs for age? 4 tab(titanic,~ is.na(age)) | is.na(age) | FALSE TRUE Total | 633 680 1313 Find out what happened to others, e.g. Benjamin Guggenheim, the Straus’s who founded Macy’s, Look at the data directory (or codebook) for these data frames at http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/ There are no recipes for examining data. The ‘best’ way depends on the nature of the data. You want to be able to see things you expect to see but you also want to have a good chance of finding things you didn’t expect.

Load more