Introduction to R and Exploratory Data Analysis August 26, 2009
Introduction to RandExploratory Data Analysis – p.1/20 Modern Statistical Software
create static and dynamic graphics fit and critique complex models provide interactive environments for exploration of data and models are extensible high level programming languages Object oriented code allow speed-ups by creating compiled code or re-writing slow parts in C/FORTRAN
Features present in R, S/S-Plus, Matlab, Xlispstat
Introduction to RandExploratory Data Analysis – p.2/20 Agent Orange
Agent Orange is an herbicide mixture used during the Vietnam War to destroy forest cover/vegetation. Agent Orange, so-called from the orange color of its storage drums, contains the highly toxic impurity dioxin. The Operation Ranch Hand mission involved spraying 20 million gallons of Agent Orange over 3.6 million acres of Vietnamese land. Agent Orange has been linked to cancers and other diseases in several epidemiological studies. About 3 million Americans served in the armed forces in Vietnam during the Vietnam War.
Introduction to RandExploratory Data Analysis – p.3/20 Dioxin Levels
Dioxin concentrations in parts per trillion (ppt) for 646 ground combat Vietnam veterans and 97 veterans who did not serve in Vietnam. sample (nonrandom) of Vietnam vets who served during 1967-1968 sample (nonrandom) of vets who served in the US and Germany between 1965-1971 Dioxin measurements taken from blood serum in 1987 as a biological marker for previous Agent Orange exposure.
Introduction to RandExploratory Data Analysis – p.4/20 Startup R
Unix command line: R Windows/Mac double-click R GUI Under emacs/ESS enter M-x R
Saving Data/Ending your R Session save.image() # use to save workspace without # quitting q() # quits R and prompts you # to save data
Introduction to RandExploratory Data Analysis – p.5/20 Creating a Dataframe in R
> vets = read.table("case0302.csv", header=T, sep=",") > names(vets) [1] "DIOXIN" "VETERAN"
Notes: 1. header=T tells R that the first line of the file contains variable names 2. sep="," tells R that columns of data are separated by a comma (the csv format) 3. the names function extracts the names of variables in a dataframe
Introduction to RandExploratory Data Analysis – p.6/20 Reading Data
read.csv Comma separated variable format read.fwf Fixed width format read.delim Tab delimited files
See help(read.table) for options, such as setting character for NAs, column separators, skipping lines, etc
See also scan() for reading in large files
Introduction to RandExploratory Data Analysis – p.7/20 Graphical Views
1. Univariate: histograms, density curves, boxplots, quantile-quantile plots 2. Bivariate: scatter plots with trend lines, side-by-side boxplots 3. Several variables: scatter plot matrices, lattice or trellis plots, 3-dimensional plots, dynamic plots
Introduction to RandExploratory Data Analysis – p.8/20 Histograms
Default Number of Bins 50 Bins Density Density 0.00 0.05 0.10 0.15 0.00 0.10 0.20 0 10 20 30 40 0 10 20 30 40
Dioxin Concentration (ppt) Dioxin Concentration (ppt)
hist(y, prob=T,breaks=50,main="50 Bins", ylab="Density",xlab="Y")
Introduction to RandExploratory Data Analysis – p.9/20 Estimating the Density Curve
Default Settings for Histogram and Kernel Density Density 0.00 0.10 0.20 0.30 0 10 20 30 40
Dioxin Concentration (ppt) dens = density(DIOXIN) lines(dens, lwd=2, col="orange")
Introduction to RandExploratory Data Analysis – p.10/20 Kernel Density Estimation
Default settings lead to a “bumpy” estimate. Are these real features or artifacts of the estimation procedure? density(y, bw, adjust, kernel, window) bw bandwidth controls amount of smoothing adjust actual bandwidth is adjust*bw kernel, window smoothing kernel see help(density) for other options
Reference: Silverman, B. W. (1986) Density Estimation. Chapman & Hall.
Introduction to RandExploratory Data Analysis – p.11/20 Adjusting Bandwidth
density(x = DIOXIN) density(x = DIOXIN, adjust = 1.25) Density Density 0.00 0.10 0.20 0.30 0.00 0.10 0.20 0 10 20 30 40 0 10 20 30 40
Dioxin Concentration (ppt) Dioxin Concentration (ppt)
density(x = DIOXIN, adjust = 1.5) density(x = DIOXIN, adjust = 2) Density Density 0.00 0.10 0.20 0.00 0.10 0.20 0 10 20 30 40 0 10 20 30 40
Dioxin Concentration (ppt) Dioxin Concentration (ppt)
Introduction to RandExploratory Data Analysis – p.12/20 Adding Locations of Data
density(x = DIOXIN, adjust = 1.25) Density 0.00 0.10 0.20 0 10 20 30 40
Dioxin Concentration (ppt)
rug(DIOXIN, lwd=2)
Introduction to RandExploratory Data Analysis – p.13/20 Boxplots
Shows quartiles and “outliers” Box is based on upper (Q1) and lower (Q3) quartiles Line in box indicates median (Q2) Whiskers based on smallest/largest observation within 1.5*IQR of the box Points for all cases beyond whiskers “outliers”
Good for showing outliers, skewness/symmetry
Introduction to RandExploratory Data Analysis – p.14/20 Boxplot of Dioxin
Default with Horizontal=T 0 10 20 30 40 Dioxin Concentration (ppt) Dioxin Concentration 0 10 20 30 40
Dioxin Concentration (ppt) boxplot(DIOXIN)
Introduction to RandExploratory Data Analysis – p.15/20 Side-by-Side Boxplots
boxplot(DIOXIN ~ VETERAN, data=vets ) Dioxin Concentration (ppt) Dioxin Concentration 0 10 20 30 40
OTHER VIETNAM
Introduction to RandExploratory Data Analysis – p.16/20 Formula
Side-by-side boxplots are useful for showing distribution of a quantitative variable for each level of a qualitative variable boxplot(DIOXIN ˜ VETERAN, data=vets, ylab="Dioxin Concentration (ppt)", main="boxplot(DIOXIN ˜ VETERAN, data=vets)"
formula DIOXIN ∼ VETERAN creates a boxplot of dioxin for each level of VETERAN data specifies the dataframe to use
Introduction to RandExploratory Data Analysis – p.17/20 Numerical Summaries
means (average) mean standard deviations sd 5-point summary quantile (min, lower quartile, median, upper quartile, max) correlations cor summary learn to use the commands aggregate and apply and its relatives
Introduction to RandExploratory Data Analysis – p.18/20 Notes
mean and standard deviation are sensitive to outliers symmetric data mean and median should be approximately equal skewed data median is more appropriate as a summary
In both groups the median dioxin level is 4 ppt.
Introduction to RandExploratory Data Analysis – p.19/20 Questions/Conclusions
Original CDC study examined whether military records could be used to identify veterans who were likely to have been exposed to Agent Orange. They found tthat exposure measures from miltary records or self-reported exposures did not correlate with blood dioxin levels.
Introduction to RandExploratory Data Analysis – p.20/20