Introduction to and Exploratory Data Analysis August 26, 2009

Introduction to RandExploratory Data Analysis – p.1/20 Modern Statistical Software

create static and dynamic graphics fit and critique complex models provide interactive environments for exploration of data and models are extensible high level programming languages Object oriented code allow speed-ups by creating compiled code or re-writing slow parts in C/FORTRAN

Features present in R, S/S-Plus, Matlab, Xlispstat

Introduction to RandExploratory Data Analysis – p.2/20 Agent

Agent Orange is an herbicide mixture used during the Vietnam War to destroy forest cover/vegetation. Agent Orange, so-called from the orange color of its storage drums, contains the highly toxic impurity dioxin. The Operation Ranch Hand mission involved spraying 20 million gallons of Agent Orange over 3.6 million acres of Vietnamese land. Agent Orange has been linked to cancers and other diseases in several epidemiological studies. About 3 million Americans served in the armed forces in Vietnam during the Vietnam War.

Introduction to RandExploratory Data Analysis – p.3/20 Dioxin Levels

Dioxin concentrations in parts per trillion (ppt) for 646 ground combat Vietnam veterans and 97 veterans who did not serve in Vietnam. sample (nonrandom) of Vietnam vets who served during 1967-1968 sample (nonrandom) of vets who served in the US and Germany between 1965-1971 Dioxin measurements taken from blood serum in 1987 as a biological marker for previous Agent Orange exposure.

Introduction to RandExploratory Data Analysis – p.4/20 Startup R

Unix command line: R Windows/Mac double-click R GUI Under emacs/ESS enter M-x R

Saving Data/Ending your R Session save.image() # use to save workspace without # quitting q() # quits R and prompts you # to save data

Introduction to RandExploratory Data Analysis – p.5/20 Creating a Dataframe in R

> vets = read.table("case0302.csv", header=T, sep=",") > names(vets) [1] "DIOXIN" "VETERAN"

Notes: 1. header=T tells R that the first line of the file contains variable names 2. sep="," tells R that columns of data are separated by a comma (the csv format) 3. the names function extracts the names of variables in a dataframe

Introduction to RandExploratory Data Analysis – p.6/20 Reading Data

read.csv Comma separated variable format read.fwf Fixed width format read.delim Tab delimited files

See help(read.table) for options, such as setting character for NAs, column separators, skipping lines, etc

See also scan() for reading in large files

Introduction to RandExploratory Data Analysis – p.7/20 Graphical Views

1. Univariate: histograms, density curves, boxplots, quantile-quantile plots 2. Bivariate: scatter plots with trend lines, side-by-side boxplots 3. Several variables: scatter plot matrices, lattice or trellis plots, 3-dimensional plots, dynamic plots

Introduction to RandExploratory Data Analysis – p.8/20 Histograms

Default Number of Bins 50 Bins Density Density 0.00 0.05 0.10 0.15 0.00 0.10 0.20 0 10 20 30 40 0 10 20 30 40

Dioxin Concentration (ppt) Dioxin Concentration (ppt)

hist(y, prob=T,breaks=50,main="50 Bins", ylab="Density",xlab="Y")

Introduction to RandExploratory Data Analysis – p.9/20 Estimating the Density Curve

Default Settings for Histogram and Kernel Density Density 0.00 0.10 0.20 0.30 0 10 20 30 40

Dioxin Concentration (ppt) dens = density(DIOXIN) lines(dens, lwd=2, col="orange")

Introduction to RandExploratory Data Analysis – p.10/20 Kernel Density Estimation

Default settings lead to a “bumpy” estimate. Are these real features or artifacts of the estimation procedure? density(y, bw, adjust, kernel, window) bw bandwidth controls amount of smoothing adjust actual bandwidth is adjust*bw kernel, window smoothing kernel see help(density) for other options

Reference: Silverman, B. W. (1986) Density Estimation. Chapman & Hall.

Introduction to RandExploratory Data Analysis – p.11/20 Adjusting Bandwidth

density(x = DIOXIN) density(x = DIOXIN, adjust = 1.25) Density Density 0.00 0.10 0.20 0.30 0.00 0.10 0.20 0 10 20 30 40 0 10 20 30 40

Dioxin Concentration (ppt) Dioxin Concentration (ppt)

density(x = DIOXIN, adjust = 1.5) density(x = DIOXIN, adjust = 2) Density Density 0.00 0.10 0.20 0.00 0.10 0.20 0 10 20 30 40 0 10 20 30 40

Dioxin Concentration (ppt) Dioxin Concentration (ppt)

Introduction to RandExploratory Data Analysis – p.12/20 Adding Locations of Data

density(x = DIOXIN, adjust = 1.25) Density 0.00 0.10 0.20 0 10 20 30 40

Dioxin Concentration (ppt)

rug(DIOXIN, lwd=2)

Introduction to RandExploratory Data Analysis – p.13/20 Boxplots

Shows quartiles and “outliers” Box is based on upper (Q1) and lower (Q3) quartiles Line in box indicates median (Q2) Whiskers based on smallest/largest observation within 1.5*IQR of the box Points for all cases beyond whiskers “outliers”

Good for showing outliers, skewness/symmetry

Introduction to RandExploratory Data Analysis – p.14/20 Boxplot of Dioxin

Default with Horizontal=T 0 10 20 30 40 Dioxin Concentration (ppt) Dioxin Concentration 0 10 20 30 40

Dioxin Concentration (ppt) boxplot(DIOXIN)

Introduction to RandExploratory Data Analysis – p.15/20 Side-by-Side Boxplots

boxplot(DIOXIN ~ VETERAN, data=vets ) Dioxin Concentration (ppt) Dioxin Concentration 0 10 20 30 40

OTHER VIETNAM

Introduction to RandExploratory Data Analysis – p.16/20 Formula

Side-by-side boxplots are useful for showing distribution of a quantitative variable for each level of a qualitative variable boxplot(DIOXIN ˜ VETERAN, data=vets, ylab="Dioxin Concentration (ppt)", main="boxplot(DIOXIN ˜ VETERAN, data=vets)"

formula DIOXIN ∼ VETERAN creates a boxplot of dioxin for each level of VETERAN data specifies the dataframe to use

Introduction to RandExploratory Data Analysis – p.17/20 Numerical Summaries

means (average) mean standard deviations sd 5-point summary quantile (min, lower quartile, median, upper quartile, max) correlations cor summary learn to use the commands aggregate and apply and its relatives

Introduction to RandExploratory Data Analysis – p.18/20 Notes

mean and standard deviation are sensitive to outliers symmetric data mean and median should be approximately equal skewed data median is more appropriate as a summary

In both groups the median dioxin level is 4 ppt.

Introduction to RandExploratory Data Analysis – p.19/20 Questions/Conclusions

Original CDC study examined whether military records could be used to identify veterans who were likely to have been exposed to Agent Orange. They found tthat exposure measures from miltary records or self-reported exposures did not correlate with blood dioxin levels.

Introduction to RandExploratory Data Analysis – p.20/20