Introduction to R and Exploratory Data Analysis August 26, 2009

Introduction to R and Exploratory Data Analysis August 26, 2009 Introduction to RandExploratory Data Analysis – p.1/20 Modern Statistical Software create static and dynamic graphics fit and critique complex models provide interactive environments for exploration of data and models are extensible high level programming languages Object oriented code allow speed-ups by creating compiled code or re-writing slow parts in C/FORTRAN Features present in R, S/S-Plus, Matlab, Xlispstat Introduction to RandExploratory Data Analysis – p.2/20 Agent Orange Agent Orange is an herbicide mixture used during the Vietnam War to destroy forest cover/vegetation. Agent Orange, so-called from the orange color of its storage drums, contains the highly toxic impurity dioxin. The Operation Ranch Hand mission involved spraying 20 million gallons of Agent Orange over 3.6 million acres of Vietnamese land. Agent Orange has been linked to cancers and other diseases in several epidemiological studies. About 3 million Americans served in the armed forces in Vietnam during the Vietnam War. Introduction to RandExploratory Data Analysis – p.3/20 Dioxin Levels Dioxin concentrations in parts per trillion (ppt) for 646 ground combat Vietnam veterans and 97 veterans who did not serve in Vietnam. sample (nonrandom) of Vietnam vets who served during 1967-1968 sample (nonrandom) of vets who served in the US and Germany between 1965-1971 Dioxin measurements taken from blood serum in 1987 as a biological marker for previous Agent Orange exposure. Introduction to RandExploratory Data Analysis – p.4/20 Startup R Unix command line: R Windows/Mac double-click R GUI Under emacs/ESS enter M-x R Saving Data/Ending your R Session save.image() # use to save workspace without # quitting q() # quits R and prompts you # to save data Introduction to RandExploratory Data Analysis – p.5/20 Creating a Dataframe in R > vets = read.table("case0302.csv", header=T, sep=",") > names(vets) [1] "DIOXIN" "VETERAN" Notes: 1. header=T tells R that the first line of the file contains variable names 2. sep="," tells R that columns of data are separated by a comma (the csv format) 3. the names function extracts the names of variables in a dataframe Introduction to RandExploratory Data Analysis – p.6/20 Reading Data read.csv Comma separated variable format read.fwf Fixed width format read.delim Tab delimited files See help(read.table) for options, such as setting character for NAs, column separators, skipping lines, etc See also scan() for reading in large files Introduction to RandExploratory Data Analysis – p.7/20 Graphical Views 1. Univariate: histograms, density curves, boxplots, quantile-quantile plots 2. Bivariate: scatter plots with trend lines, side-by-side boxplots 3. Several variables: scatter plot matrices, lattice or trellis plots, 3-dimensional plots, dynamic plots Introduction to RandExploratory Data Analysis – p.8/20 Histograms Default Number of Bins 50 Bins Density Density 0.00 0.05 0.10 0.15 0.00 0.10 0.20 0 10 20 30 40 0 10 20 30 40 Dioxin Concentration (ppt) Dioxin Concentration (ppt) hist(y, prob=T,breaks=50,main="50 Bins", ylab="Density",xlab="Y") Introduction to RandExploratory Data Analysis – p.9/20 Estimating the Density Curve Default Settings for Histogram and Kernel Density Density 0.00 0.10 0.20 0.30 0 10 20 30 40 Dioxin Concentration (ppt) dens = density(DIOXIN) lines(dens, lwd=2, col="orange") Introduction to RandExploratory Data Analysis – p.10/20 Kernel Density Estimation Default settings lead to a “bumpy” estimate. Are these real features or artifacts of the estimation procedure? density(y, bw, adjust, kernel, window) bw bandwidth controls amount of smoothing adjust actual bandwidth is adjust*bw kernel, window smoothing kernel see help(density) for other options Reference: Silverman, B. W. (1986) Density Estimation. Chapman & Hall. Introduction to RandExploratory Data Analysis – p.11/20 Adjusting Bandwidth density(x = DIOXIN) density(x = DIOXIN, adjust = 1.25) Density Density 0.00 0.10 0.20 0.30 0.00 0.10 0.20 0 10 20 30 40 0 10 20 30 40 Dioxin Concentration (ppt) Dioxin Concentration (ppt) density(x = DIOXIN, adjust = 1.5) density(x = DIOXIN, adjust = 2) Density Density 0.00 0.10 0.20 0.00 0.10 0.20 0 10 20 30 40 0 10 20 30 40 Dioxin Concentration (ppt) Dioxin Concentration (ppt) Introduction to RandExploratory Data Analysis – p.12/20 Adding Locations of Data density(x = DIOXIN, adjust = 1.25) Density 0.00 0.10 0.20 0 10 20 30 40 Dioxin Concentration (ppt) rug(DIOXIN, lwd=2) Introduction to RandExploratory Data Analysis – p.13/20 Boxplots Shows quartiles and “outliers” Box is based on upper (Q1) and lower (Q3) quartiles Line in box indicates median (Q2) Whiskers based on smallest/largest observation within 1.5*IQR of the box Points for all cases beyond whiskers “outliers” Good for showing outliers, skewness/symmetry Introduction to RandExploratory Data Analysis – p.14/20 Boxplot of Dioxin Default with Horizontal=T 0 10 20 30 40 Dioxin Concentration (ppt) 0 10 20 30 40 Dioxin Concentration (ppt) boxplot(DIOXIN) Introduction to RandExploratory Data Analysis – p.15/20 Side-by-Side Boxplots boxplot(DIOXIN ~ VETERAN, data=vets ) Dioxin Concentration (ppt) 0 10 20 30 40 OTHER VIETNAM Introduction to RandExploratory Data Analysis – p.16/20 Formula Side-by-side boxplots are useful for showing distribution of a quantitative variable for each level of a qualitative variable boxplot(DIOXIN ˜ VETERAN, data=vets, ylab="Dioxin Concentration (ppt)", main="boxplot(DIOXIN ˜ VETERAN, data=vets)" formula DIOXIN ∼ VETERAN creates a boxplot of dioxin for each level of VETERAN data specifies the dataframe to use Introduction to RandExploratory Data Analysis – p.17/20 Numerical Summaries means (average) mean standard deviations sd 5-point summary quantile (min, lower quartile, median, upper quartile, max) correlations cor summary learn to use the commands aggregate and apply and its relatives Introduction to RandExploratory Data Analysis – p.18/20 Notes mean and standard deviation are sensitive to outliers symmetric data mean and median should be approximately equal skewed data median is more appropriate as a summary In both groups the median dioxin level is 4 ppt. Introduction to RandExploratory Data Analysis – p.19/20 Questions/Conclusions Original CDC study examined whether military records could be used to identify veterans who were likely to have been exposed to Agent Orange. They found tthat exposure measures from miltary records or self-reported exposures did not correlate with blood dioxin levels. Introduction to RandExploratory Data Analysis – p.20/20.

Load more