Tuan V. Nguyen Gene cs Epidemiology of Osteoporosis Lab Garvan Ins tute of Medical Research
Garvan Ins tute Biosta s cal Workshop 17 April 2014 © Tuan V. Nguyen Introduction to R
• A brief history • Installa on • Packages • Essen al grammar • A session with R Previously …
• Many sta s cal packages were/are available • Popular packages include Systat, Minitab, Sta s ca, BMDP, S+, Gauss, Spida JMP, SPSS, Stata, SAS and now R R is gaining popularity
Number of scholarly ar cles that reference each so ware by year (Source: Muenchen R. The popularity of data analysis so ware, r4stat.com/ar cles/popularity) R is gaining popularity
Number of scholarly ar cles that reference each so ware by year, a er removing the top two, SPSS and SAS (Source: Muenchen R. The popularity of data analysis so ware, r4stat.com/ar cles/popularity) A brief history
• R is a “sta s cal and graphical programming language” • Originated from S – 1988 - S2: RA Becker, JM Chambers, A Wilks – 1992 - S3: JM Chambers, TJ Has e – 1998 - S4: JM Chambers • R was ini ally wri en by Ross Ihaka and Robert Gentleman (Univ of Auckland, New Zealand) in 1990s • From 1997: interna onal “R-core”, 15 people What can R do?
• It is a sta s cal language • All models of sta s cal analysis • Great for simula on work • Programming (do you want to take a challenge?)
Why R ?
• Open source – totally free!
• Developed by professional and academic sta s cians
• Run on Windows, Unix, MacOS
• Keep up-to-date with methodological developments
• Speak the language of experts (bioinforma cs and sta s cs)
• Large user community Installa on cran.r-project.org Installation of R on Windows
• Select Windows • Select “base” • Run à OK à Next • Then Finish – R icon on your desktop A screenshot of R RStudio
An “add-on” of R RStudio h p://rstudio.org
Introduction to RStudio
• An IDE (Interface Development Environment) of R. • Provide some convenient func ons for running R • R also has a number of other IDEs: • TinnR • R commander R and RStudio
Can run R within Rstudio (you don’t need to start R) RStudio
Workspace: Variables
R console
Files Packages
R is a real demonstration of the power of collaboration Ihaka Packages
• R = Base + Packages • Base R includes basic R func ons for simple func ons and analyses • Packages are modules for specific analyses • More than 6000 packages in R !
Common packages
Hmisc: Miscellaneous for data rms: Regression modeling strategies manipula on car: Companion to regression tables: For tabula on of data analysis foreign: For reading data from survival: Survival analyses other so wares EpiR: Epidemiological analyses tables: For tabula on of data epicalc: Epidemiological analyses gmodels: Programming tools boot: Bootstrap analyses ggplot2: Advanced graphics cluster: Cluster analysis sciplot: Scien fic graphs psych: Psychometrics and Zelig: “Every one’s sta s cal descrip ve sta s cs so ware”
Basic management of packages
• Installing new packages (try now!) install.packages(c("Hmisc", "rms", "tables", "foreign", "gmodels", "ggplot2", "sciplot", "Zelig", "car", "survival", "EpiR", "epicalc", "boot", "cluster", "psych", "binom", "BMA", "ExactCIdiff", "lattice", "mgcv", "gam", "nlme", "quantreg") • To find out which packages you have installed library() R Grammar: a quick introduc on Interacting with R
• Start up R • Can use up/down arrow keys to retrieve command history • Can use le /right keys to edit a command line • Can use TAB to append a full command – very useful! • Mul ple commands can be wri en in 1 line by using “;” separator Variable names
• Use le ers, numbers, and signs (., -, _) • Assignment symbol: <- or = • Dis nc on between upper and lower case le ers Genotype = 5; genotype <- 7; Geno.type = Genotype + genotype
Object-oriented language
R is an object-oriented language • Func on • Vector • Matrix • Dataframe Function
• R “commands” = func on • Func on has arguments • Arguments include variables (name), parameters, op ons, etc • Example: fi ng a linear regression model y = a + bx
m1 = lm(y ~ x, data=test) Function
• R “commands” = func on • Func on has arguments • Example: fi ng a linear regression model y = a + bx m1 = lm(y ~ x, data=test)
Object name Func on Arguments: m1 lm = linear model variables: y, x dataset name Vector
• Vectors are basic building block in R • Vector = a series of values • Values can be numeric or character score = c(4,2,1,5) gender = c('F','M','F','M')
c (concatena on) for direct data entry Matrix
• Rectagular data à rows, columns • Matrix can be a collec on of vectors
1 3 6 7 3 4 7 9 5 7 8 0
Matrix
1 3 6 7 3 4 7 9 5 7 8 0
v1 = c(1,3,5) v2 = c(3,4,7) v3 = c(6,7,8) v4 = c(7,9,0) m = cbind(v1,v2,v3,v4) m
Reference to matrix
> m • Row first, column later v1 v2 v3 v4 [1,] 1 3 6 7 • Flexible in R [2,] 3 4 7 9 [3,] 5 7 8 0 > m[2,3] v3 7 > m[,2:3] v2 v3 > m[1,] [1,] 3 6 v1 v2 v3 v4 [2,] 4 7 1 3 6 7 [3,] 7 8
> m[1:2,] > m[,3:4]*m[1,2] v1 v2 v3 v4 v3 v4 [1,] 1 3 6 7 [1,] 18 21 [2,] 3 4 7 9 [2,] 21 27 [3,] 24 0
Dataframe
Dataset in R = “Dataframe” = matrix fields, columns, variables
ID Gender Math Reading
1 F 5 8
2 M 5 2 rows records 3 F 7 3 observa ons
4 F 8 6
numeric character numeric numeric Reference to field/column in a dataframe
• Dataframe should be attached prior to analysis • Reference to field: (dataframe name)$(field name) • Example: v1 = c(1,3,5) v2 = c(3,4,7) v3 = c(6,7,8) v4 = c(7,9,0) dat = data.frame(v1, v2, v3, v4) attach(dat) dat$sum = dat$v1 + dat$v3 sum1 = v1 + v3 dat The effect of $ v1 = c(1,3,5) > dat v2 = c(3,4,7) v1 v2 v3 v4 sum v3 = c(6,7,8) 1 1 3 6 7 7 v4 = c(7,9,0) 2 3 4 7 9 10 dat=data.frame(v1,v2,v3,v4) 3 5 7 8 0 13 attach(dat) dat$sum = dat$v1 + dat$v3 There is NO sum1 ! sum1 = v1 + v3 dat Data coding in R id = c(1, 2, 3, 4, 5) gender = c("male", "female", "male", "female", "female") dat = data.frame(id, gender)
We want to create a new variable called sex with numeric values (1, 2) dat$sex[gender=="male"] <- 1 dat$sex[gender=="female"] <- 2
Character and numeric coding
Character to numeric X = c("1", "2", "3", "4", "5") We want to create a new variable called Y with numeric values (for calcula on) Y = as.numeric(X) mean(Y)
Numeric to character Y = 1:10 We want to create a new variable called X with character values X = as.character(Y)
Sorting dat: sort()
X = rnorm(10); X [1] 1.5651300 -0.5382971 -0.1995302 1.0111098 0.3590144 -1.5245237 [7] -0.3192534 0.1323256 -0.7916954 -0.0664167 sort(X) [1] -1.5245237 -0.7916954 -0.5382971 -0.3192534 -0.1995302 -0.0664167 [7] 0.1323256 0.3590144 1.0111098 1.5651300 Merging datasets id = c(1,2,3,4) id = c(1,2,3,4,5) sex=c("M","F","M","F") age=c(21,34,45,32,18) dat1=data.frame(id,sex) dat2=data.frame(id,age)
dat = merge(dat1, dat2, by="id")
dat = merge(dat1, dat2, by="id", all.x=T, all.y=T)
An R Session (demo) To work with R …
• R, like most sta s cal programs, works on observa ons (rows) and variables • You should keep in mind – Name of dataframe – Name of variables Allison and Cichhetti’s study
True Allison; Domenic V. Cicche . Sleep in Mammals: Ecological and Cons tu onal Correlates. Science 1976; 194:732-734. R Session
• Reading a file into R for analysis Filename: allison.csv • Some graphical analyses • Some descrip ve (and not so descrip ve) analyses
Allison T, Cicchetti DV (1976). Sleep in mammals: ecological and constitutional correlates. Science 194, 732–734. NonDrea Species BodyWt BrainWt ming Dreaming TotalSleep LifeSpan Gesta on Preda on Exposure Danger
Africanelephant 6654 5712 NA NA 3.3 38.6 645 3 5 3
Africangiantpouchedrat 1 6.6 6.3 2 8.3 4.5 42 3 1 3
Arc cFox 3.385 44.5 NA NA 12.5 14 60 1 1 1
Arc cgroundsquirrel 0.92 5.7 NA NA 16.5 NA 25 5 2 3
Asianelephant 2547 4603 2.1 1.8 3.9 69 624 3 5 4
Baboon 10.55 179.5 9.1 0.7 9.8 27 180 4 4 4
Bigbrownbat 0.023 0.3 15.8 3.9 19.7 19 35 1 1 1
Braziliantapir 160 169 5.2 1 6.2 30.4 392 4 5 4
Cat 3.3 25.6 10.9 3.6 14.5 28 63 1 2 1
Chimpanzee 52.16 440 8.3 1.4 9.7 50 230 1 1 1
Chinchilla 0.425 6.4 11 1.5 12.5 7 112 5 4 4
Cow 465 423 3.2 0.7 3.9 30 281 5 5 5
Deserthedgehog 0.55 2.4 7.6 2.7 10.3 NA NA 2 1 2
Donkey 187.1 419 NA NA 3.1 40 365 5 5 5
EasternAmericanmole 0.075 1.2 6.3 2.1 8.4 3.5 42 1 1 1 Reading file csv
• Locate your folder and filename • Use the func on read.csv • In Mac, you simply drag the filename to the R command line dat = read.csv("~/Dropbox/Garvan Lectures 2014/Datasets and Teaching Materials/ allison.csv", header=T, na.strings="NA") Reading file through file.choose() f = file.choose() # find the file dat = read.csv(f, header=T, na.strings="NA") attach(dat) # a ach the data before analysis names(dat) # want to know variable names dim(dat) # how many rows and columns? summary(dat) # summarize data
Summary: an overall “picture”
> summary(dat) Species BodyWt BrainWt Africanelephant : 1 Min. : 0.005 Min. : 0.14 Africangiantpouchedrat: 1 1st Qu.: 0.600 1st Qu.: 4.25 ArcticFox : 1 Median : 3.342 Median : 17.25 Arcticgroundsquirrel : 1 Mean : 198.790 Mean : 283.13 Asianelephant : 1 3rd Qu.: 48.203 3rd Qu.: 166.00 Baboon : 1 Max. :6654.000 Max. :5712.00 (Other) :56 NonDreaming Dreaming TotalSleep LifeSpan Min. : 2.100 Min. :0.000 Min. : 2.60 Min. : 2.000 1st Qu.: 6.250 1st Qu.:0.900 1st Qu.: 8.05 1st Qu.: 6.625 Median : 8.350 Median :1.800 Median :10.45 Median : 15.100 Mean : 8.673 Mean :1.972 Mean :10.53 Mean : 19.878 3rd Qu.:11.000 3rd Qu.:2.550 3rd Qu.:13.20 3rd Qu.: 27.750 Max. :17.900 Max. :6.600 Max. :19.90 Max. :100.000 NA's :14 NA's :12 NA's :4 NA's :4 Gestation Predation Exposure Danger Min. : 12.00 Min. :1.000 Min. :1.000 Min. :1.000 1st Qu.: 35.75 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 Median : 79.00 Median :3.000 Median :2.000 Median :2.000 Mean :142.35 Mean :2.871 Mean :2.419 Mean :2.613 3rd Qu.:207.50 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :645.00 Max. :5.000 Max. :5.000 Max. :5.000 NA's :4 Descriptive statistics: counting library(tables) tabular(factor(Exposure) ~ (n=1 + Percent("col")))
All n factor(Exposure) All Percent 1 27 43.548 2 13 20.968 3 4 6.452 4 5 8.065 5 13 20.968 Descriptive statistics: mean, SD, etc tabular(factor(Exposure) ~ LifeSpan*(n=1 + mean + median + sd), data=na.omit(dat))
LifeSpan n factor(Exposure) All mean median sd 1 18 15.17 5.75 24.103 2 9 13.81 7.00 15.622 3 4 25.40 26.50 13.846 4 4 20.30 23.60 9.428 5 7 33.34 30.00 18.452 Descriptive statistics: graph means=with(na.omit(dat), tapply(LifeSpan, Exposure, mean)) barplot(sort(means), horiz=T, las=1, col="blue", xlab="Life Span", ylab="Exposure")
Descriptive statistics: graph library(sciplot) bargraph.CI(Exposure, LifeSpan, lc=F, data=na.omit(dat)) Box plot boxplot(LifeSpan ~ Exposure, notch=F, col="blue") 100 80 60 40 20 0
1 2 3 4 5 Even better box plot library(ggplot2) qplot(x=factor(Exposure), y=LifeSpan, data=dat, geom=c("boxplot", "jitter"), fill=Exposure)
100
75
Exposure 5
4
50 3
LifeSpan 2
1
25
0
1 2 3 4 5 factor(Exposure) Histogram hist(LifeSpan, prob=T, col="blue") lines(density(LifeSpan, na.rm=T), col="red", lwd=3)
Histogram of LifeSpan 0.03 0.02 Density 0.01 0.00
0 20 40 60 80 100
LifeSpan Histogram with ggplot2 qplot(x=LifeSpan) + geom_histogram(col="white", fill="blue") + opts(legend.position="none")
10.0
7.5
5.0 count
2.5
0.0
0 25 50 75 100 LifeSpan Histogram and density with ggplot2 m = ggplot(data=dat, aes(x=LifeSpan)) m+ geom_histogram(binwidth=20, aes(y=..density..), col="white", fill="blue", lwd=0.5) + geom_density()
0.03
0.02 density
0.01
0.00
0 40 80 120 LifeSpan More “fancy” histogram library(ggplot2) qplot(x=LifeSpan, geom="density", fill=factor(Exposure), alpha=I(0.5)) + opts(legend.position="top") Scatter plot plot(BodyWt, BrainWt, pch=16, col="blue") Scatter plot with labels plot(BodyWt, BrainWt, pch=16, col="blue") text(BodyWt, BrainWt, labels=Species, cex= 0.5) Scatter plot with transformation plot(log(BodyWt), log(BrainWt), pch=16, col="blue") Scatter plot with straight line plot(log(BrainWt) ~ log(BodyWt), pch=16, col="blue") abline(lm((log(BrainWt) ~ log(BodyWt))), col="red") Scatter plot coloured by a 3rd variable qplot(x=log(BodyWt), y=log(BrainWt), col=Exposure) + stat_smooth(method="lm", se=T) Scatter plot scaled by size qplot(x=log(BodyWt), y=log(BrainWt), size=Danger, col=Exposure) + stat_smooth(method="lm", se=T) Multiple scatter plots with straight line qplot(log(BodyWt), log(BrainWt), data=dat, facets=~Danger)+geom_abline() Correlogram library(psych) vars=cbind(log(BodyWt), log(BrainWt), TotalSleep, Dreaming, LifeSpan, Gestation) pairs.panels(vars) -2 2 6 0 2 4 6 0 300 600 8 4
0.96 -0.53 -0.23 0.61 0.77 0 -4 6
2 -0.56 -0.34 0.71 0.78 -2
TotalSleep 20
0.73 -0.41 -0.63 10 5
6 Dreaming 4 -0.30 -0.45 2 0
LifeSpan 80
0.61 40 0
600 Gestation 300 0
-4 0 4 8 5 10 20 0 40 80 Factor analysis library(psych) vars=cbind(BodyWt, BrainWt, LifeSpan, Gestation, TotalSleep, Danger, Predation) fit = factanal(na.omit(vars), 2, rotation="varimax") fit
Factor analysis
Loadings: Factor1 Factor2 BodyWt 0.933 BrainWt 0.995 LifeSpan 0.511 Gestation 0.771 0.264 TotalSleep -0.333 -0.614 Danger 0.996 Predation 0.948
Factor1 Factor2 SS loadings 2.834 2.345 Proportion Var 0.405 0.335 Cumulative Var 0.405 0.740
Test of the hypothesis that 2 factors are sufficient. The chi square statistic is 62.94 on 8 degrees of freedom. The p-value is 1.23e-10 Summary
• R – an important development in sta s cal science • Absolutely free, powerful, highly flexible • Widely used around the world • Fit all statsi cal models • Very useful to simula on work • High quality (eg publishable) graphics Books and references
Dalgaard P (2008) Introductory Sta s cs with R. New York: Springer, 2nd edi on. Seefeld K, Linder E (2007) Sta s cs using R with biological examples. Available online (free). h p://cran.r-project.org/doc/contrib/Seefeld_StatsRBio.pdf Braun WJ, Murdoch DJ (2007) A First Course in Sta s cal Programming with R. Cambridge: Cambridge University Press. Wickham H (2009) ggplot: using the grammar of graphics with R. Springer Useful websites www.rseek.org (Google)