Introduction to R
by Debby Kermer George Mason University Library Data Services Group http://dataservices.gmu.edu [email protected]
http://dataservices.gmu.edu/workshop/r
1. Create a folder with your name on the S: Drive 2. Copy titanic.csv from R Workshop Files to that folder
History
R ≈ S ≈ S-Plus Open Source, Free www.r-project.org
Comprehensive R Archive Network
Download R from: cran.rstudio.com
RStudio: www.rstudio.com
R Console Console
See… > prompt for new command + waiting for rest of command R "guesses" whether you are done
considering () and symbols
Type… ↑[Up] to get previous command
Type in everything in font: Courier New 3+2 3- 2 Objects Objects nine <- 9 Historical Conventions nine Use <- to assign values Use . to separate names three <- nine / 3 three Current Capabilities my.school <- "gmu" = is okay now in most cases is okay now in most cases my.school _
RStudio: Press Alt - (minus) to insert "assignment operator" Global Environment Script Files
Vectors & Lists numbers <- c(101,102,103,104,105) the same numbers <- 101:105 numbers <- c(101:104,105) numbers[ 2 ] numbers[ c(2,4,5)] Vector Variable numbers[-c(2,4,5)] numbers[ numbers > 102 ]
RStudio: Press Ctrl-Enter to run the current line or press the button:
Files & Functions Files Pane 2. Choose the "S" Drive
1. Browse Drives
3. Choose your folder Working Directory
Functions read.table( datafile, header=TRUE, sep = ",")
Function Positional Named Named Argument Argument Argument
Reading Files
Thus, these would be identical: titanic <- read.table( datafile, header=TRUE, sep = "," ) titanic <- datafile read.csv( )
If you have not set a working directory, use the whole path: titanic <- read.csv("S:/name/titanic.csv") titanic <- read.csv("S:\\name\\titanic.csv") titanic <- read.csv("titanic.csv")
R-Studio: Import Dataset Working with Data Data Frames int / num = Numeric (Interval / Ratio) str(titanic) think structure Factor = Categorical (Nominal /Ordinal )
titanic$pclass titanic <- read.csv("titanic.csv", as.is = "name") Factors - Categorical Variables
it is convention to add .f, can also give a new name or rewrite original titanic$pclass.f <- factor( titanic$pclass, levels = c(1,2,3), current values labels = c("1st Class", labels in the "2nd Class", same order "3rd Class"), ordinal variable ordered = TRUE ) NA and NULL
Delete Variable titanic$embarked <- NULL
Set Values to Missing titanic$age[titanic$age == 99] <- NA same thing while reading in data:
titanic <- read.csv("titanic.csv", na.strings = "99")
Ignore NAs Option
na.rm = TRUE primarily needed for base R functions
Review
Words with Stuff Words that are not Objects word (Object) TRUE or T FALSE or F word[ stuff ] (Object Part) word( stuff ) (Function) NaN (Not a Number) NA (Not Available) "word" (String) NULL (Empty) Inf (Infinity) Packages Packages
Packages must be both Installed and Loaded
To Install: install.packages("name") Install To Load: library( name ) or, require( name ) or, check the box Loaded Installed Confirm these are installed:
dplyr tidyr descr ggplot2
Data Frame Alternatives
History Fact package Hadley Wickham, dplyr who created dplyr, data.frame tbl_df works at RStudio data.table tbl_dt
package titanic library(dplyr) tt <- tbl_df(titanic) tt str(tt) Choose Variables base contains tt$name starts_with tt[,"name"] ends_with tt[,c("age","gender")] matches dplyr distinct select( tt, name) select( tt, -name) select( tt, age, gender) select( tt, gender : pclass) select( tt, starts_with("p"))
Choose Cases base tt[tt$age < 5 , ] titanic[titanic$age < 5 , ] attach(titanic) titanic[age < 5 , ] titanic[age < 5 & is.na(age) == F , ] dplyr filter(tt, age < 5 ) filter(tt, age < 5, pclass.f == "1st Class" ) filter(tt, age < 5 | pclass == 1 )
Change data base tt$child <- tt$age <= 12 tt$totfam <- tt$sibsp + tt$parch tt$bigfam <- as.numeric(tt$totfam > 4) dplyr tt <- mutate(tt, child = age <= 12) tt <- mutate(tt, totfam = sibsp + parch, bigfam = as.numeric(totfam > 4) )
Chaining / Piping
History Fact RStudio: %>% Ctrl+Shift+M from magrittr Read "then…" originally %.% select(tt, name, age) vs tt %>% select(name, age) tt %>% filter(age<5) %>% select(name, age) works anytime the 1st argument is the dataset
Hadley Wickham's Packages
library(dplyr) library(tidyr ) select : Choose variables spread filter : Choose cases gather separate mutate : Change values bind_rows summarize : Aggregate values group_by : Create groups library(lubridate ) arrange : Order cases library(stringr )
History Fact dplyr - update of plyr for data tables
Descriptive Statistics Summarize base summary(tt) mean(tt$age) mean(tt$age , na.rm = T ) sd(tt$age , na.rm = T )
dplyr summarize(tt, xbar=mean(age, na.rm=T)) summarize(tt, n=n(), sd=sd(sibsp))
descr Package library(descr) freq(tt$pclass) freq(tt$pclass.f) freq(tt$age)
CrossTable(tt$pclass, tt$survived) CrossTable(tt$pclass, tt$survived, prop.t = F, prop.c = F, T prop.r = T, is default for all digits = 2 )
Pivot Table library(dplyr) library(tidyr) tt %>% group_by(pclass, gender) %>% summarize( pct = mean(survived) ) %>%
spread( gender, pct )
ggplot2
History Fact plot( tt$age) Created by Hadley Wickham, plot( tt$age, tt$fare ) based on the book "Grammar of Graphics" library(ggplot2) by Leland Wilkinson qplot(age, data=tt) qplot(age, fare, data=tt)
See full documentation at: www.ggplot2.org
Levels of Measurement plot( tt$pclass, tt$survived ) tt$survived.f <- factor(tt$survived, labels = c("Died","Survived") ) labels(tt$gender) <- c("Males", "Females") plot( tt$pclass.f, tt$survived.f )
qplot qplot(pclass.f, data=tt, fill=survived.f) qplot( x, y, data=, color=, shape=, size=, fill=, method=, formula=, qplot(age,fare, data=tt, alpha=, #transparency =, #type color=survived.f) geom facets=, #matrix xlim=, ylim= , #axis ranges xlab=, ylab=, #axis labels qplot(age, data=tt, main=, sub= #titles fill = survived.f, ) alpha = I(0.3), position = "identity")
Referring to Variables mean( tt$age ) qplot( age, data=tt ) select( tt, age ) attach(titanic) names(df) <- c( "var1","var2","var3" ) names(df)[ 1 ] <- "var1"
Statistical Analysis Writing Models
~ predicted from : interaction + include * factorial
Statistical Equation R Formula
Yi = β0 + β1Xi + εI Y ~ X
Yi = β0 + β1Xi + β2Zi + εI Y ~ X + Z
Yi = β0 + β1Xi + β2Zi + β3Xi Zi + εI Y ~ X * Z or Y ~ X + Z + X:Z
t.test( fare ~ gender, data = tt )
Analysis Objects tt.anova <- aov(fare ~ gender*pclass , data=tt ) summary(tt.anova) the same tt.logistic <- glm( survived ~ gender + pclass + gender:pclass + age + child , family = binomial, data = tt ) summary(tt.logistic)
More with Analysis Objects plot(tt.logistic) tt.pred <- predict(tt.logistic) tt.resid <- residuals(tt.logistic) plot(tt.pred, tt.resid)
More Capabilities R Markdown
1 2
3 4 Analysis Environments
R Commander Deducer Separate Interface Adds to R Interface (not RStudio!) More/Better Statistics Easier Data Management www.rcommander.com www.deducer.org install.packages("Rcmdr") install.packages("Deducer") library(Rcmdr) library(Deducer)
Deducer Plot Builder Data Mining GUI
install.packages("rattle") require ("rattle") rattle() Next Steps Finding Packages Swirl install.packages("swirl") require ("swirl") install_from_swirl("Course") swirl() http://swirlstats.com
Tutorials
http://dataservices.gmu.edu/software/r http://tryr.codeschool.com/ https://www.datacamp.com/
Slides are available at: http://dataservices.gmu.edu/workshops/r [email protected] [email protected]
© 2015 by Debby Kermer, This work is licensed under the c Attribution- c b n a NonCommercial-ShareAlike International License: Mason Library Data Services http://creativecommons.org/licenses/by-nc-sa/4.0/