Introduction to

by Debby Kermer George Mason University Library Data Services Group http://dataservices.gmu.edu [email protected]

http://dataservices.gmu.edu/workshop/r

1. Create a folder with your name on the S: Drive 2. Copy titanic.csv from R Workshop Files to that folder

History

R ≈ S ≈ S-Plus Open Source, Free www.r-project.org

Comprehensive R Archive Network

Download R from: cran..com

RStudio: www.rstudio.com

R Console Console

See… > prompt for new command + waiting for rest of command R "guesses" whether you are done

considering () and symbols

Type… ↑[Up] to get previous command

Type in everything in font: Courier New 3+2 3- 2 Objects Objects nine <- 9 Historical Conventions nine Use <- to assign values Use . to separate names three <- nine / 3 three Current Capabilities my.school <- "gmu" = is okay now in most cases is okay now in most cases my.school _

RStudio: Press Alt - (minus) to insert "assignment operator" Global Environment Script Files

Vectors & Lists numbers <- c(101,102,103,104,105) the same numbers <- 101:105 numbers <- c(101:104,105) numbers[ 2 ] numbers[ c(2,4,5)] Vector  Variable numbers[-c(2,4,5)] numbers[ numbers > 102 ]

RStudio: Press Ctrl-Enter to run the current line or press the button:

Files & Functions Files Pane 2. Choose the "S" Drive

1. Browse Drives

3. Choose your folder Working Directory

Functions read.table( datafile, header=TRUE, sep = ",")

Function Positional Named Named Argument Argument Argument

Reading Files

Thus, these would be identical: titanic <- read.table( datafile, header=TRUE, sep = "," ) titanic <- datafile read.csv( )

If you have not set a working directory, use the whole path: titanic <- read.csv("S:/name/titanic.csv") titanic <- read.csv("S:\\name\\titanic.csv") titanic <- read.csv("titanic.csv")

R-Studio: Import Dataset Working with Data Data Frames int / num = Numeric (Interval / Ratio) str(titanic) think structure Factor = Categorical (Nominal /Ordinal )

titanic$pclass titanic <- read.csv("titanic.csv", as.is = "name") Factors - Categorical Variables

it is convention to add .f, can also give a new name or rewrite original titanic$pclass.f <- factor( titanic$pclass, levels = c(1,2,3), current values labels = c("1st Class", labels in the "2nd Class", same order "3rd Class"), ordinal variable ordered = TRUE ) NA and NULL

Delete Variable titanic$embarked <- NULL

Set Values to Missing titanic$age[titanic$age == 99] <- NA same thing while reading in data:

titanic <- read.csv("titanic.csv", na.strings = "99")

Ignore NAs Option

na.rm = TRUE primarily needed for base R functions

Review

Words with Stuff Words that are not Objects word (Object) TRUE or T FALSE or F word[ stuff ] (Object Part) word( stuff ) (Function) NaN (Not a Number) NA (Not Available) "word" (String) NULL (Empty) Inf (Infinity) Packages Packages

Packages must be both Installed and Loaded

To Install: install.packages("name") Install To Load: library( name ) or, require( name ) or, check the box Loaded Installed Confirm these are installed:

dplyr tidyr descr

Data Frame Alternatives

History Fact package Hadley Wickham, dplyr who created dplyr, data.frame  tbl_df works at RStudio data.table  tbl_dt

package titanic library(dplyr) tt <- tbl_df(titanic) tt str(tt) Choose Variables base contains tt$name starts_with tt[,"name"] ends_with tt[,c("age","gender")] matches dplyr distinct select( tt, name) select( tt, -name) select( tt, age, gender) select( tt, gender : pclass) select( tt, starts_with("p"))

Choose Cases base tt[tt$age < 5 , ] titanic[titanic$age < 5 , ] attach(titanic) titanic[age < 5 , ] titanic[age < 5 & is.na(age) == F , ] dplyr filter(tt, age < 5 ) filter(tt, age < 5, pclass.f == "1st Class" ) filter(tt, age < 5 | pclass == 1 )

Change data base tt$child <- tt$age <= 12 tt$totfam <- tt$sibsp + tt$parch tt$bigfam <- as.numeric(tt$totfam > 4) dplyr tt <- mutate(tt, child = age <= 12) tt <- mutate(tt, totfam = sibsp + parch, bigfam = as.numeric(totfam > 4) )

Chaining / Piping

History Fact RStudio: %>% Ctrl+Shift+M from magrittr Read "then…" originally %.% select(tt, name, age) vs tt %>% select(name, age) tt %>% filter(age<5) %>% select(name, age) works anytime the 1st argument is the dataset

Hadley Wickham's Packages

library(dplyr) library(tidyr ) select : Choose variables spread filter : Choose cases gather separate mutate : Change values bind_rows summarize : Aggregate values group_by : Create groups library(lubridate ) arrange : Order cases library(stringr )

History Fact dplyr - update of plyr for data tables

Descriptive Statistics Summarize base summary(tt) mean(tt$age) mean(tt$age , na.rm = T ) sd(tt$age , na.rm = T )

dplyr summarize(tt, xbar=mean(age, na.rm=T)) summarize(tt, n=n(), sd=sd(sibsp))

descr Package library(descr) freq(tt$pclass) freq(tt$pclass.f) freq(tt$age)

CrossTable(tt$pclass, tt$survived) CrossTable(tt$pclass, tt$survived, prop.t = F, prop.c = F, T prop.r = T, is default for all digits = 2 )

Pivot Table library(dplyr) library(tidyr) tt %>% group_by(pclass, gender) %>% summarize( pct = mean(survived) ) %>%

spread( gender, pct )

ggplot2

History Fact plot( tt$age) Created by Hadley Wickham, plot( tt$age, tt$fare ) based on the book "Grammar of Graphics" library(ggplot2) by Leland Wilkinson qplot(age, data=tt) qplot(age, fare, data=tt)

See full documentation at: www.ggplot2.org

Levels of Measurement plot( tt$pclass, tt$survived ) tt$survived.f <- factor(tt$survived, labels = c("Died","Survived") ) labels(tt$gender) <- c("Males", "Females") plot( tt$pclass.f, tt$survived.f )

qplot qplot(pclass.f, data=tt, fill=survived.f) qplot( x, y, data=, color=, shape=, size=, fill=, method=, formula=, qplot(age,fare, data=tt, alpha=, #transparency =, #type color=survived.f) geom facets=, #matrix xlim=, ylim= , #axis ranges xlab=, ylab=, #axis labels qplot(age, data=tt, main=, sub= #titles fill = survived.f, ) alpha = I(0.3), position = "identity")

Referring to Variables mean( tt$age ) qplot( age, data=tt ) select( tt, age ) attach(titanic) names(df) <- c( "var1","var2","var3" ) names(df)[ 1 ] <- "var1"

Statistical Analysis Writing Models

~ predicted from : interaction + include * factorial

Statistical Equation R Formula

Yi = β0 + β1Xi + εI Y ~ X

Yi = β0 + β1Xi + β2Zi + εI Y ~ X + Z

Yi = β0 + β1Xi + β2Zi + β3Xi Zi + εI Y ~ X * Z or Y ~ X + Z + X:Z

t.test( fare ~ gender, data = tt )

Analysis Objects tt.anova <- aov(fare ~ gender*pclass , data=tt ) summary(tt.anova) the same tt.logistic <- glm( survived ~ gender + pclass + gender:pclass + age + child , family = binomial, data = tt ) summary(tt.logistic)

More with Analysis Objects plot(tt.logistic) tt.pred <- predict(tt.logistic) tt.resid <- residuals(tt.logistic) plot(tt.pred, tt.resid)

More Capabilities R Markdown

1 2

3 4 Analysis Environments

R Commander Deducer Separate Interface Adds to R Interface (not RStudio!) More/Better Statistics Easier Data Management www.rcommander.com www.deducer.org install.packages("Rcmdr") install.packages("Deducer") library(Rcmdr) library(Deducer)

Deducer Plot Builder Data Mining GUI

install.packages("rattle") require ("rattle") rattle() Next Steps Finding Packages Swirl install.packages("swirl") require ("swirl") install_from_swirl("Course") swirl() http://swirlstats.com

Tutorials

http://dataservices.gmu.edu/software/r http://tryr.codeschool.com/ https://www.datacamp.com/

Slides are available at: http://dataservices.gmu.edu/workshops/r [email protected] [email protected]

© 2015 by Debby Kermer, This work is licensed under the c Attribution- c b n a NonCommercial-ShareAlike International License: Mason Library Data Services http://creativecommons.org/licenses/by-nc-sa/4.0/