Read.Csv("Titanic.Csv", As.Is = "Name") Factors - Categorical Variables
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to R by Debby Kermer George Mason University Library Data Services Group http://dataservices.gmu.edu [email protected] http://dataservices.gmu.edu/workshop/r 1. Create a folder with your name on the S: Drive 2. Copy titanic.csv from R Workshop Files to that folder History R ≈ S ≈ S-Plus Open Source, Free www.r-project.org Comprehensive R Archive Network Download R from: cran.rstudio.com RStudio: www.rstudio.com R Console Console See… > prompt for new command + waiting for rest of command R "guesses" whether you are done considering () and symbols Type… ↑[Up] to get previous command Type in everything in font: Courier New 3+2 3- 2 Objects Objects nine <- 9 Historical Conventions nine Use <- to assign values Use . to separate names three <- nine / 3 three Current Capabilities my.school <- "gmu" = is okay now in most cases is okay now in most cases my.school _ RStudio: Press Alt - (minus) to insert "assignment operator" Global Environment Script Files Vectors & Lists numbers <- c(101,102,103,104,105) the same numbers <- 101:105 numbers <- c(101:104,105) numbers[ 2 ] numbers[ c(2,4,5)] Vector Variable numbers[-c(2,4,5)] numbers[ numbers > 102 ] RStudio: Press Ctrl-Enter to run the current line or press the button: Files & Functions Files Pane 2. Choose the "S" Drive 1. Browse Drives 3. Choose your folder Working Directory Functions read.table( datafile, header=TRUE, sep = ",") Function Positional Named Named Argument Argument Argument Reading Files Thus, these would be identical: titanic <- read.table( datafile, header=TRUE, sep = "," ) titanic <- datafile read.csv( ) If you have not set a working directory, use the whole path: titanic <- read.csv("S:/name/titanic.csv") titanic <- read.csv("S:\\name\\titanic.csv") titanic <- read.csv("titanic.csv") R-Studio: Import Dataset Working with Data Data Frames int / num = Numeric (Interval / Ratio) str(titanic) think structure Factor = Categorical (Nominal /Ordinal ) titanic$pclass titanic <- read.csv("titanic.csv", as.is = "name") Factors - Categorical Variables it is convention to add .f, can also give a new name or rewrite original titanic$pclass.f <- factor( titanic$pclass, levels = c(1,2,3), current values labels = c("1st Class", labels in the "2nd Class", same order "3rd Class"), ordinal variable ordered = TRUE ) NA and NULL Delete Variable titanic$embarked <- NULL Set Values to Missing titanic$age[titanic$age == 99] <- NA same thing while reading in data: titanic <- read.csv("titanic.csv", na.strings = "99") Ignore NAs Option na.rm = TRUE primarily needed for base R functions Review Words with Stuff Words that are not Objects word (Object) TRUE or T FALSE or F word[ stuff ] (Object Part) word( stuff ) (Function) NaN (Not a Number) NA (Not Available) "word" (String) NULL (Empty) Inf (Infinity) Packages Packages Packages must be both Installed and Loaded To Install: install.packages("name") Install To Load: library( name ) or, require( name ) or, check the box Loaded Installed Confirm these are installed: dplyr tidyr descr ggplot2 Data Frame Alternatives History Fact package Hadley Wickham, dplyr who created dplyr, data.frame tbl_df works at RStudio data.table tbl_dt package titanic library(dplyr) tt <- tbl_df(titanic) tt str(tt) Choose Variables base contains tt$name starts_with tt[,"name"] ends_with tt[,c("age","gender")] matches dplyr distinct select( tt, name) select( tt, -name) select( tt, age, gender) select( tt, gender : pclass) select( tt, starts_with("p")) Choose Cases base tt[tt$age < 5 , ] titanic[titanic$age < 5 , ] attach(titanic) titanic[age < 5 , ] titanic[age < 5 & is.na(age) == F , ] dplyr filter(tt, age < 5 ) filter(tt, age < 5, pclass.f == "1st Class" ) filter(tt, age < 5 | pclass == 1 ) Change data base tt$child <- tt$age <= 12 tt$totfam <- tt$sibsp + tt$parch tt$bigfam <- as.numeric(tt$totfam > 4) dplyr tt <- mutate(tt, child = age <= 12) tt <- mutate(tt, totfam = sibsp + parch, bigfam = as.numeric(totfam > 4) ) Chaining / Piping History Fact RStudio: %>% Ctrl+Shift+M from magrittr Read "then…" originally %.% select(tt, name, age) vs tt %>% select(name, age) tt %>% filter(age<5) %>% select(name, age) works anytime the 1st argument is the dataset Hadley Wickham's Packages library(dplyr) library(tidyr ) select : Choose variables spread filter : Choose cases gather separate mutate : Change values bind_rows summarize : Aggregate values group_by : Create groups library(lubridate ) arrange : Order cases library(stringr ) History Fact dplyr - update of plyr for data tables Descriptive Statistics Summarize base summary(tt) mean(tt$age) mean(tt$age , na.rm = T ) sd(tt$age , na.rm = T ) dplyr summarize(tt, xbar=mean(age, na.rm=T)) summarize(tt, n=n(), sd=sd(sibsp)) descr Package library(descr) freq(tt$pclass) freq(tt$pclass.f) freq(tt$age) CrossTable(tt$pclass, tt$survived) CrossTable(tt$pclass, tt$survived, prop.t = F, prop.c = F, T prop.r = T, is default for all digits = 2 ) Pivot Table library(dplyr) library(tidyr) tt %>% group_by(pclass, gender) %>% summarize( pct = mean(survived) ) %>% spread( gender, pct ) ggplot2 History Fact plot( tt$age) Created by Hadley Wickham, plot( tt$age, tt$fare ) based on the book "Grammar of Graphics" library(ggplot2) by Leland Wilkinson qplot(age, data=tt) qplot(age, fare, data=tt) See full documentation at: www.ggplot2.org Levels of Measurement plot( tt$pclass, tt$survived ) tt$survived.f <- factor(tt$survived, labels = c("Died","Survived") ) labels(tt$gender) <- c("Males", "Females") plot( tt$pclass.f, tt$survived.f ) qplot qplot(pclass.f, data=tt, fill=survived.f) qplot( x, y, data=, color=, shape=, size=, fill=, method=, formula=, qplot(age,fare, data=tt, alpha=, #transparency geom=, #type color=survived.f) facets=, #matrix xlim=, ylim= , #axis ranges xlab=, ylab=, #axis labels qplot(age, data=tt, main=, sub= #titles ) fill = survived.f, alpha = I(0.3), position = "identity") Referring to Variables mean( tt$age ) qplot( age, data=tt ) select( tt, age ) attach(titanic) names(df) <- c( "var1","var2","var3" ) names(df)[ 1 ] <- "var1" Statistical Analysis Writing Models ~ predicted from : interaction + include * factorial Statistical Equation R Formula Yi = β0 + β1Xi + εI Y ~ X Yi = β0 + β1Xi + β2Zi + εI Y ~ X + Z Yi = β0 + β1Xi + β2Zi + β3Xi Zi + εI Y ~ X * Z or Y ~ X + Z + X:Z t.test( fare ~ gender, data = tt ) Analysis Objects tt.anova <- aov(fare ~ gender*pclass , data=tt ) summary(tt.anova) the same tt.logistic <- glm( survived ~ gender + pclass + gender:pclass + age + child , family = binomial, data = tt ) summary(tt.logistic) More with Analysis Objects plot(tt.logistic) tt.pred <- predict(tt.logistic) tt.resid <- residuals(tt.logistic) plot(tt.pred, tt.resid) More Capabilities R Markdown 1 2 3 4 Analysis Environments R Commander Deducer Separate Interface Adds to R Interface (not RStudio!) More/Better Statistics Easier Data Management www.rcommander.com www.deducer.org install.packages("Rcmdr") install.packages("Deducer") library(Rcmdr) library(Deducer) Deducer Plot Builder Data Mining GUI install.packages("rattle") require ("rattle") rattle() Next Steps Finding Packages Swirl install.packages("swirl") require ("swirl") install_from_swirl("Course") swirl() http://swirlstats.com Tutorials http://dataservices.gmu.edu/software/r http://tryr.codeschool.com/ https://www.datacamp.com/ Slides are available at: http://dataservices.gmu.edu/workshops/r [email protected] [email protected] © 2015 by Debby Kermer, This work is licensed under the c Attribution- c b n a NonCommercial-ShareAlike International License: Mason Library Data Services http://creativecommons.org/licenses/by-nc-sa/4.0/ .