Read.Csv("Titanic.Csv", As.Is = "Name") Factors - Categorical Variables

Read.Csv("Titanic.Csv", As.Is = "Name") Factors - Categorical Variables

Introduction to R by Debby Kermer George Mason University Library Data Services Group http://dataservices.gmu.edu [email protected] http://dataservices.gmu.edu/workshop/r 1. Create a folder with your name on the S: Drive 2. Copy titanic.csv from R Workshop Files to that folder History R ≈ S ≈ S-Plus Open Source, Free www.r-project.org Comprehensive R Archive Network Download R from: cran.rstudio.com RStudio: www.rstudio.com R Console Console See… > prompt for new command + waiting for rest of command R "guesses" whether you are done considering () and symbols Type… ↑[Up] to get previous command Type in everything in font: Courier New 3+2 3- 2 Objects Objects nine <- 9 Historical Conventions nine Use <- to assign values Use . to separate names three <- nine / 3 three Current Capabilities my.school <- "gmu" = is okay now in most cases is okay now in most cases my.school _ RStudio: Press Alt - (minus) to insert "assignment operator" Global Environment Script Files Vectors & Lists numbers <- c(101,102,103,104,105) the same numbers <- 101:105 numbers <- c(101:104,105) numbers[ 2 ] numbers[ c(2,4,5)] Vector Variable numbers[-c(2,4,5)] numbers[ numbers > 102 ] RStudio: Press Ctrl-Enter to run the current line or press the button: Files & Functions Files Pane 2. Choose the "S" Drive 1. Browse Drives 3. Choose your folder Working Directory Functions read.table( datafile, header=TRUE, sep = ",") Function Positional Named Named Argument Argument Argument Reading Files Thus, these would be identical: titanic <- read.table( datafile, header=TRUE, sep = "," ) titanic <- datafile read.csv( ) If you have not set a working directory, use the whole path: titanic <- read.csv("S:/name/titanic.csv") titanic <- read.csv("S:\\name\\titanic.csv") titanic <- read.csv("titanic.csv") R-Studio: Import Dataset Working with Data Data Frames int / num = Numeric (Interval / Ratio) str(titanic) think structure Factor = Categorical (Nominal /Ordinal ) titanic$pclass titanic <- read.csv("titanic.csv", as.is = "name") Factors - Categorical Variables it is convention to add .f, can also give a new name or rewrite original titanic$pclass.f <- factor( titanic$pclass, levels = c(1,2,3), current values labels = c("1st Class", labels in the "2nd Class", same order "3rd Class"), ordinal variable ordered = TRUE ) NA and NULL Delete Variable titanic$embarked <- NULL Set Values to Missing titanic$age[titanic$age == 99] <- NA same thing while reading in data: titanic <- read.csv("titanic.csv", na.strings = "99") Ignore NAs Option na.rm = TRUE primarily needed for base R functions Review Words with Stuff Words that are not Objects word (Object) TRUE or T FALSE or F word[ stuff ] (Object Part) word( stuff ) (Function) NaN (Not a Number) NA (Not Available) "word" (String) NULL (Empty) Inf (Infinity) Packages Packages Packages must be both Installed and Loaded To Install: install.packages("name") Install To Load: library( name ) or, require( name ) or, check the box Loaded Installed Confirm these are installed: dplyr tidyr descr ggplot2 Data Frame Alternatives History Fact package Hadley Wickham, dplyr who created dplyr, data.frame tbl_df works at RStudio data.table tbl_dt package titanic library(dplyr) tt <- tbl_df(titanic) tt str(tt) Choose Variables base contains tt$name starts_with tt[,"name"] ends_with tt[,c("age","gender")] matches dplyr distinct select( tt, name) select( tt, -name) select( tt, age, gender) select( tt, gender : pclass) select( tt, starts_with("p")) Choose Cases base tt[tt$age < 5 , ] titanic[titanic$age < 5 , ] attach(titanic) titanic[age < 5 , ] titanic[age < 5 & is.na(age) == F , ] dplyr filter(tt, age < 5 ) filter(tt, age < 5, pclass.f == "1st Class" ) filter(tt, age < 5 | pclass == 1 ) Change data base tt$child <- tt$age <= 12 tt$totfam <- tt$sibsp + tt$parch tt$bigfam <- as.numeric(tt$totfam > 4) dplyr tt <- mutate(tt, child = age <= 12) tt <- mutate(tt, totfam = sibsp + parch, bigfam = as.numeric(totfam > 4) ) Chaining / Piping History Fact RStudio: %>% Ctrl+Shift+M from magrittr Read "then…" originally %.% select(tt, name, age) vs tt %>% select(name, age) tt %>% filter(age<5) %>% select(name, age) works anytime the 1st argument is the dataset Hadley Wickham's Packages library(dplyr) library(tidyr ) select : Choose variables spread filter : Choose cases gather separate mutate : Change values bind_rows summarize : Aggregate values group_by : Create groups library(lubridate ) arrange : Order cases library(stringr ) History Fact dplyr - update of plyr for data tables Descriptive Statistics Summarize base summary(tt) mean(tt$age) mean(tt$age , na.rm = T ) sd(tt$age , na.rm = T ) dplyr summarize(tt, xbar=mean(age, na.rm=T)) summarize(tt, n=n(), sd=sd(sibsp)) descr Package library(descr) freq(tt$pclass) freq(tt$pclass.f) freq(tt$age) CrossTable(tt$pclass, tt$survived) CrossTable(tt$pclass, tt$survived, prop.t = F, prop.c = F, T prop.r = T, is default for all digits = 2 ) Pivot Table library(dplyr) library(tidyr) tt %>% group_by(pclass, gender) %>% summarize( pct = mean(survived) ) %>% spread( gender, pct ) ggplot2 History Fact plot( tt$age) Created by Hadley Wickham, plot( tt$age, tt$fare ) based on the book "Grammar of Graphics" library(ggplot2) by Leland Wilkinson qplot(age, data=tt) qplot(age, fare, data=tt) See full documentation at: www.ggplot2.org Levels of Measurement plot( tt$pclass, tt$survived ) tt$survived.f <- factor(tt$survived, labels = c("Died","Survived") ) labels(tt$gender) <- c("Males", "Females") plot( tt$pclass.f, tt$survived.f ) qplot qplot(pclass.f, data=tt, fill=survived.f) qplot( x, y, data=, color=, shape=, size=, fill=, method=, formula=, qplot(age,fare, data=tt, alpha=, #transparency geom=, #type color=survived.f) facets=, #matrix xlim=, ylim= , #axis ranges xlab=, ylab=, #axis labels qplot(age, data=tt, main=, sub= #titles ) fill = survived.f, alpha = I(0.3), position = "identity") Referring to Variables mean( tt$age ) qplot( age, data=tt ) select( tt, age ) attach(titanic) names(df) <- c( "var1","var2","var3" ) names(df)[ 1 ] <- "var1" Statistical Analysis Writing Models ~ predicted from : interaction + include * factorial Statistical Equation R Formula Yi = β0 + β1Xi + εI Y ~ X Yi = β0 + β1Xi + β2Zi + εI Y ~ X + Z Yi = β0 + β1Xi + β2Zi + β3Xi Zi + εI Y ~ X * Z or Y ~ X + Z + X:Z t.test( fare ~ gender, data = tt ) Analysis Objects tt.anova <- aov(fare ~ gender*pclass , data=tt ) summary(tt.anova) the same tt.logistic <- glm( survived ~ gender + pclass + gender:pclass + age + child , family = binomial, data = tt ) summary(tt.logistic) More with Analysis Objects plot(tt.logistic) tt.pred <- predict(tt.logistic) tt.resid <- residuals(tt.logistic) plot(tt.pred, tt.resid) More Capabilities R Markdown 1 2 3 4 Analysis Environments R Commander Deducer Separate Interface Adds to R Interface (not RStudio!) More/Better Statistics Easier Data Management www.rcommander.com www.deducer.org install.packages("Rcmdr") install.packages("Deducer") library(Rcmdr) library(Deducer) Deducer Plot Builder Data Mining GUI install.packages("rattle") require ("rattle") rattle() Next Steps Finding Packages Swirl install.packages("swirl") require ("swirl") install_from_swirl("Course") swirl() http://swirlstats.com Tutorials http://dataservices.gmu.edu/software/r http://tryr.codeschool.com/ https://www.datacamp.com/ Slides are available at: http://dataservices.gmu.edu/workshops/r [email protected] [email protected] © 2015 by Debby Kermer, This work is licensed under the c Attribution- c b n a NonCommercial-ShareAlike International License: Mason Library Data Services http://creativecommons.org/licenses/by-nc-sa/4.0/ .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    49 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us