Reading In, Reordering, Merging Data, Changing Shapes of Data Statistical Computing

Reading in, Reordering, Merging Data, Changing shapes of data Statistical Computing Last week: Simulation • Running simulations is an integral part of being a statistician in the 21st century • R provides us with a utility functions for simulations from a wide variety of distributions • To make your simulation results reproducible, you must set the seed, using set.seed() • There is a natural connection between iteration, functions, and simulations • Saving and loading results can be done in two formats: rds and rdata formats Part I Reading in and reordering data Reading in data from the outside All along, we’ve already been reading in data from the outside, using: • readLines(): reading in lines of text from a file or webpage; returns vector of strings • read.table(): read in a data table from a file or webpage; returns data frame • read.csv(): like the above; returns data frame This week we’ll focus on read.table(), read.csv() and their counterparts write.table(), write.csv(), respectively Reading in data from a previous R session Reminder that we’ve seen two ways to read/write in data in specialized R formats: • readRDS(), saveRDS(): functions for reading/writing single R objects from/to a file • load(), save(): functions for reading/writing any number of R objects from/to a file Advantage: these can be a lot more memory efficient than what we’ll cover this week. Disadvantage: they’re limited in scope in the sense that they can only communicate with R read.table(), read.csv() Have a table full of data, just not in the R file format? Then read.table() is the function for you. It works as in: • read.table(file=file.name, sep=" "), to read data from a local file on your computer called file.name, assuming (say) space separated data • read.table(file=webpage.link, sep="\t"), to read data from a webpage up at webpage.link, assuming (say) tab separated data 1 The function read.csv() is just a shortcut for using read.table() with sep=",". Both read.csv() and read.table() can read in .csv file. (But note: these two actually differ on some of the default inputs!) Examples of reading in data # This data table is comma separated, so we can use read.csv() strike.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp/data/strikes.csv") head(strike.df) ## country year strike.volume unemployment inflation left.parliament ## 1 Australia 1951 296 1.3 19.8 43.0 ## 2 Australia 1952 397 2.2 17.2 43.0 ## 3 Australia 1953 360 2.5 4.3 43.0 ## 4 Australia 1954 3 1.7 0.7 47.0 ## 5 Australia 1955 326 1.4 2.0 38.5 ## 6 Australia 1956 352 1.8 6.3 38.5 ## centralization density ## 1 0.3748588 NA ## 2 0.3751829 NA ## 3 0.3745076 NA ## 4 0.3710170 NA ## 5 0.3752675 NA ## 6 0.3716072 NA sapply(strike.df, class) ## country year strike.volume unemployment ## "factor" "integer" "integer" "numeric" ## inflation left.parliament centralization density ## "numeric" "numeric" "numeric" "numeric" # This data table is tab separated, so let's specify sep="\t" anss.df = read.table( file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/anss.dat", sep="\t") head(anss.df) ## V1 V2 V3 V4 V5 V6 ## 1 Date Time Lat Lon Depth Mag ## 2 2002/01/01 10:39:06.82 -55.2140 -129.0000 10.00 6.00 ## 3 2002/01/01 11:29:22.73 6.3030 125.6500 138.10 6.30 ## 4 2002/01/02 14:50:33.49 -17.9830 178.7440 665.80 6.20 ## 5 2002/01/02 17:22:48.76 -17.6000 167.8560 21.00 7.20 ## 6 2002/01/03 07:05:27.67 36.0880 70.6870 129.30 6.20 sapply(anss.df, class) ## V1 V2 V3 V4 V5 V6 ## "factor" "factor" "factor" "factor" "factor" "factor" # Oops! It comes with column names, so let's set header=TRUE anss.df = read.table( 2 file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/anss.dat", sep="\t", header=TRUE) head(anss.df) ## Date Time Lat Lon Depth Mag ## 1 2002/01/01 10:39:06.82 -55.214 -129.000 10.0 6.0 ## 2 2002/01/01 11:29:22.73 6.303 125.650 138.1 6.3 ## 3 2002/01/02 14:50:33.49 -17.983 178.744 665.8 6.2 ## 4 2002/01/02 17:22:48.76 -17.600 167.856 21.0 7.2 ## 5 2002/01/03 07:05:27.67 36.088 70.687 129.3 6.2 ## 6 2002/01/03 10:17:36.30 -17.664 168.004 10.0 6.6 sapply(anss.df, class) ## Date Time Lat Lon Depth Mag ## "factor" "factor" "numeric" "numeric" "numeric" "numeric" Helpful input arguments The following inputs apply to either read.table() or read.csv() (though these two functions actually have different default inputs in general—e.g., header defaults to TRUE in read.table() but FALSE in read.csv()) • header: boolean, TRUE is the first line should be interpreted as column names • sep: string, specifies what separates the entries; empty string “” is interpreted to mean any white space • quote: string, specifies what set of characters signify the beginning and end of quotes; empty string “” disables quotes altogether Other helpful inputs: skip, row.names, col.names. You can read about them in the help file for read.table() Another example of reading in data # This data table is tab separated, and it comes with column names sprint.m.df = read.table( file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/sprint.m.dat", sep="\t", header=TRUE) ## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 981 did not have 8 elements # Oops! It turns out it has some apostrophe marks in the City column, # so we have to set quote="" sprint.m.df = read.table( file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/sprint.m.dat", sep="\t", header=TRUE, quote="") head(sprint.m.df) ## Rank Time Wind Name Country Birthdate City Date ## 1 1 9.58 0.9 Usain Bolt JAM 21.08.86 Berlin 16.08.2009 ## 2 2 9.63 1.5 Usain Bolt JAM 21.08.86 London 05.08.2012 ## 3 3 9.69 0.0 Usain Bolt JAM 21.08.86 Beijing 16.08.2008 ## 4 3 9.69 2.0 Tyson Gay USA 09.08.82 Shanghai 20.09.2009 ## 5 3 9.69 -0.1 Yohan Blake JAM 26.12.89 Lausanne 23.08.2012 3 ## 6 6 9.71 0.9 Tyson Gay USA 09.08.82 Berlin 16.08.2009 unique(grep("'", sprint.m.df$City, value=TRUE)) # This is the troublemaker ## [1] "Villeneuve d'Ascq" write.table(), write.csv() To write a data frame (or matrix) to a text file, use write.table() or write.csv(). These are the counterparts to read.table() and read.csv(), and they work as in: • write.table(my.df, file="my.df.txt", sep=" ", quote=FALSE), to write my.dat to the text file “my.df.txt” (to be created in your working directory), with (say) space separation, and no quotes around the entries • write.csv(my.df, file="my.df.csv", quote=FALSE), to write my.dat to the text file “my.df.csv” (to be created in your working directory), with comma separation, and no quotes Note that quote=FALSE, signifying that no quotes should be put around the printed data entries, seems always preferable (default is quote=TRUE). Also, setting row.names=FALSE and col.names=FALSE will disable the printing of row and column names (defaults are row.names=TRUE and col.names=TRUE) read_excel • Use library(readxl) • data <-read_excel("filename.xlsx"); variables names are inputed as default. Reordering data Sometimes it’s convenient to reorder our data, say the rows of our data frame (or matrix). Recall: • The function order() takes in a vector, and returns the vector of indices that put this vector in increasing order • Set the input decreasing=TRUE in order() to get decreasing order • We can apply compute an appropriate vector of indices, and then use this on rows of our data frame to reorder all of the columns simultaneously Examples of reordering # The sprint data has its rows ordered by the fastest 100m time to the slowest. # Suppose we wanted to reorder, from slowest to fastest i.slow = order(sprint.m.df$Time, decreasing=TRUE) # By decreasing sprint time sprint.m.df.slow = sprint.m.df[i.slow,] # Reorder rows by decreasing sprint time head(sprint.m.df.slow) ## Rank Time Wind Name Country Birthdate City ## 2682 2691 10.09 1.8 Mel Lattany USA 10.08.59 Colorado Springs ## 2683 2691 10.09 -0.9 Mel Lattany USA 10.08.59 Zürich ## 2684 2691 10.09 1.3 Carl Lewis USA 01.07.61 Walnut ## 2685 2691 10.09 0.6 Calvin Smith USA 08.01.61 Athens ## 2686 2691 10.09 -1.7 Carl Lewis USA 01.07.61 Indianapolis 4 ## 2687 2691 10.09 -0.9 Calvin Smith USA 08.01.61 Zürich ## Date ## 2682 30.07.1978 ## 2683 19.08.1981 ## 2684 25.04.1982 ## 2685 14.05.1982 ## 2686 02.07.1982 ## 2687 18.08.1982 # Suppose we wanted to reorder the rows by sprinter name, alphabetically i.name = order(sprint.m.df$Name) # By sprinter name sprint.m.df.name = sprint.m.df[i.name,] # Reorder rows by name head(sprint.m.df.name) ## Rank Time Wind Name Country Birthdate City ## 1373 1281 10.03 0.5 Aaron Armstrong TTO 14.10.77 Port of Spain ## 1552 1456 10.04 1.0 Aaron Armstrong TTO 14.10.77 Port of Spain ## 2326 2145 10.07 1.0 Aaron Armstrong TTO 14.10.77 Port of Spain ## 568 491 9.96 2.0 Aaron Brown CAN 27.05.92 Montverde ## 1114 1011 10.01 1.8 Aaron Brown CAN 27.05.92 Montverde ## 1836 1683 10.05 1.9 Aaron Brown CAN 27.05.92 Eugene ## Date ## 1373 20.06.2009 ## 1552 25.06.2005 ## 2326 13.08.2011 ## 568 11.06.2016 ## 1114 11.06.2016 ## 1836 05.06.2013 Part II Merging data Merging data frames Suppose you have two data frames X, Y, and you want to combine them • Simplest case: the data frames have exactly the same number of rows, that the rows represent exactly the same units, and you want all columns from both; just use, data.frame(X,Y) • Next best case: you know that the two data frames have the same rows, but you only want certain columns from each; just use, e.g., data.frame(X$col1,X$col5,Y$col3) • Next best case: same number of rows but in different order; put one of them in same order as the other, with order().

Load more