Reading in, Reordering, Merging Data, Changing shapes of data Statistical Computing

Last week: Simulation

simulations is an integral part of being a statistician in the 21st century • R provides us with a utility functions for simulations from a wide variety of distributions • To make your simulation results reproducible, you must set the seed, using set.seed() • There is a natural connection between iteration, functions, and simulations • Saving and loading results can be done in two formats: rds and rdata formats

Part I

Reading in and reordering data

Reading in data from the outside

All along, we’ve already been reading in data from the outside, using: • readLines(): reading in lines of text from a file or webpage; returns vector of strings • read.table(): read in a data table from a file or webpage; returns data frame • read.csv(): like the above; returns data frame This week we’ll focus on read.table(), read.csv() and their counterparts write.table(), write.csv(), respectively

Reading in data from a previous R session

Reminder that we’ve seen two ways to read/write in data in specialized R formats: • readRDS(), saveRDS(): functions for reading/writing single R objects from/to a file • load(), save(): functions for reading/writing any number of R objects from/to a file Advantage: these can be a lot more memory efficient than what we’ll cover this week. Disadvantage: they’re limited in scope in the sense that they can only communicate with R

read.table(), read.csv()

Have a table full of data, just not in the R file format? Then read.table() is the function for you. It works as in: • read.table(file=file.name, sep=" "), to read data from a local file on your computer called file.name, assuming (say) space separated data • read.table(file=webpage.link, sep="\t"), to read data from a webpage up at webpage.link, assuming (say) tab separated data

1 The function read.csv() is just a shortcut for using read.table() with sep=",". Both read.csv() and read.table() can read in .csv file. (But note: these two actually differ on some of the default inputs!)

Examples of reading in data

# This data table is comma separated, so we can use read.csv() strike.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp/data/strikes.csv") head(strike.df)

## country year strike.volume unemployment inflation left.parliament ## 1 Australia 1951 296 1.3 19.8 43.0 ## 2 Australia 1952 397 2.2 17.2 43.0 ## 3 Australia 1953 360 2.5 4.3 43.0 ## 4 Australia 1954 3 1.7 0.7 47.0 ## 5 Australia 1955 326 1.4 2.0 38.5 ## 6 Australia 1956 352 1.8 6.3 38.5 ## centralization density ## 1 0.3748588 NA ## 2 0.3751829 NA ## 3 0.3745076 NA ## 4 0.3710170 NA ## 5 0.3752675 NA ## 6 0.3716072 NA sapply(strike.df, class)

## country year strike.volume unemployment ## "factor" "integer" "integer" "numeric" ## inflation left.parliament centralization density ## "numeric" "numeric" "numeric" "numeric"

# This data table is tab separated, so let's specify sep="\t" anss.df = read.table( file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/anss.dat", sep="\t") head(anss.df)

## V1 V2 V3 V4 V5 V6 ## 1 Date Time Lat Lon Depth Mag ## 2 2002/01/01 10:39:06.82 -55.2140 -129.0000 10.00 6.00 ## 3 2002/01/01 11:29:22.73 6.3030 125.6500 138.10 6.30 ## 4 2002/01/02 14:50:33.49 -17.9830 178.7440 665.80 6.20 ## 5 2002/01/02 17:22:48.76 -17.6000 167.8560 21.00 7.20 ## 6 2002/01/03 07:05:27.67 36.0880 70.6870 129.30 6.20 sapply(anss.df, class)

## V1 V2 V3 V4 V5 V6 ## "factor" "factor" "factor" "factor" "factor" "factor" # Oops! It comes with column names, so let's set header=TRUE anss.df = read.table(

2 file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/anss.dat", sep="\t", header=TRUE) head(anss.df)

## Date Time Lat Lon Depth Mag ## 1 2002/01/01 10:39:06.82 -55.214 -129.000 10.0 6.0 ## 2 2002/01/01 11:29:22.73 6.303 125.650 138.1 6.3 ## 3 2002/01/02 14:50:33.49 -17.983 178.744 665.8 6.2 ## 4 2002/01/02 17:22:48.76 -17.600 167.856 21.0 7.2 ## 5 2002/01/03 07:05:27.67 36.088 70.687 129.3 6.2 ## 6 2002/01/03 10:17:36.30 -17.664 168.004 10.0 6.6 sapply(anss.df, class)

## Date Time Lat Lon Depth Mag ## "factor" "factor" "numeric" "numeric" "numeric" "numeric"

Helpful input arguments

The following inputs apply to either read.table() or read.csv() (though these two functions actually have different default inputs in general—e.g., header defaults to TRUE in read.table() but FALSE in read.csv()) • header: boolean, TRUE is the first line should be interpreted as column names • sep: string, specifies what separates the entries; empty string “” is interpreted to mean any white space • quote: string, specifies what set of characters signify the beginning and end of quotes; empty string “” disables quotes altogether Other helpful inputs: skip, row.names, col.names. You can read about them in the help file for read.table()

Another example of reading in data

# This data table is tab separated, and it comes with column names .m.df = read.table( file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/sprint.m.dat", sep="\t", header=TRUE)

## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 981 did not have 8 elements # Oops! It turns out it has some apostrophe marks in the City column, # so we have to set quote="" sprint.m.df = read.table( file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/sprint.m.dat", sep="\t", header=TRUE, quote="") head(sprint.m.df)

## Rank Time Wind Name Country Birthdate City Date ## 1 1 9.58 0.9 JAM 21.08.86 Berlin 16.08.2009 ## 2 2 9.63 1.5 Usain Bolt JAM 21.08.86 London 05.08.2012 ## 3 3 9.69 0.0 Usain Bolt JAM 21.08.86 Beijing 16.08.2008 ## 4 3 9.69 2.0 Tyson Gay USA 09.08.82 Shanghai 20.09.2009 ## 5 3 9.69 -0.1 JAM 26.12.89 Lausanne 23.08.2012

3 ## 6 6 9.71 0.9 Tyson Gay USA 09.08.82 Berlin 16.08.2009 unique(grep("'", sprint.m.df$City, value=TRUE)) # This is the troublemaker

## [1] "Villeneuve d'Ascq"

write.table(), write.csv()

To write a data frame (or matrix) to a text file, use write.table() or write.csv(). These are the counterparts to read.table() and read.csv(), and they work as in: • write.table(my.df, file="my.df.txt", sep=" ", quote=FALSE), to write my.dat to the text file “my.df.txt” (to be created in your working directory), with (say) space separation, and no quotes around the entries • write.csv(my.df, file="my.df.csv", quote=FALSE), to write my.dat to the text file “my.df.csv” (to be created in your working directory), with comma separation, and no quotes Note that quote=FALSE, signifying that no quotes should be put around the printed data entries, seems always preferable (default is quote=TRUE). Also, setting row.names=FALSE and col.names=FALSE will disable the printing of row and column names (defaults are row.names=TRUE and col.names=TRUE) read_excel

• Use library(readxl) • data <-read_excel("filename.xlsx"); variables names are inputed as default.

Reordering data

Sometimes it’s convenient to reorder our data, say the rows of our data frame (or matrix). Recall: • The function order() takes in a vector, and returns the vector of indices that put this vector in increasing order • Set the input decreasing=TRUE in order() to get decreasing order • We can apply compute an appropriate vector of indices, and then use this on rows of our data frame to reorder all of the columns simultaneously

Examples of reordering

# The sprint data has its rows ordered by the fastest 100m time to the slowest. # Suppose we wanted to reorder, from slowest to fastest i.slow = order(sprint.m.df$Time, decreasing=TRUE) # By decreasing sprint time sprint.m.df.slow = sprint.m.df[i.slow,] # Reorder rows by decreasing sprint time head(sprint.m.df.slow)

## Rank Time Wind Name Country Birthdate City ## 2682 2691 10.09 1.8 Mel Lattany USA 10.08.59 Colorado Springs ## 2683 2691 10.09 -0.9 Mel Lattany USA 10.08.59 Zürich ## 2684 2691 10.09 1.3 USA 01.07.61 Walnut ## 2685 2691 10.09 0.6 USA 08.01.61 Athens ## 2686 2691 10.09 -1.7 Carl Lewis USA 01.07.61 Indianapolis

4 ## 2687 2691 10.09 -0.9 Calvin Smith USA 08.01.61 Zürich ## Date ## 2682 30.07.1978 ## 2683 19.08.1981 ## 2684 25.04.1982 ## 2685 14.05.1982 ## 2686 02.07.1982 ## 2687 18.08.1982 # Suppose we wanted to reorder the rows by sprinter name, alphabetically i.name = order(sprint.m.df$Name) # By sprinter name sprint.m.df.name = sprint.m.df[i.name,] # Reorder rows by name head(sprint.m.df.name)

## Rank Time Wind Name Country Birthdate City ## 1373 1281 10.03 0.5 Aaron Armstrong TTO 14.10.77 Port of Spain ## 1552 1456 10.04 1.0 Aaron Armstrong TTO 14.10.77 Port of Spain ## 2326 2145 10.07 1.0 Aaron Armstrong TTO 14.10.77 Port of Spain ## 568 491 9.96 2.0 Aaron Brown CAN 27.05.92 Montverde ## 1114 1011 10.01 1.8 Aaron Brown CAN 27.05.92 Montverde ## 1836 1683 10.05 1.9 Aaron Brown CAN 27.05.92 Eugene ## Date ## 1373 20.06.2009 ## 1552 25.06.2005 ## 2326 13.08.2011 ## 568 11.06.2016 ## 1114 11.06.2016 ## 1836 05.06.2013

Part II

Merging data

Merging data frames

Suppose you have two data frames X, Y, and you want to combine them • Simplest case: the data frames have exactly the same number of rows, that the rows represent exactly the same units, and you want all columns from both; just use, data.frame(X,Y) • Next best case: you know that the two data frames have the same rows, but you only want certain columns from each; just use, e.g., data.frame(X$col1,X$col5,Y$col3) • Next best case: same number of rows but in different order; put one of them in same order as the other, with order(). Alternatively, use merge() • Worse cases: different numbers of rows . . . hard to line up rows . . .

Example: big city drivers

People in larger cities (larger areas) drive more—seems like a reasonable hypothesis, but is it true? Distance driven, and city population, from http://www.fhwa.dot.gov/policyinformation/statistics/2011/hm71. cfm:

5 fha = read.csv( file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/fha.csv", colClasses=c("character","double","double","double"), na.strings="NA") nrow(fha)

## [1] 498 head(fha)

## City Population Miles.of.Road ## 1 New York--Newark, NY--NJ--CT 18351295 43893 ## 2 Los Angeles--Long Beach--Anaheim, CA 12150996 24877 ## 3 Chicago, IL--IN 8608208 25905 ## 4 Miami, FL 5502379 15641 ## 5 Philadelphia, PA--NJ--DE--MD 5441567 19867 ## 6 Dallas--Fort Worth--Arlington, TX 5121892 21610 ## Daily.Miles.Traveled ## 1 286101 ## 2 270807 ## 3 172708 ## 4 125899 ## 5 99190 ## 6 125389

Area and population of “urbanized areas”, from http://www2.census.gov/geo/ua/ua_list_all.txt: ua = read.csv(file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/ua.dat", sep=";") nrow(ua)

## [1] 3598 head(ua)

## UACE NAME POP HU AREALAND AREALANDSQMI AREAWATER ## 1 37 Abbeville, LA 19824 8460 29222871 11.28 300497 ## 2 64 Abbeville, SC 5243 2578 11315197 4.37 19786 ## 3 91 Abbotsford, WI 3966 1616 5363441 2.07 13221 ## 4 118 Aberdeen, MS 4666 2050 7416616 2.86 52732 ## 5 145 Aberdeen, SD 25977 12114 33002447 12.74 247597 ## 6 172 Aberdeen, WA 29856 13139 39997951 15.44 1929689 ## AREAWATERSQMI POPDEN LSADC ## 1 0.12 1757.0 76 ## 2 0.01 1200.1 76 ## 3 0.01 1915.2 76 ## 4 0.02 1629.4 76 ## 5 0.10 2038.6 76 ## 6 0.75 1933.3 76

Difficulties in navigating the merge

• ≈ 500 cities versus ≈ 4000 “urbanized areas” • fha orders cities by population, ua is alphabetical by name • Both have place names, but those don’t always agree

6 • Not even common names for the shared columns

Lesson: find the unique identifier

But both use the same census figures for population! And it turns out every settlement (in the top 498) has a unique census population: length(unique(fha$Population)) == nrow(fha)

## [1] TRUE ua.pop.top498 = sort(ua$POP, decreasing=TRUE)[1:nrow(fha)] max(abs(fha$Population - ua.pop.top498))

## [1] 0

First way to merge

Reorder area column in second table by population, append to first table ind.pop = order(ua$POP, decreasing=TRUE) # Order by population df1 = data.frame(fha, area=ua$AREALANDSQMI[ind.pop][1:nrow(fha)]) # Neaten up names colnames(df1) = c("City","Population","Roads","Mileage","Area") nrow(df1)

## [1] 498 head(df1)

## City Population Roads Mileage Area ## 1 New York--Newark, NY--NJ--CT 18351295 43893 286101 3450.20 ## 2 Los Angeles--Long Beach--Anaheim, CA 12150996 24877 270807 1736.02 ## 3 Chicago, IL--IN 8608208 25905 172708 2442.75 ## 4 Miami, FL 5502379 15641 125899 1238.61 ## 5 Philadelphia, PA--NJ--DE--MD 5441567 19867 99190 1981.37 ## 6 Dallas--Fort Worth--Arlington, TX 5121892 21610 125389 1779.13

Second way to merge

Use the merge() function df2 = merge(x=fha, y=ua, by.x="Population", by.y="POP") nrow(df2)

## [1] 498 tail(df2,3)

## Population City Miles.of.Road ## 496 8608208 Chicago, IL--IN 25905 ## 497 12150996 Los Angeles--Long Beach--Anaheim, CA 24877 ## 498 18351295 New York--Newark, NY--NJ--CT 43893 ## Daily.Miles.Traveled UACE NAME

7 ## 496 172708 16264 Chicago, IL--IN ## 497 270807 51445 Los Angeles--Long Beach--Anaheim, CA ## 498 286101 63217 New York--Newark, NY--NJ--CT ## HU AREALAND AREALANDSQMI AREAWATER AREAWATERSQMI POPDEN LSADC ## 496 3459257 6326686332 2442.75 105649916 40.79 3524.0 75 ## 497 4217448 4496266014 1736.02 61141327 23.61 6999.3 75 ## 498 7263095 8935981360 3450.20 533176599 205.86 5318.9 75

merge()

The merge() function tries to merge two data frames according to common columns, as in: merge(x, y, by.x="SomeXCol", by.y="SomeYCol"), to join two data frames x, y, by matching the columns “SomeXCol” and “SomeYCol” • Default (no by.x, by.y specified) is to match all columns with common names • Output will be a new data frame that has all the columns of both data frames • If you know databases, then merge() is doing a JOIN

Using order() and manual tricks versus merge()

• Reordering is easier to grasp; merge() takes some learning • Reordering is simplest when there’s only one column to merge on; merge() handles many columns • Reorderng is simplest when the data frames are the same size; merge() handles different sizes automat- ically

So, do bigger cities mean more driving?

# Convert 1,000s of miles to miles df1$Mileage = 1000*df1$Mileage # Plot daily miles per person vs. area plot(x=df1$Area, y=df1$Mileage/df1$Population, log="x", xlab="Miles driven per person per day", ylab="City area in square miles") # Impressively flat regression line abline(lm(Mileage/Population ~ Area, data=df1), col="red")

8 80 60 40 20 City area in square miles 0

10 20 50 100 200 500 1000 2000

Miles driven per person per day

More about merge() function

• merge(. . . , all, all.x, all.y) • all.x–logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output. • all.y–logical; analogous to all.x • all–logical; if TRUE, then all.x=all.y=TRUE; if FALSE, then all.x=all.y=FALSE

Converting data from wide form to long form–collapsing the data frame library(reshape2) stocks = data.frame( time = as.Date('2009-01-01') + 0:9, X = rnorm(10,0,1), Y = rnorm(10,0,2), Z = rnorm(10,0,4) ) stocks

## time X Y Z ## 1 2009-01-01 -1.06406981 -2.3648384 5.7489691 ## 2 2009-01-02 -1.26495502 1.9489011 -1.7535504 ## 3 2009-01-03 -0.28043553 2.7983741 0.2840863 ## 4 2009-01-04 0.06302827 1.3669634 -7.7664304 ## 5 2009-01-05 1.47793978 -3.0911516 -1.6267468

9 ## 6 2009-01-06 -1.21225693 -0.2473569 8.2811592 ## 7 2009-01-07 -1.06717890 0.7619787 1.2049675 ## 8 2009-01-08 1.42640535 -0.1728733 -4.1799281 ## 9 2009-01-09 0.95018420 -1.2599393 -1.3405610 ## 10 2009-01-10 -0.77522305 2.4316198 3.8532028 stocksm = melt(stocks, id.vars = "time", variable.name = "stock", value.name = "price") stocksm

## time stock price ## 1 2009-01-01 X -1.06406981 ## 2 2009-01-02 X -1.26495502 ## 3 2009-01-03 X -0.28043553 ## 4 2009-01-04 X 0.06302827 ## 5 2009-01-05 X 1.47793978 ## 6 2009-01-06 X -1.21225693 ## 7 2009-01-07 X -1.06717890 ## 8 2009-01-08 X 1.42640535 ## 9 2009-01-09 X 0.95018420 ## 10 2009-01-10 X -0.77522305 ## 11 2009-01-01 Y -2.36483838 ## 12 2009-01-02 Y 1.94890111 ## 13 2009-01-03 Y 2.79837413 ## 14 2009-01-04 Y 1.36696336 ## 15 2009-01-05 Y -3.09115155 ## 16 2009-01-06 Y -0.24735694 ## 17 2009-01-07 Y 0.76197870 ## 18 2009-01-08 Y -0.17287326 ## 19 2009-01-09 Y -1.25993929 ## 20 2009-01-10 Y 2.43161979 ## 21 2009-01-01 Z 5.74896907 ## 22 2009-01-02 Z -1.75355043 ## 23 2009-01-03 Z 0.28408625 ## 24 2009-01-04 Z -7.76643040 ## 25 2009-01-05 Z -1.62674678 ## 26 2009-01-06 Z 8.28115919 ## 27 2009-01-07 Z 1.20496746 ## 28 2009-01-08 Z -4.17992812 ## 29 2009-01-09 Z -1.34056099 ## 30 2009-01-10 Z 3.85320284 stocks2 = data.frame( time = as.Date('2009-01-01') + 0:9, day = seq(1,10,1), X = rnorm(10,0,1), Y = rnorm(10,0,2), Z = rnorm(10,0,4) ) stocksm2 = melt(stocks2, id.vars = c("time","day"), variable.name = "stock", value.name = "price") stocksm2

## time day stock price ## 1 2009-01-01 1 X 1.17763333 ## 2 2009-01-02 2 X -0.06978673 ## 3 2009-01-03 3 X 0.68714193 ## 4 2009-01-04 4 X -2.92975470

10 ## 5 2009-01-05 5 X -0.81975142 ## 6 2009-01-06 6 X -0.33862823 ## 7 2009-01-07 7 X 0.58941584 ## 8 2009-01-08 8 X 0.56775532 ## 9 2009-01-09 9 X -0.92030090 ## 10 2009-01-10 10 X -0.31174368 ## 11 2009-01-01 1 Y 1.20567598 ## 12 2009-01-02 2 Y 0.15188723 ## 13 2009-01-03 3 Y 2.36377468 ## 14 2009-01-04 4 Y -1.77815145 ## 15 2009-01-05 5 Y 1.85695473 ## 16 2009-01-06 6 Y -5.28711066 ## 17 2009-01-07 7 Y 3.40392580 ## 18 2009-01-08 8 Y 1.28097709 ## 19 2009-01-09 9 Y -1.25175838 ## 20 2009-01-10 10 Y 1.19347833 ## 21 2009-01-01 1 Z 1.92106120 ## 22 2009-01-02 2 Z 3.45684224 ## 23 2009-01-03 3 Z -2.23146341 ## 24 2009-01-04 4 Z -0.37878616 ## 25 2009-01-05 5 Z -1.47832870 ## 26 2009-01-06 6 Z -4.30045095 ## 27 2009-01-07 7 Z -1.51357751 ## 28 2009-01-08 8 Z -2.22612885 ## 29 2009-01-09 9 Z -6.01073307 ## 30 2009-01-10 10 Z -2.07128595

Converting data from long-form to wide form stocksmd1 = dcast(stocksm, time ~ stock, value.var = "price") stocksmd1

## time X Y Z ## 1 2009-01-01 -1.06406981 -2.3648384 5.7489691 ## 2 2009-01-02 -1.26495502 1.9489011 -1.7535504 ## 3 2009-01-03 -0.28043553 2.7983741 0.2840863 ## 4 2009-01-04 0.06302827 1.3669634 -7.7664304 ## 5 2009-01-05 1.47793978 -3.0911516 -1.6267468 ## 6 2009-01-06 -1.21225693 -0.2473569 8.2811592 ## 7 2009-01-07 -1.06717890 0.7619787 1.2049675 ## 8 2009-01-08 1.42640535 -0.1728733 -4.1799281 ## 9 2009-01-09 0.95018420 -1.2599393 -1.3405610 ## 10 2009-01-10 -0.77522305 2.4316198 3.8532028 stocksmd2 = dcast(stocksm, stock ~ time, value.var = "price") stocksmd2

## stock 2009-01-01 2009-01-02 2009-01-03 2009-01-04 2009-01-05 2009-01-06 ## 1 X -1.064070 -1.264955 -0.2804355 0.06302827 1.477940 -1.2122569 ## 2 Y -2.364838 1.948901 2.7983741 1.36696336 -3.091152 -0.2473569 ## 3 Z 5.748969 -1.753550 0.2840863 -7.76643040 -1.626747 8.2811592 ## 2009-01-07 2009-01-08 2009-01-09 2009-01-10 ## 1 -1.0671789 1.4264053 0.9501842 -0.7752231

11 ## 2 0.7619787 -0.1728733 -1.2599393 2.4316198 ## 3 1.2049675 -4.1799281 -1.3405610 3.8532028

What if there are two rows of stock prices for the same time? stocksm3 = rbind(stocksm, list(time = as.Date('2009-01-01'), stock= "X", price = 0.2)) stocksmd3 = dcast(stocksm3, time ~ stock, value.var = "price",function(x){mean(x)}) # take the average stocksmd3

## time X Y Z ## 1 2009-01-01 -0.43203491 -2.3648384 5.7489691 ## 2 2009-01-02 -1.26495502 1.9489011 -1.7535504 ## 3 2009-01-03 -0.28043553 2.7983741 0.2840863 ## 4 2009-01-04 0.06302827 1.3669634 -7.7664304 ## 5 2009-01-05 1.47793978 -3.0911516 -1.6267468 ## 6 2009-01-06 -1.21225693 -0.2473569 8.2811592 ## 7 2009-01-07 -1.06717890 0.7619787 1.2049675 ## 8 2009-01-08 1.42640535 -0.1728733 -4.1799281 ## 9 2009-01-09 0.95018420 -1.2599393 -1.3405610 ## 10 2009-01-10 -0.77522305 2.4316198 3.8532028

Summary

• Read in data from a previous R session with readRDS(), load() • Read in data from the outside with read.table(), read.csv() • Can sometimes be tricky to get arguments right in read.table(), read.csv() • Helps sometimes take a look at the original data files to see their structure • For reordering data, use order(), rev(), and proper indexing • For merging data, use merge(); but can do it manually using reordering tricks • Use melt() function to transform wide-format to long-format • Use dcast() function to transform long-format to wide-format

12