R Course for the Nsos in the Arab Countries Part 3: R Data Management
Total Page:16
File Type:pdf, Size:1020Kb
R Course for the NSOs in the Arab countries Part 3: R Data Management Valentin Todorov1 1United Nations Industrial Development Organization, Vienna 18-20 May 2015 Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 1 / 41 Outline 1 Motivation 2 Data exchange with other statistical tools 3 Reading and Writing in Excel Format 4 Reading Data in SDMX Format 5 R data base interfaces and Relational DBMSs 6 Case study: UNIDO database 7 R packages for database access 8 Exercise: The Data Expo 2006 9 Accessing international statistical databases 10 Summary and conclusions Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 2 / 41 Motivation Motivation 1. A number of statistical software tools and other programs are in use in a statistical organization 2. The data exchange between such systems (SAS, SPSS, EViews, Stata, Excel, Matlab, Octave, etc) is essential 3. Reading and writing data from/to Excel is very important due to its extreme popularity 4. Often data are stored in relational databases (MS Access, MySql, DB2, MS SQL server, Sybase, etc.) and the size do not allow to extract them into flat files before analysis 5. Using SDMX for data and metadata exchange becomes more and more important Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 3 / 41 Data exchange with other statistical tools R as a mediator Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 4 / 41 Data exchange with other statistical tools Package foreign • Package foreign reads data stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase and others and writes in the format of some of these. • Some limitations apply 1. read.dta(), write.dta() - reads and writes data in STATA format. Supports versions 9-12. Will not support versions higher than 12. 2. read.spss() - reads a file stored by the SPSS save or export commands. this was originally written in 2000 and has limited support for changes in SPSS format since. 3. read.ssd() - reads a SAS data set (sas7bdat). For this purpose it generates a SAS program to convert the ssd contents to SAS transport format and then uses read.xport() to obtain a data frame. Requires SAS to be available. Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 5 / 41 Data exchange with other statistical tools Example: Read and write STATA files > ## Read and write STATA files > library(foreign) > data(rock) # from 'datasets' in base > rockfile <- tempfile() # get a temporary file > write.dta(rock, file=rockfile) # write in STATA format > head(read.dta(rockfile)) # read STATA data area peri shape perm 1 4990 2791.90 0.0903296 6.3 2 7002 3892.60 0.1486220 6.3 3 7558 3930.66 0.1833120 6.3 4 7352 3869.32 0.1170630 6.3 5 7943 3948.54 0.1224170 17.1 6 7979 4010.15 0.1670450 17.1 Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 6 / 41 Data exchange with other statistical tools Example: Read and write SAS files 1. Using read.ssd() from foreign: > ## Read and write SAS files > library(foreign) > libname <- "C:/MYSASDATA" > members <- c("data1", "data2", "data3") > SAS <- "C:/Program Files/SAS/SAS 9.1/SAS.EXE" > sasdata <- read.ssd(libname=libname, sectionnames=members, + sascmd=SAS) > is.data.frame(sasdata$data1) 2. Using sas.get() from Hmisc (quite similar to read.ssd() though): > library(Hmisc) > sasdata <- sas.get("C:/MYSASDATA", "data1") 3. Using function read.sas7bdat() from sas7bdat: http://sas-and-r.blogspot.co.at/2011/07/ really-useful-r-package-sas7bdat.html Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 7 / 41 Reading and Writing in Excel Format Reading and Writing in Excel Format • Spreadsheets (especially Microsoft Excel), are one of the most common methods for data exchange • R provides several ways to access Excel files • The trivial way is to save the data to a CSV or a tab-delimited file and then use read.csv() or read.table() (manual intervention necessary) • Follows a non-comprehensive list of R packages: 1. gdata: Requires installation of Perl on Windows. Supports only reading from Excel. 2. RODBC: On Windows, function ODBConnectExcel(). Too complicated, there are better ways for reading Excel. Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 8 / 41 Reading and Writing in Excel Format Reading and Writing in Excel Format 3. xlsx: Based on Java, could be slow. Prefer the read.xlsx2() over read.xlsx()|significantly faster for large data sets. > ## Read and write Excel files using the package 'xlsx' > library(xlsx) > data(rock) # from datasets > fname <- paste(tempdir(), "/myfile.xlsx", sep="") > write.xlsx(rock, fname) > ## Now read the file and display the beginning > df <- read.xlsx2(fname, 1) # read first sheet > head(df) X. area peri shape perm 1 1 4990 2791.9 0.0903296 6.3 2 2 7002 3892.6 0.148622 6.3 3 3 7558 3930.66 0.183312 6.3 4 4 7352 3869.32 0.117063 6.3 5 5 7943 3948.54 0.122417 17.1 6 6 7979 4010.15 0.167045 17.1 Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 9 / 41 Reading and Writing in Excel Format Reading and Writing in Excel Format 4. xlsReadWrite: Available for Windows only; uses non-OSS DELPHI component; does not support .xlsx files which is a serious drawback. Not available for 64 bit platforms; It has been removed from CRAN lately. 5. XLConnect: Based on Java, might be slow for large data set but very powerful otherwise. http://miraisolutions.wordpress.com/ Uses Apache POI API as the underlying interface: http://poi.apache.org/. More details and examples follow. 6. RExcel: An add-in for Excel. Allows access to the R from within Excel, i.e. Excel becomes GUI for R. http://rcom.univie.ac.at/download.html. Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 10 / 41 Reading and Writing in Excel Format XLConnect • It seems that this is the best package for manipulating Excel files for the moment • Allows you to produce formatted Excel reports, including graphics • Main functions: I Reading and writing of Excel worksheets (via data.frames) I Reading and writing of named ranges (via data.frames) I Creating, removing, renaming and cloning worksheets I Adding graphics I Controlling sheet visibility I Defining column width and row height I Merge/unmerge, cell formulas, formula recalculation, auto-filters, cell styles Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 11 / 41 Reading and Writing in Excel Format XLConnect: writing data to a worksheet > ## Create a new workbook and write a data frame > ## to a worksheet > library(XLConnect) > data(rock) # from datasets > fname1 <- paste(tempdir(), "/rock-1.xlsx", sep="") > wb <- loadWorkbook(fname1, create=TRUE) > createSheet(wb, name="rock") > writeWorksheet(wb, data=rock, sheet="rock") > saveWorkbook(wb) # Do not forget this one • saveWorkbook() does the actual file creation. • Position the data anywhere using startRow and startCol Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 12 / 41 Reading and Writing in Excel Format XLConnect: writing data to a worksheet with one call > ## Writing to worksheet with one call > library(XLConnect) > data(rock) # from datasets > fname2 <- paste(tempdir(), "/rock-2.xlsx", sep="") > writeWorksheetToFile(fname2, data=rock, sheet="rock", + startRow=3, startCol=4) • writeWorksheetToFile() loads the workbook, creates the sheet and finally saves the workbook. • Useful when you only need to write one sheet into an Excel file • Position the data anywhere using startRow and startCol Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 13 / 41 Reading and Writing in Excel Format XLConnect: reading data from worksheet > ## Reading data from worksheet > library(XLConnect) > wb = loadWorkbook(fname1, create=TRUE) > data = readWorksheet(wb, sheet="rock") > head(data) area peri shape perm 1 4990 2791.90 0.0903296 6.3 2 7002 3892.60 0.1486220 6.3 3 7558 3930.66 0.1833120 6.3 4 7352 3869.32 0.1170630 6.3 5 7943 3948.54 0.1224170 17.1 6 7979 4010.15 0.1670450 17.1 • Sheets (and regions) can be referenced by name as well • Alternatively use startRow and startCol Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 14 / 41 Reading and Writing in Excel Format XLConnect: reading data from worksheet with one call > ## Reading data from worksheet with one call > library(XLConnect) > data = readWorksheetFromFile(fname2, sheet="rock") > head(data) area peri shape perm 1 4990 2791.90 0.0903296 6.3 2 7002 3892.60 0.1486220 6.3 3 7558 3930.66 0.1833120 6.3 4 7352 3869.32 0.1170630 6.3 5 7943 3948.54 0.1224170 17.1 6 7979 4010.15 0.1670450 17.1 • Recognizes square regions automatically • Useful when you only need to write one sheet into an Excel file • Deals with opening/closing connections Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 15 / 41 Reading Data in SDMX Format SDMX Example • Several publicly available files in SDMX format can be found on the IMF web site http://www.imf.org/oecd • We will consider the annual exchange rates for the period 1980 to 2007 • Starts by a Header and than continues with time series blocks for each country Listing 1: Read exchange rates in SDMX format.