Course for the NSOs in the Arab countries Part 3: R Data Management

Valentin Todorov1

1United Nations Industrial Development Organization, Vienna

18-20 May 2015

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 1 / 41 Outline

1 Motivation

2 Data exchange with other statistical tools

3 Reading and Writing in Excel Format

4 Reading Data in SDMX Format

5 R data base interfaces and Relational DBMSs

6 Case study: UNIDO database

7 R packages for database access

8 Exercise: The Data Expo 2006

9 Accessing international statistical databases

10 Summary and conclusions

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 2 / 41 Motivation Motivation

1. A number of statistical software tools and other programs are in use in a statistical organization 2. The data exchange between such systems (SAS, SPSS, EViews, , Excel, Matlab, Octave, etc) is essential 3. Reading and writing data from/to Excel is very important due to its extreme popularity 4. Often data are stored in relational databases (MS Access, MySql, DB2, MS SQL server, Sybase, etc.) and the size do not allow to extract them into flat files before analysis 5. Using SDMX for data and metadata exchange becomes more and more important

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 3 / 41 Data exchange with other statistical tools R as a mediator

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 4 / 41 Data exchange with other statistical tools Package foreign

• Package foreign reads data stored by , S, SAS, SPSS, Stata, Systat, Weka, dBase and others and writes in the format of some of these. • Some limitations apply 1. read.dta(), write.dta() - reads and writes data in STATA format. Supports versions 9-12. Will not support versions higher than 12. 2. read.() - reads a file stored by the SPSS save or export commands. this was originally written in 2000 and has limited support for changes in SPSS format since. 3. read.ssd() - reads a SAS data set (sas7bdat). For this purpose it generates a SAS program to convert the ssd contents to SAS transport format and then uses read.xport() to obtain a data frame. Requires SAS to be available.

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 5 / 41 Data exchange with other statistical tools Example: Read and write STATA files

> ## Read and write STATA files > library(foreign) > data(rock) # from 'datasets' in base > rockfile <- tempfile() # get a temporary file > write.dta(rock, file=rockfile) # write in STATA format > head(read.dta(rockfile)) # read STATA data

area peri shape perm 1 4990 2791.90 0.0903296 6.3 2 7002 3892.60 0.1486220 6.3 3 7558 3930.66 0.1833120 6.3 4 7352 3869.32 0.1170630 6.3 5 7943 3948.54 0.1224170 17.1 6 7979 4010.15 0.1670450 17.1

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 6 / 41 Data exchange with other statistical tools Example: Read and write SAS files

1. Using read.ssd() from foreign: > ## Read and write SAS files > library(foreign) > libname <- "C:/MYSASDATA" > members <- c("data1", "data2", "data3") > SAS <- "C:/Program Files/SAS/SAS 9.1/SAS.EXE" > sasdata <- read.ssd(libname=libname, sectionnames=members, + sascmd=SAS) > is.data.frame(sasdata$data1) 2. Using .get() from Hmisc (quite similar to read.ssd() though): > library(Hmisc) > sasdata <- sas.get("C:/MYSASDATA", "data1") 3. Using function read.sas7bdat() from sas7bdat: http://sas-and-r.blogspot.co.at/2011/07/ really-useful-r-package-sas7bdat.html

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 7 / 41 Reading and Writing in Excel Format Reading and Writing in Excel Format

(especially ), are one of the most common methods for data exchange • R provides several ways to access Excel files • The trivial way is to save the data to a CSV or a tab-delimited file and then use read.csv() or read.table() (manual intervention necessary) • Follows a non-comprehensive list of R packages:

1. gdata: Requires installation of Perl on Windows. Supports only reading from Excel. 2. RODBC: On Windows, function ODBConnectExcel(). Too complicated, there are better ways for reading Excel.

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 8 / 41 Reading and Writing in Excel Format Reading and Writing in Excel Format

3. xlsx: Based on Java, could be slow. Prefer the read.xlsx2() over read.xlsx()—significantly faster for large data sets. > ## Read and write Excel files using the package 'xlsx' > library(xlsx) > data(rock) # from datasets > fname <- paste(tempdir(), "/myfile.xlsx", sep="") > write.xlsx(rock, fname) > ## Now read the file and display the beginning > df <- read.xlsx2(fname, 1) # read first sheet > head(df) X. area peri shape perm 1 1 4990 2791.9 0.0903296 6.3 2 2 7002 3892.6 0.148622 6.3 3 3 7558 3930.66 0.183312 6.3 4 4 7352 3869.32 0.117063 6.3 5 5 7943 3948.54 0.122417 17.1 6 6 7979 4010.15 0.167045 17.1

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data Management18-20 May 2015 9 / 41 Reading and Writing in Excel Format Reading and Writing in Excel Format

4. xlsReadWrite: Available for Windows only; uses non-OSS DELPHI component; does not support .xlsx files which is a serious drawback. Not available for 64 bit platforms; It has been removed from CRAN lately. 5. XLConnect: Based on Java, might be slow for large data set but very powerful otherwise. http://miraisolutions.wordpress.com/ Uses Apache POI API as the underlying interface: http://poi.apache.org/. More details and examples follow. 6. RExcel: An add-in for Excel. Allows access to the R from within Excel, i.e. Excel becomes GUI for R. http://rcom.univie.ac.at/download.html.

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 10 / 41 Reading and Writing in Excel Format XLConnect

• It seems that this is the best package for manipulating Excel files for the moment • Allows you to produce formatted Excel reports, including graphics • Main functions:

I Reading and writing of Excel worksheets (via data.frames)

I Reading and writing of named ranges (via data.frames)

I Creating, removing, renaming and cloning worksheets

I Adding graphics

I Controlling sheet visibility

I Defining column width and row height

I Merge/unmerge, cell formulas, formula recalculation, auto-filters, cell styles

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 11 / 41 Reading and Writing in Excel Format XLConnect: writing data to a worksheet

> ## Create a new workbook and write a data frame > ## to a worksheet > library(XLConnect) > data(rock) # from datasets > fname1 <- paste(tempdir(), "/rock-1.xlsx", sep="") > wb <- loadWorkbook(fname1, create=TRUE) > createSheet(wb, name="rock") > writeWorksheet(wb, data=rock, sheet="rock") > saveWorkbook(wb) # Do not forget this one

• saveWorkbook() does the actual file creation. • Position the data anywhere using startRow and startCol

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 12 / 41 Reading and Writing in Excel Format XLConnect: writing data to a worksheet with one call

> ## Writing to worksheet with one call > library(XLConnect) > data(rock) # from datasets > fname2 <- paste(tempdir(), "/rock-2.xlsx", sep="") > writeWorksheetToFile(fname2, data=rock, sheet="rock", + startRow=3, startCol=4)

• writeWorksheetToFile() loads the workbook, creates the sheet and finally saves the workbook. • Useful when you only need to write one sheet into an Excel file • Position the data anywhere using startRow and startCol

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 13 / 41 Reading and Writing in Excel Format XLConnect: reading data from worksheet

> ## Reading data from worksheet > library(XLConnect) > wb = loadWorkbook(fname1, create=TRUE) > data = readWorksheet(wb, sheet="rock") > head(data) area peri shape perm 1 4990 2791.90 0.0903296 6.3 2 7002 3892.60 0.1486220 6.3 3 7558 3930.66 0.1833120 6.3 4 7352 3869.32 0.1170630 6.3 5 7943 3948.54 0.1224170 17.1 6 7979 4010.15 0.1670450 17.1 • Sheets (and regions) can be referenced by name as well • Alternatively use startRow and startCol Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 14 / 41 Reading and Writing in Excel Format XLConnect: reading data from worksheet with one call

> ## Reading data from worksheet with one call > library(XLConnect) > data = readWorksheetFromFile(fname2, sheet="rock") > head(data) area peri shape perm 1 4990 2791.90 0.0903296 6.3 2 7002 3892.60 0.1486220 6.3 3 7558 3930.66 0.1833120 6.3 4 7352 3869.32 0.1170630 6.3 5 7943 3948.54 0.1224170 17.1 6 7979 4010.15 0.1670450 17.1 • Recognizes square regions automatically • Useful when you only need to write one sheet into an Excel file • Deals with opening/closing connections Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 15 / 41 Reading Data in SDMX Format SDMX Example

• Several publicly available files in SDMX format can be found on the IMF web site http://www.imf.org/oecd • We will consider the annual exchange rates for the period 1980 to 2007 • Starts by a Header and than continues with time series blocks for each country

Listing 1: Read exchange rates in SDMX format. > library(sdmxer) SDMX utilities (version 0.1 −00) > u r i <− ”http://sdmx.imf.org/oecd/ExchangeRates IMF IFS Annual 80 07 . xml ” > doc <− xmlRoot(xmlTreeParse(uri ))

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 16 / 41 Reading Data in SDMX Format SDMX Example (II)

Listing 2: Function to select a time series. > ## Define a function to select time series > s e l e c t . t s <− function(x, ct=”Korea”, key=”..RF.ZF...”) + { + if(xmlName(x) == ”Header”) + return(FALSE) + + a t t <− as. list(xmlAttrs(x)) + if(att$CountryName == ct & substr(att$TS Key, 4, 15) == key) + return(TRUE) + + return(FALSE) + }

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 17 / 41 Reading Data in SDMX Format SDMX Example (III)

Listing 3: Select the records of interest. > k r <− doc[[which(xmlSApply(doc, select .ts ))]] > xmlAttrs(kr) Frequency Database ”A” ”IFS ” CountryName Country ”Korea” ”542” TS Key Descriptor ”5 4 2 . . RF . ZF . . . ” ”MARKET RATE” U n i t s S c a l e ”National Currency per US Dollar” ”None”

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 18 / 41 Reading Data in SDMX Format SDMX Example (IV)

Listing 4: Extract the time series. > ex <− xmlSApply(kr , function(x) as.vector(as.numeric(xmlAttrs(x)))) > colnames(ex) <− NULL > k r . e x r a t e <− ts(ex[2,], start=ex[1,1]) > k r . e x r a t e Time Series: S t a r t = 1980 End = 2007 Frequency = 1 [1] 607.4325 681.0283 731.0842 775.7483 805.9758 [6] 870.0200 881.4542 822.5675 731.4683 671.4558 [11] 707.7642 733.3533 780.6508 802.6708 803.4458 [16] 771.2733 804.4533 951.2892 1401.4367 1188.8167 [21] 1130.9575 1290.9946 1251.0883 1191.6142 1145.3192 [26] 1024.1167 954.7905 929.2137

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 19 / 41 Reading Data in SDMX Format SDMX Example (V)

Listing 5: Plot the time series. > plot(kr.exrate , xlab=”Time”, ylab=”Korea”, main=”Market Rate, National Currency per US Dollar”)

Market Rate, National Currency per US Dollar 1400 1000 Korea 800 600 1980 1985 1990 1995 2000 2005

Time

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 20 / 41 R data base interfaces and Relational DBMSs Relational databases: Why use statistical software?

Why use statistical software at all? Today’s DBMSs are so powerful and sophisticated, that can do everything (cf. OLAP)?

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 21 / 41 R data base interfaces and Relational DBMSs Why use statistical software?

• SQL has limited numerical and statistical features. For example, it has no least squares fitting procedures, and to find quantiles requires a sophisticated query • In many cases the numerical algorithms used in the basic SQL aggregate functions are not implemented to safeguard numerical accuracy • The wide range of data types may have drawbacks when it comes to performing calculations across a row—some of the conversions from one numeric type to another may produce unexpected truncation and rounding • The algorithms used in a DBMS are seldom publicly documented

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 22 / 41 R data base interfaces and Relational DBMSs Why use database?

On the other hand: • There are limitations on the types of data that R handles well. • R is not well suited to extremely large data set:

I All data being manipulated by R are resident in memory,

I Several copies of the data can be created during execution of a function

I Data objects that are more than a (few) hundred megabytes in size can cause R to run out of memory, particularly on a 32-bit operating system. • Fast access to data subsets • R does not easily support concurrent access to data. • R does support persistence of data but the format of the stored data is specific to R and not easily manipulated by other systems

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 23 / 41 R data base interfaces and Relational DBMSs Overview of Relational data sources

• Public domain systems

I SQLite (www.sqlite.org)

I MySQL (www.mysql.com)

I PostgreSQL (www.postgres.org) • Commercial systems

I Microsoft Access (Windows)

I Oracle

I IBM DB2

I Microsoft SQL Server • ODBC (Open Database Connectivity)—a standard to use many different data sources; Originated on Windows but is also implemented on Linux/Unix/Mac OS X.

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 24 / 41 R data base interfaces and Relational DBMSs SQLite

• www.sqlite.org • SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. • No separate server process; reads and writes directly to ordinary disk files. • SQLite is the most widely deployed SQL database engine (?). • The source code for SQLite is in the public domain. • Current Version 3.8.4.3 of SQLite (recommended for all new development).

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 25 / 41 R data base interfaces and Relational DBMSs SQLite command shell

• A command-line shell for accessing and modifying SQLite databases; compatible with all versions of SQLite through 3.8.4.3 and beyond. • sqlite3.exe on Windows • Download from www.sqlite.org • Quick start C:>sqlite3 ex1 SQLite version 3.8.4.3 2014-04-03 16:53:12 Enter ".help" for usage hints. sqlite> create table tbl1(one varchar(10), two smallint); sqlite> insert into tbl1 values('hello!',10); sqlite> insert into tbl1 values('goodbye', 20); sqlite> select * from tbl1; hello!|10 goodbye|20 sqlite> .quit

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 26 / 41 R data base interfaces and Relational DBMSs SQLite Manager: Add On for Firefox

https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 27 / 41 R data base interfaces and Relational DBMSs ODBC: Setting up Windows DSN

How to set up an Windows Data Source Name (DSN)

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 28 / 41 Case study: UNIDO database Case study: UNIDO INDSTAT database

• INDSTAT2 - 2013 edition at the 2-digit level of ISIC Revision 3. • stat.unido.org • Contains time series data from 1963 onwards. • Data are available for country, year and ISIC at the 2-digit levels of ISIC (Revision 3) • Number of countries: 166 • Reference period: 1963-2010 • Coverage in terms of years, as well as data items, may vary from country to country depending on data availability.

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 29 / 41 Case study: UNIDO database Case study: UNIDO INDSTAT database

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 30 / 41 Case study: UNIDO database Case study: UNIDO INDSTAT database

• A limited version of INDSTAT2 2013 in MS Access format: upload.unido.co/NTTS2015/xindstat2.mdb • A limited version of INDSTAT2 2013 in SQLite format: upload.unido.co/NTTS2015/xindstat2.db

Exercise: Investigate the structure of the database using SQL and the SQLite command shell or the SQLite Manager Add on for Firefox

.tables .schema SELECT * FROM data SELECT * FROM data where year = 2000 and countrycode=156 SELECT CountryCode,Tablecode,year,isiccode,value FROM data where year = 2000 and countrycode=156

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 31 / 41 R packages for database access R packages for database access

• DBI

I RMySQL

I RSQLite

I ROracle

I RPostgreSQL • RODBC • others

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 32 / 41 R packages for database access RSQLite

• The interface is broken into three main elements:

I The driver facilitates the communication between the R session and a particular type of DBNS (e.g. SQLite);

I The connection encapsulates the actual connection (with the aid of the driver) to a particular DBMS and carries out the requested queries; and

I The result which tracks the status of a query, such as the number of rows that have been fetched and whether or not the query has completed. • SQL queries can be sent by either dbSendQuery() or dbGetQuery() • There are convenient interfaces to read/write/test/delete tables in the database: dbReadTable() and dbWriteTable()

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 33 / 41 R packages for database access RSQLite example

> library(RSQLite) > drv <- SQLite() > con <- dbConnect(drv, "C:/.AMMAN/xindstat2.db") > dbListTables(con)

[1] "ExchangeRate" "Isic" "Metadata" "Revision" [5] "TableCode" "TableDefinition" "country" "data"

> dbDisconnect(con)

[1] TRUE

> dbUnloadDriver(drv)

[1] TRUE

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 34 / 41 R packages for database access RSQLite example II

> con <- dbConnect( SQLite(), "C:/.Amman/xindstat2.db") > df <- dbGetQuery(con, "SELECT tablecode, countrycode, isiccode, value + FROM data where countrycode=156 and year=2000 and tablecode=4") > dim(df)

[1] 24 4

> head(df)

TableCode CountryCode IsicCode Value 1 4 156 15 3619000 2 4 156 16 259000 3 4 156 17 4829000 4 4 156 18 3284000 5 4 156 19 0 6 4 156 20 500000

> dbDisconnect(con)

[1] TRUE

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 35 / 41 R packages for database access RODBC

• RODBC provides an interface to database sources supporting an ODBC interface—very widely available; allows the same R code to access different database systems. • Open a connection by odbcConnect() or odbcDriverConnect() • Close connection with odbcClose() • sqlSave() copies an R data frame to a table in the database • sqlFetch) copies a table in the database to an R data frame. • An SQL query can be sent to the database by a call to sqlQuery()

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 36 / 41 R packages for database access RODBC example

The following will not work with MS ACCESS on because there are no 64 bit ODBC driver for MSACCESS.use the 32 bit version of the package > library(RODBC) > ## conODBC <- odbcConnect("INDSTAT2") # DSN name > conODBC <- odbcConnectAccess("C:/.NTTS2015/xindstat2.mdb") > df <-sqlQuery(conODBC, "SELECT * from data where countrycode='040' and year=2000") > dim(df) > head(df)

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 37 / 41 Exercise: The Data Expo 2006 Exercise: The Data Expo 2006

• The Data Expo 2006 data set consists of several atmospheric measurements taken at many different locations and at several time points. The Data Expo 2006 web site is http://stat-computing.org/dataexpo/2006/. • These data was used by Paul Murrell in his book about information technologies (Murrell, 2009) and is available from the following web site too: • http: //statmath.wu.ac.at/courses/data-analysis/week5.html • Three tables: measurements, locations and dates. • An SQLite database: upload.unido.co/NTTS2015/dataexpo.db • An MS ACCESS database: upload.unido.co/NTTS2015/dataexpo.mdb

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 38 / 41 Exercise: The Data Expo 2006 Exercise: The Data Expo 2006

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 39 / 41 Accessing international statistical databases Accessing international statistical databases

• World development indicators (WDI) from the World bank: package WDI • COMTRDAE, UNCTAD and WTO for international trade data • International Financial statistics (IFS) from the International Monetary Fund (IMF) • Industrial statistics databases (INDSTAT) by UNIDO

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 40 / 41 Summary Summary

• R is very useful for data exchange between different systems (SAS, SPSS, EViews, Stata, Excel, Matlab, Octave, etc), relational databases, SDMX • Reading and writing data from/to Excel is (still) very important due to its extreme popularity. An excellent package for ding this is XLConnect • There are many reasons to access data in relational databases (RDBMS). R provides means (packages) for connecting to most commercial and open source RDBMS systems. • RSQLite and RODBC

Todorov (UNIDO) R Course for the NSOs in the Arab countriesPart 3: R Data18-20 Management May 2015 41 / 41