Transform and Tidy (Wrangle) Data with R - Required Read Bojan Duric

Transform and Tidy (Wrangle) Data with R - Required Read Bojan Duric This Notebook is selection of “A Very (short) Introduction to R” by Paul Torfs & Claudia Brauer and “R for Data Scince”" by Hadley Wickham and Garrett Grolemund 1 Introduction What is R? R is a powerful language and environment for statistical computing and graphics. It is a public domain (a so called “GNU”) project which is similar to the commercial S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S, or in language terms different dialect of S. The main advantages of R are the fact that R is freeware and that there is a lot of help available online. It is quite similar to other programming packages such as MatLab (not freeware), but more user-friendly than programming languages such as C++ or Fortran. You can use R as it is, but for educational purposes we prefer to use R in combination with the RStudio interface (also freeware), which has an organized layout and several extra options. The R language came to use quite a bit after S had been developed. One key limitation of the S language was that it was only available in a commercial package, S-PLUS. In 1991, R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland. In 1993 the first announcement of R was made to the public. In 1995, Martin Mächler made an important contribution by convincing Ross and Robert to use the GNU General Public License to make R free software. This was critical because it allowed for the source code for the entire R system to be accessible to anyone who wanted to tinker with it (more on free software later). In 1996, a public mailing list was created (the R-help and R-devel lists) and in 1997 the R Core Group was formed, containing some people associated with S and S-PLUS. Currently, the core group controls the source code for R and is solely able to check in changes to the main R source tree. Finally, in 2000 R version 1.0.0 was released to the public. Limitations of R No programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the “old days”). Another commonly cited limitation of R is that objects must generally be stored in physical memory. This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time. 1 At a higher level one “limitation” of R is that its functionality is based on consumer demand and (voluntary) user contributions. If no one feels like implementing your favorite method, then it’s your job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. 2 Getting started 2.1 Install R To install R on your computer (legally for free!), go to the home website of R: http://www.r-project.org/ and do the following (assuming you work on a windows computer): • click download CRAN in the left bar • choose a download site • choose Windows as target operation system • click base • choose Download R 3.** for Windows and • choose default answers for all questions It is also possible to run R and RStudio from a USB stick instead of installing them. This could be useful when you don’t have administrator rights on your computer. See our separate note “How to use portable versions of R and RStudio” for help on this topic. 2.2 Install RStudio After finishing this setup, you should see an “R” icon on you desktop. Clicking on this would start up the standard interface. We recommend, however, to use the RStudio interface. To install RStudio, go to: http://www.rstudio.org/ and do the following (assuming you work on a windows computer): • click Download RStudio • click Download RStudio Desktop • click Recommended For Your System • download the .exe file and run it (choose default answers for all questions) 2.3 RStudio layout The RStudio interface consists of several windows (see Figure 1). • Bottom left: console window (also called command window). Here you can type simple commands after the “>” prompt and R will then execute your command. This is the most important window, because this is where R actually does stuff. • Top left: editor window (also called script window). Collections of commands (scripts) can be edited and saved. When you don’t get this window, you can open it with File -> New -> R script Just typing a command in the editor window is not enough, it has to get into the command window before R executes the command. If you want to run a line from the script window (or the whole script), you can click Run or press CTRL+ENTER to send it to the command window. • Top right: workspace / history window. In the workspace window you can see which data and values R has in its memory. You can view and edit the values by clicking on them. The history window shows what has been typed before. 2 • Bottom right: files / plots / packages / help / viewer window. Here you can open files, view plots (also previous plots), install and load packages or use the help function. You can change the size of the windows by dragging the grey bars between the windows. Figure 1: Figure 1 The editor, workspace, console and viewer windows in RStudio. 2.4 Working directory Your working directory is the folder on your computer in which you are currently working. When you ask R to open a certain file, it will look in the working directory for this file, and when you tell R to save a data file or figure, it will save it in the working directory. Before you start working, please set your working directory to where all your data and script files are or should be stored. Type in the command window: setwd(“directoryname”). For example: # setwd("C:/Hydrology/R/") Make sure that the slashes are forward slashes and that you don’t forget the apostrophes (for the reason of the apostrophes, see section 10.1). R is case sensitive, so make sure you write capitals where necessary. Within RStudio you can also go to Tools / Set working directory. 2.5 Libraries R can do many statistical and data analyses. They are organized in so-called packages or libraries. With the standard installation, most common packages are installed. To get a list of all installed packages, go to the packages window or type library() in the console window. If the box in front of the package name is ticked, the package is loaded (activated) and can be used. # library() 3 There are many more packages available on the R website. If you want to install and use a package (for example, the package called “dplyr”) you should: • Install the package: click install packages in the packages window and type geometry or type install.packages(“dplyr”) in the command window. # install.packages("dplyr") • Load the package: check box in front of geometry or type library(“dplyr”) in the command window. ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union 3 Some first examples of R commands This is section is dedicated to basic R use and syntax. Please do read this section to understand simple R syntax and possible data formats and types. It is highly recommended to complete some of the swirl() exercises. 3.1 Calculator R can be used as a calculator. You can just type your equation in the command window after the “>” or editor: 10^2+ 36 ## [1] 136 and R will give the answer Practice Compute the difference between 2014 and the year you started at this university and divide this by the difference between 2014 and the year you were born. Multiply this with 100 to get the percentage of your life you have spent at this university. Use brackets if you need them. If you use brackets and forget to add the closing bracket, the “>” on the command line changes into a “+”. The “+” can also mean that R is still busy with some heavy computation. If you want R to quit what it was doing and give back the “>”, press ESC (see the reference list on the last page). # this is comment # use hastag to comment your code # write your code: 3.2 Workspace You can also give numbers a name.

Load more