Appendix: Introduction to R A
Total Page:16
File Type:pdf, Size:1020Kb
Appendix: Introduction to R A Background The open-source software R was developed as a free implementation of the language S which was designed as a language for statistical computation, statistical program- ming, and graphics. The main intention was to allow users to explore data in an easy and interactive way, supported by meaningful graphical representations. The statistical software R was originally created by Ross Ihaka and Robert Gentleman (University of Auckland, New Zealand). Installation and Basic Functionalities • The “base” R version, i.e. the software with its most relevant commands, can be downloaded from https://www.r-project.org/. After installing R, it is recom- mended to install an editor too. An editor allows the user to conveniently save and display R-code, submit this code to the R console (i.e. the R software itself), and control the settings and the output. A popular choice of editor is RStudio (free of charge) which can be downloaded from https://www.rstudio.com/ (see Fig. A.1 for a screenshot of the software). There are alternative good editors, for example “Tinn-R” (http://sourceforge.net/projects/tinn-r/). • A lot of additional user-written packages are available online and can be installed within the R console or using the R menu. Within the console, the install.packages("package to install") function can be used. Please note that an internet connection is required. © Springer International Publishing Switzerland 2016 297 C. Heumann et al., Introduction to Statistics and Data Analysis, DOI 10.1007/978-3-319-46162-5 298 Appendix A: Introduction to Fig. A.1 Screenshot of R Studio with the command window (top left), the output window (“console”, i.e. R itself, bottom left), and a plot window (right). Other windows (e.g. about the environment or package updates) are closed R Appendix A: Introduction to R 299 Statistics has a close relationship to algebra: data sets can be seen as matrices, and variables as vectors. R makes use of these structures and this is why we first intro- duce data structure functionalities before explaining some of the most relevant basic statistical commands. R as a Calculator, Basic Data Structures and Arithmetic Operations • The character # marks the beginning of a comment. All characters until the end of the line are ignored by R. We use # to comment on our R-code. • If we know the name of a command we would like to use, and we want to learn about its functionality, typing ?command in the R command line prompt displays a help page, e.g. ?mean displays a help page for the arithmetic mean function. • Using example(mean) shows application examples of the respective function. • The command c(1,2,3,4,5) combines the numbers 1, 2, 3, 4 and 5 into a vector (e.g. a variable). • Vectors can be assigned to an “object”. For example, X <- c(2,12,22,32) assigns a numeric vector of length 4 to the object X. In general, the arrow sign (<−) is a very important concept to store data, summaries, and outputs in objects (i.e. the name in the front of the <− sign). Note that R is case sensitive; i.e. X and x are two different variable names. Instead of “<−”, one can also use “=”. • Sequences of numbers can be created using the seq and rep commands. For example, seq(1,10) and rep(1,10) yield [1]12345678910 and [1]1111111111 respectively. 300 Appendix A: Introduction to R • Basic data structures are vectors, matrices, arrays, lists, and data frames. They can contain numeric values, logical values or even characters (strings). In the latter case, arithmetic operations are not allowed. – A numeric vector of length 5 can be constructed by the command x <- vector(mode="numeric", length=5) The elements can be accessed by squared brackets: []. For example, x[3] <- 4 assigns the value 4 to the third element of the object x. Logical vectors containing the values TRUE and FALSE are also possible. In arithmetic operations, TRUE corresponds to 1 and FALSE corresponds to 0 (which is the default). Consider the following example: x.log <- vector(mode="logical", length=4) x.log[1] = x.log[3] = x.log[4] = TRUE mean(x.log) returns as output 0.75 because the mean of (1, 0, 1, 1) =(TRUE, FALSE, TRUE, TRUE) is 0.75. – A matrix can be constructed by the matrix() command: x <- matrix(nrow=4, ncol=2, data=1:8, byrow=T) creates a 4 × 2 matrix, where the data values are the natural numbers 1, 2, 3,..., 8 which are stored row-wise in the matrix, [,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6 [4,] 7 8 because of the parameter byrow=T (which is equivalent to byrow=TRUE). The default is byrow=F which would store the data column-wise. – Arrays are more general data structures than vectors and matrices in the sense that they can have more than two dimensions. For instance, x <- array(data=1:12, dim=c(3,2,2)) creates a three-dimensional array with 3 · 2 · 2 = 12 elements. – A list can contain objects of different types. For example, a list element can be a vector or matrix. Lists can be initialized by the command list and can grow dynamically. It is important to understand that list elements should be accessed by the name of the entry via the dollar sign or using double brackets: Appendix A: Introduction to R 301 x <- list(one=c(1,2,3,4,5),two=c("Hello", "world", "!")) x $one [1]12345 $two [1] "Hello" "world" "!" x[[2]] [1] "Hello" "world" "!" x$one [1]12345 – A data frame is the standard data structure for storing a data set with rows as observations and columns as variables. Many statistical procedures in R (such as the lm function for linear models) expect a data frame. A data frame is conceptually not much different from a matrix and can either be initialized by reading a data set from an external file or by binding several column vectors. As an example, we consider three variables (age, favourite hobby, and favourite animal) each with five observations: age <- c(25,33,30,40,28) hobby <- c("Reading","Sports","Games","Reading","Games") animal <- c("Elephant", "Giraffe", NA, "Monkey", "Cat") dat <- data.frame(age,hobby,animal) names(dat) <- c("Age","Favourite.hobby","Favourite.animal") dat The resulting output is > dat Age Favourite.hobby Favourite.animal 1 25 Reading Elephant 2 33 Sports Giraffe 3 30 Games <NA> 4 40 Reading Monkey 5 28 Games Cat where NA denotes a missing value. With write.table or a specialized version thereof such as write.csv (for writing the data in a file using comma-separated fields), a data frame can be saved in a file. The command sequence write.csv(x=dat,file="toy.csv",row.names=FALSE) read.dat <- read.csv(file="toy.csv") read.dat 302 Appendix A: Introduction to R saves the data frame as an external (comma-separated) file and then loads the data again using read.csv. Individual elements of the data frame can be accessed using squared brackets, as for matrices. For example, dat[1,2] returns the first observation of the second variable column for the data set dat. Individual columns (variables) can also be selected using the $ sign: dat$Age returns the age column: [1] 25 33 30 40 28 • The factor command is very useful to store nominal variables, and the command ordered is ideal for ordinal variables. Both commands are extremely important since factor variables with more than two categories are automatically expanded into several columns of dummy variables if necessary, e.g. if they are included as covariates in a linear model. In the previous paragraph, two factor variables have already been created. This can be confirmed by typing is.factor(dat$Favourite.hobby) is.factor(dat$Favourite.animal) which return the value TRUE. Have a look at the following two factor variables: sex <- factor("female","male","male","female","female") grade <- ordered(c("low", "medium", "low", "high", "high"), levels=c("low", "medium","high")) Please note that by default alphabetical order is used to order the categories (e.g. female is coded as 1 and male as 2). However, the mapping of integers to strings can be controlled by the user as seen for the variable “grade”: grade returns [1] low medium low high high Levels: low < medium < high Appendix A: Introduction to R 303 • Basic arithmetic operations can be applied directly to a numeric vector. Basic operations are addition +, subtraction −, multiplication ∗ and division /, inte- ger division %/%, modulo operation %%, and exponentiation with two possible notations: ∗∗ or ˆ. Examples are given as: 2^3 # command [1] 8 # output 2**3 # command [1] 8 # output 2^0.5 # command [1] 1.414214 # output c(2,3,5,7)^2 # command: application to a vector [1] 4 9 25 49 # output c(2,3,5,7)^c(2,3) # command: !!! ATTENTION! [1] 4 27 25 343 # output c(1,2,3,4,5,6)^c(2,3,4) # command [1] 1 8 81 16 125 1296 #output c(2,3,5,7)^c(2,3,4) # command: !!! WARNING MESSAGE! [1] 4 27 625 49 Warning message: longer object length is not a multiple of shorter object length in: c(2, 3, 5, 7)^c(2, 3, 4) The last four commands show the “recycling property” of R. It tries to match the vectors with respect to the length if possible. In fact, c(2,3,5,7)^c(2,3) is expanded to c(2,3,5,7)^c(2,3,2,3) The last example shows that R gives a warning if the length of the shorter vector cannot be expanded to the length of the longer vector by a simple multiplication with a natural number (2, 3, 4,...).