Introduction to R 1 ,2 , Part I
Total Page:16
File Type:pdf, Size:1020Kb
1 General Ecology Introduction to R 2 1 Introduction to R1,2 , Part I 2 3We are going to learn some of the basics for R, a language for statistical computing and graphics. 4It is available for free from http://www.r-project.org/. Hopefully you have already downloaded 5it and installed it on your computer. Today we will practice some basic data manipulation 6However, R has a huge number of different capabilities; and we will only be able to cover a few 7of them in class. I would strongly encourage all of you to go over the Intro to R documentation 8found at the R project web site 9(http://cran.r-project.org/doc/manuals/R-intro.html, see also 10http://cran.r-project.org/doc/contrib/Short-refcard.pdf). 11 . 12 13 Numerical Vectors 14 15 The most basic type of data structure in R is a vector. Vectors are a series of elements of a 16single data type – numbers, characters, logical statements, etc. - contained in one named object. 17We are going to start by creating and manipulating vectors of numbers. 18 The first thing we want to do is use the c(concatenate) function to create a short numerical 19vector. The c function combines multiple elements into a single vector in the order in which they 20appear. For example type: 21 22 Monkeys = c(1,20,9) 23 24 This will create a vector named Monkeys containing these three numbers. Commands that do 25not include an assignment will print the output of the command to the screen. (Sometimes you 26will see "<-" rather than "=", but they are equivalent for assigning values to objects.) For 27example, to see the vector that you just created type “Monkeys” and hit enter. Make sure you 28don’t use all lower case this time, “monkeys”; R is case sensitive. Typing “c(8,5,74)” without 29assigning it to a variable will print that set of elements to the screen. 30 There are several ways to create vectors of regular sequences. The rep function creates a 31vector by repeating a given value a given number of times. Type: 32 33 are = rep(2,6) 34 35 This creates a vector named are with the number 2 repeated 6 times. Type “are” to see what 36this vector looks like. You can also use rep to repeat more complex command structures, such as 37other vectors. Type: 38 39 lotsa = rep(Monkeys,3) 40 41 Look at the vector you created by typing “lotsa”. The Monkeys vector has been repeated 42three times. 43 Colons are extremely useful as they produce a vector of consecutive integers over some 44range that you define. Type:
3 1 Adapted from lab by NM Hallinan, UC Berkeley, http://ib.berkeley.edu/courses/ib200b/Labs/4 Intro to 4R/Introduction to R Lab.pdf
5 Page 1 45 46 fun = 3:7 47 48 That should have made a vector containing all the integers from 3 to 7. 49 Finally you can use the c command to combine vectors and numbers into a single vector. 50Let’s replace the Monkeys vector with a new vector containing all of our vectors and the number 5134. 52 53 Monkeys = c(Monkeys,are,lotsa,fun,34) 54 55 Does that look right? Cool. 56 To find out how many elements your vector has type: 57 58 length(Monkeys) 59 60 61 Vector Indexing 62 63 Often you do not want to return an entire vector but instead only a few elements of a vector. 64In that case you must type the name of the vector followed by the index of the appropriate 65elements surrounded by brackets. For example type; 66 67 Monkeys[4] 68 69 This will return the fourth element of the vector Monkeys. It is also possible to return not just 70a single element of the vector but several elements by using a vector of integers referring to the 71index of the various elements. For example: 72 73 Monkeys[c(7,10,14,21)] 74 75 Will return the 7th, 10th, 14th and 21st elements of vector Monkeys. 76 For this type of indexing it is often very useful to use vectors created using the colon. 77 78 CertainMonkeys = Monkeys[5:17] 79 80 Will create a vector than containing the 5th to the 17th elements of vector Monkeys. 81 82 An alternative way to index is to use a logical vector as the index. The logical vector should 83be the same length as the vector being indexed. However, instead of containing numeric values 84this index contains logical values, in other words TRUE or FALSE based on some condition. If 85this logical vector is used as an index then R will return only the elements for which the logical 86vector is TRUE. 87 Let's assume the values in Monkeys are body masses for different species of monkey. Now 88let’s create a logical vector. Type: 89 90
6 Page 2 91 ItsASmallMonkey = Monkeys<5 92 93 This will create a logical vector, ItsASmallMonkey, the same length as Monkeys, which 94contains a TRUE if that element of Monkeys is less than 5 and a FALSE otherwise. Type 95“ItsASmallMonkey” to view this vector. The logical operators that R recognizes are <, <=, >, >=, 96== for exact equality and != for inequality. You can also use & (for “and”) and | (for “or”) to 97create more complex conditions. 98 Now let’s use that vector to index Monkeys. Type: 99 100 Monkeys[ItsASmallMonkey] 101 102 This will return a vector containing only the values of Monkeys that are less than 5. You 103could bypass the creation of ItsASmallMonkey by just typing: 104 105 Monkeys[Monkeys<5] 106 107 One last thing I want to point out about indexing is that it can be used not just to return 108certain elements of a vector, but also to modify only certain elements. For example type: 109 110 Monkeys[Monkeys<5] = 10 111 112 Now type “Monkeys” to see what you’ve created. You have identified all the elements of 113Monkeys that are less than 5 and changed them to 10. 114 115 116 Vector Arithmetic 117 118 Now let’s use these vectors to do some math. There are two basic principles for doing vector 119arithmetic. You can either use an equation containing a vector and a constant, or you can use an 120equation containing two vectors of the same length. 121 If you use a vector and a constant then you will apply that equation to every element of the 122vector. For example type: 123 124 foo = Monkeys^2 125 126 foo will be a vector that has all the elements of Monkeys squared. Similarly Monkeys+3 127would return a vector with all the elements of Monkeys having 3 added to them, etc. 128 There are some functions in R which take a single number as their argument and return a 129single number. For example log returns the natural logarithm of a number and exp returns e 130raised to the argument. If a vector is used as the argument for one of these functions, then the 131function will work just like the above example, such that the function will be applied to each 132element of the vector separately. To see how this works type: 133 134 log(Monkeys) 135
7 Page 3 136 If two vectors of the same length are included in an equation, then the equation will be 137applied element by element to both vectors. That means the equation x = y+z, indicates that 138x[1]=y[1]+z[1], x[2]=y[2]+z[2], x[3]=y[3]+z[3], etc. To see how this works type: 139 140 hope = foo+Monkeys 141 142 Finally, you can create an equation combining multiple vectors, constants and functions, such 143as: 144 145 3/Monkeys+foo^(log(hope)+1) 146 147 Keep in mind the order of operations whenever using a complex equation. You can always 148use parentheses, if you aren’t confident that you have it right. 149 It is often useful to create a vector of evenly ordered numbers that are separated by more than 1501. In this case it is a good idea to use the colon, but keep in mind that the first operation is to 151create the vector defined by the colon, and then to apply the other operations to that vector. Thus: 152 153 1:3*5+2 154 155 will return the vector c(7,12,17), not 1:17. To return that second vector you should use: 156 157 1:(3*5+2) 158 159 160 Arrays 161 An array is a more complex data structure than a vector, in that it contains multiple 162dimensions of elements. In other words a vector is a one dimensional array. While a vector is 163just a list of values, in an array those values are arranged into rows and columns, or possibly into 164even more dimensions. The function dim returns a vector in which the elements are the size of 165each dimension of the array. Similarly the dim function can be used to transform the dimensions 166of an array or vector. Let’s turn Monkeys into a 6 by 4 array. To do this type 167 168 dim(Monkeys) = c(6,4) 169 170 Did that work? The product of the dimensions must equal the number of elements in the 171object. If you got an error, then you probably didn’t follow all my instructions before. To correct 172this type length(Monkeys). If Monkeys does not have 24 elements, add or subtract elements, so 173that it does. 174 Type Monkeys to see what it looks like. As you can see the values of Monkeys have been 175rearranged into an array with 6 rows and 4 columns (rows always appear before columns when 176assigning dimensions). In order to make this transformation R fills in the first column using the 177first 6 elements, the second using the next six, etc. 178 179 180 181
8 Page 4 182 Array Indexing 183 184 Array indexing works very much like vector indexing, except that commas are now used to 185separate the indexes for the different dimensions of the array. The first number refers to the row, 186the second number refers to the column and any subsequent numbers refer to the higher order 187dimensions. For example type: 188 189 Monkeys[5,2] 190 191 This should return the element in the 5th row and the the 2nd column. Just like for vector 192indexing, a vector can be used to refer to multiple elements. Type: 193 194 Monkeys[2:5,c(2,4)] 195 196 Typing this command will return all the elements from rows 2 through 5 in columns 2 and 4. 197Furthermore leaving one of the dimensions blank will refer to every element in that dimension. 198Type: 199 200 Monkeys[,2] 201 202 This command will return all the elements in column 2, no matter what row they are found 203in. 204 A logical array can also be used just like a logical vector to index an array of the same 205dimensions. However, this command will return a vector, not an array, as there is no guarantee 206that the output will have the same number of elements as the original array. 207 208 209 Dataframes 210 211 One of the powerful aspects of R is its ability to handle many different types of data. For 212example, we might have one vector of species names like this: 213 214 species = c('capuchin','howler','colobus','spider','gibbon','vervet','macaque') 215 216 These data are of the type 'character', as can be seen by typing class(species). In contrast, the 217data in Monkeys are 'numeric', as can be seen by typing class(Monkeys). But these two types of 218data can be combined in a single unit called a dataframe. Type: 219 220 data.frame(species , Monkeys[1:7]) 221 222 You can see we get a table with column headings "species" and "V2" (for variable 2). I 223specified only the first seven elements of Monkeys since there are only seven elements in species. 224Clearly this is an artificial way to create a dataset. We can provide a more useful second column 225heading and assign it all to an object called monkeys.df by typing: 226 227 monkeys.df = data.frame(species, mass = Monkeys[1:7])
9 Page 5 228 229 Now let's say we also have data on lifespan for these species. 230 231 lifespan = c(15,3,9,12,26,6,12) 232 233 This can be combined with our existing dataframe by typing: 234 235 monkeys.df = cbind(monkeys.df, lifespan) 236 237 Here, the command cbind tacks on (or "binds") columns together in the order that they are 238specified. There is a similar function for binding together rows called rbind. 239 240 So, now we have a nice fake dataset. Type monkeys.df to have a look. Each column has a 241column name, and unlike in arrays the columns do not necessarily have to be all of the same data 242type. We can refer to any one column within a dataframe by using the $ between the dataframe 243name and the column name like so: 244 245 monkeys.df$mass 246 monkeys.df$lifespan 247 248 249 Plotting 250 251 R is a powerful tool for creating graphs and images, but the variety of commands can be 252somewhat daunting. Let's just start with some basics. Type: 253 254 ?plot 255 256 to get the help page for the plotting command. You can see that at the very least, the plot 257command expects you to give it x and y variables to plot, and that there are a whole host of other 258possible arguments for refining the plot. 259 260 plot(monkeys.df$mass, monkeys.df$lifespan) 261 262 We could add custom axis labels, a title, and change the color by specifying a few more 263arguments: 264 265 plot(monkeys.df$mass, monkeys.df$lifespan, xlab="Mass (kg)", ylab = "Lifespan 266 (yrs)", main = "Lifespan versus Body Mass", col = "red") 267 268 For a more complete list of possible parameters that could help modify a basic plot, type "? 269par". 270 271 272 273
10 Page 6 274 R 275 Programming for Statistical Analysis and Graphics, Part II 276 277 278Introductory R Lab 279 280 1. Objectives 281 a. Learn basics of data entry 282 b. Learn simple data manipulation 283 c. Learn simple statistical tests 284 i. Independent Samples t-test 285 ii. Pearson’s Correlation 286 iii. Chi-square goodness of fit 287 iv. Chi-square test of independence 288 2. Data Entry and Creating Data 289 a. Creating Vectors 290 i. Entering Data 291 1. By hand 292 a. x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) 293 i. x = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 294 2. importing a datasheet 295 a. read.table(C:\file.csv) or read.table(C:\file.txt) 296 b. yourdata = read.csv(‘C:\My Documents\file.txt’, header=T, 297 sep=”,”) 298 ii. Creating Data 299 1. Creating a sequence 300 a. x = (1:10) 301 i. x = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 302 2. Creating random numbers 303 a. runif(n) = n random numbers from a uniform distribution 304 b. rnorm(n, mean=c, sd=x) = n random numbers from a normal 305 distribution with a mean of c and a standard deviation (sd) of x. 306 3. Making new from old 307 a. Mathematical Functions performed on entire objects (here 308 vectors) 309 i. log(x) = log base e of x; y = log(x) 310 ii. exp(x) = antilog of x (ex); y = exp(x) 311 iii. log(x,n) = log base n of x; y = log(x,n) 312 iv. log10(x) =log base 10 of x; y = log10(x) 313 v. sqrt(x) = square root of x; y = sqrt(x) 314 iii. Missing data values 315 1. rm.na=T 316 3. Simple data manipulation 317 a. Summary statistics/Vector Functions 318 i. min(x) 319 ii. max(x) 320 iii. range(x)
11 Page 7 321 iv. median(x) 322 v. mean(x) 323 1. Removing missing data 324 a. Find missing data values 325 i. which(is.na(x)) 326 b. Replace missing data values with zero 327 i. Y = ifelse(is.na(x),0,x) 328 c. Remove missing data value to perform function 329 d. mean(x, na.rm=T) 330 vi. var(x) = variance of x 331 vii. sd(x) = standard deviation of x 332 viii. sum(x) 333 ix. length(x) = the number of values in the object x (here the object is a vector) 334 4. Simple Statistical Tests 335 a. Independent samples t-test 336 i. Example 337 1. Create a practice data frame 338 x = (1:10) 339 y = (6:15) 340 data=data.frame(x,y) 341 data 342 data2=stack(data) 343 data2 344 mass = data2[,1] 345 species = data2[,2] 346 data3=data.frame(mass, species) 347 data3 348 2. Perform a t-test 349 #t.test(formula = dependent variable ~ independent variable) 350 t.test(formula = mass~species) 351 rm(list=ls()) 352 *Clear R: >rm(list=ls()) 353 b. Correlation 354 i. Example 355 1. Create a practice data frame 356 frogs = c(8, 6, 7, 5, 3, 10, 9) 357 tadpoles = c(1.87, 1.5, 1.7, 1.25, 0.4, 2.14, 2.01) 358 frogs 359 tadpoles 360 cor.test(frogs,tadpoles) 361 rm(list=ls()) 362 c. Chi-square Goodness of Fit 363 i. Example 364 count<-c(152, 39, 53, 6) 365 type<-gl(4,1,4, labels=c("YellowSmooth", "YellowWrinkled", 366 "GreenSmooth", "GreenWrinkled")) 367 ##*’gl’ is the command to generate levels. It takes the form: gl(number of 368 ##levels, repeats, total length, labels =”x”, “y”, “etc”)
12 Page 8 369 ##Create a data frame 370 seeds<-data.frame(type, count) 371 seeds 372 ##Create a table that puts the data in a form that is appropriate for a Chi- 373 square 374 seeds.xtab <- xtabs(count~type, seeds) 375 ##View the table 376 seeds.xtab 377 ##Check that no expected values are less than 5 378 chisq.test(seeds.xtab, p = c(9/16, 3/16, 3/16, 1/16), correct = F)$exp 379 ##No expected values are less than 5; proceed with Chi-squared Goodness 380 of Fit Test 381 chisq.test(seeds.xtab, p = c(9/16, 3/16, 3/16, 1/16), correct = F) 382 ##What does your p<0.05 mean? It means that you reject the null 383 hypothesis that your 384 ##data came from a population with a ratio of 9:3:3:1 385 rm(list=ls()) 386 387 d. Chi-square Test of Independence 388 i. Example 389 birds<-c(11, 4, 5, 12) 390 birds 391 location=gl(2, 2, 4, labels=c("Tree", "Ground")) 392 ##The above line could also read: 393 ##location = c(“Tree”, “Tree”, “Ground”, “Ground”) 394 location 395 sex=c("Males", "Females", "Males", "Females") 396 sex 397 twoway<-data.frame(birds, location, sex) 398 twoway 399 twoway2<-tapply(birds, list(sex, location), sum) 400 twoway2 401 ##check that there are no expected values less than 5 402 chisq.test(twoway2, corr=F)$exp 403 ##perform the Chi-square test 404 chisq.test(twoway2, corr=F) 405 rm(list=ls())
13 Page 9