Introduction to the R Project for Statistical Computing for Use at ITC
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to the R Project for Statistical Computing for use at ITC D G Rossiter University of Twente Faculty of Geo-information Science & Earth Observation (ITC) Enschede (NL) http://www.itc.nl/personal/rossiter August 14, 2012 Actual vs. modelled straw yields 9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5.0 ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● 8 ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● 4.5 ●● ● ● ●● ● ● ●● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●●● ●● ●● ● ●● ● ●● ●● ● ●●● ●●●● ● ●● ● ● ●●● ● ●●● ● ● 7 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ●●● ●● ● ● ●● ● ●●● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ● 4.0 ● ● ● ● ● ● ● ●●●●●●●●●●●● ●● ●●● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ●● ● Actual ● ● ● ● ●● ●● ● ● ●●●●●●●●●●● ● ● ●●●● ● ● ● ● ● ●●●●●● ● ●● ● ● ●● ●● ●● ● ● ● ● ●●●● ● ● ● 6 ●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●●●● ●●●● ● ● ●● ● ●●●●●● ● ● ● ● ● ●● ● ●●●● ● ●●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●●●● ● ● 3.5 ● ●● Grain yield, lbs per plot ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3.0 ● ● 4 4 5 6 7 8 9 1 3 5 7 9 11 13 15 17 19 21 23 25 Modelled Column number Frequency histogram, Meuse lead concentration 60 53 GLS 2nd−order trend surface, subsoil clay % 50 340000 40 335000 30 26 Frequency 330000 N 20 17 17 17 12 325000 10 4 3 3 320000 1 1 0 0 1 0 315000 0 100 200 300 400 500 600 700 660000 670000 680000 690000 700000 lead concentration, mg kg−1 Counts shown above bar, actual values shown with rug plot E Contents 0 If you are impatient . .1 1 What is R?1 2 Why R for ITC?3 2.1 Advantages...............................3 2.2 Disadvantages.............................4 2.3 Alternatives...............................5 2.3.1 S-PLUS..............................5 2.3.2 Statistical packages......................5 2.3.3 Special-purpose statistical programs...........5 2.3.4 Spreadsheets..........................6 2.3.5 Applied mathematics programs..............6 3 Using R7 3.1 R console GUI..............................7 3.1.1 On your own Windows computer.............7 3.1.2 On the ITC network......................7 3.1.3 Running the R console GUI.................8 3.1.4 Setting up a workspace in Windows............8 3.1.5 Saving your analysis steps..................9 3.1.6 Saving your graphs......................9 3.2 Working with the R command line................. 10 3.2.1 The command prompt.................... 10 3.2.2 On-line help in R........................ 11 3.3 The RStudio development environment............. 13 3.4 The Tinn-R code editor....................... 14 3.5 Writing and running scripts..................... 14 3.6 The Rcmdr GUI............................. 16 3.7 Loading optional packages...................... 17 3.8 Sample datasets............................ 18 4 The S language 19 4.1 Command-line calculator and mathematical operators.... 19 4.2 Creating new objects: the assignment operator......... 20 4.3 Methods and their arguments.................... 21 4.4 Vectorized operations and re-cycling............... 22 4.5 Vector and list data structures................... 24 4.6 Arrays and matrices.......................... 25 4.7 Data frames............................... 30 4.8 Factors.................................. 34 4.9 Selecting subsets............................ 36 Version 4.0 Copyright © D G Rossiter 2003 – 2012. All rights reserved. Non-commercial reproduction and dissemination of the work as a whole freely permitted if this original copyright notice is included. To adapt or translate please contact the author. ii 4.9.1 Simultaneous operations on subsets........... 39 4.10 Rearranging data............................ 40 4.11 Random numbers and simulation................. 41 4.12 Character strings............................ 43 4.13 Objects and classes.......................... 44 4.13.1 The S3 and S4 class systems................ 45 4.14 Descriptive statistics......................... 48 4.15 Classification tables.......................... 50 4.16 Sets.................................... 51 4.17 Statistical models in S......................... 52 4.17.1 Models with categorical predictors............ 55 4.17.2 Analysis of Variance (ANOVA)............... 57 4.18 Model output.............................. 57 4.18.1 Model diagnostics....................... 59 4.18.2 Model-based prediction................... 61 4.19 Advanced statistical modelling................... 62 4.20 Missing values............................. 63 4.21 Control structures and looping................... 64 4.22 User-defined functions........................ 65 4.23 Computing on the language..................... 67 5 R graphics 69 5.1 Base graphics.............................. 69 5.1.1 Mathematical notation in base graphics......... 73 5.1.2 Returning results from graphics methods........ 75 5.1.3 Types of base graphics plots................ 75 5.1.4 Interacting with base graphics plots............ 77 5.2 Trellis graphics............................. 77 5.2.1 Univariate plots........................ 77 5.2.2 Bivariate plots......................... 78 5.2.3 Triivariate plots........................ 79 5.2.4 Panel functions......................... 81 5.2.5 Types of Trellis graphics plots............... 82 5.2.6 Adjusting Trellis graphics parameters.......... 82 5.3 Multiple graphics windows...................... 84 5.3.1 Switching between windows................. 85 5.4 Multiple graphs in the same window............... 85 5.4.1 Base graphics.......................... 85 5.4.2 Trellis graphics......................... 86 5.5 Colours.................................. 86 6 Preparing your own data for R 91 6.1 Preparing data directly in R..................... 91 6.2 A GUI data editor........................... 92 6.3 Importing data from a CSV file................... 93 6.4 Importing images........................... 96 7 Exporting from R 99 iii 8 Reproducible data analysis 101 8.1 The NoWeb document........................ 101 8.2 The LATEX document.......................... 102 8.3 The PDF document.......................... 103 8.4 Graphics in Sweave.......................... 104 9 Learning R 105 9.1 Task views................................ 105 9.2 R tutorials and introductions.................... 105 9.3 Textbooks using R........................... 106 9.4 Technical notes using R....................... 107 9.5 Web Pages to learn R......................... 107 9.6 Keeping up with developments in R................ 108 10 Frequently-asked questions 110 10.1 Help! I got an error, what did I do wrong?............ 110 10.2 Why didn’t my command(s) do what I expected?........ 112 10.3 How do I find the method to do what I want?.......... 113 10.4 Memory problems........................... 115 10.5 What version of R am I running?.................. 116 10.6 What statistical procedure should I use?............. 117 A Obtaining your own copy of R 119 A.1 Installing new packages....................... 121 A.2 Customizing your installation.................... 121 A.3 R in different human languages................... 122 B An example script 123 C An example function 126 References 128 Index of R concepts 133 List of Figures 1 The RStudio screen.......................... 13 2 The Tinn-R screen........................... 14 3 The R Commander screen...................... 16 4 Regression diagnostic plots..................... 60 5 Finding the closest point....................... 66 6 Default scatterplot........................... 70 7 Plotting symbols............................ 71 8 Custom scatterplot.......................... 73 9 Scatterplot with math symbols, legend and model lines.... 74 10 Some interesting base graphics plots............... 76 11 Trellis density plots.......................... 78 12 Trellis scatter plots.......................... 79 13 Trellis trivariate plots......................... 80 14 Trellis scatter plot with some added elements......... 82 iv 15 Available colours............................ 87 16 Example of a colour ramp...................... 89 17 R graphical data editor........................ 93 18 Example PDF produced by Sweave and LATEX........... 103 19 Results of an RSeek search...................... 108 20 Results of an R site search...................... 109 21 Visualising the variability of small random samples...... 125 List of Tables 1 Methods for adding to an existing base graphics plot..... 71 2 Base graphics plot types....................... 75 3 Trellis graphics plot types...................... 83 4 Packages in the base R distribution for Windows........ 120 v 0 If you are impatient . 1. Install R and RStudio on your MS-Windows, Mac OS/X or Linux sys- tem (§A); 2. Run RStudio; this will automatically start R within it; 3. Follow one of the tutorials (§9.2) such as my “Using the R Environ- ment for Statistical Computing: An example with the Mercer & Hall wheat yield dataset”1 [48]; 4. Experiment! 5. Use this document as a reference. 1 What is R? R is an open-source environment for statistical computing and visualisa- tion. It is based on the S language developed at Bell Laboratories in the 1980’s [20], and is the product of an active movement among statisti- cians for a powerful, programmable, portable, and open computing en- vironment, applicable to the most complex and sophsticated problems, as well as “routine” analysis, without any restrictions on access or use. Here is a description from the R Project home page:2 “R is an integrated suite of software facilities