Literate Data Analysis Using the R Environment for Statistical Computing and the Knitr Package
Total Page:16
File Type:pdf, Size:1020Kb
Technical Note: Literate data analysis using the R environment for statistical computing and the knitr package D G Rossiter December 26, 2012 Contents 1 Overview3 2 Tutorial6 2.1 First version.............................6 2.2 Second version: adding graphics..................8 2.3 Third version: in-line calculations................. 11 2.4 Writing an R source code file................... 12 3 Details 13 3.1 Multiple graphics on one line................... 13 3.2 Putting graphics in the figure environment.......... 14 3.3 Production graphics......................... 15 3.4 R code formatting and comments................. 16 3.5 Hiding code from the reader.................... 17 3.6 Showing code without executing it................ 17 3.7 Hiding output from the reader................... 18 3.8 Formatting R code and output.................. 18 4 Learning to use the tools 19 4.1 knitr ................................. 19 Version 1.0. Copyright © 2012 D G Rossiter All rights reserved. Re- production and dissemination of the work as a whole (not parts) freely permitted if this original copyright notice is included. Sale or placement on a web site where payment must be made to access this document is strictly prohibited. To adapt or translate please contact the author (http://www.itc.nl/personal/rossiter). 4.2 LATEX................................. 19 4.3 R................................... 20 4.4 Emacs................................. 20 References 21 Index of R concepts 22 A Input NoWeb files 23 A.1 First version of NoWeb source................... 23 A.2 Second version of NoWeb source................. 23 A.3 Third version of NoWeb source.................. 24 B Intermediate files 25 B.1 First version of LATEX file...................... 25 C Output files 28 C.1 PDF.................................. 28 C.2 R source code............................ 35 2 1 Overview In 1992 Donald Knuth published a book with the title \Literate Program- ming" [3], showing the advantages of, and techniques for, writing computer programs to be read and understood by humans, as well as executed by a digital computer. This technical note advocates the same approach for data analysis: the computer code (here, in the R environment), and the output from executing it, is an integral part of a document that explains what the analyst did, why, and what was discovered. This is part of reproducible research [2,6,9]: \By reproducible research, we mean research papers with accom- panying software tools that allow the reader to directly reproduce the results and employ the computational methods that are pre- sented in the research paper." { Gentleman and Lang [1]; see also the CRAN Task View \Re- producible Research"1 The advantages of this approach are several: 1. Every processing step is transparent, since the R code is shown in the document; 2. Therefore, anyone can repeat the analysis if they are given access to the same data; 3. The analysis can easily be expanded or adapted; 4. If the data sources are edited, the entire analysis can be re-run and the results updated without any editing; 5. The analyst's explanations (motivation, justification, choice of meth- ods, interpretations . ) can be presented along with the results of the analysis; 6. The results of the computer processing are generated with the docu- ment, so they are by definition synchronized; 7. Figures are generated from code and are part of the output. The approach presented here is only one element of fully reproducible re- search; this also requires that the original data and full details of its acqui- sition and manipulation be presented; for details see Mesirov [6]. The tools we use are: 1 http://cran.r-project.org/web/views/ReproducibleResearch.html 3 Data processing The R environment for statistical computing2 [7]; Literate programming The knitR (\knit R") R package3 [10], which processes the literate programming source \NoWeb" file to produce both a LATEX document and R code; Text processingL ATEX4 [5], a document preparation system, to produce the final PDF document; Text and code editor To prepare the literate programming source. There are several good choices: Emacs5, with the AUCTEX extension for working with LATEX doc- uments and the ESS (\Emacs Speaks Statistics")6 extension for running R under Emacs. Learning Emacs is an investment in a lifetime of programming productivity, but not an overnight busi- ness. RStudio7, an attractive GUI for R and NoWeb source files, with a reasonable code editor and built-in help on R commands. Microsoft Windows only: WinEdt8 and the R-WinEdt R package to communicate with it; another option is Tinn-R9. Any plain-text editor such as Notepad. The flow is as follows: 1. You create a source document in a text editor with extension .Rnw (a so-called\NoWeb"file10); this source document includes LATEX markup, your own text, and \chunks" of executable R code, using the NoWeb syntax (explained below) to show which parts of the source are exe- cutable code. 2. You run this NoWeb source through R with the R function knit of the knitr package; this produces a LATEX file (extension .tex) which includes your original LATEX markup and text, with the output from R (which may include graphics). 2 http://www.r-project.org/ 3 http://yihui.name/knitr/ 4 http://www.latex-project.org/ 5 http://www.gnu.org/software/emacs/ 6 http://ess.r-project.org/ 7 http://rstudio.org/ 8 http://www.winedt.com/ 9 http://www.sciviews.org/Tinn-R/ 10 NoWeb, http://www.cs.tufts.edu/~nr/noweb/ 4 3. You process the LATEX file with LATEX to produce a PDF document. 4. Optional: You run the NoWeb source through R with the R function purl, also of the knitr package, to produce an R source code file with the same name and extension .R; this can be executed in an R session with the R function source. As you create your source document, you can also execute lines or chunks of code in the R console to see their effect. From some text editors (Emacs + ESS, RStudio) you can directly send lines or chunks of code from the NoWeb source to a linked R console; otherwise you have to work in the two environments separately. Thus you have an interactive data analysis as you work, but write it up in a document to be read by others. Note: The term \knit" is a wordplay on the original \Weave" from Knuth11, who used that term as a reference to a poem by Sir Walter Scott: \Oh, what a tangled web we weave when first we practise to deceive"12; Knuth's original literate programming system was called WEB, so he decided to use \Weave" for the process of making the readable document and \Tangle" for the process of making the executable code. The author of the knitr package uses the term \purl" for the latter; this is a type of knitting. So now you know. Task 1 : Set up your computing environment: text editor, R environment, LATEX, and a PDF viewer. Within the R envionment, install the knitr package and its dependencies from CRAN13. • To install the package from the R prompt: > install.packages("knitr", dependencies=TRUE) We now give a tutorial example (x2), and then get into some of the details and complications (x3). 11 think \knitter", one who knits 12 Marmion, VI:17 13 http://cran.r-project.org/web/packages/knitr/index.html 5 2 Tutorial We will do a small literate data analysis on one of R's example datasets, trees: 1. Examine the dataset structure; 2. Summarize the variables; 3. Graph the relation between them; 4. Build a linear model to predict tree volume from tree girth and height. All of this is accompanied by our commentary { this is where we explain (\literately" we hope) what we are doing, why, and what conclusions we draw. 2.1 First version Task 2 : Create a NoWeb file source file named test1.Rnw, open it in the text editor, and set up the LATEX document. • Note: The .Rnw extension is used for NoWeb source files containing R code. This is the usual document skeleton, naming the document class, loading packages, defining macros, etc. A minimal skeleton is: \documentclass[11pt]{article} \begin{document} % LaTeX macros and text go here \end{document} There is usually a title, author, and date: \documentclass[11pt]{article} \title{Modelling tree volume} \author{D.\ W.\ Luo}\date{\today} \begin{document} \maketitle % LaTeX macros and text go here \end{document} 6 Task 3 : Write the introductory text in the document section of the NoWeb source file (i.e., within the document environment). • This should be your description (to your reader) of the purpose of this data analysis. It can be any valid LATEX source. Here is my text: Here we use the \texttt{trees} dataset supplied with R to illustrate a simple data analysis: \begin{enumerate} \item describing the variables and cases; \item investigating the inter-relation between variables; and \item modelling tree volume as a function of tree height and/or tree girth. \end{enumerate} Task 4 : Write the code and commentary to load the example dataset. • For this first example you just need to know one thing about NoWeb syntax: a code chunk is written between <<>>= and @; these must be the only text on their respective lines of NoWeb source. Anything between these is considered R code and will be formatted, executed, and the output written to the LATEX source file. <<>>= # R code here @ Anything not in a code chunk is regular LATEX source { this is where you write comments and explanations. My code and commentary is shown in xA.1; I added the following: First, load the dataset, examine its structure, and summarize the variables: \par <<>>= data(trees) str(trees) summary(trees) @ Task 5 : Run this source file through the knit function within R to creates LATEX source file with the same name but extension .tex.