John W. Emerson, Department of Statistics, Yale University © 2013 1

STAT 625: Statistical Case Studies

John W. Emerson

Yale University

Abstract This term, I’ll generally present brief class notes and scripts, but much of the lecture material will simply be presented in class and/or available online. However, student participation is a large part of this course, and at any time I may call upon any of you to share some of your work. Be prepared. This won’t be a class where you sit and absorb material. I strongly encourage you to work with other students; you can learn a lot from each other. To get anything out of this course you’ll need to put in the time for the sake of learning and strengthening your skills, not for the sake of jumping through the proverbial hoop. Lesson 1: Learning how to communicate. This document was created using LATEX and Sweave (which is really an R package). There are some huge advantages to a workflow like this, relating to a topic called reproducible research. Here’s my thought: pay your dues now, reap the benefits later. Don’t trust me on everything, but trust me on this. A sister document was created using R Markdown and the markdown package, and this is also acceptable. Microsoft Office is not. So, in today’s class I’ll show you how I produced these documents and I’ll give you the required files. I recommend you figure out how to use LATEX on your own computer. You already have Sweave, because it comes with R. Use of markdown is even easier (it’s a simple installation from CRAN), though it doesn’t easily produce nice PDF files; it’s better for publishing to the web. Both are integrated nicely in the R Studio environment. Secondly, you need to learn to use our department server, Euler, which is“euler.stat.yale.edu”, for the purpose of building a course web page. I’ll talk about this more in class. A brief word of caution, though: if you visit http://euler.stat.yale.edu/ you will be redi- rected to http://statistics.yale.edu (which does not “live” on euler). In contrast, http://euler.stat.yale.edu/~jay “lives” on euler. There is no such thing as http: //statistics.yale.edu/~jay.

1 Our computing environment(s)

I’m sure we have a mix of Mac and PC users (and perhaps a enthusiast or two). In theory, everything we do should be platform independent, and you are encouraged to become more familiar with advanced computing on your personal computers. However, some aspects of this course may be most efficient if we work together in the same environment once in a while. And I’d like you to have a web page for your course work, hence the following material. Everyone has (or could soon have) an account on Euler, the department Linux server. Everyone can log into Euler remotely, regardless of their type of personal computer or location on (or off) campus. John W. Emerson, Department of Statistics, Yale University © 2013 2

1.1 Accessing Euler from a Mac via SSH Good news, Mac users: you can open a terminal (if you have never done this, ask for help) and simply type one of the following commands: ssh euler.stat.yale.edu -l NETID ssh [email protected] Then enter your password when prompted, and you’re in!

1.2 Accessing Euler from a PC via SSH (or PuTTY) An SSH client is needed to connect remotely, and may or may not be available on Yale’s computers. If you need it on your own PC, you can download PuTTY from http://www.yale.edu/software/ Once it’s installed, you want to connect to euler.stat.yale.edu with your NETID and password. You may want WinSCP for file transfers (I’m not sure if a secure FTP is included with PuTTY, but think it is).

1.3 Accessing Euler’s filesystem I grudgingly admit that “drag and drop” interactivity with the Linux filesystem can be convenient, but it is not sufficient for this course. You’ll need to become proficient at a basic level with SSH (and perhaps the sister program, SFTP, for transferring files).

2 Getting started on Euler

Once you’re in, you should have a screen like the one in Figure 1. At this point, a good rule of thumb is: DON’T THE MOUSE. As a first example, we’ll create a folder (directory) for the course, and make sure a few folder permissions are set properly. Linux is case-sensitive, so be careful. I’ll discuss these commands in class, and they should soon become second-nature. pwd ls -lat 755 . ls -lat mkdir Stat625 ls -lat chmod 755 Stat625 ls -lat cd Stat625 pwd ls John W. Emerson, Department of Statistics, Yale University © 2013 3

Figure 1: A screenshot showing an SSH session to euler. John W. Emerson, Department of Statistics, Yale University © 2013 4

You can see what many of these commands do, although chmod may be less than obvious. Each file or directory has a set of permissions, such as drwxr-xr-x; this particular example says “this is a directory” (because it starts with a d), “I have read, write, and execute permission” (because of the first triplet rwx), and “people in my group as well as everyone else on the system have read and execute permission, but not write permission” (because of the second and third triplets, r-x). In binary, this is 755; think about it. Essentially, anyone can enter the directly and see the contents, but only you can create files inside this directory. In fact, depending on your user defaults, the chmod commands, above, may not have been necessary. Use of 644 is commmon for files where you want to have read-write permission for yourself, but only read permission for anyone else; 700 (for directories) or 600 (for files) restricts use to you and you alone. Maybe we can explore this in class, but it isn’t critical. The Linux system has help pages (manual pages, or man for short), too; for help on pwd, for example: -bash-3.2$ man pwd Press the space bar to page through the manual, and q to exit. The following web page has a nice summary of a bunch of useful Linux commands: http://www.ss64.com/bash/

2.1 R on Euler Next, let’s fire up R and play a bit. Simply type R at the prompt. Your prompt may look a little different, depending on the default settings: -bash-3.2$ R The result should be familiar, as will the following commands. Notice how easy it is to integrate graphics into a document. Now I admit a single “cut and paste” into Word isn’t that bad, but... graphics change, and you’ll generally have many more of them, which can be a pain. Of course, being able to display code and results in a document without painful formatting by hand is pretty cool. > sayhello <- function(x) cat(paste("Hello, ", x, "!\n", sep="")) > sayhello("Jay") Hello, Jay! > normalsample <- rnorm(100) > summary(normalsample) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.0110 -0.5820 0.1094 0.1036 1.0090 2.5890 > ls() [1] "normalsample" "sayhello" Now, if you are logged into Euler and running R interactively, you can’t view any graphics. However, if you are using Sweave the story is different... as this document demonstrates. To see the example, examine the Computing1.Rnw file and see where Figure 2 is produced. John W. Emerson, Department of Statistics, Yale University © 2013 5

Histogram of normalsample 20 15 10 Frequency 5 0

−3 −2 −1 0 1 2 3

normalsample

Figure 2: A sample histogram produced using Sweave. John W. Emerson, Department of Statistics, Yale University © 2013 6

3 Editing files

This gets a little stickier. When I used to work in Windows I would transfer the file to my Windows laptop, edit the file locally using “Notepad” (not Microsoft Word!), and then transfer the file back to euler. This is kind of clunky, though, and problems sometimes occur with end-of-line characters (see below for more on this). There is another potential danger: you might lose track of the local copy of the file, the copy of the file on the server, and mistakenly edit the wrong one. There is a real potential for lost work and general confusion with this approach. There are basically two alternatives: (1) edit the file directly on Euler inside SSH using a Linux editor like “vi” or “emacs”, or (2) edit the file over a network connection to the filesystem. See below for information on this second route.

3.1 Using “vi” A neat little introduction to vi is

http://heather.cs.ucdavis.edu/~matloff/UnixAndC/Editors/ViIntro.html although after the introduction it moves quickly onto features that, frankly, I never use. There is a small set of commands which I find are most useful.

Command Action arrow keys move around

exit from insertion mode once you get into it; the most commonly used ways of entering insertion mode follow. i insert here at my present location a insert one character after my present location o open a new line below the present one for insertion O open a new line above the present one for insertion x delete this character dd delete this line

ZZ save and exit. :q! exit without saving :w save the file without exiting

A more advanced (or, at least, exhaustive) reference is available from the author of vi, Bill Joy:

http://docs.freebsd.org/44doc/usd/12.vi/paper.html John W. Emerson, Department of Statistics, Yale University © 2013 7

3.2 Using “emacs” I used to use this regularly. I’m sure there are many web pages, but the following looks like a pretty nice introduction:

http://www.cs.colostate.edu/helpdocs/emacs.html

3.3 Editing files over a network connection to the filesystem I’m not a big fan of this, partly because of problem with end-of-line characters. Different platforms (Mac, Windows, Linux) use different convensions. If (when) problems crop up, we’ll discuss this in more detail. For now, be aware of the following commands on Euler, which can help recover from such problems. unix2dos dos2unix FILENAME

Basically, you can (sometimes) access folders on Euler from Mac or Windows PCs and edit the files as if they were on your local machine. Some sites at Yale offer help setting this up, but access should be automatic in our department computer lab. For more information about access from your own computer, see the following page (Euler behaves just like the old Pantheon cluster):

http://its.yale.edu/how-to/mount-pantheon-drive-your-computer-windows http://its.yale.edu/how-to/mounting-pantheon-space-drive-mac-os-x

Note that you may have to use VPN for security reasons. If you have problems, talk with each other; ask people like me or Le-Minh to have a look. Or maybe seek support from the ITS folks if all else fails. It’s gotta be possible. I don’t do this myself any longer (it isn’t part of my daily workflow), but I wanted to acknowledge it as an option.

3.4 Producing this document Instructions would vary slightly by computer, but the big picture is more important. On your computer, you would need to download the files over the web rather than doing the copy, below. If you are working on Euler, try something like the following (if you have created the obvious subdirectories): cd ~/public_html/625 cp -r /home/jay/public_html/625/Week2 . cd Week2 mkdir graphics R CMD Sweave Computing1 pdflatex Computing1 pdflatex Computing1 John W. Emerson, Department of Statistics, Yale University © 2013 8

Yes, do that last command twice. The first time, LATEX can’t quite resolve some of the automatic references to figure numbers, and might want to keep track of sections. The second run resolves any possible confusion. Instead of the R CMD Sweave Computing1 command, you could just start R and do what I did earlier. Note that creation of the graphics folder is necessary to hold the temporary graphics. You should now have Computing1.pdf in that directory. Check it. Next, look at the “source” file for the document, Computing1.Rnw. This is LATEX with some R commands mixed in. Sweave actually runs the R commands and collects the results, integrating them into the document, producing Computing1.tex. This is a pure LATEX document, which then is processed most easily using pdflatex. Done. WARNING! Do not edit the tex file. You’ll lose all your work. Instead, edit the Rnw file, then re-process it using Sweave, getting a revised tex file. Then create the PDF from this new tex file. You have been warned. There are many good reasons to consider investing a little time in this, particularly if you might write a dissertation in a few years, or a scientific journal article. For now, I’ll close with a teaser: 2 1 − x φ(x) = √ e 2 2π Try doing that quickly and easily in Microsoft Word. I provide a copy of the Sweave manual in this Week2 directory. You can also find it on the web, along with many LATEX references.

4 Homework for Thursday, September 12

Get yourself up and running, preferably with both LATEX and R Markdown. On a Linux machine, you may already be able to use Sweave and pdflatex with no problems. On a Mac, Google MacTeX. In Windows, you’ll need something like MikTeX I expect. R Markdown is best used with R Studio in my opinion, and you’ll probably have to install the markdown package.

4.1 Starting a course web page You have an Euler account, so I’d like you to try this on your own, first. If you have trouble, ask another student for help. If everything fails, we’ll spend more time on thisduring class. On Euler, make sure you created a subdirectory (or folder) that is of the form /home/NETID/public_html/625/ To check that this is done correctly, test the following URL in a web browser: http://www.stat.yale.edu/~NETID/625/ You’ll keep all your course work inside that folder (or, more precisely, in subdirectories of that folder). Some of it will be “world readable” and other parts may be private; you’ll use the chmod command to control this. You will need to be able to connect to Euler during class to modify permissions and show off your work, so please be prepared for this. John W. Emerson, Department of Statistics, Yale University © 2013 9

4.2 Diving (except for Jessica) Continue (just a little bit) your diving explorations, but the focus for Thursday is getting up to speed with basic Linux and creating a course web page. Using Sweave and/or R Markdown, produce a PDF and/or HTML of your explorations, adding a little text discussion of what you are doing. Post these in your 625/Week2 folders and make sure they work from the web. If you can’t do this, you should explain why. If you succeeded with either Sweave and/or R Markdown (but couldn’t work in Linux), then you should upload the document(s) to the classesv2 site Dropbox.

5 HINT

Work together on computing stuff like this! There are lots of little computing “gotchas” that can be annoying, but fighting through them and succeeding is important. When you graduate and get a real job (or head to graduate school, or whatever), being able to solve problems and be computationally self-sufficient will be valuable. Yes, we’re here to study statistics (or data analysis, I would prefer to say). But we aren’t here just to think about problems, we’re here to work on problems. Thinking is necessary, but not sufficient.