BIO503: Introduction to Programming and Statistical Modeling in

January 2016

Contents

1 Course Description

This course is an introduction to R, a powerful and flexible statistical language and environ- ment that also provides more flexible graphics capabilities than other popular statistical packages.

The course will introduce students to the basics of using R for statistical programming, computa- tion, graphics, and modeling. We will start with a basic introduction to the R language, reading and writing data, and graphics. We then discuss writing functions in R and tips on programming in R. Finally, the latter part of the course will focus on using R to fit some important statistical models, including basic linear regression, generalized linear models and survival analysis. The class will include a short intro- duction on how to produce professional looking reports (with pretty plots and tables) that meet the standard necessary for reproducible research and documentation. The first 4 lectures will focus on R essentials. I am happy to tailor the last lectures to students interests. I can provide an in- troduction to analysis of genomics data in Bioconductor should there be interest among students.

The class goal is to get students up and running with R such that they can use R in their research and are in a good position to expand their knowledge of R on their own. Course notes are written such that they provide students with a useful and extensive reference manual on R (its over 200 pages!)

2 Course Website

The url on isites for Bio503 2015 is https://canvas.harvard.edu/courses/11126/

3 Learning Objectives

After taking the course, students will be able to

1. Use R for statistical programming, computation, graphics, and modeling,

1 2. Write functions and use R in an efficient way,

3. Perform basic statistical analysis in R and fit basic statistical models

4. Use R in their own research, and produce reports which meet the standards for reproducible research

5. Be able to expand their knowledge of R on their own.

4 Course Schedule

The Winter Session 2016 course will consist of 5 lectures, each just under 3 hours long. The class will be held in computer lab Kresge LL6 from 9:30am-12:20pm on the following dates:

• January 5 (Tuesday)

• January 7 (Thursday)

• January 12 (Tuesday)

• January 14 (Thursday)

• January 19 (Tuesday)

5 Intended Audience and prerequisites

There are no formal prerequisites, but in order to appreciate the abilities of R and for the later classes that explore various statistical models, a basic knowledge of statistics is useful. The in- tended audience is students who need a flexible statistical environment for their research. We do not expect any prior experience with R, but experience with another programming or statistical language may be helpful to a limited extent. Beginning R users with basic knowledge may also find the course useful.

6 Instructors, Staff

Primary classroom and grading instructor: Aedin Culhane Department of Biostatistics Dana-Farber Cancer Institute Biostatistics and Computational Biology Office: Dana-Farber Cancer Institute, Smith 822C (8th floor of the Smith building at the end of Shattuck St) Phone: (617) 617-2468 e-mail: [email protected] web: http://www.hsph.harvard.edu/aedin-culhane/

2 Teaching Assistant: BJ Stubbs Channing Laboratory e-mail: [email protected]

Faculty sponsor: John Quackenbush

7 Course Material

Course text: Students may find one of the following books useful depending on their needs and background. Reviews and more information about these textbooks are listed on both my Amazon wishlist http://amzn.com/w/2PTDZDB6JMG8Q and on the Harvard Coop text book website http:// harvardcoopbooks.bncollege.com/webapp/wcs/stores/servlet/TBWizardView?catalogId= 10001&langId=-1&storeId=52084

1. Peter Dalgaard. Introductory Statistics with R (Paperback) 2nd Edition Springer-Verlag New York, Inc. ISBN 978-0387790534 Introductory Statistics with R provides an very basic introduction to R, targeting both statis- tician and non-statistician scientists. It maybe sufficient for students who may use R for basic statistics. This book is a good introductory text, but is getting older (published 2008) so one of the other listed books is worth considering also. http://www.amazon.com/ Introductory-Statistics-R-Peter-Dalgaard/dp/0387954759 Answers to examples in book are available on authors’ webpage http://staff.pubhealth. ku.dk/~pd/ISwR.html

2. Andy Field, Jeremy Miles, Zoe Field. Discovering Statistics Using R SAGE Publications Ltd; 1 edition (April 4, 2012) Excellent but irreverent introduction to both statistics and R, written in an engaging student- friendly manner. For example chapters include, Why Is My Evil Lecturer Forcing Me to Learn Statistics? Nevertheless it is a quite comprehensive introduction to basic statistics and R. It includes correlation, simple and multiple regression, logistic regression, anova, glm, basic analysis of categorical data and exploratory factor analysis. This is aimed at students with little or no statistical experience.

Experienced statisticians may prefer a different textbook. http://www.amazon.com/dp/ 1446200469/ref=wl_it_dp_o_pd_nS_ttl?_encoding=UTF8&colid=2PTDZDB6JMG8Q& coliid=I3TJGIQOBZIPYS

3 R scripts to run the examples provided in the textbook are available from http://www. sagepub.com/dsur/main.htm. Updates/errata to the book are available on the authors website http://discoveringstatistics.com/docs/dsurerrata.pdf.

3. Gareth James, Daniela Witten, Trevor Hastie An Introduction to Statistical Learning: with Applications in R Springer Texts in Statistics (Book 103) Springer; 2013 edition (August 12, 2013)

Only recently published, well-written introduction to state of the art, advanced statistical analysis approaches. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. http://www.amazon.com/dp/1461471370/ref=wl_it_dp_o_pC_nS_ttl?_encoding= UTF8&colid=2PTDZDB6JMG8Q&coliid=I3VY53JI14IXZA

4. Joseph Adler. R in a Nutshell: A Desktop Quick Reference Publisher: O’Reilly Media; 1 edition (January 11, 2010) ISBN 978-0596801700 http://www.amazon.com/R-Nutshell-Desktop-Quick-Reference/dp/059680170X Good introductory textbook in R from O’ Reilly publishers. Provides a more in depth intro- duction to R that Dalgaard’s book.

Additional Course texts: I have a full list of recommended text available online as an amazon wish list http://amzn.com/w/2PTDZDB6JMG8Q

• An Introduction to R. Online manual at the R website at http://cran.r-project.org/ manuals.html

• R for SAS and SPSS Users (Statistics and Computing) Springer; 2nd Edition. edition (July 26, 2011) ISBN 978-1461406846 Useful for those converting between SAS or SPSS and R. It presents a translation between the 3 languages, so one can find equivalent R functions for SAS statements or SPSS commands. http://www.amazon.com/SAS-SPSS-Users-Statistics-Computing/dp/0387094172

8 Software

8.1 R R is available for free from http://cran.r-project.org/ for UNIX/Linux, Windows, and Mac. It is also available in the IT microlabs.

8.2 RStudio We will use RStudio, available free for all platforms from http://Rstudio.org which is a IDE and nicer interface to R.

4 8.3 LATEX For those familiar with LATEXyou may find it useful to use or to produce LATEXdocuments which embed R code. We will not demonstrate this in class, unless there is sufficient interest.

• LATEX: Windows MikTex http://miktex.org/, or MacOS MaxTex http://www.tug. org/mactex.

• LATEXEditors: TexWorks http://www.tug.org/texworks/ or Mac users can use TeXshop http://pages.uoregon.edu/koch/texshop/

9 Class Format

There will be five 3-hour class sessions. They will be held in the Kresge LL6 and will combine lec- ture, demonstration, and laboratory components, with an emphasis on demonstration and hands- on experience.

10 Grading/Assessment

Pass/Fail or Ordinal grading option only There will be 2 practical assignments, in-classes quizes, which will require students to use and expand on the material discussed in class. Grading will be based both on attendance and performance on assignments. Submit assignments through the course website online.

• In-class short quizes and class attendance (20%)

• Homework 1 ( 40%)

– Assignment available from Thurs January 7th. – Due Monday January 11th 11am or earlier.

• Homework 2 (40%)

– Assignment available from Thurs January 14th. – Due Monday January 18th 11am or earlier.

11 Course topics

1. Introduction to the R language:

• History of R. Overview of R project • Obtaining, instaling and managing R • Objects - types of objects, classes, creating and accessing objects • Creating, modifying and accessing objects in R. Simulating data in R. • Arithmetic and matrix operations • Introduction to functions (these are the R equivalent to PROC or methods)

5 2. More details on working with R • Reading and writing data from local files, databases and online • R libraries • Working with R Studio • Writing reports in R, rmarkdown • Functions and R programming – the if statement – looping: for, repeat, while – writing functions – function arguments and options 3. Graphics • Basic plotting, Manipulating the plotting window. • Advanced plotting using ggplot, googleVis, googleMaps. creating tag clouds, ba- sic graphs/networks • Saving plots. Publication quality graphics. 4. Standard statistical models in R • Basic statistics analysis in R (students t, chi -sq test, anova, etc) • Categorical data • Model formulae and model options, output and extraction from fitted models in R • Statistical Models will include – Linear regression: lm() – Logistic regression: glm() – Survival analysis: Surv(), coxph() 5. Advanced R • Extensions of topics discussed in lectures 1-4 and additional items to be decided. The final topics in Lecture 5 will be based on a course survey that will be available to student after lecture 4. It may include – Data management (importing, subsetting, merging, new variables, missing data etc.) – Plotting – Loops and functions – More advanced Plotting and Graphics in R. Interactive graphics in R. – More on writing R functions, optimizing R code – "Big Data" , parallel processing and options for big data in R. – Creating packages in R. – Bioconductor, analysis of genomics data. – Reproducible research in R. More on Rmarkdown, IO slides, Shiny etc – Creating output that looks like SAS/SPSS etc

6