Basic Introduction to and . Aedin Culhane 18th Sept 2009 ([email protected])

R (www.r-project.org) is an open source interactive computer system for visualization and analysis of statistical data. Bioconductor (www.bioconductor.org) is a project within R. The Bioconductor project is a source of many resources relevant to data analysis for high-throughput biology.

Installing R and Bioconductor

1. Download R from http://www.r-project.org. The primary portal for R is http://cran.r- project.org . Many local mirrors exist and these may provide quicker downloads.

R can be download precompiled for , Mac OS and Windows. If you are using Linux, R can be installed using yum install.

To install R simply, click on download and select the version appropriate for your . If you use linux, you may be able to use (apt-get install R or yum) depending on the version of Linux you use.

For the most users, this precompiled version of R is sufficient. However if you wish to write an extension package for R/Bioconductor, you need to install R from source. Please refer to the R FAQ for directions on this. Windows users will need to install unix tools/perl/latex, please refer to the excellent guide at http://www.murdoch- sutherland.com/Rtools/. There is also a more (complex) complete description on the bioconductor developers wiki (http://wiki.fhcrc.org/bioc/HowTo/Build_R_on_Windows) R Extension Packages On http://cran.r-project.org/ or the mirror you are using. To find out about contributed package please click on the link Contributed extension packages at the bottom of the main download page. This will link to a page with lots of information on contributed packages and to “CRAN Task Views”.

CRAN task views, provide many categories of contributed packages and makes installation of these packages simpler. For example to install genetics packages type install.packages("ctv") library(ctv) install.views("Genetics")

Note these categories contain many packages, for example this Genetics contains 29 packages.

Installing Individual Contributed or Extension Packages

Although a team of statisticians and programmers maintain R, many independent groups submit contributed packages. This is a very extensive resource of hundreds of programs and will not be install by default.

R extensions can be installed easily. R extensions are stored in repositories. Select the package repositories using Packages - Select repositories

The R repositories are:

CRAN: basic R distribution

CRAN (extras)- Contributed R packages. There are several hundred.

Bioconductor: The Bioconductor Project produces an open source software framework that will assist biologists and statisticians working in bioinformatics, with primary emphasis on inference using DNA microarrays. A CRAN style repository is available via http://www.bioconductor.org/. Omegahat: The Omegahat Project for Statistical Computing provides a variety of open- source software for statistical applications, with special emphasis on web-based software, Java, the Java virtual machine, and distributed computing. A CRAN style R package repository is available via http://www.omegahat.org/R/.

Select one repository, it may provide a select of CRAN mirrors, select one close by. You can now download extension packages.

There are several ways to install extension packages

1. From the GUI, select Packages - Install Packages. Note only the packages from the selected repositories will be shown.

2. From command line type install.packages(“packagename”)

3. Download a compressed copy of the package from the R website and install from local zip or .tar.gz file Installing Bioconductor Bioconductor can be installed using this approach however a simpler method is to use the following commands:

source("http://www.bioconductor.org/biocLite.R") biocLite()

This will automatically download core bioconductor packages. Due to the number and size of Bioconductor, many useful packages will not be installed, so these can be installed as described above

For more complex installs of specific subsets of packages see http://www.bioconductor.org/docs/install/

To install a specific package, simply include the package name between the brackets eg biocLite(‘biomart’)

On windows, you can also use the drop down menus Packages -> install package(s) but ensure that you have first select the bioconductor repository using Packages -> select repositories. By default only CRAN is selected. To install bioconductor packages, bioconductor needs to be selected. Then select Packages -> install package(s) and choose a package from the long list. Select one or more packages. Hold down the shift key to select many packages that are consecutive or hold control to select non-continuous packages.

Using a script to simplify Installation of R and Bioconductor Once you are familiar with R and Bioconductor, you may wish to script the above progress so that updates to R are simpler ;-)

I use the following script, which I saved to a plain text file called bioC_install.R

### Packages for R. install.packages("ade4") install.packages("scatterplot3d") install.packages("Rcurl") install.packages("fastICA") install.packages("e1071")

### BioC Packages for R. source("http://www.bioconductor.org/biocLite.R") biocLite() biocLite("made4") biocLite("hgu133plus2.db") biocLite("biomaRt")

To install these packages, I simply type command source(“./bioC_install.R”) and then go for coffee ;-)

R and Bioconductor are pretty “intelligent”, if you wish to install a package that has dependencies (needs other packages) it will automatically install these. Bioconductor Packages Task Overview Bioconductor (http://www.bioconductor.org) has a really nice “task views” interface to packages in Bioconductor. To find out more about these click on Software

Packages Task views are grouped into 3 categories,

Software Annotation Data Experiment Data.

Click on each of these clicks to explore the packages in Bioconductor.

Bioconductor Task Views - Categories Software packages are sub divided into the categories:

• Annotation • AssayDomains • AssayTechnologies • BiologicalDomains • Infrastructure • Bioinformatics

Each contains a long list of contributed packages. For example clicking on Technology returns a list of technologies software categories

• Microarray • MicrotitrePlateAssay • MassSpectrometry • SAGE • FlowCytometry • Sequencing • HighThroughputSequencing and MassSpectrometry contains a list of relevant packages:

and each package page give details, help and a link to a vignette (tutorial).

Under Documentation, Click on the link to the pdf file.

This is the vignette which will give an overview (tutorial) and examples of work-flows for the package Bioconductor Annotation Data Packages Bioconductor contains hundreds of annotation packages. These are an incredible value resource in Bioconductor. This resource can be searched by organism, chip manufacturer, chip name etc

Subviews

• Organism • ChipManufacturer • CustomCDF • CustomArray • CustomDBSchema • FunctionalAnnotation • SequenceAnnotation • ChipName

For example click on chip manufacturer - Agilent or Organism - Homo sapiens Note annotation packages have the ending “.db” If you click on Affymetrix, you will see each array is represented by 3 annotation packages, in addition to the .db file, there is a cdf and probe package also

These contain probe annotation, chip description file (cdf) and probe sequence data (probe). The latter two are compiled from Affymetrix or manufacturer information. To install an annotation package, follow the same guidelines as package installation

source("http://www.bioconductor.org/biocLite.R") biocLite("hgu133plus2.db")

To find out about a package: library(hgu133plus2.db) hgu133plus2.db()

Quality control information for hgu133plus2 Date built: Created: Mon Sep 25 20:14:15 2006

Number of probes: 54675 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: hgu133plus2ACCNUM found 54675 of 54675 hgu133plus2CHRLOC found 42738 of 54675 hgu133plus2CHR found 47342 of 54675 hgu133plus2ENTREZID found 47430 of 54675 hgu133plus2ENZYME found 4755 of 54675 hgu133plus2GENENAME found 41845 of 54675 hgu133plus2GO found 35836 of 54675 hgu133plus2MAP found 46947 of 54675 hgu133plus2OMIM found 27681 of 54675 hgu133plus2PATH found 10232 of 54675 hgu133plus2PMID found 46000 of 54675 hgu133plus2REFSEQ found 45868 of 54675 hgu133plus2SUMFUNC found 0 of 54675 hgu133plus2SYMBOL found 47391 of 54675 hgu133plus2UNIGENE found 47147 of 54675 Mappings found for non-probe based rda files: hgu133plus2CHRLENGTHS found 25 hgu133plus2ENZYME2PROBE found 792 hgu133plus2GO2ALLPROBES found 6978 hgu133plus2GO2PROBE found 5128 hgu133plus2ORGANISM found 1 hgu133plus2PATH2PROBE found 183 hgu133plus2PFAM found 33303 hgu133plus2PMID2PROBE found 130997 hgu133plus2PROSITE found 23213

Bioconductor Experiment Data Packages

Experiment data package contain published data pre-prepared for Bioconductor. For example these packages include data from leukemia gene expression studies (Golub et al., 1999), colon cancer gene expression profiling (Alon et al. 1999), etc. Many of the tutorial s(vignettes) in Bioconductor use these data in exercises.

What Packages Should I download? There are so many packages it can be very confusing as to what is the best selection to download.

There is no one answer to this question. Bioconductor is expanding rapidly to enable analysis of many different data types for example microarray data, proteomics, sequencing, etc. So it really depends on the analysis you wish to do.

There is a list of packages for different workflows http://www.bioconductor.org/docs/workflows/ I would also recommend a new user to go through the workshops (the code and lectures are available) on http://www.bioconductor.org/workshops/2009.

For example one of these courses is an Introduction to Bioconductor course which was held earlier this year is online at http://www.economia.unimi.it/projects/marray/2009/ Click on the link Course Material on the right side of the page

My very biased list of useful packages (ie have a look at the vignettes and see if it is useful or your analysis) Preprocessing ------affy - Affy normalization (rma, dCHIP li & Wong normalization, mas5.0) lumi - for normalization of illumina data

Feature Selection (and preprocessing) ------limma - for pre-processing/normalization of spotted array data and for linear models, feature selection

Data visualization (EDA) ------made4 - completely biased, its my package

Annotation of genes ------annaffy, annotate - for annotation of affy data biomaRt - runs biomart

Gene Set Approaches ------globalAncova, globaltest - for testing association between sets of genes and clincial covariates

GSEAlm,goProfiles, sigPathway

Machine Learning, Supervised Analysis ------MLinterfaces - wrapper around many machine learning approaches (eg svm, nn, knn etc)

Connection to online data repositories ------GEOquery, ArrayExpress (which have the functions getGEO or getAE respectively to pull data from GEO and ArrayExpress) Bioconductor and R Help The first place to find help on R and Bioconductor is their website. These provide an excellent source of information.

In particular it is worth examining the manuals, FAQ and searchable archives of the mailing lists. Do subscribe to the Bioconductor and R mailing lists.

R Help My favorite beginners guide to R is from Emmanuel Paradis. Its available from http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf

Introduction to R classes and objects http://cran.r-project.org/doc/manuals/R-intro.html

Tom Short’s R reference card (http://cran.r-project.org/doc/contrib/Short-refcard.pdf) and other contributed are useful http://cran.r-project.org/other-docs.html

Thomas Girke's (UC Riverside) has also written an excellent intro into R and Bioconductor http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.html

Bioconductor Help I found the Bioconductor course and workshop examples very useful when I was learning. These are available at http://www.bioconductor.org/workshops

For more information on getting started in R and Bioconductor, please see Vince Carey’s guide at http://www.biostat.harvard.edu/~carey/rr.pdf

GUI’s for R Editors which color code ;-)

TinnR http://www.sciviews.org/Tinn-R/

Notepad ++ http://notepad-plus.sourceforge.net/uk/site.htm

On Linux I use Kate (KDE) or EMACS which has a “plugin” ESS ()

GUI interface to R

R commander http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/ R WorkBench (from ebi, http://www.ebi.ac.uk/tc-test/microarray- as/biocep)

For more information on GUI and editors see the R wiki http://wiki.r-project.org/rwiki/doku.php?id=getting-started:system-setup:system-setup

Fun R Websites (Graphics and More) R Graph Gallery http://addictedtor.free.fr/graphiques/

Rggobi http://www.ggobi.org/demos/ GGobi is an open source visualization program for exploring high-dimensional data, it lets you create movies and rotating 3D plots.

Useful Books

Bioconductor Case Studies (Use R) (Paperback) by Florian Hahne (Author), Wolfgang Huber (Author), Robert Gentleman (Author), Seth Falcon (Author) http://www.amazon.com/Bioconductor-Case-Studies-Use-R/dp/0387772391

Introductory Statistics with R (Statistics and Computing) (Paperback) by http://www.amazon.com/Introductory-Statistics-R- Computing/dp/0387790535/ref=pd_sim_b_1