Introduction to R, Day 1: “Getting Oriented” Michael Chimenti and Diana Kolbe Dec 5 & 6 2019 Iowa Institute of Human Genetics
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Bioconductor: Open Software Development for Computational Biology and Bioinformatics Robert C
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Collection Of Biostatistics Research Archive Bioconductor Project Bioconductor Project Working Papers Year 2004 Paper 1 Bioconductor: Open software development for computational biology and bioinformatics Robert C. Gentleman, Department of Biostatistical Sciences, Dana Farber Can- cer Institute Vincent J. Carey, Channing Laboratory, Brigham and Women’s Hospital Douglas J. Bates, Department of Statistics, University of Wisconsin, Madison Benjamin M. Bolstad, Division of Biostatistics, University of California, Berkeley Marcel Dettling, Seminar for Statistics, ETH, Zurich, CH Sandrine Dudoit, Division of Biostatistics, University of California, Berkeley Byron Ellis, Department of Statistics, Harvard University Laurent Gautier, Center for Biological Sequence Analysis, Technical University of Denmark, DK Yongchao Ge, Department of Biomathematical Sciences, Mount Sinai School of Medicine Jeff Gentry, Department of Biostatistical Sciences, Dana Farber Cancer Institute Kurt Hornik, Computational Statistics Group, Department of Statistics and Math- ematics, Wirtschaftsuniversitat¨ Wien, AT Torsten Hothorn, Institut fuer Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universitat Erlangen-Nurnberg, DE Wolfgang Huber, Department for Molecular Genome Analysis (B050), German Cancer Research Center, Heidelberg, DE Stefano Iacus, Department of Economics, University of Milan, IT Rafael Irizarry, Department of Biostatistics, Johns Hopkins University Friedrich Leisch, Institut fur¨ Statistik und Wahrscheinlichkeitstheorie, Technische Universitat¨ Wien, AT Cheng Li, Department of Biostatistical Sciences, Dana Farber Cancer Institute Martin Maechler, Seminar for Statistics, ETH, Zurich, CH Anthony J. Rossini, Department of Medical Education and Biomedical Informat- ics, University of Washington Guenther Sawitzki, Statistisches Labor, Institut fuer Angewandte Mathematik, DE Colin Smith, Department of Molecular Biology, The Scripps Research Institute, San Diego Gordon K. -
Statistical Computing with Pathway Tools Using Rcyc
Statistical Computing with Pathway Tools using RCyc Statistical Computing with Pathway Tools using RCyc Tomer Altman [email protected] Biomedical Informatics, Stanford University Statistical Computing with Pathway Tools using RCyc R & BioConductor S: software community over 30 years of statistical computing, data mining, machine learning, and data visualization knowledge R: open-source S with a lazy Scheme interpreter at its heart (including closures, symbols, and even macros!) RCyc: an R package to allow the interaction between Pathway / Genome Databases and the wealth of biostatistics software in the R community Statistical Computing with Pathway Tools using RCyc BioConductor Figure: BioConductor: Thousands of peer-reviewed biostatistics packages. Statistical Computing with Pathway Tools using RCyc Software`R'-chitecture C code extension to R to allow Unix socket access Common Lisp code to hack in XML-based communication Make the life of *Cyc API developers easier. Currently supports exchange of numbers, strings, and lists R code and documentation Provides utilities for starting PTools and marshaling data types Assumes user is familiar with the PTools API: http://bioinformatics.ai.sri.com/ptools/api/ All wrapped up in R package Easily installs via standard command-line R interface Statistical Computing with Pathway Tools using RCyc Simple Example callPToolsFn("so",list("'meta")) callPToolsFn("get-slot-value",list("'proton", "'common-name")) callPToolsFn("get-class-all-instances",list("'|Reactions|")) Statistical Computing with Pathway Tools using RCyc Availability http://github.com/taltman/RCyc Linked from PTools website Statistical Computing with Pathway Tools using RCyc Next Steps Dynamic instantiation of API functions in R Coming next release (coordination with BRG) Make development of *Cyc APIs easier, less boilerplate code Frame to Object import/export Provide \RCelot" functionality to slurp Ocelot frames directly into R S4 reference objects for direct data access Support for more exotic data types Symbols, hash tables, arrays, structures, etc. -
Installing R
Installing R Russell Almond August 29, 2020 Objectives When you finish this lesson, you will be able to 1) Start and Stop R and R Studio 2) Download, install and run the tidyverse package. 3) Get help on R functions. What you need to Download R is a programming language for statistics. Generally, the way that you will work with R code is you will write scripts—small programs—that do the analysis you want to do. You will also need a development environment which will allow you to edit and run the scripts. I recommend RStudio, which is pretty easy to learn. In general, you will need three things for an analysis job: • R itself. R can be downloaded from https://cloud.r-project.org. If you use a package manager on your computer, R is likely available there. The most common package managers are homebrew on Mac OS, apt-get on Debian Linux, yum on Red hat Linux, or chocolatey on Windows. You may need to search for ‘cran’ to find the name of the right package. For Debian Linux, it is called r-base. • R Studio development environment. R Studio https://rstudio.com/products/rstudio/download/. The free version is fine for what we are doing. 1 There are other choices for development environments. I use Emacs and ESS, Emacs Speaks Statistics, but that is mostly because I’ve been using Emacs for 20 years. • A number of R packages for specific analyses. These can be downloaded from the Comprehensive R Archive Network, or CRAN. Go to https://cloud.r-project.org and click on the ‘Packages’ tab. -
Bioconductor: Open Software Development for Computational Biology and Bioinformatics Robert C
Bioconductor Project Bioconductor Project Working Papers Year 2004 Paper 1 Bioconductor: Open software development for computational biology and bioinformatics Robert C. Gentleman, Department of Biostatistical Sciences, Dana Farber Can- cer Institute Vincent J. Carey, Channing Laboratory, Brigham and Women’s Hospital Douglas J. Bates, Department of Statistics, University of Wisconsin, Madison Benjamin M. Bolstad, Division of Biostatistics, University of California, Berkeley Marcel Dettling, Seminar for Statistics, ETH, Zurich, CH Sandrine Dudoit, Division of Biostatistics, University of California, Berkeley Byron Ellis, Department of Statistics, Harvard University Laurent Gautier, Center for Biological Sequence Analysis, Technical University of Denmark, DK Yongchao Ge, Department of Biomathematical Sciences, Mount Sinai School of Medicine Jeff Gentry, Department of Biostatistical Sciences, Dana Farber Cancer Institute Kurt Hornik, Computational Statistics Group, Department of Statistics and Math- ematics, Wirtschaftsuniversitat¨ Wien, AT Torsten Hothorn, Institut fuer Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universitat Erlangen-Nurnberg, DE Wolfgang Huber, Department for Molecular Genome Analysis (B050), German Cancer Research Center, Heidelberg, DE Stefano Iacus, Department of Economics, University of Milan, IT Rafael Irizarry, Department of Biostatistics, Johns Hopkins University Friedrich Leisch, Institut fur¨ Statistik und Wahrscheinlichkeitstheorie, Technische Universitat¨ Wien, AT Cheng Li, Department -
Supplementary Materials
Tomic et al, SIMON, an automated machine learning system reveals immune signatures of influenza vaccine responses 1 Supplementary Materials: 2 3 Figure S1. Staining profiles and gating scheme of immune cell subsets analyzed using mass 4 cytometry. Representative gating strategy for phenotype analysis of different blood- 5 derived immune cell subsets analyzed using mass cytometry in the sample from one donor 6 acquired before vaccination. In total PBMC from healthy 187 donors were analyzed using 7 same gating scheme. Text above plots indicates parent population, while arrows show 8 gating strategy defining major immune cell subsets (CD4+ T cells, CD8+ T cells, B cells, 9 NK cells, Tregs, NKT cells, etc.). 10 2 11 12 Figure S2. Distribution of high and low responders included in the initial dataset. Distribution 13 of individuals in groups of high (red, n=64) and low (grey, n=123) responders regarding the 14 (A) CMV status, gender and study year. (B) Age distribution between high and low 15 responders. Age is indicated in years. 16 3 17 18 Figure S3. Assays performed across different clinical studies and study years. Data from 5 19 different clinical studies (Study 15, 17, 18, 21 and 29) were included in the analysis. Flow 20 cytometry was performed only in year 2009, in other years phenotype of immune cells was 21 determined by mass cytometry. Luminex (either 51/63-plex) was performed from 2008 to 22 2014. Finally, signaling capacity of immune cells was analyzed by phosphorylation 23 cytometry (PhosphoFlow) on mass cytometer in 2013 and flow cytometer in all other years. -
Decode™ Bioinformatic Analysis
PROTOCOL Decode™ Bioinformatic Analysis An introductory guide for the use of high-throughput sequencing to determine primary hits from an shRNA pooled screen. Table of contents 1. Legal disclaimers 1. Legal disclaimers 1 SOFTWARE LICENSE TERMS. With respect to any software products 2. Summary 2 incorporated in or forming a part of the software provided or described 3. Intended audience 2 hereunder (“Software”), Horizon Discovery (“Seller”) and you, the purchaser, 4. Software requirements 2 licensee and/or end-user of the software (“Buyer”) intend and agree that 5. FASTQ files from high-throughput sequencing 2 such software products are being licensed and not sold, and that the words 6. Align the FASTQ files to reference shRNA FASTA files 3 “purchase”, “sell” or similar or derivative words are understood and agreed to A. Create a Bowtie index 3 mean “license”, and that the word “Buyer” or similar or derivative words are B. Align using Bowtie 3 understood and agreed to mean “licensee”. Notwithstanding anything to the C. Batch process 4 contrary contained herein, Seller or its licensor, as the case may be, retains all D. Alignment summaries 4 rights and interest in software products provided hereunder. 7. Differential expression analysis 5 A. Convert Bowtie files into DESeq input 5 Seller hereby grants to Buyer a royalty-free, non-exclusive, nontransferable B. Run DESeq on .rtable files 7 license, without power to sublicense, to use software provided hereunder 8. Conclusions 9 solely for Buyer’s own internal business purposes and to use the related documentation solely for Buyer’s own internal business purposes. -
R / Bioconductor for High-Throughput Sequence Analysis
R / Bioconductor for High-Throughput Sequence Analysis Nicolas Delhomme1 21 October - 26 October, 2013 [email protected] Contents 1 Day2 of the workshop2 1.1 Introduction............................................2 1.2 Main Bioconductor packages of interest for the day......................2 1.3 A word on High-throughput sequence analysis.........................2 1.4 A word on Integrated Development Environment (IDE)...................2 1.5 Today's schedule.........................................2 2 Prelude 4 2.1 Purpose..............................................4 2.2 Creating GAlignment objects from BAM files.........................4 2.3 Processing the files in parallel..................................4 2.4 Processing the files one chunk at a time............................5 2.5 Pros and cons of the current solution..............................6 2.5.1 Pros............................................6 2.5.2 Cons............................................6 3 Sequences and Short Reads7 3.1 Alignments and Bioconductor packages............................7 3.1.1 The pasilla data set...................................7 3.1.2 Alignments and the ShortRead package........................8 3.1.3 Alignments and the Rsamtools package........................9 3.1.4 Alignments and other Bioconductor packages..................... 13 3.1.5 Resources......................................... 17 4 Interlude 18 5 Estimating Expression over Genes and Exons 20 5.1 Counting reads over known genes and exons.......................... 20 5.1.1 -
Open-Source Statistical Software for the Analysis of Microarray Data
The Bioconductor Project: Open-source Statistical Software for the Analysis of Microarray Data Sandrine Dudoit Division of Biostatistics University of California, Berkeley www.stat.berkeley.edu/~sandrine EMBO Practical Course on Analysis and Informatics of Microarray Data Wellcome Trust Genome Campus, Hinxton, Cambridge, UK March 18, 2003 © Copyright 2003, all rights reserved Materials from Bioconductor short courses developed with Robert Gentleman, Rafael Irizarry. Expanded version of this course: Fred Hutchinson Cancer Research Center, December 2002 Biological question Experimental design Microarray experiment Image analysis Expression quantification Pre-processing A Normalization n a l Testing Estimation Clustering Prediction y s Biological verification i and interpretation s Statistical computing Everywhere … • Statistical design and analysis: – image analysis, normalization, estimation, testing, clustering, prediction, etc. • Integration of experimental data with biological metadata from WWW-resources – gene annotation (GenBank, LocusLink); – literature (PubMed); – graphical (pathways, chromosome maps). Outline • Overview of the Bioconductor project • Annotation • Visualization • Pre-processing: spotted and Affy arrays • Differential gene expression • Clustering and classification Acknowledgments • Bioconductor core team • Vince Carey, Biostatistics, Harvard • Yongchao Ge, Statistics, UC Berkeley • Robert Gentleman, Biostatistics, Harvard • Jeff Gentry, Dana-Farber Cancer Institute • Rafael Irizarry, Biostatistics, Johns Hopkins -
The Split-Apply-Combine Strategy for Data Analysis
JSS Journal of Statistical Software April 2011, Volume 40, Issue 1. http://www.jstatsoft.org/ The Split-Apply-Combine Strategy for Data Analysis Hadley Wickham Rice University Abstract Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece inde- pendently and then put all the pieces back together. This insight gives rise to a new R package that allows you to smoothly apply this strategy, without having to worry about the type of structure in which your data is stored. The paper includes two case studies showing how these insights make it easier to work with batting records for veteran baseball players and a large 3d array of spatio-temporal ozone measurements. Keywords: R, apply, split, data analysis. 1. Introduction What do we do when we analyze data? What are common actions and what are common mistakes? Given the importance of this activity in statistics, there is remarkably little research on how data analysis happens. This paper attempts to remedy a very small part of that lack by describing one common data analysis pattern: Split-apply-combine. You see the split-apply- combine strategy whenever you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together. This crops up in all stages of an analysis: During data preparation, when performing group-wise ranking, standardization, or nor- malization, or in general when creating new variables that are most easily calculated on a per-group basis. -
Hadley Wickham, the Man Who Revolutionized R Hadley Wickham, the Man Who Revolutionized R · 51,321 Views · More Stats
12/15/2017 Hadley Wickham, the Man Who Revolutionized R Hadley Wickham, the Man Who Revolutionized R · 51,321 views · More stats Share https://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/ 1/10 12/15/2017 Hadley Wickham, the Man Who Revolutionized R “Fundamentally learning about the world through data is really, really cool.” ~ Hadley Wickham, prolific R developer *** If you don’t spend much of your time coding in the open-source statistical programming language R, his name is likely not familiar to you -- but the statistician Hadley Wickham is, in his own words, “nerd famous.” The kind of famous where people at statistics conferences line up for selfies, ask him for autographs, and are generally in awe of him. “It’s utterly utterly bizarre,” he admits. “To be famous for writing R programs? It’s just crazy.” Wickham earned his renown as the preeminent developer of packages for R, a programming language developed for data analysis. Packages are programming tools that simplify the code necessary to complete common tasks such as aggregating and plotting data. He has helped millions of people become more efficient at their jobs -- something for which they are often grateful, and sometimes rapturous. The packages he has developed are used by tech behemoths like Google, Facebook and Twitter, journalism heavyweights like the New York Times and FiveThirtyEight, and government agencies like the Food and Drug Administration (FDA) and Drug Enforcement Administration (DEA). Truly, he is a giant among data nerds. *** Born in Hamilton, New Zealand, statistics is the Wickham family business: His father, Brian Wickham, did his PhD in the statistics heavy discipline of Animal Breeding at Cornell University and his sister has a PhD in Statistics from UC Berkeley. -
The Rockerverse: Packages and Applications for Containerisation
PREPRINT 1 The Rockerverse: Packages and Applications for Containerisation with R by Daniel Nüst, Dirk Eddelbuettel, Dom Bennett, Robrecht Cannoodt, Dav Clark, Gergely Daróczi, Mark Edmondson, Colin Fay, Ellis Hughes, Lars Kjeldgaard, Sean Lopp, Ben Marwick, Heather Nolis, Jacqueline Nolis, Hong Ooi, Karthik Ram, Noam Ross, Lori Shepherd, Péter Sólymos, Tyson Lee Swetnam, Nitesh Turaga, Charlotte Van Petegem, Jason Williams, Craig Willis, Nan Xiao Abstract The Rocker Project provides widely used Docker images for R across different application scenarios. This article surveys downstream projects that build upon the Rocker Project images and presents the current state of R packages for managing Docker images and controlling containers. These use cases cover diverse topics such as package development, reproducible research, collaborative work, cloud-based data processing, and production deployment of services. The variety of applications demonstrates the power of the Rocker Project specifically and containerisation in general. Across the diverse ways to use containers, we identified common themes: reproducible environments, scalability and efficiency, and portability across clouds. We conclude that the current growth and diversification of use cases is likely to continue its positive impact, but see the need for consolidating the Rockerverse ecosystem of packages, developing common practices for applications, and exploring alternative containerisation software. Introduction The R community continues to grow. This can be seen in the number of new packages on CRAN, which is still on growing exponentially (Hornik et al., 2019), but also in the numbers of conferences, open educational resources, meetups, unconferences, and companies that are adopting R, as exemplified by the useR! conference series1, the global growth of the R and R-Ladies user groups2, or the foundation and impact of the R Consortium3. -
Installing R and Cran Binaries on Ubuntu
INSTALLING R AND CRAN BINARIES ON UBUNTU COMPOUNDING MANY SMALL CHANGES FOR LARGER EFFECTS Dirk Eddelbuettel T4 Video Lightning Talk #006 and R4 Video #5 Jun 21, 2020 R PACKAGE INSTALLATION ON LINUX • In general installation on Linux is from source, which can present an additional hurdle for those less familiar with package building, and/or compilation and error messages, and/or more general (Linux) (sys-)admin skills • That said there have always been some binaries in some places; Debian has a few hundred in the distro itself; and there have been at least three distinct ‘cran2deb’ automation attempts • (Also of note is that Fedora recently added a user-contributed repo pre-builds of all 15k CRAN packages, which is laudable. I have no first- or second-hand experience with it) • I have written about this at length (see previous R4 posts and videos) but it bears repeating T4 + R4 Video 2/14 R PACKAGES INSTALLATION ON LINUX Three different ways • Barebones empty Ubuntu system, discussing the setup steps • Using r-ubuntu container with previous setup pre-installed • The new kid on the block: r-rspm container for RSPM T4 + R4 Video 3/14 CONTAINERS AND UBUNTU One Important Point • We show container use here because Docker allows us to “simulate” an empty machine so easily • But nothing we show here is limited to Docker • I.e. everything works the same on a corresponding Ubuntu system: your laptop, server or cloud instance • It is also transferable between Ubuntu releases (within limits: apparently still no RSPM for the now-current Ubuntu 20.04)