R / Bioconductor for High-Throughput Sequence Analysis
Total Page:16
File Type:pdf, Size:1020Kb
R / Bioconductor for High-Throughput Sequence Analysis Nicolas Delhomme1 21 October - 26 October, 2013 [email protected] Contents 1 Day2 of the workshop2 1.1 Introduction............................................2 1.2 Main Bioconductor packages of interest for the day......................2 1.3 A word on High-throughput sequence analysis.........................2 1.4 A word on Integrated Development Environment (IDE)...................2 1.5 Today's schedule.........................................2 2 Prelude 4 2.1 Purpose..............................................4 2.2 Creating GAlignment objects from BAM files.........................4 2.3 Processing the files in parallel..................................4 2.4 Processing the files one chunk at a time............................5 2.5 Pros and cons of the current solution..............................6 2.5.1 Pros............................................6 2.5.2 Cons............................................6 3 Sequences and Short Reads7 3.1 Alignments and Bioconductor packages............................7 3.1.1 The pasilla data set...................................7 3.1.2 Alignments and the ShortRead package........................8 3.1.3 Alignments and the Rsamtools package........................9 3.1.4 Alignments and other Bioconductor packages..................... 13 3.1.5 Resources......................................... 17 4 Interlude 18 5 Estimating Expression over Genes and Exons 20 5.1 Counting reads over known genes and exons.......................... 20 5.1.1 The alignments...................................... 20 5.2 Discovering novel transcribed regions.............................. 23 5.3 Using easyRNASeq........................................ 27 5.4 Where to from here........................................ 29 1 Chapter 1 Day2 of the workshop 1.1 Introduction This portion of the workshop introduces use of R [15] and Bioconductor [5] for analysis of high-throughput sequence (HTS) data; specifically the manipulation of HTS reads alignment and how to estimate expres- sion over exons, transcripts and genes using these. The workshop is structured as a series of short remarks followed by group exercises. The exercises explore the diversity of tasks for which R / Bioconductor are appropriate, but are far from comprehensive. The goals of that workshop part are to: (1) develop familiarity with R / Bioconductor packages for high-throughput analysis; (2) specifically for those necessary for manipulating HTS reads alignment files and for devising expression over genic features; and (3) provide inspiration and a framework for further independent exploration. 1.2 Main Bioconductor packages of interest for the day Bioconductor is a collection of R packages for the analysis and comprehension of high-throughput ge- nomic data. Among these, we will focus on three of them principally: ShortRead, Rsamtools and GenomicRanges. 1.3 A word on High-throughput sequence analysis Recent technological developments introduce high-throughput sequencing approaches. A variety of ex- perimental protocols and analysis workflows address gene expression, regulation, and encoding of genetic variants. Experimental protocols produce a large number (tens of millions per sample) of short (e.g. , 35-250, single or paired-end) nucleotide sequences. These are aligned to a reference or other genome. Analysis workflows use the alignments to infer levels of gene expression (RNA-seq), binding of regulatory elements to genomic locations (ChIP-seq), or prevalence of structural variants (e.g. , SNPs, short indels, large-scale genomic rearrangements). Sample sizes range from minimal replication (e.g,. 2 samples per treatment group) to thousands of individuals. 1.4 A word on Integrated Development Environment (IDE) There are numerous tools to support developing programs and softwares in R. For this course, we have selected one of them: the RStudio environment, which provides a feature-full, user-friendly, cross-platform environment for working with R. 1.5 Today's schedule 2 Table 1.1: EMBO2013 AHTSD workshop day2 Schedule Time Description 09:00 Lecture: Representing and manipulating alignments 09:45 Practical: Representing and manipulating alignments 10:30 Coffee break 10:45 Practical c'ed: Representing and manipulating alignments 12:30 Lunch 13:30 Lecture: Estimating expression over genes and exons 14:30 Practical: Estimating expression over genes and exons 15:30 Coffee break 15:45 Lecture: Working without a "reference" genome 16:30 Practical: Discovering novel transcribed regions 17:30 Question and Answer session - preferably at the Red Lion 18:30 Dinner 3 Chapter 2 Prelude 2.1 Purpose Before getting familiar with the Bioconductor packages functionalities that were presented in the lecture, we will first sublimate the knowledge you've gathered so far into adressing the computationaal challenges faced when using HTS data: i.e. resources and time consumption. In the lecture, the readGAlignmentsFromBam function from the Rsamtools package was introduced and used to extract a GAlignment object. However, most of the times, an experiment will NOT consist of a single sample (of only 2.5M reads!) and an obvious way to speed up the process is to parallelize. In the following three sections, we will see how to perform this before ultimately discussing the pros and cons of the implemented method. 2.2 Creating GAlignment objects from BAM files Exercise 1 First of all, locate the BAM files and implement a function to read them sequentially. Have a look at the lapply function man page for doing so. Solution: > library(Rsamtools) > bamfiles <- dir(system.file("bigdata","bam",package="EMBO2013Day2"), + pattern="*.bam$",full.names=TRUE) > gAlns <- lapply(bamfiles,readGAlignmentsFromBam) Nothing complicated so far - or if, raise your voice. We proceed both files sequentially and get a list of GAlignments objects stored in the gAlns object. Apart from the coding enhancement - with one line, we can process all our samples - there is no other gains. 2.3 Processing the files in parallel Modern laptop CPUs possess several cores that can perform tasks independently, commonly 2 to 4. Computational servers usually have many CPUs (commonly 8) each having several cores. An obvious enhancement to our previous solution is to take advantage of this CPU architecture and to process our sample in parallel. Exercise 2 Have a look at the parallel package and in particular at the mclapply function to re-implement the previous function in a parallel manner. Solution: 4 > library(parallel) > gAlns <- mclapply(bamfiles,readGAlignmentsFromBam) Exercise 3 Could you figure out how many cores were used in parallel when running the previous line? Can you explain why that was so? Solution: It is NOT because there were 2 files to proceed. The mclapply has a number of default parameters - see ?mclapply for details - including the mc.cores one that defaults to 2. If you want to proceed more samples in parallel, set that parameter value accordingly. This new implementation has the obvious advantage to be X times faster (with X being the number of CPU used, or almost so as parallelization comes with a slight processing cost), but it put a different strain on the system. As several files are being processed in parallel, the memory requirement also increase by a factor X (assuming files of almost equivalent size are to be processed). This might be fine on a computational server but given the constant increase in sequencing reads being produced per run, this will eventually be challenged. Exercise 4 Can you think of the way this memory issue could be adressed? i.e. what could we modify in the way we read/process the file to limit the memory required at a given moment? Solution: No, buying more memory is usually not an option. And anyway, at the moment, the increase rate of reads sequenced per run is faster than the memory doubling time. So, let us just move to the next section to have a go at adressing the issue. 2.4 Processing the files one chunk at a time To limit the memory required at any moment, one approach would be to proceed the file not as a whole, but chunk-wise. As we can assume that reads are stored independently in BAM files (or almost so, think of how Paired-End data is stored!), we simply can decide to parse, e.g. 1; 000; 000 reads at a time. This will of course require to have a new way to represent a BAM file in R, i.e. not just as a character string as we had it until now in our bamfiles object. Exercise 5 The Rsamtools package again comes in handy. Lookup the ?BamFile package and try to scheme how we could take advantage of the BamFile or BamFileList classes for our purpose. Solution: The yieldSize parameter of either class looks like exactly what we want. Let us recode our bamfiles character object into a BamFileList. > bamFileList <- BamFileList(bamfiles,yieldSize=10^6) Now that we have the BAM files described in a way that we can process them chunk-wise, let us do so. The paradigm is as follow: > open(bamFile) > while(length(chunk <- readGAlignmentsFromBam(bamFile))){ + message(length(chunk)) + } > close(bamFile) 5 Exercise 6 In the paradigm above, we process one BAM file chunk wise and report the sizes of the chunks. i.e. these would be 1M reads - in our case - apart for the last one, which would be smaller or equal to 1M (it is unlikely that a sequencing file contains an exact multiple of our chink size). Now, try to implement the above paradigm in the function we implemented previously - see solu- tion 2.3 page4 - so as to process both our BAM files in parallel chunk-wise. Solution: > gAlns <- mclapply(bamFileList,function(bamFile){ + open(bamFile) + gAln <- GAlignments() + while(length(chunk <- readGAlignmentsFromBam(bamFile))){ + gAln <- c(gAln,chunk) + } + close(bamFile) + return(gAln) + }) 2.5 Pros and cons of the current solution Exercise 7 Before reading my comments below, take the time to jot down what you think are the advantages and drawbacks of the method implemented above. My own comments below are certainly not extensive and I would be curious to hear yours that are not matched with mine. Solution: 2.5.1 Pros a. We have written a streamlined piece of code, using up to date functionalities from other packages.