Flexible Approach for Novel Transcript Reconstruction from RNA-Seq Data Using Maximum Likelihood Integer Programming
Total Page:16
File Type:pdf, Size:1020Kb
Flexible Approach for Novel Transcript Reconstruction from RNA-Seq Data using Maximum Likelihood Integer Programming Serghei Mangul Adrian Caciula Sahar A. Seesi Georgia State University Georgia State University University of Connecticut Atlanta, GA, 30303, USA Atlanta, GA, 30303, USA Storrs, CT, 06269, USA [email protected] [email protected] [email protected] Dumitru Brinza Abdul R. Banday Rahul Kanadia Life Technologies Corporation University of Connecticut University of Connecticut Foster City, CA, 94404, USA Storrs, CT, 06269, USA Storrs, CT, 06269, USA [email protected] [email protected] [email protected] Ion Mandoiu Alex. Zelikovsky University of Connecticut Georgia State University Storrs, CT, 06269, USA Atlanta, GA, 30303, USA [email protected] [email protected] Abstract sis, rapidly replacing array-based technologies. Most cur- In this paper, we propose a novel, intuitive and flex- rent research, using RNA-Seq, employs methods that de- ible approach for transcriptome reconstruction from sin- pend on existing transcriptome annotations. Unfortunately, gle RNA-Seq reads, called “ Maximum Likelihood Inte- as shown by recent targeted RNA-Seq studies [1], existing ger Programming ” (MLIP) method. MLIP creates a splice transcript libraries still miss large numbers of transcripts. graph based on aligned RNA-Seq reads and enumerates all The incompleteness of annotation libraries poses a serious maximal paths corresponding to putative transcripts. The limitation to using this powerful technology since accurate problem of selecting true transcripts is formulated as an normalization of RNA-Seq data critically requires knowl- integer program which minimizes the number of selected edge of expressed transcript sequences [2, 3, 4, 5]. As a re- candidate transcripts. Our method purpose is to predict the sult, transcript discovery from RNA-Seq has been the focus minimum number of transcripts explaining the set of in- of many research in recent years. The sequences of novel put reads with the highest quantification accuracy. This is transcripts can be reconstructed from deep RNA-Seq data, achieved by coupling a integer programming formulation but this is computationally challenging due to sequencing with an expectation maximization model for transcript ex- errors, uneven coverage of expressed transcripts, and the pression estimation. MLIP has the advantage of offering need to distinguish between highly similar transcripts pro- different levels of stringency that would gear the results to- duced by alternative splicing. wards higher precision or higher sensitivity, according to the user preference. We test MLIP method on simulated and real data, and we show that MLIP outperforms both 2 Related Work Cufflinks and IsoLasso. A number of recent works have addressed the prob- lem of transcriptome reconstruction from RNA-Seq reads. 1 Introduction These methods fall into three categories: “genome- guided”, “genome-independent” and “annotation-guided” Massively parallel whole transcriptome sequencing, methods [6]. Genome-independent methods such as Trin- commonly referred to as RNA-Seq, and its ability to gen- ity [7] or transAbyss [8] directly assemble reads into tran- erate full transcriptome data at the single transcript level, scripts. A commonly used approach for such methods is de provides a powerful tool for gene and isoform specific ex- Brujin graph [9] utilizing ”k-mers”. The use of genome- pression profiling. As a result, RNA-Seq has become the independent methods becomes essential when there is no technology of choice for performing transcriptome analy- trusted genome reference that can be used to guide recon- struction. On the other end of the spectrum, annotation fragment for paired-end technology[22] . The primary goal guided methods [10, 11, 12] make use of available infor- of our study is to developed a method for longer single-end mation in existing transcript annotations to aid in the dis- reads. In future we plan to extend out method to support covery of novel transcripts. RNA-Seq reads can be mapped paired-end reads. We test our method on simulated and real onto reference genome, reference annotations, exon-exon data and compare with two of the available genome guided junction libraries, or combinations thereof, and the result- methods, Cufflinks and IsoLasso. ing alignments are used to reconstruct transcripts. Many transcriptome reconstruction methods fall in the genome-guided category. They typically start by map- 3 Methods ping sequencing reads onto the reference genome,using spliced alignment tools, such as TopHat [13] or SpliceMap The maximum likelihood integer programming [14]. The spliced alignments are used to identify exons and (MLIP) method starts from a set of putative transcripts and transcripts that explain the alignments. While some meth- selects the subset of this transcripts with the highest sup- ods aim to achieve the highest sensitivity, others work to port from the RNA-Seq reads. We formulate this prob- predict the smallest set of transcripts explaining the given lem as an integer program. The objective is to select the input reads. Furthermore, some methods aim to recon- smallest set of putative transcripts that sufficiently explain struct the set of transcripts that would insure the highest the RNA-Seq data. Further, maximum likelihood estima- quantification accuracy. Scripture [15] construct a splic- tor is applied to all possible combinations of putative tran- ing graph from the mapped reads and reconstructs isoforms scripts of minimum size determined by integer program. corresponding to all possible paths in this graph. It then Our method offers different level of stringency from low to uses paired-end information to filter out some transcripts. high. Low stringency gives priority to sensitivity of recon- Although scripture achieves very high sensitivity, it may struction over precision of reconstruction, high stringency predict a lot of incorrect isoforms. The method of Trap- gives priority to precision over sensitivity. The default pa- nell et al. [16, 17], referred to as Cufflinks, constructs a rameter of the MLIP method is medium stringency that read overlap graph and generates candidate transcripts by achieves balance between sensitivity and precision of re- finding a minimal size path cover via a reduction to max- construction imum matching in a weighted bipartite graph. TRIP [18] uses an integer programming model where the objective is to select the smallest set of putative transcripts that yields 3.1 Model description a good statistical fit between the fragment length distribu- tion empirically determined during library preparation and We use a splice graph (SG) to represent alternatively fragment lengths implied by mapping read pairs to selected spliced isoforms for every gene in a sample. A SG is a transcripts. Cufflinks, Scripture, and TRIP do not target directed acyclic graph where each vertex in the graph rep- the quantification accuracy. IsoLasso [19] uses the LASSO resents a segment of a gene. Two segments are connected [20] algorithm, and it aims to achieve a balance between by an edge if they are adjacent in at least one transcript. quantification accuracy and predicting the minimum num- To partition a gene into a set of non-overlapping segments, ber of isoforms. It formulates the problem as a quadratic information about alternative variants is used. Typically, programming one, with additional constraints to ensure that alternative variants occurs due alternative transcriptional all exons and junctions supported by the reads are included events and alternative splicing events [23] . Transcriptional in the predicted isoforms. CLIIQ [21] uses an integer linear events include: alternative first exon, alternative last exon. programming solution that minimizes the number of pre- Splicing events include: exon skipping, intron retention, al- dicted isoforms explaining the RNA-Seq reads while mini- ternative 5’ splice site (A5SS), and alternative 3’ splice site mizing the difference between estimated and observed ex- (A3SS). Transcriptional events may consist only of non- pression levels of exons and junctions within the predicted overlapping exons. If exons partially overlap and they serve isoforms. as a first or last exons we will refer to such event as A5SS In this paper, we present a genome guided method or A3SS respectively. for transcriptome reconstruction from RNA-Seq reads. Our Figure 1-A shows an example of a gene with 4 differ- method aims to predict the minimum number of transcripts ent exons, and 3 transcripts produced by alternative splic- explaining the set of input reads with the highest quantifica- ing. To represent such alternative variants we suggest tion accuracy. This is achieved by coupling a integer pro- to process the gene as a set of so called “pseudo-exons” gramming formulation with an expectation maximization based on alternative variants obtained from aligned RNA- model for isoform expression estimation. seq reads. A pseudo-exon is a region of a gene between Recent advances in Next Generation Sequencing consecutive transcriptional or splicing events, i.e. starting (NGS) technologies made it possible to produce longer or ending of an exon, as shown in figure 1-B. Hence ev- single-end reads with the length comparable to length of ery gene has a set of non-overlapping pseudo-exons, from which it is possible to reconstruct a set of putative tran- can use existing annotation for P AS or specialized proto- scripts. cols such as the PolyA-Seq protocol [24]. PQC7R$ 3.2 Maximum Likelihood Integer Programming AAA A. Map single reads to Solution genome and identify the pseudo-exons Genome Here we introduce 2-step approach for novel tran- pse intron pse pse pse 1 2 3 4 script reconstruction from single-end RNA-Seq reads. First, we introduce the integer program (IP ) formulation, which has an objective to minimize number of transcripts B. Splice Graph (SG) 1 2 3 4 sufficiently well covering observed reads. Since such for- mulation can lead to many identical optimal solutions we will use the additional step to select maximum likelihood Pseudo-exons Introns solution based on deviation between observed and expected read frequencies.