Data Structures and Algorithms for Analysis of Alternative Splicing with RNA-Seq Data
Total Page:16
File Type:pdf, Size:1020Kb
Data Structures and Algorithms for Analysis of Alternative Splicing with RNA-Seq Data Marcel H. Schulz June 2010 Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) am Fachbereich Mathematik und Informatik der Freien Universität Berlin Gutachter: Prof. Dr. Martin Vingron Prof. Dr. Jens Stoye 1. Referent: Prof. Dr. Martin Vingron 2. Referent: Prof. Dr. Jens Stoye Tag der Promotion: 26 August 2010 Preface The first part of the thesis in Chapter 3 was published in the journal Science, as part of a collaborative study that addressed the first application of RNA-Seq experiments to two human cell lines in 2008 [147]. The methods for prediction and quantification of alternative isoforms and their application presented in Chapter 3 and 4 appeared in the journal Nucleic Acids Research in 2010 [129]. The last part about the de novo transcriptome assembly method Oases has not been published yet, but a manuscript is in preparation. The successful application of Oases to human, fly, and worm RNA- Seq data will be published in the report about the RGASP competition this year. My contributions to these papers was the design of statistical methods and the anal- ysis of alternative splicing with junctions reads for the Science paper. I designed, implemented, and analyzed the CASI and DASI method. I was involved in the con- ception and analysis of the POEM method, analysis of the exon array data, as well as primer design for the PCR experiments and their evaluation. I implemented a preliminary R version of the initial steps of the transcriptome assembler, addressing loci and trivial transcript reconstruction. I developed the theory in Section 5.1 and made the analysis with Oases in Sections 5.3 and 5.4. I was involved in the algorithm design in Section 5.3 and error analysis of the Oases software. I did the transcriptome assembly and parts of the downstream analysis for the RGASP submissions. There are a number of other contributions that are unfortunately not in the thesis. I have implemented linear time algorithms for the construction of variable order Markov chains and the first algorithm for the score distribution computation for ontological similarity searches, presented at the WABI conference in 2008 and 2009 [142, 141]. Also I designed and implemented a linear time truncated suffix tree algorithm [140]. Further, I was involved as a co-author in projects about clinical diagnostics with i ontologies [81], basepair-precise breakpoint detection of human structural variations in resequencing data [27, 170], the influence of highly conserved sequence elements on gene expression [132, 54], and algorithms for frequency pattern mining [164]. Acknowledgements I am grateful to my supervisor Martin Vingron for his ideas, support, initiating of collaborations, and especially for his suggestion to start working with de Bruijn graphs. It is a wonderful atmosphere that he created in his group that I have enjoyed throughout the years. I also want to thank him and the International Max Planck Research School for Computational Biology and Scientific Computing for funding my Phd time and my stay in Cambridge. I also like to thank Hugues Richard, who supervised me for the project about statistical methods with RNA-Seq data (Chapter 3 and 4). My knowledge about statistics, R, and many other important subjects in life have grown considerably due to his influence. I acknowledge his writing of the functions for the POEM algorithm, help with its design and analysis, primer design, exon array analysis, and application of POEM on the RGASP data. I want to thank Daniel R. Zerbino for sharing his expertise about assembly algorithms, which has been an important contribution to the project. In particular, I acknowledge his conception for the treatment of cycles, adaptations for the transitive reduction algorithm of Myers, and running ABySS. I am very grateful to Daniel that he accepted to implement the ideas described in Section 5.2 in the Oases software, which made it possible to have results in time for the RGASP competition. I further want to thank Ewan Birney for financing my stay in Cambridge, hosting me at the EBI, and inspiring and motivating discussions about transcriptome assembly. I want to thank Marc Sultan for conducting the RNA-Seq, PCR, and exon array experiments, help with primer design, and many interesting discussions about wet lab biology. Additionally, I thank Marie-Laure Yaspo for financing the experiments, as well as, Hans Lehrach, Asja Nürnberger, Sabine Schrinner, Daniela Balzereit, and Emilie Dagand, who conducted or supervised parts of RNA-Seq, PCR, and exon array experiments for Chapter 3 and 4. Further, I would like to thank Stefan Haas for sev- eral discussions about alternative splicing and help with EST analysis, paper writing, and primer design. I further thank David Weese for help with setting up a pipeline for the RGASP competition and other interesting projects with did together. Thanks to Axel Rasche for sharing his knowledge about exon array analysis and preprocessing the human exon array data. Thanks to Andreas Klingenhoff, Alon Magen, Dmitri ii Parkhomchuk, Matthias Scherf, and Martin Seifert for preprocessing and prepara- tion of data for the Science paper. Thanks to Cole Trapnell, Ali Mortazavi, and Dian Trout for sharing the mouse C2C12 data. I want to thank Knut Reinert and Peter N. Robinson for constant support and guid- ance. Thanks to Jens Stoye for reviewing my thesis, Roland Krause for joining my "Verteidigungskommission" on very short notice. I would like to thank Hannes Luz for support throughout my Phd and especially in the last minutes as well as Kirsten Kelleher for support. Also thanks to Anne-Kathrin Emde, Stefan Haas, Marta Luk- sza, Alena Mysickova, Hugues Richard, Christian Rödelsperger, Marc Sultan, Ewa Szczurek, David Weese, and Daniel R. Zerbino for great comments from proofread- ing. In addition, I would like to thank Sebastian Bauer, Sebastian Köhler, Jonathan Göke, Tobias Rausch, Christian Rödelsperger, Stefan Roepcke, and Silke Stahlberg and all so far unmentioned members of the Vingron department for help, parties, and other interesting projects I was involved in. I want to thank Markus Bauer, Florian Markowetz, Ole Schulz-Trieglaff as well as the EBI Pre- and Postdocs for sharing many private parties and taking care of me during my stay in Cambridge. Finally, I want to express my biggest gratitude to my parents and my girlfriend Susi. Their long lasting support and love is the foundation of all my achievements. I love you. iv Contents 1 Introduction 1 1.1 DNA, Gene Expression, and Alternative Splicing . .1 1.2 DNA sequencing . .4 1.3 Methods for Detection of Alternative Splicing . .4 1.4 Thesis Organization . .7 2 Sequences and Graphs in Computational Biology 11 2.1 Definitions . 11 2.1.1 Strings . 11 2.1.2 Graphs . 12 2.1.3 Sensitivity and Specificity . 12 2.2 Sequence Alignment . 13 2.3 Data structures and Algorithms for Genome Assembly . 14 2.3.1 Genome Assembly with the Overlap-Layout-Consensus Paradigm 14 2.3.2 Genome Assembly Using the Eulerian Path Paradigm . 15 2.3.3 The Velvet Genome Assembler . 18 2.4 Data Structures and Algorithms for EST Assembly and Analysis . 20 2.4.1 EST Assembly . 20 2.4.2 Splicing Graphs . 20 3 Prediction of Alternative Isoforms 23 3.1 Prediction of Alternative Splicing Events from RNA-Seq Data . 23 3.1.1 A General Stochastic Count Model for Transcriptome Analysis 23 3.2 Prediction of Alternative Splicing Events with Exon Junction Read Evidence . 25 3.2.1 Reference based Spliced Alignment of RNA-Seq Reads . 25 3.2.2 Application to Human RNA-Seq data . 28 v Contents 3.3 Prediction of Alternative Isoforms with Exon Expression Levels . 29 3.3.1 Alternative Exon Usage within a Condition . 31 3.3.2 Alternative Exon Usage between two Conditions . 38 4 Quantification of Alternative Isoforms 45 4.1 From Gene to Transcript Expression Levels . 45 4.2 Quantification of Transcript Expression Levels . 46 4.2.1 Simulations . 51 4.2.2 Proportion Estimation with Junction Reads . 52 4.3 Application to Human RNA-Seq Data . 53 4.3.1 Experimental Validation . 54 5 De Novo Assembly of Transcripts considering Alternative Isoforms 57 5.1 De Novo Assembly of Transcript Sequences . 57 5.1.1 Problem Statement . 59 5.1.2 Transcript de Bruijn Graphs . 59 5.1.3 Recognition of Alternative Exon Events in Transcript de Bruijn Graphs . 64 5.2 Oases: a de novo Transcriptome Assembler Based on Transcript de Bruijn Graphs . 68 5.2.1 Error Correction and Collapsing . 69 5.2.2 Scaffolding of Loci . 69 5.2.3 Recognition of Trivial Structures . 73 5.2.4 Prediction of Full Length Transcript Sequences . 74 5.2.5 Merged Assemblies . 77 5.2.6 Transcript Confidence Scores . 78 5.2.7 Prediction of Alternative Exon Events . 78 5.3 Influence of Repeats, Domains and Paralogs . 79 5.4 Application to Paired-End RNA-Seq data . 82 5.4.1 Data Sets . 82 5.4.2 Influence of Parameter k ..................... 83 5.4.3 Comparison with ABySS . 85 5.4.4 Comparison with Cufflinks . 87 6 Discussion 91 vi Contents Bibliography 97 Notation and Definitions 119 Zusammenfassung 123 Summary 125 Software Availability 127 Appendix 129 Ehrenwörtliche Erklärung 139 vii viii List of Figures 1.1 Transcription and Alternative Splicing . .2 1.2 Alternative Splicing Events . .3 1.3 Comparison of Traditional Sanger and Next-Generation Sequencing Approaches . .6 1.4 Summary of the RNA-Seq Protocol . .8 2.1 Overview of Error Correction in Velvet . 17 2.2 Twin Nodes in Velvet . 18 3.1 Complete Splicing Graphs for Splice Junction Extraction . 26 3.2 Distribution of Major AS Events in HEK and B Cell Lines .