From Read Mapping to Variant Detection

Next-Generation Sequencing Algorithms: From Read Mapping to Variant Detection Dissertation zur Erlangung des Grades Doktor der Naturwissenschaften (Dr. rer. nat.) am Fachbereich Mathematik und Informatik der Freien UniversitätBerlin vorgelegt von Anne-Katrin Emde Berlin, 15. November, 2012 Betreuer: Prof. Dr. Knut Reinert, Freie UniversitätBerlin Prof. Dr. Martin Vingron, Max-Planck-Institut fürMolekulare Genetik, Berlin Gutachter: Prof. Dr. Knut Reinert, Freie UniversitätBerlin Prof. Dr. Martin Vingron, Max-Planck-Institut fürMolekulare Genetik, Berlin Prof. Dr. Peter Nürnberg, UniversitätKöln Datum der Disputation: 17. April, 2013 Abstract Next-Generation-Sequencing (NGS) has brought on a revolution in sequence analysis with its broad spectrum of applications ranging from genome resequencing to transcriptomics or metage- nomics, and from fundamental research to diagnostics. The tremendous amounts of data necessi- tate highly efficient computational analysis tools for the wide variety of NGS applications. This thesis addresses a broad range of key computational aspects of resequencing applications, where a reference genome sequence is known and heavily used for interpretation of the newly sequenced sample. It presents tools for read mapping and benchmarking, for partial read mapping of small RNA reads and for structural variant/indel detection, and finally tools for detecting and genotyping SNVs and short indels. Our tools efficiently scale to large NGS data sets and are well- suited for advances in sequencing technology, since their generic algorithm design allows handling of arbitrary read lengths and variable error rates. Furthermore, they are implemented within the robust C++ library SeqAn, making them open-source, easily available, and potentially adaptable for the bioinformatics community. Among other applications, our tools have been integrated into a large-scale analysis pipeline and have been applied to large datasets, leading to interesting discoveries of human retrocopy variants and insights into the genetic causes of X-linked intellectual disabilities. i ii Zusammenfassung Neuste DNA-Sequenzieungstechnologien (kurz genannt NGS Technologien) ermöglichen revolu- tionäreneue Anwendungen, die sowohl von Genomresequenzierung über Transkriptomsequen- zierung zu Metagenomik als auch von Grundlagenforschung zu Diagnostik reichen. Problema- tisch ist dabei die Flut an Daten, die eine grosse Herausforderung fürdie Bionformatik darstellt. Hocheffiziente Analysesoftware ist von enormer Wichtigkeit fürdas breite Spektrum von NGS Anwendungen. Diese Arbeit adressiert mehrere Schlüsselaspekte der Analyse von Resequenzierungsdaten, bei der ein bereits sequenziertes Referenzgenom als Grundlage für die Interpretation eines neu sequen- zierten Datensatzes dient. Es werden Algorithmen und Programme präsentiert fürdas sogenannte Read Mapping Problem und fürdie Auswertung der Güteseiner Lösung,fürpartielles Read Map- ping, welches in miRNA Studien und bei der Suche nach strukturellen Variationen Anwendung findet, sowie letztlich zum Auffinden und Genotypisieren von Basenmutationen und kurzen Inser- tionen/Deletionen im Genom. Die vorgestellten Algorithmen sind effizient und so gestaltet, dass sie auch bei Fortschritten in Sequenzierungstechnologien weiterhin anwendbar und skalierbar bleiben. Zudem sind sie in der robusten C++ Bibliothek SeqAn implementiert, was sie leicht zugänglich und adaptierbar macht. Unter anderem wurden unsere Tools in eine Hochdurchsatz-Analysepipeline integriert und auf grosse Datensaetze angewendet, wodurch interessante biologische Erkenntnisse (vorallem im Zusammenhang X-Chromosom gebundener geistiger Behinderung) gewonnen werden konnten. iii iv Acknowledgements First and foremost, I want to thank my supervisors Knut Reinert and Martin Vingron who gave me the opportunity to work on this thesis and therefore the many diverse aspects of next-generation sequencing. I very much enjoyed working in the two groups and gaining experience from a theoretical as well as a more applied perspective. I am particularly grateful to Knut who has been a true mentor for many years and whose ideas, advice and encouragement I greatly appreciate. Furthermore, I want to thank Stefan Haas from whom I learned a lot and whose supervision made my usually systematic and theoretical approach more real-world applicable. I would also like to thank Peter Nürnberg for acting as external thesis examiner. I am very grateful to have been part of many fruitful and enjoyable collaborations. With David Weese I enjoyed working on RazerS. A lot of the work I did is based on his work and ideas. I thank Manuel Holtgrewe for the collaboration on Rabema, Marcel Grunert for the joint work on MicroRazerS and Marcel Schulz for working with me on SplazerS. Further thanks for pleasant collaborations goes to Magdalena Feldhahn at the University in Tübingen,and Peter Krawitz and Na Zhu at the Humboldt University. I furthermore enjoyed working with Hugues Richard, Ruping Sun, Tomasz Zemojtel, and Vera Kalscheuer at the MPI-MG. Thank you to Birte Kehr, Stefan Haas, and RenéRahn for proof-reading and to Patrick Schultz for helping me with figures. Special thanks to Birte for extensive proof-reading, moral support and many good conversations, both scientific and non-scientific. Also thanks to my other office mates during the years: Tobias Rausch, Sandro Andreotti, Sean O'Keefe, Kathrin Trappe. Very important: the many good coffee breaks, both at the Max-Planck-Institute as well as the university. I am grateful to all members of the two groups for creating a great working (and coffee-drinking) atmosphere. I am indebted to the International Max Planck Research School and particularly our great coordinators Kirsten Kelleher and Hannes Luz, who is missed dearly. A huge thank you to my friends, family, roommates, and all the special people in my life. v vi Contents 1 Introduction 1 2 Background 5 2.1 Biological Background: Genomic Variation . 5 2.1.1 Types of Genomic Variation . 6 2.1.2 Functional Impact of Variants . 7 2.2 Methodological Background . 9 2.2.1 Next-Generation Sequencing Technologies . 9 2.2.2 Sequencing Data Characteristics . 12 2.2.3 Computational Analysis Framework . 13 2.2.4 Genomic Variant Detection Methods . 16 2.2.5 Sequencing-Based vs. Array-Based Methods . 18 2.3 Overview of Thesis . 19 2.3.1 SeqAn . 20 2.3.2 Developed SeqAn Tools . 20 3 Read Mapping 21 3.1 Background: Filtering and Verification Methods . 22 3.1.1 Filtering Techniques . 23 3.1.2 Index Data Structures . 26 3.1.3 Verification Techniques . 27 3.2 Overview of Read Mapping Tools . 28 3.3 RazerS - Overview of Read Mapping Algorithm . 30 3.3.1 Filtering algorithm . 31 3.3.2 Match Verification . 31 3.4 RazerS - Filter Parameter Computation . 32 3.4.1 Computing Sensitivities by Dynamic Programming . 33 3.4.2 Computing Heavy Lossless Shapes . 35 3.4.3 ParamChooser . 36 3.5 RazerS - Results . 39 3.5.1 Evaluation of RazerS' Mapping Sensitivity Estimation . 39 vii viii CONTENTS 3.5.2 Comparison with Other Read Mappers . 40 3.6 Improved Benchmarking of Read Mapping Tools . 44 3.6.1 Rabema - Generating a Gold Standard . 44 3.6.2 Rabema - Evaluation of Read Mapping Tools . 45 3.7 Chapter Summary . 48 4 Partial Read Mapping and Applications 51 4.1 Small RNA Read Mapping . 53 4.1.1 Strategies for Mapping Small RNA Reads . 53 4.1.2 MicroRazerS - Algorithm . 53 4.1.3 MicroRazerS - Evaluation and Comparison with Other Tools . 55 4.2 Overview of Split Read Mapping Tools . 57 4.3 SplazerS - Split Read Mapping for Indel Detection . 58 4.3.1 Split Match Definition . 58 4.3.2 Algorithm . 59 4.4 SplazerS - Results . 64 4.4.1 Evaluation Datasets . 65 4.4.2 Anchored Indel Detection on 1000 Genomes Project Data . 67 4.4.3 Anchored Indel Detection on 100 bp Illumina Reads . 69 4.4.4 Unanchored Indel Detection on Simulated Data . 70 4.4.5 Unanchored Indel Detection on Targeted Resequencing Data . 73 4.5 Chapter Summary . 75 5 Small Variant Detection 77 5.1 SNV/Indel Calling: Challenges in NGS Data . 77 5.2 Small Variant Detection Strategies . 78 5.2.1 Variant Detection Tools and Their Realignment Strategies . 79 5.3 SnpStore - Variant Calling Algorithm . 80 5.3.1 Realignment . 81 5.3.2 Variant Calling and Genotyping . 82 5.4 SnpStore - Results . 86 5.4.1 Evaluation Datasets . 86 5.4.2 Comparison with Other Tools on Simulation Data . 88 5.4.3 Comparison with Other Tools on 1000 Genomes Data . 89 5.4.4 Application to Targeted Resequencing Data . 94 5.5 Chapter summary . 95 6 Discussion and Conclusions 97 6.1 Outlook . 100 Chapter 1 Introduction When the structure of DNA, the carrier of genetic information, was discovered by Watson and Crick (1953), and Franklin and Gosling (1953), it was surely hard to imagine that only about 50 years later in 2001 the whole DNA sequence of a human genome would have been deciphered. Another ten years later, with revolutionary advancements in DNA sequencing technology (Mardis, 2008b), we today know the sequence of hundreds of human individuals and have sequenced more than thousands of species (NCBI, 2012). Nowadays, DNA sequencing is routinely used to answer fundamental research as well as medical/diagnostic questions. For example, the 1000 Genomes Project (Durbin et al., 2010), a large-scale sequencing endeavor launched in 2008 by a consortium of researchers from more than 75 universities and companies around the world, has compiled the to date most complete catalogue of human variation. This catalogue of normal differences occurring between individual human genomes serves as a solid basis for investigations into disease. Gradually paving the way towards personalized medicine, DNA sequencing has already been successfully used to identify the genetic causes of many diseases (Johnston et al., 2010; Ng et al., 2010a,b); predominantly advancing diagnostics of Mendelian disorders where pathogenic variations affecting a single gene provoke a strong phenotype. Developed by Sanger et al. (1977), the first method for DNA sequencing through chain termi- nation was quite laborious and slow, but was soon automated to achieve higher throughput (Smith et al., 1986).

Load more