Identification of Genetic Variation Using Next-Generation Sequencing

Technische Universitat¨ Munchen¨ Lehrstuhl fürBioinformatik Identification of genetic variation using Next-Generation Sequencing Sebastian Hubert Eck VollständigerAbdruck der von der FakultätWissenschaftszentrum Weihen- stephan fürErnährung,Landnutzung und Umwelt der Technischen Univer- sitätMünchen zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. H.-R. Fries Prüferder Dissertation: 1. Univ.-Prof. Dr. H-W. Mewes 2. Univ.-Prof. Dr. R. Zimmer, Ludwig-Maximilians UniversitätMünchen 3. Priv.-Doz. Dr. T. M. Strom Die Dissertation wurde am 30.07.2013 bei der Technischen UniversitätMünchen eingereicht und am 27.01.2014 durch die FakultätWissenschaftszentrum Wei- henstephan fürErnährung,Landnutzung und Umwelt angenommen. Summary The field of genetic research was drastically changed by the advent of next- generation DNA sequencing technologies during the last three years. These technologies are able to generate unparalleled amounts of DNA sequence information at a cost that is several orders of magnitude lower than standard sequencing techniques. Employing this technology allows identification of disease causative and disease associated mutations from a frequency spectrum previously not accessible with standard techniques. Even though showing great promise, researchers are also facing new challenges. In particular the handling and analysis of unprecedented amounts of data. The goal of this thesis was the conception and implementation of an auto- mated, computational analysis pipeline for next generation sequencing data in general and whole exome sequencing data in particular in order to robustly identify disease related mutations. The pipeline covers all key analysis steps including alignment to a reference genome, variant calling, quality filtering and annotation of identified variants, storage of all variants in a relational database and identification of putatively causative mutations. The pipeline is a combination of public available software and custom developed programs implemented primarily in Perl. Identification of candidate mutations for specific disorders is facilitated through preformulated database queries via a web interface. The database integrates external ressources such as dbSNP, the Human Gene Mutation Database and Polyphen functional consequences prediction in order to provide key information to select putatively causative mutations. The user may specify regions of particular interest, for instance from a previous linkage analysis, which are then highlighted during analysis. The pipeline was employed in four distinct projects to show the appli- cability for mutation identification from a broad frequency spectrum. In the following all components of the analysis pipeline and the complemen- tary database are discussed in detail. Additionally, example results of the four projects, involving the identification of disease associated variants in Leukemia, Intelectual Disability, Parkinson's Disease and Myocardial Infarc- tion, employing the pipeline are presented. 2 Zusammenfassung In den letzten drei Jahren wurde das Feld der Genetik durch die Einführung von Next-Generation Sequencing Technologien drastisch revolutioniert. Diese Technologien ermöglichen es noch nie dagewesene Mengen an DNA Sequen- zdaten zu generieren, bei Kosten, die mehrere Grössenordnungen niedriger sind als bei den bisherigen Standardtechniken. Der Einsatz dieser Techniken erlaubt die Identifikation von Krankheitsverusachenden und Kranheitsassozi- ierten Mutationen aus einem Frequenzspektrum, das bisher nicht erreichbar war. Obwohl dieser Ansatz vielversprechend ist sehen sich Wissenschafftler auch mit neuen Herausforderung konfrontiert, insbesondere der Umgang und die Analyse von beispielosen Mengen an Daten. Das Ziel dieser Arbeit war die Entwicklung und Implementierung einer automatisierten, rechnergestützenAnalysepipeline fürNext-Generation Se- quencing Daten im allgemeinen und Whole-Exome Sequencing Daten im besonderen, um krankheitsassoziierte Mutationen robust und reproduzierbar detektieren zu können.Die Pipeline deckt alle notwendigen Auswerteschritte ab: Alignment der Sequenzierdaten an ein Referenzgenom, Identifikation von Varianten, Qualitätsfilteringund Annotation der identifizierten Varianten, Abspeicherung der Varianten in einer relationalen Datenbank und Detektion von möglichen Krankheitsverusachenden Mutationen. Die Pipeline ist eine Kombination von frei verfügbarerund spezifisch entwickelter Software, die vorrangig in Perl implementiert ist. Die Detektion von kranheitsrelevanten Varianten wird durch Datenbankabfragen über ein Webinterface ermöglicht. Die Datenbank integriert externe Ressourcen wie dbSNP, die Human Gene Mutation Database und Polyphen Vorhersagen fürfunktionelle Konsequen- zen, um essentielle Informationen fürdie Selektion von kranheitsrelevanten Mutationen zur Verfügungzu stellen. Benutzer könnenoptional Regionen von besonderer Signifikanz spezifieren, zum Beispiel von vorherigen Linkage Analysen, die dann währendder Analyse besonders hervorgehoben werden. Die Analysepipeline wurde Anhand von vier Projekten eingesetzt, um die Anwendbarkeit zur Identifikation von Mutationen aus einem breiten Fre- quenzspektrum zu demonstrieren. Im Folgenden werden alle Komponen- ten der Analysepipeline und der zugehörigenDatenbank im Detail erläutert und diskutiert. Die Ergebnisse des Einsatzes der Pipeline in vier Projekten, in denen neue, krankheitsassoziierte Varianten bei Patienten mit Leukämie, geistiger Behinderung, der Parkinson Krankheit und Myokardinfarkt erfol- greich identifiziert wurden, wird beispielhaft dargestellt. 4 5 Contents 1 Introduction 12 1.1 Human Genetic Disorders . 12 1.2 Identification of Causative Mutations in Genetic Disorders . 14 1.3 Common and Rare Variants in Genetic Disorders . 16 1.4 DNA Sequencing . 19 1.5 Next-Generation Sequencing . 23 1.6 Roche/454 Sequencing . 24 1.7 SOLiD - Sequencing by Ligation . 25 1.8 Solexa Technology . 28 1.8.1 Sequencing-by-Synthesis . 28 1.8.2 Illumina Data Analysis Pipeline . 32 1.8.3 Genome Analyzer IIx . 34 1.8.4 HiSeq2000 . 35 1.9 Short-read Alignment Algorithms . 36 1.9.1 Hash table based Algorithms . 36 1.9.1.1 MAQ . 37 1.9.2 Suffix-Array-based Algorithms . 39 1.9.2.1 Burrows-Wheeler Transformation . 39 1.9.2.2 BWA . 40 1.9.2.3 SAMtools . 41 1.10 Exome Sequencing . 42 1.10.1 In-Solution Enrichment . 43 1.10.2 Applications . 45 2 Results 47 2.1 Analysis Pipeline for Next-Generation Sequencing Data . 47 2.2 Components . 48 2.2.1 Alignment . 49 6 2.2.2 Variant Calling . 50 2.2.2.1 SNV Calling . 50 2.2.2.2 Variant Filtering . 51 2.2.3 Run Statistics . 52 2.3 Exome Variant Database . 53 2.3.1 Candidate Gene identification . 54 2.3.2 Query Parameters . 54 2.3.3 Query Types . 56 2.4 Identification of Disease Causing and Disease Associated Mu- ations . 59 2.5 Somatic Mutations - Identification of recurring tumor-specific somatic mutations in acute myeloid leukemia by transcriptome sequencing. 59 2.6 Monogenic Disorders - Identification of mutations in Adaptor Protein Complex 4 proteins as cause of Intellectual Disability . 64 2.7 Common Disease I - Identification of a mutation in VPS35 causing late-onset Parkinson . 66 2.8 Common Disease II - Dysfunctional nitric oxide signaling in- creases risk of myocardial infarction . 69 3 Discussion 73 3.1 General Mutation Identification Strategies for Exome Sequenc- ing................................. 73 3.2 Accuracy of Variant Calls . 76 3.2.1 Sequencing Artifacts . 76 3.3 Enrichment Bias . 79 3.3.1 Coverage Distribution . 79 3.3.2 Incomplete Enrichment . 80 3.4 Statistical Analysis of 732 Exomes . 81 3.4.1 Average Exome Statistics . 81 3.4.2 Variant Distribution per Gene . 84 3.4.3 Comparison with Gene Mutation Database HGMD . 85 3.4.3.1 Carrier Burden and Identification of Litera- ture Misannotations . 87 4 Outlook 89 4.1 Third-Generation Sequencing . 89 4.1.1 Single-Molecule Real-Time Sequencing . 90 7 4.1.2 Nanopore Sequencing . 94 4.1.3 Sequencing by microscopy techniques. 97 4.1.4 Complete Genomics . 98 4.1.5 Ion Torrent . 100 4.1.6 Clinical Diagnostics and Personalized Medicine . 102 4.1.7 Summary and Conclusion of Next-Generation Sequenc- ing Instruments . 103 4.2 Appendix - Manuscripts . 106 4.2.1 Identification of recurring tumor-specific somatic mutations in acute myeloid leukemia by transcriptome sequencing. 106 4.2.2 Adaptor Protein Complex 4 Deficiency Causes Severe Autosomal-Recessive Intellectual Disability, Progres- sive Spastic Paraplegia, Shy Character, and Short Stature114 4.2.3 A Mutation in VPS35, Encoding a Subunit of the Retromer Complex, Causes Late-Onset Parkinson Disease . 123 4.2.4 Complete List of Publications . 132 8 List of Tables 1.1 Sequencing Throughput Development of the Illumina Platform 35 1.2 Sequence Alignment Algorithms . 36 1.3 BWT of string .ANANAS . 40 1.4 Comparison of exome enrichment platforms . 43 2.1 Individual Pipeline Programs and Scripts . 49 2.2 Confirmed tumor-specific mutations . 63 2.3 Stepwise variant filtering for two PD cases . 67 2.4 Occurence of mutations in the MI extended pedigree . 70 3.1 Mutation Identification Strategies . 74 3.2 Sequencing artifact filter . 78 3.3 Average exome statistics . 83 3.4 Average variant statistics . 84 4.1 Summary of Sequencing Instuments Performance . 105 9 List of Figures 1.1 Manhatten Plot . 16 1.2 Spectrum of Genetic Disorders . 19 1.3 Development of DNA sequencing cost . 22 1.4 Developmet of whole genome sequencing cost . 23 1.5 454 Sequencing

Identification of Genetic Variation Using Next-Generation Sequencing

DNA Sequencing

Targeted Genes and Methodology Details for Neuromuscular Genetic Panels

Preclinical Research 2020 Sections

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Conserved and Novel Properties of Clathrin-Mediated Endocytosis in Dictyostelium Discoideum" (2012)

DNA Sequencing - I

NICU Gene List Generator.Xlsx

Supplementary Information

AP4S1 Splice-Site Mutation in a Case of Spastic Paraplegia Type 52 with Polymicrogyria

Cldn19 Clic2 Clmp Cln3

Deep Multiomics Profiling of Brain Tumors Identifies Signaling Networks

DNA Sequencing  Genetically Inherited Diseases