Computational Gene Structure Prediction

Computational Gene Structure Prediction Dissertation zur Erlangung des akademischen Grades Dr. rer. nat. an der Fakultat¨ fur¨ Mathematik, Informatik und Naturwissenschaften der Universitat¨ Hamburg eingereicht beim Fach-Promotionsausschuss Informatik von Gordon Gremme aus Wesel August 2012 Gutachter: Prof. Dr. Stefan Kurtz, Universitat¨ Hamburg Prof. Dr. Wolfgang Menzel, Universitat¨ Hamburg Prof. Dr. Volker Brendel, Indiana University Bloomington Tag der Disputation: 15. Mai 2013 Dedicated to my mother. I miss you. 1 Acknowledgments I want to thank Stefan Kurtz for his guidance and encouraging support over the years. He never lost his patience with me, for what I am very grateful. Furthermore, I want to thank Volker Brendel for his helpful explanations concerning gene structure prediction and the nice stays with him in Ames, Iowa. I also want to thank Volker’s group for their hospitality and interesting discussions, in particular Michael Sparks for writing an XML parser for GenomeThreader and showing me around Ames. At the Center for Bioinformatics (ZBH) I want to thank all my colleagues for the great time I had while being there, it was an inspiring experience. In particular, I want to thank Ute Willhoft¨ for many helpful discussions on biology, Sascha Steinbiß for the good and fruitful collaboration, and Karin Lundt for her uplifting support. From Matthias Rarey’s group I especially want to thank Patrick Maaß for our countless discussions and for what he taught me about programming. It meant a lot to me. I also want to thank him and Jorg¨ Degen, Axel Griewel, Juri Parn,¨ Ingo Reulecke, Ingo Schellhammer, Jochen Schlosser, and Katrin Stierand for the fun times playing tabletop soccer together. Last but not least I want to thank my friends, my family, and my loved ones who helped writing this dissertation with their knowledge, their patience, and their love. You are too many to thank individually here, but I hope you know who you are. I will be forever in your debt and all I can offer is a sincere “Thank you!”. 2 Abstract Modern molecular biology research is characterized by the availability of an increasing amount of biological data which is often fuzzy due to the nature of the experimental methods used to derive it. Bioinformatics, a branch of computer science, deals with the storage, retrieval, and analysis of this data. DNA, the basic information carrier of life, is now sequenced industrially in large quantities and assembled to complete genomes. The automatic annotation of genes in these genomes, a process called computational gene structure prediction, is the scope of this thesis. This dissertation describes the computational gene structure prediction software Genome- Threader which uses homologous biological sequences (so-called cDNAs/ESTs and/or protein sequences) to predict gene structures by computing spliced alignments. GenomeThreader uses a multi-phase approach, filtering the possibly very large sequence data sets in early phases to obtain promising gene candidates which are then refined by more computationally expensive algorithms in later phases. The results of this gene structure predictions, genome annotations, can become quite large and cumbersome to process. To deal with such annotations easily and efficiently, the GenomeTools genome analysis system has been developed, which is also described in this thesis. The prediction quality of GenomeThreader was evaluated on a variety of datasets and the results show that the software performs very well on common gene structure prediction tasks. The quality of the results is comparable with the results of the best other programs and in some cases it is even better. The software is very easy to use due to its integrated nature, a feature which distinguishes it from its competitors. GenomeThreader has been adopted widely in the scientific community, it has approx. 150 users world-wide and over 30 publications cite the scientific article which describes an earlier version of the software. The open source package GenomeTools was used as a foundation for 10 published sequence and annotation processing tools. 3 Kurzfassung Moderne molekularbiologische Forschung ist durch die Verfugbarkeit¨ stetig wachsender Daten- mengen charakterisiert. Diese Daten sind auf Grund der experimentellen Methoden, die sie erzeugen, oftmals fehlerbehaftet. Die Bioinformatik, ein Teilbereich der Informatik, beschaftigt¨ sich damit molekularbiologische Daten zu speichern, abzurufen und zu analysieren. DNA, der grundlegende Informationstrager¨ des Lebens, wird heutzutage im industriellen Maßstab sequen- ziert und zu kompletten Genomen zusammengefugt.¨ Diese Disseration befasst sich mit der au- tomatische Annotation von Genen in vollstandig¨ sequenzierten Genomen, ein Prozess der rech- nergestutzte¨ Genstrukturvorhersage genannt wird. Diese Dissertation beschreibt die Methoden und Techniken, die die Grundlage der Gen- strukturvorhersagesoftware GenomeThreader bilden. GenomeThreader benutzt homologe biol- ogische Sequenzen (sogenannte cDNA/EST und/oder Proteinsequenzen) und berechnet Spliced Alignments, die Genstrukturen beschreiben. Zur Vorhersage der Genstrukturen wird ein mehr- phasiger Ansatz benutzt. Dabei werden die unter Umstanden¨ sehr großen Sequenzdatenmengen in fruhen¨ Phasen auf vielversprechende Genkandidaten reduziert, die dann in spateren¨ Phasen durch rechenaufwendigere Algorithmen verfeinert werden. Die Resultate dieser Genstruktur- vorhersagen, die Genomannotationen, konnen¨ sehr umfangreich werden und aufwendige Schritte der Weiterverarbeitung erfordern. Um mit solchen Annotationen einfach und effizient umgehen zu konnen,¨ wurde das GenomeTools Genomanalysesystem entwickelt, das ebenfalls in dieser Arbeit beschrieben wird. Die Vorhersagequalitat¨ von GenomeThreader wurde auf verschiedenen Datensatzen¨ evaluiert. Es zeigt sich, dass GenomeThreader fur¨ die ublichen¨ Genvorhersageaufgaben sehr gute Ergeb- nisse liefert. Die Qualitat¨ der Ergebnisse ist vergleichbar mit den Ergebnissen der besten anderen Programme und teilweise sogar besser. Durch die gelungene Integration der einzelnen Phasen ist die Software sehr einfach zu benutzen, eine Eigenschaft, die sie von ihren Wettbewerbern unter- scheidet. GenomeThreader hat weite Verbreitung in der Wissenschaftsgemeinde gefunden. Es gibt ca. 150 Nutzer weltweit und 30 Publikationen zitieren den wissenschaftlichen Artikel, der eine fruhe¨ Version der Software beschreibt. Das quelloffene Softwarepaket GenomeTools diente als Grundlage fur¨ 10 weitere publizierte Werkzeuge zur Sequenz- und Annotationsverarbeitung. 4 Contents 1 Introduction 16 1.1 Background . 16 1.2 Contributions . 17 1.3 Structure of this Thesis . 18 2 Biology Background 20 2.1 The Science of Life . 20 2.2 Prokaryotes and Eukaryotes . 20 2.3 Basic Macromolecules . 21 2.3.1 Nucleic Acids . 21 2.3.2 Proteins . 22 2.4 Information Carrier DNA . 23 2.5 Single Pieces of Information: Genes . 23 2.6 Flow of Information: From a Gene to a Protein . 23 2.6.1 Splicing . 24 2.6.2 Translation . 25 2.7 What is an EST? . 25 2.7.1 Construction of ESTs . 26 2.8 Next-Generation Sequencing Methods . 26 2.8.1 RNA-Seq . 28 2.9 What is a Gene? Attempting a Definition . 29 2.9.1 Gene as a Heredity Unit . 29 2.9.2 Gene as a Distinct Locus . 29 2.9.3 Gene as a Protein Blueprint . 29 2.9.4 Gene as a Physical Molecule . 29 2.9.5 Gene as Transcribed Code . 30 2.9.6 Gene as ORF Sequence Pattern . 30 2.9.7 Gene as Annotated Genomic Entity Stored in a DB (pre-ENCODE) . 31 2.9.8 Problems with the Current Definition . 31 2.9.9 Experience of the ENCODE Project . 32 2.9.10 Gene as Dispersed Genome Activity (ENCODE) . 34 5 3 Computational Background 37 3.1 Basic Definitions . 37 3.2 Gene Prediction Categories . 41 3.2.1 Ab initio Methods . 41 3.2.2 Comparative Methods . 41 3.2.3 Homology methods . 42 3.2.4 Combiners . 43 3.3 Measures of Prediction Accuracy . 43 3.3.1 Nucleotide Level . 43 3.3.2 Exon Level . 45 3.3.3 Gene Level . 47 3.4 Prediction Accuracy . 48 3.5 Related Work: Ab initio Methods . 49 3.5.1 GENSCAN ................................. 49 3.5.2 AUGUSTUS ................................ 49 3.5.3 mGene ................................... 49 3.6 Related Work: Comparative Methods . 49 3.6.1 TWINSCAN ................................ 49 3.7 Related Work: Homology Methods . 50 3.7.1 GMAP ................................... 50 3.7.2 EuGENE´ .................................. 50 3.8 Related Work: Combiners . 50 3.8.1 JIGSAW .................................. 50 3.8.2 Evigan ................................... 51 4 GenomeThreader Gene Prediction Software 52 4.1 The Computational Problem . 52 4.1.1 Basic Notions . 54 4.1.2 The Spliced Alignment Problem for cDNA/EST Sequences . 54 4.1.3 The Spliced Alignment Problem for Protein Sequences . 57 4.2 Easy-to-Use Bayesian Splice Site Models (BSSMs) . 59 4.3 Computing Optimal Spliced Alignments with ESTs . 59 4.4 Computing Optimal Spliced Alignments with Proteins . 60 4.5 The Intron Cutout Technique . 63 4.5.1 Computing cDNA/EST Matches . 64 4.5.2 Chaining the cDNA/EST Matches . 64 4.5.3 Computing and Chaining Protein Matches . 66 4.5.4 Chain Enrichment . 67 4.5.5 The Cutout Step . 69 4.6 Jump Tables . 69 4.7 Computing Consensus Spliced Alignments . 73 4.8 GenomeThreader Implementation . 77 4.8.1 Fast Matching for Filtering of Exon Candidates . 79 6 4.8.2 Chaining . 80 4.8.3 Representation of BSSMs . 80 4.8.4 Dynamic Programming . 81 4.8.5 Representation of Spliced Alignments . 82 4.8.6 Consensus Spliced Alignments . 83 4.8.7 Output of Spliced Alignments . 84 4.8.8 Incremental Updates . 84 4.8.9 Software Development Tools . 85 4.8.10 Test Strategy . 85 4.8.11 Practical Applications . 86 5 GenomeTools Genome Analysis Software 88 5.1 Basic Notions . 90 5.2 The Generic Feature Format Version 3 (GFF3) . 91 5.2.1 Meta Lines . 91 5.2.2 Comment Lines . 92 5.2.3 Feature Lines . 92 5.2.4 Termination Lines . 93 5.2.5 Example GFF3 File . 94 5.3 GenomeTools Overview . 94 5.4 Representing GFF3 Files with Genome Nodes . 96 5.5 Processing Genome Nodes with Node Streams and Node Visitors . 99 5.5.1 Sorted Streams .

Load more