Computational Annotation of Eukaryotic Gene Structures: Algorithms Development and Software Systems Michael Edward Sparks Iowa State University

Iowa State University Capstones, Theses and Retrospective Theses and Dissertations Dissertations 2007 Computational annotation of eukaryotic gene structures: algorithms development and software systems Michael Edward Sparks Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/rtd Part of the Bioinformatics Commons Recommended Citation Sparks, Michael Edward, "Computational annotation of eukaryotic gene structures: algorithms development and software systems" (2007). Retrospective Theses and Dissertations. 15614. https://lib.dr.iastate.edu/rtd/15614 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Computational annotation of eukaryotic gene structures: algorithms development and software systems by Michael Edward Sparks A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Bioinformatics and Computational Biology Program of Study Committee: Volker Brendel, Co-major Professor Jonathan F. Wendel, Co-major Professor Karin S. Dorman Xun Gu Xiaoqiu Huang Iowa State University Ames, Iowa 2007 Copyright c Michael Edward Sparks, 2007. All rights reserved. UMI Number: 3289422 UMI Microform 3289422 Copyright 2008 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, MI 48106-1346 ii DEDICATION This work is dedicated to Dorothy Osborne, June 1926–March 2007. iii TABLE OF CONTENTS 1. GENERALINTRODUCTION . 1 Abstract........................................... 1 Background......................................... 2 ThesisOrganization ................................. 5 Major open source software systems developed in this work . ......... 6 CHAPTER 2. INCORPORATION OF SPLICE SITE PROBABILITY MODELS FOR NON-CANONICAL INTRONS IMPROVES GENE STRUCTURE PREDICTIONINPLANTS. 7 Abstract........................................... 7 Introduction....................................... 8 SystemsandMethods .................................. 10 Trainingdata ..................................... 10 Informationplotsandpictograms . .... 11 Splicesiteprobabilitymodels . .... 12 Comparisonofsplicesitemodels . 13 Incorporation of splice site probability models in the scoring of spliced alignments 14 Programsused .................................... 14 Testdata ....................................... 15 Splicedalignmentassays. 16 Ab initio assays.................................... 16 Identification of GC–AG introns in Arabidopsis andrice .............. 17 Gene Ontology (GO) annotation . 17 iv ResultsandDiscussion. .. .. .. .. .. .. .. .. .... 18 Informationplotsandpictograms . .... 18 Differences between species-specific models . ....... 18 Splicedalignment................................... 19 Spliced alignment detection of U12-type introns . ......... 20 Ab initio gene structure and splice site prediction for GC–AG intron containing genes ..................................... 20 Characterization of GC–AG introns within their gene structure . 21 Classification of GC–AG intron containing genes . 22 Conclusions........................................ 23 Acknowledgments.................................... 24 CHAPTER 3. MARKOV MODEL VARIANTS FOR APPRAISAL OFCODINGPOTENTIALINPLANTDNA . 37 Abstract........................................... 37 Introduction....................................... 38 MaterialsandMethods................................ 39 DataAccumulation.................................. 39 Fixed-orderMarkovModels(FO) . 40 Interpolated Markov Models (IMMs) . 41 Dynamically and partially modulating Markov models (DMMMs, PMMMs) . 44 AccountingforG+CContent . 45 TestDesign ...................................... 46 Results ........................................... 48 Discussion......................................... 49 Acknowledgments.................................... 53 Appendix .......................................... 53 CHAPTER 4. MetWAMer: EUKARYOTIC TRANSLATION INITIATIONSITEPREDICTION . 65 v Abstract........................................... 65 Background......................................... 66 The MetWAMer system ................................ 68 Datasets ....................................... 72 Stratifiedtrainingandtesting . .... 73 Testdesign ...................................... 74 Results ........................................... 75 Discussion......................................... 78 Conclusions........................................ 81 Acknowledgments.................................... 83 CHAPTER 5. EUKARYOTIC GENE STRUCTURE PREDICTION USING THE PASIF SYSTEM .......................... 94 Abstract........................................... 94 Introduction....................................... 94 MaterialsandMethods................................ 96 RandomRestartHill-Climbing . 96 Valid Local Search Operators and Computational Complexity .......... 97 Descriptive Feature Vector for Gene Structures . ........ 98 ObjectiveFunctionandClassifier . .... 100 gthXML and ToolSAQ ................................. 101 Datasets ....................................... 101 TestDesign ...................................... 102 Results ........................................... 103 Discussion......................................... 104 Acknowledgments.................................... 106 CHAPTER6. CONCLUSIONS . .. .. .. .. .. .. .. ... .. .. .. .. 112 BIBLIOGRAPHY ................................... 114 ACKNOWLEDGMENTS............................... 132 1 CHAPTER 1. GENERAL INTRODUCTION Abstract An important foundation for the advancement of both basic and applied biological science is correct annotation of protein-coding gene repertoires in model organisms. Accurate auto- mated annotation of eukaryotic gene structures remains a challenging, open-ended and critical problem for modern computational biology. The use of extrinsic (homology) information has been shown as a quite successful strategy for this task, though it is not a perfect solution, for a variety of reasons. More recently, gene prediction methods leveraging information present in syntenic genomic sequences have become favorable, though these too, have limitations. Identifying genes by inspection of genomic sequence alone thoroughly tests our theoretical understanding of the gene recognition process as it occurs in vivo, and where we encounter failure, excellent opportunities for meaningful research are revealed. Therefore, the continued development of methods not reliant on homology information—the so-called ab initio gene prediction methods—should help to more rapidly achieve a comprehensive understanding of gene content in our model organisms, at least. This thesis explores the development of novel algorithms in an attempt to advance the current state-of-the-art in gene prediction, with particular emphasis on ab initio approaches. The work has been conducted with an eye towards contributing open source, well-documented, and extensible software systems implementing the methods, and to generate novel biological knowledge with respect to plant taxa, in particular. This chapter provides the reader with a broad overview of modern, general approaches to gene annotation, with more detailed discussions of topic-specific literature being provided in the chapters that address them. Finally, the organization of this thesis and a listing of relevant software systems developed throughout, are presented. 2 Background Since the first-ever complete genome sequence, that of bacteriophage φX174, was obtained thirty years ago [1], genome sequencing technology has advanced considerably, and with it has come a tremendous volume of new sequence data in need of analysis. According to the most recent version of the Genomes On Line Database (GOLD; dated November 2007) [2], over 3,000 genome sequencing projects are at present either complete or ongoing. Such resources are helping to considerably advance both basic and applied biological sciences, and are likely to continue to do so for decades to come. The availability of a complete genome sequence is not generally sufficient for derivation of useful biological knowledge per se, as it is also necessary to identify those elements serving as phenotypic determinants, i.e., the genes. A thorough catalog of a genome’s genic content provides the modern biologist with an excellent starting point for, e.g., experimental characterization via reverse genetics approaches. Genes, which are operationally defined as complementation groups, can be identified by genetic screens and subsequently mapped in or- der to pinpoint their exact locations in a host genome. While such approaches can have their advantages, it has become far more typical to identify genes using in silico methods, which are often dramatically more time- and cost-effective. A gene can encode a variety of biological agents with functional utility in vivo, including a diverse range of RNA molecules having functional roles that are either primarily structural or catalytic, as for example with tRNAs or ribonucleases such as RNase H, respectively. A wide variety

Computational Annotation of Eukaryotic Gene Structures: Algorithms Development and Software Systems Michael Edward Sparks Iowa State University

Functional Aspects and Genomic Analysis

Gene Prediction Using Deep Learning

Complete Genome Sequence of the Hyperthermophilic Bacteria- Thermotoga Sp

Cep-2020-00633.Pdf

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Genes & Gene Finding

Computational Gene Finding

Scheme and Syllabus of B. Tech. Biotechnology (BT) Batch 2011 Onwards

Basics of Molecular Biology

Computational Gene Prediction

Yersinia Pestis Genes Essential Across a Narrow Or a Broad Range of Environmental Conditions Nicola J

Introduction to Ab Initio and Evidence-Based Gene Finding