Plant Genome Informatics: Evaluation and Analysis of Genomic DNA Features Involved in the Transcriptional Processing of Protein
Total Page:16
File Type:pdf, Size:1020Kb
Iowa State University Capstones, Theses and Retrospective Theses and Dissertations Dissertations 2006 Plant genome informatics: evaluation and analysis of genomic DNA features involved in the transcriptional processing of protein coding genes Shannon Dwayne Schlueter Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/rtd Part of the Bioinformatics Commons, Genetics Commons, and the Molecular Biology Commons Recommended Citation Schlueter, Shannon Dwayne, "Plant genome informatics: evaluation and analysis of genomic DNA features involved in the transcriptional processing of protein coding genes " (2006). Retrospective Theses and Dissertations. 1866. https://lib.dr.iastate.edu/rtd/1866 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Plant genome informatics: Evaluation and analysis of genomic DNA features involved in the transcriptional processing of protein coding genes by Shannon Dwayne Schlueter A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Bioinformatics and Computational Biology Program of Study Committee: Volker Brendel, Co-major Professor Randy C. Shoemaker, Co-major Professor Xiaoqiu Huang Thomas Peterson Xun Gu Iowa State University Ames, Iowa 2006 Copyright © Shannon Dwayne Schlueter, 2006. All rights reserved. UMI Number: 3243822 UMI Microform 3243822 Copyright 2007 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, MI 48106-1346 ii Dedicated to: Jessica, Harmon, and Elyse My love, My pride, My inspiration iii TABLE OF CONTENTS CHAPTER 1. GENERAL INTRODUCTION Introduction 1 Dissertation Organization 4 References 5 CHAPTER 2. XGDB: OPEN-SOURCE COMPUTATIONAL INFRASTRUCTURE FOR THE INTEGRATED EVALUATION AND ANALYSIS OF GENOME FEATURES Abstract 9 Rationale 9 Features and Capabilities 12 xGDB Internals 19 Conclusions 25 xGDB Software Requirements 25 xGDB Support 25 Acknowledgments 26 References 26 CHAPTER 3. COMMUNITY-BASED GENE STRUCTURE ANNOTATION Abstract 36 When is a genome finished? 37 Arabidopsis genome annotation 39 Quality assessment of predicted gene structures 41 Community-based annotation 44 Outreach 45 Conclusions 46 Acknowledgments 46 References 47 Appendix: 53 iv CHAPTER 4. TSIP: TRANSCRIPTION START SITE IDENTIFICATION IN PLANTS Abstract 57 Introduction 57 Results 59 Discussion 61 Methods 65 Acknowledgments 70 References 70 CHAPTER 5. GENERAL CONCLUSIONS Conclusions 84 ACKNOWLEDGMENTS 86 1 CHAPTER 1. GENERAL INTRODUCTION Introduction Genome sequencing efforts in the plant kingdom have produced an astounding amount of sequencing information in the last decade. Arabidopsis thaliana (TAGI, 2000) and Oryza sativa (Goff et al., 2002; Yu et al., 2002) were the first major plant genomes to be sequenced. However, genome sequencing efforts are now underway in Zea mays , Medicago truncatula , Lotus japonicus , Lycopersicon esculentum , Manihot esculenta , Populus trichocarpa , and most recently Glycine max . These projects are amassing a tremendous sequence resource for investigation of a wide range of biological questions. The computational tools to both manage and interpret this data, however, are often difficult to apply to hypothesis driven research. Furthermore, many of these tools are designed for use with data from Human or other vertebrate sequencing projects and have limited effectiveness for plant genome analysis. In order to provide a sound infrastructure for analyzing the various sources of genomic data available in the plant sciences and for interpreting the results of computational tools applied to this data, the xGDB resource has been created. The first implementation of the xGDB structure was applied to the model plant Arabidopsis thaliana . Spliced alignments of EST, cDNA and the annotations of the Arabidopsis genome were parsed and imported into a MySQL relational database. An elaborate web interface was designed for the database to allow users to browse the genome and query the database by sequence similarity, identifiers, or description 2 (http://www.plantgdb.org/AtGDB/ ). In general, the web interface is composed of three parts: the genomic context view, the query view, and the sequence view. The genomic context view allows users to browse a specific genomic region in the context of multiple annotation resources. The region graphic displays multiple sources of alignment information relative to one another. The query view allows users to view and interact with the results of a user query. Stored EST/cDNA alignments and annotated transcripts each have an individual page, the sequence view, which brings together sequence data, analysis tools, and related external links. The web interface efficiently presents the database entries on the fly and facilitates data access and utilization. This system has been used in the analysis of gene annotation quality, 5’- and 3’- UTRs, non-canonical splicing, U12-specific splicing, alternative splicing, abnormal intron and exon sizes, and conserved homologous sequences. (Zhu et al., 2003; Schlueter et al., 2003; Schlueter et al., submitted). This system was integrated into the PlantGDB framework (Dong et al., 2004; Dong et al., 2005) and now provides information and tools for Arabidopsis , rice, maize, Medicago , Lotus , Populus , tomato, soybean, Brassica , wheat, and sorghum. In addition, this system has been used in conjunction with the yrGATE gene-structure annotation tool (Wilkerson et al., 2006) to annotate homeologous soybean BAC sequences (Schlueter et al., 2006a; Schlueter et al., 2006b). Albeit the current methods of gene structure annotation are vastly more accurate and provide more complete coverage than those employed less than five years ago, the support of annotations on a per-gene basis differs extraordinarily. Annotation quality can be directly attributed to the presence of expressed sequence alignments. The dependence of gene structure annotation on available EST and cDNA sequences makes static 3 assignments of gene structure problematic. For this reason, methods of analyzing and maintaining gene annotations must acknowledge confidence in a predicted gene structure. To provide these confidence estimators and an effective method for querying gene annotations based on these values, the Genome Annotation Evaluation Algorithm, GAEVAL, was developed (Schlueter et al., 2005; Schlueter et al, unpublished results). Inconsistency of gene structure annotation is a limitation to research in the post- genome era. It is unrealistic to hope for better software solutions in the near future that will solve all or even a majority of the problems encountered by computational annotation tools. This issue is all the more urgent with an increasing number of species being sequenced and analyzed by comparative genomics – erroneous annotations could easily propagate. To address this limitation, a dynamic and economically feasible solution to the annotation predicament was developed by providing broad-based, web- technology-enabled community annotation tools (Schlueter et al., 2005; Wilkerson et al., 2006). Previous understanding of the factors and sequence elements responsible for the initiation of eukaryotic gene transcription has been established primarily through conventional genetic analysis. However, these signals have been found to differ considerably among plants, animals and yeast. The majority of effort devoted to the in silico prediction of these regions and the analysis of their corresponding signals has been carried out in vertebrate species. As such, available programs that attempt to identify promoters by prediction of transcriptional start sites often perform poorly on plant sequences. Using a highly refined collection of sequences from Arabidopsis, a program 4 called TSiP was developed to predict plant promoter regions in anonymous DNA sequence. Dissertation Organization This dissertation is organized into five chapters. Chapter 1 contains a brief introduction to the areas of interest. Chapters 2 and 3 each consist of a published manuscript. Chapter 4 consists of a manuscript in preparation for publication. Chapter 5 is a summary of conclusions reached during the course of this dissertation research. Chapter 2, entitled “xGDB: open-source computational infrastructure for the integrated evaluation and analysis of genome features” has been published in Genome Biology in 2006, Volume 7 electronic publication R111 Contributions to this work from co-authors include the following. Volker Brendel provided computational hardware and helpful discussion during the course of this work. Matt Wilkerson provided support in the development of this system and feedback during the writing of the manuscript. Qunfeng Dong has maintained the individual plant species xGDB instances as PlantGDB (http://www.plantgdb.org/ ) and has provided feedback during the writing of the manuscript. All of the code was developed by Shannon Dwayne Schlueter who was responsible for the writing of the manuscript. Chapter 3, entitled “Community-based