Hundreds of Novel Composite Genes and Chimeric Genes with Bacterial

Total Page:16

File Type:pdf, Size:1020Kb

Hundreds of Novel Composite Genes and Chimeric Genes with Bacterial Hundreds of novel composite genes and chimeric genes with bacterial origins contributed to haloarchaeal evolution Raphaël Méheust, Andrew Watson, Francois-Joseph Lapointe, R. Thane Papke, Philippe Lopez, Eric Bapteste To cite this version: Raphaël Méheust, Andrew Watson, Francois-Joseph Lapointe, R. Thane Papke, Philippe Lopez, et al.. Hundreds of novel composite genes and chimeric genes with bacterial origins contributed to haloarchaeal evolution. Genome Biology, BioMed Central, 2018, 19, pp.75. 10.1186/s13059-018- 1454-9. hal-01830047 HAL Id: hal-01830047 https://hal.sorbonne-universite.fr/hal-01830047 Submitted on 4 Jul 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Méheust et al. Genome Biology (2018) 19:75 https://doi.org/10.1186/s13059-018-1454-9 RESEARCH Open Access Hundreds of novel composite genes and chimeric genes with bacterial origins contributed to haloarchaeal evolution Raphaël Méheust1, Andrew K. Watson1, François-Joseph Lapointe3, R. Thane Papke2, Philippe Lopez1 and Eric Bapteste1* Abstract Background: Haloarchaea, a major group of archaea, are able to metabolize sugars and to live in oxygenated salty environments. Their physiology and lifestyle strongly contrast with that of their archaeal ancestors. Amino acid optimizations, which lowered the isoelectric point of haloarchaeal proteins, and abundant lateral gene transfers from bacteria have been invoked to explain this deep evolutionary transition. We use network analyses to show that the evolution of novel genes exclusive to Haloarchaea also contributed to the evolution of this group. Results: We report the creation of 320 novel composite genes, both early in the evolution of Haloarchaea during haloarchaeal genesis and later in diverged haloarchaeal groups. One hundred and twenty-six of these novel composite genes derived from genetic material from bacterial genomes. These latter genes, largely involved in metabolic functions but also in oxygenic lifestyle, constitute a different gene pool from the laterally acquired bacterial genes formerly identified. These novel composite genes were likely advantageous for their hosts, since they show significant residence times in haloarchaeal genomes—consistent with a long phylogenetic history involving vertical descent and lateral gene transfer—and encode proteins with optimized isoelectric points. Conclusions: Overall, our work encourages a systematic search for composite genes across all archaeal major groups, in order to better understand the origins of novel prokaryotic genes, and in order to test to what extent archaea might have adjusted their lifestyles by incorporating and recycling laterally acquired bacterial genetic fragments into new archaeal genes. Background transitions (a process we termed “haloarchaeal genesis”) Haloarchaea (also called Halobacteria) is an archaeal implied that Haloarchaea faced at least two major issues. class in which all members thrive in oxygenated hypersa- It involved numerous genetic events to transform their line environments using aerobic respiration and reduced physiology, as well as amino acid optimizations, which carbon sources. This lifestyle is in distinct contrast with allowed their proteins to remain soluble, resulting in the physiology of their methanogenic ancestors, which lower isoelectric points than their homologs outside this were autotrophic, and lived in oxygen-free habitats [1]. group [3]. While the latter changes can result from point Furthermore, Haloarchaea adapted to extreme osmotic mutation, abundant lateral gene transfers (LGT) from challenges by adopting a salt-in strategy making their bacteria have repeatedly been invoked to explain the cytosolic salinity equal to that of their environment – evolution and adaptation to oxygenic lifestyle of this ar- halophilic methanogens use compatible solutes to chaeal lineage [4]. balance their osmotic pressures [2]. These major lifestyle Phylogenetic studies, largely focused on the acquisition of full-sized genes by Haloarchaea from bacterial donors, * Correspondence: [email protected] proposed either a sudden and massive introgressive 1Sorbonne Universités, UPMC Univ Paris 06, Institut de Biologie Paris Seine, process [5, 6], or a more gradual and piecemeal process Centre National de la Recherche Scientifique, Unité Mixte de Recherche 7138 [7, 8] to explain the gains of a thousand gene families Evolution Paris Seine, 75005 Paris, France Full list of author information is available at the end of the article with bacterial origins in the haloarchaeal group [5, 6]. © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Méheust et al. Genome Biology (2018) 19:75 Page 2 of 12 Integrative modeling of gene and genome evolution in networks [19] that rely on both full and partial (e.g. the archaea has also suggested that, though gene families protein domain) pairwise similarity values to analyze are largely vertically transmitted within archaea, LGT similarity between sequences and to test whether gene has had a significant impact on archaeal genome remodeling was involved in the emergence of Haloarch- evolution, outnumbering expansion of genomes by du- aea and in their subsequent evolution. We report the plication of existing archaeal gene families in the major- creation of hundreds of novel composite genes, both ity of branches of the archaeal tree, including in the early in the evolution of Haloarchaea (during haloarch- haloarchaea where rates of LGT were particularly high aeal genesis) and later in diverged haloarchaeal groups. [9]. Importantly, in addition to these recognized LGTs, Based on the taxonomic assignment of the components the evolution of novel genes within the group could also of the composite genes, we distinguish three classes of explain how Haloarchaea arose and thrive, and this is composite genes, exclusively found in Haloarchaea. First, supported by the discovery of gene families correspond- we identified class I composite genes, which are formed ing to potential de novo origins in the Haloarchaea [9]. from unique associations of DNA components from ar- This type of change has rarely been assessed because lit- chaeal genomes (i.e. ARC-ARC and ARC-HALO com- tle is known about the origins of novel genes in prokary- posite genes). Second, we identified class II composite otic genomes [10, 11]. There are a range of different genes, which are derived from unique combinations of mechanisms that can produce novel genes, including de prokaryotic DNA, since these components cannot be novo genes, synthesized either partly or completely from confidently assigned either to a bacterial or an archaeal non-coding DNA [12], from the divergence of an exist- host (i.e. PROK-PROK, PROK-ARC, PROK-HALO com- ing protein-coding sequence beyond the point at which posite genes). Third, we identified chimeric composite it is recognizable as a homologue (e.g. following gene (ChiC) genes in Haloarchaea, which are made up of (at duplication events), or by fusion or fission of existing least one) component from bacterial genomes (i.e. protein-coding sequences [13]. New genes have been BAC-HALO, ARC-BAC, and PROK-BAC composite shown to be able to fulfil crucial roles in biological genes). Importantly, these latter genes constitute a differ- processes after relatively short evolutionary times for ent gene pool from the laterally acquired bacterial genes different lineages, highlighting their potential importance in detected by Nelson-Sathi [5]. Hence, haloarchaeal ChiC biological transitions [14]. Yet, events of gene remodeling genes reveal an additional substantial bacterial contribu- leading to the creation of novel genes, possibly contributing tion during the evolution of Haloarchaea. Many of these to haloarchaeal genesis and to the success of their descen- novel composite genes, i.e. ChiC genes and class I and II dants, remain to be systematically investigated. composite genes, were likely advantageous for these Gene remodeling has been described in prokaryotes, hosts. They showed significant residence times in mainly resulting from the fusion and fission of full-sized haloarchaeal genomes, as assessed by their taxonomic genes [15]. These two processes produce detectable distributions, consistent with a long phylogenetic history composite genes, i.e. genes composed of dissociable/as- involving vertical descent, and by the optimized isoelec- sociable parts, called components. Moreover, the transfer tric points
Recommended publications
  • The Biocyc Collection of Microbial Genomes and Metabolic Pathways Peter D
    Briefings in Bioinformatics, 2017, 1–9 doi: 10.1093/bib/bbx085 Paper Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbx085/4084231 by guest on 02 May 2019 The BioCyc collection of microbial genomes and metabolic pathways Peter D. Karp, Richard Billington, Ron Caspi, Carol A. Fulcher, Mario Latendresse, Anamika Kothari, Ingrid M. Keseler, Markus Krummenacker, Peter E. Midford, Quang Ong, Wai Kit Ong, Suzanne M. Paley and Pallavi Subhraveti Corresponding author: Peter D. Karp, Bioinformatics Research Group, SRI International, 333 Ravenswood Ave, AE206, Menlo Park, CA 94025, USA. E-mail: [email protected] Abstract BioCyc.org is a microbial genome Web portal that combines thousands of genomes with additional information inferred by computer programs, imported from other databases and curated from the biomedical literature by biologist curators. BioCyc also provides an extensive range of query tools, visualization services and analysis software. Recent advances in BioCyc in- clude an expansion in the content of BioCyc in terms of both the number of genomes and the types of information available for each genome; an expansion in the amount of curated content within BioCyc; and new developments in the BioCyc soft- ware tools including redesigned gene/protein pages and metabolite pages; new search tools; a new sequence-alignment tool; a new tool for visualizing groups of related metabolic pathways; and a facility called SmartTables, which enables biolo- gists to perform analyses that previously would have required a programmer’s assistance. Key words: genome databases; microbial genome databases; metabolic pathway databases Peter Karp is the Director of the SRI Bioinformatics Research Group (BRG).
    [Show full text]
  • Biological Capacities Clearly Define a Major Subdivision in Domain Bacteria 4 5 Raphaël Méheust+,1,2, David Burstein+,1,3,8, Cindy J
    bioRxiv preprint doi: https://doi.org/10.1101/335083; this version posted May 30, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Research article 2 3 Biological capacities clearly define a major subdivision in Domain Bacteria 4 5 Raphaël Méheust+,1,2, David Burstein+,1,3,8, Cindy J. Castelle1,2,4 and Jillian F. Banfield1,2,4,5,6,7,*,# 6 7 1Department of Earth and Planetary Science, University of California, Berkeley, Berkeley, CA, USA 8 2Innovative Genomics Institute, Berkeley, CA, USA 9 3 California Institute for Quantitative Biosciences (QB3), University of California Berkeley, CA USA 10 4Chan Zuckerberg Biohub, San Francisco, CA, USA 11 5University of Melbourne, Melbourne, VIC, Australia 12 6Lawrence Berkeley National Laboratory, Berkeley, CA, USA 13 7Department of Environmental Science, Policy and Management, University of California, Berkeley, 14 Berkeley, CA, USA 15 8Present address: School of Molecular and Cell Biology and Biotechnology, George S. Wise Faculty of 16 Life Sciences, Tel Aviv University, Tel Aviv, Israel 17 +These authors contributed equally to this work 18 *Correspondence: [email protected] 19 #Lead Contact 20 21 Resume 22 Phylogenetic analyses separate candidate phyla radiation (CPR) bacteria from other bacteria, but 23 the degree to which their proteomes are distinct remains unclear. Here, we leveraged a proteome 24 database that includes sequences from thousands of uncultivated organisms to identify protein 25 families and examine their organismal distributions. We focused on widely distributed protein 26 families that co-occur in genomes, as they likely foundational for metabolism.
    [Show full text]
  • Psortm: a Bacterial and Archaeal Protein Subcellular Localization Prediction Tool for Metagenomics Data Michael A
    Bioinformatics, 36(10), 2020, 3043–3048 doi: 10.1093/bioinformatics/btaa136 Advance Access Publication Date: 28 February 2020 Original Paper Sequence analysis PSORTm: a bacterial and archaeal protein subcellular localization prediction tool for metagenomics data Michael A. Peabody1,†, Wing Yin Venus Lau1,†, Gemma R. Hoad1,2, Baofeng Jia1, Finlay Maguire3, Kristen L. Gray1, Robert G. Beiko3 and Fiona S. L. Brinkman1,* 1Department of Molecular Biology and Biochemistry, 2Research Computing Group, Simon Fraser University, Burnaby, BC V5A 1S6, Canada and 3Department of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada *To whom correspondence should be addressed. †The authors wish it to be known that these authors contributed equally. Associate Editor: Yann Ponty Received on July 3, 2019; revised on January 23, 2020; editorial decision on February 19, 2020; accepted on February 25, 2020 Abstract Motivation: Many methods for microbial protein subcellular localization (SCL) prediction exist; however, none is readily available for analysis of metagenomic sequence data, despite growing interest from researchers studying microbial communities in humans, agri-food relevant organisms and in other environments (e.g. for identification of cell-surface biomarkers for rapid protein-based diagnostic tests). We wished to also identify new markers of water quality from freshwater samples collected from pristine versus pollution-impacted watersheds. Results: We report PSORTm, the first bioinformatics tool designed for prediction of diverse bacterial and archaeal protein SCL from metagenomics data. PSORTm incorporates components of PSORTb, one of the most precise and widely used protein SCL predictors, with an automated classification by cell envelope. An evaluation using 5-fold cross-validation with in silico-fragmented sequences with known localization showed that PSORTm maintains PSORTb’s high precision, while sensitivity increases proportionately with metagenomic sequence fragment length.
    [Show full text]
  • Pathway Tools Version 24.0: Integrated Software for Pathway/Genome Informatics and Systems Biology
    Pathway Tools version 24.0: Integrated Software for Pathway/Genome Informatics and Systems Biology Peter D. Karp∗, Suzanne M. Paley, Peter E. Midford, Markus Krummenacker, Richard Billington, Anamika Kothari, Wai Kit Ong, Pallavi Subhraveti, Ingrid M. Keseler, and Ron Caspi Bioinformatics Research Group, SRI International 333 Ravenswood Ave, Menlo Park, CA 94025 [email protected] ∗To whom correspondence should be addressed. arXiv:1510.03964v4 [q-bio.GN] 12 Nov 2020 1 Abstract Pathway Tools is a bioinformatics software environment with a broad set of capabilities. The software provides genome-informatics tools such as a genome browser, sequence align- ments, a genome-variant analyzer, and comparative-genomics operations. It offers metabolic- informatics tools, such as metabolic reconstruction, quantitative metabolic modeling, pre- diction of reaction atom mappings, and metabolic route search. Pathway Tools also provides regulatory-informatics tools, such as the ability to represent and visualize a wide range of reg- ulatory interactions. The software creates and manages a type of organism-specific database called a Pathway/Genome Database (PGDB), which the software enables database curators to interactively edit. It supports web publishing of PGDBs and provides a large number of query, visualization, and omics-data analysis tools. Scientists around the world have created more than 35,000 PGDBs by using Pathway Tools, many of which are curated databases for important model organisms. Those PGDBs can be exchanged using a peer-to-peer database- sharing system called the PGDB Registry. Biographical Note Dr. Peter Karp is the Director of the Bioinformatics Research Group at SRI International. He received the PhD degree in Computer Science from Stanford University.
    [Show full text]
  • A Sample Article Title
    Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts Gabor Melli1, Martin Ester 1, Anoop Sarkar1 1School of Computing Science, Simon Fraser University, Burnaby, British Columbia, CANADA Email addresses: GM: [email protected] ME: [email protected] AS: [email protected] Abstract Background Research into semantic relation recognition from text has focused on the identification of binary relations that are contained within one sentence. In the domain of biomedical documents however relations of interest can have more than two arguments and can also have their entity mentions located on different sentences. An example of this scenario is the ternary relation of “subcellular localization” which relates whether an organism’s (O) protein (P) has subcellular location (L) as one of its target destinations. Empirical evidence suggests that approximately one half of the mentions for this ternary relation reside on multi-sentence passages. Results We introduce a relation recognition algorithm that can detect n-ary relations across multiple sentences in a document, and use the subcellular localization relation as a motivating example. The approach uses a text-graph representation of the entire document that is based on intrasentential edges derived from each sentence’s predicted syntactic parse trees, and on intersentential edges based on either the linking of adjacent sentences or the linking of coreferents, if reliable coreference predictions are available. From the text graph state-of-the-art features such as named-entity features and syntactic features are produced for each argument pairing. We test the approach on the task of recognizing , in PubMed abstracts, experimentally validated subcellular localization relations that have been curated by biomedical researchers.
    [Show full text]
  • Computational Prediction and Comparative Analysis of Protein Subcellular Localization in Bacteria
    COMPUTATIONAL PREDICTION AND COMPARATIVE ANALYSIS OF PROTEIN SUBCELLULAR LOCALIZATION IN BACTERIA Jennifer Leigh Gardy B.Sc., University of British Columbia, 2000 Graduate Certificate in Biotechnology, McGill University, 2001 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY In the Department of Molecular Biology and Biochemistry O Jennifer Gardy 2006 SIMON FRASER UNIVERSITY Spring 2006 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author APPROVAL Name: Jennifer Leigh Gardy Degree: Doctor of Philosophy Title of Thesis: Computational Prediction and Comparative Analysis of Protein Subcellular Localization in Bacteria Examining Committee: Chair: Dr. Erika Plettner Assistant Professor, Department of Chemistry Dr. Fiona S.L. Brinkman Senior Supervisor Assistant Professor, Department of Molecular Biology and Biochemistry Dr. Jamie K. Scott Supervisor Professor, Department of Molecular Biology and Biochemistry Dr. Frederic Pio Supervisor Assistant Professor, Department of Molecular Biology and Biochemistry Dr. Margo Moore Internal Examiner Professor, Department of Biological Sciences Dr. Chris Upton External Examiner Associate Professor, Department of Biochemistry and Microbiology University of Victoria Date DefendedlApproved: Friday, April 7, 2006 SIMON FRASER ,,@ UNIVERSIW~ ibra ry DECLARATION OF PARTIAL COPYRIGHT LICENCE The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.
    [Show full text]
  • Ontoloki: an Automatic, Instance-Based Method for the Evaluation of Biological Ontologies on the Semantic Web1
    OntoLoki: an automatic, instance-based method for the evaluation of biological ontologies on the Semantic Web1 Benjamin M. Good, Gavin Ha, Chi K. Ho, Mark D. Wilkinson Background In recent years, the explosive growth of data in the life sciences has produced the need for many new biological and medical ontologies [1, 2]. To meet this need, the biomedical informatics community has developed powerful tools for building ontologies such as Protégé [3, 4], established institutions devoted to improving and expanding the practice of biomedical ontology design [5], and produced an ever-increasing number of ontologies and ontology-based applications [6]. Outside of the biomedical community, the broader W3C-lead semantic Web initiative has produced new, standardized languages for sharing ontologies on the Web, such as OWL [7], as well as efficient algorithms for automated, deductive reasoning over the knowledge represented within them [8, 9]. These advances aid the processes of building, sharing, and using biological and medical ontologies; however, one well-recognized, yet largely unmet need remains the development of effective, objective methods for the evaluation of ontology quality [10, 11]. As Jeremy Rogers points out, “the medical and non-medical ontology engineering communities have yet to define, much less to regularly practice, a comprehensive and systematic methodology for assuring, or improving, the quality of their product” [12]. In much the same way that consistent standards for designing, building, and evaluating the products of physical engineering efforts can contribute to the efficiency of the engineering process and the reliability of the end-product, an effective, systematic methodology for ontology design and evaluation would be a great benefit to the community.
    [Show full text]
  • View Course Material
    PROTEIN ENGINEERING B.Tech Biotechnology SBT1206 UNIT-IV Protein Databases on the Internet Protein databases have become a crucial part of modern biology. Huge amounts of data for protein structures, functions, and particularly sequences are being generated. These data cannot be handled without using computer databases. Searching databases is often the first step in the study of a new protein. Without the prior knowledge obtained from such searches, known information about the protein could be missed, or an experiment could be repeated unnecessarily. Comparison between proteins and protein classification provide information about the relationship between proteins within a genome or across different species, and hence offer much more information than can be obtained by studying only an isolated protein. In this sense, protein comparison through databases allows one to view life as a forest instead of individual trees. In addition, secondary databases derived from experimental databases are also widely available. These databases reorganize and annotate the data or provide predictions. The use of multiple databases often helps researchers understand evolution, structure, and function of a protein. Protein databases are especially powered by the Internet. Unlike traditional media, such as the CD-ROM, the Internet allows databases to be easily maintained and frequently updated with minimum cost. Researchers with limited resources can afford to set up their own databases and disseminate their data quickly. Notably, many small databases on specific types of proteins, such as the EF-Hand Calcium-Binding Proteins Data Library (http://structbio.vanderbilt.edu/cabp_database/), are widely available. Users worldwide can easily access the most up-to-date version through a user-friendly interface.
    [Show full text]
  • Dbmloc: a Database of Proteins with Multiple Subcellular Localizations Song Zhang†, Xuefeng Xia†, Jincheng Shen, Yun Zhou and Zhirong Sun*
    BMC Bioinformatics BioMed Central Database Open Access DBMLoc: a Database of proteins with multiple subcellular localizations Song Zhang†, Xuefeng Xia†, Jincheng Shen, Yun Zhou and Zhirong Sun* Address: MOE Key Laboratory of Bioinformatics, State Key Laboratory of Biomembrane and Membrane Biotechnology, Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing 100084, China Email: Song Zhang - [email protected]; Xuefeng Xia - [email protected]; Jincheng Shen - [email protected]; Yun Zhou - [email protected]; Zhirong Sun* - [email protected] * Corresponding author †Equal contributors Published: 28 February 2008 Received: 22 July 2007 Accepted: 28 February 2008 BMC Bioinformatics 2008, 9:127 doi:10.1186/1471-2105-9-127 This article is available from: http://www.biomedcentral.com/1471-2105/9/127 © 2008 Zhang et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Subcellular localization information is one of the key features to protein function research. Locating to a specific subcellular compartment is essential for a protein to function efficiently. Proteins which have multiple localizations will provide more clues. This kind of proteins may take a high proportion, even more than 35%. Description: We have developed a database of proteins with multiple subcellular localizations, designated DBMLoc. The initial release contains 10470 multiple subcellular localization-annotated entries. Annotations are collected from primary protein databases, specific subcellular localization databases and literature texts.
    [Show full text]
  • Psortdb—An Expanded, Auto-Updated, User-Friendly Protein Subcellular Localization Database for Bacteria and Archaea Nancy Y
    Published online 10 November 2010 Nucleic Acids Research, 2011, Vol. 39, Database issue D241–D244 doi:10.1093/nar/gkq1093 PSORTdb—an expanded, auto-updated, user-friendly protein subcellular localization database for Bacteria and Archaea Nancy Y. Yu, Matthew R. Laird, Cory Spencer and Fiona S.L. Brinkman* Department of Molecular Biology & Biochemistry, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada Received September 15, 2010; Revised October 15, 2010; Accepted October 17, 2010 ABSTRACT or vaccine targets, reveals suitable targets for microbial diagnostics and provides directions for experimental The subcellular localization (SCL) of a microbial design. For biomedical applications, identification of cell protein provides clues about its function, its suit- surface and secreted proteins from pathogenic bacteria ability as a drug, vaccine or diagnostic target and may lead to the discovery of novel therapeutic targets. aids experimental design. The first version of Characterizing cell surface and extracellular proteins PSORTdb provided a valuable resource comprising associated with non-pathogenic Bacteria and Archaea a data set of proteins of known SCL (ePSORTdb) as can have industrial uses, or play a role in environmental well as pre-computed SCL predictions for prote- detection. omes derived from complete bacterial genomes The first SCL prediction software, called PSORT, was (cPSORTdb). PSORTdb 2.0 (http://db.psort.org) developed in 1991 by Kenta Nakai for bacteria, animals extends user-friendly functionalities, significantly and plants (1, 2). PSORT II, iPSORT and WoLF PSORT were subsequently developed for eukaryotic species (3–5). expands ePSORTdb and now contains pre- PSORTb and PSORTb 2.0 were later developed in 2003 computed SCL predictions for all prokaryotes— and 2005 specifically for Gram-negative and -positive including Archaea and Bacteria with atypical cell bacterial protein SCL prediction, with a focus on wall/membrane structures.
    [Show full text]
  • Ortholugedb: a Bacterial and Archaeal Orthology Resource for Improved Comparative Genomic Analysis Matthew D
    D366–D376 Nucleic Acids Research, 2013, Vol. 41, Database issue Published online 29 November 2012 doi:10.1093/nar/gks1241 OrtholugeDB: a bacterial and archaeal orthology resource for improved comparative genomic analysis Matthew D. Whiteside, Geoffrey L. Winsor, Matthew R. Laird and Fiona S. L. Brinkman* Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada Received September 28, 2012; Revised and Accepted November 1, 2012 ABSTRACT INTRODUCTION Prediction of orthologs (homologous genes that The increase in genomic sequencing throughput has diverged because of speciation) is an integral generated a rapid increase in the number of available bac- component of many comparative genomics terial and archaeal genome sequences (1,2). To make methods. Although orthologs are more likely to effective use of this growing resource, computational methods for comparative genomics analysis must keep have similar function versus paralogs (genes that pace, ideally without sacrificing accuracy or performance. diverged because of duplication), recent studies Computationally predicted orthologs are integral to many have shown that their degree of functional conser- comparative genomics analyses. Orthologs, related genes vation is variable. Also, there are inherent problems between species that have diverged as a result of speci- with several large-scale ortholog prediction ation, are thought to more likely have similar functions approaches. To address these issues, we previously than paralogs, which are homologous genes that have developed Ortholuge, which uses phylogenetic arisen through gene duplication (3). This ortholog func- distance ratios to provide more precise ortholog tional conservation hypothesis or conjecture is the basis assessments for a set of predicted orthologs.
    [Show full text]
  • Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance
    Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance by Matthew Daratha Whiteside B.Sc., (Hons., Bioinformatics), University of Waterloo, 2006 Thesis Submitted In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the Department of Molecular Biology and Biochemistry Faculty of Science © Matthew Daratha Whiteside 2013 SIMON FRASER UNIVERSITY Spring 2013 Approval Name: Matthew Daratha Whiteside Degree: Doctor of Philosophy Title of Thesis: Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance Examining Committee: Chair: Dr. Ralph Pantophlet Assistant Professor, Faculty of Health Sciences Dr. Fiona S.L. Brinkman Senior Supervisor Professor, Department of Molecular Biology and Biochemistry Dr. Jack Chen Supervisor Associate Professor, Department of Molecular Biology and Biochemistry Dr. Margo M. Moore Supervisor Professor, Department of Biological Sciences Dr. Ryan D. Morin Internal Examiner Assistant Professor, Department of Molecular Biology and Biochemistry Dr. Rosemary J. Redfield External Examiner Professor, Department of Zoology University of British Columbia Date Defended/Approved: March 8th, 2013 ii Partial Copyright Licence iii Abstract Orthologs are genes that diverged from an ancestral gene when the species diverged. High-throughput computational methods for ortholog prediction are a key component of many computational biology analyses. A fundamental premise in these analyses is that orthologs (when predicted correctly) are functionally equivalent and can be used to transfer gene annotations across species. Currently, many existing ortholog prediction methods generate a sizeable number of incorrect ortholog predictions, especially in cases of complex gene evolution. My thesis examines the functional equivalence hypothesis further and presents one solution that increases the precision of ortholog prediction.
    [Show full text]