Computational Prediction and Comparative Analysis of Protein Subcellular Localization in Bacteria

Total Page:16

File Type:pdf, Size:1020Kb

Computational Prediction and Comparative Analysis of Protein Subcellular Localization in Bacteria COMPUTATIONAL PREDICTION AND COMPARATIVE ANALYSIS OF PROTEIN SUBCELLULAR LOCALIZATION IN BACTERIA Jennifer Leigh Gardy B.Sc., University of British Columbia, 2000 Graduate Certificate in Biotechnology, McGill University, 2001 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY In the Department of Molecular Biology and Biochemistry O Jennifer Gardy 2006 SIMON FRASER UNIVERSITY Spring 2006 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author APPROVAL Name: Jennifer Leigh Gardy Degree: Doctor of Philosophy Title of Thesis: Computational Prediction and Comparative Analysis of Protein Subcellular Localization in Bacteria Examining Committee: Chair: Dr. Erika Plettner Assistant Professor, Department of Chemistry Dr. Fiona S.L. Brinkman Senior Supervisor Assistant Professor, Department of Molecular Biology and Biochemistry Dr. Jamie K. Scott Supervisor Professor, Department of Molecular Biology and Biochemistry Dr. Frederic Pio Supervisor Assistant Professor, Department of Molecular Biology and Biochemistry Dr. Margo Moore Internal Examiner Professor, Department of Biological Sciences Dr. Chris Upton External Examiner Associate Professor, Department of Biochemistry and Microbiology University of Victoria Date DefendedlApproved: Friday, April 7, 2006 SIMON FRASER ,,@ UNIVERSIW~ ibra ry DECLARATION OF PARTIAL COPYRIGHT LICENCE The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection, and, without changing the content, to translate the thesislproject or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work. The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission. Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence. The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive. Simon Fraser University Library Burnaby, BC, Canada ABSTRACT Predicting the subcellular localization of a protein is a critical step in processes ranging from genome annotation to drug and vaccine target discovery. Previously developed methods for localization prediction in bacteria exhibit poor predictive performance and are not conducive to the high-throughput analysis required in this era of genome-scale biological analysis. We therefore developed PSORTb, a high-precision, high-throug hput tool for the prediction of bacterial protein localization. PSORTb implements a multi-component approach to prediction, incorporating the detection of several sequence features known to influence subcellular localization. With a reported overall precision of 96%, it is the most precise method available and one of the most comprehensive methods - capable of assigning a query protein to one or more of four Gram-positive or five Gram-negative localization sites. The PSORTb algorithm comprises a series of analytical steps, each step - or module - being an independent piece of software which scans the protein for the presence or absence of a particular sequence feature. Modules include: SCL-BLAST for homology-based detection, the HMMTOP transmembrane helix prediction tool, a signal peptide prediction tool, a series of frequent subsequence-based support vector machines, as well as motif and profile-matching modules. The modules return as output either a predicted localization site or - if the feature is not detected - a result of "unknown". The output is then integrated by a Bayesian network into a final prediction. Development of PSORTb also required the creation of PSORTdb, a database storing both known and predicted localization information for bacterial proteins. This is a valuable resource to both the localization prediction and microbial research communities, providing a source of training data for new predictive algorithms and acting as a discovery space. The release of PSORTb v.2.0 allowed us to carry out a number of analyses related to localization. We performed the first genome-wide computational and laboratory screen for N- terminal signal peptides in the opportunistic pathogen Pseudomonas aeruginosa, used PSORTb as a complement to laboratory-based high-throughput 2D gel studies of individual cellular compartments, and examined protein localization in a global context, revealing trends with implications for adaptive evolution in microbes. Keywords: Bioinformatics, localization, pathogenomics, proteomics, bacteria. ACKNOWLEDGEMENTS I would like to extend my sincere thanks to everyone involved in the PSORTb project. First and foremost, Dr. Fiona S.L. Brinkman and the members of the Brinkman Lab - past and present - who collaborated on the PSORTb work: Michael Acab, Matthew Laird, Dr. Sebastien Rey, Cory Spencer, and Christopher Walsh. Thank you also to our collaborators in Simon Fraser University's Computing Science Department: Fei Chen, Dr. Martin Ester, Dr. Oliver Schulte, Rong She, and Dr. Ke Wang; at the University of British Columbia: Dr. Raphael Gottardo, Dr. Robert E.W. Hancock, and Dr. Shawn Lewenza; and around the world: Katalin deFays, Dr. Sujun Hua, Dr. Christophe Lambert, Dr. lstvan Simon, and Dr. Gabor Tusnady. Special thanks to Dr. Kenta Nakai, who developed the first version of PSORT in 1991 and graciously allowed us access to his source code and offered much helpful advice. TABLE OF CONTENTS .. Approval ..............................................................................................................11 ... Abstract ..............................................................................................................III Dedication............................................................................................................ v Acknowledgements ...........................................................................................vi .. Table of Contents ..............................................................................................VII List of Figures ....................................................................................................xi .. List of Tables .....................................................................................................XII ... Glossary ............................................................................................................XIII 1 An introduction to protein subcellular localization in bacteria ................1 1. 1 Protein localization in the bacterial cell ...............................................1 1.2 Signals governing protein targeting in bacteria ...................................3 1.2.1 Bacterial transport systems ............................................................. 3 1.2.2 Type I transporters - ABC transporters ...........................................4 1.2.3 Type II transporters - the general secretory pathway (GSP) ...........4 1.2.4 Type V transporters - autotransporters ...........................................9 1.2.5 Type Ill and IV transporters - delivery of DNAIprotein into host cells .......................................................................................I0 1.2.6 Differences in protein transport in Gram-positive bacteria ............. 11 1.3 Laboratory-based methods for localization determination ................. 12 1.3.1 Microscopy-based visualization .....................................................12 1.3.2 PhoA fusion ...................................................................................13 1.3.3 Subcellular fractionation ................................................................14 1.3.4 Limitations of laboratory-based methods .......................................14 1.4 Computational methods for localization prediction in bacteria .......... I6 1.4.1 The importance of computational predictive methods ................... 16 1.4.2 Early computational methods for localization prediction in bacteria ..........................................................................................18 1.4.3 Recent computational methods for localization prediction in bacteria.......................................................................................... 22 1.5 Goal of the present research ............................................................27 2 ePSORTdb: A database of bacterial proteins of known localization.................................................................................................. 28 2.1 Summary ..........................................................................................28
Recommended publications
  • The Biocyc Collection of Microbial Genomes and Metabolic Pathways Peter D
    Briefings in Bioinformatics, 2017, 1–9 doi: 10.1093/bib/bbx085 Paper Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbx085/4084231 by guest on 02 May 2019 The BioCyc collection of microbial genomes and metabolic pathways Peter D. Karp, Richard Billington, Ron Caspi, Carol A. Fulcher, Mario Latendresse, Anamika Kothari, Ingrid M. Keseler, Markus Krummenacker, Peter E. Midford, Quang Ong, Wai Kit Ong, Suzanne M. Paley and Pallavi Subhraveti Corresponding author: Peter D. Karp, Bioinformatics Research Group, SRI International, 333 Ravenswood Ave, AE206, Menlo Park, CA 94025, USA. E-mail: [email protected] Abstract BioCyc.org is a microbial genome Web portal that combines thousands of genomes with additional information inferred by computer programs, imported from other databases and curated from the biomedical literature by biologist curators. BioCyc also provides an extensive range of query tools, visualization services and analysis software. Recent advances in BioCyc in- clude an expansion in the content of BioCyc in terms of both the number of genomes and the types of information available for each genome; an expansion in the amount of curated content within BioCyc; and new developments in the BioCyc soft- ware tools including redesigned gene/protein pages and metabolite pages; new search tools; a new sequence-alignment tool; a new tool for visualizing groups of related metabolic pathways; and a facility called SmartTables, which enables biolo- gists to perform analyses that previously would have required a programmer’s assistance. Key words: genome databases; microbial genome databases; metabolic pathway databases Peter Karp is the Director of the SRI Bioinformatics Research Group (BRG).
    [Show full text]
  • Biological Capacities Clearly Define a Major Subdivision in Domain Bacteria 4 5 Raphaël Méheust+,1,2, David Burstein+,1,3,8, Cindy J
    bioRxiv preprint doi: https://doi.org/10.1101/335083; this version posted May 30, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Research article 2 3 Biological capacities clearly define a major subdivision in Domain Bacteria 4 5 Raphaël Méheust+,1,2, David Burstein+,1,3,8, Cindy J. Castelle1,2,4 and Jillian F. Banfield1,2,4,5,6,7,*,# 6 7 1Department of Earth and Planetary Science, University of California, Berkeley, Berkeley, CA, USA 8 2Innovative Genomics Institute, Berkeley, CA, USA 9 3 California Institute for Quantitative Biosciences (QB3), University of California Berkeley, CA USA 10 4Chan Zuckerberg Biohub, San Francisco, CA, USA 11 5University of Melbourne, Melbourne, VIC, Australia 12 6Lawrence Berkeley National Laboratory, Berkeley, CA, USA 13 7Department of Environmental Science, Policy and Management, University of California, Berkeley, 14 Berkeley, CA, USA 15 8Present address: School of Molecular and Cell Biology and Biotechnology, George S. Wise Faculty of 16 Life Sciences, Tel Aviv University, Tel Aviv, Israel 17 +These authors contributed equally to this work 18 *Correspondence: [email protected] 19 #Lead Contact 20 21 Resume 22 Phylogenetic analyses separate candidate phyla radiation (CPR) bacteria from other bacteria, but 23 the degree to which their proteomes are distinct remains unclear. Here, we leveraged a proteome 24 database that includes sequences from thousands of uncultivated organisms to identify protein 25 families and examine their organismal distributions. We focused on widely distributed protein 26 families that co-occur in genomes, as they likely foundational for metabolism.
    [Show full text]
  • Hundreds of Novel Composite Genes and Chimeric Genes with Bacterial
    Hundreds of novel composite genes and chimeric genes with bacterial origins contributed to haloarchaeal evolution Raphaël Méheust, Andrew Watson, Francois-Joseph Lapointe, R. Thane Papke, Philippe Lopez, Eric Bapteste To cite this version: Raphaël Méheust, Andrew Watson, Francois-Joseph Lapointe, R. Thane Papke, Philippe Lopez, et al.. Hundreds of novel composite genes and chimeric genes with bacterial origins contributed to haloarchaeal evolution. Genome Biology, BioMed Central, 2018, 19, pp.75. 10.1186/s13059-018- 1454-9. hal-01830047 HAL Id: hal-01830047 https://hal.sorbonne-universite.fr/hal-01830047 Submitted on 4 Jul 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Méheust et al. Genome Biology (2018) 19:75 https://doi.org/10.1186/s13059-018-1454-9 RESEARCH Open Access Hundreds of novel composite genes and chimeric genes with bacterial origins contributed to haloarchaeal evolution Raphaël Méheust1, Andrew K. Watson1, François-Joseph Lapointe3, R. Thane Papke2, Philippe Lopez1 and Eric Bapteste1* Abstract Background: Haloarchaea, a major group of archaea, are able to metabolize sugars and to live in oxygenated salty environments. Their physiology and lifestyle strongly contrast with that of their archaeal ancestors.
    [Show full text]
  • Psortm: a Bacterial and Archaeal Protein Subcellular Localization Prediction Tool for Metagenomics Data Michael A
    Bioinformatics, 36(10), 2020, 3043–3048 doi: 10.1093/bioinformatics/btaa136 Advance Access Publication Date: 28 February 2020 Original Paper Sequence analysis PSORTm: a bacterial and archaeal protein subcellular localization prediction tool for metagenomics data Michael A. Peabody1,†, Wing Yin Venus Lau1,†, Gemma R. Hoad1,2, Baofeng Jia1, Finlay Maguire3, Kristen L. Gray1, Robert G. Beiko3 and Fiona S. L. Brinkman1,* 1Department of Molecular Biology and Biochemistry, 2Research Computing Group, Simon Fraser University, Burnaby, BC V5A 1S6, Canada and 3Department of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada *To whom correspondence should be addressed. †The authors wish it to be known that these authors contributed equally. Associate Editor: Yann Ponty Received on July 3, 2019; revised on January 23, 2020; editorial decision on February 19, 2020; accepted on February 25, 2020 Abstract Motivation: Many methods for microbial protein subcellular localization (SCL) prediction exist; however, none is readily available for analysis of metagenomic sequence data, despite growing interest from researchers studying microbial communities in humans, agri-food relevant organisms and in other environments (e.g. for identification of cell-surface biomarkers for rapid protein-based diagnostic tests). We wished to also identify new markers of water quality from freshwater samples collected from pristine versus pollution-impacted watersheds. Results: We report PSORTm, the first bioinformatics tool designed for prediction of diverse bacterial and archaeal protein SCL from metagenomics data. PSORTm incorporates components of PSORTb, one of the most precise and widely used protein SCL predictors, with an automated classification by cell envelope. An evaluation using 5-fold cross-validation with in silico-fragmented sequences with known localization showed that PSORTm maintains PSORTb’s high precision, while sensitivity increases proportionately with metagenomic sequence fragment length.
    [Show full text]
  • Pathway Tools Version 24.0: Integrated Software for Pathway/Genome Informatics and Systems Biology
    Pathway Tools version 24.0: Integrated Software for Pathway/Genome Informatics and Systems Biology Peter D. Karp∗, Suzanne M. Paley, Peter E. Midford, Markus Krummenacker, Richard Billington, Anamika Kothari, Wai Kit Ong, Pallavi Subhraveti, Ingrid M. Keseler, and Ron Caspi Bioinformatics Research Group, SRI International 333 Ravenswood Ave, Menlo Park, CA 94025 [email protected] ∗To whom correspondence should be addressed. arXiv:1510.03964v4 [q-bio.GN] 12 Nov 2020 1 Abstract Pathway Tools is a bioinformatics software environment with a broad set of capabilities. The software provides genome-informatics tools such as a genome browser, sequence align- ments, a genome-variant analyzer, and comparative-genomics operations. It offers metabolic- informatics tools, such as metabolic reconstruction, quantitative metabolic modeling, pre- diction of reaction atom mappings, and metabolic route search. Pathway Tools also provides regulatory-informatics tools, such as the ability to represent and visualize a wide range of reg- ulatory interactions. The software creates and manages a type of organism-specific database called a Pathway/Genome Database (PGDB), which the software enables database curators to interactively edit. It supports web publishing of PGDBs and provides a large number of query, visualization, and omics-data analysis tools. Scientists around the world have created more than 35,000 PGDBs by using Pathway Tools, many of which are curated databases for important model organisms. Those PGDBs can be exchanged using a peer-to-peer database- sharing system called the PGDB Registry. Biographical Note Dr. Peter Karp is the Director of the Bioinformatics Research Group at SRI International. He received the PhD degree in Computer Science from Stanford University.
    [Show full text]
  • A Sample Article Title
    Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts Gabor Melli1, Martin Ester 1, Anoop Sarkar1 1School of Computing Science, Simon Fraser University, Burnaby, British Columbia, CANADA Email addresses: GM: [email protected] ME: [email protected] AS: [email protected] Abstract Background Research into semantic relation recognition from text has focused on the identification of binary relations that are contained within one sentence. In the domain of biomedical documents however relations of interest can have more than two arguments and can also have their entity mentions located on different sentences. An example of this scenario is the ternary relation of “subcellular localization” which relates whether an organism’s (O) protein (P) has subcellular location (L) as one of its target destinations. Empirical evidence suggests that approximately one half of the mentions for this ternary relation reside on multi-sentence passages. Results We introduce a relation recognition algorithm that can detect n-ary relations across multiple sentences in a document, and use the subcellular localization relation as a motivating example. The approach uses a text-graph representation of the entire document that is based on intrasentential edges derived from each sentence’s predicted syntactic parse trees, and on intersentential edges based on either the linking of adjacent sentences or the linking of coreferents, if reliable coreference predictions are available. From the text graph state-of-the-art features such as named-entity features and syntactic features are produced for each argument pairing. We test the approach on the task of recognizing , in PubMed abstracts, experimentally validated subcellular localization relations that have been curated by biomedical researchers.
    [Show full text]
  • Ontoloki: an Automatic, Instance-Based Method for the Evaluation of Biological Ontologies on the Semantic Web1
    OntoLoki: an automatic, instance-based method for the evaluation of biological ontologies on the Semantic Web1 Benjamin M. Good, Gavin Ha, Chi K. Ho, Mark D. Wilkinson Background In recent years, the explosive growth of data in the life sciences has produced the need for many new biological and medical ontologies [1, 2]. To meet this need, the biomedical informatics community has developed powerful tools for building ontologies such as Protégé [3, 4], established institutions devoted to improving and expanding the practice of biomedical ontology design [5], and produced an ever-increasing number of ontologies and ontology-based applications [6]. Outside of the biomedical community, the broader W3C-lead semantic Web initiative has produced new, standardized languages for sharing ontologies on the Web, such as OWL [7], as well as efficient algorithms for automated, deductive reasoning over the knowledge represented within them [8, 9]. These advances aid the processes of building, sharing, and using biological and medical ontologies; however, one well-recognized, yet largely unmet need remains the development of effective, objective methods for the evaluation of ontology quality [10, 11]. As Jeremy Rogers points out, “the medical and non-medical ontology engineering communities have yet to define, much less to regularly practice, a comprehensive and systematic methodology for assuring, or improving, the quality of their product” [12]. In much the same way that consistent standards for designing, building, and evaluating the products of physical engineering efforts can contribute to the efficiency of the engineering process and the reliability of the end-product, an effective, systematic methodology for ontology design and evaluation would be a great benefit to the community.
    [Show full text]
  • View Course Material
    PROTEIN ENGINEERING B.Tech Biotechnology SBT1206 UNIT-IV Protein Databases on the Internet Protein databases have become a crucial part of modern biology. Huge amounts of data for protein structures, functions, and particularly sequences are being generated. These data cannot be handled without using computer databases. Searching databases is often the first step in the study of a new protein. Without the prior knowledge obtained from such searches, known information about the protein could be missed, or an experiment could be repeated unnecessarily. Comparison between proteins and protein classification provide information about the relationship between proteins within a genome or across different species, and hence offer much more information than can be obtained by studying only an isolated protein. In this sense, protein comparison through databases allows one to view life as a forest instead of individual trees. In addition, secondary databases derived from experimental databases are also widely available. These databases reorganize and annotate the data or provide predictions. The use of multiple databases often helps researchers understand evolution, structure, and function of a protein. Protein databases are especially powered by the Internet. Unlike traditional media, such as the CD-ROM, the Internet allows databases to be easily maintained and frequently updated with minimum cost. Researchers with limited resources can afford to set up their own databases and disseminate their data quickly. Notably, many small databases on specific types of proteins, such as the EF-Hand Calcium-Binding Proteins Data Library (http://structbio.vanderbilt.edu/cabp_database/), are widely available. Users worldwide can easily access the most up-to-date version through a user-friendly interface.
    [Show full text]
  • Dbmloc: a Database of Proteins with Multiple Subcellular Localizations Song Zhang†, Xuefeng Xia†, Jincheng Shen, Yun Zhou and Zhirong Sun*
    BMC Bioinformatics BioMed Central Database Open Access DBMLoc: a Database of proteins with multiple subcellular localizations Song Zhang†, Xuefeng Xia†, Jincheng Shen, Yun Zhou and Zhirong Sun* Address: MOE Key Laboratory of Bioinformatics, State Key Laboratory of Biomembrane and Membrane Biotechnology, Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing 100084, China Email: Song Zhang - [email protected]; Xuefeng Xia - [email protected]; Jincheng Shen - [email protected]; Yun Zhou - [email protected]; Zhirong Sun* - [email protected] * Corresponding author †Equal contributors Published: 28 February 2008 Received: 22 July 2007 Accepted: 28 February 2008 BMC Bioinformatics 2008, 9:127 doi:10.1186/1471-2105-9-127 This article is available from: http://www.biomedcentral.com/1471-2105/9/127 © 2008 Zhang et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Subcellular localization information is one of the key features to protein function research. Locating to a specific subcellular compartment is essential for a protein to function efficiently. Proteins which have multiple localizations will provide more clues. This kind of proteins may take a high proportion, even more than 35%. Description: We have developed a database of proteins with multiple subcellular localizations, designated DBMLoc. The initial release contains 10470 multiple subcellular localization-annotated entries. Annotations are collected from primary protein databases, specific subcellular localization databases and literature texts.
    [Show full text]
  • Psortdb—An Expanded, Auto-Updated, User-Friendly Protein Subcellular Localization Database for Bacteria and Archaea Nancy Y
    Published online 10 November 2010 Nucleic Acids Research, 2011, Vol. 39, Database issue D241–D244 doi:10.1093/nar/gkq1093 PSORTdb—an expanded, auto-updated, user-friendly protein subcellular localization database for Bacteria and Archaea Nancy Y. Yu, Matthew R. Laird, Cory Spencer and Fiona S.L. Brinkman* Department of Molecular Biology & Biochemistry, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada Received September 15, 2010; Revised October 15, 2010; Accepted October 17, 2010 ABSTRACT or vaccine targets, reveals suitable targets for microbial diagnostics and provides directions for experimental The subcellular localization (SCL) of a microbial design. For biomedical applications, identification of cell protein provides clues about its function, its suit- surface and secreted proteins from pathogenic bacteria ability as a drug, vaccine or diagnostic target and may lead to the discovery of novel therapeutic targets. aids experimental design. The first version of Characterizing cell surface and extracellular proteins PSORTdb provided a valuable resource comprising associated with non-pathogenic Bacteria and Archaea a data set of proteins of known SCL (ePSORTdb) as can have industrial uses, or play a role in environmental well as pre-computed SCL predictions for prote- detection. omes derived from complete bacterial genomes The first SCL prediction software, called PSORT, was (cPSORTdb). PSORTdb 2.0 (http://db.psort.org) developed in 1991 by Kenta Nakai for bacteria, animals extends user-friendly functionalities, significantly and plants (1, 2). PSORT II, iPSORT and WoLF PSORT were subsequently developed for eukaryotic species (3–5). expands ePSORTdb and now contains pre- PSORTb and PSORTb 2.0 were later developed in 2003 computed SCL predictions for all prokaryotes— and 2005 specifically for Gram-negative and -positive including Archaea and Bacteria with atypical cell bacterial protein SCL prediction, with a focus on wall/membrane structures.
    [Show full text]
  • Ortholugedb: a Bacterial and Archaeal Orthology Resource for Improved Comparative Genomic Analysis Matthew D
    D366–D376 Nucleic Acids Research, 2013, Vol. 41, Database issue Published online 29 November 2012 doi:10.1093/nar/gks1241 OrtholugeDB: a bacterial and archaeal orthology resource for improved comparative genomic analysis Matthew D. Whiteside, Geoffrey L. Winsor, Matthew R. Laird and Fiona S. L. Brinkman* Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada Received September 28, 2012; Revised and Accepted November 1, 2012 ABSTRACT INTRODUCTION Prediction of orthologs (homologous genes that The increase in genomic sequencing throughput has diverged because of speciation) is an integral generated a rapid increase in the number of available bac- component of many comparative genomics terial and archaeal genome sequences (1,2). To make methods. Although orthologs are more likely to effective use of this growing resource, computational methods for comparative genomics analysis must keep have similar function versus paralogs (genes that pace, ideally without sacrificing accuracy or performance. diverged because of duplication), recent studies Computationally predicted orthologs are integral to many have shown that their degree of functional conser- comparative genomics analyses. Orthologs, related genes vation is variable. Also, there are inherent problems between species that have diverged as a result of speci- with several large-scale ortholog prediction ation, are thought to more likely have similar functions approaches. To address these issues, we previously than paralogs, which are homologous genes that have developed Ortholuge, which uses phylogenetic arisen through gene duplication (3). This ortholog func- distance ratios to provide more precise ortholog tional conservation hypothesis or conjecture is the basis assessments for a set of predicted orthologs.
    [Show full text]
  • Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance
    Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance by Matthew Daratha Whiteside B.Sc., (Hons., Bioinformatics), University of Waterloo, 2006 Thesis Submitted In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the Department of Molecular Biology and Biochemistry Faculty of Science © Matthew Daratha Whiteside 2013 SIMON FRASER UNIVERSITY Spring 2013 Approval Name: Matthew Daratha Whiteside Degree: Doctor of Philosophy Title of Thesis: Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance Examining Committee: Chair: Dr. Ralph Pantophlet Assistant Professor, Faculty of Health Sciences Dr. Fiona S.L. Brinkman Senior Supervisor Professor, Department of Molecular Biology and Biochemistry Dr. Jack Chen Supervisor Associate Professor, Department of Molecular Biology and Biochemistry Dr. Margo M. Moore Supervisor Professor, Department of Biological Sciences Dr. Ryan D. Morin Internal Examiner Assistant Professor, Department of Molecular Biology and Biochemistry Dr. Rosemary J. Redfield External Examiner Professor, Department of Zoology University of British Columbia Date Defended/Approved: March 8th, 2013 ii Partial Copyright Licence iii Abstract Orthologs are genes that diverged from an ancestral gene when the species diverged. High-throughput computational methods for ortholog prediction are a key component of many computational biology analyses. A fundamental premise in these analyses is that orthologs (when predicted correctly) are functionally equivalent and can be used to transfer gene annotations across species. Currently, many existing ortholog prediction methods generate a sizeable number of incorrect ortholog predictions, especially in cases of complex gene evolution. My thesis examines the functional equivalence hypothesis further and presents one solution that increases the precision of ortholog prediction.
    [Show full text]