Protein Domain Architectures Provide a Fast, Efficient and Scalable

F1000Research 2016, 5:1987 Last updated: 09 APR 2020 RESEARCH ARTICLE Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics [version 1; peer review: 1 approved, 2 approved with reservations] Jasper J. Koehorst 1, Edoardo Saccenti1, Peter J. Schaap1, Vitor A. P. Martins dos Santos1,2, Maria Suarez-Diez 1 1Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Stippeneng, The Netherlands 2LifeGlimmer, LifeGlimmer GmBH, Berlin, Germany First published: 15 Aug 2016, 5:1987 ( Open Peer Review v1 https://doi.org/10.12688/f1000research.9416.1) Second version: 24 Nov 2016, 5:1987 ( https://doi.org/10.12688/f1000research.9416.2) Reviewer Status Latest published: 27 Jun 2017, 5:1987 ( https://doi.org/10.12688/f1000research.9416.3) Invited Reviewers 1 2 3 Abstract A functional comparative genome analysis is essential to understand the version 3 mechanisms underlying bacterial evolution and adaptation. Detection of (revision) functional orthologs using standard global sequence similarity methods 27 Jun 2017 faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large version 2 scale. (revision) report 24 Nov 2016 We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 version 1 bacterial genomes sampled at different taxonomic levels. 15 Aug 2016 report report report We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of 1 Antonio Rosato , University of Florence, functionally equivalent proteins within and across taxonomic bounderies. Sesto Fiorentino, Italy As the computational cost scales linearly, and not quadratically with the number of genomes, it is suitable for large scale comparative analysis. 2 Robert D. Finn , European Bioinformatics Running both methods in parallel pinpoints potential functional adaptations Institute, Cambridge, UK that may add to bacterial fitness. David M. Kristensen, The University of Iowa, Keywords 3 Bacterial genomics , Bacterial functionome , Orthology , Horizontal gene Iowa City, USA transfer , clustering , semantic annotation Any reports and responses or comments on the article can be found at the end of the article. Page 1 of 43 F1000Research 2016, 5:1987 Last updated: 09 APR 2020 This article is included in the International Society for Computational Biology Community Journal gateway. Corresponding author: Jasper J. Koehorst ([email protected]) Competing interests: No competing interests were disclosed. Grant information: This work was partly supported by the European Union’s Horizon 2020 research and innovation programme (EmPowerPutida, Contract No. 635536, granted to Vitor A P Martins dos Santos). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Copyright: © 2016 Koehorst JJ et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. How to cite this article: Koehorst JJ, Saccenti E, Schaap PJ et al. Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics [version 1; peer review: 1 approved, 2 approved with reservations] F1000Research 2016, 5:1987 (https://doi.org/10.12688/f1000research.9416.1) First published: 15 Aug 2016, 5:1987 (https://doi.org/10.12688/f1000research.9416.1) Page 2 of 43 F1000Research 2016, 5:1987 Last updated: 09 APR 2020 Introduction To overcome these bottlenecks, protein domains have been sug- Comparative analysis of genome sequences has been pivotal to gested as an alternative for defining groups of functionally equiv- unravel mechanisms shaping bacterial evolution like gene dupli- alent proteins8–10 and have been used to perform comparative cation, loss and acquisition1,2, and helped in shedding light on analyses of Escherichia coli9, Pseudomonas10, Streptococcus11 and pathogenesis and genotype-phenotype associations3,4. for protein functional annotation12,13. A protein domain architecture describes the arrangement of domains contained in a protein and Comparative analysis relies on the identification of sets of orthol- is exemplified in Figure 1. As protein domains capture key struc- ogous and paralogous genes and subsequent transfer of func- tural and functional features, protein domain architectures may be tion to the encoding proteins. Technically orthologs are defined considered to be better proxies to describe functional equivalence as best bi-directional hits (BBH) obtained via pairwise sequence than a global sequence similarity14. The concept of using the domain comparison among multiple species and thus exploits sequence architecture to precisely describe the extent of functional equiva- similarity for functional grouping. Sequence similarity-based lence is exemplified in Figure 2. Moreover, once the probabilistic (SB) methods present a number of shortcomings. First, a gener- domain models have been defined, mining large sets of individual alized minimal alignment length and similarity cut-off need to be genome sequences for their occurrences is a considerably less arbitrarily selected for all, which may hamper proper functional demanding computational task than an exploration of all possible grouping. Second, sequence and function might differ across bi-directional hits between them15,16. evolutionary scales. Protein sequences change faster than protein structure and proteins with same function but with low sequence Building on these observations we aim at exploring the potential similarity have been identified5,6. SB methods may fail to group of domain architecture-based (DAB) methods for large scale func- them hampering a functional comparison. This limitation becomes tional comparative analysis by comparing functionally equivalent even more critical when comparing either phylogenetically distant sets of proteins, defined using domain architectures, with standard genomes or gene sequences that were acquired with horizontal gene clusters of orthogonal proteins obtained with SB methods. We transfer events. compared the SB and DAB approach by analysing i) the retrieved number of singletons (i.e. clusters containing only one protein) Third, time and memory requirements scale quadratically with and ii) the characteristics of the inferred pan- and core-genome the number of genomes to be compared. Recent technological size considering a selection of bacterial genomes (both gram advancements are resulting in thousands of organisms and billions positive and negative) sampled at different taxonomic levels of proteins being sequenced7, rendering SB approaches of limited (species, genus, family, order and phylum). We show that the applicability for comparisons at the larger scales. DAB approach provides a fast and efficient alternative to SB Figure 1. Domain architecture as a formal description of functional equivalence. Although the proteins obviously share a common core, four distinct domain architectures involving six protein domains were observed in (1) Enterobacteriacee, (2) H. pylori, (3) Pseudomonas and (4) Cyanobacteria. Page 3 of 43 F1000Research 2016, 5:1987 Last updated: 09 APR 2020 Figure 2. Relationship between domain architecture based (DAB) and sequence similarity based (SB) clustering. Domains are probabilistic models of amino acids coordinates obtained by hidden Markov modeling (HMM) built from (structure based) multiple sequence alignments. Domain architectures are linear combinations of these domains representing the functional potential of a given protein sequence and constitute the input for DAB clustering. SB-orthology clusters inherit functional annotations via best bi-directional hits above a predefined sequence similarity cut-off score. The information content decreases when moving from the overall function to the sequence level. methods to identify groups of functionally equivalent/related (Listeria monocytogenes and Helicobacter pylori), three genera proteins for comparative genome analysis and that the functional (Streptococcus, Pseudomonas, Bacillus), one family (Enterobacte- pan-genome is more closed in comparison to the sequence based riaceae), one order (Corynebacteriales), and one phylum (Cyano- pan-genome. DAB approaches can complement standardly applied bacteria) were selected. For each a set of 60 genome sequences sequence similarity methods and can pinpoint potential functional were considered, except for L. monocytogenes for which only 26 adaptations. complete genome sequences were available. Maximal diversity among genome sequences was ensured by sampling divergent Methods species (when possible) at each taxonomic level. Genome sequences Genome sequence retrieval were retrieved from the European Nucleotide Archive database Bacterial species were chosen on the basis of the availability of (www.ebi.ac.uk/ena). A full list of genomes analyzed is available fully sequenced genomes in the public domain: two species in the Data availability section. Page 4 of 43 F1000Research 2016,

Protein Domain Architectures Provide a Fast, Efficient and Scalable

Doctoral Thesis 2016 UNDERSTANDING the AROMATIC HYDROCARBON DEGRADATION POTENTIAL of PSEUDOMONAS STUTZERI

Comparative Genomic Analysis of Three Pseudomonas

Pseudomonas Chloritidismutans Sp. Nov., a Non- Denitrifying, Chlorate-Reducing Bacterium

Étude Des Communautés Microbiennes Rhizosphériques De Ligneux Indigènes De Sols Anthropogéniques, Issus D’Effluents Industriels Cyril Zappelini

Pseudomonas Stutzeri Biology Of

16S Rrna Gene Sequence Analysis Relative to Genomovars of Pseudomonas Stutzen' and Proposal of Pseudomonas Balearica Sp

Wo 2010/132341 A2

A Primary Assessment of the Endophytic Bacterial Community in a Xerophilous Moss (Grimmia Montana) Using Molecular Method and Cultivated Isolates

Aquatic Microbial Ecology 80:15

Pyani Documentation Release 0.2.9

Occupational Exposure to Microorganisms When Using Biological Degreasing Stations

Genomic and Phenotypic Insights Point to Diverse Ecological Strategies by Facultative Anaerobes Obtained from Subsurface Coal Seams Silas H