Functional Inference from Orthology and Domain Architecture

Functional Inference from Orthology and Domain Architecture Mateusz Kaduk Academic dissertation for the Degree of Doctor of Philosophy in Biochemistry towards Bioinformatics at Stockholm University to be publicly defended on Tuesday 12 June 2018 at 14.00 in Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16 B. Abstract Proteins are the basic building blocks of all living organisms. They play a central role in determining the structure of living beings and are required for essential chemical reactions. One of the main challenges in bioinformatics is to characterize the function of all proteins. The problem of understanding protein function can be approached by understanding their evolutionary history. Orthology analysis plays an important role in studying the evolutionary relation of proteins. Proteins are termed orthologs if they derive from a single gene in the species' last common ancestor, i.e. if they were separated by a speciation event. Orthologs are useful because they retain their function more often than other homologs. Inference of a complete set of orthologs for many species is computationally intensive. Currently, the fastest algorithms rely on graph-based approaches, which compare all-vs-all sequences and then cluster top hits into groups of orthologs. The initial step of performing all-vs-all comparisons is usually the primary computational challenge as it scales quadratically with the number of species. A new, more scalable and less computationally demanding method was developed to solve this problem without sacrificing accuracy. The Hieranoid 2 algorithm reduces computational complexity to almost linear by overcoming the necessity to perform all-vs-all similarity searches. The algorithm progresses along a known species tree, from leaves to root. Starting at the leaves, ortholog groups are predicted conventionally and then summarized at internal nodes to form pseudo-species. These pseudo-species are then re-used to search against other (pseudo-)species higher in the tree. This way the algorithm aggregates new ortholog groups hierarchically. The hierarchy is a natural structure to store and view large multi-species ortholog groups, and provides a complete picture of inferred evolutionary events. To facilitate explorative analysis of hierarchical groups of orthologs, a new online tool was created. The HieranoiDB website provides precomputed hierarchical groups of orthologs for a set of 66 species. It allows the user to search for orthology assignments using protein description, protein sequence, or species. Evolutionary events and meta information is added to the hierarchical groups of orthologs, which are shown graphically as interactive trees. This representation allows exploring, searching, and easier visual inspection of multi-species ortholog groups. The majority of orthology prediction methods focus on treating the whole protein sequence as a single evolutionary unit. However, proteins are often composed of individual units, called protein domains, that can have different evolutionary histories. To extend the full sequence based methodology to a domain-aware method, a new approach called Domainoid is proposed. Here, domains are extracted from full-length sequences and subjected to orthology inference. This allows Domainoid to find orthology that would be missed by a full sequence approach. Networks are a convenient graphical representation for showing a large number of functional associations between genes or proteins. They allow various analyses of graph properties, and can help visualize complex relationships. A framework for inferring comprehensive functional association networks was developed, called FunCoup. A major difference compared to other networks is FunCoup's extensive use of orthology relationships between species, which significantly boosts its coverage. Using naïve Bayesian classifiers to integrate 10 different evidence types and orthology transfer, FunCoup captures functional associations of many types, and provides comprehensive networks for 17 species across five gold- standards. Keywords: Orthology, Functional coupling networks, Association networks, Hierarchical groups of orthologs. Stockholm 2018 http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-155096 ISBN 978-91-7797-252-5 ISBN 978-91-7797-253-2 Department of Biochemistry and Biophysics Stockholm University, 106 91 Stockholm FUNCTIONAL INFERENCE FROM ORTHOLOGY AND DOMAIN ARCHITECTURE Mateusz Kaduk Functional Inference from Orthology and Domain Architecture Mateusz Kaduk ©Mateusz Kaduk, Stockholm University 2018 ISBN print 978-91-7797-252-5 ISBN PDF 978-91-7797-253-2 Printed in Sweden by Universitetsservice US-AB, Stockholm 2018 Distributor: Department of Biochemistry and Biophysics, Stockholm University Abstract Proteins are the basic building blocks of all living organisms. They play a central role in determining the structure of living beings and are required for essential chemical reactions. One of the main challenges in bioinformatics is to characterize the function of all proteins. The problem of understanding protein function can be approached by understanding their evolutionary history. Or- thology analysis plays an important role in studying the evolutionary relation of proteins. Proteins are termed orthologs if they derive from a single gene in the species’ last common ancestor, i.e. if they were separated by a speciation event. Orthologs are useful because they retain their function more often than other homologs. Inference of a complete set of orthologs for many species is computationally intensive. Currently, the fastest algorithms rely on graph-based approaches, which compare all-vs-all sequences and then cluster top hits into groups of orthologs. The initial step of performing all-vs-all comparisons is usually the primary computational challenge as it scales quadratically with the number of species. A new, more scalable and less computationally demanding method was developed to solve this problem without sacrificing accuracy. The Hieranoid 2 algorithm reduces computational complexity to almost linear by overcoming the necessity to perform all-vs-all similarity searches. The algorithm progresses along a known species tree, from leaves to root. Starting at the leaves, ortholog groups are predicted conventionally and then summarized at internal nodes to form pseudo-species. These pseudo-species are then re-used to search against other (pseudo-)species higher in the tree. This way the algorithm aggregates new ortholog groups hierarchically. The hierarchy is a natural structure to store and view large multi-species ortholog groups, and provides a complete picture of inferred evolutionary events. To facilitate explorative analysis of hierarchical groups of orthologs, a new online tool was created. The HieranoiDB website provides precomputed hierarchical groups of orthologs for a set of 66 species. It allows the user to search for orthology assignments using protein description, protein sequence, or species. Evolutionary events and meta information is added to the hierarchical groups of orthologs, which are shown graphically as interactive trees. This representation allows exploring, searching, and easier visual inspection of multi-species ortholog groups. The majority of orthology prediction methods focus on treating the whole protein sequence as a single evolutionary unit. However, proteins are often composed of individual units, called protein domains, that can have different evolutionary histories. To extend the full sequence based methodology to a domain-aware method, a new approach called Domainoid is proposed. Here, domains are extracted from full-length sequences and subjected to orthology inference. This allows Domainoid to find orthology that would be missed by a full sequence approach. Networks are a convenient graphical representation for showing a large number of functional associations between genes or proteins. They allow various analyses of graph properties, and can help visualize complex relationships. A framework for inferring comprehensive functional association networks was developed, called FunCoup. A major difference compared to other networks is FunCoup’s extensive use of orthology relationships between species, which significantly boosts its coverage. Using naïve Bayesian classifiers to integrate 10 different evidence types and orthology transfer, FunCoup captures functional associations of many types, and provides comprehensive networks for 17 species across five gold-standards. This thesis is dedicated to my grandfather Franciszek Dmitrowski. List of Papers The following papers, referred to in the text by their Roman numerals, are included in this thesis. PAPER I: Improved orthology inference with Hieranoid 2 Mateusz Kaduk, Erik Sonnhammer Bioinformatics, 8, 1154- 1159 (2017). DOI: https://doi.org/10.1093/bioinformatics/btw774 PAPER II: HieranoiDB: a database of orthologs inferred by Hieranoid Mateusz Kaduk, Christian Riegler, Oliver Lemp, Erik Sonnham- mer Nucleic Acids Research, 1, 1154-1159 (2016). DOI: https://doi.org/10.1093/nar/gkw923 PAPER III: Domainoid: Domain-oriented orthology inference Mateusz Kaduk, Kristoffer Forslund, Erik Sonnhammer Inter- national Society for Computational Biology, Manuscript, (2018). PAPER IV: FunCoup 4: new species, data, and visualization Christoph Ogris†, Dimitri Guala†, Mateusz Kaduk†, Erik Sonnham- mer Nucleic Acids Research, 46, D601-D607 (2017). DOI: https://doi.org/10.1093/nar/gkx1138 † Contributed equally. Reprints were made with

Functional Inference from Orthology and Domain Architecture

Applied Category Theory for Genomics – an Initiative

Gene Prediction: the End of the Beginning Comment Colin Semple

Meeting Review: Bioinformatics and Medicine – from Molecules To

Prospects & Overviews Orthology Prediction Methods: a Quality Assessment Using Curated Protein Families

UNIVERSITY of CALIFORNIA SAN DIEGO Making Sense of Microbial

Methodology for Predicting Semantic Annotations of Protein Sequences by Feature Extraction Derived of Statistical Contact Potentials and Continuous Wavelet Transform

Standardised Benchmarking in the Quest for Orthologs

Complementary Symbiont Contributions to Plant Decomposition in a Fungus-Farming Termite

Peregrine and Saker Falcon Genome Sequences Provide Insights Into Evolution of a Predatory Lifestyle

Multiscale Modeling in Systems Biology

EMBO Encounters Issue43.Pdf

Assigning Folds to the Proteins Encoded by the Genome of Mycoplasma Genitalium (Protein Fold Recognition͞computer Analysis of Genome Sequences)