Rooting Phylogenies
Total Page:16
File Type:pdf, Size:1020Kb
Rooting phylogenies Dissertation Submitted in fulfilment of the requirements for the degree Doktor der Naturwissenschaften (Dr. rer. Nat.) in the Faculty of Mathematics and Natural Sciences of the Christian-Albrechts University Kiel Submitted by Fernando Domingues Kümmel Tria Kiel, August 2018 1 2 First examiner: Prof. Dr. Tal Dagan Second examiner: Prof. Dr. Bernhard Haubold Date of the oral examination: 12.10.2018 3 4 Declaration I hereby declare that the thesis entitled “Rooting Phylogenies“ has been carried out in the Institute of General Microbiology at the Christian-Albrechts University of Kiel, Kiel, Germany, under the guidance of Prof. Dr. Tal Dagan and Dr. Giddy Landan. The work is original and has not been submitted in part or full by me for any degree at any other University. I further declare that the material obtained from other sources has been duly acknowledged in the thesis. My work has been produced in compliance to the principles of good scientific practice in accordance with the guidelines of the German science foundation. Kiel, 01.08.2018 ___________________________ Fernando Domingues Kümmel Tria 5 6 Table of Contents 1 Abstract ..................................................................................................................................... 8 2 Zusammenfasung (abstract in German) .................................................................................... 9 3 Introduction ............................................................................................................................. 10 4 Rooting trees with minimal ancestor deviation ......................................................................... 13 4.1 Results ............................................................................................................................. 13 4.1.1 Algorithm ................................................................................................................... 13 4.1.2 Performance ............................................................................................................. 16 4.2 Conclusions ..................................................................................................................... 20 4.3 Methodology..................................................................................................................... 21 4.3.1 Datasets preparation ................................................................................................. 21 4.3.2 Detailed algorithm ..................................................................................................... 22 5 Rooting species trees .............................................................................................................. 26 5.1 Terminology ..................................................................................................................... 26 5.2 Results ............................................................................................................................. 28 5.2.1 Demonstrative datasets ............................................................................................. 28 5.2.2 Phylogenomic rooting by majority rule ....................................................................... 28 5.2.3 The root support test for alternative root partitions .................................................... 31 5.2.4 Phylogenetic signal from partial and multi-copy gene trees ....................................... 38 5.2.5 Root inferences in biological datasets ....................................................................... 41 5.3 Conclusions ..................................................................................................................... 60 5.4 Methodology..................................................................................................................... 64 6 Outlook.................................................................................................................................... 65 7 References .............................................................................................................................. 66 8 Acknowledgments ................................................................................................................... 69 9 Supplementary ........................................................................................................................ 70 7 1 Abstract Ancestor-descendent relations play a cardinal role in evolutionary theory. Those relations are determined by rooting phylogenetic trees. Existing rooting methods are hampered by evolutionary rate heterogeneity among lineages or the unavailability of auxiliary phylogenetic information. In this thesis I propose two novel rooting approaches, each approach applicable to address different research questions. In section 4, I introduce a general method to infer the roots of phylogenetic trees, without assuming prior knowledge about phylogenetic relations among the studied lineages. The method, named Minimal Ancestor Deviation (MAD), takes as input any type of unrooted tree and infers the most likely root using branch length and topological information contained in the tree. When applied to biological datasets, I show that MAD is more accurate and more robust to known confounding factors than existing methods. In the next sections, I use Ancestor Deviations (r) in a phylogenomic context to infer the roots of species trees, using whole genomes for the inferences. The approach is grounded in a statistical framework that evaluates all candidate roots of the underlying species tree and formally tests the relative strength of competing root hypotheses. This phylogenomic rooting approach uses information from multiple gene trees and does not require knowledge of the species tree, making it suitable for root inferences even in face of reticulated evolution. When applied to biological datasets, our approaches reveal evidence for: 1) the origin of photosynthesis in the ocean; 2) the anaerobic and chemolithoautotrophic lifestyle of the last common ancestor of proteobacteria; and 3) the chimeric nature of modern archaea genomes. 8 2 Zusammenfasung (abstract in German) Die Beziehungen zwischen Vorfahren und ihren Nachfahren spielen eine entscheidende Rolle in der Evolutionstheorie. Diese Beziehungen werden durch die Bestimmung der Wurzeln von phylogenetischen Stammbäumen ermittelt. Solche Wurzeln können durch verschiedene Methoden festgelegt werden, deren Anwendung jedoch durch heterogene Evolutionsraten oder fehlende phylogenetische Information sehr eingeschränkt ist. In dieser Arbeit stelle ich zwei neue Wurzel-Methoden vor, welche auf unterschiedliche Fragestellungen anwendbar sind. Der erste Ansatz wird in Kapitel 4 dieser Arbeit vorgestellt und ist eine Methode zur generellen Bestimmung von phylogenetischen Wurzeln. Sie ist ohne Vorwissen über die zu untersuchenden Abstammungen anwendbar. Die neue Methode, genannt‚ Minimal Ancestor Deviation (kurz MAD), kann mit jeglicher Art von ungewurzeltem, phylogenetischem Baum durchgeführt werden. In der MAD-Methode wird mit Hilfe der Ast-Längen und topologischer Informationen des Stammbaumes die wahrscheinlichste Wurzel bestimmt. Ich zeige weiterhin, dass die MAD-Methode bei biologischen Daten ein genaueres Ergebnis produziert als bisherige Methoden und sich stabiler gegenüber Störfaktoren verhält. Im anschließenden Kapitel verwende ich die Vorfahren-Abweichungs-Statistik (Ancestor Deviations, r) in einem phylogenetischen Kontext um die Wurzel von Speziesbäumen anhand von kompletten Genomen zu bestimmen. Diese Methode basiert auf einem statistischen Vorgehen, bei welchem alle möglichen Wurzeln eines Stammbaumes direkt verglichen und evaluiert werden. Zur Bestimmung der Abstammungswurzel werden hier die Informationen von mehreren Genbäumen und nicht nur die einzelner Speziesbäume berücksichtigt. Dadurch ist diese Methode auch anwendbar, wenn eine netzartige Evolution vorliegt. Mit den neuen Methoden aus dieser Arbeit zeige ich abschließend, 1) dass der Ursprung von Photosynthese in den Ozeanen liegt, 2) dass der letzte gemeinsame Vorfahr von Proteobakterien eine anaerobe und chemolithoautotrophische Lebensweise hatte und 3) dass Archaeen chimäre Genome, zusammengesetzt aus unterschiedlichen Spezies, aufweisen. 9 3 Introduction Phylogenetic trees are used to describe and investigate the evolutionary relations between entities. A phylogenetic tree is an acyclic bifurcating graph whose topology is inferred from a comparison of the sampled entities. In the field of molecular evolution, phylogenetic trees are mostly reconstructed from DNA or protein sequences (Fitch and Margoliash 1967). Other types of data have also been used to reconstruct phylogenetic trees, including species phenotypic characteristics, biochemical makeup as well as language vocabularies (for a historical review see (Ragan 2009)). In most tree reconstruction methods the inferred phylogeny is unrooted, and the ancestral relations between the taxonomic units are not resolved. The determination of ancestor-descendant relations in an unrooted tree is achieved by the inference of a root node, which a priori can be located on any of the branches of the unrooted tree. The root represents the last common ancestor (LCA) from which all operational taxonomic units (OTUs) in the tree descended. Several root inference methods have been described in the literature, differing in the type of data that can be analyzed, the assumptions regarding