Bromberg-Dissertation-2016

ESTABLISHING A TRUSTWORTHY FIRST APPROXIMATION FOR EVOLUTIONARY DISTANCES APPROVED BY SUPERVISORY COMMITTEE Zbyszek Otwinowski Nick V. Grishin Yuh Min Chook Lora Hooper Khuloud Jaqaman DEDICATION I am thankful to everyone who helped and supported me during my time performing this work. ESTABLISHING A TRUSTWORTHY FIRST APPROXIMATION FOR EVOLUTIONARY DISTANCES by RAQUEL BROMBERG DISSERTATION Presented to the Faculty of the Graduate School of Biomedical Sciences The University of Texas Southwestern Medical Center at Dallas In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY The University of Texas Southwestern Medical Center at Dallas Dallas, Texas May, 2016 Copyright by RAQUEL BROMBERG, 2016 All Rights Reserved ESTABLISHING A TRUSTWORTHY FIRST APPROXIMATION FOR EVOLUTIONARY DISTANCES RAQUEL BROMBERG, Ph.D. The University of Texas Southwestern Medical Center at Dallas, 2016 ZBYSZEK OTWINOWSKI, Ph.D. NICK V. GRISHIN, Ph.D. Advances in sequencing have generated a large number of complete genomes. Traditionally, phylogenetic analysis relies on alignments of orthologs, but defining orthologs and separating them from paralogs is a complex task that may not always be suited to the large datasets of the future. An alternative to traditional, alignment-based approaches are whole-genome, alignment-free methods. These methods are scalable and require minimal manual intervention. I developed SlopeTree, a new alignment-free method that estimates evolutionary distances by measuring the decay of exact sub-sequence matches as a function of match length. SlopeTree corrects for horizontal gene transfer, for composition variation and low complexity sequences, and for branch-length nonlinearity caused by multiple v mutations at the same site. SlopeTree also includes several optional features for removing mobile elements from proteomes, for reducing proteomes to their conserved core, for automatically identifying poor quality proteomes in large inputs, and for explicitly identifying pairs of organisms that have horizontally transferred genes and then identifying those genes. I tested SlopeTree on large and diverse sets of bacteria and archaea, and I also applied it at the strain level. I compared the SlopeTree trees to the NCBI taxonomy, to trees based on concatenated alignments, and to trees produced by other alignment-free methods. The results were consistent with current knowledge about prokaryotic evolution. I assessed differences in tree topology over different methods and settings and found that the majority of bacteria and archaea have a core set of proteins that evolves by descent. In trees built from complete genomes rather than from sets of core genes, I observed some grouping by phenotype rather than phylogeny. In general, SlopeTree generates sensible topologies which are relatively stable between whole proteome and reduced proteome inputs, which validates the concept of species and phyla as having a core proteome evolving by descent, but not necessarily coevolving with the ribosome and its proteins. TABLE OF CONTENTS ABSTRACT .........................................................................................................................v PRIOR PUBLICATIONS ...................................................................................................xiv LIST OF TABLES .............................................................................................................. xv LIST OF FIGURES ............................................................................................................xvi LIST OF APPENDICES ................................................................................................. xviii CHAPTER ONE: INTRODUCTION ................................................................................. 20 1.1 DIVERSITY OF CELLULAR LIFE .......................................................................... 20 Domains of cellular life ............................................................................................... 20 Horizontal gene transfer ............................................................................................... 21 1.2 HISTORY OF PHYLOGENETICS FOR PROKARYOTES ...................................... 22 Prokaryotic phylogeny before molecular phylogenetics ............................................... 22 Molecular phylogenetics .............................................................................................. 23 Problems with traditional methods: different genes tell different stories ....................... 25 1.3 THE GENOMIC DATA FLOOD .............................................................................. 26 Advances in sequencing technology............................................................................. 26 From highly curated data to Big Data........................................................................... 27 How many species are there? ....................................................................................... 28 1.4 ALIGNMENT-FREE WHOLE-GENOME METHODS ............................................ 29 Types of methods ........................................................................................................ 29 Survey of current alignment-free methods.................................................................... 30 vii Composition Vector Trees (CVTree) ................................................................................... 30 Feature Frequency Profiles (FFP)......................................................................................... 31 D2 statistics ......................................................................................................................... 32 Co-phylog ........................................................................................................................... 33 Spaced Words (SW) ............................................................................................................ 35 Average Common Substring (ACS) ..................................................................................... 36 Kmacs ................................................................................................................................. 36 ALFRED-G ......................................................................................................................... 37 The Kr method .................................................................................................................... 38 1.5 WHY IS PHYLOGENETICS IMPORTANT? ........................................................... 38 Understanding how life evolves ................................................................................... 39 Phylogenetics underlies taxonomy ............................................................................... 39 Microbiome characterization ....................................................................................... 40 1.6 DISSERTATION OVERVIEW ................................................................................. 40 SlopeTree overview ..................................................................................................... 41 Horizontal gene transfer and alignment-free methods................................................... 42 The SlopeTree package ................................................................................................ 43 Assessing SlopeTree .................................................................................................... 44 CHAPTER TWO ................................................................................................................ 48 2.1 MOTIVATION ............................................................................................................. 48 2.2 SLOPETREE OVERVIEW ........................................................................................... 49 The main SlopeTree algorithm......................................................................................... 49 2.3 IMPLEMENTATION.................................................................................................... 50 Assigning unique ordinals to proteomes and proteins ................................................... 50 Assembling the k-mer lists ........................................................................................... 51 Removing low complexity sequences .......................................................................... 52 Counting unique matches................................................................................................. 53 Match-counting algorithm 1 ......................................................................................... 53 Match-counting algorithm 2 ......................................................................................... 54 Scoring matches .............................................................................................................. 55 The SlopeTree match-count histogram ............................................................................. 57 Background subtraction ................................................................................................... 58 Identifying left and right bounds for the SlopeTree data ................................................... 60 Estimating evolutionary distances .................................................................................... 62 Fitting the data ............................................................................................................

Load more