Topology of Reticulate Evolution
Total Page:16
File Type:pdf, Size:1020Kb
Topology of Reticulate Evolution Kevin Joseph Emmett Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2016 ⃝c 2016 Kevin Joseph Emmett All Rights Reserved ABSTRACT Topology of Reticulate Evolution Kevin Joseph Emmett The standard representation of evolutionary relationships is a bifurcating tree. How- ever, many types of genetic exchange, collectively referred to as reticulate evolution, involve processes that cannot be modeled as trees. Increasing genomic data has pointed to the prevalence of reticulate processes, particularly in microorganisms, and underscored the need for new approaches to capture and represent the scale and frequency of these events. This thesis contains results from applying new techniques from applied and computa- tional topology, under the heading topological data analysis, to the problem of characterizing reticulate evolution in molecular sequence data. First, we develop approaches for analyz- ing sequence data using topology. We propose new topological constructions specific to molecular sequence data that generalize standard constructions such as Vietoris-Rips. We draw on previous work in phylogenetic networks and use homology to provide a quantitative measure of reticulate events. We develop methods for performing statistical inference using topological summary statistics. Next, we apply our approach to several types of molecular sequence data. First, we examine the mosaic genome structure in phages. We recover inconsistencies in existing morphology-based taxonomies, use a network approach to construct a genome-based repre- sentation of phage relationships, and identify conserved gene families within phage popu- lations. Second, we study influenza, a common human pathogen. We capture widespread patterns of reassortment, including nonrandom cosegregation of segments and barriers to subtype mixing. In contrast to traditional influenza studies, which focus on the phyloge- netic branching patterns of only the two surface-marker proteins, we use whole-genome data to represent influenza molecular relationships. Using this representation, we identify un- expected relationships between divergent influenza subtypes. Finally, we examine a set of pathogenic bacteria. We use two sources of data to measure rates of reticulation in both the core genome and the mobile genome across a range of species. Network approaches are used to represent the population of S. aureus and analyze the spread of antibiotic resistance genes. The presence of antibiotic resistance genes in the human microbiome is investigated. Contents List of Figures iii List of Tables v 1 Introduction 1 1.1 Molecular Evolution and the Tree Paradigm .................. 2 1.2 Reticulate Processes and the Universal Tree .................. 6 1.3 Evolution as a Topological Space ........................ 8 1.4 Thesis Organization ................................ 11 2 Background 13 2.1 Evolutionary Biology and Genomics ....................... 13 2.1.1 Genes and Genomes ........................... 14 2.1.2 Evolutionary Processes .......................... 14 2.1.3 Mathematical Models of Evolution ................... 19 2.1.4 Phylogenetic Methods .......................... 22 2.2 Topological Data Analysis ............................ 29 2.2.1 Preliminaries ............................... 33 2.2.2 Persistent Homology ........................... 38 2.2.3 Mapper .................................. 48 2.3 Applying TDA to Molecular Sequence Data . 50 2.3.1 Topology of Tree-like Metrics ...................... 51 2.3.2 The Fundamental Unit of Reticulation . 51 2.3.3 A Complete Example ........................... 53 2.3.4 The Space of Trees, Revisited ...................... 54 I Theory 57 3 Quantifying Reticulation Using Topological Complex Constructions Be- yond Vietoris-Rips 59 3.1 Introduction .................................... 59 3.2 Sensitivity of the Vietoris-Rips Construction . 60 3.3 The Median Complex Construction ....................... 62 3.4 Čech Complex Construction as an Optimization Problem . 67 i 3.5 Conclusions .................................... 70 4 Parametric Inference using Persistence Diagrams 73 4.1 Introduction .................................... 73 4.2 The Coalescent Process .............................. 75 4.3 Statistical Model for Coalescent Inference .................... 76 4.4 Conclusions .................................... 81 II Applications 83 5 Phage Mosaicism 85 5.1 Introduction .................................... 85 5.2 Data ........................................ 88 5.3 Measuring Phage Mosaicism with Persistent Homology . 89 5.4 Representing Phage Relationships with Mapper . 91 5.5 Conclusions .................................... 96 6 Reassortment in Influenza Evolution 101 6.1 Introduction .................................... 101 6.2 Influenza Virology ................................ 103 6.3 Influenza Reassortment .............................. 103 6.4 Nonrandom Association of Genome Segments . 106 6.5 Multiscale Flu Reassortment . 110 6.6 Conclusions .................................... 111 7 Reticulate Evolution in Pathogenic Bacteria 113 7.1 Introduction .................................... 113 7.2 Evolutionary Scales of Recombination in the Core Genome . 115 7.3 Protein Families as a Proxy for Genome Wide Reticulation . 118 7.4 Antibiotic Resistance in Staphylococcus aureus . 121 7.5 Microbiome as a Reservoir of Antibiotic Resistance Genes . 122 7.6 Conclusions .................................... 124 8 Conclusions 127 Bibliography 129 ii List of Figures 1.1 Charles Darwin and the Evolutionary Tree ..................... 3 1.2 Carl Woese’s Three Domain Tree of Life ...................... 5 1.3 Ford Doolittle’s Reticulate Tree of Life ....................... 8 1.4 Topological equivalence of the coffee mug and the donut ............. 9 1.5 Treelike and reticulate phylogenies ......................... 11 2.1 Viral recombination and reassortment ........................ 16 2.2 Three modes of bacterial reticulation ........................ 17 2.3 Two models for simulating evolutionary data .................... 21 2.4 Rooted vs. Unrooted trees .............................. 23 2.5 The four point condition for additivity ....................... 25 2.6 Counting Tree Topologies .............................. 27 2.7 Tree Space ....................................... 29 2.8 Example of a Splits Network ............................. 30 2.9 Simplices: The building blocks of topological complexes . 34 2.10 Simplicial Complex: A discrete topological space . 34 2.11 Relationship between the chain group, cycle group, and boundary group . 36 2.12 Simplicial Homology ................................. 36 2.13 Vietoris-Rips and ČechComplexes .......................... 38 2.14 Multiscale Topological Structure ........................... 41 2.15 Barcode Diagram for the Two Circles Example ................... 42 2.16 The Persistence Pipeline ............................... 42 2.17 Level-Set Filtrations ................................. 43 2.18 The stability of persistent homology under noise . 45 2.19 Confidence sets on the persistence diagram ..................... 46 2.20 Persistence landscape of the persistence diagram . 47 2.21 Dimensionality Reduction for EDA ......................... 48 2.22 The Mapper Algorithm ................................ 49 2.23 Fundamental Unit of Reticulation .......................... 53 2.24 Applying TDA to Molecular Sequence Data .................... 55 2.25 Topology expands tree space to include reticulation . 56 3.1 Two examples of reduced sensitivity of the Vietoris-Rips Complex . 62 3.2 The Median Operation on Binary Sequences .................... 63 3.3 The Median Complex Recovers Reticulation in Example One . 64 iii 3.4 The Median Complex Recovers Reticulation in Example Two . 65 3.5 Recombination in D. melanogaster ......................... 67 3.6 Species hybridization in genus Ranunculus ..................... 68 4.1 Barcode Diagrams for Three Coalescent Simulations . 75 4.2 Distributions of statistics defined on the H1 persistence diagram for different model parameters ................................... 77 4.3 Fitting the Poisson parameter ζ(ρ, n) for the expected number of invariants, K¯ . 79 4.4 Fitting the distribution of birth times to a Gamma . 80 4.5 Fitting the length distribution to an exponential . 81 4.6 Inference of recombination rate ρ using topological information . 82 4.7 Comparing traditional estimators of ρ to the TDA-estimator . 82 5.1 Inconsistency of morphological classifications in bacteriophage . 87 5.2 Summary annotations of 306 bacteriophage strains used in this study . 89 5.3 S306 Bacteriophage Barcode Diagram ........................ 91 5.4 Caudovirales H0 dendrogram ............................ 92 5.5 Caudovirales Barcode Diagrams ........................... 93 5.6 Phage Mapper Network ............................... 94 5.7 Phage Network Colored by Taxonomic Family ................... 95 5.8 Modularity Scores for Different Divisions of the Phage Network . 95 5.9 Phage Network Colored by Host ........................... 97 5.10 Phage Network with MCL Clustering ........................ 99 6.1 Structure of an influenza virus particle . 104 6.2 Influenza Dataset Statistics ............................. 105 6.3 Influenza Genome Segment Barcodes . 107 6.4 Influenza Concatenated Genome Barcode . 108 6.5 Influenza Networks By HA Subtype . 109 6.6 Influenza Nonrandom Reassortment . 110 6.7 H1