Download It Using the Following Command
Total Page:16
File Type:pdf, Size:1020Kb
UC San Diego UC San Diego Electronic Theses and Dissertations Title New methods for inferring, assessing, and using phylogenetic trees from genomic and microbiome data Permalink https://escholarship.org/uc/item/8nc7x677 Author Sayyari, Erfan Publication Date 2019 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA SAN DIEGO New methods for inferring, assessing, and using phylogenetic trees from genomic and microbiome data A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical Engineering (Signal and Image Processing) by Erfan Sayyari Committee in charge: Professor Siavash Mirarab, Chair Professor Vineet Bafna Professor Alon Orlitsky Professor Greg Rouse Professor Behrouz Touri 2019 Copyright Erfan Sayyari, 2019 All rights reserved. The dissertation of Erfan Sayyari is approved, and it is ac- ceptable in quality and form for publication on microfilm and electronically: Chair University of California San Diego 2019 iii DEDICATION I dedicate my dissertation to my beloved mother and father. iv EPIGRAPH Nothing in Biology Makes Sense Except in the Light of Evolution. —Theodosius Dobzhansky v TABLE OF CONTENTS Signature Page . iii Dedication . iv Epigraph . v Table of Contents . vi List of Figures . x List of Tables . xiv Acknowledgements . xv Vita . xviii Abstract of the Dissertation . xx Chapter 1 Introduction . 1 Chapter 2 Anchoring quartet-based phylogenetic distances and applications to species tree reconstruction . 8 2.1 Background . 9 2.2 Methods . 11 2.2.1 General theoretical results . 12 2.2.2 Tree inference algorithms . 19 2.2.3 DISTIQUE (theory) . 22 2.2.4 DISTIQUE (algorithmic design) . 23 2.2.5 Computing pseudo-counts for allpairs-max . 24 2.2.6 Difficulties with long branches and need for majority con- sensus . 25 2.2.7 Experimental setup . 33 2.3 Results . 36 2.3.1 Comparison between DISTIQUE variants . 36 2.3.2 DISTIQUE versus other methods . 43 2.3.3 Biological results . 47 2.4 Discussion . 49 2.5 Conclusions . 51 2.6 Acknowledgements . 51 vi Chapter 3 Fast coalescent-based computation of local branch support from quartet frequencies . 56 3.1 Introduction . 57 3.2 New Approaches . 60 3.2.1 Calculation of local posterior probability . 61 3.2.2 Calculation of branch length . 69 3.2.3 Calculation of quartet support . 72 3.2.4 Other considerations . 73 3.3 Materials and Methods . 73 3.3.1 Datasets . 73 3.3.2 Evaluation procedure . 78 3.4 Results . 79 3.4.1 A-200 dataset . 79 3.4.2 Avian . 89 3.4.3 Posterior and MLBS . 90 3.4.4 Biological datasets . 95 3.5 Discussions . 105 3.5.1 Gene tree estimation error . 106 3.5.2 High support despite high discordance . 108 3.5.3 Limitations and future work . 109 3.6 Acknowledgements . 109 Chapter 4 Testing for polytomies in phylogenetic species trees using quartet fre- quencies . 110 4.1 Introduction . 111 4.2 Materials and Methods . 115 4.2.1 Background . 115 4.2.2 A statistical test of polytomy . 115 4.2.3 Evaluations . 119 4.3 Results . 123 4.3.1 Simulated datasets . 123 4.3.2 Biological dataset . 126 4.4 Discussion . 132 4.4.1 Power . 132 4.4.2 Divergence from the MSC model and connections to localPP135 4.4.3 The effective number of gene trees . 137 4.4.4 Interpretation . 140 4.5 Conclusions . 141 4.6 Acknowledgements . 141 vii Chapter 5 Fragmentary gene sequences negatively impact gene tree and species tree reconstruction . 142 5.1 Introduction . 143 5.2 Results . 147 5.2.1 Simulation Results . 147 5.2.2 Empirical Results . 151 5.3 Discussion . 157 5.3.1 Impacts of fragments . 157 5.3.2 Methodological limitations . 162 5.4 Materials and Methods . 163 5.4.1 Analysis pipeline for insect dataset . 163 5.4.2 Simulation procedure . 184 5.4.3 Statistical tests . 186 5.5 Acknowledgements . 186 Chapter 6 DiscoVista: interpretable visualizations of gene tree discordance . 187 6.1 Introduction . 187 6.2 Results . 190 6.3 Acknowledgements . 206 Chapter 7 TADA . 207 7.1 Introduction . 208 7.2 The TADA method . 210 7.2.1 Background and Notations . 210 7.2.2 Generative model used in TADA . 211 7.2.3 Data augmentation procedure . 215 7.2.4 Balancing . 220 7.3 Experimental setup . 222 7.3.1 Datasets . 222 7.3.2 Experiments . 223 7.3.3 Evaluation procedure. 224 7.3.4 Method details . 224 7.4 Results . 228 7.4.1 E1: complete datasets . 228 7.4.2 E2: unbalanced class labels . 230 7.4.3 E3: biased class labels . 234 7.5 Discussions and conclusions . 234 7.5.1 Using a Beta-Binomial distribution with two parameters 237 7.6 Acknowledgements . 243 Appendix A Anchored distances for quartet-based estimation of phylogenetic trees and applications to coalescent-based analyses . 244 A.1 Commands and version numbers . 244 viii Appendix B Supplementary Material for “Fast coalescent-based computation of local branch support from quartet frequencies” . 246 B.1 Commands and version numbers . 247 B.1.1 ASTRAL . 247 B.1.2 MP-EST . 247 Appendix C Supplementary Material for “DiscoVista: interpretable visualizations of gene tree discordance” . 248 C.1 Structure of parameter files . 248 C.2 Installation, Commands, and Usage . 249 Bibliography . 251 ix LIST OF FIGURES Figure 2.1: Possible ways of adding anchors to a quartet. 14 Figure 2.2: An example where long branches can cause problems. See section 2.3 for descriptions. 24 Figure 2.3: Accuracy of different implementations of DISTIQUE for the Mammalian dataset. 37 Figure 2.4: Accuracy of different implementations of DISTIQUE for the Avian dataset. 38 Figure 2.5: Impact of number of rounds of anchor sampling on accuracy. 40 Figure 2.6: Impact of number of rounds of anchor sampling on accuracy. 41 Figure 2.7: Impact of distance method on accuracy. 42 Figure 2.8: DISTIQUE versus other methods on (A) simPhy-size and (B) simPhy- ILS datasets using estimated gene trees. Boxes: (A) number of genes and (B) levels of ILS. 44 Figure 2.9: Running time comparisons on the simPhy-size datasets with 1000 genes. 46 Figure 2.10: Running times of DISTIQUE versus other methods for the simPhy-size dataset. 46 Figure 2.11: The accuracy of methods on Avian (A) and Mammalian (B) datasets using estimated gene trees. Left: number of genes is fixed (1000 for avian, 200 for mammalian) and ILS levels change. 48 Figure 2.12: Species trees generated using DISTIQUE on Avian biological dataset [214]. 49 Figure 2.13: Species trees generated using ASTRID (NJst) and ASTRAL on Avian biological dataset [214]..