Hierarchical Phylogeny Construction
Total Page:16
File Type:pdf, Size:1020Kb
Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2019 Hierarchical phylogeny construction Anindya Das Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Bioinformatics Commons, and the Computer Sciences Commons Recommended Citation Das, Anindya, "Hierarchical phylogeny construction" (2019). Graduate Theses and Dissertations. 17433. https://lib.dr.iastate.edu/etd/17433 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Hierarchical phylogeny construction by Anindya Das A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Computer Science Program of Study Committee: Xiaoqiu Huang, Major Professor David Fernandez-Baca Oliver Eulenstein Peng Liu Dennis V Lavrov The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2019 Copyright c Anindya Das, 2019. All rights reserved. ii DEDICATION I would like to dedicate this dissertation to my wife Soma and to my parents without whose support I would not have been able to complete this work. I would also like to thank my friends and family for their loving guidance and continuous encouragement and assistance during the writing of this work. iii TABLE OF CONTENTS Page LIST OF TABLES . vi LIST OF FIGURES . viii ACKNOWLEDGMENTS . .x ABSTRACT . xi CHAPTER 1. OVERVIEW . .1 1.1 Introduction . .2 1.2 Outline . .3 1.2.1 Background And Literature Review . .3 1.2.2 Identification Of Informative Regions For Construction Of Phylogeny . .3 1.2.3 Hierarchical Phylogeny Construction . .4 1.2.4 Summary And Conclusion . .4 CHAPTER 2. BACKGROUND . .5 2.1 Phylogeny . .5 2.1.1 Branch Length . .6 2.1.2 Bipartition . .6 2.2 Genome Sequence Alignment . .6 2.2.1 Construction Of Alignment From SNPs . .7 2.3 Construction Of Phylogeny From Sequence Alignment . .8 2.4 Models Of DNA Evolution . .8 2.4.1 Jukes-Cantor Model . .9 2.4.2 K80 Model . .9 iv 2.4.3 GTR Model . .9 2.5 Methods For Computing Phylogeny From Sequence Alignment . 10 2.5.1 Distance Based Methods . 10 2.5.2 Maximum Parsimony Method . 11 2.5.3 Bayesian Method . 11 2.5.4 Maximum Likelihood Method . 11 2.6 Programs For Maximum Likelihood Phylogeny Construction . 16 2.6.1 PhyML . 16 2.6.2 RAxML . 17 2.6.3 IQ-TREE . 17 2.6.4 FastTree . 17 2.7 Shortcomings Of Existing Approach . 18 CHAPTER 3. RECONSTRUCTION OF LOW RESOLUTION CLADES . 20 3.1 Introduction . 20 3.2 Methodology . 22 3.2.1 Detection Of Informative Regions . 22 3.2.2 Selection Of Clades With Low Resolution . 25 3.2.3 Reconstruction Of Phylogeny Of Closely Related Isolates . 26 3.3 Results . 27 3.3.1 Real Data . 27 3.3.2 Simulated Data . 28 3.4 Discussion . 32 v CHAPTER 4. HIERARCHICAL PHYLOGENY CONSTRUCTION . 34 4.1 Introduction . 34 4.2 Methods . 35 4.2.1 Partition Of Isolates . 36 4.2.2 Construction Of Phylogeny For The Current Group . 38 4.2.3 Learning Optimal Values For Parameters of HPC . 38 4.2.4 Workflow of HPC . 40 4.3 Results . 41 4.3.1 Datasets With Two Levels Of Sequence Similarity . 42 4.3.2 Comparison Of HPC With Existing Approach . 43 4.3.3 Datasets With Three Levels Of Sequence Similarity . 48 4.3.4 Computational Resources And Average Wall-clock Time . 55 4.3.5 Settings And Commands For RAxML, IQ-TREE And FastTree . 56 4.4 Discussion . 57 CHAPTER 5. SUMMARY AND CONCLUSION . 74 5.1 Summary . 74 5.1.1 Identification Of Informative Columns And Closely Related Isolates . 74 5.1.2 Hierarchical Phylogeny Construction . 75 5.2 Conclusion . 76 BIBLIOGRAPHY . 77 vi LIST OF TABLES Page Table 3.1 Deviation of likelihood for the alignment S ................... 23 Table 3.2 Noise alignment from sequence alignment . 23 Table 3.3 Deviation of likelihood for the alignment S0 .................. 24 Table 3.4 Number (percentage) of informative columns . 30 Table 3.5 Robinson-Foulds distance of trees built from whole alignment (Tw) and from informative columns (Tr)............................. 31 Table 4.1 Robinson-Foulds distance (average over 5 instances) from true phylogenies for various lengths of alignment . 39 Table 4.2 Time required by three versions of HPC and three existing programs . 44 Table 4.3 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Wide alignments . 60 Table 4.4 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Wide alignments . 61 Table 4.5 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Wide alignments . 62 Table 4.6 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Wide alignments . 63 Table 4.7 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Wide alignments . 64 Table 4.8 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Wide alignments . 65 vii Table 4.9 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Wide alignments . 66 Table 4.10 Sum of normalized Robinson-Foulds distances over 24 groups in 20 Wide alignments . 67 Table 4.11 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Narrow alignments . 68 Table 4.12 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Narrow alignments . 69 Table 4.13 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Narrow alignments . 70 Table 4.14 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Narrow alignments . 71 Table 4.15 Sum of normalized Robinson-Foulds distances over 24 groups in 40 Narrow alignments . 72 Table 4.16 Average normalized Robinson-Foulds distance for each program . 73 Table 4.17 Normalized Robinson-Foulds distance for 16 subgroups in 10 alignments . 73 viii LIST OF FIGURES Page Figure 2.1 Example of a phylogenetic tree of 4 species . .5 Figure 2.2 Example of an alignment of 3 species with 4 columns . 12 Figure 2.3 A tree of 3 species for calculating the likelihood . 13 Figure 2.4 A tree of 3 species with character states in the nodes . 13 Figure 2.5 Nearest neighbor interchange on an edge . 15 Figure 3.1 Algorithm for selection of informative columns from a sequence alignment and corresponding maximum likelihood tree . 26 Figure 3.2 The maximum likelihood tree of 11 Fusarium isolates from an alignment of 137; 718 columns . 28 Figure 3.3 The maximum likelihood tree of 4 F. virguliforme isolates from an alignment of 319 informative columns . 29.