Statistical Analysis of Species Level Phylogenetic Trees
Total Page:16
File Type:pdf, Size:1020Kb
STATISTICAL ANALYSIS OF SPECIES LEVEL PHYLOGENETIC TREES Meg Elizabeth Ferguson A Thesis Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE December 2017 Committee: John T. Chen, Advisor Junfeng Shang Craig Zirbel ii ABSTRACT John T. Chen, Advisor In this thesis, statistical methods are used to analyze the generation of species-level phylogenies. Two software packages, one phylogenetic and one statistical, are used to investigate the difference in phylogeny topology across three methods. Maximum likelihood estimation, neighbor-joining, and UPGMA methodologies are applied in this comparison to study the accuracy of each software package in correctly placing taxa with the true phylogeny. Four genes are used to compare with variable length sequences and genes amongst forty-seven squid species. In addition, missing data techniques are employed to assess the impact missing data has on phylogeny generation. Two software platforms were used to generate phylogenies for genes 16S rRNA, 18s rRNA, 28S rRNA, and the mitochondrial gene cytochrome c oxidase I (COI). The phylogenetic software platform MEGA was utilized as well as the statistical software platform, R; within R, the packages ape, phangorn, and seqinr were used in tree generation. Results show discrepancies between phylogenies generated across the four single-gene trees and multiple-gene trees; only phylogenies generated using missing data in the form of partial sequences grouped all families correctly. Results from this study highlight the struggle in determining the best software package to use for phylogenetic analyses. It was discovered that in general, MEGA generated a more accurate single-gene phylogeny from gene 18S rRNA while R generated a more accurate single-gene phylogeny from gene 28S rRNA. Results also showed that sequences with 50% missing characters could be accurately placed within generated phylogenies. iii ACKNOWLEDGMENTS I would like to express my appreciation and gratitude to Dr. John T. Chen for his support and encouragement on this thesis and in my degree. Thanks are also extended to my committee members, Dr. Junfeng Shang and Dr. Craig Zirbel for their valuable feedback and time. I would also like to thank the Department of Mathematics and Statistics and Department of Applied Statistics and Operations Research at Bowling Green State University for the education I received. Last but not least, I would like to thank my mother, father, sister, brother, and friends for their continued support and encouragement throughout my schooling and life. iv TABLE OF CONTENTS Page CHAPTER 1: INTRODUCTION ......................................................................................... 1 1.1 Motivation for Study.............................................................................................. 1 1.2 Data Description .................................................................................................... 2 CHAPTER 2: SINGLE-GENE TREE BUILDING.............................................................. 7 2.1 Introduction ........................................................................................................... 7 2.1.1 Aligning DNA Sequences............................................................................. 7 2.1.2 Correct Placement Definition ....................................................................... 7 2.1.3 Methodology ................................................................................................. 8 2.2 Maximum Likelihood Estimation and Bootstrapping Trees.................................. 9 2.2.1 Maximum Likelihood Estimation Introduction ............................................ 9 2.2.2 Maximum Likelihood Tree Building in MEGA ........................................... 10 2.2.3 Maximum Likelihood Tree Building in R .................................................... 17 2.3 Neighbor-Joining Trees ......................................................................................... 24 2.3.1 Neighbor-Joining Introduction ..................................................................... 24 2.3.2 Neighbor-Joining Tree Building in MEGA .................................................. 24 2.3.3 Neighbor-Joining Tree Building in R ........................................................... 30 2.4 UPGMA Trees ....................................................................................................... 37 2.4.1 UPGMA Introduction ................................................................................... 37 2.4.2 UPGMA Tree Building in MEGA................................................................ 37 2.4.3 UPGMA Tree Building in R ......................................................................... 43 2.5 Maximum Likelihood and Neighbor-Joining Tree Comparison in R ................... 49 v CHAPTER 3: MULTIPLE-GENE TREE BUILDING ........................................................ 50 3.1 Multiple-Gene Tree Introduction........................................................................... 50 3.2 Procedure for Creating Super-Gene Matrix........................................................... 51 3.3 Multiple-Gene Maximum Likelihood Trees.......................................................... 51 3.4 Multiple-Gene Neighbor-Joining Trees................................................................. 55 3.5 Multiple-Gene UPGMA Trees............................................................................... 58 CHAPTER 4: MISSING DATA........................................................................................... 61 4.1 Literature Review ................................................................................................ 61 4.2 Genetic-Based Missing Data Analysis .................................................................. 64 4.2.1 Genetic-Based Procedure.............................................................................. 64 4.2.2 Multiple-Gene Maximum Likelihood Trees................................................. 67 4.2.3 Multiple-Gene Neighbor-Joining Trees........................................................ 71 4.2.4 Multiple-Gene UPGMA Trees...................................................................... 74 4.3 Trait-Based Missing Data Analysis ....................................................................... 77 CHAPTER 5: CONCLUSION ............................................................................................. 81 5.1 Maximum Likelihood Phylogenies........................................................................ 81 5.2 Neighbor-Joining Phylogenies............................................................................... 82 5.3 UPGMA Phylogenies ............................................................................................ 83 5.4 Missing Data Analysis ........................................................................................... 83 5.5 Areas For Future Study.......................................................................................... 84 5.6 Final Remarks ........................................................................................................ 85 REFERENCES ..................................................................................................................... 87 APPENDIX A: SUPPLEMENTAL TABLES...................................................................... 92 vi APPENDIX B: SUPPLEMENTAL FIGURES .................................................................... 98 vii LIST OF FIGURES Page 2.1 Maximum Likelihood Estimation Tree for 16S Gene Generated in MEGA ............ 13 2.2 Maximum Likelihood Estimation Tree for 18S Gene Generated in MEGA ............ 14 2.3 Maximum Likelihood Estimation Tree for 28S Gene Generated in MEGA ............ 15 2.4 Maximum Likelihood Estimation Tree for COI Gene Generated in MEGA ........... 16 2.5 Maximum Likelihood Estimation Tree for 16S Gene Generated in R ..................... 20 2.6 Maximum Likelihood Estimation Tree for 18S Gene Generated in R ..................... 21 2.7 Maximum Likelihood Estimation Tree for 28S Gene Generated in R ..................... 22 2.8 Maximum Likelihood Estimation Tree for COI Gene Generated in R .................... 23 2.9 Neighbor-Joining Tree for 16S Gene Generated in MEGA ..................................... 26 2.10 Neighbor-Joining Tree for 18S Gene Generated in MEGA ..................................... 27 2.11 Neighbor-Joining Tree for 28S Gene Generated in MEGA ..................................... 28 2.12 Neighbor-Joining Tree for COI Gene Generated in MEGA..................................... 29 2.13 Neighbor-Joining Tree for 16S Gene Generated in R .............................................. 33 2.14 Neighbor-Joining Tree for 18S Gene Generated in R .............................................. 34 2.15 Neighbor-Joining Tree for 28S Gene Generated in R .............................................. 35 2.16 Neighbor-Joining Tree for COI Gene Generated in R.............................................. 36 2.17 UPGMA Tree for 16S Gene Generated in MEGA ................................................... 39 2.18 UPGMA Tree for 18S Gene Generated in MEGA ................................................... 40 2.19 UPGMA Tree for 28S Gene Generated in MEGA ................................................... 41 2.20 UPGMA Tree for COI Gene Generated in MEGA .................................................