INTRODUCTION to PHYLOGENETICS: a STUDY of MAXIMUM PARSIMONY and MAXIMUM LIKELIHOOD METHODS by NAOMI R. IUHASZ a THESIS PRESENTED
Total Page:16
File Type:pdf, Size:1020Kb
INTRODUCTION TO PHYLOGENETICS: A STUDY OF MAXIMUM PARSIMONY AND MAXIMUM LIKELIHOOD METHODS By NAOMI R. IUHASZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2010 c 2010 Naomi R. Iuhasz 2 To my sister, Andreea Iuhasz 3 ACKNOWLEDGMENTS I take this opportunity to thank my supervisor, Dr. Sergei Pilyugin, for his expertise, kindness, and most of all, for his patience. I thank him also for many insightful conversations during the development of the ideas in this thesis, for the questions that challenged my understanding of the material, and for helpful comments on the text. I would like to express sincere gratitude to my thesis co-advisor Dr. Edward Braun, for taking me under his wing long before I knew anything about bio-mathematics and for his constant guidance throughout the journey that lead to this thesis. I also thank Dr. Rebecca Kimball and the entire Braun-Kimball lab for allowing an outsider to enter their niche and to learn from their expertise. I thank Dr. Murali Rao for his valuable suggestions. I am greatly indebted to Prof. Billy Gunnells for challenging me to pursue my interests in biology. My thanks and gratitude goes to Prof. Claire Kurtgis-Hunter for recognizing my potential and for being the first to encourage me to pursue a graduate degree. Above all, I thank my family who stood beside me and encouraged me constantly. My thanks to my grandparents for their love and sound advice, to my aunt for taking on the responsibility to see me through college, and especially to my father who is my role model. Special thanks and appreciation to Camilo for his constant support and unending encouragement. 4 TABLE OF CONTENTS page ACKNOWLEDGMENTS..................................4 LIST OF TABLES......................................6 LIST OF FIGURES.....................................7 ABSTRACT.........................................8 CHAPTER 1 INTRODUCTION...................................9 2 PRELIMINARIES................................... 12 2.1 Definition of Terms............................... 12 2.2 X-trees..................................... 13 2.3 Tree Shapes.................................. 15 2.4 X-splits..................................... 18 2.5 Characters and Convexity........................... 23 2.6 Character Compatibility............................ 27 3 MAXIMUM PARSIMONY.............................. 34 3.1 Classical Parsimony.............................. 34 3.2 Optimization on a Fixed Tree......................... 41 3.3 Tree Rearrangement Operations....................... 46 3.4 Relevance.................................... 49 4 MAXIMUM LIKELIHOOD.............................. 51 4.1 Basic Principles................................ 52 4.2 Models of Sequence Evolution........................ 54 4.3 Calculating Change Probabilities....................... 58 4.4 Differences in Perspective between Parsimony and Likelihood....... 60 REFERENCES....................................... 62 BIOGRAPHICAL SKETCH................................ 65 5 LIST OF TABLES Table page 2-1 Characters χ1, χ2, and χ3............................... 32 6 LIST OF FIGURES Figure page 2-1 (a) An X-tree. (b) A binary phylogenetic X-tree................... 13 2-2 The two tree shapes for B(6)............................. 16 2-3 A rooted tree shape.................................. 17 2-4 Edges e1 and e2 induce f1, 2, 3, 4gjf5, 6, 7, 8, 9g and f1, 2, 3, 4, 5, 6, 7gjf8, 9g X-splits respectively.................................. 19 2-5 One iteration of tree popping............................. 23 2-6 An X-tree and a mapping χt ............................. 24 2-7 Vertex disjoint subtrees................................ 26 2-8 Compatibility of characters. (a) A restricted chordal completion of int(C). (b) A maximal clique tree representation of G. (c) An X-tree on which each character in C is convex..................................... 33 3-1 (a) The X-tree. (b) A minimum extension of a character. (c) Edges forming a cut-set......................................... 37 3-2 A maximum-sized set of edge-disjoint proper paths................ 39 3-3 An Erdos¨ -Szekely´ path system........................... 40 3-4 Deficiencies of forward pass reconstruction..................... 43 3-5 (a) The X-tree. (b) Forward pass. (c) Backward pass............... 45 3-6 A schematic representation on the generic TBR operation............ 46 3-7 A schematic representation on the generic SPR operation............ 47 3-8 A schematic representation on the generic NNI operation............. 47 4-1 Overview of the calculation of the likelihood of a tree. (a) Hypothetical sequence alignment. (b) An unrooted tree for the four taxa in (a). (c) Tree after rooting at arbitrary interior vertex, in this case (6). (d) Likelihood of character χ..... 54 7 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science INTRODUCTION TO PHYLOGENETICS: A STUDY OF MAXIMUM PARSIMONY AND MAXIMUM LIKELIHOOD METHODS By Naomi R. Iuhasz May 2010 Chair: Sergei S. Pilyugin Major: Mathematics In this thesis we investigate the conceptual framework of phylogenetics together with two of the most popular methods of inferring phylogenetic relationships. Some notation and background concepts of graph theory are introduced. The general class of X-trees is presented together with its properties, shapes of trees, and X-splits. The functions of characters applied to trees make the connection between mathematics and biology. The notions of character convexity and compatibility are formalized. We also introduce and analyze the maximum parsimony and maximum likelihood methods of creating phylogenies, with comparison and contrast between the methods. 8 CHAPTER 1 INTRODUCTION The subject of this thesis was motivated by the undergraduate research conducted by this author in the field of computational biology. The premise of the undergraduate research was to re-evaluate the general assumption that the rate of molecular evolution and the rate of morphological evolution are in effect dissociated from each other. If there is a correlation between the two rates, there would be a universal rate of evolution for organisms and the expected number of changes (either morphological or molecular) during a certain period of evolutionary history would be proportional to that universal rate. Even with a universal rate, some variation in the number of changes is expected. This variance would reflect a Poisson process in the ideal case. The research tested whether the negative binomial method is an improved alternative over the Poisson method to accommodate the variation in the number of changes. Also, it tested whether there is a need to assume more variance to fit morphological change to the universal rate than to fit molecular change. Both negative binomial and Poisson methods were applied to phylogenetic trees that were computed using one of the widely known software packages for inferring phylogenetic trees, PAUP*. This thesis is a continuation of the undergraduate research in that it investigates the mathematical concepts on which phylogenetic trees are based on and how they are inferred from the available data. Phylogenetics is the biological discipline which studies the evolutionary relatedness among different organisms based on molecular and morphological information. This relatively new branch of biology has its roots in Charles Darwin’s theory of evolution. The quest to use present-day characteristics of a group of species to infer the historical relationships between them and their evolution from a common ancestor has been the subject of numerous studies. These relationships are consistently represented by an evolutionary (phylogenetic) tree, structure first proposed by Darwin himself. Initially, the relationships were drawn after studying the morphological characteristics of the species. 9 However, such criterion of comparison has its limitations due to simplistic assumptions about evolutionary processes and difficulty of comparison between very distantly related species or morphologically identical yet different species. The field of phylogenetics flourished with the discovery and study of molecular data which began in the late 1960s. Protein and genetic sequences provide an immensely richer pool of information which can be explored through a steadily growing number of methods and techniques. [32] The field of phylogenetics developed with a strong interdisciplinary foundation as it incorporates elements of mathematics, statistics, and computer science with biology. Here we will look at the mathematical aspect of the field. The reconstruction and analysis of phylogenetic trees involves almost exclusively discrete mathematics, mainly graph theory and probability theory. [32] Inferring a phylogenetic tree is an estimation procedure since the true tree is essentially unknowable. The estimation models employed calculate a ”best estimate” of an evolutionary history based on the incomplete information contained in the data. Phylogenetic inference methods seek to accomplish this goal in one of two ways: by defining a specific algorithm that leads to the determination of a tree or by defining a criterion for comparing alternative phylogenies to one another and deciding which is better. While in a purely algorithmic method the algorithm defines the tree selection criterion, in the criterion-based methods the