<<

"Searching for the 'true tree' for twenty species is like trying to find a needle in very large haystack. Yet biologists are now routinely attempting to build trees with hundreds and sometimes thousands of species." Mathematical Aspects of the'Tree of Life' Charles Semple and Mike Steel University of Canterbury, New Zealand

rees have long been used for illustrating certain phe- nomena in nature. They describe the structure of braid- Ted rivers, classify acyclic hydrocarbons in chemistry, and model the growth of cell division in physiology. In evolu- tionary biology, trees describe how species evolved from a common ancestor. Evolutionary trees date back (at least) to , who made an early sketch of one in a note- book from 1837, more than 20 years before his Origin of Species. In Darwin's day, the main evidence available for reconstructing such trees, apart from some fossils, was mor- phology and physiology- the physical details of different species (does an animal have wings or not, how many petals does a plant have, and so forth). As Darwin wrote in 1872: We possess no pedigrees or amorial bearings; and we have to discover and trace the many diverging lines of descent in our natural genealogies, by characters ofany kind which have long been inherited. A phylogenetic (evolutionary) tree of 42 representative mammals The problem with morphological and fossil data is that they reconstructed using a Bayesian approach by concatenating 12 pro- are patchy, and can be misleading. Also the evolution of mor- tein coding mitochondrial genes. The main mammalian groups are phological characters is often compleX: tind difficult to model. colour-coded (Frederic Delsuc and Nicolas Lartillot, unpublished). Today, biologists prefer genetic date/to re~onstruct and study evolutionary trees, using methods that increasingly are based In this article, we will describe some of the mathematical on mathematical ideas. This field is known as phylogenetics- aspects of phylogenetics, starting from the elementary princi- phylo means race or tribe, and thus phylogenetics is the study pIes, but also describing one or two deeper and more recent of grouping together species with similar genetic characteris- results. Definitions of some basic combinatorial terms used in tics. Since it attempts to reconstruct the distant past, it is usu- this article are provided in a box on page 8. ally impossible to check directly whether an inferred tree is correct or not. So phylogenetics is somewhat akin to a field Counting Trees like forensic science that aims to reconstruct a crime scene for One ofthe most famous enumeration formulae in combina- which no witness was present. Phylogenetic methods are used torics is Arthur Cayley's formula from 1889, for the number of to study other problems in biology-such as how different trees with n labeled vertices: This number is nn-2, a simple for- strains of a virus (for example, of HIV or influenza) are relat- mula that has many known proofs, most of which are surpris- ed, how much biodiversity is threatened by extinction, and ingly complicated. how populations, such as modern humans, came to be distrib- In phylogenetics, we are interested in counting a different uted across the planet. Phylogenetic techniques have even class of trees, called phylogenetic (evolutionary) trees. A been used to study the evolution of languages and the copying phylogenetic tree is a tree whose leaves are labeled, but the history of old manuscripts. remaining vertices are unlabeled and of degree at least three. Searching for the Tree of Life... WWW.MAA.ORG/MATHHORIZONS 5 MATH HORIZONS MATH HORIZONS

The leaves correspond to the species we see today, while the known and elementary result from called the How large is N(n)? Well, for 10 species we have N(10) = We denote such a split by writing AlB (or, equivalently, BIA). unlabeled vertices correspond to unknown ancestral species. Handshaking Lemma: the sum of the degrees ofthe vertices of 2,027,025; for n = 20, we have N(20) = 2X 1020 , prompting the The set of splits that are generated from a tree in this way carry The most 'informative' type of phylogenetic trees are those any graph is equal to twice the number of edges. For a binary biologist Walter Fitch to remark some years ago that: enough information that one can always uniquely recover the phylogenetic tree, if we denote the number of unlabeled (non- that have the largest number of edges possible for the number We have,for twenty species, more than a gram molecular tree just from its splits. Indeed, there is an easy and fast algo- of leaves- these are sometimes calledfully resolved or binary leaf) vertices by 1 and the number of edges by E, then the weight ofevolutionary tree! rithm for doing this called tree-popping, developed in 1983 by phylogenetic trees-and this is the set of trees we will count. H~ndshaking Lemma gives: a biologist Christopher Meacham. Searching for the 'true tree' for twenty species is like trying Every vertex in a binary philogenetic tree that is not a leaf (1) 31 + n =2E. to find a needle in very large haystack. Yet biologists are now must have degree 3, for otherwise two edge§(,!ncident with 3 5 But a tree has one more vertex than its number of edges (see routinely attempting to build trees with hundreds and some- such a vertex v could be plucked from v and joined\together at 1 e' box), so we also have: a new vertex, and a single new edge could be placed with one times thousands of species. To realize the smallness ofthe nee- dle, consider how many millenia it would take a computer that 4 end at the new vertex and the other end at v, resulting in a phy- E=n+l-1. (2) 2 V 4 can check a million trees per second to search through N(20) logenetic tree with the same leaves as the original, and one Combining Equations (1) and (2) gives 31 + n = 2(n + 1- 1), 2 A trees. 6 7 more edge. ;, and so 1 =n- 2. Since, by (2), E =n + 1 -1, the number of 67 binary,~'phylogenet­ This leads to a very natural question- is there enough DNA Even though the unlabeled vertices of a edges in a binary phylogenetic tree is Cutting an edge produces the split {1 ,2,3,5}1{4,6,7}. Cutting both sequence data available to reconstruct accurately the one-in-a- e ic tree all have degree 3, the reason they are called 'binary' is and produces the partition {{1,2}, {3,5}, {4,6,7}}. E= 2n- 3, (3) zillion tree that describes the evolution of the species you are e e' because if we direct all the edges away from some leaf (which interested in? Or are the trees being reconstructed almost cer- might represent some 'outgroup' species that is distantly relat- an equation that one could also prove by induction on n. Curi- However, not every set of splits of {I, 2, ... , n} corresponds tain to be false in some details? We can give a crude counting ed to the other species), then each non-leaf vertex encountered ously, Equation (3) does not depend on the shape of the tree, to the splits of a phylogenetic tree with leaf set {I, 2, ... , n}- argument to provide a lower bound on the genome length k as one traverses the tree can be viewed as a speciation event- just the number of leaves. To determine N(n), we need anoth- the precise conditions for this were found by Peter Buneman in required to accurately reconstruct binary phylogenetic trees on where each species splits into two daughter species. These er insight-each binary phylogenetic tree T on leaf set {I, 2, 1971. Buneman was studying trees for a quite different pur- n species. The number of genomes of length k is 4k (since a concepts are illustrated below. ... , n} is obtained from a binary phylogenetic tree T' on leaf set pose, namely reconstructing the copying history of old manu- genome is an ordered sequence whose alphabet consists of the {I, 2, ... , n-1} by subdividing an edge and attaching leaf n by scripts. Notice that if we were to cut the edge e' in the tree four bases of DNA - A, C, G and T). Representing each a new edge to the vertex resulting from this subdivision as shown above, the resulting split A' IB', where A' ={3, 5} and species by its genome, the number of data sets consisting of n shown. B' ={1, 2, 4, 6, 7}, would have the property that B n A' is the genomes of length k is therefore (4k)'~ =4nk. Now, suppose we vertex (degree 4) empty set. This is no accident-it is easy to see that ifAlB and have a method that can reconstruct each binary phylogenetic A' IB' are arbitrary splits of any phylogenetic tree, then one of tree from some possible data set. Think of this method as a 1~/edge the four intersections :>-----<2 reconstruct each possible tree means that we want this function compatibility' condition is not just necessary, but also suffi- Different choices of T' or edges to subdiyide produce dif- to be onto, and so IBI :::; IAI. In other words: cient for a set of splits to be realized by some phylogenetic (a) 4 (b) 3 nk ferent choices for T and, as each T' has (2n - 5) edges (by (3», N(n) :::; 4 . (7) tree. (a) A phylogenetic tree on six leaves. (b) The three possible binary we have: Many splits are well known in biology-for example, we Some straightforward analysis (applying Stirling's approx- phylogenetic trees on four leaves. have the vertebrates and invertebrates. We would hope that N(n) = (2n - 5) N(n - 1). (4) imation for n! to (6) in (7» shows th~t k must grow at least at these splits correspond to splits of the underlying evolutionary The observant reader may have noticed that in the above the rate log(n). Of course, this bound is very crude, and it This recurrence, together with the boundary condition N(3) =1, tree. More generally, we would like any partition of a set of definition, a phylogenetic tree has no root (representing a com- would be useful to derive an upper bound on k under some implies that species to 'fit' their'underlying evolutionary tree in the sense mon ancestor). In practice, most reconstruction methods build model of genome evolution, and see how different to log(n) it that we could delete a set of edges so that the leaves that an unrooted phylogenetic tree. The often controversial N(n)=(2n-5)x(2n-7)x .. ·x3x1. (5) is. We will describe the surprising answer to this problem at remain connected form the blocks of the partition. For exam- decision of where to place the root is done later. the end of this article, but first we need to say more about That is, N(n) is the product ofthe first n - 2 odd numbers. Writ-' ple, consider the partition {{I, 2}, {3, 5}, {4, 6, 7}} and the To count the number of binary phylogenetic trees, we may mathematical aspects of phylogenetic trees and how they can ten (2n -5)!!, this number is sometimes called a 'semi- assume that the leaves are labeled 1,2, ... , n. There are three tree shown above. If we delete both edges e and e', we obtain factorial' . It can also be written as a ratio of factorials and describe data. unlabeled binary trees for n = 4, shown above. How many this partition. We say that the partition is convex on the tree. powers of 2 as follows: binary phylogenetic trees are possible on n leaves? This prob- Convexity has a biological rationale-imagine that the (2n-4)! Properties of Phylogenetic Trees and Data lem was solved long ago-in fact, 21 years after Darwin pub- (6) species in t

6 FEBRUARY 2009 WWW.MAAORG/MATHHORIZONS 7 :MATH HORIZONS :MATH HORIZONS

there are sets of partitions where, for each pair of partitions, a convex. How many partitions do we need to generate for this required developing a fundamentally new approach to tree tree exists on which both partitions are convex yet for the to hold? It is unlikely to be a small number (like the 4 we reconstruction, and an analysis that combined delicate combi- entire collection, no such tree exists. Indeed, given a set ofpar- described above) particularly for a large tree, since the parti- natoral and probabilistic arguments. • titions, deciding whether there exists a tree on which the entire tions are randomly generated, not deliberately chosen by a set of partitions is convex is an NP-complete problem. Loose- mathematician! But it still doesn't need to grow too quickly For Further Reading ly speaking, this means that, in general, the best solution we with n (the number of leaves of T); log(n) turns out to be fast We have provided just a snapshot of some results in phylo- have is to check each tree T individually to see if each of the enough, at least under some mild restrictions. genetics and omitted mentioning many other areas where partitions in the set are convex on T. Hmmm... , the small nee- More precisely, suppose that B:5 pe :5 q where B >0and plays a crucial role. For more details the reader is dle and large haystack does not make this good news! q < 1/2 are constants that are independent of n. Then the num- referred to Phylogenetics (C. Semple and M. Steel, Oxford Nevertheless, we can ask the following question: For any ber of partitions required so that T is (with high probability) University Press, 2003). More recent mathematical develop- phylogenetic tree T, what is the smallest collection of parti- the only tree on which all the partitions are convex is: ments can be found in Mathematics ofEvolution and Phyloge- tions required so that T is the only phylogenetic tree on which log(n) ny (0. Gascuel. (ed.) , 2005). The last c-- all the partitions are convex? For this question to make sense, E chapter of this book provides the mathematical details T must be binary (otherwise we could always exhibit another where c depends just on q and on the probability with which described in the final section above. This second book also tree that resolves T and that tree would also satisfy the con- we wish to recover Tcon-ectly. When q is greater than 1/2, the deals with some complexity of molecular evolution which we vexity condition for any partition that T does). For splits, the number of partitions required for tree reconstruction grows have not mentioned; for example, even the use of trees in rep- answer to the question is easy- we need one split for each much faster (but still polynomially) with n. resenting evolution can sometimes be overly simple, and in edge of T that is not incident with a leaf (otherwise we could How about models for describing the evolution of individ- primitive organisms such as bacteria, processes such as hori- delete that edge and identify its two end-vertices to get a phy- ual DNA sites? Since we have just 4 bases, it is unrealistic to zontal gene transfer can make a directed acyclic graph a more logenetic tree on which each split in the collection is still con- expect evolution to be homoplasy-free-ifwe can change state suitable representation. For a more biological and statistical vex). There are n - 3 such internal edges in a binary phyloge- from (say) A to G, then we should be able later to change state treatment of phylogenetics, the reader is refen-ed to Inferring netic tree with n leaves (by Equation (3)), so for splits, we need from G back to A. Biologists typically model DNA site evo- Phylogenies (J. Felsenstein, Sinauer Press, 2004). n -3partitions. The answer to the posed question for sets of lution as a 4-state Markov process. For example, the tree of general partitions is more interesting. It was recently shown the 42 mammals shown above was constructed using such a Acknowledgments that, for any binary phylogenetic tree T, a set of at most four model in a Bayesian statistical approach. We thank Frederic Delsuc and Nicolas Lartillot for partitions exists so that T is the only phylogenetic tree on The question of how many sequence permission to reproduce their mammal tree. which these partitions are convex (i.e. could have evolved sites we need to reconstruct a phylogenetic without homoplasy). What makes this result surprising is that tree accurately when the sites evolve inde- it does not depend at all on n (the number ofleaves of the tree), pendently and identically under some unlike the n -3result for splits. finite-state Markov process is difficult. University of California, Riverside However, it was solved recently by March 14,2009 Models of Evolution Elchanan Mossel and colleagues at UC Pi Day! The result that a tree can be uniquely reconstructed from Berkeley, and their result is impressive. We Registration fee 0% described by homoplasy-free evolution on that tree (and from just four partitions is a combinatorial 'best case' situation. It saw earlier that a very crude bound 'from Lunch fee 0% any hypothetical ancestral vertex). Of course, homoplasy can would be interesting to know how many partitions we would (7)) requires that the sequence length'~eed-' Inaccessible math occur for certain characteristics-for example, wings can need to reconstruct a tree if these partitions 'evolved' on the ed to reconstruct binary phylogentic trees talks 0% evolve in some species and then be lost in some later descen- tree according to some random model. A particularly simple from DNA sequences must grow atthe rate dant species (reverse evolution), or wings can evolve inde- model is to suppose that each character state (e.g. eye colour) log(n)-this argument was based just on pendently in different lineages (convergent evolution), and so can randomly change on any edge of the tree, and when it does counting, and had nothing to do with Travel funding is homoplasy-free evolution is an 'ideal' situation. Nevertheless, so it always takes on a new state that has not so-far arisen' Markov processes. What Mossel and col- available!! certain genomic data (such as 'retroposons') exhibit very low (thereby producing partitions that are convex on the tree). leagues have shown is that, under some Funding Deadline: levels of homoplasy; this has been helpful in resolving, for We can exactly describe the partitions generated in this way restrictions on the Markov process, the FebruarY 21, 2009 example, the tree of placental mammals (including the place- by the following simple 'random cluster' process. For each log(n) growth rate on sequence length is Register Online: ment of whales as a sister to the hippopotamus). edge e of the tree T, cut edge e with probability pe or leave it enough to recover each tree with high prob- www.pcumc-math.org It should be clear that not all sets of partitions can be intact with probability 1- Pe' Perform this process independ- ability. In other words, the growth in the Registration described by homoplasy-free evolution (i.e. be convex) on the ently across the edges of T and consider the subsets of the sequence length k of at least log(n) is not Deadline: 20% same tree-for bipartitions (splits), the condition for this is leaves of T in the resulting components. This gives us a ran- only necessary for reconstructing all binary March 7, 2009 precisely Buneman's pairwise compatibility condition dom partition of the leaves. If we independently generate a trees on n species (by the crude counting PCUMC is sponsored by Loyola Marymounl Universily, Lewis and Clark Col/ege, and Pepperdine University and described above. For sets of general partitions, the situation is large number of partitions in this way, then, with high proba- argument), but it is also sufficient, at least receives major funding from the NSA and NSF Grant OMS-0241090 through the MAA RUMC program. more interesting. Firstly, a pairwise condition won't work- bility, T will be the only tree on which all the partitions are for certain Markov processes. The result

8 FEBRUARY 2009 WWW.MAA.ORG/MATHHORIZONS n