IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 1, JANUARY-MARCH 2004 13 Phylogenetic Networks: Modeling, Reconstructibility, and Accuracy
Bernard M.E. Moret, Luay Nakhleh, Tandy Warnow, C. Randal Linder, Anna Tholse, Anneke Padolina, Jerry Sun, and Ruth Timme
Abstract—Phylogenetic networks model the evolutionary history of sets of organisms when events such as hybrid speciation and horizontal gene transfer occur. In spite of their widely acknowledged importance in evolutionary biology, phylogenetic networks have so far been studied mostly for specific data sets. We present a general definition of phylogenetic networks in terms of directed acyclic graphs (DAGs) and a set of conditions. Further, we distinguish between model networks and reconstructible ones and characterize the effect of extinction and taxon sampling on the reconstructibility of the network. Simulation studies are a standard technique for assessing the performance of phylogenetic methods. A main step in such studies entails quantifying the topological error between the model and inferred phylogenies. While many measures of tree topological accuracy have been proposed, none exist for phylogenetic networks. Previously, we proposed the first such measure, which applied only to a restricted class of networks. In this paper, we extend that measure to apply to all networks, and prove that it is a metric on the space of phylogenetic networks. Our results allow for the systematic study of existing network methods, and for the design of new accurate ones.
Index Terms—Phylogenetic networks, reticulate evolution, error metric, Robinson-Foulds, bipartitions, tripartitions. æ
1INTRODUCTION
HYLOGENIES are the main tool for representing evolu- In phylogenetic reconstruction, a standard methodology Ptionary relationships among biological entities. Their for assessing the quality of reconstruction methods is pervasiveness has led biologists, mathematicians, and simulation [9], [10]. Simulation allows a direct comparison computer scientists to design a variety of methods for their between the “true” phylogeny and its reconstruction, reconstruction (see, e.g., [28]). Almost all such methods, something that is generally not possible with real data, however, construct trees; yet, biologists have long recog- where the true phylogeny is at best only partially known. In nized that trees oversimplify our view of evolution, since a simulation study, a model phylogeny is generated, they cannot take into account such events as hybrid typically in two stages: A topology is created, and then speciation and horizontal gene transfer. These nontree the evolution of a set of molecular sequences is simulated events, usually called reticulations, give rise to edges that on that topology. The set of sequences obtained at the leaves connect nodes from different branches of a tree, creating a is then fed to the reconstruction methods under study and directed acyclic graph structure that is usually called a their output compared with the model phylogeny. While phylogenetic network. To date, no accepted methodology for computationally expensive (the large parameter space and network reconstruction has been proposed. Many research- the need to obtain statistically significant results necessitate ers have studied closely related problems, such as the a very large number of tests), simulation studies provide an compatibility of tree splits [2], [3] and other indications that unbiased assessment of the quality of reconstruction, as a tree structure is inadequate for the data at hand [15], well as a first step in the very difficult process of deriving detection and identification of horizontal gene transfer [7], formal bounds on the behavior of reconstruction methods. [8], and, more generally, detection and identification of A crucial part of a simulation study is the comparison recombination events [19], [20]. A number of biological between the reconstructed and the true phylogenies. Many studies of reticulation have also appeared [21], [22], [23], methods have been devised to compute the error rate [25]. Our group has proposed a first, very simple method between two phylogenetic trees but, with the exception of [17] based on an observation of Maddison [13], but it our work [16], no such method has been proposed for remains limited to just a few reticulations. phylogenetic networks. We proposed a measure of the relative error between two phylogenetic networks based on the tripartitions induced by the edges of the network [16], a measure that naturally extends a standard error measure . B.M.E. Moret and A. Tholse are with the Department of Computer Science, used for trees (the Robinson-Foulds measure), and pro- University of New Mexico, Albuquerque, NM 87131. E-mail: {moret, tholse}@cs.umn.edu. vided experimental results showing that our measure . L. Nakhleh, T. Warnow, and J. Sun are with the Department of Computer exhibited desirable properties. Sciences, University of Texas at Austin, Austin, TX 78712. In this paper, we formalize our results about this E-mail: {nakhleh, tandy, jsun}@cs.utexas.edu. measure. We establish a framework for phylogenetic net- . C.R. Linder, A. Padolina, and R. Timme are with the School of Biological works in terms of directed acyclic graphs (DAGs) and Sciences, University of Texas at Austin, Austin, TX 78712. E-mail: {rlinder, annekes, retimme}@mail.utexas.edu. define a set of properties that a DAG must have in order to reflect a realistic phylogenetic network. A crucial aspect of Manuscript received 23 Jan. 2004; revised 21 May 2004; accepted 25 May 2004. our model is our distinction between model networks and For information on obtaining reprints of this article, please send e-mail to: reconstructible networks: The former represent what really [email protected], and reference IEEECS Log Number TCBB-0005-0104. happened (at the level of simplification of the model, of
1545-5963/04/$20.00 ß 2004 IEEE Published by the IEEE CS, NN, and EMB Societies & the ACM 14 IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 1, JANUARY-MARCH 2004
Fig. 1. Hybrid speciation: the network and its two induced trees. Fig. 2. Horizontal transfer: the network and its two induced trees. course), while the latter represent what can be inferred from trees in terms of their edge structure, are often called data on current organisms. Using the properties of model measures of topological1 accuracy. and reconstructible networks, we extend our original Theorem 1 [24]. The pair ðT ;mÞ, where T is the space of measure [16] to obtain a true metric, thereby showing that phylogenetic trees on n leaves and m is the RF distance, is a the combination of these DAG properties and our distance metric space. measure provides a sound theoretical as well as practical basis for the analysis of phylogenetic networks and for the A metric space is a set of objects with an equivalence assessment of network reconstruction methods. relation, , and a binary distance function, d, which The rest of the paper is organized as follows: In Section 2, together satisfy the following three conditions, for every we briefly review phylogenetic trees, bipartitions, and the three objects x, y, and z: Robinson-Foulds error measure. In Section 3, we define . positivity: dðx; yÞ 0 ^½dðx; yÞ¼0 , x y , model phylogenetic networks as DAGs that obey certain . symmetry: dðx; yÞ¼dðy; xÞ, and properties. In Section 4, we discuss the ramifications of . triangle inequality: dðx; yÞþdðy; zÞ dðx; zÞ. missing taxa and other problems arising with biological datasets on the identifiability of reticulation events, and distinguish between model networks and reconstructible 3MODEL PHYLOGENETIC NETWORKS ones. In Section 5, we briefly review the measure we 3.1 Nontree Evolutionary Events introduced in [16] and prove that our measure is a metric on We consider two types of evolutionary events that give rise the space of phylogenetic networks. In Section 6, we discuss to network (as opposed to tree) topologies: hybrid speciation future work needed to bring network reconstruction on a and horizontal gene transfer (also called lateral gene transfer). par with tree reconstruction. In hybrid speciation, two lineages recombine to create a new species. We can distinguish diploid hybridization,in 2PHYLOGENETIC TREES AND BIPARTITIONS which the new species inherits one of the two homologs for each chromosome from each of its two parents—so that the A phylogenetic tree is a leaf-labeled tree that models the new species has the same number of chromosomes as its evolution of a set of a taxa (species, genes, languages, parents, and polyploid hybridization,inwhichthenew placed at the leaves) from their most recent common species inherits the two homologs of each chromosome ancestor (placed at the root). The internal nodes of the tree from both parents—so that the new species has the sum of correspond to the speciation events. Many algorithms have the numbers of chromosomes of its parents. Under this last been designed for the inference of phylogenetic trees, heading, we can further distinguish allopolyploidization,in mainly from biomolecular (DNA, RNA, or amino-acid) which two lineages hybridize to create a new species whose sequences [28]. Similarly, several measures have been used ploidy level is the sum of the ploidy levels of its two parents to assess the accuracy of tree reconstruction; the most (the expected result), and autopolyploidization, a regular commonly used measure is the Robinson-Foulds (RF) metric speciation event that does not involve hybridization, but [24], which we now define. which doubles the ploidy level of the newly created lineage. Every edge e in an unrooted leaf-labeled tree T defines a Prior to hybridization, each site on each homolog has bipartition e on the leaves (induced by the deletion of e), so evolved in a tree-like fashion, although, due to meiotic recombination, different strings of sites may have different that we can define the set CðTÞ¼f e : e 2 EðTÞg, where EðTÞ is the set of all internal edges of T.IfT is a model tree histories. Thus, each site in the homologs of the parents of the hybrid evolved in a tree-like fashion on one of the trees and T 0 is the tree inferred by a phylogenetic reconstruction induced by (contained inside) the network representing the method, we define the false positives to be the edges of the hybridization event, as illustrated in Fig. 1. set CðT 0Þ CðTÞ and the false negatives to be those of the set 0 In horizontal gene transfer, genetic material is trans- CðTÞ CðT Þ. ferred from one lineage to another without resulting in the production of a new lineage. In an evolutionary scenario . The false positive rate (FP)isjCðT 0Þ CðTÞj=ðn 3Þ. 0 involving horizontal transfer, certain sites are inherited . The false negative rate (FN)isjCðTÞ CðT Þj=ðn 3Þ. through horizontal transfer from another species, while all (When both trees are binary, we have FP ¼ FN.) Since an others are inherited from the parent, as symbolized in Fig. 2. unrooted binary tree on n leaves has n 3 internal edges, When the evolutionary history of a set of taxa involves the false positive and false negative rates are values in the processes such as hybrid speciation or horizontal gene range ½0; 1 . The RF distance between T and T 0 is simply the 1. The word “topological” is not used here in a mathematical sense, but average of these two rates, ðFN þ FPÞ=2. Measures, such as only to signify that the measure of accuracy relates only to the structure of the RF distance, that quantify the distance between two the tree, not to any other parameters. MORET ET AL.: PHYLOGENETIC NETWORKS: MODELING, RECONSTRUCTIBILITY, AND ACCURACY 15 transfer, trees can no longer represent the evolutionary relationship; instead, we turn to rooted directed acyclic graphs (rooted DAGs). 3.2 Notation Given a (directed) graph G, let EðGÞ denote the set of (directed) edges of G and V ðGÞ denote the set of nodes of G. Let ðu; vÞ denote a directed edge from node u to node v; u is the tail and v the head of the edge and u is a parent of v.2 The Fig. 3. A phylogenetic network N on six species. Disks denote the tree indegree of a node v is the number of edges whose head is v, nodes and squares denote the network nodes, while solid lines denote while the outdegree of v is the number of edges whose tail is the tree edges and dashed lines denote the network edges. v. A node of indegree 0 is a leaf (often called a tip by systematists). A directed path of length k from u to v in G is . indegreeðvÞ¼0 and outdegreeðvÞ¼2: root, a sequence u0u1 uk of nodes with u ¼ u0, v ¼ uk, and . indegreeðvÞ¼1 and outdegreeðvÞ¼0: leaf, or 8i; 1 i k; ðui 1;uiÞ2EðGÞ; we say that u is the tail of p . indegreeðvÞ¼1 and outdegreeðvÞ¼2: internal tree and v is the head of p. Node v is reachable from u in G, node. denoted u e> v, if there is a directed path in G from u to v; A node v is a network node if and only if we have u v we then also say that is an ancestor of .Acycle in a graph indegreeðvÞ¼2 and outdegreeðvÞ¼1. is a directed path from a vertex back to itself; trees never contain cycles: in a tree, there is always a unique path Tree nodes correspond to regular speciation or extinction between two distinct vertices. Directed acyclic graphs (or events, whereas network nodes correspond to reticulation DAGs) play an important role on our model; note that every events (such as hybrid speciation and horizontal gene DAG contains at least one vertex of indegree 0. A rooted transfer). We clearly have V \ V ¼;and can easily verify directed acyclic graph, in the context of this paper, is then a T N that we have V [ V ¼ V . We extend the node categories to DAG with a single node of indegree 0, the root; note that all T N corresponding edge categories as follows: all other nodes are reachable from the root by a (directed) path of edges. Definition 2. An edge e ¼ðu; vÞ2E is a tree edge if and only if v is a tree node; it is a network edge if and only if v is a 3.3 Phylogenetic Networks network node. Strimmer et al. [26] proposed DAGs as a model for describing the evolutionary history of a set of sequences The tree edges are directed from the root of the network under recombination events. They also described a set of toward the leaves and the network edges are directed from properties that a DAG must possess in order to provide a their tree-node endpoint towards their network-node end- realistic model of recombination. Later, Strimmer et al. [27] point. Fig. 3 shows an example of a network in which the proposed adopting ancestral recombination graphs (ARGs), species at node Z is the product of (homoploid or allopoly- due to Hudson [11] and Griffiths and Marjoram [5], as a ploid) hybrid speciation. In such a network, a species appears more appropriate model of phylogenetic networks. ARGs as a directed path p that does not contain any network edge are rooted graphs that provide a way to represent linked (since a network edge connects two existing species or an collections of trees (assuming that the trees are ultrametric existing species and a newly created one). If p1 and p2 are two or nearly so) by a single network. Hallett and Lagergren [6] directed paths that define two distinct species, then p1 and p2 described a similar set of conditions on rooted DAGs to use must be edge-disjoint, that is, the two paths cannot share them as models for evolution under horizontal transfer edges. For example, the directed path p from node X to node B events. (Another network-like model is pedigrees, designed in Fig. 3 could define species B (as could the path from R to B), to represent the parentage of individual organisms that whereas the directed path from node X to node C does not propagate through sexual reproduction—so that the inde- define a species, since it contains a network edge. gree of each internal node of a pedigree is either 0 or 2 [18].) A phylogenetic network N ¼ðV;EÞ defines a partial Like Strimmer et al. [26], we use DAGs to describe the order on the set V of nodes. We can also assign times to the topology of our phylogenetic networks and, like Hallett and nodes of N, associating time tðuÞ with node u; such an Lagergren [6], we add a set of (mostly simpler) conditions to assignment, however, must be consistent with the partial ensure that the resulting DAGs reflect the properties of order. Call a directed path p from node u to node v that hybrid speciation. Unlike these authors, however, we care- contains at least one tree edge a positive-time directed path.If fully distinguish between model networks (the representation there exists a positive-time directed path from u to v, then of the actual evolutionary scenario) and reconstructible net- we must have tðuÞ Fig. 5. A small phylogenetic network illustrating the distinction between model networks and reconstructible networks. (a) The model network and (b) its best reconstruction. Fig. 4. A scenario illustrating two nodes X and Y that cannot coexist in sampling have the same consequences: The data do not reflect time. all of the various lineages that contributed to the current situation. Abnormal conditions include insufficient differ- clearly imply t1