Reconstructing Phylogenetic Networks Using Maximum Parsimony
Total Page:16
File Type:pdf, Size:1020Kb
Reconstructing Phylogenetic Networks Using Maximum Parsimony Luay Nakhleh Guohua Jin Fengmei Zhao John Mellor-Crummey Department of Computer Science, Rice University Houston, TX 77005 {nakhleh,jin,fzhao,johnmc}@cs.rice.edu Abstract combination, hybrid speciation, and horizontal gene trans- fer result in networks of relationships rather than trees of re- Phylogenies—the evolutionary histories of groups of lationships. While this problem is widely appreciated, there organisms—are one of the most widely used tools through- has been comparatively little work on computational meth- out the life sciences, as well as objects of research within ods for estimating and studying evolutionary networks (see systematics, evolutionary biology, epidemiology, etc. Al- [15, 17] for surveys of work on phylogenetic networks). most every tool devised to date to reconstruct phylogenies Due to the pervasiveness of phylogenies, many different produces trees; yet it is widely understood and accepted methods have been proposed for their reconstruction. Par- that trees oversimplify the evolutionary histories of many simony has been studied and used extensively for phyloge- groups of organims, most prominently bacteria (because of netic trees [22]; it is based on a minimum-information prin- horizontal gene transfer) and plants (because of hybrid spe- ciple (similar to Occam’s razor): in absence of information ciation). to the contrary, the best explanation for the observed data Various methods and criteria have been introduced for is that involving the smallest number of manipulations—in phylogenetic tree reconstruction. Parsimony is one of the the case of evolutionary histories, that involving the fewest most widely used and studied criteria, and various accu- evolutionary events. As pointed out by Hein [10, 11], par- rate and efficient heuristics for reconstructing trees based simony can be extended to phylogenetic networks: he ob- on parsimony have been devised. Jotun Hein suggested served that each individual site in a set of sequences label- a straightforward extension of the parsimony criterion to ing a network evolves down a tree contained in the network phylogenetic networks. In this paper we formalize this con- (i.e., a tree whose edges are edges of the network). In con- cept, and provide the first experimental study of the qual- sequence, the obvious extension is to define the parsimony ity of parsimony as a criterion for constructing and evalu- score of a network as the sum, over all sites, of the parsi- ating phylogenetic networks. Our results show that, when mony score of the best possible tree contained within the extended to phylogenetic networks, the parsimony criterion network for each site. produces promising results. In a great majority of the cases This generalization suffers from a major flaw: adding in our experiments, the parsimony criterion accurately pre- reticulation events (in the form of additional edges) to the dicts the numbers and placements of non-tree events. network can only lower the parsimony score, since intro- Keywords: phylogenetic networks, maximum parsimony, ducing a new edge cannot increase the score of any site, but horizontal gene transfer. may help lower the score of some. In consequence, this cri- terion may lead to a gross overestimation of the number of reticulation events needed to explain the data. In this paper, 1 Introduction we generalize Hein’s parsimony concept to blocks of sites, rather than individual sites, and study the performance of Phylogenies—the evolutionary histories of groups of the generalized criterion. This generalization is a more real- organisms—play a major role in representing the interrela- istic reflection of the biology of reticulate evolution: blocks tionships among biological entities. Many methods for re- of sites, or genes, are inherited as single units from their constructing and studying such phylogenies have been pro- “parents”. posed, almost all of which assume that the underlying his- Reconstructing maximum parsimony phylogenetic net- tory of a given set of species can be represented by a binary works is NP-hard, since it is a generalization of the max- tree. Although many biological processes can be effectively imum parsimony problem on phylogenetic trees, which is modeled and summarized in this fashion, others cannot: re- NP-hard [4, 8]. Further, whereas computing the parsimony Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference (CSB’05) 0-7695-2344-7/05 $20.00 © 2005 IEEE score of a given phylogenetic tree can be done in time linear removing exactly one of the two edges incoming into each in the number of tree nodes, using Fitch’s algorithm [6, 9], reticulation node in N and using any applicable forced con- computing the parsimony score of phylogenetic networks is tractions. We denote by T (N) the set of all trees induced a hard problem (possibly NP-hard). However, the problem by a network N. While a phylogenetic network models the is fixed-parameter tractable (FPT) [5], as we show. evolutionary history of a set of organisms, the evolutionary Since we are interested in assessing the quality of the histories of individual genes are still trees [16], which are parsimony criterion for phylogenetic networks (rather than the trees contained inside the network. heuristics), and due to the absence of any efficient algo- Reticulation events impose time constraints on the phy- rithms for solving the problem, we implemented an exhaus- logenetic network. [18] A phylogenetic network N = tive search method that traversed the entire space of net- (V,E) defines a partial order on the set V of nodes. If we works, and considered the parsimony score of every net- associate time t(u) with node u of N, then, if there exists work in the space. We considered a version of the phylo- a directed path p from u to some other node v such that p genetic network reconstruction problem that applies to hor- contains at least one tree edge, we must have t(u) <t(v) izontal gene transfer: given an organismal (species) tree, in order to respect the time flow; moreover, if e =(u, v) is compute an additional set of edges whose addition to the a network edge, then we must have t(u)=t(v), because tree explains the horizontal gene transfer events that oc- hybridization is, at the scale of evolution, an instantaneous curred during the evolutionary history of the sequences. process. Given a network N, we say that p is a positive-time Since these events are unknown, we apply the parsimony directed path from u to v,ifp is a directed path from u to criterion and seek the solution that is optimal with respect to v and p contains at least one tree edge. Given a network N, this criterion. Our experimental results show that the parsi- two nodes u and v cannot co-exist in time if there exists a mony criterion, when used carefully, can be very promising sequence P = p1,p2,...,pk of paths such that: (1) pi is in both reconstructing phylogenetic networks, and quantify- a positive-time directed path, for every 1 ≤ i ≤ k,(2)u ing their quality (in terms of capturing the true evolutionary is the tail of p1, and v is the head of pk, and (3) for every events). 1 ≤ i ≤ k − 1, there exists a network node whose two parents are the head of pi and the tail of pi+1. Since events 2 Phylogenetic Networks such as horizontal gene transfer occur between two lineages (nodes in the network) that co-exist in time [16, 19, 18], a N When events such as horizontal gene transfer occur, the phylogenetic network must satisfy the following prop- evolutionary history of the set of organisms may not be erty: modeled by phylogenetic trees; in this case, phylogenetic • If two nodes x and y cannot co-exist in time, then they networks provide the correct model. In horizontal gene cannot participate in a reticulation event, that is, the transfer (HGT), genetic material is transferred from one network cannot include either of the two edges (x, y) lineage to another; see the phylogenetic network in Fig- and (y, x). ure 1(a). In an evolutionary scenario involving horizon- tal transfer, certain sites (specified by a specific substring It is important to note that the aforementioned model and within the DNA sequence of the species into which the temporal constraints may be violated in practice, due to in- horizontally transferred DNA was inserted) are inherited complete sampling, extinction, etc. Moret et al. addressed through horizontal transfer from another species (as in Fig- these issues and extended the phylogenetic network models ure 1(c)), while all others are inherited from the parent (as to capture these scenarios, whenever they may occur [18]. in Figure 1(b)). Thus, each site evolves down one of the trees induced by the network. We adopt the general model of phylogenetic networks 3 The Parsimony Criterion formalized by Moret et al. [18] Definition 1 A phylogenetic network N =(V,E) with a Parsimony is one of the most popular methods used for set L ⊆ V of n leaves, is a directed acyclic graph in which phylogenetic tree reconstruction. Roughly this method is exactly one node, the root, has no incoming edges, and all based on the assumption that “evolution is parsimonious”, other nodes have either one incoming edge—tree nodes—or i.e., the best evolutionary trees are the ones that minimize two incoming edges—reticulation nodes. the number of changes along the edges of the tree. We now formulate this concept. In this paper, we focus on binary networks, i.e., networks in which the outdegree of a tree node is 2 and the outdegree Definition 2 The Hamming distance between two equal- of a reticulation node is 1.AtreeT is contained (or in- length sequences x and y, denoted by H(x, y), is the num- duced) inside a network N,ifT can be obtained from N by ber of positions j such that xj = yj.