Quick viewing(Text Mode)

Live Phylogeny with Polytomies

Live Phylogeny with Polytomies

arXiv:1603.03315v2 [cs.DS] 29 Oct 2016 • • ist ul rewt h mletpsil ot In cost. possible smallest the [ with methods tree statistical years, a recent build to aims epeetaogteiptseis ocp termed concept a , input phylogeny the among present be evolutionary to correspond edges their species. that between and lengths distances unrooted, Phy- with or species tree. labelled rooted a of are be by set can represented a trees often logenetic of are history relationships evolutionary whose the is Phylogeny oe.I h rao aenooy netr fspecies of ancestors paleontology, of area the In to shown nodes. [ been strains networks have specific lution patients of within emergence HIV detect of evolution [ demonstrated genera of Marburgvirus successfully and ancestors Ebolavirus been common the has input with known data or identify in- molecular hypothetical to that using either ability case be The the can species. often nodes is ancestor it ternal stemmatics, How- linguisti tree. phylogenetic paleontology, the virology, and of as leaves such the domains on in ever, placed where are trees species bifurcating extant fully all on focused primarily have ods [ searches tree heuristic fast perform to predict [ [ to characters data morphological genomic method for on widely-used based phylogenies and viral correct effective parsimony an maximum However, remains phylo- domains. constructing certain in for genies approaches parsimony maximum soitsacs ihec hrce uain( mutation character each with cost a associates I 1 .Ppmcali ihteDprmn fCmue cec,N Science, Computer of Department Greece. the Athens, College, with is Papamichail G. E [email protected] Jersey, E-mail: New are of Miller College A. The 08628. and Science, Ott Computer J. of Kennedy, Department E. Huang, A. Papamichail, D. hsatcefcsso hlgne hr netr can ancestors where phylogenies on focuses article This Parsimony Maximum ne Terms Index possibl all of enumeration exhaustive to resorting without developed. been have solutions efficient numerous which for Abstract nqepoete fms asmnoslv hlgntct phylogenetic live parsimonious most of bee properties have unique and polyt solutions and algorithmic phylogeny), harder (live computationally nodes tree ancestral on species phylogenet and linguistics, paleontology, transmission, NTRODUCTION iePyoeywt oyois idn the Finding Polytomies: with Phylogeny Live yTle ta.i [ in al. et Telles by iirsPpmcal neaHag dadKney Jan-L Kennedy, Edward Huang, Angela Papamichail, Dimitris Cntuto fpyoeei re a rdtoal foc traditionally has trees phylogenetic of —Construction hc eebeteswt xatancestor extant with trees resemble which , Pyoeeis aiu asmn,Lv hlgn,Polyt Phylogeny, Live Parsimony, Maximum —, otCmatPrioiu Trees Parsimonious Compact Most samto htuses that method a is 9 .Eitn hlgntcmeth- phylogenetic Existing ]. 6 ,t ul uetes[ supertrees build to ], 1 ,[ ], 11 8 2 ]. aesupplanted have ] ,using ], 10 erisPapamichail Georgios event .Patterns ]. 3 characters eilevo- serial csemtc,otnrqiepyoeyifrneta invo that inference phylogeny require often stemmatics, ic ,[ ], ig NJ, wing, retopologies. tree e 7 ihthe with wYork ew 4 eswt oyois n ecienvlagrtm ofidt find to algorithms novel describe and polytomies, with rees preyeaie nteltrtr odt.I hsarticl this In date. to literature the in examined sparsely n ,and ), me.Teerqieet,dsieterpeaec,lead prevalence, their despite requirements, These omies. ,and ], ,[ ], eti plcto oan huh uha ia evolutio viral as such though, domains application Certain live cs, 5 ], , sdo iaytesweealseisapa nlae,apro a leaves, on appear species all where trees binary on used ✦ h omnacso fseis[ species recording of of guarantee ancestor high common a incom provide the is not record does it fossil for and the plete, account Notably, nodes. that internal methods labeled need reconstruction the prompting phylogenetic characterized, for well and known be may ySele l n eti fterpoete aebeen have properties their of certain and [ al. in examined named et been and Steel have polytomies trees by Such for nodes. allow species that internal trees parsimony maximum h ubro vnsaogteegs sw i ocreate to aim we as trees. edges, parsimonious compact the most along to the events criterion of parsimony with number tree additional the a an in now nodes becomes of nodes, As species number polytomies. internal the of) of among degree minimization the contractions such, increase edge (or introduce do which as size, tree npltme oue ncntn iehuitcimprove- heuristic time [ constant ments on to focused polytomies uncertainty in of [ reconstructions source phylogenetic a the th added in limitations algorithm to due implemented structure bifurcating the a Forcing all to polytomies. phylogenies polytomy viruses, of RNA number of a different study contained 38 phylogenies a from reconstructed in were example, that For str ones. of for instead bifurcating trees account viral multifurcating to as utilizing stemmatics, such polytomies, domains, phylogenetic certain and in important transmission also is it nodes, paper. this in presented ancestors, methods [ algorithmic known the package [ incorporate paleotree (e.g. the which as graptolites trees such build for known, to case highly efforts the are as ancestors common extensivel such been extinct has and record fossil studied the where species certain diinlcmlxt oyoisadt narayhard already the an of [ to result problem for a add computational be allowing polytomies may area complexity tree this additional in parsimonious work of most Lack polytomies. the identifying for ihti ok eamt xlr h osrcinof construction the explore to aim we work, this With eie loigfriptseist pero internal on appear to species input for allowing Besides omies. 17 ;orwr nta oue nntv methods native on focuses instead work our ]; 22 .Mpigseist nenlndsreduces nodes internal to species Mapping ]. csOt nrwMiller, Andrew Ott, ucas 18 ,[ ], 19 15 ,[ ], ,cngetybnfi from benefit greatly can ], 16 20 12 vspaiginput placing lves .Sm rvoswork previous Some ]. ,[ ], .Hwvr hr are there However, ]. epoesome prove we e 21 to 13 esc trees such he and n ]. ,[ ], 14 ) Existing ]). blem X-trees ictly n y e 1 - 2

The rest of the paper is structured as follows: In section 2 The potential cost of an edge (u, v), is the number of mu- we provide terminology for most terms encountered in this tations between a pair of fits of u and v. The cost of an paper. Section 3 examines the number of phylogenetic trees edge (u, v) is the number of mutations between the values with n species that make up our search space, and compares of u and v. The minimum cost or min-cost of an edge (u, v) its magnitude to the number of cubic trees with n species, is defined as the minimum number of mutations between which is explored in traditional phylogenetic algorithms. In all pairs of fits between u and v, and is equal to md(u, v). section 4 we describe Hartigan’s algorithm, which solves The cost of a tree fit is the sum of costs along the tree’s edges. the small parsimony problem with polytomies, and adapt it The most parsimonious cost (MP-cost) of a tree is the minimum from rooted to unrooted trees. In section 5 we describe an sum of potential costs along all of its edges for any tree fit. algorithm to find the most compact parsimonious tree using An MP-cost tree fit is called a best fit. edge contractions. We present results that demonstrate the efficiency of the contraction algorithm in section 6 and 3 ENUMERATING MIXED TREES conclude with observations and discussion in section 7. n−2 According to Flight [25], there are T (n,m) unrooted 2 DEFINITIONS mX=0 A rooted tree where all nodes have a maximum degree of mixed labelled trees, where all leaf nodes are labelled, and 3 is called a binary or bifurcating tree. If all internal nodes internal nodes may be labelled, where T (n,m) is the num- except for the root have a degree of 3 (one parent and two ber of unique trees with n labelled nodes and m unlabelled children) then the rooted tree is called a full binary tree. nodes. Observably, there are four different ways to construct An unrooted tree where all nodes have either a degree of a tree with n labelled species from a tree with n 1 labelled − 1 (leaves) or 3 (internal nodes) we will call a cubic tree, species, allowing polytomies: following the terminology of [23]. A tree whose nodes can 1) Insert an unlabelled node into any of the n + m 3 have degrees > 3 is called multifurcating. Nodes in a tree can edges of any of the T (n 1,m 1) trees and have− the be labelled, i.e. assigned values. A labelled-leaf tree has values nth labelled node descend− from− it. assigned to all of its leaves. In our study we will define a 2) Insert the labelled node directly into any of the n+m 2 − mixed-labelled tree (or mixed tree) to be a tree where all leaves edges of any of the T (n 1,m) trees. are labelled, and internal nodes may be labelled. 3) Make the labelled node the− child of any of the n+m 1 The following definitions follow to a large extent the ter- available nodes belonging to any of the T (n 1,m)−. − minology in [24]. Let S bea set of n objects S1,S2,...,Sn . 4) Label any of the m +1 unlabelled nodes in any of the { } We will refer to these objects as species. Each species has a T (n 1,m + 1) trees. − set of m ordered features C, called characters. Each character This leads to the following recurrence: can take a constant number of values, called states. Each species Si, 1 i n is a fixed tuple of m- a T (n 1,m 1) ≤ ≤ · − − character states (C(i)1, C(i)2,...,C(i) ). Character states T(n,m)=  + b T (n 1,m) m · − are unordered (Fitch parsimony). Species can be assigned  + c T (n 1,m + 1) · − to nodes in a tree, which are then considered labelled. As where a = m + n 3 ifm> 0 or a =0 otherwise, b =2n + such, every labelled (species) node in a tree will have an 2m 3, and c = m−+1 if n>m +2 or c =0 otherwise. The m-tuple of character states associated with it, which will base− case of this recurrence is: T (1, 0) = 1 and T (1,i)=0 be the value V of the node. Each labelled node v will also i,i > 0. Utilizing sequence A005263 from N.J.A. Sloane’s have a root set V V associated with it, which is an m-tuple of ∀Online Encyclopedia of Integer Sequences [26] we identified character state singleton sets. For example, a labelled node the following closed form formula as an approximation for v corresponding to species S will have a value V (i) = i j the number of trees as a function of the labelled nodes n: (A,B,...,Z) and a root set V V (i) = ( A , B ,..., Z ), nn−2 { } { } { } 3 where C(j)1 = A, C(j)2 = B,...,C(j)m = Z. Unlabelled n 1 n− 2 √ 2 2 nodes in the tree will also have root sets V V , whose state 2e 2 e  −  sets can contain more than one state. If an unlabelled node The exhaustive search method on cubic trees with la- u is assigned a single state for each character, then the node belled leaves and unlabelled internal nodes is computa- is called fitted and the assignment is called a node fit f, with tionally impractical for any but the smallest input sets [8]. f V V (u)1 V V (u)2 V V (u)m.A tree fit is an Comparatively, the number of mixed multifurcating trees assignment∈ of× node fits to×···× all unlabelled nodes in the tree. grows at a hyper-exponentially faster rate, as can be seen in A mutation or event is a change between states of a Fig. 1. This motivates a need for an alternative method to character. A single mutation will carry a unit cost. Let exhaustive enumeration of all n-species trees. neii(x, y): Xi Xi 0, 1 , where Xi is the powerset × →C { } of the states of character i, be a function such that 4 MAXIMUM PARSIMONYFORTREESWITHLA- 0 if x y = nei(x, y)= ∩ 6 ∅ BELLED LEAVES  1 otherwise According to the parsimony criterion, we seek a tree that The minimum distance md(u, v) between two adjacent nodes explains divergence of species with the fewest number of u and v is defined as evolutionary events. As such, we seek to identify a tree with n md(u, v)= neii(V V (u)i, V V (v)i) labelled nodes and fitted unlabelled nodes such that the 1≤Xi≤m tree cost, which is the sum of edge costs and therefore the 3

total number of mutations, is minimized. This problem can AB AB be broken into two subproblems. Small parsimony problem (SPP): Given a tree τ with n species nodes and a specified A B AB B B C topology, compute its MP-cost. Large parsimony problem u2 u2 (LPP): Given a set of n species, find the tree(s) with the BC AB A AB BC minimum MP-cost among all possible tree topologies. Such C tree(s) is/are called the most parsimonious tree(s) (MP-trees). A B A B

4.1 Hartigan's algorithm A B BC A B BC Hartigan’s algorithm provides a powerful framework for a b calculating best fits of a given tree. It solves the SPP for multifurcating rooted trees with n labelled leaves [24]. The bottom-up procedure of Hartigan’s algorithm processes ev- Figure 2. Hartigan’s algorithm: (a) Computation of upper VU and lower VL sets of all unlabelled nodes, as well as root V V set of root node, ery unlabelled internal node ui, 1 i n 2 that has ≤ ≤ − after bottom-up step. (b) Computation of root sets V V of all unlabelled children vi,i 2. The procedure recursively calculates ≥ nodes after top-down step. upper VU(u)i and lower VL(u)i sets for every character of every unlabelled node u as follows (theorem 2 in [24]): If k(A) the number of times a value A occurs in the sets MP-costs are computed by arbitrarily rooting the trees. To VU(u)i of every child v of u, and K = max k(A), then maintain bifurcation, a root can be added to a tree by 1) VU(u)i = A k(A)= K replacing an edge (v1, v2) with a new unlabelled root node { | } 2) VL(u)i = A k(A)= K 1 r and two edges (r, v1) and (r, v2). It is evident that the cost { | − } Hartigan’s top down refinement allows the computation of the new tree will remain unaltered, since the root node of optimal assignments to each node. For any character can be assigned the same root set and value as one of either i of the root node, selecting any of the candidate states v1 or v2. from its root set would yield a most parsimonious labelling. Conversely, the following theorem also holds true: The algorithm then proceeds to compute the root sets of characters of internal nodes v using the following result Theorem 1. Removing the root of a binary tree, as well as any (theorem 3 in [24]): unlabelled node of degree 2, does not change the MP-cost of the For v child of u: tree. 1) If V V (u)i VU(v)i, then V V (v)i = V V (u)i ⊆ 2) Otherwise, V V (v)i = VU(v)i (V V (u)i VL(v)i) Proof. Hartigan’s algorithm on a rooted binary tree com- ∪ ∩ By storing all optimal and next-to-optimal values in sets putes the root sets of all internal nodes, including the root VU(u) and VL(u) respectively, and by computing V V (u), set V V (r) of the root node r. Let V V (x) and V V (y) be the Hartigan’s algorithm can be used to find all co-optimal corresponding root sets of the root’s children x and y. Any assignment of a state Si V V (r)j to the character Cj of the solutions to the SPP. An example of Hartigan’s algorithm ∈ can be seen in Fig. 2. root node will result in 1) A cost of 0 for mutating this character from the root to 4.2 Unrooting trees both children, if V V (r)j (V V (x)j V V (y)j ) or ∈ ∩ Tree enumeration for the LPP on rooted full binary trees 2) A cost of 1 otherwise (when V V (r)j (V V (x)j ∈ ∪ involves the systematic generation of cubic trees, for which V V (y)j )).

Figure 1. Comparison of growth rates of mixed and leaf labelled trees. 4

vd-2 vd-2 … ) d-1 … i,v ) VR(u ) c(u v ,v d-1 c(u ,v) k d-1 k v VR(u ) i VR(u ) 0 c(u ,v) c(u d-1 i i k c(u 0 v2 u u ui i ,v v c(u ui i k d ) 2 u k ,v v v i uk d ) vd 0 vd v1 v1 VR(u l) ul

a b a b

Figure 3. Moving an internal labelled node to a leaf while maintaining Figure 4. Node split to reduce degree of a node while maintaining tree tree cost cost

Removing the root and connecting nodes x and y directly Lemma 2. In leaf-labelled trees, a multifurcating n-species MP- with an edge will not cause an increase to the MP-cost of tree cannot have a lower cost fit than an n-species cubic MP-tree. the tree, as the same assignments that would minimize the Proof. We will prove this lemma by construction, once again edge costs between the root and its children will now be without modifying the MP-tree cost. A multifurcating tree maintained on the edge (x, y), meaning 0 for each character has two types of nodes that do not appear in a cubic j whose state does not mutate (V V (x)j V V (y)j = ), and tree, nodes of degree 2 and nodes of degrees 4. We 1 when the state mutates. ∩ 6 ∅ have seen how to remove unlabelled nodes of degree≥ 2 in Therefore, cubic trees with n labelled leaves share the Theorem 1 without increasing the MP-tree cost. To remove same MP-cost with binary rooted counterparts (not neces- tree nodes with degrees greater than 3 we will introduce sarily full). a split operation that will create a new node, reduce the degree of an existing node by 1, and conserve the tree cost. Consider a node ui with degree d > 3. Node ui will be 5 TOWARDS A COMPACT MOST PARSIMONIOUS adjacent to d other nodes v1, v2,...,vd. We will create a new TREE unlabelled node uk with the same root set as ui, which we will connect to u . Then we will disconnect nodes v 1 and Our ultimate goal is to find the most compact parsimonious i d− v from u , and connect them to u . The modified node u is n-species trees. To solve this problem, in this section we d i k i now connected to nodes v1, v2,...,v 2,u and has degree will demonstrate that it is sufficient to find the cubic n- d− k d 1, where node u is adjacent to u , v 1 and v , and has species MP-trees and contract them. Towards that goal we k i d− d degree− 3. The degrees of all other nodes are unchanged. The will prove that most compact n-species MP-tree cannot have MP-tree cost remains the same, as new edge (u ,u ) has a a lower cost fit than the cubic n-species MP-tree. To prove i k potential cost of 0 with an original tree best fit, and removed this claim we will utilize the following lemmas: edges (ui, vd−1) and (ui, vd) carry the same potential cost Lemma 1. An n-species MP-tree with labelled internal nodes with added edges (uk, vd−1) and (uk, vd) respectively. The cannot have a lower cost than an n-species MP-tree with n labelled split operation is shown in Fig. 4. leaves. Repeating the split operation on all nodes with degrees 4 until their degrees are reduced to 3 will produce a cubic Proof. We will prove by construction, while maintaining the tree≥ with the same MP-cost as the original multifurcating invariant of lowest tree cost. Consider an n-species MP-tree tree. with labelled internal nodes. Let ui be one of these nodes. Let (ui, v) be an edge connecting ui with another node v. Lemma 3. Unlabelled nodes with degrees d< 3 can be removed We will create a new internal node uk with the same root from an n-species tree without increasing its MP-cost. set as u , meaning V V (u )= V V (u ). We will then remove i k i Proof. We have seen how to remove unlabelled nodes of the (u , v) edge and connect u to u and v with two edges i k i degree 2 in Theorem 1 without increasing the tree MP- (u ,u ) and (u , v). We will then create a new leaf node u i k k l cost. To remove an unlabelled leaf v, we can notice that its with V V (u ) = V V (u ) and connect it to node u with an l i k incident edge can always have a cost of 0 for any given fit, edge (u ,u ). Finally we will move the label from u to u , k l i l since we can always set V V (v) = V V (u), where u is the effectively removing a labelled internal node and creating a single neighbor of v. As such, removal of v and its incident labelled leaf. The construction is shown in Fig. 3. edge does not increase the tree cost. The tree cost remains unchanged during these opera- tions, since edge (uk, v) will have the same potential cost Lemma 4. A most compact parsimonious n-species tree will have (for the same fit of v) as edge (ui, v) had, where the other at most n 2 unlabelled nodes. − new edges (ui,uk) and (uk,ul) will have potential costs of 0, since they connect nodes with the same single-fit root sets. Proof. Based on Lemma 3, all leaves of a most compact MP- We can repeat this process independently on every internal tree will be labelled. Thus, such a tree will have n leaves. labelled node, until the only labelled nodes are leaves, while Assume to the contrary of our stated lemma that a most the MP-cost of the tree remains the same. compact n-species MP-tree has k n 1 internal nodes, all of which have degrees 3, as per≥Lemma− 3. Then the total ≥ 5

number of nodes of the tree is n+k. A tree with n+k nodes Let w be the new node created once edge (u, v) is con- has n + k 1 edges. The sum of the node degrees then will tracted. A value V (w) with V (w)i = x V V (u)i (we select − ∈ be 2n +2k 2, since every edge contributes 2 to the total u without loss of generality) would set MM(τc) >MM(τ). − sum. To see that, let us root τ at u. Clearly V V (u)i V V (v)i, 6⊆ The sum of the degrees of the n leaves is n, which means which means, based on theorem 2 of [24] and Lemma 5, that the sum of degrees of the internal nodes S = n+2k 2. that V V (v)i = VU(v)i (V V (u)i VL(v)i). But then − ∪ ∩ Since every internal node has a degree 3, the k internal x VU(v)i, since x VU(v)i = x V V (v)i = ≥ 6∈ ∈ ⇒ ∈ ⇒ nodes will have a sum of degrees S 3k n + 2k V V (v)i V V (u)i = . Also x VL(v)i, since x ≥ ⇔ − ∩ 6 ∅ 6∈ ∈ 2 3k k n 2, which contradicts our assumption VL(v)i = x V V (u)i VL(v)i = x V V (v)i = ≥ ⇔ ≤ − ⇒ ∈ ∩ ⇒ ∈ ⇒ k n 1. V V (v)i V V (u)i = . Thus an assignment of x to V (v)i ≥ − in τ would∩ increase6 the∅ cost of the subtree rooted at v by Now we can proceed with the proof of our main theo- 2 more than any other assignment from V V (v)i = VU(v)i. rem: Even with the gain of one mutation from the contraction Theorem 2. The most compact n-species MP-tree cannot have a of edge (u, v), MM(τc) > MM(τ), which is a contradic- lower cost fit than the n-species cubic MP-tree. tion.

Proof. Assume to the contrary that there exists a tree τc on n Corollary 2. After an edge (u, v) contraction, the new node w will have root set V V (w): i1 i m,VV (w)i = V V (u)i species S that has a lower cost than the cubic MP-tree τ on S. ∀ ≤ ≤ ∩ Using the construction in Lemma 1 we can move all labelled V V (v)i. internal nodes to leaves without increasing the MP-cost of Proof. The proof follows from the same process as in τ c. Based on Lemma 4 we could remove all unlabelled nodes Lemma 6. with degrees 2 without altering the MP-cost of τc as well. ≤ Now τc has only nodes with degree 1 or degree 3. Using Our algorithm for contracting a cubic tree to the most ≥ the construct of Lemma 2 we can convert τc to a cubic compact tree on n species with the same cost is described in tree, by successively splitting nodes of degree higher than Algorithm 1. 3, again without affecting the MP-cost of τc. The resulting tree is cubic, has all species in S associated with leaves, and Algorithm 1 Tree contraction algorithm a lower cost than τ. 1: loop For each 0-min-cost edge 2: Contract edge (u, v), remove u, v and create new Theorem 2 enables us to build the most compact MP- node w tree without enumerating all n-species trees, but only cubic 3: Update root set of w: i1 i mV V (w)i = trees with n species. It also supplies us with a systematic ∀ ≤ ≤ V V (u)i V V (v)i procedure to create the most compact parsimonious tree ∩ 4: Root tree on w and run DFS to: by reversing the process described in Theorem 2. Starting 5: (a) Update root sets of all unlabelled tree nodes and with the n-species cubic MP-trees, we can contract edges 6: (b) Update list of 0-min-cost edges with 0 min-cost, effectively reversing the split operation. But which is the right order to contract edges, so that we can produce the most compact parsimonious tree? The relation Theorem 3. Algorithm 1 yields the most compact parsimonious R : V V : (u, v) R md(u, v)=0 is not transitive, tree and edge→ contraction∈ order⇐⇒ can matter. Therefore we will consider all possible orders of edge contractions. Proof. The proof of correctness of Algorithm 1 follows from the reversal of the conversion of the most compact MP-tree V V v v Lemma 5. The root tuple ( ) of a node is independent to a cubic MP-tree. We are exhaustively enumerating all n- of the placement of the root of the tree and its character sets are species cubic trees and, for the most parsimonious of them, maximal. we are considering all possible orders of edge contractions. Proof. V V (v) indicates the tuple of maximal sets of states Contracting edges reverses the node split operation that was that can be assigned to corresponding characters of v in a utilized in Theorem 2. most parsimonious fit. These are computed by the top-down The initial cubic tree (before any contractions) has n procedure of Hartigan’s algorithm , the correctness of which labelled and n 2 unlabelled nodes, therefore 2n 2 nodes is proven in theorem 3 of [24]. and 2n 3 edges.− The minimum number of nodes a− compact − Corollary 1. V V (v)= VU(v) when v is placed on the root. tree can have is n, so the maximum number of consecutive contractions that can be performed is n 2. In the worst case 2 3 The following lemma will help us prove the correctness our algorithm can iterate (2n 3)(2n −4) n = n− − − ····· n−2 of our contraction algorithm. times. Each iteration involves a DFS traversal that takes Lemma 6. Only edges with 0 min-cost can be contracted without linear time as a function of the size of the tree. Therefore increasing tree cost. the worst-case time complexity of the edge contraction algorithm is hyperexponential. Previous work in tree refine- Proof. Assume to the contrary that we can contract an edge ment, where maximum parsimony is pursued by contracting with min-cost > 0 where the contracted tree τc has the same edges in trees, has shown that the tree refinement problem MP-cost MM(τ) as the initial MP-tree τ. Let (u, v) be such is NP-hard [27], indicating that our problem may not have an edge. Then i, 1 i m : V V (v)i V V (u)i = . efficient solutions in the worst case without bounding the ∃ ≤ ≤ ∩ ∅ 6

values of any parameters. On average we would expect the 7 CONCLUSION edge contraction algorithm to be much more efficient, as the In this work we have established a novel connection be- probability of contracting a tree edge decreases exponen- tween mixed MP-trees and cubic MP-trees, and shown a tially as a function of the number of characters examined, mapping from the cubic MP-trees to the most compact assuming character independence. mixed MP-trees, enabling more efficient algorithms for live phylogeny with polytomies. We have designed and im- plemented an efficient optimal algorithm to generate the 6 EXPERIMENTAL RESULTS most compact MP-trees for n species by enumerating all We implemented two branch-and-bound algorithms to cubic n-species trees, finding the most parsimonious ones, identify the most compact MP-trees for n species. The first and optimally contracting them. Although contraction re- algorithm, the Mixed Tree Enumeration Algorithm (MTEA), quires potentially hyper-exponential time as a function of exhaustively enumerates all mixed trees to identify the most the number of species, the running time of our algorithm compact MP-trees. The second algorithm, the Cubic Tree is superior to the enumeration of all multifurcating trees Enumeration and Edge Contraction Algorithm (CTEECA), with n species, even in the worst case. On average we exhaustively enumerates all cubic trees to find the MP-trees, expected the contraction algorithm to be comparatively very on which it applies our edge contraction method to identify efficient, an expectation that was confirmed experimentally. the most compact MP-trees. Furthermore, cubic tree enumeration has been refined in We run these two algorithms on a dataset of viral se- several existing phylogenetic software suites for many years quences from the genes lef-8 and ac22 of the Baculoviridae [30], [31], and a large number of heuristics, approximations, family, analyzed in [28]. A multiple alignment of these and parallel algorithms have been developed and used sequences was downloaded from TreeBASE [29], and the effectively to speed up enumeration [17], [27], [32], [33], [34], first 30 characters of each taxon were used. We excluded 4 [35], [36], [37], advancements of which our edge contraction taxa for which the first 30 characters were not known. algorithm can easily take advantage to further improve its efficiency. We run our two algorithms to find most compact MP- It is our hope that our theoreticaly advances in the un- trees for n species with 4 n 11. The n species derstanding of maximum live parsimony with polytomies were selected at random from≤ the 35≤ available sequences, and our optimal algorithms for identifying the most com- and for each value of n we run 10 seperate randomized pact MP-trees for n species – providing the ability to handle experiments, averaging the results. All experiments were polytomies and input species on internal nodes natively – run on a desktop computer with an Intel i7-4820k processor will enhance studies and enable new advances in evolu- running at 3.7Ghz with 16 GB of RAM, an amount adequate tionary virology, paleontology, linguistics, and phylogenetic for all data to be stored in memory once the input sequences stemmatics. were imported from the solid state drive. The results of our experiments are shown in Table 1. Execution times displayed include the time needed to enu- ACKNOWLEDGMENTS merate and score mixed trees, or the time needed to enumer- ate, score, and contract cubic trees. Running times varied This work has been supported by NSF Grant CCF-1418874 significantly among trials for any given n, due to the nature and The College of New Jersey Mentored Undergraduate of branch and bound algorithms; in some trials, low scoring Summer Experience (MUSE) program. trees were found earlier on in the enumeration, allowing for more efficient pruning of the search space. The number of REFERENCES compact mixed trees reported is the mean number of most parsimonious mixed trees that had the fewest number of [1] J. Huelsenbeck and F. Ronquist, “Mrbayes: Bayesian nodes. The number of most compact trees from the cubic inference of phylogenetic trees,” Bioinformatics, vol. 17, enumeration includes all possible most compact trees gen- no. 8, pp. 754–755, 2001. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/11524383 erated by contracting edges in the most parsimonious cubic [2] J. Felsenstein, “Evolutionary trees from DNA sequences: A maxi- trees. This comparatively high number includes possible du- mum likelihood approach,” Journal of Molecular Evolution, vol. 17, plicate trees as well as possible rerootings of the same tree. no. 6, pp. 368–376, 1981. [3] D. Hillis, J. Bull, M. White, M. Badgett, and I. Molineux, “Experi- The number of contractions is the mean number of zero min- mental phylogenetics: generation of a known phylogeny,” Science, cost edges that were contracted in the most parsimonious vol. 255, no. 5044, pp. 589–592, 1992. cubic trees. This shows the average difference in the number [4] T. Zhu, B. Korber, A. Nahmias, E. Hooper, P. Sharp, and D. Ho, of nodes between the most compact trees and the cubic trees “An african hiv-1 sequence from 1959 and implications for the origin of the epidemic,” Nature, vol. 391, no. 6667, pp. 594–597, from which they were generated. Our results experimentally 1998. demonstrate that the CTEECA outperforms the MXEA by [5] R. Bush, C. Bender, K. Subbarao, N. Cox, and W. Fitch, at least an order of magnitude on reasonable biological “Predicting the evolution of human influenza A,” Science, vol. 286, no. 5446, pp. 1921–1925, 1999. [Online]. Available: datasets, with similar behavior observed on datasets from http://www.ncbi.nlm.nih.gov/pubmed/10583948 the domain of phylogenetic stemmatics. [6] G. D. Wilson and G. D. Edgecombe, “The triassic isopod The programs used to perform these experiments protamphisopus wianamattensis (chilton) and comparison were written in the Java programming language with extant taxa (crustacea, phreatoicidea),” Journal of Paleontology, vol. 77, no. 3, pp. 454–470, 2003. [Online]. Avail- and the documented source code be downloaded at: able: http://www.jstor.org/stable/pdf/4094794.pdf?seq=1#page https://github.com/ottj3/phylotreecontract. scan tab contents 7

Table 1 Mixed Tree Enumeration vs Cubic MP-Tree Contraction

Number of MTEA CTEECA Compact Mixed Cubic Contracted Cubic Number of Species Time (ms) Time (ms) MP-Trees (#) MP-Trees (#) MP-Trees (#) Contractions 4 10 8 1.1 1.1 3 1.4 5 36 10 1.6 1.9 5.4 1.7 6 191 43 2 2.7 13.4 1.8 7 796 167 3.5 4.4 162 2.3 8 8540 811 2.5 4 30.2 2.4 9 128242 10458 4.8 10.5 328.8 2.8 10 865839 139922 5.2 10.1 752 3.1 11 3831436 1778987 3.9 18.7 1716.8 3.5

[7] N. Salamin, T. R. Hodkinson, and V. Savolainen, “Building Intelligence and Lecture Notes in Bioinformatics), vol. 8486 LNCS, supertrees: an empirical assessment using the grass family 2014, pp. 79–88. (Poaceae).” Systematic biology, vol. 51, no. 1, pp. 136–150, 2002. [22] C. Semple and M. Steel, Phylogenetics, ser. Oxford [8] D. L. Swofford and J. Sullivan, “Phylogeny inference based lecture series in mathematics and its applications. on parsimony and other methods using PAUP,” in The Oxford University Press, 2003. [Online]. Available: Phylogenetic Handbook, 2nd ed., P. Lemey, M. Salemi, and https://books.google.gr/books?id=uR8i2qetjSAC A.-M. Vandamme, Eds. Cambridge University Press, 2009, [23] G. Exoo, “A simple method for constructing small cubic pp. 267–312, cambridge Books Online. [Online]. Available: graphs of girths 14, 15, and 16.” The Electronic Journal of http://dx.doi.org/10.1017/CBO9780511819049.010 Combinatorics, vol. 3, no. 1, p. 30, 1996. [Online]. Available: [9] G. P. Telles, N. F. Almeida, R. Minghim, and M. E. http://eudml.org/doc/119250 M. T. Walter, “Live phylogeny.” Journal of computational [24] J. A. Hartigan, “Minimum mutation fits to a given tree,” biology : a journal of computational molecular cell biology, Biometrics, vol. 29, no. 1, pp. 53–65, 1973. [Online]. Available: vol. 20, no. 1, pp. 30–7, 2013. [Online]. Available: http://www.jstor.org/stable/2529676 http://www.ncbi.nlm.nih.gov/pubmed/23294270 [25] C. Flight, “How many stemmata?” Manuscripta, vol. 34, [10] S. A. Carroll, J. S. Towner, T. K. Sealy, L. K. McMullan, no. 2, pp. 122–128, 1990. [Online]. Available: M. L. Khristova, F. J. Burt, R. Swanepoel, P. E. Rollin, and http://dx.doi.org/10.1484/J.MSS.3.1335 S. T. Nichol, “Molecular evolution of viruses of the family [26] N. Sloane. (2010) The On-Line Encyclopedia of Integer Sequences. filoviridae based on 97 whole-genome sequences,” Journal of [Online]. Available: http://oeis.org Virology, vol. 87, no. 5, pp. 2608–2616, 2013. [Online]. Available: [27] M. Bonet, M. Steel, T. Warnow, and S. Yooseph, “Better http://jvi.asm.org/content/87/5/2608.abstract methods for solving parsimony and compatibility,” J Comput [11] P. Buendia and G. Narasimhan, “Serial evolutionary networks of Biol, vol. 5, no. 3, pp. 391–407, 1998. [Online]. Available: within-patient HIV-1 sequences reveal patterns of evolution of X4 http://www.ncbi.nlm.nih.gov/pubmed/9773340 strains.” BMC Systems Biology, vol. 3, p. 62, 2009. [28] E. a. Herniou, J. a. Olszewski, D. R. O’Reilly, and J. S. Cory, [12] M. Foote, “On the probability of ancestors in the fossil record,” “Ancient coevolution of baculoviruses and their insect hosts.” Paleobiology, vol. 22, pp. 141–151, 1996. [Online]. Available: Journal of virology, vol. 78, no. 7, pp. 3244–3251, 2004. http://journals.cambridge.org/article S0094837300016146 [29] W. H. Piel, M. Donoghue, and M. Sanderson, “TreeBASE : A [13] P. Mierzejewski, “A new graptolite, intermediate between database of phylogenetic information,” in Proceedings of the 2nd the tuboidea and the. camaroidea,” Acta Palaeontologica International Workshop of Species 2000, 2002, pp. 41–47. [Online]. Polonica, vol. 46, no. 3, pp. 367–376, 2001. [Online]. Available: Available: http://phylogeny.harvard.edu/treebase. https://www.app.pan.pl/archive/published/app46/app46-367.pdf [30] J. Felsenstein, “Phylip: phylogeny inference package (version [14] A. Urbanek, “Oligophyly and evolutionary parallelism: A case 3.2),” , vol. 5, pp. 164–166, 1989. study of silurian graptolites,” Acta Palaeontologica Polonica, vol. 43, [31] D. L. Swofford, “Phylogenetic Analysis Using Parsimony,” no. 4, pp. 549–572, 1998. Options, vol. 42, pp. 294–307, 2003. [Online]. Available: http://www.springerlink.com/index/10.1007/BF02198856 [15] D. W. Bapst, “paleotree: an R package for paleontological and [32] M. Yan and D. A. Bader, “D.a.: Fast character optimization in parsi- phylogenetic analyses of evolution,” Methods in Ecology and mony phylogeny reconstruction,” Georgia Institute of Technology, Evolution, vol. 3, no. 5, pp. 803–807, 2012. [Online]. Available: Tech. Rep., 2003. http://dx.doi.org/10.1111/j.2041-210X.2012.00223.x [33] D. A. Bader, V. P. Chandu, and M. Yan, “ExactMP: An efficient [16] A. F. Y. Poon, L. W. Walker, H. Murray, R. M. McCloskey, parallel exact solver for phylogenetic tree reconstruction using P. R. Harrigan, and R. H. Liang, “Mapping the Shapes of maximum parsimony,” in Proceedings of the International Conference Phylogenetic Trees from Human and Zoonotic RNA Viruses,” on Parallel Processing, 2006, pp. 65–73. PLoS ONE, vol. 8, no. 11, pp. 78–122, 2013. [Online]. Available: [34] S. Sridhar, F. Lam, G. E. Blelloch, R. Ravi, and R. Schwartz, “Mixed http://dx.doi.org/10.1371%2Fjournal.pone.0078122 integer linear programming for maximum-parsimony phylogeny [17] P. A. Goloboff, “Optimization of polytomies: state set and parallel inference,” in IEEE/ACM Transactions on Computational Biology and operations.” Molecular phylogenetics and evolution, vol. 22, no. 2, pp. Bioinformatics, vol. 5, no. 3, 2008, pp. 323–331. 269–275, 2002. [35] A. Goeffon, J. M. Richer, and J. K. Hao, “Progressive tree neighbor- [18] R. L. Graham and L. R. Foulds, “Unlikelihood that minimal hood applied to the maximum parsimony problem,” IEEE/ACM phylogenies for a realistic biological study can be constructed in Transactions on Computational Biology and Bioinformatics, vol. 5, reasonable computational time,” Mathematical Biosciences, vol. 60, no. 1, pp. 136–145, 2008. no. 2, pp. 133–142, 1982. [36] N. Alon, B. Chor, F. Pardi, and A. Rapoport, “Approximate maxi- [19] W. H. E. Day, “Computationally difficult parsimony problems in mum parsimony and ancestral maximum likelihood,” IEEE/ACM phylogenetic ,” Journal of Theoretical Biology, vol. 103, Transactions on Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 429–438, 1983. no. 1, pp. 183–187, 2010. [20] W. H. E. Day, D. S. Johnson, and D. Sankoff, “The computational [37] W. T. J. White and B. R. Holland, “Faster exact maximum parsi- complexity of inferring rooted phylogenies by parsimony,” Math- mony search with XMP,” Bioinformatics, vol. 27, no. 10, pp. 1359– ematical Biosciences, vol. 81, no. 1, pp. 33–42, 1986. 1367, 2011. [21] A. Carmel, N. Musa-Lempel, D. Tsur, and M. Ziv-Ukelson, “The worst case complexity of maximum parsimony,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial