Reconstructing Unrooted Phylogenetic Trees from Symbolic Ternary Metrics
Total Page:16
File Type:pdf, Size:1020Kb
Reconstructing unrooted phylogenetic trees from symbolic ternary metrics Stefan Gr¨unewald CAS-MPG Partner Institute for Computational Biology Chinese Academy of Sciences Key Laboratory of Computational Biology 320 Yue Yang Road, Shanghai 200032, China Email: [email protected] Yangjing Long School of Mathematics and Statistics Central China Normal University, Luoyu Road 152, Wuhan, Hubei 430079, China Email: [email protected] Yaokun Wu Department of Mathematics and MOE-LSC Shanghai Jiao Tong University Dongchuan Road 800, Shanghai 200240, China Email: [email protected] Abstract In 1998, B¨ocker and Dress presented a 1-to-1 correspondence between symbolically dated rooted trees and symbolic ultrametrics. We consider the corresponding problem for unrooted trees. More precisely, given a tree T with leaf set X and a proper vertex coloring of its interior vertices, we can map every triple of three different leaves to the color of its median vertex. We characterize all ternary maps that can be obtained in this way in terms of 4- and 5-point conditions, and we show that the corresponding tree and its coloring can be reconstructed from a ternary map that satisfies those conditions. Further, we give an additional condition that characterizes whether the tree is binary, and we describe an algorithm that reconstructs general trees in a bottom-up fashion. Keywords: symbolic ternary metric ; median vertex ; unrooted phylogenetic tree 1 Introduction A phylogenetic tree is a rooted or unrooted tree where the leaves are labeled by some objects of interest, usually taxonomic units (taxa) like species. The edges have a positive edge length, thus the tree defines a metric on the taxa set. It is a classical result in phylogenetics that the tree can be reconstructed from this metric, if it is unrooted or ultrametric. The latter means that the tree is rooted and all taxa are equally far away from the root. An ultrametric tree is realistic whenever the edge lengths are proportional to time and the taxa are species that can be observed in the present. In an ultrametric tree, the distance between arXiv:1702.00190v3 [math.CO] 17 Jan 2018 two taxa is twice of the distance between each of the taxa and their last common ancestor (lca), hence pairs of taxa with the same lca must have the same distance. For three taxa x; y; z, it follows that there is no unique maximum within their three pairwise distances, thus we have d(x; y) ≤ maxfd(x; z); d(y; z)g. This 3-point condition turns out to be sufficient for a metric to be ultrametric, too, and it is the key for reconstructing ultrametric trees from their distances. In 1995, Bandelt and Steel [1] observed that the complete ordering of the real numbers is not necessary to reconstruct trees, and they showed that the real-valued distances can be replaced by maps from the pairs of taxa into a cancellative abelian monoid. Later, B¨ocker and Dress [2] pushed this idea to the limit by proving that the image set of the symmetric map does not need any structure at all (see Section 2 for details). While this result is useful for understanding how little information it takes to reconstruct an ultrametric phylogenetic tree, it was not until recently that it turned out to have some practical applications. In 2013, Hellmuth et al. [8] found an alternative characterization of symbolic ultrametrics in terms of cographs and showed that, for perfect data, phylogenetic trees can be reconstructed from orthology information. By adding some optimization tools, this concept was then applied to analyze real data [9]. Motivated by the practical applicability of symbolic ultrametrics, we are considering their unrooted version. However, in an unrooted tree there is in general no interior vertex associated to a pair of taxa that would correspond to the last common ancestor in a rooted tree. Instead, there is a median associated 1 to every set of three taxa that represents, for every possible rooting of the tree, a last common ancestor of at least two of the three taxa. Therefore, we consider ternary maps from the triples of taxa into an image set without any structure. We will show that an unrooted phylogenetic tree with a proper vertex coloring can be reconstructed from the function that maps every triple of taxa to the color of its median. In order to apply our results to real data, we need some way to assign a state to every set of three taxa, with the property that 3-sets with the same median will usually have the same state. For symbolic ultrametrics, the first real application was found 15 years after the development of the theory. In addition to the hope that something similar happens with symbolic ternary metrics, we have some indication that they can be useful to construct unrooted trees from orthology relations (see Section 6. Consider an unrooted tree T with vertex set V , edge set E, and leaf set X, and a dating map t : V ! M , where M = M [ { } such that t(x) = for all x 2 X, and t(v1) 6= t(v2) if v1v2 2 E. V For any S = fx; yg 2 2 there is a unique path [x; y] with end points x and y, and for any 3-set V S = fx; y; zg 2 3 there is a unique triple point or median med(x; y; z) such that [x; y] \ [y; z] \ [x; z] = fmed(x; y; z)g. Putting [x; x] = fxg, the definition also works, if some or all of x; y and z equal. Given a phylogenetic tree T on X and a dating map t : V ! M , we can define the symmetric symbolic ternary map d(T ;t) : X × X × X ! M by d(T ;t)(x; y; z) = t(med(x; y; z)). In this set-up, our question can be phrased as follows: Suppose we are given an arbitrary symbolic ternary map δ : X × X × X ! M , can we determine if there is a pair (T ; t) for which d(T :t)(x; y; z) = δ(x; y; z) holds for all x; y; z 2 X? The rest of this paper is organized as follows. In Section 1.1, we present the basic and relevant concepts used in this paper. In Section 2 we recall the one-to-one correspondence between symbolic ultrametrics and symbolically dated trees, and introduce our main results Theorem 4 and Theorem 5. In Section 3 we give the proof of Theorem 4. In order to prove our main result, we first introduce the connection between phylogenetic trees and quartet systems on X in Subsection 3.1. Then we use a graph representation to analyze all cases of the map δ for 5-taxa subsets of X in Subsection 3.2. In Section 4 we use a similar method to prove Theorem 5, which gives a sufficient and necessary condition to reconstruct a binary phylogenetic tree on X. In Section 5, we give a criterion to identify all pseudo-cherries of the underlying tree from a symbolic ternary metric. This result makes it possible to reconstruct the tree in a bottom-up fashion. In the last section we discuss some open questions and future work. 1.1 Preliminaries We introduce the relevant basic concepts and notation. Unless stated otherwise, we will follow the monographs [11] and [4]. In the remainder of this paper, X denotes a finite set of size at least three. An (unrooted) tree T = (V; E) is an undirected connected acyclic graph with vertex set V and edge set E. A vertex of T is a leaf if it is of degree 1, and all vertices with degree at least two are interior vertices. A rooted tree T = (V; E) is a tree that contains a distinguished vertex ρT 2 V called the root. We define a partial order T on V by setting v T w for any two vertices v; w 2 V for which v is a vertex on the path from ρT to w. In particular, if v T w and v 6= w we call v an ancestor of w. An unrooted phylogenetic tree T on X is an unrooted tree with leaf set X that does not contain any vertex of degree 2. It is binary, if every interior vertex has degree 3. A rooted phylogenetic tree T on X is a rooted tree with leaf set X that does not contain any vertices with in- and out-degree one, and whose root ρT has in-degree zero. For a set A ⊆ X with cardinality at least 2, we define the last common ancestor of A, denoted by lcaT (A), to be the unique vertex in T that is the greatest lower bound of A under the partial order T . In case A = fx; yg we put lcaT (x; y) = lcaT (fx; yg). Given a set Q of four taxa fa; b; c; dg, there exist always exactly three partitions into two pairs: ffa; bg; fc; dgg,ffa; cg; fb; dgg and ffa; dg; fb; cgg. These partitions are called quartets, and they represent the three non-isomorphic unrooted binary trees with leaf set Q. These trees are usually called quartet trees, and they { as well as the corresponding quartets {are symbolized by abjcd; acjbd; adjbc respectively. We use Q(X) to denote the set of all quartets with four taxa in X. A phylogenetic tree T on X displays a quartet abjcd 2 Q(X), if the path from a to b in T is vertex-disjoint with the path from c to d.