Reconstructing unrooted phylogenetic trees from symbolic ternary metrics

Stefan Gr¨unewald CAS-MPG Partner Institute for Computational Biology Chinese Academy of Sciences Key Laboratory of Computational Biology 320 Yue Yang Road, Shanghai 200032, China Email: [email protected] Yangjing Long School of Mathematics and Statistics Central China Normal University, Luoyu Road 152, Wuhan, Hubei 430079, China Email: [email protected] Yaokun Wu Department of Mathematics and MOE-LSC Shanghai Jiao Tong University Dongchuan Road 800, Shanghai 200240, China Email: [email protected]

Abstract In 1998, B¨ocker and Dress presented a 1-to-1 correspondence between symbolically dated rooted trees and symbolic ultrametrics. We consider the corresponding problem for unrooted trees. More precisely, given a T with leaf set X and a proper coloring of its interior vertices, we can map every triple of three different leaves to the color of its vertex. We characterize all ternary maps that can be obtained in this way in terms of 4- and 5-point conditions, and we show that the corresponding tree and its coloring can be reconstructed from a ternary map that satisfies those conditions. Further, we give an additional condition that characterizes whether the tree is binary, and we describe an algorithm that reconstructs general trees in a bottom-up fashion. Keywords: symbolic ternary metric ; median vertex ; unrooted

1 Introduction

A phylogenetic tree is a rooted or unrooted tree where the leaves are labeled by some objects of interest, usually taxonomic units (taxa) like species. The edges have a positive edge length, thus the tree defines a metric on the taxa set. It is a classical result in that the tree can be reconstructed from this metric, if it is unrooted or ultrametric. The latter means that the tree is rooted and all taxa are equally far away from the root. An ultrametric tree is realistic whenever the edge lengths are proportional to time and the taxa are species that can be observed in the present. In an ultrametric tree, the between arXiv:1702.00190v3 [math.CO] 17 Jan 2018 two taxa is twice of the distance between each of the taxa and their last common ancestor (lca), hence pairs of taxa with the same lca must have the same distance. For three taxa x, y, z, it follows that there is no unique maximum within their three pairwise distances, thus we have d(x, y) ≤ max{d(x, z), d(y, z)}. This 3-point condition turns out to be sufficient for a metric to be ultrametric, too, and it is the key for reconstructing ultrametric trees from their distances. In 1995, Bandelt and Steel [1] observed that the complete ordering of the real numbers is not necessary to reconstruct trees, and they showed that the real-valued distances can be replaced by maps from the pairs of taxa into a cancellative abelian monoid. Later, B¨ocker and Dress [2] pushed this idea to the limit by proving that the image set of the symmetric map does not need any structure at all (see Section 2 for details). While this result is useful for understanding how little information it takes to reconstruct an ultrametric phylogenetic tree, it was not until recently that it turned out to have some practical applications. In 2013, Hellmuth et al. [8] found an alternative characterization of symbolic ultrametrics in terms of cographs and showed that, for perfect data, phylogenetic trees can be reconstructed from orthology information. By adding some optimization tools, this concept was then applied to analyze real data [9]. Motivated by the practical applicability of symbolic ultrametrics, we are considering their unrooted version. However, in an unrooted tree there is in general no interior vertex associated to a pair of taxa that would correspond to the last common ancestor in a rooted tree. Instead, there is a median associated

1 to every set of three taxa that represents, for every possible rooting of the tree, a last common ancestor of at least two of the three taxa. Therefore, we consider ternary maps from the triples of taxa into an image set without any structure. We will show that an unrooted phylogenetic tree with a proper vertex coloring can be reconstructed from the function that maps every triple of taxa to the color of its median. In order to apply our results to real data, we need some way to assign a state to every set of three taxa, with the property that 3-sets with the same median will usually have the same state. For symbolic ultrametrics, the first real application was found 15 years after the development of the theory. In addition to the hope that something similar happens with symbolic ternary metrics, we have some indication that they can be useful to construct unrooted trees from orthology relations (see Section 6. Consider an unrooted tree T with vertex set V , edge set E, and leaf set X, and a dating map t : V → M , where M = M ∪ { } such that t(x) = for all x ∈ X, and t(v1) 6= t(v2) if v1v2 ∈ E. V  For any S = {x, y} ∈ 2 there is a unique path [x, y] with end points x and y, and for any 3-set V  S = {x, y, z} ∈ 3 there is a unique triple point or median med(x, y, z) such that [x, y] ∩ [y, z] ∩ [x, z] = {med(x, y, z)}. Putting [x, x] = {x}, the definition also works, if some or all of x, y and z equal. Given a phylogenetic tree T on X and a dating map t : V → M , we can define the symmetric symbolic ternary map d(T ;t) : X × X × X → M by d(T ;t)(x, y, z) = t(med(x, y, z)). In this set-up, our question can be phrased as follows: Suppose we are given an arbitrary symbolic ternary map δ : X × X × X → M , can we determine if there is a pair (T ; t) for which d(T :t)(x, y, z) = δ(x, y, z) holds for all x, y, z ∈ X? The rest of this paper is organized as follows. In Section 1.1, we present the basic and relevant concepts used in this paper. In Section 2 we recall the one-to-one correspondence between symbolic ultrametrics and symbolically dated trees, and introduce our main results Theorem 4 and Theorem 5. In Section 3 we give the proof of Theorem 4. In order to prove our main result, we first introduce the connection between phylogenetic trees and quartet systems on X in Subsection 3.1. Then we use a graph representation to analyze all cases of the map δ for 5-taxa subsets of X in Subsection 3.2. In Section 4 we use a similar method to prove Theorem 5, which gives a sufficient and necessary condition to reconstruct a binary phylogenetic tree on X. In Section 5, we give a criterion to identify all pseudo-cherries of the underlying tree from a symbolic ternary metric. This result makes it possible to reconstruct the tree in a bottom-up fashion. In the last section we discuss some open questions and future work.

1.1 Preliminaries We introduce the relevant basic concepts and notation. Unless stated otherwise, we will follow the monographs [11] and [4]. In the remainder of this paper, X denotes a finite set of size at least three. An (unrooted) tree T = (V,E) is an undirected connected acyclic graph with vertex set V and edge set E. A vertex of T is a leaf if it is of 1, and all vertices with degree at least two are interior vertices. A rooted tree T = (V,E) is a tree that contains a distinguished vertex ρT ∈ V called the root. We define a partial order T on V by setting v T w for any two vertices v, w ∈ V for which v is a vertex on the path from ρT to w. In particular, if v T w and v 6= w we call v an ancestor of w. An unrooted phylogenetic tree T on X is an unrooted tree with leaf set X that does not contain any vertex of degree 2. It is binary, if every interior vertex has degree 3. A rooted phylogenetic tree T on X is a rooted tree with leaf set X that does not contain any vertices with in- and out-degree one, and whose root ρT has in-degree zero. For a set A ⊆ X with cardinality at least 2, we define the last common ancestor of A, denoted by lcaT (A), to be the unique vertex in T that is the greatest lower bound of A under the partial order T . In case A = {x, y} we put lcaT (x, y) = lcaT ({x, y}). Given a set Q of four taxa {a, b, c, d}, there exist always exactly three partitions into two pairs: {{a, b}, {c, d}},{{a, c}, {b, d}} and {{a, d}, {b, c}}. These partitions are called quartets, and they represent the three non-isomorphic unrooted binary trees with leaf set Q. These trees are usually called quartet trees, and they – as well as the corresponding quartets –are symbolized by ab|cd, ac|bd, ad|bc respectively. We use Q(X) to denote the set of all quartets with four taxa in X. A phylogenetic tree T on X displays a quartet ab|cd ∈ Q(X), if the path from a to b in T is vertex-disjoint with the path from c to d. The collection of all quartets that are displayed by T is denoted by QT . Let M be a non-empty finite set, denotes a special element not contained in M, and M := M∪{ }. Note that in biology the symbol corresponds to a ”non-event” and is introduced for purely technical reasons [8]. A symbolic ternary map is a mapping from X × X × X to M . Suppose we have a symbolic ternary map δ : X × X × X → M , we say δ is symmetric if the value of δ(x, y, z) is only related to the set {x, y, z} but not on the ordering of x, y, z, i.e., if δ(x, y, z) = δ(y, x, z) = δ(z, y, x) = δ(x, z, y) for all x, y, z ∈ X. For simplicity, if a map δ : X × X × X → M is symmetric, then we can define δ on the set {x, y, z} to be δ(x, y, z). Let S be a set, we define |S| to be the number of elements in S.

2 2 Symbolic ultrametrics and our main results

In this section, we first recall the main result concerning symbolic ultrametrics by B¨ocker and Dress [2]. Suppose δ : X × X → M is a map. We call δ a symbolic ultrametric if it satisfies the following conditions: (U1) δ(x, y) = if and only if x = y; (U2) δ(x, y) = δ(y, x) for all x, y ∈ X, i.e., δ is symmetric; (U3) |{δ(x, y), δ(x, z), δ(y, z)}| ≤ 2 for all x, y, z ∈ X; and X (U4) there exists no subset {x, y, u, v} ∈ 4 such that δ(x, y) = δ(y, u) = δ(u, v) 6= δ(y, v) = δ(x, v) = δ(x, u). Now suppose that T = (V,E) is a rooted phylogenetic tree on X and that t : V → M is a map such that t(x) = for all x ∈ X. We call such a map t a symbolic dating map for T ; it is discriminating if t(u) 6= t(v), for all edges {u, v} ∈ E. Given (T, t), we associate the map d(T ;t) on X × X by setting, for all x, y ∈ X, d(T ;t)(x, y) = t(lcaT (x, y)). Clearly δ = d(T ;t) satisfies Conditions (U1),(U2),(U3),(U4) and we say that (T ; t) is a symbolic representation of δ. B¨ocker and Dress established in 1998 the following fundamental result which gives a 1-to-1 correspondence between symbolic ultrametrics and symbolic representations [2], i.e., the map defined by (T, t) 7→ d(T,t) is a bijection from the set of symbolically dated trees into the set of symbolic ternary metrics. Theorem 1 (B¨ocker and Dress 1998 [2]). Suppose δ : X × X → M is a map. Then there is a discriminating symbolic representation of δ if and only if δ is a symbolic ultrametric. Furthermore, up to isomorphism, this representation is unique. Similarly, we consider unrooted trees. Suppose that T = (V,E) is an unrooted tree on X and that t : V → M is a symbolic dating map, i.e., t(x) = for all x ∈ X, it is discriminating if t(x) 6= t(y) for all (x, y) ∈ E. Given the pair (T ; t), we associate the map δ(T ;t) on X × X × X by setting, for all x, y, z ∈ X, δ(T ;t)(x, y, z) = t(med(x, y, z)). Before stating our main results, we need the following definition: Definition 2 (n-m partitioned). Suppose δ : X × X × X → M is a symmetric map. We say that a subset S of X is n-m partitioned (by δ), if among all the 3-element subsets of S, there are in total 2 different values of δ, and n of those 3-sets are mapped to one value while all other m 3-sets are mapped to the other value. |S| Note that S can be n-m partitioned, only when 3 = m + n. Definition 3 (symbolic ternary metrics). We say δ : X × X × X → M is a symbolic ternary metric, if the following conditions hold. (1) δ is symmetric, i.e., δ(x, y, z) = δ(y, x, z) = δ(z, y, x) = δ(x, z, y) for all x, y, z ∈ X. (2) δ(x, y, z) = if and only if x = z or y = z or x = y. (3) for any distinct x, y, z, u we have

|{δ(x, y, z), δ(x, y, u), δ(x, z, u), δ(y, z, u)}| ≤ 2, and when the equality holds then {x, y, z, u} is 2-2 partitioned by δ. (4) there is no distinct 5-element subset {x, y, z, u, e} of X which is 5-5 partitioned by δ. We will refer to these conditions throughout the paper. Our main result is: Theorem 4. There is a 1-to-1 correspondence between the discriminating symbolically dated phylogenetic trees and the symbolic ternary metrics on X. Let δ be a ternary symbolic ultrametric on X. Then we call δ fully resolved, if the following condition holds: (*) If |{δ(x, y, z), δ(x, y, u), δ(x, z, u), δ(y, z, u)}| = 1, then there exists e ∈ X such that e can resolve xyzu. i.e., the set {x, y, z, u, e} is 4-6 partitioned by δ. Now we can characterize ternary symbolic ultrametrics that correspond to binary phylogenetic trees: Theorem 5. There is a 1-to-1 correspondence between the discriminating symbolically dated binary phylogenetic trees and the fully resolved symbolic ternary metrics on X.

3 Reconstructing a symbolically dated phylogenetic tree

The aim of this section is to prove Theorem 4.

3.1 Quartet systems We will use quartet systems to prove Theorem 4. In 1981, Colonius and Schulze [3] found that, for a quartet system Q on a finite taxa set X, there is a phylogenetic tree T on X such that Q = QT , if and only if certain conditions on subsets of X with up to five elements hold. The following theorem (Theorem 3.7 in [4]) states their result. A quartet system Q is thin, if for every 4-subset a, b, c, d ⊆ X, at most one of the three quartets ab|cd, ac|bd and ad|bc is contained in Q. It is transitive, if for any 5 distinct elements a, b, c, d, e ∈ X, the

3 quartet ab|cd is in Q whenever both of the quartets ab|ce and ab|de are contained in Q. It is saturated, if for any five distinct elements a , b , c , d , e ∈ X with ab|cd ∈ Q, at least one of the two quartets ae|cd and ab|ce is also in Q. Theorem 6. A quartet system Q ⊆ Q(X) is of the form Q = Q(T ) for some phylogenetic tree T on X if and only if Q is thin, transitive and saturated. We can encode a phylogenetic tree on X in terms of a quartet system by taking all the quartets displayed by the tree, as two phylogenetic trees on X are isomorphic if and only if the associated quartet systems coincide [4]. Hence, a quartet system that satisfies Theorem 6 uniquely determines a phylogenetic tree.

3.2 Graph representations of a ternary map Suppose we have a symmetric map δ : X × X × X → M . Then we can represent the restriction of δ to all 3-element subsets of any 5-element subset {x, y, z, u, v} of X by an edge-colored complete graph on the 5 vertices x, y, z, u, v. For any distinct a, b, c, d ∈ {x, y, z, u, v}, edge ab and edge cd have the same color if and only if the value of δ for {x, y, z, u, v}\{a, b} is the same as for {x, y, z, u, v}\{c, d}. It follows from Condition (3) in the definition of a symbolic ternary metric that, for any vertex of the graph, either 2 incident edges have one color and the other 2 edges have another color, or all 4 incident edges have the same color. By symmetry, there are exactly five non-isomorphic graph representations.

Lemma 1. Let the edges of a K5 be colored such that for each vertex, the 4 incident edges are either colored by the same color, or 2 of them colored by one color and the other 2 by another color. Then there are exactly 5 non-isomorphic colorings, and they are depicted in Figure 1.

Type 3 Type 4 Type 5 Type 1 Type 2

Figure 1: The 5 non-isomorphic colorings of K5 for which every color class induces an Eulerian graph. Note that in type 1 there are 3 types of edges, solid edges, dotted edge and dashed edges

Proof. It follows from the condition on the coloring that every color class induces an Eulerian subgraph (a graph where all vertices have even degree) of K5. Therefore, ignoring isolated vertices, every such induced subgraph either is a cycle or it contains a vertex of degree four. Since there are only ten edges, the only way to have three color classes is two triangles and one 4-cycle. In that case each of the triangles must contain two non-adjacent vertices of the 4-cycle and the vertex that is not in the 4-cycle, thus we get a coloring isomorphic to Type 1 in Figure 1. If there are exactly two color classes, then one of them has to be a cycle and the other one its complement. This yields Types 2, 3, and 4, if the length of the cycle is 5, 4, and 3, respectively. Finally, if there is only one color, we get Type 5.

Note that the vertices are not labeled and it does not matter which color we are using. We will prove Theorem 4 by obtaining a quartet system from any symbolic ternary metric. More precisely, we say that the symbolic ternary metric δ on X generates the quartet xy|zu if either δ(x, z, u) = δ(y, z, u) 6= δ(x, y, z) = δ(x, y, u), or |{δ(x, y, z), δ(x, y, u), δ(x, z, u), δ(y, z, u)}| = 1 and there is e ∈ X such that δ(x, y, e) = δ(x, y, z) = δ(x, y, u) = δ(x, z, u) = δ(z, u, e) = δ(y, z, u) 6= δ(x, u, e) = δ(x, z, e) = δ(y, z, e) = δ(y, u, e). In the latter case, we say that e resolves x, y, z, u. Note that the 3-sets obtained by adding e to the pairs of the generated quartet both have the same δ-value as the subsets of {x, y, z, u}. The following lemma will show that the set of all quartets generated by a symbolic ternary metric is thin. Lemma 2. Let δ : X × X × X → M be a symbolic ternary metric and let x, y, z, u ∈ X be four different taxa with |{δ(x, y, z), δ(x, y, u), δ(x, z, u), δ(y, z, u)}| = 1. Let e, e0 ∈ X − {x, y, z, u} such that {x, y, z, u, e} and {x, y, z, u, e0} are both 4-6-partitioned, and let δ(x, y, e) = δ(x, y, z) = δ(x, y, u) = δ(x, z, u) = δ(z, u, e) = δ(y, z, u) 6= δ(x, u, e) = δ(x, z, e) = δ(y, z, e) = δ(y, u, e). Then we also have δ(x, y, e0) = δ(x, y, z) = δ(x, y, u) = δ(x, z, u) = δ(z, u, e0) = δ(y, z, u) 6= δ(x, u, e0) = δ(x, z, e0) = δ(y, z, e0) = δ(y, u, e0).

4 Proof. We already know that |{δ(x, y, z), δ(x, y, u), δ(x, z, u), δ(y, z, u)}| = 1 and {x, y, z, u, e0} is 4-6- partitioned. So there are three possible cases for the values of δ on {x, y, z, u, e0}. (1) δ(x, y, z) = δ(x, y, u) = δ(x, z, u) = δ(y, z, u) and the rest 6 are equal. Then consider δ on {x, y, z, e0}, if it is 1-3 partitioned instead of 2-2 partitioned, then it contradicts the definition of symbolic ternary metric, thus this case would not happen. 0 0 {x,y,z,u} (2) δ(x, y, z) = δ(x, y, u) = δ(x, z, u) = δ(y, z, u) = δ(e , a, b) = δ(e , a, c) where {a, b, c} ∈ 3 . There are totally 12 different cases. Since x, y, z, u are symmetric, w.l.o.g., we assume δ(x, y, z) = δ(x, y, u) = δ(x, z, u) = δ(y, z, u) = δ(e0, x, y) = δ(e0, x, z) and the rest are equal. Then consider δ on {e0, x, y, z}, if it is 1-3 partitioned instead of 2-2 partitioned, then it contradicts to the definition of symbolic ternary metric, thus this case would not happen. (3) δ(x, y, z) = δ(x, y, u) = δ(x, z, u) = δ(y, z, u) = δ(e0, a, b) = δ(e0, c, d) where {a, b, c, d} ∈ {x,y,z,u} 4 . There are totally 3 different cases. Suppose the statement of the lemma is wrong. Then because x, y, z, u are symmetric w. l. o. g. we can assume that

δ(x, z, e0) = δ(x, y, z) = δ(x, z, u) = δ(y, u, e0) = δ(x, y, u) = δ(y, z, u)

6= δ(x, u, e0) = δ(x, y, e0) = δ(y, z, e0) = δ(z, u, e0). Case (3a): δ(x, u, e) 6= δ(x, u, e0). We assume that δ(x, u, e) is dashed, δ(x, u, e0) is dotted, and δ(x, y, z) is solid. Since δ is a symbolic ternary metric, by Lemma 1, the graph representation of {y, z, u, e, e0} has to be Type 1, so the color classes are one 4-cycle and two 3-cycles. The values of δ for the sets that contain at most one of e and e0 are shown in Figure 2. There is a path of length 3 that is col- ored with δ(x, y, z) (solid), and there are paths of length 2 colored with δ(x, u, e) (dashed) and δ(x, u, e0) (dotted), respectively. It follows that we only can get Type 1 by coloring the edges connecting the end vertices of each of those paths with the same color as the edges on the path. We get δ(u, e, e0) = δ(x, y, z) (solid), δ(z, e, e0) = δ(x, u, e0) (dotted), and δ(y, e, e0) = δ(x, u, e) (dashed). Now doing the same analysis for {x, y, z, e, e0} yields δ(z, e, e0) = δ(x, u, e), in contradiction to δ(z, e, e0) = δ(x, u, e0). u

z e0

y e

Figure 2: The partial coloring of K5 as described in Case (a).

Case (3b): δ(x, u, e) = δ(x, u, e0). The graph representation of {y, z, u, e, e0} can be obtained from Figure 2 by identifying the colors dashed and dotted. It contains a path of length 3 of edges colored with δ(x, y, z) and a path of length 4 of edges colored with δ(x, u, e). Since a path is not an Eulerian graph, both colors must be used for at least one of the remaining three edges, thus Type 1 is not possible. Due to δ being a symbolic ternary map, {y, z, u, e, e0} is not 5-5-partitioned, and since there are only two colors with at least 4 edges of one color and 5 edges of the other color, Lemma 1 implies that the corresponding graph representation must be Type 2 and therefore, {y, z, u, e, e0} is 4-6-partitioned. We get δ(u, e, e0) = δ(x, y, z), δ(y, e, e0) = δ(z, e, e0) = δ(x, u, e). Now we consider the graph representation of {x, z, u, e, e0}, and we observe that the edges colored with δ(x, y, z) contain a path of length 4, and the edges colored with δ(x, u, e) contain a 5-cycle. Hence, {x, z, u, e, e0} is 5-5-partitioned, in contradiction to Condition (4).

Proof of Theorem 4. By the definitions of the median, the ternary map δ(T ;t) associated with a discrimi- nating symbolically dated phylogenetic tree (T, t) on X satisfies Conditions (1) and (2). For any distinct leaves x, y, z, u, the smallest subtree of T connecting those four leaves has at most two vertices of degree larger than two. If there are two such vertices, then each of them has degree 3 and is the median of two 3-sets in {x, y, z, u}. Therefore, δ(T ;t) satisfies Condition (3). For any 5 distinct leaves, the smallest subtree of T connecting them either has three vertices of degree 3, or one vertex of degree 3 and one of degree 4, or one vertex of degree 5, while all other vertices have degree 1 or 2. The first case is depicted in Figure 3. There v1 is the median for the 3-sets that contain x1 and x2, v3 is the median for the 3-sets that contain z1 and z2, and v2 is the median for the remaining four 3-sets. Hence, either {x1, x2, y, z1, z2} is 4-6-partitioned (if t(v1) = t(v3)), or there are three different values of δ(T ;t) within those five taxa. For the other two cases, the set of five taxa is either 3-7-partitioned or δ(T ;t) is constant on all its subsets with 3 taxa. Hence, no subset of X of cardinality five is 5-5-partitioned by δ(T ;t), thus δ(T ;t) satisfies Condition (4).

5 On the other hand, let δ be a symbolic ternary metric on X. By Lemma 1, taking any 5-element subset of X, the possible graph representations of the delta system satisfying (1), (2), and (3) are shown in Figure 1. Except for Type 2, all other types satisfy (4). For the first type, the delta system is δ(y, z, u) = δ(x, y, z) = δ(w, y, z) 6= δ(w, x, z) = δ(w, x, u) = δ(w, x, y) 6= δ(x, z, u) = δ(w, z, u) = δ(x, y, u) = δ(w, y, u) 6= δ(y, z, u). The corresponding quartet system is {xw|yu, xw|zu, xw|yz, uw|yz, xu|yz}. For the third type, the delta system is δ(y, z, u) = δ(x, z, u) = δ(w, z, u) = δ(x, y, w) = δ(y, u, w) = δ(y, z, w) 6= δ(x, y, z) = δ(w, x, z) = δ(x, y, u) = δ(w, x, u). The corresponding quartet system is {wy|xu, wy|xz, xy|zu, wy|zu, wx|zu}. For the fourth type, the delta system is δ(w, x, z) = δ(w, x, u) = δ(w, x, y) 6= δ(y, z, u) = δ(x, y, z) = δ(x, z, u) = δ(w, z, u) = δ(w, y, z) = δ(x, y, u) = δ(w, y, u). The corresponding quartet system is {xw|yu, xw|zu, xw|yz}. For the fifth type, the delta system is δ(y, z, u) = δ(x, y, z) = δ(w, y, z) = δ(w, x, z) = δ(w, x, u) = δ(w, x, y) = δ(x, z, u) = δ(w, z, u) = δ(x, y, u) = δ(w, y, u). The corresponding quartet system is ∅. All quartet systems are thin, transitive, and saturated. Indeed, the delta systems of Types 1 and 3 generate all quartets displayed by a binary tree, Type 4 generates all quartets displayed by a tree with exactly one interior edge, and Type 5 corresponds to the star tree with 5 leaves. Now we take the union of all quartets generated by δ. The resulting quartet system is thin in view of Lemma 2, it is easy to see that it is also transitive and saturated, and by Theorem 6, every delta system satisfying Conditions (1), (2), (3) and (4) uniquely determines a phylogenetic tree T on X. It only remains to show that two 3-element subsets of X that have the same median in T must be mapped to the same value of δ, since then we can define t to be the dating map with t(v) = δ(x, y, z) for every interior vertex v of T and for all 3-sets {x, y, z} whose median is v. It suffices to consider two sets which intersect in two taxa, as the general case follows by exchanging one taxon up to three times. We assume that v is the median of both, {w, x, y} and {w, x, z}. If δ(w, x, y) 6= δ(w, x, z), then by the definition of symbolic ternary metric, {x, y, w, z} is 2-2-partitioned by δ, thus δ generates one of the quartets wy|xz and wz|xy, thus T must display that quartet. However, in both cases {w, x, y} and {w, x, z} do not have the same median in T . We also claim that the associated t is discriminating. Suppose otherwise, there is an edge uv in T such that t(u) = t(v). Thus for any four leaves x1, x2, y1, y2 such that med(x1, x2, y1) = med(x1, x2, y2) = u and med(y1, y2, x1) = med(y1, y2, x2) = v, the quartet x1x2|y1y2 is displayed by T . Further we have δ(x1, x2, y1) = δ(x1, x2, y2) = δ(y1, y2, x1) = δ(y1, y2, x2). Because the quartet x1x2|y1y2 is generated by δ, there is a leave e which resolves x1x2y1y2. Since uv is an edge in T , there exists i and j such that med(xi, yj , e) ∈ {u, v}, we have δ(xi, yj , e) = δ(x1, x2, y1), which means e cannot resolve x1x2y1y2, a contradiction. Hence a thin, transitive and saturated quartet system uniquely determines a discriminating symbolic dated tree.

x1 z1

v1 v2 v3

z2 x2 y

Figure 3: The leaves and median vertices for a 5-taxa binary tree.

The set of quartets generated by Type 2 does not satisfy the condition of being saturated from Theorem 6. Without loss of generality, label the vertices by x, y, z, u, w as in Figure 4. Then the quartet system is {yw|zu, xu|yz, xz|uw, xy|zw, xw|yu}. In order to be saturated, the presence of xu|yz would induce that we have xw|yz or xu|yw, but we have xy|zw, xw|yu instead.

4 Reconstructing a binary phylogenetic tree

The aim of this section is to prove Theorem 5. A quartet system Q on X is complete, if |{Q ∩ {ab|cd, ac|bd, ad|bc}}| = 1 X holds for all {a, b, c, d} ∈ 4 . Using the easy observation that a phylogenetic tree is binary if and only if it displays a quartet for every 4-set, the following result is a direct consequence of Theorem 6. Corollary 1. A quartet system Q ⊆ Q(X) is of the form Q = Q(T ) for some binary phylogenetic tree T on X if and only if Q is complete, transitive, and saturated. Condition (*) ensures that a ternary metric δ generates a quartet for every set of four taxa, even if δ is constant on all of its 3-taxa subsets. In view of Lemma 2 we also have that δ can not generate two different quartets for the same 4-set. Hence, we have the following corollary.

6 x

y w

u z Figure 4: The graph representation of a ternary map satisfying Conditions (1), (2), and (3), but not (4).

Corollary 2. A symbolic ternary metric δ : X × X × X → M that satisfies Condition (*) generates a complete quartet system on X. Now we prove Theorem 5.

Proof. Let (T, t) be a symbolically dated binary phylogenetic tree. By Theorem 4, δ(T ;t) is a symbolic ternary metric. Since T is binary, it displays a quartet for every 4-taxa sub- set {x, y, z, u} of X. Assume that T displays xu|yz, thus med(x, y, z) 6= med(x, y, u). If |{δ(T ;t)(x, y, z), δ(T ;t)(x, y, u), δ(T ;t)(x, z, u), δ(T ;t)(y, z, u)}| = 1, then there is at least one vertex v on the path in T connecting med(x, y, z) and med(x, y, u) with t(v) 6= t(med(x, y, z)), as t is discriminating. Hence, there is a leaf e ∈ X such that v = med(x, y, e). It follows that the set {x, y, z, u, e} is 4-6 partitioned by δ(T ;t), thus δ(T ;t) satisfies Condition (*). On the other hand, if δ is a symbolic ternary metric on X and satisfies (*), then by Corollary 2, it corresponds to a unique complete quartet system, thus it encodes a binary phylogenetic tree T in view of Corollary 1. As in the last paragraph of the proof of Theorem 4, we can define a dating map t by t(v) = δ(x, y, z) for every interior vertex v of T and for all 3-sets {x, y, z} whose median is v. Hence, (T, t) is a symbolically dated binary phylogenetic tree.

5 The recognition of pseudo-cherries

In Theorem 4, we have established a 1-to-1 correspondence between symbolically dated phylogenetic trees and symbolic ternary metrics on X, and a bijection is given by mapping (T, t) to d(T,t). To get the inverse of this map, we can first compute the set of all quartets generated by a symbolic ternary metric, and then apply an algorithm that reconstructs a phylogenetic tree from the collection of all its displayed quartets. Finally, the dating map is defined as in our proof of Theorem 4. This approach would correspond to first extracting rooted triples from a symbolic ultrametric and then reconstruct the rooted tree (see Section 7.6 of [11]). However, a more direct way to reconstruct the corresponding tree from a symbolic ultrametric was presented in [8]. It is based on identifying maximal sets of at least two taxa that are adjacent to the same interior vertex, so-called pseudo-cherries. These can iteratively be identified into a single new taxon, thereby reconstructing the corresponding tree in a bottom-up fashion. The main advantage of such an algorithm is that it might be used to heuristically construct a tree, even if the input is not a symbolic ternary metric. In terms of running time, using pseudo-cherries is also slightly better than having to compute all O(n4) quartets, but the O(n3) input size limits the speed of every algorithm that deals with ternary maps. Here we only show how to find the pseudo-cherries of T from a symbolic ternary metric d(T,t). An algorithm to reconstruct T can be designed exactly as in [8] and is therefore omitted. We point out that it is not necessary to check Condition (4) of a symbolic ternary metric, as a violation would make the algorithm recognize that the ternary map does not correspond to a tree. Given an arbitrary symbolic ternary map δ : X × X × X → M satisfying Conditions (1), (2), and (3). For x, y ∈ X and m ∈ M , we say that x and y are m-equivalent, if there is z ∈ X such that δ(x, y, z) = m, and for u, v ∈ X − x − y, δ(x, u, v) = m if and only if δ(y, u, v) = m. Lemma 3. If x and y are m-equivalent and y and z are m0-equivalent, then m = m0 and x and z are m-equivalent.

Proof. Assume δ(x, y, z) 6= m. Then let u ∈ X with δ(x, y, u) = m. Since δ is not constant on {x, y, z, u}, this 4-set must be 2-2-partitioned, thus exactly one of δ(x, u, z) = m and δ(y, u, z) = m must hold, in contradiction to x and y being m-equivalent. Hence, we have δ(x, y, z) = m, and by symmetry we also have δ(x, y, z) = m0, thus m = m0. In order to verify that x and z must be m-equivalent, we have already shown δ(x, y, z) = m. For w, w0 ∈ X − {x, y, z}, we have δ(x, w, w0) = m if and only if δ(y, w, w0) = m, since x and y are m-equivalent, and we have δ(y, w, w0) = m if and only if δ(z, w, w0) = m, since y and z are m-equivalent. Finally, we have δ(x, y, w) = m if and only if δ(x, z, w) = m if and only if δ(y, z, w) = m. Hence x and z are m-equivalent.

7

We say x, y ∈ X are δ-equivalent, denoted by x ∼δ y, if there exists m ∈ M such that x and y are m-equivalent. Lemma 4. The relation of being δ-equivalent is an equivalence relation.

Proof. For any x ∈ X, since δ(x, x, y) = for any y ∈ X, by definition x and x are -equivalent, hence

x and x are δ-equivalent. Hence ∼δ is reflexive. For any x ∼δ y, we know that there exists an m ∈ M such that x and y are m-equivalent. Since δ is symmetric, by the definition of m-equivalent, y and x are also m-equivalent, thus y ∼δ x. Hence, ∼δ is symmetric. To prove the transitivity of ∼δ, assume x ∼δ y and y ∼δ z, by Lemma 3 we know that x ∼δ z. Therefore, δ-equivalent is an equivalence relation.

Suppose that T is a phylogenetic tree on X. Let C ⊆ X be a subset of X with |C| ≥ 2. We call C a pseudo-cherry of T , if there is an interior vertex v of T such that C is the set of all leaves adjacent to v . Theorem 7. If (T ; t) is a symbolically dated phylogenetic tree, then a non-empty subset C of X is a

non-trivial equivalence class of ∼δ(T,t) if and only if C is a pseudo-cherry of T .

Proof. For the ease of notation, we let δ = δ(T,t). Since t is discriminating, the definition of a pseudo-cherry immediately implies that any pseudo-cherry of T must be contained in a non-trivial equivalence class of ∼δ. Conversely, if a non-trivial equivalence class C of ∼δ is not a pseudo-cherry, then there are x1, x2 ∈ C such that the path in T that contains x1 and x2 has length at least 3, and since t is discriminating, it has at least 2 interior vertices labeled by two different elements of M. Suppose that all elements of C 0 are m-equivalent, and that v is an interior vertex on the path from x1 to x2 such that t(v) = m and 0 m 6= m. Further, let y ∈ X such that v = med(x1, x2, y). Since x1 and x2 are m-equivalent, there is z ∈ X such that δ(x1, x2, z) = m. Then the median u of x1, x2, z is also on the path from x1 and x2, and we assume without loss of generality that u is on the path from x1 to v. It follows that δ(x1, y, z) = m 0 but δ(x2, y, z) = m in contradiction to x1 ∼m x2.

6 Discussions and open questions

The proofs of our main results heavily rely on extracting the corresponding quartet set from a symbolic ternary metric and then checking that our Conditions (3) and (4) guarantee the quartet system to be thin, transitive, and saturated, and adding (*) makes the quartet system complete. The conditions look like (3) corresponds to thin and transitive, (4) to saturated, and (*) to complete. However, this is not true, and removing (4) from Theorem 5 does not necessarily yield a transitive complete quartet system. While for five taxa, a 5-5-partition yields the only non-saturated complete transitive quartet system, Lemma 2 does not hold without Condition (4). Indeed the ternary map that is visualized in Figure 5 suffices Conditions (1), (2), (3), and (*), but it generates two quartets on each of {a1, a2, b1, b2} and {a1, a2, c1, c2}. It can be shown by checking the remaining 5-sets in Case 2 of our proof of Lemma 2 that every ternary map on 6 taxa satisfying Conditions (1), (2), (3), (*) that does not yield a thin quartet system is isomorphic to this example. This raises the question whether ternary maps satisfying these four conditions can be completely characterized. The hope is to observe something similar to the Clebsch trees that were observed by Jan Weyer-Menkhoff [13]. As a result, a phylogenetic tree with all interior vertices of degree 3, 5, or 6 can be reconstructed from every transitive complete quartet set.

a1 a b a2 a1 a2 1 1 --- — ------— --- — --- —

c1 c c c1 2 2 b b2 a2 b2 1 c2

b2 b1 a1 b2 — --- a1 c2 — a2 c2 a b1 2 c1 c1 c1 b1 b2 Figure 5: The 5-taxa trees respectively graph representations generated by a ternary map satisfying Condi- tions (1), (2), (3), (*) but not (4).

8 Another direction to follow up this work would be to consider more general graphs than trees. A median graph is a graph for which every three vertices have a unique median. Given a vertex-colored median graph and a subset X of its vertex set, we can get a symmetric ternary map on X × X × X by associating the color of the median to every 3-subset of X. It would be interesting to see whether this map can be used to reconstruct the underlying graph for other classes of median graphs than phylogenetic trees. In phylogenetics, median graphs are used to represent non-treelike data. Since the interior vertices of those so-called splits graphs do in general not correspond to any ancestor of some of the taxa, reconstructing a collection of splits from the ternary map induced by a vertex-colored splits graph is probably limited to split systems that are almost compatible with a tree. It is one of the main observations of [8] that the 4-point condition for symbolic ultrametrics can be formulated in terms of cographs which are graphs that do not contain an induced path of length 3. For the special case that δ : X × X → M with |M| = 2 and m ∈ M, consider the graph with vertex set X where two vertices x, y are adjacent, if and only if δ(x, y) = m. Then deciding whether δ is a symbolic ultrametric can be reduced to checking whether this graph (as well as its complement) is a cograph. This is useful for analyzing real data which will usually not provide a perfect symbolic ultrametric, thus some approximation is required. For ternary maps and unrooted trees, the 5-taxa case looks promising, as Condition (4) translates to a forbidden graph representation that splits the edges of a K5 into two 5-cycles, thus we have a self-complementary forbidden induced subgraph. However, for more taxa, the 3-sets that are mapped to the same value of a ternary map δ define a 3-uniform hypergraph on X and formulating Condition (4) in terms of this hypergraph does not seem to be promising. In addition, even if there are only two values of δ for 3-sets, Condition (3) does not become obsolete. We leave it as an open question, whether an alternative characterization of symbolic ternary metrics exists that makes it easier to solve the corresponding approximation problem.

7 Note added in proof

It was brought to our attention during the refereeing process that, in the context of game theory and using different notation, Vladimir Gurvich published a result equivalent to Theorem 4. Already in 1984, the work was published in Russian [5], as well as an English translation [6]. More recently, Gurvich published another article on the topic [7], providing some details of the proofs that were previously omitted, and the result is included in a survey on graph entropy by Simonyi [12]. We point out that Gurvichs result does not only imply our theorem, but also Theorem 1 by B¨ocker and Dress [2] and the interpretation of symbolic ultrametrics in terms of cographs by Hellmuth et al [8]. In addition, Huber et al. [10] independently published a preprint that contains the characterization of symbolic ternary metrics. Their work, as well as Gurvichs papers, reduces the problem to the rooted equivalent and then applies Theorem 1. Therefore, our quartet-based proof is the only one that stays within an unrooted setting, and the constraints on the quartet systems that it provides may be useful for future applications.

Acknowledgements

We thank the anonymous referees for their helpful comments and suggestions. We thank Peter F. Stadler for suggesting to consider general median graphs and Zeying Xu for some useful comments. This work is supported by the NSFC (11671258) and STCSM (17690740800). YL acknowledges support of Postdoctoral Science Foundation of China (No. 2016M601576).

References

[1] Hans-J¨urgenBandelt and Michael Anthony Steel. Symmetric matrices representable by weighted trees over a cancellative abelian monoid. SIAM Journal on Discrete Mathematics, 8(4):517–525, 1995. [2] Sebastian B¨ocker and Andreas WM Dress. Recovering symbolically dated, rooted trees from sym- bolic ultrametrics. Advances in mathematics, 138(1):105–125, 1998. [3] Hans Colonius and Hans Hennig Schulze. Tree structures for proximity data. British Journal of Mathematical and Statistical Psychology, 34(2):167–180, 1981. [4] Andreas Dress, Katharina T Huber, and Jacobus Koolen. Basic phylogenetic combinatorics. Cam- bridge University Press, 2012. [5] Vladimir Gurvich. Decomposing complete edge-chromatic graphs and hypergraphs (in Russian). Doklady Akad. Nauk SSSR, 279(6):1306–1310, 1984. [6] Vladimir Gurvich. Some properties and applications of complete edge-chromatic graphs and hyper- graphs. In Soviet math. dokl, volume 30, pages 803–807, 1984. [7] Vladimir Gurvich. Decomposing complete edge-chromatic graphs and hypergraphs. Revisited. Dis- crete Applied Mathematics, 157(14):3069–3085, 2009.

9 [8] Marc Hellmuth, Maribel Hernandez-Rosales, Katharina T Huber, Vincent Moulton, Peter F Stadler, and Nicolas Wieseke. Orthology relations, symbolic ultrametrics, and cographs. Journal of mathe- matical biology, 66(1-2):399–420, 2013. [9] Marc Hellmuth, Nicolas Wieseke, Marcus Lechner, Hans-Peter Lenhof, Martin Middendorf, and Peter F. Stadler. Phylogenetics from paralogs. Proc. Natl. Acad. Sci. USA, 112:2058–2063, 2015. doi: 10.1073/pnas.1412770112. [10] Katharina T Huber, Vincent Moulton, and Guillaume E Scholz. Three-way symbolic tree-maps and ultrametrics. arXiv preprint arXiv:1707.08010, 2017. [11] Charles Semple and Mike A Steel. Phylogenetics, volume 24. Oxford University Press on Demand, 2003. [12] G´abor Simonyi. Perfect graphs and graph entropy. an updated survey. Perfect graphs, pages 293–328, 2001. [13] Jan Weyer Menkhoff. New quartet methods in phylogenetic combinatorics. PhD thesis, Universit¨at Bielefeld, 2003.

10