Ashort(?)introductiontophylogeneticnetworks
Celine´ Scornavacca
ISE-M, Equipe Phylog´enie & Evolution Mol´eculaires Montpellier, France
1/76 Rooted species trees ...
... are oriented connected and acyclic graphs, where terminal nodes are associated to a set of species:
the leaves or taxa represent extant organisms internal nodes represent TIME hypothetical ancestors each internal node represents the lowest common ancestor of Bos Homo Pan Pongo Macaca Mus all taxa below it (clade) the only node without ancestor is called root
2/76 Used, among other things, to estimate species trees. We usually use several gene families...
Gene trees
Gene trees are built by analyzing a gene family,i.e.,homologous molecular sequences appearing in the genome of di↵erent organisms.
G1
Mouse Dog Bat Rat
Mouse Dog Rat Bat
3/76 We usually use several gene families...
Gene trees
Gene trees are built by analyzing a gene family.
Mouse Rat Dog Bat Mouse Dog Rat Bat Mouse Dog Rat Bat
Used, among other things, to estimate species trees.
3/76 We usually use several gene families...
Gene trees
Gene trees are built by analyzing a gene family.
Mouse Rat Dog Bat Mouse Dog Rat Bat Mouse Dog Rat Bat
Used, among other things, to estimate species trees.
3/76 We usually use several gene families...
Gene trees
Gene trees are built by analyzing a gene family.
Mouse Rat Dog Bat Mouse Dog Rat Bat Mouse Dog Rat Bat
Used, among other things, to estimate species trees. Gene trees can significantly di↵er from the species tree for: methodological reasons biological reasons
3/76 Gene trees
Gene trees are built by analyzing a gene family.
Mouse Rat Dog Bat Mouse Dog Rat Bat Mouse Dog Rat Bat
Used, among other things, to estimate species trees. Gene trees can significantly di↵er from the species tree for: methodological reasons biological reasons We usually use several gene families...
3/76 Gene trees
Gene trees are built by analyzing a gene family.
Mouse Rat Dog Bat Mouse Dog Rat Bat Mouse Dog Rat Bat
Used, among other things, to estimate species trees. Gene trees can significantly di↵er from the species tree for: methodological reasons biological reasons We usually use several gene families... http://sulab.org/2013/06/sequenced-genomes-per-year/
3/76 assembling trees
Supertree approach:
Reconstruction of phylogenies for multiple datasets
The two main classic approaches: Supermatrix approach: assembling primary data
4/76 Reconstruction of phylogenies for multiple datasets
The two main classic approaches: Supermatrix approach: assembling primary data
Supertree approach: assembling trees
4/76 An implicit assumption
The implicit assumption of using trees is that, at a macroevolutionary scale, each (current or extinct) species or gene only descends from one ancestor. Darwin described evolution as ”descent with modification”, a phrase that does not necessarily imply a tree representation...
5/76 A new approach: building phylogenic networks
Why do we need them? Due to reticulate evolutionary phenomena (hybridization, recombination, horizontal gene transfer) the evolution of a set of species sometimes cannot be described using phylogenetic trees.
6/76 A new approach: building phylogenic networks
Why do we need them? Due to reticulate evolutionary phenomena (hybridization, recombination, horizontal gene transfer) the evolution of a set of species sometimes cannot be described using phylogenetic trees.
6/76 A new approach: building phylogenic networks
Why do we need them? Due to reticulate evolutionary phenomena (hybridization, recombination, horizontal gene transfer) the evolution of a set of species sometimes cannot be described using phylogenetic trees.
6/76 A new approach: building phylogenic networks
Why do we need them? Due to reticulate evolutionary phenomena (hybridization, recombination, horizontal gene transfer) the evolution of a set of species sometimes cannot be described using phylogenetic trees.
6/76 The network of life
Doolittle, 1999
Daniel Huson 2010 6
7/76 Three di↵erent paradigms
We (want to) see only the tree
Doolittle, 1999
Daniel Huson 2010 6
8/76 Three di↵erent paradigms
It is a big mess, no chance to retrieve the past
Doolittle, 1999
8/76 Three di↵erent paradigms
There is an underlying tree structure, with some reticulate events
Doolittle, 1999
8/76 An example - a split network
J. Wagele and C. Mayer. Visualizing di↵erences in phylogenetic information content of alignments and distinction of three classes of long-branch e↵ects. BMC Evolutionary Biology, 7(1):147, 2007 9/76 An example - a reduced median network
Northeast Asia domestic pig Domestic pig in region MDYZ Domestic pig in South China Other Northeast Asia wild boar Wild boar in region MDYZ Wild boar in South China Feral pigs Domestic pig in region UMYR Domestic pig in the Mekong region Domestic pig in region URYZ Japanese domestic pig and ancient DNA Domestic pig in region DRYR Wild boar in the Mekong region Wild boar in region URYZ * Coalescent root type of haplogroup D1
D1g D1f D1h D1e
D1i D1d D1a1a-2 * D1a1a D1a1a-1
D1c2 D1c3
D1c1
100 D1a2 60
D1a3 20 D1b 10 5 One mutation 1
G.-S. Wu, Y.-G. Yao, K.-X. Qu, Z.-L. Ding, H. Li, M. Palanichamy, Z.-Y. Duan, N. Li, Y.-S. Chen, and Y.-P. Zhang. Population phylogenomic analysis of mitochondrial DNA in wild boars and domestic pigs revealed multiple domestication events in East Asia. Genome Biology, 8(11):R245, 2007 9/76 An example - a minimum spanning network
C. M. Miller-Butterworth, D. S. Jacobs, and E. H. Harley. Strong population sub- structure is correlated with morphology and ecology in a migratory bat. Nature, 424(6945):187-191, 2003 9/76 An example - a DTLR network
P.J. Planet, S.C. Kachlany, D.H. Fine, R. DeSalle, and D.H. Figurski. The wide spread colonization island of actinobacillus actinomycetemcomitans. Nature Genetics, 34:193–198, 2003. 9/76 An example - a recombination network 00000
1 3 10000 2 11000 00100 r 4 11100 00110 5 a b c d e 10000 11000 11100 00110 00111
Daniel H. Huson, Regula Rupp, Celine Scornavacca. Phylogenetic Networks. Cambridge University Press. 2011
9/76 Phylogenetic networks
Phylogenetic networks
In the presence of reticulate events, phylogenies are networks, not trees
The study of phylogenetic networks Doolittle is a new interdisciplinary field: Science maths, CS, biology… 1999 Phylogenetic networks
2008 2010 2011 2013 2014
10 / 76 Phylogenetic networks
Phylogenetic networks
In the presence of reticulate events, phylogenies are networks, not trees
The study of phylogenetic networks Doolittle is a new interdisciplinary field: Science maths, CS, biology… 1999 Phylogenetic networks
2008 2010 2011 2013 2014
10 / 76 A phylogenetic network ...
... is any connected graph, where terminal nodes are associated to a set of species.
Homo Pongo
Pan Macaca
11 / 76 A rooted phylogenetic network ...
... is any single-rooted directed acyclic graph, where terminal nodes are associated to a set of species.
TIME
Spear Mint Peppermint Water Mint
12 / 76 Phylogenetic networks
Phylogenetic networks
In the presence of reticulate events, phylogenies are networks, not trees
The study of phylogenetic networks Doolittle is a new interdisciplinary field: Science maths, CS, biology… 1999 Phylogenetic networks
2008 2010 2011 2013 2014
13 / 76 Abstract VS explicit phylogenetic networks
Split network: Hybridization network: Split network: Hybridization network:
ShowsShows conflictingconflicting ShowsShows putative putative placementplacement of of taxa taxa hybridizationhybridization history history Daniel Huson 2010 8 14 / 76 The plan of the survey
1 combinatorial and distance methods not accounting for ILS unrooted networks rooted networks (explicit or not) 2 methods accounting for ILS (always explicit)
15 / 76 Unrooted phylogenetic networks
16 / 76 User Manual for SplitsTree4 V4.12.8
Daniel H. Huson and David Bryant
November 8, 2012
Contents
Contents 17 / 76 1
1 Introduction 4
2 Getting Started 5
3 Obtaining and Installing the Program 5
4 Program Overview 5
5 Splits, Trees and Networks 7
6 Opening, Reading and Writing Files 9
7 Estimating Distances 9
8 Building and Processing Trees 10
9 Building and Drawing Networks 11
10 Main Window 11
1 Reconstruction of unrooted phylogenetic networks
from splits from distances (via splits or not) from trees (via splits) from sequences (via splits or not)
18 / 76 Splits 5.2. Splits 81 A split A B on is a partition of a taxon set into two non-empty sets. | X X b {a} {b,c,d,e} | {b} {a,c,d,e} | a {c} {a,b,d,e} | c {d} {a,b,c,e} | {e} {a,b,c,d} | {a,b} {c,d,e} | e {a,b,e} {c,d} d | (a) Unrooted tree T (b) Split encoding of T
19 / 76 Figure 5.2 (a) An unrooted phylogenetic tree T on X {a,...,e}. (b) The seven splits represented by T . =
A B X . If the edges of the tree have lengths or weights, then these can be assigned to = the corresponding splits, as well. If T does not contain any unlabeled nodes of degree two, then any two different edges e and f always represent two different splits, that is, (e) (f ) must hold. The only T = T situation in which a phylogenetic tree can contain an unlabeled node of degree 2 is when it it has a root with outdegree two. In this case, the two edges e and f that originate at the root give rise to the same split (e) (f ), but to two complementary clusters. T = T A common construction to avoid this special case at the root is to attach an additional leaf to that is labeled by a special taxon o, which we call a (formal) outgroup. Then the root of a phylogenetic tree is specified as the node to which the leaf edge of o attaches. All phylogenetic tree that are discussed in this chapter are unrooted. However, by using this outgroup trick much of what we discuss concerning unrooted trees can also be adapted to rooted phylogenetic trees. Let T be an unrooted phylogenetic tree on X . We define the split encoding S (T ) to be the set of all splits represented by T , that is,
S (T ) { (e) e is an edge in T }. = | The term encoding is justified by the observation that the tree T can be uniquely recon- structed from S (T ), as we see in the next section. Figure 5.2 shows an unrooted phylogenetic tree that has seven edges and thus gives rise to seven different splits. We can determine all splits associated with any given unrooted phylogenetic tree T in quadratic time using the following algorithm:
Algorithm 5.2.2 (Splits from tree) The set S (T ) of all splits associated with an unrooted phylogenetic tree T on X can be computed as follows:
(i) Choose a start leaf and assume that all edges of T are directed away from . (ii) In a postorder traversal of T , for each node v compute the set L(v) of taxon labels that are encountered in the subtree rooted at v. L(v) (iii) For each edge e (u,v) of T , add the split (e) X L(v) to S (T ). = = Exercise 5.2.3 (Splits on a tree) Consider the following set of splits
S {a} , {b} , {c} , {d} , {e} , {a,b} , {a,e} . = {b,c,d,e} {a,c,d,e} {a,b,d,e} {a,b,c,e} {a,b,c,d} {c,d,e} {b,c,d} Compatible splits
Definition (Compatible splits)
Two splits S = A B and S = A B on are called compatible,if 1 1| 1 2 2| 2 X one of the following four possible intersections of their split parts is empty: A A , A B , B A or B B .Otherwise,thetwo 1 \ 2 1 \ 2 1 \ 2 1 \ 2 splits are called incompatible. A set of splits is called compatible if S all pairs of splits in are compatible. S Example a b, c, d, e { }|{ } a, b, d, e, h c, f , g b a, c, d, e { }|{ } { }|{ } a, c, d, e, g, h b, f c a, b, d, e { }|{ } { }|{ } a, c, e, g b, d, f , h 1 = d a, b, c, e 2 = { }|{ } S { }|{ } S a, c, g b, d, e, f , h e a, b, c, d { }|{ } { }|{ } a, c, e, f , g b, d, h a, b c, d, e { }|{ } { }|{ } a, e, h b, c, d, f , g a, b, e c, d { }|{ } { }|{ } 20 / 76 Compatible splits
Theorem (Compatible splits)
Let be a set of splits on and assume that contains all trivial S X S splits on . There exists a unique unrooted phylogenetic tree T that X realizes ,that is, with (T ) = , if and only if is compatible. 5.2. Splits S S S S 81
b {a} {b,c,d,e} | {b} {a,c,d,e} | a {c} {a,b,d,e} | c {d} {a,b,c,e} | {e} {a,b,c,d} | {a,b} {c,d,e} | e {a,b,e} {c,d} d | (a) Unrooted tree T (b) Split encoding of T
21 / 76 Figure 5.2 (a) An unrooted phylogenetic tree T on X {a,...,e}. (b) The seven splits represented by T . =
A B X . If the edges of the tree have lengths or weights, then these can be assigned to = the corresponding splits, as well. If T does not contain any unlabeled nodes of degree two, then any two different edges e and f always represent two different splits, that is, (e) (f ) must hold. The only T = T situation in which a phylogenetic tree can contain an unlabeled node of degree 2 is when it it has a root with outdegree two. In this case, the two edges e and f that originate at the root give rise to the same split (e) (f ), but to two complementary clusters. T = T A common construction to avoid this special case at the root is to attach an additional leaf to that is labeled by a special taxon o, which we call a (formal) outgroup. Then the root of a phylogenetic tree is specified as the node to which the leaf edge of o attaches. All phylogenetic tree that are discussed in this chapter are unrooted. However, by using this outgroup trick much of what we discuss concerning unrooted trees can also be adapted to rooted phylogenetic trees. Let T be an unrooted phylogenetic tree on X . We define the split encoding S (T ) to be the set of all splits represented by T , that is,
S (T ) { (e) e is an edge in T }. = | The term encoding is justified by the observation that the tree T can be uniquely recon- structed from S (T ), as we see in the next section. Figure 5.2 shows an unrooted phylogenetic tree that has seven edges and thus gives rise to seven different splits. We can determine all splits associated with any given unrooted phylogenetic tree T in quadratic time using the following algorithm:
Algorithm 5.2.2 (Splits from tree) The set S (T ) of all splits associated with an unrooted phylogenetic tree T on X can be computed as follows:
(i) Choose a start leaf and assume that all edges of T are directed away from . (ii) In a postorder traversal of T , for each node v compute the set L(v) of taxon labels that are encountered in the subtree rooted at v. L(v) (iii) For each edge e (u,v) of T , add the split (e) X L(v) to S (T ). = = Exercise 5.2.3 (Splits on a tree) Consider the following set of splits
S {a} , {b} , {c} , {d} , {e} , {a,b} , {a,e} . = {b,c,d,e} {a,c,d,e} {a,b,d,e} {a,b,c,e} {a,b,c,d} {c,d,e} {b,c,d} Circular splits
Definition (Circular splits)
A set of splits on is called circular,ifthereexistsalinearordering S X ⇡ =(x ,...,x ) of the elements of for such that each split S 1 n X S 2S is interval-realizable,thatis,hastheform S = x , x ,...,x ( x , x ,...,x ), { p p+1 q}| X\{p p+1 q} for appropriately chosen 1 < p q n. Example
a, b, d, e, h c, f , g b { }|{ } 6 1 a, c, d, e, g, h b, f d b d h f { }|{ } h f a, c, e, g b, d, f , h { }|{ } 4 a, c, g b, d, e, f , h e c { }|{ } e a, c, e, f , g b, d, h a g { }|{ } a g, c a, e, h b, c, d, f , g { }|{ } 22 / 76 Circular splits Problem (Consecutive Ones problem) Let M be a binary matrix. Does there exist a permutation of the columns of the matrix M such that in each row, all ones in the row occur in a single consecutive run? Example 5.7. Circular splits and planar split networks 93 a, b, d, e, h c, f , g abcdef gh aehdbf cg { }|{ } 00100110 00000111 a, c, d, e, g, h b, f 01000100 00001100 { }|{ } 01010101 00111100 a, c, e, g b, d, f , h 01011101 01111100 01010001 00111000 { }|{ } 01110110 00011111 a, c, g b, d, e, f , h { }|{ } a, c, e, f , g b, d, h (a) Input matrix (b) Permuted matrix { }|{ } a, e, h b, c, d, f , g Figure 5.11 (a) An input matrix M for the Consecutive Ones problem that corresponds to the set of splits shown in Figure 5.9(a). (b) A solution obtained by { }|{ } permuting the columns of M in the order suggested by Figure 5.9(b).
We call such an ordering (x1,...,xn )acircular ordering for S , as it holds that (x1,...,xn )
is a circular ordering for S , if and only if (xn ,xn 1,...,x1) and (x2,x3,...,xn ,x1) both are. How to determine whether a set of splits S is circular? This is equivalent to a well-known problem in computer science:
Problem 5.7.2 (Consecutive Ones problem) Let M be a binary matrix. Does there exist a permutation of the columns of the matrix M such that in each row, all ones in the row occur in a single consecutive run? 23 / 76 It is straight-forward to translate the problem of determining whether S is circular into an instance of the Consecutive Ones problem. Simply define a binary matrix M in which each row r corresponds to some split S S and every column c corresponds to some taxon x X . Then set M(r,c) 1 if and only if S separates x from the taxon represented = by the first column of M.
Exercise 5.7.3 (Circular ordering and Consecutive Ones) Prove the following statement: There exists a circular ordering of X for S if and only if a solution of the Consecutive Ones problem exists for M. Hint: construct a binary matrix M whose rows represent splits and whose columns represent taxa (see Figure 5.11).
The Consecutive Ones problem can be solved in linear time [23]. However, if no cir- cular ordering exists for a given set of splits S on X , then one might try to determine a solution that is as good as possible. For our purposes, an appropriate optimization goal is to find an ordering of X that minimizes the number of runs of ones in the matrix M. This problem, known as the Optimal Consecutive Ones problem, is NP-hard. A useful heuristic for finding an optimal circular ordering for a set of non-circular splits S on X is to define a distance matrix D on X by setting the distance between any two taxa a and b equal to the sum of all weights of the splits that separate them. This dis- tance matrix D is then provided as input to the neighbor-net algorithm, described in Sec- tion 10.4, which outputs an ordering of X , that is guaranteed to be circular, whenever the given input set S is circular [31]. Circular split systems are of particular interest because they can always be represented graphically in an appealing way, namely by a split network that is drawn in the plane Circular splits Problem (Consecutive Ones problem) Let M be a binary matrix. Does there exist a permutation of the columns of the matrix M such that in each row, all ones in the row occur in a single consecutive run? Example 5.7. Circular splits and planar split networks 93 a, b, d, e, h c, f , g abcdef gh aehdbf cg { }|{ } 00100110 00000111 a, c, d, e, g, h b, f 01000100 00001100 { }|{ } 01010101 00111100 a, c, e, g b, d, f , h 01011101 01111100 01010001 00111000 { }|{ } 01110110 00011111 a, c, g b, d, e, f , h { }|{ } a, c, e, f , g b, d, h (a) Input matrix (b) Permuted matrix { }|{ } a, e, h b, c, d, f , g Figure 5.11 (a) An input matrix M for the Consecutive Ones problem that corresponds to the set of splits shown in Figure 5.9(a). (b) A solution obtained by { }|{ } permuting the columns of M in the order suggested by Figure 5.9(b). Note We call such an ordering (x1,...,xn )acircular ordering for S , as it holds that (x1,...,xn )
is a circular ordering for S , if and only if (xn ,xn 1,...,x1) and (x2,x3,...,xn ,x1) both are. decision problem is polynomial solvable How to determine whether a set of splits S is circular? This is equivalent to a well-known finding an ordering of X thatproblem minimizes in computer the science: number of runs of ones in the matrix M (Optimal ConsecutiveProblem Ones 5.7.2 (Consecutive problem) Ones is problem) NP-hard.Let M be a binary matrix. Does there exist a permutation of the columns of the matrix M such that in each row, all ones in the row occur in a single consecutive run? 23 / 76 It is straight-forward to translate the problem of determining whether S is circular into an instance of the Consecutive Ones problem. Simply define a binary matrix M in which each row r corresponds to some split S S and every column c corresponds to some taxon x X . Then set M(r,c) 1 if and only if S separates x from the taxon represented = by the first column of M.
Exercise 5.7.3 (Circular ordering and Consecutive Ones) Prove the following statement: There exists a circular ordering of X for S if and only if a solution of the Consecutive Ones problem exists for M. Hint: construct a binary matrix M whose rows represent splits and whose columns represent taxa (see Figure 5.11).
The Consecutive Ones problem can be solved in linear time [23]. However, if no cir- cular ordering exists for a given set of splits S on X , then one might try to determine a solution that is as good as possible. For our purposes, an appropriate optimization goal is to find an ordering of X that minimizes the number of runs of ones in the matrix M. This problem, known as the Optimal Consecutive Ones problem, is NP-hard. A useful heuristic for finding an optimal circular ordering for a set of non-circular splits S on X is to define a distance matrix D on X by setting the distance between any two taxa a and b equal to the sum of all weights of the splits that separate them. This dis- tance matrix D is then provided as input to the neighbor-net algorithm, described in Sec- tion 10.4, which outputs an ordering of X , that is guaranteed to be circular, whenever the given input set S is circular [31]. Circular split systems are of particular interest because they can always be represented graphically in an appealing way, namely by a split network that is drawn in the plane Circular splits
Theorem (Circular implies outer-labeled planar) A set of splits on = x ,...,x is circular if and only if it can be S X { 1 n} represented by a split network N that is outer-labeled planar.
Example
a, b, d, e, h c, f , g b { }|{ } a, c, d, e, g, h b, f d { }|{ } h f a, c, e, g b, d, f , h { }|{ } a, c, g b, d, e, f , h { }|{ } e a, c, e, f , g b, d, h { }|{ } a g, c a, e, h b, c, d, f , g { }|{ }
24 / 76 Weakly compatible splits Definition (Weakly compatible splits)
A set of splits on is called weakly compatible,ifanythree distinct S X A1 A2 splits in are weakly compatible. Three such splits S1 = , S2 = , S B1 B2 and S = A3 are called weakly compatible,if 3 B3 1 at least one of the following four intersections is empty: A A A , A B B , B A B and B B A , 1 \ 2 \ 3 1 \ 2 \ 3 1 \ 2 \ 3 1 \ 2 \ 3 2 at least one of the following four intersections is empty: B B B , B A A , A B A and A A B . 1 \ 2 \ 3 1 \ 2 \ 3 1 \ 2 \ 3 1 \ 2 \ 3 Example a, b, d, e, h c, f , g a, b, d, e, h c, f , g {a, c, d, e, g}|{, h b, f } {a, c, d, e, g}|{, h b, f } {a, c, e, g b}|{, d, f , h} {a, c, e, g b}|{, d, f , h} 1 = { }|{ } 2 = { }|{ } S a, c, g b, d, e, f , h S a, c, d, e b, f , g {a, c, e,}|{f , g b, d, h} {a, b c}|{, d, e, f , g} {a, e, h b}|{, c, d, f , g} {a, e,}|{f b, c, d, g} { }|{ } { }|{ } 25 / 76 Weakly compatible splits Definition (Weakly compatible splits)
A set of splits on is called weakly compatible,ifanythree distinct S X A1 A2 splits in are weakly compatible. Three such splits S1 = , S2 = , S B1 B2 and S = A3 are called weakly compatible,if 3 B3 1 at least one of the following four intersections is empty: WhyA allA this?!?A , A B B , B A B and B B A , 1 \ 2 \ 3 1 \ 2 \ 3 1 \ 2 \ 3 1 \ 2 \ 3 2 Phylogeneticat least one of the networks following reconstructed four intersections isfrom empty: B B B , B A A , A B A and A A B . weakly1 \ 2 compatible\ 3 1 \ 2 are\ 3 easier1 \ than2 \ 3 the ones1 \ 2 \ 3 Examplereconstructed from generic splits a, b, d, e, h c, f , g a, b, d, e, h c, f , g {a, c, d, e, g}|{, h b, f } {a, c, d, e, g}|{, h b, f } {a, c, e, g b}|{, d, f , h} {a, c, e, g b}|{, d, f , h} 1 = { }|{ } 2 = { }|{ } S a, c, g b, d, e, f , h S a, c, d, e b, f , g {a, c, e,}|{f , g b, d, h} {a, b c}|{, d, e, f , g} {a, e, h b}|{, c, d, f , g} {a, e,}|{f b, c, d, g} { }|{ } { }|{ } 25 / 76 UPN from distances or“how to get the splits from distances”
26 / 76 From weighted splits to distances
Any split S can be used to define a distance matrix D on , 2S S X by setting:
1ifS separates x and y, d (x, y)= S 0else. ⇢ for all taxa x and y in . X 0 Assume we are given a weighting of a set of splits, ! : R , S! then we define the weighted split metric D( ,!) as: S
d( ,!)(x, y)= !(S) dS (x, y) S ⇥ S X2S for all taxa x and y in . X
27 / 76 PN from distances: the split decomposition
The split decomposition algorithm [Bandelt and Dress, 1992]: Given a distance matrix D on = x ,...,x we start by computing the isolation index for X { 1 n} quartets and splits : w y
for any four taxa w, x, y and z with w, x y, z = , x z but not necessarily w = x or y = z: { }\{ } ; w,x 1 6 6 ↵ˆD { y,z } = 2 (max d(w, x)+d(y, z), d(w, y)+d(x, z), d(w, z)+d(x, y) d(w, x) d(y, z)). { } { } w,x A B for any (partial) split S: ↵D (S)=min ↵ˆD { y,z } w, x A, y, z B 0. { { } | 2 2 } Then, we set X0 = and 0 = . Now, assume that the set of splits i on the first i taxa = x; ,...,Sx .; To obtain on = x ,...,x S ,for Xi { 1 i } Si+1 Xi+1 { 1 i+1} each split A do: B 2Si A xi+1 1 Consider S = [{ } . If ↵D (S) > 0, set !(S)=↵D (S)andaddS to i+1. B S 2 A Consider S = B x . If ↵D (S) > 0, set !(S)=↵D (S)andaddS to i+1. [{ i+1} S 3 Xi Consider S = x . If ↵D (S) > 0, set !(S)=↵D (S)andaddS to i+1. { i+1} S The result is given by . Sn
28 / 76 PN from distances: the split decomposition
AsplitS whose isolation index ↵D (S) is greater than 0 is called a D-split. D-splits are always weakly compatible. It follows from this that the split decomposition always computes a set of weakly compatible splits
29 / 76 PN from distances: the split decomposition
AsplitS whose isolation index ↵D (S) is greater than 0 is called a D-split. D-splits are always weakly compatible. It follows from this that the split decomposition always computes a set of weakly compatible splits The SD is a conservative method It can be used for small number of taxa or low divergence
0.01
WE_LA
SK
BS_LD JC
HG45 Gg1 BS_LC
Gg2
HG7 HT HG2 HG16 HG1 HG65 HG184 HG108HG25 HG179 HG72 Gg3 HG40 HY HG169, HG146, HG140, HG114, HG113, HG110, HG103, HG95, HG91, HG88, HG85, HG75, HG61, HG56, HG49, HG48, HG4, HG20 HG157 NG
KM
BD_LB BE_LC
29 / 76 PN from distances: Neighbor-Net Given a distance matrix D on ,theNeighbor-Net algorithm [Bryant and Moulton, 2004] computesX a circular ordering ⇡ of from D and then a set of weighted splits that are X interval-realizable with respect to ⇡: S produces circular splits uses together with circular network algorithm to get planar networks can be used for large number of taxa and high divergence
0.01
0.01 WE_LA
HG16 HG7 HG110 HG140 HG56HG48 HG65 HG2 HG113 HG91 HG49 HG114 HG25 HG20 HG75 HG103 HG88
HG146 HG40 Gg2 HG61HG85 Gg1 HG179 HG184 HG95 HG169 HG4 Gg3 HG45 HG108 HG72 HG1 HG157 NG
HT
SK BE_LC
BD_LB
BS_LD JC HY
KM
HG45 Gg1 BS_LC JC
Gg2 SK BS_LC HG7 HT HG2 HG16 HG1 HG65 HG184 HG108HG25 BS_LD HG179 HG72 Gg3 HG40 HY HG169, HG146, HG140, HG114, HG113, HG110, HG103, HG95, HG91, HG88, HG85, HG75, HG61, HG56, HG49, HG48, HG4, HG20 HG157 NG WE_LA KM
BD_LB BE_LC
30 / 76 PN from distances: Neighbor-Net Given a distance matrix D on ,theNeighbor-Net algorithm [Bryant and Moulton, 2004] computesX a circular ordering ⇡ of from D and then a set of weighted splits that are X interval-realizable with respect to ⇡: S produces circular splits uses together with circular network algorithm to get planar networks can be used for large number of taxa and high divergence
0.01
0.01 WE_LA
HG16 HG7 HG110 HG140 HG56HG48 HG65 HG2 HG113 HG91 HG49 HG114 HG25 HG20 HG75 HG103 HG88
HG146 HG40 Gg2 HG61HG85 Gg1 HG179 HG184 HG95 HG169 HG4 Gg3 HG45 HG108 HG72 HG1 HG157 NG
HT
SK BE_LC
BD_LB TheoremBS_LD JC (Neighbor-Net consistency) HY
KM
HG45 Gg1 BS_LC JC
Gg2 SK BS_LC HG7 HT HG2 HG16 HG1 Given a circularHG65 distance matrix D on , the Neighbor-Net HG184 HG108HG25 BS_LD HG179 HG72 Gg3 HG40 HY HG169, HG146, HG140, HG114, HG113, HG110, HG103, HG95, HG91, HG88, HG85, HG75, HG61, HG56, HG49, HG48, HG4, HG20 X HG157 NG algorithm produces a circular ordering ⇡ and a set of WE_LA KM weighted splits thatBD_LB BE_LC are interval-realizable with respect to S ⇡ such that D = D( ). S 30 / 76 PN from distances
Other algorithms from distances: Minimum spanning network T-Rex ... Agreatsourceofinformation: http://phylnet.univ-mlv.fr/
31 / 76 UPN from splits or“what to do with the splits?”
32 / 76 PN from splits: the Convex hull algorithm
Let be a set of splits on comprising all trivial S X ones. Assume that we have already computed a split network N for . To obtain a split network S A I = A1 = {a,b,c,d} I A2 {a,b,g,h} S1 B {e,f ,g,h} S2 = B = {c,d,e,f } for S,whereS = B is a new non-trivial split, 1 c b c2 b S[ c b c b modify N as follows: d a d a d a d a 1 Compute the two convex hulls H(A)and h h e e e h e h H(B)inN and let M be the split graph f g f g f g f g I I (a) Network N0 (b) Hulls for S1 (c) Network N1 (d) Hulls for S2 induced in N by the set of nodes cbcb c b H(A) H(B) = . d a d a a \ 6 ; d 2 Create a copy M0 of M and for each node v e h e h e h and edge e in M let v 0 and e0 denote their f g f g f g (e) Network N (f) Hulls for SI (g) Network N copies in M0, respectively. 2 3 3 SI = A3 = {a,c,f } 3 B3 {b,d,e,g,h} 3 For every edge f that leads from some node u in H(B) H(A) = to a node v in M, \ 6 ; redirect the edge f so that it leads from u to v 0.
4 Connect each pair of nodes v in M and v 0 in M0 by a new edge f =(v, v 0) and set (f )=S 33 / 76 PN from splits: the circular network algorithm
Let be a set of splits on comprising all trivial S X ones. Assume that we have already computed a split network N for . To obtain a split network S x ,...,x { p q } for S,whereS = x ,...,x is a new S[ X\{ p q } non-trivial split, modify N as follows: (splits have to be considered in a certain order) cb cb d 1 ˙ d a Determine the path M(xp, xq) and let M a denote the path obtained by removing the first and last (leaf) edges from M(xp, xq). e ˙ ˙ e h h 2 Create a copy M0 of M and for each node v gf f g and edge e in M˙ let v 0 and e0 denote their (a) Network N2 (b) Network N3 copies, respectively. A {a,f ,g,h} B = {b,c,d,e} 3 Redirect every edge f that leads from some node u = (xi ) to some node v in M˙ so that it leads from u to v 0, for all i = p,...,q.
4 Connect each pair of nodes v in M˙ and v 0 in M˙ 0 by a new edge f =(v, v 0) and set (f )=St+1.
34 / 76 PN from splits: attention!!!
All four di↵erent split networks shown below represent the same set of splits. The split networks shown in (a) and (b) were computed using the circular network algorithm processing the splits and taxa in two di↵erent orders. The one shown in (c) was constructed using the convex hull algorithm. The split network shown in (d) can be obtained by deleting superfluous edges in any of the first three.
e f e f e f e f
d a d a d a d a
c b c b c b c b (a)(b)(c)(d)
35 / 76 UPN from trees or“how to get splits from a bunch of trees”
36 / 76 PN from trees: Consensus split networks
Consensus splits [Holland et al, 2004] Input: Trees on identical taxon sets Determine splits in more than X% of trees For >50%, the result is compatible 240 Chapter 11. Phylogenetic Networks from Trees
f d f c f c e a e a e b
c b b d d a
(a) Tree T1 (b) Tree T2 (c) Tree T3 f c f b f b e a e a e a
d b c d d c
(d) Tree T4 (e) Tree T5 (f) Tree T6
f f f f b c c b e e e e b a a a d d c d c b d a (g) Majority (h) d 2 (i) d 5 (j) All splits = =
Figure 11.1 (a)– (f) Six different phylogenetic trees T1,...,T6 on X {a,b,c,d,e, f }. (g) Their majority consensus tree and (h) their consensus = 1 split network for d 2, representing all splits that are present in more than 3 of the trees. Note that= in this case the network is still a tree, but more resolved than the majority consensus tree. (i) The consensus split network for d 5 and (j) the split network representing all splits present in the six trees. = 37 / 76
and p 0.30, respectively. Additionally, we show the majority consensus tree. In these = drawings, the length of an edge is scaled to represent the number of trees that support the corresponding split. This figure shows clearly that there is some disagreement among the gene trees as to where the outgroup taxon C. albicans attaches to the phylogeny, the two main choices being an attachment to either the lineage leading to S. castelli or to S. kluyveri. There is some disagreement concerning how to arrange the other five taxa, the largest contention being whether S. kudriavzevii and S. bayanus are sister taxa. In the depicted consensus tree, the majority of gene trees place C. albicans as a sister of S. kluyveri and the information that a significant number of genes prefer an alternative placement is suppressed.
11.2 Consensus super split networks for unrooted trees The above approach to calculating consensus trees and consensus split networks requires that all input trees are on exactly the same set of taxa X . However, in practice it often occurs that some (or even all) of the input trees lack some of the members of the total taxon set X . For example, when studying multiple gene trees, particular gene sequences may be absent for certain taxa. In this section we discuss how to address this problem.
Let X be a set of taxa. A phylogenetic tree T on X is called a partial tree on X , if X PN from trees: Consensus split networks
Consensus splits [Holland et al, 2004] Input: Trees on identical taxon sets Determine splits in more than X% of trees For >50%, the result is compatible 240 Chapter 11. Phylogenetic Networks from Trees
f d f c f c e a e a e b
c b b d d a
(a) Tree T1 (b) Tree T2 (c) Tree T3 f c f b f b e a e a e a
d b c d d c
(d) Tree T4 (e) Tree T5 (f) Tree T6
f f f f b c c b e e e e b a a a d d c d c b d a (g) Majority (h) d 2 (i) d 5 (j) All splits = =
Figure 11.1 (a)– (f) Six different phylogenetic trees T1,...,T6 on X {a,b,c,d,e, f }. (g) Their majority consensus tree and (h) their consensus = 1 split network for d 2, representing all splits that are present in more than 3 of the trees. Note that= in this case the network is still a tree, but more resolved than the majority consensus tree. (i) The consensus split network for d 5 and (j) the split network representing all splits present in the six trees. =
37 / 76 and p 0.30, respectively. Additionally, we show the majority consensus tree. In these = drawings, the length of an edge is scaled to represent the number of trees that support the corresponding split. This figure shows clearly that there is some disagreement among the gene trees as to where the outgroup taxon C. albicans attaches to the phylogeny, the two main choices being an attachment to either the lineage leading to S. castelli or to S. kluyveri. There is some disagreement concerning how to arrange the other five taxa, the largest contention being whether S. kudriavzevii and S. bayanus are sister taxa. In the depicted consensus tree, the majority of gene trees place C. albicans as a sister of S. kluyveri and the information that a significant number of genes prefer an alternative placement is suppressed.
11.2 Consensus super split networks for unrooted trees The above approach to calculating consensus trees and consensus split networks requires that all input trees are on exactly the same set of taxa X . However, in practice it often occurs that some (or even all) of the input trees lack some of the members of the total taxon set X . For example, when studying multiple gene trees, particular gene sequences may be absent for certain taxa. In this section we discuss how to address this problem.
Let X be a set of taxa. A phylogenetic tree T on X is called a partial tree on X , if X 11.2. Consensus super split networks for unrooted trees 243
P. radicata O. cabulica P. enysii P. exilis Physaria bellii O. pumila P. wallii Arabis pauciflora P. fastigiata P. stellata Physaria gracilis Arabis glabra P. stellata P. fastigiata Crucihimalaya P. exilis P. latisilique O. pumila Crucihimalaya O. cabulica A. thaliana A. halleri camelina Neslia Moricandia Neslia Cardaminopsis Aethionema camelina A. neglecta Physaria bellii PN from trees: Consensus superMathiola splits networksA. lyrata A. thaliana Physaria gracilis Arabis alpina Mathiola Arabis pauciflora Cardaminopsis Moricandia Arabis glabra A. lyrata Arabis alpina A. halleri A. neglecta Consensus super splits [Huson et al, 2004, Whitfield(a) Tree T1 et al 2008].(b) Tree T2
P. wallii P. fastigiata P. fastigiata P. exilis Input: Trees on overlapping taxon sets P. nouvaezelandia Crucihimalaya Arabis glabra Arabis glabra P. exilis P. stellata Arabis pauciflora Arabis pauciflora camelina P. stellata Neslia Use the Z–closure to complete partial splits Physaria gracilis Moricandia Neslia A. thaliana A. thaliana A. halleri Cardaminopsis A. neglecta A. lyrata Cardaminopsis A. neglecta Use the “distortion” values to filter splits A. halleri (c) Tree T (d) Tree T 11.2. Consensus super split networks for unrooted trees 243 3 4
A. lyrata Arabis glabra P. radicata O. cabulica O. cabulica A. neglecta Arabis pauciflora Crucihimalaya P. stellata P. enysii P. exilis Physaria bellii O. pumila O. pumila Cardaminopsis P. wallii Arabis pauciflora P. fastigiata A. lyrata P. fastigiata P. stellata Physaria gracilis Arabis glabra P. stellata O. pumila P. enysii P. latisilique A. halleri P. fastigiata Crucihimalaya O. cabulica P. exilis A. thaliana P. latisilique O. pumila Crucihimalaya Aethionema P. nouvaezelandia A. thaliana P. exilis camelina O. cabulica Moricandia P. radicata A. halleri camelina camelina Neslia Neslia Neslia P. wallii Arabis pauciflora Moricandia Neslia Cardaminopsis A. halleri Crucihimalaya Aethionema camelina A. neglecta Physaria bellii Physaria gracilis Arabis glabra A. lyrata A. neglecta Arabis alpina Mathiola A. thaliana Physaria gracilis Physaria bellii Physaria bellii Arabis alpina Mathiola A. thaliana Physaria gracilis Mathiola Arabis pauciflora Cardaminopsis P. exilis Arabis alpina Moricandia P. stellata Moricandia Arabis glabra A. lyrata Aethionema A. halleri A. neglecta (e) Tree T5 (f) Super network N (a) Tree T1 (b) Tree T2
Figure 11.3 (a)–(e) Five partial gene trees T1,...,T5 on 13–25 plant taxa. (f) The P. wallii P. fastigiata P. fastigiata P. exilis corresponding super split network N on all 26 taxa, computed using the P. nouvaezelandia Crucihimalaya Z-closure method. The edges in N are scaled to represent the number of input Arabis glabra Arabis glabra P. exilis P. stellata Arabis pauciflora trees that contain the edge. The network N shows that the placement of the pair Arabis pauciflora camelina of taxa Physaria bellii and Physaria gracilis differs in the five trees. P. stellata Neslia Physaria gracilis Moricandia Neslia A. thaliana ¯ A. thaliana A. halleri The split network used to represent Sp (T ) for any value of p from 0 to 1 is called a super Cardaminopsis split network. A. neglecta A. lyrata Cardaminopsis A. neglecta A. halleri We conjecture that the set of strict consensus splits S¯strict (T ) is always compatible and (c) Tree T3 (d) Tree T4 thus representable by a phylogenetic tree. If this is true, then this construction provides a ¯ A. lyrata super tree method. However, the set of majority consensus splits Smajority(T ) is not neces- Arabis glabra A. neglecta P. stellata O. cabulica sarily compatible, in general. In fact, the latter statement holds for S¯p (T ) with any choice Arabis pauciflora Crucihimalaya O. pumila Cardaminopsis A. lyrata P. fastigiata O. pumila P. enysii P. latisilique A. halleri of p between 0 and 1. O. cabulica Aethionema P. nouvaezelandia A. thaliana P. exilis camelina Moricandia camelina P. radicata Neslia Neslia P. wallii Arabis pauciflora A. halleri Crucihimalaya Physaria gracilis Arabis glabra A. neglecta Physaria bellii Physaria bellii Arabis alpina A. thaliana Physaria gracilis Mathiola P. exilis P. stellata Moricandia Aethionema
(e) Tree T5 (f) Super network N
Figure 11.3 (a)–(e) Five partial gene trees T1,...,T5 on 13–25 plant taxa. (f) The 38 / 76 corresponding super split network N on all 26 taxa, computed using the Z-closure method. The edges in N are scaled to represent the number of input trees that contain the edge. The network N shows that the placement of the pair of taxa Physaria bellii and Physaria gracilis differs in the five trees.
The split network used to represent S¯p (T ) for any value of p from 0 to 1 is called a super split network.
We conjecture that the set of strict consensus splits S¯strict (T ) is always compatible and thus representable by a phylogenetic tree. If this is true, then this construction provides a
super tree method. However, the set of majority consensus splits S¯majority(T ) is not neces-
sarily compatible, in general. In fact, the latter statement holds for S¯p (T ) with any choice of p between 0 and 1. The Z-closure
The Z-closure method takes a set of partial splits on as input and infers a set of complete splits on as output. X X Two partial splits S = A1 and S = A2 are said to be in 1 B1 2 B2 Z-relation to each other, if2 exactlyS one of the2 fourS intersections A A , 1 \ 2 A1 B2, B1 A2 or B1 B2 is empty. In this case, if the empty \ \ \ A1 A2 intersection is A1 B2,say,thenwewrite Z . \ B1 B2 The Z-operation applied to S1 and S2 is defined as the creation of two new splits A1 A1 A2 S 0 = and S 0 = [ . 1 B B 2 B 1 [ 2 2 If at least one of the two new splits contains more taxa than its predecessor, the pair of splits is called productive.
We obtain a set of complete splits from a set partial splits = S1,...,Sm on as follows: While contains a productive pair of splitsS {S , S ,apply} X S { i j } the Z-operation to obtain two new splits Si0, Sj0 and then replace the former pair by the latter pair in . Finally,{ add all} trivial splits on . S X 39 / 76 The distortion values
Let =(T ,...,Tk be a set of partial trees on .Forany T 1 ) X complete split S on we define the distortion of S (with respect X to )as (S)= k (T , S) T i=1 i (Ti , S) denotesP the minimal homoplasy score for S on the input tree Ti , i.e. the smallest number of (SPR or TBR) branch-swapping operations required to transform some refinement of Ti into a tree that contains the split S The distortion of a split can be e ciently computed using dynamic programming
40 / 76 UPN from sequences
41 / 76 Median networks
For a condensed1 multiple alignment M of binary sequences on ,its median network is a phylogenetic network N =(V , E, , ) whoseX node set is given by the median closure V = M¯ and in which any two nodes a and b are connected by an edge e of color (e)=i E, if any only if they di↵er in exactly in their i-th position (as haplotypes).2 An associated taxon labeling : X V maps each taxon x onto the node (x) that represents the corresponding! sequence.
1 each set of identical sequences is pooled into a single haplotype, then all constant columns are removed and finally, every set of columns that have the same pattern is replaced by a single column i that is assigned a weight !(i)thatequalsthenumberofcolumnsintherepresentedset. 42 / 76 Median9.4. networksMedian networks 197 a 01010 b 10001 a 01010 1 = For a condensed multipleb alignment10001 M of binary sequences on ,its = c 11100 m(a,b,c) 11000 X median network is a phylogenetic= network N =(V , E, , ) whose node m(a,b,c) 11000 ¯ set is given by the median= closure V = M andc 11100 in which any two nodes a and b are connected(a) Median by calculation an edge e ofcolor (b) Parsimonious (e)= treei E, if any only if they di↵er in exactly in their i-th position (as haplotypes).2 An Figure 9.2 (a) The median m(a,b,c) of three binary sequences a, b and c. The associatedmedian taxonm(a labeling,b,c) minimizes : theX parsimonyV maps score ofeach the unrooted taxon phylogeneticx onto the node (x) thattree represents on {a,b,c} shown the in corresponding (b), which! in this case sequence. is five.
a 0000000 2
0100000 a 0000000 b 0110000 5 b 0110000 0100100 c 1101100 0110100 1 d 1110110 7 1100100 e 0110101 e 0110101 3 4 1110100
6 c 1101100 d 1110110 (a) Alignment M (b) Median network N
1 Figure 9.3 (a) A condensed multiple alignment M of binary sequences on each set of identical sequences is pooled into a single haplotype, then all constant columns are removed and X {a,...,e}. (b) The median network N that represents M. The nodes are finally, every set of columns= that have the same pattern is replaced by a single column i that is assigned a weight !(i)thatequalsthenumberofcolumnsintherepresentedset.labeled by haplotypes and the edges are labeled by the corresponding columns in M. 42 / 76
Let M be a multiple alignment of binary sequences on X . The median closure of M is defined as the set M¯ of all binary sequences that can be obtained by repeatedly taking the median of any three sequences in the set and adding it to the set, until no new sequences can be produced. The median network is defined as follows:
Definition 9.4.1 (Median network) Let M be a condensed multiple alignment of binary sequences on X .Themedian network associated with M is a phylogenetic network N = (V,E, , ) whose node set is given by the median closure V M¯ and in which any two = nodes a and b are connected by an edge e of color (e) i in E, if any only if they differ in = exactly in their i-th position (as haplotypes). An associated taxon labeling : X V maps each taxon x onto the node (x) that represents the corresponding sequence.
An example of a median network is shown in Figure 9.3. Let M be a condensed multiple alignment of binary sequences on X . Definition 9.4.1 suggests the following naive algorithm for computing the median network N for M: First compute the median closure M¯ . Then construct the graph N that has node set M¯ and in which any two nodes are connected by an edge e of color (e) i, if any only if they = differ exactly in their i th position (as haplotypes). Alternatively, one can compute the Quasi median networks
204 Chapter 9. Phylogenetic Networks from Sequences