Quick viewing(Text Mode)

Introduction to Phylogenetic Networks

Ashort(?)introductiontophylogeneticnetworks

Celine´ Scornavacca

ISE-M, Equipe Phylog´enie & Evolution Mol´eculaires Montpellier, France

1/43 Rooted trees ...

... are oriented connected and acyclic graphs, where terminal nodes are associated to a set of species:

the leaves or taxa represent extant organisms internal nodes represent TIME hypothetical ancestors each internal node represents the lowest common ancestor of Bos Homo Pan Pongo Macaca Mus all taxa below it () the only node without ancestor is called root

2/43 Used, among other things, to estimate species trees. We usually use several families...

Gene trees

Gene trees are built by analyzing a gene family,i.e.,homologous molecular sequences appearing in the of di↵erent organisms.

G1

Mouse Dog Bat Rat

Mouse Dog Rat Bat

3/43 We usually use several gene families...

Gene trees

Gene trees are built by analyzing a gene family.

Mouse Rat Dog Bat Mouse Dog Rat Bat Mouse Dog Rat Bat

Used, among other things, to estimate species trees.

3/43 We usually use several gene families...

Gene trees

Gene trees are built by analyzing a gene family.

Mouse Rat Dog Bat Mouse Dog Rat Bat Mouse Dog Rat Bat

Used, among other things, to estimate species trees.

3/43 We usually use several gene families...

Gene trees

Gene trees are built by analyzing a gene family.

Mouse Rat Dog Bat Mouse Dog Rat Bat Mouse Dog Rat Bat

Used, among other things, to estimate species trees. Gene trees can significantly di↵er from the species tree for: methodological reasons biological reasons

3/43 Gene trees

Gene trees are built by analyzing a gene family.

Mouse Rat Dog Bat Mouse Dog Rat Bat Mouse Dog Rat Bat

Used, among other things, to estimate species trees. Gene trees can significantly di↵er from the species tree for: methodological reasons biological reasons We usually use several gene families...

3/43 Gene trees

Gene trees are built by analyzing a gene family.

Mouse Rat Dog Bat Mouse Dog Rat Bat Mouse Dog Rat Bat

Used, among other things, to estimate species trees. Gene trees can significantly di↵er from the species tree for: methodological reasons biological reasons We usually use several gene families... http://sulab.org/2013/06/sequenced-genomes-per-year/

3/43 assembling trees

Supertree approach:

Reconstruction of phylogenies for multiple datasets

The two main classic approaches: Supermatrix approach: assembling primary data

4/43 Reconstruction of phylogenies for multiple datasets

The two main classic approaches: Supermatrix approach: assembling primary data

Supertree approach: assembling trees

4/43 An implicit assumption

The implicit assumption of using trees is that, at a macroevolutionary scale, each (current or extinct) species or gene only descends from one ancestor. Darwin described evolution as ”descent with modification”, a phrase that does not necessarily imply a tree representation...

5/43 A new approach: building phylogenic networks

Why do we need them? Due to reticulate evolutionary phenomena (hybridization, recombination, ) the evolution of a set of species sometimes cannot be described using phylogenetic trees.

6/43 A new approach: building phylogenic networks

Why do we need them? Due to reticulate evolutionary phenomena (hybridization, recombination, horizontal gene transfer) the evolution of a set of species sometimes cannot be described using phylogenetic trees.

6/43 A new approach: building phylogenic networks

Why do we need them? Due to reticulate evolutionary phenomena (hybridization, recombination, horizontal gene transfer) the evolution of a set of species sometimes cannot be described using phylogenetic trees.

6/43 A new approach: building phylogenic networks

Why do we need them? Due to reticulate evolutionary phenomena (hybridization, recombination, horizontal gene transfer) the evolution of a set of species sometimes cannot be described using phylogenetic trees.

6/43 The network of life

Doolittle, 1999

Daniel Huson 2010 6

7/43 Three di↵erent paradigms

We (want to) see only the tree

Doolittle, 1999

Daniel Huson 2010 6

8/43 Three di↵erent paradigms

It is a big mess, no chance to retrieve the past

Doolittle, 1999

8/43 Three di↵erent paradigms

There is an underlying tree structure, with some reticulate events

Doolittle, 1999

8/43 An example - a split network

J. Wagele and C. Mayer. Visualizing di↵erences in phylogenetic information content of alignments and distinction of three classes of long-branch e↵ects. BMC Evolutionary Biology, 7(1):147, 2007 9/43 An example - a reduced median network

Northeast Asia domestic pig Domestic pig in region MDYZ Domestic pig in South China Other Northeast Asia wild boar Wild boar in region MDYZ Wild boar in South China Feral pigs Domestic pig in region UMYR Domestic pig in the Mekong region Domestic pig in region URYZ Japanese domestic pig and ancient DNA Domestic pig in region DRYR Wild boar in the Mekong region Wild boar in region URYZ * Coalescent root type of haplogroup D1

D1g D1f D1h D1e

D1i D1d D1a1a-2 * D1a1a D1a1a-1

D1c2 D1c3

D1c1

100 D1a2 60

D1a3 20 D1b 10 5 One mutation 1

G.-S. Wu, Y.-G. Yao, K.-X. Qu, Z.-L. Ding, H. Li, M. Palanichamy, Z.-Y. Duan, N. Li, Y.-S. Chen, and Y.-P. Zhang. Population phylogenomic analysis of mitochondrial DNA in wild boars and domestic pigs revealed multiple domestication events in East Asia. Genome Biology, 8(11):R245, 2007 9/43 An example - a minimum spanning network

C. M. Miller-Butterworth, D. S. Jacobs, and E. H. Harley. Strong population sub- structure is correlated with morphology and ecology in a migratory bat. Nature, 424(6945):187-191, 2003 9/43 An example - a DTLR network

P.J. Planet, S.C. Kachlany, D.H. Fine, R. DeSalle, and D.H. Figurski. The wide spread colonization island of actinobacillus actinomycetemcomitans. Nature Genetics, 34:193–198, 2003. 9/43 An example - a recombination network 00000

1 3 10000 2 11000 00100 r 4 11100 00110 5 a b c d e 10000 11000 11100 00110 00111

Daniel H. Huson, Regula Rupp, Celine Scornavacca. Phylogenetic Networks. Cambridge University Press. 2011

9/43 Phylogenetic networks

Phylogenetic networks

In the presence of reticulate events, phylogenies are networks, not trees

The study of phylogenetic networks Doolittle is a new interdisciplinary field: Science maths, CS, biology… 1999 Phylogenetic networks

2008 2010 2011 2013 2014

10 / 43 Phylogenetic networks

Phylogenetic networks

In the presence of reticulate events, phylogenies are networks, not trees

The study of phylogenetic networks Doolittle is a new interdisciplinary field: Science maths, CS, biology… 1999 Phylogenetic networks

2008 2010 2011 2013 2014

10 / 43 A phylogenetic network ...

... is any connected graph, where terminal nodes are associated to a set of species.

Homo Pongo

Pan Macaca

11 / 43 A rooted phylogenetic network ...

... is any single-rooted directed acyclic graph, where terminal nodes are associated to a set of species.

TIME

Spear Mint Peppermint Water Mint

12 / 43 Phylogenetic networks

Phylogenetic networks

In the presence of reticulate events, phylogenies are networks, not trees

The study of phylogenetic networks Doolittle is a new interdisciplinary field: Science maths, CS, biology… 1999 Phylogenetic networks

2008 2010 2011 2013 2014

13 / 43 Abstract VS explicit phylogenetic networks

Split network: Hybridization network: Split network: Hybridization network:

ShowsShows conflictingconflicting ShowsShows putative putative placementplacement of of taxa taxa hybridizationhybridization history history Daniel Huson 2010 8 14 / 43 The plan of the survey

1 combinatorial and distance methods not accounting for ILS unrooted networks rooted networks (explicit or not) 2 methods accounting for ILS (always explicit)

15 / 43 Unrooted phylogenetic networks

16 / 43 User Manual for SplitsTree4 V4.12.8

Daniel H. Huson and David Bryant

November 8, 2012

Contents

Contents 17 / 43 1

1 Introduction 4

2 Getting Started 5

3 Obtaining and Installing the Program 5

4 Program Overview 5

5 Splits, Trees and Networks 7

6 Opening, Reading and Writing Files 9

7 Estimating Distances 9

8 Building and Processing Trees 10

9 Building and Drawing Networks 11

10 Main Window 11

1 Reconstruction of unrooted phylogenetic networks

from splits from distances (via splits or not) from trees (via splits) from sequences (via splits or not)

18 / 43 Splits 5.2. Splits 81 A split A B on is a partition of a set into two non-empty sets. | X X b {a} {b,c,d,e} | {b} {a,c,d,e} | a {c} {a,b,d,e} | c {d} {a,b,c,e} | {e} {a,b,c,d} | {a,b} {c,d,e} | e {a,b,e} {c,d} d | (a) Unrooted tree T (b) Split encoding of T

19 / 43 Figure 5.2 (a) An unrooted T on X {a,...,e}. (b) The seven splits represented by T . =

A B X . If the edges of the tree have lengths or weights, then these can be assigned to = the corresponding splits, as well. If T does not contain any unlabeled nodes of degree two, then any two different edges e and f always represent two different splits, that is, (e) (f ) must hold. The only T = T situation in which a phylogenetic tree can contain an unlabeled node of degree 2 is when it it has a root with outdegree two. In this case, the two edges e and f that originate at the root give rise to the same split (e) (f ), but to two complementary clusters. T = T A common construction to avoid this special case at the root is to attach an additional leaf to that is labeled by a special taxon o, which we call a (formal) . Then the root of a phylogenetic tree is specified as the node to which the leaf edge of o attaches. All phylogenetic tree that are discussed in this chapter are unrooted. However, by using this outgroup trick much of what we discuss concerning unrooted trees can also be adapted to rooted phylogenetic trees. Let T be an unrooted phylogenetic tree on X . We define the split encoding S (T ) to be the set of all splits represented by T , that is,

S (T ) {(e) e is an edge in T }. = | The term encoding is justified by the observation that the tree T can be uniquely recon- structed from S (T ), as we see in the next section. Figure 5.2 shows an unrooted phylogenetic tree that has seven edges and thus gives rise to seven different splits. We can determine all splits associated with any given unrooted phylogenetic tree T in quadratic time using the following algorithm:

Algorithm 5.2.2 (Splits from tree) The set S (T ) of all splits associated with an unrooted phylogenetic tree T on X can be computed as follows:

(i) Choose a start leaf and assume that all edges of T are directed away from . (ii) In a postorder traversal of T , for each node v compute the set L(v) of taxon labels that are encountered in the subtree rooted at v. L(v) (iii) For each edge e (u,v) of T , add the split (e) X L(v) to S (T ). = = Exercise 5.2.3 (Splits on a tree) Consider the following set of splits

S {a} , {b} , {c} , {d} , {e} , {a,b} , {a,e} . = {b,c,d,e} {a,c,d,e} {a,b,d,e} {a,b,c,e} {a,b,c,d} {c,d,e} {b,c,d} Compatible splits

Two splits are S = A B and S = A B are compatible, if one of 1 1| 1 2 2| 2 the A A , A B , B A or B B is empty. A set of splits is 1 \ 2 1 \ 2 1 \ 2 1 \ 2 S called compatible if all pairs of splits in are compatible. S Example a b, c, d, e { }|{ } a, b, d, e, h c, f , g b a, c, d, e { }|{ } { }|{ } a, c, d, e, g, h b, f c a, b, d, e { }|{ } { }|{ } a, c, e, g b, d, f , h 1 = d a, b, c, e 2 = { }|{ } S { }|{ } S a, c, g b, d, e, f , h e a, b, c, d { }|{ } { }|{ } a, c, e, f , g b, d, h a, b c, d, e { }|{ } { }|{ } a, e, h b, c, d, f , g a, b, e c, d { }|{ } { }|{ }

20 / 43 Compatible splits

A set of compatible splits corresponds univocally to a unrooted phylogenetic tree.

5.2. Splits 81

b {a} {b,c,d,e} | {b} {a,c,d,e} | a {c} {a,b,d,e} | c {d} {a,b,c,e} | {e} {a,b,c,d} | {a,b} {c,d,e} | e {a,b,e} {c,d} d | (a) Unrooted tree T (b) Split encoding of T

Figure 5.2 (a) An unrooted phylogenetic tree T on X {a,...,e}. (b) The seven splits represented by T . = 21 / 43

A B X . If the edges of the tree have lengths or weights, then these can be assigned to = the corresponding splits, as well. If T does not contain any unlabeled nodes of degree two, then any two different edges e and f always represent two different splits, that is, (e) (f ) must hold. The only T = T situation in which a phylogenetic tree can contain an unlabeled node of degree 2 is when it it has a root with outdegree two. In this case, the two edges e and f that originate at the root give rise to the same split (e) (f ), but to two complementary clusters. T = T A common construction to avoid this special case at the root is to attach an additional leaf to that is labeled by a special taxon o, which we call a (formal) outgroup. Then the root of a phylogenetic tree is specified as the node to which the leaf edge of o attaches. All phylogenetic tree that are discussed in this chapter are unrooted. However, by using this outgroup trick much of what we discuss concerning unrooted trees can also be adapted to rooted phylogenetic trees. Let T be an unrooted phylogenetic tree on X . We define the split encoding S (T ) to be the set of all splits represented by T , that is,

S (T ) {(e) e is an edge in T }. = | The term encoding is justified by the observation that the tree T can be uniquely recon- structed from S (T ), as we see in the next section. Figure 5.2 shows an unrooted phylogenetic tree that has seven edges and thus gives rise to seven different splits. We can determine all splits associated with any given unrooted phylogenetic tree T in quadratic time using the following algorithm:

Algorithm 5.2.2 (Splits from tree) The set S (T ) of all splits associated with an unrooted phylogenetic tree T on X can be computed as follows:

(i) Choose a start leaf and assume that all edges of T are directed away from . (ii) In a postorder traversal of T , for each node v compute the set L(v) of taxon labels that are encountered in the subtree rooted at v. L(v) (iii) For each edge e (u,v) of T , add the split (e) X L(v) to S (T ). = = Exercise 5.2.3 (Splits on a tree) Consider the following set of splits

S {a} , {b} , {c} , {d} , {e} , {a,b} , {a,e} . = {b,c,d,e} {a,c,d,e} {a,b,d,e} {a,b,c,e} {a,b,c,d} {c,d,e} {b,c,d} Circular splits

A set of splits on is called circular,ifthereexistsalinearordering S X ⇡ =(x ,...,x ) of the elements of for such that each split S 1 n X S 2S is interval-realizable,thatis,hastheform S = x , x ,...,x ( x , x ,...,x ), { p p+1 q}| X\{p p+1 q} for appropriately chosen 1 < p q n.   Example

a, b, d, e, h c, f , g b { }|{ } 6 1 a, c, d, e, g, h b, f d b d h f { }|{ } h f a, c, e, g b, d, f , h { }|{ } 4 a, c, g b, d, e, f , h e c { }|{ } e a, c, e, f , g b, d, h a g { }|{ } a g, c a, e, h b, c, d, f , g { }|{ }

22 / 43 Circular splits

A set of circular splits corresponds to a unrooted network that is outer-labeled planar.

b a, b, d, e, h c, f , g { }|{ } d a, c, d, e, g, h b, f h f { }|{ } a, c, e, g b, d, f , h { }|{ } a, c, g b, d, e, f , h e { }|{ } a, c, e, f , g b, d, h a g, c { }|{ } a, e, h b, c, d, f , g { }|{ }

(a) Planar network (b) Circular splits

23 / 43 Weakly compatible splits

Three splits S = A1 , S = A2 ,andS = A3 weakly compatible,if 1 B1 2 B2 3 B3 1 at least one of the following four intersections is empty: A A A , A B B , B A B and B B A , 1 \ 2 \ 3 1 \ 2 \ 3 1 \ 2 \ 3 1 \ 2 \ 3 2 at least one of the following four intersections is empty: B B B , B A A , A B A and A A B . 1 \ 2 \ 3 1 \ 2 \ 3 1 \ 2 \ 3 1 \ 2 \ 3 A set of splits on is called weakly compatible,ifanythree distinct S X splits in are weakly compatible. S Example a, b, d, e, h c, f , g a, b, d, e, h c, f , g {a, c, d, e, g}|{, h b, f } {a, c, d, e, g}|{, h b, f } {a, c, e, g b}|{, d, f , h} {a, c, e, g b}|{, d, f , h} 1 = { }|{ } 2 = { }|{ } S a, c, g b, d, e, f , h S a, c, d, e b, f , g {a, c, e,}|{f , g b, d, h} {a, b c}|{, d, e, f , g} {a, e, h b}|{, c, d, f , g} {a, e,}|{f b, c, d, g} { }|{ } { }|{ }

24 / 43 Weakly compatible splits

Phylogenetic networks reconstructed from weakly compatible are easier than the ones reconstructed from generic splits

25 / 43 UPN from splits or“what to do with the splits?”

26 / 43 PN from splits: the Convex hull algorithm

A We start with the start tree and we add a split S = B as follows: 1 Compute the two convex hulls H(A)andH(B)inN and let M be the graph induced by the nodes in H(A) H(B). \ 2 Create a copy M0 of M and denote v 0 and e0 the copies of a node v and an edge e in M. 3 Substitute any edge f =(u, v)whereu in H(B) H(A) = and v in \ 6 ; M with edge f =(u, v 0).

4 Connect each pair of nodes v in M and v 0 in M0 by a new edge.

I = A1 = {a,b,c,d} I A2 {a,b,g,h} S1 B {e,f ,g,h} S2 = B = {c,d,e,f } 1 c b c2 b c b c b d a d a d a d a

h h e e e h e h f g f g f g f g I I (a) Network N0 (b) Hulls for S1 (c) Network N1 (d) Hulls for S2 cbcb c b d a d a d a

e h e h e h f g f g f g I (e) Network N2 (f) Hulls for S3 (g) Network N3 SI = A3 = {a,c,f } 3 B3 {b,d,e,g,h} 27 / 43 PN from splits: the circular network algorithm

x ,...,x { p q } We start with the start tree and we add a split S = x ,...,x as follows: X\{ p q } (splits have to be considered in a certain order)

1 Determine the path M(xp, xq) and let M˙ denote the path obtained by removing the first and last (leaf) edges from M(xp, xq).

2 Create a copy M˙ 0 of M˙ and denote v 0 and e0 the copies of a node v and an edge e in M˙ .

3 Substitute any edge f =(u, v)whereu = (xi )andv in M˙ with edge f =(u, v 0), for all i = p,...,q.

4 Connect each pair of nodes v in M˙ and v 0 in M˙ 0 by a new edge.

cb cb d a d a

e e h h f g f g

(a) Network N2 (b) Network N3 A {a,f ,g,h} B = {b,c,d,e}

28 / 43 PN from splits: attention!!!

All four di↵erent split networks shown below represent the same set of splits.

e f e f e f e f

d a d a d a d a

c b c b c b c b

29 / 43 UPN from distances or“how to get the splits from distances”

30 / 43 PN from distances: the split decomposition

Given a distance matrix D on = x1,...,xn the split decomposition algorithm [Bandelt and Dress,X 1992]{ starts by computing} the isolation index for quartets and splits: w y

for any four taxa w, x, y and z with w, x y, z = ,: x z w,x 1 { }\{ } ; ↵ˆD { y,z } = 2 (max d(w, x)+d(y, z), d(w, y)+d(x, z), d(w, z)+d(x, y) d(w, x) d(y, z)). { } { } w,x for any (partial) split S: ↵D (S)=min ↵ˆD { y,z } w, x A, y, z B 0. { { } | 2 2 } A B

Then, we set X0 = and 0 = . Given the set of splits i on the first i taxa, we obtain ; by,S for each; split A doing: S Si+1 B 2Si A x 1 [{ i+1} Consider S = B . If ↵D (S) > 0, set !(S)=↵D (S)andaddS to i+1. S A 2 Xi Do the same with S = B x and S = x [{ i+1} { i+1} The result is given by . Sn

31 / 43 PN from distances: the split decomposition

AsplitS whose isolation index ↵D (S) is greater than 0 is called a D-split. D-splits are always weakly compatible. It follows from this that the split decomposition always computes a set of weakly compatible splits

32 / 43 PN from distances: the split decomposition

AsplitS whose isolation index ↵D (S) is greater than 0 is called a D-split. D-splits are always weakly compatible. It follows from this that the split decomposition always computes a set of weakly compatible splits The SD is a conservative method It can be used for small number of taxa or low divergence

0.01

WE_LA

SK

BS_LD JC

HG45 Gg1 BS_LC

Gg2

HG7 HT HG2 HG16 HG1 HG65 HG184 HG108HG25 HG179 HG72 Gg3 HG40 HY HG169, HG146, HG140, HG114, HG113, HG110, HG103, HG95, HG91, HG88, HG85, HG75, HG61, HG56, HG49, HG48, HG4, HG20 HG157 NG

KM

BD_LB BE_LC

32 / 43 PN from distances: Neighbor-Net Given a distance matrix D on ,theNeighbor-Net algorithm [Bryant and Moulton, 2004] computesX a circular ordering ⇡ of from D and then a set of weighted splits that are X interval-realizable with respect to ⇡: S produces circular splits uses together with circular network algorithm to get planar networks can be used for large number of taxa and high divergence

0.01

0.01 WE_LA

HG16 HG7 HG110 HG140 HG56HG48 HG65 HG2 HG113 HG91 HG49 HG114 HG25 HG20 HG75 HG103 HG88

HG146 HG40 Gg2 HG61HG85 Gg1 HG179 HG184 HG95 HG169 HG4 Gg3 HG45 HG108 HG72 HG1 HG157 NG

HT

SK BE_LC

BD_LB

BS_LD JC HY

KM

HG45 Gg1 BS_LC JC

Gg2 SK BS_LC HG7 HT HG2 HG16 HG1 HG65 HG184 HG108HG25 BS_LD HG179 HG72 Gg3 HG40 HY HG169, HG146, HG140, HG114, HG113, HG110, HG103, HG95, HG91, HG88, HG85, HG75, HG61, HG56, HG49, HG48, HG4, HG20 HG157 NG WE_LA KM

BD_LB BE_LC

33 / 43 PN from distances

Other algorithms from distances: Minimum spanning network T-Rex ... Agreatsourceofinformation: http://phylnet.univ-mlv.fr/

34 / 43 UPN from trees or“how to get splits from a bunch of trees”

35 / 43 PN from trees: Consensus split networks

Consensus splits [Holland et al, 2004] Input: Trees on identical taxon sets Determine splits in more than X% of trees

For >24050%, the result is compatibleChapter 11. Phylogenetic Networks from Trees

f d f c f c e a e a e b

c b b d d a

(a) Tree T1 (b) Tree T2 (c) Tree T3 f c f b f b e a e a e a

d b c d d c

(d) Tree T4 (e) Tree T5 (f) Tree T6

f f f f b c c b e e e e b a a a d d c d c b d a (g) Majority (h) d 2 (i) d 5 (j) All splits = =

Figure 11.1 (a)– (f) Six different phylogenetic trees T1,...,T6 on 36 / 43 X {a,b,c,d,e, f }. (g) Their majority consensus tree and (h) their consensus = 1 split network for d 2, representing all splits that are present in more than 3 of the trees. Note that= in this case the network is still a tree, but more resolved than the majority consensus tree. (i) The consensus split network for d 5 and (j) the split network representing all splits present in the six trees. =

and p 0.30, respectively. Additionally, we show the majority consensus tree. In these = drawings, the length of an edge is scaled to represent the number of trees that support the corresponding split. This figure shows clearly that there is some disagreement among the gene trees as to where the outgroup taxon C. albicans attaches to the phylogeny, the two main choices being an attachment to either the leading to S. castelli or to S. kluyveri. There is some disagreement concerning how to arrange the other five taxa, the largest contention being whether S. kudriavzevii and S. bayanus are sister taxa. In the depicted consensus tree, the majority of gene trees place C. albicans as a sister of S. kluyveri and the information that a significant number of prefer an alternative placement is suppressed.

11.2 Consensus super split networks for unrooted trees The above approach to calculating consensus trees and consensus split networks requires that all input trees are on exactly the same set of taxa X . However, in practice it often occurs that some (or even all) of the input trees lack some of the members of the total taxon set X . For example, when studying multiple gene trees, particular gene sequences may be absent for certain taxa. In this section we discuss how to address this problem.

Let X be a set of taxa. A phylogenetic tree T on X is called a partial tree on X , if X PN from trees: Consensus super splits networks

Consensus super splits [Huson et al, 2004, Whitfield et al 2008]. Input: Trees on overlapping taxon sets Use the Z–closure to complete partial splits

Use the “distortion”11.2. Consensus super split values networks for unrooted to filter trees splits 243

P. radicata O. cabulica P. enysii P. exilis Physaria bellii O. pumila P. wallii Arabis pauciflora P. fastigiata P. stellata Physaria gracilis Arabis glabra P. stellata P. fastigiata Crucihimalaya P. exilis P. latisilique O. pumila Crucihimalaya O. cabulica A. thaliana A. halleri camelina Neslia Moricandia Neslia Cardaminopsis Aethionema camelina A. neglecta Physaria bellii A. lyrata Mathiola A. thaliana Physaria gracilis Arabis alpina Mathiola Arabis pauciflora Cardaminopsis Moricandia Arabis glabra A. lyrata Arabis alpina A. halleri A. neglecta

(a) Tree T1 (b) Tree T2

P. wallii P. fastigiata P. fastigiata P. exilis P. nouvaezelandia Crucihimalaya Arabis glabra Arabis glabra P. exilis P. stellata Arabis pauciflora Arabis pauciflora camelina P. stellata Neslia Physaria gracilis Moricandia Neslia A. thaliana A. thaliana A. halleri Cardaminopsis A. neglecta A. lyrata Cardaminopsis A. neglecta A. halleri

(c) Tree T3 (d) Tree T4

A. lyrata Arabis glabra A. neglecta P. stellata O. cabulica Arabis pauciflora Crucihimalaya O. pumila Cardaminopsis A. lyrata P. fastigiata O. pumila P. enysii P. latisilique A. halleri O. cabulica Aethionema P. nouvaezelandia A. thaliana P. exilis camelina Moricandia camelina P. radicata Neslia Neslia P. wallii Arabis pauciflora A. halleri Crucihimalaya Physaria gracilis Arabis glabra A. neglecta Physaria bellii Physaria bellii Arabis alpina A. thaliana Physaria gracilis Mathiola P. exilis P. stellata Moricandia Aethionema

(e) Tree T5 (f) Super network N 37 / 43

Figure 11.3 (a)–(e) Five partial gene trees T1,...,T5 on 13–25 plant taxa. (f) The corresponding super split network N on all 26 taxa, computed using the Z-closure method. The edges in N are scaled to represent the number of input trees that contain the edge. The network N shows that the placement of the pair of taxa Physaria bellii and Physaria gracilis differs in the five trees.

The split network used to represent S¯p (T ) for any value of p from 0 to 1 is called a super split network.

We conjecture that the set of strict consensus splits S¯strict (T ) is always compatible and thus representable by a phylogenetic tree. If this is true, then this construction provides a

super tree method. However, the set of majority consensus splits S¯majority(T ) is not neces-

sarily compatible, in general. In fact, the latter statement holds for S¯p (T ) with any choice of p between 0 and 1. The Z-closure

Two partial splits S = A1 and S = A2 are said to be in 1 B1 2 B2 Z-relation to each other, if2 exactlyS one of the2 fourS intersections A A , 1 \ 2 A1 B2, B1 A2 or B1 B2 is empty. Then we can create of two new splits\ (the Z-operation\ ) \

A1 A1 A2 S 0 = and S 0 = [ . 1 B B 2 B 1 [ 2 2 If at least one of the two new splits contains more taxa than its predecessor, the pair of splits is called productive.

From a set partial splits on , Z-closure method infers a set of complete splits on as follows: WhileS Xcontains a productive pair of splits S , S , X S { i j } apply the Z-operation to obtain two new splits Si0, Sj0 and then replace the former pair by the latter pair in . Finally, add{ all trivial} splits on . S X

38 / 43 UPN from sequences

39 / 43 Median9.4. Median networks networks 197 a 01010 b 10001 a 01010 = For a multiple alignmentbM10001of binary sequences on ,itsmedian = c 11100 m(a,b,c) 11000 X network is a phylogenetic= network N =(V , E,,) whose node set is m(a,b,c) 11000 ¯ given by the median closure= V = M and inc which 11100 any two nodes a and b are connected by(a) an Median edge calculation e of color (b)(e Parsimonious)=i E tree, if any only if they di↵er in exactly in their i-th position (as haplotypes).2 An associated Figure 9.2 (a) The median m(a,b,c) of three binary sequences a, b and c. The taxon labelingmedian m(a:,bX,c) minimizesV maps the parsimony each taxon score ofx theonto unrooted the phylogenetic node (x)that representstree the on {a corresponding,b,c} shown! in (b), which sequence. in this case is five.

a 0000000 2

0100000 a 0000000 b 0110000 5 b 0110000 0100100 c 1101100 0110100 1 d 1110110 7 1100100 e 0110101 e 0110101 3 4 1110100

6 c 1101100 d 1110110 (a) Alignment M (b) Median network N

Figure 9.3 (a) A condensed multiple alignment M of binary sequences on X {a,...,e}. (b) The median network N that represents M. The nodes are labeled= by haplotypes and the edges are labeled by the corresponding columns in M. 40 / 43

Let M be a multiple alignment of binary sequences on X . The median closure of M is defined as the set M¯ of all binary sequences that can be obtained by repeatedly taking the median of any three sequences in the set and adding it to the set, until no new sequences can be produced. The median network is defined as follows:

Definition 9.4.1 (Median network) Let M be a condensed multiple alignment of binary sequences on X .Themedian network associated with M is a phylogenetic network N = (V,E,,) whose node set is given by the median closure V M¯ and in which any two = nodes a and b are connected by an edge e of color (e) i in E, if any only if they differ in = exactly in their i-th position (as haplotypes). An associated taxon labeling : X V maps each taxon x onto the node (x) that represents the corresponding sequence.

An example of a median network is shown in Figure 9.3. Let M be a condensed multiple alignment of binary sequences on X . Definition 9.4.1 suggests the following naive algorithm for computing the median network N for M: First compute the median closure M¯ . Then construct the graph N that has node set M¯ and in which any two nodes are connected by an edge e of color (e) i, if any only if they = differ exactly in their i th position (as haplotypes). Alternatively, one can compute the Quasi median networks

204 Chapter 9. Phylogenetic Networks from Sequences

•• ------a AAAAA a 000000000 a 0000000 b BBAAA b 110000000 b 1100000 c ABABB c 010001110 c 0100011 d AABBC d 001101101 d 0011010 e AACBC e 001011101 e 0010110

(a) Input M (b) Binary expansion M1 (c) Condensed M1

•• ------0000000 000000000 202AAAAA Chapter 9. Phylogenetic Networks from Sequences 1100000 110000000 BBAAA 0100011 010001110 ABABBA0 A000 A0000 A00000 0011010 001101101 AACBCB1 B110 B1100 B11000 0010110 001011101 AABBC C101 C1010 C10100 0010010 001001101 AA* BC D1001 D10010 0000010 000001100 AAAB* E10001 0100000 010000000 ABAAA(a) 2 states (b) 3 states (c) 4 states (d) 5 states 0100010 010001100 ABAB* (d) Median closure M2 (e) Expanded M2 (f) Multi-states M3 Figure 9.9 The binary expansion M1 of a condensed multiple alignment M of DNA sequences is obtained by replacing each column that contains 2, 3, 4 or 5 AAAAA edifferentAACBC states by a group of 1 to 5 binary columns, converting each state A–E BBAAA d AABBC AAABC into a corresponding binary pattern, as indicated in (a) to (d), respectively. ABABB AA*BC AABBC AAABC = AACBC ABABC AACBC AABBC AAABB AAABA c ABABB AAABC AAAB* AAABB AAABA = AAABA ofABABA binary columns, to then compute the median network for the alignment of binary se- AAABC AAABB quences, anda AAAAA then finally to convert back to multi-state characters, being careful to ex- ABABA ABAAA ABAAA pand each encountered “virtual median” into a set of sequences that display all appro- ABAB* ABABB = ABABA priate multi-state characters. ABABC b ABABC BBAAA In more detail, assume that we are given a condensed multiple alignment M of DNA (g) Expansion of (h) Final matrix M4 (i) Quasi-median network N virtual medians sequences on X . For ease of exposition, assume that the characters in M are labeled A, B, , E, representing the four DNA bases and the gap character. ··· Figure 9.10 For the condensed multiple sequence alignment M ofThe multi-state first step is to compute the binary expansion M1 of M in which every column of M characters in (a) we first compute the binary expansion M1 (b).is We represented then generate by a group of binary columns in M1. For each column C of M we compute the median closure M M¯ , first collapsing all identical columns in M (c) 2 1 the corresponding1 group of columns in M1 as follows: If the41/ column 43 C contains only two ¯ = before computing M1 (d). We record which columns are collapseddifferent together states, then the corresponding group of columns in M contains only one col- (marked and in (b)) and then expand them all again to obtain M (e). We then 1 • umn,2 which is obtained from C by changing all occurrences of A by 0 and all occurrences compute the multi-state representation M3 for M2 (f). Finally, we obtain the of B by 1. If the number of different states contained in the column C is d 2, then the matrix M4 by expanding all virtual medians in M3 (g) and then removing any > duplicate copies (h). In (i) we show the quasi-median network Ncolumnfor M. C gives rise to a group of d binary columns, in which each of the letters A – E is represented by a specific binary pattern as prescribed in Figure 9.9.

The third step is to compute the median closure M2 M¯ 1. For computational reasons, some original input multiple sequence alignment M0. To take this into account, whenever = one should first collapse all identical columns in M before computing M¯ and then ex- we compare two sequences in M we must weight each position in M by the number of 1 1 pand them all again to obtain M2. original positions in M0 that it represents. This gives rise to a distance matrix D in which the distance d(a,b) between any two sequences a and b in M is givenThe fourth by the step sum is of to convert M2 back to a representation M3 of multi-state data. To weights of all positions at which a and b differ. do this, each group of columns is converted back to a single column using the tables Consider the graph G (V,E,) that has node set V X andshown whose edge in Figure set E 9.9.con- Note that a group of three or more columns may contain binary = = patterns that are not listed in the tables. Such binary patterns are produced by the median operation and we refer to them as virtual medians, as they are not the medians of binary characters but rather of our binary encoding of multi-state characters. Virtual medians are represented in M3 by the state ‘*’. By definition, a virtual median is obtained as the median of three different binary pat- terns that represent three different multi-state characters, according to the tables shown in Figure 9.9. (Please convince yourself that the binary patterns listed in each of the ta- bles have the property that the median of any three different ones is not contained in the table.)

In the fifth step we obtain a new multiple alignment M4 of multi-state characters from M3 by expanding each sequence M3 that contains a virtual median * into a set of new How to keep the complexity of the network down...

The number of nodes of the quasi-median network can be very large, even for a small number of short sequences. Thus, the quasi-median network is rarely useful in practice. There exist two alternative methods: median-joining algorithm, which aims at computing an UPN that is as informative as a quasi-median network, but usually much smaller. The algorithm has a parameter that is used to control how complex the resulting phylogenetic network will be. geodesically-pruned quasi-median networks: a method that aims at computing a pruned version of the full quasi-median network by considering only those sequences that lie on a geodesic between two of the original input sequences.

42 / 43 How to keep the complexity of the network down...

UPN from ... quartets ... QNet http://www2.cmp.uea.ac.uk/~vlm/qnet/ http://phylnet.univ-mlv.fr/

42 / 43 Recombination networks 00000

1 3 10000 2 11000 00100 r 4 11100 00110 5 a b c d e 10000 11000 11100 00110 00111

Daniel H. Huson, Regula Rupp, Celine Scornavacca. Phylogenetic Networks. Cambridge University Press. 2011

43 / 43 Methods for reconstructing rooted phylogenetic networks not accounting for ILS

some slides have been kindly provided by Fabio Pardi Trees displayed by a network

Trees displayed by a network

In a phylogenetic network, a reticulate event is represented as a reticulation, where branches converge to give rise to a new lineage:

a b c d e f Trees displayed by a network

Trees displayed by a network

In a phylogenetic network, a reticulate event is represented as a reticulation, where branches converge to give rise to a new lineage:

The genome at the start of the new lineage is a composition of those of the parent lineages.

a b c d e f Trees displayed by a network

Trees displayed by a network

In a phylogenetic network, a reticulate event is represented as a reticulation, where branches converge to give rise to a new lineage:

The genome at the start of the new lineage is a composition of those of the parent lineages.

The evolution of each part independently inherited is described by a “gene” tree

a b c d e f Trees displayed by a network

Trees displayed by a network

In a phylogenetic network, a reticulate event is represented as a reticulation, where branches converge to give rise to a new lineage:

The genome at the start of the new lineage is a composition of those of the parent lineages.

The evolution of each part independently inherited is described by a “gene” tree

a b c d e f

a b c d e f Trees displayed by a network

Trees displayed by a network

In a phylogenetic network, a reticulate event is represented as a reticulation, where branches converge to give rise to a new lineage:

The genome at the start of the new lineage is a composition of those of the parent lineages.

The evolution of each part independently inherited is described by a “gene” tree

a b c d e f

a b c d e f a b c d e f Trees displayed by a network

Trees displayed by a network

In a phylogenetic network, a reticulate event is represented as a reticulation, where branches converge to give rise to a new lineage:

The genome at the start of the new lineage is a composition of those of the parent lineages.

The evolution of each part independently inherited is described by a “gene” tree

In the absence of deep a b c d e f coalescence and allopolyploidy, the gene trees are displayed by the network

a b c d e f a b c d e f Trees displayed by a network

Trees displayed by a network

a b c def a b c def Switch on and off reticulated edged

a b c def a b c def Trees displayed by a network

Trees displayed by a network

Delete switched off a aa a b c def bb cc dedeff edges bandc unlabelleddef leaves and suppress outdgree-1 indegree-1 nodes

a b c def aa bb cc dedeff a b c def Trees displayed by a network

Trees displayed by a network

a a b c def b c def 2r possible trees!!!

a b c def a b c def Phylogene5c networks

Phylogene5c network inference

An op5miza5on problem where a candidate N network is evaluated on the basis of how well the trees it displays fit the data: a b c d e f

Many possible formula1ons: a b c d e f a b c d e f

Data: ... Trees with 3 taxa: (inferred from other data) a b c c f a d e f d f b a c d Goal: Find the network N with the lower hybridiza5on number such that the triplets are `consistent’ with one of the trees displayed by N subject to constraints on the complexity of N Triplets

Triplets - Software

• LEV1ATHAN: A practical algorithm for reconstructing level-1 phylogenetic networks. Combines any set of phylogenetic trees into a level-1 phylogenetic network that is consistent with a large number of the triplets of the input trees.

• SIMPLISTIC: Returns a phylogenetic network with minimum level consistent with all input triplets

• MARLON: Constructs a level-1 phylogenetic networks with a minimum number of reticulations consistent with a dense set of triplets, if such a network exists

• LEVEL2: Constructs a level-2 phylogenetic network consistent with a dense set of triplets, if such a network exists

Phylogene5c networks

Phylogene5c network inference

An op5miza5on problem where a candidate N network is evaluated on the basis of how well the trees it displays fit the data: a b c d e f

Many possible formula1ons: a b c d e f a b c d e f

Data: Clusters of taxa: a, b , d, e , d, e, f , a, b, c, d, e, f , e, f , c, d, e, f ,... { } { } { } { } { } { } Goal: Find the network N with the lower hybridiza5on number such that the input clusters are `explained’ by one of the trees displayed by N subject to constraints on the complexity of N 3 P-2 Methodologie

Cette section pr´esente une nouvelle approche pour reconstruire un r´eseau phylog´en´etique `apartir d’un ensemble 3 de s´equencesmol´eculaires. Un sch´ema synth´etique de l’approche propos´ee est pr´esent´een Figure 3, ainsi qu’une P-2br`eve Methodologie description de chaque ´etape.

données génomiques

Cette section pr´esentedétection une homologie nouvelle approche pour reconstruire un r´eseau phylog´en´etique `apartir d’un ensemble ETAPE 0 + de s´equencesmol´eculaires.alignement Un sch´ema synth´etique de l’approche propos´ee est pr´esent´een Figure 3, ainsi qu’une jeux de données homologues br`eve description de chaque ´etape. Dans ce projet, nous ne nous int´eressons pas `ala d´etection de l’homo- J1 J2 ... Jk logie et `al’alignement de s´equences, nous supposons l’´etape 0 r´esolue.

groupement de certains ETAPE 1 données Chaque jeu de donn´ees est appel´e famille de g`enes. L’´etape 1 consiste jeux de données génomiques (clustering) `adiviser l’ensemble de toutes les familles de g`enes `anotre disposition clusters de données détection homologie ETAPE 0 en sous-probl`emes (clusters). Chaque cluster est compos´ede familles C1 C2 ... Ck' + Clusters alignement de g`enesqui ont une histoire ´evolutive similaire et qui contiennent Clustersjeux de données homologues susamment d’information pour chaque esp`ece. reconstruction d'un arbre ETAPE 2 daté pour chaque Dans ce projet, nous ne nous int´eressons pas `ala d´etection de l’homo- CASS Jalgorithm1 J 2: search... for theJk level-k network containing a set of Chaque sous-probl`emeest alors analys´eavec une approche de super- clusters cluster de données 20 50 33 matrice qui permet de reconstruire une phylog´eniedat´ee (´etape 2). logie et `al’alignement de s´equences, nous supposons l’´etape 0 r´esolue. ... 15 25 groupement20 de certains Ensuite, ces arbres sources sont combin´esdans un super-arbre dat´e 10 Chaque jeu de donn´ees est appel´e famille de g`enes. L’´etape 1 consiste ETAPE12 1 43 3 5 8 3 2 jeux de 6données (´etape 3) contenant l’ensemble des esp`eces ´etudi´ees. Des m´ethodes (clustering) `adiviser l’ensemble de toutes les familles de g`enes `anotre disposition clusters de données de r´econciliation d’arbres sont ensuite utilis´ees pour comparer le reconstruction ETAPE 3 en sous-probl`emes (clusters). Chaque cluster est compos´ede familles d'un super-arbre super-arbre dat´eet chaque arbre source afin de d´etecter des macro- C1 C2 ... Cdaték' 50 ´ev`enementsde g`enesqui de r´eticulation(e.g. ont une histoire un ´echange ´evolutive d’un similaire grand nombre et qui de contiennent g`enesentresusamment deux esp`eces, d’information des ´ev`enements pour chaque d’hybridation, esp`ece. etc.). Enfin, reconstruction d'un arbre ETAPE 2 33 43 15 20 12 25 daté pour chaque le super-arbreChaque sous-probl`emeest est modifi´epour incorporer alors analys´eavec ces ´ev`enements une : on approche obtient de super- 2 6 cluster de données 20 50 33 ainsimatrice un super-r´eseau qui permet qui explique de reconstruire l’´evolution desune s´equences phylog´eniedat´ee de d´epart. (´etape 2). reconstruction ETAPE 4 ... 15 25 d'un super-réseau20 Ensuite, ces arbres sources sont combin´esdans un super-arbre dat´e 10 12 43 3 5 8 3 2 6 (´etape 3) contenant l’ensemble des esp`eces ´etudi´ees. Des m´ethodes de r´econciliation d’arbres sont ensuite utilis´ees pour comparer le reconstruction ETAPE 3 d'un super-arbre super-arbre dat´eet chaque arbre source afin de d´etecter des macro- daté 50 ´ev`enements de r´eticulation(e.g. un ´echange d’un grand nombre de g`enesentre deux esp`eces, des ´ev`enements d’hybridation, etc.). Enfin, Figure 3 – Sch´emaet description synth´etique de l’approche propos´ee dans ce projet. 33 43 15 20 12 25 le super-arbre est modifi´epour incorporer ces ´ev`enements : on obtient 2 6 ainsi un super-r´eseau qui explique l’´evolution des s´equences de d´epart.

reconstruction ETAPE 4 d'un super-réseau P-2.1 Detail´ de l’etape´ 1

Dans la Section P-1, nous avons ´evoqu´edeux approches pour combiner des donn´ees provenant de plusieurs sources : l’approche de super-matrice et l’approche de super-abre. Nous allons bri`evement discuter les avantages et les inconv´enients de ces deux approches et expliquer comment elles peuvent ˆetrecombin´ees lors de l’analyse de grandes quantit´esFigure de donn´ees. 3 – Sch´emaet description synth´etique de l’approche propos´ee dans ce projet.

Approche de super-matrice. La premi`ere approche consiste `aconcat´enerles s´equences d’origine dans une seule grande matrice appel´ee super-matrice. Cette approche a l’avantage que toute information contenue dans chaque jeu de donn´eesest conserv´ee.Ceci est en accord avec l’approche total evidence [Klu89, SPH98] c’est-`a-dire le principe P-2.1suivant D lequeletail´ la meilleure de l’etape´ solution est 1 celle provenant de toutes les donn´ees disponibles. Cependant, cette approche comporte plusieurs limites. On discute ici les plus importantes dans le contexte d’une approche diviser pour r´egner. Tout d’abord, cette strat´egieest intenable pour l’assemblage des phylog´eniesde plus en plus importantes [SPH98]. En e↵et, quand seulement un petit nombre de sont communs entre les jeux de donn´ees, la plupart des entr´ees deDans la matrice la Section combin´ee P-1, sont nous des avons donn´eesmanquantes. ´evoqu´edeux approches Par exemple, pour une combiner des plus grandes des donn´ees super-matrices provenant jamais de plusieurs sourcesanalys´ee[DAB : l’approche+04], obtenue de super-matrice de la concat´enation et l’approche de 1131 de alignements super-abre. de prot´eineset Nous allons contenant bri`evement 469.497 discuter sites pour les avantages et70 les taxons, inconv´enients ´etaitcompos´ee de ces pour deux 92 approches % de donn´ees et manquantes. expliquer comment L’analyse elles d’une peuvent super-matrice ˆetrecombin´ees avec trop de lors donn´ees de l’analyse de grandesmanquantes quantit´es peut ˆetre, de donn´ees. dans certains cas, peu fiable [SPH98], notamment lorsque le signal concat´en´econtenu dans la super-matrice n’est pas assez fort. En outre, mˆeme si les jeux de donn´eescontiennent tous le mˆemeensemble Approche de super-matrice. La premi`ere approche consiste `aconcat´enerles s´equences d’origine dans une seule grande matrice appel´ee super-matrice. Cette approcheScornavacca a Celine l’avantage que toute information contenue dans chaque jeu de donn´eesest conserv´ee.Ceci est en accord avec l’approche total evidence [Klu89, SPH98] c’est-`a-dire le principe suivant lequel la meilleure solution est celle provenant de toutes les donn´ees disponibles. Cependant, cette approche comporte plusieurs limites. On discute ici les plus importantes dans le contexte d’une approche diviser pour r´egner. Tout d’abord, cette strat´egieest intenable pour l’assemblage des phylog´eniesde plus en plus importantes [SPH98]. En e↵et, quand seulement un petit nombre de taxons sont communs entre les jeux de donn´ees, la plupart des entr´ees de la matrice combin´ee sont des donn´eesmanquantes. Par exemple, une des plus grandes super-matrices jamais analys´ee[DAB+04], obtenue de la concat´enation de 1131 alignements de prot´eineset contenant 469.497 sites pour 70 taxons, ´etaitcompos´ee pour 92 % de donn´ees manquantes. L’analyse d’une super-matrice avec trop de donn´ees manquantes peut ˆetre, dans certains cas, peu fiable [SPH98], notamment lorsque le signal concat´en´econtenu dans la super-matrice n’est pas assez fort. En outre, mˆeme si les jeux de donn´eescontiennent tous le mˆemeensemble

Scornavacca Celine Phylogene5c networks

Phylogene5c network inference

An op5miza5on problem where a candidate N network is evaluated on the basis of how well the trees it displays fit the data: a b c d e f

Many possible formula1ons: a b c d e f a b c d e f

Data: ... Any trees on the same taxa: (inferred from other data) a c d e f c f a b d e f Goal: Find the network N with the lower hybridiza5on number such that the input trees are `consistent’ with one of the trees displayed by N subject to constraints on the complexity of N Reconstruction of hybridization networks

Software

ultraNet Phylogene5c networks

Phylogene5c network inference

An op5miza5on problem where a candidate N network is evaluated on the basis of how well the trees it displays fit the data: a b c d e f

Many possible formula1ons: a b c d e f a b c d e f

Data: ... Any trinets on the same taxa: (inferred from other data) Goal: Find the network N with the lower hybridiza5on number such that the input trees are `consistent’ with the N subject to constraints on the complexity of N Trinets

Trinets

TriLoNet (Trinet Level One Network) : constructs rooted level-1 phylogenetic networks from aligned DNA sequence data using a trinet-based approach

Oldman et al. TriLoNet: Piecing Together Small Networks to Reconstruct Reticulate Evolutionary Histories. 2016 Phylogenetic networks

Explicit phylogenetic networks (rDAG) v1 Show e putative e2 4 reticulated histories v2 e e v 3 v3 3' 4 γ 1-γ e5 e6 e9 v6 e7 e8 v5 v7 v8 v9 Phylogene5c networks

Phylogene5c network inference

An op5miza5on problem where a candidate N network is evaluated on the basis of how well the trees it displays fit the data: a b c d e f

(N): T

Many possible formula1ons: a b c d e f a b c d e f

Data: Sequence alignments: (typically given in blocks) ...

Goal: m Find N that minimizes F (N A1,A2,...,Am)= min F (T Ai) T (N) | | i=1 2T X subject to constraints on the complexity of N. F() is the parsimony score. Jin et al. Parsimony Score of Phylogene5c Networks: Hardness Results and a Linear-Time Heuris5c. TCCB. 2009. Phylogene5c networks

Phylogene5c network inference

An op5miza5on problem where a candidate N network is evaluated on the basis of how well the trees it displays fit the data: a b c d e f NEPAL Phylogene)c Networks Parsimony and Likelihood (N): Toolkit T

Many possible formula1ons: a b c d e f a b c d e f

Data: Sequence alignments: (typically given in blocks) ...

Goal: m Find N that minimizes F (N A1,A2,...,Am)= min F (T Ai) T (N) | | i=1 2T X subject to constraints on the complexity of N. F() is the parsimony score. Jin et al. Parsimony Score of Phylogene5c Networks: Hardness Results and a Linear-Time Heuris5c. TCCB. 2009. Phylogene5c networks

Phylogene5c network inference

An op5miza5on problem where a candidate N network is evaluated on the basis of how well the trees it displays fit the data: p 1-p a b c d e f

(N): T

Many possible formula1ons: a b c d e f a b c d e f

Data: Sequence alignments: (typically given in blocks) ...

Goal: m m Find N that maximises Pr(A ,A ,...,A N)= Pr(A N)= Pr(A T )Pr(T N) 1 2 m| i| 0 i| | 1 i=1 i=1 T (N) Y Y 2XT @ A Jin et al.Maximum likelihood of phylogene5c networks. Bioinforma5cs 2006. Phylogene5c networks

Phylogene5c network inference

An op5miza5on problem where a candidate N network is evaluated on the basis of how well the trees it displays fit the data: p 1-p a b c d e f NEPAL Phylogene)c Networks Parsimony and Likelihood (N): Toolkit T

Many possible formula1ons: a b c d e f a b c d e f

Data: Sequence alignments: (typically given in blocks) ...

Goal: m m Find N that maximises Pr(A ,A ,...,A N)= Pr(A N)= Pr(A T )Pr(T N) 1 2 m| i| 0 i| | 1 i=1 i=1 T (N) Y Y 2XT @ A Jin et al.Maximum likelihood of phylogene5c networks. Bioinforma5cs 2006. The strategy

A1 A2 A3 A4 INPUT DATA

rNNI, rSPR,... ? reticulation-0 networks rNNI, rSPR,... ? reticulation-1 networks ... ? rNNI, rSPR,.. reticulation-r networks Problems

Some issues

• Searching the space of phylogene5c networks The space of networks with k re5cula5ons is infinite. • Controlling for Model Complexity Because any network with k re5cula5ons provides a more complex model than any network with (k-1) re5cula5ons, we must handle the model selec5on problem (AIC, BIC, K-fold cross-valida5on, …). • Iden5fiability issues

m m Pr(A ,A ,...,A N)= Pr(A N)= Pr(A T )Pr(T N) 1 2 m| i| 0 i| | 1 i=1 i=1 T (N) Y Y 2XT @ A • Not accoun5ng for ILS and allopolyploidy Iden5fiability problems

Different networks can display the same trees

Some networks display exactly N1 N2 the same trees:

d d

a b c a b c

T1 T2 T3

d d d

abcab ca b c Iden5fiability problems

Different networks can display the same trees

Some networks display exactly N1 N2 the same trees:

d d

Because N1 and N2 display the same trees, they are a b c a b c equally good to any of the T T T inference methods we saw 1 2 3 – no ma8er the input data d d d

abcab ca b c

Data (Recall that a network is evaluated on the basis of how well the trees it displays fit the data) Iden5fiability problems

Different networks can display the same trees

Some networks display exactly N1 N2 the same trees:

d d

Because N1 and N2 display the same trees, they are a b c a b c equally good to any of the T T T inference methods we saw 1 2 3 – no ma8er the input data UNIDENTIFIABILITY d d d

abcab ca b c Iden5fiability problems

Indis5nguishable networks

Branch lengths do not .1 N1 .1 N2 eliminate non- .15 .15 .1 .1 iden5fiability… .1 .1 .1 The same hold for p 1-p .1 .12 .07 inheritance probabili5es .1 .2 .05 .1 .15 .15 .2 .15 .1 .05 .05 .03 .05 .05 .05 .05 a b c d e a b c d e

.15 .1 .1 .1 .1 .35 .35 .1 .2 .25 .2 .25 .2 .2 .2 .15 .15 .05 .05 .05 .05 .05 .05 .1 a b c d e a b c d e a d b c e

N1 and N2 display the same trees (i.e. including branch lengths) and are thus indis1nguishable even to methods accoun5ng for lengths Canonical networks

What it means for the evolu5onary biologist

If N is reconstructed by a "classic" inference method, then even assuming perfect and unlimited data, the best you can hope is that the true phylogene5c network is just one of the many that are indis5nguishable from N …

.1 .1 N N* .15 .15 .1 .1 .1 .1 .1 .1 .1 .2 .2 .05 .15 .2 .1 .2 .15 .1 .05 .15 .05 .05 .05 .05 a b c d e a b c d e

The canonical form of N is a unique representa5ve of the networks indis5nguishable from N, that excludes their unrecoverable aspects… Methods for reconstructing rooted phylogenetic networks not accounting for ILS The model

Deep coalescence (ILS)

(a) Population view (b) Reconciliation representation

Figure 1: Impact of incomplete lineage sorting on simple populations of 4 haploid individuals. The originating population contains a single blue allele for the considered gene. First, a mutation leads to a new green allele at this locus, then a first speciation takes place, rapidly followed by a second one. As the blue and green alleles still co-exist when the second speciation takes place, both alleles still have a chance to be fixed in the resulting child species B and C. For these species, the history of this gene will hence di↵er from the species history due to ILS.

and thus always returns an optimal time-consistent reconciliation. A detailed comparison with the models and algorithms of [21] and [18] is also provided.

2 Preliminaries

Given a tree T , its node set, branches, and leaf set are respectively denoted V (T ), E(T ), L(T ). The label of each leaf u is denoted by (u), while the set of labels of leaves of T is denoted by (T ). L If T is rooted,L we denote its root by r(T ). Given a node u V (T ), we denote its 2 parent by up, and the subtree of T rooted at u by Tu.Giventwonodesu and v of T ,we write u T v (u

3 The model

ILS in phylogentic networks

The true gene tree is not p1 p2 p1 p2 displayed by the network because it needs to use h h both edges entering the node a c dba c db (a) (b)

Yu et al. Maximum likelihood inference of reticulate evolutionary histories, 2014 The model

Allopolyploidy

The true gene tree is not p1 p2 displayedp by1 the network p2 because it needs to use h both edges entering h the hybrid node a c dba c db (a) (b) Parental trees

The multi-labelled tree U*(N)

N U*(N)

a c d a b c d b

a ac dbc dba c d a b c da b adc dc dc b dc b (a) (a) (b) (b) (c) (c) • nodes are the directed paths in N starting at r(N) • for each pair of paths p,p' in N, there is an edge in U*(N) from p to p' if and only if p=p'e for some edge e in N • each node in U*(N) corresponding to a path in N that starts at r(N) and ends at x in X is labelled by x Parental trees

Parental trees

a c d b

a c d b a c d b a c dba c d b a dc dc b (a) (b) (c)

a c d b a d c b

A phylogenetic tree T on X is a parental tree of N if it is displayed by U*(N) Huber et al. Folding and unfolding phylogenetic trees and networks, 2016 [weakly displayed] Zhu al. In the light of deep coalescence: revisiting trees within networks, 2016 Zhu and Degnan. Displayed trees do not determine distinguishability under the network multispecies coalescent, 2016 Parental trees

Parental trees

a c d b

a c d b a c d b a c dba c d b a dc dc b (a) (b) (c)

p1 p2 p1 p2 c a d b a d b c h h

a c dba c db Huber et al. Folding and unfolding phylogenetic trees and networks, 2016 [weakly displayed(a)] (b) Zhu al. In the light of deep coalescence: revisiting trees within networks, 2016 Zhu and Degnan. Displayed trees do not determine distinguishability under the network multispecies coalescent, 2016 Parental trees

Parental trees

a c d b

a c d b a c d b a c dba c d b a dc dc b (a) (b) (c)

p1 p2 p1 p2 a c d b a h d c b h

a c dba c db Huber et al. Folding and unfolding phylogenetic trees(a) and networks, 2016 [weakly displayed(b)] Zhu al. In the light of deep coalescence: revisiting trees within networks, 2016 Zhu and Degnan. Displayed trees do not determine distinguishability under the network multispecies coalescent, 2016 Parental trees

Parental trees can be multi-labelled

d

a b c a b d b b c (a) (b)

multiple individuals per species are allowed Scoring schemes based on parental trees (NMSC)

N γ-1 γ a b c d e f

a b c d e f a b c d e f Data: Sequence alignments: (typically given in blocks)

Goal: m m Find N that maximises Pr(A1,A2,...,Am N)= Pr(Ai N)= Pr(Ai T )Pr(T N) | | 0 | | 1 i=1 i=1 T (N) Y Y 2XT Yu et al. The Probability of a Gene Tree Topology within a Phylogenetic Network@ with Applications to A Hybridization Detection, 2012 Yu et al. Maximum likelihood inference of reticulate evolutionary histories, 2014 Wen el al. PLOS Genetics 2016 (Bayesian method)

3

on its consistency with collections of such data will not be able to distinguish between networks N1 and N2. This includes all the methods whose data consists of clusters of taxa (e.g., [34]), triples (e.g., [35]), quartets (e.g., [36]), or any trees (e.g., [37]). The same holds for many, sequence-based, maximum parsimony and maximum likelihood approaches proposed in recent papers. For maximum parsimony, a practical approach [2,29–31] is to consider that the input is partitioned in a number of alignments A1,A2,...,Am, each from a different non-recombining genomic region (possibly consisting of just one site each), and then take, for each of these alignments, the best parsimony score Ps(T Ai) among all those of the trees displayed by a network N. The parsimony score of N is then the sum| of all the parsimony scores thus obtained. Formally, we have

m Ps(N A1,A2,...,Am)= min Ps(T Ai). | T (N) | i=1 2T X ItScoring is clear that schemes if two networks based display theon same parental set of trees (astrees in Fig. (NMSC) 1), then their parsimony score with respect to any input alignments will be the same — because they take the minimum value over the same set (N) — and thus they are indistinguishable to any method based on the maximum parsimony principleT above. N As for maximum likelihood (ML), Nakhleh and collaborators [2,γ 32,33,-1 38] have proposed an elegant framework whereby a phylogenetic network N is not only describedγ by a network topology, but also edge lengths and inheritance probabilities associated to the reticulationsa b c d ofe Nf . As a result, any tree T displayed by N has edge lengths — allowing the calculation of its likelihood Pr(A T ) with respect to any alignment A — and an associated probability of being observed Pr(T N). The likelihood| function with respect to a PhyloNet | set of alignments A1,A2,...,Am, each from a different non-recombining genomic region, is then given by:

m a b c d e mf a b c d e f Pr(A ,A ,...,A N)= Pr(A N)= Pr(A T )Pr(T N). 1 2 m| i| i| | Data: i=1 i=1 T (N) Y Y 2XT Sequence alignments: m (typically given in blocks) Pr(A ,A ,...,A N)= Pr(A T )p(G N). 1 2 m| i| | i=1 G Y Z Goal: m Find N that maximises Pr(A1,A2,...,Am N)= p(Gi N). | | i=1 Y Note Note that an important difference with the consistency-based and parsimony methods described aboveZhu and is Degnan that any. Displayed tree T treesdisplayed do not determine by a network distinguishability has now under edge the lengths network and multispecies an associated probability coalescent, 2016 Pr (T N). Unfortunately,| this ML framework is also subject to identifiability problems. For example, it does not allow us to distinguish between networks with topologies N1 and N2 in Fig. 1: for every assignment of edge lengths and inheritance probabilities to N1, there exist corresponding assignments to N2 that make the resulting networks indistinguishable, that is, displaying the same trees, with the same edge lengths and the same probabilities of being observed (see the last section in the Supporting Information, S1 Text). As a result, the likelihoods of these two networks will be identical, regardless of the data, and no method based on this definition of likelihood will be able to favour one of them over the other. We refer to S1 Text for a more detailed discussion about networks with inheritance probabilities and likelihood-based reconstruction. In general, we believe that these identifiability problems affect all network inference methods which seek consistency with unordered collections of sequence alignments or pre-inferred attributes such as clusters, triples, quartets or trees. Scoring schemes based on parental trees (NMSC)

p1 = 1/3 p2 = 2/3 q1=7/9 and q2 = 3/7 x=y =1/2 and λi=1, for all i Scoring schemes based on parental trees (NMSC)

b1 b2 b1 b2

p1 = 1/3 p2 = 2/3 q1=7/9 and q2 = 3/7 x=y =1/2 and λi=1, for all i

This may solve the g=((((a,d),c),b1),b2) identifiability issues for several practical cases but we need −6 −6 P(g|N1) 7.7 x 10 , P(g|N1) 7.6 x10 more samples per species “well positioned” in the phylogeny SNaQ(Species Networks applying Quartets) – pseudo-likelihood

Input: quartet CFs Output: level-1 semidirected networks

• quartet CFs do not depend on the root placement à semidirected networks • if n=4, k=2,3 reticulations cannot be detected because equivalent to a tree

Solís-Lemus and Ané. Inferring Phylogenetic Networks with Maximum Pseudolikelihood under Incomplete Lineage Sorting, 2016.

SNaQ (Species Networks applying Quartets) – an example of how to cope with indistinguishability

• quartet CFs do not depend on the root placement à semidirected networks • if n=4, k=2,3 reticulations cannot be detected because equivalent to a tree • if n=4, k=4, reticulations can be detected but not the “placement” SNaQ (Species Networks applying Quartets) – an example of how to cope with indistinguishability

• quartet CFs do not depend on the root placement à semidirected networks • if n=4, k=2,3 reticulations cannot be detected because equivalent to a tree • if n=4, k=4, reticulations can be detected but not the “placement” • for n4, k=2 reticulations are not detectable, k=3 sometimes and k=4 yes in general if n5, along with the placement SNaQ (Species Networks applying Quartets) – an example of how to cope with indistinguishability

With only 4 taxa, there are more parameters than equations (3 quartet CFs), so focus on the case n 5. • If k=3, parameters are identifiable if n1,n2,n3 2 , and setting t12 = 0. • If k = 4, parameters are identifiable if either n0 2 (or n2 , symmetrically), or if both n1 and n3 2 . Parameters are not all identifiable in the remaining 2 cases (bad diamonds I & II) • If k=5, all the parameters are identifiable. SNaQ (Species Networks applying Quartets) – an example of how to cope with indistinguishability

They search only in the space of identifiable networks: • k = 2 not considered • k = 3, only n1,n2,n3 2, and setting t12 = 0 • For bad diamonds I, they reparametrized the 3 nonidentifable -t0 -t1 values (γ, t1, t0) into 2 identifiable ones γ(1-e ) and (1-γ)(1-e ). For bad diamonds II, they set t13 = 0 and kept the other 5 parameters (γ, t0, t1, t2, t3). Thank you for your aden5on Another (home made) approach

Figure 1 Ae. umbellulata Tr257 Ae. umbellulata Tr266 58 Ae. umbellulata Tr268 Ae. caudata Tr276 a Ae. caudata Tr275 b 38 Ae. caudata Tr139 Ae. comosa Tr272 Comopyrum Ae. comosa Tr271 82 Ae. uniaristata Tr357 Ae. uniaristata Tr402 *"Ae. uniaristata Tr403

Ae. uniaristata Tr404 Ae. bicornis Tr407 83 Ae. bicornis Tr408 D" Ae. bicornis Tr406 91 Ae. longissima Tr241 Sitopsis Ae. longissima Tr242 85 Ae. longissima Tr355 88 Ae. sharonensis Tr265 Ae. sharonensis Tr264 Ae. searsii Tr165 47 Ae. searsii Tr164 44 Ae. searsii Tr161 Ae. tauschii Tr125 Ae. tauschii Tr180 Ae. tauschii Tr352 Ae. tauschii Tr351 T. boeoticum TS10 T. boeoticum TS3 82 T. boeoticum TS8 99 T. boeoticum TS4 T. urartu Tr315 A" T. urartu Tr232 T. urartu Tr309 T. urartu Tr317 Ae. speltoides Tr251 68 Ae. speltoides Tr320 Ae. speltoides Tr323 49 Ae. speltoides Tr223 Ae. mutica Tr332 B" 78 Ae. mutica Tr244 Ae. mutica Tr329 100 Ae. mutica Tr237 Ta. caput-medusae TB2 S. vavilovii Tr279 Er. bonaepartis TB1 H. vulgare HVens23

15" 10" 5" 0"""Myr" 26 24 22 20 18 16 14 12 10 8 6 4 2 0 Figure 3 Another (home made) approach

a b D species as hybrids between A and B lineages Hybridization index ( ) along 3 Ae. mutica Ae. speltoides γ 1.00 0.7 Sitopsis T. boeoticum 0.6 ● Comopyrum

0.5 in D ● ● 0.75 ● ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ) = % of B in D ● ●

γ 0.3 ● ● ● ● ● ● ● ● Ae. mutica Ae. 0.50 ● ● ● 0.7 ● ● ●

0.6 T. urartu ● ● ● 0.25 ● 0.5 Proportion of 0.4 Hybridization index ( Hybridization index 0.3 0.00 0 100 200 300 400 500 Position along chromosome (Mb) Ae. searsii Ae. searsii Ae. Ae. tauschii Ae. tauschii Ae. Ae. bicornisAe. bicornisAe. Ae. comosa Ae. comosa Ae. Ae. caudata Ae. caudata Ae. Ae. uniaristata Ae. uniaristata Ae. Ae. longissima Ae. longissima Ae. Ae. umbellulata Ae. umbellulata Ae. Ae. sharonensis Ae. sharonensis Ae. D focal species Another (home made) approach Figure 4

Introgression from the Triticum ancestor Triticum A 1 into the Ae. mutica ancestor

2 Origin of the D clade by hybridization between the A clade ancestor and the Ae. Comopyrum mutica ancestor

3 Introgression of the Sitopsis ancestor by the Ae. speltoides ancestor Ae. caudata 4 4 Complex gene flows during the divergence of Ae. caudata and Ae. umbellulata ? probably involving three events (or more) Ae. umbellulata 2 D

Ae. tauschii

1 Sitopsis

3 Ae. mutica B

Ae. speltoides

6 4 2 0 Myr Another (home made) approach Figure 2 Data! Scenario! 1)"Probabili1es"of"the"network’s"components" Out ATTGCTACGTCAT αa" αb" γ γ ! (1 γ ) γ ! (1 γ ) (1 γ ) ! γ (1 γ ) ! Τ1 1 2 1 2 1 2 1 2 A ...... A... γ 1 1 γ1 αc" α B .....G...A... Τ2 d" γ ! C .C...... 2 1 γ2

D .C...G...... A" B" C" D" A" B" C" D" A" B" C" D" A" C" B" D" A" C" B" D"

2)"Probabili1es"of"coalescent"trees"embedded"into"components"

No%ILS% No%ILS% …" …"

A" B" C" D" A" C" B" D"

With%ILS% With%ILS% Site%pa(erns% …" …" Out""A"""B"""C"""D "Counts !Expecta0ons!

0"""""""0"""0"""1"""1 """"""v1 " """""""""""p1" A" B" C" D" A" C" B" D" 0"""""""0"""1"""0"""1 """"""v2 " """""""""""p2" …" …" …" …"""""" """" " """"""… " """"""""…" …"

0"""""""1"""1"""0"""0 """"""v10 " """""""""""p10"""

3)"Probabili1es"of"informa1ve"site"paDerns"given"a"coalescent"tree"

Likelihood!of!the!scenario! Two%possible%informa9ve%muta9ons" ! !%Two%site%pa(erns" !!Choice!of!the!best!scenario! …" …"

A" B" C" D" A" C" B" D" 1! 1! 1! 0! 0! 0! 1! 1! 1! 1! 0! 0! 1! 1!…" 0! 0! …" …" …"