<<

Phylogeny reconstruction

Eva Stukenbrock & Julien Dutheil

[email protected]

[email protected]

Max Planck Institute for Terrestrial Microbiology – Marburg

19 Janvier 2011

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 1 / 30 Infer the history of taxa Necessary to comparative analysis, as are not “independent”

Why studying ?

Phylogenetics The study of the of (taxonomic) groups of (e.g. species, populations).

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 2 / 30 Why studying phylogenetics?

Phylogenetics The study of the evolution of (taxonomic) groups of organisms (e.g. species, populations).

Infer the history of taxa Taxonomy Phylogeography Necessary to comparative analysis, as species are not “independent”

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 2 / 30 Some applications: Resolving orthology and paralogy Datation: estimate divergence times Reconstruct ancestral sequences Exhibit sites under positive selection Predict the structure of a molecule. . .

Phylogenetics applications

Phylogenetic tree A graph depicting the ancestor-descendant relationships between organisms or gene sequences. The sequences are the tips of the tree. Branches of the tree connect the tips to their (unobservable) ancestral sequences (Holder 2003)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 3 / 30 Phylogenetics applications

Phylogenetic tree A graph depicting the ancestor-descendant relationships between organisms or gene sequences. The sequences are the tips of the tree. Branches of the tree connect the tips to their (unobservable) ancestral sequences (Holder 2003)

Some applications: Resolving orthology and paralogy Datation: estimate divergence times Reconstruct ancestral sequences Exhibit sites under positive selection Predict the structure of a molecule. . .

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 3 / 30 3 leaves 4 leaves A A C A 1 B C C D B B 2 D A A B B C D

C D With branch lengths: With clock: A A B B (2n − 5)! C C 2n−3(n − 2)! D D distinct unrooted trees

What is a tree?

2 leaves

A B

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 4 / 30 4 leaves A C A 1 B C D B 2 D A A B B C D

C D With branch lengths: With clock: A A B B (2n − 5)! C C 2n−3(n − 2)! D D distinct unrooted trees

What is a tree?

2 leaves 3 leaves A

A B C

B

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 4 / 30 A 1 B C D 2 A B C D

With branch lengths: With clock: A A B B (2n − 5)! C C 2n−3(n − 2)! D D distinct unrooted trees

What is a tree?

2 leaves 3 leaves 4 leaves A A C

A B C

B B D

A B

C D

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 4 / 30 A 1 B C D 2 A B C D

With branch lengths: With clock: A A B B C C D D

What is a tree?

2 leaves 3 leaves 4 leaves A A C

A B C

B B D

A B

C D

(2n − 5)! 2n−3(n − 2)! distinct unrooted trees Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 4 / 30 2 A B C D

With branch lengths: With clock: A A B B C C D D

What is a tree?

2 leaves 3 leaves 4 leaves A A C A 1 B A B C C D B B D

A B

C D

(2n − 5)! 2n−3(n − 2)! distinct unrooted trees Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 4 / 30 With branch lengths: With clock: A A B B C C D D

What is a tree?

2 leaves 3 leaves 4 leaves A A C A 1 B A B C C D B B 2 D A A B B C D

C D

(2n − 5)! 2n−3(n − 2)! distinct unrooted trees Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 4 / 30 With clock: A B C D

What is a tree?

2 leaves 3 leaves 4 leaves A A C A 1 B A B C C D B B 2 D A A B B C D

C D With branch lengths: A B (2n − 5)! C 2n−3(n − 2)! D distinct unrooted trees Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 4 / 30 What is a tree?

2 leaves 3 leaves 4 leaves A A C A 1 B A B C C D B B 2 D A A B B C D

C D With branch lengths: With clock: A A B B (2n − 5)! C C 2n−3(n − 2)! D D distinct unrooted trees Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 4 / 30 Taxon 2 Taxon 1 Site Taxon 1 AAGACATGTGGCA Taxon 2 AGGAC-TGTGGCA Taxon 3 AGTAC-TGTGA-A Taxon 3 Taxon 4 AGCAC-TGTG--T Taxon 5 AGCACATGTGA-A

Taxon 5 Taxon 4

Aligned homologous positions: sites Each site is a realization of a random variable

From sequences to trees

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 5 / 30 Taxon 2 Taxon 1 Site

Taxon 3

Taxon 5 Taxon 4

Aligned homologous positions: sites Each site is a realization of a random variable

From sequences to trees

Taxon 1 AAGACATGTGGCA Taxon 2 AGGAC-TGTGGCA Taxon 3 AGTAC-TGTGA-A Taxon 4 AGCAC-TGTG--T Taxon 5 AGCACATGTGA-A

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 5 / 30 Site

Aligned homologous positions: sites Each site is a realization of a random variable

From sequences to trees

Taxon 2 Taxon 1

Taxon 1 AAGACATGTGGCA Taxon 2 AGGAC-TGTGGCA Taxon 3 AGTAC-TGTGA-A Taxon 3 Taxon 4 AGCAC-TGTG--T Taxon 5 AGCACATGTGA-A

Taxon 5 Taxon 4

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 5 / 30 From sequences to trees

Taxon 2 Taxon 1 Site Taxon 1 AAGACATGTGGCA Taxon 2 AGGAC-TGTGGCA Taxon 3 AGTAC-TGTGA-A Taxon 3 Taxon 4 AGCAC-TGTG--T Taxon 5 AGCACATGTGA-A

Taxon 5 Taxon 4

Aligned homologous positions: sites Each site is a realization of a random variable

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 5 / 30 The phenetic approach

Uses an overall similarity measure (commonly used in morphology) Consider that the similarity is (inversely) correlated to the evolutionary distance Build a tree from a pairwise distance matrix: T1 T2 T3 T4 T5 T1 0 T2 2 0 T3 3.5 4.5 0 T4 8 10 8 0 T5 6 8 9 3 0

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 6 / 30 (Assumes that the rate of change is constant over time: )

2 Gather the corresponding groups 3 Recompute distances from the new group: 1.0 T1 d + d 1.0 d = i,k j,k ij,k 1.0 2 2.125 T2 2.0 T3

1.5 T4 2.625 1.5 T5

The WPGMA clustering method Weighted Pair Group Method using Average

T1 T2 T3 T4 T5 1 Pick the smallest distance T1 0 T2 2 0 T3 3.5 4.5 0 T4 8 10 8 0 T5 6 8 9 3 0

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 7 / 30 (Assumes that the rate of change is constant over time: molecular clock)

1 Pick the smallest distance

3 Recompute distances from the new group:

di,k + dj,k 1.0 dij,k = 2 2.125 2.0 T3

1.5 T4 2.625 1.5 T5

The WPGMA clustering method Weighted Pair Group Method using Average

T1 T2 T3 T4 T5 T1 0 2 Gather the corresponding T2 2 0 groups T3 3.5 4.5 0 T4 8 10 8 0 T5 6 8 9 3 0 1.0 T1

1.0 T2

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 7 / 30 1 Pick the smallest distance 2 Gather the corresponding groups

1.0

2.125 2.0 T3

1.5 T4 2.625 1.5 T5

The WPGMA clustering method Weighted Pair Group Method using Average

T12 T3 T4 T5 T12 0 T3 4 0 T4 9 8 0 3 Recompute distances from the T5 7 9 3 0 new group: 1.0 T1 di,k + dj,k dij,k = 2 1.0 T2 (Assumes that the rate of change is constant over time: molecular clock)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 7 / 30 (Assumes that the rate of change is constant over time: molecular clock)

2 Gather the corresponding groups 3 Recompute distances from the new group:

di,k + dj,k 1.0 dij,k = 2 2.125 2.0 T3

1.5 T4 2.625 1.5 T5

The WPGMA clustering method Weighted Pair Group Method using Average

T12 T3 T4 T5 1 Pick the smallest distance T12 0 T3 4 0 T4 9 8 0 T5 7 9 3 0

1.0 T1

1.0 T2

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 7 / 30 (Assumes that the rate of change is constant over time: molecular clock)

1 Pick the smallest distance

3 Recompute distances from the new group:

di,k + dj,k 1.0 dij,k = 2 2.125 2.0 T3

2.625

The WPGMA clustering method Weighted Pair Group Method using Average

T12 T3 T4 T5 T12 0 2 Gather the corresponding T3 4 0 groups T4 9 8 0 T5 7 9 3 0

1.0 T1

1.0 T2

1.5 T4

1.5 T5

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 7 / 30 1 Pick the smallest distance 2 Gather the corresponding groups

1.0

2.125 (Assumes that the rate of 2.0 T3 change is constant over time: molecular clock) 2.625

The WPGMA clustering method Weighted Pair Group Method using Average

T12 T3 T45 T12 0 T3 4 0 T45 8 8.5 0 3 Recompute distances from the new group: 1.0 T1 di,k + dj,k dij,k = 2 1.0 T2

1.5 T4

1.5 T5

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 7 / 30 (Assumes that the rate of change is constant over time: molecular clock)

2 Gather the corresponding groups 3 Recompute distances from the new group:

di,k + dj,k 1.0 dij,k = 2 2.125 2.0 T3

2.625

The WPGMA clustering method Weighted Pair Group Method using Average

1 Pick the smallest distance T12 T3 T45 T12 0 T3 4 0 T45 8 8.5 0

1.0 T1

1.0 T2

1.5 T4

1.5 T5

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 7 / 30 (Assumes that the rate of change is constant over time: molecular clock)

1 Pick the smallest distance

3 Recompute distances from the new group:

di,k + dj,k dij,k = 2 2.125

2.625

The WPGMA clustering method Weighted Pair Group Method using Average

T12 T3 T45 2 Gather the corresponding T12 0 groups T3 4 0 T45 8 8.5 0

1.0 T1 1.0 1.0 T2

2.0 T3

1.5 T4

1.5 T5

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 7 / 30 1 Pick the smallest distance 2 Gather the corresponding groups

2.125 (Assumes that the rate of change is constant over time: molecular clock) 2.625

The WPGMA clustering method Weighted Pair Group Method using Average

T123 T45 T123 0 T45 8.25 0 3 Recompute distances from the new group: 1.0 T1 di,k + dj,k 1.0 dij,k = 2 1.0 T2

2.0 T3

1.5 T4

1.5 T5

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 7 / 30 (Assumes that the rate of change is constant over time: molecular clock)

2 Gather the corresponding groups 3 Recompute distances from the new group:

di,k + dj,k dij,k = 2 2.125

2.625

The WPGMA clustering method Weighted Pair Group Method using Average

1 Pick the smallest distance T123 T45 T123 0 T45 8.25 0

1.0 T1 1.0 1.0 T2

2.0 T3

1.5 T4

1.5 T5

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 7 / 30 (Assumes that the rate of change is constant over time: molecular clock)

1 Pick the smallest distance

3 Recompute distances from the new group: d + d d = i,k j,k ij,k 2

The WPGMA clustering method Weighted Pair Group Method using Average

T123 T45 2 Gather the corresponding T123 0 groups T45 8.25 0

1.0 T1 1.0 1.0 2.125 T2 2.0 T3

1.5 T4 2.625 1.5 T5

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 7 / 30 Two steps: 1 getting a distance matrix 2 building a tree from the matrix When using molecular sequences, the distance for a pair of species is the divergence estimated from the sequences. It can be an edit distance (alignment score) a proportion of mismatch an estimated divergence from the observed matches (requires a model of evolution)

Distance methods

“Distance” methods are extensions of clustering techniques for the purpose of phylogenetics

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 8 / 30 When using molecular sequences, the distance for a pair of species is the divergence estimated from the sequences. It can be an edit distance (alignment score) a proportion of mismatch an estimated divergence from the observed matches (requires a model of evolution)

Distance methods

“Distance” methods are extensions of clustering techniques for the purpose of phylogenetics Two steps: 1 getting a distance matrix 2 building a tree from the matrix

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 8 / 30 Distance methods

“Distance” methods are extensions of clustering techniques for the purpose of phylogenetics Two steps: 1 getting a distance matrix 2 building a tree from the matrix When using molecular sequences, the distance for a pair of species is the divergence estimated from the sequences. It can be an edit distance (alignment score) a proportion of mismatch an estimated divergence from the observed matches (requires a model of evolution)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 8 / 30 7 e M a b c d e 4 a 0 11 10 9 15 1 d b 11 0 3 12 18 ⇔ 5 c 10 3 0 11 17 4 d 9 12 11 0 8 1 e 15 18 17 8 0 2 a c b

Phylogenetic clustering Additive tree

Additive distance matrix A distance matrix M is additive if it exists a corresponding phylogenetic tree T so that da,b = da,c + db,c , where c depicts the ancestral node for species a and b.

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 9 / 30 Phylogenetic clustering Additive tree

Additive distance matrix A distance matrix M is additive if it exists a corresponding phylogenetic tree T so that da,b = da,c + db,c , where c depicts the ancestral node for species a and b. 7 e M a b c d e 4 a 0 11 10 9 15 1 d b 11 0 3 12 18 ⇔ 5 c 10 3 0 11 17 4 d 9 12 11 0 8 1 e 15 18 17 8 0 2 a c b

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 9 / 30 • Saitou N and Nei M (1987). Molecular and Evolution, 4:405-425 (one of the most cited biological paper!)

The goal is to find a tree for which the corresponding additive matrix is the closest as possible of the measured one (for instance based on the minimum least scares) This is a very difficult problem! Heuristics are needed The most famous heuristics:

Other heuristics: BioNJ, FastME

Phylogenetic clustering Biological distance matrices

Distance matrices from real data are not additive

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 10 / 30 Neighbor Joining • Saitou N and Nei M (1987). Molecular Biology and Evolution, 4:405-425 (one of the most cited biological paper!)

This is a very difficult problem! Heuristics are needed The most famous heuristics:

Other heuristics: BioNJ, FastME

Phylogenetic clustering Biological distance matrices

Distance matrices from real data are not additive The goal is to find a tree for which the corresponding additive matrix is the closest as possible of the measured one (for instance based on the minimum least scares)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 10 / 30 Neighbor Joining • Saitou N and Nei M (1987). Molecular Biology and Evolution, 4:405-425 (one of the most cited biological paper!)

The most famous heuristics:

Other heuristics: BioNJ, FastME

Phylogenetic clustering Biological distance matrices

Distance matrices from real data are not additive The goal is to find a tree for which the corresponding additive matrix is the closest as possible of the measured one (for instance based on the minimum least scares) This is a very difficult problem! Heuristics are needed

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 10 / 30 Neighbor Joining • Saitou N and Nei M (1987). Molecular Biology and Evolution, 4:405-425 (one of the most cited biological paper!) Other heuristics: BioNJ, FastME

Phylogenetic clustering Biological distance matrices

Distance matrices from real data are not additive The goal is to find a tree for which the corresponding additive matrix is the closest as possible of the measured one (for instance based on the minimum least scares) This is a very difficult problem! Heuristics are needed The most famous heuristics:

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 10 / 30 Phylogenetic clustering Biological distance matrices

Distance matrices from real data are not additive The goal is to find a tree for which the corresponding additive matrix is the closest as possible of the measured one (for instance based on the minimum least scares) This is a very difficult problem! Heuristics are needed The most famous heuristics: Neighbor Joining • Saitou N and Nei M (1987). Molecular Biology and Evolution, 4:405-425 (one of the most cited biological paper!) Other heuristics: BioNJ, FastME

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 10 / 30 G→A

A plethora of scenario, not all as likely Assess the probability of a given 1 C scenario 2 T for a site C→T for an alignment 5 C 4 C 5 T C→T “bad” site

The cladistic approach

1 G 2 G 5 G 4 A Reconstruct the evolutionary history of 5 A the data “good” site

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 11 / 30 Assess the probability of a given 1 C scenario 2 T for a site C→T for an alignment 5 C 4 C 5 T C→T “bad” site

The cladistic approach

1 G 2 G 5 G 4 A Reconstruct the evolutionary history of G→A 5 A the data “good” site A plethora of scenario, not all as likely

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 11 / 30 The cladistic approach

1 G 2 G 5 G 4 A Reconstruct the evolutionary history of G→A 5 A the data “good” site A plethora of scenario, not all as likely Assess the probability of a given 1 C scenario 2 T for a site C→T for an alignment 5 C 4 C 5 T C→T “bad” site Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 11 / 30 General hypotheses

1 All sites evolve essentially by substitutions (insertions, deletions, inversions are not accounted) 2 All sites evolve independently 3 All sites undergo the same process, notably All sites evolve at the same rate The substitution rate is constant over time

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 12 / 30 Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 13 / 30 Maximum parsimony

William of Ockham (1288-1346) English medieval logician and Franciscan friar. Well known for its principle of parsimony, or Ockham’s razor: the explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory.

This is often paraphrased as “All other things being equal, the simplest solution is the best.”

⇒ General statistic and scientific method.

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 14 / 30 Y X X X X X Site 1 Site 2 Site 2 A XXX XX Three types of informative sites + B XYY XX non-informative sites C YXY XY For one site, we take the most D YYX YZ parsimonious scenario • For an Case 1 1 2 2 1 2 Case 2 2 1 2 1 2 alignment, we take the scenario in Case 3 2 2 1 1 2 agreement with the majority of sites

Maximum parsimony

YC DY XB DY XB CY

XA BX XA CY XA DY Case 1 Case 2 Case 3

Three possible topologies

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 15 / 30 Site 1 Site 2 Site 2 A XXX XX Three types of informative sites + B XYY XX non-informative sites C YXY XY For one site, we take the most D YYX YZ parsimonious scenario • For an Case 1 1 2 2 1 2 Case 2 2 1 2 1 2 alignment, we take the scenario in Case 3 2 2 1 1 2 agreement with the majority of sites

Maximum parsimony

YC DY XB DY XB CY

Y X X X X X

XA BX XA CY XA DY Case 1 Case 2 Case 3

Three possible topologies

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 15 / 30 For one site, we take the most parsimonious scenario • For an alignment, we take the scenario in agreement with the majority of sites

Maximum parsimony

YC DY XB DY XB CY

Y X X X X X

XA BX XA CY XA DY Case 1 Case 2 Case 3

Three possible topologies Site 1 Site 2 Site 2 A XXX XX Three types of informative sites + B XYY XX non-informative sites C YXY XY D YYX YZ Case 1 1 2 2 1 2 Case 2 2 1 2 1 2 Case 3 2 2 1 1 2

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 15 / 30 Maximum parsimony

YC DY XB DY XB CY

Y X X X X X

XA BX XA CY XA DY Case 1 Case 2 Case 3

Three possible topologies Site 1 Site 2 Site 2 A XXX XX Three types of informative sites + B XYY XX non-informative sites C YXY XY For one site, we take the most D YYX YZ parsimonious scenario • For an Case 1 1 2 2 1 2 Case 2 2 1 2 1 2 alignment, we take the scenario in Case 3 2 2 1 1 2 agreement with the majority of sites

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 15 / 30 1 Try all possible topologies (not possible if n > 15) 2 Stepwise addition: Start with three random sequences Pick a new sequence randomly, and find the best position (3 possibilities) Pick a new sequence randomly, and find the best position (5 possibilities) Pick a new sequence randomly, and find the best position (7 possibilities) etc. 3 Start from an existing tree, and try to improve it 4 A combination of ‘2’ and ‘3’ 5 A combination of a distance method and ‘3’

How to find the best tree?

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 16 / 30 2 Stepwise addition: Start with three random sequences Pick a new sequence randomly, and find the best position (3 possibilities) Pick a new sequence randomly, and find the best position (5 possibilities) Pick a new sequence randomly, and find the best position (7 possibilities) etc. 3 Start from an existing tree, and try to improve it 4 A combination of ‘2’ and ‘3’ 5 A combination of a distance method and ‘3’

How to find the best tree?

1 Try all possible topologies (not possible if n > 15)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 16 / 30 3 Start from an existing tree, and try to improve it 4 A combination of ‘2’ and ‘3’ 5 A combination of a distance method and ‘3’

How to find the best tree?

1 Try all possible topologies (not possible if n > 15) 2 Stepwise addition: Start with three random sequences Pick a new sequence randomly, and find the best position (3 possibilities) Pick a new sequence randomly, and find the best position (5 possibilities) Pick a new sequence randomly, and find the best position (7 possibilities) etc.

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 16 / 30 4 A combination of ‘2’ and ‘3’ 5 A combination of a distance method and ‘3’

How to find the best tree?

1 Try all possible topologies (not possible if n > 15) 2 Stepwise addition: Start with three random sequences Pick a new sequence randomly, and find the best position (3 possibilities) Pick a new sequence randomly, and find the best position (5 possibilities) Pick a new sequence randomly, and find the best position (7 possibilities) etc. 3 Start from an existing tree, and try to improve it

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 16 / 30 5 A combination of a distance method and ‘3’

How to find the best tree?

1 Try all possible topologies (not possible if n > 15) 2 Stepwise addition: Start with three random sequences Pick a new sequence randomly, and find the best position (3 possibilities) Pick a new sequence randomly, and find the best position (5 possibilities) Pick a new sequence randomly, and find the best position (7 possibilities) etc. 3 Start from an existing tree, and try to improve it 4 A combination of ‘2’ and ‘3’

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 16 / 30 How to find the best tree?

1 Try all possible topologies (not possible if n > 15) 2 Stepwise addition: Start with three random sequences Pick a new sequence randomly, and find the best position (3 possibilities) Pick a new sequence randomly, and find the best position (5 possibilities) Pick a new sequence randomly, and find the best position (7 possibilities) etc. 3 Start from an existing tree, and try to improve it 4 A combination of ‘2’ and ‘3’ 5 A combination of a distance method and ‘3’

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 16 / 30 Topology ‘movements’

Nearest Neighbor Subtree Pruning Tree Bisection Interchange (NNI) Regrafting (SPR) Reconnection (TBR)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 17 / 30 According to the maximum parsimony criterion, the a tree is the correct one Using non-informative sites can help resolving the issue

Problems with the parsimony approach

Y 3 4 Y

The ‘right’ site is b : ((1, 3), (2, 4)), but sites of type a are more frequent X X

X 1 2 X a b c 1 XXX 2 XYY 3 YXY 4 YYX

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 18 / 30 Using non-informative sites can help resolving the issue

Problems with the parsimony approach

Y 3 4 Y

The ‘right’ site is b : ((1, 3), (2, 4)), but sites of type a are more frequent X X According to the maximum parsimony X 1 2 X criterion, the a tree is a b c the correct one 1 XXX 2 XYY 3 YXY 4 YYX

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 18 / 30 Problems with the parsimony approach

Y 3 4 Y

The ‘right’ site is b : ((1, 3), (2, 4)), but sites of type a are more frequent X X According to the maximum parsimony X 1 2 X criterion, the a tree is a b c d e f the correct one 1 XXX XXX Using non-informative 2 XYY XXX sites can help resolving 3 YXY YXY the issue 4 YYX ZYX

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 18 / 30 Long-branch attraction

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 19 / 30 The state X (t) of a site at time t depends only on the current state:

Pr(X (t) = A) = Pr(X (t0) = A) × Pr(A → A)

+ Pr(X (t0) = C) × Pr(C → A)

+ Pr(X (t0) = G) × Pr(G → A)

+ Pr(X (t0) = T ) × Pr(T → A) (1)

Similar equations are writable for Pr(X (t) = C), Pr(X (t) = G) and Pr(X (t) = T ).

Markov model

We model the evolution of each position (site) of a sequence independently

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 20 / 30 + Pr(X (t0) = C) × Pr(C → A)

+ Pr(X (t0) = G) × Pr(G → A)

+ Pr(X (t0) = T ) × Pr(T → A) (1)

Similar equations are writable for Pr(X (t) = C), Pr(X (t) = G) and Pr(X (t) = T ).

Markov model

We model the evolution of each position (site) of a sequence independently The state X (t) of a site at time t depends only on the current state:

Pr(X (t) = A) = Pr(X (t0) = A) × Pr(A → A)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 20 / 30 + Pr(X (t0) = G) × Pr(G → A)

+ Pr(X (t0) = T ) × Pr(T → A) (1)

Similar equations are writable for Pr(X (t) = C), Pr(X (t) = G) and Pr(X (t) = T ).

Markov model

We model the evolution of each position (site) of a sequence independently The state X (t) of a site at time t depends only on the current state:

Pr(X (t) = A) = Pr(X (t0) = A) × Pr(A → A)

+ Pr(X (t0) = C) × Pr(C → A)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 20 / 30 + Pr(X (t0) = T ) × Pr(T → A) (1)

Similar equations are writable for Pr(X (t) = C), Pr(X (t) = G) and Pr(X (t) = T ).

Markov model

We model the evolution of each position (site) of a sequence independently The state X (t) of a site at time t depends only on the current state:

Pr(X (t) = A) = Pr(X (t0) = A) × Pr(A → A)

+ Pr(X (t0) = C) × Pr(C → A)

+ Pr(X (t0) = G) × Pr(G → A)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 20 / 30 Similar equations are writable for Pr(X (t) = C), Pr(X (t) = G) and Pr(X (t) = T ).

Markov model

We model the evolution of each position (site) of a sequence independently The state X (t) of a site at time t depends only on the current state:

Pr(X (t) = A) = Pr(X (t0) = A) × Pr(A → A)

+ Pr(X (t0) = C) × Pr(C → A)

+ Pr(X (t0) = G) × Pr(G → A)

+ Pr(X (t0) = T ) × Pr(T → A) (1)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 20 / 30 Markov model

We model the evolution of each position (site) of a sequence independently The state X (t) of a site at time t depends only on the current state:

Pr(X (t) = A) = Pr(X (t0) = A) × Pr(A → A)

+ Pr(X (t0) = C) × Pr(C → A)

+ Pr(X (t0) = G) × Pr(G → A)

+ Pr(X (t0) = T ) × Pr(T → A) (1)

Similar equations are writable for Pr(X (t) = C), Pr(X (t) = G) and Pr(X (t) = T ).

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 20 / 30 And we can write   pAA pAC pAG pAT pCA pCC pCG pCT  x(t) = x(t0) ×   pGA pGC pGG pGT  pTA pTC pTG pTT | {z } P

where pij = Pr(i → j). ‘P’ defines the substitution process.

Matrix notation

We can gather all equations in a more compact form. We note

x(t) = X (t) = AX (t) = CX (t) = GX (t) = T 

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 21 / 30 ‘P’ defines the substitution process.

Matrix notation

We can gather all equations in a more compact form. We note

x(t) = X (t) = AX (t) = CX (t) = GX (t) = T 

And we can write   pAA pAC pAG pAT pCA pCC pCG pCT  x(t) = x(t0) ×   pGA pGC pGG pGT  pTA pTC pTG pTT | {z } P

where pij = Pr(i → j).

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 21 / 30 Matrix notation

We can gather all equations in a more compact form. We note

x(t) = X (t) = AX (t) = CX (t) = GX (t) = T 

And we can write   pAA pAC pAG pAT pCA pCC pCG pCT  x(t) = x(t0) ×   pGA pGC pGG pGT  pTA pTC pTG pTT | {z } P

where pij = Pr(i → j). ‘P’ defines the substitution process.

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 21 / 30 A few more considerations

We have X ∀i, pi,j = 1 j that is

Pr(A → A) + Pr(A → C) + Pr(A → G) + Pr(A → T ) = 1

If we assume that all types of are equi-probable (Jukes and Cantor, 1969), we can simplify:

1 − 3r r r r   r 1 − 3r r r  P(JC69) =    r r 1 − 3r r  r r r 1 − 3r

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 22 / 30 Continuous time

We assume that the process does not change over time, so we can write the equations for any time t:

t = t0 + dt0, r = α · dt0   1 − 3αdt0 αdt0 αdt0 αdt0  αdt0 1 − 3αdt0 αdt0 αdt0  x(t0 + dt0) = x(t0) ×    αdt0 αdt0 1 − 3αdt0 αdt0  αdt0 αdt0 αdt 1 − 3αdt0

x(t0 + dt0) = x(t0) + x(t0) · Qdt0

x(t0 + dt0) − x(t0) = x(t0) · Q dt0

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 23 / 30 Q is called the generator of the substitution process, and we have −3α α α α   α −3α α α  Q(JC69) =    α α −3α α  α α α −3α with X ∀i, qi,j = 0 j This resolves into x(t) = x(t0) · exp(Q · t)

Continuous time

We obtain a differential equation by having dt0 → 0: ∂x(t) = Q · x(t) ∂t

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 24 / 30 This resolves into x(t) = x(t0) · exp(Q · t)

Continuous time

We obtain a differential equation by having dt0 → 0: ∂x(t) = Q · x(t) ∂t Q is called the generator of the substitution process, and we have −3α α α α   α −3α α α  Q(JC69) =    α α −3α α  α α α −3α with X ∀i, qi,j = 0 j

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 24 / 30 Continuous time

We obtain a differential equation by having dt0 → 0: ∂x(t) = Q · x(t) ∂t Q is called the generator of the substitution process, and we have −3α α α α   α −3α α α  Q(JC69) =    α α −3α α  α α α −3α with X ∀i, qi,j = 0 j This resolves into x(t) = x(t0) · exp(Q · t)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 24 / 30 So what??? If we have two sequences and Q, we can compute t which maximizes this probability → unbiased estimate of the divergence between the two sequences!

Conclusion

We can compute the probability that a certain sequence (x(t0)) transforms into another given sequence (x(t)) after a known time (t) and given a certain substitution process specified by its generator (Q).

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 25 / 30 Conclusion

We can compute the probability that a certain sequence (x(t0)) transforms into another given sequence (x(t)) after a known time (t) and given a certain substitution process specified by its generator (Q).

So what??? If we have two sequences and Q, we can compute t which maximizes this probability → unbiased estimate of the divergence between the two sequences!

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 25 / 30 P4

P4 × P3 × P1 × P3 × × P1 P2 × × P5 × P2 × P5 × P6

P6

+ G

+ G A P4 G G × P3 × C C P1 C A × C A P2 × P5 × T C T T P6 T A A G A

A P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Evolution along a tree

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 P4

P4 × P3 × P1 × P3 × × P1 P2 × × P5 × P2 × P5 × P6

P6

+

+ G A P4 G G × P3 × C P1 C A × C A P2 × P5 × C T T P6 T A

G A

A P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Evolution along a tree

G

C

T

A

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 P4

P4 × P3 × P1 × P3 × × P1 P2 × × P5 × P2 × P5 × P6

P6

+

+ G P4 G G × P3 × C P1 C × C A P2 × P5 × C T P6 T A

G A

A P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Evolution along a tree

G

A

C

A

T

T

A

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 P4

P4 × P3 × P1 × P3 × × P1 P2 × × P5 × P2 × P5 × P6

P6

+

+ G P4 G G × P3 × C C × C A P2 × P5 × C T P6 T A

G A

A P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Evolution along a tree

G

A

C P1 A

T

T

A

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 P4

P4 × P3 × P1 × P3 × × P1 P2 × × P5 × P2 × P5 × P6

P6

+

+ G P4 G G × P3 × C C

C A × P5 × C T P6 T A

G A

A P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Evolution along a tree

G

A

C P1 A × P2 T

T

A

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 P4

P4 × P3 × P1 × P3 × × P1 P2 × × P5 × P2 × P5 × P6

P6

+

+ G P4 G G × C C

C A × P5 × C T P6 T A

G A

A P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Evolution along a tree

G

A

× P3 C P1 A × P2 T

T

A

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 P4

P4 × P3 × P1 × P3 × × P1 P2 × × P5 × P2 × P5 × P6

P6

+

+ G

G G

C C

C A × P5 × C T P6 T A

G A

A P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Evolution along a tree

G

A P4

× P3 × C P1 A × P2 T

T

A

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 P4

P4 × P3 × P1 × P3 × × P1 P2 × × P5 × P2 × P5 × P6

P6

+

+ G

G G

C C

C A × C T P6 T A

G A

A P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Evolution along a tree

G

A P4

× P3 × C P1 A × P2 × P5 T

T

A

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 P4

P4 × P3 × P1 × P3 × × P1 P2 × × P5 × P2 × P5 × P6

P6

+

+ G

G G

C C

C A

C T

T A

G A

A P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Evolution along a tree

G

A P4

× P3 × C P1 A × P2 × P5 × T

T P6

A

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 P4

× P3 × P1 × P2 × P5 ×

P6

+ P4 G

× P3 × C P1 × C P2 × P5 × C

P6 T

G

A P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Evolution along a tree

+ G

G A

G P4 C

× P3 × C A P1 A × T P2 × P5 × T T

A P6 A

A

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 P4

P4 × P3 × P1 × P3 × × P1 P2 × × P5 × P2 × P5 × P6

P6

P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Evolution along a tree

+ G

+ G A G G C C P4 C A × P3 × C A P1 T C × T P2T × P5 × T A A G P6 A

A

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 P4

P4 × P3 × P1 × P3 × × P1 P2 × × P5 × P2 × P5 × P6

P6

Evolution along a tree

+ G

+ G A G G C C P4 C A × P3 × C A P1 T C × T P2T × P5 × T A A G P6 A

A P L = Ancestors P1 × P2 × P3 × P4 × P5 × P6

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 26 / 30 General statistical framework Allows to perform model comparisons Allows to get confidence intervals of estimates

Maximum likelihood

Maximum-likelihood estimation (MLE) MLE is a method of estimating the parameters of a statistical model. For a given dataset and underlying statistical model, the maximum likelihood estimator corresponds to the set of values of the model parameters that maximizes the likelihood function. (The method was initially proposed by statistician Ronald Aylmer Fisher in 1922.)

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 27 / 30 Maximum likelihood

Maximum-likelihood estimation (MLE) MLE is a method of estimating the parameters of a statistical model. For a given dataset and underlying statistical model, the maximum likelihood estimator corresponds to the set of values of the model parameters that maximizes the likelihood function. (The method was initially proposed by statistician Ronald Aylmer Fisher in 1922.)

General statistical framework Allows to perform model comparisons Allows to get confidence intervals of estimates

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 27 / 30 [However...]

. . . is tree topology (really) a parameter?

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 28 / 30 2 Build a new phylogeny from the neo data set 3 Compare the new phylogeny to the one inferred from the data: which internal branches are present? 4 GOTO [1] many times ⇒ Allows to compute to which extent each internal branch is supported by the data

The bootstrap procedure

1 Sample sites from the original alignment

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 29 / 30 3 Compare the new phylogeny to the one inferred from the data: which internal branches are present? 4 GOTO [1] many times ⇒ Allows to compute to which extent each internal branch is supported by the data

The bootstrap procedure

1 Sample sites from the original alignment 2 Build a new phylogeny from the neo data set

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 29 / 30 4 GOTO [1] many times ⇒ Allows to compute to which extent each internal branch is supported by the data

The bootstrap procedure

1 Sample sites from the original alignment 2 Build a new phylogeny from the neo data set 3 Compare the new phylogeny to the one inferred from the data: which internal branches are present?

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 29 / 30 ⇒ Allows to compute to which extent each internal branch is supported by the data

The bootstrap procedure

1 Sample sites from the original alignment 2 Build a new phylogeny from the neo data set 3 Compare the new phylogeny to the one inferred from the data: which internal branches are present? 4 GOTO [1] many times

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 29 / 30 The bootstrap procedure

1 Sample sites from the original alignment 2 Build a new phylogeny from the neo data set 3 Compare the new phylogeny to the one inferred from the data: which internal branches are present? 4 GOTO [1] many times ⇒ Allows to compute to which extent each internal branch is supported by the data

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 29 / 30 Model misspecification Inconsistency between the history of sequences and the history of species: Stochasticity, incomplete sorting Introgression, horizontal gene transfer Selection, convergence

What can bias the reconstruction?

Long branch attraction

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 30 / 30 Inconsistency between the history of sequences and the history of species: Stochasticity, incomplete lineage sorting Introgression, horizontal gene transfer Selection, convergence

What can bias the reconstruction?

Long branch attraction Model misspecification

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 30 / 30 What can bias the reconstruction?

Long branch attraction Model misspecification Inconsistency between the history of sequences and the history of species: Stochasticity, incomplete lineage sorting Introgression, horizontal gene transfer Selection, convergence

Stukenbrock EH & Dutheil JY (MPItM) Phylogeny reconstruction 19 Janvier 2011 30 / 30