3: Methods to reconstruct phylogenies

A generalized protocol for molecular phylogenetics, and the associated concerns with each step

Concerns:

Collect homolgous sequences gene tree- tree / paralogy–orthology / trees within trees

Multiple positional homology / gaps / subjectivity-objectivity / methods

Phylogeny estimation philosophy / methods / consistency / power and accuracy

Test reliability or fit of phylogenetic branch support / tree comparison / statistic issues with trees estimates

independent contrasts / impact of error on conclusions Interpretation and application

1 Molecular characters

sequences structural genes (, RNA, regulatory) non-structural genes (introns, intergenic sequences, pseudogenes) • Protein sequences translate DNA to protein thousands to choose from • INDELS , amino acids, genes, segments of DNA, etc. • DNA-DNA hybridization • Restriction fragment length polymorphisms (RFLPs) • Large scale genomic rearrangements

A generalized protocol for molecular phylogenetics, and the associated concerns with each step

Concerns:

Collect homolgous sequences gene tree-species tree / paralogy–orthology / trees within trees

Multiple sequence alignment positional homology / gaps / subjectivity-objectivity / methods

Phylogeny estimation philosophy / methods / consistency / power and accuracy

Test reliability or fit of phylogenetic branch support / tree comparison / statistic issues with trees estimates

independent contrasts / impact of error on conclusions Interpretation and application

2 Molecular characters: multiple sequence alignment

Some methods: 1. sum-all-pairs method: count cost of aligning all pairs of sequences and select alignment that minimizes the total cost

2. star alignment: alignment based on tree that assumes all seqeunces are equally related

3. tree alignment: uses “known” information about relationships of sequences (lineages) to guide the alignment

Take course called “” (BIOC 4010 / BIOL 4041) to learn more of alignments.

Species 1 … T A G … Species 2 … T A G … Species 3 … T A A … Species 4 … T C A … Species 5 … C C A …

3 Molecular characters: multiple sequence alignment

Molecular characters: DNA alignment

Alignment of the nucleotide character states of the β -globin gene from five species of mammals

human cow rabbit rat opossum

GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT GGC GCG CAC ...... G.C ...... T.. ..T ...... GC A...... C ..T ...... A.. ... A.T ...... AA ... A.C ... AGC ...... C ... G.A .AT ... ..A ...... A.. ... AA. TG...... G ... A.. ..T .GC ..T ... ..C ..G GA. ..T ...... T C.. ..G ..A ... AT...... T ... ..G ..A .GC ...

GCT GGC GAG TAT GGT GCG GAG GCC CTG GAG AGG ATG TTC CTG TCC TTC CCC ACC ACC AAG ... ..A .CT ... ..C ..A ... ..T ...... AG...... G...... C ..C ...... G...... T.. GG...... G. ..T ..A ... ..C .A...... A C...... GCT G...... C ..T .CC ..C .CA ..T ..A ..T ..T .CC ..A .CC ... ..C ...... T ... ..A

ACC TAC TTC CCG CAC TTC GAC CTG AGC CAC GGC TCT GCC CAG GTT AAG GGC CAC GGC AAG ...... C ...... G ...... C ...... G...... C ...... T.C .C...... AG ... A.C ..A .C...... T.T ... A.T ..T G.A ... .C...... C ... .CT ...... T ...... C ...... TC. .C...... C ...... A.C C.. ..T ..T ..T ...

The order of DNA sequences in the alignment is specified by the order of the taxa in the list. To fit it on the page, the alignment is broken into three parts; such alignments are called INTERLEAVED. The complete DNA sequence is shown for the fist (human). All the other sequences are shown relative to human, with the dot, “.”, signifying a match in the character state with the human sequences. Differences are indicated by using the single-letter nucleotide code (A,C,T or G). Note that this alignment could also be analyzed by using distance, likelihood, and Bayesian methods.

Positional homology is always assumed when constructing alignments

4 Molecular characters: presence-absence data matrix (easy)

Hypothetical presence-absence data matrix for a diversity of molecular characters

Species 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 0 1 0 1 1 1 0 1 1 1 0 0 Species 2 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 1 0 1 0 1 Species 3 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 1 Species 4 1 1 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 Species 5 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 0 1 1 1

Amino acid INDELS Pseudogene / Presence / absence Tandem gene in 8 different genes functional gene of transposon duplication elements at 8 events different genomic locations

Molecular characters: positional homology of gaps are a real pain in the ass

10 20 30 40 50 60 ....|....| ....|....| ....|....| ....|....| ....|....| ....|....| Mus2.FAS MTTPALLPLS -----GRRIP PLNL--GPP- ----SFPHHR ATLRLSEKFI LLLILSAFIT Human_GIA ------MNSNF ITFDLKMSLL PSNLFSAFIT Human_GIB MTTPALLPLS -----GRRIP PLNL--GPP- ----SFPHHR ATLRLSEKFI LLLILSAFIT Mus_GIA MPVGGLLPLF SSPGGGGLGS GLGGGLGGG- ----RKGSGP AAFRLTEKFV LLLVFSAFIT Rabbit_GIA ------Sus_GIA MPVGGLLPLF SSPAGGGLGG GLGGGLGGGG GGGGRKGSGP SAFRLTEKFV LLLVFSAFIT

70 80 90 100 110 120 ....|....| ....|....| ....|....| ....|....| ....|....| ....|....| Mus2.FAS LCFGAFFFLP DSSKHKRFDL G-LEDVLIPH VDAGKG---- AKNPGVFLIH GPDEHRHREE Human_GIA LCFGAIFFLP DSSKLLSGVL FHSSPALQPA ADHKPGPGAR AEDAAEGRAR RREEGAPGDP Human_GIB LCFGAFFFLP DSSKHKRFDL G-LEDVLIPH VDAGKG---- AKNPGVFLIH GPDEHRHREE Mus_GIA LCFGAIFFLP DSSKLLSGVL FHSNPALQPP AEHKPGLGAR AEDAAEGRVR HREEGAPGDP Rabbit_GIA ------AEDAADGRAR PGEEGAPGDP Sus_GIA LCFGAIFFLP DSSKLLSGVL FHSSPALQPA ADHKPGPGAR AEDAADGRAR PGEEGAPGDP

An alignment of real amino acid sequences of the mannosidase protein

5 Molecular characters: multiple sequence alignment

• software is far from flawless (many different methods) • all alignments must be inspected “by eye” • any manual adjustments (by eye) introduces subjectivity

• one solution is to publish alignments: • public database • in scientific paper • supplementary online materials of a scientific journal

A generalized protocol for molecular phylogenetics, and the associated concerns with each step

Concerns:

Collect homolgous sequences gene tree-species tree / paralogy–orthology / trees within trees

Multiple sequence alignment positional homology / gaps / subjectivity-objectivity / methods

Phylogeny estimation philosophy / methods / consistency / power and accuracy

Test reliability or fit of phylogenetic branch support / tree comparison / statistic issues with trees estimates

independent contrasts / impact of error on conclusions Interpretation and application

6 Molecular phylogenetics: methods

We divide methods up by two criteria (data and method):

Type of data: 1. characters: discrete character states at positionally homologous sites in a multiple sequence alignment (hence, discrete character methods) 2. distances: evolutionary distance, measured in average numbers of substitutions per positional homologous sites, between all pairs of taxa (hence, distance methods)

Species 1 ! T A G ! Species 2 ! T A G ! Species 3 ! T A A ! Species 4 ! T C A ! Species 5 ! C C A !

Molecular phylogenetics: methods

We divide methods up by two criteria (data and method):

Type of data: 1. characters: discrete character states at positionally homologous sites in a multiple sequence alignment (hence, discrete character methods) 2. distances: evolutionary distance, measured in average numbers of substitutions per positional homologous sites, between all pairs of taxa (hence, distance methods)

7 Molecular phylogenetics: methods

We divide methods up by two criteria (data and method): Type of data: 1. characters: discrete character states at positionally homologous sites in a multiple sequence alignment (hence, discrete character methods) 2. distances: evolutionary distance, measured in average numbers of substitutions per positional homologous sites, between all pairs of taxa (hence, distance methods)

Type of tree-building: 1. clustering algorithm: computationally “build a tree” according to a specific set of “steps”. 2. optimality criterion: a criterion for scoring a tree and comparing different trees with the goal of finding the tree with the best (optimal) score. [also called objective function]

Molecular phylogenetics: most common methods

Type of data

Character Distance Tree-building method

UPGMA Clustering algorithm Neighbor-joining (NJ)

Maximum parsimony Least squares (MP) Optimality Minimum criterion Maximum likelihood (ME) (ML)

8 Molecular phylogenetics: maximum parsimony

The parsimony principle is derived from the principle of philosophy called Occam’s Razor: plurality should not be posited without necessity (Pluralitas non est poneneda sine necessitate, William of Occam, medieval English philosopher [ca. 1285-1349]).

The “simplest” hypothesis is the one that is chosen under the maximum parsimony criterion (principle of economy).

Molecular phylogenetics: maximum parsimony as optimality criterion

Simple dataset: one positionally homologous DNA site with two character states (A and G) Species 1 A Species 2 A Species 3 G Species 4 G

We do not know if these states reflect homology or homoplasy. What do we do?

Tree length: called steps

? ?

What happens if we consider the other two character states that are possible?

9 Molecular phylogenetics: maximum parsimony as optimality criterion

Parsimony optimality criterion: minimum tree length in steps • can be used to reconstruct ancestral character states

• can be used to compare different tree topologies • tree length is a sum over all the characters in a given dataset • e.g., tree 1 = 10 steps tree 2 = 27 steps

Molecular phylogenetics: maximum parsimony as optimality criterion

Parsimony optimality criterion: minimum tree length in steps

Shortest tree (in steps) = “most parsimonious tree”

What if the lowest number of steps applies to two or more trees? “equally parsimonious trees”

10 Molecular phylogenetics: maximum parsimony as optimality criterion

Example of the maximum parsimony principle in phylogenetics :

SITE 1 2 3 4 5 6 7 8 9 0 1 2 Lengths of three possible trees: SPECIES 1 A T G T T G T G A T A A SPECIES 2 A T G T T C T G G T A A TREE 1: 5 steps SPECIES 3 A T G T T A T C A T A A TREE 2: 5 steps SPECIES 4 A T G T T A T C G T A A T REE 3: 6 steps * * * SITE 6 SITE 8 SITE 9

1 A 1 G A 3 1 G C 3 A 3 G A[G] G A C A[ G ] TREE 1

2 G 2 G 2 C A 4 C 4 G 4

1 G C 2 1 G G 2 1 A G 2

TREE 2 A A C C A G

3 A A 4 3 C C 4 3 A G 4

1 G C 2 1 G G 2 1 A G 2 TREE 3 A A C C G [A] G [A]

4 A A 3 4 C 4 G C 3 A 3

Molecular phylogenetics: distance method with an optimality criterion

Least squares method: criterion is the fit of the pairwise distances to the branch lengths of the tree.

Minimum evolution: criterion is the sum of the branch lengths (t or L).

Linear programming techniques under branch length constraints

11 Example of distance based approach to molecular phylogenetics:

Obtain set of homologous gene sequences and produce an alignment.

Transform primary data Most often done by into a matrix of pairwise genetic distance values. using a model of evolution

Select a method of inferring a from distance data; in this case it is the least squares method.

human In this case, determine the S statistic for the set of chimp candidate trees, and select a tree that minimizes S. gorilla Note that S is a function of both the tree topology and orang its branch lengths

Molecular phylogenetics: distance method with an algorithm

Clustering algorithm: computationally “build a tree” according to a specific set of “steps”.

Common algorithms: 1. Stepwise addition 2. Star decomposition 3. Quartet puzzling

12 Molecular phylogenetics: distance method with an algorithm

Obtaining a tree by star decomposition

A A A A E F F F F E E E

C C C B C B B B D D D D

Molecular phylogenetics: maximum likelihood as optimality criterion

Maximum likelihood is statistical tool to answer the question “how adequate is a given hypothesis at explaining the data in hand”? • Data: multiple alignment • Hypothesis: Tree (and model of sequence evolution)

The method of maximum likelihood measures this notion of “adequacy” by the probability of observing the data (e.g., DNA sequence) given a hypothesis (e.g., tree topology and model).

The model is required because we need to compute probabilities of character-state change.

We write the likelihood as L = Prob (D|H) (Note trees are scored in log-likelihoods)

13

A T

T T

Molecular phylogenetics: another BIG problem

Optimality based methods require comparison of different tress.

The notion of the “best”, or optimal, tree implies that all trees have been scored and compared.

Tree space is vast!

NT = 3 × 5 × 7 × … × (2n – 5)

14 Molecular phylogenetics: tree space is vast!

Number of Number of unrooted trees lineages 3 1 4 3 5 15 6 105 7 945 8 10,395 9 135,135 10 2,027,025 11 34,495,425 12 645,729,075 13 13,749,310,575 14 316,234,143,225 15 7,905,853,580,625 20 221,643,095,476,699,771,875 50 ~3 × 1074 100 184 ~3 × 10 Getting close to Eddington’s number !!! [Avogadro’s number is only 6 × 1023]

Molecular phylogenetics: big datasets require heuristic methods

Heuristic: a procedure that seeks a solution with no guarantee of finding the globally optimal solution.

Phylogenetics uses branch-swapping algorithms to search tree space.

Branch-swapping algorithms make small rearrangements in the tree topology, and the optimality criterion is recalculated after each rearrangement.

These methods work by trying to climb a hill (blind)

15 You have seen “hill-climbing” before…

Maximum Likelihood score = 0.228

0.25

0.2

0.15

0.1

0.05

0 0 0.2 0.4 0.6 0.8 1

ML estimate of p = 0.42

Gene 1: ATG ATC CTG CCG ACT … … … … … … TAA Gene 2: ATG ATT CTG CCA ACT … … … … … … TAA

Mouse gene 1 Human gene 2

t 1 t0

common ancestor

Genetic distance between gene 1 and 2 (t) = t0 + t1

16 Molecular phylogenetics: hill climbing

An extremely simplified example (but from real data):

:.*#);9)5.,.2$1$,)$*1"2.1"'0 7.,1)89)):&,$$)1.*#*

Parameters: t and ω !"#$%"&''()*+,-./$).*) .)-+0/1"'0)'-)Gene! .0(): acetylcholine! α -',)&+2.032'+*$receptor ./$14%/&'%"0$)" ,$/$51',)6$0$

mouse human

! common t ancestor

:.*#); 7.,1)89)):&,$$)1.*#*

Molecular phylogenetics: the problem with hill climbing

Task 1: Estimation of dS and dN

< =+2>$,*)'-)*+>*1"1+1"'0*).,$)/.%/+%.1$()-,'2)"#$ .0()!?

< =+2>$,)'-)*"1$*)@% .0()&A).,$)/.%/+%.1$()-,'2)"#$ >4)-"B"06)! C);?

D',).)($1."%$()$B5%.0.1"'0E)*$$)F&.51$,)G "09 Yang, Z. 2006. Computational . Oxford University Press, Oxford, England.

!"

17 A generalized protocol for molecular phylogenetics, and the associated concerns with each step

Concerns:

Collect homolgous sequences gene tree-species tree / paralogy–orthology / trees within trees

Multiple sequence alignment positional homology / gaps / subjectivity-objectivity / methods

Phylogeny estimation philosophy / methods / consistency / power and accuracy

Test reliability or fit of phylogenetic branch support / tree comparison / statistic issues with trees estimates

independent contrasts / impact of error on conclusions Interpretation and application

Random verses systematic error

Random error is defined as the deviation between the true value of a parameter and an estimate of that parameter that is due solely to the effects of finite sample size

Systematic error is the deviation between the true value of a parameter and an estimate that is due to incorrect assumptions of the estimation method

*An important difference between these two types of error is that while random error decreases with increasing sample size, systematic errors persist, and sometimes intensify, as sample sizes increase

18 See lecture notes for much more detail…

19 Uncertainty in a phylogeny: bootstrap

20 Bootstrap: sample with replacement ⇒ pseudo-sample

Jackknife: sample without replacement ⇒ pseudo-sample

A phylogenetic tree is not a numerical quantity

21 Bootstrap: sampling a data matrix with replacement

Original data: Site 1 2 3 4 5 6 7 8 9 10 Species 1 T C A G T T C G A T Species 2 C C G G T G A C A T Species 3 A C A T T T A G A A Species 4 G C A T T G A C A A

Site 9 4 3 1 4 8 10 3 9 10 Species 1 A G A T G G T A A T Species 2 A G G C G C T G A T Species 3 A T A A T G A A A A Species 4 A T A G T C A A A A One possible bootstrap pseudosample

22 A generalized protocol for molecular phylogenetics, and the associated concerns with each step

Concerns:

Collect homolgous sequences gene tree-species tree / paralogy–orthology / trees within trees

Multiple sequence alignment positional homology / gaps / subjectivity-objectivity / methods

Phylogeny estimation philosophy / methods / consistency / power and accuracy

Test reliability or fit of phylogenetic branch support / tree comparison / statistic issues with trees estimates

independent contrasts / impact of error on conclusions Interpretation and application

23 Rooted trees provide polarity

G G A A G G A A

G G A A A A G G G G A A

“G” is derived and “A” is We don’t know what is “A” is derived and “G” is ancestral derived and ancestral ancestral

Rooted trees provide polarity

24 Rooting a phylogeney with an

Let’s define some terms:

INGROUP: A group of lineages, assumed to be monophyletic, but whose phylogenetic relationships are of primary interest.

OUTGROUP: One or more terminal taxa that are assumed to be outside of the monophyletic group that has been specified as the ingroup. Unlike the ingroup, the outgroup does not have to be monophyletic

ROOT: The most evolutionary point of a phylogeny. The root orients the direction of change along a phylogeny relative to time.

CHARACTER POLARITY: The evolutionary relationship between two or more states for a given character. Say we have a character with two states, “a” and “b”. By mapping them on a phylogeny we can determine that “b” preceded “a” in evolutionary history; hence “a” is the derived state and “b” is the primitive state.

Rooting a phylogeney with an outgroup

Rooting a phylogenetic tree by placing the root between the ingroup and outgroup

IG: ingroup OG OG: outgroup Root

Root IG-4

OG IG-3

IG-4 IG-3 IG-1 IG-2 IG-1 IG-2 IG-3 IG-4 OG IG-1 IG-2

Unrooted tree Placing root between ingroup and outgroup Rooted tree

25 See lecture notes for much more detail…

26