Today’s topics

Inferring phylogeny Overview of phylogenetic inferences Methodology Introduction ! Distance methods ! Parsimony method Methods using distance data UPGMA, Neighbor-joining, minimum Methods using discrete data !"#$%&'(!)* Maximum parsimony +,- Maximum likelihood Bayesian inference .'/01!23454(6!7!2845*0&4'9#6!:&454(6 ;<=>?@AB=C?DEF

1 2

Milestones of molecular evolution studies Contributions to molecular evolution 1800 1900 2000 The patterns of evolution Cladistics and : Emergence of cladistics: Willi Hennig et al. The blooming of Willi Hennig (1913-1976) evo devo studies Discovery of DNA structure: James Watson & Francis Technical breakthrough Crick (1953) in the 1990s: computer The mechanisms Genetics: Gregor Mendel speed, methodological (1856-1863) First comparison of amino improvement The neutral theory of molecular evolution: acid sequence (insulin) Motoo Kimura (!"#$, 1924-1994) among different species: Technical breakthrough in Fred Sanger et al. (1955) the 1980s: PCR, DNA sequencing, etc. First example of molecular systematics: Vince Sarich The models and methodological concerns & Allan Wilson (1967) on The uses of parsimony, maximum likelihood methods in phylogenetics: Origin of species: Charles human evolution Joseph Felsenstein Darwin (1859) Genome Science and of , , Seattle

Nature (1968) 217: 6243 3 The neutral theory of molecular evolution 4

Reconstructing phylogeny The basics of evolutionary trees

The concepts of trees The tree shapes The data used to construct phylogenetic trees Morphological data Molecular data Homology of characters in data matrices Sequence alignment

Are these two trees different?

5 Baum et al. (2005) Science 310:9796 5 6 Most recent common ancestor (MRCA) Types of data

Character and character types Quantitative and qualitative characters Binary and multistate characters Assumptions about character evolution Substitution models Step matrix

Which is the MRCA of a mushroom and a sponge? d Which is the MRCA of a mouse and a fern? b

Plant Molecular Evolution 8 7 8

Distance method Evolution of DNA sequences

UPGMA (Unweighted Pair Group Method using A SpA ACAGCCGATTGT Arithmetic averages) TCAGCCGACTGT Transformed distance method T TCAGCCGACTGT Neighbor-Joining method C SpB TCAGCCGATTGT Minimum evolution TCAGACGACTGT

SpC TCAGCCGACTGT TCAGACGACTGT

SpD TCAGACGACTGT

9 10 9 10

In distance method, sequences were grouped based on their Measuring genetic distance similarity, we need to measure the differences among sequences How different are the two sequences we have? SpA ACAGCCGATTGT The ways of measuring nucleotide substitution Simplest way Hamming distance = n/N x100% SpB TCAGCCGATTGT SpA ACAGCCGA-TGT SpB TCAGCCGA-TGT SpC TCAGCCGACTGT 2/12 = 16.6% SpC TCAGCCGACTGT 1/12 = 8.3% SpD TCAGACGACTGT SpD TCAGACGACTGT

11 12 Measuring differences between sequences The probability of nucleotide substitution

The common ways SpC TCAGCCGACTGT Direct counting SpD TCAGACGACTGT The probability of substitution at certain nucleotide site change from i to j. In order to know the probability of the Nucleotide substitution models sequences to be different (p), we can calculate Jukes & Cantor(1969)’s one parameter model the probability of the sequences being the same Kimura(1980)'s two parameter model (I(t)) first. F81 (Felsenstein, 1981) Let’s start with Jukes-Cantor model HKY85 (Hasegawa et al., 1985) Generalized time reversible model (GTR)

13 13 14

We can just use another estimator, K, the number of substitutions per site since time of Example of calculation divergence between the two sequences If there are 80 transitions and 20 transversions between two

" sequences (length=1000bp), the number of substitutions per A G site K can be estimated: X Under J-C model, 3"t 3"t " " " " 3 4 Y Z K = ! ln(1! p) = 0.10732 (p=0.1) " 4 3 C T Therefore, K = 2 x 3"t = 6"t Under Kimura 2P model: 1 1 1 1 4 3 4 K = ln( ) + ln( ) (P=0.08, Q=0.02) ! 8!t = "ln(1" p) " K = ! ln(1! p) 2 1! 2P ! Q 4 1! 2Q 3 4 3 = 0.10943

where p = the observed proportions of different nucleotides between the two sequences where P = the proportions of transitions and Q = the proportions of transversions. p is the proportion of total substitution. 16 15 16

Overview of Distance method UPGMA

OTU A B C 1. Transform the data OTU (AB) C d matrix into distance table B AB d C (AB)C d d C AC BC d d D (AB)D CD d d d D AD BD CD

where d(AB)C = (dAC + dBC)/2 and d(AB)D = (dAD + dBD)/2

2. Use this new data matrix to evaluate/ construct the tree

(Page & Holme 1998)17 (Graur & Li 2000)18 17 18 Example Example

Felsenstein (2004) 19 Felsenstein (2004) 20 19 20

Example UPGMA tree

Felsenstein (2004) 21 Felsenstein (2004) 22 21 22

3 Neighbor-Joining (NJ) method True tree SP1: AAACGCGCTTAGATCGCCTAACGTACAAGC

10 SP2: TATAAAGGGTAGAAAACCTAAGGATCAACG Saitou & Nei (1987); Studier N1: AATCGCGCTTAGATCGCCTAACGTACAACG 2 & Keppler (1988)

N2: AATCGCGGGTAGATCGCCTAACGTACAACG 3 SP3: AATCGCGGGTATTTCGCCTAACGTACATCG NJ does not assume a clock and approximates the minimum evolution method 7 SP1 SP2

SP2 15 4 SP1 3 SP3 8 13 4 SP3 SP1 SP2 SP3

24 23 24 NJ procedure 1. For each tip, compute n Limitation on distance methods Di ui = # j: j"i (n ! 2) All nucleotide sites change independently. i j D -u -u 2. Choose the and for which ij i j When evolutionary rates vary from site to site, is smallest than the data set needs to be corrected 3. Join i and j, compute the new node The substitution rate is constant over time and in (ij) to every other nodes different lineages. (Dik + D jk ! Dij ) The base composition is at equilibrium. D(ij ),k = 2 The conditional probabilities of nucleotide

4. Delete tips i and j and make new substitutions are the same for all sites and do not table with the new node (ij) change over time.

5. Continue till resolve the tree 25 26 25 26

Discrete methods Parsimony methods

Maximum parsimony (%&'()*) The goal is to find the most parsimonious tree Maximum likelihood (+,-./)*) The criteria are to calculate the changes of Bayesian inference (01234567) character states, i.e. the evolutionary steps Others First, we have to know the way to evaluate a given tree

28 27 28

A simple data matrix w/ discrete characters A simple data matrix w/ discrete characters SpA: TCAGACGATTGTCAGACCATTG For position no. 3 SpA: TCAGACGATTGTCAGACCATTG SpB: TCAGTCGACTGTCAAACCATTG SpB: TCAGTCGACTGTCAAACCATTG SpC: TCGGTCAATTGTCAAACGATTG SpC: TCGGTCAATTGTCAAACGATTG SpD: TCGGTCAATTGTCAAACGATTG SpD: TCGGTCAATTGTCAAACGATTG

A A G G A A G G A G G A SpA SpB SpC SpD SpA SpB SpD SpC SpA SpC SpD SpB SpA SpB SpC SpD SpA SpB SpD SpC SpA SpC SpD SpB

A to G

A A G G SpA SpB SpC SpD

Character change (step) = 2 Character change (step) = 1

29 Character change (step) = 2 30 29 30 A simple data matrix w/ discrete characters A simple data matrix w/ discrete characters For position no. 7 SpA: TCAGACGATTGTCAGACCATTG SpA: TCAGACGATTGTCAGACCATTG SpB: TCAGTCGACTGTCAAACCATTG SpB: TCAGTCGACTGTCAAACCATTG SpC: TCGGTCAATTGTCAAACGATTG SpC: TCGGTCAATTGTCAAACGATTG SpD: TCGGTCAATTGTCAAACGATTG SpD: TCGGTCAATTGTCAAACGATTG

G G A A G G A A G A A G G G A A G G A A G A A G SpA SpB SpC SpD SpA SpB SpD SpC SpA SpC SpD SpB SpA SpB SpC SpD SpA SpB SpD SpC SpA SpC SpD SpB

G to A Character change (step) = 9 Character change (step) = 9 Character change (step) = 6

Character change (step) = 2 Character change (step) = 2 Character change (step) = 1

31 32 31 32

Consensus trees

Finding the underlying information among trees

So, how many trees out there are we dealing with?

(Page & Holme 1998)33 34 33 34

Tree searching 3 taxa The number of unrooted trees: (2n ! 5)! N = unrooted 2n!3(n ! 3)!

4 taxa The number of rooted trees: (2n ! 3)! N = rooted 2n!2(n ! 2)!

35 36 35 36 Taxon number All possible unrooted tree number 3 1 Note 4 3 5 15 6 105 6 7 945 Say a computer can evaluate 10 trees per 8 10,395 9 135,135 second. 10 2,027,025 11 34,459,425 If we want to evaluate all of the trees for 25 taxa, 12 654,729,075 13 13,749,310,575 we will need 14 316,234,143,225 x = 1029/106/60/60/24/365 = 3.17x1015!"#$% 15 7,905,853,580,625 16 213,458,046,676,875 !& 17 6,190,283,353,629,375 18 191,898,783,962,510,625 19 6,332,659,870,762,850,625 We got to have some ways to approximate the true tree. 20 221,643,095,476,699,771,875 21 8,200,794,532,637,891,559,375 22 319,830,986,772,877,770,815,625 23 13,113,070,457,687,988,603,440,625 24 563,862,029,680,583,509,947,946,875 25 25,373,791,335,626,257,947,657,609,375 >1029 ... etc. 37 38 37 38

Ways to solve time paradigm The procedure can be simplified

Skip unnecessary calculation SpA: TCAGACGATTGTCAGACCATTG Change the tree searching ways SpB: TCGGTCGACTGTCAGACCATTG SpC: TCAGTCGATTGTCA-ACGATTG SpD: TCAGTCGATTGTCA-ACGATTG SpE: TCAGTCGATCGTCA-ACGATTG

Parsimonious Parsimonious uninformative informative

39 40 39 40

Searching for optimal trees Exhausted search Exhausted search Branch-and-bound method Heuristic approach Stepwise/closest/random addition Star decomposition Branch swapping

41 Page & Holme (1998) Molecular evolution 41 42 Tree-island profile: First Last First Times Distribution of tree length Island Size tree tree Score replicate hit ------Frequency distribution of lengths of 1000000 random trees: 1 101 1471 1571 3890 3 1 Results of heuristic search, 2 69 4961 5029 3890 10 1 3 1 14236 14236 3890 30 1 with random addition from mean=6528.095481 sd=65.005662 g1=-0.322317 g2=0.159848 4 5 14237 14241 3890 31 1 random starting points 6114.00000 /------5 1236 1 1236 3891 1 2 6146.30000 | (1) 6 96 8816 8911 3891 20 1 6178.60000 | (3) 7 256 9524 9779 3891 23 2 6210.90000 | (24) 8 960 14242 15201 3891 32 2 6243.20000 | (84) 9 624 15202 15825 3891 33 1 10 3085 15870 18954 3891 37 1 6275.50000 | (302) 11 1488 1911 3398 3892 5 1 6307.80000 | (1071) 12 626 3448 4073 3892 7 2 6340.10000 |# (3493) 13 211 4074 4284 3892 8 1 6372.40000 |### (9300) 14 96 5935 6030 3892 12 1 6404.70000 |######## (22649) 15 156 7211 7366 3892 14 2 6437.00000 |################# (48576) 16 236 7367 7602 3892 15 2 17 1053 7667 8719 3892 18 1 6469.30000 |################################# (94873) 18 291 9780 10070 3892 24 1 6501.60000 |################################################## (144323) 19 576 10071 10646 3892 25 1 6533.90000 |################################################################# (187000) 20 96 18987 19082 3892 39 1 6566.20000 |##################################################################### (199355) 21 676 4285 4960 3893 9 1 6598.50000 |##################################################### (153445) 22 905 5030 5934 3893 11 1 6630.80000 |############################### (88764) 23 96 8720 8815 3893 19 1 6663.10000 |############# (36489) 24 612 8912 9523 3893 21 1 25 32 18955 18986 3893 38 1 6695.40000 |### (8865) 26 18 19083 19100 3893 40 1 6727.70000 | (1266) 27 339 1572 1910 3894 4 2 6760.00000 | (117) 28 64 7603 7666 3894 17 1 \------29 3589 10647 14235 3894 26 1 30 234 1237 1470 3895 2 1 31 49 3399 3447 3895 6 1 32 44 15826 15869 3895 35 1 Data of bird Data of bird ovomucoids 33 1180 6031 7210 3896 13 1 ovomucoids

43 44

Branch and bound searching A shortest Hamiltonian path problem

Point x y Hendy & Penny (1982) first used this method for 1 0.537 0.061 2 0.274 0.222 inferring phylogenies. 3 0.016 0.837 Random route: 4 0.871 0.400 Total length=5.4342 An example - shortest Hamiltonian path (SHP) 5 0.399 0.740 6 0.815 0.531 For n cities, there will be n-1 cities to come next, so 7 0.587 0.946 8 0.992 0.733 there will be n! possible solution 9 0.268 0.481 10 0.895 0.068

Adding by greedy Optimal solution: manner: Length=2.7812 Length=2.8027

Hendy, M.D. & D. Penny (1982) Mathem. Biosci. 60: 133-142.

Felsenstein (2004) Inferring phylogeny45 Felsenstein (2004) Inferring phylogeny46 45 46

Branch-and-bound heuristic searching for best tree search

#This method can greatly improve the speed of search, but it still cannot escape from the constraints of the NP- hardness proof.

Page & Holme (1998) Molecular evolution 47 Felsenstein (2004) Inferring phylogeny48 47 48 Heuristic approach- Branch swapping: NNIs How greedy should we be?

In a tree with n tips, there will be n-3 interior Nearest- branches. neighbor In all, 2(n-3) neighbors will be examined for interchanges each tree. (NNI) Should we do NNI for each of the best trees, or stop at some point, or should we start from scratch more often?

Swofford et al. (1996) Molecular Systematics49 50 49 50

The space of all 15 possible unrooted trees The space of all 15 possible unrooted trees

NNI is to moving in this graph

Felsenstein (2004) Inferring phylogeny51 Felsenstein (2004) Inferring phylogeny52 51 52

Heuristic approach- Branch swapping: TBR Browsing tree space

Rearrangement Tree bisection and Random starting point strategy to avoid being reconnection (TBR) trapped in “tree island” (Maddison 1991) Parsimony ratchet Parsimony ratchet uses a re-weighted subset of characters as a starting point to browse the tree In a n1+n2 species in the tree, there will space, and tries to find as many tree islands as be (2n1-3)(2n2-3) possible ways to reconnect the two trees. possible by using adjacent-tree searching method. (Nixon 1999)

Maddison, D.R. (1991) Syst. Zool. 40: 315-328. Nixon, K.C. (1999) Cladistics 15: 407-414. Swofford et al. (1996) Molecular Systematics53 54 53 54 Parsimony ratchet Parsimony ratchet procedure

Described by Nixon (1999), and available in Generate a starting “Wagner” tree various programs (NONA, WINCLADA, Randomly select a subset of character (5-25%) PAUPRat, etc.). and add 1 weight score Parsimony ratchet uses a re-weighted subset of Performing TBR branch swapping, keep 1 tree characters as a starting point to browse the tree Reset the weighting to original space, and tries to find as many tree islands as possible by using adjacent-tree searching Performing branch swapping, keep the best tree method. Return to step 2, repeat the iteration (step2-6) 50-200 times

Nixon, K.C. (1999) Cladistics 15: 407-414. 55 56 55 56

Notes on parsimony ratchet

It seems to improve the effectivity of tree searching It can be used with any objective function based on character data: compatibility, distance matrix, and likelihoods. The strategy can be modified

Nixon, K.C. (1999) Cladistics 15: 407-414.57 58 57 58

Weighted parsimony method The Sankoff algorithm applied to the tree

Incorporating simple nucleotide models into MP analysis C A C A G ! 0 ! ! 0 ! ! ! ! 0 ! ! 0 ! ! ! ! ! 0 ! Transition vs. transversion scores

A C G T 2.52.5 3.5 3.5 1 5 1 5 A 0 2.5 1 2.5 A C G T 3.5 3.5 3.5 4.5 C 2.5 0 2.5 1 A 0 2.5 1 2.5 G 1 2.5 0 2.5 C 2.5 0 2.5 1 1 2.5 0 2.5 6 6 7 8 G A C G T S = min S0(i) T 2.5 1 2.5 0 T 2.5 1 2.5 0 Modified from Felsenstein (2004) Inferring phylogeny 60 59 60 Evaluating trees

Character properties (CI, RI) How do we know the Examining how “clean” is the data on a given tree Confidence level of trees tree we got is right? Comparisons between obtained trees Partition distance Kishino-Hasegawa test The evaluation Distance test Likelihood ratio test Bootstrap/jackknife (internal support) Bremer support (for parsimony)

62 61 62

Properties of characters Properties of characters

Consistency index (CI) of each character is: Retention index (RI) of each characters is:

character CI = mi/Si M i ! si Tree CI for all characters in a specific tree: character RI = n M i ! mi !miw i i=1 tree CI = n w s ! i i m ='()*+,-./i01/23)*+,45 (minimum conceivable steps) i=1 i Mi ='()*+,-./i01/2<)*+,45 (maximum conceivable steps) 06+,-./ 01/+,45 mi ='()*+,-./i01/23)*+,45 (minimum conceivable steps) Si = i

Si =06+,-./i01/+,45

wi = i01/789n:01;

63 64 63 64

Example: calculating CI & RI Evaluating trees

Character properties (CI, RI) Tree 1 Tree 2 Confidence level of trees Taxon1 0 Taxon1 0 Taxon1 0 0 0 0 Taxon2 0 0 0 1 Taxon2 0 Taxon4 1 Comparisons between obtained trees Taxon3 1 1 0 1 Taxon3 1 Taxon3 1 Taxon4 1 1 1 1 Bootstrap/jackknife (internal support) Taxon5 1 1 0 1 Taxon4 1 Taxon2 0 Taxon5 1 Taxon5 1 Bremer support (for parsimony)

Character(1) CI=1/1=1 Character CI=1/2=0.5 Character(1) RI=(2-1)/(2-1)=1 Character RI=(2-2)/(2-1)=0

1+1+1+1 1+1+1+1 TreeCI = =1 TreeCI = = 0.67 1+1+1+1 2 + 2 +1+1

65 66 65 66 Characters 1 2 3 4 5 ...... N A Bootstrap method B C Bootstrap method Taxa D E F G

7 33 29 72 7 86 45 78 16 98 21 2 47 108

M times replica; M=100, 500, 1000, or any desired number

N N

Make a tree Make a tree Make a tree for each replica

100 100

100 Make a consensus tree based on majority rule, numbers indicates 100 99 the % of supported topology. 100 Usually only >50% branches are shown.

100 86 100

Diagram explaining the bootstrapping procedure used in phylogenetic analysis. Page & Holme (1998) Molecular evolution 67 68 67 68

1 Characters 1 2 3 4 5 ...... N Primate mtDNA 2 A Jackknife method B C Taxa D phylogeny 3 E F G 4 Take out X% of characters, X=30, 50, or any number 5 29 72 7 86 45 98 21 2 47 108 MP criteria Bootstrap 1000 replicates 6 M times replica; M=100, 500, 1000, or any desired number 7

N-X characters N-X characters 1234567 Freq % Make a tree Make a tree Make one tree for each replica ------.....** 1000.00 100.0% ...**** 999.00 99.9% 100 100 ....*** 945.00 94.5% 100 .**.... 497.33 49.7% 100 Make a consensus tree based on majority rule, numbers indicates Gorilla grouped with Homo/Pan 99 the % of supported topology 100 or other primates Usually only >50% branches are shown. ..***** 484.83 48.5% 100 86 100

Diagram explaining the Jackknife procedure used in phylogenetic analysis. 69 69 70

Bootstrapping Decay/support index

Pseudo-resampling. Also called ‘Bremer support’ (Bremer 1988, An approximate measure of repeatability and 1994; Donoghue et al. 1992) accuracy of data. An index of support calculating the difference in Will not correct the inconsistency caused by tree lengths between the shortest trees that reconstruction method. contain versus lack a specific group. The tree constructed in every replicate still depend Constraint trees can be generated by AutoDecay. on the methods you choose, therefore subject to associated problems

71 72 71 72 Method of calculation Bremer Support (decay index)

1. Obtain most parsimonious tree(s) (MP) the number of extra steps it takes to collapse a Generate a strict consensus tree group 2. Obtain all trees one step longer (MP + 1) example: 1 MP tree, 110 steps Generate a strict consensus tree If branch is not supported, Decay index = 1 Haliperp HPE232926 3. Obtain all trees one step longer (MP + 2) Arabaren AAU43231 Generate a strict consensus tree Arabsuec ASU43227 If branch is not supported, Decay index = 2 Arabthal ATU43224 4. Obtain all trees one step longer (MP + 3) Arabsuec ASU43226 Arabthal ATH232900 Generate a strict consensus tree If branch is not supported, Decay index = 3 Capsrube CRU232912 Capsrube CRU232913

73 74 73 74

Bremer Support (decay index) Bremer Support (decay index)

3 trees <= 111 steps 3 trees <= 111 steps

Haliperp HPE232926 Haliperp HPE232926

Arabaren AAU43231 Arabaren AAU43231

Arabsuec ASU43227 Arabsuec ASU43227

Arabthal ATU43224 Arabthal ATU43224

Arabsuec ASU43226 Arabsuec ASU43226

Arabthal ATH232900 Arabthal ATH232900

Capsrube CRU232912 Capsrube CRU232912

Capsrube CRU232913 Capsrube CRU232913

75 76 75 76

Bremer Support (decay index) Bremer Support (decay index)

5 trees <= 117 steps Can not be larger than the branch length No direct connection to branch length otherwise Quantification of support in a parsimony Haliperp HPE232926 framework Haliperp HPE232926 Arabaren AAU43231 6 Arabaren AAU43231 13 Arabsuec ASU43227 10 Arabsuec ASU43227 Arabthal ATU43224 13 Arabthal ATU43224

Arabsuec ASU43226 1 Arabsuec ASU43226 Arabthal ATH232900 2 Arabthal ATH232900

Capsrube CRU232912 15 Capsrube CRU232912

Capsrube CRU232913 26 Capsrube CRU232913

77 78 77 78 Methodological concerns Effects of sampling

Sampling problem Performance of phylogenetic methods under computer simulation

0 change 3 changes

Makes No.2 a fast evolving taxon

(Page & Holmes, 1998) 80 79 80

Long branch attraction Simulation analysis

Joe Felsenstein (1978, Syst. Zool. 27: 401-410)

UPGMA Parsimony

Page & Holme (1998) Molecular evolution81 Page & Holme (1998) Molecular evolution82 81 82

Equal rate of evolution Unequal rate of evolution

Page & Holme (1998) Molecular evolution83 Page & Holme (1998) Molecular evolution84 83 84 References of phylogenetics Parsimony

Weighted Graur, D. and W.-H. Li. 2000. Fundamentals of Molecular parsimony Evolution. 2nd ed., Sinauer Assoc., Sunderland, MA, USA.

UPGMA Hall, B. G. 2004. Phylogenetic trees made easy: a how-to manual, 2nd ed. Sinauer Assoc., Sunderland, MA. Hillis, D. M., C. Moritz, and B. K. Mable (eds) 1996. Molecular Lake!s invariants systematics. Sinauer Assoc., Sunderland, MA. Page, R. D. M., and E. C. Holmes. 1998. Molecular evolution - A Maximum likelihood (Jukes-Cantor) phylogenetic approach. Blackwell Science Ltd, Oxford, the United Kingdom. Maximum likelihood (Kimura) Yang, Z. H. 2006. Computational molecular evolution. Oxford University Press.

Weighted least squares

Swofford et al. (1996) Molecular Systematics85 86 85 86