Inferring Phylogeny
Total Page:16
File Type:pdf, Size:1020Kb
Today’s topics Inferring phylogeny Overview of phylogenetic inferences Methodology Introduction ! Distance methods ! Parsimony method Methods using distance data UPGMA, Neighbor-joining, minimum evolution Methods using discrete data !"#$%&'(!)* Maximum parsimony +,- Maximum likelihood Bayesian inference .'/01!23454(6!7!2845*0&4'9#6!:&454(6 ;<=>?@AB=C?DEF 1 2 Milestones of molecular evolution studies Contributions to molecular evolution 1800 1900 2000 The patterns of evolution Cladistics and phylogenetics: Emergence of cladistics: Willi Hennig et al. The blooming of Willi Hennig (1913-1976) evo devo studies Discovery of DNA structure: James Watson & Francis Technical breakthrough Crick (1953) in the 1990s: computer The mechanisms Genetics: Gregor Mendel speed, methodological (1856-1863) First comparison of amino improvement The neutral theory of molecular evolution: acid sequence (insulin) Motoo Kimura (!"#$, 1924-1994) among different species: Technical breakthrough in Fred Sanger et al. (1955) the 1980s: PCR, DNA sequencing, etc. First example of molecular systematics: Vince Sarich The models and methodological concerns & Allan Wilson (1967) on The uses of parsimony, maximum likelihood methods in phylogenetics: Origin of species: Charles human evolution Joseph Felsenstein Darwin (1859) Genome Science and of Biology, University of Washington, Seattle Nature (1968) 217: 6243 3 The neutral theory of molecular evolution 4 Reconstructing phylogeny The basics of evolutionary trees The concepts of trees The tree shapes The data used to construct phylogenetic trees Morphological data Molecular data Homology of characters in data matrices Sequence alignment Are these two trees different? 5 Baum et al. (2005) Science 310:9796 5 6 Most recent common ancestor (MRCA) Types of data Character and character types Quantitative and qualitative characters Binary and multistate characters Assumptions about character evolution Substitution models Step matrix Which is the MRCA of a mushroom and a sponge? d Which is the MRCA of a mouse and a fern? b Plant Molecular Evolution 8 7 8 Distance method Evolution of DNA sequences UPGMA (Unweighted Pair Group Method using A SpA ACAGCCGATTGT Arithmetic averages) TCAGCCGACTGT Transformed distance method T TCAGCCGACTGT Neighbor-Joining method C SpB TCAGCCGATTGT Minimum evolution TCAGACGACTGT SpC TCAGCCGACTGT TCAGACGACTGT SpD TCAGACGACTGT 9 10 9 10 In distance method, sequences were grouped based on their Measuring genetic distance similarity, we need to measure the differences among sequences How different are the two sequences we have? SpA ACAGCCGATTGT The ways of measuring nucleotide substitution Simplest way Hamming distance = n/N x100% SpB TCAGCCGATTGT SpA ACAGCCGA-TGT SpB TCAGCCGA-TGT SpC TCAGCCGACTGT 2/12 = 16.6% SpC TCAGCCGACTGT 1/12 = 8.3% SpD TCAGACGACTGT SpD TCAGACGACTGT 11 12 Measuring differences between sequences The probability of nucleotide substitution The common ways SpC TCAGCCGACTGT Direct counting SpD TCAGACGACTGT The probability of substitution at certain nucleotide site change from i to j. In order to know the probability of the Nucleotide substitution models sequences to be different (p), we can calculate Jukes & Cantor(1969)’s one parameter model the probability of the sequences being the same Kimura(1980)'s two parameter model (I(t)) first. F81 (Felsenstein, 1981) Let’s start with Jukes-Cantor model HKY85 (Hasegawa et al., 1985) Generalized time reversible model (GTR) 13 13 14 We can just use another estimator, K, the number of substitutions per site since time of Example of calculation divergence between the two sequences If there are 80 transitions and 20 transversions between two " sequences (length=1000bp), the number of substitutions per A G site K can be estimated: X Under J-C model, 3"t 3"t " " " " 3 4 Y Z K = ! ln(1! p) = 0.10732 (p=0.1) " 4 3 C T Therefore, K = 2 x 3"t = 6"t Under Kimura 2P model: 1 1 1 1 4 3 4 K = ln( ) + ln( ) (P=0.08, Q=0.02) ! 8!t = "ln(1" p) " K = ! ln(1! p) 2 1! 2P ! Q 4 1! 2Q 3 4 3 = 0.10943 where p = the observed proportions of different nucleotides between the two sequences where P = the proportions of transitions and Q = the proportions of transversions. p is the proportion of total substitution. 16 15 16 Overview of Distance method UPGMA OTU A B C 1. Transform the data OTU (AB) C d matrix into distance table B AB d C (AB)C d d C AC BC d d D (AB)D CD d d d D AD BD CD where d(AB)C = (dAC + dBC)/2 and d(AB)D = (dAD + dBD)/2 2. Use this new data matrix to evaluate/ construct the tree (Page & Holme 1998)17 (Graur & Li 2000)18 17 18 Example Example Felsenstein (2004) 19 Felsenstein (2004) 20 19 20 Example UPGMA tree Felsenstein (2004) 21 Felsenstein (2004) 22 21 22 3 Neighbor-Joining (NJ) method True tree SP1: AAACGCGCTTAGATCGCCTAACGTACAAGC 10 SP2: TATAAAGGGTAGAAAACCTAAGGATCAACG Saitou & Nei (1987); Studier N1: AATCGCGCTTAGATCGCCTAACGTACAACG 2 & Keppler (1988) N2: AATCGCGGGTAGATCGCCTAACGTACAACG 3 SP3: AATCGCGGGTATTTCGCCTAACGTACATCG NJ does not assume a clock and approximates the minimum evolution method 7 SP1 SP2 SP2 15 4 SP1 3 SP3 8 13 4 SP3 SP1 SP2 SP3 24 23 24 NJ procedure 1. For each tip, compute n Limitation on distance methods Di ui = # j: j"i (n ! 2) All nucleotide sites change independently. i j D -u -u 2. Choose the and for which ij i j When evolutionary rates vary from site to site, is smallest than the data set needs to be corrected 3. Join i and j, compute the new node The substitution rate is constant over time and in (ij) to every other nodes different lineages. (Dik + D jk ! Dij ) The base composition is at equilibrium. D(ij ),k = 2 The conditional probabilities of nucleotide 4. Delete tips i and j and make new substitutions are the same for all sites and do not table with the new node (ij) change over time. 5. Continue till resolve the tree 25 26 25 26 Discrete methods Parsimony methods Maximum parsimony (%&'()*) The goal is to find the most parsimonious tree Maximum likelihood (+,-./)*) The criteria are to calculate the changes of Bayesian inference (01234567) character states, i.e. the evolutionary steps Others First, we have to know the way to evaluate a given tree 28 27 28 A simple data matrix w/ discrete characters A simple data matrix w/ discrete characters SpA: TCAGACGATTGTCAGACCATTG For position no. 3 SpA: TCAGACGATTGTCAGACCATTG SpB: TCAGTCGACTGTCAAACCATTG SpB: TCAGTCGACTGTCAAACCATTG SpC: TCGGTCAATTGTCAAACGATTG SpC: TCGGTCAATTGTCAAACGATTG SpD: TCGGTCAATTGTCAAACGATTG SpD: TCGGTCAATTGTCAAACGATTG A A G G A A G G A G G A SpA SpB SpC SpD SpA SpB SpD SpC SpA SpC SpD SpB SpA SpB SpC SpD SpA SpB SpD SpC SpA SpC SpD SpB A to G A A G G SpA SpB SpC SpD Character change (step) = 2 Character change (step) = 1 29 Character change (step) = 2 30 29 30 A simple data matrix w/ discrete characters A simple data matrix w/ discrete characters For position no. 7 SpA: TCAGACGATTGTCAGACCATTG SpA: TCAGACGATTGTCAGACCATTG SpB: TCAGTCGACTGTCAAACCATTG SpB: TCAGTCGACTGTCAAACCATTG SpC: TCGGTCAATTGTCAAACGATTG SpC: TCGGTCAATTGTCAAACGATTG SpD: TCGGTCAATTGTCAAACGATTG SpD: TCGGTCAATTGTCAAACGATTG G G A A G G A A G A A G G G A A G G A A G A A G SpA SpB SpC SpD SpA SpB SpD SpC SpA SpC SpD SpB SpA SpB SpC SpD SpA SpB SpD SpC SpA SpC SpD SpB G to A Character change (step) = 9 Character change (step) = 9 Character change (step) = 6 Character change (step) = 2 Character change (step) = 2 Character change (step) = 1 31 32 31 32 Consensus trees Finding the underlying information among trees So, how many trees out there are we dealing with? (Page & Holme 1998)33 34 33 34 Tree searching 3 taxa The number of unrooted trees: (2n ! 5)! N = unrooted 2n!3(n ! 3)! 4 taxa The number of rooted trees: (2n ! 3)! N = rooted 2n!2(n ! 2)! 35 36 35 36 Taxon number All possible unrooted tree number 3 1 Note 4 3 5 15 6 105 6 7 945 Say a computer can evaluate 10 trees per 8 10,395 9 135,135 second. 10 2,027,025 11 34,459,425 If we want to evaluate all of the trees for 25 taxa, 12 654,729,075 13 13,749,310,575 we will need 14 316,234,143,225 x = 1029/106/60/60/24/365 = 3.17x1015!"#$% 15 7,905,853,580,625 16 213,458,046,676,875 !& 17 6,190,283,353,629,375 18 191,898,783,962,510,625 19 6,332,659,870,762,850,625 We got to have some ways to approximate the true tree. 20 221,643,095,476,699,771,875 21 8,200,794,532,637,891,559,375 22 319,830,986,772,877,770,815,625 23 13,113,070,457,687,988,603,440,625 24 563,862,029,680,583,509,947,946,875 25 25,373,791,335,626,257,947,657,609,375 >1029 ... etc. 37 38 37 38 Ways to solve time paradigm The procedure can be simplified Skip unnecessary calculation SpA: TCAGACGATTGTCAGACCATTG Change the tree searching ways SpB: TCGGTCGACTGTCAGACCATTG SpC: TCAGTCGATTGTCA-ACGATTG SpD: TCAGTCGATTGTCA-ACGATTG SpE: TCAGTCGATCGTCA-ACGATTG Parsimonious Parsimonious uninformative informative 39 40 39 40 Searching for optimal trees Exhausted search Exhausted search Branch-and-bound method Heuristic approach Stepwise/closest/random addition Star decomposition Branch swapping 41 Page & Holme (1998) Molecular evolution 41 42 Tree-island profile: First Last First Times Distribution of tree length Island Size tree tree Score replicate hit ---------------------------------------------------------------------- Frequency distribution of lengths of 1000000 random trees: 1 101 1471 1571 3890 3 1 Results of heuristic search, 2 69 4961 5029 3890 10 1 3 1 14236 14236 3890 30 1 with random addition from mean=6528.095481 sd=65.005662 g1=-0.322317 g2=0.159848 4 5 14237 14241 3890 31 1 random starting points 6114.00000 /---------------------------------------------------------------------