Systematics - BIO 615
Total Page:16
File Type:pdf, Size:1020Kb
Systematics - BIO 615 Outline 1. Optimality Criteria: Parsimony continued 2. Distance vs character methods 3. “Building” a tree vs “finding” a tree - Clustering vs Optimality criterion methods 4. Performance of Distance and clustering methods Derek S. Sikes University of Alaska Four steps - each should be explained in methods Parsimony 1. Character (data) selection (not too fast, not too slow) “Why did you choose these data?” Two problems to solve 2. Alignment of Data (hypotheses of primary 1. Determine optimality criterion score (tree homology) “How did you align your data?” length) for each tree (easy / fast) 3. Analysis selection (choose the best model / method(s)) - data exploration “Why did you 2. Search over all possible trees to find the tree(s) chose your analysis method?” that is/are the best according to the optimality criterion (e.g. shortest; hard for more than 11 OTUs) 4. Conduct analysis Trees Parsimony Parsimony & Cladistics Some relationships - Strict cladists typically use only parsimony methods & justify this choice on philosophical grounds – A rooted binary tree of n OTUs eg it provides the “least falsified hypothesis” – 2n-3 branches - Parsimony has also been interpreted as a fast approximation to maximum likelihood Cavalli-Sforza and Edwards (1967:555), stated that parsimony’s “… success is probably due to the closeness of the solution it gives to the projection of the ‘maximum likelihood’ tree” and parsimony – Thus, an unrooted tree with n OTUs has 2n-3 “certainly cannot be justified on the grounds that evolution proceeds according to some minimum rooted versions. principle…”. - Parsimony is often used in conjunction with other methods (by those who use statistical phylogenetic methods) 1 Systematics - BIO 615 Parsimony Parsimony Determining tree length • Given a set of characters, such as aligned site 5 sequences, parsimony analysis works by ! 123456! C A C A G determining the fit (number of steps) of each !species1 AAAACA! character on a given tree !species2 AAAAAA! AG !species3 AAAACA! AC • The sum over all characters is called Tree !species4 AAAAAT! ACG Length !species5 AAAAGA • Most parsimonious trees (MPTs) have the Sites 1-4 are constant AC minimum tree length needed to explain the Parsimony uninformative, site 6 is an autapomorphy observed distributions of all the characters = 1 change only, ignored by cladistic software, counted by PAUP (variable but parsimony uninformative) At 100k trees/sec PAUP would take over 2 billion years to evaluate all trees for 21 OTUs Tree searching Number of OTUs Number of unrooted trees 4 3 If 11 or fewer OTUs can do an exhaustive search 5 15 6 105 7 945 - this guarantees the shortest tree(s) will be found 8 10,395 (an exact solution) 9 135,135 10 2,027,025 11 34,459,425 12 654,729,075 - every possible tree for n taxa examined 13 13,749,310,575 14 316,234,143,225 15 7,905,853,580,625 For n taxa - slowest and most rigorous method 16 213,458,046,676,875 # trees= 17 6,190,283,353,629,375 (2n-5)! 18 191,898,783,962,510,625 - provides a frequency histogram of tree scores ----------------- 19 6,332,659,870,762,850,625 2n-3(n-3)! 20 221,643,095,476,699,771,875 21 8,200,794,532,637,891,559,375 Exhaustive Search Tree searching If 12-25 OTUs can do a branch and bound search B C - this also guarantees the shortest tree(s) will be found Step 1 Starting tree, any 3 taxa but not all trees are examined (also an exact A solution) Add fourth taxon (D) in each of three possible positions -> three trees E - families of trees that cannot lead to shorter trees are discarded and not examined - save time B D D C B C E C B D Step 2a 2b 2c - read text for details on method E A A E A E - faster than exhaustive search Add fifth taxon (E) in each of the five possible positions on each of the three trees - > 15 trees, and so on .... - no histogram of tree scores 2 Systematics - BIO 615 Tree searching For more than 25 OTUs (most datasets) must use other methods, heuristic searching - approximate methods - do not guarantee the shortest tree will be found - fastest method (but less rigorous) - many issues to consider to employ best strategy for searching tree space - can get trapped in local optima while searching for global optima (shortest trees) (to be continued… see lecture on large datasets) Tree searching Data types heuristic searching - approximate methods Character data - OTU x character matrix! 1. A starting tree is obtained by some clustering method OTUs! !!characters! eg stepwise addition or neighbor-joining ! ! !!1234567890! Species1! !ATGCTTGCCA! 2. This tree is then subjected to branch swapping Species2! !ATGCTTGTCA! (movement of branches to new places on tree) Species3! !ATGCATGACA! - each swap makes a new topology which is Species4! !ATGTGTGGCA! scored using the Optimality Criterion - hope is to find a more optimal tree through extensive branch swapping Data types Data types Distance data - OTU x OTU matrix! Distance methods vs Character methods species1 species2 species3 species4 - Depending on the data can prefer the same or species1 0 different topologies species2 0.1 0 - Even if the same topology is chosen, only character species3 0.2 0.2 0 based methods allow an estimate of ancestral states (which characters are changing where on the tree) species4 0.3 0.3 0.1 0 - Once characters have been converted to distances we These are uncorrected distances - they can be corrected lose this information to improve chance of finding correct tree (see upcoming - Important for understanding historical forms of lecture on models of evolution) molecules or adaptations! 3 Systematics - BIO 615 Optimality Criteria vs clustering methods Building a tree vs Finding a tree Methods of Phylogenetic Inference “Building” or “Constructing” a tree vs “Finding” a tree - A more fundamental split is not between methods that use distance data vs character data - Each tree is a hypothesis of relationships - But between methods that use optimality criteria - For n OTUs we know how many alternative hypotheses vs those that do not (sometimes called algorithmic there are before we do any analysis methods, or clustering methods) - The question to be answered: - Clustering methods are fast because they build only one tree (“one-tree methods”) “Is the signal in your data strong enough to weed through these hypotheses?” - ie “Can your data reject all but one (or a few) of - But, often a dataset will be explained equally well these hypotheses? by multiple, sometimes thousands of trees… Building a tree vs Finding a tree Building a tree vs Finding a tree - Methods that “build” a tree do not use optimality criteria Comparison: Clustering method - They do not test hypotheses, they do not evaluate the 1. Build a tree (eg with NJ) alternative hypotheses for n taxa 2. Stop - Instead, they create a single tree from the data, they “build” (create, construct) a tree Optimality Criteria (eg parsimony) - Since hypotheses are not tested I contend these trees 1. Build a tree (eg with NJ) are not scientific, they are good places to start a 2. Score tree with criterion search for optimal trees, or a means to explore the 3. Try to improve tree with branch swapping data 4. Goto step 1 until +/- all trees are scored - (But will find the true tree if the distance matrix is an exact reflection of the true tree) 5. Stop when search is done Clustering Methods - UPGMA UPGMA (Unweighted Pair-Group Method using Arithmetic averages) for ultrametric data - Sokal & Michener 1958 - phenetics / phenograms - assumes / requires equal rates of evolution on all branches - false for most real data - descendents are equidistant from ancestor A B C D E H I J F G Constraints: e1 e2 e6 e7 e1 = e2 e3 e4 e8 e4 = e1 + e3 e5 e9 e5 + e4 = e9 + e8 + e6 ABSOLUTE etc. TIME or DIVERGENCE 4 Systematics - BIO 615 Clustering Methods - UPGMA Clustering Methods - NJ Neighbor-Joining (NJ) - Saitou & Nei (1987) A B C D Very sensitive to unequal rates - unjustifiably popular method but better than UPGMA A 0 among lineages - clustering method to deal with non-ultrametric data B 17 0 Real data - no assumption of clock-like evolution rarely are - approximation of the Minimum Evolution tree C 21 12 0 ultrametric (thus not “phenetic”) - for many datasets, however, NJ fails to produce the 27 18 14 0 Should no D longer be used ME tree - it gets close but better trees often exist - eg 27,000 equal or better trees than the NJ tree published by Hedges et B C D A UPGMA A D al (1992) - Swofford et al. 1996 B C 6 6 8 13 10.83 4 4 10 2 2 2 - thus, hard to justify an NJ tree as a reasonable 2.83 True tree estimate of phylogeny (Swofford & Sullivan, text) Clustering Methods - NJ Clustering Methods - NJ Neighbor-Joining (NJ) Farris et al. (1996) Cladistics 12:99-124 - Farris et al. (1996) Cladistics 12:99-124 - “NJ obscures ambiguities in the data” No means to decide which is better - - sensitive to the input order of the OTUs (no optimality criterion) - Depending on the sorting of the OTUs different trees might result - This problem was addressed by “jumbling” the input order for building multiple NJ trees - Typically people do NJ-bootstrapping which reduces but does not eliminate the problem of ignoring ambiguities in the data Two NJ trees built from same dataset with differently sorted OTUs Clustering Methods - NJ Clustering Methods - NJ Your text - Distance methods (Yves Van de Peer) NJ often considered inferior but is preferred by those who consider a quickly produced tree more important than accuracy or honest presentation of the - Repeatedly mentions that optimality criteria methods