<<

Systematics - BIO 615

Outline 1. Optimality Criteria: Parsimony continued

2. Distance vs character methods

3. “Building” a tree vs “finding” a tree - Clustering vs Optimality criterion methods

4. Performance of Distance and clustering methods

Derek S. Sikes University of Alaska

Four steps - each should be explained in methods Parsimony

1. Character (data) selection (not too fast, not too slow) “Why did you choose these data?” Two problems to solve

2. Alignment of Data (hypotheses of primary 1. Determine optimality criterion score (tree homology) “How did you align your data?” length) for each tree (easy / fast)

3. Analysis selection (choose the best model / method(s)) - data exploration “Why did you 2. Search over all possible trees to find the tree(s) chose your analysis method?” that is/are the best according to the optimality criterion (e.g. shortest; hard for more than 11 OTUs) 4. Conduct analysis

Trees Parsimony Parsimony & Some relationships - Strict cladists typically use only parsimony methods & justify this choice on philosophical grounds – A rooted binary tree of n OTUs eg it provides the “least falsified hypothesis”

– 2n-3 branches - Parsimony has also been interpreted as a fast approximation to maximum likelihood Cavalli-Sforza and Edwards (1967:555), stated that parsimony’s “… success is probably due to the closeness of the solution it gives to the projection of the ‘maximum likelihood’ tree” and parsimony – Thus, an unrooted tree with n OTUs has 2n-3 “certainly cannot be justified on the grounds that evolution proceeds according to some minimum rooted versions. principle…”. - Parsimony is often used in conjunction with other methods (by those who use statistical phylogenetic methods)

1 - BIO 615

Parsimony Parsimony Determining tree length • Given a set of characters, such as aligned site 5 sequences, parsimony analysis works by ! 123456! C A C A G determining the fit (number of steps) of each !species1 AAAACA! character on a given tree !species2 AAAAAA! AG !species3 AAAACA! AC • The sum over all characters is called Tree !species4 AAAAAT! ACG Length !species5 AAAAGA

• Most parsimonious trees (MPTs) have the Sites 1-4 are constant AC minimum tree length needed to explain the Parsimony uninformative, site 6 is an observed distributions of all the characters = 1 change only, ignored by cladistic software, counted by PAUP (variable but parsimony uninformative)

At 100k trees/sec PAUP would take over 2 billion years to evaluate all trees for 21 OTUs Tree searching

Number of OTUs Number of unrooted trees

4 3 If 11 or fewer OTUs can do an exhaustive search 5 15 6 105 7 945 - this guarantees the shortest tree(s) will be found 8 10,395 (an exact solution) 9 135,135 10 2,027,025 11 34,459,425 12 654,729,075 - every possible tree for n taxa examined 13 13,749,310,575 14 316,234,143,225 15 7,905,853,580,625 For n taxa - slowest and most rigorous method 16 213,458,046,676,875 # trees= 17 6,190,283,353,629,375 (2n-5)! 18 191,898,783,962,510,625 - provides a frequency histogram of tree scores ------19 6,332,659,870,762,850,625 2n-3(n-3)! 20 221,643,095,476,699,771,875 21 8,200,794,532,637,891,559,375

Exhaustive Search Tree searching

If 12-25 OTUs can do a branch and bound search B C - this also guarantees the shortest tree(s) will be found Step 1 Starting tree, any 3 taxa but not all trees are examined (also an exact A solution)

Add fourth taxon (D) in each of three possible positions -> three trees E - families of trees that cannot lead to shorter trees are discarded and not examined - save time B D D C B C E C B D Step 2a 2b 2c - read text for details on method E A A E A E - faster than exhaustive search Add fifth taxon (E) in each of the five possible positions on each of the three trees - > 15 trees, and so on .... - no histogram of tree scores

2 Systematics - BIO 615

Tree searching

For more than 25 OTUs (most datasets) must use other methods, heuristic searching - approximate methods - do not guarantee the shortest tree will be found

- fastest method (but less rigorous)

- many issues to consider to employ best strategy for searching tree space

- can get trapped in local optima while searching for global optima (shortest trees) (to be continued… see lecture on large datasets)

Tree searching Data types

heuristic searching - approximate methods Character data - OTU x character matrix! 1. A starting tree is obtained by some clustering method OTUs! !!characters! eg stepwise addition or neighbor-joining ! ! !!1234567890! Species1! !ATGCTTGCCA! 2. This tree is then subjected to branch swapping Species2! !ATGCTTGTCA! (movement of branches to new places on tree) Species3! !ATGCATGACA! - each swap makes a new topology which is Species4! !ATGTGTGGCA! scored using the Optimality Criterion - hope is to find a more optimal tree through extensive branch swapping

Data types Data types

Distance data - OTU x OTU matrix! Distance methods vs Character methods

species1 species2 species3 species4 - Depending on the data can prefer the same or species1 0 different topologies

species2 0.1 0 - Even if the same topology is chosen, only character species3 0.2 0.2 0 based methods allow an estimate of ancestral states (which characters are changing where on the tree) species4 0.3 0.3 0.1 0 - Once characters have been converted to distances we These are uncorrected distances - they can be corrected lose this information to improve chance of finding correct tree (see upcoming - Important for understanding historical forms of lecture on models of evolution) molecules or adaptations!

3 Systematics - BIO 615

Optimality Criteria vs clustering methods Building a tree vs Finding a tree Methods of Phylogenetic Inference “Building” or “Constructing” a tree vs “Finding” a tree - A more fundamental split is not between methods that use distance data vs character data - Each tree is a hypothesis of relationships

- But between methods that use optimality criteria - For n OTUs we know how many alternative hypotheses vs those that do not (sometimes called algorithmic there are before we do any analysis methods, or clustering methods)

- The question to be answered: - Clustering methods are fast because they build only one tree (“one-tree methods”) “Is the signal in your data strong enough to weed through these hypotheses?” - ie “Can your data reject all but one (or a few) of - But, often a dataset will be explained equally well these hypotheses? by multiple, sometimes thousands of trees…

Building a tree vs Finding a tree Building a tree vs Finding a tree - Methods that “build” a tree do not use optimality criteria Comparison: Clustering method - They do not test hypotheses, they do not evaluate the 1. Build a tree (eg with NJ) alternative hypotheses for n taxa 2. Stop

- Instead, they create a single tree from the data, they “build” (create, construct) a tree Optimality Criteria (eg parsimony) - Since hypotheses are not tested I contend these trees 1. Build a tree (eg with NJ) are not scientific, they are good places to start a 2. Score tree with criterion search for optimal trees, or a means to explore the 3. Try to improve tree with branch swapping data 4. Goto step 1 until +/- all trees are scored - (But will find the true tree if the distance matrix is an exact reflection of the true tree) 5. Stop when search is done

Clustering Methods - UPGMA UPGMA (Unweighted Pair-Group Method using Arithmetic averages) for ultrametric data - Sokal & Michener 1958 - phenetics / phenograms - assumes / requires equal rates of evolution on all branches - false for most real data - descendents are equidistant from ancestor

A B C D E H I J F G Constraints: e1 e2 e6 e7 e1 = e2 e3 e4 e8

e4 = e1 + e3 e5 e9 e5 + e4 = e9 + e8 + e6 ABSOLUTE etc. TIME or DIVERGENCE

4 Systematics - BIO 615

Clustering Methods - UPGMA Clustering Methods - NJ Neighbor-Joining (NJ) - Saitou & Nei (1987) A B C D Very sensitive to unequal rates - unjustifiably popular method but better than UPGMA A 0 among lineages - clustering method to deal with non-ultrametric data

B 17 0 Real data - no assumption of clock-like evolution rarely are - approximation of the Minimum Evolution tree C 21 12 0 ultrametric (thus not “phenetic”) - for many datasets, however, NJ fails to produce the 27 18 14 0 Should no D longer be used ME tree - it gets close but better trees often exist - eg 27,000 equal or better trees than the NJ tree published by Hedges et B C D A UPGMA A D al (1992) - Swofford et al. 1996 B C 6 6 8 13 10.83 4 4 10 2 2 2 - thus, hard to justify an NJ tree as a reasonable 2.83 True tree estimate of phylogeny (Swofford & Sullivan, text)

Clustering Methods - NJ Clustering Methods - NJ Neighbor-Joining (NJ) Farris et al. (1996) Cladistics 12:99-124

- Farris et al. (1996) Cladistics 12:99-124

- “NJ obscures ambiguities in the data” No means to decide which is better - - sensitive to the input order of the OTUs (no optimality criterion) - Depending on the sorting of the OTUs different trees might result - This problem was addressed by “jumbling” the input order for building multiple NJ trees - Typically people do NJ-bootstrapping which reduces but does not eliminate the problem of ignoring ambiguities in the data Two NJ trees built from same dataset with differently sorted OTUs

Clustering Methods - NJ Clustering Methods - NJ Your text - Distance methods (Yves Van de Peer) NJ often considered inferior but is preferred by those who consider a quickly produced tree more important than accuracy or honest presentation of the - Repeatedly mentions that optimality criteria methods “suffer the drawback that all tree topologies have signal in their data to be investigated” - Clustering methods, in contrast, produce only one If a more rigorous method, an optimality criterion tree (- I consider this a huge weakness) method, finds the same tree as NJ… - This is like saying: the advantage of NJ is that it ignores alternative equally good or Publish the more rigorously obtained tree! possibly better estimates of phylogeny - Similar to the pheneticists goal to produce a stable (the NJ tree in this case tells us nothing extra - and tree that was not necessarily a phylogeny publishing only the NJ tree tells readers that not much effort or care was put into the analysis)

5 Systematics - BIO 615

Clustering Methods Clustering Methods Advocates argue that we can do NJ Bootstrapping NJ is the primary method of the “DNA Barcoding” protocol: - method to assess the strength of the signal in the data HEBERT, P.D.N., PENTON, E.H., BURNS, J.M., - to assess the precision of the estimate JANZEN, D.H. & HALLWACHS, W. 2004. Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. - consider a bogus clustering method that builds a tree Proceedings of the National Academy of Sciences of by alphabetically ordering the OTUs according to the USA 101, 14812–14817. their names & ignores the data Brower, A.V.Z. 2006 - Problems with DNA barcodes for - bootstrapping this method would produce scores of species delimitation: “ten species” of Astraptes 100% for every branch - obviously meaningless fulgerator reassessed (Lepidoptera: Hesperiidae). Systematics and Biodiversity

Brower bootstrapped their NJ tree - found support for “at least 3 but not more than 7 that may correspond to cryptic species”

- Not 10!

Suspected cryptic species - adults look identical

Optimality Criterion for Distances Minimum Evolution Example Minimum Evolution (ME) Philips, M. J., F. Delsuc, D. Penny. 2004. Genome-scale phylogeny and the detection of systematic biases. MBE 21(7):1455-1458 - Uses distance data (preferably corrected data)

- Reevaluated Rokas et al’s (2003) dataset of 8 yeast genomes - Optimality criterion (searches tree space, much (106 nuclear genes, 127, 026 nucleotides) more rigorous & time consuming than NJ) - Compared the two types of statistical error: - Tree that minimizes the sum of the lengths of the Stochastic (random) error - deviation between estimate and true value due to sampling, by definition branches is the best estimate of phylogeny will vanish with infinite data - assessed using branch support measures - “Parsimony using distance data” Systematic error - deviation between estimate and true value due to violated assumptions in the estimation method - will not vanish with infinite data - Better than NJ but still weaker than character methods - branch support tells us nothing

6 Systematics - BIO 615

Minimum Evolution Example Minimum Evolution Example - Random error +/- gone with huge dataset of 124,026 characters Zwickl, D. J. & D. M. Hillis 2002. Increased taxon sampling greatly reduces phylogenetic error. Sys. Biol. 51(4):588-598 - Systematic error evident in ME analysis (tree on right) (even with corrected distance data!) (& loss of data) - investigated impact of taxon sampling on accuracy of estimates - 100% branch support values indicate no random error of phylogenies - all optimality criteria saw a reduction in error with increased taxon sampling

- however, Minimum Evolution saw the smallest benefit and had overall the highest error rates

- another source of phylogenetic error - use of a method that has a higher failure rate than other methods

Summary Summary 1. Two data types: Distance & Character Distance methods • Should only be used with corrected data (see lecture on models of evolution) Phylogenetic error can result from use of distances that don’t 2. Two ways to “get a tree”: Clustering methods & Optimality reflect true distances Criteria methods (building vs searching) • Assessment of strength of signal is critical to modern 3. All clustering methods are distance based but seemingly the opposite goal of clustering - But not all distance methods are clustering Data type (one-tree) methods methods (Minimum Evolution) Distances Characters • Character evolution / ancestral state data are lost UPGMA NJ • Branch lengths sometimes estimated as less than minimum possible (observed = minimum) Minimum Parsimony, evolution Maximum • Less capable with difficult dataset (eg Rokas et al) than Likelihood character methods (eg Parsimony) OptimalityClustering Tree Method Method Tree algorithmCriterion

Summary Terms - from lecture & readings 6. If dataset is clean, strong historical signal, low rate heterogeneity, distance methods typically do fine 2n-3 tree length Neighbor-joining MPT (most parsimonious tree(s)) 7. If a comparison is made with more powerful, non-distance minimum evolution methods, as it should be, why bother using distance methods exhaustive (exact) search stochastic (random) error at all? branch & bound search systematic error heuristic search tree space 8. Only justification I can think of, which is weak, is that distance methods are very efficient [ = fast] (and if you are lucky, they tree islands will choose the same tree as the more powerful methods) branch swapping ancestral states optimality criteria 9. Due to their speed these methods are ideal for data clustering method exploration prior to final analyses distance data character (discrete) data 10. Clustering methods (eg NJ) do not try to find globally optimal UPGMA trees ultrametric data

7 Systematics - BIO 615

Study questions

What does ʻtree lengthʼ mean for a Parsimony analysis?

What are the differences between the 3 different means of searching for optimal trees? When would you use each? Pros & cons?

What are tree islands & why are they important? How does NJ deal with tree islands?

Compare & contrast optimality criteria methods with clustering methods. Is the requirement that tree space be extensively searched a drawback or advantage of optimality criterion methods?

What are the requirements of the data for an ultrametric method? Are these requirements typically met by real data?

Why is it hard to justify a NJ tree as a reasonable estimate of phylogeny?

Compare & contrast stochastic error with systematic error.

All clustering methods use (distance? or character?) data. All distance methods are clustering methods (True or False)?

8