Phylogenetics Topic 3: Methods of inferring phylogenies

Because no person was present to directly observe the evolution of a group of organisms, biologists must infer phylogenies from the characters of living and fossil taxa. These days, the vast majority of phylogenies are reconstructed from variation among nucleotide or amino acid sequences. However, a wide variety of other types of molecular data can be used to reconstruct phylogenies; examples include restriction fragment length polymorphisms (RFLPS), insertion-deletion events (INDELS), chromosomal rearrangements, DNA-DNA hybridization, to name a few. Numerous methods of reconstructing trees have been implemented. This lecture covers a very brief, and non-technical, introduction to the most common methods.

A generalized protocol for molecular phylogenetics, and the associated concerns with each step

Concerns:

Collect homolgous sequences gene tree-species tree / paralogy–orthology / trees within trees

Multiple sequence alignment positional homology / gaps / subjectivity-objectivity / methods

Phylogeny estimation philosophy / methods / consistency / power and accuracy

Test reliability or fit of phylogenetic branch support / tree comparison / statistic issues with trees estimates

independent contrasts / impact of error on conclusions Interpretation and application

Classification of tree-reconstruction methods

PARSIMONY METHODS: These methods utilize variation in CHARACTER STATES to reconstruct phylogenies. Character states are most often variation in the nucleotide (see below) or amino acid “states” at a site in a sequence of such characters. Such sequences often correspond to genes, but other sorts of sequences of characters could be just as useful; examples include nucleotides of introns or inter-genic regions, restriction site polymorphisms, or morphological characters.

Alignment of the nucleotide character states of the β-globin gene from five species of mammals

GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT GGC GCG CAC ...... G.C ...... T.. ..T ...... GC A...... C ..T ...... A.. ... A.T ...... AA ... A.C ... AGC ...... C ... G.A .AT ... ..A ...... A.. ... AA. TG...... G ... A.. ..T .GC ..T ... ..C ..G GA. ..T ...... T C.. ..G ..A ... AT...... T ... ..G ..A .GC ... GCT GGC GAG TAT GGT GCG GAG GCC CTG GAG AGG ATG TTC CTG TCC TTC CCC ACC ACC AAG ... ..A .CT ... ..C ..A ... ..T ...... AG...... G...... C ..C ...... G...... T.. GG...... G. ..T ..A ... ..C .A...... A C...... GCT G...... C ..T .CC ..C .CA ..T ..A ..T ..T .CC ..A .CC ... ..C ...... T ... ..A ACC TAC TTC CCG CAC TTC GAC CTG AGC CAC GGC TCT GCC CAG GTT AAG GGC CAC GGC AAG ...... C ...... G ...... C ...... G...... C ...... T.C .C...... AG ... A.C ..A .C...... T.T ... A.T ..T G.A ... .C...... C ... .CT ...... T ...... C ...... TC. .C...... C ...... A.C C.. ..T ..T ..T ...

The order of DNA sequences in the alignment is specified by the order of the taxa in the list. To fit it on the page, the alignment is broken into three parts; such alignments are called INTERLEAVED. The complete DNA sequence is shown for the fist taxon (human). All the other sequences are shown relative to human, with the dot, “.”, signifying a match in the character state with the human sequences. Differences are indicated by using the single-letter nucleotide code (A,C,T or G). Note that this alignment could also be analyzed by using distance, likelihood, and Bayesian methods.

The parsimony principle is derived from the principle of philosophy called Occam’s Razor: plurality should not be posited without necessity (Pluralitas non est poneneda sine necessitate, William of Occam, medieval English philosopher [ca. 1285-1349]). Thus the “simplest” hypothesis is the one that is chosen under the MAXIMUM PARSIMONY criterion. Let’s take a nucleotide dataset as an example. In this case an individual tree is a hypothesis, and the “best tree” for the dataset is the one that requires the fewest number of nucleotide substitutions to explain those data. One first computes the minimum number of evolutionary changes required to fit a given dataset to a tree. This number, often called the “number of STEPS”, is recorded for all candidate trees. The tree that requires the minimum number of steps is selected as the best estimate of the phylogenetic tree, and is called the MAXIMUM PARSIMONY TREE. When there are one or more trees with the same minimum number of steps, such trees are called EQUALLY PARSIMONIOUS TREES. The length of a tree in steps is called the “TREE LENGTH”. The appeal of maximum parsimony is that the shortest tree is the one that requires the fewest number of homoplasies. Remember, homoplasies are events such as parallelisms, convergences, reversals; and as such they represent non-phylogenetic similarities. “Longer trees” require more assumptions of homoplasies and thus are more complex than the maximum parsimony tree. When the truth is not parsimonious, parsimony tree length underestimates the true evolutionary distances.

Example of the maximum parsimony principle in phylogenetics:

SITE 1 2 3 4 5 6 7 8 9 0 1 2 Lengths of three possible trees: SPECIES 1 A T G T T G T G A T A A SPECIES 2 A T G T T c T G G T A A TREE 1: 5 steps SPECIES 3 A T G T T A T C A T A A TREE 2: 6 steps SPECIES 4 A T G T T A T C G T A A TREE 3: 6 steps

SITE 9 SITE 6 SITE 8

1 A 1 G A 3 1 G C 3 A 3 G A[G] G A C A[G] TREE 1

2 G 2 G 2 C A 4 C 4 G 4

1 G C 2 1 G G 2 1 A G 2

TREE 2 A A C C A G

3 A A 4 3 C C 4 3 A G 4

1 G C 2 1 G G 2 1 A G 2

TREE 3 A A C C G[A] G[A]

4 A A 3 4 C 4 G C 3 A 3

A problem arises when the underlying mechanism of molecular evolution is sufficiently complex that the number of homoplasies exceeds the true phylogenetic signal in the data. When this happens, methods which choose simple solutions are sometimes “fooled” by the data. What happens is that the simplest way to fit a tree to such data is to consider the homoplasies as the true signal and the true signal as the homopalsies. When this happens we say that maximum parsimony is INCONSISTENT under such a mode of molecular evolution.

DISTANCE MATRIX METHODS: If one looks at a phylogeny with branch lengths scaled to some evolutionary distance such as the mean number of changes per site in a gene, it is easy to see that there is a relationship between evolutionary distance and a measure of pair-wise similarity between the lineages. For example, a pair of sister taxa on a tree will have a shorter distance between each other than either will have with any other lineage on such a tree. Distance methods seek to utilize this form of information to reconstruct phylogenies. All distance methods start by converting the original data, say a set of gene sequences, into a matrix of pairwise distance values between all pairs of lineages in the sample. Next a tree is inferred either (i) by some type of sequential joining method, or (ii) by evaluating a set of candidate trees and applying a type of OPTIMALITY CRITERION to select the best tree. Note that maximum parsimony methods described above is one example of an optimality criterion that may be used on discrete character data. An optimality criterion for distance data with a similar justification as parsimony is MINIMUM EVOLUTION. Under the minimum evolution criterion, the tree with the smallest sum of branch lengths is chosen as the best estimate. As with character-based datasets, there are a variety of optimality criteria that one can use with distance data.

Example of distance based approach to molecular phylogenetics:

Obtain set of homologous gene sequences and produce an alignment.

Transform primary data into a matrix of pairwise genetic distance values.

Select a method of inferring a phylogenetic tree from distance data; in this case it is the least squares method.

human In this case, determine the S statistic for the set of chimp candidate trees, and select a tree that minimizes S. gorilla Note that S is a function of both the tree topology and orang its branch lengths Distance methods have a number of attractive qualities for phylogenetics. First and foremost, the distance calculations between all pairs of sequences are based on an explicit model of molecular evolution. If the most important features of the process of evolution are contained in the model then inconsistency problems such as long-branch attraction are reduced or eliminated. For those who are interested, I have placed on the course website a short summary of the more popular models of nucleotide and amino acid evolution. We will return to the problem of using model-based methods to obtain “corrected” estimates of evolutionary distance later in this course. Another very useful feature of distance methods is the statistical framework that can be used to evaluate models or hypotheses that are not available under parsimony methods. A noteworthy drawback of distance methods is that the information content of the dataset is reduced in the step of transforming the primary data into a matrix of pairwise distance values. The practical effect is that the power of distance methods could be lower than character-based methods in certain circumstances.

MAXIMUM LIKELIHOOD METHODS: Maximum likelihood is a standard statistical framework that can be applied to the problem of tree-reconstruction when a stochastic model of evolution is assumed. Note that maximum likelihood was invented by the great British statistician Ronald A. Fisher in 1912, and is of central importance to the field of statistics on its own. Many of the well-known statistical estimates are maximum likelihood estimates. Remember the binomial distribution that we used earlier to study the problem of genetic drift? ⎯The fraction of heads is a maximum likelihood estimate of the parameter of a binomial distribution. Likelihood methods measure the probability of the data given the hypothesis [i.e., Prob (D|H)]. Note that this is NOT the probability of the hypothesis; that would be Prob (H|D). In terms of the phylogeny problem we attempt to measure the probability of the data given a particular tree topology [i.e., Prob (D|τ)], which we call the LIKELIHOOD SCORE. Given an explicit model of evolution it is relatively strait forward to compute the likelihood of a tree (τ), although it is computationally slow as there are many terms in the likelihood function. Supplementary notes are posted on the course website that describes the calculation of the likelihood of a tree given a sample of nucleotide data. Given a set of candidate trees, the likelihood score is used as an optimality criterion, and the tree that yield the largest probability of observing the data in hand (i.e., the likelihood score) is taken as the best estimate of the tree. We call this tree the MAXIMUM LIKELIHOOD TREE, and its score is the maximum LIKELIHOOD SCORE.

A T

T T

Like parsimony methods, likelihood-based methods are based on characters rather than pairwise distance. Unlike parsimony, likelihood uses an explicit model of evolution to compute probabilities of character- state changes along a tree. The task of computing the likelihood of the data given a tree is broken down into separate calculations of the probability that the nucleotide data at a site evolved along a given tree (under a given model of evolution). At the end these SITE LIKELIHOODS are multiplied to get the likelihood of the complete dataset.

BAYESIAN METHODS: Bayesian methods have become very popular for phylogenetics over the past few years. Bayesian inference of phylogenies involves making an inference from the posterior distribution of trees. Because the posterior is extremely complicated there is no analytical formula for it, and a technique called Markov Chain Monte Carlo (MCMC) is used to approximate the posterior. We will not cover Bayesian phylogenetics in this course; however this approach, as well as the others mentioned above, is covered in detail at the fourth year in the course, Bioinformatics (BIOL 4041 / BIOC 4010). For those who can’t wait, I have placed a copy of an excellent review of Bayesian inference in phylogenetics on the course web site (Huelsenbeck et al. 2001. Science 294: 2310-2314.).

ALGORITHMIC METHODS: Rather than comparing alternate topologies based on some criterion of optimality (e.g., parsimony or likelihood), algorithmic methods will computationally “build a tree” according to a specific set of “steps”. All the algorithms break the task up into steps where a decision is made concerning the relationship of a small set of taxa according to some criterion. The steps are repeated until all taxa have been placed into a phylogenetic tree. The most common used algorithmic methods include UPGMA, STEPWISE ADDITION, STAR DECOMPOSITION, and QUARTET PUZZELING. Let’s take five taxa as an example to look at stepwise addition. A three-taxon tree is selected as the starting point. In turn, both of the remaining two taxa are placed on each of the three branches and the result of each is evaluated, say for tree length. The best 4-taxon tree is selected and all others are discarded. The four taxon tree is used as the start point for evaluating the placement of the last taxon to all five of the possible places on the 4-taxon tree. The best is selected and all others are discarded. This procedure can be applied to any number of taxa. Star decomposition reverses the process above; rather than building up a tree one species at a time, this method starts with all the taxa present as a completely unresolved tree, called a star tree. A completely resolved bifurcating tree is obtained by resolving the tree, step by step, by grouping two lineages in each step. Note that there are multiple pathways for decomposing a star tree into a bifurcating tree. The figure below illustrates one pathway of star decomposition.

Obtaining a tree by star decomposition

A A A A E F F F F E E E

C C C B C B B B D D D D

Algorithmic methods are very fast because they proceed directly toward a final solution, discarding alternatives along the way. Unfortunately algorithmic methods provide no measure of suitability of the discarded alternatives. Are they nearly as good as the tree obtained by the algorithm or are they much worse? Consider the case where an algorithm resolves a single tree, but where parsimony identifies 500 equally parsimonious trees. How much confidence should be placed in the single tree obtained from the algorithmic method?

Selected list of the more popular methods of inferring a phylogenetic tree.

MAXIMUM PARSIMONY Character-based method that selects the tree that minimizes the net amount of evolutionary character-state transformations WEIGHTED PARSIMONY Character-based method that selects the tree that minimizes the net amount of evolutionary character-state transformations after applying weights to different subsets of the possible transformations TRANSVERSION PARSIMONY Character-based method that selects a tree based on DNA sequences and minimizing the net amount of transversional transformations. Transitions are ignored. DOLLO PARSIMONY Character-based method that selects the tree that minimizes the net amount of evolutionary character-state transformations under the assumption that such transformations are irreversible LEAST-SQUARES Distance method that selects the tree that minimizes the discrepancy between the observed distance values and the branch lengths predicted by a tree. MINIMUM EVOLUTION Distance method that selects the tree that minimizes the sum of the branch lengths (total tree length) of the reconstructed tree. MAXIMUM LIKELIHOOD A character-based method that assumes a model of evolution and selects the tree that maximizes the probability of the data set in hand given the assumed evolutionary model. NEIGHBOUR-JOINING A star-decomposition algorithm that proceeds by minimizing the total length of the tree. There is no guarantee that the neighbour-joining tree will reconstruct the tree with the globally shortest tree length. UPGMA A clustering algorithm that assumes a linear relationship between evolutionary distance and divergence time. UPGMA stands for un-weighted pair-group method of arithmetic means. Among the simplest methods for tree reconstruction.

Some assumptions that [nearly] all methods of tree-reconstruction make for gene sequence data

• The sequence data has no errors • The genes are homologous • Each position in the alignment has positional homology; although it might differ in character state • Evolution at each position is independent of the other positions • The sequence variation contained in the sample of gene sequences is representative of the broader population of genes in the genome and lineages within the group of interest. • The signal to noise ratio in the genes sequences is sufficient to resolve the problem of interest.

Role of assumptions in the form of an evolutionary model MAXIMUM PARSIMONY Implicit rather then explicit DISTANCE METHODS Used for distance corrections MAXIMUM LIKELIHOOD Full and explicit use

The problem of tree searching

For those methods based on an optimality criterion, the best solution to the tree problem would be to evaluate every possible tree topology and compute the tree score for each (e.g. likelihood score). Let’s call the set of all possible trees TREE SPACE. It would be a simple task to keep track of the best score during the search of tree space, and replace the current “best tree” with any that are found to be better. The best tree (or list of equally good trees) at the end of the search of tree space is the best estimate of the phylogeny under the involved optimality criterion. The problem is that as the number of lineages in the data set increases, the size of tree space increases spectacularly. Let’s take a look at the size of tree space for unrooted bifurcating trees. We will focus on unrooted trees as this is the type of tree space that the vast majority of phylogenetic methods will search. The number of such trees is given by:

NT = 3 × 5 × 7 × … × (2n – 5),

where n is the number of species. A table showing the increase in the size of tree space with increasing number of lineages is presented below.

Number of Number of unrooted trees lineages 3 1 4 3 5 15 6 105 7 945 8 10,395 9 135,135 10 2,027,025 11 34,495,425 12 645,729,075 13 13,749,310,575 14 316,234,143,225 15 7,905,853,580,625 20 221,643,095,476,699,771,875 50 ~3 × 1074 100 ~3 10184 × Getting close to

23 Eddington’s number !!! [Avogadro’s number is only 6 × 10 ]

At 50 lineages, one is getting close to Eddington’s number, the number of electrons in the universe!

An alternative to exhaustive searching is a method called the BRANCH-AND-BOUND SEARCH. Here the algorithm is able to eliminate parts of tree space that contain suboptimal trees, as it proceeds through the search. [Note here the algorithm is one for searching tree space, not constructing trees]. Although very effective, the algorithm is only practical up to about 20 lineages. For more than 20 lineages only HEURISTIC searches of tree space are possible. Heuristic methods employ algorithms that will conclude a search in an acceptable amount of time, but at the cost of not being able to guarantee that the globally optimal solution has been found. These methods start with an initial tree (provided by the user, obtained at random, or from a method like stepwise addition), and conduct a process called BRANCH-SWAPPING. Branch swapping involves making small rearrangements to the tree topology. Following a branch swapping event, the score of the resulting topology is computed. The process of branch- swapping is continued until no more improvements can be made to the optimality criterion. This is a type of HILL-CLIMBING ALGORITHM, and unfortunately can result in a tree that represents a local optimum in the optimality criterion rather than the global optimum.

A “nice” likelihood surface as a function of the length of a two taxon tree (t) and the model parameter ω; there is only one peak on the surface

t ω

Outgroups

As we have seen above, mapping characters on a rooted phylogeny allows us to distinguish between homology and analogy. We can also distinguish between the primitive and derived character states. Such inferences are simply not possible with unrooted trees. Because nearly all modern methods of phylogenetic inference produce unrooted trees; correctly identifying the root is an important aspect of phylogenetic analysis. Today, the overwhelming majority of biologists use the OUTGROUP METHOD to root phylogenies.

Let’s define some terms:

INGROUP: A group of lineages, assumed to be monophyletic, but whose phylogenetic relationships are of primary interest.

OUTGROUP: One or more terminal taxa that are assumed to be outside of the monophyletic group that has been specified as the ingroup. Unlike the ingroup, the outgroup does not have to be monophyletic

ROOT: The most evolutionary basal point of a phylogeny. The root orients the direction of change along a phylogeny relative to time.

CHARACTER POLARITY: The evolutionary relationship between two or more states for a given character. Say we have a character with two states, “a” and “b”. By mapping them on a phylogeny we can determine that “b” preceded “a” in evolutionary history; hence “a” is the derived state and “b” is the primitive state.

In the outgroup method the outgroup or outgroup taxa serve only one purpose; following a phylogenetic analysis their location determines where to place the root on an unrooted tree. The root is placed along the branch that connects the ingroup with the outgroup.

Rooting a phylogenetic tree by placing the root between the ingroup and outgroup

IG: ingroup OG OG: outgroup Root Root IG-4

OG IG-3

IG-4 IG-3 IG-1 IG-2 IG-1 IG-2 IG-3 IG-4 OG IG-1 IG-2

Unrooted tree Placing root between ingroup and outgroup Rooted tree

An important point is that the phylogenetic analysis must contain both the ingroup and outgroup taxa. Usually, no constraints are placed on either the ingroups or outgroups for the purposes of conducting the phylogenetic analysis. There have been may misconceptions about the significance and role of outgroups. To avoid further confusion a concept map of the outgroup method is presented below.

Flowchart of the general method of outgroup analysis. This method is based on simultaneous phylogenetic analysis of ingroup and outgroup lineages.

Define ingroup, Define outgroup, Combine ingroup Conduct Root tree Read characters usually by usually by more and outgroup unrooted between ingroup from phylogeny synapomorphies inlcusive into single phylogentic and outgroup synapomorphies dataset analysis

Other methods do not use outgroups; Treat outgoups as Any method can be Distinguish between e.g., mid-point terminal taxa used: parsimony, primitive and derived, methods, and likelihood, etc. and between hypothetical homology and analogy ancestors

Many myths have developed about the use of outgroups in phylogenetic analysis. The following figure illustrates the most common misconceptions about how to use outgroups.

Outgroup myths:

Myth 1: The character state in the outgroup should be considered primitive. In reality, character states in the outgroup can, and often are, highly derived features of the organism.

Myth 2: The outgroup should be the sister taxon to the ingroup. There are many reasons why this is desirable; however it is not absolutely necessary. It is possible to root a tree by using an

outgroup that is more distantly related to the ingroup than its sister group.

Myth 3: More than one outgroup is required to root a tree. Of course larger sample sizes are generally better than smaller ones, but as we have shown above, it is possible to place a root on a tree by using only a single outgroup taxon.

Comparison of phylogenetic methods Random verses systematic error

In any statistical analysis there will be two potential sources of error (systematic and random). RANDOM ERROR is defined as the deviation between a parameter and an estimate of that parameter that is due solely to the effects of finite sample size. Since all phylogenetic methods are applied to finite sets of data, all estimates of a phylogeny will be subject to sampling error to some degree. SYSTEMATIC ERROR is the deviation between a parameter and an estimate that is due to incorrect assumptions of the estimation method. An important difference between these two types of error is that while random error decreases with increasing sample size, systematic errors persist, and sometimes intensify, as sample sizes increase.

Many commonly used methods of phylogenetic inference are not based on explicit assumptions. At first this might seem to be an advantage over model based methods, because if an assumption of a model is violated it could lead to systematic errors. However, a lack of stated assumptions does not mean that a method is assumption-free. In the “model-free” methods, the assumptions are merely implicit rather than explicit. Fortunately, phylogenetic methods are not automatically invalidated when one or more of their assumptions are violated, in fact a very simple model can be useful. An advantage of the model-based methods is the explicit nature of its assumptions; the fit of a specific assumption, or even the entire model, to the data can be evaluated.

Long branch attraction as an example of systematic error:

There are many modes of molecular evolution that can lead to inconsistency. Some examples are (i) rate heterogeneity among branches in the true tree; (ii) nucleotide compositional heterogeneity among the branches; and (iii) non-phylogenetic convergence in site-specific rates of evolution. Rate heterogeneity among branches can lead to something called LONG-BRANCH ATTRACTION (LBA). LBA occurs when two distantly related branches have very high rates such that the number of non-phylogenetic similarities between those branches exceeds the true signal in the data. When this happens many methods will yield a tree where those two unrelated branches are put together; hence, the “long branch attraction” effect. The figure below provides an example of this effect.

Long branch attraction: lineages 1 and 3 are not sister taxa, but are recovered incorrectly as sister taxa because high evolutionary rates lead to an excess of non phylogenetic similarities in character states in those two lineages.

Extremely high rate of substitution 1 3 1 3

2 4 2 4 true tree inferred tree

Note that maximum parsimony appears to be particularly sensitive to LBA.

Evaluation criteria

All phylogenetic methods have advantages and disadvantages. There are a variety of criteria (see list below) by which one can judge a method; and, a method that does not perform well by one standard often will perform well when measured by another standard.

• Consistency: An estimation method is said to be (statistically) consistent if the estimate converges to the true value of the parameter when the amount of data approaches infinity. A tree construction method is said to be consistent if the estimated tree is the true tree when the number of sites in the sequence goes to infinity. For model-based methods, the definition of consistency assumes that the model is correct. There has been a lot of discussion about this criterion since Felsenstein (1978) demonstrated that parsimony can be inconsistent under some model-tree combinations. A method that is inconsistent might be said to have a systematic error. • Efficiency measures how often we recover the true relationship given limited data. It is usually measured by the probability of recovering the correct tree or subtree (represented by internal nodes of the tree) when there is a limited number of sites in the sequence. In finite data, every method has random errors or sampling errors, and can get the tree wrong just by chance if there is not enough information in the sample of data. However, a more-efficient method has smaller sampling errors than an inefficient method and will recover the correct tree more often than the inefficient method at a given finite sample-size. • Robustness: A method is robust if it still gives correct answers when the assumptions of the method are wrong; that is, a robust method is not sensitive to violations to its assumptions. • Computational speed. Distance-based methods are very fast. Likelihood is the slowest. • Philosophical justifications (typically vacuous arguments)

Evaluation methods

Given a wide variety of methods, we would like to know how each performs at recovering a set of phylogenetic relationships. However, there are a variety of approaches for this, and each approach has its own advantages and disadvantages as well.

• Computer simulation. You can simulate many replicate data sets under a simulation model. You then use various tree reconstruction methods to analyse the data and see whether each method recovers the true tree, which you used during the simulation. You can change the variables in the simulation such as the sequence length, the shape of the true tree, the sequence divergence, etc. to see their effects. Simulation is probably the most commonly used method for evaluating phylogeny reconstruction methods, and there are computer programs for simulating data sets such as seq-gen by Andrew Rambaut in Oxford and evolver in Ziheng Yang’s paml package.

A criticism of computer simulation is that the models used for simulation do not reflect the true complexity of molecular evolution.

• Lab-generated phylogenies. Hillis et al. (1992 Science 255:589-92) generated a known phylogeny in the lab using the bacteriophage T7. Since the phage was sequenced at different stages and then separated to produce different lineages, the phylogeny as well as the ancestral sequences is known. They then used tree reconstruction methods to reconstruct the history. All methods performed extremely well!

A criticism of lab-generated phylogenies is that one lacks the control of specific aspects of molecular evolution that one has in a simulation study.

• Well-established phylogenies. In some cases, the phylogenetic relationship is almost certain, and such well-established phylogenies can be used to evaluate the performance of tree reconstruction methods or the utility of different genes.

A criticism of this approach is that the number of cases of well-established phylogenies is so low that this method is impossible to apply to many questions.

Some general comments about the powers and pitfalls of different methods

Given both the complexity of molecular evolution and the wide variety of approaches to phylogeny reconstruction, it is difficult to make any general recommendations. It appears that all methods tend to give incorrect trees when sequence length is small, and there are conditions that can cause all methods to be inconsistent. Perhaps the only safe recommendation is that one should not reject an entire method simply because it did not perform well in a particular computer simulation or some other study of performance. The utility of different methods will vary greatly among datasets, and also will depend on the analytical objectives. For example, you might not want to use branch lengths from a parsimony tree to estimate divergence times for very deep lineages, yet you might want to use parsimony to estimate a tree topology from a large sample of closely related species.

Phylogenetic uncertainty

Because most of the above applications rely on the assumption that a phylogeny is known without error, there is always the possibility that some conclusions might be overturned if a subsequent phylogenetic study suggests a different tree topology.

One of the most common methods of minimizing dependence on a single estimate of a phylogeny is to conduct phylogenetic analyses using several methods or several different datasets. The idea is that those parts of the phylogeny that are robust to method or dataset are likely to be good estimates of the true phylogeny. The different topologies can be combined to form a STRICT CONSENSUS TREE, which is a single tree that maintains only the monophyletic groups found in all the individual estimates of the phylogeny. Phylogeny based analyses are then conducted using only the information of evolutionary relationships contained in the consensus tree.

In statistics, when we estimate a parameter, we also need to calculate the confidence interval to indicate how reliable our point estimate is. For example, a 95% confidence interval, say 110 ± 10, means that if the same experiment is repeated many times, we would expect the interval to cover the true value of the parameter in 95% of the replicates. Clearly it is desirable for us to provide a measure of the reliability of the estimated phylogeny. However, the phylogeny represents a complex structure that is quite different from a conventional parameter, and this difference makes it difficult to construct a confidence interval for the estimated phylogeny. No rigorously justified analytical solution to the problem is available. There are many approximate methods currently in use, including some that are known not to work. We will discuss one of the most popular, the nonparametric bootstrap.

Bootstrap proportions

The NONPARAMETRIC BOOTSTRAP is the most commonly used method, and also the most respected. This method generates many (100 or 500) pseudo data sets (called bootstrap samples) by resampling sites from the original data set with replacement. Each bootstrap sample has the same number of sites as the original sequence. Each site in the bootstrap sample is chosen at random from the original data set, so some sites in the original data set might not be sampled at all and some others might be sampled multiple times. For example, in the figure below sites 4 and 9 were each included twice in the bootstrap pseudosample, while sites 2 and 7 are not sampled at all. Of course, different sites will be sampled in different bootstrap samples. Each pseudo data set is analysed using a phylogeny reconstruction method in the same way as the original data set, and the proportion of bootstrap samples that supports a particular clade is recorded. This is known as the bootstrap support, or bootstrap proportion for the concerned clade.

Original data: Site 1 2 3 4 5 6 7 8 9 10 Species 1 T C A GTT CG A T Species 2 C C G GTGA C A T Species 3 A C A T T T A G A A Species 4 GCA T T G A C A A

Site 9 4 3 1 4 8 10 3 9 10 Species 1 A G A T G G T A A T Species 2 A G G C G C T G A T Species 3 A T A A T G A A A A Species 4 A T A G T C A A A A One possible bootstrap pseudosample

Overview of the bootstrap in phylogeny reconstruction:

Sample data for a Pseudo- Bootstrap Tree 1 phylogenetic analysis sample 1

Pseudo- Bootstrap Tree 2 sample 2 Sampling Pseudo- Bootstrap Tree 3 variance sample 3

. . . .

Pseudo- Bootstrap Tree n sample n

Estimate of true phylogeny

Bootstrap is a method to assessing the sampling (random) error in the data. Intuitively, if the different sites have consistent phylogenetic signals, there will be little conflict among the bootstrap pseudosamples and high bootstrap proportions will be achieved for many clades. However, if the reconstruction method has systematic errors, for example, if the method is inconsistent, it can give you high bootstrap support for a wrong clade!

Example of gene trees with the non-parametric bootstrap method used to assess the support for individual branches of the trees. Bootstrap proportions were obtained from 2000 replications and values > 70% are shown above the involved branches.

From: Ward et al. (2002) PNAS 99:9278-9283.

Note, in molecular phylogenetics, the interpretation of bootstrap is controversial.

Bayesian posterior probabilities

The recent application of Bayesian methods to phylogenetic inference provided biologists with a much more sophisticated tool to account for phylogenetic uncertainty. Here, Markov chain Monte Carlo (MCMC) methods are used to approximate the Bayesian posterior probabilities of different tree topologies. These probabilities are used to identify the set of trees that have a 95% probability of including the true tree. Now it is simply a matter of analyzing each of the tree topologies in this set and weighting each result by the probability of the tree on which the analysis was performed. Although application of Bayesian methods to phylogenetics is still in its infancy, this analytical framework should greatly reduce the sensitivity of future evolutionary studies to the assumption that a single estimate of a phylogeny is known without error.

Appendix I: Comparison of tree-reconstruction methods

Parsimony methods Advantages Disadvantages

1. Intuitive appeal: Occam’s razor. 1.Implicit assumption that rates are low and largely homogenous, and that nucleotide composition is largely homogeneous (relative to other methods)

2. Very powerful implementation in current software 2. Branch lengths are substantially underestimated when rates are high.

3. Well studied, hence its properties are well understood 3. Unweighted parsimony not as robust to violation of implicit assumptions, it appears to be inconsistent under wider conditions than most other methods

4. Identifies synapomorphies 4. Under realistic patterns of sequence evolution unweighted parsimony has lower efficiency than many 5. Can be “weighted” to improve performance other methods

6. High efficiency when weighted properly, or when implicit 5. Controversy about how to properly weight parsimony assumptions are not violated.

7. Only framework for some types of molecular data (e.g., 6. Difficult to treat in statistical framework because no patterns of large scale genomic changes, SINES, LINES) way to compute means and variances of minimum numbers of substitutions

Distance methods Advantages Disadvantages

1. High computational speed for most tree selection 1. Software for some tree selection methods (e.g., least methods (e.g., UPGMA, NJ, ME) squares) less developed than for other distance or parsimony based tree selection methods

2. Can change assumptions of underlying models 2. Some tree selection methods are strictly algorithmic and have no optimality criterion (e.g., UPGMA and NJ)

3. Some tree selection methods are well studied, and their 3. Loss of information in conversion of nucleotide or properties are well understood (e.g., UPGMA, NJ, ME) amino acid sequences to genetic distances

4. Some tree selection methods provide a statistical 4. Sometimes produce biologically impossible branch framework for hypothesis testing. lengths

5. Appear to be less efficient than ML or weighted parsimony methods

Likelihood methods Advantages Disadvantages

1. Sound statistical framework for evaluating a wide variety 1. Computational burden for tree selection is much higher of evolutionary hypotheses. than all other methods

2. Model based approach with no loss of information, as 2. Uncertainty over how to choose substitution parameter occurs in the distance based methods values when computational costs prevent their estimation by ML.

3. Can test and change assumptions of underlying models 3. Difficulties of treating a tree topology as a parameter in a statistical approach to its estimation; a tree is not a numerical quantity.

4. Very efficient when model is correct