<<

MOLECULAR PHYLOGENETIC METHODS Course Contents - I. UPGMA II. III. Minimum IV. Maximum Parsimony V. Maximum Likelihood VI. Bayesian Inference

Gurumayum Suraj Sharma

MOLECULAR BUILDING METHODS ‹ Mathematical/Statistical Methods for inferring divergence order of taxa, as well as the lengths of the branches that connect them. ‹ Many phylogenetic methods available: ‹ Each having strengths and weaknesses

Cluster analysis is one such method in which OTUs are arranged in the in the order of decreasing similarity Gurumayum Suraj Sharma DISTANCE BASED METHODS ‹ Distance-based methods begin construction of tree by calculating pairwise distances between molecular sequences. ‹ A matrix of pairwise scores for all aligned (or nucleic acid sequences) is used to generate a tree. ‹ GOAL - ‹ Find a tree in which branch lengths correspond as closely as possible to the observed distances. ‹ Main distance-based methods I. Unweighted Pair Group Method with Arithmetic Mean [UPGMA] II. Neighbor Joining [NJ] ‹ Distance-based methods of phylogeny ‹ Computationally fast ‹ Particularly useful for analyses of larger number of sequences (e.g., .50 or 100).

Gurumayum Suraj Sharma

‹ USES DISTANCE METRIC o Number of amino acid changes between the sequences o Distance score.

‹ Distance is calculated as dissimilarity between the sequences of each pair of taxa. ‹ While similarities are useful, distances (which differ from differences) offer appealing properties for describing the relationships between objects. ‹ Distance based methods are fast but overlook substantial amount of information in a multiple

Gurumayum Suraj Sharma DISTANCE-BASED METHOD UPGMA

UPGMA algorithm introduced by Sokal & Michener [1958] Example: ‹ Consider five sequences whose distances can be represented as points in a plane. ‹ Also represent them in a . ‹ Some sequences, such as 1 & 2, closely similar ‹ Others (1 & 3) are far less related. ‹ UPGMA clusters these sequences

Gurumayum Suraj Sharma

1. Begin with a distance matrix . o Identify the least dissimilar groups (i.e. the two OTUs that are most closely related). o All OTUs given equal weights. If there are several equidistant minimal pairs, one is picked randomly. o Eg. OTUs 1 and 2 have the smallest distance.

2. Combine to form a new group. o Eg. Groups 1 & 2 have smallest distance (0.1) and are combined to form cluster (1, 2). o Results in formation of a new, clustered distance matrix having one fewer row and column than the initial matrix. o Dissimilarities that are not involved in formation of new cluster remain unchanged. o The values for clustered taxa (1,2) reflect average of OTUs 1 and 2 to each of the other OTUs. o The distance of OTU 1 to OTU 4 was initially 0.8, of OTU 2 to OTU 4 was 1.0, and then the distance of OTU (1,2) to OTU 4 becomes 0.9.

Gurumayum Suraj Sharma 3. Connect through a new node on nascent tree. o This node corresponds to group.

4. Identify next smallest dissimilarity, & combine those taxa to generate a second clustered dissimilarity matrix. o It is possible that two OTUs will be joined (if they share the least dissimilarity), or a single OTU will be joined with a cluster, or two clusters will be joined. o The dissimilarity of a single OTU with a cluster is computed simply by taking average dissimilarity. o In this process a new distance matrix is formed, and the tree continues to be constructed.

5. Continue until there are only two remaining groups, and join these.

Gurumayum Suraj Sharma

I. Each sequence is assigned to its own cluster. A distance matrix, based on some metric, quantitates the distance between each object. Circles represent sequences. II. The taxa with closest distance (1 and 2) identified and connected. This allows to name an internal node. The distance matrix is reconstructed counting taxa 1 and 2 as a group. Identify the next closest sequences. Gurumayum Suraj Sharma III. Next closest sequences combined into cluster, and matrix is again redrawn. In the tree taxa 4 and 5 are now connected by a new node, 7. Further identify next smallest distance corresponding to the union of 3 to cluster. IV. The newly formed group (cluster 4,5 joined with sequence 3) is represented on the emerging tree with new node 8. V. Finally, all sequences are connected in a rooted tree. Gurumayum Suraj Sharma

INPUT/ INITIAL SETTING

‹ Start with clusters of individual points and a distance/proximity matrix

p1 p2 p3 p4 p5 . . . p1

p2 p3

p4 p5 . . Distance/Proximity. Matrix

Gurumayum Suraj Sharma INTERMEDIATE STATE ‹ After some merging steps, we have some clusters C1 C2 C3 C4 C5

C1

C2 C3 C3 C4 C4 C5 C1 Distance/Proximity Matrix

C2 C5

Gurumayum Suraj Sharma

INTERMEDIATE STATE

‹ Merge the two closest clusters (C2 and C5) and update the

distance matrix. C1 C2 C3 C4 C5 C1

C3 C2 C3 C4 C4 C5 C1 Distance/Proximity Matrix

C2 C5

Gurumayum Suraj Sharma STEP 1 STEP 2

STEP 3 STEP 4

Critical Assumption of UPGMA - ‹ Rate of nucleotide or amino acid substitution is constant for all branches in tree, i.e., The applies to all evolutionary lineages. ‹ If this assumption is true, branch lengths can be used to estimate the dates of divergence ,& sequence-based tree mimics a species tree. ‹ UPGMA tree is rooted because of its assumption of a molecular clock. ‹ If violated & there are unequal substitution rates along different branches of tree, the method can produce an incorrect tree.

Other methods (including neighbour-joining) do not automatically produce a root, but a root can be placed by choosing an or by applying midpoint rooting.

Gurumayum Suraj Sharma UPGMA Method - ‹ Commonly used distance method in variety of applications. ‹ Microarray data analysis.

‹ In phylogenetic analyses using molecular sequence data its simplifying assumptions tend to make it significantly less accurate than other distance-based methods such as neighbor-joining.

Gurumayum Suraj Sharma

DISTANCE-BASED METHODS NEIGHBOUR JOINING [SAITOU AND NEI, 1987 ]

‹ Neighbor-joining Method is used for building trees by Distance Methods . ‹ Produces both Topology & Branch lengths . ‹ Example: ‹ A neighbour is a pair of OTUs connected through a single interior node X in an unrooted, bifurcating tree

‹ Method related to the cluster method ‹ Does not require that all lineages have diverged by equal amounts. ‹ Especially suited for datasets comprising lineages with largely varying rates of evolution . ‹ Can be used in combination with methods that allow correction for superimposed substitutions. Gurumayum Suraj Sharma ‹ Neighbor-joining method - A special case of Star Decomposition Method . ‹ Keeps track of nodes on tree rather than taxa or clusters of taxa. ‹ Raw data provided as distance matrix & initial tree is a STAR TREE . ‹ Modified distance matrix is constructed in which separation between each pair of nodes is adjusted on basis of their average divergence from all other nodes. ‹ The tree is constructed by linking the least-distant pair of nodes in this modified matrix. ‹ When two nodes are linked, their common ancestral node is added & terminal nodes with their respective branches are removed. ‹ The process converts the newly added common ancestor into a terminal node on a tree of reduced size. ‹ At each stage two terminal nodes are replaced by one new node ‹ The process is complete when two nodes remain, separated by a single branch.

Gurumayum Suraj Sharma

‹ The process of starting with a star-like tree and finding and joining neighbours is continued until the topology of the tree is completed. ‹ Neighbour-joining, algorithm minimizes the sum of branch lengths at each stage of clustering OTUs although the final tree is not necessarily the one with the shortest overall branch lengths. ‹ Results may differ from strategies or maximum parsimony.

‹ Neighbour joining produces an unrooted tree topology ‹ Because it does not assume a constant rate of evolution, unless an outgroup is specified or midpoint rooting is applied.

Gurumayum Suraj Sharma NJ method distance-based algorithm: I. OTUs are first clustered in a Starlike Tree . “Neighbours ” are defined as OTUs that are connected by a single, interior node in an unrooted, bifurcating tree. II. Two closest OTUs are identified. These neighbours are connected to other OTUs via internal branch XY . The OTUs [neighbours] that are selected are chosen as ones that yield smallest sum of branch lengths. The process is repeated until the entire tree is generated Gurumayum Suraj Sharma

Gurumayum Suraj Sharma ADVANTAGES & DISADVANTAGES

ADVANTAGES o Fast and thus suited for large datasets and for bootstrap analysis o Permits lineages with largely different branch lengths o Permits correction for multiple substitutions

DISADVANTAGES o Sequence information reduced o Gives only one possible tree o Strongly dependent on model of evolution used.

Gurumayum Suraj Sharma

MINIMUM EVOLUTION

MAIN IDEA- ‹ Based on the assumption that the tree with the smallest sum of branch length estimates is most likely to be the true one. ‹ Length computed from pair-wise distance between the sequences . ‹ Slightly similar to Parsimony Method. ‹ Tree obtained for ME and parsimony methods nearly identical in topology and branch length. ‹ Available in PHYLIP & ClustalW package

Gurumayum Suraj Sharma ADVANTAGES & DISADVANTAGES

ADVANTAGES o Easy to perform & quick calculation o Fit for sequences having high similarity scores

DISADVANTAGES o Loss of Information since sequences are not considered as such o All sites equally treated [differences in substitution rates not considered] o Not applicable in distantly related divergent sequences.

Gurumayum Suraj Sharma

Gurumayum Suraj Sharma PHYLOGENETIC INFERENCE MAXIMUM PARSIMONY

Parsimony: Latin- Parcere meaning “ to spare ” Refers to simplicity of assumptions in a logical formulation

MAIN IDEA- Best tree is that with the shortest branch lengths possible.

‹ Hennig (1966), and Eck & Dayhoff (1966) ‹ Used parsimony-based approach in generating phylogenetic trees based on morphological characters

Gurumayum Suraj Sharma

Dayhoff et al. (1972), used maximum parsimony analysis to infer relationships and history of 13 globins. Arrow 1 indicates a node corresponding to LCA of the group of vertebrate globins Arrow 2 indicates the ancestor of insect and vertebrate globins Gurumayum Suraj Sharma ‹ According to maximum parsimony , having fewer changes to account for the way a group of sequences evolved is preferable to more complicated explanations of molecular evolution. ‹ Thus one seeks the most parsimonious explanations for the observed data. ‹ The assumption of phylogenetic is that genes exist in a nested hierarchy of relatedness , and this is reflected in a hierarchical distribution of shared characters in the sequences. ‹ The most parsimonious tree is supposed to best describe the relationships of proteins (or genes) that are derived from common ancestors.

Gurumayum Suraj Sharma

STEPS MAXIMUM PARSIMONY ANALYSIS

I. IDENTIFY INFORMATIVE SITES ‹ If a site is constant, then it is not informative. ‹ Non-informative sites include constant sites & positions in which there are not at least two states (e.g. two different amino acid residues) with at least two taxa having each state.

Gurumayum Suraj Sharma II. Construct trees . ‹ Every tree is assigned a cost, & the tree with the lowest cost is sought. ‹ When a reasonable number of taxa are evaluated, such as about a dozen or fewer, all possible trees are evaluated and the one with the shortest branch length is chosen. ‹ When necessary, a heuristic search is performed to reduce the complexity of the search by ignoring large families of trees that are unlikely to contain the shortest tree.

III. Count the number of changes and select the shortest tree (or trees).

Gurumayum Suraj Sharma

Four amino acid residues from five different species ‹ Maximum parsimony identifies the simplest (most parsimonious) evolutionary path by which those sequences might have evolved from ancestral sequences. ‹ Two trees showing possible ancestral sequences. ‹ One requires seven changes from its common ancestor, while the other requires nine changes . ‹ Thus, maximum parsimony would select the tree with seven changes Gurumayum Suraj Sharma ‹ Two possible trees describe these sequences ‹ Each tree has hypothetical sequences assigned to ancestral nodes. ‹ One of the trees requires fewer changes to explain how the observed sequences evolved from a hypothetical common ancestor. ‹ *Each site is treated independently

Gurumayum Suraj Sharma

LONG -BRANCH ATTRACTION - ‹ An ARTIFACT that occurs sometimes in phylogenetic inference. ‹ Parsimony approaches particularly susceptible.

‹ In phylogenetic reconstruction of protein/DNA sequences ‹ Branch length indicates number of substitutions that occur between two taxa. ‹ Parsimony algorithms assume that all taxa evolve at the same rate and that all characters contribute the same amount of information ‹ Long -branch attraction is a phenomenon in which rapidly evolving taxa are placed together on a tree, not because they are closely related, but artifactually because they both have many .

Gurumayum Suraj Sharma EXAMPLE - ‹ Consider the true tree in which Taxon 2 represents a DNA/Protein that changes rapidly relative to Taxa 1 & 3. ‹ The OUTGROUP is (by definition) more distantly related than Taxa 1, 2 & 3 are to each other. ‹ A maximum parsimony algorithm may generate an inferred tree in which Taxon 2 is “attracted ” toward another long branch (the outgroup) ‹ Since these two taxa have a large number of substitutions.

‹ Anytime two long branches are present, they may be “attracted”

Gurumayum Suraj Sharma

Long branch chain attraction: ‹ The true tree includes a taxon [2] that evolves more quickly than the other taxa. ‹ It shares a common ancestor with taxon 3. ‹ In the inferred tree taxon 2 is placed separately from the other taxa because it is attracted by the long branch of the outgroup.

Gurumayum Suraj Sharma MODEL-BASED PHYLOGENETIC INFERENCE MAXIMUM LIKELIHOOD ‹ Maximum likelihood - ‹ A general statistical method for estimating unknown parameters of a probability model. ‹ Parameter - Some descriptor of the model.

‹ A familiar model might be the normal distribution of a population with two parameters: the mean and variance. ‹ In , there are many parameters I. Rates II. Differential transformation costs III. The tree itself [most important]

Gurumayum Suraj Sharma

Likelihood ‹ A quantity proportional to the probability of observing the data given the model . ‹ For a model (i.e. the tree and parameters), one can calculate the probability the observations would have actually been observed as a function of the model.

‹ Then examine this likelihood function to see where it is greatest, and the value of the parameter of interests (usually the tree and/or branch lengths) at that point is the maximum likelihood estimate of the parameter.

Gurumayum Suraj Sharma MAXIMUM LIKELIHOOD - ‹ An approach designed to determine tree topology & branch lengths that have greatest likelihood of producing observed data set.

‹ A likelihood is calculated for each residue in an alignment, including some MODEL of nucleotide or amino acid substitution process. ‹ Among most computationally intensive but most flexible methods available. ‹ Maximum parsimony methods may sometimes FAIL when there are large amounts of evolutionary change in different branches of a tree. ‹ Maximum likelihood, in contrast, provides a STATISTICAL MODEL for evolutionary change that varies across branches.

Gurumayum Suraj Sharma

‹ EXAMPLE ‹ Maximum likelihood can be used to estimate positive & negative selection across individuals branches of a tree. ‹ The relative merits of maximum parsimony and maximum likelihood continue to be explored. ‹ When sequences evolve in a heterogeneous fashion over time, maximum parsimony can outperform maximum likelihood. ‹ Maximum likelihood [like parsimony methods ], also evaluates alternative trees (hypotheses of relationship) , but considers the probability , based on some selected model of evolution, that each tree explains the data. ‹ That tree which has highest probability of explaining the data is preferred over trees having a lower probability

Gurumayum Suraj Sharma Example - Maximum likelihood is used in practice for molecular sequence data ‹ There are three possible trees . ‹ Maximum likelihood evaluates each tree and calculates, for each character, the total probability that each node of the tree possesses a given nucleotide. ‹ These individual probabilities are added together and the total probability for all characters is calculated. ‹This total probability is compared with that for the other trees. ‹ That tree having the greatest overall probability is preferred over the others.

Gurumayum Suraj Sharma

Gurumayum Suraj Sharma ADVANTAGE & DISADVANTAGES ‹ The method is statistically well understood ‹ Have advantage over parsimony- Estimation of pattern of evolutionary history can take into account probabilities of a nucleotide substitution (e.g., purine to purine versus purine to pyrimidine) as well as varying rates of nucleotide substitution. ‹ Maximum likelihood methods may also eliminate the problem of . ‹ Better accounting for branch lengths. ‹ Incorporates “multiple hits” thereby providing more realistic branch length. ‹ Also, information is derived from sites that would be uninformative under parsimony

Gurumayum Suraj Sharma

DISADVANTAGES ‹ Computationally intensive and slow ‹ Susceptible to asymmetrical presence of data in partitions- ‹ Misleading results of likelihood-based phylogenetic analyses in the presence of missing data. ‹ The result is dependent on model used & information is derived from sites that are uninformative under parsimony is only due to the model used. ‹ Questionably applicable to complex data like morphology given the difficulty of modeling the numerous processes ‹ Philosophically less well established , especially in terms the applicability of probabilities and statistical measures of unique historical events (vs. Parsimony as a general principle). ‹ Fundamental distinction between reconstruction & estimation ‹ “Although the true phylogeny maybe “unknowable” it can nonetheless be estimated …” Phylogenetic Inference

Gurumayum Suraj Sharma UPGMA & Neighbor Joining are clustering algorithms o Can make quick trees but are not the most reliable, especially when dealing with deeper divergence times. o These method are good to give an idea about available data, but are usually not acceptable for publication.

Maximum parsimony & Minimum evolution are methods that try to minimize branch lengths by either minimizing distance (minimum evolution) or minimizing the number of mutations (maximum parsimony ). o The major problem with these methods is that the fail to take into account many factors of sequence evolution (e.g. reversals, convergence, and ). o Thus, the deeper the divergence times that more likely these methods will lead to erroneous or poorly supported groupings.

Gurumayum Suraj Sharma

TREE INFERENCE: BAYESIAN METHODS Bayesian Inference - A statistical approach to modelling uncertainty in complex models.

‹ Bayesian inference seeks the probability of a tree conditional on the data (that is, based on the observations such as a given multiple sequence alignment). ‹ Bayesian estimation of phylogeny is focused on a quantity called the, posterior probability distribution of trees . ‹ Read as “the probability of observing a tree given the data ” ‹ For a given tree, the posterior probability is the probability that the tree is correct, and our goal is to identify the tree with the maximum probability.

Gurumayum Suraj Sharma One can apply a Bayesian inference approach using the MrBayes software program. There Are Four Steps. FIRST - A multiple sequence alignment of interest. SECOND - Specify the evolutionary model. This includes options or data that are DNA (whether coding or not), ribosomal DNA (for the analysis of paired stem regions). Before performing the analysis, one specifies a prior probability distribution for the parameters of the likelihood model. There are six types of parameters that are set as the priors for the model in the case of the analysis of nucleotide sequences: (1) Topology of trees (e.g. some nodes can be constrained to always be present) (2) Branch lengths, (3) Stationary frequencies of the four nucleotides (4) Six nucleotide substitution rates (for A→C, A→G, A→T, C→G, C→T, and G→T) (5) Proportion of invariant sites (6) Shape parameter of distribution of rate variation. Gurumayum Suraj Sharma

THIRD - Run analysis. ‹ Done with MCMC [Monte Carlo Markov Chain ] command. ‹ Posterior probability of possible phylogenetic trees is ideally calculated as a summation over all possible trees, ‹ For each tree, all combinations of branch lengths & parameters are evaluated. ‹ In practice this probability cannot be determined analytically, but it can be approximated using MCMC. ‹ Done by drawing many samples from posterior distribution. ‹ MrBayes runs two simultaneous, independent analyses beginning with distinct, randomly initiated trees. ‹ Helps to assure that analysis includes a good sampling from posterior probability distribution.

FOURTH - Summarize the samples. ‹ MrBayes provides a variety of summary statistics ‹ Including a phylogram, branch lengths (in units of number of expected substitutions per site), and credibility values.

Gurumayum Suraj Sharma ‹ Bayesian inference of phylogeny resembles maximum likelihood because each method seeks to identify a quantity called the likelihood which is proportional to observing the data conditional on a tree. ‹ The methods differ in that Bayesian inference includes specification of prior information and uses MCMC to estimate posterior probability distribution. ‹ Although they were introduced relatively recently, Bayesian approaches to phylogeny are becoming increasingly commonplace.

Gurumayum Suraj Sharma

Source - Pevsner J. & Functional Genomics

Gurumayum Suraj Sharma