Phylogenetics Topic 3: Methods of Inferring Phylogenies

Phylogenetics Topic 3: Methods of inferring phylogenies Because no person was present to directly observe the evolution of a group of organisms, biologists must infer phylogenies from the characters of living and fossil taxa. These days, the vast majority of phylogenies are reconstructed from variation among nucleotide or amino acid sequences. However, a wide variety of other types of molecular data can be used to reconstruct phylogenies; examples include restriction fragment length polymorphisms (RFLPS), insertion-deletion events (INDELS), chromosomal rearrangements, DNA-DNA hybridization, to name a few. Numerous methods of reconstructing trees have been implemented. This lecture covers a very brief, and non-technical, introduction to the most common methods. A generalized protocol for molecular phylogenetics, and the associated concerns with each step Concerns: Collect homolgous sequences gene tree-species tree / paralogy–orthology / trees within trees Multiple sequence alignment positional homology / gaps / subjectivity-objectivity / methods Phylogeny estimation philosophy / methods / consistency / power and accuracy Test reliability or fit of phylogenetic branch support / tree comparison / statistic issues with trees estimates independent contrasts / impact of error on conclusions Interpretation and application Classification of tree-reconstruction methods PARSIMONY METHODS: These methods utilize variation in CHARACTER STATES to reconstruct phylogenies. Character states are most often variation in the nucleotide (see below) or amino acid “states” at a site in a sequence of such characters. Such sequences often correspond to genes, but other sorts of sequences of characters could be just as useful; examples include nucleotides of introns or inter-genic regions, restriction site polymorphisms, or morphological characters. Alignment of the nucleotide character states of the β-globin gene from five species of mammals human cow rabbit rat opossum GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT GGC GCG CAC ... ... ... G.C ... ... ... T.. ..T ... ... ... ... ... ... ... ... ... .GC A.. ... ... ... ..C ..T ... ... ... ... A.. ... A.T ... ... .AA ... A.C ... AGC ... ... ..C ... G.A .AT ... ..A ... ... A.. ... AA. TG. ... ..G ... A.. ..T .GC ..T ... ..C ..G GA. ..T ... ... ..T C.. ..G ..A ... AT. ... ..T ... ..G ..A .GC ... GCT GGC GAG TAT GGT GCG GAG GCC CTG GAG AGG ATG TTC CTG TCC TTC CCC ACC ACC AAG ... ..A .CT ... ..C ..A ... ..T ... ... ... ... ... ... AG. ... ... ... ... ... .G. ... ... ... ..C ..C ... ... G.. ... ... ... ... T.. GG. ... ... ... ... ... .G. ..T ..A ... ..C .A. ... ... ..A C.. ... ... ... GCT G.. ... ... ... ... ... ..C ..T .CC ..C .CA ..T ..A ..T ..T .CC ..A .CC ... ..C ... ... ... ..T ... ..A ACC TAC TTC CCG CAC TTC GAC CTG AGC CAC GGC TCT GCC CAG GTT AAG GGC CAC GGC AAG ... ... ... ..C ... ... ... ... ... ... ... ..G ... ... ..C ... ... ... ... G.. ... ... ... ..C ... ... ... T.C .C. ... ... ... .AG ... A.C ..A .C. ... ... ... ... ... ... T.T ... A.T ..T G.A ... .C. ... ... ... ... ..C ... .CT ... ... ... ..T ... ... ..C ... ... ... ... TC. .C. ... ..C ... ... A.C C.. ..T ..T ..T ... The order of DNA sequences in the alignment is specified by the order of the taxa in the list. To fit it on the page, the alignment is broken into three parts; such alignments are called INTERLEAVED. The complete DNA sequence is shown for the fist taxon (human). All the other sequences are shown relative to human, with the dot, “.”, signifying a match in the character state with the human sequences. Differences are indicated by using the single-letter nucleotide code (A,C,T or G). Note that this alignment could also be analyzed by using distance, likelihood, and Bayesian methods. The parsimony principle is derived from the principle of philosophy called Occam’s Razor: plurality should not be posited without necessity (Pluralitas non est poneneda sine necessitate, William of Occam, medieval English philosopher [ca. 1285-1349]). Thus the “simplest” hypothesis is the one that is chosen under the MAXIMUM PARSIMONY criterion. Let’s take a nucleotide dataset as an example. In this case an individual tree is a hypothesis, and the “best tree” for the dataset is the one that requires the fewest number of nucleotide substitutions to explain those data. One first computes the minimum number of evolutionary changes required to fit a given dataset to a tree. This number, often called the “number of STEPS”, is recorded for all candidate trees. The tree that requires the minimum number of steps is selected as the best estimate of the phylogenetic tree, and is called the MAXIMUM PARSIMONY TREE. When there are one or more trees with the same minimum number of steps, such trees are called EQUALLY PARSIMONIOUS TREES. The length of a tree in steps is called the “TREE LENGTH”. The appeal of maximum parsimony is that the shortest tree is the one that requires the fewest number of homoplasies. Remember, homoplasies are events such as parallelisms, convergences, reversals; and as such they represent non-phylogenetic similarities. “Longer trees” require more assumptions of homoplasies and thus are more complex than the maximum parsimony tree. When the truth is not parsimonious, parsimony tree length underestimates the true evolutionary distances. Example of the maximum parsimony principle in phylogenetics: SITE 1 2 3 4 5 6 7 8 9 0 1 2 Lengths of three possible trees: SPECIES 1 A T G T T G T G A T A A SPECIES 2 A T G T T c T G G T A A TREE 1: 5 steps SPECIES 3 A T G T T A T C A T A A TREE 2: 6 steps SPECIES 4 A T G T T A T C G T A A TREE 3: 6 steps SITE 6 SITE 8 SITE 9 1 A 1 G A 3 1 G C 3 A 3 G A[G] A[G] G A C TREE 1 2 G 2 G 2 C A 4 C 4 G 4 1 A 1 G C 2 1 G G 2 G 2 A G TREE 2 A A C C 3 A 3 C 3 A A 4 C 4 G 4 1 G C 2 1 G G 2 1 A G 2 TREE 3 A A C C G[A] G[A] 4 A A 3 4 C 4 G C 3 A 3 A problem arises when the underlying mechanism of molecular evolution is sufficiently complex that the number of homoplasies exceeds the true phylogenetic signal in the data. When this happens, methods which choose simple solutions are sometimes “fooled” by the data. What happens is that the simplest way to fit a tree to such data is to consider the homoplasies as the true signal and the true signal as the homopalsies. When this happens we say that maximum parsimony is INCONSISTENT under such a mode of molecular evolution. DISTANCE MATRIX METHODS: If one looks at a phylogeny with branch lengths scaled to some evolutionary distance such as the mean number of changes per site in a gene, it is easy to see that there is a relationship between evolutionary distance and a measure of pair-wise similarity between the lineages. For example, a pair of sister taxa on a tree will have a shorter distance between each other than either will have with any other lineage on such a tree. Distance methods seek to utilize this form of information to reconstruct phylogenies. All distance methods start by converting the original data, say a set of gene sequences, into a matrix of pairwise distance values between all pairs of lineages in the sample. Next a tree is inferred either (i) by some type of sequential joining method, or (ii) by evaluating a set of candidate trees and applying a type of OPTIMALITY CRITERION to select the best tree. Note that maximum parsimony methods described above is one example of an optimality criterion that may be used on discrete character data. An optimality criterion for distance data with a similar justification as parsimony is MINIMUM EVOLUTION. Under the minimum evolution criterion, the tree with the smallest sum of branch lengths is chosen as the best estimate. As with character-based datasets, there are a variety of optimality criteria that one can use with distance data. Example of distance based approach to molecular phylogenetics: Obtain set of homologous gene sequences and produce an alignment. Transform primary data into a matrix of pairwise genetic distance values. Select a method of inferring a phylogenetic tree from distance data; in this case it is the least squares method. human In this case, determine the S statistic for the set of chimp candidate trees, and select a tree that minimizes S. gorilla Note that S is a function of both the tree topology and orang its branch lengths Distance methods have a number of attractive qualities for phylogenetics. First and foremost, the distance calculations between all pairs of sequences are based on an explicit model of molecular evolution. If the most important features of the process of evolution are contained in the model then inconsistency problems such as long-branch attraction are reduced or eliminated. For those who are interested, I have placed on the course website a short summary of the more popular models of nucleotide and amino acid evolution. We will return to the problem of using model-based methods to obtain “corrected” estimates of evolutionary distance later in this course. Another very useful feature of distance methods is the statistical framework that can be used to evaluate models or hypotheses that are not available under parsimony methods. A noteworthy drawback of distance methods is that the information content of the dataset is reduced in the step of transforming the primary data into a matrix of pairwise distance values. The practical effect is that the power of distance methods could be lower than character-based methods in certain circumstances. MAXIMUM LIKELIHOOD METHODS: Maximum likelihood is a standard statistical framework that can be applied to the problem of tree-reconstruction when a stochastic model of evolution is assumed.

Phylogenetics Topic 3: Methods of Inferring Phylogenies

Investgating Determinants of Phylogeneic Accuracy

Phylogeny Inference Based on Parsimony and Other Methods Using Paup*

A Phylogenomic Analysis of Turtles ⇑ Nicholas G

Phylogenetic Comparative Methods: a User's Guide for Paleontologists

Phylogenetic Definitions in the Pre-Phylocode Era; Implications for Naming Clades Under the Phylocode

Phylogeny Codon Models • Last Lecture: Poor Man’S Way of Calculating Dn/Ds (Ka/Ks) • Tabulate Synonymous/Non-Synonymous Substitutions • Normalize by the Possibilities

Diversity-Dependent Cladogenesis Throughout Western Mexico: Evolutionary Biogeography of Rattlesnakes (Viperidae: Crotalinae: Crotalus and Sistrurus)

A Phylogenetic Analysis of the Basal Ornithischia (Reptilia, Dinosauria)

Is Ellipura Monophyletic? a Combined Analysis of Basal Hexapod

EVOLUTIONARY INFERENCE: Some Basics of Phylogenetic Analyses

Family Classification

The Probability of Monophyly of a Sample of Gene Lineages on a Species Tree