Quick viewing(Text Mode)

Phylogenetic Analysis 2 Introducing Some of the Most Commonly Used Methods for Phylogenetic Analysis

Phylogenetic Analysis 2 Introducing Some of the Most Commonly Used Methods for Phylogenetic Analysis

Subjects of this lecture

Introduction to 1 Introducing some of the terminology of . Phylogenetic Analysis 2 Introducing some of the most commonly used methods for phylogenetic analysis. Irit Orr 3 Explain how to construct phylogenetic trees.

Phylogenetics - WHY?

¾ Find evolutionary ties between . - is the of (Analyze changes occuring in different organisms classification of organisms. during ). Phylogeny - is the evolution of a ¾ Find (understand) relationships between an genetically related group of organisms. ancestral sequence and it descendants. Or: a study of relationships between (Evolution of of sequences) collection of "things" (, , organs..) that are derived from a common ¾ Estimate time of divergence between a group of organisms that share a common ancestor. ancestor. Basic Of

Eukaryote tree

Eubacteria tree

Similar sequences, common ancestor...

From a common ancestor sequence, two DNA sequences are diverged.

Each of these two sequences start to accumulate substitutions.

The number of these are used in ... common ancestor, similar analysis. How we calculate the Relationships of Phylogenetic Degree of Divergence Analysis and Sequences Analysis

When 2 sequences found in 2 organisms are If two sequences of length N differ from very similar, we assume that they have derived each other at n sites, then their degree of from one ancestor. divergence is: AAGAATC AAGAGTT n/N or n/N*100%.

AAGA(A/G)T(/ T)

The sequences alignment reveal which positions are conserved from the ancestor sequence.

Relationships of Phylogenetic Relationships of Phylogenetic Analysis and Sequences Analysis Analysis and Sequences Analysis

The progressive multiple alignment of a group of Most phylogenetic methods assume that each sequences, first aligns the most similar pair. position in a sequence can change Then it adds the more distant pairs. independently from the other positions. The alignment is influenced by the “most Gaps in alignments represent mutations in similar” pairs and arranged accordingly, but….it sequences such as: , , genetic does not always correctly represent the rearrangments. evolutionary history of the occured changes. Gaps are treated in various ways by the Not all phylogenetic methods work this way. phylogenetic methods. Most of them ignore gaps. Relationships of Phylogenetic Analysis and Sequences Analysis What is a ?

An illustration of the evolutionary Another approach to treat gaps is by using relationships among a group of organisms. sequences similarity scores as the base for the phylogenetic analysis, instead of using the alignment itself, and trying to is another name for a phylogenetic tree. decide what happened at each position. The similarity scores based on scoring A tree is composed of nodes and branches. matrices (with gaps scores) are used by One branch connects any two adjacent the DISTANCE methods. nodes. Nodes represent the taxonomic units. (sequences)

Rooted Phylogenetic Tree What is a phylogenetic tree?

™ E.G: 2 very similar sequences will be seqA Leaves = Outer branches Represent the taxa (sequences) neighbors on the outer branches and will be 1 connected by a common internal branch. seqB 2 * Nodes = 1 2 3 Types of Trees * 3 seqC Represent the relationships Trees Networks Among the taxa (sequences) * e.g Node 1 represent the ancestor seqD seq from which seqA and seqB derived.

Only one path between More than one path Branches * any pair of nodes between any pair of nodes The length of the branch represent the # of changes that occurred in the seqs prior to the next level of separation. External branches - are branches that end with a tip. (FA,FB,GC,GD,IE) In a phylogenetic tree... A more recent diversions 9 Each NODE represents a event in F evolution. Beyond this point any sequence B changes that occurred are specific for each H branch (specie). I C G 9 The BRANCH connects 2 NODES of the tree. D The length of each BRANCH between one NODE to the next, represents the # of changes Internal branches - are that occurred until the next separation branches that do not end E (speciation). with a tip. (IH,HF,HG) more ancient diversions

In a phylogenetic tree... Tree structure

9 NOTE: The amount of evolutionary time that ♠Terminal nodes - represent the data (e.g passed from the separation of the 2 sequences sequences) under comparison is not known. The phylogenetic analysis can only estimate the # of changes that occurred from the (A,B,C,D,E), also known as OTUs, time of separation. (Operational Taxonomic Units). 9 After the branching event, one (sequence) can undergo more mutations then the other ♠Internal nodes - represent inferred taxon. ancestral units (usually without empirical data), also known as HTUs, (Hypothetical 9 Topology of a tree is the branching pattern of a Taxonomic Units). tree. Slide taken from Dr. Itai Yanai The Hypothesis Different kinds of trees can be used to depict different aspects of evolutionary history All the mutations occur in the same rate in all 1. : the tree branches. simply shows relative recency of common ancestry The rate of the mutations is the same for all positions along the sequence.

7 2. Additive trees: 2 a cladogram with branch lengths, 3 4 3 also called phylograms and metric trees The Molecular Clock Hypothesis is most suitable for 1 5 3 1 closely related .

1 3 1 3. Ultrametric trees: (dendograms) special kind of additive tree in which the 1 3 tips of the trees are all equidistant from the root 2 1 1 1 1

Unrooted Tree = Phenogram Rooted Tree = Cladogram

A phylogenetic tree where all A phylogenetic tree that the "objects" on it are related all the "objects" on it descendants - but there is B share a known common not enough information to ancestor (the root). specify the common There exists a particular ancestor (root).

root node. A The paths from the root The path between nodes of to the nodes Root B the tree do not specify an correspond to evolutionary time. evolutionary time. C Slide taken from Dr. Itai Yanai Rooted versus Unrooted

Types of Trees Rooted vs. Unrooted ™The number of tree topologies of rooted tree is much higher than that of the unrooted tree for the same number of Branches Nodes OTUs. Rooted Interior M – 2 M – 1 Total 2M – 2 2M – 1 Unrooted Interior M – 3 M – 2 Total 2M – 3 2M – 2 ™Therefore, the of the unrooted tree topology is smaller than that of the rooted M is the number of OTU’s tree.

Slide taken from Dr. Itai Yanai

The number of rooted and unrooted trees: Orthologs - genes related by speciation Possible Number of Number events. Meaning same genes in different of OTU’s Rooted trees Unrooted trees 2 1 1 species. 3 3 1 4 15 3 5 105 15 6 945 105 Paralogs - genes related by duplication 7 10395 945 events. Meaning duplicated genes in the 8 135135 10395 9 2027025 135135 same species. 10 34459425 2027025

OTU – Operational Taxonomical Unit Selecting sequences for phylogenetic Selecting sequences for phylogenetic analysis analysis ™Non-coding DNA regions have more What type of sequence to use, or substitution than coding regions. DNA? ™Proteins are much more conserved since they "need" to conserve their function. The rate of is assumed to be the So it is better to use sequences that same in both coding and non-coding mutate slowly (proteins) than DNA. regions. However, if the genes are very small, or However, there is a difference in the they mutate slowly, we can use them for substitution rate. building the trees.

Known Problems of Multiple Alignments Alignment of a should be compared with the alignment of their protein sequences, to be sure about the placement of gaps.

Important sites could be misaligned by the T Y R R S R ACA TAC AGG CGA software used for the . T Y R R T Y R R S R ACA TAC AGG CGA That will effect the significance of the site - T Y R R and the tree. T Y R - S R ACA TAC AGG --- T Y R - For example: ATG as start codon, or specific T Y R - S R ACA TAC --- CGA T Y - R amino acids in functional domains. T Y R R S R ACA TAC AGG CGA T Y R R Gaps - Are treated differently by different alignment programs and should play no part in building trees. Known Problems of Multiple Alignments Selecting sequences for phylogenetic analysis ♥ Low complexity regions - effect the multiple ‡ Sequences that are being compared belong alignment because they create random bias together (orthologs). for various regions of the alignment. ‡ If no ancestral sequence is available you may ♥ Low complexity regions should be removed use an "" as a reference to measure from the alignment before building the tree. distances. In such a case, for an outgroup you need to choose a close relative to the group being compared. If you delete these regions you need to ‡ For example: if the group is of mammalian consider the affect of the deletions on the sequences then the outgroup should be a sequence branch lengths of the whole tree. from and not .

How to choose a phylogenetic method? Taken from Dr.Itai Yanai

Given a multiple alignment, how do we construct the tree? Choose set of related seqs Obtain MultipleAlignment (DNA or Proteins

Is there a strong similarity? A - GCTTGTCCGTTACGAT B – ACTTGTCTGTTACGAT C – ACTTGTCCGAAACGAT ? Strong similarity D - ACTTGACCGTTTCCTT Distant (weak) similarity E – AGATGACCGTTTCGAT Maximum Parsimony Distance methods F - ACTACACCCTTATGAG

Check validity of the Very weak similarity results Maximum Likelihood Building Phylogenetic Trees Statistical Methods

Main methods: 9 Bootstrapping Analysis – ¾ Distances matrix methods Is a method for testing how good a dataset fits a Neighbour Joining, UPGMA evolutionary model. This method can check the branch arrangement ¾ Character based methods: (topology) of a phylogenetic tree. Parsimony methods In Bootstrapping, the program re-samples Maximum Likelihood method columns in a multiple aligned group of ¾ Validation method: sequences, and creates many new alignments, Bootstrapping (with replacement the original dataset). Jack Knife These new sets represent the population.

Statistical Methods Taken from Dr. Itai Yanai

The process is done at least 100 times. Given the following tree, estimate the confidence of the two internal branches Phylogenetic trees are generated from all the sets. Part of the results will show the # of times a particular branch point occurred out of all the trees that were built. Gorilla Orang-utan

The higher the # - the more valid the branching point. Taken from Dr. Itai Yanai Statistical Methods

Estimating Confidence from the Resamplings Bootstrap values between 90-100 are 1. Of the 100 trees: 41/100 28/100 31/100 considered statistically significant Gorilla Human Chimpanzee Human Chimpanzee Human

Gibbon

Gibbon Gibbon Gorilla Orang-utan Chimpanzee Gorilla Orang-utan Orang-utan 2. Upon the original tree we superimpose bootstrap values: Chimpanzee Human 41 In 41 of the 100 trees, Gibbon chimp and gorilla are split from the rest. 100 In 100 of the 100 trees, Gorilla gibbon and orang-utan Orang-utan are split from the rest.

Character Based Methods Character Based Methods

All Character Based Methods assume that Q: How do you find the minimum # of changes each character substitution is independent needed to explain the data in a given tree? of its neighbors. A: The answer will be to construct a set of possible ways to get from one set to the other, and choose ƒ Maximum Parsimony () the "best". (for example: Maximum Parsimony) - in this method one tree will be given CCGCCACGA (built) with the fewest changes required to P P R explain (tree) the differences observed in CGGCCACGA the data. R P R Character Based Methods - Maximum Parsimony

∞ Not all sites are informative in parsimony. Maximum Parsimony Start by classifying the sites: ∞ Informative site, is a site that has at least 2 123456789012345678901 characters, each appearing at least in 2 of CTTCGTTGGATCAGTTTGATA CCTCGTTGGATCATTTTGATA the sequences of the dataset. Dog CTGCTTTGGATCAGTTTGAAC Human CCGCCTTGGATCAGTTTGAAC ------Invariant * * ******** ***** Variant ** * * ** ------Taken from Informative ** ** Dr. Itai Yanai Non-inform. * *

123456789012345678901 Mouse CTTCGTTGGATCAGTTTGATA Rat CCTCGTTGGATCATTTTGATA 123456789012345678901 Dog CTGCTTTGGATCAGTTTGAACTaken from Mouse CTTCGTTGGATCAGTTTGATA Human CCGCCTTGGATCAGTTTGAACDr. Itai Yanai Rat CCTCGTTGGATCATTTTGATA Maximum Parsimony ** * Dog CTGCTTTGGATCAGTTTGAAC Taken from Human CCGCCTTGGATCAGTTTGAAC Informative ** ** Dr. Itai Yanai

Mouse Dog Mouse Mouse Rat G T T GT G G GG G G Site 5: G C G C T C Rat HumanRat Human Dog Human Mouse Dog Dog Mouse Mouse Rat

T T T MouseC C Dog C C Mouse MouseT TT C Rat C Rat RatHuman Human Dog Human Site 2:C C C C T C Rat HumanRat Human Dog Human 3 1 T G G G G T TG G T 0 Mouse Dog Mouse Mouse Rat T G T G G G Site 3: Rat HumanRat Human Dog Human Character Based Methods - Maximum Maximum Parsimony Methods are Parsimony Available…

∞ The Maximum Parsimony method is good for ¾For DNA in Programs: similar sequences, a sequences group with paup, molphy,phylo_win small amount of variations In the Phylip package: DNAPars, DNAPenny, etc.. Maximum Parsimony methods do not give the branch lengths only the branch . ¾For Protein in Programs: paup, molphy,phylo_win For larger set it is recommended to use In the Phylip package: the “branch and bound” method instead PROTPars Of Maximum Parsimony.

Character Based Methods - Character Based Methods - Maximum Likelihood Maximum Likelihood

Basic idea of Maximum Likelihood method is Maximum Likelihood method – using a tree building a tree based on mathemaical model. model for nucleotide substitutions, it will try to This method find a tree based on probability find the most likely tree (out of all the trees of calculations that best accounts for the large the given dataset). amount of variations of the data (sequences) set. The Maximum Likelihood methods are very slow and cpu consuming. Maximum Likelihood method (like the Maximum Parsimony method) performs its analysis on Maximum Likelihood methods can be found in each position of the multiple alignment. , paup or puzzle. This is why this method is very heavy on CPU. Maximum Likelihood method Character Based Methods

Are available in the Programs: paup or puzzle The Maximum Likelihood methods are In phylip package in programs: very slow and cpu consuming (computer DNAML and DNAMLK expensive).

Maximum Likelihood methods can be found in phylip, paup or puzzle.

Distances Matrix Methods Distances Matrix Methods

Distance methods assume a molecular , Distance - the number of substitutions per site per meaning that all mutations are neutral and time period. therefore they happen at a random clocklike Evolutionary distance are calculated based on one rate. of DNA evolutionary models. This assumption is not true for several reasons: Different environmental conditions affect mutation Neighbors – pairs of sequences that have the rates. smallest number of substitutions between them. This assumption ignores selection issues which are On a phylogenetic tree, neighbors are joined by a different with different time periods. node (common ancestor). Distances Matrix Methods Distance method steps

Distance methods vary in the way they construct 1 Multiple alignments - based on all against the trees. all pairwise comparisons. 2 Building of all the compared Distance methods try to place the correct sequences (all pair of OTUs). positions of all the neighbors, and find the 3 Disregard of the actual sequences. correct branches lengths. 4 Constructing a guide tree by clustering the distances. Iteratively build the relations Distance based clustering methods: (branches and internal nodes) between all Neighbor-Joining (unrooted tree) OTUs. UPGMA (rooted tree)

Distance method steps Distances Matrix Methods

Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA) Distances matrix methods can be found in the First, construct a distance matrix: following Programs:

A B C D E Clustalw, Phylo_win, Paup A - GCTTGTCCGTTACGAT B 2 B – ACTTGTCTGTTACGAT In the GCG software package: C – ACTTGTCCGAAACGAT C 4 4 D - ACTTGACCGTTTCCTT D 6 6 6 Paupsearch, distances E – AGATGACCGTTTCGAT E 6 6 6 4 F - ACTACACCCTTATGAG F 8 8 8 8 8 In the Phylip package: DNADist, PROTDist, Fitch, Kitch, Neighbor

From http://www.icp.ucl.ac.be/~opperd/private/.html Mutations as data source for Correction of Distances between evolutionary analysis DNA sequences

Mutation - an error in DNA replication or In order to detect changes in DNA DNA repair. sequences we compare them to each other. Only mutations that occur in We assume that each observed change in cells play a rule in evolution. However, in similar sequences, represent a “single some organisms there is no distinction mutation event”. between germline or mutation. The greater the number of changes, the Only mutations that were fixed in the more possible types of mutations. population are called substitutions.

Original Sequence Mutations - Substitutions A C T G Q: What do we measure by sequence alignment? A A A: Substitutions in the aligned sequences. C G T The rate of substitution in regions that A A A C C > A Single Substitution evolve under no constraints are assumed T T G G to be equal to the . Multiple A > C > T A Substitution A coincidental Substitution A C > G C > A G Parallel Substitution G T > A T > A Mutations - Substitutions Mutations

- mutation in a single ♣Deletion - removal of one or more nucleotide. from the DNA. Segmental mutation - mutation in several ♣Insertion - addition of one or more adjacent nucleotides. nucleotides to the DNA. Substitution mutation - replacement of one nucleotide with another. ♣Inversion - rotation by 180 of a double- Recombination - exchange of a sequence stranded DNA segment comprising 2 or more with another. base-pairs.

1 AGGCAAACCTACTGGTCTTAT Original Sequence * Substitution Mutations 2 AGGCAAATCCTACTGGTCTTAT Transtion c-t

3 AGGCAAACCTACTGCTCTTAT g-c - a change between * (A,G) or between (T,C). 4 AGGCAAACCTACTGGTCTTAT recombination gtctt

ACCTA 5 AGGCAA CTGGTCTTAT deletion accta Transversion - a change between purines (A,G) to pyrimidines (T,C). 6 AGGCAAACCTACTAAAGCGGTCTTAT insertion aagcg

Substitution mutations usually arise from 7 AGGTTTGCCTACTGGTCTTAT inversion from 5' gcaaac3' to 5' gtttgc 3' mispairing of bases during replication. Let's assume that.. Where do the mutations occur?

A mutation of a codon to another Synonymous substitutions mostly occur at sense codon, occur in equal for the 3rd position of the codon. all the codons. Nonsynonymous substitutions mostly occur at the 2nd position of the codon. If so we can now compute the expected proportion of different types of substitution mutations from the . Any substitution in the 2nd position is nonsynonymous

Correction of Distances between Jukes & Cantor one-parameter model DNA sequences This model assumes that substitutions between the 4 bases occur with equal frequency. There are several evolutionary models Meaning no bias in the direction of the change. used to correct for the likelihood of A multiple mutations and reversions in DNA α G sequences. α Is the rate of substitutions These evolutionary models use a In each of the 3 directions normalized distance measurement that is For one base. the average degree of change per length α ( Is the one parameter). of aligned sequences. C T Kimura two-parameter model

This model assumes that transitions (A - G or T - C) occur more often than ( -). These evolutionary models improve the distance calculations between the A sequences. α G These evolutionary models have less α Is the rate of transitional effect in phylogenetic predictions of Substitutions. closely related sequences. Is the rate of transversional βββ substitutions. These evolutionary models have better effect with distant related sequences.

C T α

DNA Evolution Models How to read the tree?

Start at the base and follow A basic process in DNA sequence rat The progression of the branch evolution is the substitution of one points (nodes)

nucleotide with another. bovine

This process is slow and can not be human observed directly.

The study of DNA changes is used to estimate the , and the evolution history of organisms. Xeno How to draw Trees? (Building trees software) A Tip... * Unrooted trees should be plotted using the DRAWGRAM program (phylip), or For DNA sequences use the Kimura's similar. model in the building trees programs.

* Rooted trees should be plotted using the For PROTEINS the differences lie with DRAWTREE program (phylip), or similar. the scoring (substitution) matrices used. For more distant sequences you should use BLOSUM with lower # (i.e., for * On a PC use the TreeView program distant proteins use blosum45 and for similar proteins use blosum60).

Known problems of Phylogenetic Analysis Known problems of Phylogenetic Analysis

™ Order of the input data (sequences) - ™ The definition of "best tree" is ambiguous. The order of the input sequences effects the tree It might mean the most likely tree, or a tree with construction. You can "correct" this effect in some the fewest changes, or a tree best fit to a known of the programs (like phylip), using the Jumble model, etc.. option. (J in phylip set to 10).

The trees that result from various methods differ ™ The number of possible trees is huge for large datasets. Often it is not possible to construct all from each other. Never the less, in order to trees, but can guarantee only "a good" tree not compare trees, one need to assume some the "best tree". evolutionary model so that the trees may be tested. Known problems of Phylogenetic Analysis How many trees to build?

∆ The amount of data used for the tree construction is not always "informative". For example, if we ! For each dataset it is recommended to build compare proteins that are 100% similar in a site, we more than one tree. Build a tree using a do not have information to infer to phylogenetic distance method and if possible also use a relationships between these proteins. character-based method, like maximum parsimony. ∆ Population effects are often to be considered, especially if we have a lot of variety (large # of for one protein). ! The core of the tree should be similar in both methods, otherwise you may suspect that your tree is incorrect.