<<

Introduction to Bioinformatics What is Molecular

• Phylogenetics is the study of evolutionary relationships Prof. Dr. Nizamettin AYDIN • Example: – relationship among species [email protected] rodents lizards Phylogenetics crocodiles birds snakes rodents snakes primates lizards marsupials

1 2

A Brief History of Molecular data vs. Morphology/Physiology

• 1900s • Strictly heritable entities • Can be influenced by – Immunochemical studies environmental factors • cross-reactions stronger for closely related organisms • Data is unambiguous • Ambiguous modifiers: – Nuttall (1902) - apes are closest relatives to humans! “reduced”, “slightly • 1960s - 1970s elongated”, “somewhat flattened” – Protein sequencing methods, electrophoresis, DNA hybridization and PCR contributed to a boom in molecular • Regular & predictable • Unpredictable evolution phylogeny • Quantitative analyses • Qualitative argumentation • late 1970s to present • Ease of homology assessment • Homology difficult to assess – Discoveries using molecular phylogeny • Relationship of distantly related • Only close relationships can be • Endosymbiosis - Margulis, 1978 organisms can be inferred confidently inferred • Divergence of phyla and kingdom - Woese, 1987 • Abundant and easily generated • Problems when working with • Many Tree of Life projects completed or underway with PCR and sequencing micro-organisms and where visible morphology is lacking

3 4

Phylogenetic concepts: Interpreting a Phylogeny Phylogenetic concepts: Interpreting a Phylogeny

Sequence A Sequence A

Sequence B Sequence B • Physical position in tree is • Physical position in tree is Sequence C not meaningful Sequence E not meaningful • Swiveling can only be done • Swiveling can only be done Sequence D at the nodes Sequence D at the nodes • Only tree structure matters • Only tree structure matters

Sequence E Sequence C

Present Present Time Time

5 6

Copyright 2000 N. AYDIN. All rights reserved. 1 Tree Terminology Tree Terminology

• Relationships are illustrated by a / • The branching pattern is called the tree’s topology dendrogram • Trees can be represented in several forms: – Combination of Greek dendro/tree and gramma/drawing – A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Rectangular cladogram Slanted cladogram – Dendrograms are often used in computational biology to illustrate the clustering of or samples, sometimes on top of heatmaps. • A cladogram is a type of phylogenetic tree that only shows tree topology – the shape indicating relatedness. • It shows that, say, humans are more closely related to chimpanzees than to gorillas, but not the time or genetic distance between the species. – Combination of Greek clados/branch and gramma/drawing

7 8

• Same tree - seven different views: Tree Terminology Rectangular Phylogram, Rectangular Cladogram, Slanted Cladogram, Circular Phylogram, Circular Cladogram, Radial Phylogram and Radial Cladogram Circular cladogram

9 10

Tree Terminology Tree Terminology Operational taxonomic units (OTU) / Taxa Rooted trees Unrooted trees A D Internal nodes A B B A E C B D C Root E Terminal nodes F C D F Sisters • Rooted trees: E Root – Has a root that denotes common ancestry F • Unrooted trees: – Only specifies the degree of kinship among taxa but not Branches the evolutionary path Taxon, plural taxa. (): Any group or rank in a biological classification into which related organisms are classified.

11 12

Copyright 2000 N. AYDIN. All rights reserved. 2 Tree Terminology Tree Terminology Scaled trees Unscaled trees Monophyletic groups Paraphyletic groups A Saturnite 1 Jupiterian 32 B Saturnite 2 Jupiterian 5 C Saturnite 3 Jupiterian 67 D Martian 1 Human 11 E Martian 3 Jupiterian 8 F Martian 2 Human 3 • Monophyletic groups: • Scaled trees: – All taxa within the group are derived from a single – Branch lengths are proportional to the number of nucleotide/amino acid common ancestor and members form a natural . changes that occurred on that branch (usually a scale is included). • Paraphyletic groups: • Unscaled trees: – The common ancestor is shared by other taxon in the group – Branch lengths are not proportional to the number of nucleotide/amino acid changes (usually used to illustrate evolutionary relationships only). and members do not form a natural clade. 13 14

Methods in Phylogenetic Reconstruction Comparison of Methods

• Distance methods Distance Maximum parsimony Maximum likelihood – calculate pairwise distances between sequences, and group • Uses only pairwise • Uses only shared • Uses all data sequences that are most similar. distances derived characters – This approach has potential for computational simplicity and • Minimizes distance • Minimizes total • Maximizes tree therefore speed between nearest distance likelihood given • Maximum Parsimony neighbors specific parameter – assumes that shared characters in different entities result from values common descent. • Very fast • Slow • Very slow – Groups are built on the basis of such shared characters, and the • Easily trapped in • Assumptions fail • Highly dependent on simplest explanation for the evolution of characters is taken to be local optima when evolution is assumed evolution the correct, or most parsimonious one. rapid model • Maximum Likelihood • Good for generating • Best option when • Good for very small tentative tree, or tractable (<30 taxa, data sets and for – compute the probability that a data set fits a tree derived from choosing among homoplasy rare) testing trees built that data set, given a specified model of sequence evolution. multiple trees using other methods

15 16

Methods in Phylogenetic Reconstruction Methods in Phylogenetic Reconstruction

• Distance • Maximum Parsimony – Using a sequence alignment, pairwise distances are calculated – All possible trees are determined for each position – Creates a distance matrix of the sequence alignment – A phylogenetic tree is calculated with clustering algorithms, using the distance matrix. – Each tree is given a score based on the number of – Examples of clustering algorithms include the Unweighted Pair evolutionary step needed to produce said tree Group Method using Arithmetic averages (UPGMA) and – The most parsimonious tree is the one that has the clustering. fewest evolutionary changes for all sequences to be A A A derived from a common ancestor B B B – Usually several equally parsimonious trees result C C from a single run.

D

17 18

Copyright 2000 N. AYDIN. All rights reserved. 3 Maximum parsimony: exhaustive stepwise addition Methods in Phylogenetic Reconstruction

B C Step 1 • Maximum Likelihood A – Creates all possible trees like Maximum Parsimony method but instead of retaining trees with shortest evolutionary B D B D C B C steps…… C D Step 2 – Employs a model of evolution whereby different rates of transition/transversion ration can be used A A A – Each tree generated is calculated for the probability that it E reflects each position of the sequence data. E B D B D B D E – Calculation is repeated for all nucleotide sites C C C – Finally, the tree with the best probability is shown as the ………………… maximum likelihood tree - usually only a single tree A A A Step 3 remains – It is a more realistic tree estimation because it does not assume equal transition-transversion ratio for all branches.

19 20

How confident are we about the inferred phylogeny? The Bootstrap

? rat • Computational method to estimate the confidence level ? human of a certain phylogenetic tree. turtle Pseudo sample 1 001122234556667 ? Sample fruit rat GGAAGGGGCTTTTTA 0123456789 human GGTTGGGGCTTTTTA ? oak rat GAGGCTTATC turtle GGTTGGGCCCCTTTA duckweed human GTGGCTTATC fruitfly CCTTCCCGCCCTTTT turtle GTGCCCTATG oak AATTCCCGCTTCCCT fruitfly CTCGCCTTTG duckweed AATTCCCCCTTCCCC • Bootstrapping oak ATCGCTCTTG duckweed ATCCCTCCGG • Bootstrap analysis is a kind of statistical analysis to test the reliability of Pseudo sample 2 445556777888899 certain branches in the evolutionary tree rat CCTTTTAAATTTTCC • It involves resampling one's own data, with replacement, to create a series rat human CCTTTTAAATTTTCC turtle CCCCCTAAATTTTGG human of bootstrap samples of the same size as the original data. fruitfly CCCCCTTTTTTTTGG turtle • In the case of nucleic acid (amino acid) sequences, the resampled data are oak CCTTTCTTTTTTTGG fruit fly duckweed CCTTTCCCCGGGGGG the nucleotides (amino acids) of a sequence while the statistical oak significance of a specific cluster is given by the fraction of trees, based on duckweed the resampled data, containing that cluster. Many more replicates Inferred tree (between 100 - 1000)

21 22

Bootstrap values Some Discoveries Made Using Molecular Phylogenetics

• Universal Tree of Life 100 rat – Using rRNA 65 human sequences turtle 0 – Able to study the fruit fly relationships of 55 oak uncultivated duckweed organisms, obtained from a hot spring in • Values are in percentages Yellowstone National • Conventional practice: only values 60-100% Park are shown

23 24

Copyright 2000 N. AYDIN. All rights reserved. 4 Some Discoveries Made Using Molecular Phylogenetics Some Discoveries Made Using Molecular Phylogenetics • Endosymbiosis: Origin of the Mitochondrion and Chloroplast • Relationships within species: HIV subtypes

Rwanda A -Purple Bacteria Other bacteria Ivory Coast Italy Chloroplasts B U.S. Uganda Mitochondria U.S. India Rwanda U.K. C Root Cyanobacteria Ethiopia Eukaryotes Uganda S. Africa D Uganda Archaea Tanzania Netherlands Russia • Mitochondria and chloroplasts are derived from the -purple bacteria and Romania G F Taiwan the cyanobacteria respectively, via separate endosymbiotic events. Cameroon Brazil Netherlands

25 26

Problems and Errors in Phylogenetic Reconstruction Problems and Errors in Phylogenetic Reconstruction

• Inherent strengths and weaknesses in different • : tree-making methodologies. – Duplications, inversions, insertions, deletions etc. can give inaccurate signals • More is better • Genomic hotspots: – Errors in inferred phylogeny may be caused by – small regions of rapid evolution are not easily detected small data sets and/or limited sampling. • Homoplasy: • Unsuitable sequences – nucleotide changes that are similar but occurred – those undergoing rapid nucleotide changes or slow independently in separate lineages are mistakenly assumed as inherited changes to zero changes overtime may skew phylogenetic estimations • Sample contamination / mislabeling: – always a possibility when working with large data sets

27 28

Maximum Parsimony - example Maximum Parsimony - example

• Maximum parsimony methods predict the • This continues for each position in the evolutionary tree that minimizes the number of alignment. steps required to generate the observed • Those trees that produce the smallest number variation in the sequences. of changes overall for all sequence positions – First, a multiple sequence alignment must first be are identified. obtained. – This is a rather time consuming algorithm that only • For each aligned position, phylogenetic trees works well if the sequences have a strong sequence that require the smallest number of similarity. evolutionary changes to produce the observed sequence changes are identified.

29 30

Copyright 2000 N. AYDIN. All rights reserved. 5 Maximum Parsimony - example Maximum Parsimony - example

• Assuming we have 4 sequences AAA AAA 1 – There are 3 possible trees: AAA AGA AAA AAA 1 1 1 1 2 AAA AGA AGA GGA AAG GGA AAG AAA Total #substitutions = 3 Total #substitutions = 4 • The optimal tree is obtained by adding the number of changes at each informative site for each tree, and • The left tree is preferred over the right tree. picking the tree requiring the least total number of changes. • For a large number of sequences the number of trees to examine becomes so large that it might not be possible to examine all possible trees.

31 32

Maximum Parsimony - example Maximum Parsimony - example

• Consider the following sequences S1 CACCCCTT S1 C A C C C C T T S2 AACCCCAT S3 CACTGCTT S2 A A C C C C A T S4 AACTGCTA S3 C A C T G C T T (S1,S2),(S3,S4) 20011011 6 √ S4 A A C T G C T A (S1,S3),(S2,S4) 10022011 7 (S1,S2),(S3,S4) 2 0 0 1 1 0 1 1 6 √ (S1,S3),(S2,S4) 1 0 0 2 2 0 1 1 7 S1 C C S3 S1 C A S2

C A

S2 A =2 A S4 S3C mutation=1 A S4 (S1,S2), (S3, S4) (S1,S3), (S2, S4)

33 34

Distance Methods -example Distance Methods -example

• For phylogenetic analysis, the distance score • Phylogeny reconstruction for 3 sequences counted as – There is a single tree topology – either the number of mismatched positions in the – The branch lengths (a, b, c) : alignment a+b = DAB A B C – the number of sequence positions that must be b+c = DBC A -- a+b a+c changed to generate the other sequence is used. a+c = D B -- -- b+c AC C ------• The Fitch and Margoliash method uses a • Input: distance table. – DAB, DBC and DAC (pairwise distances) • Output: – The sequences are combined in threes to define the A a a = (DAB + DAC – DBC) / 2 branches of the predicted tree and to calculate the C b = (DAB + DBC – DAC) / 2 c branch lengths of the tree. b c = (DAC + DBC – DAB) / 2 B

35 36

Copyright 2000 N. AYDIN. All rights reserved. 6 Distance Methods -example Distance Methods -example

• Distance matrix of 3 sequences and unrooted tree • adding (1) and (4) yields a + b + b – a = 2b = 22 + 2 = 24 A B C A a A -- 22 39 C 2b = 24 B -- -- 41 c b = 24 / 2 = 12 C ------B b – distance from A to B = a + b = 22 (1) • so – distance from A to C = a + c = 39 (2) a + b = a + 12 = 22; – distance from B to C = b + c = 41 (3) a = 22 – 12 = 10 • finally A 10 • subtracting (3) from (2) yields: C 29 b + c – (b + c) = b – a = 41 – 39 = 2 (4) a + c = 10 + c = 39; 12 c = 39 – 10 = 29 B

37 38

Distance Methods -example Distance Methods -example

• Consider the alignment: • Using this information, an A B C D A - 3 7 8 A ACGCGTTGGGCGATGGCAAC unrooted tree showing the B - - 6 7 B ACGCGTTGGGCGACGGTAAT relationship between these C - - - 3 C ACGCATTGAATGATGATAAT sequences can be drawn: D - - - - D ACACATTGAGTGATAATAAT

• The distances between these sequences can be shown C A as a table: 2 1 A B C D 4 A - 3 7 8 B - - 6 7 C - - - 3 1 2 B D D - - - -

39 40

Copyright 2000 N. AYDIN. All rights reserved. 7