Introduction to Bioinformatics What Is Molecular Phylogenetics
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Bioinformatics What is Molecular Phylogenetics • Phylogenetics is the study of evolutionary relationships Prof. Dr. Nizamettin AYDIN • Example: – relationship among species [email protected] crocodiles primates rodents birds lizards Phylogenetics crocodiles birds snakes rodents marsupials snakes primates lizards marsupials 1 2 A Brief History of Molecular Phylogenetics Molecular data vs. Morphology/Physiology • 1900s • Strictly heritable entities • Can be influenced by – Immunochemical studies environmental factors • cross-reactions stronger for closely related organisms • Data is unambiguous • Ambiguous modifiers: – Nuttall (1902) - apes are closest relatives to humans! “reduced”, “slightly • 1960s - 1970s elongated”, “somewhat flattened” – Protein sequencing methods, electrophoresis, DNA hybridization and PCR contributed to a boom in molecular • Regular & predictable evolution • Unpredictable evolution phylogeny • Quantitative analyses • Qualitative argumentation • late 1970s to present • Ease of homology assessment • Homology difficult to assess – Discoveries using molecular phylogeny • Relationship of distantly related • Only close relationships can be • Endosymbiosis - Margulis, 1978 organisms can be inferred confidently inferred • Divergence of phyla and kingdom - Woese, 1987 • Abundant and easily generated • Problems when working with • Many Tree of Life projects completed or underway with PCR and sequencing micro-organisms and where visible morphology is lacking 3 4 Phylogenetic concepts: Interpreting a Phylogeny Phylogenetic concepts: Interpreting a Phylogeny Sequence A Sequence A Sequence B Sequence B • Physical position in tree is • Physical position in tree is Sequence C not meaningful Sequence E not meaningful • Swiveling can only be done • Swiveling can only be done Sequence D at the nodes Sequence D at the nodes • Only tree structure matters • Only tree structure matters Sequence E Sequence C Present Present Time Time 5 6 Copyright 2000 N. AYDIN. All rights reserved. 1 Tree Terminology Tree Terminology • Relationships are illustrated by a phylogenetic tree / • The branching pattern is called the tree’s topology dendrogram • Trees can be represented in several forms: – Combination of Greek dendro/tree and gramma/drawing – A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Rectangular cladogram Slanted cladogram – Dendrograms are often used in computational biology to illustrate the clustering of genes or samples, sometimes on top of heatmaps. • A cladogram is a type of phylogenetic tree that only shows tree topology – the shape indicating relatedness. • It shows that, say, humans are more closely related to chimpanzees than to gorillas, but not the time or genetic distance between the species. – Combination of Greek clados/branch and gramma/drawing 7 8 • Same tree - seven different views: Tree Terminology Rectangular Phylogram, Rectangular Cladogram, Slanted Cladogram, Circular Phylogram, Circular Cladogram, Radial Phylogram and Radial Cladogram Circular cladogram 9 10 Tree Terminology Tree Terminology Operational taxonomic units (OTU) / Taxa Rooted trees Unrooted trees A D Internal nodes A B B A E C B D C Root E Terminal nodes F C D F Sisters • Rooted trees: E Root – Has a root that denotes common ancestry F • Unrooted trees: – Only specifies the degree of kinship among taxa but not Branches the evolutionary path Polytomy Taxon, plural taxa. (taxonomy): Any group or rank in a biological classification into which related organisms are classified. 11 12 Copyright 2000 N. AYDIN. All rights reserved. 2 Tree Terminology Tree Terminology Scaled trees Unscaled trees Monophyletic groups Paraphyletic groups A Saturnite 1 Jupiterian 32 B Saturnite 2 Jupiterian 5 C Saturnite 3 Jupiterian 67 D Martian 1 Human 11 E Martian 3 Jupiterian 8 F Martian 2 Human 3 • Monophyletic groups: • Scaled trees: – All taxa within the group are derived from a single – Branch lengths are proportional to the number of nucleotide/amino acid common ancestor and members form a natural clade. changes that occurred on that branch (usually a scale is included). • Paraphyletic groups: • Unscaled trees: – The common ancestor is shared by other taxon in the group – Branch lengths are not proportional to the number of nucleotide/amino acid changes (usually used to illustrate evolutionary relationships only). and members do not form a natural clade. 13 14 Methods in Phylogenetic Reconstruction Comparison of Methods • Distance methods Distance Maximum parsimony Maximum likelihood – calculate pairwise distances between sequences, and group • Uses only pairwise • Uses only shared • Uses all data sequences that are most similar. distances derived characters – This approach has potential for computational simplicity and • Minimizes distance • Minimizes total • Maximizes tree therefore speed between nearest distance likelihood given • Maximum Parsimony neighbors specific parameter – assumes that shared characters in different entities result from values common descent. • Very fast • Slow • Very slow – Groups are built on the basis of such shared characters, and the • Easily trapped in • Assumptions fail • Highly dependent on simplest explanation for the evolution of characters is taken to be local optima when evolution is assumed evolution the correct, or most parsimonious one. rapid model • Maximum Likelihood • Good for generating • Best option when • Good for very small tentative tree, or tractable (<30 taxa, data sets and for – compute the probability that a data set fits a tree derived from choosing among homoplasy rare) testing trees built that data set, given a specified model of sequence evolution. multiple trees using other methods 15 16 Methods in Phylogenetic Reconstruction Methods in Phylogenetic Reconstruction • Distance • Maximum Parsimony – Using a sequence alignment, pairwise distances are calculated – All possible trees are determined for each position – Creates a distance matrix of the sequence alignment – A phylogenetic tree is calculated with clustering algorithms, using the distance matrix. – Each tree is given a score based on the number of – Examples of clustering algorithms include the Unweighted Pair evolutionary step needed to produce said tree Group Method using Arithmetic averages (UPGMA) and – The most parsimonious tree is the one that has the Neighbor Joining clustering. fewest evolutionary changes for all sequences to be A A A derived from a common ancestor B B B – Usually several equally parsimonious trees result C C from a single run. D 17 18 Copyright 2000 N. AYDIN. All rights reserved. 3 Maximum parsimony: exhaustive stepwise addition Methods in Phylogenetic Reconstruction B C Step 1 • Maximum Likelihood A – Creates all possible trees like Maximum Parsimony method but instead of retaining trees with shortest evolutionary B D B D C B C steps…… C D Step 2 – Employs a model of evolution whereby different rates of transition/transversion ration can be used A A A – Each tree generated is calculated for the probability that it E reflects each position of the sequence data. E B D B D B D E – Calculation is repeated for all nucleotide sites C C C – Finally, the tree with the best probability is shown as the ………………… maximum likelihood tree - usually only a single tree A A A Step 3 remains – It is a more realistic tree estimation because it does not assume equal transition-transversion ratio for all branches. 19 20 How confident are we about the inferred phylogeny? The Bootstrap ? rat • Computational method to estimate the confidence level ? human of a certain phylogenetic tree. turtle Pseudo sample 1 001122234556667 ? Sample fruit fly rat GGAAGGGGCTTTTTA 0123456789 human GGTTGGGGCTTTTTA ? oak rat GAGGCTTATC turtle GGTTGGGCCCCTTTA duckweed human GTGGCTTATC fruitfly CCTTCCCGCCCTTTT turtle GTGCCCTATG oak AATTCCCGCTTCCCT fruitfly CTCGCCTTTG duckweed AATTCCCCCTTCCCC • Bootstrapping oak ATCGCTCTTG duckweed ATCCCTCCGG • Bootstrap analysis is a kind of statistical analysis to test the reliability of Pseudo sample 2 445556777888899 certain branches in the evolutionary tree rat CCTTTTAAATTTTCC • It involves resampling one's own data, with replacement, to create a series rat human CCTTTTAAATTTTCC turtle CCCCCTAAATTTTGG human of bootstrap samples of the same size as the original data. fruitfly CCCCCTTTTTTTTGG turtle • In the case of nucleic acid (amino acid) sequences, the resampled data are oak CCTTTCTTTTTTTGG fruit fly duckweed CCTTTCCCCGGGGGG the nucleotides (amino acids) of a sequence while the statistical oak significance of a specific cluster is given by the fraction of trees, based on duckweed the resampled data, containing that cluster. Many more replicates Inferred tree (between 100 - 1000) 21 22 Bootstrap values Some Discoveries Made Using Molecular Phylogenetics • Universal Tree of Life 100 rat – Using rRNA 65 human sequences turtle 0 – Able to study the fruit fly relationships of 55 oak uncultivated duckweed organisms, obtained from a hot spring in • Values are in percentages Yellowstone National • Conventional practice: only values 60-100% Park are shown 23 24 Copyright 2000 N. AYDIN. All rights reserved. 4 Some Discoveries Made Using Molecular Phylogenetics Some Discoveries Made Using Molecular Phylogenetics • Endosymbiosis: Origin of the Mitochondrion and Chloroplast • Relationships within species: HIV subtypes Rwanda A -Purple Bacteria Other bacteria Ivory Coast Italy Chloroplasts B U.S. Uganda Mitochondria U.S. India Rwanda U.K.