Bias in Phylogenetic Estimation and Its Relevance to the Choice
Total Page:16
File Type:pdf, Size:1020Kb
Load more
										Recommended publications
									
								- 
												  Lecture Notes: the Mathematics of PhylogeneticsLecture Notes: The Mathematics of Phylogenetics Elizabeth S. Allman, John A. Rhodes IAS/Park City Mathematics Institute June-July, 2005 University of Alaska Fairbanks Spring 2009, 2012, 2016 c 2005, Elizabeth S. Allman and John A. Rhodes ii Contents 1 Sequences and Molecular Evolution 3 1.1 DNA structure . .4 1.2 Mutations . .5 1.3 Aligned Orthologous Sequences . .7 2 Combinatorics of Trees I 9 2.1 Graphs and Trees . .9 2.2 Counting Binary Trees . 14 2.3 Metric Trees . 15 2.4 Ultrametric Trees and Molecular Clocks . 17 2.5 Rooting Trees with Outgroups . 18 2.6 Newick Notation . 19 2.7 Exercises . 20 3 Parsimony 25 3.1 The Parsimony Criterion . 25 3.2 The Fitch-Hartigan Algorithm . 28 3.3 Informative Characters . 33 3.4 Complexity . 35 3.5 Weighted Parsimony . 36 3.6 Recovering Minimal Extensions . 38 3.7 Further Issues . 39 3.8 Exercises . 40 4 Combinatorics of Trees II 45 4.1 Splits and Clades . 45 4.2 Refinements and Consensus Trees . 49 4.3 Quartets . 52 4.4 Supertrees . 53 4.5 Final Comments . 54 4.6 Exercises . 55 iii iv CONTENTS 5 Distance Methods 57 5.1 Dissimilarity Measures . 57 5.2 An Algorithmic Construction: UPGMA . 60 5.3 Unequal Branch Lengths . 62 5.4 The Four-point Condition . 66 5.5 The Neighbor Joining Algorithm . 70 5.6 Additional Comments . 72 5.7 Exercises . 73 6 Probabilistic Models of DNA Mutation 81 6.1 A first example . 81 6.2 Markov Models on Trees . 87 6.3 Jukes-Cantor and Kimura Models .
- 
												  Phylogeny Codon Models • Last Lecture: Poor Man’S Way of Calculating Dn/Ds (Ka/Ks) • Tabulate Synonymous/Non-Synonymous Substitutions • Normalize by the PossibilitiesPhylogeny Codon models • Last lecture: poor man’s way of calculating dN/dS (Ka/Ks) • Tabulate synonymous/non-synonymous substitutions • Normalize by the possibilities • Transform to genetic distance KJC or Kk2p • In reality we use codon model • Amino acid substitution rates meet nucleotide models • Codon(nucleotide triplet) Codon model parameterization Stop codons are not allowed, reducing the matrix from 64x64 to 61x61 The entire codon matrix can be parameterized using: κ kappa, the transition/transversionratio ω omega, the dN/dS ratio – optimizing this parameter gives the an estimate of selection force πj the equilibrium codon frequency of codon j (Goldman and Yang. MBE 1994) Empirical codon substitution matrix Observations: Instantaneous rates of double nucleotide changes seem to be non-zero There should be a mechanism for mutating 2 adjacent nucleotides at once! (Kosiol and Goldman) • • Phylogeny • • Last lecture: Inferring distance from Phylogenetic trees given an alignment How to infer trees and distance distance How do we infer trees given an alignment • • Branch length Topology d 6-p E 6'B o F P Edo 3 vvi"oH!.- !fi*+nYolF r66HiH- .) Od-:oXP m a^--'*A ]9; E F: i ts X o Q I E itl Fl xo_-+,<Po r! UoaQrj*l.AP-^PA NJ o - +p-5 H .lXei:i'tH 'i,x+<ox;+x"'o 4 + = '" I = 9o FF^' ^X i! .poxHo dF*x€;. lqEgrE x< f <QrDGYa u5l =.ID * c 3 < 6+6_ y+ltl+5<->-^Hry ni F.O+O* E 3E E-f e= FaFO;o E rH y hl o < H ! E Y P /-)^\-B 91 X-6p-a' 6J.
- 
												  Introduction to Phylogenetics Workshop on Molecular Evolution 2018 Marine Biological Lab, Woods Hole, MAIntroduction to Phylogenetics Workshop on Molecular Evolution 2018 Marine Biological Lab, Woods Hole, MA. USA Mark T. Holder University of Kansas Outline 1. phylogenetics is crucial for comparative biology 2. tree terminology 3. why phylogenetics is difficult 4. parsimony 5. distance-based methods 6. theoretical basis of multiple sequence alignment Part #1: phylogenetics is crucial for biology Species Habitat Photoprotection 1 terrestrial xanthophyll 2 terrestrial xanthophyll 3 terrestrial xanthophyll 4 terrestrial xanthophyll 5 terrestrial xanthophyll 6 aquatic none 7 aquatic none 8 aquatic none 9 aquatic none 10 aquatic none slides by Paul Lewis Phylogeny reveals the events that generate the pattern 1 pair of changes. 5 pairs of changes. Coincidence? Much more convincing Many evolutionary questions require a phylogeny Determining whether a trait tends to be lost more often than • gained, or vice versa Estimating divergence times (Tracy Heath Sunday + next • Saturday) Distinguishing homology from analogy • Inferring parts of a gene under strong positive selection (Joe • Bielawski and Belinda Chang next Monday) Part 2: Tree terminology A B C D E terminal node (or leaf, degree 1) interior node (or vertex, degree 3+) split (bipartition) also written AB|CDE or portrayed **--- branch (edge) root node of tree (de gree 2) Monophyletic groups (\clades"): the basis of phylogenetic classification black state = a synapomorphy white state = a plesiomorphy Paraphyletic Polyphyletic grey state is an autapomorphy (images from Wikipedia) Branch rotation does not matter ACEBFDDAFBEC Rooted vs unrooted trees ingroup: the focal taxa outgroup: the taxa that are more distantly related. Assuming that the ingroup is monophyletic with respect to the outgroup can root a tree.
- 
												  Heterotachy and Long-Branch Attraction in Phylogenetics. Hervé Philippe, Yan Zhou, Henner Brinkmann, Nicolas Rodrigue, Frédéric DelsucHeterotachy and long-branch attraction in phylogenetics. Hervé Philippe, Yan Zhou, Henner Brinkmann, Nicolas Rodrigue, Frédéric Delsuc To cite this version: Hervé Philippe, Yan Zhou, Henner Brinkmann, Nicolas Rodrigue, Frédéric Delsuc. Heterotachy and long-branch attraction in phylogenetics.. BMC Evolutionary Biology, BioMed Central, 2005, 5, pp.50. 10.1186/1471-2148-5-50. halsde-00193044 HAL Id: halsde-00193044 https://hal.archives-ouvertes.fr/halsde-00193044 Submitted on 30 Nov 2007 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. BMC Evolutionary Biology BioMed Central Research article Open Access Heterotachy and long-branch attraction in phylogenetics Hervé Philippe*1, Yan Zhou1, Henner Brinkmann1, Nicolas Rodrigue1 and Frédéric Delsuc1,2 Address: 1Canadian Institute for Advanced Research, Centre Robert-Cedergren, Département de Biochimie, Université de Montréal, Succursale Centre-Ville, Montréal, Québec H3C3J7, Canada and 2Laboratoire de Paléontologie, Phylogénie et Paléobiologie, Institut des Sciences de l'Evolution, UMR 5554-CNRS, Université
- 
												  C3020 – Molecular Evolution Exercises #3: PhylogeneticsC3020 – Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from the earlier version to make computation easier) 1 ACAAACAGTT CGATCGATTT GCAGTCTGGG 2 ACAAACAGTT TCTAGCGATT GCAGTCAGGG 3 ACAGACAGTT CGATCGATTT GCAGTCTCGG 4 ACTGACAGTT CGATCGATTT GCAGTCAGAG 5 ATTGACAGTT CGATCGATTT GCAGTCAGGA O TTTGACAGTT CGATCGATTT GCAGTCAGGG 1. Make a distance matrix using raw distances (number of differences) for the five ingroup sequences. | 1 2 3 4 5 1 | - 2 | 9 - 3 | 2 11 - 4 | 4 12 4 - 5 | 5 13 5 3 - 2. Infer the UPGMA tree for these sequences from your matrix. Label the branches with their lengths. /------------1------------- 1 /---1.25 ---| | \------------1------------- 3 /---------3.375------| | | /------------1.5------------4 | \---0.75---| | \------------1.5----------- 5 \---------5.625--------------------------------------------- 2 To derive this tree, start by clustering the two species with the lowest pairwise difference -- 1 and 3 -- and apportion the distance between them equally on the two branches leading from their common ancestor to the two taxa. Treat them as a single composite taxon, with distances from this taxon to any other species equal to the mean of the distances from that species to each of the species that make up the composite (the use of arithmetic means explains why some branch lengths are fractions). Then group the next most similar pair of taxa -- here* it is 4 and 5 -- and apportion the distance equally. Repeat until you have the whole tree and its lengths. 2 is most distant from all other taxa and composite taxa, so it must be the sister to the clade of the other four taxa.
- 
												  Characters and Parsimony Analysis Genetic RelationshipsIntroduction to characters and parsimony analysis Genetic Relationships • Genetic relationships exist between individuals within populations • These include ancestor-descendent relationships and more indirect relationships based on common ancestry • Within sexually reducing populations there is a network of relationships • Genetic relations within populations can be measured with a coefficient of genetic relatedness Phylogenetic Relationships • Phylogenetic relationships exist between lineages (e.g. species, genes) • These include ancestor-descendent relationships and more indirect relationships based on common ancestry • Phylogenetic relationships between species or lineages are (expected to be) tree-like • Phylogenetic relationships are not measured with a simple coefficient Phylogenetic Relationships • Traditionally phylogeny reconstruction was dominated by the search for ancestors, and ancestor-descendant relationships • In modern phylogenetics there is an emphasis on indirect relationships • Given that all lineages are related, closeness of phylogenetic relationships is a relative concept. Phylogenetic relationships • Two lineages are more closely related to each other than to some other lineage if they share a more recent common ancestor - this is the cladistic concept of relationships • Phylogenetic hypotheses are hypotheses of common ancestry Frog Toad Oak Hypothetical (Frog,Toad)Oak ancestral lineage Phylogenetic Trees LEAVES terminal branches ABCDEFGHIJ node 2 node 1 polytomy interior branches A CLADOGRAM ROOT CLADOGRAMS AND PHYLOGRAMS E C D A BCDEH I J F G A B G I F H J RELATIVE TIME ABSOLUTE TIME or DIVERGENCE Trees - Rooted and Unrooted ABCDEFGHIJ A BCDEH I J F G ROOT ROOT D E ROOT A F B H J G C I Characters and Character States • Organisms comprise sets of features • When organisms/taxa differ with respect to a feature (e.g.
- 
												  Molecular PhylogeneticsModule 12: Molecular Phylogenetics http://evolution.gs.washington.edu/sisg/2014/ MTH Thanks to Paul Lewis, Tracy Heath, Joe Felsenstein, Peter Beerli, Derrick Zwickl, and Joe Bielawski for slides Monday July 14: Day I 8:30AM to 10:00AM Introduction (Mark Holder) Parsimony methods for phylogeny reconstruction (Mark Holder) Distance{based methods for phylogeny reconstruction (Mark Holder) 10:30AM to noon Topology Searching (Mark Holder) Parsimony and distances demo in PAUP* (Mark Holder) 1:30PM to 3:00PM Nucleotide Substitution Models and Transition Probabilities (Jeff Thorne) Likelihood { (Joe Felsenstein) 3:30PM to 5:00PM PHYLIP lab: likelihood { (Joe Felsenstein) PAUP∗ lab (Mark Holder) Tuesday July 15: Day II 8:30AM to 10:00AM Bootstraps and Testing Trees (Joseph Felsenstein) Bootstrapping in Phylip (Joe Felsenstein) 10:30AM to noon More Realistic Evolutionary Models (Jeff Thorne) 1:30PM to 3:00PM Bayesian Inference and Bayesian Phylogenetics (Jeff Thorne) 3:30PM to 5:00PM MrBayes Computer Lab { (Mark Holder) 5:00PM to 6:00PM Tutorial (questions and answers session) Wednesday July 16: Day III 8:30AM to 10:00AM Divergence Time Estimation { (Jeff Thorne) BEAST demo (Mark Holder) 10:30AM to noon The Coalescent { (Joe Felsenstein) The Comparative Method { (Joe Felsenstein) Future Directions { (Joe Felsenstein) Darwin's 1859 \On the Origin of Species" had one figure: Human family tree from Haeckel, 1874 Fig. 20, p. 171, in Gould, S. J. 1977. Ontogeny and phylogeny. Harvard University Press, Cambridge, MA Are desert green algae adapted to high light intensities? Species Habitat Photoprotection 1 terrestrial xanthophyll 2 terrestrial xanthophyll 3 terrestrial xanthophyll 4 terrestrial xanthophyll 5 terrestrial xanthophyll 6 aquatic none 7 aquatic none 8 aquatic none 9 aquatic none 10 aquatic none Phylogeny reveals the events that generate the pattern 1 pair of changes.
- 
												  Short Branch Attraction, the Fundamental Bipartition in Cellular Life, and Eukaryogenesis Amanda AUniversity of Connecticut OpenCommons@UConn Doctoral Dissertations University of Connecticut Graduate School 12-16-2016 Short Branch Attraction, the Fundamental Bipartition in Cellular Life, and Eukaryogenesis Amanda A. Dick PhD University of Connecticut, [email protected] Follow this and additional works at: https://opencommons.uconn.edu/dissertations Recommended Citation Dick, Amanda A. PhD, "Short Branch Attraction, the Fundamental Bipartition in Cellular Life, and Eukaryogenesis" (2016). Doctoral Dissertations. 1479. https://opencommons.uconn.edu/dissertations/1479 Amanda A. Dick - University of Connecticut, 2016 Short Branch Attraction, the Fundamental Bipartition of Cellular Life, and Eukaryogenesis Amanda A. Dick, PhD University of Connecticut, 2016 Short Branch Attraction is a phenomenon that occurs when BLAST searches are used as a surrogate method for phylogenetic analysis. This results from branch length heterogeneity, but it is the short branches, not the long, that are attracting. The root of the cellular tree of life is on the bacterial branch, meaning the Archaea and eukaryotic nucleocytoplasm form a clade. Because this split is the first in the cellular tree of life, it represents a taxonomic ranking higher than the domain, the realm. I name the clade containing the Archaea and eukaryotic nucleocytoplasm the Ibisii based on shared characteristics having to do with information processing and translation. The Bacteria are the only known members of the other realm, which I call the Bacterii. Eukaryogenesis is the study of how the Eukarya emerged from a prokaryotic state. The beginning state of the process is represented by the relationship between Eukarya and their closest relative, the Archaea. The ending state is represented by the location of the root within the Eukarya.
- 
												  Long-Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology-Based Summary MethodsLong-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods Sebastien´ Roch∗ Michael Nutey Tandy Warnowz March 8, 2018 Abstract With advances in sequencing technologies, there are now massive am- ounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, “gene tree heterogeneity”, which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus datasets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be “statistically consistent”). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the se- quence length of each locus is bounded (by any arbitrarily chosen value), the arXiv:1803.02800v1 [q-bio.PE] 7 Mar 2018 most common approaches to species tree estimation that take heterogeneity ∗Department of Mathematics, University of Wisconsin–Madison, 480 Lincoln Dr, Madison WI 53706 yDepartment of Statistics, The University of Illinois at Urbana-Champaign, 725 S Wright St #101, Champaign IL 61820 zDepartment of Computer Science, The University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana IL 61801-2302 1 into account (i.e., traditional fully partitioned concatenated maximum like- lihood and newer approaches, called summary methods, that estimate the species tree by combining gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained.
- 
												  Hennig + CharactersIntegrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2010 University of California, Berkeley B.D. Mishler Jan. 26, 2010. The Hennig Principle: homology; synapomorphy; rooting issues; character analysis -- what is a data matrix? I. Introduction Genealogical relationships themselves are invisible, so how can we know them? Is there an objective, logically sound method by which one can reconstruct the tree of life? Recent advances in theories and methods for phylogenetic reconstruction, along with copious new data from the molecular level, have made possible a new scientific understanding of the relationships of organisms. This understanding of relationships has lead in turn to improved taxonomic classifications as well as a wealth of comparative methods for testing biogeographic, ecological, behavioral, and other functional hypotheses. II. The Hennig Principle The fundamental idea driving recent advances in phylogenetics is known as the Hennig Principle, and is as elegant and fundamental in its way as was Darwin's principle of natural selection. It is indeed simple, yet profound in its implications. It is based on the idea of homology, one of the most important concepts in systematics, but also one of the most controversial. What does it mean to say that two organisms share the same characteristic? The modern concept is based on evidence for historical continuity of information; homology would then be defined as a feature shared by two organisms because of descent from a common ancestor that had that feature (more on homology below). Hennig's seminal contribution was to note that in a system evolving via descent with modification and splitting of lineages, characters that changed state along a particular lineage can serve to indicate the prior existence a character of that lineage, even after further splitting changing state on occurs.
- 
												  The Concatenation Question David Bryant, Matthew WThe Concatenation Question David Bryant, Matthew W. Hahn To cite this version: David Bryant, Matthew W. Hahn. The Concatenation Question. Scornavacca, Celine; Delsuc, Frédéric; Galtier, Nicolas. Phylogenetics in the Genomic Era, No commercial publisher | Authors open access book, pp.3.4:1–3.4:23, 2020. hal-02535651 HAL Id: hal-02535651 https://hal.archives-ouvertes.fr/hal-02535651 Submitted on 10 Apr 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0 International License Chapter 3.4 The Concatenation Question David Bryant Department of Mathematics and Statistics, University of Otago P.O. Box 56, Dunedin 9054 New Zealand [email protected] Matthew W. Hahn Department of Biology, Department of Computer Science, Indiana University Bloomington IN 47405, USA [email protected] Abstract Gene tree discordance is now recognized as a major source of biological heterogeneity. How to deal with this heterogeneity is an unsolved problem, as the accurate inference of individual gene tree topologies is difficult. One solution has been to simply concatenate all of the data together, ignoring the underlying heterogeneity. Another approach infers gene tree topologies separately and combines the individual estimates in order to explicitly model this heterogeneity.
- 
											The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing, Philippe Lemey, Marco Salemi, and Anne-Mieke Vandamme (Eds.)5 Phylogenetic inference based on distance methods THEORY Yves Van de Peer 5.1 Introduction In addition to maximum parsimony (MP)andlikelihood methods (see Chapters 6, 7 and 8), pairwise distance methods form the third large group of methods to infer evolutionary trees from sequence data (Fig. 5.1). In principle, distance methods try to fit a tree to a matrix of pairwise genetic distances (Felsenstein, 1988). For every two sequences, the distance is a single value based on the fraction of positions in which the two sequences differ, defined as p-distance (see Chapter 4). The p-distance is an underestimation of the true genetic distance because some of the nucleotide positions may have experienced multiple substitution events. Indeed, because mutations are continuously fixed in the genes, there has been an increasing chance of multiple substitutions occurring at the same sequence position as evolutionary time elapses. Therefore, in distance-based methods, one tries to estimate the number of substitutions that have actually occurred by applying a specific evolutionary model that makes particular assumptions about the nature of evolutionary changes (see Chapter 4). When all the pairwise distances have been computed for a set of sequences, a tree topology can then be inferred by a variety of methods (Fig. 5.2). Correct estimation of the genetic distance is crucial and, in most cases, more important than the choice of method to infer the tree topology. Using an unrealistic evolutionary model can cause serious artifacts in tree topology, as previously shown The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing, Philippe Lemey, Marco Salemi, and Anne-Mieke Vandamme (eds.).