Bias in Phylogenetic Estimation and Its Relevance to the Choice

Total Page:16

File Type:pdf, Size:1020Kb

Bias in Phylogenetic Estimation and Its Relevance to the Choice Syst. Biol. 50(4):525–539, 2001 Biasin PhylogeneticEstimation and Its Relevance to theChoice betweenParsimony and Likelihood Methods DAVID L. S WOFFORD,1,6 PETER J. WADDELL,2 JOHN P. HUELSENBECK,3 PETER G. FOSTER,1,7 PAUL O. LEWIS,4 AND JAMES S. ROGERS5 1Laboratory of MolecularSystematics, National Museumof Natural History, SmithsonianInstitution Museum Support Center,4210 Silver Hill Road, Suitland, Maryland 20746, USA; E-mail: [email protected] 2Instituteof MolecularBioSciences, Massey University,Palmerston North, New Zealand;E-mail: [email protected] 3Department of Biology,University of Rochester,Rochester ,New York 14627,USA; E-mail:[email protected] 4Department of Ecologyand Evolutionary Biology, The University of Connecticut,U-43, 75 N.EaglevilleRoad, Storrs, Connecticut06269-30437, USA; E-mail: [email protected] 5Department of BiologicalSciences, University of New Orleans, New Orleans, Louisiana70148, USA; E-mail:[email protected] Itis now widely recognizedthat un- actuallyoccur and thus do notgo far enough der relativelysimple modelsof stochastic in correctingfor the problem. Inconsistency change,phylogenetic inference methodscan canalso arise under parsimony,even when activelymislead investigators attempting to allbranches have the samelength (Kim, estimateevolutionary trees from molecu- 1996),although in thiscase there muststill larsequences andother data. One instance be particularimbalances in the totallengths of thisphenomenon is“ long-branch attrac- ofthe pathsfrom internal nodes to tipsof the tion,”in whichsome pairs of taxahave tree;“ long-path attraction”would describe ahigher probability of sharingthe same thisphenomenon. characterstate because of parallelor con- Long-branch attractionhas been widely vergent changesalong long branchesthan used, andabused, in justifying choicesof dotaxa that are more closely related be- methodsand in explaining anomalousre- causethey haveretained some same state sults.Critics of the relevance oflong-branch froma commonancestor. Methods that sys- attractionand related artifacts have generally tematicallyunderestimate the actualamount takentwo tacks. The rst(e.g., Farris,1983) of divergence maythen become statisti- claimsthat the demonstrationof long-branch callyinconsistent or “ positivelymisleading” attractionrequires simple andunrealistic (Felsenstein, 1978;Hendy andPenny ,1989), modelsof evolutionarychange. As pointed estimatingan incorrect tree withan increas- outby Kim (1996),this argument lacks force ing certaintyas the amountof characterdata because conditionsthat lead to inconsistency increases.Although usually associatedwith aremuch moregeneral andcomplex than parsimonymethods, long-branch attraction thoseoutlined by Felsenstein (1978);further canalso af ict maximum likelihood anddis- relaxationof Felsenstein’s conditionssimply tanceanalyses when the assumedsubstitu- exacerbatesthe problem. The secondline tionmodels of these methodsare strongly of argument(e.g., Siddall andKluge, 1997) violated(e.g., Huelsenbeck andHillis, followsfrom the factthat “ truth”is unknow- 1993;Huelsenbeck, 1995;W addell,1995:377– able in science generally; because itis not 404;Gaut and Lewis, 1995; Chang, 1996a; possible tobe certainthat the analysisof a Lockhartet al.,1996; Sullivan andSwofford, realdata set has been compromisedby long- 1997).In thiscase, although the methods branchattraction, the abilityof a methodto areexplicitly designed todeal with superim- converge,in principle, tothe correctsolution posedsubstitutions (multiple hits),the un- withincreasing amounts of datais irrelevant. derlying modelspredict fewer of these than In thisview ,“‘accuracy’is rendered empty asan empirical claim” (Siddall andKluge, 6 Currentaddress: The Natural History Museum, 1997:318).Proponents of model-based(or Cromwell Road,London SW7 5BD, U.K.; E-mail: statistical)methods that seek toavoid [email protected] 7 Currentaddress: School of ComputationalScience inconsistencyattributable to long-branch or andInformation T echnology,FloridaState University , long-path artifactshave not been dissuaded Tallahassee,Florida, 32306-4120. by thisargument. They certainlyappreciate 525 526 SYSTEMATICBIOLOGY VOL. 50 the elusiveness of “truth”but understand the modelis misspeci ed (e.g., overcorrec- thatall methods are susceptible tofailure tionfor among-site rate variation). However , under certainconditions. Consequently , wewill use thisterm in Siddall’s contextfor these proponentsseek methodsand models the present purposes. thatwill succeed under awide range of We—and undoubtedly others—realized plausible conditionsand that are less likely long agothat when long-branch attractionfa- toyield misleadingresults purely because of vorsthe correctunrooted tree forfour taxa artifacts.Historically ,the different perspec- ratherthan one ofthe twoincorrect trees, tiveshave led toaschismbetween thosewho parsimonywould outperform maximum wouldapproach phylogenetics froma statis- likelihood inchoosing a topology . Parsimony ticalperspective andthose who place strong “succeeds”in the inverse-Felsenstein zone faithin one particularapproach over all oth- because itis a stronglybiased method, the ers.In manyareas of science,the statistical directionof the biasfavoring the correct modeling viewpointtends to become more tree ratherthan an incorrect one, in con- predominantas asubject matures.However , trastto the situationin the Felsenstein zone. proponentsof model-basedmethods in phy- Thispoint was obvious enough notto merit logeneticshave not always helped their case publicationon its own, although we have by makingoverly assertive and sometimes mentioned itin variousother contexts (e.g., misleadingclaims about the superiorityof Swofford et al.,1995; W addell,1995). How- these methods(see Sidow,1994;Hillis et al., ever, the portrayalby Siddall (1998)of this 1994). observationas avictoryfor parsimony meth- Againstthis backdrop of confusing and odsdemands closer scrutiny .Properly in- often acrimoniousdebate, Siddall (1998)of- terpreted, the resultsof Siddall’s simula- fered anew challenge tothe positionthat con- tionsactually support the superiorityof siderationsof long-branch attractionfavor model-basedmethods for dealing withlong- model-basedmethods. Siddall’ s position, branchartifacts just as stronglyas did those which seemsreasonable at least on the sur- fromearlier studiesthat concentrated on the face,can be summarizedsimply: Although Felsenstein zone.W eemphasizeat the out- maximumlikelihood andcorrected-distance set,however ,thatthe followinganalysis is methodsoutperform parsimony methods in notintended asa general criticismof the par- the so-calledFelsenstein zone(four-taxon simonymethod. Rather ,we showthat results tree withtwo long, but unrelated,termi- suchas those of Siddall (1998)should not be nalbranches and all other branches short), takenas avindicationof parsimonywith re- parsimonyis better able toinfer the cor- spectto one particularproblem— sensitivity recttree topologyin whatSiddall callsthe tolong-branch-attractionartifacts. Farriszone, where the twolong terminal branchesinstead lead to sister taxa (or are adjacenton an unrooted tree). Thus,if an PERFORMANCEOF MAXIMUM unrootedphylogeny containstwo long (ter- LIKELIHOODINTHE minal)branches plus three shortbranches, INVERSE-FELSENSTEIN ZONE andthe long branchesare expected tolead Siddall’s (1998)simulationresults are sum- tosister taxa about as often asthey leadto marizedin Figure 1.Siddall’s branch-length nonsistertaxa, then one mightargue that parameterswere dened (Siddall, 1998:212) there isno compelling reasonfor preferring asthe “expected percentage change ofthe : : : one methodover another on the basisof branches.”This refers tothe expected per- long-branch attraction.W addell(1995) too centage of sitesfor which the nucleotide at hadearlier referred tothe Farriszone, call- one end ofabranch(internode oredge) dif- ing itthe “anti-Felsenstein”zone. Neither fers fromthe nucleotide atthe otherend. To ofthese designationsseems entirely appro- avoidambiguity ,weprefer tocallthis quan- priate,and we will use the term“ inverse- titythe expected percentage difference. Un- Felsenstein zone”here. Siddall refers tothe der the modelused forhis simulations, this poorperformance of maximumlikelihood in value,expressed asaproportion p, is a lower the inverse-Felsenstein zoneas “long-branch bound onthe expected number of changes repulsion,”a termused by Waddell(1995) for (substitutions)per siteincluding multiple the signicantly different problem of perfor- hits,which we will call d.The twomeasures mancein the inverse-Felsenstein zonewhen arerelated by using the familiardistance 2001 SWOFFORDET AL.—BIAS AND CHOICEOF PHYLOGENETICMETHODS 527 FIGURE 1.(a) Four-taxon model tree used bySiddall (1998).The probability of adifference in characterstates between the nodesincident to brancheslabeled a and b is given by pa and pb ,respectively.(b)Parameter-space investigated bySiddall, showing relative performance of parsimonyand likelihood (underthe Jukes–Cantor model) in various regions of this space. equationof Jukes andCantor (1969): Thus,the longestbranch length simulatedby Siddall, p 0.75,corresponds to an innitely D 3 4 long branch,and the next longest, p 0.675, d ln 1 p correspondsto a meanof about1.7 substitu- D D 4 3 ³ ´ tionsper sitealong the branch. Closeexamination of Siddall’s (1998)sim- andits inverse: ulationresults immediately reveals some anomalies.The rstinvolves his claim 3 3 4 p exp d : (1998:213and his Fig. 4)that parsimony D 4
Recommended publications
  • Lecture Notes: the Mathematics of Phylogenetics
    Lecture Notes: The Mathematics of Phylogenetics Elizabeth S. Allman, John A. Rhodes IAS/Park City Mathematics Institute June-July, 2005 University of Alaska Fairbanks Spring 2009, 2012, 2016 c 2005, Elizabeth S. Allman and John A. Rhodes ii Contents 1 Sequences and Molecular Evolution 3 1.1 DNA structure . .4 1.2 Mutations . .5 1.3 Aligned Orthologous Sequences . .7 2 Combinatorics of Trees I 9 2.1 Graphs and Trees . .9 2.2 Counting Binary Trees . 14 2.3 Metric Trees . 15 2.4 Ultrametric Trees and Molecular Clocks . 17 2.5 Rooting Trees with Outgroups . 18 2.6 Newick Notation . 19 2.7 Exercises . 20 3 Parsimony 25 3.1 The Parsimony Criterion . 25 3.2 The Fitch-Hartigan Algorithm . 28 3.3 Informative Characters . 33 3.4 Complexity . 35 3.5 Weighted Parsimony . 36 3.6 Recovering Minimal Extensions . 38 3.7 Further Issues . 39 3.8 Exercises . 40 4 Combinatorics of Trees II 45 4.1 Splits and Clades . 45 4.2 Refinements and Consensus Trees . 49 4.3 Quartets . 52 4.4 Supertrees . 53 4.5 Final Comments . 54 4.6 Exercises . 55 iii iv CONTENTS 5 Distance Methods 57 5.1 Dissimilarity Measures . 57 5.2 An Algorithmic Construction: UPGMA . 60 5.3 Unequal Branch Lengths . 62 5.4 The Four-point Condition . 66 5.5 The Neighbor Joining Algorithm . 70 5.6 Additional Comments . 72 5.7 Exercises . 73 6 Probabilistic Models of DNA Mutation 81 6.1 A first example . 81 6.2 Markov Models on Trees . 87 6.3 Jukes-Cantor and Kimura Models .
    [Show full text]
  • Phylogeny Codon Models • Last Lecture: Poor Man’S Way of Calculating Dn/Ds (Ka/Ks) • Tabulate Synonymous/Non-Synonymous Substitutions • Normalize by the Possibilities
    Phylogeny Codon models • Last lecture: poor man’s way of calculating dN/dS (Ka/Ks) • Tabulate synonymous/non-synonymous substitutions • Normalize by the possibilities • Transform to genetic distance KJC or Kk2p • In reality we use codon model • Amino acid substitution rates meet nucleotide models • Codon(nucleotide triplet) Codon model parameterization Stop codons are not allowed, reducing the matrix from 64x64 to 61x61 The entire codon matrix can be parameterized using: κ kappa, the transition/transversionratio ω omega, the dN/dS ratio – optimizing this parameter gives the an estimate of selection force πj the equilibrium codon frequency of codon j (Goldman and Yang. MBE 1994) Empirical codon substitution matrix Observations: Instantaneous rates of double nucleotide changes seem to be non-zero There should be a mechanism for mutating 2 adjacent nucleotides at once! (Kosiol and Goldman) • • Phylogeny • • Last lecture: Inferring distance from Phylogenetic trees given an alignment How to infer trees and distance distance How do we infer trees given an alignment • • Branch length Topology d 6-p E 6'B o F P Edo 3 vvi"oH!.- !fi*+nYolF r66HiH- .) Od-:oXP m a^--'*A ]9; E F: i ts X o Q I E itl Fl xo_-+,<Po r! UoaQrj*l.AP-^PA NJ o - +p-5 H .lXei:i'tH 'i,x+<ox;+x"'o 4 + = '" I = 9o FF^' ^X i! .poxHo dF*x€;. lqEgrE x< f <QrDGYa u5l =.ID * c 3 < 6+6_ y+ltl+5<->-^Hry ni F.O+O* E 3E E-f e= FaFO;o E rH y hl o < H ! E Y P /-)^\-B 91 X-6p-a' 6J.
    [Show full text]
  • Introduction to Phylogenetics Workshop on Molecular Evolution 2018 Marine Biological Lab, Woods Hole, MA
    Introduction to Phylogenetics Workshop on Molecular Evolution 2018 Marine Biological Lab, Woods Hole, MA. USA Mark T. Holder University of Kansas Outline 1. phylogenetics is crucial for comparative biology 2. tree terminology 3. why phylogenetics is difficult 4. parsimony 5. distance-based methods 6. theoretical basis of multiple sequence alignment Part #1: phylogenetics is crucial for biology Species Habitat Photoprotection 1 terrestrial xanthophyll 2 terrestrial xanthophyll 3 terrestrial xanthophyll 4 terrestrial xanthophyll 5 terrestrial xanthophyll 6 aquatic none 7 aquatic none 8 aquatic none 9 aquatic none 10 aquatic none slides by Paul Lewis Phylogeny reveals the events that generate the pattern 1 pair of changes. 5 pairs of changes. Coincidence? Much more convincing Many evolutionary questions require a phylogeny Determining whether a trait tends to be lost more often than • gained, or vice versa Estimating divergence times (Tracy Heath Sunday + next • Saturday) Distinguishing homology from analogy • Inferring parts of a gene under strong positive selection (Joe • Bielawski and Belinda Chang next Monday) Part 2: Tree terminology A B C D E terminal node (or leaf, degree 1) interior node (or vertex, degree 3+) split (bipartition) also written AB|CDE or portrayed **--- branch (edge) root node of tree (de gree 2) Monophyletic groups (\clades"): the basis of phylogenetic classification black state = a synapomorphy white state = a plesiomorphy Paraphyletic Polyphyletic grey state is an autapomorphy (images from Wikipedia) Branch rotation does not matter ACEBFDDAFBEC Rooted vs unrooted trees ingroup: the focal taxa outgroup: the taxa that are more distantly related. Assuming that the ingroup is monophyletic with respect to the outgroup can root a tree.
    [Show full text]
  • Heterotachy and Long-Branch Attraction in Phylogenetics. Hervé Philippe, Yan Zhou, Henner Brinkmann, Nicolas Rodrigue, Frédéric Delsuc
    Heterotachy and long-branch attraction in phylogenetics. Hervé Philippe, Yan Zhou, Henner Brinkmann, Nicolas Rodrigue, Frédéric Delsuc To cite this version: Hervé Philippe, Yan Zhou, Henner Brinkmann, Nicolas Rodrigue, Frédéric Delsuc. Heterotachy and long-branch attraction in phylogenetics.. BMC Evolutionary Biology, BioMed Central, 2005, 5, pp.50. 10.1186/1471-2148-5-50. halsde-00193044 HAL Id: halsde-00193044 https://hal.archives-ouvertes.fr/halsde-00193044 Submitted on 30 Nov 2007 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. BMC Evolutionary Biology BioMed Central Research article Open Access Heterotachy and long-branch attraction in phylogenetics Hervé Philippe*1, Yan Zhou1, Henner Brinkmann1, Nicolas Rodrigue1 and Frédéric Delsuc1,2 Address: 1Canadian Institute for Advanced Research, Centre Robert-Cedergren, Département de Biochimie, Université de Montréal, Succursale Centre-Ville, Montréal, Québec H3C3J7, Canada and 2Laboratoire de Paléontologie, Phylogénie et Paléobiologie, Institut des Sciences de l'Evolution, UMR 5554-CNRS, Université
    [Show full text]
  • C3020 – Molecular Evolution Exercises #3: Phylogenetics
    C3020 – Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from the earlier version to make computation easier) 1 ACAAACAGTT CGATCGATTT GCAGTCTGGG 2 ACAAACAGTT TCTAGCGATT GCAGTCAGGG 3 ACAGACAGTT CGATCGATTT GCAGTCTCGG 4 ACTGACAGTT CGATCGATTT GCAGTCAGAG 5 ATTGACAGTT CGATCGATTT GCAGTCAGGA O TTTGACAGTT CGATCGATTT GCAGTCAGGG 1. Make a distance matrix using raw distances (number of differences) for the five ingroup sequences. | 1 2 3 4 5 1 | - 2 | 9 - 3 | 2 11 - 4 | 4 12 4 - 5 | 5 13 5 3 - 2. Infer the UPGMA tree for these sequences from your matrix. Label the branches with their lengths. /------------1------------- 1 /---1.25 ---| | \------------1------------- 3 /---------3.375------| | | /------------1.5------------4 | \---0.75---| | \------------1.5----------- 5 \---------5.625--------------------------------------------- 2 To derive this tree, start by clustering the two species with the lowest pairwise difference -- 1 and 3 -- and apportion the distance between them equally on the two branches leading from their common ancestor to the two taxa. Treat them as a single composite taxon, with distances from this taxon to any other species equal to the mean of the distances from that species to each of the species that make up the composite (the use of arithmetic means explains why some branch lengths are fractions). Then group the next most similar pair of taxa -- here* it is 4 and 5 -- and apportion the distance equally. Repeat until you have the whole tree and its lengths. 2 is most distant from all other taxa and composite taxa, so it must be the sister to the clade of the other four taxa.
    [Show full text]
  • Characters and Parsimony Analysis Genetic Relationships
    Introduction to characters and parsimony analysis Genetic Relationships • Genetic relationships exist between individuals within populations • These include ancestor-descendent relationships and more indirect relationships based on common ancestry • Within sexually reducing populations there is a network of relationships • Genetic relations within populations can be measured with a coefficient of genetic relatedness Phylogenetic Relationships • Phylogenetic relationships exist between lineages (e.g. species, genes) • These include ancestor-descendent relationships and more indirect relationships based on common ancestry • Phylogenetic relationships between species or lineages are (expected to be) tree-like • Phylogenetic relationships are not measured with a simple coefficient Phylogenetic Relationships • Traditionally phylogeny reconstruction was dominated by the search for ancestors, and ancestor-descendant relationships • In modern phylogenetics there is an emphasis on indirect relationships • Given that all lineages are related, closeness of phylogenetic relationships is a relative concept. Phylogenetic relationships • Two lineages are more closely related to each other than to some other lineage if they share a more recent common ancestor - this is the cladistic concept of relationships • Phylogenetic hypotheses are hypotheses of common ancestry Frog Toad Oak Hypothetical (Frog,Toad)Oak ancestral lineage Phylogenetic Trees LEAVES terminal branches ABCDEFGHIJ node 2 node 1 polytomy interior branches A CLADOGRAM ROOT CLADOGRAMS AND PHYLOGRAMS E C D A BCDEH I J F G A B G I F H J RELATIVE TIME ABSOLUTE TIME or DIVERGENCE Trees - Rooted and Unrooted ABCDEFGHIJ A BCDEH I J F G ROOT ROOT D E ROOT A F B H J G C I Characters and Character States • Organisms comprise sets of features • When organisms/taxa differ with respect to a feature (e.g.
    [Show full text]
  • Molecular Phylogenetics
    Module 12: Molecular Phylogenetics http://evolution.gs.washington.edu/sisg/2014/ MTH Thanks to Paul Lewis, Tracy Heath, Joe Felsenstein, Peter Beerli, Derrick Zwickl, and Joe Bielawski for slides Monday July 14: Day I 8:30AM to 10:00AM Introduction (Mark Holder) Parsimony methods for phylogeny reconstruction (Mark Holder) Distance{based methods for phylogeny reconstruction (Mark Holder) 10:30AM to noon Topology Searching (Mark Holder) Parsimony and distances demo in PAUP* (Mark Holder) 1:30PM to 3:00PM Nucleotide Substitution Models and Transition Probabilities (Jeff Thorne) Likelihood { (Joe Felsenstein) 3:30PM to 5:00PM PHYLIP lab: likelihood { (Joe Felsenstein) PAUP∗ lab (Mark Holder) Tuesday July 15: Day II 8:30AM to 10:00AM Bootstraps and Testing Trees (Joseph Felsenstein) Bootstrapping in Phylip (Joe Felsenstein) 10:30AM to noon More Realistic Evolutionary Models (Jeff Thorne) 1:30PM to 3:00PM Bayesian Inference and Bayesian Phylogenetics (Jeff Thorne) 3:30PM to 5:00PM MrBayes Computer Lab { (Mark Holder) 5:00PM to 6:00PM Tutorial (questions and answers session) Wednesday July 16: Day III 8:30AM to 10:00AM Divergence Time Estimation { (Jeff Thorne) BEAST demo (Mark Holder) 10:30AM to noon The Coalescent { (Joe Felsenstein) The Comparative Method { (Joe Felsenstein) Future Directions { (Joe Felsenstein) Darwin's 1859 \On the Origin of Species" had one figure: Human family tree from Haeckel, 1874 Fig. 20, p. 171, in Gould, S. J. 1977. Ontogeny and phylogeny. Harvard University Press, Cambridge, MA Are desert green algae adapted to high light intensities? Species Habitat Photoprotection 1 terrestrial xanthophyll 2 terrestrial xanthophyll 3 terrestrial xanthophyll 4 terrestrial xanthophyll 5 terrestrial xanthophyll 6 aquatic none 7 aquatic none 8 aquatic none 9 aquatic none 10 aquatic none Phylogeny reveals the events that generate the pattern 1 pair of changes.
    [Show full text]
  • Short Branch Attraction, the Fundamental Bipartition in Cellular Life, and Eukaryogenesis Amanda A
    University of Connecticut OpenCommons@UConn Doctoral Dissertations University of Connecticut Graduate School 12-16-2016 Short Branch Attraction, the Fundamental Bipartition in Cellular Life, and Eukaryogenesis Amanda A. Dick PhD University of Connecticut, [email protected] Follow this and additional works at: https://opencommons.uconn.edu/dissertations Recommended Citation Dick, Amanda A. PhD, "Short Branch Attraction, the Fundamental Bipartition in Cellular Life, and Eukaryogenesis" (2016). Doctoral Dissertations. 1479. https://opencommons.uconn.edu/dissertations/1479 Amanda A. Dick - University of Connecticut, 2016 Short Branch Attraction, the Fundamental Bipartition of Cellular Life, and Eukaryogenesis Amanda A. Dick, PhD University of Connecticut, 2016 Short Branch Attraction is a phenomenon that occurs when BLAST searches are used as a surrogate method for phylogenetic analysis. This results from branch length heterogeneity, but it is the short branches, not the long, that are attracting. The root of the cellular tree of life is on the bacterial branch, meaning the Archaea and eukaryotic nucleocytoplasm form a clade. Because this split is the first in the cellular tree of life, it represents a taxonomic ranking higher than the domain, the realm. I name the clade containing the Archaea and eukaryotic nucleocytoplasm the Ibisii based on shared characteristics having to do with information processing and translation. The Bacteria are the only known members of the other realm, which I call the Bacterii. Eukaryogenesis is the study of how the Eukarya emerged from a prokaryotic state. The beginning state of the process is represented by the relationship between Eukarya and their closest relative, the Archaea. The ending state is represented by the location of the root within the Eukarya.
    [Show full text]
  • Long-Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology-Based Summary Methods
    Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods Sebastien´ Roch∗ Michael Nutey Tandy Warnowz March 8, 2018 Abstract With advances in sequencing technologies, there are now massive am- ounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, “gene tree heterogeneity”, which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus datasets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be “statistically consistent”). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the se- quence length of each locus is bounded (by any arbitrarily chosen value), the arXiv:1803.02800v1 [q-bio.PE] 7 Mar 2018 most common approaches to species tree estimation that take heterogeneity ∗Department of Mathematics, University of Wisconsin–Madison, 480 Lincoln Dr, Madison WI 53706 yDepartment of Statistics, The University of Illinois at Urbana-Champaign, 725 S Wright St #101, Champaign IL 61820 zDepartment of Computer Science, The University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana IL 61801-2302 1 into account (i.e., traditional fully partitioned concatenated maximum like- lihood and newer approaches, called summary methods, that estimate the species tree by combining gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained.
    [Show full text]
  • Hennig + Characters
    Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2010 University of California, Berkeley B.D. Mishler Jan. 26, 2010. The Hennig Principle: homology; synapomorphy; rooting issues; character analysis -- what is a data matrix? I. Introduction Genealogical relationships themselves are invisible, so how can we know them? Is there an objective, logically sound method by which one can reconstruct the tree of life? Recent advances in theories and methods for phylogenetic reconstruction, along with copious new data from the molecular level, have made possible a new scientific understanding of the relationships of organisms. This understanding of relationships has lead in turn to improved taxonomic classifications as well as a wealth of comparative methods for testing biogeographic, ecological, behavioral, and other functional hypotheses. II. The Hennig Principle The fundamental idea driving recent advances in phylogenetics is known as the Hennig Principle, and is as elegant and fundamental in its way as was Darwin's principle of natural selection. It is indeed simple, yet profound in its implications. It is based on the idea of homology, one of the most important concepts in systematics, but also one of the most controversial. What does it mean to say that two organisms share the same characteristic? The modern concept is based on evidence for historical continuity of information; homology would then be defined as a feature shared by two organisms because of descent from a common ancestor that had that feature (more on homology below). Hennig's seminal contribution was to note that in a system evolving via descent with modification and splitting of lineages, characters that changed state along a particular lineage can serve to indicate the prior existence a character of that lineage, even after further splitting changing state on occurs.
    [Show full text]
  • The Concatenation Question David Bryant, Matthew W
    The Concatenation Question David Bryant, Matthew W. Hahn To cite this version: David Bryant, Matthew W. Hahn. The Concatenation Question. Scornavacca, Celine; Delsuc, Frédéric; Galtier, Nicolas. Phylogenetics in the Genomic Era, No commercial publisher | Authors open access book, pp.3.4:1–3.4:23, 2020. hal-02535651 HAL Id: hal-02535651 https://hal.archives-ouvertes.fr/hal-02535651 Submitted on 10 Apr 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0 International License Chapter 3.4 The Concatenation Question David Bryant Department of Mathematics and Statistics, University of Otago P.O. Box 56, Dunedin 9054 New Zealand [email protected] Matthew W. Hahn Department of Biology, Department of Computer Science, Indiana University Bloomington IN 47405, USA [email protected] Abstract Gene tree discordance is now recognized as a major source of biological heterogeneity. How to deal with this heterogeneity is an unsolved problem, as the accurate inference of individual gene tree topologies is difficult. One solution has been to simply concatenate all of the data together, ignoring the underlying heterogeneity. Another approach infers gene tree topologies separately and combines the individual estimates in order to explicitly model this heterogeneity.
    [Show full text]
  • The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing, Philippe Lemey, Marco Salemi, and Anne-Mieke Vandamme (Eds.)
    5 Phylogenetic inference based on distance methods THEORY Yves Van de Peer 5.1 Introduction In addition to maximum parsimony (MP)andlikelihood methods (see Chapters 6, 7 and 8), pairwise distance methods form the third large group of methods to infer evolutionary trees from sequence data (Fig. 5.1). In principle, distance methods try to fit a tree to a matrix of pairwise genetic distances (Felsenstein, 1988). For every two sequences, the distance is a single value based on the fraction of positions in which the two sequences differ, defined as p-distance (see Chapter 4). The p-distance is an underestimation of the true genetic distance because some of the nucleotide positions may have experienced multiple substitution events. Indeed, because mutations are continuously fixed in the genes, there has been an increasing chance of multiple substitutions occurring at the same sequence position as evolutionary time elapses. Therefore, in distance-based methods, one tries to estimate the number of substitutions that have actually occurred by applying a specific evolutionary model that makes particular assumptions about the nature of evolutionary changes (see Chapter 4). When all the pairwise distances have been computed for a set of sequences, a tree topology can then be inferred by a variety of methods (Fig. 5.2). Correct estimation of the genetic distance is crucial and, in most cases, more important than the choice of method to infer the tree topology. Using an unrealistic evolutionary model can cause serious artifacts in tree topology, as previously shown The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing, Philippe Lemey, Marco Salemi, and Anne-Mieke Vandamme (eds.).
    [Show full text]