Dissfernando.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

Dissfernando.Pdf ��������� ������������� ����������� ��� ��������� ��� ������������ ������ ����� ������� ��� ����������������������� ��� ��� �������� ��� ���������� ��� ���������� ��������� ��� ����������� ����� ���������� ������������ ��� �������� ������������������ ���������� ������� ��� ��� ���������� �������� ���������� ������ ���������� ����� ��� ��������������������� ������� ���������� ����� ��� ����� ��� �������� ����������������������������������� �������������������������������������������������������� ������������������������������������������������������������������������������������������� ������������������������������������������� ������������������������������������������������������������������������������������������������� ��������� ����������������������������������������������������������������������������� ������������������������������������� Zusammenfassung Phylogenetische B¨aume stellen evolution¨aren Beziehungen zwischen verschie- denen Organismen (Arten) dar, die vermutlich gemeinsame Vorfahren haben Au!ere¨ Knoten eines #hylogenetischen Baumes (taxa) bezeichnen lebendi- ge Arten, fur¨ die molekulare Daten verfugbar¨ sind oder sequenziert werden %¨onnen Innere Knoten bezeichnen hypothetische ausgestorbene Arten, fur¨ die in der Regel %eine Daten'uelle zur Verfugung¨ steht In den letzten Jahren hat sich das +eld durch die ,infuhrung¨ von sogen- nanten -./ (Next-Generation-/equencing) Methoden dramatisch ge¨andert &iese 0ethoden, die sich rasch etablierten, erlauben einen erh¨ohten Durch- satz. Aus diesem .rund wachsen aktuelle molekulare &atenban%en schneller als vom mooreschen Gesetz vorhergesagt &adurch entsteht die Herausforde- rung, solche große molekulare &atenmengen effizient zu analysieren &ie 0aximum-Likelihood Baumrekonstru%tion versucht, den Baum mit dem h¨ochsten 3i%elihood score fur¨ ein bestimmtes /equenzalignment (in#ut) zu 4nden Von /eiten der Informati% besteht die Herausforderung darin, die Phylogenetic 3i%elihood +unction (P3F) zum Bewerten alternativer 5o#olo- gien effizient zu berechnen In dieser Arbeit besch¨aftigen wir uns mit der Untersuchung und Entwic%- lung von Methoden, die diese Aufgabe, im Zusammenhang mit groß angeleg- ten &atens¨atzen, erleichtern %¨onnen Wir #r¨asentieren drei unterschiedliche Ans¨atze hierfur9¨ den /#eicherbedarf der PL+ +unktion zu verringern, 3auf- zeiten zu reduzieren und die Automatisierung #hylogenetischer Analysen Verringerung des Speicherbedarfs der PLF Funktion 8ir entwic%el- ten drei 0ethoden, die den /#eicherbedarf fur¨ Zwischenergebnisse ver- ringern %ann &ie (i) out-of-core und die (ii) recomputation haben miteinander gemeinsam, dass nur ein 5eil der Zwischenergebnisse im Hau#tspeicher gespeichert werden muss &ie restlichen 7wischenergeb- nisse werden entweder auf der +est#latte (out-of-core) gespeichert oder erneut berechnet (recomputation trade-off) Unsere Auswertungen deuten darauf hin, dass der recomputation An- satz (trade-off) deutlich effizienter ist, da er fur¨ ty#ische Datens¨atze den i /#eicherbedarf halbiert zu 3asten einer um ;<= erh¨ohten 3aufzeit &ie dritte Methode wird /EV (/ubtree ,'uality Vectors) genannt und reduziert den /#eicherbedarf nahezu #ro#ortional zum Anteil fehlen- der Daten im Datensatz ohne die 3aufzeit zu erh¨ohen Im allgemeinen %¨onnen diese Methoden gleichzeitig eingesetzt werden Reduzierung von Laufzeiten Wir #ortierten die PL+ Im#lementierung auf GPUs mit O#en>3 &abei #assten wir das /#eicherlayout an, um eine o#timale .PU 3eistung zu erreichen Unsere Auswertungen zeigen, dass fur¨ ausreichend lange &atens¨atze, die Verlagerung von Berech- nungen auf die .PU bis zu zweimal schneller als der am effizientesten vektorisierte CPU-Code sein %ann Aus einer algorithmischen Perspektive haben wir den sogenannten Back- bone Algorithm, der den /uchraum aller m¨oglichen B¨aume beschr¨ankten %ann, entwic%elt &ie Anwendung dieser Methode %ann die 3aufzeit bis zu ?<= verringern und liefert dabei B¨aume, die statistisch vergleichbar sind mit denen, die ohne /uchraumbeschr¨ankungen gefunden werden Automatisierung phylogenetischer Analysen Der letzte 5eil dieser Ar- beit besch¨aftigt sich mit dem Problem, dass vorhandene #hylogeneti- sche B¨aume nach kurzer Zeit nicht mehr die aktuellsten &aten aus den schnell wachsenden &atenban%en widers#iegeln 8ir entwic%elten ein +ramewor% namens PUmPER (Phylogenies 6#dated P,)#ertual- ly) Mit PUmPER% ¨onnen Pi#elines erstellt werden, die iterativ neue /equenzalignments assemblieren und B¨aume aus vorherigen Iteratio- nen erg¨anzen &as +ramewor% %ann entweder als stand-alone #i#eline %on4guriert werden oder rechenintensive Aufgaben auf einem Cluster ausfuhren¨ Obwohl die in dieser Arbeit beschriebene Methoden als #roof-of-concept innerhalb der RAx03 codebase im#lementiert wurden, sind diese Metho- den nichtsdestotrotz relativ einfach in anderen state-of-the-art 3i%elihood- basierten Codes im#lementierbar Im allgemeinen erwarten wir, dass diese Methoden hilfreich sind, um aktuelle #hylogenetische Anwendungen besser zu skalieren und dadurch auch #hylogenetische Analysen erm¨oglichen, die auf %om#letten .enomen basieren ii Acknowledgements This research #roject would not have been #ossible without the su##ort of many #eople. ( wish to e$#ress my gratitude to my su#ervisor, Prof. &r Alexandros /tamatakis who was abundantly hel#ful and offered invaluable assistance, su##ort and dedicated guidance. Deepest gratitude also to Prof &r Arndt von Haeseler without whose knowledge and assistance this study would not have been successful ( addition ( would like to ex#ress my sin- cere gratitude to Casey Dunn, /tephen A. /mith, and John Cazes for %indly hosting me at their lab during my research stay at Brown University /#ecial thanks also to all my collaborators and fellow grou# members; Ni%os Alachi- otis, /imon Berger, Andre Aberer, Pavlos Pavlidis, /olon P Pissis, 5omas Flouri, &ilrini de /ilva, Paschalia Kapli, Kassian Kobert, Jiajie Zhang and Alexey Kozlov for being an always-collaborative and helpful grou# ( would li%e to than% my #arents, brothers and family for their uncondi- tional su##ort, as well as my friends from Valladolid and &eggendorf, who have always stood by me during my time in Heidelberg. 3ast but not least, my dearest thanks to Beifei for being an endless source of encouragement and inspiration This wor% was funded via the .erman /cience +oundation (&+G) grants /5A BC<DE and /5A BC<DF Part of this wor% used resources #rovided by the iPlant Collaborative (funded by -/+ grant GDBI-0HF?;I1) iii ontents Zusammenfassung i Acknowledgements iii ! "ntroduction ! ; ; Motivation ; ; E /cienti4c Contribution E ; F /tructure of this thesis J # Computational Molecular Phylogenetics % E ; /tatistical 0odels of Evolution ? E E /equence Alignment H E E ; Pairwise /equence Alignment B E E E 0ulti#le /equence Alignment I E F Phylogenetic 5rees ;; E J 5ree )econstruction Methods ;E E ? Com#uting the 3i%elihood of a 5ree ;? E ? ; Accounting for rate heterogeneity ;I E C Maximum 3i%elihood 5ree /earch E; E H Phylogenetic 3ikelihood 3ibrary EJ E B RAx03-family im#lementation concepts E? E B ; Node records E? E B E Internal tree nodes E? E B F &ata structure for 5rees EC E B J Com#uting the li%elihood on a 5ree EH E B ? O#timizing branch lengths EI & Memory-Saving (echniques &! F ; Memory requirements for the PL+ FE F E Out-of-Core FF F E ; )elated 8or% FJ iv F E E Com#uting the PL+ Out-of-Core FJ F E F ,$#erimental /etu# K Results FI F F Recom#utation of Ancestral Vectors JJ F F ; )ecom#utation of Ancestral Probability Vectors J? F F E ,$#erimental /etu# and )esults ?J F F F ,valuation of traversal overhead ?C F J /ubtree ,'uality Vectors ?B F J ; .ap#y /ubtree ,'uality Vectors ?B F J E .eneration of Biological 5est &atasets C; F J F /EV Performance CE F ? /ummary CF * The +ackbone Algorithm ,% J ; Constraining tree search to a bac%bone tree C? J E Algorithm CC J E ; /tarting tree CH J E E Ti# Clustering CB J E F Bac%bone construction H; J E J Bac%bone-constrained 5ree /earch HF J F Evaluation and )esults HJ J F ; Performance HJ J F E /imulated &atasets (Accuracy) HB J J /ummary HB % "ntroduction to -PU Programming /0 ? ; Overview HI ? E >6&A B< ? E ; >6&A Hardware and Architecture B< ? E E >6&A Programming Model BE ? E F Performance Considerations BE ? F O#en>3 BJ ? F ; O#en>3 #erformance #ortability B? , GP. implementation of Phylogenetic 1ernels 2, C ; Related wor% BH C E Generic Vectorization BB C F GP6 Im#lementation I; C F ; Kernel Im#lementation IE C F E .PU 0emory Organization IJ C F F O#en>3 Im#lementation I? C J ,$#erimental setu# and results IC v C ? /ummary ;<< / Perpetual Phylogenies with PUmPER !4! H ; Related 8or% ;<E H E +ramewor% Overview ;<F H E ; 0/A ConstructionD,$tension with P13A8& ;<? H E E Phylogenetic Inference ;<H H E F 0anual and automatic tree u#dates ;<B H F /oftware and Availability ;<B H F ; /tandalone im#lementation ;;< H F E &istributed im#lementation ;;< H F F Custom iPlant setu# ;;; H J Evaluation and )esults ;;F H J ; Biological exam#les ;;F H J E /imulated &ata ;;I H ? Discussion ;E; H C /ummary ;E; 2 Conclusion And 5utlook !## B ; Conclusion ;EE B E +uture 8or% ;EF B E ; .PUs ;EF B E E PUmPER ;EJ B F Outloo% ;EJ List of Figures !#, List of (ables !#2 List of Acronyms !#0 +ibliography !&4 vi hapter ! "ntroduction !6! Motivation The landscape of genomics and #hylogenetics has changed dramatically since the irru#tion of Next-.eneration /equencing (-./) technologies LH?M As a consequence of the through#ut increases, the amount of data accumu- lated in molecular databases %eeps growing
Recommended publications
  • Treefinder Manual
    TREEFINDER MANUAL - Version of October 2008 - Gangolf Jobb Email: [email protected] TREEFINDER computes phylogenetic trees from molecular sequences. The program infers even large trees by maximum likelihood under a variety of models of sequence evolution. It accepts both nucleotide and amino acid data and takes rate heterogeneity into account. Separate models can be assumed for user-defined data partitions, separate rates, separate edge lengths, separate character compositions. All parameters can be estimated from the data. Tree search can be guided by user-supplied topological constraints and start trees. There is a model proposer to propose appropriate models of sequence evolution. Confidence of inferred trees can be assessed by parametric and non-parametric bootstrapping, various paired-sites tests (ELW, BP, KH, SH, WSH, AU), also in combination with information criteria, edge support by LR-ELW. There are tools to manipulate trees and sequence data, visualize base frequencies, compute consensus and distance trees, count topologies. Confidence limits and other statistics can be obtained for all numerical results. The program can estimate divergence times by rate smoothing. A graphical user interface makes the use of TREEFINDER very intuitive. Data files and reconstruction parameters can be chosen interactively, results will be displayed on the screen. TREEFINDER is programmable in the functional and system independent programming language TL, which offers a wide range of functionality and enables the researcher to create problem
    [Show full text]
  • A Powerful Graphical Analysis Environment for Molecular Phylogenetics Gangolf Jobb1*, Arndt Von Haeseler2,3 and Korbinian Strimmer1
    Jobb et al. BMC Evolutionary Biology (2015) 15:243 DOI 10.1186/s12862-015-0513-z RETRACTION NOTE Open Access Retraction Note: TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics Gangolf Jobb1*, Arndt von Haeseler2,3 and Korbinian Strimmer1 Retraction The editors of BMC Evolutionary Biology retract this article [1] due to the decision by the corresponding author, Gangolf Jobb, to change the license to the software described in the article. The software is no longer available to all scientists wishing to use it in certain territories. This breaches the journal’seditorial policy on software availability [2] which has been in effect since the time of publication. The other authors of the article, Arndt von Haeseler and Korbinian Strimmer, have no control over the licensing of the software and support the retraction of this article. Author details 1Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany. 2Department of Computer Science, University of Düsseldorf, Universitätsstr. 1, D-40225 Düsseldorf, Germany. 3John von Neumann Institute for Computing, Forschungszentrum Jülich, D-52425 Jülich, Germany. Received: 19 October 2015 Accepted: 20 October 2015 References 1. Jobb G, von Haeseler A, Strimmer K. TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics. BMC Evol Biol. 2004;4:18. 2. BMC Evolutionary Biology’s Instructions for Authors of Software Articles: http://www.biomedcentral.com/bmcevolbiol/authors/instructions/software. Accessed 15 October 2015. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance * Correspondence: [email protected] • Inclusion in PubMed, CAS, Scopus and Google Scholar The online version of the original article can be found under • Research which is freely available for redistribution doi:10.1186/1471-2148-4-18.
    [Show full text]
  • A Phylogenomic Approach to Resolve the Arthropod Tree of Life Research
    A Phylogenomic Approach to Resolve the Arthropod Tree of Life Karen Meusemann,†,1 Bjorn¨ M. von Reumont,†,1 Sabrina Simon,2 Falko Roeding,3 Sascha Strauss,4 Patrick Kuck,¨ 1 Ingo Ebersberger,4 Manfred Walzl,5 Gunther¨ Pass,6 Sebastian Breuers,7 Viktor Achter,7 Arndt von Haeseler,4 Thorsten Burmester,3 Heike Hadrys,2,8 J. Wolfgang Wagele,¨ 1 and Bernhard Misof*†,3 1Zoologisches Forschungsmuseum Alexander Koenig, Molecular Biology Unit, Bonn, Germany 2Stiftung Tieraerztliche Hochschule Hannover, Institute of Ecology & Evolution, Hannover, Germany 3Biocenter Grindel & Zoological Museum, University of Hamburg, Hamburg, Germany 4Center for Integrative Bioinformatics Vienna, Max F Perutz Laboratories, University of Vienna, Medical University of Vienna, University of Veterinary Medicine of Vienna, Vienna, Austria 5Department of Theoretical Biology, University of Vienna, Vienna, Austria 6Department of Evolutionary Biology, University of Vienna, Vienna, Austria 7Regional Computing Center of Cologne (RRZK), The University of Cologne, Cologne, Germany 8Department of Ecology and Evolutionary Biology, Yale University †These authors contributed equally to this work. *Corresponding author: E-mail: [email protected]. Downloaded from Associate editor: Barbara Holland Abstract Arthropods were the first animals to conquer land and air. They encompass more than three quarters of all described living species. This extraordinary evolutionary success is based on an astoundingly wide array of highly adaptive body organizations. http://mbe.oxfordjournals.org/
    [Show full text]
  • Disentangling Biological and Analytical Factors That Give Rise to Outlier Genes in Phylogenomic Matrices
    bioRxiv preprint doi: https://doi.org/10.1101/2020.04.20.049999; this version posted April 21, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Disentangling biological and analytical factors that give rise to outlier genes in phylogenomic matrices Joseph F. Walker1,*, Xing-Xing Shen2, Antonis Rokas3, Stephen A. Smith4 and Edwige Moyroud1,5,* 1 The Sainsbury Laboratory, University of Cambridge, Bateman Street, CB2 1LR, Cambridge, UK 2 Institute of Insect Sciences, Agro-Complex A239, 866 Yuhangtang Road, Hangzhou 310058, Zhejiang University, China 3 Department of Biological Sciences, Vanderbilt University, Nashville, TN 37235, USA 4 Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48103, USA 5 Department of Genetics, University of Cambridge, Downing place, CB2 3EJ, Cambridge, UK * Corresponding Authors: [email protected], [email protected] Abstract The genomic data revolution has enabled biologists to develop innovative ways to infer key episodes in the history of life. Whether genome-scale data will eventually resolve all branches of the Tree of Life remains uncertain. However, through novel means of interrogating data, some explanations for why evolutionary relationships remain recalcitrant are emerging. Here, we provide four biological and analytical factors that explain why certain genes may exhibit “outlier” behavior, namely, rate of molecular evolution, alignment length, misidentified orthology, and errors in modeling. Using empirical and simulated data we show how excluding genes based on their likelihood or inferring processes from the topology they support in a supermatrix can mislead biological inference of conflict.
    [Show full text]
  • Mammalian Evolution May Not Be Strictly Bifurcating Open Access Research Article
    Mammalian Evolution May not Be Strictly Bifurcating Bjo¨rn M. Hallstro¨m1 and Axel Janke*,2 1Department of Cell and Organism Biology, Division of Evolutionary Molecular Systematics, University of Lund, Lund, Sweden 2LOEWE—Biodiversita¨t und Klima Forschungszentrum BiK-F, Frankfurt, Germany *Corresponding author: E-mail: [email protected]. Associate editor: Arndt von Haeseler Abstract The massive amount of genomic sequence data that is now available for analyzing evolutionary relationships among 31 placental mammals reduces the stochastic error in phylogenetic analyses to virtually zero. One would expect that this would make it possible to finally resolve controversial branches in the placental mammalian tree. We analyzed a 2,863,797 nucleotide- Research article long alignment (3,364 genes) from 31 placental mammals for reconstructing their evolution. Most placental mammalian relationships were resolved, and a consensus of their evolution is emerging. However, certain branches remain difficult or virtually impossible to resolve. These branches are characterized by short divergence times in the order of 1–4 million years. Computer simulations based on parameters from the real data show that as little as about 12,500 amino acid sites could be sufficient to confidently resolve short branches as old as about 90 million years ago (Ma). Thus, the amount of sequence data should no longer be a limiting factor in resolving the relationships among placental mammals. The timing of the early radiation of placental mammals coincides with a period of climate warming some 100–80 Ma and with continental fragmentation. These global processes may have triggered the rapid diversification of placental mammals. However, the rapid radiations of certain mammalian groups complicate phylogenetic analyses, possibly due to incomplete lineage sorting and introgression.
    [Show full text]
  • Quartet Puzzling: a Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies
    Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies Korbinian Strimmer and Arndt von Haeseler Zoologisches Institut, Universitlt Mtinchen A versatile method, quartet puzzling, is introduced to reconstruct the topology (branching pattern) of a phylogenetic tree based on DNA or amino acid sequence data. This method applies maximum-likelihood tree reconstruction to all possible quartets that can be formed from n sequences. The quartet trees serve as starting points to reconstruct a set of optimal n-taxon trees. The majority rule consensus of these trees defines the quartet puzzling tree and shows groupings that are well supported. Computer simulations show that the performance of quartet puzzling to recon- struct the true tree is always equal to or better than that of neighbor joining. For some cases with high transi- tion/transversion bias quartet puzzling outperforms neighbor joining by a factor of 10. The application of quartet puzzling to mitochondrial RNA and tRNAVd’ sequences from amniotes demonstrates the power of the approach. A PHYLIP-compatible ANSI C program, PUZZLE, for analyzing nucleotide or amino acid sequence data is available. Introduction In recent years the maximum-likelihood method for try to reconstruct a tree topology considering only the reconstructing phylogenetic relationships (Felsenstein branching pattern of the (2) different quartet trees that 1981) has become more popular due to the arrival of can be constructed from n sequences (Sattath and Tver- powerful computers. The main advantage of a maxi- sky 1977; Fitch 1981; Bandelt and Dress 1986; Dress, mum-likelihood approach is the application of a well- von Haeseler, and Krtiger 1986).
    [Show full text]
  • Phylogenetic Trees from Large Datasets
    Phylogenetic Trees from Large Datasets Inaugural – Dissertation zur Erlangung des Doktorgrades der Mathematisch–Naturwissenschaftlichen Fakult¨at der Heinrich–Heine–Universit¨atD¨usseldorf vorgelegt von Heiko A. Schmidt aus Bonn D¨usseldorf 2003 Gedruckt mit der Genehmigung der Mathematisch-Naturwissenschaftlichen Fakult¨at der Heinrich-Heine-Universit¨atD¨usseldorf Referent: Prof. Dr. Arndt von Haeseler Korreferent: Prof. Dr. William Martin Tag der m¨undlichen Pr¨ufung:28.07.2003 The Dachshund Indeed, it was made on the plan of a bench for length and lowness. It seemed to be satisfied, but I thought the plan poor, and structurally weak, on account of the distance between the forward supports and those abaft. With age the dog’s back was likely to sag; and it seemed to me that it would have been a stronger and more practicable dog if it had had some more legs. Mark Twain (Following the Equator, 1895-1896) Biologists must constantly keep in mind that what they see was not designed, but rather evolved. Francis Crick (What Mad Pursuit, 1988) iv Preface This thesis deals with topics from Bioinformatics/Phyloinformatics. It touches the fields algorithmic complexity, molecular phylogenetics, sequence evolution, as well as parallel computing and, thus, is addressed to an interdisciplinary community. Although it is hardly possible to write a completely self-contained thesis, I tried to provide some basic information needed for members of either community to be able to more easily under- stand the topics discussed. Consequently, some information might be enclosed that seem trivial to, e.g., computer scientists but which biologists are possibly not familiar with and vice versa.
    [Show full text]