Alignment-Free Phylogeny Reconstruction Based on Quartet Trees
Total Page:16
File Type:pdf, Size:1020Kb
Alignment-free Phylogeny Reconstruction Based On Quartet Trees Dissertation for the award of the degree „Doctor rerum naturalium“ of the Georg-August-Universität Göttingen within the doctoral program „Computer Science“ (PCS) of the Georg-August-University School of Science (GAUSS) submitted by Thomas Dencker from Nordenham January 27, 2020 Members of the Thesis Committee Prof. Dr. Burkhard Morgenstern (1st Referee) Institute for Microbiology and Genetics, Georg-August-Universität Göttingen Prof. Dr. Stephan Waack (2nd Referee) Institute for Computer Science, Georg-August-Universität Göttingen Dr. Johannes Söding Quantitative and Computational Biology, Max-Planck Institute for Biophysical Chemistry Göttingen Further Members of the Examination Board Prof. Dr. Christoph Bleidorn Johann-Friedrich-Blumenbach Institute for Zoology & Anthropology, Georg-August-Universität Göttingen Prof. Dr. Anja Sturm Institute for Mathematical Stochastics, Georg-August-Universität Göttingen Prof. Dr. Jan de Vries Institute for Microbiology and Genetics, Georg-August-Universität Göttingen Date of the oral examination: March 4 th 2020 ii Affidavit I hereby confirm that this thesis has been written independently and with no other sources and aids than quoted. Göttingen, 27. January 2020 Thomas Dencker iii Danksagung An dieser Stelle möchte ich mich bei allen bedanken, die mich in den letzten Jahren be- gleitet und unterstützt haben. Zuallererst bedanke ich mich bei meinem Doktorvater Prof. Dr. Burkhard Morgenstern für die Betreuung dieser Doktorarbeit und die andauernde Unterstützung seit der Bache- lorarbeit. Ohne ihn wäre ich gar nicht in der Bioinformatik gelandet. Besonders dankbar ich bin für die vielen Ratschläge, mit denen ich vor allem meine Präsentationsfähigkeiten verbessern konnte. Außerdem hat er mir die Möglichkeit gegeben, an der RECOMB-CG in Kanada teilzunehmen und viele neue Erfahrungen zu sammeln. An die gemeinsame Zeit werde ich mich noch lange positiv erinnern. Außerdem danke ich Dr. Soeding und Prof. Dr. Waack für die konstruktive Kritik in den Komiteetreffen. Des Weiteren möchte ich mich auch bei Dr. Peter Meinecke, Heiner Klingenberg, Lars Hahn und Matthias Blanke für die vielen hilfreichen Diskussion und das gute Feedback bei meinen Präsentationen bedanken. Ganz besonders bedanke ich mich bei meinem Freund und ehemaligen Kollegen Chris Leimeister dafür, dass er mir immer mit gutem Rat zur Seite stand und für die vielen hilfreichen Gesprache, die mich immer wieder motiviert haben. Mein ganz besonderer Dank gilt auch meinen Eltern, die mich während des gesamten Studiums unterstützt und immer an mich geglaubt haben. v Contents 1 Introduction 5 1.1 Sequence alignment . .6 1.2 Tree building methods . .9 1.3 Supertree methods . 12 1.4 Alignment-free methods . 16 1.5 Filtered spaced word matches . 22 1.6 Objectives of this thesis and overview . 25 2 ‘Multi-SpaM’: a maximum-likelihood approach to phylogeny reconstruc- tion using multiple spaced-word matches and quartet trees 27 3 Insertions and deletions as a phylogenetic signal in an alignment-free context 39 4 Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points. 67 5 Further improvements and experiments 77 5.1 SH-like support values . 77 5.1.1 Correctness and RF . 79 5.2 Neighbor-Joining . 83 5.2.1 Test results . 83 5.3 Bootstrapping . 87 vii 6 Discussion 89 6.1 Alignment-free methods based on micro-alignments . 91 6.2 Limitations . 94 7 Conclusion and outlook 96 7.1 Possible improvements . 96 7.2 Further applications . 98 viii Publication List Journal Papers • Thomas Dencker, Chris-André Leimeister, Michael Gerth, Christoph Bleidorn, Sagi Snir, Burkhard Morgenstern. ‘Multi-SpaM’: a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees NAR Genomics and Bioinformatics, 2:lqz013, 2020. • Thomas Dencker, Niklas Birth und Burkhard Morgenstern. Insertions and dele- tions as a phylogenetic signal in an alignment-free context. Prepared for submission. • Chris-Andre Leimeister, Thomas Dencker, and Burkhard Morgenstern. Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points. Bioinformatics, 35(2):211–218, 2019. • Andrzej Zielezinski, Hani Z Girgis, Guillaume Bernard, Chris-Andre Leimeister, Kujin Tang, Thomas Dencker, Anna Katharina Lau, Sophie Röhling, Jae Jin Choi, Michael S Waterman, Matteo Comin, Sung-Hou Kim, Susana Vinga, Jonas S Almeida, Cheong Xin Chan, Benjamin T James, Fengzhu Sun, Burkhard Morgen- stern, Wojciech M Karlowski Benchmarking of alignment-free sequence comparison methods. Genome Biol., 20(1):144, 2019. • Sophie Röhling, Alexander Linne, Jendrik Schellhorn, Morteza Hosseini, Thomas Dencker and Burkhard Morgenstern. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLOS ONE, in press, 2020. Conference Papers (peer reviewed) • Thomas Dencker, Chris-Andre Leimeister, Michael Gerth, Christoph Bleidorn, Sagi Snir, and Burkhard Morgenstern. Multi-SpaM: a Maximum-Likelihood ap- proach to phylogeny reconstruction using multiple spaced-word matches and quartet trees. In Comparative Genomics, pages 227–241. Springer International Publishing, 2018. 1 Abstract Traditional methods for phylogeny reconstruction are based on multiple sequence align- ments and character-based methods. This combination of computationally expensive meth- ods leads to very accurate results, but it is ill-suited to handle the enormous amount of sequence data that is available today. As a consequence, very fast alignment-free methods have been developed. These methods calculate pairwise distances in order to build phy- logenetic trees. However, current alignment-free methods are generally less accurate than traditional methods. In this thesis, I developed Multi-SpaM which is a novel alignment-free approach that tries to combine the best of both worlds. This method quickly finds small gap-free ‘micro- alignments’ – so-called blocks – involving four sequences. A binary pattern defines at which positions the nucleotides have to match. At the remaining don’t care positions, the possibly mismatching nucleotides are first used to remove random matches with a filtering procedure previously introduced by Filtered Spaced-Word Matches (FSWM). Then, the character-based method RAxML is used to find the optimal quartet tree for each block. Subsequently, all quartet trees are amalgamated into a supertree with Quartet MaxCut. This approach can be used to build phylogenetic trees of high quality. Furthermore, I showed multiple ways that could help to improve Multi-SpaM. The distances between two adjacent blocks involving the same four sequences can be used to identify pu- tative insertions and deletions from which accurate quartet trees can be derived. These trees could be used both on their own and in combination with the quartet trees pro- duced by Multi-SpaM to build or improve phylogenetic trees using Quartet MaxCut. As an alternative, we also used Maximum-Parsimony to infer accurate phylogenies from these pu- tative insertions and deletions. In other experiments, I tried to give the individual quartet trees weights based on SH-like support values and tried to use Neighbor-Joining in order to speed up Multi-SpaM. Moreover, I contributed to another extension of the FSWM approach. Here, we used these matches as anchor points for a genome alignment tool called mugsy. We found that a higher number of homologous pairs could be aligned for more distantly related species in comparison to other anchor points used with the same alignment program. 3 1 Introduction Evolution is a central concept in biology. A formal theory of evolution was first formulated by Charles Darwin in his book “On the Origin of Species” [23]. We now believe that all species [89] ultimately evolved from a common ancestor [137]. The evolutionary history of a group of species is called a phylogeny. It is commonly visualized as a phylogenetic tree. In some cases, the phylogeny is depicted as a phylogenetic network [58] to take even more complex relationships into account. Reconstructing phylogenies is a fundamental task in the life sciences. Ultimately, it is one major goal to identify the evolutionary relationships between all species and thus reconstruct the tree of life [53]. For the longest time, relationships between species were determined purely by morpho- logical features such as bone structures. Based on these features, species can be assigned to certain taxa. Taxa are groups of species on different levels (domain, kingdom, phylum, class, order, family, genus, and species) and can be thought of as subtrees of the tree of life. Reconstructing phylogenies based on morphological features can be a challenging task. One problem is that a feature can appear independently in different taxa. The ability to fly was developed by both birds and bats and is a common example for convergent evolution. Such a feature is also called a homoplasy [8]. In contrast, features are called homologies when they exist due to divergent evolution, i.e. they were inherited from a shared ances- tor. In order to reconstruct the correct phylogeny, it has to be based on sufficiently many homologies. However, the morphological approach reaches its limits when homologies are hard to find. For example, for certain taxa, such as closely related bacteria, it is difficult to make out any differences. In such a case, the genetic material can provide