Computational Problems in Evolution
Total Page:16
File Type:pdf, Size:1020Kb
Computational Problems in Evolution Multiple Alignment, Genome Rearrangements, and Tree Reconstruction ISAAC ELIAS Doctoral Thesis Stockholm, Sweden 2006 TRITA CSC-A 2006-22 ISSN 1653-5723 KTH School of Computer Science ISRN KTH/CSC/A--06/22--SE and Communication ISBN 91-7178-511-6 SE-100 44 Stockholm ISBN 978-91-7178-511-4 SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i datalogi Mån- dagen den 15 January 2007 klockan 10.00 i FA32, Albanova, Roslagstullsbacken 21, Stockholm. © Isaac Elias, Nov 2006 Tryck: Universitetsservice US AB iii Abstract Reconstructing the evolutionary history of a set of species is a fundamental problem in biology. This thesis concerns computational problems that arise in different settings and stages of phylogenetic tree reconstruction, but also in other contexts. The contributions include: • A new distance-based tree reconstruction method with optimal reconstruction ra- dius and optimal runtime complexity. Included in the result is a greatly simplified proof that the NJ algorithm also has optimal reconstruction radius. (co-author Jens Lagergren) • NP-hardness results for the most common variations of Multiple Alignment. In particular, it is shown that SP-score, Star Alignment, and Tree Alignment, are NP- hard for all metric symbol distances over all binary or larger alphabets. • A 1.375-approximation algorithm for Sorting By Transpositions (SBT). SBT is the problem of sorting a permutation using as few block-transpositions as possible. The complexity of this problem is still open and it was a ten-year-old open problem to im- prove the best known 1.5-approximation ratio. The 1.375-approximation algorithm is based on a new upper bound on the diameter of 3-permutations. Moreover, a new lower bound on the transposition diameter of the symmetric group is presented and the exact transposition diameter of simple permutations is determined. (co-author Tzvika Hartman) • Approximation, fixed-parameter tractable, and fast heuristic algorithms for two vari- ants of the Ancestral Maximum Likelihood (AML) problem: when the phylogenetic tree is known and when it is unknown. AML is the problem of reconstructing the most likely genetic sequences of extinct ancestors along with the most likely mutation probabilities on the edges, given the phylogenetic tree and sequences at the leafs. (co-author Tamir Tuller) • An algorithm for computing the number of mutational events between aligned DNA sequences which is several hundred times faster than the famous Phylip packages. Since pairwise distance estimation is a bottleneck in distance-based phylogeny re- construction, the new algorithm improves the overall running time of many distance- based methods by a factor of several hundred. (co-author Jens Lagergren) Contents Contents iv 1 Introduction 3 1.1 BasicIntroductiontoComputation . 3 1.2 BasicBiologicalBackground. 5 2 Introduction to Computational Problems 11 2.1 Phylogenetic Tree Reconstruction using Distance Methods...... 14 2.2 Tree Reconstruction based on Sequences that have Evolved through Substitutions ............................... 18 2.3 Multiple Sequence Alignment - Local Point Mutations . ..... 24 2.4 Genome Rearrangements - Global Mutations . 27 3 Present Investigation 35 3.1 Multiple Sequence Alignment is NP-hard . 35 3.2 Approximating Sorting By Transpositions . ... 38 3.3 FastNeighborJoining .. .. .. .. .. .. .. .. .. .. .. 40 3.4 Fast Distance Estimation from Aligned Sequences . ..... 41 3.5 Ancestral Sequence Reconstruction using Likelihood . ....... 42 3.6 ConclusionsandOpenProblems . 43 Bibliography 45 iv 1 Acknowledgments After four and a half years of studying and research I have finally come to the point where I can write down my thanks to all the people that have helped me. I have been part of the theoretical computer science group at KTH and the Stockholm Bioinformatics Center. Further, I have spent more than a year and a half in Israel, visiting Tel Aviv University and also the Technion. All over I have met people who have shown great kindness and who have helped me with my research, studies, and life. The person I owe the greatest gratitude to is my advisor Jens Lagergren. Jens, you have given me freedom and always been there to give advice and to guide. You have understood my needs, both professional and personal, summarized them for me, and helped me get to were I am today. Without you this would never have happened. Tack så jättemycket! To the people that I have been in contact with in the theory group. Johan Håstad, your classes have been a great inspiration, it has been a joy to have heard such complex subjects. Mikael Goldmann, your positive attitude in the class room and to problem solving has been a shining example. To the other people, attending seminars and classes with smart and enthusiastic people like you have been a great experience. Thanks! To the people at the Stockholm Bioinformatics Center. Bengt Sennblad, you always made yourself available to helped me understand issues in models of evolu- tion. Without you I would have been lost more than once. Erik Lindahl, your rapid and detailed explanations directly contributed to one of the papers. Ali Tofigh, you have on numerous occasions helped me with C++ and provided me with precis explanations of its workings. Thanks! To the people in Israel. Benny Chor, without your kindness I would never have come to Israel, where I have found so much happiness. I am very grateful for all your help and shared interest in coffee and sailing. Tzvika Hartman, I remember the first time we sat down doing research together. It was a great experience to make progress with our illusive problem. Tamir Tuller, your exceptional ability to always be ”squeezed” while still being happy to do even more research is both impressive and inspiring. Also special thanks to Ron Pinter for the help during my last visit to Israel. There are many people who have not been part of my academic life while still being of utmost importance to me. My parents, I love you! You have been there and given me a home away from the front. To the rest of my family and friends, I love you! Finally, wonderful, beautiful, loving Tali, you have given me more happiness than I have ever felt before. 2 CONTENTS Publications and Organization of the Thesis This thesis a summary of the papers listed below, the papers appear after this summary. The results are presented in three chapters. The first chapter contains a basic introduction to computation and biology. This is followed by an introduction to the problems that this thesis is concerned with. The last chapter contains a, to a large extent, self-contained presentation of most of the results included in the papers below. • Settling the Intractability of Multiple Alignment [46, 45] I. Elias Journal of Computational Biology 2006 Conference version in Int. Symp. on Algorithms and Computation 2003 • A 1.375-Approximation Algorithm for Sorting by Transpositions [48, 47] I. Elias and T. Hartman To appear in IEEE/ACM Trans. on Comp. Biology and Bioinformatics Conference version in Workshop on Algorithms in Bioinformatics 2005 • Fast Neighbor Joining [49] I. Elias and J. Lagergren Int. Coll. on Automata, Languages and Programming 2005 • Fast Computation of Distance Estimators [50] I. Elias and J. Lagergren Submitted • Reconstruction of Ancestral Genomic Sequences Using Likelihood [51] I. Elias and T. Tuller Submitted Isaac has contributed with at least half of the work in each of the papers. Chapter 1 Introduction This is a thesis in computer science and it concerns the interdisciplinary field of computational biology. In this field, techniques from computer science are applied to solve problems inspired by biology. A central notion in biology is models of evol- ution which describe the origin and descent of species. The major concern of this thesis is computational problems that biologists are faced with when tracing evol- ution under different models. This chapter contains basic background to relevant issues of computation and biology. 1.1 Basic Introduction to Computation A common problem in evolutionary biology is the reconstruction of the evolution- ary history of a set of species, usually represented by a phylogenetic tree. One such example is the reconstruction of the phylogenetic tree of the great apes, see Figure 1.1. There is frequently little ancient information and instead the evolution- ary biologist has to rely on the biological features of the extant species, i.e., living species. Classic evolutionary biology dealt mainly with physical and morphologic traits, such as the shape and size of the scull, while modern evolutionary biology mainly uses information extracted from genetic material, such as DNA sequences. The underlying assumption when reconstructing phylogenetic trees is that two species with a close common ancestor are more similar than two species with a more remote common ancestor. With regard to DNA sequences, similarity is meas- ured with respect to a model of evolution which describes the type of mutations that causes the sequences to change over time. A simple similarity measurement, between two sequences, is using the number of mutations needed to transform one sequence into the other. In this variant, two species are considered similar if a small number of mutations can be used to explain the differences in their DNA sequences. Once a model of evolution has been chosen the problem of reconstructing the correct tree becomes purely computational. With respect to the simple model above, the evolutionary biologist has to find the tree that explains the extant species 3 4 CHAPTER 1. INTRODUCTION s SS s Ss S S Orangutan s Ss S S Gorilla s Ss Human Chimpanzee Figure 1.1: The phylogeny of the great apes. using the least number of mutations. Such computational problems are called optimization problems. Further, the objective, for this particular problem, is to find the tree that minimizes the number of mutations.