High Performance Computing on Biological Sequence Aligment
Total Page:16
File Type:pdf, Size:1020Kb
Nom/Logotip de la Universitat on s’ha llegit la tesi High performance computing on biological sequence aligment Miquel Orobitg Cortada Dipòsit Legal: L.290-2013 http://hdl.handle.net/10803/110930 Títol de la tesi està subjecte a una llicència de Reconeixement- NoComercial-CompartirIgual 3.0 No adaptada de Creative Commons (c) any, nom i cognom/s de l'autor/a de la tesi Escola Politècnica Superior Departament d’Informàtica i Enginyeria Industrial High performance computing on biological sequence alignment Memòria presentada per obtenir el grau de Doctor per la Universitat de Lleida per Miquel Orobitg Cortada Dirigida per Dr. Fernando Cores Prado Dr. Fernando Guirado Fernández Universitat de Lleida Programa de doctorat en Enginyeria Lleida, febrer de 2013 Abstract Multiple Sequence Alignment (MSA) is an extremely powerful tool for such important biological applications, as phylogenetic analysis, identification of conserved motifs and domains and structure prediction. The main idea behind MSA is to place protein residues into columns according to selected criteria. These criteria can be the structural, evolutionary, functional or sequence sim- ilarity. The first three criteria are based on biological meaning, but the fourth is not. Then the best match alignment may not exhibit the best biological meaning. The other issue is that MSAs are computationally difficult to calculate, and most formulations of the problem lead to NP-hard optimization problems. MSA computation therefore depends on approximate heuristics or algorithms. Accurate and fast construction of MSAs has been extensively researched in recent years, and a variety of methods have been developed. Nowadays, the new challenges in the genomics era are focused on align- ing thousands, or even hundreds of thousands, of sequences. For this reason, alignment methods must be adapted to avoid being relegated from everyday use. Many MSA methods have introduced High Performance Computing ca- pabilities to take advantage of the new technologies and infrastructures. In particular, there are many approaches that improve alignment performance by parallelizing the whole MSA application. However, these distributed methods exhibit scalability problems when the number of sequences increases, as they are constrained by data dependencies that guide the alignment process. In this thesis, three different approaches to solving or reducing some limita- tions of the MSA methods are proposed. The first of these is aimed at reducing data dependencies to increase the degree of parallelism in the applications and iii improve their scalability. The second one is devoted to reducing the memory requirements of a consistency-based method. And the last one is designed to increase the alignment accuracy by improving the guide tree. From the parallelism point of view, we propose a new guide tree construc- tion algorithm to exploit the degree of parallelism in the final alignment pro- cess in order to resolve the bottleneck of the progressive alignment stage in MSA. This new contribution increases the number of parallel alignments to take advantage of increasing computer resources by balancing the guide tree, but taking biological accuracy properties into account. The results achieved demonstrated that the proposed heuristic is capable of increasing the degree of parallelism in the parallel MSA method, thus reducing the execution time while maintaining the biological quality of the alignment. Regarding the memory requirements, we propose a new approach to reduce the computational requirements of the consistency-based method, T-Coffee. The main goal of the proposal is to reduce the number of consistency data stored. This allows a reduction in the execution time and therefore an improve- ment in the scalability of TC to enable the method to align more sequences. Furthermore, the proposed methodology is focused on achieve a minimal the impact over accuracy of the alignments and it is the user who decides the de- gree of optimization. The results obtained proved that this approach is able to reduce the memory consumption and increase both performance and the number and the length of the sequences that the method can align. In connection with biological accuracy, we propose Multiple Trees Align- ment (MTA). MTA is a new MSA method for aligning multiple guide-tree variations for the same input sequences, evaluating the alignments obtained and selecting the best one as a result. MTA is implemented in parallel to pro- vide a good compromise between time and accuracy. Besides, this approach could be applied to any progressive alignment method that accepts a guide tree as an input parameter, and the resulting alignments could be evaluated with any scoring function. Furthermore, we also propose two meta-score metrics, obtained through the combination of single ones. These metrics, which are de- vised using evolutionary algorithms, are able to improve the single ones. The study of different alignment scoring metrics for the selection of the best align- ment proved that in general the best ones are the two proposed meta-scores, followed by the ones that use structural information. Finally, the results ob- tained demonstrated that MTA with Clustal-W and T-Coffee considerably improves the quality of the alignments. Acknowledgements I would like to acknowledge everybody who supported me during the course of this thesis. First and foremost, I want express my gratitude to my supervisors, Dr. Fernando Cores Prado and Dr. Fernando Guirado Fernández, for their continued encouragement and invaluable suggestions throughout this work. They have patiently supervised every little issue, always guiding me in the right direction. Without their help, I could not have finished my thesis successfully. Special thanks are due to all the seniors from the Group of Distributed Computing (GCD) at the Universitat de Lleida (UdL). I must say that has been a great experience working closely with such good people: Concepció Roig, Francesc Giné, Francesc Solsona, Josep Ll. Lèrida, Josep M. Solà, Valentí Pardo and Xavier Faus. During my research, I have shared great moments with all my colleagues: Josep Rius, Ignasi Barri, Ivan Teixidó, Héctor Blanco, Anabel Usié, Damià Castellà, Alberto Montañola, Jordi Vilaplana, Eloi Gabaldon, Jordi Lladós and Ismael Arroyo. Many thanks to all of them for the wonderful times we shared. Furthermore, I want to thank Porfidio Hernández and Toni Espinosa from Universitat Autònoma de Barcelona (UAB) for all their support. I also want to thank all the members of the Comparative Bioiformatics group from the Centre for Genomic Regulation (CRG), especially Cedric Notredame, Carsten Kemena and Paolo Di Tommaso, for their good advice and help, and for wel- coming me as if I were one more during one month. Finally, I want to thank all the members of the Edinburgh Data-Intensive Research group at the University of Edinburgh. During three month, I had the chance to stay in this group and work with great people who made me feel at home. It was a great experience vii viii that marked me in many positive aspects. I would especially like to acknowl- edge Malkolm Atkinson, Paolo Besana, David Rodríguez, Rosa Filgueira and Gary McGilvary for all their support and for welcoming me as if I were another member of a big family. Last but not least, I would like to dedicate this thesis to all my friends and, foremost, to my closest family. Especially to my father Miquel and my sister Marta for the patience and understanding they have shown during these long years. To my girlfriend Laura for her emotional support. And especially, I want to dedicate this thesis to my mother Elena who unfortunately is no longer with us. Contents Contents ix List of Figures xiii List of Tables xvii 1 Introduction 1 1.1 Bioinformatics . .3 1.1.1 Molecular Biology . .4 1.1.1.1 Biological preliminaries . .4 1.1.1.2 Central Dogma of molecular Biology . .8 1.1.2 Bioinformatic Research Areas . .8 1.2 Sequence Alignment . 12 1.2.1 Pairwise Sequence Alignment . 14 1.2.2 Multiple Sequence Alignment . 14 1.2.2.1 Applications . 15 1.2.2.2 Constraints and Requirements . 15 1.2.2.3 Evaluation . 17 1.2.2.4 Current Status . 17 1.3 High Performance Computing . 18 1.3.1 Parallel Computing . 19 1.3.1.1 Hardware . 21 1.3.1.2 Software . 28 1.4 Motivation . 29 1.5 Thesis Objectives . 31 ix x CONTENTS 1.6 Overview . 32 2 Related Work 35 2.1 Introduction . 35 2.2 Pairwise Sequence Alignment Methods . 36 2.2.1 Dot-matrix Methods . 36 2.2.2 Dynamic Programming . 36 2.2.3 Word Methods . 40 2.3 Multiple Sequence Alignment Methods . 40 2.3.1 Exact Algorithms . 41 2.3.2 Progressive Alignment . 42 2.3.3 Iterative Alignment . 44 2.3.3.1 Stochastic Iterative Algorithms . 44 2.3.3.2 Non-stochastic Iterative Algorithms . 48 2.3.4 Consistency-based Methods . 51 2.3.5 Meta-aligners . 54 2.3.6 Template-based Methods . 55 2.4 Benchmarking . 57 2.5 High Performance Computing in MSA . 60 2.5.1 Parallel Algorithms for Pairwise Alignment . 61 2.5.1.1 Wavefront Approach . 61 2.5.1.2 Divide and Conquer Approach . 61 2.5.2 Parallel Multiple Sequence Alignments . 63 2.5.2.1 Multi-Threading and Multi-Process Parallel Aligners . 63 2.5.2.2 Distributed Memory Parallel Aligners . 65 2.5.3 GPUs based Parallel Aligners . 69 2.5.4 Conclusions . 71 3 Balanced Guide Tree 73 3.1 Introduction . 73 3.2 Problem analysis . 74 3.2.1 Progressive Alignment analysis . 76 3.3 Balanced Guide Tree . 78 CONTENTS xi 3.3.1 Neighbor-Joining Algorithm . 79 3.3.2 Balancing Guide Tree Algorithm . 80 3.4 Experimentation . 84 3.4.1 Balanced Guide Tree Performance . 84 3.4.1.1 BalancedParallel-TCoffee . 85 3.4.1.2 Balanced-TCoffee .