A Multi-Objective Ab-Initio Model for Protein Folding Prediction at an Atomic Conformation Level
Total Page:16
File Type:pdf, Size:1020Kb
NATIONAL UNIVERSITY OF COLOMBIA A Multi-objective Ab-initio Model for Protein Folding Prediction at an Atomic Conformation Level by David Camilo Becerra Romero A thesis submitted in partial fulfillment for the degree of Msc. System Engineering and Computation in the Engineering Computer Science February 2010 Declaration of Authorship The following articles were published during my period of research in the master program. A Novel Ab-Initio Genetic Based Approach for Protein Folding Prediction, In Proceedings of the 2007 Conference on Genetic and Evolutionary Computation (GECCO'07) (London, UK, July 7 - 11, 2007), pages 393-400, ACM Press. A Novel Method for Protein Folding Prediction Based on Genetic Algorithms and Hashing, In Proceedings of the II Computation Colombian Congress (2CCC) (Bogot´a,Colombia, April 18 - 20, 2007). A Model for Resource Assignment to Transit Routes in Bogota Transportation Sys- tem Transmilenio, In Proceedings of the III Computation Colombian Congress (3CCC) (Medell´ın,Colombia, April 23 - 25, 2008). An Association Rule Based Model For Information Extraction From Protein Sequence Data, In Proceedings of the III Computation Colombian Congress (3CCC) (Medelln, Colombia, April 23 - 25, 2008). Un Modelo de Asignacion de Recursos a Rutas en el Sistema de Transporte Masivo Trans- milenio, Avances en Sistemas e Inform´atica. Special Volume Vol. 5 Number. 2, June 2008, ISBN 1657-7663. An Association Rule Based Model For Information Extraction From Protein Sequence Data, Avances en Sistemas e Inform´atica.Special Volume Vol. 5 Number. 2, June 2008, ISBN 1657-7663. An Artificial Network Topology Based on Association Rules for Protein Secondary Struc- ture Prediction (Best paper award), III Biotechnology Colombian Congress (Bogot´a, Colombia, July 29 August 01, 2008). A Novel Image Encryption Scheme Based on a Generalized Chinese Remainder Theorem, Poster Presentation of the 2008 International Workshop on Combinatorial Algorithms (IWOCA 2008), (Nagoya, Japan. September 13-15 de 2008). An Association Rule based Approach for Biological Sequence Feature Classification, In Proceedings of the IEEE Congress on Evolutionary Computation (CEC 2009) (Trondheim, Norway, May 18 - 21, 2009). i ii Una revisi´ona las t´ecnicas de soluci´onal problema de la subsecuencia com´unm´aslarga, To appear in Proceedings of the Encuentro Nacional de Investigaciones de Posgrado (ENIP 2009) (Bogot´a,Colombia, Dec 02 - 04, 2009). A Novel Ab-Initio Model for Protein Folding Prediction, To appear in Proceedings of the Encuentro Nacional de Investigaciones de Posgrado (ENIP 2009) (Bogota, Colombia, Dec 02 - 04, 2009). Propuesta de un modelo multi-objetivo para la predicci´onde complejos prote´ına-ligando, To appear in Proceedings of the Encuentro Nacional de Investigaciones de Posgrado (ENIP 2009) (Bogota, Colombia, Dec 02 - 04, 2009). A New Chinese Remainder Algorithm for Image-based Encryption, To appear in Colom- bian Journal of Computation (RCC), 10(1), 2009, ISSN 1657-2831. An Algorithm for Constrained LCS, To appear in Proceedings of the 8th ACS/IEEE Inter- national Conference on Computer Systems and Applications (AICCSA'10), Hammamet, Tunisia, 2010. Great works are performed, not by strength, but by perseverance. Samuel Johnson NATIONAL UNIVERSITY OF COLOMBIA Abstract Engineering Computer Science Msc. System Engineering and Computation by David Camilo Becerra Romero This thesis focuses on the protein structure prediction problem, one of the most important prob- lems tackled by bioinformatics and molecular biology. The basic problem consists of predicting the three-dimensional structure of a protein from its amino-acid sequence. Specifically, in this work a parallel multiobjective ab-initio approach at an atomic conformation level for protein structure prediction is proposed. The proposed model incorporated several existing software tools combined with others developed by us. Experiments were carried out over a set of different proteins (1ROP, 1PLW, 1UTG, 1CRT and 1ZDD) to validate the model obtaining very promising results. The most important features of the experimental framework are: i) A trigonometric representation was chosen to represent and compute the backbone and side-chain torsion angles of a protein; ii) The Chemistry at HARvard Macromolecular Mechanics (CHARMm) function was used in order to evaluate the structure of protein conformations; iii) The multi-objective genetic algorithm (NSGA-II) was used to evolve protein conformations directed by an atom-interaction scoring function; iv) An island model of the evolutionary algorithm was developed to speed up the process and to improve the effectiveness of the algorithm. NATIONAL UNIVERSITY OF COLOMBIA Abstract Engineering Computer Science Msc. System Engineering and Computation by David Camilo Becerra Romero La tesis se enfoca en el problema de la predicci´onde estructuras de prote´ınas, uno de los problemas m´asimportantes abordados por la bionform´aticay la biolog´ıamolecular. El problema b´asicoconsiste en la predicci´onde la estructura tridimensional de una prote´ınadada su secuencia de amino-´acidos.De forma particular, en ´este trabajo se presenta una aproximaci´onab-initio, multi-objetivo y paralela para la predicci´onde estructuras de prote´ınasa un nivel at´omico. El modelo propuesto incorpor´odiferentes herramientas de software combinadas con otras her- ramientas desarrolladas por nosotros. Algunos experimentos se llevaron a cabo sobre un conjunto de cinco diferentes prote´ınas(1ROP, 1PLW, 1UTG, 1CRT y 1ZDD) para validar el modelo, obte- niendo resultados prometedores. Las caracter´ısticasm´asimportantes del marco experimental son: i) Se escogi´ouna representaci´ontrigonom´etrica para representar y computar los ´angulos de torsi´ondel backbone y la cadena lateral de una prote´ına;ii) La funci´onde potenciales de energ´ıaThe Chemistry at HARvard Macromolecular Mechanics (CHARMm) es utilizada para evaluar en t´erminosenerg´eticosla estructura de las conformaciones de prote´ınas;iii) El algo- ritmo gen´eticomulti-objetivo (NSGA-II) es utilizado para evolucionar las conformaciones de prote´ınasdirigidas por una funci´onde puntajes de interacciones at´omicas;iv) Un modelo de islas del algoritmo evolutivo fue desarrollado para acelerar el proceso de c´omputoy mejorar la efectividad del algoritmo. Acknowledgements First I would like to thank Professor Luis Nino for his support, encouragement and teachings. I will be always grateful with him because he introduced me to the wonderful world of proteins. I really appreciate his patience and all the help and guidance that he always brought me. I would also like to thank Professor Yoan Pinzon, because he along with Professor Nino became the research role model to follow. They taught me great things, not only in the academic field, but for my personal life. They introduced me to the research area, and gave me the opportunity to attend international and national conferences that help me grow as a student and a human being. I will always be grateful with my friends from the Intelligent System Research Lab (LISI) and bioinformatics group. They have helped me a lot during my master program. I got an academical enrichment from his criticism, comments and ideas, but most important, I built a strong friendship with all of them. There are many other people that have helped me during this wonderful period. Thanks to the other professors from the Computer Science Department, to the members of the Intelligent Security Systems Research Laboratory (ISSRL), to the members of the Colombian Immunology Institute (FIDIC), and to the members of the Biotechnology Institute (IBUN). The biggest thanks go to my family, my best friend (Sergio) and my girlfriend (Luisa). Every second of my life, they have contributed to this work. There is not enough words to thank them, but they know that this work is dedicated to them. vi Contents Declaration of Authorship i Abstract iv Resumen v Acknowledgements vi List of Figures ix List of Tables xi Abbreviations xii Physical Constants xiii Symbols xiv 1 Introduction 1 2 Protein Folding Prediction: A survey 12 2.1 Abstract . 12 2.2 Introduction . 12 2.3 Problem Presentation . 13 2.4 Comparative Methods . 15 2.5 Threading Methods . 17 2.6 Ab-Initio Methods . 19 2.6.1 Conformation Representations . 21 2.6.1.1 Lattice Models . 22 2.6.1.2 Off-lattice Methods . 23 2.6.2 Scoring Functions . 24 2.6.3 Search Conformational Methods . 25 2.6.3.1 Molecular Dynamics . 26 2.6.3.2 Statistical Mechanics . 27 2.6.3.3 Monte Carlo Methods . 27 2.6.3.4 Probabilistic Road Maps . 29 vii Contents viii 3 The Proposed Multi-objective Ab-Initio Approach 31 3.1 Representation of the conformations . 31 3.1.1 Secondary Structure Prediction . 36 3.1.2 Rotamer Libraries . 37 3.2 Cost Functions (Potential Energy Functions) . 38 3.2.1 CHARMm . 39 3.2.2 Amber . 40 3.3 Search Conformation Procedure . 41 3.3.1 Multi-objective Optimization . 41 3.3.1.1 The Multi objective evolutionary algorithm . 43 3.3.2 The Multi-objective Formulation . 44 3.3.3 Decision-making Phase . 46 3.4 Root Mean Square Deviation . 48 4 The Proposed Parallel Implementation 49 4.1 Introduction . 49 4.2 Background . 51 4.2.1 JavaSpaces . 51 4.2.2 Island Models . 52 4.3 The proposed parallel implementation . 53 4.3.1 Initiating work and slave registration . 53 4.3.2 Handling Migrations . 55 4.3.3 Finishing work and sending results . 56 5 Experimental Framework 58 5.1 Test proteins suits . 60 5.1.1 Repressor of primer 1ROP . 60 5.1.2 Disulphide-stabilized mini protein A domain 1ZDD . 63 5.1.3 Met-enkephalin 1PLW . 74 5.1.4 Crambin 1CRN . 76 5.1.5 Uteroglobin 1UTG . 86 5.1.6 Comparisons with other approaches . 89 6 Conclusions and Further Work 96 A Web Service on a computer cluster 99 Bibliography 102 List of Figures 1.1 The central dogma of molecular biology . 2 1.2 Protein Structures . 3 1.3 Models for protein folding . 6 1.4 Energy landscape model for protein folding .