NATIONAL UNIVERSITY OF COLOMBIA

A Multi-objective Ab-initio Model for Folding Prediction at an Atomic Conformation Level

by David Camilo Becerra Romero

A thesis submitted in partial fulfillment for the degree of Msc. System Engineering and Computation

in the Engineering Computer Science

February 2010 Declaration of Authorship

The following articles were published during my period of research in the master program.

 A Novel Ab-Initio Genetic Based Approach for Prediction, In Proceedings of the 2007 Conference on Genetic and Evolutionary Computation (GECCO’07) (London, UK, July 7 - 11, 2007), pages 393-400, ACM Press.

 A Novel Method for Protein Folding Prediction Based on Genetic Algorithms and Hashing, In Proceedings of the II Computation Colombian Congress (2CCC) (Bogot´a,Colombia, April 18 - 20, 2007).

 A Model for Resource Assignment to Transit Routes in Bogota Transportation Sys- tem Transmilenio, In Proceedings of the III Computation Colombian Congress (3CCC) (Medell´ın,Colombia, April 23 - 25, 2008).

 An Association Rule Based Model For Information Extraction From Protein Sequence Data, In Proceedings of the III Computation Colombian Congress (3CCC) (Medelln, Colombia, April 23 - 25, 2008).

 Un Modelo de Asignacion de Recursos a Rutas en el Sistema de Transporte Masivo Trans- milenio, Avances en Sistemas e Inform´atica. Special Volume Vol. 5 Number. 2, June 2008, ISBN 1657-7663.

 An Association Rule Based Model For Information Extraction From Protein Sequence Data, Avances en Sistemas e Inform´atica.Special Volume Vol. 5 Number. 2, June 2008, ISBN 1657-7663.

 An Artificial Network Topology Based on Association Rules for Protein Secondary Struc- ture Prediction (Best paper award), III Biotechnology Colombian Congress (Bogot´a, Colombia, July 29 August 01, 2008).

 A Novel Image Encryption Scheme Based on a Generalized Chinese Remainder Theorem, Poster Presentation of the 2008 International Workshop on Combinatorial Algorithms (IWOCA 2008), (Nagoya, Japan. September 13-15 de 2008).

 An Association Rule based Approach for Biological Sequence Feature Classification, In Proceedings of the IEEE Congress on Evolutionary Computation (CEC 2009) (Trondheim, Norway, May 18 - 21, 2009). i ii

 Una revisi´ona las t´ecnicas de soluci´onal problema de la subsecuencia com´unm´aslarga, To appear in Proceedings of the Encuentro Nacional de Investigaciones de Posgrado (ENIP 2009) (Bogot´a,Colombia, Dec 02 - 04, 2009).

 A Novel Ab-Initio Model for Protein Folding Prediction, To appear in Proceedings of the Encuentro Nacional de Investigaciones de Posgrado (ENIP 2009) (Bogota, Colombia, Dec 02 - 04, 2009).

 Propuesta de un modelo multi-objetivo para la predicci´onde complejos prote´ına-ligando, To appear in Proceedings of the Encuentro Nacional de Investigaciones de Posgrado (ENIP 2009) (Bogota, Colombia, Dec 02 - 04, 2009).

 A New Chinese Remainder Algorithm for Image-based Encryption, To appear in Colom- bian Journal of Computation (RCC), 10(1), 2009, ISSN 1657-2831.

 An Algorithm for Constrained LCS, To appear in Proceedings of the 8th ACS/IEEE Inter- national Conference on Computer Systems and Applications (AICCSA’10), Hammamet, Tunisia, 2010. Great works are performed, not by strength, but by perseverance.

Samuel Johnson NATIONAL UNIVERSITY OF COLOMBIA

Abstract

Engineering Computer Science

Msc. System Engineering and Computation

by David Camilo Becerra Romero

This thesis focuses on the prediction problem, one of the most important prob- lems tackled by bioinformatics and molecular biology. The basic problem consists of predicting the three-dimensional structure of a protein from its amino-acid sequence. Specifically, in this work a parallel multiobjective ab-initio approach at an atomic conformation level for protein structure prediction is proposed.

The proposed model incorporated several existing software tools combined with others developed by us. Experiments were carried out over a set of different (1ROP, 1PLW, 1UTG, 1CRT and 1ZDD) to validate the model obtaining very promising results. The most important features of the experimental framework are: i) A trigonometric representation was chosen to represent and compute the backbone and side-chain torsion angles of a protein; ii) The Chemistry at HARvard Macromolecular Mechanics (CHARMm) function was used in order to evaluate the structure of protein conformations; iii) The multi-objective genetic algorithm (NSGA-II) was used to evolve protein conformations directed by an atom-interaction scoring function; iv) An island model of the evolutionary algorithm was developed to speed up the process and to improve the effectiveness of the algorithm. NATIONAL UNIVERSITY OF COLOMBIA

Abstract

Engineering Computer Science

Msc. System Engineering and Computation

by David Camilo Becerra Romero

La tesis se enfoca en el problema de la predicci´onde estructuras de prote´ınas, uno de los problemas m´asimportantes abordados por la bionform´aticay la biolog´ıamolecular. El problema b´asicoconsiste en la predicci´onde la estructura tridimensional de una prote´ınadada su secuencia de amino-´acidos.De forma particular, en ´este trabajo se presenta una aproximaci´onab-initio, multi-objetivo y paralela para la predicci´onde estructuras de prote´ınasa un nivel at´omico.

El modelo propuesto incorpor´odiferentes herramientas de software combinadas con otras her- ramientas desarrolladas por nosotros. Algunos experimentos se llevaron a cabo sobre un conjunto de cinco diferentes prote´ınas(1ROP, 1PLW, 1UTG, 1CRT y 1ZDD) para validar el modelo, obte- niendo resultados prometedores. Las caracter´ısticasm´asimportantes del marco experimental son: i) Se escogi´ouna representaci´ontrigonom´etrica para representar y computar los ´angulos de torsi´ondel backbone y la cadena lateral de una prote´ına;ii) La funci´onde potenciales de energ´ıaThe Chemistry at HARvard Macromolecular Mechanics (CHARMm) es utilizada para evaluar en t´erminosenerg´eticosla estructura de las conformaciones de prote´ınas;iii) El algo- ritmo gen´eticomulti-objetivo (NSGA-II) es utilizado para evolucionar las conformaciones de prote´ınasdirigidas por una funci´onde puntajes de interacciones at´omicas;iv) Un modelo de islas del algoritmo evolutivo fue desarrollado para acelerar el proceso de c´omputoy mejorar la efectividad del algoritmo. Acknowledgements

First I would like to thank Professor Luis Nino for his support, encouragement and teachings. I will be always grateful with him because he introduced me to the wonderful world of proteins. I really appreciate his patience and all the help and guidance that he always brought me. I would also like to thank Professor Yoan Pinzon, because he along with Professor Nino became the research role model to follow. They taught me great things, not only in the academic field, but for my personal life. They introduced me to the research area, and gave me the opportunity to attend international and national conferences that help me grow as a student and a human being.

I will always be grateful with my friends from the Intelligent System Research Lab (LISI) and bioinformatics group. They have helped me a lot during my master program. I got an academical enrichment from his criticism, comments and ideas, but most important, I built a strong friendship with all of them.

There are many other people that have helped me during this wonderful period. Thanks to the other professors from the Computer Science Department, to the members of the Intelligent Security Systems Research Laboratory (ISSRL), to the members of the Colombian Immunology Institute (FIDIC), and to the members of the Biotechnology Institute (IBUN).

The biggest thanks go to my family, my best friend (Sergio) and my girlfriend (Luisa). Every second of my life, they have contributed to this work. There is not enough words to thank them, but they know that this work is dedicated to them.

vi Contents

Declaration of Authorship i

Abstract iv

Resumen v

Acknowledgements vi

List of Figures ix

List of Tables xi

Abbreviations xii

Physical Constants xiii

Symbols xiv

1 Introduction 1

2 Protein Folding Prediction: A survey 12 2.1 Abstract ...... 12 2.2 Introduction ...... 12 2.3 Problem Presentation ...... 13 2.4 Comparative Methods ...... 15 2.5 Threading Methods ...... 17 2.6 Ab-Initio Methods ...... 19 2.6.1 Conformation Representations ...... 21 2.6.1.1 Lattice Models ...... 22 2.6.1.2 Off-lattice Methods ...... 23 2.6.2 Scoring Functions ...... 24 2.6.3 Search Conformational Methods ...... 25 2.6.3.1 Molecular Dynamics ...... 26 2.6.3.2 Statistical Mechanics ...... 27 2.6.3.3 Monte Carlo Methods ...... 27 2.6.3.4 Probabilistic Road Maps ...... 29

vii Contents viii

3 The Proposed Multi-objective Ab-Initio Approach 31 3.1 Representation of the conformations ...... 31 3.1.1 Secondary Structure Prediction ...... 36 3.1.2 Rotamer Libraries ...... 37 3.2 Cost Functions (Potential Energy Functions) ...... 38 3.2.1 CHARMm ...... 39 3.2.2 Amber ...... 40 3.3 Search Conformation Procedure ...... 41 3.3.1 Multi-objective Optimization ...... 41 3.3.1.1 The Multi objective evolutionary algorithm ...... 43 3.3.2 The Multi-objective Formulation ...... 44 3.3.3 Decision-making Phase ...... 46 3.4 Root Mean Square Deviation ...... 48

4 The Proposed Parallel Implementation 49 4.1 Introduction ...... 49 4.2 Background ...... 51 4.2.1 JavaSpaces ...... 51 4.2.2 Island Models ...... 52 4.3 The proposed parallel implementation ...... 53 4.3.1 Initiating work and slave registration ...... 53 4.3.2 Handling Migrations ...... 55 4.3.3 Finishing work and sending results ...... 56

5 Experimental Framework 58 5.1 Test proteins suits ...... 60 5.1.1 Repressor of primer 1ROP ...... 60 5.1.2 Disulphide-stabilized mini protein A domain 1ZDD ...... 63 5.1.3 Met-enkephalin 1PLW ...... 74 5.1.4 Crambin 1CRN ...... 76 5.1.5 Uteroglobin 1UTG ...... 86 5.1.6 Comparisons with other approaches ...... 89

6 Conclusions and Further Work 96

A Web Service on a computer cluster 99

Bibliography 102 List of Figures

1.1 The central dogma of molecular biology ...... 2 1.2 Protein Structures ...... 3 1.3 Models for protein folding ...... 6 1.4 model for protein folding ...... 7 1.5 Example of HP lattice model ...... 7 1.6 Stages of the methodology for the proposed model ...... 8 1.7 Parameters to define in the proposed model ...... 9 1.8 The PSP proposed model ...... 10

2.1 Taxonomy of PF problem ...... 14 2.2 Protein Squence Flow ...... 16 2.3 Comparative Modeling ...... 18 2.4 Common lattice models ...... 23

3.1 Formats of structural representation ...... 33 3.2 The planar peptide group ...... 34 3.3 Ramachandran plot ...... 35 3.4 Secondary Structure Prediction Approach ...... 37 3.5 Mapping in MOOP ...... 42 3.6 Multi-objective optimization procedure ...... 43 3.7 Schematic of the NSGA-II procedure ...... 44 3.8 Chromosome used in the GA ...... 46 3.9 Primary structure representation ...... 46 3.10 Angle-based Focus to find knees ...... 47

4.1 JavaSpaces ...... 52 4.2 A space-based compute-server ...... 52 4.3 A space-based compute-server ...... 53 4.4 JavaSpace technology components ...... 54 4.5 Initiating work phase ...... 54 4.6 Slave Registration phase. The master takes the unapproved slave entry ...... 54 4.7 Slave Registration phase. The slave takes the approved entry ...... 55 4.8 The master will periodically check the status of its current slaves ...... 55 4.9 Handling migrations phase: slaves send EmigrantEntries to JavaSpace...... 55 4.10 Handling migrations phase: master takes all EmigrantEntries ...... 56 4.11 Handling migrations phase: master writes all ImmigrantEntries...... 56 4.12 Handling migrations phase: each slave takes its corresponding ImmigrantEntry . 57 4.13 Finishing and sending results phase ...... 57

5.1 Native and predicted conformations for 1ROP protein ...... 61 5.2 Behavior of the minimum RMSD in each island ...... 63

ix List of Figures x

5.3 Minimum, average and maximum RMSD in each island (1ROP) ...... 64 5.4 Histrogram of RMSD ranges (1ROP) ...... 65 5.5 Pareto Fronts in each island (1ROP) ...... 66 5.6 Relationship between Energy and RMSD (1ROP) ...... 67 5.7 Behavior of the minimum RMSD in each island ...... 69 5.8 Minimum, average and maximum RMSD in each island (1ZDD) ...... 70 5.9 Histrogram of RMSD ranges (1ZDD) ...... 71 5.10 Pareto Fronts in each island (1ZDD) ...... 72 5.11 Relationship between Energy and RMSD (1ZDD) ...... 73 5.12 Native and predicted conformations for 1PLW protein ...... 75 5.13 Behavior of the minimum RMSD in each island for the 1PLW protein ...... 75 5.14 Minimum, average and maximum RMSD in each island (1PLW) ...... 77 5.15 Histrogram of RMSD ranges (1PLW) ...... 78 5.16 Pareto Fronts in each island (1PLW) ...... 79 5.17 Relationship between Energy and RMSD (1PLW) ...... 80 5.18 Native and predicted conformations for 1CRN protein ...... 81 5.19 Behavior of the minimum RMSD in each island for the 1CRN protein ...... 82 5.20 Minimum, average and maximum RMSD in each island (1CRN) ...... 83 5.21 Histrogram of RMSD ranges (1CRN) ...... 84 5.22 Pareto Fronts in each island (1CRN) ...... 85 5.23 Relationship between Energy and RMSD (1CRN) ...... 87 5.24 Behavior of the minimum RMSD in each island for the 1CRN protein ...... 89 5.25 Minimum, average and maximum RMSD in each island (1UTG) ...... 90 5.26 Histrogram of RMSD ranges (1UTG) ...... 91 5.27 Pareto Fronts in each island (1UTG) ...... 92 5.28 Relationship between Energy and RMSD (1UTG) ...... 93

A.1 Web service framework architecture ...... 99 A.2 Web service use case model ...... 101 List of Tables

2.1 A comparison of protein folding models ...... 26 2.2 Complexities estimate for serial GAs ...... 29

3.1 Ideal Values of Bond Angles ...... 35 3.2 Ideal Values of Interatomic Distances ...... 35 3.3 Number of χ angles ...... 35 3.4 Secondary Structure Constraints ...... 36 3.5 Rotamer library values ...... 38

5.1 Experimental Test Proteins ...... 60 5.2 Island architecture, migration policies and MOEA’s parameters for PSP . . . . . 60 5.3 Computed protein conformation for the 1ROP protein target ...... 62 5.4 Computed protein conformation for the 1ZDD protein target ...... 68 5.5 Computed protein conformation for the 1PLW protein target ...... 74 5.6 Computed protein conformation for the 1ZDD protein target ...... 81 5.7 Computed protein conformation for the 1UTG protein target ...... 88 5.8 Proposed approach versus other approaches for Met-enkephalin peptide . . . . . 94 5.9 Proposed approach versus other approaches for Crambin protein ...... 94 5.10 Proposed approach versus other approaches for Disulphide-stabilized protein . . 94 5.11 Proposed approach versus other approaches for Represor of Primer ...... 95 5.12 Proposed approach versus other approaches for Uteroglobin ...... 95

xi Abbreviations

AA CASP Critical Assessment of Techniques for PSP GA Genetic Algorithms MD Molecular Dynamics MOEA Multiobjective Optimization Evolutionary Algorithm MOOP Multiobjective Optimization Problem NSGA Non-Dominated Sorting Genetic Algorithm PDB Protein Data Bank PF Protein Folding PRM Probabilistic Road Maps PSP Protein Structure Prediction RMSD Root Mean Square Deviation VDW Van Der Waals

xii Physical Constants

Armstrongs Ao = 0.1 nanometers

xiii Symbols

φ torsion angle phi ψ torsion angle psi χ torsion angle chi ω torsion angle omega

Cα Alpha-Carbon kcal Kilo-calories mol Mole (a unit of amount of substance)

xiv Dedicated to my GRANDMOM. . .

xv Chapter 1

Introduction

For decades scientists have studied the complex processes that determine the structure, prop- erties and functionality of proteins. Nowadays, many of these research topics converge to the protein folding problem, in which extremely challenging and complex issues still remain.

Information encoded in the DNA contains the instructions needed to construct proteins. Pro- cessing of DNA information is carried out by cells in a process that consists of two phases: transcription and translation (See. Figure 1.1). In transcription, the genetic information in- cluded in the DNA is transcribed into the messenger RNA (mRNA). This molecule is segmented into codons - three consecutive nucleotides that encode an amino acid - which are then trans- lated into an amino acid sequence. In prokaryotes the mRNA is directly generated from DNA, meanwhile, in eukaryotes two types of DNA segments exist in the nucleus, namely, introns (non-coding segments) and exons (coding segments). Subsequently, introns are discarded and exons are concatenated into mRNA, in order to reach the cytoplasm. Once the mRNA is in the cytoplasm, the translation of each codon takes place. However, this translation is based on the capacity of the codons to form 64 possible encodings (each nucleotide has four different conformations), only 20 standard amino acids which will be used by cells in protein biosynthesis are represented. Thus, a protein primary structure is reached after codons have been translated into amino acids and included in the current sequence [1].

An amino acid is a molecule that contains an amine and a carboxile group; amino acids are the proteins’ building blocks. Based on the differences in structure, size, electric charge and solubility in water of the amino acids side chains, they can be classified as either hydrophobic, hydrophilic or amphipathic. Hydrophilic amino acids are able to have hydrogen bonds with other amino acids, with polar organic molecules and with water. Hydrophobic amino acids interact by Van der Waals bonds and they are generally located in the core of the protein trying to reside in a non-aqueous environment. Specifically the R group of these amino acids is non- polar and non-charge, making them insoluble in water. Amphipathic amino acids have both polar and non-polar characteristics and they tend to serve as interfaces between hydrophobic and hydrophilic molecules [1].

1 Chapter 1. Introduction 2

Figure 1.1: The central dogma of molecular biology (picture taken from http://www.iusd.k12.ca.us/uhs/cs2/images/centraldogma.gif)

Proteins are linear polymers composed by a sequence of amino acid residues that are connected by peptide bonds. Proteins are formed by several amino acids in a folding process from which a three-dimensional structure is obtained. This three-dimensional structure is highly important because it determines the function of the protein. In order to understand protein structure and formation, it is convenient to consider four structural levels. Primary structure consists of the order of the amino acids in the sequence. Secondary structure consists of regular components like α-helices, β-sheets and β-turns, that contribute to the stabilization of the protein folding. In tertiary structure, the elements of secondary structure are folded forming an almost solid compact structure that is stabilized by weak interactions that involve polar and nonpolar groups; the three-dimensional structure of the amino acids is obtained after proteins fold into their native state; it is important to stress that at this level proteins become functional. In the quaternary structure, several polypeptides chains with tertiary structure are joined by weak connections - non-covalent - to form a protein complex; in other words, different polypeptide chains with tertiary structure interact to build a protein complex [2] (See Figure 1.2)

Although the three dimensional structure of a protein can be precisely determined using empir- ical and experimental methods, an enormous amount of money and time should be invested in Chapter 1. Introduction 3

Figure 1.2: Protein Structures (picture taken from [1]) these methods before any results could be achieved. Then a significant amount of effort have been given in the development of efficient and effective computational models. The protein folding problem is one of the most important open interdisciplinary problems and it consists of determining the tertiary protein structure from its amino acid sequence [1]. The protein fold- ing problem is considered as one of the most compelling puzzles and questions facing scientists today. According to science magazine [3], the prediction of protein folding is one of the 125 big questions that face scientific inquiry over the next quarter-century. Understanding the complex processes that determine the structure, properties and functionality of proteins is important mainly for the following reasons:

• Proteins carry out a wide variety of vital functions, for example, proteins are involved in the catalysis of cellular chemical reactions, transport of molecules, signal transduction, segregation of genetic material and production and use of energy [2].

• Significant progress in understanding protein folding mechanisms will represent advances in drug design, treatments for many hereditary and infectious diseases.

• Classical methods for protein structure analysis such as X-ray crystallography and nuclear magnetic resonance (NMR) [4] are very time consuming, error prone and expensive. Then, improvements in the accuracy of protein folding prediction methods may save money and time. Additionally, the existent gap between sequence (number of known protein se- quences) and structure (three-dimensional structures) protein information could decrease [5].

The protein folding problem presents numerous challenges to other fields of study, including com- puter science, biochemistry, medicine, biology, engineering and scientific disciplines [6]. Specif- ically, molecular biology focuses on understanding how protein three-dimensional structures Chapter 1. Introduction 4 are achieved, meanwhile computer science is concerned with developing efficient algorithms to predict such three-dimensional structures. It is important to emphasize the difference between the protein folding problem (PF) and the protein structure prediction problem (PSP). The first one is interested in determining a protein tertiary structure from its amino acid sequence trying to understand the folding path which leads the folding process. On the other hand, the aim of protein structure prediction is at determining the configuration of a folded protein regardless of the folding process, i.e., the PSP is not concerned with the folding process, but only with the final three dimensional structure [7]. Protein structure prediction methods are based on two groups of principles and theories: the first group is the laws of physics, biology and chemistry; and the second the theory of evolution. Each set of principles that apply to natural protein sequences gave rise to a class of protein structure prediction methods [8]. Comparative models and fold recognition methods are based on the first group, and ab-initio methods are based on the second one. Comparative methods rely on the folding similarity between a target protein and known protein structures. A three-dimensional model is built for a protein whose structure is unknown (the target) based on related proteins of known structure (the templates) [9]. In contrast, ab-initio methods use primary principles to predict a protein structure from its amino acid sequence without relying on similarity at the fold level between a target structure and a set of templates [7]. Although ab-initio methods are highly demanding from a computational point of view, they are very important because they overcome the inherent problems of comparative methods:

• Ab-initio methods could be applied when it is not possible to find homologous or similar structures related to a target protein. In a general way, such methods can be used on any amino acid sequence because the information used only refers to the physical properties of the amino acid atoms.

• Ab-initio methods can discriminate between correct and incorrect assumptions of the model and have a deeper understanding of protein folding mechanisms [10].

Anfinsen thermodynamic hypothesis and Lenvinthal’s paradox are two key concepts that sup- port the ab-initio methods. Based on some laboratory experiments over an active RNase en- zyme, Anfinsen showed that the sequence of amino acids in a peptide chain determines the folding pattern. Then, the protein folding process can be explained entirely by the physical and chemical interaction among the amino acids [11]. Anfisten also stablished that the native or natural conformation occurs because this is thermodynamically the most stable shape or conformation in a particular intracellular environment [12]. On the other hand Levinthal fo- cused on the contradiction between the small quantity of time taken by a protein in the folding process with respect to the time used in the search of the most stable (minimum energy) con- formation from all the possible spacial conformations. Specifically, he focused on the fact that Chapter 1. Introduction 5 many naturally-occurring proteins fold reliably and quickly to their native states despite the astronomical number of possible configurations. Then it was stablished that finding the native folded state of a protein by a random search among all possible configurations will never succeed [13]. Several folding models could be used based on the protein characteristics and the scope of the specific research. The main goal of these models is to avoid the Levinthal’s paradox and to provide closer folding paths to those present in natural conditions. There are mainly four folding models: hydrophobic collapse, framework model, nucleation-condensation and energy landscapes (funnel or novo model). The hydrophobic collapse hypothesises that the native pro- tein conformation forms by rearrangement of a compact collapsed structure and it is based on the observation that a hydrophobic core of nonpolar amino acids is found in the protein’s inte- rior, meanwhile the polar residues which are thought to stabilize folding intermediates (molten globule) are found on the exposed protein surface [14]. The framework model represents a step-wise mechanism to narrow the conformation search and involves a hierarchical assembly. In this process the secondary structure is formed according to the primary structure. Then these elements collide to form the tertiary structure [15]. In the nucleation-condensation model an early formation of a diffuse protein-folding nucleus catalyses further folding. The nucleus consists of few adjacent residues with secondary structure interactions. This nucleus (large proteins may have several nuclei) is stable only in the presence of correct tertiary structure interactions [16]. In the energy landscape model proteins are thought to have globally funneled energy landscapes that are directed towards the native state. This model explains when and why are unique behaviors in protein folding [17]. Figure 1.3 and Figure 1.4 depict the main ideas of these models. There are two main issues to face in the de novo protein structure model. i) The formulation of an energy function to model the different local and global interactions contributing to protein folding and ii) The size of the space of possible conformations, which as mentioned before, cannot be explored exhaustively. Then, substantial improvement in the accuracy of this model crucially relies on the progress both in the design of appropriate energy functions and the development of specialized efficient sampling methods. There are four main tasks that should be considered in the development of an ab-initio model.

1. Representation of the conformations:

2. Cost functions:

3. Search conformation procedures:

4. Metrics to evaluate the similarity:

To represent the conformations and to overcome the high complexity in sampling protein con- formations, most of the methods need a significant reduction of the complexity. Methods for Chapter 1. Introduction 6

Figure 1.3: a- Framework Model, b-) Hydrophobic collapse Model, c-) Nucleation- condensation Model (image taken from http://www.pitb.de/nolting/prot00/models.gif) reducing protein structure to discrete low-complexity models can be divided into two major classes: lattice and off-lattice models [7].

Spatial lattices or grids models can be used to represent amino acid space and to allow folding having discrete degrees of freedom (See Figure 1.5). In lattice models the evaluation of the energy function and methods involving exhaustive searches of the available conformations can be achieved quite efficiently. However, even such models have fundamental theoretical relevance, they cannot be directly applied to real proteins; this is due to the fact that, typically, they have a restricted ability to represent subtle geometric considerations and to reproduce the backbone with enough accuracy.

Most off-lattice models fix the degrees of freedom and the bond lengths of the polypeptide . Although some algorithms use multiple representations, a few of them are commonly used: all-atom three dimensional coordinates, all-heavy-atom coordinates, backbone atom co- ordinates plus side-chain centroids, Cα coordinates, and backbone and side-chain torsion angles [7]. The atomic representation is a lower level of abstraction of the real protein conformations. Chapter 1. Introduction 7

Figure 1.4: A schematic energy landscape for protein folding (image taken from http://www.nature.com/nature/journal/v426/n6968/images/nature02261-f1.2.jpg)

Figure 1.5: Example of HP lattice model Chapter 1. Introduction 8

Once a protein representation model that sufficiently reduces the complexity of the confor- mational search is chosen, a scoring, energy or potential function that works in the chosen low-complexity space must be defined. The potential energy functions or force fields allow the evaluation of the structure of protein conformations returning a value for the energy based on the conformation of the protein. The energy function must adequately represent the forces responsible for protein structure; also, this energy function should be efficiently calculated, be- cause it needs to be intensively computed in the exploring of the search conformational space [18]. Quantum mechanics provides the natural set of principles to built an energy function, but its computational complexity makes it impractical to use in modelling complex systems.

The next task in determining a protein’s native state is the search of the lowest energy confor- mation in a vast conformational space. All algorithms currently used for native state prediction combine domain information and local search techniques in order to reduce the complexity of the search on high-dimensional conformation spaces. Evolutionary methods have been com- monly used as search methods of conformations in an energy landscape. Specifically, genetic algorithms, which are systematic methods based on biological evolution used to solve search and optimization problems, have been widely used. The protein folding problem (PF) and the pro- tein structure prediction problem (PSP) have been considered as large optimization problems that use a single-objective potential energy function; however, those problems can be tackled as multi-objective optimization problems that involve more than one single objective funtion.

The final task in the devolopment of ab-initio methods is the use of some metrics to evaluate the similarity between the predicted and the native conformations. One of the most used metrics is the root mean square deviation (RMSD). Specifically, RMSD measures the similarity of atomic positions with respect to two superposed conformations.

Figure 1.6: Stages of the methodology for the proposed model Chapter 1. Introduction 9

Mutation Rates Migration rates. Cross-over Rates Number of solutions to migrate Selection Mechanism Mechanism to select emigrating solutions Population size Mechanism to select solutions to be replaced Number of evaluations Island topology (number of islands, connections)

MOEA INSTANCE MOEA INSTANCE

Multi-objective algorithms: (NSGA-II, SPEA2, PAES, PESA-II,OMOPSO, MOCell, AbYSS, MOEA/D, Densea, CellDE, GDE3, FastPGA, IBEA) Potential Energy Functions: Amber (ff94, ff96, ff98 and ff99), CHARMM (19 and 27), MMFF, AMOEBA, Allinger MM (MM2-1991 and MM3-2000), OPLS (OPLS-UA, OPLS-AA and OPLS-AA/L)) Multi-objective with two or three objectives:

MOEA INSTANCE MOEA INSTANCE

Figure 1.7: Parameters to define in the proposed model

The proposed research project focuses on the exploration of a multi-objective approach as a suitable methodology to model the protein folding prediction problem. The proposed method is based on the hypothesis that the protein folding problem can be formulated as a multi-objective optimization problem and that this focus would have significant benefits when compared to single-objective approaches. The hypothesis are mainly based on the ideas that the PF problem involves multiple objectives where different 3D conformations may produce a trade-off in the funnel landscape among different objectives [7] and on the idea that the folded state is a small ensemble of conformational structures compared to the conformational entropy present in the unfolded ensemble [19]. Then, a multi-objective genetic algorithm is used to evolve protein con- formations directed by an atom-interaction scoring function. Figure 1.6 depicts in a squematic manner the stages of the methodology for the proposed model.

The proposed model is implemented incorporating many existing software tools along with oth- ers that have been developed by us. This fact makes the proposed approach flexible respect to the use of different parameters. For example, the score functions CHARMM (19 and 27) AMBER (ff94, ff96, ff98 and ff99), Allinger MM (MM2-1991 and MM3-2000), OPLS (OPLS- UA, OPLS-AA and OPLS-AA/L), Merck Molecular Force Field (MMFF) and AMOEBA can be used to evaluate the structure of protein conformations. The MOEA algorithms NSGA- II, SPEA2, PAES, PESA-II, OMOPSO, MOCell, AbYSS, MOEA/D, Densea, CellDE, GDE3, FastPGA, IBEA can be used to evolve protein conformations. A different number of policies such as number of islands, percentage of migrations, percentage of population involved in mi- grations, to name some, can be used toward the proposed parallel implementation. The parallel implementation is mainly used to speed up the computations and improve the effectiveness of Chapter 1. Introduction 10 a 1 YES Rotamer Libraries PSIPRED METHOD PSIPRED rate migration Decision-making criteri NO MOEA n MOEA INSTANCE YES NO rate migration NO ediction YES max .PDB FILE evaluations? .FASTA FILE MOEA 2 MOEA INSTANCE Rotamer constraints NO Secondary Structure Pr YES rate migration NO NO ction ithm resentation MOEA 1 MOEA INSTANCE 1- Multi-objective algor 3- Potential Fun Energy 1 2- Torsional-angle Rep

Figure 1.8: The PSP proposed model Chapter 1. Introduction 11 the algorithm discovering what solutions arise from simultaneous executions of multiple protein folding instances. The relation of these parameters can be depicted in Figure 1.7.

This thesis develops a protein folding prediction model. Specifically, a parallel multiobjective ab-initio approach at an atomic conformation level is presented (see Figure 1.8 for a scheme of the proposed model). This presentation has been divided in six parts; each part presents a component of the proposed model and the protein folding problem, but seen as a whole this thesis presents a clear methodology to develop an ab-initio method for the protein structure prediction problem.

The first part introduces the problem and defines the proposed approach. The second part is a comprehensive review of the ab-initio protein structure prediction methods. The third part explore in detail the proposed approach. Specifically, four main tasks, namely representation of the conformations, cost functions, search conformation procedures and metrics to evaluate similarity, are considered. The fourth part focuses in the proposed parallel implementation. The fifth part addresses and presents a complete experimental framework of the porposed model over different benchmark proteins. The concluding remarks and further work form the last part of the thesis. Chapter 2

Protein Folding Prediction: A survey

2.1 Abstract

Though there has been extensive work on understanding and modeling the complex processes that determine the structure, properties and functionality of proteins, there are extremely chal- lenging and complex issues that still remain. The protein structure prediction problem is an interdisciplinary open problem that consists of determining the tertiary structure of a protein from its amino-acid sequence trying to understand the folding paths. In this work, a compre- hensive review of the protein structure prediction problem is described. Specifically the main methods, their essential features, advantages and limitations are summarized.

2.2 Introduction

Protein folding understanding and modeling is an open interdisciplinary problem which consists of determining the tertiary protein structure from its amino-acid sequence; a proteins three- dimensional conformation allows it to carry out its function.

Understanding the complex processes that determine a proteins’ structure, properties and func- tionality is important because proteins carry out a wide variety of vital functions of living organisms. This understanding will definitely facilitate drug and vacuum design and may help to find the cures and treatments for several diseases. In addition, improvements in the accuracy of protein folding prediction methods may potentially save money, time, and technical personnel currently invested in experimental methods and it will decrease the gap between sequence and structure protein information.

The protein folding problem represents a big challenge to the scientific community from different fields, not only because of its inherent difficulty, but because it is a interdisciplinary problem. A significant amount of work has been devoted to the protein folding problem, however, most

12 Chapter 2. Protein Folding Prediction: A survey 13 research has tackled the problem using a very particular point of view. Some of the work and even survey papers are highly influenced by a particular scientific field such as physics, biochemistry, medicine, biology or computer science. Thus a comprehensive perspective of the protein structure prediction problem may not appear very clear. Accordingly, the main goal of this survey is to give a comprehensive survey of the protein folding problem. Therefore, a general review of the main ab-initio protein folding methods is provided. Figure 2.1 shows the proposed taxonomy map.

2.3 Problem Presentation

The protein folding problem is one of the most important scientific open problems, and it presents many challenges to different fields of study, including computer science, biochemistry, medicine, biology, engineering and scientific disciplines. Meanwhile molecular biology focuses on understanding how protein three-dimensional structures are achieved, computer science is concerned about developing efficient algorithms to predict such three-dimensional structures.

Computational methods for protein folding prediction are very important for two main reasons. Classical methods for protein structure analysis such as X-ray crystallography and nuclear magnetic resonance (NMR) are very time consuming, error prone and expensive [4, 20]. The second reason is the need to decrease the existing gap between the number of known protein sequences and protein three-dimensional structures [5].

It is important to emphasize the difference between the protein folding problem (PF) and the protein structure prediction (PSP). The first one is interested in determining a protein tertiary structure from its amino-acid sequence trying to understand the folding path which leads the folding process. On the other hand, the aim of protein structure prediction is at determining the configuration of a folded protein without understanding and regarding the folding process, i.e., the PSP is not interested in the folding process, but only in the final three dimensional structure [21].

Protein structure prediction methods are based on two groups of principles and theories: the laws of physics, biology and chemistry; and the theory of evolution [8]. Each set of principles that apply to natural protein sequences gave rise to a class of protein structure prediction method. Meanwhile comparative models and fold recognition methods are based on the first group, ab-initio methods are based on the second one. Comparative methods rely on the folding similarity between a target protein and known protein structures. A three-dimensional model is built for a protein whose structure is unknown (the target) based on related proteins of known structure (the templates) [8, 9, 22]. In contrast, ab-initio methods use the laws of science to predict a protein structure from its amino-acid sequence [8] without relying on similarity at the fold level between a target structure and known structures [7]. Chapter 2. Protein Folding Prediction: A survey 14

Figure 2.1: Taxonomy of the Protein Folding problem Chapter 2. Protein Folding Prediction: A survey 15

C.A. Floudas [23] proposed that protein structure prediction methods may be classified into four main groups: i) comparative models; ii) fold recognition; iii) primary principle methods with database information; and iv) first principle methods without database information (ab- initio). Most successful methods for protein structure prediction are comparative modeling and fold recognition based on homology [24]. The methods that predict secondary structures and local motifs are the most successful prediction methods when homologous sequences of known structure are not available [25, 26]. On the other hand, the accuracy and reliability of ab-initio methods is much lower than comparative models based on alignments with sequence identity higher than 30%; but the basic structure of a protein can in several cases be reasonably predicted [5].

If a sequence identity higher than 30% exists, then comparative methods are the best option to choose; but if the sequence has little or no primary sequence similarity to any sequence with a known structure and some model from the structure library represents the true fold of the sequence then threading methods are the best option. If any of the conditions described above are assured then ab-initio methods should be the choice. (See Figure 2.2)

Richards [27] stated ”In theory, all one needs to know in order to fold a protein into its biologi- cally active shape is the sequence of its constituent amino-acids. Why has nobody been able to put theory into practice?”, For ab-initio methods the answer to this question could be highly influenced by the fact that current potential functions have limited accuracy and the confor- mational space is huge. On the other hand, protein structure modeling based on homology is influenced by similarity measures between a target sequence and template structures, and the alignments between them.

2.4 Comparative Methods

Comparative methods and threading rely on detectable similarity between the modeled sequence and at least one known structure [8]. When the structure of a protein in a protein family has been determined by experiments, other proteins in the same family can be modeled based on aligning them to the known structure. Comparative approaches are based on the hypothesis that a small change in the protein sequence usually results in a small change in its 3D structure [28] and that the protein structures in the same family are more conserved than their primary sequences [29].

The usefulness of comparative modeling has been improved due to the increasing number of experimentally determined protein structures reported in specialized data bases. On november 2009, the RCS PDB protein data bank had approximately 61418 structures [30]. The accuracy of comparative methods is closely related to the percentage of sequence identity on which the methods are based on; particularly, they correlate the sequence and structural similarity of two proteins [31]. Comparative models with high accuracy are based on more than 50% sequence Chapter 2. Protein Folding Prediction: A survey 16

Figure 2.2: Protein Squence Flow identity to their templates and they have about 1Ao root mean square deviation (RMSD) error for the main-chain atoms, which is comparable to the accuracy of a medium-resolution nuclear magnetic resonance (NMR) structure or a low-resolution X-ray structure. Medium-accuracy comparative models are based on 30% to 50% sequence identity. These models have about 90% of the main-chain modeled with 1.5Ao RMSD error. Finally, low-accuracy comparative models are based on less than 30% sequence identity, where alignment errors rapidly increase below this sequence identity threshold [32].

In [8] Andre´asFiser et. al. determined five main steps in comparative model methods. The first step consists of searching for structures related to the target sequence. Comparative mod- eling usually starts by searching for known protein structures in the PDB [33] using the target sequence as a query. This search is generally done by comparing the target sequence with the sequence of each structure in the database. In this first step, either a pairwise sequence-sequence Chapter 2. Protein Folding Prediction: A survey 17 or multiple sequences comparison could be performed. The second step consists of selecting tem- plates from biological data bases using search methods. The reliability of a template increase with the sequence similarity to the target and decrease with the number and length of gaps in the alignment. The third step is the alignment of the target sequence with one or more struc- tures found in the data base. Therefore, once the templates are selected, an alignment method should be used to align them with the target sequence. The fourth step consists of building the model. In order to do this, the modeling by assembly of rigid bodies method [9, 34], or modeling by segment matching or coordinate reconstruction [25, 35] or modeling by satisfaction of spatial restraints [36] are mainly used. The final step is evaluating the model in order to check for possible errors. There are two types of evaluations which can be carried out. The internal evaluation checks whether a model satisfies the constraints used to calculate it or not. The external evaluation relies on information that was not used in the calculation of the model [37].

However, different factors such as the environment, influence the accuracy of a model. It could be thought that a sequence identity above 30% is a relative good predictor with the expected accuracy. But if the target-template sequence identity falls below 30%, the sequence identity becomes significantly less reliable as a measure of expected accuracy of a single model and the first purpose of an evaluation should be to test whether or not a correct template was used [8].

Methods based on knowledge for predicting protein structures have been widely criticized be- cause they do not provide information about the mechanisms and forces that direct the formation of such structures. Additionally, when homolog structures related with the target protein in a repository are not found, or the target protein has unique structural features or different to those characteristics that have been reported, there is no possibility to use those methods [23]. In other cases, when the similarity in sequence between the target and template proteins is lower than 30%, it is not advisable to use these methods because the alignment errors increase rapidly and become the most substantial origin of errors in comparative models. Kihara et. al. in [5] argue that other factors such as template selection and alignments usually have a larger impact on the accuracy of the model; Specially for models based on less than 40% sequence identity to the templates. Figure 2.3 depicts a flow graph of the comparative modeling steps.

2.5 Threading Methods

Fold recognition methods are based on the hypothesis that proteins often adopt similar folds despite no significant sequence similarity is found, this hypothesis is explained given that pro- tein structure is more conserved than protein sequence [28, 38] and that nature is apparently restricted to a limited number of protein folds [39, 40].

Threading methods have characteristics of both homology modeling and ab-initio methods, for example, as homology modeling, threading methods use known protein structures as templates Chapter 2. Protein Folding Prediction: A survey 18

Figure 2.3: Steps in Comparative Modeling for sequences of unknown structure; As ab-initio methods threading methods try to optimize a score function to measure the fit of the sequence in a spatial configuration.

Fold recognition methods can be generally divided into methods that derive a 1-D profile for each structure in the fold library and align the target sequence to these profiles and into methods that consider the full 3-D structure of the protein template. In an 1-D profile-based representation, the structures of proteins are modeled as chains of residue positions which do not interact. In a 2-D representation the model also includes interacting pairs of residue positions. In [41], an 1-D model of a protein structure is a sequence of states representing the residue as if it was embedded in a 3-D structural environment. There are two distinct types of features frequently used to characterize a state; structural features and amino-acid sequence features. 2-D models try to capture the contribution of interactions between pairs of residues. Particularly, these models overlay representations of pairs of residues that are neighbors in the native structure. On the other hand, 3-D models attempts to identify regularities of structures that can not be represented by considering amino-acid pairs.

The general procedure of a fold recognition method consists of taking the amino-acid sequence of a protein and evaluates how well it fits one of the known three-dimensional protein structures or structural elements that have been experimentally observed. This procedure is complemented Chapter 2. Protein Folding Prediction: A survey 19 with scoring functions that determine the fit of a sequence to a given fold. The scores are sorted and the sequence adopts the structure with the best score. These functions are usually built using a database of well-resolved structures. The main difference with respect to the comparative modeling is that the prediction is determined by assembling small structural components, where assembly is guided by an energy function [42].

In [41], it is established that predictions by threading require choosing both the correct structural model from a library and the correct alignment from the space of possible sequence-structure alignments. Once chosen, the alignment establishes a correspondence between amino-acids in the sequence and spatial positions in the model. Assigning each aligned amino-acid to its corresponding spatial position places the sequence into the three-dimensional (3D) protein fold represented by the model.

As an example of threading procedures, in [43] an approach to fold recognition is presented, whereby sequences are matched directly onto the backbone coordinates of known protein struc- tures. The method for protein fold recognition involved automatic modelling of protein struc- tures using a given sequence, and was based on the frameworks of known protein folds. Addi- tionally, it was considered as the first time that the term threading was coined.

The quality of sequence-structure fit is typically evaluated using inter-residue potentials of mean force or other statistical parameters [44]. Then generally protein threading methods require a representation of the sequence, a library of structural models, an objective function, a method of aligning the sequence to the model, and a method of selecting a model from the library.

2.6 Ab-Initio Methods

Ab-initio methods try to directly predict the three-dimensional structure without structural information of the target protein’s family. Based on [10], it is possible to state that although such methods are very demanding at a computational level, they are extremely important because they overcome the inherent problems of comparative methods. In other words, Ab- initio methods could be applied when it is not possible to find homologous structures related to a target protein, and in a general way they can be used on any amino-acid sequence because the information makes only reference to the physics properties of the amino-acids atoms. Moreover they can discriminate between correct and incorrect assumptions of the model and have a deeper understanding of protein folding mechanisms.

The transition between a folded and un-folded protein was considered as a process of two states where it should be represented the native conformation and the unfolded conformation. Although, thanks to kinetic studies, which were developed to explain the Levinthal paradox, the existence of intermediated states was demonstrated [45]. Then, two main models of folding could be used based on the characteristic of a protein and the scope of a study, the first is called a simple model and it is used in small and globular proteins having two states. Specifically, Chapter 2. Protein Folding Prediction: A survey 20 it has an intermediate state where the free energy is higher than the minimum free energy of a spontaneously state. In more complex folding models (second class) there are more than one intermediate state which explain in a detailed way the folding process under real biological conditions.

There are four main basic models to explain the folding process, the hydrophobic collapse, the framework, the nucleation and the energy-landscape models. The hydrophobic collapse model hypothesizes that the native protein conformation is formed by a rearrangement of a compact collapsed structure [46, 47]. The hydrophobic collapse (determined by hydrophobic forces) is a result of the repulsion between hydrophobic side chains of the protein and the hydrophilic water molecules of the environment. This model is generally used to describe the initial stage of protein folding. This model introduces the term ’molted globule’ that is defined as a compact, partially folded protein that has native-like secondary structure and backbone folding topology, but lacks the extensive, specific side-chain packing interactions of the native structure [48]. According to this model after the apparition of the molted globule the protein suffer some re-arrangements in the structure to generate the native structure.

The framework model explains the protein folding process as a hierarchical process where it describes a step-wise mechanism to greatly narrow the conformational search [49, 50]. This involves an assembly where local elements of secondary structure are formed according to their primary structure, but independent from tertiary structure. This model consists of four main steps; the first step is represented by the unfolded chain, next the secondary structures are formed and fluctuate around its native position, next the secondary structures interact with each other and a native-like secondary structure and a folding pattern are found, finally the native 3D structure is obtained.

The third model is the nucleation theory model; it establishes that in the early stages of folding there are possible folding nuclei in the protein structure, which initiate the folding. These nuclei are usually those parts of the protein, which have residues very close to each other not only sequentially, but also structurally in the native protein structure. The nucleation model describes the continual formation of distinct unstable structures which are formed as fast as they are destroyed, although at certain point a sufficiently stable structure is formed to develop the native structure [16].

A recent unified model called the new view in which protein folding is represented for different paths in a landscape has been proposed. These landscapes could have a funnel shape and a high level of irregularity. In this model, the folding is closer to a native state if it is lower in the landscape. Different theories and experimental contributions have been carried out to corroborate the model [51]. Those experiments have also shown the evolution of the process as a function of the folding energy and represents the different thermodynamic and kinetic variations in the landscape, then different variations in energy will let us get closer to the native state of Chapter 2. Protein Folding Prediction: A survey 21 the protein. The postulate deviated markedly from the old pathway doctrine and improved our understanding at the conceptual level [52].

The new vision has replaced progressively the ideas of folding paths. The metaphor of landscape provides us a conceptual frame to understand both kinetics of folding: two and multiple states. Experimental results support the new vision theory and prove that the polypeptide chain (ini- tially unfolded) folds through of an heterogeneous population of intermediate folded structures in a fluctuant equilibrium [45, 51]. If it is assumed that proteins adopt a native conformation or a unique three-dimensional structure that is near to a global free-energy minimum [53], the problem of finding a native conformation could be divided into two different but complementary sub-problems. The first one consists of developing an accurate potential energy function used to evaluate possible conformations and the second is developing an efficient protocol to search conformations in an energy landscape. In [32], it is stated that in order to allow a rapid and an efficient search in the conformational space, only a subset of the atoms in the protein chain should be explicity represented; the potential functions must then include terms that reflect the averaged-out effects of the omitted atoms and solvent molecules.

Bonneau et. al. [7] stated that in spite of the big progress and evolution that ab-initio methods have had, many issues must still be resolved if a reliable prediction scheme is developed. For example no one method performs consistently across all classes of proteins, the accuracy and reliability of models produced by de novo methods is much lower than that of comparative models based on alignments with more than 30% sequence identity, all methods do not have good prediction scores trying to predict sequences longer than 150 residues in length. On the other hand, in [5], for roughly 40% of proteins shorter than 150 amino-acids that were examined, one of the five most commonly recurring models generated by Rosetta has sufficient global similarity to the true structure to recognize it in a search of the protein structure database. Then they concluded that reasonable models can, in some cases, be produced for domains of even very large proteins by using multiple sequence alignments to identify domain boundaries. Additionally, it is concluded that the accuracy of de novo models is too low for problems requiring high-resolution structure information. Instead, the low-resolution models produced by these methods can reveal structural and functional relationships between proteins not visible from their amino-acid sequences and provide a framework for analyzing spatial relationships between evolutionarily conserved residues or between residues shown experimentally to be functionally important.

2.6.1 Conformation Representations

To overcome some of the sampling conformation problems, most of the methods for protein structure prediction involve some significant complexity reduction. Methods for reducing protein structure to discrete low-complexity models can be divided into two major classes: lattice and off-lattice models [7]. Chapter 2. Protein Folding Prediction: A survey 22

2.6.1.1 Lattice Models

Lattice models have a fundamental theoretical relevance due to their analytical and computa- tional simplicity [46]. Specifically the evaluation of energies on a lattice can be achieved quite efficiently, and methods involving exhaustive search of the available conformational space be- come feasible [54, 55, 56]. Although these methods cannot be applied directly to real proteins due to their restricted ability to represent subtle geometric considerations and to reproduce the backbone with accuracy greater than approximately half the lattice spacing [57], when carefully parameterized lattice models can be applied to structure prediction and can give encouraging results [58, 59]. The form of the potential and the use of a lattice leads to a highly oversimpli- fied model, but the results of such simulations have provided important information on possible protein folding scenarios [60, 61].

One of the most used lattice models is the HP model, which takes into account hydrophobic interactions as the main driving forces in protein folding, involving only attraction-interaction [62]. In the HP model [63], each amino-acid is represented as a bead, and connecting bonds are represented as lines. Then a protein is composed of a specific sequence of only two types of beads, H (bead-Hydrophobic/non-polar) or P (bead- hydrophilic/Polar); in other words, the 20 amino-acids can be divided into two classes: H and P.

In the essence of lattice models an automated procedure that reduces the dimensionality of protein structure by simplifying the representation of the primary sequence is performed. Ex- plicitly HP models transform the 20 letter AA alphabet into a two letter hydrophobic/polar (HP) alphabet [63]. Although restrains to the residue location of the target protein on a lat- tice could be performed [64], sometimes they are applied to real non-constrained proteins. In [59], the performance of several learning methods applied to predicting the coordination num- ber for lattice-based proteins, over real proteins with either HP alphabet or AA alphabet, was compared, as a result, it was reported that the gap between the performance in HP and AA alphabets is significant.

In [18], authors recognize two types of lattice model simulations, aimed at two distinct objectives. The first type [65] was designed to understand the basic physics governing the protein folding process, where a key feature of this type of lattice model is its simplicity. Through lattice model simulations, Wolynes and coworkers [16, 51] postulated their hypothesis about the existence of a funnel-like energy landscape which guides the proteins toward their native structures. Using lattice models Dill et. al. [46] emphasized the importance of hydrophobic interactions. The second class of lattice models have the characteristic of have been used with real folds and they are parameterized using real proteins as templates by statistical sampling of the available structures [66, 67] and are referred to statistical potentials or knowledge based potentials. Some of the most representative examples of these models can be found in [59, 67]. Chapter 2. Protein Folding Prediction: A survey 23

Among the most popular candidates to lattice based on space filling, the cubic, hex prism, truncated octahedron, cube octahedron and truncated octahedron are the most used models (See Figure 2.4). Regarding the vector based models, the face centered cube (FCC), side plus FCC and the extended FCC could be mentioned [68, 69]. The ideal attributes for these models are, a edge length of 3.8Ao, minimum distance between any two vertices of 3.8Ao, supporting mainly 90o and 120o angles.

Figure 2.4: Common lattice models

2.6.1.2 Off-lattice Methods

Most off-lattice reduced complexity models fix all side chain degrees of freedom and all bond lengths. Although some algorithms use multiple representations, there are few conformation- representations which are commonly used [21]: all-atom three-dimensional coordinates [70, 71]; all-heavy-atom coordinates; backbone atom coordinates plus side-chain centroids [72, 73]; Cα coordinates; and backbone and side-chain torsion angles [21]. Developing methods which are able to reliably differentiate native states from non-native ones has been a clearly necessity in the computational study of protein folding process [18]. In protein structure prediction the most used approaches have been based on residue-level models with statistical potentials obtained from the PDB [30]. A growing tendency in the community has been the development of atomic-level statistical potentials [74, 75] in attempts to improve the accuracy. An improved level of accuracy has been obtained through a combination of an all-atom representation of protein and a continuum model of solvent [71]. These models find a good equilibrium between the accuracy of the representation and the computational cost Chapter 2. Protein Folding Prediction: A survey 24 because they can significantly reduce the number of particles included in the calculation even after considering the overhead due to the added complexity of the continuum solvent model. An all atom representation is a lower level of abstraction of the real protein conformation because it guarantees the generality and allows further extension. Generally, their parameters are obtained through high-level quantum mechanical calculations on short peptide fragments [18], where the availability of more accurate quantum mechanical methods allows further refinement.

Direct simulation of protein folding using an all-atom model has been called the ”holy grail” because it tries to elucidate the folding mechanism and the necessary protein processes to reach their native state [76]. In[77], it is enunciated that complementary approaches are essential to obtain a detailed understanding of the events occurring during the folding of even a simple protein. Given the inherent complexity of all-atom simulations, both with implicit and explicit solvent, they generally focus on specific problems [78].

2.6.2 Scoring Functions

In [18], it is emphasized that once a model for representing the protein that sufficiently reduces the complexity of the conformational search is chosen, a scoring or energy function that works in the chosen low-complexity space must be developed. The energy function must adequately represent the forces responsible for protein structure. Additionally, they must be efficient making it computationally feasible (given the huge number of energy evaluation needed in exploring the search conformation space).

To come up with some good functions it would be natural to use quantum mechanics, but it is too computationally complex to be practical in modeling large systems, then classical physics is a common approach to overcome the computational limitations [21].

Computational approaches to potential energy may be divided into two broad categories: quan- tum mechanics [79] and molecular mechanics [80]. In the same way two different types of force fields have been widely studied, low and high resolution methods. The basis for the division of computational approaches depends on the incorporation of the Schr¨odingerequation or it’s ma- trix equivalent, where these methods support one another attempting to understand chemical and biological behavior at a molecular level. The determination of feasibility of the meth- ods is mainly determined by the complexity of the problem, its time constraints, its computer requirements, among other limiting factors [81].

In [82], it has been stated that the application of computer-based models using analytical potential energy functions within the framework of classical mechanics has proven to be an increasingly powerful tool for studying molecules of biochemical and organic chemical interest. Most typical all atom energy functions have the form shown in Equation 2.6.2, where R is the vector representing the conformation of the molecule, typically in Cartesian coordinates or in Chapter 2. Protein Folding Prediction: A survey 25 torsion angles, and [others] refer to specific terms such as hydrogen bonding in biochemical systems.

P P P P E(R)= bonds B(R) + angles A(R) + torsions T (R) + non−bonded N(R) + [others] (2.1)

Most molecular mechanic equations are similar in the terms they contain. However, there are some differences in the forms of the equations that can affect the choice of a force field and parameters for the systems of interest [81]. The sum of terms which describes the total energy can be relatively simple or extremely complex. AMBER [83, 84] and CHARMm [85] are two famous molecular mechanic equations for biological molecules.

In [86], it is clearly stablished that low resolution solvation-based scoring schemes favors place- ment of hydrophobic amino-acids at buried positions and of hydrophilic acids at exposed posi- tions. Then, these uncomplicated scoring schemes can be functional in the prediction of small hydrophobic cores. In [86], it is determined that most low-resolution potential-scoring functions utilize an empirically derived pair potential in place of or in addition to the residue-environment term. Although, the most common of these potentials are functions of position of a single center per residue, Cα or centroid/united atom center, all-atom functions have also been used . Sequence independent scoring functions can also be used to explain several protein features (association of beta strands as sheets). Multiple sequence alignments on homologous sequences available for protein families could be used, followed of a contact prediction based on covariance patterns in these alignments. In [18] authors stated that low-resolution methods try to nar- row the possible conformations from an exponentially large number to a number small enough that more computationally expensive methods can be applied. However, reduced complexity approaches cannot be expected to consistently generate predictions with resolutions of better than 3-7 Ao, it is necessary to make some improvements in high-resolution potentials to achieve significant progress in the field of ab initio protein fold prediction.

2.6.3 Search Conformational Methods

The task of determining a protein’s native state can be treated as a search for the lowest energy structure in a vast conformational space. Several algorithms currently used for protein structure prediction combine domain information with local search techniques to avoid the complexity of high-dimensional conformation spaces. Table 2.1 (taken from [87]) reports a comparison of different protein folding models. Chapter 2. Protein Folding Prediction: A survey 26

Table 2.1: A comparison of protein folding models Approach Landscape Paths Path Quality Comp. Time Native-St needed Molecular dynamic No 1 Good Long No Monte Carlo No 1 Good Long No Statistical Model Yes 0 N/A Fast Yes PRM-based Yes Many Approx Fast Yes Lattice Models N/A N/A N/A N/A N/A

2.6.3.1 Molecular Dynamics

Molecular dynamics (MD) methods attempt to predict protein structure by simulating the folding process that occurs in nature [88]. This is achieved by simulating the interactions of all forces acting between atoms of the protein and a solvent. The interactions are modeled using Newton’s law at time steps equivalent to atomic thermal vibrations [89].

Molecular dynamics [90, 91] relies on information about atomic physiochemical interactions, in conjunction with a simulation of the equations of motion of the entire physical system [92, 93]. Specifically, molecular dynamics methods use the values of the energy and its analytical first derivative to simulate the motion of the atoms in the presence of thermal energy numerically integrating Newton’s equations of motion for the polypeptide chain.

Molecular dynamics yield information about the amplitudes and rates of atomic motion on the picoseconds time-scale, and it has been shown that MD performs an anneal process of conformations in which the unfavorable interactions that remain after energy minimization are removed using the thermal energy to hop over small energetic barriers [93]. Subsequent energy minimization is then able to reach a lower energy value [94].

The folding trajectories obtained by molecular dynamics simulations are generally in agreement with experimental evidence. MD problems significantly reduce the likelihood that the native state will be found at the global free-energy minimum using current potentials. Although MD methods have been used for a long time, MD methods have applicability in current research where encouraging progress has been made in the active application of molecular dynamics simulations to study the folding process.

In [89], it is clearly stated that evaluating the forces acting upon all molecules at every time step of a folding protein is biologically accurate, but computationally prohibitively expensive. Advances in simulation strategies and the increasing computer power have considerably extended simulation times; for example the IBM’s Blue Gene project [95] has implemented some software and hardware technologies in order to overcome the inherent problems of MD.

Like other local search techniques, MD is susceptible to spend a significant amount of compu- tation in local minima. To escape from a minimum, thermal vibrations must sample a state outside of the minimum. A second and perhaps a more serious problem to overcome is the inad- equacy of current potential functions for macromolecules in water and derived in an uncertainty Chapter 2. Protein Folding Prediction: A survey 27 of the parameters used in molecular mechanics potentials. Given the fundamental importance of water in protein folding, the explicit representation of both proteins and solvent is expected to play an increasingly role in our understanding of protein folding mechanisms. Another prob- lem is the representation of electrostatics, where the large difference in the dielectric properties of the solvent and the protein, and the uncertainties in the magnitude and location of atomic partial charges are considerable challengers [86].

With the promise of even more accurate models and considerably faster computer speed, to- gether with the advancement of experimental approaches, the goal of understanding the folding of proteins must be achievable in the near future [18], but even with the continued exponential increase in computational power, it is unlikely that MD will be a practical solution to protein folding for several decades. Currently the best use of MD methods may be in refining and discriminating among models produced by lower-resolution methods.

2.6.3.2 Statistical Mechanics

The mechanical methods could be thought of a feed back to physics where proteins are com- pletely described by the physico-chemical properties of their constituent atoms and bonds. Computational statistical mechanics is in charge of calculating the dynamics by repeated in- tegration of the forces acting on each atom, and as the other methods, a minimum energy conformation is assumed to be the native state. The determination of functions that describe the forces acting on the atoms, the use of numerical integration methods to calculate the motion of the atoms due to the forces acting on them and the long time propagation of the equations of motion are the three main aspects of mechanical methods.

However, statistical mechanical methods have also been successful in studying protein folding and they have provided estimates of the transition state ensemble, folding rates, and phi-values, approaches applied to analyze the folding assume extremely simplified molecular interactions and are limited to studying global averages of folding kinetics. For example in the study of large protein a very simplified energy function that depends only on the topology of the protein’s native state and, hence, are not as accurate as the distance from the native state increases. Given that the determination of the density of states of the given protein is a very difficult task with conventional methods, the main task in the statistical mechanical treatment of protein folding is the determination of that density [96].

2.6.3.3 Monte Carlo Methods

In [97], the term Monte Carlo method is defined as any member of a very large class of com- putational methods that use randomness to generate ’typical’ instances of a problem under investigation. Additionally, authors stated that typical instances are generated because it is im- practical or even impossible to generate all instances, where a set of typical instances is supposed Chapter 2. Protein Folding Prediction: A survey 28 to help us learn something about a problem of interest. The expression ’Monte Carlo method’ is actually very general, but it could be established that they are stochastic techniques (mean- ing that they are based on the use of random numbers and probability statistics to investigate problems).

The term Monte Carlo Method was initially coined in [98] by S. Ulam and Nicholas Metropolis in reference to games of chance, a popular attraction in Monte Carlo, Monaco. Since then Monte Carlo techniques have been widely used in different problems where the protein folding problem has not been the exception. Monte Carlo methods are the most commonly used search methods of conformations in an energy landscape. Many improvements as the replica Monte Carlo [99], multiplexed-replica exchange method [100], parallel tempering [101], jump walking [102], multi-canonical jump walking [103], smart walking [104], the multi-canonical ensemble method [105], local energy flattening [106], importance sampling [107], sampling-importance resampling [108], methods based on weighted histograms [109], entropic sampling [110], space annealing [111], tabu search [112], simulated annealing [113], genetic algorithms [55, 58], and others have been proposed.

Some of the Monte Carlo methods are inspired in biological theory and use simulated evolution processes to find the solution or an approximately one to NP-complete or complicated compu- tational problems that are difficult to solve using deterministic or conventional methods. Two of those methods are genetic algorithms and artificial immune systems which have been used in the protein folding problem.

Genetic algorithms are systematic methods based on biological evolution used to solve search and optimization problems. A population of individuals that typically represent the solutions to a particular problem is evolved, based on the survival of the fittest principle and intro- ducing genetic variation in the individuals of the population. Thus individuals are encoded as chromosomes, and genetic operators are applied over them to introduce changes that al- low an exploration of the solution search space. The individuals compete among them, and the environment that consists of other possible solutions produces a selective pressure over the population to favor survival of the fittest individuals. Genetic information in the chromosomes will be preserved and transmitted to the next generations [114, 115]. Some variants of genetic algorithms used in protein folding are simple genetic algorithm (sGA), messy GA (mGA), fast messy GA (fmGA), linkage learning GA (LLGA) and Multiobjective fmGA (MOfmGA). The different complexities estimate for serial GAs is shown in the Table 2.2 taken from [15], where l is the length of chromosome, n is the size of population, q is group size for tournament selection, g is the number of generations.

In [116], an artificial immune systems is defined as the use of immune system components and processes as inspiration to construct computational systems. Moreover in [62], it is detailed that artificial immune systems represent a field of biologically inspired computing that attempts to exploit theories, principles, and concepts of modern immunology to design immune system-based Chapter 2. Protein Folding Prediction: A survey 29

Table 2.2: Complexities estimate for serial GAs Phase sGA ssGA mGA fmGA Initialization O(ln) O(ln) O(lk) O(l) Recombination O(g ∗ n ∗ q) O(g) O(lk) O(l) Primordial O(0) N/A O(0) O(l2)c Juxtapositional N/A N/A O(l log l) O(l log l) Overall mGA O(ln) O(ln) O(lk) O(l2) applications in science and engineering. The artificial immune system is applied in a protein folding problem to try to solve or to find approximate solutions to the NP-complete problem of search in the space of conformations of the landscape. Then, the idea is to use an evolutionary approach with different evolutionary operators as cloning, hypermutation, hypermacromutation, aging and others to find low-energy conformations of peptides [117].

2.6.3.4 Probabilistic Road Maps

The roadmap is a global approach to build a representation of the connectivity of a configuration space as a network of curves [118]. Probabilistic roadmap (PRM) methods try to capture the connectivity of a high dimensional space via random sampling using a graph. A major strength of PRM is that they are simple to apply, even for problems with highdimensional configuration spaces, requiring only the ability to randomly generate points in C-space, and then test them for feasibility.

PRM methods try to accomplish the motion planning’s goal of computing a sequence of valid intermediate states that transform a given initial state into some desired final state. Then, the search of conformation space in the protein folding problem could be simulated using PRM because it could generate an ensemble of pathways rather than an individual one to be consistent with the protein folding kinetics, where it replaces a single folding pathway with an energy landscape and a folding funnel. The general idea of PRM method for protein folding is to build an approximate map of a protein’s potential energy landscape. This map contains thousands of feasible folding pathways to the known native state enabling the study of global landscape properties [119].

In robotics, given a description of the environment and a robot, the motion planning goal con- sists of finding a feasible path that takes the robot from a given starting point to a given goal configuration. This problem is different from the protein structure prediction PRM approach in that in PF the traditional collision-free constraint is replaced by a preference for low energy con- formations favoring the transitions from configurations with higher potential to configurations with lower potential. It is also different by the fact that in PF it is not sufficient to find any feasible path connecting the start and goal configurations, it should find the most energetically favorable paths [120]. Chapter 2. Protein Folding Prediction: A survey 30

In [87, 121] the authors give a general description of how a probabilistic road map method to protein structure prediction works. First some points should be sampled in the protein’s conformation space where the sampling is biased to increase density near the known native state. Then, these points are connected to form a graph or the roadmap. Weights are assigned to directed edges to reflect the energetic feasibility of transition between the conformations corresponding to the two end points. Finally, folding pathways are extracted from the roadmap using standard graph search techniques.

Finally, in [122] it is claimed that PRM methods obtain a better statistically characterization of molecular motion properties and they allow scientists to study some important properties of the folding process, such as secondary structure formation order, and kinetic measures such as folding rates or population kinetics. Chapter 3

The Proposed Multi-objective Ab-Initio Approach

As stated before, there are four main tasks that should be considered in the development of an ab-initio model.

1. Representation of the conformations

2. Cost functions

3. Search conformation procedures

4. Metrics to evaluate similarity

The following subsections deal with each of these tasks, and explain in detail the proposed schemes.

3.1 Representation of the conformations

The tertiary structure of a protein is the spatial structure of its polypeptide chains. In principle, this structure is given by the spatial coordinates of the centers of all atoms in the protein. The tertiary structure of a protein depends on the shape of its backbone and the positions of the R groups relative to the backbone chain. Then, in general, the degrees of freedom in a structural representation are the backbone and side-chain torsion angles (φ, ψ, χ, ω).

Lattice and off-lattice models are used to represent the conformations and to overcome the high complexity in sampling protein conformations. Given that this thesis is proposed to work in an atomic level, the off-lattice models are the obvious option to choose. In off-lattice models few conformation-representations are commonly used:

• All-atom three dimensional coordinates.

31 Chapter 3. The Proposed Multi-objective Ab-Initio Approach 32

• All-heavy-atom coordinates.

• Backbone atom coordinates plus side-chain centroids.

• Cα coordinates.

• Backbone and side-chain torsion angles

Additionally, there are four main formats to implement the chosen structural representations. Each one of them may have its own advantages in representing certain types of structural data or computing the structure with certain types of methods (see Figure 3.1).

1. Algebraic Representation: Cartesian coordinates.

The algebraic representation makes reference to computing three coordinates (x,y,z) for each atom. This is the most straightforward method to represent a molecular structure. This is a useful method because the potential energy functions can be expressed, evaluated, or minimized more easily when a structure has been defined in terms of cartesian coordi- nates. Moreover, structural changes or dynamics are described in terms of the coordinates of the atoms as well.

2. Geometric Representation: Interatomic distances.

The geometric representation is based on the computation of the inter-atomic distances between protein’s atoms. The main advantage of this representation consists of the fact that inter-atomic distances are known for certain pairs of atoms, such as the bond lengths between the bonded atoms, where the distances between hydrogen atoms can be detected by experimental methods. Then, it is clear that a protein structure could be determined if the distances between all pairs of atoms are given.

3. Trigonometric Representation: Dihedral angles.

The trigonometric representation is based on the idea that the distances and the angles could determine the structure of a protein. The main advantage of the trigonometric representation relies on the fact that each residue type requires a fixed number of torsion angles to fix the three-dimensional coordinates of all atoms in a protein.

4. Probabilistic Representation: Electron density distribution.

The probabilistic representation uses entropy methods as a tool for the reconstruction of the electron density function based on information obtained from the X-ray crystallogra- phy technique.

The trigonometric representation was chosen to model and compute the backbone and side-chain torsion angles of a protein. The internal coordinate representation (torsion angles) Chapter 3. The Proposed Multi-objective Ab-Initio Approach 33

Figure 3.1: Formats of structural representation was chosen based on the fact that each residue requires a fixed number of torsion angles to fix the three-dimensional coordinates of all atoms. Additionally, the trigonometric implementation was used given its facility to simulate conformational features (geometry of the peptide bond) and steric constrains. Furthermore, this is the easiest and cheapest (with respect to computational resources) manner to represent a protein conformation in the chosen search space method. For example, for a single aminoacid as tryptophan, 7 dihedral angles are neccesary at most in the trigonometric representation, meanwhile, using cartesian or geometric representation, a top of 27 atoms or 26 distances must be computed, respectively.

Even if in the proposed approach, the trigonometric respresentation was the chosen represen- tation model, an algorithm to convert from trigonometric to cartesian models was used. This algorithm is necessary given two main facts: i) The used potential energy functions are evalu- ated based on cartesian coordinates. ii-) The final protein conformations are reported using a file format based on cartesian coordinates (i.e., PDB).

The tertiary structure of a protein rely on the shape of its backbone, which depends on the geom- etry of the peptide bonds and Cα atoms. The conventional definitions and notation concerning the geometry of amino acids and peptide bonds are presented in Figure 3.2 [1]. Chapter 3. The Proposed Multi-objective Ab-Initio Approach 34

Figure 3.2: The planar peptide group

To specify the conformation of an amino acid unit in a protein chain, it is necessary to specify torsion angles about both of the two single bonds provided for the Cα to the chain. These torsion angles are indicated by the symbols φ (around Cα − N bond) and ψ (around Cα − Cobond). By convention, both φ and ψ are defined as 180o when the polypeptide is in its fully extended conformation and all peptide groups are in the same plane.

Even if an astronomical number of possible conformations is generated by the rotation of the protein torsion angles, there are some steric constraints on the torsion angles of a polypeptide backbone that limit the number of permissible conformations. One of the most common con- straints is the fact that no two atoms may approach one another more closely than it is allowed by their van der Waals radii, i.e., φ and ψ can have any value between −180o and +180o, but many values are prohibited by steric interference between atoms in the polypeptide backbone and amino acid side chains.

The whole range of possible combinations of φ and ψ are plotted (φ versus ψ) in a conformational map (see Figure 3.3) indicating the allowable combinations of the two angles within the blocked areas.

In Figure 3.3, the shaded dark blue areas reflect conformations that involve no steric overlap and thus are fully allowed; medium blue indicates conformations allowed at the extreme limits for unfavorable atomic contacts; the lightest blue area reflects conformations that are permissible if a little flexibility is allowed in the bond angles. The asymmetry of the plot results from the L stereochemistry of the amino acid residues.

In the internal coordinate representation the degrees of freedom in the conformations were fixed to handle the φ, ψ and χi torsional angles. The ω angle is not taken into account because its bond lengths and angles are fixed at their ideal values. The bond angles and interactomic distances used in this work are shown in Table 3.1 and Table 3.2 respectively. The number of χ angles used depends on the residue type (see Table 3.3). Chapter 3. The Proposed Multi-objective Ab-Initio Approach 35

Figure 3.3: Ramachandran plot

Table 3.1: Ideal Values of Bond Angles Bond Angles Radians O − C − N 2.1457 C − N − Cα 2.1293 Cα − N − O 2.0245 Cα − C − O 2.108

Table 3.2: Ideal Values of Interatomic Distances Atomic Bond Distance (Armstrongs) O − C 1.231 O − N 1.33 N − Cα 1.45 O − Cα 1.52

Table 3.3: Number of χ angles Residue χ angles GLI, ALA, PRO main chain SER, CYS, THR, VAL χ1 ILE, LEU, ASP, ASN, HIS, PHE, TYR, TRP χ1, χ2 MET, GLU, GLN χ1, χ2, χ3 LYS, ARG, χ1, χ2, χ3, χ4

In order to reduce the size of the conformational space, backbone torsion angles are bounded in regions derived from secondary structure prediction. Also, side-chain torsion angles are constrained in regions derived from a backbone-independent rotamer library. The protein con- formations are still greatly flexible under these constraints, and the structure can take on various conformations that are vastly different between them, and the native conformation. Chapter 3. The Proposed Multi-objective Ab-Initio Approach 36

3.1.1 Secondary Structure Prediction

Secondary structure can be described as the local conformation of the polypeptide backbone. This local conformation plays a crucial role in the formation of a protein native structure. There are two main frequent backbone configurations named α-helix and β-sheet. In tertiary structure, the elements of secondary structure are folded forming an almost solid compact structure that is stabilized by weak interactions that involve polar and nonpolar groups; the three-dimensional structure of amino acids is obtained after proteins fold into their native states.

An α-helix is formed by hydrogen bonds between residues that are close to each other in the sequence. In contrast, β-sheets are formed by hydrogen bonds between two parts of the polypep- tide chain that come close to each other in the three dimensional space, but may be far away from each other in the sequence.

Secondary structure prediction methods can be categorized in four different generations. First generation methods were based on propensities of single residues, i.e., they were based on single amino acid propensities for finding a specific amino acid in a specific structural element. Second generation methods were based on propensities of segments as opposed to isolated amino acids. In third generation methods, information from homologue sequences to the query sequence and state of the art machine learning methods were used. In fourth generation approaches, a matching between secondary and tertiary protein structure was used; in other words, informa- tion about 3D protein conformation was added to secondary structure predictive methods (see Figure 3.4).

Predicting a protein secondary structure consists of the classification of its amino acids as belonging to either helices(H) or sheets (E) or coils (C). In the proposed approach the PSIPRED (Protein Structure Prediction Server) [123] was used to obtain the secondary structure of the proteins. PSIPRED is a highly accurate method for protein secondary structure prediction, and it is one of the best performing prediction methods. PSIPRED uses information from sequences that are homologues of the query sequence (third generation methods). Specifically, PSIPRED incorporates two feed forward neural networks which perform an analysis on output obtained from PSI-BLAST.

In this thesis, backbone torsion angles are bounded in regions derived from secondary structure prediction as it is shown in Table 3.4.

Table 3.4: Secondary Structure Constraints Structures φ ψ H(α − helix)[−39o, −67o][−16o, −57o] E(β − strand)[−130o, −110o] [110o, 130o] C(coil)[−180o, 180o][−180o, 180o] Chapter 3. The Proposed Multi-objective Ab-Initio Approach 37

Figure 3.4: Secondary Structure Prediction Approach

3.1.2 Rotamer Libraries

Rotamers are conformers arising from restricted rotation about one single bond. Then, rotamers are usually defined as low energy side-chain conformations. A library of rotamers is a source of rotamer torsion values and probabilities. For given phi and psi (φ and ψ) angles of the backbone of a specific residue, the rotamer library contains the rotamers which have been observed in some collection of crystal structures, along with the frequency at which they are observed. Rotamers are generated by a systematic conformational search, where different side-chain conformations of a particular amino-acid are generated. Side-chain conformations are generated by systematically rotating bonds in a side-chain by discrete increments.

The core assumption for using a rotamer library, in protein structure methods, is that there is enough data in the PDB for each side-chain conformation to attempt statistical determination of conformation preferences. Then, the exponential growth of the number of sequences reported in the PDB and the growth in the number of predicted high-resolution protein structures have been fundamental to improve the accuracy of the rotamer libraries.

In protein structure prediction methods, the use of a rotamer library allows a structure to be determined or modeled trying the most likely side-chain conformations, saving time and producing a structure that is more likely to be correct. In this thesis two rotamer libraries were considered. Specifically, the libraries reported in [124] and [125] were studied to generate the rotamers reported in Table 3.5. Chapter 3. The Proposed Multi-objective Ab-Initio Approach 38

Table 3.5: Rotamer library values Residue χ1 χ2 χ3 χ4 ARG [−177o, 62o][−167o, 180o][−65o, 180o][−175o, 180o] LYS [−177o, 62o][−68o, 180o][−68o, 180o][−65o, 180o] MET [−177o, 62o][−65o, 180o][−75o, 180o] N/A GLU [−177o, 70o][−80o, 180o][−60o, 60o] N/A GLN [−177o, 70o][−75o, 180o][−100o, 100o] N/A ASP [−177o, 62o][−60o, 65o] N/A N/A ASN [−177o, 62o][−80o, 120o] N/A N/A ILE [−177o, 62o][−60o, 170o] N/A N/A LEU [−177o, 62o] [65o, 175o] N/A N/A HIS [−177o, 62o][−165o, 165o] N/A N/A TRP [−177o, 62o][−105o, 95o] N/A N/A TYR [−177o, 62o][−85o, 90o] N/A N/A PHE [−177o, 62o][−85o, 90o] N/A N/A PRO [−30o, 30o] N/A N/A N/A THR [−177o, 62o] N/A N/A N/A VAL [−60o, 175o] N/A N/A N/A SER [−177o, 62o] N/A N/A N/A CYS [−177o, 62o] N/A N/A N/A

3.2 Cost Functions (Potential Energy Functions)

It is difficult to perform a detailed study of the relationship between protein structure and function using three dimensional structures from experimental techniques (such as X-ray crista- lography and NMR). These structures are predominately presented as static in nature where only a minimum of information concerning the relationship of structure to energetics is experi- mentally accessible. Then, it is claimed that theoretical studies of proteins based on empirical force field calculations can overcome those limitations [126].

A potential energy function is an equation that relates mainly structure to energy. In order to be effective, the parameters that have to be input into the equations have been optimized to represent real chemical systems. After that step, the force field could be used for energy minimization and MD simulations.

The use of a cost or energy function is necessary in order to evaluate the structure of a confor- mation. In terms of the searching process (in the energy landscape), a cost function is necessary to compare any new solution with the solutions found at the moment. This comparison is performed using the values returned by the energy function based on the conformation of the molecule. These returned values attempt to represent the actual physical forces and chemi- cal reactions occurring in a protein in an atomic detail. Furthermore, this empirical functions provide information about the relationship of protein structure and function.

Most typical all atom energy functions have the form shown in Equation 3.2, where R is the vector representing the conformation of the molecule, typically in Cartesian coordinates or in Chapter 3. The Proposed Multi-objective Ab-Initio Approach 39 torsion angles and [others] refers to specific terms such as hydrogen bonding in biochemical systems.

P P P P E(R)= bonds B(R) + angles A(R) + torsions T (R) + non−bonded N(R) + [others] (3.1)

In general, equation 3.2 could be separated into two groups, mainly, the internal terms, including the bond, angle, and dihedral contributions, and the non-bonded or external terms that include the electrostatic (or coulombic) and van der Waals (vdW) terms. Then, molecular mechanics is an approach to represent the energy surface for bond and non-bonded interactions with a simple, analytical and easily differentiable energy function.

Although an accurate measure of protein conformation must imply considering quantum me- chanics principles, it is a too computationally complex method to become practical. Numerous approximations are introduced which lead to certain limitations. However, the simulations be- come much more tractable when turning to empirical potential energy functions, which are much less computationally demanding.

In this work, the molecular modeling package TINKER is used (available at http://dasher. wustl.edu/tinker/). Tinker is a general package for molecular mechanics and dynamics. Tinker makes available the use of any of several force fields such as Amber (ff94, ff96, ff98 and ff99), CHARMm (19 and 27), Allinger MM (MM2-1991 and MM3-2000), OPLS (OPLS-UA, OPLS-AA and OPLS-AA/L), Merck Molecular Force Field (MMFF), Liam Dang’s polarizable potentials, and their own AMOEBA polarizable atomic multipole force field. From those, there are two widely used molecular mechanics equations, CHARMm and Amber.

3.2.1 CHARMm

The value of the energy in the Chemistry at HARvard Macromolecular Mechanics (CHARMm) function is calculated as a sum of internal terms, which describe the bonds, angles and bond rotations in a molecule, and a sum of external terms, which account for interactions between non-bonded atoms or atoms separated by 3 or more covalent bonds. The CHARMm energy function has the form shown in following equation. Chapter 3. The Proposed Multi-objective Ab-Initio Approach 40

X 2 X 2 X 2 X ECharmm = kb(b − b0) + kU B(S − S0) + kθ(θ − θ0) + kχ[1 + cos(nχ − δ)] bonds UB angles torsion

" 12  6# X 2 X R minij R minij qiqj + kimp(φ − φ0) + εij − + rij rij erij impropers non−bond where

• b is the bond length, b0 is the bond equilibrium distance and kb is the bond force constant.

• S is the distance between two atoms separated by two covalent bonds, S0 is the equilibrium

distance and kUB is the Urey Bradley force constant.

• θ is the valence angle, θ0 is the equilibrium angle and Kθ is the valence angle force constant.

• χ is the dihedral or torsion angle, kχ is the dihedral force constant, n is the muliplicity and δ is the phase angle.

• φ is the improper angle, φ0 is the quilibrium improper angle and kimp is the improper force constant.

• εij is the Lennard Jones well depth, rij is the distance between atoms i and j, R minij

is the minimum interaction radius, qi is a partial atomic charge and e is the dielectric constant.

3.2.2 Amber

AMBER-99 is the third generation update to AMBER-94, including updated parameters for both amino and nucleic acids. The primary changes for peptides/proteins is in the torsional potentials. Amber represents a class of minimalist functions, where the bond and angles repre- sented by a simple diagonal harmonic expression, the VDW interaction represented by a 6-12 potential, electrostatic interactions modeled by a Coulombic interaction of atom-centered point charges, and dihedral energies represented with a simple set of parameters, often only specified by the two central atoms. The amber energy function has the form depicted in the following equation.

X X X Vn E = K (r − r )2 + K (θ − θ )2 + [1 + cos(nφ − γ)] total r eq θ eq 2 bonds angles dihedrals " # " # X Aij Bij qiqj X Cij Dij + 12 − 6 + + 12 − 10 R R Rij R R i

3.3 Search Conformation Procedure

Numerous problems encountered in computational biology can be formulated as optimization problems. Traditionally, optimization is conducted with respect to a single objective function, but the possibility of optimizing multiple objectives simultaneously has become of great interest. In [127] a categorization based on the different types of contexts in which a multi-objective approach may be usefully exploited in biological problems is presented.

The protein structure prediction problem has been usually approached as a single-objective optimization problem using a single-objective potential energy function. In this work a multi- objective approach is proposed in order to evaluate the suitability of facing the PSP problem as a multi-objective optimization problem (MOOP).

3.3.1 Multi-objective Optimization

The task of finding one or more optimal solutions in an optimization problem, where the problem involves more than one objective function is known as a multi-objective optimization problem. Typically, the objectives estimate very different, incommensurable and often in conflict aspects of the solution. A general multi-objective optimization problem can be defined as follows:

Definition 3.1 (a multi-objective optimization problem – MOOP). Finding a vector T x = [x1, x2, . . . , xn] which:

i) satisfies the r equality constraints, hi(x) = 0, 1 ≤ i ≤ r,

ii) is subject to the s inequality constraints, gi(x) ≥ 0, 1 ≤ i ≤ s,

T iii) and which optimizes the vector function, z = f(x) = [f1(x), f2(x), . . . , fm(x)]

Then, it is clear that an MOOP problem is focused on searching for the optimal values of the decision variables (vector x) that minimize/maximize the objective functions (vector f(x)) while satisfying the constraints. The vector x is an n-dimensional decision vector or solution and X is the decision space, i.e., the set of all expressible solutions. z = f(x) is an objective vector that maps X into

A concept of dominance is important in multi-objective optimization algorithms. Then, two solutions are compared on the basis of whether one dominates the other solution or not.

Definition 3.2 (Dominance). A vector solution x is said to dominate a solution x0, denoted 0 0 0 0 0 by x ≺ x , iff z = f(x) is partially less than z = f(x ), i.e. ∀i : zi ≤ zi, ∧ , ∃i : zi < zi, for 1 ≤ i ≤ m.

In other words, x ≺ x0 if the solution x is no worse than x0 in all objectives, and x is strictly better than x0 in at least one objective. Note that if solution x does not dominate solution x0, Chapter 3. The Proposed Multi-objective Ab-Initio Approach 42

xn zm objective decision space X space Z

x2 z2

x1 z1

Figure 3.5: The n-dimensional parameter space maps to the m-dimensional objective space. this does not imply that x0 dominates x. Furthermore, there are three possible outcomes of this relation: x dominates x0; x is dominated by x0; or x and x0 do not dominate each other.

The Pareto optimal set X of solutions consists of all those that it is impossible to improve in any objective without a simultaneous worsening in some other objective. Then, a Pareto optimum solution can be defined as:

Definition 3.3 (Pareto Optimum Solution). A solution x ∈ X is said to be a Pareto optimum solution if and only if there is no x0 ∈ X such that x0 ≺ x.

There are two invariant relations that are preserved in optimum sets, i) Any two solutions of non-dominated set must be non-dominated with respect to each other, and ii) Any solution not belonging to non-dominated set is dominated by at least one member of the non-dominated set.

It is clear that a Pareto optimal set can contain more than one element given that there exist different trade-off solutions to the problem with different compromises with respect to the objectives. Then, a higher making decision level is involved to choose a solution that is Pareto optimal. Figure 3.5 shows the principles in an ideal multi-objective optimization procedure [128]. Step 1 has multiple tradeoff solutions. Thereafter, in Step 2, higher level information is used to choose one solution from Step 1.

A multi-objective optimization evolutionary algorithm (MOEA) is a stochastic, population- based computational procedures that mimics nature’s evolutionary principles to drive its search towards an optimal solution of problems with multiple objectives. Evolutionary algorithms are usually suitable to solve MOPs, because, in each iteration, they deal with a set of possible solutions, instead of a single one. Since a population of solutions are processed, several members of the Pareto optimal set in a single run of the algorithm, instead of having to perform a series of separate runs as in classical search and optimization algorithms. Additionally, evolutionary algorithms are less susceptible to the shape or continuity of the Pareto front.

In this thesis a multi-objective evolutionary algorithm called Elitist Non-Dominated Sorting Genetic Algorithm (NSGA-II) was used in the experimental framework. Chapter 3. The Proposed Multi-objective Ab-Initio Approach 43

Figure 3.6: Schematic of an ideal multi-objective optimization procedure.

3.3.1.1 The Multi objective evolutionary algorithm

This algorithm was suggested in 2000 by Deb. et. al. [129]. NSGA-II uses an elite-preservation strategy and an explicit diversity-preserving mechanism. A schematic graph of NSGA-II is depicted in Figure 3.7 and the NSGA-II procedure is outlined in the following [128].

NSGA-II:

S • Step 1: Combine parent and offspring population and create Rt = Pt Qt.

Perform a non-dominated sorting to Rt and identify different fronts: Fi, i=1, 2,...,etc.

• Step 2: Set new population Pt+1 = 0. Set a counter i = 1. S Until |Pt+1| + Fi < N, perform Pt+1 = Pt+1 Fi and i = i + 1

• Step 3: Perform a crowding-sort procedure and include the most widely spread (N −

|Pt+1|) solutions by using the crowding distance values in the sorted Fi to Pt+1.

• Step 4: Create offspring population Qt+1 from Pt+1 by using the crowded tournament selection, crossover and mutation operators.

Respect to the algorithm complexity analysis, it could be stablished that Step 1 requires at most O(MN 2) computations performing a non-dominated sorting of a population of size 2N. The crowded tournament selection in Step 4 requires the crowding distance computation O(N log

N) of the complete population Pt+1 of size N. This requires O(MN log N). Clearly the overall complexity of the algorithm is determined by Step 1. Therefore, the overall time complexity of NSGA-II is bounded by O(MN 2). Chapter 3. The Proposed Multi-objective Ab-Initio Approach 44

Figure 3.7: Schematic of the NSGA-II procedure

The use of no extra niching parameter in the introduction of diversity (using crowding distance) among non-dominated solutions is one of the main advantages of NSGA-II. Additionally, the crowding distance can be calculated either, in the objective or variable space. The main disad- vantage of the algorithm is that it loses its converge properties when the crowded distance is used to restrict the population size.

In this work, the NSGA II algorithm is used as the method to evolve the protein conformations. Specifically, an object-oriented Java-based framework called jMetal was used at the development, experimentation, and study of multi-objective algorithm for solving the PSP problem. (jMetal is available for download at http://jmetal.sourceforge.net/)

3.3.2 The Multi-objective Formulation

In any multi-objective problem, two spaces should be defined: the decision space (the set of all expressible solutions), and the objective space (the space where the image of the solutions is the set of all attainable solutions). In general, to model any specific problem as a MOOP, three basic sets should be modeled, namely, a set of objective functions, a set of decision variables and a set of equality/inequality constrains.

For the problem under discussion, the energy of the protein conformation should be minimized. Moreover, more than one possible energy interacton which compute different kind of interactions could be defined. Then, we have:

• Decision Space [decision variables] In the design of a genetic algorithm, a chromosome is a set of parameters that represent a solution to the problem that the genetic algorithm is trying to solve. In multi-objective genetic algorithms (as NSGA-II) this set of parameters is equivalent to the variable space. Chapter 3. The Proposed Multi-objective Ab-Initio Approach 45

In this work and in order to represent the protein conformations, a trigonometric respre- sentation was chosen to represent and compute the backbone and side-chain torsion angles of a protein. Additionally, backbone torsion and side-chain angles are bounded in regions derived from structure prediction and rotamer libraries, respectively (See section 3.1 for further detail).

Figure 3.8 depicts in a schematic way the coding that was used in the proposed approach. Note that each amino acid will have a maximum of six positions (ex. LYS) and a minimum of two (ex. GLY). Then, the length of the chromosome depends on the number and the class of residues of the target protein. Figure 3.9 shows how the primary structure of the Met-enkephaline peptide is represented by the proposed approach.

• Objective space [objective functions] In the proposed approach, the different terms of the potential energy functions were transformed into several objectives to fit the multi- objective evolutionary optimization formulation of the PSP problem. Therefore, a problem with two objectives and another with three are considered.

In the two objective problem, the bonds, angles and torsion interactions between atoms (i.e., local interaction) were considered as the first objective, the second objective is composed by all the interactions between all the atoms not connected by chemical bond, which are atoms separated by at least three or more covalent bonds (i.e., non-local interaction). Then, these cost energy equations have the following form.

P` f1 = Ebond = Ek k=1 (3.2) PC f2 = Enon−bond = k=` Ek

Where the bond energy takes into account the interactions between aminoacids that are neighbors along the primary structure. The non-bond term represents the interaction be- tween residues that are separate in the primary sequence by at least two aminoacids. ` is the number of sums in the potential energy functions that characterizes the bond interactions, and C is the total number of interactions characterized by the potential function.

In the three objective problem, the Van Der Wall energy term is separated in a new objective. This is due to the fact that this term has higher change range than the other non-bond energy terms. Thus, a cost function with three objectives is used to optimize the energy of the conformation appropriately. Then, these cost energy equations have the following form. Chapter 3. The Proposed Multi-objective Ab-Initio Approach 46

P` f1 = Ebond = k=1 Ek Pc−1 f2 = k=` Ek (3.3) Pc f3 = k=c−1 Evdw

• Feasible regions [equality/inequality constrains] Therefore, the constraints which rep- resent unfeasible regions are the steric constrains. This constraints are modeled based on the cost function and not as optimization constraints.

Figure 3.8: Chromosome used in the genetic algorithm

Figure 3.9: Chromosome used to represent the Met-enkephaline primary structure

3.3.3 Decision-making Phase

As stated, an multi-objective evolutionary algorithm called NSGA II is used to evolve protein conformations in the reported experimental framework. These methods requires an high level information phase in order to select one solution from the final dominance set. This phase could be hard to accomplish when the number of objectives and solutions is large.

In this work four different implementations of the decision-making phase based on the Pareto knees were computed. The knees are solutions of the Pareto front where a small improvement in one objective leads to a deterioration in at least one other objective. In [130], the authors presented two algorithms for finding the knees in the Pareto front: an angle-based and an utility-based method.

In the angle-based method the trade-offs in either direction can be estimated by the slopes of the two lines through an individual and its four closest neighbours. The angle between these slopes can be regarded as an indication of whether the individual is at a knee or not. Figure 3.10 depicts an illustration of this fact. Specifically, note that the larger the angle α, β, γ or δ Chapter 3. The Proposed Multi-objective Ab-Initio Approach 47 between the lines, the worse the trade-offs in either direction, and the more clearly the solution can be classified as a knee [130].

Figure 3.10: Four neighbors to calculate the maximum of α, β, γ, δ

The expected marginal utility that a specific solution provides to a decision maker is the second presented method to measure the relevance of this solution. The individuals marginal utility is defined as the additional cost that must be accepted if a particular individual (the best one) would not be availabe and the second best individual must be chosen. Formally, this idea can be expressed in the following manner.

( 0 0 0 0 0 minj6=iU(xj, λ ) − U(xi, λ )) if i = argmin U(xj, λ ) U (xi, λ ) = 0 otherwise

0 P 0 P 0 where U(xi, λ ) = λifi(x) is a linear utility function with λi = 1. The expected marginal utilities can be approximated by sampling. Specifically, the marginal utility for all individuals for a number of randomly chosen utility functions is calculated, next the average as expected marginal utility is taken.

Based on these algorithms, in this work four different decision-making schemes were adopted.

1. First, detect the solution, which is a knee of the Pareto front, using the angle-based method. Then, select the solution with the largest angle.

2. First, detect the solution, which is a knee of the Pareto front, using the angle-based method. Then, select the solution with the lowest energy function value from these sam- ples.

3. First, detect the solution, which is a knee of the Pareto front, using the utility-based method. Then, select the solution with the biggest utility. Chapter 3. The Proposed Multi-objective Ab-Initio Approach 48

4. First, detect the solution, which is a knee of the Pareto front, using the utility-based method. Then, select the solution with the lowest energy function value from these sam- ples.

3.4 Root Mean Square Deviation

The root mean square deviation (RMSD) is a well known metric to evaluate how similar the predicted protein conformation is to the native one (i.e., structural similarity). When an optimal superposition of the two structures that are compared is computed, the RMSD measures the average distance between their backbones.

Low RMSD values between the predicted conformation and the native protein are best, zero indicates exact equality. RMSD is given by the following equation.

v u n uX 2 u |rai − rbi| t i=1 RMSD(a,b)= n (3.5)

where rai and rbi are the positions of atom i of structures a and b, respectively, and where structures a and b have been optimally superimposed.

RMSD values do not depend only on conformational differences but also on other features, for example the dimensions of the structures that are compared, best alignments do not always mean minimal RMSD values, significance of RMSD depends on the size of the structures and it can vary with protein type.

In this work, the program RmsCalc from the BioShell suite of tools is used (available at http: //bioshell.chem.uw.edu.pl/rmscalc.html). RmsCalc calculates crmsd distances between a bunch of target and template structures in PDB or XYZ format, and calculate crmsd distance between any target and any template. Two different parameter were used for this program, ’-allatom’ which uses all atoms from both structures for distance calculations, and ’-caonly’ which uses only CA atoms from both structures for distance calculations. Chapter 4

The Proposed Parallel Implementation

4.1 Introduction

Parallel computing refers to the simultaneous computation of instructions to solve a single problem. The processing elements of the computation can be diverse and include resources such as a single computer with multiple processors, several networked computers, specialized hardware, or a combination of the above. Parallel computing has been used to model difficult scientific and engineering problems found in the real world. Optimization problems are among the most interesting and challenging, and they are suitable for the use of parallel techniques. Therefore they have received an increasing amount of attention in the research and industry community in last decades.

Parallel and distributed computing has been widely used in the design and implementation of multi-objective optimization algorithms to speed up the search process. They have also been used to improve the precision of the model, the quality of the obtained Pareto fronts, the of the obtained solutions, and to solve large scale problems.

In addition, parallelization techniques have always been a focus of research in evolutionary algorithms for single objective optimization problems because of their population-based na- ture. As a natural extension, parallelization techniques have also been used for evolutionary multi-objective optimization. Thus, different parallel approaches applied to single objective optimization problems have been widely suited and reported, particularly, their modeling, per- formance and behavior have been investigated. However, a modest insight has been given to parallel approaches to solve multi-objective optimization problems. Different parallel models have been proposed in the design of multi-objective optimization algorithms, but only a few of the reported parallel approaches are accessible to compare the performance of those imple- mentations. Then, determining the efficiency and effectiveness of parallel implementations is usually difficult.

49 Chapter 4. The Proposed Parallel Implementation 50

In the early work by Veldhuizen et al. [131, 132], they presented the benefits of parallelizing an MOEA. Additionally, some design issues and considerations were explored and analyzed on a real-world application, particularly, the protein structure prediction problem. Also, a perfor- mance analysis in a cluster of an MOEA algorithm was carried out in [133] and extended by the same authors in [134]. This analysis was developed using benchmark functions to evaluate the algorithms in different ranges of features such as convexity, discreteness, multimodality and uniformity. On the other hand, in [135], Streicher proposed a divide and conquer approach to parallelize MOEAs, this work aimed at improving the speed of convergence beyond a par- allel island MOEA with migration. Additionally, a clustering based parallelization scheme for MOEAs was suggested and compared on standard multi-objective test functions. Recently, in [136] a comprehensive overview of parallel approaches for multi-objective optimization was pre- sented. Particularly, the design aspect of the algorithms as well as the implementation aspects on different parallel and distributed architectures is discussed.

The advances in the use of multi-objective evolutionary approaches for real-world applications, containing multiple objectives and high dimensionality, have led to the exploitation of the inher- ent parallelism of such algorithms. Since the protein structure prediction problem is a complex NP-hard, CPU time consuming process, hence these features make it a perfect candidate to be solved using a parallel model. Then, the efficiency (i.e., how well it performs computation- ally) and effectiveness (i.e. how good its reported solutions are) of PSP could be improved if increasing the number of processors allocated to it.

Given that an MOEA approach for the PSP problem was already implemented, the main goal of this work is to study how a parallelization model can improve the performance of an MOEA. Specifically, the focus of the proposed approach in this work is to improve the effectiveness of the algorithm discovering what solutions arise from simultaneous execution of multiple PSP instances.

Accordingly, some groups or categories can be defined based on common features of the al- gorithms. Specifically, the parallel evolutionary algorithms can be included in one of three categories: master-slave, islands and diffusion models. In the master-slave model the objective function evaluations are distributed among several slave processors, while a master processor executes the MOEA and other overhead functions. In an island model, MOEAs populations are relatively isolated from each other but individuals within some particular island occasion- ally migrate to another one. As in the master-slave model, the diffusion model deals with one conceptual population, except that each processor holds only a few individuals [132]. Clearly, in all these models a different trade-off between exploration and exploitation of the search space is required.

In this chapter, the proposed parallel implementation for the PSP problem is explained in detail. The outline of the chapter is as follows: The next section provides the background details necessary to understand the essential features of the implementation. Specifically, this Chapter 4. The Proposed Parallel Implementation 51 section is devoted to giving a brief review of the used parallel paradigm (i.e., island models), and an explanation of the JavaSpaces technology. Then, the proposed parallel implementation is described in the next section.

4.2 Background

The proposed parallel version of PSP was modeled and implemented using the JavaSpaces technology. JavaSpaces was the chosen high-level coordination tool for gluing processes together into our distributed application because it allowed the development of a simple design and implementation. In addition, the proposed approach uses an island MOEA model as its parallel paradigm. Par- ticularly, a master processor was used for management tasks and several slave processors to execute the MOEA algorithm. Accordingly, the master processor generates tasks, writes them into a space and collects results from the space. A background necessary to understand the parallel implementation of the MOEA to solve the PSP problem is presented in this section. Specifically, the next subsections briefly summarize the JavaSpaces technology and explain the parallel paradigm called island models.

4.2.1 JavaSpaces

The construction of distributed applications usually demand passing messages between processes or invoking methods on remote objects. Therefore, several technologies can be used to build these applications, including low-level sockets, message passing, and remote method invocation (RMI). In contrast, in JavaSpaces applications, the processes do not communicate directly, but their activities are coordinated by exchanging objects through persistent object exchange areas (i.e. space or shared memory). Then, JavaSpaces provides a different programming model that views an application as a collection of processes cooperating via the flow of objects into and out of one or more spaces. This approach can simplify the design and implementation of sophisticated distributed applications. Hence, a process can write new objects into a space, take objects from a space, or read (make a copy of) objects in a space, as depicted in Figure 4.1 [137]. JavaSpaces technology has a significant number of java-interfaces and java-classes that provide a valuable set of tools for developers. One of these tools is the compute-server implementation, which is an implementation of an all-purpose computing engine using a JavaSpace. A compute- server provides a service that accepts tasks, computes them, and returns results. The server itself is responsible for computing the results and managing the resources that complete the job. The service may use multiple processors or special-purpose hardware to compute the tasks faster than a single-CPU machine. The typical space-based compute-server is depicted in Figure 4.2. Chapter 4. The Proposed Parallel Implementation 52

Figure 4.1: Processes use spaces and simple operations to coordinate activities (figure taken from http://java.sun.com/developer/Books/images/javaspacesil.gif)

Figure 4.2: A space-based compute-server (Figure taken from [137])

4.2.2 Island Models

The island paradigm is based on the biological evolution of natural populations in relative isolation, such as those that might occur within an ocean island chain with limited migration. Then, in the island model, every processor runs an independent MOEA using a separate sub- population. The processors might cooperate by regularly exchanging migrants which are good individuals in their sub-populations [138].

The island models can be categorized into two main groups, cooperating subpopulations and a multi-start approach [136]. The first methods (cooperating subpopulations) are based on Chapter 4. The Proposed Parallel Implementation 53 the partition of the search space. In this group, the population is divided into subpopulations and the algorithms attempt to distribute the task of finding the complete Pareto optimal front among the available islands. This way each processor is destined to find a particular portion of the Pareto-optimal front. In the multi-start approach, each processor independently runs an optimization algorithm. In [136] it is stated that the basic idea of using such a model is that running several optimization algorithms with different initial seeds is more valuable than executing only one single run for a very long time.

The island paradigm is termed coarse-grained parallelism because each island contains a large number of individual solutions. Rings, meshes, toruses, triangles, and hypercubes (see Figure 4.3) are some of the logical or physical structures in which communication backbones can connect processors. The communication backbone, the island model architecture (as the number of islands) and the migration policies (such as how often migration occurs, the number of solutions to migrate, how to select emigrating solutions, and which solutions are replaced by immigrants) are some of the key issues that should be faced when an island model is implemented.

Figure 4.3: A space-based compute-server

4.3 The proposed parallel implementation

As mentioned earlier, the proposed approach is based on an island model as parallel paradigm and on a compute server model with regard to the JavaSpace implementation. The proposed implementation performs three main tasks: initiating work and slave registration, handling migrations, and finishing work and sending and collecting results. Figure 4.4 depicts the main components of the JavaSpace technology. Note that the arrows represent one of the three actions (write, take or read) that can be performed by the different clients (either master or slave) in order to communicate and synchronize through the shared JavaSpace area.

4.3.1 Initiating work and slave registration

The proposed implementation requires that the master keeps an updated list of the slaves working for it at any time so it can properly synchronize them. In order to accomplish this, all the clients must register with their master. Chapter 4. The Proposed Parallel Implementation 54

Figure 4.4: JavaSpace technology components

Once a master starts, the first step consists of writing (W) an entry (InitEntry) into the JavaS- pace. This entry will then be taken by slaves that can perform the task. This entry is read (R) by the slaves, as seen in Figure 4.5. Note that the entry must only be read (instead of being taken) by the slaves, this is because the entry ”Init” must be kept in the JavaSpace during all the processes so that new slaves can be added at any time.

Figure 4.5: Initiating work phase

Subsequently, once a slave receives the InitEntry, it requests registration from the master. The slave then writes an unapproved SlaveEntry. Then, the master takes this entry. This process is depicted in Figure 4.6.

Figure 4.6: Slave Registration phase. The master takes the unapproved slave entry

Notice that the master writes the same entry that was taken before. The only difference is that this entry has been approved by the master.

In the last step, as shown in Figure 4.7, the slave takes this entry, and writes it back; the reason for this rewriting is that the slave must be the author of the entry in the space in order to keep its lease alive. The lease of this entry lease is renewed in the background by the slave (see Figure Chapter 4. The Proposed Parallel Implementation 55

4.8) so that the master can know when the slave is not working on the task anymore, either because it completed the task successfully or due to an unexpected error, such as a network error or unavailability of the entry.

At the end of this step, the master has a list of all the slaves that are working for him. The master will periodically check the status of its current slaves; it will also repeat this process to receive new slaves.

Figure 4.7: Slave Registration phase. The slave takes the approved entry

Figure 4.8: The master will periodically check the status of its current slaves

4.3.2 Handling Migrations

At a given point of the execution of the algorithm a migration will occur, and each of the slaves will send EmigrantEntries to JavaSpace, as depicted in Figure 4.9, and then wait for an ImmigrantEntry to be sent to it by the master.

Figure 4.9: Handling migrations phase: slaves send EmigrantEntries to JavaSpace. Chapter 4. The Proposed Parallel Implementation 56

The master will then collect all EmigrantEntries that are sent out by its slaves and will store them, as shown in Figure 4.10.

Figure 4.10: Handling migrations phase: master takes all EmigrantEntries.

After all slaves have sent their emigrants, the master creates an ImmigrantEntry for each one of its slaves based on the emigrant populations that it has collected in the previous step, and, as shown in Figure 4.11, it will write all of them to the JavaSpace.

Figure 4.11: Handling migrations phase: master writes all ImmigrantEntries

Finally, each slave is able to take its corresponding immigrant and continue the execution of its task, i.e., the MOEA. See Figure 4.12

4.3.3 Finishing work and sending results

When a slave client has reached the maximum number of evaluations in the MOEA algorithm, it writes a ResultEntry into the JavaSpace. These entries are then taken by the master, which adds them into his final result set. When the first ResultEntry is taken by the master, it also cancels the InitEntry, so that no new slaves begin working on it past this point in time. After this point the master waits for its slave list to empty, either because the slaves complete their work or due to an unexpected error that prevented that slave from finishing its computation. Chapter 4. The Proposed Parallel Implementation 57

Figure 4.12: Handling migrations phase: each slave takes its corresponding ImmigrantEntry.

Once this occurs, the master presents its final result set, comprised of all the results received. The aforementioned process is presented in Figure 4.13.

Figure 4.13: Finishing and sending results phase. Chapter 5

Experimental Framework

In this chapter, an experimental framework to report the results obtained using the parallel multi-objective approach is presented. Specifically, an analysis of the results obtained over the set of proteins shown in Table 5.1 is provided. This set of proteins has been used as a benchmark given that it contains some of the most studied proteins both theoretically and experimentally.

The proposed algorithm was compiled using java software and the open source IDE Eclipse. The experiments run on a work-station equipped with two Quad Core Intel Xeon Processor X5450 (2.33GHz,2X6M L2,1333), with 32GB, DDR2 SDRAM FBD Memory, 667MHz.

The main goal of the experiments presented in this chapter was to evaluate the behavior of the proposed model in the protein structure prediction problem. However, the proposed model allows the use of different parameters, such as number of islands, number of migrations, percent- age of migrated population, the MOEA algorithm, the MOEA’s mutation and crossover rates, number of objectives etc. Some parameters were fixed and several experiments were carried out using those values. Understanding the role of the different parameters and policies in their impact on MOEA outcomes are left for further studies.

In the experiments, a model with four identical islands (same parameters) evolving synchronously was used. The number of individuals in the system was set to 100, the number of evaluations was set depending on the target protein. The used MOEA was a standard NSGA-II algorithm, so the results could later be used to further study the use of islands in MOEA approaches. A real value representation was used (the length of the chromosome depends on the protein target). A crossover and a mutation operator at rates of 0.7 and 0.01 were used, respectively. Migrations ocurred every 100000 (one hundred thousand) evaluations, and their size was ten percent of the population. A ring topology was used for the island model. During migrations, an exchange between two neighboring island is carried out.

For each experiment, seven different analyses were performed.

• The computation of the rmsdCα and energy value were computed over different protein conformations. These conformations are obtained from different evaluations of the MOEA

58 Chapter 5. Experimental Framework 59

approach using the four implemented decision-making algorithms. The results are gener- ally reported as a table. These results give us important information about how different is the predicted protein conformation with respect to its native counterpart. Additionally, we will observe the output of the algorithm as an energy minimization process.

• A visualization of the protein conformations is depicted using the molecule viewer soft- ware JMOL. Different styles in the visualization are used in order to facilitate the visual inspection of the predicted protein conformation with respect to its native counterpart.

• The computation of the minimum rmsdCα obtained for each island is plotted and ana- lyzed. As explained in chapter 3 and 4, the final sets of the different and independent MOEA instances are merged together in islandT. Specifically, islandT reports a solution set (with the same length of the population parameter) obtained from the MOEA’s final

sets. The analysis of the minimum rmsdCα obtained for each island is important be- cause it give us valuable information about the behavior of the MOEA algorithm and the behavior of the decision-making implementations.

• The behavior of each island, with respect to the maximum, minimum and average rmsdCα found in different stages of the algorithm is also studied. Then, a plot showing the dy- namics of these three values is shown and discussed.

• Given that the average values could be affected by very high or low rmsdCα errors, a plot reporting the frecuency histograms of these errors in specific intervals is provided. In

previous graphs, the dynamic of the rmsdCα was already studied. Then, this analysis is particularly important because it allows to precisely determine if the number of good solu-

tions (i.e., protein conformations with low rmsdCα errors) is increased along the evolution of the populations.

• The next analysis corresponds to the dynamic of the Pareto fronts. These plots depict the Pareto fronts of each island in different stages of the algorithm. This analysis is important because it allows the study of the muliobjective algorithm and the different objectives chosen for the experimentation. Moreover, these plots allow an understanding of the decision making methods, given that these methods are based on the geometry of the Pareto front. The impact and performance of the parallel implementation can be understood analyzing the fronts after migrations have occurred.

• The low correlation between the energy of a conformation and its rmsdCα error is one of the main factors why ab-initio protein structure prediction methods have low accucary prediction rates. Then, even if a configuration with minimum energy is found by the algorithm, this conformation usually is not close to the optimum conformation in nature.

The analysis of the energy versus rmsdCα errors tries to unveil the relationship of these factors in the multi-objective algorithm. Chapter 5. Experimental Framework 60

Table 5.1: Experimental Test Proteins PDB-ID Length (aa) Molecule Method Resolution (Ao) 1PLW 5 Met-enkephalin 1 Solution NMR N/A 1ZDD 35 Protein A domain Solution NMR N/A 1CRN 46 Crambin X-Ray Difraction 1.5 1ROP 63 ROP protein X-Ray Difraction 1.7 1UTG 70 Uteroglobin X-Ray Difraction 1.34

Table 5.2: Island architecture, migration policies and MOEA’s parameters for PSP Parameter or Policie Value Island Topology Ring Number of islands 4 Migration rates every 100000 generations Population to be migrated 10% Number of evaluations 2000000 Population size 100 SBX Crossover ηc = 5, pc = 0.9 Polynomial Mutation ηm = 10, pm = 0.01%

5.1 Test proteins suits

5.1.1 Repressor of primer 1ROP

ROP is a small four-helix bundle protein formed by the antiparallel interaction of two helix- turn-helix identical monomers of 56 residues. The protein data bank ID for this protein is 1ROP. This protein is found in bacteria (it is expressed in Escherichia coli) and its role is to regulate the number of copied genes of plasmids. The ROP protein’s structure has been solved at a resolution of 1.7Ao. In general, the 1ROP protein is a highly used test suit in de novo protein design and in the novo protein prediction, given that its conformational characteristics (four-helix bundle) allow the understanding of the the relationship between amino acid sequence and structure.

Based on the decision-making methods described in chapter 3, the algorithm matches the crystal o −1 structure with an rmsdCα of 3.7754 A and energy -626.75 kcalmol (see Figure 5.1). Table 5.3 reports the computed structures in different iterations of the algorithm; specifically, it shows a comparison between the structures in the final archives in different iterations based on different decision-making criteria.

A more detailed inspection of the conformations with minimum RMSD for different iterations of the algorithm in each island is depicted in Figure 5.2. We would like to underline the fact that even if island 1 and island 2 evolve in worst protein conformations, the general procedure that combines the results (island T) is not affected by this fact. Then, the proposed parallel implementation provides an adequate cover of the search space improving the precision of the model and the quality of the obtained Pareto fronts. The conformation with minimum RMSD Chapter 5. Experimental Framework 61

(a) Native conformation (schematic structure) (b) Predicted conformation (schematic struc- ture)

(c) Native conformation (cord structure) (d) Predicted conformation (cord structure)

Figure 5.1: Native and predicted conformations for 1ROP protein Chapter 5. Experimental Framework 62

Table 5.3: Computed protein conformation for the 1ROP protein target

kcal Iterations Decision Maker rmsdCα Energy ( mol ) 500000 Utility 3.9550 -212.9578 Utility + Energy 3.9550 -212.9578 Angles 3.9802 -225.442 Angles + Energy 3.9802 -225.442 1000000 Utility 3.9156 -476.0725 Utility + Energy 3.9117 -477.2129 Angles 3.9259 -477.7539 Angles + Energy 3.9259 -477.7539 1500000 Utility 3.8063 -584.0034 Utility + Energy 3.8063 -584.0034 Angles 3.8016 -583.8831 Angles + Energy 3.8187 -532.0886 2000000 Utility 3.8265 -656.4 Utility + Energy 3.8235 -656.4 Angles 3.7754 -626.7585 Angles + Energy 3.8265 -656.1283

o evolved by the algorithm has an rmsdCα error of 3.3935A , which is significatively better than the minimum reported errors for this protein. A more detailed plot of the minimum, maximum and average rmsdCα values at different time-steps of the algorithm is presented in Figure 5.3. Given that the mean values could be affected by very high or low RMSD errors. Figure 5.4 shows the frequency histograms of these errors in specific intervals.

Note that Figure 5.4 depicts the amount of individuals belonging to a specific rmsdCα interval. This measure is highly important because it allows to understand the role of the evolution in the minimization of the RMSD errors without propensities to outlier values. Specifically, from Figure 5.4 (e) can be concluded that the number of protein conformations belonging to the interval [3A, 5A] is close to the fifty percent of the population. Notice that the number of conformations belonging to this interval at the beginning of the evolution was less than two percent of the population. Figure 5.4 stresses the importance of the parallel model. The plot marked as Final Results maintains the good behavior of island 3 and it is not very sensitive to the behaviors of the other islands.

Figure 5.5 shows the different Pareto fronts found by the algorithm in different iterations. It should be stressed that, although the NSGA-II algorithm finds well distributed fronts and it emphasizes the convergence, there are some individuals in the optimal set that present high energy levels. Typically these individuals show a marked difference in the objectives tradeoff. Note that the Pareto front reported in Figure 5.5 (e) Final results, is the set of dominant solutions of joining the Pareto fronts of islands one, two, three and four. Chapter 5. Experimental Framework 63

1ROP 10 Island 1 Island 2 Island 3 9 Island 4 Island T

8

7

6 RMSD (Amstrongs)

5

4

3 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 2e+06 Evaluations

Figure 5.2: Behavior of the minimum RMSD in each island

In the plot of Figure 5.6, the correlation between the energy and the rmsdCα for conformations sampled by the MOEA algorithm in each island is shown. From this plot it is worth to note that it is possible to produce structures with lower energy than that of their reported native structure. It is an important point because even if a minimum energy conformation is found by the algorithm, this conformation usually is not close to the native conformation.

5.1.2 Disulphide-stabilized mini protein A domain 1ZDD

The 1ZDD protein is a two-helix peptide of 34 residues. This protein defines 192 torsional angles and it has 71% helical (2 helices, 25 residues) with respect to the reported DSSP secondary structure.

The 1ZDD protein has been considering an interesting test bed to better understand how a given folding algorithm works on a problem with a considerable number of variables. In this thesis it is a very nice test protein to understand the performance of the algorithm with respect to three main challenges and features. There are no knees on the Pareto front of 1ZDD protein, the secondary structure prediction used by other methods is generally performed by the SCRATCH software, and there is one aminoacid that is not recognized in the .fasta file. These characteristics are used in this subsection to analyze the impact of these features in the proposed approach. Specifically, the same running parameters that have been used so far are kept for the 1ZDD experiments. In other words, the same secondary structure prediction method (PSIPRED) instead of the SCRATCH method was used, the not recognized amino-acid was resigned from be included in the experiments, and the same decision-making methods were used even if there are not knees on the Pareto fronts. Chapter 5. Experimental Framework 64

1ROP Island 1 1ROP Island 2 25 25 MAX MAX MIN MIN AVG AVG

20 20

15 15 RMSD (Armstrongs) RMSD (Armstrongs) 10 10

5 5

0 500000 1e+06 1.5e+06 2e+06 0 500000 1e+06 1.5e+06 2e+06 Evaluations Evaluations

(a) Island 1 (b) Island 2

1ROP Island 3 1ROP Island 4 25 25 MAX MAX MIN MIN AVG AVG

20 20

15 15 RMSD (Armstrongs) RMSD (Armstrongs) 10 10

5 5

0 500000 1e+06 1.5e+06 2e+06 0 500000 1e+06 1.5e+06 2e+06 Evaluations Evaluations

(c) Island 3 (d) Island 4

1ROP FINAL RESULTS 25 MAX MIN AVG

20

15 RMSD (Armstrongs) 10

5

0 500000 1e+06 1.5e+06 2e+06 Evaluations

(e) Final Results

Figure 5.3: Minimum, average and maximum RMSD in each island (1ROP) Chapter 5. Experimental Framework 65

1ROP Island 1 1ROP Island 2 100 60 [3A-5A] [3A-5A] [5A-7A] [5A-7A] [7A-) [7A-)

50 80

40 60

30 Individuals Individuals 40 20

20 10

0 0 200 500000 1000000 1500000 2000000 200 500000 1000000 1500000 2000000 Evaluations Evaluations

(a) Island 1 (b) Island 2

1ROP Island 3 1ROP Island 4 100 100 [3A-5A] [3A-5A] [5A-7A] [5A-7A] 90 [7A-) 90 [7A-)

80 80

70 70

60 60

50 50 Individuals Individuals 40 40

30 30

20 20

10 10

0 0 200 500000 1000000 1500000 2000000 200 500000 1000000 1500000 2000000 Evaluations Evaluations

(c) Island 3 (d) Island 4

1ROP Island T 90 [3A-5A] [5A-7A] 80 [7A-)

70

60

50

40 Individuals

30

20

10

0 200 500000 1000000 1500000 2000000 Evaluations

(e) Final Results

Figure 5.4: Histrogram of RMSD ranges (1ROP) Chapter 5. Experimental Framework 66

1ROP Island 1 1ROP Island 2 14000 14000 100000 evaluations 100000 evaluations 500000 evaluations 500000 evaluations 1000000 evaluations 1000000 evaluations 12000 1500000 evaluations 12000 1500000 evaluations 2000000 evaluations 2000000 evaluations

10000 10000

8000 8000

6000 6000

4000 4000 Non-Bond (kcal/mol) Non-Bond (kcal/mol)

2000 2000

0 0

-2000 -2000 410 420 430 440 450 460 470 480 410 420 430 440 450 460 470 480 Bond (kcal/mol) Bond (kcal/mol)

(a) Island 1 (b) Island 2

1ROP Island 3 1ROP Island 4 14000 14000 100000 evaluations 100000 evaluations 500000 evaluations 500000 evaluations 1000000 evaluations 1000000 evaluations 12000 1500000 evaluations 12000 1500000 evaluations 2000000 evaluations 2000000 evaluations

10000 10000

8000 8000

6000 6000

4000 4000 Non-Bond (kcal/mol) Non-Bond (kcal/mol)

2000 2000

0 0

-2000 -2000 410 420 430 440 450 460 470 480 410 420 430 440 450 460 470 480 Bond (kcal/mol) Bond (kcal/mol)

(c) Island 3 (d) Island 4

1ROP FINAL RESULTS 14000 100000 evaluations 500000 evaluations 1000000 evaluations 12000 1500000 evaluations 2000000 evaluations

10000

8000

6000

4000 Non-Bond (kcal/mol)

2000

0

-2000 410 420 430 440 450 460 470 480 Bond (kcal/mol)

(e) Final Results

Figure 5.5: Pareto Fronts in each island (1ROP) Chapter 5. Experimental Framework 67

1ROP Island 1 1ROP Island 2 3000 3000 100000 evaluations 100000 evaluations 500000 evaluations 500000 evaluations 1000000 evaluations 1000000 evaluations 2500 1500000 evaluations 2500 1500000 evaluations 2000000 evaluations 2000000 evaluations

2000 2000

1500 1500

1000 1000

Energy (kcal/mol) 500 Energy (kcal/mol) 500

0 0

-500 -500

-1000 -1000 3 3.5 4 4.5 5 5.5 6 6.5 7 3 3.5 4 4.5 5 5.5 6 6.5 7 RMSD (Armstrongs) RMSD (Armstrongs)

(a) Island 1 (b) Island 2

1ROP Island 3 1ROP Island 4 3000 3000 100000 evaluations 100000 evaluations 500000 evaluations 500000 evaluations 1000000 evaluations 1000000 evaluations 2500 1500000 evaluations 2500 1500000 evaluations 2000000 evaluations 2000000 evaluations

2000 2000

1500 1500

1000 1000

Energy (kcal/mol) 500 Energy (kcal/mol) 500

0 0

-500 -500

-1000 -1000 3 3.5 4 4.5 5 5.5 6 6.5 7 3 3.5 4 4.5 5 5.5 6 6.5 7 RMSD (Armstrongs) RMSD (Armstrongs)

(c) Island 3 (d) Island 4

1ROP FINAL RESULTS 3000 100000 evaluations 500000 evaluations 1000000 evaluations 2500 1500000 evaluations 2000000 evaluations

2000

1500

1000

Energy (kcal/mol) 500

0

-500

-1000 3 3.5 4 4.5 5 5.5 6 6.5 7 RMSD (Armstrongs)

(e) Final Results

Figure 5.6: Relationship between Energy and RMSD (1ROP) Chapter 5. Experimental Framework 68

The performance of the proposed approach was measured in terms of the rmsdCα with respect to the structure stored in PDB (1ZDD): the results are reported in Table 5.4. The conformation with minimum rmsdCα matches the crystal structure with a coordinate root mean square deviation value of 3.8135Ao.

Table 5.4: Computed protein conformation for the 1ZDD protein target

kcal Iterations Decision Maker rmsdCα Energy ( mol ) 500000 Utility 3.8023 -1185.1286 Utility + Energy 3.8023 -1185.1286 Angles 3.8023 -1185.129 Angles + Energy 3.7980 -1185.462 700000 Utility 3.8618 -1202.807 Utility + Energy 3.8618 -1202.807 Angles 3.8588 -1200.944 Angles + Energy 3.8213 -1187.277 900000 Utility 10.5937 -1231.737 Utility + Energy 10.5937 -1231.737 Angles 7.5044 24477.172 Angles + Energy 10.5937 -1231.737 1100000 Utility 10.6401 -1250.410 Utility + Energy 10.6401 -1250.410 Angles 3.8135 -1218.566 Angles + Energy 10.6401. -1250.410

A detailed inspection of the conformations with minimum RMSD for different iterations of the algorithm in each island is shown in Figure 5.7. It is important to stress that three out of four islands are initialized with protein conformations having a lower rmsdCα with respect to their

final rmsdCα. This fact suggests the presence of multiple local minima in the 1ZDD energy landscape and a weak correlation between the minimum energy conformations and these inital protein conformations. Note that, the elitist algorithm as NSGA-II was not able to keep the initial good conformations for subsequent generations.

A more detailed plot of the minimum, maximum and average rmsdCα values at different time- steps of the algorithm is presented in Figure 5.8. In this plot it could be noted that the behavior of the minimum values is not shared for the maximum and average values. Those values show a decrease in its rmsdCα errors through the evolution of their conformations. Island 2 appears to have the most stable conformations. This fact can be observed in Figure 5.9. Particularly, in this graph it could be noted how in general the quantity of individuals with higher RMSD errors decrease as the populations evolve. Island 4 presents a contrary behavior with respect to the ideas stated before. This island increase the number of higher RMSD conformation in the evolution process. This suggest that this island was trapped in a local mimima from which it could never escape. Those ideas can be contrasted and proved analyzing Figure 5.11. Chapter 5. Experimental Framework 69

1ZDD 9 Island 1 Island 2 Island 3 Island 4 8 Island T

7

6 RMSD (Armstrongs) 5

4

3 0 200000 400000 600000 800000 1e+06 1.2e+06 Evaluations

Figure 5.7: Behavior of the minimum RMSD in each island

The fact that there are no knees on the Pareto front of 1ZDD protein can be noted by inspecting Figure 5.10 and can be confirmed by different works [21]. In this thesis the same decision-making methods are used even if they are based on the presence of knees. The reason is that it is difficult for the algorithm to determine when a Pareto front contains or not knees. Then, no additional heuristic information is used in this thesis and the behavior of these methods are evaluated under these conditions.

In the plot of Figure 5.11, the correlation between the energy and the rmsdCα for conformations sampled by the MOEA algorithm in each island is shown. From this plot it is worth to notice that it is possible to produce structures with lower energy than that of their reported native structure. It is an important point because even if a minimum energy is found by the algorithm, this conformation usually is not close to the optimum conformation for the nature. This fact is underlined in the general procedure that combine the results (island T), where the algorithm gets stuck in a set of minimum energy conformations with high rmsdCα values (see Figure 5.11). Particularly, one of the energy groups in the 1100000 evaluations presents a very good value with respect to its energy but a bad value with respect to its RMSD error. This energy group was inherited from island four and correspond to the local mimima from where this island keep trapped.

In this experiment, the NSGA-II algorithm was not able to find a good relationship between the best energy structure and the best RMSD conformation in the different stages of the algorithm. In other words, the algorithm was not able to produce an ensemble of good quality structures even if it made a high sampling of the conformational search space. Additionally, it should be stressed that the algorithm was highly influenced by the three stated features of this protein Chapter 5. Experimental Framework 70

1ZDD Island 1 1ZDD Island 2 16 16 MAX MAX MIN MIN AVG AVG 14 14

12 12

10 10

8 8 RMSD (Armstrongs) RMSD (Armstrongs)

6 6

4 4

0 200000 400000 600000 800000 1e+06 0 200000 400000 600000 800000 1e+06 Evaluations Evaluations

(a) Island 1 (b) Island 2

1ZDD Island 3 1ZDD Island 4 16 16 MAX MAX MIN MIN AVG AVG 14 14

12 12

10 10

8 8 RMSD (Armstrongs) RMSD (Armstrongs)

6 6

4 4

0 200000 400000 600000 800000 1e+06 0 200000 400000 600000 800000 1e+06 Evaluations Evaluations

(c) Island 3 (d) Island 4

1ZDD FINAL RESULTS 16 MAX MIN AVG 14

12

10

8 RMSD (Armstrongs)

6

4

0 200000 400000 600000 800000 1e+06 Evaluations

(e) Final Results

Figure 5.8: Minimum, average and maximum RMSD in each island (1ZDD) Chapter 5. Experimental Framework 71

1ZDD Island 1 1ZDD Island 2 100 70 [3A-6A] [3A-6A] [6A-9A] [6A-9A] 90 [9A-) [9A-) 60 80

70 50

60 40

50

Individuals Individuals 30 40

30 20

20 10 10

0 0 100000 300000 500000 700000 900000 1100000 100000 300000 500000 700000 900000 1100000 Evaluations Evaluations

(a) Island 1 (b) Island 2

1ZDD Island 3 1ZDD Island 4 80 90 [3A-6A] [3A-6A] [6A-9A] [6A-9A] [9A-) [9A-) 70 80

70 60

60 50

50 40 40 Individuals Individuals 30 30

20 20

10 10

0 0 100000 300000 500000 700000 900000 1100000 100000 300000 500000 700000 900000 1100000 Evaluations Evaluations

(c) Island 3 (d) Island 4

1ZDD Island T 70 [3A-6A] [6A-9A] [9A-) 60

50

40

Individuals 30

20

10

0 100000 300000 500000 700000 900000 1100000 Evaluations

(e) Final Results

Figure 5.9: Histrogram of RMSD ranges (1ZDD) Chapter 5. Experimental Framework 72

1ZDD Island 1 1ZDD Island 2 10000 10000 100000 evaluations 100000 evaluations 300000 evaluations 300000 evaluations 500000 evaluations 500000 evaluations 700000 evaluations 700000 evaluations 8000 900000 evaluations 8000 900000 evaluations 1100000 evaluations 1100000 evaluations

6000 6000

4000 4000

Non-Bond (kcal/mol) 2000 Non-Bond (kcal/mol) 2000

0 0

-2000 -2000 225 230 235 240 245 250 255 260 225 230 235 240 245 250 255 260 Bond (kcal/mol) Bond (kcal/mol)

(a) Island 1 (b) Island 2

1ZDD Island 3 1ZDD Island 4 10000 10000 100000 evaluations 100000 evaluations 300000 evaluations 300000 evaluations 5000000 evaluations 500000 evaluations 700000 evaluations 700000 evaluations 8000 9000000 evaluations 8000 900000 evaluations 1100000 evaluations 1100000 evaluations

6000 6000

4000 4000

Non-Bond (kcal/mol) 2000 Non-Bond (kcal/mol) 2000

0 0

-2000 -2000 225 230 235 240 245 250 255 260 225 230 235 240 245 250 255 260 Bond (kcal/mol) Bond (kcal/mol)

(c) Island 3 (d) Island 4

1ZDD FINAL RESULTS 10000 100000 evaluations 300000 evaluations 500000 evaluations 700000 evaluations 8000 900000 evaluations 1100000 evaluations

6000

4000

Non-Bond (kcal/mol) 2000

0

-2000 225 230 235 240 245 250 255 260 Bond (kcal/mol)

(e) Final Results

Figure 5.10: Pareto Fronts in each island (1ZDD) Chapter 5. Experimental Framework 73

1ZDD Island 1 1ZDD Island 2 1500 1500 100000 evaluations 100000 evaluations 300000 evaluations 300000 evaluations 500000 evaluations 500000 evaluations 700000 evaluations 700000 evaluations 1000 900000 evaluations 1000 900000 evaluations 1100000 evaluations 1100000 evaluations

500 500

0 0 Energy (kcal/mol) Energy (kcal/mol) -500 -500

-1000 -1000

-1500 -1500 3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12 RMSD (Armstrongs) RMSD (Armstrongs)

(a) Island 1 (b) Island 2

1ZDD Island 3 1ZDD Island 4 1500 1500 100000 evaluations 100000 evaluations 300000 evaluations 300000 evaluations 500000 evaluations 500000 evaluations 700000 evaluations 700000 evaluations 1000 900000 evaluations 1000 900000 evaluations 1100000 evaluations 1100000 evaluations

500 500

0 0 Energy (kcal/mol) Energy (kcal/mol) -500 -500

-1000 -1000

-1500 -1500 3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12 RMSD (Armstrongs) RMSD (Armstrongs)

(c) Island 3 (d) Island 4

1ZDD FINAL RESULTS 1500 100000 evaluations 300000 evaluations 500000 evaluations 700000 evaluations 1000 900000 evaluations 1100000 evaluations

500

0 Energy (kcal/mol) -500

-1000

-1500 3 4 5 6 7 8 9 10 11 12 RMSD (Armstrongs)

(e) Final Results

Figure 5.11: Relationship between Energy and RMSD (1ZDD) Chapter 5. Experimental Framework 74 mentioned above. Even if the minimization process got better results with respect to the reported energy, no correlation with the rmsdCα was observed.

5.1.3 Met-enkephalin 1PLW

Met-enkephalin is a very small neuropeptide with only five amino-acids (TYR-GLY-GLY-PHE- MET), 22 torsional angles and 75 atoms. The peptide Met-enkephalin is part of an important class of molecules to study, mainly because it is difficult to obtain experimental information about them. From an optimization point of view, this penta-peptide has become a standard test given that it is a paradigmatic example of multiple minima problem (having more than 1011 locally optimal conformations). A substantial amount of in silico experiments and a considerable number of computational studies using different conformational search methodologies have been done.

Figure 5.12 shows the predicted and the native structure of the peptide 1PLW. This struc- ture matches the crystal structure of 1PLW with an rmsdall−atoms = 2.5816 and rmsdCα = 1.2629. Additionally, this structure has the lowest energy value of the overall final Pareto front, kcal −22.732 mol . Table 5.5 shows more information about other found conformations in the dy- namics of different number of evaluations. The knee-based decision making methods have been applied in these cases to select the conformation among the set of solutions in the Pareto front after executing the corresponding migration procedure.

Table 5.5: Computed protein conformation for the 1PLW protein target

kcal Iterations Decision Maker rmsdCα rmsdall Energy ( mol ) 200000 Utility 1.2482 3.5678 2653.610 Utility + Energy 1.2482 3.5678 2653.610 Angles 1.3175 2.610 -19.020 Angles + Energy 1.3175 2.610 -19.020 400000 Utility 1.3024 2.5475 -19.945 Utility + Energy 1.3024 2.5475 -19.945 Angles 1.2821 3.6680 4312.663 Angles + Energy 1.3161 2.6200 -19.588 600000 Utility 1.2629 2.5816 -22.732 Utility + Energy 1.2629 2.5816 -22.732 Angles 1.2605 3.6282 3884.746 Angles + Energy 1.2627 2.6207 -22.179

As in previous experiments, a more detailed study of the conformations with minimum RMSD for different number of evaluations of individuals in each island (see Figure 5.13) is carried out. From these plots it can be stablished that good RMSD structures are generated in all the islands since the beginning of the experimentation. After a significant number of evaluations the energy and the RMSD value of those conformations start to be stable. Clearly, one of the Chapter 5. Experimental Framework 75

(a) Native conformation 1PLW (b) Predicted conformation

Figure 5.12: Native and predicted conformations for 1PLW protein most stable conformations are presented in the islandT proving the advantage of the proposed parallel implementation as stated in previous experiments. Note that the information of this dynamic is depicted for the optimal solution set built after the migrations have occurred.

1PLW 2 Island 1 Island 2 Island 3 Island 4 Island T

1.5

1 RMSD (Armstrongs)

0.5

0 200000400000600000 Evaluations

Figure 5.13: Behavior of the minimum RMSD in each island for the 1PLW protein

A more detailed plot of the minimum, maximum and average rmsdCα values at different time- steps of the algorithm is presented in Figure 5.14. Given that the mean values could be affected by excessively low or high RMSD errors, Figure 5.15 plots the frecuency histograms of these Chapter 5. Experimental Framework 76 errors in specific intervals. These histograms show that a significant quantity of good conforma- tions is kept by the algorithm through the evolution process. A good relationship between the energy structure and the RMSD conformation in the final archive was observed. By carefully inspecting the final conformation, good values both in terms of RMSD and energy are found for them. Then, for 1PLW protein, the algorithm was able to produce an ensemble of good quality structures.

Figure 5.16 depicts the dynamic of the found Pareto fronts in different stages of the algorithm. Notice that the graph plotted as 1PLW final results is the outcome of the procedure that define the minimum-energy conformation set based on the results given by each of the four islands. Notice that this found set keeps the diversity of the Pareto front meanwhile it preserves only the best individuals.

In the plot of Figure 5.17, the correlation between the energy and the rmsdCα for conformations sampled by the MOEA algorithm in each island is shown.

5.1.4 Crambin 1CRN

Crambin is a 46-residue protein that is found in the plant seeds of Abyssinian cabbage. Its biological function is unknown and it is not related to any human diseases. It has two alpha- helices and two beta-strands forming an anti-parallel sheet. Additionally, it has six cysteine residues, accounting for 13% of the structure. It has three disulphide bonds, whose constrains are not taken into account in this research. 1CRN is a peptide that has been studied extensively both theoretically and experimentally, because crystals of crambin diffract well.

The best computed structure, using the proposed computational method, matches the crystal kcal structure with and rmsdCα = 7.3214 and energy 262.675 mol . The best computed structure was found in the last generation of the algorithm as in the previous experiments. It is important to note that the PSIPRED method was used to predict the protein secondary structure. This protein could be correctly modeled using supersecondary structure prediction methods (which are not used in the proposed approach). In Figure 5.18, the native and the predicted protein conformations are plotted using the software JMOL. This graph allows a visual inspection of the two compared conformations. Specifically, an identification of the secondary structures can be understood. A wrong secondary structure prediction was made in the 1CRN protein: a beta- strand forming the anti-parallel sheet was not predicted by the PSIPRED method. Although, the algorithm was able to build a β -Strand, the algorithm was not able to reach a native like structure.

As in the previous experiments, a more detailed study of the conformations with minimum RMSD for the dynamic of different evaluations of the individuals in each island is shown in Figure 5.19. Additionally the information of this dynamic is depicted for the optimal solution set built after the migrations have occurred. It is interesting to note that the minimum errors Chapter 5. Experimental Framework 77

1PLW Island 1 1PLW Island 2 2 2 MAX MAX MIN MIN AVG AVG

1.5 1.5

1 1 RMSD (Armstrongs) RMSD (Armstrongs)

0.5 0.5

0 0 200000400000600000 200000400000600000 Evaluations Evaluations

(a) Island 1 (b) Island 2

1PLW Island 3 1PLW Island 4 2 2 MAX MAX MIN MIN AVG AVG

1.5 1.5

1 1 RMSD (Armstrongs) RMSD (Armstrongs)

0.5 0.5

0 0 200000400000600000 200000400000600000 Evaluations Evaluations

(c) Island 3 (d) Island 4

1PLW FINAL RESULTS 2 MAX MIN AVG

1.5

1 RMSD (Armstrongs)

0.5

0 200000400000600000 Evaluations

(e) Final Results

Figure 5.14: Minimum, average and maximum RMSD in each island (1PLW) Chapter 5. Experimental Framework 78

1PLW Island 1 1PLW Island 2 90 100 [0A-1.3A) [0A-1.3A) [1.3A-1.4A) [1.3A-1.4A) 80 [1.4A-) 90 [1.4A-)

80 70

70 60 60 50 50 40 Individuals Individuals 40 30 30

20 20

10 10

0 0 200000 400000 600000 200000 400000 600000 Evaluations Evaluations

(a) Island 1 (b) Island 2

1PLW Island 3 1PLW Island 4 80 60 [0A-1.3A) [0A-1.3A) [1.3A-1.4A) [1.3A-1.4A) [1.4A-) [1.4A-) 70 50

60

40 50

40 30 Individuals Individuals 30 20

20

10 10

0 0 200000 400000 600000 200000 400000 600000 Evaluations Evaluations

(c) Island 3 (d) Island 4

1PLW Island T 80 [0A-1.3A) [1.3A-1.4A) [1.4A-) 70

60

50

40 Individuals 30

20

10

0 200000 400000 600000 Evaluations

(e) Final Results

Figure 5.15: Histrogram of RMSD ranges (1PLW) Chapter 5. Experimental Framework 79

1PLW Island 1 1PLW Island 2

200000 evaluations 200000 evaluations 40 400000 evaluations 40 400000 evaluations 600000 evaluations 600000 evaluations

20 20

0 0

Non-Bond (kcal/mol) -20 Non-Bond (kcal/mol) -20

-40 -40

20 22 24 26 28 30 20 22 24 26 28 30 Bond (kcal/mol) Bond (kcal/mol)

(a) Island 1 (b) Island 2

1PLW Island 3 1PLW Island 4

200000 evaluations 200000 evaluations 40 4000000 evaluations 40 400000 evaluations 600000 evaluations 600000 evaluations

20 20

0 0

Non-Bond (kcal/mol) -20 Non-Bond (kcal/mol) -20

-40 -40

20 22 24 26 28 30 20 22 24 26 28 30 Bond (kcal/mol) Bond (kcal/mol)

(c) Island 3 (d) Island 4

1PLW FINAL RESULTS

200000 evaluations 40 400000 evaluations 600000 evaluations

20

0

Non-Bond (kcal/mol) -20

-40

20 22 24 26 28 30 Bond (kcal/mol)

(e) Final Results

Figure 5.16: Pareto Fronts in each island (1PLW) Chapter 5. Experimental Framework 80

1PLW Island 1 1PLW Island 2

200000 evaluations 200000 evaluations 1400 400000 evaluations 1400 400000 evaluations 600000 evaluations 600000 evaluations

1200 1200

1000 1000

800 800

600 600 Energy (kcal/mol) Energy (kcal/mol) 400 400

200 200

0 0

-200 -200 1 1.1 1.2 1.3 1.4 1.5 1 1.1 1.2 1.3 1.4 1.5 RMSD (Armstrongs) RMSD (Armstrongs)

(a) Island 1 (b) Island 2

1PLW Island 3 1PLW Island 4

200000 evaluations 200000 evaluations 1400 400000 evaluations 1400 400000 evaluations 600000 evaluations 600000 evaluations

1200 1200

1000 1000

800 800

600 600 Energy (kcal/mol) Energy (kcal/mol) 400 400

200 200

0 0

-200 -200 1 1.1 1.2 1.3 1.4 1.5 1 1.1 1.2 1.3 1.4 1.5 RMSD (Armstrongs) RMSD (Armstrongs)

(c) Island 3 (d) Island 4

1PLW FINAL RESULTS

200000 evaluations 1400 400000 evaluations 600000 evaluations

1200

1000

800

600 Energy (kcal/mol) 400

200

0

-200 1 1.1 1.2 1.3 1.4 1.5 RMSD (Armstrongs)

(e) Final Results

Figure 5.17: Relationship between Energy and RMSD (1PLW) Chapter 5. Experimental Framework 81

Table 5.6: Computed protein conformation for the 1ZDD protein target

kcal Iterations Decision Maker rmsdCα Energy ( mol ) 500000 Utility 7.5838 357.921 Utility + Energy 7.5838 357.921 Angles 8.0975 355.255 Angles + Energy 8.0951 351.992 700000 Utility 8.0869 353.476 Utility + Energy 8.0869 353.476 Angles 7.6694 360.125 Angles + Energy 8.0951 351.992 900000 Utility 7.3932 288.728 Utility + Energy 7.3932 288.728 Angles 7.3895 291.810 Angles + Energy 7.3895 291.810 1100000 Utility 7.3355 256.843 Utility + Energy 7.3355 256.843 Angles 7.3214 262.675 Angles + Energy 7.3356 257.353

(a) Native conformation 1CRN (b) Predicted conformation 1CRN

Figure 5.18: Native and predicted conformations for 1CRN protein stay stable during the dynamic of evaluations. The minimum RMSD value was found in the initial population and it can be considered as an effect of the chance (the initial population was built at random). The island number four is the MOEA instance with minimum RMSD error, although it barely crossed the threshold of 7 Armstrongs. Then, that suggest that the algorithm was not able to make a complete exploration of the energy landscape, this fact could be due to a wrong secondary structure prediction. Chapter 5. Experimental Framework 82

1CRN 8.5 Island 1 Island 2 Island 3 Island 4 Island T 8

7.5

7 RMSD (Armstrongs)

6.5

6 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 2e+06 Evaluations

Figure 5.19: Behavior of the minimum RMSD in each island for the 1CRN protein

A more detailed plot of the minimum, maximum and average RMSD values at different time- steps of the algorithm is presented in Figure 5.20. From these plots it can be stablished that the most stable value correspond to the minimum, and the island with the best behavior (see the average tendence) corresponds to island four as stated before. These conclusions can be shown in Figure 5.21. Given that the mean values could be affected by very high or low RMSD errors, Figure 5.21 plots the frecuency histograms of these errors in specific intervals. In this plot, it is clear that island four has the biggest number of protein conformation belonging to interval one (i.e., [0A, 7A)). Unfortunately, this same behavior is not shared by islandT, where the number of individuals which belongs to the first interval rarely cross the 20% threshold. This fact suggests that there is not a good correlation between the RMSD error and the two energy (i.e., bond and non-bond interactions) objectives. Despide of this lack of correlation, the decision-making methods were able to choose the best individuals with respect to their RMSD errors.

Figure 5.22 depicts the dynamic of the found Pareto fronts in different stages of the algorithm. At the beggining of the evolution process, the Pareto front solutions tend to be grouped into three individual clusters of non-dominated compact solutions, however, after several evaluations the distinction of these clusters is not obvious. It is worth noting that the graph plotted as 1CRN final results is the outcome of the procedure that define the minimum-energy conformation set based on the results given by each of the four islands. Notice that, this set keeps the diversity of the Pareto front meanwhile it preserves only the best individuals. A clear relation between the Pareto front of the last iteration and the energy versus RMSD plot could not be determined for 1CRN protein. Chapter 5. Experimental Framework 83

1CRN Island 1 1CRN Island 2 20 20 MAX MAX MIN MIN AVG AVG 18 18

16 16

14 14

12 12 RMSD (Armstrongs) RMSD (Armstrongs)

10 10

8 8

6 6 100000015000002000000500000200 100000015000002000000500000200 Evaluations Evaluations

(a) Island 1 (b) Island 2

1CRN Island 3 1CRN Island 4 20 20 MAX MAX MIN MIN AVG AVG 18 18

16 16

14 14

12 12 RMSD (Armstrongs) RMSD (Armstrongs)

10 10

8 8

6 6 100000015000002000000500000200 100000015000002000000500000200 Evaluations Evaluations

(c) Island 3 (d) Island 4

1CRN FINAL RESULTS 20 MAX MIN AVG 18

16

14

12 RMSD (Armstrongs)

10

8

6 100000015000002000000500000200 Evaluations

(e) Final Results

Figure 5.20: Minimum, average and maximum RMSD in each island (1CRN) Chapter 5. Experimental Framework 84

1CRN Island 1 1CRN Island 2 100 100 [0A-7A) [0A-7A) [7A-8A) [7A-8A) 90 [8A-) 90 [8A-)

80 80

70 70

60 60

50 50 Individuals Individuals 40 40

30 30

20 20

10 10

0 0 200 500000 1000000 1500000 2000000 200 500000 1000000 1500000 2000000 Evaluations Evaluations

(a) Island 1 (b) Island 2

1CRN Island 3 1CRN Island 4 100 100 [0A-7A) [0A-7A) [7A-8A) [7A-8A) 90 [8A-) 90 [8A-)

80 80

70 70

60 60

50 50 Individuals Individuals 40 40

30 30

20 20

10 10

0 0 200 500000 1000000 1500000 2000000 200 500000 1000000 1500000 2000000 Evaluations Evaluations

(c) Island 3 (d) Island 4

1CRN Island T 100 [0A-7A) [7A-8A) 90 [8A-)

80

70

60

50 Individuals 40

30

20

10

0 200 500000 1000000 1500000 2000000 Evaluations

(e) Final Results

Figure 5.21: Histrogram of RMSD ranges (1CRN) Chapter 5. Experimental Framework 85

1CRN Island 1 1CRN Island 2 800 800 100000 evaluations 100000 evaluations 500000 evaluations 500000 evaluations 1000000 evaluations 1000000 evaluations 1500000 evaluations 1500000 evaluations 2000000 evaluations 2000000 evaluations 600 600

400 400

200 200 Non-Bond (kcal/mol) Non-Bond (kcal/mol)

0 0

-200 -200 370 380 390 400 410 420 430 370 380 390 400 410 420 430 Bond (kcal/mol) Bond (kcal/mol)

(a) Island 1 (b) Island 2

1CRN Island 3 1CRN Island 4 800 800 100000 evaluations 100000 evaluations 500000 evaluations 500000 evaluations 1000000 evaluations 1000000 evaluations 1500000 evaluations 1500000 evaluations 2000000 evaluations 2000000 evaluations 600 600

400 400

200 200 Non-Bond (kcal/mol) Non-Bond (kcal/mol)

0 0

-200 -200 370 380 390 400 410 420 430 370 380 390 400 410 420 430 Bond (kcal/mol) Bond (kcal/mol)

(c) Island 3 (d) Island 4

1CRN FINAL RESULTS 800 100000 evaluations 500000 evaluations 1000000 evaluations 1500000 evaluations 2000000 evaluations 600

400

200 Non-Bond (kcal/mol)

0

-200 370 380 390 400 410 420 430 Bond (kcal/mol)

(e) Final Results

Figure 5.22: Pareto Fronts in each island (1CRN) Chapter 5. Experimental Framework 86

In the plot of Figure 5.23, the correlation between the energy and the rmsdCα for conforma- tions sampled by the MOEA algorithm in each island is shown. Note that island 1 got better individuals with respect to its energy terms than island four. However, island four found better individuals with respect their RMSD errors. Given that the process of combining the solutions (island T) is based on the energy terms, the final solution lost some of the good individuals generated by island 1.

5.1.5 Uteroglobin 1UTG

Uteroglobin is a protein with 70 residues, this protein presents a four helix secondary struc- ture. Uteroglobin is a progesterone binding protein that inhibits the enzyme phospholipase A2 (PLA2). Initially, Uteroglobin was found to be secreted by the lining of the uterus in rabbits. Uteroglobin appears as one of the most extensively studied proteins, particularly its physico- chemical properties, including its crystal structure and its gene.

The best computed structure, using the proposed computational method, matches the crystal kcal structure with and rmsdCα = 10.7731 and energy 1383445 mol . Contrary to the previous experiments, the best computed structure was not found in the last generation of the algorithm but after 1500000 evaluations. Another important result of this experiment is the fact that the best computed structure has a high value of energy. In general, and contrary to the former experiments there is not a closed relation between the conformations found by the four decision- making methods.

The results obtained in this experiments are not as good as expected. Some good conformations were considered through the population evolution (conformations with less than 6, 5Ao RMSD error), but these conformations were lost in the selection procedures. This fact suggest a lack of correlation between the energy of the conformations and their RMSD. Another possible expla- nation of the results is that the algorithm needs more iterations to find better conformations. This last point opens an important question: how to determine the stopping criteria of the algorithm?.

Table 5.7 depicts more information about other found conformations in the dynamics of different number of evaluations. The knee-based decision making methods have been applied in these cases to select the conformation among the set of solutions in the Pareto front after excecuting the corresponding migration procedure.

Given the native conformation of the 1UTG protein, we think that better results could be ob- tained if this protein were modeled using supersecondary structure prediction methods. How- ever, in this work, the PSIPRED method was used to predict secondary structures, which does not predict supersecondary structures.

A study of the conformations with minimum RMSD for the dynamic of different evaluations of the individuals in each island is depicted in Figure 5.24. From this plot, it can be seen that Chapter 5. Experimental Framework 87

1CRN Island 1 1CRN Island 2

100000 evaluations 100000 evaluations 1400 500000 evaluations 1400 500000 evaluations 1000000 evaluations 1000000 evaluations 1500000 evaluations 1500000 evaluations 2000000 evaluations 2000000 evaluations 1200 1200

1000 1000

800 800 Energy (kcal/mol) Energy (kcal/mol)

600 600

400 400

200 200 6.5 7 7.5 8 8.5 9 9.5 10 6.5 7 7.5 8 8.5 9 9.5 10 RMSD (Armstrongs) RMSD (Armstrongs)

(a) Island 1 (b) Island 2

1CRN Island 3 1CRN Island 4

100000 evaluations 100000 evaluations 1400 500000 evaluations 1400 500000 evaluations 1000000 evaluations 1000000 evaluations 1500000 evaluations 1500000 evaluations 2000000 evaluations 2000000 evaluations 1200 1200

1000 1000

800 800 Energy (kcal/mol) Energy (kcal/mol)

600 600

400 400

200 200 6.5 7 7.5 8 8.5 9 9.5 10 6.5 7 7.5 8 8.5 9 9.5 10 RMSD (Armstrongs) RMSD (Armstrongs)

(c) Island 3 (d) Island 4

1CRN FINAL RESULTS

100000 evaluations 1400 500000 evaluations 1000000 evaluations 1500000 evaluations 2000000 evaluations 1200

1000

800 Energy (kcal/mol)

600

400

200 7 8 9 10 11 12 RMSD (Armstrongs)

(e) Final Results

Figure 5.23: Relationship between Energy and RMSD (1CRN) Chapter 5. Experimental Framework 88 conformation with RMSD errors lower than 7Ao are produced by island3 and islandT before one million evaluations. Then, these conformations are lost by the pressure selection mechanism. At the end of the two million iterations only island2 contains conformations with RMSD errors lower than 8Ao. From Figure 5.24 it is clear that the behaviour of islandT is totally influenced by island3. This fact can be explained making an exploration of Figure 5.27, where the different Pareto fronts are depicted. A detailed plot of the minimum, maximum and average RMSD values at different time-steps of the algorithm is presented in Figure 5.25. These plots confirm the behaviours stated before. In the first one million iterations the island with the best behavior (with respect to the minimum tendence) corresponds to island three. From one million to two million iterations the island with the best behavior (with respect to its minimum values) corresponds to island two. The behavior of IslandT is very similar to that of island3 during all the evaluations.

Table 5.7: Computed protein conformation for the 1UTG protein target

kcal Iterations Decision Maker rmsdCα Energy ( mol ) 500000 Utility 13.4578 578.839 Utility + Energy 13.4578 578.839 Angles 13.6797 588.896 Angles + Energy 13.7082 581.971 1000000 Utility 13.1114 265.1213 Utility + Energy 13.1114 265.1213 Angles 13.2902 310591 Angles + Energy 14.0841 651.1969 1500000 Utility 12.3560 164.3547 Utility + Energy 12.3560 164.3547 Angles 10.7731 1383446 Angles + Energy 12.4517 174687 2000000 Utility 12.6578 91.5209 Utility + Energy 12.6578 91.5209 Angles 12.0989 2782151.915 Angles + Energy 12.4559 136808

Given that the mean values could be affected by excessively high or low RMSD errors, Figure 5.26 plots the frecuency histograms of these errors in specific intervals. In this plot it is clear that island three has the highest number of protein conformation belonging to interval one (i.e., [0A, 11A)) during the first one million evaluations. After the threshold of one million evaluations is crossed, island two has the biggest number of protein conformation belonging to interval one (i.e., [0A, 11A)). From this plot it is clear that islandT has in a fewer scale the same behaviour than island3. Figure 5.27 depicts the dynamic of the found Pareto fronts in different stages of the algorithm. These plots are useful to corroborate the ideas stated before. The Pareto front with minimum energy belongs to island three. In Figure 5.27 c) it is clear that the axis X got energy values Chapter 5. Experimental Framework 89

1UTG 20 Island 1 Island 2 Island 3 18 Island 4 Island T

16

14

12 RMSD (Armstrongs)

10

8

6 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 2e+06 Evaluations

Figure 5.24: Behavior of the minimum RMSD in each island for the 1UTG protein

kcal kcal close to 600 mol , meanwhile the other islands barely got values close to 200 mol . Then, it is clear now why the behaviour of islandT is reasonably similar to the behaviour of island2. Specifically, in the moment that a dominant-based algorithm is applied over the four population resulting from the migrations, the resulting population will contain more individual from island2 than from the other islands.

In the plot of Figure 5.28, the correlation between the energy and the rmsdCα for conformations sampled by the MOEA algorithm in each island is shown. Specifically, for the 1UTG protein this correlation is not clear. This is one of the reasons why the scale in the axis is bigger than the scales of previous experiments. In this analysis of correlation, it is not clear that the island2 is the MOEA instance with the best performance as stated before. The reason is that the conformations with better RMSD value have high energy values that are not plotted in Figure 5.28.

5.1.6 Comparisons with other approaches

In this section, the results of the proposed approach are compared to other works in the litera- ture. Given the differences between the approaches, it is not possible to get a final conclusion about the performance of the proposed approach, but it presents a useful numeric comparison with respect to some of the reported protein structure prediction methods. Also, there is not a ”best” protein structure method reported in the literature. Even if some servers for protein structure prediction methods are considered as good methods (according to the recent experi- ments and competitions as CASP), there is not a consensus of a standard method to solve the PSP problem. This fact can be explained by the huge folded space represented by the proteins. Chapter 5. Experimental Framework 90

1UTG Island 1 1UTG Island 2

MAX MAX 24 MIN 24 MIN AVG AVG 22 22

20 20

18 18

16 16

14 14 RMSD (Armstrongs) RMSD (Armstrongs) 12 12

10 10

8 8

6 6 0 500000 1e+06 1.5e+06 2e+06 0 500000 1e+06 1.5e+06 2e+06 Evaluations Evaluations

(a) Island 1 (b) Island 2

1UTG Island 3 1UTG Island 4

MAX MAX 24 MIN 24 MIN AVG AVG 22 22

20 20

18 18

16 16

14 14 RMSD (Armstrongs) RMSD (Armstrongs) 12 12

10 10

8 8

6 6 0 500000 1e+06 1.5e+06 2e+06 0 500000 1e+06 1.5e+06 2e+06 Evaluations Evaluations

(c) Island 3 (d) Island 4

1UTG FINAL RESULTS

MAX 24 MIN AVG 22

20

18

16

14 RMSD (Armstrongs) 12

10

8

6 0 500000 1e+06 1.5e+06 2e+06 Evaluations

(e) Final Results

Figure 5.25: Minimum, average and maximum RMSD in each island (1UTG) Chapter 5. Experimental Framework 91

1UTG Island 1 1UTG Island 2 100 80 [0A-11A) [0A-11A) [11A-13A) [11A-13A) [13A-) [13A-) 70

80 60

50 60

40 Individuals Individuals 40 30

20 20

10

0 0 200 500000 1000000 1500000 2000000 200 500000 1000000 1500000 2000000 Evaluations Evaluations

(a) Island 1 (b) Island 2

1UTG Island 3 1UTG Island 4 70 100 [0A-11A) [0A-11A) [11A-13A) [11A-13A) [13A-) [13A-) 60 80

50

60 40

Individuals 30 Individuals 40

20

20 10

0 0 200 500000 1000000 1500000 2000000 200 500000 1000000 1500000 2000000 Evaluations Evaluations

(c) Island 3 (d) Island 4

1UTG Island T 90 [0A-11A) [11A-13A) 80 [13A-)

70

60

50

40 Individuals

30

20

10

0 200 500000 1000000 1500000 2000000 Evaluations

(e) Final Results

Figure 5.26: Histrogram of RMSD ranges (1UTG) Chapter 5. Experimental Framework 92

1UTG Island 1 1UTG Island 2

100000 evaluations 100000 evaluations 1400 500000 evaluations 1400 500000 evaluations 1000000 evaluations 1000000 evaluations 1500000 evaluations 1500000 evaluations 1200 2000000 evaluations 1200 2000000 evaluations

1000 1000

800 800

600 600

Non-Bond (kcal/mol) 400 Non-Bond (kcal/mol) 400

200 200

0 0

-200 -200 580 590 600 610 620 630 640 650 580 590 600 610 620 630 640 650 Bond (kcal/mol) Bond (kcal/mol)

(a) Island 1 (b) Island 2

1UTG Island 3 1UTG Island 4 700 100000 evaluations 100000 evaluations 600 500000 evaluations 500000 evaluations 1000000 evaluations 600 1000000 evaluations 1500000 evaluations 1500000 evaluations 2000000 evaluations 2000000 evaluations 400 500

400 200

300 0 200

Non-Bond (kcal/mol) -200 Non-Bond (kcal/mol) 100

-400 0

-100 -600

-200 580 590 600 610 620 630 640 650 580 590 600 610 620 630 640 650 Bond (kcal/mol) Bond (kcal/mol)

(c) Island 3 (d) Island 4

1UTG FINAL RESULTS

100000 evaluations 600 500000 evaluations 1000000 evaluations 1500000 evaluations 2000000 evaluations 400

200

0

Non-Bond (kcal/mol) -200

-400

-600

580 590 600 610 620 630 640 650 Bond (kcal/mol)

(e) Final Results

Figure 5.27: Pareto Fronts in each island (1UTG) Chapter 5. Experimental Framework 93

1UTG Island 1 1UTG Island 2 3500 3500 100000 evaluations 100000 evaluations 500000 evaluations 500000 evaluations 1000000 evaluations 1000000 evaluations 3000 1500000 evaluations 3000 1500000 evaluations 2000000 evaluations 2000000 evaluations

2500 2500

2000 2000

1500 1500 Energy (kcal/mol) Energy (kcal/mol) 1000 1000

500 500

0 0

8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 RMSD (Armstrongs) RMSD (Armstrongs)

(a) Island 1 (b) Island 2

1UTG Island 3 1UTG Island 4 3500 3500 100000 evaluations 100000 evaluations 500000 evaluations 500000 evaluations 1000000 evaluations 1000000 evaluations 3000 1500000 evaluations 3000 1500000 evaluations 2000000 evaluations 2000000 evaluations

2500 2500

2000 2000

1500 1500 Energy (kcal/mol) Energy (kcal/mol) 1000 1000

500 500

0 0

8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 RMSD (Armstrongs) RMSD (Armstrongs)

(c) Island 3 (d) Island 4

1UTG FINAL RESULTS 3500 100000 evaluations 500000 evaluations 1000000 evaluations 3000 1500000 evaluations 2000000 evaluations

2500

2000

1500 Energy (kcal/mol) 1000

500

0

8 9 10 11 12 13 14 15 RMSD (Armstrongs)

(e) Final Results

Figure 5.28: Relationship between Energy and RMSD (1UTG) Chapter 5. Experimental Framework 94

As expected, the homology methods will predict the foldings with more accuracy than the ab- initio methods. Then, in the following comparisons we will not include pure homology and threading methods in the following tables. Table 5.8 reports the comparison of the proposed approach versus other approaches for Met- enkephalin peptide. Table 5.9, Table 5.10, Table 5.11 and Table 5.12 report the comparison of the proposed approach versus other approaches for Crambin, Disulphide-stabilized, Represor of Primer proteins and Uteroglobin, respectively.

Table 5.8: Proposed approach versus other approaches for Met-enkephalin peptide o o Algorithm rmsdCα(A ) rmsdall−atom(A ) OUR 1.2629 2.5816 Calvo et al. [139] (2 obj) N/A 2.650 REGAL (real cod.) [140] N/A 3.23 Lamarkian (binary cod.) [140] N/A 3.33 I-PAES [21] 1.740 3.605 Baldwinian (binary cod.) [140] N/A 3.96 SGA (binary cod.) [140] N/A 4.51

Table 5.9: Proposed approach versus other approaches for Crambin protein o Algorithm rmsdCα(A ) OUR 7.3214 NSGA-II (low-level operators) [21] 10.34 NSGA-II (high-level operators) [21] 6.447 HC-GA (with hydrophobic term) [141] 5.6 HC-GA (no hydrophobic term) [141] 6.8 I-PAES [21] 4.43 Dadekar et al.-GA [142] 5.4 Scatter [143] 9.43 HC-GA (with hydrophobic term) [141] 5.6 MD [144] 3.94 (1+1)-PAES1[145] 6.18 (1+1)-PAES2[145] 7.89

Table 5.10: Proposed approach versus other approaches for Disulphide-stabilized protein o Algorithm rmsdCα(A ) OUR 3.8135 OUR (DSSP) 2.8527 Hybrid Method [146] ≈ 5.00 GAdecoy[147] 3.27 GPS [148] 3.87 MADS [148] 13.486 I-PAES [21] 2.27

The proposed approach outperform the good RMSD values obtained by different techniques in the Met-enkephalin peptide. This fact is highly important because it may be the most Chapter 5. Experimental Framework 95

Table 5.11: Proposed approach versus other approaches for Represor of Primer o Algorithm rmsdCα(A ) OUR 3.7754 I-PAES [21] 3.70 Scatter [143] 17.25 HC-GA (with hydrophobic term) [141] 5.6 Bhageerath [149] 4.3 CReF [150] 7.1 (1+1)-PAES1[145] 6.31 (1+1)-PAES2[145] 8.665

Table 5.12: Proposed approach versus other approaches for Uteroglobin o Algorithm rmsdCα(A ) OUR 10.77 I-PAES [21] 4.60 (1+1)-PAES1[145] 6.04 (1+1)-PAES2[145] 5.56 CReF [150] 11.7 MD [151] 8.30 FF [152] 5.4 Rosseta [153] 4.6 ABLE [154] 8.44 studied and used peptide in the protein structure prediction. It is also important, because it is the most standard peptide (no secondary structure prediction is performed over it) to study the performance of algorithms. The approach also shows very good results in the Represor of Primer protein, which is a difficult and big protein to predict. The proposed approach does not outperform the better results in the 1ZDD and 1CRN tests.

As stated before, it is clear that the proposed model is a good protein folding predictor, although, more work needs to be done to improve the quality of the energy function and the policies and parameters used by the model. Additionally, more experiments over different protein must be performed in order to cover a bigger part of the folding space.

In order to study the proposed model thoroughly, several specific parameter were fixed and some experiments were performed over a set of five benchmark proteins. Those parameters were fixed given the huge number of possibilities to combine these parameters. In the experimental framework these parameters were kept constant without relevance of the tested protein and no heuristic or additional information was taken into account for the evaluation of the model. Better results over the test proteins suits could be obtained if those parameter were optimized, but this task is not a main goal in the performed experimental framework. It should be stressed that in this thesis, the fixed parameters are not proposed or concluded as the optimum set of parameters to be used in the modeling of an ab-initio PSP, those parameter are simply a well-known set of options from the huge number of possibilities. Chapter 6

Conclusions and Further Work

The protein structure prediction problem is an open problem that joins biological and computa- tional concepts. Currently, there is not a best or even standard method to find protein solutions using ab-initio methods. Although scientists have worked on this problem for more than five decades, there is still a lot of work to do to find acceptable solutions to the PSP problem for proteins of realistic sizes.

This thesis contributes to the PSP research field dealing with a novel procedure for protein structure prediction based on a parallel multiobjective ab-initio approach at an atomic confor- mation level. Additionally, in this research several important concepts regarding the modeling of an ab-initio model are taking into account. An introduction, A comprehensive review of the state of the art respect to ab-initio methods, an the modelling of the ab-initio problem and its parallel implementation were deeply studied.

This work presents a complete methodology to develop ab-initio methods for the PSP problem. In a similar manner, this work can be seen as a bioinformatics tool for the protein structure prediction problem. This tool was implemented incorporating several existing software tools along with others developed as part of this work. The software developed allows the use of different score functions to evaluate the energy of protein conformations, as well as the use of different MOEA algorithms to evolve protein conformations. All these features can be computed over a parallel implementation based on an island model as the parallel paradigm and JavaSpaces as its implementation model. The bioinformatics tool is currently being implemented as a web service to allow its use by different research groups. This fact is highly important because it will allow to get a feedback about the advantages and disadvantages of the proposed model for further improvements.

The proposed model is a good protein folding predictor, although, additional work may be needed to improve the quality of the energy functions and the policies and parameters used by the model. The proposed approach can certainly benefit from other improvements, such as refinement of the MOEA algorithm and parallel implementation and the addition of new features including amino acid specific information and biological heuristic information.

96 Chapter 6. Conclusions and Further Work 97

With respect to the ab-initio modeling it can be concluded that:

• The trigonometric representation is a computational feasible spatial representation that facilitates the implementation of the chromosome in a genetic-based approach. Addition- ally, this spatial representation has biological significance allowing to model secondary structures and some spatial characteristic of the protein such as torsion and bond angles. This spatial representation also allows the reconstruction of the protein conformations and their representation in cartesian coordinates. In conclusion, the trigonometric rep- resentation allows a reduction in the number of variables that an MOEA algorithm has to manange without losing flexibility in the search process and without losing biological significance.

• From the experimentation it could be stated that even though there is not a clear cor- relation between energy and structures, the energy functions could be used to lead the evolution of the conformations in a MOEA approach. It is necessary to make some im- provements in high-resolution potentials to achieve significant progress in the field of ab initio protein fold prediction.

• A multi-objective evolutionary algorithm was capable of finding a good set of three- dimensional conformations inside a protein folded state. The use of bond and non-bond interactions are appropriate as the objectives to optimize in the MOEA. They can be considered as the main forces to direct the folding towards the native state.

• The use of secondary structure information is fundamental for the accuracy of the pre- dicted structures, given the importance of those conformations in the protein folding process present in nature. In protein structure prediction methods, the use of a rotamer library allows a structure to be determined or modeled trying the most likely side-chain conformations, saving time and producing a structure that is more likely to be correct. The use of these two heuristics highly decreases the search space, however, they have a big impact in the accuracy of the model if wrong predictions are computed.

This thesis has proposed and implemented a novel model to one of the most important, open, and interdisciplinary problems; the protein structure prediction problem. As a whole, we believe that this work makes a valuable contribution to the field because it presents a clear methodology to develop ab-initio methods. Additionally, a complete review of the state of the art in the problem is also presented.

This thesis is a small contribution respect to the long distance that must be covered to find a final solution to the PSP problem, but it is an important step that could help to direct the problem toward new unexplored roads. The obtained results correspond to the expected outcomes for an ab-initio method. It is clear that the task that ab-initio methods correctly find Chapter 6. Conclusions and Further Work 98 protein structures, even for small proteins, is still not achieved. But due to new algorithms to use the available computer power and novel algorithms to tackle the protein structure prediction problem, much progress is being made to solve the main difficulties.

We believe that this work will have a considerable impact and influence in local research com- munity. Though the results obtained in this work were very encouraging, further exploration is still necessary. With respect to the future work, this work has opened different research areas where valuable contributions can be done. This work will follow two main lines: one to improve the quality of the solutions and another to explore closely related research problems based on the proposed approach.

• A more complete study of the different parallelization alternatives, parameters and poli- cies, including their performance and scalability behavior for more complex proteins need to be studied.

• Additional work can be done on the MOEA. This optimization cannot only be restrained to the genetic operators, but also different MOEA algorithms and different energy functions in the PSP problem can be explored.

• Also, future work will focus on the use of this research to predict 20 polypeptides as part of the research on developing synthetic vaccines at the Colombia Institute of Immunology (FIDIC).

• Additional work needs to be carried out to evaluate the proposed approach, thus, we are planning to participate in CASP9. CASP (Critical Assessment of Techniques for Pro- tein Structure Prediction), a community-wide experiment for protein structure prediction which takes place every two years since 1994. CASP establishes the current state of the art in protein structure prediction methods and it identifies what progress has been made based on the prediction of a set of known structures using different approaches and techniques of research groups worldwide. Appendix A

Web Service on a computer cluster

This appendix present, in a general way, the services and content which will be available on the web server of the protein structure prediction project. To date, the design and part of the implementation are ready. Some more work must be perfomed in the implementation of the proposed approach and in the test of the web services. The web service implementation will be ready soon in the following site http://ungrid.unal.edu.co/services/pfs-index.htm.

Web services and service-oriented architectures usually refer to services that are usable by machines. In this appendix the services provided to humans will be presented. On the other hand, the services provided to automatic servers or applications (as the necessary services to participate in CASP competition) will be tackled in future works.

Figure A.1 depicts the used framework (service-oriented architecture) to provide the bioinfor- matic services (protein structure prediction ) to human users.

unGrid

4 1 2 3 Sakai/ INET Master INET 10 9 6 5

7 8

DB

Figure A.1: Web service framework architecture

The architecture is based on an enhanced version of a standard client-server architecture. A standard web browser can be used in the client side. In the server side, the new high-performance computing cluster obtained for the National University of Colombia will be used to perform the computation of the MOEA instances. Specifically, the computation of the fitness functions of the 99 Appendix A. Web Service 100 individuals for each island will be excecuted in parallel on the grid, therefore achieving significant speed-up in the predictor. This speed-up is highly important given the time constrains imposed to participated in the CASP meeting, and to improve the service to the human clients (waiting less time for a predicted conformation).

The main service that the application will offer to human clients is the prediction of a protein structure given its amino-acid sequence. Then, the main task performed by the web service could be listed as follows.

• Protein Structure Prediction: A file with extension .pdb must be generated with the information about the predicted structure based on the amino-acid sequence entry (.fasta file).

• Molecular viewer: A visualization of the predicted protein conformation must be generated using the software JMOL.

• Amino-acid sequence entry: The human user must have at least two different ways to bring the amino-acid sequence. Write down the complete sequence or bring a .fasta file.

• Outcome report: An outcome report must be generated if the user checked that option. Additional information about the performance of the algorithm and the used parameters and policies could be expanded.

• Publications: Some information about the publications refering the protein predictor must be depicted.

Figure A.2 depicts the general use case model for the implemented web service. As depidted by the figure, there are four main services performed by the server.

• Registration: This function will allow the human user to be part of the system, giving to him a user account and a unique password. In this manner, the human user will have user privileges and will be able to ask for services.

• Session start: This case will allow the user to access the application services through the logging with his username and password.

• Services: In this case, the user run the protein structure predictor. The user must define the parameters used by the algoithm running.

• Outcome: After an user has requested a service and the protein structure predictor has finished its iterations, the system must send the outcomes of the service (.pdb file and another information requested by the user) to his/her registered email account.

• References: The user must be able to have information about the bioinformatic group publications. Appendix A. Web Service 101

Figure A.2: Web service use case model Bibliography

[1] David L. Nelson and Michael M. Cox. Lehninger Principles of Biochemistry, Fourth Edition. W. H. Freeman, 4 edition, April 2004. ISBN 0716743396.

[2] Robert K. Murray, Daryl K. Granner, Peter A. Mayes, and Victor W. Rodwell. Harper’s Biochemistry. McGraw-Hill Publishing Co, 25th edition, August 1999. ISBN 0838536840.

[3] Donald Kennedy. 125. Science, 309:19, July 2005. doi: 10.1126/science.1115951. URL http://www.sciencemag.org.

[4] Carl Ivar Branden and John Tooze. Introduction to Protein Structure. Garland Science, 2 edition, 1999. ISBN 0815323050.

[5] Daisuke Kihara, Yang Zhang, Hui Lu, Andrzej Kolinski, and Jeffrey Skolnick. Ab ini- tio protein structure prediction on a genomic scale: Application to the mycoplasma genitalium genome. Proceedings of the National Academy of Sciences of the United States of America, 99:5993?5998, April 2002. doi: 10.1073/pnas.092135699. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=122890.

[6] Jr. Merz Kenneth M. The Protein Folding Problem and Tertiary Structure Prediction. Birkhauser Boston, 1 edition, August 1994. ISBN 0817636935.

[7] R Bonneau and D Baker. Ab initio protein structure prediction: progress and prospects. Annual Review of Biophysics and Biomolecular Structure, 30:173–89, 2001. ISSN 1056- 8700. doi: 11340057. URL http://www.ncbi.nlm.nih.gov/pubmed/11340057.

[8] Andr´asFiser, Michael Feig, Charles L Brooks, and Andrej Sali. Evolution and physics in comparative protein structure modeling. Accounts of Chemical Research, 35:413–21, June 2002. ISSN 0001-4842. doi: 12069626. URL http://www.ncbi.nlm.nih.gov/pubmed/ 12069626.

[9] T. L. Blundell, B. L. Sibanda, M. J. E. Sternberg, and J. M. Thornton. Knowledge-based prediction of protein structures and the design of novel molecules. Nature, 326:347–352, March 1987. doi: 10.1038/326347a0. URL http://dx.doi.org/10.1038/326347a0.

102 Bibliography 103

[10] F. Allen, G. Almasi, W. Andreoni, D. Beece, B. J. Berne, A. Bright, J. Brunheroto, C. Cascaval, J. Castanos, P. Coteus, P. Crumley, A. Curioni, M. Denneau, W. Do- nath, M. Eleftheriou, B. Fitch, B. Fleischer, C. J. Georgiou, R. Germain, M. Gi- ampapa, D. Gresh, M. Gupta, R. Haring, H. Ho, P. Hochschild, S. Hummel, T. Jonas, D. Lieber, G. Martyna, K. Maturu, J. Moreira, D. Newns, M. Newton, R. Philhower, T. Picunko, J. Pitera, M. Pitman, R. Rand, A. Royyuru, V. Salapura, A. Sanomiya, R. Shah, Y. Sham, S. Singh, M. Snir, F. Suits, R. Swetz, W. C. Swope, N. Vish- numurthy, T. J. C. Ward, H. Warren, and R. Zhou. Blue gene: a vision for pro- tein science using a petaflop supercomputer. IBM Syst. J., 40:310–327, 2001. URL http://portal.acm.org/citation.cfm?id=1017221.1017226.

[11] C.B. Anfinsen, R.R. Redfield, W.L. Choate, J. Page, and W.R. Carroll. Studies on the gross structure, cross-linkages, and terminal sequences in ribonuclease. Journal of Biolog- ical Chemistry, 207(1):201–210, 1954.

[12] C.B. Anfinsen. Principles that govern the folding of protein chains. Science, 181(4096): 223–230, 1973.

[13] C. Levinthal. Are there pathways for protein folding. J. Chim. Phys, 65(1):44–45, 1968.

[14] G.J.A. Vidugiris, J.L. Markley, and C.A. Royer. Evidence for a molten globule-like tran- sition state in protein folding from determination of activation volumes. Biochemistry, 34 (15):4909–4912, 1995.

[15] VI Abkevich, AM Gutin, and EI Shakhnovich. Specific nucleus as the transition state for protein folding: evidence from the lattice model. Biochemistry, 33(33):10026–10036, 1994.

[16] P.G. Wolynes, J.N. Onuchic, and D. Thirumalai. Navigating the folding routes. Science, 267(5204):1619–1620, 1995.

[17] JD Bryngelson, JN Onuchic, ND Socci, and PG Wolynes. Funnels, pathways and the energy landscape of protein folding: a synthesis. Journal reference: Proteins-Struct. Func. and Genetics, 21:167, 1995.

[18] Y. Duan and P. A. Kollman. Computational protein folding: From lattice to all-atom. IBM Systems Journal, 40:297–309, 2001.

[19] S.S. Plotkin and J.N. Onuchic. Investigation of routes and funnels in protein folding by free energy functional methods. Proceedings of the National Academy of Sciences, 97(12): 6509–6514, 2000.

[20] Kensal E van Holde, Curtis Johnson, and Pui Shing Ho. Principles of Physical Biochem- istry. Prentice Hall, 1st edition, 1998. ISBN 0137204590. Bibliography 104

[21] Vincenzo Cutello, Giuseppe Narzisi, and Giuseppe Nicosia. A multi-objective evolutionary approach to the protein structure prediction problem. Journal of The Royal Society Interface, 3:139–151, February 2006. doi: 10.1098/rsif.2005.0083. URL http://dx.doi. org/10.1098/rsif.2005.0083.

[22] M S Johnson, N Srinivasan, R Sowdhamini, and T L Blundell. Knowledge-based protein modeling. Critical Reviews in Biochemistry and Molecular Biology, 29:1–68, 1994. ISSN 1040-9238. doi: 8143488. URL http://www.ncbi.nlm.nih.gov/pubmed/8143488.

[23] C.A. Floudas, H.K. Fung, S.R. McAllister, M. Mönnigmann, and R. Rajgaria. Ad- vances in protein structure prediction and de novo protein design: A review. Chemical En- gineering Science, 61:966–988, February 2006. doi: 10.1016/j.ces.2005.04.009. URL http: //www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6TFK-4HGM7FH-1&_ user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_version=1&_urlVersion= 0&_userid=10&md5=ad20ca5f94a28754f5a466f5662969ad.

[24] John Moult, Krzysztof Fidelis, Adam Zemla, and Tim Hubbard. Critical assessment of methods of protein structure prediction (casp)-round v. Proteins: Structure, Function, and Genetics, 53:334–339, 2003. doi: 10.1002/prot.10556. URL http://dx.doi.org/10. 1002/prot.10556.

[25] Christopher Bystroff and David Baker. Prediction of local structure in pro- teins using a library of sequence-structure motifs1. Journal of Molecular Bi- ology, 281:565–577, August 1998. doi: 10.1006/jmbi.1998.1943. URL http: //www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WK7-45S498H-77&_ user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_version=1&_urlVersion= 0&_userid=10&md5=47d1e02176644fc373ff263bd4e82a86.

[26] Kevin Karplus, Christian Barrett, Melissa Cline, Mark Diekhans, Leslie Grate, and Richard Hughey. Predicting protein structure using only sequence information. Pro- teins: Structure, Function, and Genetics, 37:121–125, 1999. doi: 10.1002/(SICI) 1097-0134(1999)37:3+h121::AID-PROT16i3.0.CO;2-Q. URL http://dx.doi.org/10. 1002/(SICI)1097-0134(1999)37:3+<121::AID-PROT16>3.0.CO;2-Q.

[27] F.M. Richards. The protein folding problem. Scientific American, 264(1):54–63, 1991.

[28] C. Chothia and A M Lesk. The relation between the divergence of sequence and structure in proteins. The EMBO Journal, 5:823?826, April 1986. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1166865& rendertype=abstract.

[29] A M Lesk and C Chothia. How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. Journal of Molecular Bibliography 105

Biology, 136:225–70, 1980. ISSN 0022-2836. doi: 7373651. URL http://www.ncbi.nlm. nih.gov/pubmed/7373651.

[30] Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The protein data bank. Nucl. Acids Res., 28:235–242, 2000. doi: 10.1093/nar/28.1.235. URL http://nar.oxfordjournals. org/cgi/content/abstract/28/1/235.

[31] R S´anchez and A Sali. Large-scale protein structure modeling of the saccharomyces cere- visiae genome. Proceedings of the National Academy of Sciences of the United States of America, 95:13597–602, November 1998. ISSN 0027-8424. doi: PMC24864. URL http://www.ncbi.nlm.nih.gov/pubmed/9811845.

[32] David Baker and Andrej Sali. Protein structure prediction and structural genomics. Science, 294:93–96, October 2001. doi: 10.1126/science.1065659. URL http://www. sciencemag.org/cgi/content/abstract/294/5540/93.

[33] John Westbrook, Zukang Feng, Shri Jain, T. N. Bhat, Narmada Thanki, Veerasamy Ravichandran, Gary L. Gilliland, Wolfgang Bluhm, Helge Weissig, Douglas S. Greer, Philip E. Bourne, and Helen M. Berman. The protein data bank: unifying the archive. Nucl. Acids Res., 30:245–248, 2002. doi: 10.1093/nar/30.1.245. URL http://nar. oxfordjournals.org/cgi/content/abstract/30/1/245.

[34] J Greer. Comparative modeling methods: application to the family of the mammalian serine proteases. Proteins, 7:317–34, 1990. ISSN 0887-3585. doi: 2381905. URL http: //www.ncbi.nlm.nih.gov/pubmed/2381905.

[35] R Unger, D Harel, S Wherland, and J L Sussman. A 3d building blocks approach to analyzing and predicting structure of proteins. Proteins, 5:355–73, 1989. ISSN 0887-3585. doi: 2798411. URL http://www.ncbi.nlm.nih.gov/pubmed/2798411.

[36] T.F. Havel and M.E. Snow. A new method for building protein conformations from sequence alignments with homologues of known structure. J. Mol. Biol, 217(1):1–7, 1991.

[37] Roland Luthy, James U. Bowie, and David Eisenberg. Assessment of protein models with three-dimensional profiles. Nature, 356:83–85, March 1992. doi: 10.1038/356083a0. URL http://dx.doi.org/10.1038/356083a0.

[38] A. M. Lesk. Protein Architecture: A Practical Approach. Oxford University Press, USA, November 1991. ISBN 0199630550.

[39] C. Chothia. Proteins?1000 families for the molecular biologist. Nature, 357:543–544, 1992.

[40] Catching a common fold, 1993. URL http://www.proteinscience.org/cgi/content/ citation/2/6/877. Bibliography 106

[41] R. H. Lathrop, R. G. Rogers, J. Bienkowska, B. K. M. Bryant, L. J. Buturovic, C. Gai- tatzes, R. Nambudripad, J. V. White, and T. F. Smith. Analysis and algorithms for protein sequence-structure alignment. New Comprehensive Biochemistry, 32:227–283, 1998.

[42] Lisa N. Kinch, James O. Wrabl, S. Sri Krishna, Indraneel Majumdar, Ruslan I. Sadreyev, Yuan Qi, Jimin Pei, Hua Cheng, and Nick V. Grishin. Casp5 assessment of fold recognition target predictions. Proteins: Structure, Function, and Genetics, 53:395–409, 2003. doi: 10.1002/prot.10557. URL http://dx.doi.org/10.1002/prot.10557.

[43] D. T. Jones, W. R. Taylort, and J. M. Thornton. A new approach to protein fold recog- nition. Nature, 358:86–89, July 1992. doi: 10.1038/358086a0. URL http://dx.doi.org/ 10.1038/358086a0.

[44] Burkhard Rost, Reinhard Schneider, Chris S, and er. Protein fold recognition by prediction-based threading1. Journal of Molecular Biology, 270:471–480, July 1997. doi: 10.1006/jmbi.1997.1101. URL http://www.sciencedirect.com/ science?_ob=ArticleURL&_udi=B6WK7-45RFFJF-2B&_user=10&_rdoc=1&_fmt= &_orig=search&_sort=d&view=c&_version=1&_urlVersion=0&_userid=10&md5= 5512596da9a38b9d514a717123d8fd41.

[45] S´ılviaSalamanca Segu´ı. Plegamiento y funcionalidad biol´ogica del inhibidor de metalo- carboxipeptidasas lci, April 2004. URL http://dialnet.unirioja.es/servlet/tesis? codigo=4859.

[46] K. A. Dill, S. Bromberg, K. Yue, K. M. Fiebig, D. P. Yee, P. D. Thomas, and H. S. Chan. Principles of protein folding – a perspective from simple exact models. Protein Sci, 4: 561–602, April 1995. URL http://www.proteinscience.org/cgi/content/abstract/ 4/4/561.

[47] Oleg Ptitsyn. How molten is the molten globule? Nat Struct Mol Biol, 3:488–490, June 1996. doi: 10.1038/nsb0696-488. URL http://dx.doi.org/10.1038/nsb0696-488.

[48] Ob Ptitsyn. Molten globule and protein folding. Advances in protein chemistry, 47:83–229, 1995.

[49] HJ Dyson and PE Wright. Peptide conformation and protein folding. Current Opinion in Structural Biology, 3:60–60, 1993.

[50] Peter S. Kim and Robert L. Baldwin. Intermediates in the folding reactions of small proteins, November 2003. URL http://arjournals.annualreviews.org/doi/abs/10. 1146/annurev.bi.59.070190.003215.

[51] J. D. Bryngelson, J. N. Onuchic, N. D. Socci, and P. G. Wolynes. Funnels, pathways, and the energy landscape of protein folding: A synthesis. Proteins-New York, 21:167–167, 1995. Bibliography 107

[52] Ken A. Dill and Hue Sun Chan. From levinthal to pathways to funnels. Nat Struct Mol Biol, 4:10–19, 1997. doi: 10.1038/nsb0197-10. URL http://dx.doi.org/10.1038/ nsb0197-10.

[53] On the thermodynamic hypothesis of protein folding, May 1998. URL http://www.pnas. org/content/95/10/5545.abstract.

[54] J. Skolnick and A. Kolisnkia. Dynamic monte carlo simulations of a new lattice model of globular protedin folding, stucture and dynamics. Journal of molecular biology, 221(2): 499–531, 1991.

[55] Ron Unger and John Moult. Genetic algorithms for protein folding simulations. University of Maryland at College Park, 1992. URL http://portal.acm.org/citation.cfm?id= 132510.

[56] D. A. Hinds and M. Levitt. Exploring conformational space with a simple lattice model for protein structure. J Mol. Rid, 243:6W, 1994.

[57] Ba Reva, Av Finkelstein, Mf Sanner, and Aj Olson. Adjusting potential energy functions for lattice models of chain molecules. Proteins, 25:379–388, 1996.

[58] Sergio Duarte, David Becerra, Luis Nino, and Yoan Pinzon. A novel ab-initio genetic- based approach for protein folding prediction. pages 393–400, London, England, 2007. ACM. ISBN 978-1-59593-697-4. doi: 10.1145/1276958.1277039. URL http://portal. acm.org/citation.cfm?id=1277039.

[59] Jeffrey Skolnick and Andrzej Kolinski. Simulations of the folding of a globular protein. Science, 250:1121–1125, November 1990. doi: 10.1126/science.250.4984.1121. URL http: //www.sciencemag.org/cgi/content/abstract/250/4984/1121.

[60] Robert L. Baldwin. Matching speed and stability. Nature, 369:183–184, May 1994. doi: 10.1038/369183a0. URL http://dx.doi.org/10.1038/369183a0.

[61] A. Sali, E. Shakhnovich, and M. Karplusj. Kinetics of protein folding a lattice model study of the requirements for folding to the native state. J. Mol. ??, 235:1614–1636, 1994.

[62] V. Cutello, G. Nicosia, M. Pavone, and J. Timmis. An immune algorithm for protein structure prediction on lattice models. IEEE Transactions on Evolutionary Computation, 11:101, 2007.

[63] K. A. Dill. Theory for the folding and stability of globular proteins. Biochemistry, 24: 1501–1509, 1985.

[64] K. Yue, K.M. Fiebig, P.D Thomas, H.S Chan, E.I. Shakhnovich, and K.A Dill. A test of lattice protein folding algorithms. National Academy of Sciences, 1995. URL http: //www.pnas.org/content/92/1/325.abstract. Bibliography 108

[65] N Go. Theoretical studies of protein folding, November 2003. URL http://arjournals. annualreviews.org/doi/abs/10.1146/annurev.bb.12.060183.001151.

[66] I. Bahar and R. L. Jernigan. Inter-residue potentials in globular proteins and the dominance of highly specific hydrophilic interactions at close separation1. Journal of Molecular Biology, 266:195–214, February 1997. doi: 10.1006/jmbi.1996.0758. URL http: //www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WK7-45RFG0C-70&_ user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_version=1&_urlVersion= 0&_userid=10&md5=58eeffde2bcbce96e9aaccc4ca493124.

[67] S. Miyazawa and R. L. Jernigan. Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules, 18:534– 552, 1985.

[68] Arvind Gupta, Jan Manuch, and Ladislav Stacho. Structure-approximating inverse protein folding problem in the 2d hp model, December 2005. URL http://www.liebertonline. com/doi/abs/10.1089/cmb.2005.12.1328.

[69] Arvind Gupta, Jan Manuch, and Ladislav Stacho. Inverse protein folding in 2d hp mode (extended abstract). pages 311–318. IEEE Computer Society, 2004. ISBN 0-7695-2194-0. URL http://portal.acm.org/citation.cfm?id=1018417.1019262.

[70] S S Sung. Helix folding simulations with various initial conformations. Biophys. J., 66: 1796–1803, June 1994. URL http://www.biophysj.org/cgi/content/abstract/66/6/ 1796.

[71] S S Sung. Folding simulations of alanine-based peptides with lysine residues. Biophys. J., 68:826–834, March 1995. URL http://www.biophysj.org/cgi/content/abstract/68/ 3/826.

[72] Kim T. Simons, Charles Kooperberg, Enoch Huang, and David Baker. Assem- bly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions1. Journal of Molecu- lar Biology, 268:209–225, April 1997. doi: 10.1006/jmbi.1997.0959. URL http: //www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WK7-45VGF7T-N&_ user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_version=1&_urlVersion= 0&_userid=10&md5=8eb18cf341e02b770ec7c1fb744e8f42.

[73] M J Sternberg, F E Cohen, and W R Taylor. A combinational approach to the prediction of the tertiary fold of globular proteins. Biochemical Society Transactions, 10:299–301, October 1982. ISSN 0300-5127. doi: 6183155. URL http://www.ncbi.nlm.nih.gov/ pubmed/6183155. Bibliography 109

[74] D. Frishman and P. Argos. Knowledge-based protein secondary structure assignment. Proteins: Struct. Funct. Genet, 23:566?579, 1995.

[75] Ram Samudrala and John Moult. An all-atom distance-dependent conditional proba- bility discriminatory function for protein structure prediction1. Journal of Molecular Biology, 275:895–916, February 1998. doi: 10.1006/jmbi.1997.1479. URL http: //www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WK7-45P0FFJ-1F&_ user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_acct=C000050221&_ version=1&_urlVersion=0&_userid=10&md5=3285e848bef486a5ce218f1645b5abd1.

[76] H J Berendsen. A glimpse of the holy grail. Science (New York, N.Y.), 282:642–3, October 1998. ISSN 0036-8075. doi: 9841417. URL http://www.ncbi.nlm.nih.gov/ pubmed/9841417.

[77] Christopher M Dobson and Martin Karplus. The fundamentals of protein folding: bringing together theory and experiment. Current Opinion in Structural Biol- ogy, 9:92–101, February 1999. doi: 10.1016/S0959-440X(99)80012-8. URL http: //www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VS6-3W8SDFR-D&_ user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_acct=C000050221&_ version=1&_urlVersion=0&_userid=10&md5=5b7493aaab7a229f3e6adc4183e4133d.

[78] Themis Lazaridis and Martin Karplus. ”new view” of protein folding reconciled with the old through multiple unfolding simulations. Science, 278:1928–1931, December 1997. doi: 10.1126/science.278.5345.1928. URL http://www.sciencemag.org/cgi/content/ abstract/278/5345/1928.

[79] W. J. Hehre. Ab initio molecular orbital theory. Accounts of Chemical Research, 9: 399–406, 1976.

[80] U. Burkert and N. L. Allinger. Molecular Mechanics. An American Chemical Society Publication, 1982.

[81] C. S. Tsai. An Introduction to Computational Biochemistry. John Wiley & Sons, Inc. New York, NY, USA, 2002.

[82] W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M. Ferguson, D. C. Spellmeyer, T. Fox, J. W. Caldwell, and P. A. Kollman. A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. Journal of the American Chemical Society, 117:5179–5197, 1995.

[83] S. J. Weiner, P. A. Kollman, D. A. Case, U. C. Singh, C. Ghio, G. Alagona, S. Profeta, and P. Weiner. A new force field for molecular mechanical simulation of nucleic acids and proteins. Journal of the American Chemical Society, 106:765–784, 1984. Bibliography 110

[84] SJ Weiner, PA Kollman, DT Nguyen, and DA Case. An all atom force field for simulations of proteins and nucleic acids. Journal of computational chemistry, 7:230–252, 1986.

[85] BR Brooks, RE Bruccoleri, BD Olafson, DJ States, S. Swaminathan, and M. Karplus. Charmm: a program for macromolecular energy, minimization, and dynamics calculations. Journal of computational chemistry, 4:187–217, 1983.

[86] Ram Samudrala, Yu Xia, Enoch Huang, and Michael Levitt. Ab initio protein structure prediction using a combined hierarchical approach. Proteins: Structure, Function, and Ge- netics, 37:194–198, 1999. doi: 10.1002/(SICI)1097-0134(1999)37:3+h194::AID-PROT24i 3.0.CO;2-F. URL http://dx.doi.org/10.1002/(SICI)1097-0134(1999)37:3+<194:: AID-PROT24>3.0.CO;2-F.

[87] Shawna Thomas, Guang Song, and Nancy M. Amato. Protein folding by motion planning. Physical Biology, 2:S148–S155, 2005. ISSN 1478-3975. URL http://www.iop.org/EJ/ abstract/1478-3975/2/4/S09.

[88] C. D. Snow, H. Nguyen, V. S. Pande, and M. Gruebele. Absolute comparison of simulated and experimental protein-folding dynamics. Nature, 420:102–106, 2002.

[89] T. J. Brunette and O. Brock. Model-based search to determine minima in molecular energy landscapes. Dept. of Computer Science, University of Massachusetts, Amherst, MA, 2005.

[90] Martin Karplus. Molecular dynamics of biological macromolecules: A brief history and perspective. Biopolymers, 68:350–358, 2003. doi: 10.1002/bip.10266. URL http://dx. doi.org/10.1002/bip.10266.

[91] Tomas Hansson, Chris Oostenbrink, and WilfredF van Gunsteren. Molecular dy- namics simulations. Current Opinion in Structural Biology, 12:190–196, April 2002. doi: 10.1016/S0959-440X(02)00308-1. URL http://www.sciencedirect. com/science?_ob=ArticleURL&_udi=B6VS6-45JPC7V-B&_user=10&_rdoc=1&_fmt= &_orig=search&_sort=d&view=c&_version=1&_urlVersion=0&_userid=10&md5= 9dc1830efc3e43f52ce1d905ef97ed29.

[92] L. Kale, R. Skeel, M. Bhandarkar, R. Brunner, A. Gursoy, N. Krawetz, J. Phillips, A. Shi- nozaki, K. Varadarajan, and K. Schulten. Namd2: Greater scalability for parallel molec- ular dynamics. Journal of Computational Physics, 151:283–312, 1999.

[93] M. Levitt. Molecular dynamics of native protein. i: Computer simulation of trajectories. Journal of molecular biology, 168:595–620, 1983.

[94] M Levitt. Protein folding by restrained energy minimization and molecular dynamics. Journal of Molecular Biology, 170:723–64, November 1983. ISSN 0022-2836. doi: 6195346. URL http://www.ncbi.nlm.nih.gov/pubmed/6195346. Bibliography 111

[95] Blake Fitch, Aleksandr Rayshubskiy, Maria Eleftheriou, T.J. Ward, Mark Giampapa, Yuri Zhestkov, Michael Pitman, Frank Suits, Alan Grossfield, Jed Pitera, William Swope, Ruhong Zhou, Scott Feller, and Robert Germain. Blue Matter: Strong Scaling of Molecular Dynamics on Blue Gene/L, pages 846–854. 2006. URL http://dx.doi.org/10.1007/ 11758525_113.

[96] Ming-Hong Hao and Harold A. Scheraga. Statistical thermodynamics of protein folding: Comparison of a mean-field theory with monte carlo simulations. The Journal of Chemical Physics, 102:1334–1348, 1995. doi: 10.1063/1.468920. URL http://link.aip.org/link/ ?JCP/102/1334/1.

[97] I. Beichl and F. Sullivan. Guest editors’ introduction: Monte carlo methods. Computing in Science & Engineering, 8:7–8, 2006. ISSN 1521-9615. doi: 10.1109/MCSE.2006.27.

[98] N Metropolis and S Ulam. The monte carlo method. Journal of the American Statistical Association, 44:335–41, September 1949. ISSN 0162-1459. doi: 18139350. URL http: //www.ncbi.nlm.nih.gov/pubmed/18139350.

[99] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21: 1087, 1953.

[100] Young Min Rhee and Vijay S. Pande. Multiplexed-replica exchange molecular dynamics method for protein folding simulation. Biophys. J., 84:775–786, February 2003. URL http://www.biophysj.org/cgi/content/abstract/84/2/775.

[101] Ulrich H. E Hansmann. Parallel tempering algorithm for conformational studies of biolog- ical molecules. physics/9710041, October 1997. URL http://arxiv.org/abs/physics/ 9710041.

[102] D. D. Frantz, D. L. Freeman, and J. D. Doll. Reducing quasi-ergodic behavior in monte carlo simulations by j-walking: Applications to atomic clusters. The Journal of Chemical Physics, 93:2769–2784, 1990. doi: 10.1063/1.458863. URL http://link.aip.org/link/ ?JCP/93/2769/1.

[103] Huafeng Xu and B. J. Berne. Multicanonical jump walking: A method for efficiently sampling rough energy landscapes. The Journal of Chemical Physics, 110:10299–10306, June 1999. doi: 10.1063/1.478963. URL http://link.aip.org/link/?JCP/110/10299/ 1.

[104] Ruhong Zhou and B. J. Berne. Smart walking: A new method for boltzmann sampling of protein conformations. The Journal of Chemical Physics, 107:9185–9196, December 1997. doi: 10.1063/1.475210. URL http://link.aip.org/link/?JCP/107/9185/1. Bibliography 112

[105] Bernd A. Berg and Thomas Neuhaus. Multicanonical ensemble: A new approach to simulate first-order phase transitions. Physical Review Letters, 68:9, 1992. doi: 10.1103/ PhysRevLett.68.9. URL http://link.aps.org/abstract/PRL/v68/p9.

[106] Yang Zhang, Daisuke Kihara, and Jeffrey Skolnick. Local energy landscape flattening: Parallel hyperbolic monte carlo sampling of protein folding. Proteins: Structure, Function, and Genetics, 48:192–201, 2002. doi: 10.1002/prot.10141. URL http://dx.doi.org/10. 1002/prot.10141.

[107] Geraint L. Thomas, Richard B. Sessions, and Martin J. Parker. Density guided importance sampling: application to a reduced model of protein folding. Bioinformatics, 21:2839– 2843, June 2005. doi: 10.1093/bioinformatics/bti421. URL http://bioinformatics. oxfordjournals.org/cgi/content/abstract/21/12/2839.

[108] Oivind Skare, Erik Bolviken, and Lars Holden. Improved sampling-importance resampling and reduced bias importance sampling. Scandinavian Journal of Statistics, 30:719–737, 2003. doi: 10.1111/1467-9469.00360. URL http://dx.doi.org/10.1111/1467-9469. 00360.

[109] Shankar Kumar, John M. Rosenberg, Djamal Bouzida, Robert H. Swendsen, and Pe- ter A. Kollman. The weighted histogram analysis method for free-energy calculations on biomolecules. i. the method. Journal of Computational Chemistry, 13:1011–1021, 1992. doi: 10.1002/jcc.540130812. URL http://dx.doi.org/10.1002/jcc.540130812.

[110] Jooyoung Lee. New monte carlo algorithm: Entropic sampling. Physical Review Let- ters, 71:211, July 1993. doi: 10.1103/PhysRevLett.71.211. URL http://link.aps.org/ abstract/PRL/v71/p211.

[111] Jooyoung Lee, Harold A. Scheraga, and S. Rackovsky. New optimization method for conformational energy calculations on polypeptides: Conformational space anneal- ing. Journal of Computational Chemistry, 18:1222–1232, 1997. doi: 10.1002/(SICI) 1096-987X(19970715)18:9h1222::AID-JCC10i3.0.CO;2-7. URL http://dx.doi.org/10. 1002/(SICI)1096-987X(19970715)18:9<1222::AID-JCC10>3.0.CO;2-7.

[112] F. Glover and M. Laguna. Tabu search, Modern heuristic techniques for combinatorial problems. John Wiley & Sons, Inc., New York, NY, 1993.

[113] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220:671–680, May 1983. doi: 10.1126/science.220.4598.671. URL http://www. sciencemag.org/cgi/content/abstract/220/4598/671.

[114] Darrell Whitley. An overview of evolutionary algorithms: practical issues and common pitfalls. Information and Software Technology, 43:817–831, December 2001. doi: Bibliography 113

10.1016/S0950-5849(01)00188-4. URL http://www.sciencedirect.com/science? _ob=ArticleURL&_udi=B6V0B-44D4196-3&_user=10&_rdoc=1&_fmt=&_orig=search&_ sort=d&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5= 6d01279f5f7b0988de18faa769305855.

[115] Jonathan Shapiro. Genetic Algorithms in Machine Learning, pages 146–168. 2001. URL http://dx.doi.org/10.1007/3-540-44673-7_7.

[116] L. N. De Castro and J. Timmis. Artificial Immune Systems: A New Computational Intelligence Approach. Springer, 2002.

[117] A. Anile, V. Cutello, G. Narzisi, G. Nicosia, and S. Spinella. Determination of pro- tein structure and dynamics combining immune algorithms and pattern search meth- ods. Natural Computing, 6:55–72, March 2007. doi: 10.1007/s11047-006-9027-3. URL http://dx.doi.org/10.1007/s11047-006-9027-3.

[118] Yong K. Hwang and Narendra Ahuja. Gross motion planning, survey. ACM Comput. Surv., 24:219–291, 1992. doi: 10.1145/136035.136037. URL http://portal.acm.org/ citation.cfm?id=136035.136037.

[119] Shawna Thomas, Xinyu Tang, Lydia Tapia, and Nancy M. Amato. Simulating protein motions with rigidity analysis, August 2007. URL http://www.liebertonline.com/doi/ abs/10.1089/cmb.2007.R019.

[120] Guang Song and Nancy M. Amato. Using motion planning to study protein folding pathways. pages 287–296, Montreal, Quebec, Canada, 2001. ACM. ISBN 1-58113-353-7. doi: 10.1145/369133.369239. URL http://portal.acm.org/citation.cfm?id=369239.

[121] G. Song, S. Thomas, K. Dill, JM Scholtz, and NM Amato. A path planning-based study of protein folding with a case study of hairpin formation in protein g and l. Biocomputing 2003, 2002.

[122] L. Tapia, X. Tang, S. Thomas, and N. M. Amato. Roadmap-based methods for studying protein folding kinetics. oct 2006.

[123] L.J. McGuffin, K. Bryson, and D.T. Jones. The PSIPRED protein structure prediction server. Bioinformatics, 16(4):404–405, 2000.

[124] S.C. Lovell, J.M. Word, J.S. Richardson, and D.C. Richardson. The penultimate rotamer library. Proteins Structure Function and Genetics, 40(3):389–408, 2000.

[125] R. More. Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Science, 6(8):1661–1681, 1997.

[126] A MacKerell. Protein Force Fields. Encyclopedia of Computational Chemistry, 1998. Bibliography 114

[127] J. Handl, D.B. Kell, and J. Knowles. Multiobjective optimization in bioinformatics and computational biology. IEEE/ACM Transactions on Computational Biology and Bioin- formatics, pages 279–292, 2007.

[128] K. Deb. Multi-objective optimization using evolutionary algorithms. Wiley, 2001.

[129] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. Lecture Notes in Computer Science, pages 849–858, 2000.

[130] J. Branke, K. Deb, H. Dierolf, and M. Osswald. Finding knees in multi-objective opti- mization. Lecture Notes in Computer Science, pages 722–731, 2004.

[131] D.A. Van Veldhuizen, J.B. Zydallis, and G.B. Lamont. Issues in parallelizing multiobjec- tive evolutionary algorithms for real world applications. In Proceedings of the 2002 ACM symposium on Applied computing, pages 595–602. ACM New York, NY, USA, 2002.

[132] DA Van Veldhuizen, JB Zydallis, and GB Lamont. Considerations in engineering parallel multiobjective evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 7(2):144–173, 2003.

[133] F. De Toro, J. Ortega, and B. Paechter. Parallel single front genetic algorithm: Perfor- mance analysis in a cluster system. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International, page 8, 2003.

[134] F. de Toro Negro, J. Ortega, E. Ros, S. Mota, B. Paechter, and JM Martın. PSFGA: Parallel processing and evolutionary computation for multiobjective optimisation. Parallel Computing, 30(5-6):721–739, 2004.

[135] F. Streichert, H. Ulmer, and A. Zell. Parallelization of multi-objective evolutionary algo- rithms using clustering algorithms. In EMO, pages 92–107. Springer, 2005.

[136] E.G. Talbi, S. Mostaghim, T. Okabe, H. Ishibuchi, G. Rudolph, and C.A.C. Coello. Par- allel Approaches for Multiobjective Optimization. Lecture Notes In Computer Science, pages 349–372, 2008.

[137] E. Freeman, K. Arnold, and S. Hupfer. JavaSpaces principles, patterns, and practice. Addison-Wesley Longman Ltd. Essex, UK, UK, 1999.

[138] C.A.C. Coello, G.B. Lamont, and D.A. Van Veldhuizen. Evolutionary algorithms for solving multi-objective problems. Springer-Verlag New York Inc, 2007.

[139] JC Calvo and J. Ortega. Parallel protein structure prediction by multiobjective opti- mization. In 17th Euromicro Int. Conference on Parallel, Distributed and Network-based processing, PDP, 2009. Bibliography 115

[140] C.E. Kaiser Jr, G.B. Lamont, L.D. Merkle, G.H. Gates Jr, and R. Pachter. Polypeptide structure prediction: real-value versus binary hybrid genetic algorithms. In Proceedings of the 1997 ACM symposium on Applied computing, pages 279–286. ACM New York, NY, USA, 1997.

[141] LR Cooper, DW Corne, and MJ Crabbe. Use of a novel Hill-climbing genetic algorithm in protein folding simulations. Computational biology and chemistry, 27(6):575, 2003.

[142] T. Dandekar and P. Argos. Identifying the tertiary fold of small proteins with different topologies from sequence and secondary structure using the genetic algorithm and ex- tended criteria specific for strand regions. Journal of Molecular Biology, 256(3):645–660, 1996.

[143] C. Kehyayan, N. Mansour, and H. Khachfe. Evolutionary Algorithm for Protein Structure Prediction. In Advanced Computer Theory and Engineering, 2008. ICACTE’08. Interna- tional Conference on, pages 925–929, 2008.

[144] Y. Cui, R.S. Chen, and W.H. Wong. Protein folding simulation with genetic algorithm and supersecondary structure constraints. Proteins Structure Function and Genetics, 31 (3):247–257, 1998.

[145] M.E. Algorithms. Computational Studies of Peptide and Protein Structure Prediction Problems via Multiobjective Evolutionary Algorithms. Multiobjective problem solving from nature: from concepts to applications, page 93, 2008.

[146] M. Dorn, A. Breda, and O.N. Souza. A Hybrid Method for the Protein Structure Predic- tion Problem. Lecture Notes in Computer Science, 5167:47–56, 2008.

[147] M. Sykes. Using a Genetic Algorithm to Generate Decoy Sets for Protein Structure Prediction. Not defined, 0, 0.

[148] G. Nicosia and G. Stracquadanio. Generalized pattern search algorithm for peptide struc- ture prediction. Biophysical Journal, 95(10):4988–4999, 2008.

[149] B. Jayaram, K. Bhushan, S.R. Shenoy, P. Narang, S. Bose, P. Agrawal, D. Sahu, and V. Pandey. Bhageerath: an energy based web enabled computer software suite for limiting the search space of tertiary structures of small globular proteins. Nucleic acids research, 34(21):6195, 2006.

[150] M. Dorn and O.N. de Souza. CReF: a central-residue-fragment-based method for predict- ing approximate 3-D polypeptides structures. In Proceedings of the 2008 ACM symposium on Applied computing, pages 1261–1267. ACM New York, NY, USA, 2008. Bibliography 116

[151] K.K. Koretke, Z. Luthey-Schulten, and P.G. Wolynes. Self-consistently optimized en- ergy functions for protein structure prediction by molecular dynamics. Proceedings of the National Academy of Sciences, 95(6):2932, 1998.

[152] A. Verma and W. Wenzel. Protein structure prediction by all-atom free-energy refinement. BMC Structural Biology, 7(1):12, 2007.

[153] K.T. Simons, C. Strauss, and D. Baker. Prospects for ab initio protein structural genomics. Journal of Molecular Biology, 306(5):1191–1199, 2001.

[154] T. Ishida, T. Nishimura, M. Nozaki, T. Inoue, T. Terada, S. Nakamura, and K. Shimizu. Development of an ab initio protein structure prediction system ABLE. GENOME IN- FORMATICS SERIES, pages 228–237, 2003.