Protein Folding Prediction with Genetic Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Protein Folding Prediction with Genetic Algorithms A Thesis Submitted to the Faculty of Department of Computer Science and Engineering National Sun Yat-sen University by Yi-Yao Huang In Partial Fulfillment of the Requirements for the Degree of Master of Science July 9, 2004 TABLE OF CONTENTS Page LIST OF FIGURES ................................. 4 LIST OF TABLES .................................. 5 ABSTRACT ..................................... 0 Chapter 1. Introduction .............................. 1 Chapter 2. Preliminaries .............................. 3 2.1 Amino Acids in Proteins . .................... 3 2.2 Levels of Protein Structures . .................... 3 2.3 The Hydrophobic-hydrophilic Model .................... 7 2.4 Genetic Algorithm . ........................... 8 Chapter 3. Protein Structure Prediction Methods ................ 11 3.1 Strategies of Protein Structure Prediction . ............. 11 3.2 Representations of Protein Sequences on the Lattice Model........ 13 3.3 Previous PSP Methods for the HP Model . ............. 15 Chapter 4. A New Method Based on the Lattice Model ............. 17 4.1 Motivation .................................. 17 4.2 Overall Steps of the Algorithm . .................... 17 4.3 The Fitness Function . ........................... 18 4.4 Example . .................................. 22 Chapter 5. Experimental Results ......................... 27 Page Chapter 6. Conclusions ............................... 31 BIBLIOGRAPHY .................................. 33 LIST OF FIGURES Figure Page 2.1 Levels of protein structures. (a)Primary structure. (b)Secondary structure. (c)Tertiary structure. ........................... 6 2.2 One folding conformation of a protein sequence on the 2D HP model with free energy 6. The sequence starts from “Start” and ends at “End”, where amino acids are linked by solid lines. The solid node represents “hy- drophobic” and the hollow one represents “hydrophilic”. Non-sequential hydrophobic amino acid pairs are linked by dot lines, where two amino acids of each pair are not sequential in sequence (linked by solid line) and are adjacent on the grid points of the lattice model. ......... 8 2.3 The procedure of the genetic algorithm. ............. 10 3.1 An example of absolute encoding and relative encoding. ......... 13 3.2 One-point mutation for relative encoding. ............. 14 4.1 Copying the corresponding coordinates. ............. 24 4.2 The folding conformation. The solid node represents “1” while the hollow one represents “0”. ........................... 25 4.3 The smooth folding conformation. .................... 25 4.4 The final predicted conformation. .................... 26 LIST OF TABLES Table Page 2.1 Names and abbreviations of twenty amino acids. ............. 4 2.2 Features of twenty amino acids. Each amino acid is classified into one of the eight groups, where a mark “¯” indicates the amino acid is in that group. 5 4.1 Abbreviations of secondary structure. ............. 23 5.1 Parameters used in GA. ........................... 27 5.2 Weights of features. ........................... 28 5.3 Comparison of our method with the original method for target: 1LIN(146). 28 5.4 Comparison of our method with the original method for target: 1QG1(104). 29 5.5 Comparison of our method with the original method for target: 5CPV(108). 29 5.6 Comparison of our method with the original method for target: 1SHD(101). 29 5.7 Comparison of our method with method of Dovier et al.. ......... 30 ABSTRACT It is well known that the biological function of a protein depends on its 3D struc- ture. Therefore, solving the problem of protein structures is one of the most important works for studying proteins. However, protein structure prediction is a very challenging task because there is still no clear feature about how a protein folds to its 3D structure yet. In this thesis, we propose a genetic algorithm (GA) based on the lattice model to predict the 3D structure of an unknown protein, target protein, whose primary sequence and secondary structure elements (SSEs) are assumed known. Hydrophobic-hydrophilic model (HP model) is one of the most simplified and popular protein folding models. These models consider the hydrophobic-hydrophobic interactions of protein structures, but the results of prediction are still not encouraged enough. Therefore, we suggest that some other features should be considered, such as SSEs, charges, and disulfide bonds. That is, the fitness function of GA in our method considers not only how many hydrophobic-hydrophobic pairs there are, but also what kind of SSEs these amino acids belong to. The lattice model is in fact used to help us get a rough folding of the target protein, since we have no idea how they fold at the very beginning. We show that these additional features do improve the prediction accuracy by com- paring our prediction results with their real structures with RMSD. CHAPTER 1 Introduction Protein folding prediction, sometimes called protein structure prediction (PSP), is one of the most important issues for understanding living organisms, because it is be- lieved that the biological function of a protein is determined by its structure, and that is why proteins can play so many different roles in cells, such as catalyzing, regulating, and transporting, and so on. Biological scientists have dedicated themselves to solving the structures of proteins by experimentations, like X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy [22, 26, 29]. Both of them, however, are time- consuming and the difficulty of crystallizing the structures is increasing due to the high complexity degree of the protein structures. As a result, many computer scientists have taken part in the research of PSPs by using computational strategies. Generally speaking, there are three strategies for solving PSPs: ab initio, homology modeling, and threading methods. PSPs based on lattice models have been proved to be NP-complete problems [4, 8, 14, 18, 27]. Therefore, it is unfeasible to use the thorough search of one protein’s confor- mation for these problems, which is so-called the brute force algorithm. Consequently, heuristic optimization methods are considered to solve PSP problems. In recent years, there have been many heuristic methods for PSPs. And, genetic algorithm (GA) is one of the most popular strategies used by many scientists [9, 12, 20, 21, 28]. Whereas the results of traditional hydrophobic-hydrophilic models (HP models) that consider the hydrophobicity only seem not to be very satisfied, and many methods 1 that only obtain improvements in how to get better folding algorithms do not upgrade the accuracy of PSPs explicitly, we suggest that some other characteristics, besides the hydrophobicity, have to be considered [5]. In this thesis, we propose a hybrid method of homology modeling and folding strategies. We use GA as our method’s architecture and perform folding on the 3D lattice model to help us obtain a rough conformation of a target protein, because at the beginning we have no idea about how it folds into its tertiary structure even we have its primary sequence and features of each amino acid in the sequence. So, at first, we let the sequence fold on the 3D lattice and then we have the initiative global aspect of its structure from the model. Next, we consider the hydrophobicity, the charges of the side chains, the specificity of some amino acids, and most important one, the secondary structure elements (SSEs), in our fitness function of GA to improve the prediction accuracy. In the past, some methods have ever used the SSEs for improving the accuracy of PSPs [7, 11]. The rest chapters of this thesis are as follows. Chapter 2 introduces the levels of protein structures, the HP model, and the genetic algorithm (GA). In Chapter 3, we introduce strategies of protein structure prediction and survey several methods of protein structure prediction based on the HP model. In Chapter 4, we propose a new method to predict the protein tertiary structure. Chapter 5 demonstrates our experimental results on some small proteins and shows that our results,RMSD (Root Mean Square Distance) values, are better than those of the previous methods. The conclusion is given in Chapter 6. 2 CHAPTER 2 Preliminaries 2.1 Amino Acids in Proteins There are twenty kinds of amino acids in proteins, and these amino acids have several common chemistry structures: an alpha carbon atom, an amino group, a carboxyl group, a hydrogen and the side chain. These amino acids have different side chains, or the so-called residues. Table 2.1 lists these twenty amino acids and their abbreviations. We always represent a protein sequence as a string of characters, and each character denotes one of the one-letter abbreviations. These amino acids can be classified into different groups as shown in Table 2.2. 2.2 Levels of Protein Structures Proteins are macromolecules. They are polymers consisting of amino acids linked by covalent peptide bonds as chains. A protein may have many different conformations (three-dimensional structures). However, only one or a few of these structures, which are so-called native conformations, have biological activity. Because protein structures are so complicated, they are categorized into four levels of structures, as shown in Figure 2.1. Primary structure, the first level, of a protein is the amino acid sequence. The amino acids are covalently linked to form an ordered polypeptide chain. Then the order of amino acids can be written as a sequence. The sequence