Protein Folding Prediction with Genetic Algorithms
A Thesis Submitted to the Faculty
of
Department of Computer Science and Engineering National Sun Yat-sen University
by
Yi-Yao Huang
In Partial Fulfillment of the Requirements for the Degree
of
Master of Science
July 9, 2004 TABLE OF CONTENTS
Page
LIST OF FIGURES ...... 4
LIST OF TABLES ...... 5
ABSTRACT ...... 0
Chapter 1. Introduction ...... 1
Chapter 2. Preliminaries ...... 3
2.1 Amino Acids in Proteins ...... 3 2.2 Levels of Protein Structures ...... 3 2.3 The Hydrophobic-hydrophilic Model ...... 7 2.4 Genetic Algorithm ...... 8
Chapter 3. Protein Structure Prediction Methods ...... 11
3.1 Strategies of Protein Structure Prediction ...... 11 3.2 Representations of Protein Sequences on the Lattice Model...... 13 3.3 Previous PSP Methods for the HP Model ...... 15
Chapter 4. A New Method Based on the Lattice Model ...... 17
4.1 Motivation ...... 17 4.2 Overall Steps of the Algorithm ...... 17 4.3 The Fitness Function ...... 18 4.4 Example ...... 22
Chapter 5. Experimental Results ...... 27 Page
Chapter 6. Conclusions ...... 31
BIBLIOGRAPHY ...... 33
LIST OF FIGURES
Figure Page
2.1 Levels of protein structures. (a)Primary structure. (b)Secondary structure. (c)Tertiary structure...... 6
2.2 One folding conformation of a protein sequence on the 2D HP model with free energy 6. The sequence starts from “Start” and ends at “End”, where amino acids are linked by solid lines. The solid node represents “hy- drophobic” and the hollow one represents “hydrophilic”. Non-sequential hydrophobic amino acid pairs are linked by dot lines, where two amino acids of each pair are not sequential in sequence (linked by solid line) and are adjacent on the grid points of the lattice model...... 8
2.3 The procedure of the genetic algorithm...... 10
3.1 An example of absolute encoding and relative encoding...... 13
3.2 One-point mutation for relative encoding...... 14
4.1 Copying the corresponding coordinates...... 24
4.2 The folding conformation. The solid node represents “1” while the hollow one represents “0”...... 25
4.3 The smooth folding conformation...... 25
4.4 The final predicted conformation...... 26 LIST OF TABLES
Table Page
2.1 Names and abbreviations of twenty amino acids...... 4
2.2 Features of twenty amino acids. Each amino acid is classified into one of
the eight groups, where a mark “¯” indicates the amino acid is in that group. 5
4.1 Abbreviations of secondary structure...... 23
5.1 Parameters used in GA...... 27
5.2 Weights of features...... 28
5.3 Comparison of our method with the original method for target: 1LIN(146). 28
5.4 Comparison of our method with the original method for target: 1QG1(104). 29
5.5 Comparison of our method with the original method for target: 5CPV(108). 29
5.6 Comparison of our method with the original method for target: 1SHD(101). 29
5.7 Comparison of our method with method of Dovier et al...... 30 ABSTRACT
It is well known that the biological function of a protein depends on its 3D struc- ture. Therefore, solving the problem of protein structures is one of the most important works for studying proteins. However, protein structure prediction is a very challenging task because there is still no clear feature about how a protein folds to its 3D structure yet. In this thesis, we propose a genetic algorithm (GA) based on the lattice model to predict the 3D structure of an unknown protein, target protein, whose primary sequence and secondary structure elements (SSEs) are assumed known. Hydrophobic-hydrophilic model (HP model) is one of the most simplified and popular protein folding models. These models consider the hydrophobic-hydrophobic interactions of protein structures, but the results of prediction are still not encouraged enough. Therefore, we suggest that some other features should be considered, such as SSEs, charges, and disulfide bonds. That is, the fitness function of GA in our method considers not only how many hydrophobic-hydrophobic pairs there are, but also what kind of SSEs these amino acids belong to. The lattice model is in fact used to help us get a rough folding of the target protein, since we have no idea how they fold at the very beginning. We show that these additional features do improve the prediction accuracy by com- paring our prediction results with their real structures with RMSD. CHAPTER 1
Introduction
Protein folding prediction, sometimes called protein structure prediction (PSP), is one of the most important issues for understanding living organisms, because it is be- lieved that the biological function of a protein is determined by its structure, and that is why proteins can play so many different roles in cells, such as catalyzing, regulating, and transporting, and so on. Biological scientists have dedicated themselves to solving the structures of proteins by experimentations, like X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy [22, 26, 29]. Both of them, however, are time- consuming and the difficulty of crystallizing the structures is increasing due to the high complexity degree of the protein structures. As a result, many computer scientists have taken part in the research of PSPs by using computational strategies. Generally speaking, there are three strategies for solving PSPs: ab initio, homology modeling, and threading methods.
PSPs based on lattice models have been proved to be NP-complete problems [4, 8, 14, 18, 27]. Therefore, it is unfeasible to use the thorough search of one protein’s confor- mation for these problems, which is so-called the brute force algorithm. Consequently, heuristic optimization methods are considered to solve PSP problems. In recent years, there have been many heuristic methods for PSPs. And, genetic algorithm (GA) is one of the most popular strategies used by many scientists [9, 12, 20, 21, 28]. Whereas the results of traditional hydrophobic-hydrophilic models (HP models) that consider the hydrophobicity only seem not to be very satisfied, and many methods
1 that only obtain improvements in how to get better folding algorithms do not upgrade the accuracy of PSPs explicitly, we suggest that some other characteristics, besides the hydrophobicity, have to be considered [5]. In this thesis, we propose a hybrid method of homology modeling and folding strategies. We use GA as our method’s architecture and perform folding on the 3D lattice model to help us obtain a rough conformation of a target protein, because at the beginning we have no idea about how it folds into its tertiary structure even we have its primary sequence and features of each amino acid in the sequence. So, at first, we let the sequence fold on the 3D lattice and then we have the initiative global aspect of its structure from the model. Next, we consider the hydrophobicity, the charges of the side chains, the specificity of some amino acids, and most important one, the secondary structure elements (SSEs), in our fitness function of GA to improve the prediction accuracy. In the past, some methods have ever used the SSEs for improving the accuracy of PSPs [7, 11]. The rest chapters of this thesis are as follows. Chapter 2 introduces the levels of protein structures, the HP model, and the genetic algorithm (GA). In Chapter 3, we introduce strategies of protein structure prediction and survey several methods of protein structure prediction based on the HP model. In Chapter 4, we propose a new method to predict the protein tertiary structure. Chapter 5 demonstrates our experimental results on some small proteins and shows that our results,RMSD (Root Mean Square Distance) values, are better than those of the previous methods. The conclusion is given in Chapter 6.
2 CHAPTER 2
Preliminaries
2.1 Amino Acids in Proteins
There are twenty kinds of amino acids in proteins, and these amino acids have several common chemistry structures: an alpha carbon atom, an amino group, a carboxyl group, a hydrogen and the side chain. These amino acids have different side chains, or the so-called residues. Table 2.1 lists these twenty amino acids and their abbreviations. We always represent a protein sequence as a string of characters, and each character denotes one of the one-letter abbreviations. These amino acids can be classified into different groups as shown in Table 2.2.
2.2 Levels of Protein Structures
Proteins are macromolecules. They are polymers consisting of amino acids linked by covalent peptide bonds as chains. A protein may have many different conformations (three-dimensional structures). However, only one or a few of these structures, which are so-called native conformations, have biological activity. Because protein structures are so complicated, they are categorized into four levels of structures, as shown in Figure 2.1. Primary structure, the first level, of a protein is the amino acid sequence. The amino acids are covalently linked to form an ordered polypeptide chain. Then the order of amino acids can be written as a sequence. The sequence has directivity that is from the
N-terminal end to the C-terminal end.
3 Table 2.1 Names and abbreviations of twenty amino acids.
One-letter Abbreviation Three-latter Abbreviation Name 1 A Ala Alanine 2 C Cys Cysteine 3 D Asp Aspartic Acid 4 E Glu Glutamic Acid 5 F Phe Phenylalanine 6 G Gly Glycine 7 H His Histidine 8 I IlE Isoleucine 9 K Lys Lysine 10 L Leu Leucine 11 M Met Methionine 12 N Asn Asparagine 13 P Pro Proline 14 Q Gln Glutamine 15 R Arg Arginine 16 S Ser Serine 17 T Thr Threonine 18 V Val Valine 19 W Trp Tryptophan 20 Y Tyr Tyrosine
Secondary structure is the second level. The regular arrangements that can be found in proteins have repetitive hydrogen bonding between the amide N-H and the car- boxyl group of the peptide backbone. The bonds in the peptide backbone are important for structures. α helix and β sheet are two major types of secondary structure. Other random parts are called loops. The α helix is rodlike as a spiral spring and is formed by only one polypeptide chain. The helical conformation is stabilized because of the hydrogen bonds parallel to the helix axis within this backbone. Each turn of the helix has 3.6 residues and the pitch of the helix, the linear distance between the starting and the ending of one turn, is 5.4 A.û Figure 2.1(b) shows the diagram of the α helix.
4 Table 2.2 Features of twenty amino acids. Each amino acid is classified into one of the
eight groups, where a mark “¯” indicates the amino acid is in that group.
Group Name Amino Acid Hydrophobicity Acidic Basic Aliphatic Amide Aromatic Hydroxyl Imino Sulfur
Ala H ¯
Cys P ¯
Asp P ¯
Glu P ¯
Phe H ¯
Gly HorP ¯
His P ¯
IlE H ¯
Lys P ¯
Leu H ¯
Met H ¯
Asn P ¯
Pro H ¯
Gln P ¯
Arg P ¯
Ser P ¯
Thr P ¯
Val H ¯
Trp H ¯
Tyr P ¯
5 (a)
(b) α helix β sheet
parallel antiparallel
3.6 residues per turn (pitch) H-bond
C=O NH Ca
(c)
Figure 2.1 Levels of protein structures. (a)Primary structure. (b)Secondary structure. (c)Tertiary structure.
6 The β sheet may be formed by several polypeptide chains, β strands, which are ar- ranged as an two-dimensional array, and neighboring strands are hydrogen bonded. These β strands are not limited to be sequential in one protein sequence. If alternating strands run in the same direction, they are parallel. On the other hand, if they run in opposite directions, they are antiparallel. Figure 2.1(b) shows these two types of the β sheet. Tertiary structure is the further folding of secondary structures and loops. The ter- tiary structure of a protein is stabilized by the forces of hydrophobic interactions between amino acids mainly and other forces, like disulfide bonds that can be formed by cysteine’s, charge attractions, and so on. As a result of these stabilizing forces, in three dimensional structure, residues that are far in the primary sequence may be close to each other. Quaternary structure is formed by more than one polypeptide chain. Each chain is called a subunit.
2.3 The Hydrophobic-hydrophilic Model
The hydrophobic-hydrophilic model (HP model) proposed by Dill [10] is one of the most simplified and popular of protein folding models [4, 12, 20, 21, 23, 28], where H represents hydrophobic (non-polar or “water-hating”) and P represents hydrophilic (polar or “water-liking”). In the HP model, amino acids in a protein sequence were abstracted by hydrophobic (H) or by hydrophilic (P). And then, the protein sequence folds on this lattice such that each amino acid occupies on one grid point and continuative two amino acids, called sequential amino acids, in the sequence are also adjacent on the grid points of the lattice model. Besides, any two amino acids can not occupy on the same grid point. Therefore, the folding of a protein sequence is restricted on the lattice space. For one protein sequence, there may be many feasible folding arrangements of it as long as they follow the rules mentioned above.
7 0 0
0 1 1 Start 0 1 1 0 End 11 0
0 0
Figure 2.2 One folding conformation of a protein sequence on the 2D HP model with free energy 6. The sequence starts from “Start” and ends at “End”, where amino acids are linked by solid lines. The solid node represents “hydrophobic” and the hollow one represents “hydrophilic”. Non-sequential hydrophobic amino acid pairs are linked by dot lines, where two amino acids of each pair are not sequential in sequence (linked by solid line) and are adjacent on the grid points of the lattice model.
According to the law of thermodynamics, natural proteins or the so-called folded proteins are under the states of minimal free energy. For the above reason, on the HP model, the natural conformation of a protein is supposed to have the minimal free en- ergy when there are maximal non-sequential hydrophobic amino acid pairs. That is, the main provider of the free energy is defined as the interactions between hydrophobic amino acids such that the hydrophobic amino acids often form a hydrophobic core interior the global conformation surrounded by hydrophilic ones. Figure 2.2 shows one folding con- formation of a protein sequence on the 2D HP model with free energy 6, where six non- sequential hydrophobic amino acid pairs (pairs linked by dot lines) are included.
2.4 Genetic Algorithm
The genetic algorithm (GA) was invented by Holland [19]. The basis of GA is “survival of the fittest”, which was referred to Darwin’s evolutionary theory, and the evo- lution of species is dominated by recombination and mutation of gene. Therefore, the
8 idea of GA is to simulate the process as natural evolution such that individuals of the population undergo recombination and mutation to adapt the new environment. In other words, individuals with better fitness have larger chances to survive. The GAs are adaptive heuristic search methods for solving the optimization prob- lems according to the fitness functions [3,17], especially for the problems that do not have precisely-defined solving methods. The steps of GA are as follows [2, 13]. At first, the environment is considered as the problem that we intend to solve. Then a population is created randomly, which is a set of individuals (the candidate solutions of the problem).
There are variations among these individuals, and as a result, some of the individuals have better fitness than others. That is, these individuals with better fitness are better so- lutions of the problem. Next, evolution process are applied during every iteration. The process operators include recombination or the so-called crossover, where corresponding substrings between two individuals are exchanged, mutation, where one or more charac- ters of the individual are replaced by other feasible characters, and other changes on the individuals. After the evolution, more fitting individuals of the new generation present higher abilities of survival and then are selected to the new population. The evolution it- eration continues until the best solution with the highest fitness value is generated. Figure 2.3 sketches the GA. Furthermore, there are some important aspects of GAs: (1) how to define the fitness function, (2) which representation of sequences are used, and (3) what genetic operators are applied. GAs have been widely applied to many areas, such as DNA fragment assembly, multiple sequence alignment of gene and protein sequences [30], and prediction of RNA secondary structures and protein structures prediction [9, 20, 21], etc.
9 Initialize the population Select the Individual 1 individuals with Individual 2 better fitness Individual 3 score as parents
Individual n
Individual 1 Individual 2 Calculate the fitness Individual 3 score and put back the population
Individual k
Individual i
Individual j