Protein Folding Prediction with Genetic Algorithms

A Thesis Submitted to the Faculty

of

Department of Computer Science and Engineering National Sun Yat-sen University

by

Yi-Yao Huang

In Partial Fulfillment of the Requirements for the Degree

of

Master of Science

July 9, 2004 TABLE OF CONTENTS

Page

LIST OF FIGURES ...... 4

LIST OF TABLES ...... 5

ABSTRACT ...... 0

Chapter 1. Introduction ...... 1

Chapter 2. Preliminaries ...... 3

2.1 Amino Acids in ...... 3 2.2 Levels of Structures ...... 3 2.3 The Hydrophobic-hydrophilic Model ...... 7 2.4 Genetic Algorithm ...... 8

Chapter 3. Protein Structure Prediction Methods ...... 11

3.1 Strategies of Protein Structure Prediction ...... 11 3.2 Representations of Protein Sequences on the Lattice Model...... 13 3.3 Previous PSP Methods for the HP Model ...... 15

Chapter 4. A New Method Based on the Lattice Model ...... 17

4.1 Motivation ...... 17 4.2 Overall Steps of the Algorithm ...... 17 4.3 The Fitness Function ...... 18 4.4 Example ...... 22

Chapter 5. Experimental Results ...... 27 Page

Chapter 6. Conclusions ...... 31

BIBLIOGRAPHY ...... 33

LIST OF FIGURES

Figure Page

2.1 Levels of protein structures. (a)Primary structure. (b)Secondary structure. (c)Tertiary structure...... 6

2.2 One folding conformation of a protein sequence on the 2D HP model with free energy 6. The sequence starts from “Start” and ends at “End”, where amino acids are linked by solid lines. The solid node represents “hy- drophobic” and the hollow one represents “hydrophilic”. Non-sequential hydrophobic pairs are linked by dot lines, where two amino acids of each pair are not sequential in sequence (linked by solid line) and are adjacent on the grid points of the lattice model...... 8

2.3 The procedure of the genetic algorithm...... 10

3.1 An example of absolute encoding and relative encoding...... 13

3.2 One-point mutation for relative encoding...... 14

4.1 Copying the corresponding coordinates...... 24

4.2 The folding conformation. The solid node represents “1” while the hollow one represents “0”...... 25

4.3 The smooth folding conformation...... 25

4.4 The final predicted conformation...... 26 LIST OF TABLES

Table Page

2.1 Names and abbreviations of twenty amino acids...... 4

2.2 Features of twenty amino acids. Each amino acid is classified into one of

the eight groups, where a mark “¯” indicates the amino acid is in that group. 5

4.1 Abbreviations of secondary structure...... 23

5.1 Parameters used in GA...... 27

5.2 Weights of features...... 28

5.3 Comparison of our method with the original method for target: 1LIN(146). 28

5.4 Comparison of our method with the original method for target: 1QG1(104). 29

5.5 Comparison of our method with the original method for target: 5CPV(108). 29

5.6 Comparison of our method with the original method for target: 1SHD(101). 29

5.7 Comparison of our method with method of Dovier et al...... 30 ABSTRACT

It is well known that the biological function of a protein depends on its 3D struc- ture. Therefore, solving the problem of protein structures is one of the most important works for studying proteins. However, protein structure prediction is a very challenging task because there is still no clear feature about how a protein folds to its 3D structure yet. In this thesis, we propose a genetic algorithm (GA) based on the lattice model to predict the 3D structure of an unknown protein, target protein, whose primary sequence and secondary structure elements (SSEs) are assumed known. Hydrophobic-hydrophilic model (HP model) is one of the most simplified and popular protein folding models. These models consider the hydrophobic-hydrophobic interactions of protein structures, but the results of prediction are still not encouraged enough. Therefore, we suggest that some other features should be considered, such as SSEs, charges, and disulfide bonds. That is, the fitness function of GA in our method considers not only how many hydrophobic-hydrophobic pairs there are, but also what kind of SSEs these amino acids belong to. The lattice model is in fact used to help us get a rough folding of the target protein, since we have no idea how they fold at the very beginning. We show that these additional features do improve the prediction accuracy by com- paring our prediction results with their real structures with RMSD. CHAPTER 1

Introduction

Protein folding prediction, sometimes called protein structure prediction (PSP), is one of the most important issues for understanding living organisms, because it is be- lieved that the biological function of a protein is determined by its structure, and that is why proteins can play so many different roles in cells, such as catalyzing, regulating, and transporting, and so on. Biological scientists have dedicated themselves to solving the structures of proteins by experimentations, like X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy [22, 26, 29]. Both of them, however, are time- consuming and the difficulty of crystallizing the structures is increasing due to the high complexity degree of the protein structures. As a result, many computer scientists have taken part in the research of PSPs by using computational strategies. Generally speaking, there are three strategies for solving PSPs: ab initio, homology modeling, and threading methods.

PSPs based on lattice models have been proved to be NP-complete problems [4, 8, 14, 18, 27]. Therefore, it is unfeasible to use the thorough search of one protein’s confor- mation for these problems, which is so-called the brute force algorithm. Consequently, heuristic optimization methods are considered to solve PSP problems. In recent years, there have been many heuristic methods for PSPs. And, genetic algorithm (GA) is one of the most popular strategies used by many scientists [9, 12, 20, 21, 28]. Whereas the results of traditional hydrophobic-hydrophilic models (HP models) that consider the hydrophobicity only seem not to be very satisfied, and many methods

1 that only obtain improvements in how to get better folding algorithms do not upgrade the accuracy of PSPs explicitly, we suggest that some other characteristics, besides the hydrophobicity, have to be considered [5]. In this thesis, we propose a hybrid method of homology modeling and folding strategies. We use GA as our method’s architecture and perform folding on the 3D lattice model to help us obtain a rough conformation of a target protein, because at the beginning we have no idea about how it folds into its tertiary structure even we have its primary sequence and features of each amino acid in the sequence. So, at first, we let the sequence fold on the 3D lattice and then we have the initiative global aspect of its structure from the model. Next, we consider the hydrophobicity, the charges of the side chains, the specificity of some amino acids, and most important one, the secondary structure elements (SSEs), in our fitness function of GA to improve the prediction accuracy. In the past, some methods have ever used the SSEs for improving the accuracy of PSPs [7, 11]. The rest chapters of this thesis are as follows. Chapter 2 introduces the levels of protein structures, the HP model, and the genetic algorithm (GA). In Chapter 3, we introduce strategies of protein structure prediction and survey several methods of protein structure prediction based on the HP model. In Chapter 4, we propose a new method to predict the protein tertiary structure. Chapter 5 demonstrates our experimental results on some small proteins and shows that our results,RMSD (Root Mean Square Distance) values, are better than those of the previous methods. The conclusion is given in Chapter 6.

2 CHAPTER 2

Preliminaries

2.1 Amino Acids in Proteins

There are twenty kinds of amino acids in proteins, and these amino acids have several common chemistry structures: an alpha carbon atom, an amino group, a carboxyl group, a hydrogen and the side chain. These amino acids have different side chains, or the so-called residues. Table 2.1 lists these twenty amino acids and their abbreviations. We always represent a protein sequence as a string of characters, and each character denotes one of the one-letter abbreviations. These amino acids can be classified into different groups as shown in Table 2.2.

2.2 Levels of Protein Structures

Proteins are macromolecules. They are polymers consisting of amino acids linked by covalent peptide bonds as chains. A protein may have many different conformations (three-dimensional structures). However, only one or a few of these structures, which are so-called native conformations, have biological activity. Because protein structures are so complicated, they are categorized into four levels of structures, as shown in Figure 2.1. Primary structure, the first level, of a protein is the amino acid sequence. The amino acids are covalently linked to form an ordered polypeptide chain. Then the order of amino acids can be written as a sequence. The sequence has directivity that is from the

N-terminal end to the C-terminal end.

3 Table 2.1 Names and abbreviations of twenty amino acids.

One-letter Abbreviation Three-latter Abbreviation Name 1 A Ala 2 C Cys Cysteine 3 D Asp Aspartic Acid 4 E Glu Glutamic Acid 5 F Phe Phenylalanine 6 G Gly 7 H His Histidine 8 I IlE Isoleucine 9 K Lys Lysine 10 L Leu Leucine 11 M Met Methionine 12 N Asn Asparagine 13 P Pro Proline 14 Q Gln Glutamine 15 R Arg Arginine 16 S Ser Serine 17 T Thr Threonine 18 V Val Valine 19 W Trp Tryptophan 20 Y Tyr Tyrosine

Secondary structure is the second level. The regular arrangements that can be found in proteins have repetitive hydrogen bonding between the amide N-H and the car- boxyl group of the peptide backbone. The bonds in the peptide backbone are important for structures. α and β sheet are two major types of secondary structure. Other random parts are called loops. The α helix is rodlike as a spring and is formed by only one polypeptide chain. The helical conformation is stabilized because of the hydrogen bonds parallel to the helix axis within this backbone. Each of the helix has 3.6 residues and the pitch of the helix, the linear distance between the starting and the ending of one turn, is 5.4 A.û Figure 2.1(b) shows the diagram of the α helix.

4 Table 2.2 Features of twenty amino acids. Each amino acid is classified into one of the

eight groups, where a mark “¯” indicates the amino acid is in that group.

Group Name Amino Acid Hydrophobicity Acidic Basic Aliphatic Amide Aromatic Hydroxyl Imino Sulfur

Ala H ¯

Cys P ¯

Asp P ¯

Glu P ¯

Phe H ¯

Gly HorP ¯

His P ¯

IlE H ¯

Lys P ¯

Leu H ¯

Met H ¯

Asn P ¯

Pro H ¯

Gln P ¯

Arg P ¯

Ser P ¯

Thr P ¯

Val H ¯

Trp H ¯

Tyr P ¯

5 (a)

(b) α helix β sheet

parallel antiparallel

3.6 residues per turn (pitch) H-bond

C=O NH Ca

(c)

Figure 2.1 Levels of protein structures. (a)Primary structure. (b)Secondary structure. (c)Tertiary structure.

6 The β sheet may be formed by several polypeptide chains, β strands, which are ar- ranged as an two-dimensional array, and neighboring strands are hydrogen bonded. These β strands are not limited to be sequential in one protein sequence. If alternating strands run in the same direction, they are parallel. On the other hand, if they run in opposite directions, they are antiparallel. Figure 2.1(b) shows these two types of the β sheet. Tertiary structure is the further folding of secondary structures and loops. The ter- tiary structure of a protein is stabilized by the forces of hydrophobic interactions between amino acids mainly and other forces, like disulfide bonds that can be formed by cysteine’s, charge attractions, and so on. As a result of these stabilizing forces, in three dimensional structure, residues that are far in the primary sequence may be close to each other. Quaternary structure is formed by more than one polypeptide chain. Each chain is called a subunit.

2.3 The Hydrophobic-hydrophilic Model

The hydrophobic-hydrophilic model (HP model) proposed by Dill [10] is one of the most simplified and popular of protein folding models [4, 12, 20, 21, 23, 28], where H represents hydrophobic (non-polar or “water-hating”) and P represents hydrophilic (polar or “water-liking”). In the HP model, amino acids in a protein sequence were abstracted by hydrophobic (H) or by hydrophilic (P). And then, the protein sequence folds on this lattice such that each amino acid occupies on one grid point and continuative two amino acids, called sequential amino acids, in the sequence are also adjacent on the grid points of the lattice model. Besides, any two amino acids can not occupy on the same grid point. Therefore, the folding of a protein sequence is restricted on the lattice space. For one protein sequence, there may be many feasible folding arrangements of it as long as they follow the rules mentioned above.

7 0 0

0 1 1 Start 0 1 1 0 End 11 0

0 0

Figure 2.2 One folding conformation of a protein sequence on the 2D HP model with free energy 6. The sequence starts from “Start” and ends at “End”, where amino acids are linked by solid lines. The solid node represents “hydrophobic” and the hollow one represents “hydrophilic”. Non-sequential hydrophobic amino acid pairs are linked by dot lines, where two amino acids of each pair are not sequential in sequence (linked by solid line) and are adjacent on the grid points of the lattice model.

According to the law of thermodynamics, natural proteins or the so-called folded proteins are under the states of minimal free energy. For the above reason, on the HP model, the natural conformation of a protein is supposed to have the minimal free en- ergy when there are maximal non-sequential hydrophobic amino acid pairs. That is, the main provider of the free energy is defined as the interactions between hydrophobic amino acids such that the hydrophobic amino acids often form a hydrophobic core interior the global conformation surrounded by hydrophilic ones. Figure 2.2 shows one folding con- formation of a protein sequence on the 2D HP model with free energy 6, where six non- sequential hydrophobic amino acid pairs (pairs linked by dot lines) are included.

2.4 Genetic Algorithm

The genetic algorithm (GA) was invented by Holland [19]. The basis of GA is “survival of the fittest”, which was referred to Darwin’s evolutionary theory, and the evo- lution of species is dominated by recombination and mutation of gene. Therefore, the

8 idea of GA is to simulate the process as natural evolution such that individuals of the population undergo recombination and mutation to adapt the new environment. In other words, individuals with better fitness have larger chances to survive. The GAs are adaptive heuristic search methods for solving the optimization prob- lems according to the fitness functions [3,17], especially for the problems that do not have precisely-defined solving methods. The steps of GA are as follows [2, 13]. At first, the environment is considered as the problem that we intend to solve. Then a population is created randomly, which is a set of individuals (the candidate solutions of the problem).

There are variations among these individuals, and as a result, some of the individuals have better fitness than others. That is, these individuals with better fitness are better so- lutions of the problem. Next, evolution process are applied during every iteration. The process operators include recombination or the so-called crossover, where corresponding substrings between two individuals are exchanged, mutation, where one or more charac- ters of the individual are replaced by other feasible characters, and other changes on the individuals. After the evolution, more fitting individuals of the new generation present higher abilities of survival and then are selected to the new population. The evolution it- eration continues until the best solution with the highest fitness value is generated. Figure 2.3 sketches the GA. Furthermore, there are some important aspects of GAs: (1) how to define the fitness function, (2) which representation of sequences are used, and (3) what genetic operators are applied. GAs have been widely applied to many areas, such as DNA fragment assembly, multiple sequence alignment of gene and protein sequences [30], and prediction of RNA secondary structures and protein structures prediction [9, 20, 21], etc.

9 Initialize the population Select the Individual 1 individuals with Individual 2 better fitness Individual 3 score as parents

Individual n

Individual 1 Individual 2 Calculate the fitness Individual 3 score and put back the population

Individual k

Individual i

Individual j

Mutation Randomly choose individuals Crossover i

j

i

j

Perform the evolution processes on the individuals

Figure 2.3 The procedure of the genetic algorithm.

10 CHAPTER 3

Protein Structure Prediction Methods

3.1 Strategies of Protein Structure Prediction

Though we can find out the relationship between a protein sequence and other protein sequences of known protein families by means of sequence alignment and then get the information about the biological evolution, the 3D protein structures are more helpful clues for understanding their functions. When we take a protein sequence which is cheap to obtain experimentally and we want to know about the 3D structure of the protein, we would like to do prediction if it is expensive and even impossible to determine by experiments. In general, protein structure predictions are classified into three strategies: homology modeling, threading, and ab initio method [15].

1. Homology modeling. In a literal sense, homology modeling methods do prediction by means of finding one template derived from proteins of known structure to de-

termine the rough structure of the target protein. The homology is based on the viewpoint of high sequence similarity, belonging to the same family, or something else. Previous researches showed that prediction of protein structures based on ho- mology modeling is more successful than the predictions with the other two meth-

ods if a good template with quite homologous structure can be found. The standard procedure of homology modeling is given as follows:

11 1. For one target protein sequence, whose structure is unknown, search for template proteins whose structures are known and further produce the global sequence alignment of the target protein sequence and the template sequence(s).

2. Take the backbone from one template structure as a rough model of the target protein.

3. Use a loop-modeling procedure to improve the regions that contain gaps in either the target protein sequence or the template one.

4. Put back and optimize the side chains to the backbone.

5. Optimize the structure such that it is with minimal free energy and under the most stable structure.

2. Threading. Threading methods take the target protein sequence and compute some model structures based on a set of known structures. Then evaluate and determine

the most possible model that this target protein can be. However, the results of prediction with threading methods are often not so good. There may be too many features needed to be considered so that the problem is relatively complicated when the models are computed. Nevertheless, there are more successes for remote ho-

mologies that can not be detected with sequence alignment.

3. Ab initio. Ab initio methods do prediction only based on protein sequences. That is, they do not refer to known structures to help them. They are often time consuming

to search for the better solutions of target proteins. Thus, the searching algorithms are very important and critical. For example, the Monte Carlo sampling method is one famous method of this kind of methods.

12 End

Start

Figure 3.1 An example of absolute encoding and relative encoding.

3.2 Representations of Protein Sequences on the Lattice Model

There are several common ways to represent a sequence on a lattice, like Cartesian coordinates, internal coordinates,adistance matrix, and so on. In Cartesian coordinates, the location of every character is specified independently. In internal coordinates, a se- quence is represented as a string of moving steps on the lattice from one character to the next one. Compared with the previous two, a distance matrix describes the inter-character distances of the sequence. Proteins are often represented on the lattice model with inter- nal coordinates. And there are two kinds of representation, absolute encoding and relative encoding.

For absolute encoding, a protein sequence is encoded as a string of characters of absolute directions. On the 3D lattice, the directions include U, D, L, R, F and B, cor- responding to Up, Down, Left, Right, Forward and Backward moving directions, respec-

n 1

 tively. Thus, a sequence with length n is encoded as U D L R F B . Take Figure 3.1 for example. When using absolute encoding, the protein sequence with length 12 is encoded as DRURRULULDL. In contrast with absolute encoding, relative encoding of a protein sequence is as

n 2

 U D L R F where each direction is defined as the relative move of the previous direction. Therefore, all conformations are at least one-step self-avoiding because of

13 (a) (b) (c)

Figure 3.2 One-point mutation for relative encoding. no backward move. For the same folding in Figure 3.1, the representation of relative encoding is (F)LLRFLLRLLR.

Furthermore, the point worth while observing is that there are significantly dif- ferent between absolute encoding and relative one when genetic operators are applied in GA approach. The following serves an example. Figure 3.2 shows the effect of one- point mutation on these two representations. In Figure 3.2 (a), the relative encoding is

(F)LLRLRRFLLR while the absolute encoding is RULULURRULU. The one-point mu- tation for relative encoding would occur on the bold-faced R (grey square box), with corresponding to the bold-faced U for absolute encoding. Thus, the sequence in relative encoding would be either (F)LLRLFRFLLR or (F)LLRLLRFLLR, as shown in Figure 3.2 (b) or (c), respectively. From Figure 3.2, we can observe that after one-point mutation, for the relative encoding, the structure has one rotation at the mutated point. However, the absolute encodings for the cases of Figure 3.2 (b) and (c) are RULULLUULDL and

RULULKLLKRD, respectively, where bold-faced ones are “mutated”. That is, a rotation in absolute encoding will result in the change on all the remaining acids after the mutated point.

14 3.3 Previous PSP Methods for the HP Model

In 1993, Unger and Moult [28] applied a basic GA to the PSP problems with ab- solute encoding on the 2D and 3D lattices. They only considered feasible conformations that are self-avoiding foldings on the lattice, and each evolutionary process was applied iteratively until a feasible conformation was generated. Patton et al. [24] described a GA with relative encodings in 1995. They used penalty functions for infeasible conformations. For example, it will be penalized if two or more amino acids occupy on the same grid point. Besides, hydrophobic-hydrophobic interactions are not counted when one or both of these two hydrophobic amino acids share one grid point with another amino acid. They showed that their method has better results for the same test sets of the previous method. In 1999, Krasnogor et al. [21] considered three standpoints that may impact the performance of GAs for solving PSP problems. For one thing, the commonly used rep- resentations for these kinds of problems, such as absolute and relative encodings, were evaluated and the equivalences between different operators under these two representa- tions were presented. What is more, they also proposed a new method to formulate the energy potential for the HP model. They counted on the distance-dependent potential to preserve the distinction between conformations with the same number of hydrophobic- hydrophobic interactions. One final point is that they presented the uses of penalty meth- ods for the purpose of self-avoiding constraints. They also presented their experimental results that it is better for using the relative encoding. In 2003, a combinational method of protein folding simulations was presented by Jiang et al. [20]. They combined tabu search [16] and genetic algorithm to retain the advantages of tabu search as well as those of genetic algorithm. Tabu search is able to avoid involving in local optimal solution and has the ability to keep search to find a near optimal solution. They demonstrated that tabu search used in the crossover operator can

15 work well for the folding problem based on the HP model and the results of this combined method are better than those of methods with genetic algorithms alone.

16 CHAPTER 4

A New Method Based on the Lattice Model

4.1 Motivation

Actually, protein folding based on the HP model does not work very well and cannot predict the structure of a target protein successfully. When using GA to solve this problem, the fitness function should be defined well. For the HP model, the fitness function considers the hydrophobicity only. Though the hydrophobicity of amino acids does provide the energy and the force of one protein to fold its structure, there are still some other characteristics that affect the conformation. And more important, from a sequence to a 3D structure, several segments of the polypeptide chain will at first form to regular structure segments, the secondary structure elements (SSEs), and then these SSEs will fold to a 3D structure further. Besides, for those secondary structure elements, they do not change the basic conformations of secondary structure under the stable structure conformation. That is, while folding the sequence, we suggest that the secondary structure elements should be kept and considered. At the same time, some other characteristics, such as side chain charges and disulfide bonds, should be considered, too.

4.2 Overall Steps of the Algorithm

The overall steps of the algorithm are given as follows.

Input: A target protein sequence S1 , where its secondary structure is known but its tertiary structure is unknown.

17 Output: The folding conformation of S1.

Step 1: Perform sequence alignment on S1 and each sequence in database, such as PDB,

to find one template sequence S2 that has the highest sequence similarity with S1.

If more than one sequence have the highest sequence similarity with S1 at the same

time, randomly choose one sequence from them as S2.

Step 2: Find the structurally conserved regions, which have 50% or higher sequence sim- ilarity and the sequence alignment score is positive. Copy the coordinates of struc-

turally conversed regions, except gaps, in the template structure S2 to the target

protein structure S1.

Step 3: For those remaining regions, R-regions, that are not structurally conserved re- gions, apply the folding algorithms on the 3D lattice model with GA to predict their

folding conformations.

Step 4: For each R-region Ri, smooth the folding conformation such that it may be fea-

sible as a real structure. Then search for the segments with same length as Ri in the set of proteins of known structure and apply the alignment between the

folding structure and these segments of known structure [6]. Copy the coordinates from the most similar protein structure segment that gets the highest score.

Step 5: Construct the complete protein structure model of S1.

4.3 The Fitness Function

We consider the definition of the fitness score function with these characteristics mentioned above. Some methods worked on the distribution and partners of β sheets [1, 25]. The purpose of these methods is to offer the information of secondary structures that may improve protein structure prediction, especially for the portions of β sheets. As

18 to this view, we propose a new lattice model to consider these features, especially the secondary structures, while calculating the fitness scores in the GA. For one remaining region, there are one or more elements of secondary structure. Each element may be an α helix, β strand or loop. The scoring aspects are as followed:

1. Hydrophobic-hydrophobic interactions. The hydrophobic-hydrophobic interactions are roughly the same except some con- ditions. For the HP model, the hydrophobic-hydrophobic interaction occurs be- tween two amino acids that are not consecutive in sequence but are adjacent to

each other on the grid points of the lattice model. However, we suggest that the interaction should be restricted to the pairs of amino acids belonging to the same secondary structure element. The interactions between the amino acids in an α helix should occur only for those

amino acids with distance of three or four amino acids in sequence. This is because that there are 3.6 residues per turn and those amino acids having such distance lo-

cate at the same side and then have chances to interact. Thus, for each residue A j

in one α helix segment, as long as its adjacent amino acids also belong to this seg-

· · ment, we hope they are A j 3, A j 4, A j 3,orA j 4 and then the fitness score will be set higher. On the other hand, for β sheet, the interactions should occur only between the amino acids that belong to different β strands and should not occur between those amino

acids within the same β strand. Roughly, each strand of one β sheet is straight and not bended too much. Therefore, it is of rare occurrence to have interactions be- tween amino acids belonging to one strand. For the above reason, the interactions of these conditions are invalid and therefore

will be penalized. That is, this condition will get a negative fitness score.

Then, we define the fitness function of this parameter as follows:

19

µ· ¢ ´ µ S1 ´ # of HÐH interactions PenH # o f invalid pairs

where PenH is the penalty of invalid pairs.

2. Charges of side chains. For two adjacent amino acids on the grid points of the lattice model, if they have

opposite charges, that is, one is positively charged residue and the other is nega- tively charged one, there is attraction between this pair of amino acids, and it will get a positive fitness score. On the other hand, residues with the same charge should locate far from each other.

For two adjacent amino acids on the grid points of the lattice model, if they have the same charge, there is anti-attraction between this pair of amino acids, and it will

get a negative fitness score.

¢ ´ µ S2 BonC # o f di f f erent charge pairs # o f same charge pairs

where BonC is the bonus of charge pairs.

3. The segment conformation of α helix. As described above, in α helix, there are 3.6 residues per turn. Thus the folding

conformation of the α helix segment is regular, and each pitch of it should have the

same length. Therefore, we define pitch Pk of one α helix is the distance from Ak to

Ak ·4. Next, we assume that the length of pitch P1 is defined as the standard pitch

length l and the pitch length of Pk is lk, for k =1, 2, 3,  . Then we hope that the

difference of each lk and l is not too large. The total penalty of fitness score of this α helix is defined as follows:

20

 k

¢ α ∑ l l PenA i f the element is an helix i1 k

S3  0 otherwise

where PenA is the penalty of deviations.

4. The segment conformation of β sheet.

For the reason mentioned before, we also hope that the folding conformation of each β strand tends to be straight and, in most conditions, close strands can be formed as a β sheet, parallel or antiparallel. Assume the length, number of amino

acids, of one β strand is l and A1 (Al) is the first (last) residue of this β strand. And

 we define the distance from A1 to Al on the lattice to be str. We hope l str is not

too large either. Thus, the best conformation of this β strand is that str l.

Therefore, the penalty of fitness score of this β strand is defined as follows:



´ µ ¢ β l str PenB i f the element is a strand

S4  0 otherwise

where PenB is the penalty of deviations.

5. The strong strength of disulfide bonds.

For one protein structure, the disulfide bond may be one kind of great strength to conformation. Though there are only two to three amino acids with sulfonium (S) and disulfide bonds are not very common, once that the bond forms, the strength is very strong and has quite large effect to the structure. Thus, for two adjacent amino

acids on the grid points of the lattice model, if the disulfide bond can be formed

between them, they will get more positive fitness score.

¢ ´ µ S5 BonS # o f disul fide bonds

21 where BonS is the bonus of disulfide bonds.

Therefore, the fitness score function is as follows:

n F ∑ Ri

i1

where Ri is the fitness score of ith remaining region,

m Ri ∑ Eij

j 1

where Eij is the fitness score of jth element of Ri, and

5 Eij ∑ Sijk

k 1

where Sijk is the Sk score of jth element of Ri.

4.4 Example

We take an example to illustrate the algorithm step by step. Assume the target protein sequence is S1:

S1 SSKCSRLKTFPQNLVQACVYHK and its secondary structure is Ss:

Ss ----SS--HHHHHHHH-SSTT-

The abbreviations of secondary structure elements are listed in Table 4.1 [?]. Our goal is to predict the folding structure of S1. First, we perform sequence alignment with PAM-250 as the score matrix to find one protein sequence that has high sequence similarity to S1. Suppose S2 is the template sequence.

22 Table 4.1 Abbreviations of secondary structure.

Abbreviation Representation H helix B residue in isolated β bridge E extended β strand G I pi helix T hydrogen bonded turn S bend

S2 SVYCSSLACSDHN

And the alignment result is as follows:

S1 SSKCSRLKTFPQNLVQACVYHK . . |--||.|------||--|. .

S2 SVYCSSL------ACSDHN

. where | represents that residues are identical, .represents that they are similar, and - represents a gap.

Next, we find the structurally conserved regions from the alignment:

SSKCSRL KTFPQNLVQ ACVYHK . . |--||.|. ------||--|.

SVYCSSL ------ACSDHN

  

Conserved Region Remaining Region Conserved Region

We then copy the coordinates of these conserved regions to the corresponding re- gions of target protein as shown in Figure 4.1.

Then for the remaining region, R1, we translate the residues to 1 or 0 according to their hydrophobicity, where 1 represents “hydrophobic” and 0 represents “hydrophilic”.

R2 is the sequence of hydrophobicity of R1.

23 S1 K S S C S L K T F P Q N L V Q R

A H C Y K V

Y S2 V S C S L S A H C D S N

Figure 4.1 Copying the corresponding coordinates.

R1 LKTFPQNLVQA

R2 10011001101

Besides, its corresponding secondary structure segment, R3, is shown as follows:

R1 LKTFPQNLVQA

R3 --HHHHHHHH-

Then, perform the GA to run the folding algorithm to predict its possible folding conformation. Figure 4.2 shows an example of folding conformation of R1. After having the conformation, smooth the folding conformation as shown in Fig- ure 4.3 and then find out the segment which is most similar to R1 based on structure alignment with curve alignment [6].

24 Figure 4.2 The folding conformation. The solid node represents “1” while the hollow one represents “0”.

Figure 4.3 The smooth folding conformation.

25 S1 K S S C S L K R T F P Q N L V Q A H C Y K V

Figure 4.4 The final predicted conformation.

Finally, we copy the coordinates of this segment and combine it with the conserved segments to form the final predicted conformation. Figure 4.4 shows the final predicted result.

26 CHAPTER 5

Experimental Results

In this chapter, we show some experimental results by comparing our model with the folding method based on the original HP model [6]. Most of the results show that the additional characteristics we use do improve the predicted structures. For GA, there are several parameters needed to be defined: iterations of genera- tions, initial size of population, crossover rate and mutation rate. Table 5.1 lists these parameters used in GA. Besides, we also define some coefficients in the fitness function to adjust the fitness value of features as shown in Table 5.2.

Table 5.3 shows that we have fair improvements for the RMSDs even though the results are still not so good. Here, we notice that the first template protein 1CFD of target protein 1LIN is not a good template because that they have 100% sequence similarity, but the real structures of these two protein are quite different. Thus, lacking good template proteins may be one of the factors that affect the prediction results.

Table 5.1 Parameters used in GA.

Parameter Value Generation 500 Population Size 100 Crossover Rate 0.8 Mutation Rate 0.05

27 Table 5.2 Weights of features.

Feature Fitness Value PenH -1 BonC 1 PenA 10 PenB 10 BonS 5

Table 5.3 Comparison of our method with the original method for target: 1LIN(146).

Template Protein Sequence Similarity RMSD(Old) RMSD(Ours) 1CFD 100% 7.34 - 1TNW 69% 18.72 13.37 1IQ5 55% 15.15 9.18 1DTL 52.9% 10.22 7.48 5PAL 36.4% 12.18 8.43

However, in Table 5.3, we also find that some prediction results of templates with lower sequence similarity are better than these of templates with higher sequence similar- ity. As a result, we suggest that it is not always right to choose the templates with higher sequence similarity.

Table 5.4 demonstrates another case about template proteins. 1JYQ and 1JYU both have 90.4% sequence similarity with 1QG1. However, 1JYQ is a better template protein than 1JYU obviously. Table 5.5 and table 5.6 are another two cases with no distinguished improvements.

Table 5.7 shows the comparison between our method and the method of Dovier et al. [11]. The input information for their method includes the primary sequence and the secondary structure of the target protein. But, they do not use the information of other known structures in the database. For most cases, we improve the prediction results with the evaluation of RMSD values. Take 1E0M for example, the result of their method is 7.2 while the result of our method is 6.05 for the entire sequence with 37 amino acids.

28 Table 5.4 Comparison of our method with the original method for target: 1QG1(104).

Template Protein Sequence Similarity RMSD(Old) RMSD(Ours) 1JYQ 90.4% 4.15 - 1JYU 90.4% 13.89 - 1SHA 46.7% 4.82 4.82 1SHD 45.2% 8.89 6.77 1PDR 24.4% 10.55 8.0

Table 5.5 Comparison of our method with the original method for target: 5CPV(108).

Template Protein Sequence Similarity RMSD(Old) RMSD(Ours) 1CDP 100% 0.16 - 1BU3 82.7% 1.62 - 2PVB 76.4% 5.6 5.72 1C7V 32.5% 5.32 4.92

Table 5.6 Comparison of our method with the original method for target: 1SHD(101).

Template Protein Sequence Similarity RMSD(Old) RMSD(Ours) 1SHA 98% 3.86 - 1JYQ 48.3% 9.98 7.43 1QG1 45.2% 9.04 7.35 1PDR 29.2% 8.52 9.68

29 Table 5.7 Comparison of our method with method of Dovier et al..

Target Protein Method of Dovier et al. Our Method Name Length Time(s) RMSD Time(s) RMSD Template(Similarity) 1VII 36 5460 10.0 9.6 6.01 1E0M(11.6%) 7.5(4-32) 4.98 1E0M 37 69420 7.2 27.4 6.05 1LE3(27%) 5.9(7-22) 4.11 1ED0 46 64200 7.3 37.7 6.45 1LE3(5.7%) 3.7(7-30) 6.18 1ENH 54 12240 10.0 28.1 5.57 1CMG(36%) 5.4(8-52) 4.29

Furthermore, for the segment of 7th-22nd amino acids, the result of their method is 5.9 while the result of our method is 4.11. We may note, in passing, that the difference of the execution times between the two methods are very much. Actually, it is not so fair to compare them because these two methods do not apply the same strategy. For most of these test cases, we find that the improvements are more explicit in α helix than in β sheet. This condition may be because it is not easy for folding to make the strands close, so the conformation of β sheets are not predicted well. However, we can also make the β strands go straight possibly.

30 CHAPTER 6

Conclusions

In this thesis, we survey some methods that predict the protein folding with GA based on the HP model. And we find that the folding of proteins are not great enough on the HP model. Thus, we suggest some other features need to be considered in addition to hydrophobicity.

We propose a method with GA based on the lattice model to predict the folding of proteins. Our algorithm is a hybrid method, consisting of homology modeling method and the folding algorithm. We consider the information of secondary structure, hydropho- bicity, charges of side chains, and disulfide bonds, in the fitness function of GA for the prediction of protein folding. The results of our experiments show that these features in- deed improve the prediction accuracy compared with the previous methods based on the HP model. From the experimental results, we find that even though considering these features has improvements for the predictions, the results of predictions are still not very perfect. It may be because the lattice model itself is not a very proper model except that it is easy to implement. Therefore, in the future, the folding of one protein can be based on some other lattice models. For example, folding on the 3D triangular lattice should be one more suitable model for protein structures because the folding angles of protein sequences are more and the folding pathways of proteins are more close to the real structures. Besides, more chemical characteristics of amino acids can be considered when predicting protein structures, such as the structures of residues.

31 Furthermore, the statistics of secondary structure elements in all proteins of one large protein database, the research of supersecondary structures and domains, and so on, also offer the information. And more and more databases of structure classification are developed. They have large contribution for protein structure predictions, too. It seems reasonable to conclude that there are still some challenges about protein structure predictions.

32 BIBLIOGRAPHY BIBLIOGRAPHY

[1] P. Baldi, G. Pollastri, C. A. F. Andersen, and S. Brunak, “Matching protein beta- sheet partners by feedforward and recurrent neural network,” Proceedings of the 2000 Conference on Intelligent Systems for Molecular Biology, (ISMB00), AAAI Press, 2000.

[2] D. H. Ballard, An Introduction to Natural Computation. MIT Press, 1999.

[3] D. Beasley, D. Bull, and R. Martin, “An overview of genetic algorithms: Part2, research topics,” University Computing, Vol. 15, No. 4, pp. 170Ð181, 1993.

[4] B. Berger and T. Leight, “Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete,” Journal of Computational Biology, Vol. 5, No. 1, pp. 27Ð40, 1998.

[5] M. K. Campbell and S. O. Farrell, Biochemistry. Brooks Cole, fourth ed., 2002.

[6] Y. Y. Chen, C. B. Yang, and K. T. Tseng, “Prediction of protein structures based on curve alignment,” Proceedings of the 20th Workshop on Combinatorial Mathematics and Computation Theory, Chiayi, Taiwan, pp. 34Ð44, 2003.

[7] R. S. Cheng, C. B. Yang, and K. T. Tseng, “Protein structure prediction based on secondary structure alignment,” Proceedings of 2004 Symposium on Digital Life and Internet Technologies(Abstract, full text in CD), Tainan, Taiwan, pp. 29Ð29, 2004.

[8] P. Crescenzi, D. Goldman, C. Capadimitriou, A. Piccolboni, and M. Yannakakis, “On the complexity of protein folding,” Journal of Computational Biology, Vol. 5, No. 1, pp. 409Ð422, 1998.

[9] Y. Cui, R. S. Chen, and W. H. Wong, “Protein folding simulation with genetic al- gorithm and supersecondary structure constraints,” Proteins, Vol. 31, pp. 247Ð257, 1998.

[10] K. A. Dill, “Theory for the folding and stability of globular proteins,” Biochemistry, Vol. 24, pp. 1501Ð1509, 1985.

33 [11] A. Dovier, M. Burato, and F. Fogolari, “Using secondary structure information for protein folding in CLP(FD),” Electronic Notes in Theoretical Computer Science (M. Comini and M. Falaschi, eds.), Vol. 76, Elsevier, 2002.

[12] S. Duarte-Flores and J. Smith, “Study of fitness landscapes for the HP model of pro- tein structure prediction,” In Proceedings of the Congress on Evolutionary Computa- tion 2003 (CEC’2003), Vol. 1, Canberra, Australia, IEEE Service Center, pp. 2338Ð 2345, 2003.

[13] S. Forrest, “Genetic algorithms,” ACM Computing Surveys, Vol. 28, pp. 77Ð80, 1996.

[14] A. Fraenkel, “Complexity of protein folding,” Bulletin of Mathematical Biology, pp. 1199Ð1210, 1993.

[15] C. Gibas and P. Jambeck, Developing Bioinformatics Computer Skills. O’Reilly & Associates, Inc., first ed., 2001.

[16] F. Glover, “Future paths for integer programming and links to artificial intelligence,” Computers and Operations Research, Vol. 13, pp. 533Ð549, 1986.

[17] D. Goldberg, Genetic Algorithms. Addison Wesley Publishing, first ed., 1988.

[18] W. Hart and S. Istrail, “Robust proofs of NP-hardness for protein folding: general lattices and energy potentials,” Journal of Computational Biology, Vol. 4, No. 1, pp. 1Ð22, 1997.

[19] J. Holland, “Adaptation in natural and artificial system.” Technical Report. The Uni- versity of Michigan Press, USA, 1975.

[20] T. Jiang, Q. Cui, G. Shi, and S. Ma, “Protein folding simulations of the hydrophobic- hydrophilic model by combining tabu search with genetic algorithm,” Journal of Chemical Physics, Vol. 119, No. 8, pp. 4592Ð4596, 2003.

[21] N. Krasnogor, W. Hart, J. Smith, and D. Pelta, “Protein structure prediction with evolutionary algorithms,” In W. Banzhaf, J. Daida, A.E. Eiben, M.H. Garzon, V. Honavar, M. Jakaiela, and R.E. Smith, editors, GECCO-99: Proceedings of the Genetic and Evolutionary Computation Conference, Morgan Kaufman, 1999.

[22] R. C. T. Lee, “Computational biology.” http://www.csie.ncnu.edu.tw/, Department of Computer Science and Information Engineering, National Chi-Nan University, Taiwan, 2001.

34 [23] M. Milostan, P. Lukasiak, K. Dill, and J. Blazewicz, “A tabu search strategy for find- ing low energy structures of proteins in HP-model,” Proceedings of Seventh Annual International Conference on Research in Computational Molecular Biology, Berlin, Germany, 2003.

[24] A. Patton, W. P. III, and E. Goodman, “A standard GA approach to native protein structure prediction,” Proceedings of 6th International Conference On Genetic Al- gorithm, Dublin, Ireland, pp. 574Ð581, 1995.

[25] I. Ruczinski, C. Kooperberg, R. Bonneau, and D. Baker, “Distrubutions of beta sheets in proteins with application to structure prediction,” Proteins, Vol. 48, pp. 85Ð 97, 2002.

[26] J. Setubal and J. Meidanis, Introduction to Computational Molecular Biology. PWS Publishing Company, Boston, second ed., 1997.

[27] R. Unger and J. Moult, “Finding the lowest free energy conformation of a protein is NP-hard problem: Proof and implications,” Bulletin of Mathematical Biology, Vol. 55, No. 6, pp. 1183Ð1198, 1993.

[28] R. Unger and J. Moult, “Genetic algorithms for protein folding simulations,” Journal of Molecular Biology, Vol. 231, No. 1, pp. 75Ð81, 1993.

[29] M. Waterman, Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman and Hall, London: CRC Press, 1995.

[30] C. Zhang and A. K. Wong, “A genetic algorithm for multiple molecular sequence alignment,” Comput. Appl. Biosci., Vol. 13, pp. 565Ð581, 1997.

35