Protein Secondary Structure Prediction by Fuzzy Min Max Neural Network with Compensatory Neurons
Sudipta Saha Protein Secondary Structure Prediction by Fuzzy Min Max Neural Network with Compensatory Neurons Thesis submitted in partial fulfillment of the requirements for the degree Of
Master of Technology In Computer Science & Engineering
By
Sudipta Saha (Roll No. 06CS6023)
Under the supervision of Prof. Jayanta Mukhopadhyay
Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur-721302 West Bengal, India May, 2008.
Department Of Computer Science & Engineering
Indian Institute of Technology
Kharagpur-721302, India
Certificate
This is to certify that the thesis titled “Protein Secondary Structure Prediction by Fuzzy Min-Max Neural Network with Compensatory Neurons”, submitted by Sudipta Saha, to the Department of Computer Science and Engineering, in partial fulfillment for the award of the degree of Master of Technology is a bona fide record of work carried out by him under my supervision and guidance. The thesis has fulfilled all the requirements as per the regulations of the institute and, in my opinion, has reached the standard needed for submission.
Prof. Jayanta Mukhopadhyay Dept. of Computer Science and Engineering Indian Institute of Technology Kharagpur -721302, India
Dedicated To My parents and wife
Acknowledgement
I take this opportunity to express my deep sense of gratitude to my guide Dr. Jayanta Mukhopadhyay for his guidance, support and inspiration throughout the duration of the work.
I would like to specially acknowledge the help and encouragement I received from Dr. P. K. Biswas of Department of Electronics and Electrical Communication Engineering and Dr. A. K. Majumdar of Department of Computer Science & Engineering, IIT Kharagpur. In addition, I am thankful to all the faculty members, staffs and research scholars of the Department of Computer Science and Engineering and my friends for providing me adequate help whenever required.
I am also grateful to my parents for their constant encouragement and financial support in my years of studies. Lastly I would like to thank my wife Sipra Saha, for all the love.
Sudipta Saha Dept. of Computer Science and Engineering Indian Institute of Technology Kharagpur -721302, India Date: May 05, 2008
i
Abstract
Neural Networks are extensively being used now-a-days for predicting the three dimensional structures of the proteins. Different types of neural networks are employed for this work still now. There are several levels of three dimensional structures of the proteins. In our work a special kind of neural network has been employed for predicting the secondary structure of the proteins from the primary structure. This neural network combines the neural network concept with the fuzzy logic. The method of prediction also uses the algorithm described by Chou– Fasman [4] to break the ties between different classes of predictions. The basic algorithm used here for the fuzzy min-max neural network has been taken from [3]. Some small drawbacks of the training algorithm has been identified and removed as a part of our work. The prediction is tried with the improved neural network.
Apart from these, some domain knowledge relating to the nature of the protein secondary structures are also used to post–process the prediction output of the basic neural network to get improved prediction accuracy. So far more than 25000 of proteins have been sequenced and the three dimensional structures of these proteins are also determined. In our work it has been tried to extract as much information as possible from the already sequenced proteins. To achieve this, we have employed multiple instances of the neural network and trained them with different set of data. The protein data bank is used as a primary resource for the protein data for both training and testing our prediction system. The overall accuracy (Q3) achieved is around 70%. It is better than existing statistics based prediction systems like Chou-Fasman [1], GOR I [2] and it is comparable to some of the neural network based systems.
ii
Contents
List of Figures v
List of Tables vi
1. Introduction ...... 1 1.1 Motivation of the Work ...... 1 1.2 Objective of the Work ...... 2 1.3 Organization of the Thesis ...... 3
2. Protein Structure ...... 5 2.1 Introduction ...... 5 2.2 Molecular Structure of Protein ...... 6 2.3 3D Structures of Protein ...... 7 2.4 Ramachandran Diagram ...... 10 2.5 Different Secondary Structures of Protein ...... 12 2.6 Levinthal’s Paradox ...... 14 2.7 Summary ...... 15
3. Literature Survey ...... 16 3.1 Introduction ...... 16 3.2 Chou-Fasman Method ...... 17 3.3 GOR Method ...... 19 3.4 PhD Method ...... 20 3.5 PSI-Pred Method ...... 21 3.6 JPred Method ...... 22 3.7 Summary ...... 23
iii Contents
4. Neural Network Architecture ...... 24 4.1 Introduction ...... 24 4.2 FMNN ...... 25 4.3 FMCN ...... 31 4.4 Improvements on FMCN ...... 40 4.5 Summary ...... 48
5. Secondary Structure Prediction with Improved FMCN ...... 49 5.1 Introduction ...... 49 5.2 Application of FMCN ...... 50 5.3 Accuracy Measurement Techniques ...... 59 5.4 Complexity in Using FMCN ...... 62 5.5 Experimental Results ...... 63 5.6 Summary ...... 68
6. Multiple Instantiations of FMCN-units ...... 69 6.1 Introduction ...... 69 6.2 System Architecture ...... 70 6.3 Experimental Results ...... 78 6.4 Summary ...... 85
7. Conclusion and Future Works ...... 86 7.1 Conclusion ...... 86 7.2 Future Work ...... 88
Appendix A ...... 90
Appendix B ...... 91
Appendix C ...... 94
Appendix D ...... 110
Bibliography ...... 114
iv
List of Figures
Fig 2.1 ...... 6 Fig 2.2 ...... 8 Fig 2.3 ...... 9 Fig 2.4 ...... 9 Fig 2.5 ...... 10 Fig 2.6 ...... 11 Fig 2.7 ...... 12 Fig 2.8 ...... 13 Fig 2.9 ...... 14
Fig 4.1 ...... 26 Fig 4.2 ...... 27 Fig 4.3 ...... 29 Fig 4.4 ...... 30 Fig 4.5 ...... 31 Fig 4.6 ...... 33 Fig 4.7 ...... 35 Fig 4.8 ...... 41 Fig 4.9 ...... 46
Fig 5.1 ...... 58 Fig 5.2 ...... 60
Fig 6.1 ...... 72
v
List of Tables
Table 3.1 ...... 18
Table 5.1 ...... 52 Table 5.2 ...... 53 Table 5.3 ...... 55 Table 5.4 ...... 55 Table 5.5 ...... 56 Table 5.6 ...... 64 Table 5.7 ...... 64 Table 5.8 ...... 65 Table 5.9 ...... 66 Table 5.10 ...... 68
Table 6.1 ...... 79 Table 6.2 ...... 82 Table 6.3 ...... 83 Table 6.4 ...... 85
Table A.1 ...... 90
Table C.1 ...... 94 Table C.2 ...... 94 Table C.3 ...... 95 Table C.4 ...... 96 Table C.5 ...... 97 Table C.6 ...... 98 Table C.7 ...... 98 Table C.8 … ...... 99
vi List of Tables
Table C.9 … ...... 100 Table C.10...... 101 Table C.11...... 102 Table C.12 ...... 102 Table C.13 ...... 103 Table C.14...... 104 Table C.15 ...... 105 Table C.16 ...... 106 Table C.17...... 108
Table D.1...... 110 Table D.2...... 110 Table D.3...... 111 Table D.4...... 111 Table D.5...... 112 Table D.6...... 112 Table D.7...... 113
vii
Chapter1.Introduction
Chapter 1
Introduction
1.1 Motivation of the Work
Proteins play the central role in every living organism. In 1838, Swedish chemist Jöns Jakob Berzelius firt described and named these protein molecules. The origin of the word ‘protein’ is the Greek word ‘prota’ – which means ‘of primary importance’. For example, ‘insulin’ is a protein and if the pancreas does not produce sufficient amount of insulin, or if the cells become resistant to the effects of insulin, the body cannot use glucose effectively, and the disease diabetes mellitus results. ‘Insulin’ is the protein whose primary chemical structure was known first, by Frederick Sanger. He won the Nobel Prize for this achievement in 1958. Every protein has its unique three dimensional structure. There is a very close relationship between the protein structure and its function. For example, the proteins which are mainly responsible for giving strength in the muscular tissues are generally of spiral shape. So , if that three dimensional structure of the protein is known in advance, that information can give us a lot clues about how the
Chapter1.Introduction
protein carries out its function and what at all the functions of the protein are . But so far no well-established way of experimentation is explored to know these three dimensional structures. Actually this structure depends on the energy minimization of very large function with a lot of parameters. Even how many factors are there that is also not fully known. So the straight forward way does not work. Experimental detection of the structures is also not possible for many proteins for several limitations. These techniques, such as X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy, are very costly and time consuming. For these reasons we need to take the help of the prediction mechanism. Many kinds of prediction mechanisms have been developed. But so far, the best prediction mechanism gives an average accuracy (Q3) of 77%. There are more than 100,000 distinct proteins in the set of proteins specified by only human genome. But the number of proteins whose structures are fully determined is very less compared to the total number of possible proteins from all organisms. There are also a lot proteins whose chemical structures are known, but the three dimensional structure is not known. Therefore, there is a good demand of an intelligent system, to predict the three dimensional structure of a protein from its primary structure. 1.2 Objective of the Work
The main objective of our work was to come up with an intelligent system for predicting secondary structure of a protein from its amino acid sequence having a good prediction accuracy. So far many statistical and neural network based methods are applied in this area. Many new types of classifiers are also used. On the other hand, the Fuzzy Min-Max Neural Network Classifier with Compensatory Neurons (FMCN) [2] was already tested for different types of data such as Iris, Thyroid, Ionosphere and achieved very good prediction accuracies. Our objective was to design a prediction system with FMCN as the main component. So far there are more than 25000 proteins in the protein data bank (PDB, http://www.rcsb.org/pdb). The three dimensional coordinates of different atoms of the amino acids in these proteins are fully determined by experimentations. The secondary structures of these proteins can be calculated from these coordinate values. The application of FMCN is a knowledge based approach. Naturally, the
2
Chapter1.Introduction
quality of the prediction will also depend on the acquired knowledge of the classifier. Our aim was also to take the full advantage of the entire database available in PDB. The existing prediction systems which have more than 70% accuracy on average, use several other components to improve the performance, such as homology modeling as a preprocessing of data, post processing of data, jury system etc. Our designed system does not give that much of accuracy in prediction as the existing methods available for prediction. But as a bare classifier algorithm with a very little post-processing, it gave the desired performance. Therefore it is highly expected that, our method appended with a good preprocessing of training / testing data and a training set encompassing almost all the proteins sequenced so far , will give better results. 1.3 Organization of the Thesis
The thesis is divided into 7 chapters including one introductory chapter (chapter 1) followed by five main chapters (chapter 2 – chapter 6) and one chapter for drawing the conclusion and describing the scope of future works (chapter 7). Chapter 2 deals with different levels of structures of the proteins. It also describes different secondary structures. The paradox regarding the protein folding is also presented in this chapter. Chapter 3 briefly focuses over the past works in the field of protein secondary structure prediction. Two statistics based and two neural network based systems are discussed in brief. Chapter 4, 5 and 6 contribute to the detailed descriptions of the out work. Chapter 4 starts with the basics of prediction mechanism and its relation with fuzzy logic. It briefly describes the drawbacks of FMNN. Chapter 4 mainly illustrates the FMCN architecture with its algorithms for learning and recall. The drawbacks were pointed out and removed in the modified algorithms which are also presented in this chapter. Chapter 5 deals with the mapping of the problem of predicting the secondary structure of proteins to the problem of classification by FMCN. It describes the used enumeration schemes and their significance. Finally, it comes
3
Chapter1.Introduction
up with three typical result sets and the overall results for the first phase of this experiment. Chapter 6 explains in details the techniques used for creating multiple instances and using a large amount of knowledge base for prediction. It describes the algorithm used for combining the outputs from multiple instances of the FMCN-units and the algorithm for post-processing the prediction output to get the final output. It also furnishes the final results for the second i.e. final phase of our experiment. Chapter 7 deals with the scope of possible future works. It mainly describes the reasons for not achieving higher accuracy and what are the necessary steps to be taken as future works to get better accuracy.
4
Chapter2.Protein Structure
Chapter 2
Protein Structure
2.1 Introduction
This chapter deals with the background knowledge of biochemistry related to our work. Though these ideas are not mandatory to have, but they will help to visualize the full scenario of protein structure. In the beginning of this chapter there is a brief description of the amino acids, their chemical structures, peptide bonds, poly-peptides and proteins. Section 2.3, briefly discusses different protein structures with diagrams. Section 2.4 describes different dihedral angles and Ramachandran diagram. Next section of this chapter, section 2.5, briefly tells about different classes of secondary structure found in the proteins and their relation with the torsion angles. Finally, this chapter ends with Section 2.6 which presents the thought experiment in the theory of protein folding known as ‘Levinthal’s Paradox’. All the research works which are going on in the field of protein structure prediction, actually aims to get a fruitful solution to this paradox. Chapter2.Protein Structure
2.2 Molecular Structure of Protein 2.2.1 Amino Acid
The basic unit of all proteins is the amino acid molecule. They are very small bio-molecules having average molecular weight of about 135 Daltons (1 1.660 10 27 ). They are actually organic acids. In nature, they exist in a zwitterionic state. In this state, the carboxylic acid group is ionized (negative) and the basic amino group is protonated. All types of amino acids are composed of an organic carboxylic acid group and an amino group. These two groups are attached to a saturated carbon atom. Apart from the two groups, there are two more groups attached to the central carbon atom (also called alpha-carbon atom). One is a simple hydrogen atom and the other is named as a residue ( ). All the amino acid molecules have the same structure except that the residue is different for different amino acids. The simplest amino acid is Glycine, where the saturated carbon atom is attached with two hydrogen atoms and the other two said radicals. So here the residue is the hydrogen atom itself. In all other cases there is one residue attached to the alpha-carbon atom.
There are 20 different types amino acids. Each one of them is denoted by one three letter code and another one letter code. The 3-letter and 1-letter codes of all the amino acids are listed in Appendix A (Table A.1). The side of the amino acid molecule where there is the amino group, is called the N-terminal and the other side is called the C-terminal where the carboxylic acid group is attached with.
H O
+ H3N C C
R O-
Fig 2.1: General atomic structure of amino acid in zwitterionic state
6
Chapter2.Protein Structure
2.2.1 Peptide bond, Polypeptide and Protein
The covalent bond between the carboxylic acid group of one amino acid and amino group of another amino acid is called the peptide bond. When two or more amino acid groups are linked by this way, they are called peptides. Polypeptides are long chains of peptide bonds. For sequencing all these chain like structures, the N-terminal to C-terminal direction is followed. The polypeptides typically have large molecular weights. Protein is a also a kind of polypeptide. They form a large macromolecule composed of several peptide bonds. Generally proteins contain polypeptide chains with 50 to 2000 amino acid residues. But the difference between the polypeptides and the proteins is that, protein will always take the same three dimensional shapes in nature and that shape is unique for that protein. But this is not true for the general polypeptides. They can take any possible shape and these shapes may not be unique. 2.3 3D Structures of Protein
Proteins are one of the major macromolecules found in the cell of every living organism. They perform the major bio-chemical tasks. To perform these tasks it is very important to form a particular three-dimensional shape. For example instead being a flat shaped object, if the structure of the object is a spiral or helical then it can exhibit more strength. For these reason the proteins which are responsible for giving strength in our muscles are of helical structures. There are a lot of such examples.
Proteins are formed by peptide bonds between different amino acid molecules. The primary structure of a protein basically refers to the sequence of the amino acids of that particular protein. For example some protein may have some portion of it with the following sequence of amino acids :……………N I R V I A R V R P V T K E D G E G P E A T N A V T F D A D D D S I I H L L H K G K P V S F E L D K V F S P Q A S Q Q D V F Q E V ……………….. , (Each letter denotes the one letter code of the amino acids). The three dimensional shape of the protein fully depends on this primary structure. To have clear analysis of different shapes, the 3D structure of a protein is divided into three levels: Secondary structure, Tertiary structure and Quaternary structure.
7
Chapter2.Protein Structure
Generally the sequence of amino acid residues is long. Different portions of that sequence get different 3-D shapes. This is called the secondary structure of the protein. These shapes may be spiral, or like flat sheet or like loops. The primary structure of the protein contains the full information that how the secondary structure will be formed. This secondary structure is stabilized mainly by hydrogen bonds. The most common examples are the alpha helix, beta sheet and turns or loops. Because secondary structures are local, many regions of different secondary structure can be present in the same protein molecule.
Tertiary structure is the overall 3-D shape of a single protein molecule. There is a spatial relationship of different secondary structures of different portions of the same protein to one another. This relationship gives birth of the tertiary structure. Tertiary structure is generally stabilized by nonlocal interactions, most commonly the formation of a hydrophobic core. The contributions of salt bridges, hydrogen bonds and disulfide bonds are also there. The term tertiary structure is often used as synonymous with the term foold. For example, Myoglobin, is a protein which is the oxygen carrier in muscle. It has 153 amino acids. About 70% of the chain of amino acids of the protein is folded into eight helical regions and these 8 regions are connected by turns or loops. Fig 2.2 gives the graphical view of Myoglobin [14].
Fiig 2.2: Myoglobin – a protein with 8 alpha helical regions connected by loops.
8
Chapter2.Protein Structure
Fig 2.3: CD4 – a protein which is found in cell surfaces. It consists 4 similar subunits.
The Primary structure actually describes the amino acid sequence. Secondary structure of the protein describes the local arrangement of amino acid residues. Tertiary structure of the protein describes non-local arrangement of amino acid residues. It gives us the overall structure of the protein. But some proteins can contain more than one polypeptide chain. A fourth level of structural organization can be seen in such proteins. Each chain of polypeptide is called subunits. Quaternary structure describes the arrangement of the subunits. For example, human hemoglobin has 4 subunits. It is the oxygen-carrying protein in blood. Due to it’s this kind of structure, it can carry oxygen very efficiently. Fig 2.4 shows the pictorial view of Hemoglobin [14]..
Fig2.4: Hemoglobin- oxygen carrying protein in human blood – contains 4 subunits.
9
Chapter2.Protein Structure
2.4 Ramachandran Diagram
Peptide bonds are strong in nature. They do not permit any rotation around the bond. But this is not true for the bonds between the central carbon atom and the amino group and between the central carbon atom and the carbonyl group. The two adjacent units can rotate along these bonds and can take different orientations. This rotation is the main reason for the proteins to take different shapes. Dihedral angles are the measure of rotations along a bond. They are also called torsion angles. The angle of rotation between nitrogen and central carbon atom is called phi angle and that of between central carbon atom and carbonyl carbon atom is called psi angle. Fig 2.5 pictorially describes the peptide bonds and places of rotation along the chain of amino acids.
C-Terminal N-terminal Peptide bonds Rotation Occurs Here
One single amino acid
Fig 2.5: Peptide bond and phi-psi angles
All combinations of phi – psi angles are not possible due to the collision between atoms. G. N. Ramachandran plotted these allowed values of the torsion angles in the diagram called Ramachandran-diagram [14, 15]. Ramachandran
10
Chapter2.Protein Structure
diagram for proteins shows three confined regions for the allowed combinations of phi-psi angles. Fig 2.6 shows different regions in the Ramachandran diagram. It also shows the preferred combinations of these angles for different secondary structures which are discussed in the next section.
The dark green region is the most favorable region. Generally in beta sheets, these This region is very combinations of angles are rarely seen in found. practice. These angles are only found in rare left handed alpha-
These combinations of angles are generally found in right handed alpha helix
Angles in this Ψ angles region are not used due to collisions Φ angles between atoms.
Fig 2.6: A typical Ramachandran diagram. The dark green regions are the most favorable combinations of phi-psi angle. The lightly shaded regions are less favorable
regions and the white regions shows the forbidden regions.
11
Chapter2.Protein Structure
2.5 Different Secondary Structures of Protein
Due to different possible combinations of phi-psi angles, different secondary structures are created. These combinations are mentioned in different regions of the Ramachandran diagram in Fig 2.6. The possible secondary structures are: Alpha-helix, Beta-sheet and Turn or loop.
The helical structure was discovered by Linus Pauling in 1951 (He also discovered the beta sheet). The helix can have directions (from N-terminal to C- terminal) which can be either left handed or right handed. The shape is spiral. The combinations of the phi-psi angles which are responsible for these types of shapes are found in the region described in the Ramachandran diagram (Fig 2.6). Fig 2.7 shows a protein whose almost total structure is helical.
Fig 2.7: Ferritin – an iron storage protein. It has a bundle of alpha helices connected bby loops.
There are generally three types of helices. One is alpha helix. The other two are known as 3-helix (also known as 3/10 helix) and 5 helix (also known as pi helix). The alpha helix structure is a result of hydrogen bonds. Between the amino group of one amino acid and the carbonyl group of another amino acid, the hydrogen bonds are formed. The classification of the helices is done based on the compactness of this shape which depends on the density of the hydrogen bonds. For alpha helices every 4th residue is connected by hydrogen bonds. For 3-helix
12
Chapter2.Protein Structure
and 5-helix, every 3rd and 5th residues are connected. The 3-helix and 5-helix are found rare in nature.
In 1952, Pauling and Corey predicted the beta-pleated sheet structure as an alternative secondary structure to the alpha-helix in proteins. The shape is like sticks. Fig 2.8 shows a protein consisting mainly beta-sheets. This structure also has a direction (indicated by the arrows in the picture).
Fig 2.8: A protein rich in beta sheet. There is one small portion of helix and some more loops. This is actually a fatty acid binding protein.
Single beta-strands are not stable structures. So they occur in association with neighboring strands. Thus they can be found as either parallel or anti-parallel form, with respect to the N-terminal to C-Terminal direction (in amino acid molecules) of the adjaacent peptide strands. Like alpha-helices, beta-pleated sheet backbones are fully hydrogen bonded. But here the H-bonds occur between neighboring strands (intermolecular). The H-bond geometry is different in the parallel and anti-parallel beta sheets.
To combine helices and sheets in their various combinations, protein structures also contain turns or loops that allow the peptide backbone to fold back. These turns can be found almost always on the surface of proteins and often contain Proline and/or Glycine amino acids. Fig 2.9 shows one protein having a lot of turns in its surface.
13
Chapter2.Protein Structure
Fig 2.9. : A protein having a lot of Loops or turns on the surfaace of the protein.
2.6 Levinthal’s Paradox
Levinthal’s paradox [13] is a very interesting paradox in the field of ‘protein foolding’. Thhe amino acid molecules in the polypeptide chain have a very large number of possible conformations. If a protein needs to get its correct three dimensional shape by sequentially sampling (even at a rate in picoseconds or nanoseconds) all possible conformations, then it will take a very large time which is greater than the age of the universe. For example, if we consider a 100 residue protein and if each residue can take only 3 positions, there are 3100 possible conformations. If it takes 10-12 sec to convert from one structure to another, exhaustive search would take 1.6 × 1026 Years! But proteins folds spontaneously within microseconds or milliseconds or within minutes. So the paradox is stated as:
“Given a particular sequence of amino acid residues (primary structure), what will the tertiary/quaternary structure of the resulting protein be?”
14
Chapter2.Protein Structure
Levinthal’s paradox throws a great challenge to the problem of protein structure determination. It is obvious from the paradox that the proteins don’t make an exhaustive search. But what is the exact way it folds, in very small time is still unknown. All the related works in the area of protein structure prediction actually tries to solve this paradox. Our work in the domain of secondary structure prediction also tries to get some new avenue to solve it. 2.6 Summary
The problem of protein structure prediction is a problem from computational biology where the computational techniques are applied to solve different problems related to biology. To get the full essence of our work, it is required to have the relevant knowledge from biochemistry. This chapter basically introduces those related domain knowledge from biochemistry in very brief. Starting from molecular structure of the amino acids, it goes through the different 3D structures of proteins. Finally it ends up with the well-known paradox in the field of protein folding which is still a great challenge to solve.
15
Chapter3.Literature Survey
Chapter 3
Literature Survey
3.1 Introduction
The prediction of the secondary structure of a protein from its primary structure is proved to be an extremely difficult problem to solve. Two fundamentally different approaches has been adopted in predicting the three dimensional structures of protein. One is ab initio prediction approach. It does not take any help of previously known protein structure. It employs computer based applications to minimize a very large function corresponding to the free energy in the molecules. Actually it tries to simulate the real folding process. But this process is limited as the number of possible conformation is exponential. The other approach is known as knowledge-based approach. In this approach, amino acid sequences with known structures are used as source of the knowledge. This knowledge is extracted and stored in some appropriate manner. This knowledge- base is then used to predict any knew amino acid sequence i.e. proteins with
Chapter3.Literature Survey
known sequence but unknown structure. Our approach to this problem is also a knowledge based approach.
The knowledge based methods used in the past can be categorized into two groups: statistics based and neural network based. Other types of works are also carried out using SVM, different data-mining tools, graph-based models etc. Among all these works the neural network based methods showed the most efficiency. Some of the methods from past work are described in brief in the following sections. In this chapter there is a brief discussion on the existing best-known neural network based methods. The first statistics based good algorithm by Chou- Fasman is described in some more details. 3.2 Chou-Fasman Method
Chou-Fasman [3] defined the first good algorithm for determining the secondary structure. The algorithm totally depends on the propensity values of different amino acids. These values are probability like values. For all the 20 different types of amino acids, the propensity values are calculated based on 29 (later 64) proteins available at that time, by studying all the protein structures. The calculated propensity values for 20 proteins are given in Table 3.1. The values which are shown in bold faces in the table indicate that the corresponding amino acids are the strong formers of the corresponding secondary structure.
The propensity values are calculated by studying the database of protein sequences and the corresponding secondary structures. Let R be the amino acid residue whose propensity value is to be calculated. The following algorithm is used to calculate the propensity values.
1. Count the number of occurrence of the residue R in the helical regions all over the database. Let the value is A.
2. Count the number of all residues in helical regions all over the database. Let the value is B.
3. Count the number of occurrences of the residue R in the total database. Let the value is C.
17
Chapter3.Literature Survey
4. Count the no of all residues in the entire database of proteins. Let the value is D.
The propensity value of R for the alpha helical region is :