Protein Secondary Structure Prediction by Fuzzy Min Max Neural Network with Compensatory Neurons

Sudipta Saha Secondary Structure Prediction by Fuzzy Min Max Neural Network with Compensatory Neurons Thesis submitted in partial fulfillment of the requirements for the degree Of

Master of Technology In Computer Science & Engineering

By

Sudipta Saha (Roll No. 06CS6023)

Under the supervision of Prof. Jayanta Mukhopadhyay

Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur-721302 West Bengal, India May, 2008.

Department Of Computer Science & Engineering

Indian Institute of Technology

Kharagpur-721302, India

Certificate

This is to certify that the thesis titled “Protein Secondary Structure Prediction by Fuzzy Min-Max Neural Network with Compensatory Neurons”, submitted by Sudipta Saha, to the Department of Computer Science and Engineering, in partial fulfillment for the award of the degree of Master of Technology is a bona fide record of work carried out by him under my supervision and guidance. The thesis has fulfilled all the requirements as per the regulations of the institute and, in my opinion, has reached the standard needed for submission.

Prof. Jayanta Mukhopadhyay Dept. of Computer Science and Engineering Indian Institute of Technology Kharagpur -721302, India

Dedicated To My parents and wife

Acknowledgement

I take this opportunity to express my deep sense of gratitude to my guide Dr. Jayanta Mukhopadhyay for his guidance, support and inspiration throughout the duration of the work.

I would like to specially acknowledge the help and encouragement I received from Dr. P. K. Biswas of Department of Electronics and Electrical Communication Engineering and Dr. A. K. Majumdar of Department of Computer Science & Engineering, IIT Kharagpur. In addition, I am thankful to all the faculty members, staffs and research scholars of the Department of Computer Science and Engineering and my friends for providing me adequate help whenever required.

I am also grateful to my parents for their constant encouragement and financial support in my years of studies. Lastly I would like to thank my wife Sipra Saha, for all the love.

Sudipta Saha Dept. of Computer Science and Engineering Indian Institute of Technology Kharagpur -721302, India Date: May 05, 2008

i

Abstract

Neural Networks are extensively being used now-a-days for predicting the three dimensional structures of the . Different types of neural networks are employed for this work still now. There are several levels of three dimensional structures of the proteins. In our work a special kind of neural network has been employed for predicting the secondary structure of the proteins from the primary structure. This neural network combines the neural network concept with the fuzzy logic. The method of prediction also uses the algorithm described by Chou– Fasman [4] to break the ties between different classes of predictions. The basic algorithm used here for the fuzzy min-max neural network has been taken from [3]. Some small drawbacks of the training algorithm has been identified and removed as a part of our work. The prediction is tried with the improved neural network.

Apart from these, some domain knowledge relating to the nature of the protein secondary structures are also used to post–process the prediction output of the basic neural network to get improved prediction accuracy. So far more than 25000 of proteins have been sequenced and the three dimensional structures of these proteins are also determined. In our work it has been tried to extract as much information as possible from the already sequenced proteins. To achieve this, we have employed multiple instances of the neural network and trained them with different set of data. The protein data bank is used as a primary resource for the protein data for both training and testing our prediction system. The overall accuracy (Q3) achieved is around 70%. It is better than existing statistics based prediction systems like Chou-Fasman [1], GOR I [2] and it is comparable to some of the neural network based systems.

ii

Contents

List of Figures v

List of Tables vi

1. Introduction ...... 1 1.1 Motivation of the Work ...... 1 1.2 Objective of the Work ...... 2 1.3 Organization of the Thesis ...... 3

2. Protein Structure ...... 5 2.1 Introduction ...... 5 2.2 Molecular Structure of Protein ...... 6 2.3 3D Structures of Protein ...... 7 2.4 Ramachandran Diagram ...... 10 2.5 Different Secondary Structures of Protein ...... 12 2.6 Levinthal’s Paradox ...... 14 2.7 Summary ...... 15

3. Literature Survey ...... 16 3.1 Introduction ...... 16 3.2 Chou-Fasman Method ...... 17 3.3 GOR Method ...... 19 3.4 PhD Method ...... 20 3.5 PSI-Pred Method ...... 21 3.6 JPred Method ...... 22 3.7 Summary ...... 23

iii Contents

4. Neural Network Architecture ...... 24 4.1 Introduction ...... 24 4.2 FMNN ...... 25 4.3 FMCN ...... 31 4.4 Improvements on FMCN ...... 40 4.5 Summary ...... 48

5. Secondary Structure Prediction with Improved FMCN ...... 49 5.1 Introduction ...... 49 5.2 Application of FMCN ...... 50 5.3 Accuracy Measurement Techniques ...... 59 5.4 Complexity in Using FMCN ...... 62 5.5 Experimental Results ...... 63 5.6 Summary ...... 68

6. Multiple Instantiations of FMCN-units ...... 69 6.1 Introduction ...... 69 6.2 System Architecture ...... 70 6.3 Experimental Results ...... 78 6.4 Summary ...... 85

7. Conclusion and Future Works ...... 86 7.1 Conclusion ...... 86 7.2 Future Work ...... 88

Appendix A ...... 90

Appendix B ...... 91

Appendix C ...... 94

Appendix D ...... 110

Bibliography ...... 114

iv

List of Figures

Fig 2.1 ...... 6 Fig 2.2 ...... 8 Fig 2.3 ...... 9 Fig 2.4 ...... 9 Fig 2.5 ...... 10 Fig 2.6 ...... 11 Fig 2.7 ...... 12 Fig 2.8 ...... 13 Fig 2.9 ...... 14

Fig 4.1 ...... 26 Fig 4.2 ...... 27 Fig 4.3 ...... 29 Fig 4.4 ...... 30 Fig 4.5 ...... 31 Fig 4.6 ...... 33 Fig 4.7 ...... 35 Fig 4.8 ...... 41 Fig 4.9 ...... 46

Fig 5.1 ...... 58 Fig 5.2 ...... 60

Fig 6.1 ...... 72

v

List of Tables

Table 3.1 ...... 18

Table 5.1 ...... 52 Table 5.2 ...... 53 Table 5.3 ...... 55 Table 5.4 ...... 55 Table 5.5 ...... 56 Table 5.6 ...... 64 Table 5.7 ...... 64 Table 5.8 ...... 65 Table 5.9 ...... 66 Table 5.10 ...... 68

Table 6.1 ...... 79 Table 6.2 ...... 82 Table 6.3 ...... 83 Table 6.4 ...... 85

Table A.1 ...... 90

Table C.1 ...... 94 Table C.2 ...... 94 Table C.3 ...... 95 Table C.4 ...... 96 Table C.5 ...... 97 Table C.6 ...... 98 Table C.7 ...... 98 Table C.8 … ...... 99

vi List of Tables

Table C.9 … ...... 100 Table C.10...... 101 Table C.11...... 102 Table C.12 ...... 102 Table C.13 ...... 103 Table C.14...... 104 Table C.15 ...... 105 Table C.16 ...... 106 Table C.17...... 108

Table D.1...... 110 Table D.2...... 110 Table D.3...... 111 Table D.4...... 111 Table D.5...... 112 Table D.6...... 112 Table D.7...... 113

vii

Chapter1.Introduction

Chapter 1

Introduction

1.1 Motivation of the Work

Proteins play the central role in every living organism. In 1838, Swedish chemist Jöns Jakob Berzelius firt described and named these protein molecules. The origin of the word ‘protein’ is the Greek word ‘prota’ – which means ‘of primary importance’. For example, ‘insulin’ is a protein and if the pancreas does not produce sufficient amount of insulin, or if the cells become resistant to the effects of insulin, the body cannot use glucose effectively, and the disease diabetes mellitus results. ‘Insulin’ is the protein whose primary chemical structure was known first, by Frederick Sanger. He won the Nobel Prize for this achievement in 1958. Every protein has its unique three dimensional structure. There is a very close relationship between the protein structure and its function. For example, the proteins which are mainly responsible for giving strength in the muscular tissues are generally of shape. So , if that three dimensional structure of the protein is known in advance, that information can give us a lot clues about how the

Chapter1.Introduction

protein carries out its function and what at all the functions of the protein are . But so far no well-established way of experimentation is explored to know these three dimensional structures. Actually this structure depends on the energy minimization of very large function with a lot of parameters. Even how many factors are there that is also not fully known. So the straight forward way does not work. Experimental detection of the structures is also not possible for many proteins for several limitations. These techniques, such as X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy, are very costly and time consuming. For these reasons we need to take the help of the prediction mechanism. Many kinds of prediction mechanisms have been developed. But so far, the best prediction mechanism gives an average accuracy (Q3) of 77%. There are more than 100,000 distinct proteins in the set of proteins specified by only human genome. But the number of proteins whose structures are fully determined is very less compared to the total number of possible proteins from all organisms. There are also a lot proteins whose chemical structures are known, but the three dimensional structure is not known. Therefore, there is a good demand of an intelligent system, to predict the three dimensional structure of a protein from its primary structure. 1.2 Objective of the Work

The main objective of our work was to come up with an intelligent system for predicting secondary structure of a protein from its sequence having a good prediction accuracy. So far many statistical and neural network based methods are applied in this area. Many new types of classifiers are also used. On the other hand, the Fuzzy Min-Max Neural Network Classifier with Compensatory Neurons (FMCN) [2] was already tested for different types of data such as Iris, Thyroid, Ionosphere and achieved very good prediction accuracies. Our objective was to design a prediction system with FMCN as the main component. So far there are more than 25000 proteins in the protein data bank (PDB, http://www.rcsb.org/pdb). The three dimensional coordinates of different atoms of the amino acids in these proteins are fully determined by experimentations. The secondary structures of these proteins can be calculated from these coordinate values. The application of FMCN is a knowledge based approach. Naturally, the

2

Chapter1.Introduction

quality of the prediction will also depend on the acquired knowledge of the classifier. Our aim was also to take the full advantage of the entire database available in PDB. The existing prediction systems which have more than 70% accuracy on average, use several other components to improve the performance, such as homology modeling as a preprocessing of data, post processing of data, jury system etc. Our designed system does not give that much of accuracy in prediction as the existing methods available for prediction. But as a bare classifier algorithm with a very little post-processing, it gave the desired performance. Therefore it is highly expected that, our method appended with a good preprocessing of training / testing data and a training set encompassing almost all the proteins sequenced so far , will give better results. 1.3 Organization of the Thesis

The thesis is divided into 7 chapters including one introductory chapter (chapter 1) followed by five main chapters (chapter 2 – chapter 6) and one chapter for drawing the conclusion and describing the scope of future works (chapter 7). Chapter 2 deals with different levels of structures of the proteins. It also describes different secondary structures. The paradox regarding the protein folding is also presented in this chapter. Chapter 3 briefly focuses over the past works in the field of protein secondary structure prediction. Two statistics based and two neural network based systems are discussed in brief. Chapter 4, 5 and 6 contribute to the detailed descriptions of the out work. Chapter 4 starts with the basics of prediction mechanism and its relation with fuzzy logic. It briefly describes the drawbacks of FMNN. Chapter 4 mainly illustrates the FMCN architecture with its algorithms for learning and recall. The drawbacks were pointed out and removed in the modified algorithms which are also presented in this chapter. Chapter 5 deals with the mapping of the problem of predicting the secondary structure of proteins to the problem of classification by FMCN. It describes the used enumeration schemes and their significance. Finally, it comes

3

Chapter1.Introduction

up with three typical result sets and the overall results for the first phase of this experiment. Chapter 6 explains in details the techniques used for creating multiple instances and using a large amount of knowledge base for prediction. It describes the algorithm used for combining the outputs from multiple instances of the FMCN-units and the algorithm for post-processing the prediction output to get the final output. It also furnishes the final results for the second i.e. final phase of our experiment. Chapter 7 deals with the scope of possible future works. It mainly describes the reasons for not achieving higher accuracy and what are the necessary steps to be taken as future works to get better accuracy.

4

Chapter2.Protein Structure

Chapter 2

Protein Structure

2.1 Introduction

This chapter deals with the background knowledge of biochemistry related to our work. Though these ideas are not mandatory to have, but they will help to visualize the full scenario of protein structure. In the beginning of this chapter there is a brief description of the amino acids, their chemical structures, peptide bonds, poly-peptides and proteins. Section 2.3, briefly discusses different protein structures with diagrams. Section 2.4 describes different dihedral angles and Ramachandran diagram. Next section of this chapter, section 2.5, briefly tells about different classes of secondary structure found in the proteins and their relation with the torsion angles. Finally, this chapter ends with Section 2.6 which presents the thought experiment in the theory of protein folding known as ‘Levinthal’s Paradox’. All the research works which are going on in the field of protein structure prediction, actually aims to get a fruitful solution to this paradox. Chapter2.Protein Structure

2.2 Molecular Structure of Protein 2.2.1 Amino Acid

The basic unit of all proteins is the amino acid molecule. They are very small bio-molecules having average molecular weight of about 135 Daltons (1 1.660 10 27 ). They are actually organic acids. In nature, they exist in a zwitterionic state. In this state, the carboxylic acid group is ionized (negative) and the basic amino group is protonated. All types of amino acids are composed of an organic carboxylic acid group and an amino group. These two groups are attached to a saturated carbon atom. Apart from the two groups, there are two more groups attached to the central carbon atom (also called alpha-carbon atom). One is a simple hydrogen atom and the other is named as a residue ( ). All the amino acid molecules have the same structure except that the residue is different for different amino acids. The simplest amino acid is , where the saturated carbon atom is attached with two hydrogen atoms and the other two said radicals. So here the residue is the hydrogen atom itself. In all other cases there is one residue attached to the alpha-carbon atom.

There are 20 different types amino acids. Each one of them is denoted by one three letter code and another one letter code. The 3-letter and 1-letter codes of all the amino acids are listed in Appendix A (Table A.1). The side of the amino acid molecule where there is the amino group, is called the N-terminal and the other side is called the C-terminal where the carboxylic acid group is attached with.

H O

+ H3N C C

R O-

Fig 2.1: General atomic structure of amino acid in zwitterionic state

6

Chapter2.Protein Structure

2.2.1 , Polypeptide and Protein

The covalent bond between the carboxylic acid group of one amino acid and amino group of another amino acid is called the peptide bond. When two or more amino acid groups are linked by this way, they are called peptides. Polypeptides are long chains of peptide bonds. For sequencing all these chain like structures, the N-terminal to C-terminal direction is followed. The polypeptides typically have large molecular weights. Protein is a also a kind of polypeptide. They form a large macromolecule composed of several peptide bonds. Generally proteins contain polypeptide chains with 50 to 2000 amino acid residues. But the difference between the polypeptides and the proteins is that, protein will always take the same three dimensional shapes in nature and that shape is unique for that protein. But this is not true for the general polypeptides. They can take any possible shape and these shapes may not be unique. 2.3 3D Structures of Protein

Proteins are one of the major macromolecules found in the cell of every living organism. They perform the major bio-chemical tasks. To perform these tasks it is very important to form a particular three-dimensional shape. For example instead being a flat shaped object, if the structure of the object is a spiral or helical then it can exhibit more strength. For these reason the proteins which are responsible for giving strength in our muscles are of helical structures. There are a lot of such examples.

Proteins are formed by peptide bonds between different amino acid molecules. The primary structure of a protein basically refers to the sequence of the amino acids of that particular protein. For example some protein may have some portion of it with the following sequence of amino acids :……………N I R V I A R V R P V T K E D G E G P E A T N A V T F D A D D D S I I H L L H K G K P V S F E L D K V F S P Q A S Q Q D V F Q E V ……………….. , (Each letter denotes the one letter code of the amino acids). The three dimensional shape of the protein fully depends on this primary structure. To have clear analysis of different shapes, the 3D structure of a protein is divided into three levels: Secondary structure, Tertiary structure and Quaternary structure.

7

Chapter2.Protein Structure

Generally the sequence of amino acid residues is long. Different portions of that sequence get different 3-D shapes. This is called the secondary structure of the protein. These shapes may be spiral, or like flat sheet or like loops. The primary structure of the protein contains the full information that how the secondary structure will be formed. This secondary structure is stabilized mainly by hydrogen bonds. The most common examples are the alpha , and turns or loops. Because secondary structures are local, many regions of different secondary structure can be present in the same protein molecule.

Tertiary structure is the overall 3-D shape of a single protein molecule. There is a spatial relationship of different secondary structures of different portions of the same protein to one another. This relationship gives birth of the tertiary structure. Tertiary structure is generally stabilized by nonlocal interactions, most commonly the formation of a hydrophobic core. The contributions of salt bridges, hydrogen bonds and disulfide bonds are also there. The term tertiary structure is often used as synonymous with the term foold. For example, Myoglobin, is a protein which is the oxygen carrier in muscle. It has 153 amino acids. About 70% of the chain of amino acids of the protein is folded into eight helical regions and these 8 regions are connected by turns or loops. Fig 2.2 gives the graphical view of Myoglobin [14].

Fiig 2.2: Myoglobin – a protein with 8 alpha helical regions connected by loops.

8

Chapter2.Protein Structure

Fig 2.3: CD4 – a protein which is found in cell surfaces. It consists 4 similar subunits.

The Primary structure actually describes the amino acid sequence. Secondary structure of the protein describes the local arrangement of amino acid residues. Tertiary structure of the protein describes non-local arrangement of amino acid residues. It gives us the overall structure of the protein. But some proteins can contain more than one polypeptide chain. A fourth level of structural organization can be seen in such proteins. Each chain of polypeptide is called subunits. Quaternary structure describes the arrangement of the subunits. For example, human hemoglobin has 4 subunits. It is the oxygen-carrying protein in blood. Due to it’s this kind of structure, it can carry oxygen very efficiently. Fig 2.4 shows the pictorial view of Hemoglobin [14]..

Fig2.4: Hemoglobin- oxygen carrying protein in human blood – contains 4 subunits.

9

Chapter2.Protein Structure

2.4 Ramachandran Diagram

Peptide bonds are strong in nature. They do not permit any rotation around the bond. But this is not true for the bonds between the central carbon atom and the amino group and between the central carbon atom and the carbonyl group. The two adjacent units can rotate along these bonds and can take different orientations. This rotation is the main reason for the proteins to take different shapes. Dihedral angles are the measure of rotations along a bond. They are also called torsion angles. The angle of rotation between nitrogen and central carbon atom is called phi angle and that of between central carbon atom and carbonyl carbon atom is called psi angle. Fig 2.5 pictorially describes the peptide bonds and places of rotation along the chain of amino acids.

C-Terminal N-terminal Peptide bonds Rotation Occurs Here

One single amino acid

Fig 2.5: Peptide bond and phi-psi angles

All combinations of phi – psi angles are not possible due to the collision between atoms. G. N. Ramachandran plotted these allowed values of the torsion angles in the diagram called Ramachandran-diagram [14, 15]. Ramachandran

10

Chapter2.Protein Structure

diagram for proteins shows three confined regions for the allowed combinations of phi-psi angles. Fig 2.6 shows different regions in the Ramachandran diagram. It also shows the preferred combinations of these angles for different secondary structures which are discussed in the next section.

The dark green region is the most favorable region. Generally in beta sheets, these This region is very combinations of angles are rarely seen in found. practice. These angles are only found in rare left handed alpha-

These combinations of angles are generally found in right handed

Angles in this Ψ angles region are not used due to collisions Φ angles between atoms.

Fig 2.6: A typical Ramachandran diagram. The dark green regions are the most favorable combinations of phi-psi angle. The lightly shaded regions are less favorable

regions and the white regions shows the forbidden regions.

11

Chapter2.Protein Structure

2.5 Different Secondary Structures of Protein

Due to different possible combinations of phi-psi angles, different secondary structures are created. These combinations are mentioned in different regions of the Ramachandran diagram in Fig 2.6. The possible secondary structures are: Alpha-helix, Beta-sheet and or loop.

The helical structure was discovered by Linus Pauling in 1951 (He also discovered the beta sheet). The helix can have directions (from N-terminal to C- terminal) which can be either left handed or right handed. The shape is spiral. The combinations of the phi-psi angles which are responsible for these types of shapes are found in the region described in the Ramachandran diagram (Fig 2.6). Fig 2.7 shows a protein whose almost total structure is helical.

Fig 2.7: – an iron storage protein. It has a bundle of alpha helices connected bby loops.

There are generally three types of helices. One is alpha helix. The other two are known as 3-helix (also known as 3/10 helix) and 5 helix (also known as pi helix). The alpha helix structure is a result of hydrogen bonds. Between the amino group of one amino acid and the carbonyl group of another amino acid, the hydrogen bonds are formed. The classification of the helices is done based on the compactness of this shape which depends on the density of the hydrogen bonds. For alpha helices every 4th residue is connected by hydrogen bonds. For 3-helix

12

Chapter2.Protein Structure

and 5-helix, every 3rd and 5th residues are connected. The 3-helix and 5-helix are found rare in nature.

In 1952, Pauling and Corey predicted the beta-pleated sheet structure as an alternative secondary structure to the alpha-helix in proteins. The shape is like sticks. Fig 2.8 shows a protein consisting mainly beta-sheets. This structure also has a direction (indicated by the arrows in the picture).

Fig 2.8: A protein rich in beta sheet. There is one small portion of helix and some more loops. This is actually a fatty acid binding protein.

Single beta-strands are not stable structures. So they occur in association with neighboring strands. Thus they can be found as either parallel or anti-parallel form, with respect to the N-terminal to C-Terminal direction (in amino acid molecules) of the adjaacent peptide strands. Like alpha-helices, beta-pleated sheet backbones are fully hydrogen bonded. But here the H-bonds occur between neighboring strands (intermolecular). The H-bond geometry is different in the parallel and anti-parallel beta sheets.

To combine helices and sheets in their various combinations, protein structures also contain turns or loops that allow the peptide backbone to fold back. These turns can be found almost always on the surface of proteins and often contain Proline and/or Glycine amino acids. Fig 2.9 shows one protein having a lot of turns in its surface.

13

Chapter2.Protein Structure

Fig 2.9. : A protein having a lot of Loops or turns on the surfaace of the protein.

2.6 Levinthal’s Paradox

Levinthal’s paradox [13] is a very interesting paradox in the field of ‘protein foolding’. Thhe amino acid molecules in the polypeptide chain have a very large number of possible conformations. If a protein needs to get its correct three dimensional shape by sequentially sampling (even at a rate in picoseconds or nanoseconds) all possible conformations, then it will take a very large time which is greater than the age of the universe. For example, if we consider a 100 residue protein and if each residue can take only 3 positions, there are 3100 possible conformations. If it takes 10-12 sec to convert from one structure to another, exhaustive search would take 1.6 × 1026 Years! But proteins folds spontaneously within microseconds or milliseconds or within minutes. So the paradox is stated as:

“Given a particular sequence of amino acid residues (primary structure), what will the tertiary/quaternary structure of the resulting protein be?”

14

Chapter2.Protein Structure

Levinthal’s paradox throws a great challenge to the problem of protein structure determination. It is obvious from the paradox that the proteins don’t make an exhaustive search. But what is the exact way it folds, in very small time is still unknown. All the related works in the area of protein structure prediction actually tries to solve this paradox. Our work in the domain of secondary structure prediction also tries to get some new avenue to solve it. 2.6 Summary

The problem of protein structure prediction is a problem from computational biology where the computational techniques are applied to solve different problems related to biology. To get the full essence of our work, it is required to have the relevant knowledge from biochemistry. This chapter basically introduces those related domain knowledge from biochemistry in very brief. Starting from molecular structure of the amino acids, it goes through the different 3D structures of proteins. Finally it ends up with the well-known paradox in the field of protein folding which is still a great challenge to solve.

15

Chapter3.Literature Survey

Chapter 3

Literature Survey

3.1 Introduction

The prediction of the secondary structure of a protein from its primary structure is proved to be an extremely difficult problem to solve. Two fundamentally different approaches has been adopted in predicting the three dimensional structures of protein. One is ab initio prediction approach. It does not take any help of previously known protein structure. It employs computer based applications to minimize a very large function corresponding to the free energy in the molecules. Actually it tries to simulate the real folding process. But this process is limited as the number of possible conformation is exponential. The other approach is known as knowledge-based approach. In this approach, amino acid sequences with known structures are used as source of the knowledge. This knowledge is extracted and stored in some appropriate manner. This knowledge- base is then used to predict any knew amino acid sequence i.e. proteins with

Chapter3.Literature Survey

known sequence but unknown structure. Our approach to this problem is also a knowledge based approach.

The knowledge based methods used in the past can be categorized into two groups: statistics based and neural network based. Other types of works are also carried out using SVM, different data-mining tools, graph-based models etc. Among all these works the neural network based methods showed the most efficiency. Some of the methods from past work are described in brief in the following sections. In this chapter there is a brief discussion on the existing best-known neural network based methods. The first statistics based good algorithm by Chou- Fasman is described in some more details. 3.2 Chou-Fasman Method

Chou-Fasman [3] defined the first good algorithm for determining the secondary structure. The algorithm totally depends on the propensity values of different amino acids. These values are probability like values. For all the 20 different types of amino acids, the propensity values are calculated based on 29 (later 64) proteins available at that time, by studying all the protein structures. The calculated propensity values for 20 proteins are given in Table 3.1. The values which are shown in bold faces in the table indicate that the corresponding amino acids are the strong formers of the corresponding secondary structure.

The propensity values are calculated by studying the database of protein sequences and the corresponding secondary structures. Let R be the amino acid residue whose propensity value is to be calculated. The following algorithm is used to calculate the propensity values.

1. Count the number of occurrence of the residue R in the helical regions all over the database. Let the value is A.

2. Count the number of all residues in helical regions all over the database. Let the value is B.

3. Count the number of occurrences of the residue R in the total database. Let the value is C.

17

Chapter3.Literature Survey

4. Count the no of all residues in the entire database of proteins. Let the value is D.

The propensity value of R for the alpha helical region is :

⁄ ⁄

Table 3.1: Table of propensity values corresponding to each amino acid residue.

3-letter Palpha Pbeta Pturn codes of the Amino Acids Ala 1.29 0.90 0.78 0.06 0.076 0.035 0.058 Cys 1.11 0.74 0.80 0.149 0.050 0.117 0.128 Leu 1.30 1.02 0.59 0.061 0.025 0.036 0.070 Met 1.47 0.97 0.39 0.068 0.082 0.014 0.055 Glu 1.44 0.75 1.00 0.056 0.060 0.077 0.064 Gln 1.27 0.80 0.97 0.074 0.098 0.037 0.098 His 1.22 1.08 0.69 0.102 0.085 0.190 0.152 Lys 1.23 0.77 0.96 0.055 0.115 0.072 0.095 Val 0.91 1.49 0.47 0.062 0.048 0.028 0.053 Ile 0.97 1.45 0.51 0.043 0.034 0.013 0.056 Phe 1.07 1.32 0.58 0.059 0.041 0.065 0.065 Tyr 0.72 1.25 1.05 0.082 0.065 0.114 0.125 Trp 0.99 1.14 0.75 0.077 0.013 0.064 0.167 Thr 0.82 1.21 1.03 0.086 0.108 0.065 0.079 Gly 0.56 0.92 1.64 0.102 0.085 0.190 0.152 Ser 0.82 0.95 1.33 0.120 0.139 0.125 0.106 Asp 1.04 0.72 1.41 0.147 0.110 0.179 0.081 Asn 0.90 0.76 1.28 0.161 0.083 0.191 0.091 Pro 0.52 0.64 1.91 0.102 0.301 0.034 0.068 Arg 0.96 0.99 0.88 0.070 0.106 0.099 0.085

These parameters are used in the main algorithm for prediction. The main algorithm is based on a set of rules which are presented below in brief.

1. Prediction of Helices: A cluster of four helix former residues (whose values are in bold faces in the table in the column for Palpha) out of six along the protein sequence initiates a helix. The helical segment is extended in both directions until a set of tetra-peptide breakers (residues with Palpha < 1.00) are reached. Any segment

that is at least six residues long with average Palpha > 1.03 and Palpha > Pbeta, is predicted as helix. The above process is repeated to find out all the helical regions in the sequence.

18

Chapter3.Literature Survey

2. Prediction of Beta Sheet: A cluster of three beta sheet former residues

(whose values are in bold faces in the table in the column for Pbeta) five residues along the sequence will initiate a beta sheet. The beta sheet is propagated in both directions until terminated by a set of tetra-peptide breakers (Pbeta < 1.00). Any segment with average Pbeta> 1.05 as well as Pbeta > Palpha is predicted as beta sheet.

3. Prediction of Turn: The probability of a bend at a residue is calculated from the formula . Tetrapeptides with > 0.75 10 as well

as average value of the propensities, Pturn > 1.00 and Palpha < Pturn > Pbeta , are

predicted as turn. , , , are the frequencies of occurrences of a certain residue at the first , second , third and fourth positions of a turn.

4. In general the overlapping regions are compared for the calculated average

value. If the average Palpha is greater than the average Pbeta, the region is assigned as the helical. If the situation is reversed, the region is assigned as beta-sheet. For turn prediction it is also checked that the average value of Pturn is greater or less than

the other two average values. If the average Pturn value is less than the other two averages then, the prediction does not go for turn.

The (Q3) accuracy of this method of prediction was reported as 70-80%. But later it has been revealed that the actual accuracy is about 55%. It is possible that the accuracy of the Chou-Fasman algorithm has decreased for the small size and composition of the protein database from which the conformational propensities were calculated. So, recalculation of the parameters with current set of proteins may give some more prediction accuracy.

3.3 GOR Method

GOR (Garnier, Osguthorpe, and Robson) [6] is another secondary structure prediction mechanism based on statistical analysis of the available protein database with known structures. There are five versions of GOR (GOR I – GOR V). This method mainly employs the information theory. It takes a window of size 17 and calculates the probability of the middle residue to be in a particular secondary

19

Chapter3.Literature Survey

structure, provided the total sequence around it (i.e. 8 – residues in each side). Finally, it takes the maximum value over all three possible structures and predicts the corresponding structure. The earlier version of GOR has a mean accuracy of 64.4% for a three state prediction. The most improved version of GOR is GOR V [7].

It shows an average Q3 accuracy of 73.5%.

Originally GOR used only the information theory based calculations. It considers the effect of the residues from position i-8 to i+8 on the secondary structure of the ith residue in a window of size 17. Later they tried prediction for double residue instead of single residue. It proved to be better. GOR V also takes the help of evolutionary information of proteins. It gives the highest accuracy over all the earlier versions. The algorithms applied in GOR are relatively simple and has computationally low resource requirements. 3.4 PhD Method

PhD method [4] uses back-propagation based artificial neural networks for prediction of the protein secondary structure. But this method does not directly use the available protein structure data for training. First the data is preprocessed to generate a sequence profile which is then used for training. It takes the help of multiple sequence alignment technique to generate the sequence profile. It also uses one jury system.

The full architecture of this method consists of three levels. The main building block of the first level is a back propagation based artificial neural network with one hidden layer, one input layer and one output layer. The network is trained with a sequence profile. A window of 13 residues is selected from the input sequence. On that window, the sequence profile is prepared and that sequence profile is used to train the neural network. Each of the nodes in the input layer gets a number which is the frequency of a particular amino acid residue (out of 20 amino acid residues). The output of this neural network is three values which are the probability like values representing the chance that the middle residue (that is the 7th residue) has the secondary structure as helix or sheet or turn.

20

Chapter3.Literature Survey

The next level is also a back propagation based artificial neural network that takes the three input values calculated by the previous level for each of the 17 positions of the selected window for prediction. The output of this level is also three probability like values representing the chance of the 8th residue to be one of the three different secondary structures.

The last level is a jury system that takes the average of the outputs of several networks. Each of the networks has the same architecture as mentioned above, but different parameters for training, different order of training etc. The secondary structure having the highest average value is predicted as the output of this method. The jury system takes the arithmetic average. The random noise created by the artificial neural networks is removed by this function.

Finally, it removes the obvious errors by removing the single or double prediction which are alone in a long chain of other types of classes. The overall accuracy achieved in this method was reported as 69.7%. It is a successful application of neural network and sequence profiling in the field of protein secondary structure prediction 3.5 PSI-Pred Method

PSI-Pred [16, 17] is also a neural network based protein secondary structure prediction mechanism. Like PHD method, it also used two stages of neural network. Its architecture is more or less same as that of the PHD method. It also uses a sequence profile to train the first neural network. But this technique is not that much time consuming as PHD method. It uses different technique for generating the sequence profile. It employs the PSI-BLAST [18] to generate the sequence profile. PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) program is used to search the relatives of a protein in an incremental way. Initially a small set of related proteins is created. This list is used to generate a sequence profile which is thereafter used to search the database of proteins again. It will generate another list of related proteins from which another sequence profile is generated. This process repeats. By this way the method creates s a list of distant relatives of the initial protein given for search. PSI-BLAST is much more sensitive in finding the evolutionary information than any other type of BLAST.

21

Chapter3.Literature Survey

The prediction system is composed of three stages. They are: generation of a sequence profile, prediction of initial secondary structure, and finally filtering the

predicted structure. On average the accuracy (Q3) achieved by this method was reported to be 77%.

3.6 JPred Method

JPred [8] is a prediction method, which combines the prediction outputs from six existing techniques for predictions. They are DSC [9], PHD [4], NNSSP [10], PREDATOR [11], ZPRED [12] and MULPRED [13]. When one protein sequence is given as an input to the JPred server, it first generates the corresponding sequence profile and uses it as input to all the six methods. The outputs from all the methods are taken and combined by a certain algorithm to generate the final output. The advantage of JPred server is that, it does not rely on any particular technique.

A variety of techniques are employed for this work. NNSSP is based on nearest neighbor method. PhD relies on jury decision neural networks (back propagation based). DSC (Discrimination of protein Secondary structure Class) method of prediction is a statistics based method. It uses the GOR residue attributes. The hydrophobicity information of the amino acids is also applied in this method. It also uses multiple sequence alignment technique for evolutionary information. MULPRED is a consensus based single sequence method. It is itself a combination of five different methods which are: Lim [19], GOR, Chou-Fasman, Rose [20] and Wilmot & Thornton [21] turn prediction methods. ZPRED is a conservation number weighted prediction based method. JPred also encompasses one method PREDATOR. PREDATOR is a secondary structure prediction method based on recognition of potentially hydrogen-bonded residues in a single amino acid sequence. Therefore, JPred uses almost all types the existing techniques using the sequence profile or single sequence in different ways to generate their prediction output. The overall accuracy (Q3) was reported to be 72.9%.

22

Chapter3.Literature Survey

3.7 Summary

In this chapter, different approaches for solving the problem of protein secondary structure prediction are discussed. The Chou-Fasman algorithm is described in some more details as we have taken the help of this algorithm in our work. The other techniques discussed in the chapter uses different methodologies such as nearest neighbor technique, consensus based technique, neural network based techniques etc. Apart from these techniques, support vector machines and several other data mining tools, many graph based models are also employed in this area. So, the range of methodologies applied is very wide. As a general conclusion which can be drawn from the survey of the past works is that the systems with good prediction accuracy are not a single algorithm based method. There is more than one component in those systems. A significant amount of improvement is achieved by using the sequence profiles generated from sequence alignments of various proteins. This is a very good approach for taking the advantage of the evolutionary information in proteins. It can also be seen from this survey that the neural network based methods perform better than many other statistics based methods. Influenced from this idea, we also have chosen a new kind of neural network based classifier as a basic unit in our work. The next chapter describes the used classifier in details.

23

Chapter4.Neural Network Architecture

Chapter 4

Neural Network Architecture

4.1 Introduction

From the review of the past works in the field of protein secondary structure prediction, it can be easily seen that, in comparison with all other techniques applied as knowledge based approach, neural network based approach achieves better prediction accuracy. The applied neural networks are basically variations of Multi Layer Perception or Recurrent Neural Networks or Hierarchical Neural Networks. In our work, we have applied one new type of neural network in this area. It is basically a classifier named as Fuzzy min-max neural network classifier with compensatory neurons (FMCN) [2]. This neural network is already applied on Ionosphere, Thyroid, and Wine data sets available in standard source. It showed a very good performance in all these cases of classification. There are some small problems in the basic training algorithm of the FMCN. The identified problems were removed as a part of our work. This improved FMCN is used as a basic unit of computation in our work. To improve the efficiency we have also used the Chou– Chapter4.Neural Network Architecture

Fasman parameters. The fuzzy logic concepts were successfully combined with the concepts of the Neural network classifier in the Fuzzy Min-Max Neural Network (FMNN) [1] . But this architecture had some basic drawbacks in its learning algorithm. In section 4.2 the basics of FMNN with its drawbacks are described in brief. In section 4.3, the architecture of FMCN, its training and learning process has been described in details. How this architecture fully resolves the problems found in FMNN is also discussed here. The identified deficiencies and the revised algorithm for training and recall are presented in section 4.5. 4.2 FMNN

In this section a brief idea about the FMNN and the drawbacks of its learning algorithm are presented. 4.2.1 Fuzziness of the Neural Network

A subset within the universe of discourse , can be called a fuzzy set, if it has a partial membership function that will tell the degree to which any object belongs to that set. So every fuzzy set can be defined as an ordered pair as , } where is the membership function and is any object that

belongs to the set with the degree . is value within 0 and 1. In the neural network architectures described in this section and later sections, the knowledge is stored as the boundaries of the hyperboxes in the n-dimensional space. Each hyperbox is treated as fuzzy set. The hyperbox Bj is defined as a fuzzy set as follows:

, ,,,, 4.1

Here, is a point in the n-dimensional space and is the min-point and is the

max-point of the hyperbox in that n-dimensional space. , , is the corresponding fuzzy-membership function of the hyperbox . It tells the degree to

which the point belongs to the hyperbox . The activation functions in the neural networks play actually the role of the fuzzy membership functions. Fig 4.1 shows a hyperbox in 3-dimensanal space.

25

Chapter4.Neural Network Architecture

Max Point One Hyperbox in the 3-dimensional pattern space

Min Point

Fig 4.1: One hyperbox Bj with min-max points as Vj and Wj respectively in the 3- dimensional pattern space.

4.2.2 Pattern Classification

Classification of an n-dimensional pattern is generally done by passing that n-dimensional pattern through a set of discriminant or characteristic functions. Finally the characteristics function with the largest value is chosen for generating output. Fig 4.2 pictorially describes the basic architecture of any general pattern classification system using discriminant or characteristic functions. Different methods are there for implementing the discriminant functions. For example, use of probability density functions, nearest neighbor methods etc. Neural networks are also used in this field. For example back propagation based neural networks tries to minimize a cost function and creates a nonlinear decision boundary. Fuzzy min-max neural network approaches to this problem of classification in some different way. While learning, it generates subsets of the total pattern space and thereby builds the decision boundaries. A pair of points, a min-point and a max- point, defines one hyperbox in the pattern space. The membership function of the hyperbox defines the fuzzy subset of the n-dimensional pattern space. A set of hyperbox defines a particular class. During the training procedure, a lot of hyperboxes of different classes are created in the n-dimensional space. A particular

26

Chapter4.Neural Network Architecture

class is defined as the union of the fuzzy hyperbox sets. For example, the pattern class is defined as follows:

4.2

Here K is the set of indices of those hyperboxes which are associated with the

th class . is the fuzzy membership function of the j hyperbox . At the time of recall, it is determined by the maximum selector that the hyperbox of which class is exhibiting the highest closeness from the given test point. The corresponding class is predicted.

Maximum value is Decision chosen or Class

Feature Vector Discriminant functions Maximum selector Decision (Input Pattern) (Membership function) (Decision function) (Class)

Fig 4.2: Schematic architecture of a pattern classification system. The Input pattern Ah = ,, ….. is passed through c number of discriminant functions. Each discriminant function represents a pattern class. The highest values from the output is selected and used to decide the pattern class of the input pattern.

27

Chapter4.Neural Network Architecture

4.2.3 Learning Process of FMNN

Fuzzy Min-Max neural networks are trained with sample data points in the hyperspace. A training sample data for this learning is composed of a point in the hyperspace (N-dimensional space, where N depends on the user choice or on application area) with its corresponding class. It carries the meaning that the point in the sample data is of the associated class. Similarly, for testing the neural network, test samples are used which consists of only a single point of same number of dimension. The task of the neural network is that, it will have to predict the class with which the point is associated. In the learning phase of FMNN, it takes the sample data one by one and tries to accommodate them in any previously created hyperbox. If the sample data does not fall into the region of any of the existing hyperbox of the same class, then it tries to expand one of the existing hyperbox that is the closest from the sample point in the hyperspace. But the expansion is possible only if the total size of the hyperbox after expansion does not cross a certain predefined limit. If it is possible then expansion occurs. As a natural consequence, after expansion, the expanded hyper-box may be overlapped with other previously existing hyperboxes of the same class or of different class. For this reason an overlap test is carried out to check which hyperboxes of different classes has been overlapped due to the expansion. The result of this test may show that two or more hyperboxes of different classes has been overlapped.

Now , if any test sample falls within this overlapped region, it will lead to a lot of confusion about the class of the test sample as, the point will lie in all the hyperboxes (of different classes) which are overlapped in that region. To avoid this type of confusion during recall process, after overlap test, ‘contraction of hyperboxes’ is performed. In the Fig 4.3, the process of contraction has been described. The hyperboxes are taken as two-dimensional boxes.

4.2.4 Drawbacks

Any type of neural network incorporates within it one way to store the information or knowledge extracted from the sample training data set. The FMNN does these by preparing the hyperboxes. The min-max points of the hyperboxes

28

Chapter4.Neural Network Architecture

stores the actual extracted information. But it can be easily noticed that the min- max points are modified during the process of contraction of hyperboxes. This creates the classification error during the recall process as the system will rely on some new min-max points which have fewer relationships with the training information. The classification error produced is described in the Fig 4.4.

In Fig 4.4 two overlapped hyperboxes has been shown which are of different classes. Hyperbox 1 of class 1 has min point B, max point D. Hyperbox 2 of class 2 has min point A , max point C . After contraction process, the hyperboxes are isolated. But the hyperboxes changes their min-max points. As a result, point B which was previously classified as class 1, will be now classified as class 2.

Hyperbox1, Class1

Hyper-box 1 of class 1 is overlapped with hyper-box 2 of class 2

Hyperbox2, Class 2

Hyper-box1, Class1

After contraction, the hyper-boxes are isolated from each other.

Hyper-box 2, Class 2

Fig 4.3: The process of removing the overlaps between the hyper-boxes in the FMNN learning algorithm

29

Chapter4.Neural Network Architecture

Hyper-box1, Class1

D Hyper-box 1 of class C 1 is overlapped with hyper-box 2 of class 2 B

A

Hyper-box 2,Class 2

Hyper-box1, Class1 After contraction, D the hyper-boxes are isolated from

C each other. B

A

Hyper-box 2, Class 2

Fig 4.4: The process of removing the overlaps between the hyperboxes in the FMNN learning algorithm creates classification problem. Previously, point C was classified as class 2, but now it is classified as class 1 . Same for class point B

Similarly, point C which was previously classified as class 2, will now be classified as class 1. This is a case of simple overlap between two hyperboxes of different classes. If one hyperbox is partially or fully contained in other hyperbox of different class, then after contraction process, the scenario becomes worse. A lot of information loss occurs which necessarily results in miss-classification.

30

Chapter4.Neural Network Architecture

4.3 FMCN

In the following two subsections, details of the FMCN architecture are described. 4.3.1 Basic Architecture

Three types of neurons are used in this architecture. They all get the inputs from the same input layer. Fig 4.5 describes the working principle of a single classifying neuron (CLN). Fig 4.6 describes the architecture of the FMCN classifier. Each CLN actually represents one hyperbox in the n-dimensional space. One CLN gets the input from the input layer. Each input layer node is connected to the CLN

th (say Bj ) by two weights . One is the i coordinate of the max-point ( wji ) and other

th is the i coordinate of the min-point (vji) of the hyperbox that is represented by the CLN . The activation function of the CLN given in equation 4.3.

,

, , , ,

Fig 4.5: Representation of hyperbox as a neuron with the input values ( ), edge weights , and the activation function bj . 1 … . , .

,,…. 1 , ,1 , (4.3)

Where Ah is the input point in the n-dimensional space .That is Ah is the

sequence , , … … . and are the min-max points of the hyperbox

represented by the CLN Bj. ( and are the sequence : , , … … and

, , … . . .

31

Chapter4.Neural Network Architecture

, is two parameter ramp threshold function given by

1 , 1 , , 0 1 4.4 0 , 0

γ is a fuzziness control parameter .

The calculated value actually gives the fuzzy membership value for the

hyperbox and the sample point denoted by the point Ah . It gives a value within 0

and 1 telling the closeness of the sample point from the existing hyperbox Bj If a point in the n-dimensional space falls within the hyperbox, then gives the value as 1.0 otherwise the resulted value is less than 1 and it depends actually on the distances from the boundary of the hyperbox. Each hyperbox is associated with a particular class. Which hyperbox is associated with which hyperbox – this information is stored in a matrix U.

Similarly each of the overlap compensatory neurons (OCN) and the containment compensatory neurons (CCN) is also connected to all the input layer nodes with same kind of edge weights as the CLNs. But they have separate kind of activation function. Activation function for the OCN is given in equation number 4.5.

1 , , – 1 1 , 4.5

Each OCN also represents the partially overlapped region or partially contained region between two hyperboxes of different classes in the n-dimensional space. So

they also have their min-max points Vj and Wj . , , is the same membership function as the CLNs , but the parameters are related to the overlapped region .

In the above function, p denotes two classes (actually two hyper-boxes of different classes between which the overlap has occurred). n is the number of dimensions. is a unit step function . So if x < 0 then U gives Zero. If a point in the n-dimensional space falls within any overlapped region then only the OCN

becomes active. Otherwise gives zero. gives two compensation values for two

32

Chapter4.Neural Network Architecture

Classifying Neurons i.e. neurons responsible for a 1 b1 basic hyper-boxes a Maximu 2 b2 C11 m value a3 b is chosen 3 C 21 from a4 b 4 C these 31 a three 5 bj U class a members 6 hip grade a7 d values 1 Overlap compensatory neurons

d2 C12 I1 Each intermediate d node computes 3 C I 22 2 the minimum

dp value over the two C32 I3 Y types of compensation

values. Z a n‐2 e1 C13 a n‐1 e2 C23 a n e3 C33

Containment compensatory neurons e i Fig 4.6: This figure describes the main architecture of the Fuzzy min-max neural network classifier with compensatory neurons. U, Y, and Z are the matrices that keep track of the connection arrows i.e. which neurons are connected to which class node. A = ( a1, a2 , a3 , …… , an) is the input vector that is being used to train the neural network or to test it . b1, b2 ,…. bj are the main classifying neurons which are responsible for the hyper-boxes. d1, d2,….. , dp are the overlap compensatory neurons and e1, e2 , ……. , ei are the containment compensatory neurons . C1i , C2i , C3i are the nodes that denotes the 3 classes one of which is to be predicted . They act as intermediate storages for

calculated value. I1 , I2 ,I3 are the intermediate nodes to compute the minimum values over the two types of compensation values .

33

Chapter4.Neural Network Architecture

classes. The Matrix Y stores the information that which OCN is associated with which two classes (that is which OCN represents the overlapped regions between which two classes).

Activation functions for the CCNs:

1 , , 1 4.6

Each CCN represents the hyperbox region in n-dimensional space, corresponding to a particular class, which is fully contained by some other hyperbox of some other class. So, for a CCN, two things are necessary to know: which is the containing hyperbox and which is the contained hyperbox. But the later is not required as we give priority on it and we don’t want to reduce the corresponding value. But we need the former information as we want to make its effect as null.

Here, , , is the same function as in the CLNs and OCNs but the parameters are for the min-max points for the region represented by the CCN. is same as defined in the previous activation function . For each CCN, which is the containing region, this activation function gives only one value and it is -1 for the containing hyperbox. Which CCN is associated with which hyperbox – this information is stored in the matrix Z. Each type of Neurons calculates their own activation values and all the values come in the hidden layer , , , 1 … . 3 ( as in our application area the number of classes is 3 ) . The minimum compensation value is calculated between the OCN and CCN activation values which are shown as an intermediate layer , , . Finally all the values corresponding to different class are added and the maximum value among them is chosen as the output of the total neural network. 4.3.2 Algorithms of FMCN

The following two subsections precisely describe both the learning algorithm and the recall algorithm of FMCN. They also explain how FMCN removes drawbacks found in FMNN with the help of the compensatory neurons.

34

Chapter4.Neural Network Architecture

The hyperboxes of different classes can also overlap. The case of Full containment is shown here. The smaller hyperbox is of one class and the larger hyperbox is of another class. The cases are handled by the Containment Compensatory Neurons (CCN).

This is another case of overlap between two hyperboxes of different classes. Here particularly the case of partial overlap is shown. These cases are handled by the Overlap Compensatory Neurons (OCN ).

These are the hyperboxes of same classes. There is no These are the non linear decision harm if they overlap boundary lines between the patter classes.

Fig 4.7: The hyperboxes and their inter-class overlap and intra-class overlaps are shown in this diagram. Two dimensional space is considered as the pattern space. Three subsets in the whole pattern space are selected as three classes with several hyperboxes in each class.

35

Chapter4.Neural Network Architecture

4.3.2.1 Learning Algorithm

The basic working principles of both the FMNN and FMCN are same. Both of them take one sample training data (point in the n-dimensional space associated with a class) and tries to accommodate it in an existing hyperbox of the same class. If not possible then tries to expand one existing hyperbox of the same class. Again if it is not possible then it creates one new hyperbox. But the difference comes after this. After expansion, there may be some overlap with some other hyperbox of different class. In the FMNN, to remove ambiguity, the help of the process of ‘contraction of hyperboxes’ were taken. But this process changed the min-max points of the hyperboxes which resulted in information loss. In FMCN, instead of doing, contraction, new sets of neurons are created which will remember the overlapped region. When they will be activated, they will give a compensation value which will compensate the original value to cope with the problem of overlaps. These new sets of neurons are called compensatory neurons. There are two types of compensatory neurons – overlap compensatory neurons (OCN) and the others are containment compensatory neurons (CCN) whose activation functions are already discussed in the above section. The outline of the learning algorithm is given below. During training of the neural network, a sample is presented in

the form of an ordered pair , where is the class label of . In the algorithm the sample point refers to this and its associated class refers to the class .

Begin training

for each sample training data {

if training is done for the first time

Create a hyper-box for the given sample point with associated class. And Set the corresponding min-max points of the hyperbox.

else {

Find out the hyper-box if any that can include the sample point presented for training.

36

Chapter4.Neural Network Architecture

if such a hyper-box of the same class is found out then do nothing and go for next sample .

else {

Find out the hyperbox closest from the sample point and test an expandability function for that hyperbox.

If the expandability function returns true and the hyperbox to be expanded is not overlapping with any previous hyperbox of separate class {

Begin expansion

for each dimension {

Adjust the coordinate value of the min-max point of the hyperbox.

}

End expansion

else {

Create a new hyperbox for the training sample point having the same min and max point as the sample point.

}

}

Begin Adjustment

if expansion has occurred {

for each pair of hyperbox of different classes {

Perform Isolation_Test

if Isolation test gives negative result {

Perform Containment_Test

if containment test result is positive {

Create _CCN

}

}

37

Chapter4.Neural Network Architecture

Create _OCN

}

}

End Adjustment

} /* end else – this is not the first sample */

}/* end for each training sample data */

End Training

The algorithms for creating the compensatory neurons and for isolation test and containment test are given in Appendix B. In these algorithms we can see that like the previous FMNN, there is no notion of information loss. To remove the ambiguity in the overlapped region, instead of contracting the hyperboxes, the overlapped regions are remembered with the help of other sets of compensatory neurons. In the recall procedure, these compensatory neurons give corresponding values of compensations which are used to predict the final results. Fig 4.7 pictorially describes the formed knowledge-clusters in two-dimensional space after the learning process of the FMCN. Different hyperboxes and the nonlinear decision boundaries are created. The possible cases of overlapping hyperboxes are also presented in the diagram.

4.3.2.2 Recall Algorithm

Recall algorithm is simple enough. The basic idea is already described.

When a test sample = (, ….,) (each of the is the i-th coordinate value of the test sample point which is a point in the n-dimensional space) is presented in the input layer of the FMCN, after all the necessary calculations, it produces the predicted class.

The CLN section produces their corresponding values for each class using their activation functions described in subsection 4.3.1 [equation number 4.3]. For each class there will be a number of hyperboxes. The test sample point will be

38

Chapter4.Neural Network Architecture

closest to one of the hyperbox of any class. So for each class the maximum value is chosen as the output from this section. The OCN section also produces its corresponding compensation values (if any) for each class by using their activation function [subsection 4.3.1].

For each class there will be a number of OCNs attached with. For each OCN, two compensation values are there for the two associated classes (two hyperboxes of different classes). If a test sample falls within an overlapped region, we calculate the compensation for both the two classes associated with the overlapped regions. The compensation values for all other classes are not affected. The CCN section also produces its corresponding compensation values for each class with the help of their activation function [subsection 4.3.1]. Here we also take the minimum value over multiple values for the same class. For each class the above mentioned values are calculated and then added. The minimum of the compensation values is chosen. After that the maximum value is chosen and the corresponding class is predicted. The algorithm is presented below.

Begin Recall

Step1: For each class compute the maximum membership value of the test point. Let the value produced by the CLN section for the i-th

class is .

Step2: For each class compute the minimum overlap compensation value (as all the compensation values are negative). Let the

overlap compensation values contributed for class i is .

Step3: For each class compute the minimum containment compensation value (It is negative). Let the containment compensation value for

class i is .

Step4: The final value is calculated for each class i as

, .

39

Chapter4.Neural Network Architecture

Step5: For each class the above value is calculated and the maximum value is chosen and the corresponding class is predicted.

End Recall

4.4 Improvements On FMCN

As a part of our work, we have identified some of the drawbacks of FMCN. The modified algorithm for training and recall is described in this section. We have applied this modified algorithm for learning and recall. Some of the drawbacks are only identified but not rectified. They will be considered as a part of our future work.

(i) In the above mentioned algorithm, the overlapped regions are remembered as another hyperbox with another set of min-max points. When a sample data falls within the overlapped region, the distance of that point from the borders of the two overlapping hyperboxes are measured. Depending on that measured distance, the compensation values are prepared. The more the distance is the less the chance of being in that class and vice – versa. But in this regard, there may be some situation in which it will give wrong compensation values.

One such situation is depicted in the Fig 4.8. Let us have the following situation. Let and has already came as training sample and are of class .

Then came as a test sample. As is more towards the borders of the box so

it will be predicted as class . But we can see that is closer to the point and than the border of the box . So it should be classified as of class .

In the modified version of the FMCN, one extra step at the time of creation of the OCNs has been taken to rectify this problem. The class of the major number of points near the test point has given priority in the modified algorithm. The basic algorithm for training the neural network is repeated here. The added portion to rectify the problem mentioned above is highlighted. To recover from the mentioned problem, the distribution of different points of different classes in all the overlapped regions, are remembered in some appropriate data structure. At the

40

Chapter4.Neural Network Architecture

time of recall, the Euclidean distances from the remembered points and the test points are calculated and normalized to form some weight value. These values are used to further compensate the final value. The details of the steps are given in the modified algorithm.

Training sample : (0.28 , 0.26) class 1 0.6

0.5 Hyperbox , class 1 Test sample : 0.4 (0.32, 0.24)

0.3 Training sample : Hyperbox , ( 0.32 , 0.21) Class 1 0.2 class 2 0.1

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Figure 4.8: A simple example showing one drawback of FMCN

Begin training

for each sample training data {

if training is done for the first time

Create a hyperbox for the given sample point with associated class. And Set the corresponding min-max points of the hyperbox.

else {

Find out the hyper-box if any that can include the sample point presented for training.

if such a hyper-box of the same class is found out then do nothing and go for next sample .

else if no such hyper-box is found out ( of the same class ) {

Test that if the sample point falls within any

41

Chapter4.Neural Network Architecture

partially overlapped region or not . if such an overlapped compensatory neuron is found out whose coordinates fully encompasses the sample point {

Add that new information in the associated data structure of the overlapped compensatory neuron. The information is stored in the form of the pair i.e. the sample point coordinates and the corresponding class of the point. This information will be used in the recall procedure.

}

}

else {

Find out the hyperbox closest from the sample point and test an expandability function for that hyperbox.

If the expandability function returns true and the hyperbox to be expanded is not overlapping with any previous hyperbox of separate class {

Begin expansion

for each dimension {

Adjust the coordinate value of the min- max point of the hyperbox.

}

End expansion

else {

Create a new hyperbox for the training sample point having the same min and max point as the sample point.

}

}

if expansion has occurred { /* Adjustment of hyperboxes*/

for each pair of hyperbox of different classes {

Perform Isolation_Test

if Isolation test gives negative result {

Perform Containment_Test

if containment test result is positive {

42

Chapter4.Neural Network Architecture

Create_CCN

}

} Create_OCN } } /* End of Adjustments of the hyperboxes */

} /* end else – this is not the first sample */

}/* end for each training sample data */

End Training

In the improved algorithm, the organization of the neurons remains the same. Only the OCNs will keep some more data regarding the training points falling inside the overlapped regions. For this reason there will be a minor change in the recall procedure also. One new factor is required to be calculated for recall purpose which will tell the probability for each of the classes to contain the test point.

The CLN section produces their corresponding values for each class using their activation functions described in the subsection 3.3.1. For each class there will be a number of hyperboxes. The test sample point will be closest to one of the hyperbox of any class. So for each class the maximum value is chosen as the output from this section. The OCN section produces its corresponding compensation values for each class by using their activation function.

For each class there will be a number of OCN attached with. Each OCN will create, as in the earlier algorithm, two compensation values for the two associated classes (two hyperboxes of different classes). If a test sample falls within an overlapped region, in the previous algorithm, we calculated the compensation for only the two classes associated with the overlapped regions. But in our modified algorithm, we will define some compensation value for all the classes. Irrelevant classes will get no compensation value. At the time of training, the points of different classes are already stored in the partially overlapped region. The matrix Y already stores the information that which overlapped region is associated with which two classes. Using these two information, for each overlapped region, for

43

Chapter4.Neural Network Architecture

each class we calculate a value associated with the overlapped region (these can be dynamically calculated only for the relevant overlapped region or once calculated and stored in the data structure. But in that case further learning needs recalculation). Let , , … … , are the corresponding values for each class for a particular overlapped region. Where is the number of classes. The values are calculated by finding the average of the Euclidian distances of the different points from the test sample points. Now the values are normalized to make them small value.

Let , , , … … , . To get a normalized value, all these

values are divided by where is a parameter that can be used to decide how small the effect is . If the value of is large then the effect will be small and vice-versa. It is tested with various values and got 7 as an optimum value. So the ultimate values that participate in determining the classes are as follows:

/ , / , / …..

Now each , 1 ,2 3 , … . , is a small value associated with each overlapped region and for each class in that overlapped

region. When a test sample falls in that overlapped region, we calculate the value for each class. The more the value for a class the less the probability of prediction for that class. So this value is subtracted from the final value being calculated for prediction. The CCN section also produces its corresponding compensation values for each class with the help of their activation function. Here we also take the minimum value over multiple values for the same class. For each class the above mentioned values are calculated and then added. After that the maximum value is chosen and the corresponding class is predicted. The full revised algorithm is cited below

44

Chapter4.Neural Network Architecture

Begin Recall

Step1 : For each class compute the maximum membership value of the test point . Let the value produced by the CLN section for the i-th

class is .

Step2 : For each class compute the minimum overlap compensation value ( as all the compensation values are negative ) . Let the

overlap compensation values contributed for class i is .

Step3: for each overlapped region {

for each class {

Calculate the value of the following expression:

∑ ∑

Where is the coordinate value of the

sample point being tested for classification. is the point in the overlapped region . So j

varies from 1 to where is the number of points available in the overlapped region of class. is the coordinate value of the point . } }

Set , , , … … ,

for each from 1 to C {

Set , / } is an integer value ( usually taken as 5 ) to make the result a small compensation value .

45

Chapter4.Neural Network Architecture

Step4: For each class compute the minimum containment compensation value (It is negative) . Let the containment compensation value

for class i is .

Step5: The final value is calculated for each class i as

, .

Step6: For each class the above value is calculated and the maximum value is chosen and the corresponding class is predicted.

End Recall

(ii) The sample points for training are used to create the hyperboxes. But as we can pictorially view n Fig 4.9 that once a point is seen, there is a good probability that if one other point is seen within a small distant region of that point around all directions, it will also have the same class. But in this algorithm we only focus one side not other sides. The following figure pictorially describes the same thing.

Test point 0.6

0.5 Training point 0.4 Hyperbox ,

0.3 class 1 Test point 0.2

0.1

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Fig 4.9: If Training point is of class 1 then the test point and should have

equal membership value regarding the class 1. But is given more preference than

.

46

Chapter4.Neural Network Architecture

In the modified algorithm disscussed in the next chapter, the method of creation of hyperbox is changed to cope with this problem .

(iii) Both FMNN and FMCN – considers that the information is stored in the form of the boundaries of the hyperboxes . So once the hyperboxes are created, they are not disturbed unless expansion is required. As the information is stored in the boundaries of the hyperboxes, the innermost region of the hyperbox should not be given that much of priority as given in the FMCN and FMNN recall algorithms. In both of these techniques when a test sample point comes near a hyperbox , the corresponding calculated value is calculated as a value close to 1.0 but not 1.0. But when the test sample falls within the hyperbox , it gives a value of 1.0. But this should not be the case as the boundaries are the main storage of information – not the enclosed area. To remove this problem we need to further break the hyperboxes created into smaller areas. We will consider this as our future work.

(iv) There is another point of drawback regarding the way in which the algorithm will be applied in any area. FMCN requires an enumeration within 0.0 and 1.0 of the information. In the learning process of the FMCN, the information / knowledge is stored as the boundaries of different hyperboxes and the membership value of different points are calculated depending on the distance of the sample point from the boundaries of the hyperboxes. Due to this fact, the enumeration scheme will somehow affect the accuracy of prediction. As we can think that some points may come closer and some points may become distant points due to the different enumeration of the data-points. So naturally the prediction result will differ. What should be the ideal case is that, the prediction result should be independent of the enumeration scheme applied. In our experimentation, three types of enumeration scheme are applied and ultimately the majority is selected as the result of the prediction, to make the process independent of enumeration schemes.

47

Chapter4.Neural Network Architecture

4.5 Summary

In this chapter, the basic architecture of FMNN is described to get an essence of how the idea of Fuzzy logic is applied in the field of prediction. The drawbacks of the FMNN were the main reason for its bad performance in many cases. The FMCN removes these drawbacks and comes with new architecture which performed very well in many practical cases. The drawbacks of the FMCN are discussed in this chapter. Some problems we removed as a part of our work. This chapter also described the modified algorithm of FMCN for learning and recall. This improved FMCN is the backbone algorithm used for prediction of the secondary structure in our work. The application of FMCN for the protein secondary structure prediction is described in details in the next chapter.

48

Chapter5.Prediction of Protein Secondary Structure

Chapter 5

Prediction of Protein Secondary Structure

5.1 Introduction

The previous chapter describes only the basic architecture and the algorithms for learning and recall for the FMCN. The goal of our work was to apply this new neural network based classifier in the field of protein secondary structure prediction. Section 5.2 of this chapter describes the mapping of the problem of protein secondary structure prediction to the problem of classification with the help of the FMCN. The basic set up for the prediction is discussed in this section also. It was mentioned in the previous chapter that, we need some method for making the prediction system independent of the applied enumeration schemes for the amino acid residues. Subsection 5.2.2 describes the application of multiple instantiations of the FMCN with different enumeration schemes to achieve the independence of prediction result from the applied enumeration scheme. Section 5.3 discusses bout different accuracy measurement techniques used in this work. Section 5.4 focuses on the complexity in using FMCN. Finally, section 5.5 presents different result sets of the first phase of our experiment. Chapter5.Prediction of Protein Secondary Structure

5.2 Application of FMCN

The primary structure of the protein contains the full information to determine the three dimensional structure. But how it is determined rather how it is determined in a very mall time, remains a paradox still now. Levinthal’s paradox indicates that, the proteins get its 3D shape not by exhaustive search. It is a spontaneous process. The main problem domain is ‘protein folding’. One main sub problem of this domain is ‘Protein Secondary Structure Prediction’. The whole aim of this work is get a fruitful way to predict the secondary structure from the primary structure of a given protein. The number of different amino acids is 20. The primary sequence of a protein can be represented mathematically as ,,,,,,,,,,,,,,,,,,, , where the letters are the one letter codes of the amino acid residues and n is the length of the protein to be predicted. This actually denotes a general protein containing n-number of residues each of which can be any one of the 20 letters. The secondary structures are taken here as of three kinds: Alpha Helix, Beta sheet and Turns (denoted by H, L, E respectively). So the secondary structure string can also be represented mathematically as ,, . Finally, the problem of secondary structure prediction can be represented as a mapping problem as follows:

,,,,,,,,,,,,,,,,,, ,,

The input is a sequence like AGVGTVPMTAYGNDIQYYGQVT … and the output is like ----HHHHHHHHLLLLEEEEEE…. Here ---- means the structures of the first 4 residues are unpredicted (In some cases it may happen that some of the residues rae not predicted). So, from a 20 letter sequence we will have to predict a 3 letter sequence.

This section basically defines the mapping process of the two problems of different domain. It describes in details how the problem of the domain of computation biology is mapped to a problem of prediction.

50

Chapter5.Prediction of Protein Secondary Structure

5.2.1 Feature Vector selection

The study of the previous algorithms used in the area of prediction of protein secondary structure shows that, it is the best way to consider the secondary structure of a particular amino acid residue i.e. the target residue in a sequence of residues (in the primary sequence of a particular protein) as a function of the presence of the residues in both sides of the target residue. Biologically it also makes a good sense. The amino acids are linked with peptide bonds. The actual structure of the sequence depends on the positions of the residues in the primary sequence so that the total energy contained in the full sequence gets minimized. This energy minimization certainly will be based on a function of the full sequence. It will also depend on the molecular weights, ionizations etc. Therefore, the sequence of residues carries the full information required to find out the secondary structure of the protein. Mathematically it can be stated as follows:

, , ……… ,…………. ………………..(5.1)

Where , denotes the secondary structures (alpha helices, beta sheet and turn respectively) of the ith residue of the primary sequence of a particular protein. are the selected expansion of the window of residues in both sides of the target residues . Generally and are taken to be equal.

In the prediction algorithm for FMCN, we need to consider the prediction input as a point in the -dimensional space. When the -dimensional point is presented to a trained FMCN, it associates a particular class with that point. The point coordinates are to lie within 0.0 1.0. It is just a convention to limit the total space to be considered. To fulfill the requirement of the input structure of the FMCN and also for having a biologically significant way, the feature vector is selected from the primary sequence of the protein as a 17 residue long window as follows:

, , ……… ,…………. .

51

Chapter5.Prediction of Protein Secondary Structure

5.2.2 Enumeration of the Amino Acid Residues

5.2.2.1 Basic Idea

As mentioned earlier that, the prediction mechanism requires that the data should be presented as a point in the -dimensional space where each coordinate value is within 0,1 . In the earlier stages of experimentations with the FMCN, the residues were enumerated with a flat distributions of numbers starting from 0.12 to 0.84 . Table 5.1 shows the full set of enumerated values (fixed interval) for each amino acid residue. The single letters in the left hand column are the one letter codes for the amino acid residues.

Table 5.1 : Previously used enumerated values for the amino acid residues

Amino Acid Residues Enumerated Values A ( ) 0.4 C ( CYSTEINE ) 0.8 D ( ASPARTIC ACID ) 0.12 E ( GLUTAMIC ACID ) 0.16 F ( PHENYLALANINE ) 0.20 G ( GLYCINE ) 0.24 H ( HISTIDINE ) 0.28 I ( ISOLEUCINE ) 0.32 K ( LYSINE ) 0.36 L ( LEUCINE ) 0.40 M ( METHIONINE ) 0.44 N ( ASPARAGINE ) 0.48 P ( PROLINE ) 0.52 Q ( GLUTAMINE ) 0.56 R ( ARGININE ) 0.60 S ( SERINE ) 0.64 T ( THREONINE ) 0.68 V ( VALINE ) 0.72 W ( TRYPTOPHAN ) 0.76 Y ( TYROSINE ) 0.80

But this scheme was not carrying any meaningful sense. Rather the residues should be enumerated in such a way that, the two residues having close

52

Chapter5.Prediction of Protein Secondary Structure

chances for forming a particular secondary structure, should also be placed close in the enumeration table. The following example elaborates the idea.

Let the amino acid residues , , , ,, , are enumerated with the values in Table 5.2 and it is also assumed that the enumeration is based on the chances of the residues to form alpha helices which is denoted by class1.

Table 5.2: Sample enumerated values

0.10

0.11

0.14

, 0.63

0.65

0.67

0.70

Now , let us consider that the neural network has been trained with the following information that , , , has the class 1 i.e the point in the three dimensional space 0.63 , 0.65 , 0.67 is associated with class 1 and also that

, , has the class 2 i.e. the point 0.10 , 0.11 , 0.14 in the three dimensional space is associated with the class 2 . Now if a test point comes with

the coordinate as: 0.65 , 0.67 , 0.70 i.e. for the residue sequence , , then we can easily see that the point 0.65 , 0.67 , 0.70 is near the point 0.63 , 0.65 , 0.67 and far away from 0.10 , 0.11 , 0.14 compared to0.63 , 0.65 , 0.67 . So the chance of miss-prediction is reduced and thereby increasing the cases of true positives. With this enumeration scheme, we try to forcefully place all the points to a region where the other points of the same class are already clustered. The points that are going to construct the region for a particular type of secondary structure will come closure and a compact hyper-box will be created which will imply lesser probability of partial and full overlap between hyper-boxes. But these type of advantages will not be available in the previous flat enumerations of the residues. The more significance areas of a primary sequence for this type of enumeration are the continuous regions. For example, a continuous region of alpha helices. The behavior of these scheme outside the continuous regions are not that much appreciable. The expectation behind this different set of enumeration

53

Chapter5.Prediction of Protein Secondary Structure

was that, the Helical region , Beta-sheet region and the Turn region of a particular protein will be predicted with better accuracy in the three instances having different enumeration which are based on different parameters. So the majority of these three will improve the overall accuracy.

5.2.2.2 Use of Three Instances of FMCN

The enumeration scheme which is related to the propensity values for the residues to form alpha helices will perform a little bit better for predicting the alpha helical regions. The same is true for the prediction of other structures and related enumeration schemes. For this reason, three sets of enumeration schemes are applied for training the network. Ultimately the majority of the output is taken as a final result of the unit.

The enumerated values are calculated based on the Chou-Fasman parameters. For the values ( The propensity values of different residues of forming the alpha-helical region ) found in the Chou-Fasman parameters , it can be seen that the maximum value is 1.51 for amino acid having one-letter code and the least value is 0.57 fo and amino acids . But we have a enumeration space ranging from 0.00 to 1.00. So a mapping procedure is applied which is straight forward. 1.51 is mapped to 0.95 and accordingly all other residues are enumerated with their actual chou-fasman parameter value multiplied with 0.95/1.51 . The enumerated values are listed in Table 5.3.

The values which are mentioned as ‘special case’ are not the actual calculated values. They produced nearly the same enumeration values. But to make them distinguished elements a gap of 0.01 is created. With the above mentioned enumeration scheme, the training data set is generated and the neural network is trained by them. The other two enumerations (of the same amino acid residues) use the same technique as described above, but the calculations are based on the chou-fasman parameter and . Table 5.4 and Table 5.5 shows the calculated enumeration values for beta sheet and turens.

54

Chapter5.Prediction of Protein Secondary Structure

Table 5.3: The enumerated value for the amino acid residues depending on the chou-fasman parameters (i.e. the propensity values) related to the alpha helices.

Amino acid residue one Enumerated Value letter code E 0.95 D 0.63 M 0.91 A 0.89 L 0.76 K 0.73 F 0.71 Q 0.69 I 0.67 Special case W 0.66 Special case V 0.65 Special case H 0.62 R 0.61 T 0.52 S 0.48 C 0.44 N 0.42 P 0.35 Special case G 0.34 Special case

Table 5.4: The enumerated value for the amino acid residues depending on the chou-fasman parameters (i.e. the propensity values) related to the beta sheets.

Amino acid residue one Enumerated Value letter code V .95 I .89 Y .82 F .77 W .76 L .72 T .65 Special case C .64 Special Case Q .61 M .59 R .52 N .50 H .49 A .46 G .42 Special Case S .41 Special Case K .40 Special Case P .31 D .30 Special Case E .21 Special Case

55

Chapter5.Prediction of Protein Secondary Structure

Table 5.5: The enumerated value for the amino acid residues depending on the chou-fasman parameters (i.e. the propensity values) related to the turns.

Amino acid residue one Enumerated letter code Value P .95 G .85 S .75 D .73 Special case N .72 Special Case H .57 Special case E .56 Special case C .55 Special case K .54 Special case R .53 Special case T .50 Y .42 Q .40 A .38 M .25 Special case L .24 Special case F .23 Special Case W .22 Special case I .16 V .12

5.2.2.3 The Architecture of One FMCN-Unit

Fig 5.1 describes how the three enumeration schemes are applied in one unit of FMCN. The goal is to get good prediction accuracy for all the three classes. Enumeration scheme based on a particular secondary structure say alpha helices will perform better for the prediction of the alpha helical regions in the primary sequence of the protein. Now this is true for the other two enumeration schemes. So as an overall architecture, in three different instantiations of the improved FMCN, three different enumeration schemes are applied. Now, these three networks are trained with same data set. After training, the three networks are given same test data. Each of the three neural networks writes their output in three different files. These three files are used as an input to a majority-selector. They are the prediction results of the same sequence. The majority selecting process checks for each position in the primary sequence and finds that which of the secondary structure is predicted mostly, that is in majority of the files. It chooses the majority. But in case of tie of worse case i.e. all the three files are

56

Chapter5.Prediction of Protein Secondary Structure

different, then it makes it unpredicted. So, only when two or more files are agreed on a particular position, it chooses that value as prediction and in all other cases it gives the result as unpredicted. 5.2.3 Used Proteins for Training and Testing

Protein sequence files freely available in the protein data bank website (PDB website url: http://www.rcsb.org/pdb) is used as the main source of information for this work. The data stored in the .pdb files are basically the protein primary structure sequences and the 3 dimensional coordinates of all the atoms of the amino acid molecules i.e. the residues in the sequence. The format in which the secondary structure is given in these files , is not suitable for our work . For this reason the help from the software named DSSP is taken. The DSSP software was written by Wolfgang Kabsch and Chris Sander. The PDB files contain the atomic coordinates and these are made input to the DSSP program which defines secondary structure, geometrical features and solvent exposure of proteins. The program is not used to predict the protein secondary structure. It just calculates the structures based on the Ramachandran diagram which is discussed in chapter 2. Ramachandran diagram describes the possible phi-psi angles for all possible secondary structure conformations. From the 3-dimensional coordinates, the angles can be found out and from those angles the corresponding secondary structures can be derived. This program takes the .pdb files as its input and generates .dssp files that contains the secondary structures of the full protein represented in a nice way. These files are already available in the ftp site : ftp://ftp.cmbi.ru.nl/pub/molbio/data/dssp . The following codes are used in the DSSP files as the representation of secondary structures.

H = alpha helix

G = 3-helix (3/10 helix) Three types of helices

I = 5 helix (pi helix)

B = residue in isolated beta-bridge

E = extended strand, participates in beta ladder Two types of sheets

57

Chapter5.Prediction of Protein Secondary Structure

T = hydrogen bonded turn

S = bend Two types of turns

T Trained Improved Intermediate E FMCN with the file storing the S enumeration values result of the T listed in Table 5.3 prediction

All these three neural P networks are trained R with same set of data O T E Trained Improved Intermediate Majority file storing the I FMCN with the selection result of the N enumeration values process listed in Table 5.4 prediction

S Final E Prediction of Q this unit U Trained Improved Intermediate E FMCN with the file storing the result of the N enumeration values listed in Table 5.5 prediction C E

Fig 5.1: This figure describes the recall procedure of the neural network system with three different enumeration schemes. The three neural networks (improved FMCN) are trained with same data set. In the recall process, all the three networks are tested with same primary sequence. The result of their prediction is written in three separate files. Finally, one majority selection process is applied to get the result.

58

Chapter5.Prediction of Protein Secondary Structure

Total 7 classes are used there. But for our work the 7 classes has been converted to 3 classes. H, G, I are basically three different types of helices so instead of the three codes only H is used. B and E are basically two types of sheets. In place of them L is used. T and S denotes the turns or the loops. So, instead of these letters, only E is used. The DSSP database contains more than 25000 proteins. In our work total 1500 proteins has been used. Among these 25000 proteins, initially 150 proteins were chosen at random and they were used to generate the sample data points to train the network in three different enumeration schemes. Some snapshots of the sample training data points with the enumeration schemes are given in Fig 5.2. The whole process of generating the training data is also presented in Fig 5.2.

The same DSSP files are also converted to other two training files with different enumeration schemes. The three files use the enumeration schemes described in Table 5.3, 5.4 and 5.5. . Now these files are used to train three different instances of the FMCN. So basically the three instances are trained with same data but with different enumeration schemes.

5.3 Accuracy Measurement Techniques

Mainly two types of prediction accuracy measurement techniques are employed in this work: General technique and Mathew Correlation Coefficients (MCC). The most commonly used measure of secondary structure prediction

accuracy is the percentage of correct predictions. It is denoted by .

is calculated as : ……………………… (5.2)

Where denotes the total number of residues and , , are the number of amino acid residues of secondary structure of type alpha helix , beta sheet and turn which are predicted correctly. Though it is used in all the experimentations, it does not carry enough information about the accuracy of the prediction. There are actually four types of quantities to be taken care of during any type of prediction. They are correct prediction, miss prediction, under

prediction and over prediction. But this does not take care of all these things. Moreover, this technique does not give the prediction accuracy of the three types of

59

Chapter5.Prediction of Protein Secondary Structure

PDB Files DSSP Files

Training File

Training file containing training samples which are actually 17 dimensional points and each point associated with one of three classes 0, 1or 2. Where 0 denotes Helix, 1 denotes sheet and 2 denotes turns

Fig 5.2: The process of generating the training files from the PDB files.

secondary structures separately. In the result sets, the Q3 accuracy is given in the percentage form.

MCC is a more meaningful way to measure the prediction accuracy. For a particular secondary structure type i, it is defined as follows:

60

Chapter5.Prediction of Protein Secondary Structure

………………………………(5.3)

Where is the number of residues of type i properly predicted (correctly predicted).

This is also known as TP i.e. True Positive. is the number properly rejected (miss

predicted). This is also known as TN i.e. True Negative. is the number under- predicted i.e. in the actual structure , the residue is of type i , but in the predicted

structure , the type is not i . This is also known as FN i.e. False Negative. is the number over-predicted i.e. in the actual structure , the residue is not of type i , but in the predicted structure it is predicted as of type i. This is also known as FP i.e. False Positive. The value of the correlation coefficient ranges from -1.0 to +1.0. The value – 1.0 indicates that the actual and predicted structures are perfectly anti-correlated. The value +1.0 indicates that the actual and predicted structures are perfectly correlated.

Apart from these two techniques, two other techniques used for general classification system, which are Recall or True Positive rate and the Precision. The

recall or true positive rate for the secondary structure of type i , is calculated as the following proportion:

………………………………………….5.4

Precision for secondary structure of type i , is calculated as the following proportion:

…………………………………………5.5

Here , , carries the same meanings as in the Mathew correlation coefficient. The precision and recall value for any classification task ranges from 0.0 to 1.0. A Precision score of 1.0 indicates that the number of False Positive is zero. It means that every bit of prediction for a particular class (say class of alpha helix) is accurate. But there may be many more residues of that class which are not predicted correctly. So, it indicates nothing about how many residues of that class is not predicted correctly. A recall value of 1.0 means that the number of ‘false

61

Chapter5.Prediction of Protein Secondary Structure

negative’ cases is zero. It indicates that every residue of the corresponding class is predicted correctly. But here also many more residues may be there such that they does not belong to the class but predicted to be of that class. So both of the recall value and precision values do not include all possible cases of prediction. For this reason, the MCC which takes care of all the cases gets much significance for measuring accuracy.

5.4 Complexity in Using FMCN

If one protein file contains number of residues in its primary sequence, then the number of entries (-dimensional sample points with their associated classes) generated in the training file corresponding to this protein will be approximately where is the chosen window size. So for number of proteins each of which containing residues on average will generate a training file containing training samples each of which is w-dimensional point in the dimensional hyperspace. To train the neural network with these training files, it takes linearly proportionate time on the number of training samples present in the file. But, for recall procedure, the time is less in this case. Naturally the recall time depends on the number of distinct hyperboxes created for all the classes, as the distance from these hyperboxes are calculated. As, the upper bound of the number of hyperboxes is the number of training samples, this closeness calculation is the most time consuming task. But in this area of protein structure prediction, many more sample points fall in the existing hyperboxes and also many redundant sample points come again & again in many cases. This actually depends on the value of chosen in FMCN training algorithm. If it is large then the number of hyperboxes for a particular class will be lesser and vice versa. In our work, several dimension values and values have been tested. The optimum values which gave the best results are: as 0.11 and 0.12 and the dimension size as 17 and 11. These results are again relative i.e. cannot be defined as best for all cases. For some cases the dimension size 17 is good and for many other cases 11 is good. Same is true for the value of .

62

Chapter5.Prediction of Protein Secondary Structure

5.5 Experimental Results

In the first phase of our experiment with the single unit of FMCN consisting of three FMCN instances with different enumeration schemes, we have got a very wide range of result values. The prediction accuracies are fully dependent on the knowledge the neural network has. So far more than 25000 proteins are fully sequenced with known structure. In all the available neural network based techniques in the websites such as PhD, PSI-PRED, NNPredict - all of them has trained their network with all possible proteins sequenced so far. So they give very high accuracy level in all cases of prediction. In our experiment, in the first phase, it has been tried to incorporate around 150 to 250 proteins for training in each set. For memory constraints and time constraints, the size of the training files were kept such that each of them contained around 50000 to 60000 sample points with associated classes. Several sets of experiments were carried out. Among them three typical result sets with the final overall results are represented and discussed here.

Result Set 1: Table C.1 and C.2 in Appendix C shows the training and testing set of proteins for this part of the experiment. There are around 180 numbers of proteins in the training set. The average values achieved in this experiment are given in Table 5.6. The protein codes which are given in the tables are the PDB codes. In the result, a wide range of prediction accuracy values were found. The proteins for which the helix predictions are good, sheet predictions are good and turn prediction are good and are given in separate tables with their (Table C.3, C.4, C.5 respectively) average values. This experiment shows an average quality of prediction.

This first result set does not show any improvement over the previous works. Rather the quality can be said as below average. But apart from the average value, some individual values are relatively remarkable which are put in bold faces in the tables showing the full results (in Appendix C). The quality of turn predictions are better for this training set and test set of proteins compared to the other classes.

63

Chapter5.Prediction of Protein Secondary Structure

Table 5.6 : The average value of the accuracy measurements for result set1 in the 1st phase of our experiment..

Average Palpha 58.96%

Average Pbeta 50.73%

Average Pturn 61.10%

Q3 Accuracy 56.93%

Average Calpha 0.3296

Average Cbeta 0.27

Average Cturn 0.44

Average Recallalpha 0.6144

Average Recallbeta 0.54

Average Recallturn 0.44

Average Precisionalpha 0.59

Average Precisionbeta 0.50

Average Precisionturn 0.61

Result Set 2: Table C.6 and C.7 (Appendix C) show the list of training proteins and test proteins respectively for this part of the experiment. Table C.8, C.9, C.10 (Appendix C) show the results achieved in this part of the experiment for different classes of prediction. Table 5.7 shows the average values achieved in for this set. This is an example of poor results in the total 1st phase of the experiment.

Table 5.7 : List of average results for result set 2.

Average Palpha 44.25

Average Pbeta 41.15

Average Pturn 38.68

Q3 Accuracy 41.36

Average Calpha 0.468

Average Cbeta 0.11

Average Cturn 0.37

Average Recallalpha 0.44

Average Recallbeta 0.37

Average Recallturn 0.33

Average Precisionalpha 0.44

Average Precisionbeta 0.41

Average Precisionturn 0.38

64

Chapter5.Prediction of Protein Secondary Structure

Naturally the moral that can be drawn from these scenarios is that, the neural network should be trained with a lot of data set with less redundant training sample points and the network should have a well balanced knowledge base of all the three classes. For this reason the next aim of our work was to train the neural network with a large number of proteins. In this result set, the proteins used for testing whose sequences were somewhat similar with that of the trained proteins, gave very good prediction results. These are highlighted in the result tables in Appendix C.

Result Set 3 : Table C.11 and C.12 (Appendix C) show the list of training proteins and test proteins respectively for this part of the experiment. Table C.13, C.14, C.15 (Appendix C) show the results achieved in this part of the experiment for different classes of prediction. Table 5.8 shows the average values. This experiment is a typical example of a good match between the knowledge base created by training set and the used proteins for test set. It shows good prediction accuracies.

Table 5.8: The best average measures of this experiment.

Average Palpha 72.22%

Average Pbeta 75.19%

Average Pturn 73.23%

Q3 Accuracy 73.54%

Average Calpha 0.57

Average Cbeta 0.61

Average Cturn 0.50

Average Recallalpha 0.72

Average Recallbeta 0.72

Average Recallturn 0.604

Average Precisionalpha 0.63

Average Precisionbeta 0.75

Average Precisionturn 0.73

Apart from these three experiments, in the 1st phase, a lot of other proteins were also tested with a variety of different training sets. Table 5.9 shows some of the typical of those results for the alpha-helices. The typical results for other

65

Chapter5.Prediction of Protein Secondary Structure

classes are presented in the Appendix C (Table C.16 and C.17). Table 5.10 lists the average results over all the experiments carried out in the 1st phase.

Table 5.9: Table showing the results for different proteins for alpha helix prediction. If some protein name repeats in the set then they were tested for different training sets. It also shows the average over all these predictions.

Protein P- C- Recall- Precision Protein P-Helix C-helix Recall- Precision Code Helix helix Helix - Helix Code Helix - Helix 9abp: 98.23 0.59 0.44 1 2c4y: 80.95 0.8 0.87 0.81 6rat: 98.12 0.95 0.93 1 2a4u: 80.88 0.72 0.91 0.81 2a4v: 98.11 0.43 0.21 1 6icd: 79.88 0.67 0.87 0.8 2c44: 98.05 0.92 0.92 1 2a46: 79.63 0.79 0.91 0.8 2cbf: 98.01 0.94 0.93 1 2c4z: 79.13 0.65 0.86 0.79 2cbi: 97.6 0.83 0.74 1 2cbn: 78.95 0.67 0.88 0.79 2cbq: 97.57 0.91 0.85 1 6cp4: 78.85 0.67 0.93 0.79 2cbr: 97.56 0.89 0.83 1 6cpp: 78.85 0.67 0.93 0.79 2cbv: 97.34 0.91 0.83 1 2cbm: 74.2 0.29 0.96 0.74 2cid: 97.37 0.56 0.33 1 2cbx: 73.87 0.78 1 0.74 2cir: 97.12 0.65 0.61 1 2cbl: 73.83 0.34 0.97 0.74 2cis: 97.11 0.68 0.58 1 6r1r: 73.47 0.65 0.94 0.73 2cin: 97.05 0.88 0.85 0.98 2b6x: 73.08 0.5 0.9 0.73 2cia: 96.88 0.95 0.97 0.97 2cir: 72.73 0.31 0.47 0.73 2a42: 96.49 0.68 0.98 0.96 2cig: 72.73 0.66 0.97 0.73 2c4x: 96.03 0.86 0.92 0.96 6adh: 72.73 0.18 0.29 0.73 2cil: 95.24 0.91 0.91 0.95 2cbu: 72.65 0.53 0.82 0.73 2cim: 95.24 0.87 0.85 0.95 2b6t: 72.64 0.5 0.92 0.73 2a4c: 94.71 0.92 0.98 0.95 2b6w: 72.64 0.5 0.92 0.73 2a4e: 94.67 0.74 0.69 0.95 2b6y: 72.64 0.5 0.92 0.73 2c4w: 94.55 0.88 0.95 0.95 6gsv: 72.3 0.53 0.84 0.72 2a40: 94.44 0.81 0.77 0.94 6gsx: 71.83 0.51 0.83 0.72 2cbs: 94.44 0.9 0.89 0.94 6gpb: 71.19 0.62 0.95 0.71 6gsp: 94.12 0.89 0.89 0.94 2a45: 71.15 0.65 0.81 0.71 6rnt: 94.12 0.96 1 0.94 6abp: 71.03 0.66 0.94 0.71 1bas: 94.12 0.89 0.95 0.94 6gst: 70.89 0.46 0.8 0.71 6rnt: 94.12 0.96 1 0.94 6cts: 70.25 0.24 0.94 0.7 2cip: 93.9 0.84 0.94 0.94 6cpa: 69.75 0.68 0.93 0.7 6ccp: 93.88 0.86 0.96 0.94 2cbj: 69.23 0.6 0.56 0.69 2a4x: 93.68 0.9 0.95 0.94 6gsu: 68.9 0.46 0.8 0.69 2ciq: 92.86 0.73 0.72 0.93 7cpp: 68.75 0.47 0.75 0.69 2cio: 91.8 0.69 0.99 0.92 6gsw: 68.37 0.48 0.83 0.68 2c47: 91.79 0.89 0.95 0.92 2c48: 68.25 0.68 0.81 0.68 2a43: 91.07 0.51 1 0.91 2cic: 67.86 0.73 0.83 0.68 2a4n: 90.97 0.91 0.99 0.91 6gss: 67.19 0.53 0.93 0.67 2ciu: 90.71 0.88 0.91 0.91 7gpb: 66.67 0.6 0.83 0.67 2cbo: 90.21 0.89 0.96 0.9 1a2t: 66.67 0.4 0.56 0.67 2cij: 89.39 0.79 0.9 0.89 1a2u: 66.67 0.4 0.56 0.67 2cik: 89.26 0.6 0.98 0.89 2c4v: 66.67 0.72 0.89 0.67 2a4o: 88.64 0.85 0.93 0.89 2cbg: 66.67 0.54 0.86 0.67 6gsy: 88.57 0.71 0.87 0.89 6mht: 66 0.52 0.73 0.66 6pcy: 88.24 0.4 0.31 0.88 2cie: 65.91 0.61 0.97 0.66 6req: 88.17 0.78 0.94 0.88 6gss: 65 0.08 0.08 0.65 2c43: 88 0.84 0.94 0.88 2c4a: 64.91 0.54 0.84 0.65 2c45: 88 0.84 0.94 0.88 6r1r: 64.86 -0.04 0.32 0.65 2cbk: 88 0.87 0.96 0.88 2a4m: 63.64 0.73 0.88 0.64 2c40: 87.44 0.59 0.98 0.87 2cif: 63.64 0.59 0.97 0.64 2a4p: 86.67 0.81 0.91 0.87 6xim: 63.62 0.47 0.87 0.64 2c46: 86.67 0.79 0.89 0.87 2b0z: 63.25 0.34 0.73 0.63 2a4t: 85.87 0.87 0.99 0.86 1a25: 62.5 0.21 0.21 0.62 6ca2: 85.71 0.79 0.82 0.86 6cts: 61.51 0.47 0.98 0.62 2a41: 85.71 0.59 1 0.86 6gst: 61.05 0.2 0.61 0.61 2a4q: 85.71 0.87 0.99 0.86 6q21: 60.83 0.46 0.73 0.61 2a4s: 85.11 0.87 1 0.85 6rxn: 60 0.62 0.86 0.6 2c4t: 85.08 0.8 0.93 0.85 1a2c: 60 0.55 0.68 0.6 2ciw: 85 0.77 0.72 0.85 2c49: 60 0.16 0.46 0.6 2a48: 84.81 0.87 1 0.85 2a4w: 59.9 0.21 0.9 0.6 2a4y: 84.78 0.8 0.95 0.85 6dfr: 59.46 0.48 0.69 0.59 2a4z: 84.17 0.8 0.95 0.84 6prc: 59.14 0.5 0.92 0.59 6std: 83.33 0.4 0.33 0.83 7cel: 58 0.58 0.91 0.58 2a4r: 83 0.85 1 0.83 7fab: 58 0.58 0.91 0.58 2cit: 82.76 0.8 0.96 0.83 2a44: 57.89 0.41 0.35 0.58

66

Chapter5.Prediction of Protein Secondary Structure

Table 5.9 continued from previous page.

Protein Code P-Helix C-helix Recall-Helix Precision - Helix

6acn: 53.23 0.47 0.81 0.53 5tss: 53.06 0.27 0.71 0.53 2civ: 52.82 0.24 0.63 0.53 1a2w: 51.72 0.41 0.61 0.52 2b05: 51.22 0.17 0.95 0.51 2b0t: 50.85 0.2 0.7 0.51 7pck: 50 -0.19 0.31 0.5 2b0c: 50 0.33 0.83 0.5 7tli: 49.6 0.36 0.74 0.5 1a2g: 49.32 0.2 0.75 0.49 4mon: 48.99 0.24 0.94 0.49 1a2s: 48.89 0.25 0.76 0.49 6gsy: 48.62 -0.02 0.58 0.49 6ldh: 48.61 0.34 0.79 0.49 1a7l: 48.57 0.28 0.71 0.49 6a3h: 48.25 0.46 0.85 0.48 2b0o: 48.19 0.14 0.69 0.48 6lzm: 48.11 0.15 0.8 0.48 1a2f: 47.97 0.19 0.75 0.48 5cev: 47.53 0.32 0.69 0.48 8jdw: 47.24 0.22 0.67 0.47 6cgt: 47.06 0.42 0.67 0.47 1a7f: 47.06 0.28 0.73 0.47 6gsw: 46.77 -0.02 0.49 0.47 4trx: 46.21 0.24 0.69 0.46 6prn: 46.15 0.23 0.19 0.46 1a7g: 46.15 0.34 0.67 0.46 6gsu: 45.61 -0.1 0.41 0.46 2b69: 45.53 0.25 0.71 0.46 2c41: 45.49 0.36 0.85 0.45 6acn: 45.45 -0.15 0.26 0.45 8ice: 45.24 0.24 0.37 0.45 6ald: 45.15 0.21 0.73 0.45 2b65: 44.74 0.19 0.54 0.45 1a27: 44.59 0.24 0.77 0.45 2b0m: 44.51 0.29 0.76 0.45 8gss: 44.44 0.21 0.68 0.44 1a2b: 44.12 0.16 0.56 0.44 2b6c: 44.01 0.13 0.86 0.44 6cpa: 44 0.06 0.12 0.44 1a2a: 43.95 0.36 0.83 0.44 7prc: 43.86 0.22 0.66 0.44 1a77: 43.8 0.11 0.63 0.44 2b0q: 43.56 0.24 0.67 0.44 2b6a: 43.13 0.18 0.56 0.43 2b0h: 43.12 0.03 0.87 0.43 2b6d: 42.34 0.19 0.53 0.42 1a76: 42.25 0.16 0.68 0.42 6adh: 42.11 0.09 0.3 0.42 1a2y: 41.94 0.32 0.5 0.42 1a7a: 41.87 0.24 0.71 0.42 7abp: 41.67 0.25 0.43 0.42 7yas: 41.67 0.24 0.42 0.42 2b0u: 41.48 0.21 0.41 0.41 5lzm: 41.46 0.15 0.32 0.41 1a7d: 41.18 0.14 0.92 0.41 2b63: 40.99 0.17 0.52 0.41 7icg: 40.55 0.25 0.63 0.41 5daa: 40.48 0.14 0.31 0.4 5hck: 40.48 0.13 0.31 0.4 1a22: 40.46 0.2 0.61 0.4 7ahi: 40.36 0.12 0.93 0.4 1a7j: 40.35 0.26 0.72 0.4 4ovo: 40.16 -0.15 0.84 0.4 6pah: 40.14 0.17 0.71 0.4 1a2v: 40.1 0.2 0.34 0.4 Average 67.74 0.47 0.74 0.67

67

Chapter5.Prediction of Protein Secondary Structure

Table 5.10: The overall results of all the experiments done in the 1st phase

Average of P-alpha 67.74% Average of P-Beta 65.31% Average of P-Turn 68.88% Average Q3 Accuracy 67.31% Average of C-alpha 0.47 Average of C-beta 0.44 Average of C-turn 0.43 Average Recall for alpha helix 0.74 Average Recall for Beta-sheet 0.61 Average Recall for Turn 0.561 Average Precision for Alpha Helix 0.67 Average Precision for Beta sheet 0.63 Average Precision for Turn 0.69

5.6 Summary

This chapter mainly focuses on how our improved FMCN has been applied to solve the problem of prediction of secondary structures. After discussing the significance of different enumeration schemes and the overall architecture of the system, it comes in the section 5.5 which shows several sets of results. It can be seen that there is a wide range of result values. There are both poor quality and good quality of results. So, there is a very wide deviation of the result values from the average in both upside and downside. The main reason for this is that the neural network unit used for training does not encompass a large training set. So their knowledge may not be enough to predict the structure of some proteins. In the second phase of our experiment, we have tried to alleviate this problem by using multiple instances of the FMCN-units. The next chapter discusses this technique in details with the results.

68

Chapter6.Multiple Instantiation of FMCN-Units

Chapter 6

Multiple Instantiation of FMCN-Unit

6.1 Introduction

In chapter 5, the experiment with three different types of enumeration schemes with three different instantiations of the neural network has been described with detailed results. The main reason responsible for the poor results is already pointed out as the lack of enough knowledge at a time in the knowledge base of the neural network units. This chapter describes the proposed scheme to handle this large amount of knowledge database.

Section 6.2 describes the brief implementation level techniques which have been applied for the training of the neural networks with large amount of data. Next section describes the full architecture of the system. Section 6.4 comes up with the results. This chapter basically describes the proposed technique for achieving a good amount of accuracy in prediction for a wide range of input data using several FMCN-units. Chapter6.Multiple Instantiations of FMCN-Units

6.2 System Architecture

So far in our experiment , only 150 – 250 numbers of proteins whose secondary structures has been already known , could be used as training set for the neural network. But this set is too less than the number of protein with already known structures (which is more than 25000). So the natural solution is to use multiple processes using different set of proteins as training sets. In this work, total 4 different instances of the neural network have been created. Each of them is actually a unit containing three other networks with three different enumeration schemes. Each unit is trained with approximately 60000 – 70000 samples derived from approximately 200 – 250 proteins. In the earlier case of multiple instantiations, used for different enumerations, the three networks were trained with same set of proteins. So the three networks were having same knowledge. This did not helped for predicting a wide range of proteins. But to have a good prediction for a wide range of proteins all units are to be trained with different sets proteins. Each set contained around 200 proteins pertaining to around 60000 samples. The proteins were chosen at random. 6.2.1 Multiple Instantiation

The concept of multiple instantiation is already introduced in chapter 5. Training of the neural network with different enumeration schemes is actually done by three instances of the same neural network. From the implementation point of view, the instances are created by creating separate process in the system. This is done by the UNIX system call . To use different enumeration schemes, three processes were created running the same code but taking different parameter such as the codes to enumerate the amino acid residues, input training files, output files and several other things. After the completion of their work they each wrote their out puts in their specific files. After they exit from the system, the parent comes up and it combined the results in the final output files. This is the brief technique behind the scheme introduced in chapter 5. For handling a large amount of database of knowledge, which is not in the scope of a single process, the same kind of technique has been applied. For training, at the beginning, the main process (which is the

70

Chapter6.Multiple Instantiations of FMCN-Units

parent of all processes) creates N different processes and each of these N processes then creates three of their own children. So all total 3N+N+1 processes are required to be created. It is not mandatory to run them in parallel. But to save time, it can be done so as there will be a lot of I/O access in the processes. All of training files are required to be ready before the training starts. After the training, each of the N units gets trained with different enumeration schemes. At the time of recall, each of the processes gets the same list of protein files whose structures are to be predicted. Now, each of units (of three processes) writes their intermediate results in some intermediate files. When all units exit, the parent comes up and it combines all the results and generates the final output file. The parent process also does some sort of post-processing on the final output. 6.2.2 Generation of Final Output

The following two subsections describe the algorithms applied for combining the outputs from N units and final post-processing of the output.

6.2.2.1 Algorithm to Combine the Outputs

After the full training of the N units of neural networks, a huge amount of knowledge will be stored in these networks. But the matter is that, each of these units will know different things. The domain of their knowledge may be overlapping in some portion, but most of the areas of their knowledge domain will be disjoint. Actually how much they are disjoint and how much they overlap, is determined by the choice of the training set. In our work, the training sets have been chosen at random. The training protein files are different for all the N units. But the sequence has not been individually checked for sequence matching. This will be included in our future work. So, depending on the randomness of selecting the input training files, the units have somewhat different knowledge about the classifications. Some are better for classifying some particular patterns and some other units are master in classifying some other type of patterns. So, in this case, the majority selection will not work. Here, the basic idea behind the fuzzy min-max algorithms is used to combine the outputs. The FMCN algorithm employed in this area, predicts that a particular pattern or sequence of the amino acids (i.e. the middle residue) is of

71

Chapter6.Multiple Instantiations of FMCN-Units

Unit 1 E1 F1

Intermediate E2 F2 Output 1

E3 F3

g

Unit 2

E1 F1

Intermediate The Results Output 2 g E2 F2 Final Output Post Processin

E3 F3 Combinin

Unit N E1 F1 Intermediate Output N

E2 F2

E3 F3

Fig 6.1 : The architecture of the overall system of processes used in this work . Unit 1,…, Unit N are the N units each of which is a chunk of three neural networks (E1,E2,E3) trained with same dataset but different enumeration schemes . Each unit is trained with different set of data. The arrows describe the recall process in this system. First, each network in each unit predicts in Files F1, F2 and F3. The outputs are then combined into intermediate files. Finally, the intermediate outputs are combined and used for post-processing to generate final output.

72

Chapter6.Multiple Instantiations of FMCN-Units

class Helix or Sheet or Loop if basically the distance from the boundary of some hyperbox of the corresponding class is lower than some threshold value. Now, if that threshold value is set to a very low quantity, then many more conflicts, or many confusion between different classes occurs. Even, the classes may be wrongly predicted as the given sample is falling in a very high distance from the nearest hyperbox. So it must be the case that it is out of the knowledge of the corresponding neural network. But if the threshold value is kept high, then the neural network will predict only that much, what it knows and it will not go much beyond that knowledge. In all other cases it will mark the residue as unpredicted. As the same file will be fed to the N units, each of them will predict what they will know. Naturally in each output from the units, there will be a lot of unpredicted regions. But the unpredicted regions will be overlapping in very small regions if the training has been done with highly disjoint sets of proteins for each unit. As a result, the natural choice for combining the outputs is that, we will remove the unpredicted regions with the majority of the others.

Begin Combine_outputs

For each predicted position in each intermediate file {

If all are unpredicted, then

Make it also unpredicted in the final output file.

Else {

Count how many of the N cases are predicting for each class (Helix, Sheet and Turn). Ignore the unpredicted cases.

if there is no tie between any two classes

Select the majority as the predicted class.

73

Chapter6.Multiple Instantiations of FMCN-Units

Else {

Compute the chou-fasman coefficient for that particular residue according to the algorithm described in chapter 2

If that value tells with high confidence (the value is above threshold value) that it is one of the classes between which tie has occurred then

Chose that class as the final output.

Else

Make it unpredicted.

}

}

End

6.2.2.2 Algorithm for post-processing

The main idea behind the post processing of data relies a fact that comes from the direct study of the nature of the secondary structures of the proteins. The fact is that all the secondary structures occur in runs of more than 4 or 5 residues at least. Helix, Sheet does not occur in single, double or triple consecutive residues. They take at least 4 or more that 4 number of residues to form the structure in most of the cases. This is false to some extent for turns. They can generally occur at random. So in the output generated after the combination of all the outputs from the N units, some sort of post processing can be performed to get some improvement in the accuracy. The simple algorithm for the post-processing is presented below.

74

Chapter6.Multiple Instantiations of FMCN-Units

Begin Post-processing

For each sliding window of size three on the predicted structure if all the three are unpredicted and{ /* for the unpredicted Regions of size 3*/

If it is the first window then after that window, if it is the last window then before that window, or if it is not the first or last window, the in both sides of the window check the classes of one residues.

If they are same then change the unpredicted regions to the same class of as the surrounding residues.

If they are not same then check two residues further down and up the sequence for their predicted structures. If they are same then change all the in between classes to that of the surrounding ( i.e. the i-2 and i-3 th residues and j+2 and j+3th residues all 4 should be same , where i and j are the beginning and ending ends of the window of size 3 i.e. j=i+2 ). (If window is the first or last window, then check only one side i.e. either i-2 and i-3 or j+2 and j+3 if the window is the last or first window respectively)

If again not same, then do nothing.

}

For each sliding window of size two on the predicted structure if all the two are unpredicted and { /* for the unpredicted Regions of size 2*/

If it is the first window then after that window, if it is the last window then before that window, or if it is not the first or last window, the in both sides of the window check the classes of one residues.

If they are same then change the unpredicted regions to the same class of as the surrounding residues.

If they are not same then check two residues further down and up the sequence for their predicted structures. If they are same then change all the in between classes to that of the surrounding ( i.e. the i-2 and i-3 th residues and j+2 and j+3th residues all 4 should be same , where i and j are the beginning and ending ends of the window of size 2 i.e. j=i+1 ). (If window is the first or last window, then check only one side i.e. either i-2 and i-3 or j+2 and j+3 if the window is the last or first window respectively )

If again not same, then do nothing.

}

75

Chapter6.Multiple Instantiations of FMCN-Units

For each sliding window of size one on the predicted structure if it is unpredicted {

/* for the unpredicted Regions of size 1*/

If it is the first window then after that window, if it is the last window then before that window, or if it is not the first or last window, the in both sides of the window check the classes of one residues.

If they are same then change the unpredicted regions to the same class of as the surrounding residues.

If they are not same then check two residues further down and up the sequence for their predicted structures. If they are same then change all the in between classes to that of the surrounding ( i.e. the i-2 and i-3 th residues and j+2 and j+3th residues all 4 should be same , where i and j are the beginning and ending ends of the window of size 1 i.e. j=i ). ( If window is the first or last window , then check only one side i.e. either i-2 and i-3 or j+2 and j+3 if the window is the last or first window respectively )

If again not same, then do nothing.

}

For each sliding window of size two on the predicted structure if all the two are predicted to be H {

/* for doubly predicted regions */

If it is the first window then after that window, if it is the last window then before that window, or if it is not the first or last window, then in both sides of the window check the classes of one residue.

If they are same and not H then change the unpredicted regions to the same class as that of the surrounding residues.

If they are not same then check two residues further down and up the sequence for their predicted structures. If they are same then change all the in between classes to that of the surrounding ( i.e. the i-2 and i-3 th residues and j+2 and j+3th residues all 4 should be same , where i and j are the beginning and ending ends of the window of size 2 i.e. j=i+1 ). (If window is the first or last window , then check only one side i.e. either i-2 and i-3 or j+2 and j+3 if the window is the last or first window respectively )

If again not same, then do nothing.

}

76

Chapter6.Multiple Instantiations of FMCN-Units

For each position i on the predicted structure if it is H {

/* for singly predicted regions */

If it is the first residue then after that residue, if it is the last residue then before that residue, or if it is not the first or last window, the in both sides of the residue check the classes of one residue.

If they are same and not H then change the unpredicted regions to the same class of as the surrounding residues.

If they are not same then check two residues further down and up the sequence for their predicted structures. If they are same then change all the in between classes to that of the surrounding ( i.e. the i-2 and i-3 th residues and i+2 and i+3th residues , all 4 should be same , where i is central residue ). (If the residue is the first or last residue, then check only one side i.e. either i-2 and i-3 or i+2 and i+3 if the residue is the last or first window respectively)

If again not same, then do nothing.

}

For each sliding window of size two on the predicted structure if all the two are predicted to be L {

/* for doubly predicted regions */

If it is the first window then after that window, if it is the last window then before that window, or if it is not the first or last window, then in both sides of the window check the classes of one residue.

If they are same and not L then change the unpredicted regions to the same class as that of the surrounding residues.

If they are not same then check two residues further down and up the sequence for their predicted structures. If they are same then change all the in between classes to that of the surrounding ( i.e. the i-2 and i-3 th residues and j+2 and j+3th residues all 4 should be same , where i and j are the beginning and ending ends of the window of size 2 i.e. j=i+1 ). (If window is the first or last window, then check only one side i.e. either i-2 and i-3 or j+2 and j+3 if the window is the last or first window respectively )

If again not same, then do nothing.

}

77

Chapter6.Multiple Instantiations of FMCN-Units

For each position i on the predicted structure if it is L {

/* for singly predicted regions */

If it is the first residue then after that residue, if it is the last residue then before that residue, or if it is not the first or last window, the in both sides of the residue check the classes of one residue.

If they are same and not L then change the unpredicted regions to the same class of as the surrounding residues.

If they are not same then check two residues further down and up the sequence for their predicted structures. If they are same then change all the in between classes to that of the surrounding ( i.e. the i-2 and i-3 th residues and i+2 and i+3th residues , all 4 should be same , where i is central residue ). ( If the residue is the first or last residue , then check only one side i.e. either i-2 and i- 3 or i+2 and i+3 if the residue is the last or first window respectively )

If again not same, then do nothing.

}

End Post Processing

6.3 Experimental Results

Our designed system was tested with 4 instantiations. Each of the four units was trained with different training set. Each training set contains approximately 190 to 220 number of different set of proteins with known structures. The total system is tested with more than 100 proteins. Table D.1, D.2, D.3 & D.4 in Appendix D shows the list of proteins used to train unit 1, 2, 3 and 4 respectively. The proteins used for testing are mentioned in the Table D.5 and D.6 (in Appendix D). Table 6.1, 6.2 and 6.3 describes selected results in the second phase of the experiment.

78

Chapter6.Multiple Instantiations of FMCN-Units

The Protein codes in the table of results are the codes available from protein data bank. The good results for three different classes are given in separate tables for greater clarity. Px, Cx, Recall-x, Precision-x (x may be helix, sheet or turn) are the different measures for accuracy of prediction discussed in chapter 5. Table 6.8 contains the average values for all the results obtained in the second phase of our

experiment. The average Q3 accuracy is 69.53%. So this phase achieved an increase

st of 2.25% in the Q3 accuracy over the 1 phase. But the most important thing is that the deviation of the individual results from the average value is not that much bad as the 1st phase. Most of the values are above 60%. It is the result of a large training data set used for training different instances of the FMCN-Unit.

Table 6.1 : Prediction result for the helical regions in the proteins

Protein P- C- Recall- Precision- Protein P-helix C-helix Recall- Precision- Code helix helix helix helix Code helix helix 3rpb: 100 0.41 0.28 1 103m: 84.3 0.21 0.93 0.84 3rsp: 100 0.79 0.72 1 104m: 84.3 0.14 0.91 0.84 4gal: 100 0.22 0.08 1 107m: 84.3 0.1 0.91 0.84 4hb1: 100 1 1 1 12ca: 84.2 0.59 0.56 0.84 4ccp: 98 0.84 0.91 0.98 111m: 84.2 0.2 0.92 0.84 4ccx: 97.3 0.87 0.94 0.97 8icz: 84 0.55 0.8 0.84 3rat: 96.2 0.71 0.66 0.96 8icy: 83.7 0.5 0.76 0.84 8rat: 96.2 0.8 0.76 0.96 110m: 83.6 0.11 0.92 0.84 4ca2: 94.9 0.77 0.71 0.95 4fis: 83 0.07 0.85 0.83 4hhb: 93.4 0.34 0.91 0.93 8icf: 83 0.61 0.83 0.83 4ald: 93.1 0.67 0.81 0.93 3mbp: 82.9 0.56 0.76 0.83 8gep: 92.4 0.78 0.84 0.92 105m: 82.6 0.17 0.87 0.83 3rsd: 92.3 0.81 0.8 0.92 8acn: 82.6 0.68 0.82 0.83 3srn: 92.3 0.75 0.73 0.92 8ice: 82.5 0.56 0.82 0.82 3vsb: 90.9 0.78 0.82 0.91 8ics: 82.5 0.53 0.79 0.82 4aop: 90.2 0.78 0.87 0.9 4cts: 82.4 0.33 0.76 0.82 4gep: 90.1 0.78 0.87 0.9 8icc: 82.3 0.55 0.83 0.82 3rsk: 88.5 0.8 0.82 0.88 8icj: 81.8 0.58 0.84 0.82 4cac: 88.1 0.75 0.74 0.88 4csc: 81.7 0.54 0.94 0.82 4er4: 88 0.4 0.27 0.88 101m: 81.7 0.15 0.91 0.82 11bg: 87.5 0.63 0.64 0.88 8icx: 81.5 0.54 0.82 0.81 4fap: 87.1 0.45 0.71 0.87 8icw: 81.5 0.56 0.81 0.81 132l: 86.8 0.8 0.92 0.87 3tms: 81.4 0.62 0.79 0.81 4fxc: 86.7 0.42 0.38 0.87 8ick: 81.1 0.61 0.86 0.81

79

Chapter6.Multiple Instantiations of FMCN-Units

Protein P- C- Recall- Precision- Protein P-helix C-helix Recall- Precisio Code helix helix helix helix Code helix n-helix 8ico: 80.4 0.54 0.84 0.8 252l: 73.2 0.37 0.8 0.73 3mag: 79.8 0.61 0.84 0.8 179l: 73.1 0.42 0.83 0.73 4ake: 79.7 0.5 0.79 0.8 8icv: 73.1 0.48 0.82 0.73 8adh: 79.4 0.53 0.65 0.79 4er1: 73 0.38 0.34 0.73 8icr: 79.4 0.56 0.86 0.79 11as: 73 0.19 0.51 0.73 8icq: 79.4 0.62 0.88 0.79 8icm: 73 0.46 0.81 0.73 112m: 79.3 0.06 0.91 0.79 3leu: 72.7 0.78 1 0.73 108m: 79.2 0.13 0.9 0.79 3lyz: 72.6 0.65 0.88 0.73 8ica: 79 0.55 0.85 0.79 148l: 72.6 0.3 0.8 0.73 3sic: 78.5 0.49 0.58 0.78 174l: 72.3 0.2 0.78 0.72 16vp: 78.5 0.31 0.81 0.78 3ng1: 72.1 0.28 0.75 0.72 4cyh: 78.3 0.56 0.55 0.78 4ayk: 71.8 0.46 0.61 0.72 8icg: 77.9 0.55 0.85 0.78 200l: 71.7 0.43 0.88 0.72 3il8: 77.8 0.62 0.74 0.78 232l: 71.7 0.35 0.84 0.72 8ruc: 77.7 0.49 0.73 0.78 9atc: 71.4 0.37 0.57 0.71 3lyt: 77.6 0.61 0.8 0.78 456c: 71.3 0.4 0.55 0.71 8icb: 77.4 0.56 0.85 0.77 4ape: 71 0.28 0.24 0.71 3orc: 76.9 0.36 0.62 0.77 8api: 70.9 0.12 0.37 0.71 966c: 76.9 0.38 0.52 0.77 228l: 70.8 0.4 0.87 0.71 3tmy: 76.8 0.47 0.77 0.77 238l: 70.8 0.36 0.85 0.71 8ich: 76.7 0.53 0.84 0.77 104l: 70.7 0.09 0.74 0.71 8icp: 76.5 0.62 0.91 0.77 4enl: 70.6 0.41 0.75 0.71 151l: 76.5 0.3 0.69 0.76 168l: 70.5 0.23 0.8 0.7 8icl: 76.4 0.57 0.88 0.76 3pva: 70.3 0.27 0.37 0.7 4dcg: 76.1 0.47 0.76 0.76 107l: 70.2 0.31 0.82 0.7 3rub: 76 0.38 0.67 0.76 213l: 70.2 0.31 0.82 0.7 12as: 76 0.28 0.56 0.76 3sxl: 69.9 0.46 0.61 0.7 8icn: 75.6 0.56 0.87 0.76 244l: 69.8 0.33 0.84 0.7 3pgm: 75.4 0.37 0.61 0.75 8lyz: 69.8 0.6 0.86 0.7 3wrp: 75.3 0.17 0.89 0.75 176l: 69.8 0.32 0.82 0.7 3sak: 75 0.35 0.8 0.75 8xia: 69.6 0.31 0.76 0.7 259l: 75 0.31 0.81 0.75 4dmr: 69.6 0.3 0.59 0.7 8rnt: 75 0.37 0.39 0.75 214l: 69.2 0.29 0.82 0.69 187l: 74.5 0.38 0.85 0.75 209l: 69.1 0.34 0.82 0.69 193l: 74.5 0.67 0.88 0.75 4fit: 69.1 0.26 0.54 0.69 3seb: 74.4 0.29 0.36 0.74 130l: 68.9 0.34 0.85 0.69 4gsa: 74.3 0.33 0.62 0.74 181l: 68.9 0.34 0.85 0.69 4blm: 74.2 0.27 0.63 0.74 3mds: 68.8 0.22 0.73 0.69 3tss: 74.2 0.45 0.47 0.74 3mba: 68.8 0.05 0.86 0.69 4eug: 74.1 0.28 0.59 0.74 4cla: 68.8 0.14 0.43 0.69 3ktq: 74.1 0.26 0.73 0.74 112l: 68.6 0.32 0.84 0.69 169l: 73.9 0.3 0.79 0.74 192l: 68.6 0.37 0.86 0.69 120l: 73.6 0.28 0.81 0.74 256l: 68.6 0.37 0.86 0.69 182l: 73.6 0.41 0.87 0.74 3tmk: 68.5 0.35 0.74 0.68 4er2: 73.5 0.34 0.29 0.74 3pal: 68.3 0.19 0.77 0.68 3pfl: 73.2 0.27 0.71 0.73 3pah: 68.2 0.19 0.67 0.68 3lri: 68.2 0.34 0.56 0.68 216l: 65.5 0.16 0.78 0.66

80

Chapter6.Multiple Instantiations of FMCN-Units

173l: 68 0.31 0.82 0.68 3pat: 65.5 0.14 0.67 0.65 830c: 68 0.35 0.51 0.68 8icd: 65.5 0.24 0.62 0.65 110l: 67.9 0.33 0.85 0.68 141l: 65.4 0.3 0.84 0.65 11ba: 67.9 0.4 0.52 0.68 3xin: 65.3 0.3 0.76 0.65 262l: 67.7 0.23 0.81 0.68 3r1r: 65.2 0.27 0.71 0.65 150l: 67.7 0.25 0.82 0.68 131l: 65.1 0.34 0.86 0.65 189l: 67.7 0.13 0.75 0.68 245l: 65.1 0.33 0.85 0.65 246l: 67.6 0.29 0.82 0.68 201l: 65.1 0.2 0.78 0.65 4gst: 67.6 0.08 0.61 0.68 226l: 65.1 0.37 0.86 0.65 13gs: 67.5 0.22 0.75 0.67 3xim: 65 0.29 0.75 0.65 180l: 67.5 0.24 0.81 0.67 10gs: 65 0.27 0.78 0.65 3xis: 67.4 0.34 0.79 0.67 199l: 64.8 0.36 0.89 0.65 212l: 67.3 0.28 0.82 0.67 158l: 64.5 0.29 0.84 0.64 164l: 67 0.36 0.87 0.67 241l: 64.5 0.22 0.81 0.64 206l: 67 0.28 0.83 0.67 234l: 64.4 0.3 0.84 0.64 240l: 67 0.1 0.76 0.67 247l: 64.2 0.27 0.84 0.64 3ifb: 66.7 0.24 0.23 0.67 149l: 64.2 0.37 0.88 0.64 3rhn: 66.7 0.4 0.61 0.67 155l: 64.2 0.29 0.84 0.64 4caa: 66.7 0.14 0.42 0.67 163l: 64.2 0.34 0.87 0.64 4gss: 66.7 0.25 0.77 0.67 198l: 64.2 0.33 0.86 0.64 113l: 66.7 0.28 0.82 0.67 251l: 64.2 0.21 0.81 0.64 118l: 66.7 0.3 0.83 0.67 8tln: 64.1 0.21 0.6 0.64 8pti: 66.7 0.37 0.5 0.67 16gs: 63.9 0.12 0.71 0.64 8tli: 66.7 0.28 0.62 0.67 195l: 63.9 0.38 0.9 0.64 3pvi: 66.4 0.39 0.7 0.66 157l: 63.8 0.19 0.79 0.64 8gss: 66.4 0.27 0.78 0.66 175l: 63.8 0.19 0.79 0.64 126l: 66.4 0.3 0.85 0.66 210l: 63.8 0.26 0.82 0.64 229l: 66.4 0.33 0.86 0.66 Average 74.3 0.38 0.75 0.74 215l: 66.4 0.35 0.84 0.66 3nos: 66.1 0.28 0.66 0.66 4a3h: 66.1 0.31 0.65 0.66 103l: 66 0.17 0.8 0.66 108l: 66 0.29 0.83 0.66 114l: 66 0.27 0.82 0.66 233l: 66 0.35 0.86 0.66 167l: 66 0.37 0.87 0.66 4kmb: 66 0.22 0.54 0.66 3lhm: 66 0.47 0.73 0.66 14gs: 66 0.14 0.73 0.66 11gs: 65.9 0.21 0.76 0.66 20gs: 65.8 0.2 0.75 0.66 155c: 65.7 0.03 0.43 0.66 19gs: 65.6 0.17 0.74 0.66 22gs: 65.6 0.29 0.8 0.66 4gtu: 65.6 0.24 0.71 0.66

81

Chapter6.Multiple Instantiations of FMCN-Units

Table 6.2 : Prediction result for the beta sheet regions in the proteins

Protein P-sheet C- Recall- Precision- Protein P-sheet C-sheet Recall- Precision- Code sheet sheet sheet Code sheet sheet 3rsd: 88.64 0.82 0.91 0.89 4ake: 57.97 0.45 0.53 0.58 3vsb: 88.46 0.79 0.81 0.88 3ptd: 57.38 0.37 0.51 0.57 3rsp: 86.36 0.8 0.9 0.86 3znb: 57.35 0.32 0.57 0.57 3ptb: 85.71 0.7 0.82 0.86 3lyo: 57.14 0.17 0.21 0.57 3rsk: 84.78 0.8 0.93 0.85 3lym: 57.14 0.14 0.19 0.57 4cac: 84.42 0.71 0.82 0.84 3rla: 57.05 0.22 0.33 0.57 3tgi: 83.16 0.65 0.8 0.83 4est: 56.99 0.32 0.71 0.57 3srn: 82.61 0.78 0.93 0.83 8gep: 83.5 0.72 0.77 0.83 4azu: 82.56 0.7 0.83 0.83 8ca2: 82.28 0.69 0.82 0.82 4ccp: 81.82 0.87 0.95 0.82 8icp: 82.05 0.57 0.51 0.82 4aop: 80 0.72 0.81 0.8 8ici: 79.55 0.63 0.61 0.8 4gep: 80 0.73 0.82 0.8 8rat: 79.55 0.78 0.95 0.8 3tgj: 78.95 0.62 0.81 0.79 163l: 76.92 0.39 0.29 0.77 4ald: 77.78 0.71 0.75 0.78 8acn: 76.13 0.64 0.72 0.76 3tms: 77.14 0.63 0.75 0.77 8adh: 74.73 0.56 0.67 0.75 4cyh: 76.92 0.66 0.83 0.77 12ca: 74.03 0.55 0.73 0.74 3ptn: 76.62 0.59 0.79 0.77 8icq: 72.73 0.53 0.52 0.73 4ca2: 76.54 0.72 0.9 0.77 521p: 72.09 0.46 0.58 0.72 4fgf: 74.07 0.63 0.89 0.74 176l: 71.43 0.32 0.25 0.71 3phv: 72.5 0.31 0.69 0.73 167l: 70 0.33 0.27 0.7 3tgk: 72.04 0.52 0.76 0.72 8pti: 69.23 0.26 0.43 0.69 3rhn: 72 0.45 0.55 0.72 8icw: 68.42 0.51 0.51 0.68 3tpi: 70.79 0.51 0.73 0.71 8ica: 68.29 0.49 0.5 0.68 3rat: 70.45 0.64 0.86 0.7 8ics: 68.29 0.53 0.55 0.68 4csc: 70 0.29 0.15 0.7 8icf: 67.44 0.54 0.57 0.67 4aiy: 69.57 0.25 0.21 0.7 8icl: 66.67 0.43 0.45 0.67 4enl: 68.92 0.53 0.59 0.69 8icn: 66.67 0.44 0.45 0.67 3il8: 68.42 0.5 0.68 0.68 8icg: 65 0.48 0.5 0.65 4ccx: 68.18 0.78 0.94 0.68 8icj: 65 0.49 0.51 0.65 4cha: 67.27 0.42 0.71 0.67 8ico: 65 0.51 0.53 0.65 3pfk: 67.21 0.34 0.42 0.67 8icb: 64.44 0.44 0.47 0.64 3tss: 66.67 0.55 0.87 0.67 193l: 64.29 0.28 0.27 0.64 3trx: 66.67 0.32 0.46 0.67 194l: 64.29 0.2 0.21 0.64 3pnp: 66.67 0.39 0.55 0.67 8icz: 64.29 0.48 0.51 0.64 3kvt: 65 0.28 0.37 0.65 8icm: 64.1 0.42 0.42 0.64 3mag: 62 0.46 0.56 0.62 8gch: 63.95 0.38 0.71 0.64 3tmk: 60.92 0.32 0.36 0.61 149l: 63.64 0.26 0.19 0.64 3tlh: 60.42 0.34 0.81 0.6 8ick: 63.04 0.44 0.48 0.63 3tim: 60 0.53 0.66 0.6 8icu: 62.79 0.47 0.5 0.63 3uag: 60 0.28 0.43 0.6 8ice: 61.7 0.52 0.59 0.62 3sc2: 59.3 0.35 0.49 0.59 179l: 60 0.25 0.19 0.6 3xin: 58.49 0.31 0.31 0.58 192l: 60 0.35 0.32 0.6 3rnt: 58.06 0.32 0.6 0.58 13pk: 59.62 0.37 0.45 0.6 3sic: 58 0.48 0.71 0.58 Average 67.9229 0.4676 0.5836 0.6791

82

Chapter6.Multiple Instantiations of FMCN-Units

Table 6.3 : Prediction result for the beta sheet regions in the proteins

Protein P - C - turn Recall - Precision Protein P - turn C - turn Recall - Precision Code turn turn -turn Code turn –turn 8rat: 87.5 0.86 0.93 0.88 2ciq: 78.57 0.85 1 0.79 2c48: 86.79 0.75 0.74 0.87 3ptb: 77.61 0.67 0.83 0.78 1d2w: 86.21 0.67 0.69 0.86 2cib: 77.55 0.59 0.66 0.78 2cio: 85.71 0.78 0.75 0.86 1d2c: 77.22 0.64 0.77 0.77 2c4a: 85.61 0.59 0.61 0.86 6cpp: 77.22 0.62 0.66 0.77 2c4w: 85.19 0.82 0.88 0.85 1d46: 75.71 0.59 0.71 0.76 6icd: 85 0.65 0.65 0.85 4ccp: 75.41 0.77 0.9 0.75 2ciu: 84.57 0.66 0.73 0.85 3rsd: 75 0.74 0.89 0.75 2cij: 84.29 0.76 0.81 0.84 3srn: 75 0.74 0.88 0.75 6ca2: 84.06 0.68 0.76 0.84 3ptn: 75 0.59 0.76 0.75 6r1r: 83.4 0.6 0.62 0.83 1d2t: 75 0.65 0.77 0.75 3rsk: 83.33 0.76 0.83 0.83 1d2u: 75 0.65 0.77 0.75 2c46: 83.33 0.76 0.89 0.83 2cii: 75 0.54 0.48 0.75 6req: 83.27 0.74 0.78 0.83 2civ: 75 0.57 0.68 0.75 2cbq: 82.14 0.75 0.79 0.82 1d6p: 74.42 0.33 0.44 0.74 6abp: 82 0.49 0.46 0.82 2c40: 74.19 0.58 0.55 0.74 4ccx: 81.97 0.78 0.85 0.82 2cis: 73.91 0.78 1 0.74 1d2f: 81.97 0.7 0.75 0.82 2c4z: 73.87 0.63 0.74 0.74 1d2g: 81.97 0.62 0.65 0.82 4azu: 73.55 0.53 0.63 0.74 6a3h: 80.26 0.44 0.52 0.8 4fgf: 73.53 0.62 0.76 0.74 6cpa: 79.73 0.65 0.73 0.8 6xim: 73.46 0.48 0.54 0.73 6prn: 79.41 0.56 0.62 0.79 3tgi: 73.42 0.62 0.78 0.73 6gss: 79.31 0.57 0.59 0.79 2cik: 73.33 0.6 0.58 0.73 2cit: 79.17 0.62 0.76 0.79 4ca2: 73.24 0.6 0.76 0.73 6acn: 78.82 0.5 0.56 0.79 6prc: 72.97 0.54 0.63 0.73 6gpb: 78.63 0.6 0.6 0.79 3tgj: 72.6 0.61 0.77 0.73 2cbu: 72.16 0.62 0.68 0.72 8gep: 54.55 0.55 0.76 0.55 8ca2: 72.06 0.64 0.8 0.72 8icy: 54.29 0.49 0.68 0.54 3vsb: 71.05 0.69 0.89 0.71 8icr: 54.24 0.38 0.51 0.54 8gch: 69.49 0.57 0.73 0.69 4csc: 54.1 0.47 0.58 0.54 132l: 69.05 0.51 0.71 0.69 8icw: 53.95 0.51 0.73 0.54 4gat: 68.42 0.21 0.59 0.68 3sak: 53.85 0.34 0.33 0.54 4hir: 68.42 0.61 0.93 0.68 3sgq: 53.85 0.22 0.41 0.54 4gal: 68.33 0.41 0.53 0.68 200l: 53.85 0.4 0.48 0.54 3rsp: 67.74 0.74 0.95 0.68 3nll: 53.57 0.33 0.47 0.54 3leu: 66.67 0.68 1 0.67 8icf: 53.42 0.45 0.64 0.53 4ins: 66.67 0.32 0.29 0.67 3lck: 53.33 0.34 0.52 0.53 4gep: 66.28 0.6 0.72 0.66 8icb: 53.12 0.44 0.61 0.53 3tgk: 65.75 0.53 0.73 0.66 8icv: 53.12 0.4 0.55 0.53 4cac: 64.71 0.57 0.79 0.65 3pdz: 51.85 0.38 0.67 0.52 4cyh: 64.58 0.57 0.79 0.65 3znb: 51.72 0.41 0.65 0.52

83

Chapter6.Multiple Instantiations of FMCN-Units

Protein P - C - turn Recall - Precision Protein P - turn C - turn Recall - Precision Code turn turn -turn Code turn –turn 8acn: 63.64 0.56 0.71 0.64 3pyp: 51.72 0.3 0.48 0.52 4aop: 63.1 0.58 0.72 0.63 8ica: 51.72 0.44 0.59 0.52 4kiv: 62.5 0.43 0.91 0.62 8icn: 51.72 0.42 0.57 0.52 8ict: 61.54 0.49 0.62 0.62 8ico: 51.67 0.42 0.57 0.52 8ice: 60.32 0.54 0.68 0.6 8icg: 51.61 0.4 0.55 0.52 4bu4: 60 0.32 0.58 0.6 8icx: 51.61 0.41 0.56 0.52 3znf: 60 0.06 0.5 0.6 3tss: 51.43 0.4 0.56 0.51 8ici: 59.68 0.53 0.67 0.6 3mag: 51.11 0.4 0.53 0.51 192l: 59.26 0.41 0.47 0.59 3lyz: 50 0.37 0.68 0.5 3lhm: 59.09 0.39 0.65 0.59 4gsp: 50 0.33 0.62 0.5 3tim: 58.54 0.6 0.77 0.59 3lzt: 50 0.33 0.61 0.5 193l: 58.14 0.55 0.83 0.58 3ncm: 50 0.34 0.59 0.5 3rat: 58.06 0.51 0.72 0.58 3ng1: 49.47 0.39 0.52 0.49 3tpi: 57.83 0.46 0.74 0.58 3sgb: 49.02 0.18 0.39 0.49 3mbp: 57.14 0.46 0.6 0.57 Average 66.38 0.52 0.67 0.66 8icq: 57.14 0.49 0.63 0.57 8ich: 56.45 0.42 0.55 0.56 8lyz: 56.41 0.4 0.65 0.56 11ba: 56.36 0.39 0.55 0.56 3usn: 56.25 0.34 0.64 0.56 3sic: 55.56 0.44 0.67 0.56 221l: 55.56 0.34 0.42 0.56 8icp: 55.56 0.45 0.6 0.56 4hoh: 55.36 0.28 0.55 0.55 4cha: 55 0.43 0.67 0.55 12ca: 54.93 0.51 0.8 0.55 8icu: 54.84 0.47 0.62 0.55 35c8: 54.67 0.35 0.48 0.55 194l: 54.55 0.52 0.83 0.55

Viewing the outputs from each individual FMCN-Units, we can observe that the some proteins are predicted well by one instance and the same protein is mostly unpredicted in another instance. So when the combination is taken, all the proteins get predicted with better accuracies.

84

Chapter6.Multiple Instantiations of FMCN-Units

Table 6.4: Averages of the results in the 2nd phase of our experiment

Average Phelix 74.3%

Average Psheet 67.92% Average Pturn 66.38%

Q3 Accuracy 69.53%

Average Chelix 0.38

Average Csheet 0.47

Average Cturn 0.52

Average Recallhelix 0.75

Average Recallsheet 0.58

Average Recallturn 0.67

Average Precisionhelix 0.74

Average Precisionsheet 0.68

Average Precisionturn 0.66

6.3 Summary

This chapter described the full details of the final phase of our experiment. It focused on the multiple instantiation technique, algorithms for combining the outputs from different units and the algorithm for post-processing the combined output. The post-processing of the combined output worked like a smoothing function. This chapter also elaborated the experimental results by presenting the full result set for all the three classes in details. Viewing these tables we can easily understand that many individual predictions are quite remarkable.

Our system achieved good prediction accuracy. The average Q3 accuracy is around 70%. But in many cases our system also showed very bad accuracy. Actually, there are several topics which are not incorporated in our system. Due to the lack of these components in our system, it experienced these bad results. These deficiencies are introduces in the next chapter. We will consider them in our future work.

85

Chapter7.Conclusion & Future Work

Chapter 7

Conclusion and Future Work

6.1 Conclusion

It can be seen that the neural network based methods are generally more accurate than prior methods which are based on statistical measurements. The prediction accuracy (Q3) of the FMCN based system used in our experiment is around 70%. Almost 20% of the predictions are more than 80% accurate. These strong predictions are distributed throughout all the set of proteins chosen for testing. As the training set size increases, the distribution of the good accuracy becomes higher because of the increase in the knowledge area of the neural network. The prediction accuracy achieved in our experiment may be sufficient to determine different folded regions of protein. But still, the accuracy is well below what is required for many applications, such as tertiary structure prediction in full details or quaternary structure prediction etc. Our work shows that FMCN can be applied to secondary structure prediction in proteins as well as it is likely that

Chapter7.Conclusion & Future Work

FMCN will also be useful in recognizing other structural motifs in protein and DNA sequences.

The most important observation in our experiment is that in the first phase of the experiment, there was a wide range of result values which were much deviated from the mean results. For every set of experiments in the first phase we saw this type of scenario. The average Q3 accuracy achieved in first phase is around 67%. In the second phase average Q3 value was around 69.5%. This is an improvement of around 2.5%. Though the improvement is not much, the main noticeable thing is that the percentage of correct prediction and number of proteins with correct prediction was increased in the second phase. So, the deviation of the individual result values from the mean value is not high. We have tested our system with only four instances of FMCN-Units. But with the help of high computing power we can instantiate more number of FMCN-Units. We can train them with a good amount of proteins. This will certainly result in better prediction accuracy.

The results of our work are based on application of the bare FMCN classifier algorithm on the protein data. Generally in all the neural network based prediction mechanisms which have more than 75% of accuracy, is not only application of bare neural network, but also the application of some significant preprocessing of the primary sequence data and several post-processing of the predicted data. Apart from this, the FMCN algorithm can be improved further as mentioned in chapter 4, by dividing the hyperboxes into more granular areas or using appropriate probability density function. As a future work, we will try to use some preprocessing and some more post-processing techniques and combine them with our designed system to make it a more strong system. We will also try to improve the FMCN algorithms as much as possible and we will use the modified version for prediction with a hope that the system will perform far better than the current system. The scope of future research work is outlined in the following section.

87

Chapter7.Conclusion & Future Work

6.2 Future Work

In chapter 4, it has been pointed out that the algorithms of FMCN can be improved further. Therefore, our first aim is to further develop the algorithms in more granular level and come up with more efficient version of FMCN.

It was also mentioned earlier that our designed system does not include any preprocessing of the data. After a lot of study of our training system and the process of training, we have seen that if a training file contains around 60000 of training sample points, it is not guaranteed that all of these 60000 sample points will be used and all of them will contribute to some amount of learning of the network. If a hyperbox is created and it has been expanded to its maximum limit and after that if any new point comes and falls within that hyperbox, it does not give any new bit of knowledge to the neural network. As there is no use of the density function of the sample points, these extra points which fall within the hyperboxes are actually wasted. Several repetitions of the sample points may also be there in the training file. This can happen because of the sequence similarities in the proteins which are being used for generating the training file. So before generating the training file we need to perform some kind of similarity search. And after that we should filter out those proteins which are almost disjoint in their sequences. They should be used for generating the training file. This will improve the knowledge base of the neural network with less number of proteins. So wastage will be lesser.

In the multiple instantiations of the FMCN-Units as discussed in chapter 6, we need to train different instances of the FMCN by using different training set. Again the training set for each of the instances should be such that the proteins in any pair of sets are totally disjoint in their primary structures. This will make the optimum use of the multiple instantiation of the system.

To have a good amount of knowledge in our neural network for all the three classes of secondary structures, we need to first filter the proteins into three groups – one group having mainly alpha helices, one group having more sheet like structures and other group having plenty of turns. Now we should select a good

88

Chapter7.Conclusion & Future Work

mix of the three types of proteins to use for training the neural network. This will result in a well balanced training of the network. This will also help in getting similar type of accuracies in all the three classes of secondary structures.

Finally as a most useful preprocessing, we will adopt a technique to generate the sequence profile of a group of proteins. The use of evolutionary information has proved to give better prediction accuracies. We will use multiple sequence alignment technique to generate the sequence profile. That sequence profile will be used for training of our FMCN based system.

After appending our multiply instantiated FMCN-based prediction system with the preprocessing mechanisms, the whole system will become enough strong. It is highly expected that the system designed in that way will give a good amount of prediction accuracies for all the three classes of secondary structure.

89

Appendix A

Appendix A

Table A.1 : The list of amino acids with their three letter code and the one letter code

Amino acid Three letter One letter code code alanine ala A arginine arg R asparagines asn N aspartic acid asp D asparagine or asx B aspartic acid cysteine cys C glutamic acid glu E glutamine gln Q glutamine or glx Z glutamic acid glycine gly G histidine his H isoleucine ile I leucine leu L lysine lys K methionine met M phenylalanine phe F proline pro P serine ser S threonine thr T tryptophan try W tyrosine tyr Y valine val V

90

Appendix B

Appendix B

1. Expandability function checked before expansion hyperbox

Let bj is the hyperbox which is being tested for expandability. The following condition is checked for the expandability.

, , where n is the dimension length and θ is the expansion coefficient. wji and vji are

th th the i coordinate value of the max and min points of the j hyper-box bj .This is the summation of expansion in each dimension if the sample point A is included within the hyperbox bj .

2. The Procedures Used in the Training algorithms

The following procedures are applied in the training algorithm of both the old and improved FMCN. All these functions are used in the adjustment section of the training algorithm.

Procedure: Isolation_Test

Input: One pair of hyperboxes

Output: True ,if the input hyperboxes are isolated. False, if they are not isolated.

Begin Isolation_Test

for each dimension , test {

For this dimension, the range of the values of the two hyperboxes within the min-points and the max points are overlapping or not. If they overlaps in their range of values then remember that.

}

if for all dimensions the above test result shows that they are overlapped ,

91

Appendix B

Declare that the two hyperboxes are not isolated by returning false.

else i.e. for at least one dimension , the range of values does not overlap,

Declare that hyperboxes are isolated by returning true.

End Isolation Test

Procedure : Containment_Test between the hyperboxes bj and bk

Input: One pair of hyperboxes

Output: True ,if one of the input hyperboxes contains fully or partially the other hyperbox. False, any of the input hyperbox does fully or partially contains the other one.

Begin Containment Test

for each dimension {

for this dimension, the range of the values of the two hyperboxes within the min-points and the max points are such that , one range is fully encompassing the other range . Check for both the possibilities. If yes then remember this.

}

if for all dimensions the above test result shows that in at least one dimension one range completely encompasses the other range

then declare that the hyperbox whose some coordinate value range from min to max point encompasses the same of the other , contains other hyperbox , by returning true.

Else i.e. for all dimensions it is not the case that one one range ( above mentioned )is fully encompassing the other , then

Declare that no containment has occurred, by returning false.

End Containment Test

92

Appendix B

Procedure : Create_OCN

Begin

Calculate the min point and max point of the hyperbox corresponding to the Overlap Compensatory Neuron.

for each dimension ,

Begin

Set the min-point value for this coordinate as the maximum of two values of the min-point coordinates.

Set the max-point value for this coordinate as the minimum of two values of the max-point coordinates.

End

End

Procedure : Create_CCN

Begin

Calculate the min point and max point of the hyperbox corresponding to the Containment Compensatory Neuron.

for each dimension ,

Begin

Set the min-point value for this coordinate as the maximum of two values of the min-point coordinates.

Set the max-point value for this coordinate as the minimum of two values of the max-point coordinates.

End

End

93

Appendix C

Appendix C

Table C.1 : List of proteins with which the neural network was trained for result set 1 ( Chapter 5) in the first phase of our experiment.

1c0g 1c0i 1c0k 1c0l 1c0m 1c0n 1c0p 1c0t 1c0u 1c0v 1c0w 1c40 1c41 1c43 1c44 1c45 1c46 1c47 1c48 1c49 1c4a 1c4b 1c4t 1c4u 1c4v 1c4w 1c4x 1c4y 1c4z 1c80 1c8c 1c8d 1c8m 1c8n 1c8o 1c8p 1c8q 1c8r 1c8s 1c8t 1c8u 1c8v 1c8w 1c8x 1c8y 1c8z 1cbf 1cbg 1cbh 1cbi 1cbj 1cbk 1cbl 1cbm 1cbn 1cbo 1cbq 1cbr 1cbs 1cbu 1cbv 1cbw 1cbx 1cf0 1cf1 1cf2 1cf3 1cf4 1cf5 1cf7 1cf8 1cf9 1cfa 1cfb 1cfc 1cfd 1cfe 1cff 1cfg 1cfh 1cfi 1cfj 1cfm 1cfn 1cfp 1cfq 1cfr 1cfs 1cft 1cfv 1cfw 1cfy 1cfz 1cia 1cib 1cic 1cid 1cie 1cif 1cig 1cih 1cii 1cij 1cik 1cil 1cim 1cin 1cio 1cip 1ciq 1cir 1cis 1cit 1ciu 1civ 1ciw 1cm0 1cm1 1cm2 1cm3 1cm4 1cm5 1cm7 1cm8 1cm9 1cma 1cmb 1cmc 1cmf 1cmg 1cmi 1cmj 1cmk 1cml 1cmn 1cmo 1cmp 1cmq 1cmr 1cms 1cmt 1cmu 1cmv 1cmw 1cmx 1cmy 1cmz 1cq0 1cq1 1cq2 1cq3 1cq4 1cq6 1cq7 1cq8 1cq9 1cqa 1cqd 1cqe 1cqf 1cqg 1cqh 1cqi 1cqj 1cu0 1cu1 1cu2 1cu3 1cu4 1cu5 1cu6 1cua 1cub 1cuc 1cud 1cue 1cuf 1cug 1cuh 1cui 1cuj 1cuk 1cul 1cun

Table C.2: List of proteins with which the neural network was tested for prediction for result set 1 (Chapter 5) in the first phase of our experiment.

2cip 2ciq 2cir 2cis 2cit 2ciu 2civ 2ciw 4er4 4fua 4gat 4lve 4mon 4ovo 4trx 4upj 4wbc 599n 5bj4 5cev 5daa 5er1 5hck 5lip 5lzm 5rsa 5tss 6ame 6ca2 6cpp 6hbi 6mht 6pcy 6q21 6std 7abp 7ac2 7ahi 7cel 7cpp 7fab 7gpb 7icg 7ick 7lpr 7pck 7prc 7rxn 7tli 7yas 8cpa 8gss 8ice 8jdw 8pch 8rat 8xia 9abp

94

Appendix C

Table C.3 : Selected Prediction results where alpha helix predictions were of average quality in result set 1 ( Chapter 5) for the first phase of our experiment.

Protein P-Helix C-Helix Recall- Precision - Code Helix Helix 2ciq: 57.14 0.42 0.62 0.57 9abp: 100 0.59 0.44 1 6pcy: 88.24 0.4 0.31 0.88 6std: 83.33 0.4 0.33 0.83 8cpa: 74.55 0.35 0.68 0.75 2cir: 72.73 0.31 0.47 0.73 7cpp: 68.75 0.47 0.75 0.69 7gpb: 66.67 0.6 0.83 0.67 6mht: 66 0.52 0.73 0.66 7cel: 58 0.58 0.91 0.58 7fab: 58 0.58 0.91 0.58 8pch: 55.26 0.33 0.61 0.55 2cis: 54.55 0.07 0.25 0.55 5tss: 53.06 0.27 0.71 0.53 2civ: 52.82 0.24 0.63 0.53 7pck: 50 -0.19 0.31 0.5 7tli: 49.6 0.36 0.74 0.5 4mon: 48.99 0.24 0.94 0.49 5cev: 47.53 0.32 0.69 0.48 8jdw: 47.24 0.22 0.67 0.47 4trx: 46.21 0.24 0.69 0.46 8ice: 45.24 0.24 0.37 0.45 8gss: 44.44 0.21 0.68 0.44 7prc: 43.86 0.22 0.66 0.44 7abp: 41.67 0.25 0.43 0.42 Average 58.9552 0.3296 0.6144 0.59

95

Appendix C

Table C.4 : Selected Prediction results where beta sheet predictions were of average quality in result set 1( Chapter 5) for the first phase of our experiment.

Protein P-sheet C-sheet Recall - Precision - Code sheet sheet 2ciq: 77.78 0.52 0.67 0.78 2cir: 50 0.14 0.25 0.5 2cis: 50 0.34 0.57 0.5 2cit: 50 0.17 0.19 0.5 2ciu: 50.99 0.21 0.52 0.51 2ciw: 53.88 0.27 0.79 0.54 4mon: 41.67 0.13 0.1 0.42 5cev: 54.74 0.31 0.47 0.55 5er1: 44.26 0.12 0.44 0.44 5lip: 66.67 0.28 0.35 0.67 5lzm: 36 0.29 0.68 0.36 5rsa: 44.5 0.27 0.82 0.44 5tss: 63.06 0.39 0.49 0.63 6cpp: 52.46 0.2 0.43 0.52 6mht: 55.56 0.33 0.57 0.56 6std: 42.67 0.23 0.78 0.43 7abp: 51.22 0.34 0.64 0.51 7cpp: 48.28 0.42 0.67 0.48 7icg: 42.16 0.09 0.36 0.42 7ick: 43.2 0.39 0.86 0.43 7lpr: 48.12 -0.1 0.53 0.48 7prc: 46.39 0.27 0.41 0.46 7tli: 54.14 0.23 0.48 0.54 7yas: 47.73 0.33 0.67 0.48 8cpa: 61.9 0.48 0.57 0.62 8ice: 45.24 0.3 0.64 0.45 8pch: 56.36 0.29 0.46 0.56 8rat: 48.15 0.21 0.34 0.48 9abp: 44.16 0.38 0.91 0.44 Average 50.73 0.27 0.54 0.50

96

Appendix C

Table C.5 : Selected Prediction results where the turn / loop - predictions were of average quality in result set 1( Chapter 5) for the first phase of our experiment.

Protein Code P-Turn C-Turn Recall-Turn Precision- Turn 2cit: 62.5 0.17 0.48 0.62 2ciu: 65.71 0.26 0.47 0.66 2civ: 46.74 0.17 0.42 0.47 2ciw: 63.35 0.27 0.44 0.63 4er4: 69.57 0.25 0.44 0.7 4fua: 66.67 0.16 0.41 0.67 4gat: 70.83 0.26 0.46 0.71 4lve: 68.18 0.23 0.43 0.68 4mon: 59.38 0.19 0.2 0.59 4ovo: 28.57 -0.01 0.1 0.29 4trx: 65.71 0.37 0.49 0.66 4upj: 38.46 0.1 0.34 0.38 4wbc: 67.96 0.17 0.34 0.68 599n: 61.59 0.26 0.5 0.62 5cev: 55.1 0.26 0.43 0.55 5daa: 60.29 0.19 0.46 0.6 5hck: 63.77 0.25 0.5 0.64 5lip: 70.93 0.42 0.58 0.71 5lzm: 64.29 0.25 0.5 0.64 5rsa: 73.12 0.29 0.42 0.73 5tss: 58.76 0.38 0.47 0.59 6ame: 61.54 0.26 0.5 0.62 6ca2: 33.33 -0.02 0.09 0.33 6cpp: 47.93 0.2 0.44 0.48 6mht: 70.77 0.54 0.62 0.71 6pcy: 50 0.3 0.45 0.5 6std: 42.31 0.18 0.34 0.42 7abp: 70.37 0.4 0.6 0.7 7ahi: 54.84 0.17 0.2 0.55 7cel: 71.11 0.39 0.6 0.71 7cpp: 66.67 0.38 0.44 0.67 7fab: 71.11 0.39 0.6 0.71 7gpb: 66.67 0.37 0.64 0.67 7icg: 55.46 0.25 0.41 0.55 7ick: 79.35 0.38 0.56 0.79 7lpr: 66.04 0.3 0.35 0.66 7prc: 64.39 0.29 0.45 0.64 7yas: 73.97 0.42 0.57 0.74 8gss: 49.18 0.22 0.4 0.49 8ice: 70.51 0.41 0.6 0.71 8rat: 59.57 0.15 0.31 0.6 8xia: 60 -0.04 0.46 0.6 Average 61.10 0.25 0.44 0.61

97

Appendix C

Table C.6 : List of proteins with which the neural network was trained for result set 2( Chapter 5) for the first phase of our experiment.

3c2c 3caa 3cao 3car 3ccx 3cd2 3cd4 3cel 3cev 3cgt 3chb 3ci2 3ck0 3cla 3cln 3cmh 3cms 3cna 3cp4 3cpa 3cro 3crx 3csc 3csm 3csu 3cti 3ctn 3cts 3cyh 3cyr 3cys 3cyt 3gal 3gar 3gat 3gb1 3gbp 3gbq 3gcb 3gcc 3gch 3gct 3geo 3gly 3gpb 3gpd 3grs 3grt 3grx 3gsb 3gsp 3gss 3gst 3gwx 3icb 3icd 3ifb 3il8 3ink 3ins 4bdp 4blc 4blm 4bp2 4bu4 4gal 4gat 4gbq 4gep 4gpb 4gpd 4gr1 4grt 4gsa 4gsp 4gss 4gst 4gtu 4i1b 4ifm 4ins 4sbv 4sdh 4sga 4sgb 4skn 4sli 4srn 4std 4ubp 4ukd 4ull 4upj 4xis 721p 7acn 7adh 7ahl 7ame 7at1 7atj 7ca2 7ccp 7cei 7cel 7cpp 7dfr 7fab 7fd1 7fdr 7gat 7gch 7gpb 7hsc 7i1b 7icd 7ice 7icf 7icg 7ich 7ici 7icj 7ick 7icl 7icm 7icn 7ins 7jdw 7kme 7lpr 7mdh 7msf 7msi 7nn9 7odc 7pck 7pcy 7prc 7prn 7ptd 7pti 7r1r 7rat 7req 7rsa 7rxn 7std 7tim 7tli 7tln 7wga 7xim 7yas 7znf 966c 9abp 9ame 9atc 9ca2 9gaa 9gac 9gaf 9gpb 9gss 9hvp 9ica 9icb 9icc 9icd 9ict 9icu 9icv 9icw 9icx 9icy 9ilb 9ins 9jdw 9ldb 9lpr

Table C.7 : List of proteins with which the neural network was tested for result set 2( Chapter 5) in the first phase of our experiment.

.

6a3h 6abp 6acn 6adh 6ald 6ame 6apr 6at1 6ca2 6ccp 6cgt 6cha 6chy 6cmh 6cox 6cp4 6cpa 6cpp 6cts 6dfr 6enl 6est 6gep 6gpb 6gsp 6gss 6gst 6gsu 6gsv 6gsw 6gsx 6gsy 6i1b 6icd 6ldh 6lyt 6lyz 6lzm 6pad 6pah 6pax 6paz 6pcy 6prc 6prn 6q21 6r1r 6rat 6req 6rhn 6rlx 6rnt 6rxn 6xim

.

98

Appendix C

Table C.8 : List of results on Helix Prediction in result set 2( Chapter 5) in the first phase of our experiment.

Protein Code P-Helix C-helix Recall-Helix Precision - Helix 6gsp: 80 0.52 0.44 0.8 6adh: 72.73 0.18 0.29 0.73 6cts: 70.25 0.24 0.94 0.7 6gss: 65 0.08 0.08 0.65 6r1r: 64.86 -0.04 0.32 0.65 6gst: 61.05 0.2 0.61 0.61 6gsv: 57.24 0.07 0.58 0.57 6abp: 57.14 0.07 0.33 0.57 6ccp: 56.82 -0.12 0.58 0.57 6chy: 56.08 0.05 0.91 0.56 6prc: 55.46 0.07 0.52 0.55 6xim: 55.4 -0.16 0.52 0.55 6gep: 54.76 0.12 0.28 0.55 6gsy: 48.62 -0.02 0.58 0.49 6gsw: 46.77 -0.02 0.49 0.47 6gsu: 45.61 -0.1 0.41 0.46 6acn: 45.45 -0.15 0.26 0.45 6cpa: 44 0.06 0.12 0.44 6gpb: 38.74 -0.08 0.4 0.39 6pad: 38.46 0.14 0.17 0.38 6icd: 37.78 -0.06 0.14 0.38 6rlx: 37.1 0.29 0.96 0.37 6pah: 36.7 0.11 0.6 0.37 6pcy: 36.11 -0.11 0.54 0.36 6rxn: 35.29 0.35 0.86 0.35 6at1: 35 0 0.05 0.35 6q21: 33.89 0.12 0.84 0.34 6dfr: 33.33 0.09 0.47 0.33 6enl: 33.33 0.05 0.25 0.33 6gsx: 33.33 -0.17 0.14 0.33 6lzm: 32.56 -0.14 0.22 0.33 6cgt: 31.82 0.23 0.82 0.32 6ald: 31.03 -0.19 0.38 0.31 6rhn: 30.95 0.22 0.68 0.31 6a3h: 28.77 0.04 0.62 0.29 6ca2: 27.27 -0.09 0.57 0.27 6paz: 26.53 -0.1 0.43 0.27 6cox: 25.76 0.01 0.53 0.26 6ame: 25 0.12 0.33 0.25 Average 44.25 0.04 0.468 0.44

99

Appendix C

Table C.9 : List of results on Sheet Prediction in result set 2( Chapter 5) in the first phase of our experiment.

Protein P-sheet C-sheet Recall - Precision - Code sheet sheet 6a3h: 48.94 0.3 0.4 0.49 6ame: 83.33 0.47 0.42 0.83 6apr: 43.33 0.01 0.23 0.43 6at1: 31.03 0.04 0.67 0.31 6ca2: 50 0.09 0.05 0.5 6cha: 50 0.07 0.05 0.5 6cox: 40.38 0.04 0.23 0.4 6cpa: 28.21 -0.06 0.59 0.28 6cpp: 42.39 0.08 0.53 0.42 6dfr: 48.48 0.11 0.36 0.48 6enl: 31.65 0 0.42 0.32 6est: 39.34 -0.14 0.29 0.39 6gep: 30.67 -0.01 0.4 0.31 6gpb: 31.37 -0.01 0.2 0.31 6gsp: 30.3 0.09 0.5 0.3 6gss: 23.44 0.13 0.78 0.23 6gst: 27.66 0.18 0.43 0.28 6gsu: 27.59 0.2 0.44 0.28 6gsw: 20.48 0.01 0.23 0.2 6gsx: 22.22 0.09 0.41 0.22 6i1b: 25 -0.22 0.05 0.25 6icd: 21.05 -0.02 0.54 0.21 6ldh: 23.64 -0.01 0.21 0.24 6lyz: 50 0.13 0.08 0.5 6pad: 47.44 0.13 0.61 0.47 6pah: 27.91 -0.05 0.17 0.28 6paz: 35.71 -0.04 0.13 0.36 6prn: 51.43 0.03 0.31 0.51 6q21: 44.44 0.05 0.05 0.44 6rhn: 56 0.28 0.45 0.56 6rlx: 88.89 0.27 0.2 0.89 6rnt: 75.86 0.78 0.96 0.76 6rxn: 60 0.62 0.86 0.6 Average 41.15 0.11 0.37 0.41

100

Appendix C

Table C.10 : List of results on Turn/Loop Prediction in result set 2( Chapter 5) in the first phase of our experiment.

Protein P-Turn C-Turn Recall-Turn Precision- Code Turn 6abp: 7.14 -0.26 0.09 0.07 6acn: 29.41 0.22 0.71 0.29 6adh: 17.39 -0.1 0.36 0.17 6ald: 29.17 -0.05 0.37 0.29 6ame: 54.55 0.33 0.8 0.55 6apr: 51.81 0.08 0.35 0.52 6at1: 37.1 0.04 0.33 0.37 6ca2: 39.13 -0.02 0.32 0.39 6ccp: 29.17 -0.1 0.28 0.29 6cgt: 58.33 0.2 0.45 0.58 6cha: 50 0.04 0.34 0.5 6cox: 34.29 -0.03 0.26 0.34 6cp4: 30.77 0.05 0.31 0.31 6cpa: 33.33 0.01 0.27 0.33 6cpp: 29.5 -0.08 0.31 0.29 6cts: 40 0.14 0.19 0.4 6dfr: 31.43 0.01 0.31 0.31 6enl: 30.88 -0.1 0.3 0.31 6est: 63.04 0.28 0.42 0.63 6gep: 36.23 0.22 0.56 0.36 6gsp: 44.12 -0.07 0.41 0.44 6gss: 33.33 0.05 0.33 0.33 6gst: 38.1 0.09 0.29 0.38 6gsu: 22.37 -0.15 0.21 0.22 6gsv: 48 0.15 0.27 0.48 6gsw: 28 -0.03 0.25 0.28 6gsy: 32 0 0.2 0.32 6i1b: 23.08 -0.03 0.33 0.23 6icd: 45.21 0.15 0.37 0.45 6ldh: 35.71 -0.05 0.29 0.36 6lyt: 72.73 0.12 0.13 0.73 6lyz: 48.28 -0.11 0.23 0.48 6lzm: 29.09 -0.05 0.34 0.29 6pad: 36.84 0.01 0.39 0.37 6pah: 31.03 -0.04 0.25 0.31 6pax: 57.14 -0.19 0.4 0.57 6paz: 29.41 0.08 0.42 0.29 6pcy: 24.24 -0.05 0.24 0.24 6prc: 52.7 0.13 0.37 0.53 6prn: 42.65 0.05 0.32 0.43 6q21: 43.75 0.09 0.25 0.44 6r1r: 27.59 0.04 0.29 0.28 6rat: 35.71 0.03 0.29 0.36 6req: 28.81 -0.06 0.29 0.29 6rhn: 36.36 -0.06 0.22 0.36 6rnt: 93.55 0.77 0.81 0.94 6rxn: 60 0.09 0.19 0.6 6xim: 24.35 -0.03 0.29 0.24 Average 38.68 0.037 0.333 0.38

101

Appendix C

Table C.11 : List of proteins with which the neural network was trained for result set 3 (Chapter 5) in the first phase of our experiment.

1a46 1a47 1a48 1a49 1a4a 1a4b 1a4c 1a4e 1a4f 1a4g 1a4h 1a4i 1a4j 1a4k 1a4l 1a4m 1a4o 1a4p 1a4q 1a70 1a71 1a72 1a73 1a74 1a75 1a76 1a77 1a78 1a79 1a7a 1a7b 1a7c 1a7d 1a7e 1a7f 1a7g 1a7h 1a7i 1a7j 1a7k 1a7l 1a7m 1a7n 1a7o 1a7p 1a7q 1a7r 1a7s 1a7t 1a7u 1a7v 1a7w 1a7x 1a7y 1a7z 1a90 1a91 1a92 1a93 1a94 1a95 1a96 1a97 1a98 1a99 1a9b 1a9c 1a9e 1a9m 1a9n 1a9o 1a9p 1a9q 1a9r 1a9s 1a9t 1a9u 1a9v 1a9w 1a9x 1a9y 1a9z 2a20 2a21 2a22 2a23 2a25 2a26 2a29 2a2l 2a2m 2a2n 2a2o 2a2p 2a2r 2a2s 2a2u 2a2v 2a2y 2a2z 2a40 2a41 2a42 2a45 2a46 2a47 2a48 2a49 2a4a 2a4c 2a4d 2a4e 2a4f 2a4h 2a4j 2a4k 2a4m 2a4n 2a4o 2a4t 2a4v 2a4z 2a77 2a78 2a79 2a7a 2a7b 2a7c 2a7d 2a7f 2a7g 2a7h 2a7i 2a7j 2a7k 2a7l 2a7m 2a7o 3bam 3bbg 3bc2 3bcc 3bct 3bdo 3bdp 3bif 3bir 3bjl 3blg 3blm 3bls 3bp2 3bta 3btb 3btd 3bte 3btf 3btg 3bth 3btk 3bu4 4ca2 4caa 4cac 4cat 4ccp 4ccx 4cel 4cgt 4cha 4cla 4cms 4cox 4cp4 4cpa 4cpp 4cpv 4csc 4cts 4cyh 5eas 5eat 5eau 5ebx 5er1 5er2 5est 5eug 6taa 6tim 6tli 6tmn 7yas 8a3h 8aat 8abp 8acn 8adh 8api 8at1 966c 9abp 9ame 9atc 9ca2 9gaa 9gac 9gaf 9gpb 9gss 9hvp 9ica 9icb 9icc 9icd 9ict 9icu 9icv 9icw 9icx 9icy 9ilb 9ins 9jdw 9ldb 9lpr 9mht 9msi 9nse 9pcy 9pti 9rat 9rnt 9rsa 9wga 9xia

Table C.12 : List of proteins with which the neural network was tested for prediction in result set 3 (Chapter 5) in the first phase of our experiment.

1bas 2a40 2a41 2a42 2a43 2a44 2a45 2a46 2a47 2a48 2a49 2a4b 2a4c 2a4e 2a4m 2a4n 2a4o 2a4p 2a4q 2a4r 2a4s 2a4t 2a4u 2a4v 2a4w 2a4x 2a4y 2a4z 2b05 2b06 2b07 2b0a 2b0c 2b0d 2b0e 2b0h 2b0j 2b0l 2b0m 2b0o 2b0p 2b0q 2b0r 2b0t 2b0u 2b0v 2b0z 2b61 2b63 2b64 2b65 2b66 2b67 2b68 2b69 2b6a 2b6c 2b6d 2b6e 2b6g 2b6h 2b6n 2b6o 2b6p 2b6t 2b6w 2b6x 2b6y

102

Appendix C

Table C.13 : Selected Prediction results where the alpha-helix - predictions were of good quality in result set 3 (Chapter 5) in the first phase of our experiment..

Protein P-alpha C-beta Recall- Precision- alpha alpha 2a4v: 97.12 0.76 1 1 2a42: 96.49 0.8 0.96 0 2a4c: 94.71 0.92 0.95 0.92 2a4e: 94.67 0.63 0.95 0.94 2a40: 94.44 0.8 0.94 0.91 1bas: 94.12 0.93 0.94 0.94 2a4x: 93.68 0.87 0.94 0.91 2a43: 91.07 0.75 0.91 0.7 2a4n: 90.97 0.78 0.91 0.85 2a4o: 88.64 0.88 0.89 0.84 2a4p: 86.67 0.95 0.87 0.93 2a4t: 85.87 0.89 0.86 0.87 2a41: 85.71 0.7 0.86 0.8 2a4q: 85.71 0.91 0.86 0.89 2a4s: 85.11 0.91 0.85 0.88 2a48: 84.81 0.87 0.85 0.9 2a4y: 84.78 0.78 0.85 0.85 2a4z: 84.17 0.78 0.84 0.85 2a4r: 83 0.88 0.83 0.88 2a4u: 80.88 0.8 0.81 0.77 2a46: 79.63 0.78 0.8 0.92 2a4b: 75.68 0.41 0.76 0.85 2a47: 75.6 0.78 0.76 0.93 2a49: 75.45 0.64 0.75 0.7 2b6x: 73.08 0.5 0.73 0.48 2b6t: 72.64 0.5 0.73 0.48 2b6w: 72.64 0.5 0.73 0.48 2b6y: 72.64 0.5 0.73 0.48 2a45: 71.15 0.67 0.71 0.86 2a4m: 63.64 0.72 0.64 0.88 2b0z: 63.25 0.6 0.63 0.08 2a4w: 59.9 0.45 0.6 0.7 2a44: 57.89 0.53 0.58 0.87 2b64: 57.89 0.22 0.58 0.39 2b0l: 57.14 0.06 0.57 0.19 2b66: 56.84 0.19 0.57 0.34 2b0j: 55.61 0.31 0.56 0.34 2b0v: 54.55 0.36 0.55 0.76 2b07: 54.39 0.32 0.54 0.47 2b67: 52.56 0.21 0.53 0.38 2b05: 51.22 0.3 0.51 0.4 2b0t: 50.85 0.19 0.51 0.34 2b0c: 50 0.31 0.5 0.41 2b0o: 48.19 0.16 0.48 0.14 2b69: 45.53 0.28 0.46 0.45 2b65: 44.74 0.18 0.45 0.35 2b0m: 44.51 0.22 0.45 0.34 2b6c: 44.01 0.03 0.44 0.01 Average 72.22 0.57 0.72 0.63

103

Appendix C

Table C.14: Selected Prediction results where the Beta-sheet - predictions were of good quality in result set 3 (Chapter 5) in the first phase of our experiment..

Protein Code P-sheet C-sheet Recall -sheet Precision - sheet 1bas: 95.74 0.93 0.94 0.96 2a40: 85.71 0.8 0.91 0.86 2a44: 80.7 0.53 0.87 0.81 2a45: 72.96 0.67 0.86 0.73 2a46: 81.5 0.78 0.92 0.81 2a47: 80.93 0.78 0.93 0.81 2a48: 93.1 0.87 0.9 0.93 2a49: 74.4 0.64 0.7 0.74 2a4c: 96.38 0.92 0.92 0.96 2a4e: 64.81 0.63 0.94 0.65 2a4m: 94.39 0.72 0.88 0.94 2a4n: 83 0.78 0.85 0.83 2a4o: 96.34 0.88 0.84 1 2a4p: 94.45 0.95 0.93 1 2a4q: 98.51 0.91 0.89 0.99 2a4r: 95.77 0.88 0.88 0.96 2a4s: 100 0.91 0.88 1 2a4t: 98.51 0.89 0.87 0.99 2a4u: 92.45 0.8 0.77 0.92 2a4v: 75.56 0.76 1 0.76 2a4x: 90.19 0.87 0.91 0.9 2a4y: 81.97 0.78 0.85 0.82 2a4z: 81.97 0.78 0.85 0.82 2b07: 53.97 0.32 0.47 0.54 2b0a: 53.97 0.29 0.63 0.54 2b0c: 53.12 0.31 0.41 0.53 2b0d: 56.82 0.29 0.51 0.57 2b0e: 54.62 0.26 0.49 0.55 2b0j: 59.09 0.31 0.34 0.59 2b0m: 48.28 0.22 0.34 0.48 2b0v: 50.79 0.36 0.76 0.51 2b61: 63.93 0.34 0.41 0.64 2b64: 46.08 0.22 0.39 0.46 2b69: 48.28 0.28 0.45 0.48 2b6e: 59.09 0.38 0.72 0.59 2b6h: 73.47 0.46 0.59 0.73 2b6t: 66.67 0.5 0.48 0.67 2b6w: 65.67 0.51 0.45 0.67 2b6x: 66.67 0.5 0.47 0.67 2b6y: 68.67 0.52 0.48 0.67 Average 75.1935 0.61325 0.717 0.752

104

Appendix C

Table C.15: Selected Prediction results where the Turn/loop - predictions were of good quality in result set 3 (Chapter 5) in the first phase of our experiment..

Protein Code P-Turn C-Turn Recall-Turn Precision-Turn 1bas: 88.89 0.9 0.96 0.89 2a40: 78.57 0.78 0.92 0.79 2a41: 85.71 0.73 0.67 0.86 2a42: 62.5 0.64 0.71 0.62 2a44: 68.18 0.57 0.68 0.68 2a45: 82.35 0.59 0.61 0.82 2a46: 86.73 0.67 0.67 0.87 2a47: 86.67 0.62 0.59 0.87 2a48: 88.89 0.74 0.74 0.89 2a49: 78.04 0.56 0.55 0.78 2a4b: 62.09 0.43 0.54 0.62 2a4c: 92.17 0.89 0.9 0.92 2a4e: 78.48 0.59 0.63 0.78 2a4m: 74.42 0.74 0.86 0.74 2a4n: 95 0.85 0.86 0.95 2a4o: 76.39 0.76 0.9 0.76 2a4p: 84.13 0.82 0.9 0.84 2a4q: 94.55 0.88 0.88 0.95 2a4r: 98.11 0.88 0.84 0.98 2a4s: 93.1 0.88 0.9 0.93 2a4t: 90 0.86 0.9 0.9 2a4u: 79.17 0.69 0.76 0.79 2a4v: 84.85 0.77 0.88 0.85 2a4w: 61.9 0.32 0.35 0.62 2a4x: 88.45 0.83 0.86 0.88 2a4y: 92.54 0.78 0.76 0.93 2a4z: 92.31 0.76 0.74 0.92 2b05: 59.09 0.27 0.24 0.59 2b0a: 58.54 0.19 0.39 0.59 2b0c: 66.67 0.3 0.33 0.67 2b0d: 60.71 0.26 0.43 0.61 2b0e: 59.22 0.23 0.39 0.59 2b0m: 57.33 0.22 0.37 0.57 2b0p: 78.38 0.36 0.51 0.78 2b0q: 56.6 0.21 0.37 0.57 2b0t: 62.4 0.35 0.42 0.62 2b0u: 70.05 0.29 0.48 0.7 2b0v: 62.96 0.31 0.4 0.63 2b0z: 66.96 0.4 0.6 0.67 2b61: 65.17 0.32 0.49 0.65 2b63: 53.39 0.27 0.52 0.53 2b65: 63.73 0.37 0.59 0.64 2b66: 54.25 0.27 0.62 0.54 2b68: 50 -0.32 0.37 0.5 2b69: 64 0.31 0.39 0.64 2b6a: 52.48 0.24 0.43 0.52 2b6c: 50.68 0.27 0.38 0.51 2b6d: 64.08 0.36 0.58 0.64 2b6h: 52.38 0.22 0.42 0.52 2b6n: 69.01 0.19 0.41 0.69 2b6o: 56.82 0.36 0.46 0.57 2b6t: 80.77 0.55 0.51 0.81 2b6w: 80.77 0.55 0.51 0.81 2b6x: 78.57 0.55 0.54 0.79 2b6y: 80.77 0.55 0.51 0.81 Average 73.23 0.50 0.604 0.73

105

Appendix C

Table C.16: The summed up results for beta sheet prediction in the first phase of the experiment. If some protein name repeats in the set then they were tested for different training sets. It also shows the average over all these predictions.

Protein P-sheet C-sheet Recall - Precision Protein P-sheet C-sheet Recall - Precision Code sheet - sheet Code sheet - sheet 2a4o: 98.6 0.88 0.84 1 6gsp: 68.97 0.76 1 0.69 2a4p: 98.59 0.95 0.93 1 2b6y: 68.67 0.52 0.48 0.67 2a4s: 98.53 0.91 0.88 1 2c4a: 67.01 0.6 0.7 0.67 2c44: 98.52 0.98 0.97 1 6at1: 66.83 0.59 0.74 0.67 2a4q: 98.51 0.91 0.89 0.99 5lip: 66.67 0.28 0.35 0.67 2a4t: 98.51 0.89 0.87 0.99 1a29: 66.67 0.28 0.17 0.67 2a4c: 96.38 0.92 0.92 0.96 2b6t: 66.67 0.5 0.48 0.67 2a4r: 95.77 0.88 0.88 0.96 2b6x: 66.67 0.5 0.47 0.67 1bas: 95.74 0.93 0.94 0.96 2cit: 66.67 0.52 0.5 0.67 2c4w: 95.24 0.94 0.95 0.95 2c4z: 66.26 0.52 0.55 0.66 2cia: 95.08 0.92 0.95 0.95 6prn: 66.24 0.49 0.9 0.66 2a4m: 94.39 0.72 0.88 0.94 2b6w: 65.67 0.51 0.45 0.67 2a48: 93.1 0.87 0.9 0.93 6acn: 65.58 0.5 0.62 0.66 2a4u: 92.45 0.8 0.77 0.92 6ccp: 65 0.76 0.93 0.65 2cbx: 92.16 0.71 0.65 0.92 6gsx: 65 0.44 0.41 0.65 2cbv: 91.87 0.88 0.99 0.92 2civ: 65 0.57 0.66 0.65 2c4v: 91.46 0.8 0.85 0.91 2a4e: 64.81 0.63 0.94 0.65 2c4y: 90.48 0.83 0.89 0.9 2b61: 63.93 0.34 0.41 0.64 2a4x: 90.19 0.87 0.91 0.9 6est: 63.83 0.33 0.7 0.64 2c47: 89.2 0.84 0.89 0.89 6q21: 63.78 0.42 0.59 0.64 6rlx: 88.89 0.27 0.2 0.89 6cp4: 63.64 0.46 0.47 0.64 2c4u: 88.64 0.82 0.91 0.89 6cpp: 63.64 0.46 0.47 0.64 2cic: 88.58 0.8 0.96 0.89 5tss: 63.06 0.39 0.49 0.63 2cbj: 88 0.84 0.96 0.88 6icd: 62.5 0.6 0.76 0.62 2cbs: 88 0.83 0.99 0.88 2c43: 62.5 0.62 0.71 0.62 2cbq: 87.84 0.8 0.96 0.88 2c45: 62.5 0.62 0.71 0.62 2cbr: 87.66 0.83 0.99 0.88 2c46: 62.5 0.51 0.56 0.62 2cij: 86.54 0.88 0.94 0.87 8cpa: 61.9 0.48 0.57 0.62 6gpb: 86.18 0.63 0.59 0.86 6chy: 60.47 0.43 0.51 0.6 2a40: 85.71 0.8 0.91 0.86 6gsu: 60 0.38 0.36 0.6 2cbf: 85.45 0.88 0.98 0.85 6rxn: 60 0.62 0.86 0.6 2c48: 85 0.68 0.88 0.85 6prc: 59.18 0.25 0.23 0.59 2ciw: 84.91 0.77 0.96 0.85 2b0j: 59.09 0.31 0.34 0.59 2c4t: 84.53 0.72 0.79 0.85 2b6e: 59.09 0.38 0.72 0.59 2cbi: 84.42 0.79 0.98 0.84 6enl: 58.67 0.33 0.42 0.59 6rat: 84.09 0.84 0.97 0.84 6cgt: 58.02 0.39 0.66 0.58 2cbw: 83.85 0.78 0.92 0.84 6rhn: 57.69 0.31 0.48 0.58 2cip: 83.67 0.77 0.79 0.84 1a7g: 57.69 0.27 0.56 0.58 2cbo: 83.61 0.81 0.89 0.84 6dfr: 57.45 0.3 0.6 0.57 2cii: 83.33 0.5 0.34 0.83 6r1r: 57.31 0.44 0.49 0.57 6ame: 83.33 0.47 0.42 0.83 2b0d: 56.82 0.29 0.51 0.57 2a4n: 83 0.78 0.85 0.83 2cbg: 56.79 0.43 0.53 0.57 2cid: 83 0.77 1 0.83 8pch: 56.36 0.29 0.46 0.56 2c4x: 82.61 0.83 0.9 0.83 6rhn: 56 0.28 0.45 0.56 2cil: 82.28 0.83 0.97 0.82 6cha: 55.83 0.37 0.73 0.56 2a4y: 81.97 0.78 0.85 0.82 1a79: 55.66 0.16 0.45 0.56 2a4z: 81.97 0.78 0.85 0.82 6mht: 55.56 0.33 0.57 0.56 2a46: 81.5 0.78 0.92 0.81 6gsw: 55 0.31 0.3 0.55 2cim: 81.33 0.84 0.98 0.81 1a2o: 54.97 0.27 0.44 0.55 2cin: 81.33 0.84 0.98 0.81 5cev: 54.74 0.31 0.47 0.55 2a47: 80.93 0.78 0.93 0.81 2b0e: 54.62 0.26 0.49 0.55 2a44: 80.7 0.53 0.87 0.81 1a7c: 54.33 0.23 0.54 0.54 2cbk: 80 0.72 0.84 0.8 7tli: 54.14 0.23 0.48 0.54 6ca2: 79.75 0.8 0.95 0.8 2b07: 53.97 0.32 0.47 0.54 6req: 79.66 0.71 0.72 0.8 2b0a: 53.97 0.29 0.63 0.54 6gss: 77.78 0.48 0.38 0.78 2ciw: 53.88 0.27 0.79 0.54 2ciq: 77.78 0.52 0.67 0.78 2b0c: 53.12 0.31 0.41 0.53 2ciq: 77.78 0.72 0.88 0.78 6xim: 52.83 0.31 0.33 0.53 6i1b: 76.81 0.65 0.91 0.77 6gst: 52.5 0.34 0.35 0.52 6gsy: 76.74 0.68 0.67 0.77 6cpp: 52.46 0.2 0.43 0.52 6cpa: 75.93 0.52 0.55 0.76 1a21: 51.63 0.25 0.8 0.52 6rnt: 75.86 0.78 0.96 0.76 6prn: 51.43 0.03 0.31 0.51 6rnt: 75.86 0.78 0.96 0.76 7abp: 51.22 0.34 0.64 0.51 2a4v: 75.56 0.76 1 0.76 2ciu: 50.99 0.21 0.52 0.51 2cis: 75 0.77 0.92 0.75 2b0v: 50.79 0.36 0.76 0.51 2c41: 74.68 0.4 0.44 0.75 6gsv: 50 0.34 0.36 0.5 2a49: 74.4 0.64 0.7 0.74 6rlx: 50 -0.01 0.1 0.5

106

Appendix C

Protein P-sheet C-sheet Recall - Precision Protein P-sheet C-sheet Recall - Precision Code sheet - sheet Code sheet - sheet 6dfr: 48.48 0.11 0.36 0.48 4wbc: 39.5 0.21 0.79 0.39 7cpp: 48.28 0.42 0.67 0.48 1a25: 39.34 0.3 0.84 0.39 2b0m: 48.28 0.22 0.34 0.48 6est: 39.34 -0.14 0.29 0.39 2b69: 48.28 0.28 0.45 0.48 6ldh: 39.29 0.19 0.35 0.39 1a71: 48.95 0.25 0.5 0.49 8gss: 39.13 0.08 0.25 0.39 6a3h: 48.94 0.3 0.4 0.49 Average 63.51 0.44 0.61 0.63 1a7l: 48.79 0.31 0.45 0.49 6pad: 48.72 0.08 0.31 0.49 6dfr: 48.48 0.11 0.36 0.48 7cpp: 48.28 0.42 0.67 0.48 2b0m: 48.28 0.22 0.34 0.48 2b69: 48.28 0.28 0.45 0.48 1a71: 48.95 0.25 0.5 0.49 6a3h: 48.94 0.3 0.4 0.49 1a7l: 48.79 0.31 0.45 0.49 6pad: 48.72 0.08 0.31 0.49 6dfr: 48.48 0.11 0.36 0.48 7cpp: 48.28 0.42 0.67 0.48 2b0m: 48.28 0.22 0.34 0.48 2b69: 48.28 0.28 0.45 0.48 1a71: 48.95 0.25 0.5 0.49 6a3h: 48.94 0.3 0.4 0.49 1a7l: 48.79 0.31 0.45 0.49 6pad: 48.72 0.08 0.31 0.49 6dfr: 48.48 0.11 0.36 0.48 7cpp: 48.28 0.42 0.67 0.48 2b0m: 48.28 0.22 0.34 0.48 2b69: 48.28 0.28 0.45 0.48 1a71: 48.95 0.25 0.5 0.49 8rat: 48.15 0.21 0.34 0.48 1a7s: 48.15 0.35 0.75 0.48 7lpr: 48.12 -0.1 0.53 0.48 6apr: 48.03 0.13 0.64 0.48 7yas: 47.73 0.33 0.67 0.48 6pad: 47.44 0.13 0.61 0.47 1a2v: 46.94 0.34 0.76 0.47 7prc: 46.39 0.27 0.41 0.46 2b64: 46.08 0.22 0.39 0.46 6ald: 46.02 0.18 0.31 0.46 1a2d: 45.7 0.09 0.71 0.46 8ice: 45.24 0.3 0.64 0.45 5rsa: 44.5 0.27 0.82 0.44 6q21: 44.44 0.05 0.05 0.44 6cox: 44.29 0.07 0.11 0.44 5er1: 44.26 0.12 0.44 0.44 9abp: 44.16 0.38 0.91 0.44 1a7t: 43.92 0.16 0.49 0.44 599n: 43.75 0.2 0.6 0.44 6adh: 43.59 0.24 0.46 0.44 6apr: 43.33 0.01 0.23 0.43 7ick: 43.2 0.39 0.86 0.43 6a3h: 43.1 0.26 0.44 0.43 1a7x: 43.02 0.26 0.7 0.43 1a7u: 42.98 0.13 0.32 0.43 1a2k: 42.7 0.16 0.51 0.43 6std: 42.67 0.23 0.78 0.43 6cpp: 42.39 0.08 0.53 0.42 1a28: 42.31 0.03 0.07 0.42 7icg: 42.16 0.09 0.36 0.42 1a7o: 41.96 0.15 0.75 0.42 1a76: 41.94 0.09 0.17 0.42 1a77: 41.94 0.1 0.18 0.42 6pcy: 41.89 0.25 0.79 0.42 4mon: 41.67 0.13 0.1 0.42 1a7n: 41.44 0.2 0.78 0.41 1a7p: 41.44 0.15 0.74 0.41 1a2y: 40.8 0.22 0.61 0.41 1a7r: 40.54 0.19 0.78 0.41 1a2n: 40.52 0.1 0.42 0.41 4trx: 40.38 0.17 0.32 0.4 6cox: 40.38 0.04 0.23 0.4 6paz: 40 0 0.46 0.4 6rxn: 40 0.3 0.57 0.4 1a7i: 40 -0.04 0.4 0.4 1a7q: 39.64 0.19 0.77 0.4

107

Appendix C

Table C.17: Table showing the different results for different proteins for turns / loop predictions. If some protein name repeats in the set then they were tested for different training sets .It also shows the average over all these prediction

Protein P- C- Recall- Precisio Protein P- C- Recall- Precisio Protein P- C- Recall- Precisi Code Turn Turn Turn n-Turn Code Turn Turn Turn n-Turn Code Turn Turn Turn on- Turn 2a4r 98.11 0.88 0.84 0.98 2cib: 77.55 0.59 0.66 0.78 2b65: 63.73 0.37 0.59 0.64 6gsp 96.77 0.8 0.81 0.97 6cpp: 77.22 0.62 0.66 0.77 1a7m: 63.41 0.3 0.45 0.63 2cbv: 96.77 0.89 0.87 0.97 6lzm: 76.92 0.45 0.43 0.77 2ciw: 63.35 0.27 0.44 0.63 2a4n: 95 0.85 0.86 0.95 2cbg: 76.6 0.49 0.53 0.77 6est: 63.04 0.28 0.42 0.63 2a4q: 94.55 0.88 0.88 0.95 6at1: 76.47 0.58 0.68 0.76 2b0v: 62.96 0.31 0.4 0.63 2cbo: 94.21 0.84 0.85 0.94 2a4o: 76.39 0.76 0.9 0.76 2cbm: 62.96 0.35 0.31 0.63 6rat: 93.75 0.85 0.86 0.94 6cp4: 76.25 0.61 0.66 0.76 1a23: 62.79 0.31 0.46 0.63 6rnt: 93.55 0.77 0.81 0.94 1a2v: 76.12 0.42 0.51 0.76 1a7l: 62.55 0.28 0.41 0.63 6rnt: 93.55 0.77 0.81 0.94 1a7x: 75 0.44 0.57 0.75 2cit: 62.5 0.17 0.48 0.62 2cbj: 93.48 0.83 0.87 0.93 2cii: 75 0.54 0.48 0.75 2a42: 62.5 0.64 0.71 0.62 2cbf: 93.44 0.9 0.93 0.93 2civ: 75 0.57 0.68 0.75 2b0t: 62.4 0.35 0.42 0.62 2a4s: 93.1 0.88 0.9 0.93 2a4m: 74.42 0.74 0.86 0.74 1a2y: 62.37 0.17 0.41 0.62 2cic: 92.82 0.77 0.75 0.93 2c40: 74.19 0.58 0.55 0.74 2a4b: 62.09 0.43 0.54 0.62 2a4y: 92.54 0.78 0.76 0.93 7yas: 73.97 0.42 0.57 0.74 6dfr: 62.07 0.4 0.51 0.62 2a4z: 92.31 0.76 0.74 0.92 2cis: 73.91 0.78 1 0.74 2a4w: 61.9 0.32 0.35 0.62 2cbs: 92.31 0.78 0.75 0.92 2c4z: 73.87 0.63 0.74 0.74 6gst: 61.74 0.46 0.63 0.62 2a4c: 92.17 0.89 0.9 0.92 6xim: 73.46 0.48 0.54 0.73 599n: 61.59 0.26 0.5 0.62 2ciw: 91.86 0.77 0.77 0.92 2cik: 73.33 0.6 0.58 0.73 6ame: 61.54 0.26 0.5 0.62 2cbr: 91.84 0.79 0.76 0.92 5rsa: 73.12 0.29 0.42 0.73 2b0d: 60.71 0.26 0.43 0.61 2cie: 91.67 0.79 0.81 0.92 6prc: 72.97 0.54 0.63 0.73 5daa: 60.29 0.19 0.46 0.6 2cif: 91.67 0.82 0.85 0.92 6lyt: 72.73 0.12 0.13 0.73 8xia: 60 -0.04 0.46 0.6 2c4v: 91.36 0.81 0.86 0.91 1a78: 72.55 0.31 0.38 0.73 1a7e: 60 0.18 0.23 0.6 2cig: 91.3 0.81 0.84 0.91 1a2t: 72.22 0.42 0.55 0.72 2cbh: 60 0.15 0.55 0.6 2cip: 91.23 0.94 1 0.91 1a2u: 72.22 0.42 0.55 0.72 6a3h: 60 0.13 0.28 0.6 2cih: 90.91 0.8 0.83 0.91 1a7b: 72.16 0.38 0.52 0.72 6rxn: 60 0.09 0.19 0.6 6cts: 90.77 0.6 0.53 0.91 2cbu: 72.16 0.62 0.68 0.72 1a7p: 59.62 0.16 0.38 0.6 2cbw: 90.07 0.77 0.82 0.9 2cbl: 72 0.39 0.31 0.72 8rat: 59.57 0.15 0.31 0.6 2a4t: 90 0.86 0.9 0.9 6gsy: 71.79 0.67 0.82 0.72 4mon: 59.38 0.19 0.2 0.59 2cin: 90 0.79 0.84 0.9 2c44: 71.43 0.81 1 0.71 2b0e: 59.22 0.23 0.39 0.59 2cim: 89.86 0.78 0.83 0.9 6gsv: 71.19 0.54 0.67 0.71 6q21: 59.15 0.4 0.53 0.59 2cid: 89.74 0.73 0.73 0.9 2c4t: 71.13 0.61 0.66 0.71 2b05: 59.09 0.27 0.24 0.59 2cil: 89.71 0.76 0.8 0.9 7cel: 71.11 0.39 0.6 0.71 5tss: 58.76 0.38 0.47 0.59 2cbk: 89.23 0.77 0.77 0.89 7fab: 71.11 0.39 0.6 0.71 6enl: 58.54 0.21 0.34 0.59 2c47: 89.08 0.82 0.84 0.89 1a2p: 71.08 0.23 0.46 0.71 1a2l: 58.54 0.27 0.42 0.59 2c4u: 89.04 0.76 0.8 0.89 1a2b: 71.05 0.37 0.45 0.71 2b0a: 58.54 0.19 0.39 0.59 1bas: 88.89 0.9 0.96 0.89 5lip: 70.93 0.42 0.58 0.71 6rxn: 58.33 0.13 0.44 0.58 2a48: 88.89 0.74 0.74 0.89 4gat: 70.83 0.26 0.46 0.71 6cgt: 58.33 0.2 0.45 0.58 2c43: 88.89 0.76 0.83 0.89 6mht: 70.77 0.54 0.62 0.71 1a7o: 57.41 0.14 0.38 0.57 2c45: 88.89 0.76 0.83 0.89 6lyz: 70.73 0.22 0.48 0.71 6cox: 57.39 0.28 0.52 0.57 2cbi: 88.89 0.77 0.76 0.89 8ice: 70.51 0.41 0.6 0.71 2b0m: 57.33 0.22 0.37 0.57 6ccp: 88.52 0.78 0.81 0.89 6ldh: 70.49 0.36 0.42 0.7 2ciq: 57.14 0.52 0.73 0.57 2c4y: 88.46 0.8 0.87 0.88 1a2q: 70.42 0.31 0.5 0.7 1a7d: 57.14 0.17 0.22 0.57 2a4x: 88.45 0.83 0.86 0.88 6pax: 70.37 0.2 0.34 0.7 6pax: 57.14 -0.19 0.4 0.57 2cbx: 88.37 0.79 0.84 0.88 7abp: 70.37 0.4 0.6 0.7 2c41: 57.09 0.3 0.37 0.57 1a2c: 87.34 0.68 0.75 0.87 6gsx: 70.34 0.57 0.7 0.7 6pah: 56.92 0.3 0.45 0.57 2cia: 86.96 0.85 0.91 0.87 2b0u: 70.05 0.29 0.48 0.7 2b6o: 56.82 0.36 0.46 0.57 2c4x: 86.89 0.86 0.93 0.87 6cha: 69.6 0.39 0.59 0.7 2b0q: 56.6 0.21 0.37 0.57 6i1b: 86.84 0.75 0.8 0.87 4er4: 69.57 0.25 0.44 0.7 1a28: 56.45 0.24 0.27 0.56 2c48: 86.79 0.75 0.74 0.87 6gep: 69.32 0.49 0.57 0.69 1a24: 56.41 0.28 0.43 0.56 2a46: 86.73 0.67 0.67 0.87 2b6n: 69.01 0.19 0.41 0.69 1a7u: 56.2 0.23 0.38 0.56 2a47: 86.67 0.62 0.59 0.87 6pcy: 68.97 0.09 0.43 0.69 7icg: 55.46 0.25 0.41 0.55 2a41: 85.71 0.73 0.67 0.86 1a7t: 68.42 0.32 0.46 0.68 5cev: 55.1 0.26 0.43 0.55 2cio: 85.71 0.78 0.75 0.86 6lyt: 68.29 0.21 0.47 0.68 1a2m: 54.95 0.27 0.45 0.55 2c4a: 85.61 0.59 0.61 0.86 4lve: 68.18 0.23 0.43 0.68 7ahi: 54.84 0.17 0.2 0.55 2c4w: 85.19 0.82 0.88 0.85 2a44: 68.18 0.57 0.68 0.68 1a7w: 54.55 0.37 0.46 0.55 6icd: 85 0.65 0.65 0.85 4wbc: 67.96 0.17 0.34 0.68 6ame: 54.55 0.33 0.8 0.55 2a4v: 84.85 0.77 0.88 0.85 6est: 67.8 0.4 0.56 0.68 2b66: 54.25 0.27 0.62 0.54 2ciu: 84.57 0.66 0.73 0.85 6gsu: 67.48 0.53 0.7 0.67 1a74: 54.17 0.04 0.39 0.54 2cij: 84.29 0.76 0.81 0.84 1a7q: 67.31 0.24 0.41 0.67 2cbn: 53.85 0.28 0.54 0.54 2a4p: 84.13 0.82 0.9 0.84 2b0z: 66.96 0.4 0.6 0.67 9abp: 53.7 0.19 0.33 0.54 6ca2: 84.06 0.68 0.76 0.84 4fua: 66.67 0.16 0.41 0.67 1a25: 53.7 0.27 0.44 0.54 6r1r: 83.4 0.6 0.62 0.83 7cpp: 66.67 0.38 0.44 0.67 1a2j: 53.66 0.24 0.4 0.54 2c46: 83.33 0.76 0.89 0.83 7gpb: 66.67 0.37 0.64 0.67 2b63: 53.39 0.27 0.52 0.53 6req: 83.27 0.74 0.78 0.83 2b0c: 66.67 0.3 0.33 0.67 1a75: 52.83 0.48 0.72 0.53 2a45: 82.35 0.59 0.61 0.82 1a7s: 66.15 0.35 0.58 0.66 6prc: 52.7 0.13 0.37 0.53 2cbq: 82.14 0.75 0.79 0.82 7lpr: 66.04 0.3 0.35 0.66 1a7a: 52.54 0.15 0.34 0.53 6abp: 82 0.49 0.46 0.82 6gsw: 65.79 0.52 0.68 0.66 2b6a: 52.48 0.24 0.43 0.52 2b6t: 80.77 0.55 0.51 0.81 6apr: 65.75 0.24 0.39 0.66 2b6h: 52.38 0.22 0.42 0.52 2b6w: 80.77 0.55 0.51 0.81 2ciu: 65.71 0.26 0.47 0.66 5er1: 52.17 0.23 0.41 0.52 2b6y: 80.77 0.55 0.51 0.81 4trx: 65.71 0.37 0.49 0.66 6apr: 51.81 0.08 0.35 0.52 6a3h: 80.26 0.44 0.52 0.8 1a2f: 65.57 0.3 0.43 0.66 1a73: 51.58 0.02 0.37 0.52 6cpa: 79.73 0.65 0.73 0.8 1a2g: 65.57 0.32 0.43 0.66 6ald: 51.36 0.21 0.32 0.51

108

Appendix C

Table C 17 continued from previous page.

Protein P- C- Recall- Precision Code Turn Turn Turn -Turn 6prn: 79.41 0.56 0.62 0.79 7ick: 79.35 0.38 0.56 0.79 6gss: 79.31 0.57 0.59 0.79 1a2w: 79.31 0.39 0.46 0.79 2a4u: 79.17 0.69 0.76 0.79 2cit: 79.17 0.62 0.76 0.79 6acn: 78.82 0.5 0.56 0.79 6gpb: 78.63 0.6 0.6 0.79 2a40: 78.57 0.78 0.92 0.79 2b6x: 78.57 0.55 0.54 0.79 2ciq: 78.57 0.85 1 0.79 1a2s: 65.52 0.48 0.7 0.66 1a7n: 65.38 0.21 0.4 0.65 1a7r: 65.38 0.22 0.4 0.65 2b61: 65.17 0.32 0.49 0.65 1a7h: 64.86 0.34 0.48 0.65 1a2s: 65.52 0.48 0.7 0.66 1a7n: 65.38 0.21 0.4 0.65 1a7r: 65.38 0.22 0.4 0.65 2b61: 65.17 0.32 0.49 0.65 1a7h: 64.86 0.34 0.48 0.65 1a2s: 65.52 0.48 0.7 0.66 1a21: 50.79 0.15 0.31 0.51 1a2n: 50.7 0.2 0.33 0.51 2b6c: 50.68 0.27 0.38 0.51 6ame: 50 0.12 0.42 0.5 6rhn: 50 0.12 0.41 0.5 1a21: 50.79 0.15 0.31 0.51 1a2n: 50.7 0.2 0.33 0.51 6gsv: 48 0.15 0.27 0.48 2civ: 46.74 0.17 0.42 0.47 1a2k: 46.54 0.1 0.28 0.47 6adh: 46.42 0.17 0.59 0.46 1a22: 46.15 0.17 0.33 0.46 1a7j: 45.45 0.21 0.38 0.45 6icd: 45.21 0.15 0.37 0.45 1a27: 44.44 0.22 0.38 0.44 2c49: 44.44 0.29 0.57 0.44 8pch: 44.26 0.24 0.5 0.44 1a71: 44.19 0.14 0.38 0.44 6gsp: 44.12 -0.07 0.41 0.44 6q21: 43.75 0.09 0.25 0.44 1a77: 43.24 0.22 0.47 0.43 6prn: 42.65 0.05 0.32 0.43 6std: 42.31 0.18 0.34 0.42 1a7v: 41.67 0.02 0.12 0.42 6chy: 41.38 0.37 0.63 0.41 2cir: 41.18 0.51 1 0.41 6paz: 40.62 0.26 0.54 0.41 1a7c: 40.54 0.13 0.32 0.41 1a2d: 40.43 0.09 0.26 0.4 Average: 68.8 0.43 0.561 0.69

109

Appendix D

Appendix D

Table D.1: List of proteins for the training of Unit 1 in the 2nd phase of our experiment

1aa0 1aa1 1aa2 1aa3 1aa4 1aa6 1aa7 1aa9 1aab 1aac 1aaf 1aaj 1aal 1aam 1aan 1aap 1aaq 1aar 1aaw 1aax 1aay 1aaz 1ab0 1ab1 1ab2 1ab3 1ab4 1ab5 1ab6 1ab7 1ab8 1ab9 1aba 1abb 1abe 1abf 1abi 1abj 1abo 1abq 1abr 1abs 1abt 1abv 1abw 1aby 1abz 1ac0 1aca 1acb 1acc 1acd 1acf 1aci 1acj 1acl 1acm 1aco 1acp 1acv 1acw 1acx 1acy 1acz 1ad6 1ad7 1ad8 1ad9 1adb 1adc 1add 1ade 1adf 1adg 1adi 1adj 1adl 1adn 1ado 1adq 1ae1 1ae2 1ae3 1ae5 1ae6 1ae7 1ae8 1ae9 1aeb 1aec 1aed 1aee 1aef 1aew 1aex 1aey 1af0 1af2 1af3 1af4 1af5 1af6 1af7 1af8 1af9 1afl 1afo 1afp 1afq 1afr 1afs 1aft 1afu 1afv 1afw 1ag0 1ag1 1ag2 1ag4 1ag6 1ag7 1ag8 1ag9 1agb 1agc 1agd 1age 1agf 1agg 1agi 1agj 1agm 1agn 1agp 1agq 1agr 1agt 1agw 1agx 1agy 1ah0 1ah1 1ah2 1ah3 1ah4 1ah5 1ah6 1ah7 1ah8 1ah9 1aha 1ahb 1ahc 1ahd 1ahe 1ahf 1ahg 1ahh 1ahi 1ahj 1ahk 1ahl 1ahm 1ahn 1aho 1ahp 1ahq 1ahr 1ahs 1aht 1ahu 1ahv 1ahw 1ahx 1ahy 1ahz

Table D.2: List of proteins for the training of Unit 2 in the 2nd phase of our experiment

1ai0 1ai1 1ai2 1ai3 1ai4 1ai5 1ai6 1ai7 1ai8 1ai9 1aia 1aib 1aic 1aid 1aie 1aif 1aig 1aih 1aii 1aij 1aik 1ail 1aim 1aip 1aiq 1air 1ais 1aiu 1aiv 1aiw 1aix 1aiy 1aiz 1aja 1ajb 1ajc 1ajd 1aje 1ajg 1ajh 1ajj 1ajk 1ajm 1ajn 1ajo 1ajp 1ajq 1ajr 1ajs 1aju 1ajv 1ajw 1ajx 1ajy 1ajz 1ak5 1ak6 1ak7 1ak8 1ak9 1aka 1akb 1akc 1akd 1ake 1akg 1akh 1aki 1akj 1akk 1akl 1akm 1akn 1ako 1akp 1al0 1al1 1al2 1al3 1al4 1al6 1al7 1al8 1ala 1alb 1alc 1ald 1ale 1alu 1alv 1alw 1alx 1aly 1alz 1am1 1am2 1am4 1am5 1am6 1am7 1am9 1amk 1aml 1amm 1amn 1amo 1amp 1amq 1amr 1ams 1amt 1amu 1amw 1amx 1amy 1amz 1an0 1an1 1an2 1an4 1an5 1an7 1an8 1an9 1anb 1anc 1and 1ane 1anf 1ang 1ani 1anj 1ank 1ann 1ans 1ant 1anu 1anv 1anw 1anx 1ao0 1ao3 1ao5 1ao6 1ao7 1ao8 1aoa 1aob 1aoc 1aod 1aoe 1aof 1aog 1aoh 1aoi 1aoj 1aok 1aol 1aom 1aon 1aoo 1aop 1aoq 1aor 1aos 1aot 1aou 1aov 1aow 1aox 1aoy 1aoz

110

Appendix D

Table D.3: List of proteins for the training of Unit 3 in the 2nd phase of our experiment

1ap0 1ap2 1ap4 1ap5 1ap6 1ap7 1ap8 1apa 1apb 1apc 1apf 1apg 1aph 1apj 1apl 1apm 1apn 1apo 1apq 1aps 1apt 1apu 1apv 1apw 1apx 1apy 1apz 1aqa 1aqb 1aqc 1aqd 1aqe 1aqf 1aqg 1aqh 1aqi 1aqj 1aqk 1aql 1aqm 1aqn 1aqp 1aqq 1aqr 1aqs 1aqt 1aqu 1aqv 1aqw 1aqx 1aqy 1ar4 1ar5 1ar6 1ar7 1ar8 1ar9 1arb 1arc 1ard 1are 1arf 1arg 1arh 1ari 1arj 1ark 1arl 1arm 1as0 1as2 1as3 1as4 1as5 1as6 1as7 1as8 1asa 1asb 1asc 1ast 1asu 1asv 1asw 1asx 1asy 1asz 1at0 1at1 1at3 1at5 1at6 1at9 1ati 1atj 1atk 1atl 1atn 1atp 1atr 1ats 1att 1atu 1atx 1aty 1atz 1au0 1au1 1au2 1au3 1au4 1au7 1au8 1au9 1aua 1auc 1aud 1aue 1aug 1aui 1auj 1auk 1aum 1aun 1auo 1aup 1auq 1aur 1aus 1aut 1auu 1auv 1auw 1aux 1auy 1auz 1av1 1av2 1av3 1av4 1av5 1av6 1av7 1av8 1ava 1avb 1avc 1avd 1ave 1avf 1avg 1avh 1avk 1avl 1avm 1avn 1avo 1avp 1avq 1avr 1avs 1avt 1avu 1avv 1avw 1avx 1avy 1avz 1aw0 1aw1 1aw2 1aw3 1aw5 1aw6 1aw7 1awb 1awc 1awd 1awe 1awf 1awh 1awi 1awj 1awo 1awp 1awq 1awr 1aws 1awt 1awu 1awv 1aww 1awx 1awy 1awz

Table D.4: List of proteins for the training of Unit 4 in the 2nd phase of our experiment

1axa 1axb 1axc 1axd 1axe 1axg 1axh 1axi 1axj 1axk 1axm 1axn 1axq 1axr 1axs 1axt 1axw 1axx 1ay2 1ay3 1ay4 1ay5 1ay6 1ay7 1ay8 1ay9 1aya 1ayb 1ayc 1ayd 1aye 1ayf 1ayg 1ayi 1ayj 1ayk 1ayl 1aym 1az0 1az1 1az2 1az3 1az4 1az5 1az6 1az8 1azb 1azr 1azs 1azt 1azu 1azv 1azw 1azx 1azy 1azz 1ba0 1ba1 1ba2 1ba3 1ba4 1ba5 1ba6 1ba7 1ba8 1ba9 1bab 1bav 1baw 1bay 1baz 1bb0 1bb1 1bb3 1bb4 1bb5 1bb6 1bb7 1bb8 1bb9 1bbl 1bbn 1bbo 1bbp 1bbr 1bbs 1bbt 1bbu 1bbw 1bbx 1bby 1bbz 1bc0 1bc1 1bc2 1bc3 1bc4 1bc5 1bc6 1bc7 1bc8 1bc9 1bcc 1bcd 1bcf 1bcg 1bch 1bci 1bcj 1bck 1bcm 1bcn 1bco 1bcp 1bcr 1bcs 1bct 1bcu 1bcv 1bcw 1bcx 1bcy 1bcz 1bd0 1bd2 1bd3 1bd4 1bd6 1bd7 1bd8 1bd9 1bda 1bdb 1bdc 1bdd 1bde 1bdf 1bdg 1bdh 1bdi 1bdj 1bdk 1bdl 1bdm 1bdo 1bdq 1bdr 1bds 1bdt 1bdu 1bdv 1bdw 1bdy 1be0 1be1 1be2 1be3 1be4 1be6 1be7 1be8 1be9 1bea 1beb 1bec 1bed 1bee 1bef 1beg 1beh 1bei 1bej 1bek 1bel 1bem 1ben 1beo 1bep 1beq 1bes 1bet 1beu 1bev 1bex 1bey 1bez 1bfa 1bfb 1bfc 1bfd 1bfe 1bff 1bfg 1bfi 1bfj 1bfk 1bfm 1bfn 1bfo 1bfp 1bfr 1bfs 1bft 1bfu 1bfv 1bfw 1bfx 1bfy 1bfz 1bg4 1bg5 1bg6 1bg7 1bg8 1bg9 1bga 1bgb 1bgc 1bgd 1bge 1bgf 1bgg 1bgi 1bgj 1bgk 1bgl 1bgm 1bgn 1bgo

111

Appendix D

Table D.5: List of proteins used for testing in the 2nd phase of our experiment

101m 103l 103m 104l 104m 105m 106m 107l 107m 108l 108m 10gs 10mh 110l 110m 111l 111m 112l 112m 113l 114l 117e 118l 119l 11as 11ba 11bg 11gs 120l 122l 123l 125l 126l 129l 12as 12ca 130l 131l 132l 135l 137l 138l 139l 13gs 13pk 141l 142l 143l 144l 145l 148l 149l 14gs 150l 151l 153l 154l 155c 155l 156l 157l 158l 15c8 160l 161l 162l 163l 164l 166l 167l 168l 169l 16gs 16pk 16vp 170l 172l 173d 173l 174l 175l 176l 177l 179l 180l 181l 182l 185l 186l 187l 188l 189l 18gs 191l 192l 193d 193l 194l 195l 198l 199l 19gs 200l 201l 206l 207l 208l 209d 209l 20gs 210l 212l 213l 214l 215l 216l 217l 219l 220l 221l 221p 222l 223l 225l 226l 227l 228l 229l 22gs 231l 232l 233l 234l 235l 238l 239l 240l 241l 244l 245l 246l 247l 248l 250l 251l 252l 253l 254l 256l 257l 258l 259l 260l 262l 316d 31bi 32c2 351c 35c8 41bi 421p 43ca 451c 456c 484d 521p 721p 830c 8acn 8acn 8adh 8api 8at1 8ca2 8cat 8cgt 8cho 8cpa 8cpp 8est 8fab 8gch 8gep 8gss 8hvp 8ica 8icb 8icc 8icd 8ice 8icf 8icg 8ich 8ici 8icj 8ick 8icl 8icm 8icn 8ico 8icp 8icq 8icr 8ics 8ict 8icu 8icv 8icw 8icx 8icy 8icz 8jdw 8ldh 8lpr 8lyz 8mht 8msi 8nse 8ohm 8paz 8pch 8prk 8pti 8rat 8rnt 8ruc 8tfv 8tli 8tln 8xia 8xim 966c 9abp 9ame 9atc

Table D.6: List of proteins used for testing in the 2nd phase of our experiment

3icd 3ifb 3il8 3ink 3ins 3jdw 3kar 3kbp 3kin 3kiv 3kmb 3ktq 3kvt 3lbd 3lck 3ldh 3leu 3lhm 3lip 3ljr 3lkf 3lpr 3lri 3lym 3lyn 3lyo 3lyt 3lyz 3lzm 3lzt 3mag 3man 3mat 3mba 3mbp 3mds 3mef 3mht 3min 3mon 3msi 3msp 3mth 3muc 3ncm 3ng1 3nla 3nll 3nn9 3nod 3nos 3np1 3nse 3nul 3orc 3p2p 3pah 3pal 3pat 3pax 3paz 3pbg 3pbh 3pca 3pcb 3pcc 3pcd 3pce 3pcf 3pcg 3pch 3pci 3pcj 3pck 3pcl 3pcm 3pdz 3pep 3pfk 3pfl 3pga 3pgh 3pgk 3pgm 3pgt 3phm 3phv 3phy 3pmg 3pnp 3por 3psg 3psr 3ptb 3ptd 3pte 3ptn 3pva 3pvi 3pyp 3r1r 3rab 3ran 3rap 3rat 3rdn 3req 3rhn 3rla 3rnt 3rp2 3rpb 3rsd 3rsk 3rsp 3rub 3sak 3sc2 3sdh 3sdp 3seb 3sgb 3sgq 3sic 3sil 3sod 3sqc 3srn 3sxl 3tdt 3tec 3tgf 3tgi 3tgj 3tgk 3tgl 3tim 3tlh 3tli 3tmk 3tmn 3tms 3tmy 3tpi 3trx 3ts1 3tss 3uag 3ubp 3ull 3upj 3usn 3vsb 3wrp 3xim 3xin 3xis 3znb 3znc 3znf 4a3h 4aig 4ait 4aiy 4ake 4ald 4ame 4aop 4ape 4apr 4atj 4ayk 4azu 4bdp 4blc 4blm 4bp2 4bu4 4ca2 4caa 4cac 4cat 4ccp 4ccx 4cel 4cgt 4cha 4cla 4cms 4cox 4cp4 4cpa 4cpp 4cpv 4csc 4cts 4cyh 4daa 4dbv 4dcg 4dfr 4dmr 4dpv 4eca 4eng 4enl 4er1 4er2 4er4 4est 4eug 4fap 4fbp 4fgf 4fis 4fit 4fiv 4fua 4fx2 4fxc 4gal 4gat 4gbq 4gep 4gpb 4gpd 4gr1 4grt 4gsa 4gsp 4gss 4gst 4gtu 4hb1 4hbi 4hck 4hhb 4hir 4hoh 4hvp 4i1b 4ifm 4ins 4kbp 4kiv 4kmb 4lbd 4lip

112

Appendix D

Table D.7: List of proteins used for testing in the 2nd phase of our experiment

1d00 1d01 1d02 1d03 1d04 1d05 1d06 1d07 1d08 1d09 1d0b 1d0c 1d0d 1d0e 1d0f 1d0g 1d0h 1d0i 1d0j 1d0k 1d0l 1d0m 1d0n 1d0o 1d0p 1d0q 1d0r 1d0s 1d0t 1d0u 1d0z 1d21 1d22 1d23 1d24 1d25 1d26 1d27 1d28 1d29 1d2a 1d2b 1d2c 1d2d 1d2f 1d2g 1d2i 1d2j 1d2k 1d2l 1d2m 1d2n 1d2o 1d2p 1d2q 1d2s 1d2t 1d2u 1d2v 1d2w 1d2x 1d2y 1d2z 1d46 1d47 1d48 1d49 1d4a 1d4b 1d4c 1d4e 1d4f 1d4g 1d4h 1d4i 1d4j 1d4k 1d4l 1d4m 1d4o 1d4p 1d4q 1d61 1d62 1d63 1d64 1d65 1d66 1d67 1d68 1d69 1d6l 1d6m 1d6n 1d6p 1d6q 1d6r 1d6s 1d6t 1d6u 1d6v 1d6w 1d6x 1d6y 1d6z 1d70 1d71 1d72 1d73 1d74 1d75 1d76 1d77 1d78 1d79 1d7a 1d7b 1d7c 1d7d 1d7e

113

Bibliography

Bibliography

[1] P. K. Simpson. Fuzzy min-max neural network-Part I: Classification. IEEE Trans. Neural Network, vol. 3, no. 5, pp. 776–786, Sep. 1992.

[2] A.V. Nandedkar and P. K. Biswas . A fuzzy min-max neural network classifier with compensatory neuron architecture in Proc. 17th Int. Conf. Pattern Recognition. (ICPR2004) , Cambridge, U.K., Aug. 2004, vol. 4, pp. 553–556.

[3] Chou, P. Y. & Fasman, G. D. (1974). Conformational parameters for amino acids in helical, -sheet, and random coil regions calculated from proteins. Biochemistry, 13, 211-222

[4] Rost, B. & Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232, 584-599.

[5] David T. Jones. Protein Secondary Structure Prediction Based on Position specific Scoring Matrices. J., Mol. Biol. (1999) 292, 195-202 .

[6] Garnier, J., Osguthorpe, D. J. & Robson, B. (1978). Analysis and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120, 97-120.

[7] Sen, T.Z., Jernigan, R.L., Garnier, J., Kloczkowski, A., GOR V server for protein secondary structure prediction", Bioinformatics, 21(11), 2787-2788, 2005.

[8] Cuff, J. A., Clamp, M. E., Siddiqui, A. S., Finlay, M. and Barton, G. J. (1998). Jpred: A Consensus Secondary Structure Prediction Server. Bioinformatics 14:892-893

[9] King, R. D. and Sternberg, M. J. E. (1996) . Identification and application of the concepts important for accurate and reliable protein secondary structure prediction. Protein Science 5,2298-2310.

[10] Salamov, A. A. & Solovyev, V. V. (1995). Prediction of protein secondary structure by combining nearest- neighbor algorithms and multiple sequence alignments. J. Mol. Biol. 247, 11-15.

114

Bibliography

[11] Frishman, D. & Argos, P. (1997). Seventy-five percent accuracy in protein secondary structure prediction. Proteins, 27, 329-335.

[12] Zvelebil, M., Barton, G., Taylor, W. & Sternberg, M. (1987). Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J. Mol. Biol. 195, 957{961.

[13] C. Levinthal (1968). Are there pathways for protein folding? . Journal de Chimie Physique et de Physico-Chimie Biologique 65. Page No 44-45.

[14] Jeremy M. Berg. Berg, John L. Tymoczko, Lubert Stryer. Biochemistry, 5th Edition, Chapter 3, Page No 42 – 76

[15] G.N. Ramachandran, C. Ramakrishnan & V. Sasisekharan (1963). Stereochemistry of polypeptide chain configurations. J. Mol. Biol. vol. 7, p. 95- 99.

[16] Jones DT. (1999). Protein secondary structure prediction based on position- specific scoring matrices. J. Mol. Biol. 292: 195-202.

[17] Bryson K, McGuffin LJ, Marsden RL, Ward JJ, Sodhi JS. & Jones DT. (2005) Protein structure prediction servers at University College London. Nucl. Acids Res. 33(Web Server issue):W36-38.

[18] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). Basic local alignment search tool. J Mol Biol. (1990) 215, page no. 403-410.

[19] V. I. Lim. Algorithms for prediction of alph helices and beta structural regions in globular proteins. J. Mol. Biol., 88:873-894, 1974.

[20] G. D. Rose. Prediction of chain turns in globular proteins on a hydrophobic basis. Nature, 272:586-591, 1978.

[21] A. C. M. Wilmot and J. M. Thornton. Analysis and prediction of the different types of beta-turn in proteins. J. Mol. Biol., 203:221-232, 1988.

[22] W. Kabsch and C. Sander. A dictionary of protein secondary structure. Biopolymers, 22: 2577 -2637, 1983.

115