National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012) Proceedings published by International Journal of Computer Applications® (IJCA)

Residue Contact Prediction for Structure using 2-Norm Distances Nikita V. Mahajan L.G.Malik Department of Computer Science &Engg. Department of Computer Science &Engg. G.H. Raisoni College of Engineering, Nagpur G.H. Raisoni College of Engineering, Nagpur.

ABSTRACT

Bioinformatics is a field which uses technology and is rapidly Carboxyl end growing, to solve the problem related to biological area. Study CO푂− of protein, predicting its structure to know its function is an Amino end important field in bioinformatics. Protein unit that is twenty amino acids have total information for converting linear + sequences of into its unique and globular structures. 퐻3 푁 C H Protein folding problem is process to determine that protein is folding into its exact tertiary structure. Protein folds into matter Alpha Carbon of seconds to its stable 3-D structure; once it is stable it may perform proper functions. Contact map is an intermediate step Side chain R for converting 1-d structure to 3-d structure. Contact map is the representative graphical view, how the protein folds into its Figure 1: Amino Acid Structure proper structure. Here, 6 parameters are set, which consider 2- There are twenty different amino acids; they are represented as norm distances for generating the contact map. It will consider three letter codes, such as Alanine by ALA, as GLY, the co-ordinate data, only of the residue having alpha carbon as Proline as PRO, and Cytosine as CYS etc, which are grouped in contact type and map the contact as 1 if differences is greater such a fashion to represent the primary structure of protein.As than threshold value or -1 if less than threshold value. play a central role in nearly all biological processes at Keywords the cellular level [7]. Protein folding is the process by which proteins fold itself into Prediction, Amino Acids, Contact Map, Non- 3D structure. Protein chains fold into unique, tightly packed, local Contact Map, 2-Norm Distances. globular structures called folds [3]. The early work of Anfinsen 1. INTRODUCTION and Levinthal [6] established that a protein chain folds spontaneously and reproducibly to a unique three dimensional The most emerging field undergoing rapid, exciting growth is structure when placed in aqueous solution. The folding Bioinformatics. This has been mainly fuelled by advances in information process is present totally in linear sequences of DNA sequencing, Protein sequencing and mapping techniques. amino acids, determines its unique fold, and the geometry of a Key topic in computational structural proteomics is Protein protein fold largely determines its specific biological function. structure prediction [5] [6]. Predicting the tertiary structure of a protein by using its residue Protein is nothing but a20 linear amino acids chain which is sequence is called the Protein Folding Problem [4] [7]. covalently bounded with other amino acids (referred as residues). Length of these monomers ranges from ten to MET SER HIS TRP GLY thousand and they are very important part of mostly biochemical process [3].Amino acid is made up of organic compounds which LYS ASN GLY PRO GLU contain two functional groups that are amino group and ALA ILE ASP GLN THR carboxyl group. Amino group is represented as - 푁퐻3 which is LEU VAL ARG TYR PHE basic in nature. While carboxyl group is acidic in nature and it is SER GLY ALA ILE LEU represented as, - COOH. Amino acid is also termed as α amino ASN PRO VAL PRO ILE acid if both amino and carboxyl group is attached to some PRO GLY ASP THR GLU carbon atom. An alpha-amino acid has the generic formula LYS ILE VAL ILE ALA H2NCHRCOOH, where R is an organic substituent. R Group SER HIS ASN GLN LYS only changes in each amino acid ALA SER HIS VAL ILE PRO ILE ASP GLN THR Protein Folding LEU ASN PRO VAL PRO Linear amino acid to its structure GLY ...... Figure 2: Protein 3-D Structure Prediction

Indeed, proteins can be characterized by: (i) their primary structure, i.e. the alphanumerical amino acid sequence; (ii) their secondary structure, i.e. the spatial in- formation related to

17

National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012) Proceedings published by International Journal of Computer Applications® (IJCA) amino acid residues, (iii) the it tertiary structure which describes model and applied input features are described in more details. the spatial coordinate for each atoms composing the whole Section IV presents experimental results. Finally, Section protein. [5].Protein don’t convert directly from primary structure Vprovides some concluding remarks. to tertiary structure, it goes from secondary structure which can be associated to one ofthe two possible states, namely α-helix, 2. METHODOLOGY β- strand. (α) is the common motif in the secondary structure of proteins. It’s a right-handed coiled or spiral The proposed structure calculation procedure contains three conformation. Every backbone N-H group donates a hydrogen stages: contact map generation, contact map prediction and bond to the backbone C=O group of the amino acid. One turn of secondary structure prediction. the helix represents 3.6 amino acid residues. A single turn of the a-helix involves 13 atoms from the O to the H of the H bond

[10].

Input protein atom data

Total Residue Contact

Map

Generation Residue X Residue Symmetric Matrix

Selecting alpha carbon residue

Figure 3: Alpha Helix Structure [13] 2- Norm Distance Contact Beta sheets consist of the beta strands connected laterally by at 퐷 >t퐷 < 푡 Map least two or three backbone hydrogen bonds, forming a 푖,푗 푖,푗 Prediction generally twisted, pleated sheet. In the parallel b-pleated sheet, adjacent chains run in the same direction (N ® C or C ® N). In Value=1 Value = -1 the anti-parallel b-pleated sheet, adjacent strands run in opposite directions [8] [10].

Secondary Structure Prediction

Flow Chart 1: Schematic overview of the structure prediction procedure

2.1. Contact Map Contact map is an intermediate step showing how amino acid can be converted from its linear form to its tertiary structure. There for it can be said that Contact map is guide for protein Figure 4: Structure [13] structure prediction. Contact map is the graphical representation

of amino acid residue. Therefore for predicting the 3D structure Recently a lot of effort has been put in solving the plays vital role [6]. The residues are said to Folding Problem (PFP).Most of them are based on heuristic be in contact if the carbon atom of one residue is in contact with methods, since the computational complexity of the underlying carbon atom of other residue. When amino acids come into models makes their complete computational simulation is very contact they give rise to various non-covalent bonds such as expensive and, sometimes, unfeasible [1].Reinforcement , hydrophobic bond [6] [2]. Learning (RL) is an approach to machine intelligence in which an agent can learn to behave in a certain way by receiving punishments or rewards on its chosen actions [11].SVM was 2.2.Generating Contact Map applied to the features to predict the structural relevance (i.e. if Input to contact map generation will be the file containing all in the same fold or not) [12]. But still the problem remains information about the amino acid. Given a protein sequence P unsolved [5]. One problem with protein data is that, it is very with M residue such as P = {푅1, 푅2 … … … … … . . 푅푚 }. noisy.An approach to solve this problem is representing residue And it can be said that two residue are in contact that is as contact. The principle behind contact map generation is just residue푅푖is in contact with residue 푅푗 if the distance between to show how the residue interacts with one another. These them are greater than that of specified threshold (t). That interactions will eventually lead to generation of tertiary is푅푗 − 푅푖 < 푡 structure of protein. Thus using such representation will Protein fold is normally defined on 3D Cartesian co-ordinates of amino acid atoms [3]. To find the contact between residue and reduce the dimensionality of problem down to more manageable residue 2-Normdistance is consider point [5]. .퐷 = |푅푗 − 푅푖| =

2 2 2 It is organized as follows. Section II provides definitions of (푋푗 − 푋푖) + (푌푗 − 푌푖) + (푍푗 − 푍푖) contact maps, non-local contacts. In section III, the proposed

18

National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012) Proceedings published by International Journal of Computer Applications® (IJCA)

Where 푅푖 = (푋푖, 푌푖, 푍푖) are the three dimensional co-ordinate of 2.4.Input Parameter residue 푅푖 [3]. There are six types of input parameters for each pair of positions For plotting the contact map M x M symmetric matrix is in a protein sequence, thatcapture the different aspects of the considered. This matrix will have number of rows and number amino acids and the positions i and j of column equal to the residue number in protein sequence.  Sequence Length A pair wise residue distance matrix can be defined as Sequence length (SeqLgt) will be for the total number D = 퐷푖,푗 for each protein, where each element of contact map of amino acid residue in the input protein file. matrix 푀푖,푗 is set to 1 if the residue (푅푖, 푅푗 )are in contact  Minimum Sequence Separations otherwise it is set to -1. Minimum sequence separation (MinSeqSep) will the In other words, the contact map of protein with length N can be minimum sequences separation filter used for the displayed in matrix whose elements are defined as [4] creation of contact map. 퐷푖,푗 = 푀푖,푗 = 1 If 퐷푖,푗 < t  Maximum Sequence Separations 푀푖,푗 = -1 Otherwise Maximum sequence separation (MaxSeqSep) will the maximum sequences separation filter used for the 2.3.Contact Map Prediction creation of contact map Distance between two residuescan be defined in different ways:  Distance Cutoff  The distance between alpha carbon i.e. 퐶 − 퐶 Distance cutoff (DisCutoff) will be the threshold 푎 푎 specified in angstroms unit, used for creating Contact  The distance between beta carbon i.e. 퐶 − 퐶 푏 푏 Map.  The distance between All atoms  Chain Identification Distance between alpha carbon atoms퐶 − 퐶 is considered. 푎 푎 Chain Identification (ChainID) will tell that protein The first step in calculating a contact map between proteins is to data consider as input is perfect data or not calculate the alpha carbon distance between pair of their residues. It will first take the entire atom having the alpha  Contact Type carbon.Then in second step will subtract residue1 from all n Contact type (ConType) is used for creation of contact residues to find distance that is. This subtraction is nothing but map. It will specify that which residue carbon atom to calculating 2-Normdistance of one residue with all residues be considered for calculating the 2-Normdistance. present in protein.The third and final step is to compare the calculated distance with the threshold that is set, if greater then ALL AL All atoms will plot the dot that is will set the matrix as 1 and if not then set as -1. Ca The C-alpha atom of the backbone 푟푒푠푖푑푢푒1,푐푎 − 푟푒푠푖푑푢푒2,푐푎 > 푡Then 푀푖,푗 = 1 C The C-beta atom of the side-chain (C- 푟푒푠푖푑푢푒1,푐푎 − 푟푒푠푖푑푢푒2,푐푎 < 푡 Then 푀푖,푗 = −1 B . alpha for glycine) MXM Matrix Table 1: Contact type Supported

2.5. Significance of Contact Map

A host of useful information is provided by contact map. Contact map not only calculate the contact distance of atom but also help to find which atom will contribute for formation of secondary structure [6].

Secondary structures consist of alpha helix and beta sheets. To localize the secondary structure from contact map it can be said that. Diagonal area is always one as it assumes those residues

are always in contact with themselves [3]. The amino acid ij which form alpha helix will always appear as a thick band near

the diagonal as the contact is between single amino acid and its ALA PRO SER GLY CYS LEU THR four successors. Amino acid which forms beta sheets will 푀푖,푗 always appear as thin bands which would be parallel or anti parallel to the main diagonal.

Residue Data ATOM CA ALA 20 ......

ATOM CA ALA 92 ......

Figure 5: Process for Plotting Contact Map Figure 5 provide an example, how the contact map for protein is plotted

19

National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012) Proceedings published by International Journal of Computer Applications® (IJCA)

The selected PDB input file for methodology is of Mycobacterium tuberculosis. This file has various sections such as date, author name, Atoms, Chain identification, residue number, residue Co-ordinates (x,y,z) in angstrom unit [2] [9]. 3.2. Result Present the result for predicting the contact map.The input PDB file contains 489 amino acids, thus the sequences length is of 489. The contact map matrix will be of 489x489. Theminimum Sequence separation is 1 and maximum Sequence Separation is 400. Contact type selected asCa,

Figure 7: Contact Map distance cutoff is 8 Angstroms

(a) (b)

Figure 6: Contact Map of Amino acid (a) [6]. 3 Dimensional Protein Structures from Contact map (b). [4]

Figure 6(a) is the perfect example showing the contact map of protein and figure 6(b) shows that how the contact map is being converted into tertiary structure. In figure 7(a) there is only single band near the diagonal, that amino acid had lead for formation of only one helix as represented into figure6 (b).There are four bands parallel to diagonal, that had lead for formation of four beta sheet in figure 6 (b).

3. EXPERIMENTAL RESULT

3.1. Data Set

Drs. Edgar Meyer and Walter Hamilton had founded Protein Data Bank (PDB) in 1971 at Brookhaveni National Laboratory. According to PDB selected list PDB contains 3-D structure of Protein and nucleic acid, which has non redundant protein structure with sequence identity lower than 25% [2] [9].The PDB format has 12 sections, in which 46 different fields are represented. The following Figure 8: Contact Map distance cutoff is 12 Angstroms fields are relevant for our model: SEQRES (defines the amino acids sequence of the protein), HELIX and SHEET (identify the Figure 7, Figure 8 shows contact map with distance cutoff of8, amino acids that form these secondary structures), and ATOM 12 Angstroms. The figure also shows the selected amino (represents the spatial distribution of the atoms of the protein in acidresidue number, residue name and also tell that amino acid its native conformation) belongs to helix as they are close to diagonal value and beta sheet as they are parallel or anti parallel to diagonal. 20

National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012) Proceedings published by International Journal of Computer Applications® (IJCA)

4. CONCLUSION Transactions on Computational Biology and As the definition of contact map, contact map provide that Bioinformatics, vol. 8, NO. 4. which amino acid residue are in contact by assigning it 1 and [4]Nitin Gupta, NitinMangal and SomenathBiswas (2004), which is not by assigning it -1. Contact map take protein “Evolution and similarity evaluation of Protein Structures sequence, align the map to proper templates and map is then in Contact map space. consider as folding pathway for protein structure prediction. [5]Giuseppe Tradigo (2009), “On the Integration of Protein Contact Map Predictions”, In:IEEE 8 Angstrom 12 Angstrom [6]JingjingHu ,XiaolanShen , Yu Shao , Chris Bystroff , Features Less Features More feature available Mohammed J. Zaki (2002), “Mining Protein Contact available off off main diagonal Maps”, In:Workshop on Data Mining in Bioinformatics main diagonal [7]Kibeom Park, Michele Vendruscolo, and EytanDomany Recovery It is not so easy It is easy (2000), “Toward an Energy Function for the Contact Map Representation of Proteins”, In: PROTEINS: Structure, Function, and Genetics 40:237–248 Structure Structure Structure prediction is prediction prediction is easier [8]ZaferAydin, YucelAltunbasak, and HakanErdogan (2011), little bit hard “Bayesian Models and Algorithms for Protein B-Sheet Table 2:Comparison of Contact Map with Different Angstrom Value Prediction”.In: Transactions on computational biology and Thus it can be said that if the threshold value of contact map is bioinformatics, vol. 8, no. 2 increased, then the structural constraint number also increases, [9]The Protein Data Bank,” http://www.rcsb.org/pdb, 2009 and it will be easier to extracted more detail for 3D structure prediction. [10] Introduction to protein structure and structural bioinformatics, “secondary-sructure.html” 5. REFERENCE [11] Gabriela Czibula, Maria-IulianaBocicor and Istvan- [1]Fernanda Hembecker, HeitorSilvério Lopes (2010), “A GergelyCzibula (2000), “A Reinforcement Learning Model Molecular Model for Representing Protein Structures and for Solving the Folding Problem". its Application to Protein Folding” IEEE. [12] Muhammad Asif Khan, Muhammad A. Khan, Zahoor Jan, [2]NarjesKhatoonHabibi, Mohammad HosseinSaraee (2009), Hamid Ali and Anwar M. Mirza (2010). “Performance of “Protein Contact Map Prediction Based On an Ensemble Techniques in Protein Fold Recognition Learning Method”, In:International Conference on Problem”. In: IEEE. Computer Engineering and Technology. [13] http://www.ics.uci.edu/~baldig/betasheet_data.html, 2009. [3]YosiShibberu and Allen Holder (2011), “A Sepctral Approach to Protein Structure Alignment”. In:IEEE/ACM

21