2015 3rd AASRI Conference on Computational Intelligence and Bioinformatics (CIB 2015) ISBN: 978-1-60595-308-3

Application of Bioinformatics on Protein Structure Prediction

Xiaodan Liu Jilin Agriculture Science and Technology College, Jilin, China

ABSTRACT: With the advent of the information age, bioinformatics has been rapid development in many fields to be fully applied. This paper describes the establishment and development of common database of protein, protein structure analysis, protein secondary and tertiary structure prediction; The application of bioinformatics in protein structure analysis is comprehensively summarized, which provides theoretical basis for the wide application of bioinformatics in protein engineering.

1. INTRODUCTION 2. PROTEIN STRUCTURE DATABASE

Bioinformatics provides a shortcut for the protein According to the data in the database based on the function prediction and structural analysis by analysis of the structure and function of the natural directly using gene or protein sequences. Which is protein, the spatial structure and biological function formed by the interaction of biological and computer of a certain amino acid sequence can be predicted, science and applied mathematics. It is processed by and the amino acid sequence and spatial structure of the data acquisition, processing, storage, retrieval the protein can be designed according to the specific and analysis, to explain the biological significance biological function. of the data. As the genome and proteome research The basic three-dimensional structure of the provides a very rich data, we need to make a high protein database is Protein Data Bank((PDB; degree of automation of these data management, http://www.pdb.org/), founded by Brookhaven National establish a strict database and write the Laboratory in New York in 1971,which is the corresponding software. This work greatly promoted database of the most important biological the development of bioinformatics, and macromolecules (proteins, nucleic acids and sugars) bioinformatics will also play a special role in the three- dimensional structure (Sussman et al.,1998; research of structure prediction (Hagen, 2000). Berman et al., 2000; Westbrook et al., 2003).The The emergence of bioinformatics has made the database is accurate coordinate data collected in the protein project into a new era. In the future, more protein structure by X-ray diffraction and nuclear detailed knowledge of protein structure and function, magnetic resonance (NMR) experiment measured. as well as advancements in high throughput Protein structure data in PDB are widely used in technology, may greatly expand the capabilities of studies of protein function and evolution, and they protein structure prediction. serve as a basis for protein structure prediction (He Protein structure prediction is an important task et al., 2014). in bioinformatics. To a large extent, the biological PDB Finder database (http://www.cmbi.kun.nl/ function of protein depends on its spatial structure, swift/pdbfinder/) is built on the basis of PDB, DSSP, so it is very important to predict the structure of HSSP, the two level library, which contains the PDB proteins. Sometimes a possible new gene cannot be sequence, the author, R factor, resolution, Secondary found by searching for any homologous sequence, Structure, etc. these information is not easy to read namely orphan gene, and for proteins encoded by directly from the PDB, with the PDB library every those genes, bioinformatics technology can be used time a new version, PDBFinder automatically to search for homologous genes based on the generated in EBI. homologous comparison of structure or predict its Homology Derived Secondary Structure of Proteins advanced structure to infer its function directly. (HSSP;http://www.sander.embl-heidelberg.de/hssp/) is the secondary structure database based on the homology of the two protein structure. Each PDB project has a corresponding HSSP file (Dodge et al., 1998). The database also provides the homology of all protein sequences in the SWISS-PROT database.

109 The Structural Classification of Proteins data- assigning regions of the amino acid sequence as base(SCOP; http://scop.mrc-lmb.cam.ac.uk/scop/)is a likely alpha helices, beta strands (often noted as comprehensive ordering of all proteins of known "extended" conformations), or turns. The different structure, according to their evolutionary and amino acid residues have different tendencies for the structural relationships. which is obtained by manual formation of different secondary structures. The classification, in which the proteins of known success of a prediction is determined by comparing structure are hierarchical classified (Murzin et al., it to the results of the DSSP algorithm (or similar e.g. 1995; Andreeva et al., 2004).This resource allows STRIDE) applied to the crystal structure of the users to analyze whether the query protein and protein. Specialized algorithms have been developed known structural protein have similarities. Protein for the detection of specific well-defined patterns domains in SCOP are hierarchically classified into such as transmembrane helices and coiled coils in families, superfamilies, folds and classes. proteins (Mount, 2004). The prediction accuracy of Molecular Modeling Database (MMDB; a single sequence of secondary structure is about http://www.ncbi.nih.gov/Structure/MMDB/mmdb. shtml), 60% ~80% (Petersen et al., 2000). The best modern This is a three-dimensional structure database used methods of secondary structure prediction in by the Entrez search tool (Wang et al., 2000). ASN,1 proteins reach about 80% accuracy (Pirovano format, which reflects the structure and sequence &Heringa,2010); this high accuracy allows the use data of PDB Library. Meanwhile, a complete set of of the predictions as feature improving fold 3D structure display program Cn3D is provided by recognition and ab initio protein structure prediction, NCBI. classification of structural motifs, and refinement of MUFOLD-DB(http://mufold.org/mufolddb.php), a sequence alignments.10 kinds of important and web-based database, to collect and process the effective methods were tested and analyzed using weekly PDB files thereby providing users with non- the unified standard, the comparison results show: redundant, cleaned and partially predicted structure the prediction effect of APSSP2, SsPro2.0 and data. For each of the non-redundant sequences, the PSIPRED method is better, the accuracy of the SCOP domain classification and predict structures prediction is more than 80% (Zhang et al.,2003). of missing regions are annotated by loop modelling. So far, more than 20 different secondary structure In addition, evolutional information, secondary stru- prediction methods have been developed. First cture, disorder region, and processed three- algorithms was Chou-Fasman method, which relies dimensional structure are computed and visualized predominantly on probability parameters determined tohelp users better understand the protein(He et al., from relative frequencies of each amino acid's 2014). appearance in each type of secondary structure(Chou et al.,1974). The next notable program was the GOR method, is an information 3. PROTEIN STRUCTURE PREDICTION theory-based method. Which employs the more powerful probabilistic technique of Bayesian Protein structure prediction is the prediction of the inference(Garnier et al.,1978).Another big step three-dimensional structure of a protein from its forward, was using machine learning methods. First amino acid sequence-that is, the prediction of its artificial neural networks methods were used. As a folding and its secondary, tertiary, and quaternary training sets they use solved structures to identify structure from its primary structure. Structure common sequence motifs associated with particular prediction is fundamentally different from the arrangements of secondary structures. These inverse problem of protein design. Protein structure methods are over 70% accurate in their predictions, prediction is one of the most important goals although beta strands are still often underpredicted pursued by bioinformatics and theoretical chemistry; due to the lack of three-dimensional structural it is highly important in medicine (for example, in information that would allow assessment of drug design) and biotechnology (for example, in the hydrogen bonding patterns that can promote design of novel enzymes) (https://en.wikipedia. formation of the extended conformation required for org/wiki/Protein_structure_pre-diction). the presence of a complete beta sheet (Mount, 2004). PSIPRED and JPRED are some of the most known Protein secondary structure prediction 3.1 programs based on neural networks for protein Protein secondary structure prediction is a set of secondary structure prediction (Cuff, 2000). techniques in bioinformatics that aim to predict the The predicted secondary structural states are not local secondary structures of proteins based only on cross validated by any of the existing servers. Hence, knowledge of their amino acid sequence only. information on the level of accuracy for every Which plays a vital role and acts as an intermediate sequence is not reported by the existing servers. This in solving tertiary structures; which provides an was overcome by NNvPDB, which not only insight in to protein function (Kloczkowski et reported greater Q3 but also validates every al.,2002). For proteins, a prediction consists of prediction with the homologous PDB entries. 110 NNvPDB is based on the concept of Neural of sequences with known structures, such as PBI- Network, with a new and different approach of BLAST (Altschu et al., 1997) and hidden Markov training the network every time with five PDB model [HMM]-based methods (Karplus et al.,2001). structures that are similar to query sequence. The The structure of spike protein (S2) in SARS virus average accuracy for helix is 76%, beta sheet is 71% and HIV-1 gp41 were compared, analysis reveals and overall (helix, sheet and coil) is 66% (Sakthivel that, although the protein responsible for viral- et al.,2015). induced membrane fusion of HIV-1 (gp41) differs in length, and has no sequence homology with S2. The 3.2 Protein tertiary structure prediction two viral proteins share the sequence motifs that Protein tertiary structure prediction has been an construct their active conformation (Kliger et al., important scientific problem for few decades, 2003). The structural features contribute to further especially in bioinformatics and computational design viral suppression drugs. biology (Eisenhaber et al., 1995). Despite more and The ab initio prediction method is not dependent more native structures are included in protein data on the existing structural data information, using bank (PDB) database, the gap between the molecular dynamics theory to predict and infer sequenced proteins and the native structures is still structural information directly from the protein enlarging due to the exponential increase of protein sequence. At present, most ab initio prediction sequences produced by large-scale genome and methods are used to search for spatial structure of transcriptome sequencing. It is estimated that <1% proteins, such as MonteCarlo, simulated annealing, of protein sequences have the native structures in genetic algorithm, and so on. The energy of each PDB database (Rigden et al.,2009).Therefore, structure is also evaluated, in order to find the accurate computational methods for protein tertiary lowest free energy. Jones presents predictions structure prediction that are much cheaper and faster approach based on fragment, by using simulated than experimental structure determination annealing algorithm (SA), supe-secondary structural techniques are needed to reduce this large sequence fragments were assembled to achieve protein de structure gap. Furthermore, computational structure novo prediction(Jones et al.,2003). To address the prediction methods are important for obtaining the searching problem of protein conformational space structures of membrane proteins whose structures in ab-initio protein structure prediction, a novel are hard to be determined by experimental method using abstract convex underestimation techniques such as X-ray crystallography (Yonath et (ACUE) based on the framework of evolutionary al., 2011). algorithm was proposed. Computing such confor- At present, the main methods of 3D spatial mations, essential to associate structural and structure prediction are , fold functional information with gene sequences, is type recognition and ab initio prediction. Homology challenging due to the high-dimensionality and modeling is the most successful and useful protein rugged energy surface of the protein conformational structure prediction methods, which can be used to space(Hao et al.,2015). explain the experimental data, proceed mutants design and drug design. Today, SWISS-MODEL is one of the most widely used structure modelling 4. ASSESSMENT FOR PROTEIN STRUCTURE web servers world-wide, with more than 0.9 million PREDICTION requests for protein models annually(Biasini et al.,2014).Several other groups have developed Methods for protein structure prediction are similar systems for automated homology modeling flourishing and becoming widely available to both [e.g. ModPipe(http://www.salilab.org)(Sanchez et al., experimentalists and computational biologists. But, 1998), CPHmodels (http://www.cbs.dtu.dk/ services how good are they? What is their range of /CPHmodels/)(Lund et al.,1997), 3D-JIG SAW applicability and how can we know which method is (http://www.bmm.icnet.uk/~3djigsaw/) (Bates et al., better suited for the task at hand? Some evaluation 2001), ESyPred3D (http://www.fundp.ac.be/urbm methods focus on these questions, such as the /bioinfo/esypred/)(Lambert et al.,2002) or SDSC1 world-wide Critical Assessment of Techniques for (http://cl.sdsc.edu/hm.html). Protein Structure Prediction(CASP; http://www. At the beginning of 1990s, the outstanding predictioncenter.org)experiment(Moult et al.,1995; progress of protein structure prediction is the Tramontano et al.,2008), automatic evaluation recognition method of protein folding type, also servers such as EV aluation of Automatic protein known as "threading" or "reverse folding"(Sippl et structure prediction(EVA; http://cubic.bioc. al.,1999).Its main contribution is to predict the columbia.edu/eva/) (Koh et al.,2003) and structure of proteins that have no obvious homology Livebench(http://bioinfo.pl /meta/livebench.pl) but also have similar structure. Current applications (Fischer et al.,2000). are mainly on the sequence-based methods. Evolutionary information is obtained by comparison 111 5. OUTLOOK for protein structure prediction and analysis. BMC Genomics 15 Suppl 11:S2. [15] Jones, D.T., McGuffin, L.J. 2003. Assembling novel Biological information science is a new cross protein folds from super-secondary structural fragments. subject, is still continuous progress and Proteins 53 Suppl 6:480-485. improvement, it is not the "panacea" of study in [16] Karplus, K., Karchin, R., Barrett, C., Tu, S., Cline, molecular biology, it is impossible to completely M.,Diekhans, M., Grate, L., Casper, J., Hughey, R. replace the experimental operation. The analysis and 2001.What is the value added by human intervention in prediction of bioinformatics is based on the protein structure prediction? Proteins Suppl 5:86-91. [17] Kliger, Y., Levanon, E.Y. 2003. Cloaked similarity knowledge of molecular biology. With the full and between HIV-1 and SARS-CoV suggests an anti-SARS effective use of previous theoretical knowledge, strategy. BMC Microbiol 3:20. which make reasonable inference. Therefore, there [18] Kloczkowski, A., Ting, K.L., Jernigan, R.L,, Garnier, J. may be errors, which still need to be verified and 2002. Combining the GOR V algorithm with evolutionary added in the laboratory work. information for protein secondary structure prediction from amino acid sequence.Proteins 49(2):154-166. [19] Koh, I. Y., Eyrich, V. A., Marti-Renom, M. A., Przybylski, D., Madhusudhan, M. S., Eswar, N., Grana, O., REFERENCES Pazos, F., Valencia, A., Sali, A., et al. 2003.EVA:evaluation of protein structure prediction [1] Altschu, S.F., Madden, T.L., Schäffer, A.A., servers. Nucleic Acids Res 31 (13),3311-3315. Zhang,J.,Zhan-g, Z., Miller, W., Lipman, D.J.1997. Gapped [20] Lambert,C., Leonard,N., De Bolle,X., Depiereux,E. 2002. BLAST and PSI-BLAST: a new generation of protein ESyPred3D: Prediction of proteins 3D structures. database search programs. Nucleic Acids Res 25(17): Bioinformatics 18, 1250-1256. 3389-3402. [21] Lund, O., Frimand,K., Gorodkin, J., Bohr, H., Bohr, J., [2] Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Hansen, J. and Brunak,S.1997.Protein distance constraints Chothia, C., Murzin, A.G. 2004. SCOP database in 2004: predicted by neural networks and probability density refinements integrate structure and sequence family data. functions. Protein Eng 20:1241-1248. Nucleic Acids Res 32 (Database issue): D226-229. [22] Mount, D.M. 2004. Bioinformatics: Sequence and [3] Bates,P.A., Kelley, L.A., MacCallum, R.M., Sternberg, M.J. Genome Analysis 2. Cold Spring Harbor Laboratory Press. 2001. Enhancement of protein modeling by human [23] Moult, J., Pedersen, J., Judson, R. and Fidelis, K. 1995. A intervention in applying the automatic programs large-scale experiment to assess protein structure prediction 3D-JIGSAW and 3D-PSSM. Proteins 45 (Suppl. 5),39-46. methods. Proteins 23(3), ii-v. [4] Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, [24] Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C. The Protein Data Bank. Nucleic Acids Res28, 235 -242. 1995. SCOP: a structural classification of proteins database [5] Biasini, M., Bienert, S., Waterhouse, A., et al., 2014. for the investigation of sequences and structures. Journal of SWISS MODEL: modelling protein tertiary and quaternary molecular biology 247(4):536–540. structure using evolutionary information. Nucleic Acids Res [25] Petersen, T.N., Lundegaard, C., Nielsen, M., Bohr, H., 42 (Web Server issue):W252-258. Bohr, J., Brunak, S., Gippert, G.P., Lund, O. 2000. [6] Chou, P.Y., Fasman, GD., Fasman.1974. Prediction of Prediction of protein secondary structure at 80% accuracy. protein conformation. Biochemistry 13 (2): 222-245. Proteins 41(1):17-20. [7] Cuff, James A., Barton, Geoffrey. A. 2000. Application of [26] Pirovano, W., Heringa, J. 2010. "Protein secondary multiple sequence alignment profiles to improve protein structure prediction". Methods Mol Biol. Methods in Secondary structure prediction. Proteins 40 (3): 502-11. Molecular Biology 609: 327-348. [8] Dodge, C., Schneider, R., Sander, C.1998.The HSSP [27] Rigden, D.J. 2009.From Protein Structure to Function database of protein structure-sequence alignments and with Bioinformatics. Springer, Dordrecht. family profiles. Nucleic Acids Res 26(1):313-315. [28] Sakthivel, S., S K, M. H. 2015. NNvPDB: Neural [9] Eisenhaber, F. et al., 1995. Protein structure prediction: Network based Protein Secondary Structure Prediction with Recognition of primary, secondary, and tertiary structural PDB Validation. Bioinformation 11(8):416-421. features from amino acid sequence. Crit. Rev. Biochem. [29] Sanchez, R., and Sali,A.1998. Large-scale protein Mol. Biol 30:1-94. structure modeling of the Saccharomyces cerevisiae [10] Fischer, D., Elofsson, A. and Rychlewski, L. 2000. The genome. Proc. Natl Acad. Sci. USA 95:13597-13602. 2000 Olympic Games of protein structure prediction; fully [30] Sippl, M.J. 1999. Who solved the protein folding problem? automated programs are being evaluated vis-a-vis human Structure 7(4):R81-83.1997. teams in the protein structure prediction experiment [31] Sussman, J.L., Lin, D., Jiang, J., Manning, N.O., Prilusky. CAFASP2.Protein Eng 13(10), 667-670. J.,Ritter. O., Abola, E.E. 1998. Protein Data Bank (PDB): [11] Garnier, J., Osguthorpe, D.J., Robson, B., Osguthorpe., database of three-dimensional structural information of Robson. 1978. Analysis of the accuracy and implications of biological macromolecules. Acta Crystallogr D Biol simple methods for predicting the secondary structure of Crystallogr 54(Pt 6 Pt 1):1078-1084. globular proteins. J Mol Biol 120 (1): 97-120. [32] Tramontano, A., Cozzetto, D., Giorgetti, A., Raimondo, [12] Hagen, J.B. 2000.The origins of bioinformatics. Nat Rev D. 2008.The Assessment of Methods for Protein Structure Genet1(3):231-236. Prediction.Protein Structure Prediction.Volume 413 of the [13] Hao, X., Zhang, G., Zhou,X., Yu, X. 2015. A Novel series Methods in Molecular Biology™. Humana press. Method Using Abstract Convex Underestimation in 43-57. Ab-initio Protein Structure Prediction for Guiding Search in [33] Wang, Y., Addess, K.J., Geer, L., Madej, T., Conformational Feature Space. IEEE/ACM Trans Comput MarchlerBauer, A., Zimmerman, D., Bryant, S.H. 2000. Bioinform. [Epubahead of print].PMID:26552093. MMDB: 3D structure data in Entrez. Nucleic Acids Res [14] He, Z., Zhang, C., Xu, Y., Zeng, S., Zhang, J., Xu, D. 28(1):243-245. 2014.MUFOLD-DB: a processed protein structure database

112 [34] Westbrook, J., Feng. Z., Chen, L., Yang, H., Berman, H.M. 2003. The Protein Data Bank and structural genomics. Nucleic Acids Res 31(1):489-491. [35] Yonath, A. 2011. X-ray crystallography at the heart of life science. Curr. Opin. Struct. Biol 21: 622-626. [36] Zhang, H.X., Tang, H.W., Zhang, L.Z., et al., 2003.Evalution on prediction methods of protein secondary structure. Computers and Applied Chemistry 20(6):735-740.

113