Is the Growth Rate of Protein Data Bank Sufficient to Solve the Protein Structure Prediction Problem Using Template-Based Modeling?
Total Page:16
File Type:pdf, Size:1020Kb
Bio-Algorithms and Med-Systems 2015; 11(1): 1–7 Review Michal Brylinski* Is the growth rate of Protein Data Bank sufficient to solve the protein structure prediction problem using template-based modeling? Abstract: The Protein Data Bank (PDB) undergoes an Introduction exponential expansion in terms of the number of mac- romolecular structures deposited every year. A pivotal Linus Pauling, a chemist, peace activist, educator, and question is how this rapid growth of structural informa- the only person to be awarded two unshared Nobel Prizes tion improves the quality of three-dimensional models (Nobel Prize in Chemistry in 1954 and Nobel Peace Prize constructed by contemporary bioinformatics approaches. in 1962), concluded his Nobel Lecture with the following To address this problem, we performed a retrospective remark: “We may, I believe, anticipate that the chemist analysis of the structural coverage of a representative of the future who is interested in the structure of pro- set of proteins using remote homology detected by COM- teins, nucleic acids, polysaccharides, and other complex PASS and HHpred. We show that the number of proteins substances with high molecular weight will come to rely whose structures can be confidently predicted increased upon a new structural chemistry, involving precise geo- during a 9-year period between 2005 and 2014 on account metrical relationships among the atoms in the molecules of the PDB growth alone. Nevertheless, this encouraging and the rigorous application of the new structural prin- trend slowed down noticeably around the year 2008 and ciples, and that great progress will be made, through this has yielded insignificant improvements ever since. At the technique, in the attack, by chemical methods, on the current pace, it is unlikely that the protein structure pre- problems of biology and medicine” [1]. Indeed, struc- diction problem will be solved in the near future using tural biology has been rapidly developing over the past existing template-based modeling techniques. Therefore, half-century contributing to many major breakthroughs further advances in experimental structure determina- in life sciences. tion, qualitatively better approaches in fold recognition, The specific and unique three-dimensional struc- and more accurate template-free structure prediction tures give macromolecules the ability to perform various methods are desperately needed. cellular functions. Consequently, macromolecular structure, folding, and structural alterations affecting Keywords: comparative modeling; COMPASS; HHpred; the function of proteins and nucleic acids are of prime Protein Data Bank; protein fold recognition; protein importance to biologists. Biological structures are typi- structure prediction; protein threading; template-based cally resolved experimentally by X-ray crystallography modeling. and nuclear magnetic resonance spectroscopy; none- theless, these techniques are often expensive and time DOI 10.1515/bams-2014-0024 consuming. As a result, known biological sequences Received December 22, 2014; accepted January 8, 2015; previously greatly outnumber available structures; as of December published online February 7, 2015 2014, the National Center for Biotechnology Information Reference Sequence Database [2] contains 46,968,574 sequences, whereas the Protein Data Bank [3] (PDB) *Corresponding author: Michal Brylinski, Department of Biological features 105,097 structures. A high demand for protein Sciences, 202 Life Sciences Bldg., Louisiana State University, Baton structures stimulates the development of computational Rouge, LA 70803, USA; and Center for Computation and Technology, 2054 Digital Media Center, Louisiana State University, Baton Rouge, structure prediction methods, which in fact represent the LA 70803, USA, E-mail: [email protected]. only strategy to keep up with the rapidly growing volume http://orcid.org/0000-0002-6204-2869 of sequence data. Brought to you by | LSU LAW School Authenticated | [email protected] author's copy Download Date | 3/4/15 4:14 PM 2 Brylinski: Cheshire Cat in structure bioinformatics Prediction of protein tertiary Cheshire Cat in structure structures bioinformatics Current techniques for protein structure prediction fall Unquestionably, the past two decades have seen an encour- into two major categories: template-free and template- aging progress in protein structure prediction. The quality based methods [4, 5]. The first group comprises various of modeled structures is constantly improving, making algorithms that simulate the folding of polypeptide chains them suitable for a wide range of applications, including into their native conformations [6–8]. These approaches molecular function inference, the prediction of the effect build upon fundamental physical principles, that is, they of mutations, and rational drug design [34–39]. This pro- do not directly use structural information extracted from gress can be attributed to two major factors: the advances related proteins. Despite the promising progress in our in algorithms and methods for comparative protein struc- understanding of folding mechanisms and the develop- ture modeling and the continuous growth of structure ment of template-free methods [9], the quality of protein databases. It has been already established that the PDB models constructed by template-free modeling is gener- is likely complete at the level of compact, single domain ally insufficient for subsequent functional annotation protein structures, viz. suitable structure templates are and drug discovery compared to experimental structures present in the PDB to reliably model any protein sequence [10–12]. In contrast, template-based methods routinely [40, 41]. Therefore, developing efficient fold recognition generate models whose accuracy is often comparable to algorithms that are capable of detecting these templates that of low-resolution experimental structures [13–15]. can, in principle, solve the protein folding problem [42]. In comparative protein structure prediction, the three- On the other hand, perhaps the growth rate of the PDB dimensional model of a target protein is constructed is also sufficient for the current algorithms to be able to using a template that is typically an evolutionarily related effectively detect structure templates and build accurate protein whose structure has been solved experimentally models for any biological sequence in the near future. If [16–18]. In a nutshell, a template protein is first identified so, one can contentedly wait as a Cheshire Cat, a famous in the PDB and the target-to-template alignment is calcu- character in Lewis Carroll’s novel, for the protein folding lated. Next, according to this alignment, an initial model problem to be solved just by having enough structures for is generated using the coordinates of equivalent residues a near-complete mapping to the sequence space using in the template. Missing residues and loops are added to existing bioinformatics tools. For instance, it was esti- the initial model, which is subsequently subjected to a mated that solving 16,000 carefully selected structures refinement procedure to optimize intramolecular contacts would provide structural models for approximately 90% of and the packing of side chains. 300,000 sequences available in the Swiss-Prot and TrEMBL Certainly, the recognition of quality templates and [43] databases back in 2000 [44]. Considering the exponen- the construction of correct alignments are critical to the tial growth of the PDB, these structures were anticipated to success of comparative modeling. This can be achieved become available in about a decade [44, 45]. As expected, using purely sequence-based methods to identify evolu- the structural coverage of Swiss-Prot during that period tionarily closely related homologs [19–22]; however, detect- has increased [46]; nonetheless, the near-complete struc- ing remotely related templates in the “twilight zone” of tural mapping of the sequence space has yet to be attained. sequence similarity [23] requires more sensitive algorithms In this communication, we perform a retrospective such as protein threading and fold recognition [24]. Many analysis of the structural coverage of a representative set of these methods employ sequence profile alignments [25, of proteins using remote homology. Two state-of-the-art 26] to compare a profile of the target protein constructed sequence profile-based algorithms for fold recognition from homologous sequences against profiles generated for are tested: COMPASS [47] and HHpred [48]. We analyze 12 a nonredundant subset of proteins from the PDB. In addi- time-snapshots of the PDB downloaded from the archive tion, scoring functions used in threading commonly incor- of the Research Collaboratory for Structural Bioinformat- porate structural information in the form of pair potentials, ics that cover a period from 2005 to 2014. The results are solvent accessibility, secondary structure profiles, back- presented in terms of the success detecting structurally bone dihedral angles, and hydrophobic interactions [27– related proteins, the quality of target-to-template align- 31]. Finally, the accuracy of protein fold recognition can ments, and the overall prediction confidence. Although be further improved by combining several threading algo- the structural coverage is constantly increasing, we dem- rithms into meta-threading pipelines [32, 33]. onstrate that the protein folding problem is unlikely to be Brought to you by | LSU LAW School Authenticated