COFACTOR: an Accurate Comparative Algorithm for Structure-Based Protein Function Annotation Ambrish Roy, Jianyi Yang and Yang Zhang*
Total Page:16
File Type:pdf, Size:1020Kb
Published online 8 May 2012 Nucleic Acids Research, 2012, Vol. 40, Web Server issue W471–W477 doi:10.1093/nar/gks372 COFACTOR: an accurate comparative algorithm for structure-based protein function annotation Ambrish Roy, Jianyi Yang and Yang Zhang* Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI 48109-2218, USA Received January 30, 2012; Revised March 31, 2012; Accepted April 12, 2012 ABSTRACT from the templates (2,4,5). However, the evidence of global structural similarity is usually insufficient for Downloaded from We have developed a new COFACTOR webserver accurate functional inference, as proteins possessing for automated structure-based protein function similar global fold can perform different biological func- annotation. Starting from a structural model, given tions. The classic examples include the proteins with by either experimental determination or computa- a-/b-barrel fold, which is inhabited by both enzymatic tional modeling, COFACTOR first identifies and non-enzymatic proteins (6). Accordingly, many con- template proteins of similar folds and functional temporary approaches have been designed to identify http://nar.oxfordjournals.org/ sites by threading the target structure through local structural similarity of functionally important three representative template libraries that have residues for drawing functional inferences (7,8). known protein–ligand binding interactions, Enzyme However, the functional annotation based on local struc- Commission number or Gene Ontology terms. The ture alone can result in high false-positive rate, especially when the target protein has a low sequence identity to biological function insights in these three aspects the template proteins or the target structure on its own are then deduced from the functional templates, has a low-resolution 3D structure (3,9). the confidence of which is evaluated by a scoring In this study, we describe a newly developed at University of Michigan on September 7, 2012 function that combines both global and local COFACTOR server, which combines both global and structural similarities. The algorithm has been local structural comparison algorithms to deduce the bio- extensively benchmarked by large-scale bench- logical functions of proteins, starting from their 3D struc- marking tests and demonstrated significant advan- ture. The output of the server includes function tages compared to traditional sequence-based annotations in three key aspects: protein–ligand binding methods. In the recent community-wide CASP9 interactions, Enzyme Commission (EC) (10) and Gene experiment, COFACTOR was ranked as the best Ontology (GO) (11). Keeping in mind that high-resolution method for protein–ligand binding site predictions. experimental structures are unavailable for most of the protein targets in genome databases, the algorithm has The COFACTOR sever and the template libraries are been extensively trained for low-resolution structures freely available at http://zhanglab.ccmb.med.umich generated from computational structure predictions. .edu/COFACTOR. Meanwhile, experimental structures undoubtedly meet the highest structural requirement and the predic- INTRODUCTION tion accuracy improves using these structures. In both large-scale benchmark (12) and blind experiments (2), The biological function of a protein molecule is decided the COFACTOR method has demonstrated significant by its 3D-shape, which eventually determines how the advantages over other state-of-the-art sequence- or molecule interacts with other molecules in living cells. structure-based comparative methods. As such, considerable efforts have been made to determine the structure of the protein molecules and to deduce the biological functions based on their 3D-shape MATERIALS AND METHODS (1–3). One of the most common structure-based approaches in protein function annotation is to detect COFACTOR algorithm homologous template proteins by global structure com- The input to the COFACTOR server is the 3D-structure parisons and then transfer known functional annotations of a target protein, which can be obtained from *To whom correspondence should be addressed. Tel: +1 734 647 1549; Fax: +1 734 615 6553; Email: [email protected] ß The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. W472 Nucleic Acids Research, 2012, Vol. 40, Web Server issue either structure prediction or experimental determination. the combination of the two), followed by Needleman– Figure 1 shows a general overview of the procedure Wunsch dynamic programming refinement (14). The followed on the COFACTOR server and the analysis objective function of the TM-align searching is done using the server, which includes detection of struc- TM-score (15): tural analogs in the PDB library and prediction of three 2 3 different aspects of protein function, namely, EC 61 XLali 1 7 numbers, GO terms and ligand binding sites. The TM À score ¼ max4 5 ð1Þ L 2 structure-based function inferences are made in two i¼1 1+ di steps, i.e. global structural alignment followed by local d0 structural similarity search. where di is the distance between ith pair of Ca atoms of query and template and Lali is the number of aligned Global structural similarity identification residue pairspffiffiffiffiffiffiffiffiffiffiffiffiffiffi identified by TM-align. d0 is given by 3 COFACTOR first identifies the template proteins of d0 ¼ 1:24 L À 15 À 1:8 and L is the length of the query similar fold/topology by matching the query structure protein. Since TM-score weights the short-distance residue with all proteins in three newly developed representative pairs stronger than the long-distance ones, it is more sen- Downloaded from functional libraries, which have known protein–ligand sitive to the global topology of proteins than the trad- binding information, EC numbers and GO terms itional structural similarity measurement RMSD. (J. Yang, A. Roy and Y. Zhang, submitted for Meanwhile, because only the aligned residues are publication). The global structure match is conducted by calculated in the summation which is normalized by the TM-align (13), a heuristic algorithm for global protein target length, TM-score in Equation (1) counts for both structure alignment, which starts from multiple seed align- alignment accuracy and the alignment coverage in a single http://nar.oxfordjournals.org/ ments (gapless threading, secondary structure match and parameter. Generally, a protein pair with TM-score >0.5 at University of Michigan on September 7, 2012 Figure 1. Illustration of structure-based function annotation by the COFACTOR server, starting from the query structure (shown in green). Nucleic Acids Research, 2012, Vol. 40, Web Server issue W473 indicates that they have the same fold while that with di is the Ca distance between ith aligned residue pair, Mii TM-score <0.3 have random structural similarity (16). is the normalized BLOSUM62 substitution scores between The global structural alignment between the query and ith pair of residues and d0 is the distance cutoff chosen to template structures is useful for exploring fold/family be 3.0 A˚. The second term in Equation (2) is to account for relationships of newly solved structure or predicted struc- the evolutionary information of the functional sites. For tural models. However, some folds are functionally each binding pocket on the template, this procedure is diverse and in these cases, function can be accurately pre- implemented for all the conserved query motifs and the dicted only by evaluating the similarity of active/binding one with the highest Lsim is recorded (Figure 2). site residues that are involved in the function. Moreover, in many cases, the functional motifs remain conserved during the evolution to maintain the function, even Functional analyses when the global structural similarity dwindles. Thus, a The COFACTOR server provides a variety of available local sequence and structural comparion of functional annotations for the query protein using the templates, sites may provide a more reliable way of functional anno- including EC number, GO and protein–ligand binding tation for the query proteins. sites. We provide a brief overview of the three aspects of Downloaded from Accordingly, on COFACTOR server, all template predicted functions by COFACTOR server below. proteins with a non-random structural similarity (i.e. TM-score > 0.3) (16) to the query structure (or up to 100 Enzyme Commission number top templates regardless of TM-score are used if <100 For the purpose of classifying enzymatic proteins, all non-random templates are identified) in each of the enzyme protein structures with annotated EC number(s) three function libraries (see below) are screened further have been collected from the PDB library (18) with the http://nar.oxfordjournals.org/ based on their local similarity to query structure. active site residue information mapped using Catalytic Site Atlas (19). As of January 2012, this compiled Local functional site identification enzyme template library contains 8392 protein structures. In the second step, a heuristic algorithm has been The active site motifs of the template structures are im- developed to identify the best local functional site match portant for the local