The Rough Guide to in Silico Function Prediction, Or How to Use Sequence and Structure Information to Predict Protein Function

CORE Metadata, citation and similar papers at core.ac.uk Provided by PubMed Central Education The Rough Guide to In Silico Function Prediction, or How To Use Sequence and Structure Information To Predict Protein Function Marco Punta1,2,3, Yanay Ofran4* 1 Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, United States of America, 2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), New York, New York, United States of America, 3 Northeast Structural Genomics Consortium (NESG), Columbia University, New York, New York, United States of America, 4 The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel function prediction methods. In the [7]. As we see, CbiF function comes in absence of independent benchmarks, different flavors: molecular/enzymatic comparing the figures reported by the (methyltransferase), metabolic (cobalamin developers is almost always comparing biosynthesis—directly—and DNA oranges and apples (for discussion of this biosynthesis—indirectly), and problem see [4]). Therefore, we refrain physiological (maintenance of healthy from reporting numerical assessments of nerve and red blood cells, through B12), specific methods. For those cases in which along with possible consequences related Introduction independent assessment of performance is to their malfunctioning. There are, available, we refer the reader to the obviously, numerous ways to describe Choosing the right function original publications. Finally, we discuss each of these aspects of the protein prediction tools. The vast majority of only methods that are either accessible as function. Enzymatic function, for known proteins have not yet been Web servers or freely available for example, may be characterized through: characterized experimentally, and there download (relevant Web links can be reaction (methylation), substrate (cobalt- is very little that is known about their found in Table S1). precorrin-4), or ligand (S-adenosyl-L- function. New unannotated sequences are What is protein function? The first methionine). added to the databases at a pace that far problem we face when dealing with Classifying and predicting. Since exceeds the one in which they are protein function is well-illustrated by the protein function has many facets, its annotated in the lab. Computational title of a 1998 article by Schubert et al. [5], prediction has different meaning for biology offers tools that can provide ‘‘The X-ray structure of a cobalamin different people. It may mean the insight into the function of proteins based biosynthetic enzyme, cobalt-precorrin-4 prediction of the cellular process in on their sequence, their structure, their methyltransferase.’’ What is the function which the protein is involved, or the evolutionary history, and their association of the protein that is described in this nitty-gritty of its enzymatic activity, or with other proteins. In this contribution, paper? The authors report the solution of rather its physiological role. Therefore, we attempt to provide a framework that the crystal structure of CbiF, which is an when attempting to predict protein will enable biologists and computational enzyme implicated in the biosynthesis of function one should first define clearly biologists to decide which type of vitamin B12 (cobalamin). More the kind of function she or he wants to computational tool is appropriate for the analysis of their protein of interest, and specifically, CbiF transfers a methyl predict. When predicting function what kind of insights into its function these group from an S-adenosyl-L-methionine automatically on a large scale, this tools can provide. In particular, we molecule to a precursor of vitamin B12 problem is intensified by the need to describe computational methods for (cobalt-precorrin-4). Vitamin B12 is a standardize and quantitatively assess the predicting protein function directly from compound that ‘‘helps maintain healthy similarity of functions between proteins. sequence or structure, focusing mainly on nerve cells and red blood cells, and is also While defining sequence and structural methods for predicting molecular needed to make DNA’’ [6]. Its deficiency similarity may be easy, there is no a priori function. We do not discuss methods that is related to anemia, as well as to several straightforward measure we can use to put rely on sources of information that are neurological and psychiatric symptoms a number on the similarity of functions beyond the protein itself, such as genomic context [1], protein–protein interaction networks [2], or membership in Citation: Punta M, Ofran Y (2008) The Rough Guide to In Silico Function Prediction, or How To Use Sequence and Structure Information To Predict Protein Function. PLoS Comput Biol 4(10): e1000160. doi:10.1371/ biochemical pathways [3]. When journal.pcbi.1000160 choosing a tool for function prediction, Published October 31, 2008 one would typically want to identify the best performing tool. However, a Editor: Fran Lewitter, Whitehead Institute, United States of America quantitative comparison of different tools Copyright: ß 2008 Punta, Ofran. This is an open-access article distributed under the terms of the Creative is a tricky task. While most developers Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. report their own assessment of their tool, in most cases there are no standard Funding: MP was supported by grant U54-GM75026-01 from the National Institutes of Health (NIH). datasets and generally agreed-upon Competing Interests: The authors have declared that no competing interests exist. measures and criteria for benchmarking * E-mail: [email protected] PLoS Computational Biology | www.ploscompbiol.org 1 October 2008 | Volume 4 | Issue 10 | e1000160 between two proteins. Prediction methods have since retained similarity in any of their ent functions [19,20]. For example, g- could not be developed, or rigorously properties is something that needs to be crystallin is a protein that plays a structural assessed, without such measure. Several checked in each individual case. An role in the eye lens of several species, while large-scale projects attempted to respond important distinction in this context is working as an enzyme in other tissues. to this challenge by building classification between orthologous and parologous Homologs of these proteins may retain systems or ontologies of biological sequences: orthologs are genes that only some of the original functions [21]. functions (see [8,9] for review). One such originated from a common ancestor As a consequence, function annotation enterprise was launched as early as 1955 through a speciation event, while paralogs transfer may result in erroneous or incom- by the International Congress of are the results of duplication events within plete assignments (Figure 1B). Biochemistry, which created the Enzyme the same genome. In general, function The multi-domain nature of many Commission to come up with a tends to be more conserved in orthologs proteins can also be the cause of annota- nomenclature for enzymes. In this than in paralogs [12]. So, when attempting tion transfer errors (Figure 1C). In fact, in numerical classification, each enzymatic to predict the function of an unannotated databases storing entire sequences (such as function could be described by a set of four protein based on its homology to an SWISS-PROT [18]), functional annota- numbers (which, together, are dubbed EC annotated one, one should search for tion of a protein may refer to any of its number). Each of these four numbers orthologs rather than paralogs (Figure 1A). domains. If the query protein (i.e., the represents specific description of the Although several databases have been protein whose function we wish to predict) enzyme and its activity. For instance, created to help identify orthologous genes does not align to that specific domain, when comparing carboxylesterase (e.g., COGs [13] and InParanoid [14]), annotation transfer is totally unjustified (3.1.1.1) and isochorismatase (3.3.2.1), ‘‘proven orthologs are as rare in the and will very likely result in a mis- one can tell that they share the basic literature as diamonds in bare rock’’ [12]. annotation. While a number of databases enzymatic activity of a hydrolase (all Orthologs, additionally, may also diverge and tools attempt to split proteins into hydrolases have 3 as the first number), functionally, sometimes more than domains based on sequence (Pfam [16], but they act on different types of bonds: corresponding paralogs [12]. Finally, PRODOM [22], SMART [23]), the most hydrolases with 3.1.-.- act on an ester bond there exist functional similarities between reliable way to identify protein domains is and those with 3.3.-.- act on an ether proteins that are not reflected in homology. by using, when possible, structural knowl- bond. This system is infinitely expandable These facts underline the difficulty of the edge (SCOP [24], CATH [25]). to include any new enzyme, but it does not task of transferring function from a Some of these problems can be mitigat- cover functions that are not enzymatic. homologous template. ed by the use of phylogenomic inference The Gene Ontology (GO) project provides In practice, the most common way to that frames sequence evolutionary rela- a controlled vocabulary to describe the infer homology is by detecting sequence tionship

Load more