From: ISMB-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved.

Predicting Function from Sequence: A Systematic Appraisal

Imran Shah Lawrence Hunter Computational Sciences and Informatics Bldg. 38A, 9th fl, MS-54 George Mason University National Library of Medicine Fairfax, VA22030 USA Bethesda, MD20894 USA [email protected] [email protected]

Abstract Gapped and ungapped sequence alignment were The Enzyme Commission Classification tested as possible methodsto classiC, proteins into the functional classes defined by the International Biological function is a complex and imprecise con- Enzyme Commission (EC). We exhaustively tested cept. However, , which make up a large all 15,208 proteins labeled with any EC class in a fraction of proteins, can be precisely characterized by recent release of the SwissProt database, evaluating the reactions that they catalyze. The International all 1,327 relevant ECclasses. Weeffectively tested Enzyme Commission (EC) (EC 1961) (NC all possible similarity thresholds that could be used has developed a classification scheme based on this for this assignment through the use of the ROC observation. The scheme is hierarchical, with four statistic. Approximately60% of Enz~ane Commission levels. At the top of the hierarchy are six broad classes classes containing two or more proteins could not be of enzymatic activity; at the bottom are around 3,500 perfectly discriminated by sequence similarity at any specific reaction types. EC classes are generally threshold. An analysis of the errors indicates that false positive matches dominate, and that ~xrious expressed as a string of four numbers separated by error mechanismscan be identified, including the periods. References to EC classes with fewer than multidomaln nature of many proteins and polypro- four numbers indicate an internal node in the tree, teins, convergent evolution, variation in enzyme including all of the subclasses or leaves below it. The specificity, and other factors. Manyof the putatively numbers specify a path down the hierarchy, with the false positives are in fact biologically relevant. This leftmost number identifying the highest level. All work strongly suggests that functional assignment results reported in this paper are based on the EC of enzymes should attempt to delimit functionally database ENZYME(Bairoch 1994) release 20. significant subregions, or domains, before matching to ECclasses. For exanlple, consider ECclass identified as 1.2.3.4. Keywords: protein function; enzyme classes; This class is a memberof the top level group 1, the sequence alignment, pairwise; EnzymeCommission; . The second levcl of the hierarchy receiver operating characteristic identifies a sub-class; for the oxidoreductases, the second level specifies the kind of donor which is oxidized. In this case, sub-class 2 means the enzyme acts on the aldehyde or oxo group of donors. The third Introduction position in the group specifies the Ideally, the biological function of a is revealed kind of acceptor. In this example, the sub-sub-class 3 by an understanding of the detailed structure of its means that oxygen is the acceptor. The lowest level protein . However, since protein structure in the hierarchy (the leaf node) identifies a particular determination is difficult, it is advantageous to be reaction. In this example, le’M 4 means the reaction is able to infer function directly from sequence. This oxalate + 02 ~ C02 + H20.2, which is catalyzed by paper reports on a systematic test of the hypothesis two knownproteins. that enzymatic activity, a large subclass of biological function, can be predicted directly by sequence Table 1 summarizes the hierarchy and the distri- similarity. A useful result of this systematic test is bution of 15,208 proteins which have been assigned the identification of the approximately 60~0 of EC an EC class. Each row" corresponds to a top level classes which cannot be well predicted in this way. We classification. The lmmberof subclasses at each level investigated these specific prediction failures, and have is then specified; for example, the oxidoreductases characterized several contributing factors as a prelude contain 19 second level groups, 64 third level groups to more sophisticated function prediction strategies. and 311 leaves. The EC classes which arc iucludcd in

276 ISMB-97 nodes at level [ prots, number of proteins in the leaves EC ]Name 2I 3I 4 1I x-516-10111-501 >50 1 Oxidoreductases 19 64 311 3766 84 116 56 44 11 2 9 24 333 4363 68 139 63 51 3 8 42 430 4649 127 173 66 49 15 4 7 15 131 1604 23 49 21 36 2 5 6 15 54 576 7 22 9 14 2 6 5 9 68 655 6 17 26 18 1 I IT°tal 154I 169I 1327 II I15208131515161241 I 212] 43

Table 1: Summaryof the distribution of protein sequences in the EC classes. (Due to the assignment of some proteins to multiple EC classes, the sum of the entries in the colmnn labeled prots is more than 15,208.) these counts have at least one protein. While there with the same function have a high degree of sequence are 1,327 such classes, there are another 2,000 which similarity; neither are all functionally distinct proteins do not have any proteins assigned to them. The total always dissimilar in sequence. Second, many EC numberof proteins in each top level classification is classes have very few sequences associated with them. noted, as is the distribution of those proteins at the It may be hard to recognize such classes due to leaves. For oxidoreductases, there are 200 leaves with undersampling. five or fewer proteins, and six leaves with more than 100 proteins. There are a total of 315 leaves which The ability of sequence similarity measures to have only one protein. accurately predict EC classes depends both on the similarity measure used and on the threshold set for Around 400 multifunction proteins are assigned prediction. There is a tradeoff between the sensitivity more than one EC class; the remaining proteins and specificity of predictions, based on the level at have only one EC number. Furthermore, for some which the threshold is set. These terms can be defined proteins the specific function is not knownand only formally as follows. Let 8o be the threshold set for partial EC classifications have been made. About an EC class, and 8 be the similarity score between 122 proteins are given only top level classifications, a given query sequence and another sequence which 163 are classified only to the second level, and 1,809 is a memberof the EC class. If 0 > 8o, the query is are classified only to the third level. Also, a few predicted to be a memberof the EC class. If a query multifunction proteins have some partial EC numbers is genuinely a memberof the EC class and 8 > 80, this assigned to them. Note that a majority of the leaf query is a true positive (t+). If 8 < 80, the query is classes have five or fewer proteins assigned to them. false negative (]-). If the query was not genuinely member of the EC class and 8 < 80, then the query is a true negative (~-). If a query is not genuinely Functional Assignment by Sequence member of the EC class and 8 > 80, then the query Comparison is a false positive (f+). For a set of query sequences, let T+ be the number of sequences classified t +, and Our goal was to be able to take the amino acid likewise for T-, F+ and F-. Sensitivity P+ and sequence of an enzyme as input, and identify the specificity P- can then be defined as follows: class that the EC had assigned to it on the basis of sequence similarity. We tested the most common gapped and ungapped sequence comparison tools, p+ T+ FASTA(Pearson 1990) and BLAST(Altschul al. =T--*-+ F- (1) 1990).. p- - T-+F--~T- (2) Sequence similarity scores calculated by an align- Because(P+, P-), vary with 80, either some optimal ment algorithm are used as a measure of the similarity valueof 8o, or a performancemeasure independent between a pair of proteins. Ideally, we would be able of 0o is needed.By graduallyincreasing 8o in sucha to set a threshold such that any query that matched way that0 _< (P+,P-)_< 1, a relationshipbetween a protein in a particular EC class with a similarity (P+,P-) is obtained.This is knownas a receiver greater than the threshold could be guaranteed to operatingcharacteristic ROC (Swets1982). Since the be in the same EC class. There are several reasons perfectperformance for all values of 8o is P+ = I and that this ideal cannot be met. First, not all proteins P- -- I the areaunder the ROC shouldideally be I.

Shah 277 (a) (b) 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 - 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 specificity (P -

Figure 1: Hypothetical ROCCurves

The area under the ROCgives an overall measure of Results discrimination performance, P (equation 3). Pairwise alignment was sufficient to determine the correct EC class of 12,150 out of 13,215 proteins. Of the 1065 proteins which were not correctly assigned, P --- P+dP- 315 cannot be assigned by sequence similarity, since (3) there is only one protein in the EC class. Therefore, 750 out of 12,900 proteins (about 6%) are incorrectly Figure 1 illustrates the use of ROCcurves by assigned EC class by sequence similarity. hypothetical plots for two boundary cases. In figure l(a) the sensitivity is less than 1 for a specificity The problem is much worse when considered class of 0 (P+ < 1 for P- = 0). No matter how low the threshold 0o, not all positive members will be by class rather than protein by protein. Only 470 out identified. On the other hand, in figure l(b), the of 1,221 EC classes (about 40%) at the leaves had sensitivity drops to 0 before the sub-plot specificity P scores of 1, indicating that there exists a thresh- old that perfectly discriminates members of these reaches 1 (P+ = 0 and P- < 1). In this case, matter howhigh the threshold is set, there will always classes by sequence similarity alone. No sequence be false positive matches. similarity threshold can be used to reliably assign sequences to the remaining classes with P < 1. The majority of the remaining EC classes (about 48%) We can now use the ROCcurves and P scores to determine how accurately sequence can be used to have 0.8 < P < 0.99. Sophisticated methods will be predict EC classification, independent of the choice of expected to improve the performance of these classes. threshold. The performance results for all the EC classes, calculated using BLAST,axe given in table 2. The results for FAST.&are quite close to these. The ROCs Method for each of the six ECclasses are given in figure 2. We used the SWISS-PROT release 33 (Bairoch Boeckmann 1992) database to obtain sequences The remainder of this section examines representa- labeled with EC classifications. We tested 13,215 tives of the classes with P < 1. Most of the problems sequences of the enzymes which had complete EC were due to false positive matches, where a query numbers assigned to them. Tile entire SWISS-PROT sequence would match sequences in several other EC database was searched using both FASTAand BLAST classes as well as sequences in the correct class. It was mdth their default paranmter values and substitution not possible to set thresholds, even on a class by class matrix set to BLOSUM62, recording all matches basis, such that manyfalse positives would not occur. and their scores. For each of the 1241 nodes in the EC hierarchy (after excluding the 315 leaf nodes A number of classes have P = 0, indicating that with only a single protein instance), two ROCcurves sequence similarity never is useful in predicting were calculated, using FASTAZ-score and BLAST membership in these classes. In these cases, no expectation as similarity measures. The performance detectable sequence similarity could be found between P was calculated by numerical integration of the area the members of these classes. Pyruvate synthase, under the ROCcurves. EC 1.2.7.1, for example, is made up of four sub-units (Bock et al. 1996). The four sequences in this class

278 ISMB-97 EC 1 EC 2 EC 3 1.0 ~~ 1.0 - 1.0 0.8 0.8 0.6 0.6 0.6 0.4 B=0.88 0.4 "B~I 0.80.4 0.2 F--0,89 0.2 F=0.87 0.2 B---0.89 0.0 0.0 1 0.0 1 ¯ ~ 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 . v.,,i EC 4 EC 5 EC 6 cD r~ 1.0 1.0 -- 1.0 0.8 0.8 x ~ 0.6 0.6 0.6 0.4 B--0.98 0.4 B ~.91~ 0.8 0.4 13--0.89 0.2 F=0.98 0.0 0.0 0 _-o9 0.0 --090..... I 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 specificity blast - - - fasta

Figure 2: Each sub-plot shows the ROCfor a top level EC class using blast (solid) and fasta (dotted).

Performance range EC0 I0-0.410.4-0.610.6-0.810.8-0.910.9-0.991 1 ,, 1 15 2 39 47 ’89 109 2 11 0 3 36 4O 75 134 3 3 2 9 43 69 76 140 4 3 0 1 8 23 46 49 5 1 0 1 10 17 19 19 6 0 0 0 9 26 23 19 total 33 4 19 145 222 328 470

Table 2: Performance Results using BLASTexpectation for all nodes in the EC, separated by EC class. Each column refers to a range of P, P1 < P _< P2, or a single value.

are fragments from each of the four subunits and ilarity between a bacterial instance of EC 5.5.1.1 share no sequence similarity. On the other hand, for (CATB_PSEPU)and enzymes in the sibling EC node haloacetate dehalogenase, EC 3.8.1.3, there are two EC 5.5.1.7, chloromuconate cycloisomerase. Treating distinct sequences for different forms of the enzyme. the EC numbers literally as labels makes this match Examination of the high scoring alignments with one count as a false positive. Yet sometimes a false of the instances of EC 3.8.1.3 shows significant simi- positive may indeed have different function. This larity with a sibling EC class, haloacid dehalogenases, is shown by a match between CATB_PSEPU and EC 3.8.1.2. These are recorded as false positives even MANR._PSEPU,a mandelate racemace belonging to though the activities are closely related. EC 5.1.2.2. Both enzymes are similar in sequence and structure but mechani.~tically distinct (Neidhart, Muconate cycloisoraerase, EC 5.5.1.1, also has P Kenyan, & Petsko 1990). = 0 due to the occurrence of two distinct forms of the enzyme. The two enzymes are of bacterial and In cases when 0 < P < 1 there is no value of 00 for fungal origins and they may have evolved convergently which F+ = 0 and F- = 0. No matter what threshold (Mazur et al. 1994). As in the case of EC 3.8.1.3, is set, there will always be either false positives, false there is appreciable functional and sequence sim- negatives or both. We have found various possible

Shah 279

...... -.- -.-.- ..-.-...... -...... BLAST FASTA 1.0 1.0 0.8 0.8 > o.6 0.6 ’~ 0.4 0.4 ~ 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 specificity

Figure 3: ROCfor EC 3.6.1.23 explanations for the low P classes. Many of these individual reverse transcriptase and then dutpase classes have very few sequences in them, suggesting sequences. Dutpase therefore has a low performance that they are difficult because they are currently score of P -- 0.4 using BLASTand P = 0.46 using underrepresented in the database. This appears to FASTA(figure 3) while reverse transcriptase has P be the explanation for the low P values observed for 0.54 using BLASTand FASTA. manyof the classes with less than 10 proteins. Also, enzymes that are not multifunctibn may Assessing the performance of catalytic activities nevertheless share domains with other proteins and which occur as components of multidomain proteins thereby generate false positives. Adenylate cyclase, is difficult. Consider the case of fatty acid syn- 4.6.1.1, has P = 0.8 (figure 4) for both BLAST thetase (FAS), EC 2.3.1.85, which consists of seven and FASTA.Most of the false matches are due to catalytic domains: acetyltransferase, EC 2.3.1.38; common protein phosphatase domains. The most malonyltransferase, EC 2.3.1.39; fl-ketoacyl synthase, significant of such matches were found to occur with EC 2.3.1.41; t3-ketoacyl reductase, EC 1.1.1.100; enoyl serine and threonine specific protein phosphatase, reductase, EC 1.3.1.10; /3-hydroxyacyl dehydratase, 3.1.3.16, which has P = 0.98 for both BLASTand EC 4.2.1.61; and thioesterase, EC 3.1.2.14. All FASTA(figure 4). Proteins within 4.6.1.1 are less instances of FASin the database do not contain each similar to each other than ones within 3.1.3.16. This of these domains. Sequence alignment with FAS yields lower within-class similarity makes it harder to set some matches with other FAS proteins, followed by a threshold which distinguishes 4.6.1.1 from 3.1.3.16. local matches with individual domains. If the whole The greater within-class similarity of 3.1.3.16 means protein is used as an instance of a component EC that it is possible to set a relatively high threshold, class, local matches with other classes are counted which prevents the false positives which occur in as false positives. If the extent of the domain on the 4.6.1.1. protein is known, these false matches can be resolved. Although some of the domain coordinates are available Another reason for low P scores may be vagueness in SWISS-PROTfeature records, they are not used in the definitions of the EC classes. For example, in the current treatment. The performance of FAS 3.4.21.32, collagenolytic proteases, were foumt to and the domains in it were found to lie within the matdl other aspartic endopeptidases, 3.4.23. The range 0.7 < P < 0.9. The performance of EC classes performance for FASTA,P = 0.71 was greater than which have domains in FAS is degraded due to false that for BLASTP = 0.62 (figure 5). A number positives. Therefore those EC classes which have reasons could be attributed to this behavior: EC more single domain instances are found to have better 3.4.21.32 was stated to have a broad specificity for performance. peptide bonds; the false matches were with members of the peptidase family $1 (Rawlings & Barrett The viral pol polyprotein is another example of a 1993), most of which possess some collagenolytic fimc- multifunction protein. It consists of: retropepsin, tion. Another serine protease, tryptase, EC 3.4.21.59, EC 3.4.23.16; in some cases dutpase, EC 3.6.1.23; was found to have low performance for similar reasons. reverse transcriptase, EC 2.7.7.49; and ribonuclease, EC 3.1.26.4. The polypeptide undergoes cleavage Performance can also be low due to matches to yield mature proteins with individual activities. with proteins which have not yet been assigned EC Sequence alignment with the whole sequence yields numbers. An example of this may be ribosomai matches with other pol proteins first, followed by RNAN-glycosidase, 3.2.2.22, which is a ribosome z8o ISMB-97 EC4.6.1.1 EC3.1.3.16 1.0 1.0 0.8 0.8 "~ 0.6 0.6 0.4 0.4 P=-0.827 P---0.983 ~ 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 specificity

Figure 4: ROCsusing BLASTfor EC 3.1.3.16 and EC 4.6.1.1. BLAST FASTA 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2

0.0 0.0 p~o~, . , . , 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 5: ROCfor EC 3.4.21.32 inacti~.-ating protein (RIP). It has P = 0.88 using groups. These correspond to three known groups of BLAST and FASTA(figure 6). Inspection of the structurally and mechanisticly different forms of ADH alignment lists shows matches with other RIPs which (Jornvall et al. 1995). There are 80 zinc-containing, are not labeled with EC numbers. 45 short-chain, and five iron-containing types of ADH.

Yet another possible reason for a low P score maybe The zinc-containing forms of ADHbind two zinc that proteins that are closely related in evolution may atoms, only one of which is catalytically active. There have diverged in function. Glutathione S-transferases are other forms of dehydrogenases in EC 1.1.1 which (GST), 2.5.1.18, have P = 0.78 for FASTA and depend on zinc as a and share some sequence P = 0.75 for BLAST(figure 7). A large number similarity with EC 1.1.1.1. The short-chain form of false positives are found with the eye lens protein, ADHbelongs to a larger group of oxidoreductases S-crystallin. Crystallins are evolutionarily related to which use NAD+ or NADP+ as cofactors, like GST (Zinov’eva & Piatigorsky 1994), however, they EC 1.3.1. The iron-containing forms of ADH,which are catalytically inactive and are specialized for lens use iron as a cofactor, also share similarity with refraction (Chiou et al 1995). other enzymes in EC 1.1.1. Due to the evolutionary relationships between ADHenzymes in EC 1.1.1.1 Alcohol dehydrogenase (ADH), EC 1.1.1.1 is an ex- and the sibling nodes of EC 1.1.1, false matches are ample which exhibits several of the problems discussed produced. Also because the relationships are different so far, and we will consider it in somedetail. A value for each of the groups, their performance scores will of P = 0.87 for BLASTand FASTAindicates enzymes be different if computedseparately. in this class can be discriminated moderately well by sequence similarity. However,both false positives and Some EC 1.1.1.1 enzymes are also bifunctional. false negatives are a problem in this class. In order In addition to the dehydrogenation of alcohols, they to understand these problems, we analyzed both the also catalyze the oxidation formaldehyde when glu- sequences and the biology of the enzymes. tathione is present (EC 1.2.1.1). They belong to the zinc-containing ADHfamily, consequently there are We clustered the pairwise sequence alignment many matches between theses and other EC 1.1.1.1 scores of 120 ADHenzymes, identifying three distinct members. These are examples of a single catalytic

Shah zsx BLAST FASTA 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 P=0.884 P=0.875 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 6: ROCfor EC 3.2.2.22 BLAST FASTA 1.0 1.0~ 0.8 0.8 0.6 0.6 0.4 0.4 P=0.758 P=0.784 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 7: ROCfor EC 2.5.1.18 domain having a broad specificity. membership. We did not consider information about where in a protein a match occurred, or any other Although many other oxidoreductases use the aspects of the alignment. If such information were same donors or acceptors as cofactors, little sequence available, e.g., from domain annotations in SWISS- similarity is observed in the knowncases. Someshort PROTrecords, it might have been possible to use it chain ADHenzymes in EC 1.1.1.159 and EC 1.3.1.28, to reduce false positives by restricting consideration and enzymes in EC 1.6.1.1 and EC 1.4.1.1, show of similarity to matches only within an appropriate similar NAD+ binding sites. These create false region. Distinguishing among specialized enzyme positive matches. functions may not be possible using only overall measures of sequence similarity.

Discussion and Conclusions Catalytic functions belonging to the EC sub-sub- classes referring to structurally similar substrates were We systematically tested the usefulness of sequence also found to have low performance. Because there similarity to predict EC class. In general, both is a functional similarity amongsibling nodes, those sequence similarity measures tested provide a moder- matches, although strictly speaking false, are near ately good method for predicting biological function misses. This is reflected ill the increasing P values as defined by EC class. However, there are hundreds for ltigher nodes in the EC hierarchy, where sibling of classes of enzymes for which this method performs nodes are less sinlilar to each other than the5," are at poorly. One factor is the lack of sequences associated lower levels. Again in this case, distinguishing among with some of these classes. This undersampling specialized enzymefunctions using sequence similarity should improve as the number of sequences knownfor was difficult. different ECclasses increases. However,there are also problems that are not associated with sample coverage. One weakness of our general approach is the absolute nature of class membership assignments, The performance of EC classes which occur in and the simple correct/incorrect scoring method. A multidomain proteins was found to be degraded by query sequence may align well with a sequence from false matches. In this work, only the similarity score a different EC class (or poorly with another sequence of protein matches was used to predict EC class in the same class) for a variety of biological rea~sons.

282 ISMB-97 Manyof the instances we label false positives here Bock, A.; Kunow, J.; Glasemacher, J.; and Schon- (and a few of the false negatives) are a result heit, P. 1996. Catalytic properties, molecular biologically significant relationships among(or within) composition and sequence alignments of pyruvate: the EC classes themselves. ferredoxin oxidoreductase from the methanogenic ar- chaeon methanosarcina barkeri (strain fusaro). Eur. Evolutionary relatedness (and, by implication, J. Biochem. 237:35-44. sequence similarity) is not strictly correlated with Chiou et al, S. 1995. Octopus S-crystallins with en- functional relatedness. Hence many evolutionarily dogenous glutathione S- (GST) activity: related proteins have diverged functionally. Exam- sequence comparison and evolutionary relationships ples of these in the EC are GST (EC 2.5.1.18) and with authentic GST enzymes. Biochem J. 309:793- S-crystallin; muconate cycloisomerase (EC 5.5.1.1) 800. and mandelate racemase (EC 5.1.2.2). It is also 1961. Report of the Commission on Enzymes of the the case that some evolutionarily unrelated proteins International Union of Biochemistry. Pergammon have converged to the same function. Examples of Press,Oxford. convergent evolution in the EC include the bacterial and fungal muconate cycloisomerases (EC 5.5.1.1), Jornwall, H.; Danielsson, O.; Hjelmqvist, L.; Persson, and the mammalianand insect alcohol dehydrogenases B.; and Shafqat, J. 1995. The alcohol dehydrogenase (EC 1.1.1.1). The evolutionary relationships between system. Adv Exp Med Biol 281-294. proteins must be given due consideration in order Mazur, P.; Pieken, W. A.; Budihas, S. R.; Williams, to use the EC as a practical tool for sequence based S. E.; Wong,S.; and W., K. J. 1994. Cis,cis-muconate classification of enzymefunction. lactonizing enzyme from Trichosporon cutaneum: ev- idence for a novel class of cycloisomerases in eucary- It is also quite possible that the EnzymeCommis- otes. Biochemistry 33:1961-1970. sion classification itself is flawed in various ways. The 1992. Enzyme Nomenclature. Nomenclature Com- continuous and consensus-based process for updating mittee of the International Union of Biochemistry and the classification is designed recognizing the fact Molecular Biology, Academic Press, New-York. that errors can be made. Any method for predicting Neidhart, D.; Kenyan, J.; and Petsko, G. 1990. Man- protein function based on EC classification can only delate racemase and muconate lactonizing enzymeare be as accurate as the classification itself. mechanistically distinct and structurally homologous. Nature 347:692-694. In conclusion, sequence similarity methods provide a reasonable baseline for prediction of enzyme func- Pearson, W. 1990. Rapid and sensitive sequence com- tion, and the ROCstatistic is a useful measure of the parison with FASTPand FASTA. Methods in Enzy- performance of this approach. The issue of gapped mology 183:63-98. versus ungapped sequence alignment, as represented Rawlings, N., and Barrett, A. 1993. Evolutionary by BLAST and FASTA respectively did not make families of peptidases. Biochem. J. 290:205-208. much of a difference in performance. However, there Swets, J. 1982. Measuring the Accuracy of Diagnostic is a room for significant improvement. In evaluation, Systems. New York: Academic Press. evolutionary relationships need to be considered along Zinov’eva, R. D.and Tomarev, S., and Piatigorsky, J. with the naive comparison of EC class numbers. 1994. The evolutionary kinship of the crystallins of Perhaps most importantly, methods that delimit cephalopods and vertebrates with heat-shock proteins functionally significant subregions of a query sequence and stress-induced proteins. Izv Akad Nauk Set Biol before attempting to classify that sequence appear to 566-576. be required for improving the quality of functional prediction.

References Altschul, S.; Gish, W.; Miller, W.; Myers, E. W.; and Lipman, D. J. 1990. Basic local alignment search tool. Y. Mol. Biol. 215:403-410. Bairoch, A., and Boeckmann, B. 1992. The SWISS- PROTprotein sequence data bank. Nucleic Acids Research 20:2019-2022. Bairoch, A. 1994. The ENZYMEdata bank. Nucleic Acids Res. 22:3626-3627.

Shah 283