From: ISMB-95 Proceedings. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.

Predicting Classes without Overly Relying on Homology* Mark W. Craven Richard J. Mural Computer Sciences Department Biology Division University of Wisconsin-Madison Oak Ridge National Laboratory 1210 W. Dayton St. P.O. Box 2008, Bldg. 9211 Madison WI 53706 Oak Ridge TN 37831-8077 [email protected] m91~ornl.gov Loren J. Hauser Edward C. Uberbacher Computer Science & Mathematics Division Computer Science & Mathematics Division Oak Ridge National Laboratory Oak Ridge National Laboratory P.O. Box 2008, Bldg. 9211 P.O. Box 2008, Bldg. 6010 Oak Ridge TN 37831-8077 Oak Ridge TN 37831-6364 [email protected] [email protected]

Abstract the structure and function of a protein is to identify a homologousprotein that has already been character- An important open problem in molecular biology ized. However, from current genome-sequencing efforts is howto use computational methods to under- it appears that as manyas half of the newly discovered stand the structure and function of proteins given only their primary sequences. Wedescribe and proteins do not have corresponding, well-understood evaluate an original machine-learning approach homologs(Fields et al. 1994). The goal of our research to classifying protein sequencesaccording to their program, therefore, is to develop protein-classification structural folding class. Our work is novel in methods that are not overly reliant on sequence homol- several respects: we use a set of protein classes ogy, but instead represent the essential properties of that previously have not been used for classify- analogous proteins that have similar folds. In this pa- ing primary sequences, and we use a unique set of per, we describe and evaluate a novel, machine-learning attributes to represent protein sequences to the approach for classifying protein sequences according to learners. Weevaluate our approach by measur- their structural fold family. Our experiments indicate ing its ability to correctly classify proteins that were not in its training set. Wecompare our that our approach provides a promising alternative to input representation to a commonlyused input homology-basedmethods for protein classification. representation - amino acid composition - and There is a wide variety of existing approaches for show that our approach more accurately classi- predicting structural or functional aspects of proteins fies proteins that have very limited homologyto the sequences on which the systems are trained. given their primary (i.e., amino-acid) sequences. These approaches include tertiary structure prediction (e.g., Kolinski & Skolnick 1992), secondary structure predic- Introduction tion (e.g., Rost & Sander 1993), searching (e.g., Pearson & Lipman 1988; Altschul et A problem of fundamental importance in molecular bi- a/. 1990), and classification according to folding class ology is understanding the structure and function of (Dubchak, Holbrook, & Kim 1993; Ferran, Ferrara, the proteins found throughout nature. Currently, the Pflugfelder 1993; Metfessel et al. 1993; Wuet al. 1993; growth of protein sequence databases is greatly outpac- Nakashima & Nishlkawa 1994; Reczko & Bohr 1994). ing the ability of biologists to characterize the proteins Our approach falls into this latter category - protein in these databases. Efficient computational methods classification - which itself encompassesa wide vari- for predicting and function are highly ety of methods. Existing protein classification meth- desirable because conventional laboratory methods, X- ods vary on a number of dimensions including: the in- ray crystallography and NMR,axe expensive and time- tended purposes of the systems; the level of abstraction consuming. Currently, the best method for predicting of the protein classes; and whether or not the systems *This research was supported by the Office of Health and are able to discover their ownclasses. The approach we Environmental Research, U.S. Department of Energy under describe is novel in at least two respects: we use a set of Contract No. DE-AC05-84OR21400with Martin Marietta protein classes that previously have not been used for Energy Systems, Inc. classifying primary sequences, and we use a physical

98 ISMB-95 set of attributes to represent protein sequences. Our approach uses machine-learning methods to in- Table 1: Protein class representation. The mid- duce descriptions of sixteen protein-folding classes. die columnlists the classes we use in our classification The folding classes that we use were devised by Orengo method. The left column indicates class families, and et al. (1993) in a large-scale computational effort the right column lists, for each class, the number of cluster proteins according to their structural similarity. examples we use in our experiments. These classes comprise analogous, as well as homolo- group)-~ ~ examples gous proteins. Whereas Orengo et al. used structural information to classify proteins, we are interested in 27 classifying proteins when only their primary sequences Orthogonal 14 EF Hand are available. Thus, the role of the machine-learning 5 algorithms in our approach is to induce mappings from Up/Down 7 primary sequences to folding classes. A key aspect Metal Rich 16 of our method is the way in which we represent pri- Orthogonal Barrel 5 mary sequences to the learner. Unlike most protein- Greek Key 24 classification approaches, which represent proteins by Jelly Roll 5 their amino-acid composition, our method represents Complex Sandwich 7 proteins using attributes that better capture the com- Trefoil 7 monalities of analogous proteins. Weempirically com- Disulphide Rich 11 pare learning systems that use our input representation aln TIM Barrel 15 to learning systems that use amino-acid composition as Doubly Wound 26 their input representation. We show that our approach a+~ Mainly Alpha 9 more accurately classifies proteins that have very lim- Sandwich 20 ited homology to the sequences on which the systems Beta Open Sheet 14 are trained. total examples 212 Problem Representation algorithm (Needleman & Wunsch 1970), they per- The task that we address is defined as follows: given formed pairwise comparisons on 1410 protein se- the amino-acid sequence of a protein, assign the pro- quences selected in the previous step. They then tein to one of a numberof folding classes. This problem used single-linkage cluster analysis to form clusters definition indicates that there are two fundamental is- of related sequences. In single-linkage cluster analy- sues in implementing a classification method for the sis, two proteins, a and b, are assigned to the same task: determining the attributes that are to be used to cluster if there exists a chain of proteins linking a represent protein sequences, and defining the classes and b, such that each adjacent pair in the chain sat- that are to be predicted. The remainder of this section isfies a defined measure of relatedness. Twoproteins discusses how we address these two issues. were deemedrelated, in this case, if their sequence identity was > 35%. For small proteins, this thresh- Class Representation old was adjusted using the equation of Sander and The classes that we use in our approach are the fold Schneider (Sander & Schneider 1991). Also, for pro- groups defined by Orengoet al. (1993) in their effort teins with 25-35%sequence identity, a significance identify protein-fold families. These classes represent test was used to determine if the proteins were to be proteins that have highly conserved structures, but of- considered related. ten low sequence similarity. Thus, the classes represent 3. A representative protein was selected from each of analogous, as well as homologous, proteins. Table 1 the clusters formed in the previous step, and the lists the fold groups that we use as our classes, as well resulting set of proteins was clustered according to as the numberof examples in each class that we use in structural similarity. Pairwise comparisons of pro- our experiments, and whether each class falls into the teins in this set were done using a variant of the a (primarily alpha), /7 ( primarily beta), a//~ (alter- Needleman-Wunsch algorithm that compared struc- nating a and/~), or a +/7 (non-alternating a and/7) tural environments rather than primary sequences. family (Levitt & Chothia 1976). 4. Multidimensional scaling was applied to the result- The method that Orengo et al. used to define their ing structural homology matrix to form clusters of fold groups involved four primary steps. proteins with similar folds. The final fold groups 1. A set of proteins with known folds was assembled were defined from this clustering by human inter- from the Brookhaven Protein Data Bank (Bernstein pretation, with the aid of schematic representations et 02. 1977). and topology diagrams. 2. The proteins in this set were clustered according to In summary, Orengo et al. organized proteins sequence similarity. Using the Needleman-Wunsch with knownstructures into classes representing simi-

Craven 99

...... lar folds, but not necessarily similar primary sequences. coil for each residue. The numberof a and/3 predic- Whereas, Orengo et al. developed their classes by clus- tions are summedand then divided by the sequence tering proteins according to structural similarity, we length. are interested in classifying protein sequences whose ¯ Isoelectric point: Using the Wisconsin Sequence structures have not been determined. Obviously, their Analysis Package (version 6.0) (Devereux, Haeberli, method is not applicable in such cases since it takes & Smithies 1984), we calculate the isoelectric point structural information as input. Our approach there- of the given sequence. fore uses mac3aine-learning methods to induce map- pings from primary sequences to folding classes. ¯ Fourier transform of hydrophobicity function: Our data set is formed in the following manner: For Using hydrophobicity values for each amino acid, we each of the fold groups listed in Table 1, we select convert a given sequence into a one-dimensional hy- between one and five of the examples that are rep- drophobicity function, H. Wecalculate the modulus resentatives (as listed by Orengoet al.) of the clusters of the Fourier transform of this function as follows formed in step 2. above. Note that this set contains (Eisenberg, Weiss, & Terwilliger 1984): sequences with only very limited homology. We then use each of these proteins as a query sequence to search the SWmS-PROTdatabase (Bairoch & Boeckman 1992) for similar sequences. Weuse both BLAST(Altschul et I.n=l .I n=l a/. 1990) and FASTA(Pearson & Lipman 1988) for where/~(~) is the value for the periodicity with fre- quence comparisons. As many as nine examples are quency ~, and n ranges over the residues in the extracted from each search and added to our data set sequence. We calculate this function at 1° inter- to increase the number of examples for each fold. vals from 0° (corresponding to a period of infinity) to 180° (corresponding to a period of 2 residues). Input Representation Finally, the six hydrophobicity attributes we use are computed by averaging values over each non- In order to employ a machine-learning method in this ° °, task, it is necessary to define an input representation; overlapping 30 interval in [0 180°]. that is, a scheme for representing the proteins that are Wenormalize the values for the volume, isoelectric, and given to the system. The input representation that we hydrophobicity attributes so that they fall in the range use for our protein classification approach involves a [0, 1]. Values for the other attributes naturally lie in small number of attributes that can be readily com- this range. puted from the primary sequence of a given protein. The attributes that we use are the following: Empirical Evaluation ¯ Average residue volume: Using values that rep- The underlying hypotheses of our approach are: resent the volume of each amino acid’s side group ¯ The folding class of a protein can be accurately pre- (Dickerson & Geis 1969), we calculate the average dicted, given only its primary sequence. residue volume for a given sequence. ¯ The best representation for this classification task is ¯ Charge composition: We use three attributes to one that attempts to capture the commonalities of represent the fraction of residues in a given sequence analogous proteins that are in the same folding class. that have positive charge, negative charge, and neu- Manyprotein-classification studies have represented tral charge (Lehninger, Nelson, & Cox 1993). proteins by their amino-acid composition (Klein ¯ Polarity composition: We use three attributes to Delisi 1986; Nakashima, Nishilmwa, & Ooi 1986; Dubchak, Holbrook, & Kim 1993; Metfessel et al. represent the fraction of residues in a given sequence 1993), or by some description of the amino-acid n-mers that are polar, apolar, and neutral (Lehninger, Nel- that occur in sequences (Ferran, Ferrara, & Pflugfelder son, & Cox 1993). 1993; Nakashima & Nishikawa 1994; Reczko & Bohr ¯ Predicted a-helix//3-sheet composition: One of 1994). Our hypothesis is that this type of representa- these attributes represents the fraction of the pro- tion is not well suited to the classification of proteins tein’s residues that are predicted to occur in a- that have no close relatives in existing databases, or helices, the other represents the fraction that are that have no close relatives whose structure has been predicted to occur in /3-sheets. Note that both of determined. We conjecture that methods trained using these values are merely predictions, since the prob- such a representation will perform poorly when asked lem of calculating secondary structure from primary to classify proteins that do not have homologs in the structure is exceptionally difficult itself. Our pre- training set. Our view is that protein-classification dictions are generated by a neural network that we methods should be aimed at characterizing proteins trained using the data set of Qian and Sejnowski that do not have well-understood homologs. (1988). The trained network is scanned along the In order to test our hypotheses, we present a number protein sequence, generating a prediction of a,/3 or of experiments that evaluate our approach. First, we lOO ISMB-95 measure how well several machine-learning algorithms generalize 1 to unseen examples after learning to clas- Table 2: Test-set accuracy using leave-one-out sift proteins using our problem representation. As a cross validation. For each algorithm listed in the left baseline for comparison, we also measure generaliza- column, the middle column lists the resulting test-set tion for the same learning algorithms when amino-acid accuracy when our input representation is used. The composition is used as the input representation. Our right column lists test-set accuracy when mnino-acid second experiment evaluates the relative contributions composition is used as the input representation. of the various attributes that comprise our input rep- test-set accuracy resentation. Our third experiment tests the ability of learning our amino-acid systems trained using our representation to generalize method representation representation to test cases for which there are no close relatives in C4.5 60.8% 49.1% the training set. This is a key experiment because our nearest-neighbor 80.7 76.9 approach is motivated by the need to characterize pro- neural networks 83.0 70.8 teins for which there are not any well-understood ho- mologs. Finally, we demonstrate that the accuracy of our approach can be improved by having trained classi- function defines the activation of unit i as: tiers classify only examplesfor which they are confident in their predictions. ai -- Y~n e~’~

Measuring Generalization where ~i is the net input to unit i, and n ranges over all of the output units. The networks are trained The first task that we address in our experiments is to using the cross-entropy error function (Hinton 1989), measure the accuracy of learners trained using our in- and a conjugate-gradient learning method (Kramer put and output representations. As a baseline for com- Sangiovanni-Vincentelli 1989), which obviates the need parison, we also evaluate learners trained using amino- for learning-rate and momentumparameters. Net- acid composition as their input representation. This worksare trained until either (1) they correctly classify representation has twenty attributes, each of which all of the training-set examples, (2) they converge represents the fraction of a protein sequence that is a minimum, or (3) 1000 search directions have been composed of a particular amino acid. tried. The networks have one output unit per class; Weuse several different learning algorithms to eval- the class associated with the most active unit is taken uate these two representations, since we do not know as the network’s prediction for a given test exam- a priori which algorithm has the most appropriate pie. Since the solution learned by a neural network inductive bias for each representation. We evaluate is dependent upon its initial weight values, for all of three inductive learning algorithms: C4.5 (Quinlan our neural-network experiments we perform four cross- 1993), feed-forward neural networks (Rumelhart, Hin- validation runs, using different initial weight settings ton, & Williams 1986), and k-nearest-neighbor classi- each time. tiers (Cover & Hart 1967). We evaluate the suitability For k-nearest-neighbor classifiers, we use a Euclidean of these algorithms for the protein-classification task distance metric to measure proximity. We construct by estimating their generalization ability. In order to classifiers that use values of k ranging from 1 to 10. estimate generalization for each learning method, we 2 The class predicted by a nearest-neighbor classifier is conduct leave-one-out cross-validation runs. the plurality class of the k training examples that are C4.5 is an algorithm for learning decision trees. The nearest to a given test example. Ties are broken in complexity of the trees induced by C4.5 can be con- favor of the nearest neighbor. trolled by pruning trees after learning. In our exper- Table 2 reports leave-one-out accuracy for the best iments, we run C4.5 both without pruning, and with parameter settings for each learning method. The mid- pruning confidence levels ranging from 10%to 90%. dle column lists the measured accuracy values for clas- The neural networks that we use in our experiments sifters trained using our input representation. The are fully connected between layers, and have 3, 5, 10, right columnlist accuracy values for classifiers trained 20 or no hidden units. We use the logistic activation using amino-acid composition as their input represen- function for hidden units, and the "softmax" activation tation. For both input representations, we found that function (Bridle 1989) for output units. The softmax pruning did not improve C4.5’s generalization on this task, thus we report the accuracy of unpruned trees. 1 Generalization refers to howaccurately a system clas- For neural networks, the best results were obtained us- sifies examplesthat are not in its training set. ing 20 hidden units for our input representation, and 2In leave-one-outcross-validation, classifiers are trained no hidden units for the amino-acid representation. For on n - 1 of the n available examplesand then tested on the the nearest-neighbor method, the best results were ob- exampleleft out. This process is repeated n times, so that tained by using k=3 for our input representation, and each exampleis used as the testing exampleexactly once. k=l for the amino-acid representation.

Craven 101 The accuracy values we list for the neural networks are averages over four cross-validation runs. The stan- Table 3: Evaluating attribute predictiveness. dard deviations for these averages are less than 0.1%, The table lists accuracy results for leave-one-out cross- and thus we do not list them in the table. We omit validation runs with nearest-neighbor classifiers using standard deviations from the other tables in the paper selected attribute subsets. The middle column reports for the same reason. accuracy for input representations that use only the indicated attributes. The right column reports accu- We draw two conclusions from this experiment. The racy for input representations that omit the indicated first is that it is possible to classify protein primary attributes. sequences into Orengo et al.’s folding groups with high accuracy; the neural networks using our input repre- leaving sentation were 83.0% accurate on test cases. The sec- attributes IonlyUSing out ond conclusion we make is that our input representa- average residue volume 18.9% 78:3% tion is a better representation than amino-acid com- charge composition 21.7 78.8 position for this task. For all three learning methods, polarity composition 31.6 79.2 our representation resulted in superior generalization. a-helix/fLsheet composition 29.7 77.8 isoelectric point 15.1 76.9 FT of hydrophobicity 48.6 59.9 Evaluating The Input Representation all attributes 80.7 ¯ q In our second experiment, we seek to understand how muchthe individual attributes that comprise our input representation contribute to the overall performance of Estimating the Role of Homology the classifiers. To measure this, we conduct a series Although the results presented in the first experiment of leave-one-out cross-validation runs using nearest- indicate that we are able to classify proteins with high neighbor classifiers (with k=3, the best value of k in the accuracy, they do not really address our primary con- previous experiment), and input representations that cern; namely, that we want to accurately classify pro- contain only subsets of the attributes defined in the teins whenthe classifier’s training set does not contain Input Representation section of this paper. First, we sequences that are homologousto the test sequences. conduct leave-one-out runs in which the input repre- In this section, we present an experiment in which each sentations consists only of individual attribute groups. test set has very limited homologyto its corresponding For example, in one run we classify instances using training set. only the charge composition attributes as our input As in previous experiments, we use cross-validation representation. Wealso conduct leave-one-out runs in to estimate the accuracy of our classifiers. Unlike the which we use all of the attributes except for a selected previous experiments, however, the training and test group. For example, in one run we classify examples sets in this experiment vary in size. The training and using an input representation that consists of all of the test sets for this experiment are formed by first par- attributes except for the charge composition attributes. titioning the set of examples for each class into sepa- Table 3 reports the results of this experiment. The rate subsets, such that homologousproteins fall into left column in the table lists the attribute groups. The a single subset. Recall that we formed our data set middle column reports test-set accuracy for runs in in the following manner: First, non-homologous se- which we use only a singie attribute group. The right quences were selected from Orengo et al.’s data set. column reports test-set accuracy for cross-validation Sequences with 35%or greater sequence identity were runs in which we use all of the attributes except for the considered homologous. For sequences with 25-35% indicated group. Note that the last row in the table sequence identity, a significance test was used to de- lists the cross-validation accuracy that we measured cide if the sequences were considered homologous. We for nearest-neighbor classifiers using our original input then expanded our data set by using each of these se- representation. quences as a query sequence to find close relatives in As the values in the middle column indicate, none the SWISS-PROTdatabase. The resulting groups of of the attribute groups alone are as predictive as the homologous sequences correspond to the partitions we entire attribute set. The values in the right column use in this experiment. indicate that every attribute group contributes to the This partitioning of the examples allows us to do predictiveness of the original input representation. Al- cross-validation runs in which we ensure that for every though most of the accuracy values in this column are memberof the test set there is not a close relative in close to the accuracy of the entire attribute set, none the classifier’s training set. A cross-validation run, in of them equals or exceeds it. From these results we this experiment, involves using each of the subsets as conclude that all of the attributes in our input repre- the test set exactly once. The examples for four of the sentation make some contribution to the predictiveness classes, a:EF Hand, a: Up/dovm, ~: Complex Sandwich, of our classifiers. and ~: Trefoil, are not used in this experiment since all

102 ISMB-95 Table 4: Test-set accuracy factoring out the role Table 5: Neural-network sensitivity measure- of homology. The training sets in this experiment do ments by class. The middle column shows per-class not contain sequences that are homologousto the test sensitivity values for the first experiment (leave-one- sequences. out cross validation). The right column shows sensi- tivity values for the third experiment (no test-set ex- test-set accuracy ample has a homolog in the training set). Sensitivity learning our amino-acid is defined as the percentage of examples in a class that method representation representation are correctly predicted. C4.5 32.5% 25.5% nearest-neighbor 36.6 30.6 leave-one-out no-homologs neural networks 40.1 23.1 class sensitivity sensitivity Globin 96.3% 96.3% Orthogonal 71.4 7.1 membersof these classes share high sequence identity. EF Hand 80.0 As in the first experiment, we run each algorithm Up/Down 85.7 several times using a range of parameter settings. C4.5 Metal Rich 93.8 75.0 is run with pruning confidence levels ranging from 10% Orthog. Barrel 60.0 0.0 to 90%, and without pruning. We train neural net- Greek Key 79.2 39.6 Jelly Roll 80.0 10.0 works with 3, 5, 10, 20 and no hidden units. Nearest- neighbor classifiers use values of k from 1 to 10. Complex Sand. 71.4 Trefoil 85.7 Table 4 shows the results of this experiment for the Disulphide Pdch 72.7 0.0 best parameter settings for each learning method. As TIM Barrel 80.0 43.3 in our first experiment, we found that the best decision- Doubly Wound 92.3 tree generalization was achieved without pruning, As 53.8 Mainly Alpha 88.9 16.7 before, we also found that the best neural-network Sandwich 65.0 15.0 generalization was with networks that had 20 lfidden Beta Open Sheet units for our input representation, and no hidden units 92.9 3.6 for the amino-acid representation. For the nearest- neighbor method, the best results were obtained by using k=4 for our input representation, and k=3 for membersof class c that are correctly identified as be- the amino-acid representation. longing to c. The table displays sensitivity values for The classification accuracy values reported for this both the first experiment (leave-one-out) and this one experiment are all significmltly worse than their coun- (no homologs). This table indicates that the accuracy terparts in the first experiment. For the best classifier of our predictions is not uniform across the classes, - a neural network trained using our input represen- especially in the no-homologs experiment. As previ- tation - generalization dropped from 83.0% to 40.1%. ously mentioned, poor performance for some classes is There are several reasons for this decrease in accuracy. at least partly explained by sparse training sets. The One factor is that, due to our partitioning of the data, poor predictability of some classes mayalso be due to the training sets used in the second experiment are the classes themselves being artificial constructs. It is smaller than those used in the first. Someof the train- important to keep in mind that the class structure it- ing sets had only one or two examples for some classes. self was devised through a combination of automated The more important factor, however, is that homology clustering and human interpretation. Many of these plays a large role in the predictive ability of the classi- classes encompass diverse groups of proteins, and in tiers described in our first experiment. some cases, the class boundaries are rather arbitrary. Although the results of this experiment are some- what disappointing, they also lend support to our hy- Improving Accuracy by Rejecting pothesis that our input representation is better than Examples one based on amino-acid composition. For all three Although the classification accuracy values reported in learning algorithms, the classifiers trained using our in- the previous experiment are rather low, we have found put representation generalized significantly better than that the accuracy of our classifiers can be improvedby the classifiers using the amino-acid input representa- taking into account the confidence of their predictions. tion. This result confirms that our representation em- In this section, we describe an experiment in which we bodies more of the commonalities of analogous proteins employ a strategy that is commonly used in the do- than does the amino-acid representation. main of handwritten character recognition: classifiers Table 5 shows per-class sensitivity values for our neu- ’~reject" (i.e., do not classify) examplesfor which they ral networks’ predictions. The sensitivity of a set of cannot confidently predict a class (Le Cun et al. 1989). predictions for class c is defined as the percentage of In our fourth experiment, we evaluate neural net-

Craven 103 lOO% works and nearest neighbor classifiers using the same I neuralnetWorks training and test sets as in the previous experiment nearest-deighborclasiifiers -,,,-, (i.e., no test-set example has a homologin the train- ing set). Weuse single-nearest-neighbor classifiers and 8o% neural networks with 20 hidden units. For every test- O set example, a confidence value is output along with the predicted class. 8 There are various algorithm-specific heuristics that can be used to estimate a classifier’s confidence in a given prediction. For nearest-neighbor algorithms, we measure a prediction’s confidence as how large we can make k while ensuring that the k nearest neighbors 400/o are unanimous in the class they predict. For neural fi i networks, we measure confidence as the fraction: o% 20% 40% 60% 80% exarnp4esrejected ai Figure 1: Rejection curves. The x-axis indicates the where a~ is the activation of the unit corresponding to fraction of examples rejected. The y-axis indicates the the predicted class, and n ranges over all of the output corresponding accuracy. units (actually the normalization is superfluous when the softmax activation function is used). By establishing a threshold on the confidence values, our current class structure to determine if some of the we are able to have classifiers reject examples for which classes should be redefined, aggregated or discarded. they are uncertain. Figure 1 displays generalization as Another important and difficult issue to be ad- a function of the percentage of examples rejected, for dressed in our future research is how to predict the both neural networks and nearest-neighbor classifiers. folding class of multi-domain proteins which do not The left edge of the graph represents the case where fall completely into one of our existing classes. An no examples are rejected. The right edge of the graph accurate prediction for such a protein might involve represents the case where 80%of the examples are re- labeling different domainsof the protein with different jected. For both methods, the curves climb steadily, classes. Since our approach enables subsequences of meaning that the classifiers improve their accuracy as proteins to be represented and classified, the key prob- they throw out more examples. lem to be solved in this task is how to parse protein The results of this experiment are interesting be- sequences into subsequences. One possible approach is cause they suggest that even if we cannot develop a to generate secondary-structure predictions for a given classifier that is highly accurate for a wide range of protein, and then to use the predicted a-helix/fLsheet proteins, we can perhaps develop a classifier that is boundaries to suggest alternative parses. highly accurate for certain classes of proteins, and is Finally, we plan to investigate the utility of using able to determine whentest cases fall into these classes. a distributed output representation during learning. In a distributed representation, each of the problem Future Work and Conclusions classes is represented using a bit-string in which more There are several open issues that we plan to explore than one bit is "on". A carefully engineered encoding in future research. These include: investigating al- schemecan result in significantly better generalization ternative input representations, developing an alterna- (Dietterich & Bakiri 1995). tive class structure, predicting folding classes for multi- We have presented a novel machine-learning ap- domain proteins, and using distributed output repre- proach to classifying proteins into folding groups. Our sentations. Wediscuss each of these in turn. method uses attributes that can be easily computed Although experiments presented in Section indicate from the primary sequence of a given protein. Wehave that our input representation is superior to the com- presented experiments that show that our approach is monly used amino-acid representation, we believe that able to classify proteins with relatively high accuracy. our representation can be improved by incorporating Wehave also demonstrated that our input representa- additional attributes. For example, we plan to investi- tion is superior to a representation based on amino-acid gate attributes based on the Fourier transform of sig- composition, especially when classifying proteins which nals formed by residue volumes and charges. have no homologs in the training set. The goal of our Defining a set of protein-folding classes is a difficult research program is to develop computational methods problem in itself. To date we have used the folding that are able to accurately classify proteins that have groups identified by Orengo et al. as our classes. Some no well-understood homologs. Webelieve that the re- of these classes are rather arbitrarily defined, however, search presented herein represents a promising start and hence difficult to predict. Weplan to re-evaluate towards this goal.

104 ISMB-95 Acknowledgments. The authors would like to Kolin~ki, A., and Skolnick, J. 1992. Discretized model. thank Carolyn Allex and the anonymousreviewers for of proteins. I. Monte Carlo study of cooperativity their insightful commentson this paper. in homopolypeptides. Journal of Chemical Physics 97(12):9412-9426. References Kramer, A. H., and Sangiovanni-Vincentelli, A. 1989. Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Efficient parallel learning algorithms for neural net- and Lipman, D. J. 1990. Basic local alignment search works. In Touretzky, D., ed., Advances in Neural In- tool. Journal of Molecular Biology 215(3):403-410. formation Processing Systems, volume 1. San Marco, Bairoch, A., and Boeckman, B. 1992. The Swiss-Prot CA: Morgan Kaufmann. 40-48. protein sequence data bank. Nucleic Acids Research Le Curt, Y.; Boser, B.; Denker, J. S.; Henderson, 20:2019-2022. D.; Howard, R. E.; Hubbard, W.; and Jackel, L. D. Bernstein, F.; Koeyzle, T.; Williams, G.; Meyer, J.; 1989. Handwritten digit recognition with a back- Brice, M.; Rodgers, J.; Kermard, O.; Shimanouchi, propagation network. In Touretzky, D., ed., Advances T.; and Tatsumi, M. 1977. The protein data bank: in Neural Information Processing Systems, volume 2. A computer-based archival file for macromolecular San Mateo, CA: Morgan Kauhnann. structures. Journal of Molecular Biology 112:535-542. Lehninger, A. L.; Nelson, D. L.; and Cox, M. M. 1993. Bridle, J. 1989. Probabilistic interpretation of feed- Principles of Biochemistry. New York, NY: Worth forward classification network outputs, with relation- Publishers. ships to pattern recognition. In Fogelman-Soulie, F., and H~rault, J., eds., Neurocomputing: Algo- Levitt, M., and Chothia, C. 1976. Structural patterns rithms, Architectures, and Applications. NewYork, in globular proteins. Nature (London) 261:552-557. NY: Springer-Verlag. Metfessel, B. A.; Saurugger, P. N.; Cormelly, D. P.; Cover, T. M., and Hart, P. E. 1967. Nearest neighbor and Rich, S. S. 1993. Cross-validation of protein pattern classification. IEEE Transactions on Infor- structural class prediction using statistical clustering mation Theory 13(1):21-27. and neural networks. Protein Science 2:1171-1182. Devereux, J.; Haeberli, P.; and Smithies, O. 1984. A Nakashima, H., and Nishikawa, K. 1994. Discrimina- comprehensive set of sequence analysis programs for tion of intracellular and extracellular proteins using the VAX.Nucleic Acids Research 12:387-395. amino acid composition and residue-pair frequencies. Dickerson, R. E., and Geis, I. 1969. The structure and Journal of Molecular Biology 238:54-61. action of proteins. Menlo Park, CA: W. A. Benjamin. Nakashima, H.; Nishikawa, K.; and Ooi, T. 1986. Dietterich, T. G., and Bakiri, G. 1995. Solving The folding type of a protein is relevant to its amino multiclass learning problems via error-correcting out- acid composition. Journal of Biochemistry (Tokyo) put codes. Journal of Artificial Intelligence Research 99:153-162. 2:263-286. Needleman, S. B., and Wunsch, C. D. 1970. A gen- Dubchak, I.; Holbrook, S. R.; and Kim, S.-H. 1993. eral method applicable to the search for similarities Prediction of protein folding class from amino acid in the amino acid sequence of two proteins. Journal composition. Proteins: Structure, Function, and Ge- of Molecular Biology 48:443-453. netics 16:79-91. Orengo, C. A.; Flores, T. P.; Taylor, W. R.; and Eisenberg, D.; Weiss, R. M.; and Terwilliger, T. C. Thornton, J. M. 1993. Identification and classifi- 1984. The hydrophobic momentdetects periodicity in cation of protein fold families. Protein Engineering protein hydrophobicity. Proceedings of the National 6(5):485-500. Academy of Sciences, USA 81:140-144. Pearson, W. R., and Lipman, D. J. 1988. Improved Ferran, E. A.; Ferrara, P.; and Pflugfelder, B. 1993. tools for biological sequence comparison. Proceedings Protein classification using neural networks. In Pro- of the National Academy of Sciences, USA 85:2444- ceedings of the First International Conference on 2448. Intelligent Systems for Molecular Biology, 127-135. Bethesda, MD: AAAI Press. Qian, N., and Sejnowski, T. 1988. Predicting the Fields, C.; Adams, M. D.; White, O.; and Venter, secondary structure of globular proteins using neu- J. C. 1994. How many in the human genome? ral network models. Journal of Molecular Biology Nature Genetics 7(3):345-346. 202:865-884. Hinton, G. 1989. Connectionist learning procedures. Quinlan, J. R. 1993. C~.5: Programs for Machine Artificial Intelligence 40:185-234. Learning. San Mateo, CA: Morgan Kaufmann. Klein, P., and Delisi, C. 1986. Prediction of pro- Reczko, M., and Bohr, H. 1994. The DEFdata base of tein structural class from the amino acid sequence. sequence based protein fold class predictions. Nucleic Biopolymers 25:1659-1672. Acids Research 22(17):3616-3619.

Craven 105 Rost, B., and Sander, C. 1993. Prediction of pro- tein secondary structure at better than 70%accuracy. Journal of Molecular Biology 232:584-599. Rumelhart, D.; Hinton, G.; and Williams, R. 1986. Learning internal representations by error propaga- tion. In Rumelhart, D., and McClelland, J., eds., Parallel Distributed Processing: Explorations in the microstrneture of cognition. Volume 1: Foundations. Cambridge, MA: MIT Press. 318-363. Sander, C., and Schneider, R. 1991. Database of homology-derived protein structures and the struc- tural meaning of sequence alignment. Proteins: Structure, Function and Genetics 9:56-68. Wu, C.; Berry, M.; Fung, Y.-S.; and McLarty, J. 1993. Neural networks for molecular sequence clas- sification. In Proceedings of the First International Conference on Intelligent Systems for Molecular Bi- ology, 429-437. Bethesda, MD: AAAIPress.

106 ISMB-95