Predicting Protein Folding Classes Without Overly Relying on Homology* Mark W
Total Page:16
File Type:pdf, Size:1020Kb
From: ISMB-95 Proceedings. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. Predicting Protein Folding Classes without Overly Relying on Homology* Mark W. Craven Richard J. Mural Computer Sciences Department Biology Division University of Wisconsin-Madison Oak Ridge National Laboratory 1210 W. Dayton St. P.O. Box 2008, Bldg. 9211 Madison WI 53706 Oak Ridge TN 37831-8077 [email protected] m91~ornl.gov Loren J. Hauser Edward C. Uberbacher Computer Science & Mathematics Division Computer Science & Mathematics Division Oak Ridge National Laboratory Oak Ridge National Laboratory P.O. Box 2008, Bldg. 9211 P.O. Box 2008, Bldg. 6010 Oak Ridge TN 37831-8077 Oak Ridge TN 37831-6364 [email protected] [email protected] Abstract the structure and function of a protein is to identify a homologousprotein that has already been character- An important open problem in molecular biology ized. However, from current genome-sequencing efforts is howto use computational methods to under- it appears that as manyas half of the newly discovered stand the structure and function of proteins given only their primary sequences. Wedescribe and proteins do not have corresponding, well-understood evaluate an original machine-learning approach homologs(Fields et al. 1994). The goal of our research to classifying protein sequencesaccording to their program, therefore, is to develop protein-classification structural folding class. Our work is novel in methods that are not overly reliant on sequence homol- several respects: we use a set of protein classes ogy, but instead represent the essential properties of that previously have not been used for classify- analogous proteins that have similar folds. In this pa- ing primary sequences, and we use a unique set of per, we describe and evaluate a novel, machine-learning attributes to represent protein sequences to the approach for classifying protein sequences according to learners. Weevaluate our approach by measur- their structural fold family. Our experiments indicate ing its ability to correctly classify proteins that were not in its training set. Wecompare our that our approach provides a promising alternative to input representation to a commonlyused input homology-basedmethods for protein classification. representation - amino acid composition - and There is a wide variety of existing approaches for show that our approach more accurately classi- predicting structural or functional aspects of proteins fies proteins that have very limited homologyto the sequences on which the systems are trained. given their primary (i.e., amino-acid) sequences. These approaches include tertiary structure prediction (e.g., Kolinski & Skolnick 1992), secondary structure predic- Introduction tion (e.g., Rost & Sander 1993), sequence homology searching (e.g., Pearson & Lipman 1988; Altschul et A problem of fundamental importance in molecular bi- a/. 1990), and classification according to folding class ology is understanding the structure and function of (Dubchak, Holbrook, & Kim 1993; Ferran, Ferrara, the proteins found throughout nature. Currently, the Pflugfelder 1993; Metfessel et al. 1993; Wuet al. 1993; growth of protein sequence databases is greatly outpac- Nakashima & Nishlkawa 1994; Reczko & Bohr 1994). ing the ability of biologists to characterize the proteins Our approach falls into this latter category - protein in these databases. Efficient computational methods classification - which itself encompassesa wide vari- for predicting protein structure and function are highly ety of methods. Existing protein classification meth- desirable because conventional laboratory methods, X- ods vary on a number of dimensions including: the in- ray crystallography and NMR,axe expensive and time- tended purposes of the systems; the level of abstraction consuming. Currently, the best method for predicting of the protein classes; and whether or not the systems *This research was supported by the Office of Health and are able to discover their ownclasses. The approach we Environmental Research, U.S. Department of Energy under describe is novel in at least two respects: we use a set of Contract No. DE-AC05-84OR21400with Martin Marietta protein classes that previously have not been used for Energy Systems, Inc. classifying primary sequences, and we use a physical 98 ISMB-95 set of attributes to represent protein sequences. Our approach uses machine-learning methods to in- Table 1: Protein class representation. The mid- duce descriptions of sixteen protein-folding classes. die columnlists the classes we use in our classification The folding classes that we use were devised by Orengo method. The left column indicates class families, and et al. (1993) in a large-scale computational effort the right column lists, for each class, the number of cluster proteins according to their structural similarity. examples we use in our experiments. These classes comprise analogous, as well as homolo- group)-~ ~ examples gous proteins. Whereas Orengo et al. used structural information to classify proteins, we are interested in Globin 27 classifying proteins when only their primary sequences Orthogonal 14 EF Hand are available. Thus, the role of the machine-learning 5 algorithms in our approach is to induce mappings from Up/Down 7 primary sequences to folding classes. A key aspect Metal Rich 16 of our method is the way in which we represent pri- Orthogonal Barrel 5 mary sequences to the learner. Unlike most protein- Greek Key 24 classification approaches, which represent proteins by Jelly Roll 5 their amino-acid composition, our method represents Complex Sandwich 7 proteins using attributes that better capture the com- Trefoil 7 monalities of analogous proteins. Weempirically com- Disulphide Rich 11 pare learning systems that use our input representation aln TIM Barrel 15 to learning systems that use amino-acid composition as Doubly Wound 26 their input representation. We show that our approach a+~ Mainly Alpha 9 more accurately classifies proteins that have very lim- Sandwich 20 ited homology to the sequences on which the systems Beta Open Sheet 14 are trained. total examples 212 Problem Representation algorithm (Needleman & Wunsch 1970), they per- The task that we address is defined as follows: given formed pairwise comparisons on 1410 protein se- the amino-acid sequence of a protein, assign the pro- quences selected in the previous step. They then tein to one of a numberof folding classes. This problem used single-linkage cluster analysis to form clusters definition indicates that there are two fundamental is- of related sequences. In single-linkage cluster analy- sues in implementing a classification method for the sis, two proteins, a and b, are assigned to the same task: determining the attributes that are to be used to cluster if there exists a chain of proteins linking a represent protein sequences, and defining the classes and b, such that each adjacent pair in the chain sat- that are to be predicted. The remainder of this section isfies a defined measure of relatedness. Twoproteins discusses how we address these two issues. were deemedrelated, in this case, if their sequence identity was > 35%. For small proteins, this thresh- Class Representation old was adjusted using the equation of Sander and The classes that we use in our approach are the fold Schneider (Sander & Schneider 1991). Also, for pro- groups defined by Orengoet al. (1993) in their effort teins with 25-35%sequence identity, a significance identify protein-fold families. These classes represent test was used to determine if the proteins were to be proteins that have highly conserved structures, but of- considered related. ten low sequence similarity. Thus, the classes represent 3. A representative protein was selected from each of analogous, as well as homologous, proteins. Table 1 the clusters formed in the previous step, and the lists the fold groups that we use as our classes, as well resulting set of proteins was clustered according to as the numberof examples in each class that we use in structural similarity. Pairwise comparisons of pro- our experiments, and whether each class falls into the teins in this set were done using a variant of the a (primarily alpha), /7 ( primarily beta), a//~ (alter- Needleman-Wunsch algorithm that compared struc- nating a and/~), or a +/7 (non-alternating a and/7) tural environments rather than primary sequences. family (Levitt & Chothia 1976). 4. Multidimensional scaling was applied to the result- The method that Orengo et al. used to define their ing structural homology matrix to form clusters of fold groups involved four primary steps. proteins with similar folds. The final fold groups 1. A set of proteins with known folds was assembled were defined from this clustering by human inter- from the Brookhaven Protein Data Bank (Bernstein pretation, with the aid of schematic representations et 02. 1977). and topology diagrams. 2. The proteins in this set were clustered according to In summary, Orengo et al. organized proteins sequence similarity. Using the Needleman-Wunsch with knownstructures into classes representing simi- Craven 99 . ................................................................................................................................................................................................................................................. lar folds, but not necessarily similar primary sequences. coil for each residue. The numberof a and/3 predic- Whereas, Orengo et al. developed their classes