Comparison of Two Variations of Neural Network Approaches to The

From: ISMB-93 Proceedings. Copyright © 1993, AAAI (www.aaai.org). All rights reserved. Comparison of Two Variations of Neural Network Approach to the Prediction of Protein Folding Pattern Inna Dubchak *, Stephen R. Holbrook # and*# Sung-Hou Kim #Strnctural Biology Division, LawrenceBerkeley Laboratory *Departmentof Chemistry, University of California at Berkeley Berkeley, CA 94720 II.X)UBCHAK @LBL.GOV examplesof functionally similar and dissimilar proteins with the same fold (Finkelstein & Ptitsyn 1987). Most Abstract proteins pack their secondary structures into one of a limited number of basic geometries and chain Wehave designed, trained and tested two types topologies. The definition "folding pattern" (Levitt of neural networksfor the prediction of protein folding pattern from sequence. Here we describe Chotia 1976) or "topologies" (Richardson 1977) ignore the differences in the networksand comparetheir the variations of length and orientation of ¢t- and 13- performance on a variety of proteins. Both segments and describe only their mutual positions. networkrepresentations are generally successful Someof the macrofolding patterns are: the globin fold, in predicting protein fold and can also be used the. immunoglobulin fold, the nucleotide binding together to confirma prediction. (Rossmann) fold, 4o~-helical bundles, parallel t~/~ barrels, and antiparallel ~ barrels (jelly roll fold). On Introduction smaller scale, the Greekkey motif, EF-hand,helix-loop- helix, kringle motifs and others form fragments of The prediction of protein structure from its amino acid protein structure. sequence is a central, unsolved problem in molecular biology. Despite efforts by a large number of We extended the approach which Muskal and Kim laboratories over manyyears, we are still far from a (Muskal & Kim1992) originally applied to prediction solution to this problem. The size and complexity of of secondary structure composition, to the proteins makes a strictly theoretical approach discrimination of protein folding patterns (Dubchak, intractable. A more hopeful direction is to use the Holbrook, & Kim 1993). In this method rapidly growing number of experimentally determined (Representation I) there are 21 real-valued input nodes protein structures and sequences to empirically extract representing the numberof amino acids and amino acid the principles which govern protein folding and then percent compositionof the protein as a whole, or that of apply these rules to the prediction of new structures. the domainof interest. This can be done at manylevels, from the prediction of In the second version of folding class prediction the secondarystructure of an individual aminoacid (Qian (Representation II) we made attempts to reduce the & Sejnowski 1988, Holley & Karplus 1989) to the number of inputs and accordingly variables by using surface exposure of amino acids (Holbrook, Muskal, reasonable physical subdivision of all aminoacids into a Kim1990), and even tertiary interactions (Bohr et al. few classes. In relating sequenceand structure the basic 1990). Unfortunately, as the complexity of the classification of residues is usually in terms of their prediction and the numberof parameters increase, the hydrophilic and hydrophobic character. Both groups of data to parameter ratio decreases. Thus, we have chosen authors (Qian & Sejnowski 1988, Holley & Karplus a top-downpath in whichwe first attempt to predict the 1989) used physicochemicalproperties of the aminoacid overall fold of the protein using very few parameters. residues such as hydrophobicity,charge, side chain bulk, The detailed structure may then be completed using and backbone flexibility as alternative ways of other prediction algorithms and energetic calculations. representing the inputs to the networkfor prediction of Here we study two variations of a neural network secondary structure. Qian and Sejnowski also provided approachto prediction of the overall protein fold. the network with global information such as an average It is becomingclear that there are a limited number hydrophobicity of the protein and the position of the of protein folding patterns to serve as the core or residue in the sequence. All these attempts did not help scaffold around which variations are added to perform to improve the prediction reliability over the basic specific protein functions. In fact, there are nowmany scheme. However Kneller and co-authors (Kneller, 118 ISMB-93 Cohen, & Langridge 1990) used the scheme with In our calculations we used a less simplistic additional helix hydrophobic moment and strand classification than White and Jacob, and made an hydrophobic moment input units which improved the attempt not only to group the residues accordingto their testing statistics slightly in comparison with (Qian relative hydrophobicity, but to take into consideration Sejnowski 1988). their immediateenvironment. Basically there are two opposite viewpoints (with Several scales for classification of hydrophobic- variety of opinions in between) regarding the hydrophilic character of residues have been published. distribution of hydrophobicand hydrophilic sites along Most of these scales differ only in details and so show the sequence of protein. Chothia and Finkelstein high correlation. We used the hydrophobicity groups (Chothia & Finkelstein 1990) proved that this from (Chothia & Finkelshtein 1990), which are very distribution is uniquein the ease of the globin sequence. similar to those of consensus scale (Eisenberg 1982), The formation of the native secondary structure of any which was designated to mitigate the effect of outlying protein brings together sites of the same character to values in any one scale producedby the peculiarities of form hydrophobic and hydrophilic surfaces. Protein the method. interiors are occupied mainly by non-polar residues and occasionally by neutral residues. The polar atoms in neutral residues usually form hydrogen bonds within Methods their ownsecondary structures so the surfaces between Database: A critical problem in classification of secondary structures are almost entirely hydrophobicin protein folding types by neural networks is the limited character. Theconverse is also true: the sites in protein numberof protein examples of knownthree-dimensional that are highly exposedto the solvent are nearly always structure. We have partly overcome this problem by occupied by polar or neutral residues. The same is using the much larger protein sequence database. supposedlytrue for other folds. Since an architecture of Proteins of the same family from different organisms, a definite fold is specific feature it is reasonableto use showing high sequence homology with proteins of this kind of classification for neural net input. knownstructure were assumedto have the same overall Local interactions inside structural segments can folding pattern and were used to greatly increase the often determine the protein secondarystructures despite number of input examples for use in network training the presence of muchstronger long range interactions and testing. Thus, we selected the proteins for whichthe between different segments. Someof the hydrophobicity crystal s~ucture is knownto have the fold of interest scales (Ponnuswamy1993) takes into account protein and then retrieved homologous proteins from the structural class information namelythe environment of SWlSS-PROTsequence database. We chose to begin each amino acid residue when estimating its our studies on four diverse folding patterns for which hydrophobicity. The other scale (optimal matching there exists a relatively large number of known hydrophobicity ) was derived on the assumption that structures and sequence analogs. Other folding motifs families of proteins that fold in the same way, do so can easily be added to the prediction schemeas sufficient because they have the same pattern of residue examplesof knownstrnctttre are characterized. hydrophobicities along their amino acid sequences S011, in some cases the number of variable (Eisenberg 1984). parameters did not exceed the number of independent At the same time, White and Jacobs (White observations whichled to a lack of generalization. This Jacobs 1990) studied the statistical distribution of problem, in turn, led to our experiments with an hydrophobicresidues along the length of protein chains alternate input representation specified by fewer using a binary hydrophobicity scale which assigns parameters. hydrophobic residues a value of one and non- Protein structure information was from the hydrophobes a value of zero. The resulting binary BrookhavenProtein Data Bank (Bernstein et al. 1977) sequences are tested for randomnessusing the standard (PDB)and publications describing proteins of known run test. For the majority of the 5,247 proteins crystal structure but not yet deposited in the Brooldaaven examined,the distribution of hydrophobicresidues along PDB. Sequence information was from the Swiss-Prot a sequence cannot be distinguished from that expected database (SP) Release 20 (Bairoch & Boeckmann1991). for a randomdistribution. The authors suggest that (a) Therefore, databases were compiled for each of the functional proteins may have originated from random following folding patterns 1) 4(x-helical bundles sequences, (b) the folding of proteins into compact (BUNDLE),2) Eight stranded parallel (x/~ barrels structures may be much more permissive with less (BARREL),3) Nucleotide binding or Rossmann (NBF) sequencespecificity than previously thought, and (c) the fold,

Comparison of Two Variations of Neural Network Approaches to The

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support