<<

: Structure, Function, and Genetics 25:28-37 (1996)

Constructing Residue Substitution Classes Maximally Indicative of Local Structure

Michael J. Thompson' and Richard A. Goldstein'*2 'Biophysics Research Division, 'Department of , University of Michigan, Ann Arbor, Michigan 48109-1055

ABSTRACT Using an information theo- in these families as a result of the degenerate rela- retic formalism, we optimize classes of amino tionship between sequence and structure. During acid substitution to be maximally indicative of evolution, a three-dimensional fold can persist de- local protein structure. Our statistically-de- spite significant divergence of the amino acid se- rived classes are loosely identifiable with the quence~.'-~In order to preserve the structural or heuristic constructions found in previously functional integrity of the protein, important sites published work. However, while these other in the sequence must either conserve some specific methods provide a more rigid idealization of characteristics requiring conservation of amino acid physicochemically constrained residue substi- identity, or preserve some more general property tution, our classes provide substantially more such as polarity, charge, or size. Variable sites with structural information with many fewer param- little or no structural or functional constraints may eters. Moreover, these substitution classes are evolve by random fixation of neutral or nearly neu- consistent with the paradigmatic view of the se- tral mutation^.^,^ As a result, alignments of homol- quence-to-structure relationship in globular ogous sequences provide more information about the proteins which holds that the three-dimen- architectural necessities of their commonly-adopted sional architecture is predominantly deter- fold than do lone sequences, implicitly carrying in- mined by the arrangement of hydrophobic and formation about the non-local interactions fre- polar side chains with weak constraints on the quently invoked to explain the apparent 65% accu- actual amino acid identities. More specific con- racy ceiling for secondary structure prediction. straints are imposed on the placement of pro- Various methods have been developed €or extract- lines, glycines, and the charged residues. These ing information from multiple sequence alignments. substitution classes have been used in highly The most straightforward approach has been to use accurate predictions of residue solvent accessi- the multiple members of a protein family in order to bility. They could also be used in the identifica- provide "signal averaging" of structure predictions tion of homologous proteins, the construction over multiple homologous sequence^.^-^^ This kind of and refinement of multiple sequence align- simple averaging, however, does not capture the in- ments, and as a means of condensing and cod- formation implicit in structurally characteristic pat- ifying the information in multiple sequence terns of side chain variability. Another approach has alignments for secondary structure prediction been the construction of consensus or signature and tertiary fold recognition. sequence segments characteristic of particular 0 1996 Wiley-Liss, Inc. structures, which are often used in database ~earches.'"~'~-~~By concentrating on the conserved Key words: protein evolution, structure pre- residues at the expense of alignment positions which diction, information theory, amino allow side chain variability, these methods lose much acid substitution, multiple se- of the information contained in the multiple align- quence alignment ments, as discussed below. In addition, these partic- ular methods create templates specific for each in- INTRODUCTION dividual protein family, and do not give insight into Although it is highly desirable to know the de- the more universal patterns found in proteins. tailed three dimensional structure of a protein un- Both the profile method of tertiary structure rec- der study, such structures are few in number and ognition developed by Eisenberg and coworker^"^-"^ laboriously determined. In contrast, amino acid se- quences are often readily obtainable. When the se- quences of homologous proteins are available as well, the sets of residue substitutions resulting from Received July 14, 1995; revision accepted November 20, evolutionary differentiation can provide a rich 1995. Address reprint requests to Richard A. Goldstein, Biophysics source of information about the protein's function Research Division, Department of Chemistry, University of and structure. Structural information is contained Michigan, Ann Arbor, MI 48109-1055.

0 1996 WILEY-LISS, INC. STRUCTURALLY INDICATIVE SUBSTITUTION CLASSES 29 and the secondary structure and surface accessibil- hierarchical “Amino Acid Class Covering patterns” ity prediction work of Wako and Bl~nde11~~em- of Smith and Smith37,38or the Venn diagrams of ployed environment-dependent mutation matrices Taylor.39 This is due to the fact that our methodol- and tables of structural propensities. A recent hy- ogy contains no ad hoc elements, and allows for the brid method used similar tables.25 Measures of side unprejudiced identification of structurally informa- chain conservation have been used to weight second- tive classes of residue substitution. Our substitution ary structure predictions” and to train neural net- classes are also conceptually similar to the clusters works, in combination with substitution profiles, to of “interior-indicating’’ and “surface-indicating’’ predict secondary structure and solvent accessibil- side chains used by Benner and coworkers in their ity.26,27While some of these more quantitative prediction In contrast to their heuristic methods tend towards general applicability, they do groupings, our classes are derived by a rigorous sta- not lend themselves to the abstraction of general tistical methodology. A similarly rigorous approach principles relating protein structure and sequence. to generalizing the patterns of residue substitutions In contrast, Benner and Gerloff have developed found in multiple sequence alignments can be found methods for analyzing patterns of sequence varia- in the work of Haussler and coworkers.40 Their tion using heuristic rules which vary as subsets of work, however, is aimed at developing improved the alignments are manipulated.” While this work methods for multiple sequence alignment and does has claimed certain successes, it lacks quantitative not investigate the relationship between allowed rigor and reprodu~ibility.~~-~~Benner et al. have substitutions and local protein structure. also pioneered the consideration of correlated muta- In this article, we report on the set of 28 optimal tions as an approach to structure ~rediction.~’Such substitution classes which provides close to the max- efforts are still at a preliminary In spite imum resolution of sequence to structure relation- of this work and the more common uses of multiple ships without significant dataset-dependence. These sequence alignments in the biochemical or crystal- classes provide much more structural information lographic characterization of proteins, a rigorous than is extractable from single sequences. They also and generally applicable classification of these convey more structural information with fewer pa- structurally-constrained sets of residue substitu- rameters than the covering classes developed by tions is lacking. Smith and Smith or the Venn diagrams constructed In this article, we use information theory to con- by Taylor. These classes can be used to address ques- struct sets of residue substitution classes which pro- tions concerning what, if any, structure-determin- vide maximal information about local protein struc- ing properties are being conserved, and generate ture, classified into combinatorial categories of qualitative insight into the relationship between se- secondary structure and solvent accessibility. As quence and structure. In addition to their use in this approach contains no preconceived notions con- structure predicti01-1,~~potential applications in- cerning how the amino acid types should be clus- clude sequence alignment, detection of homologous tered, beyond those proclivities present in the orig- proteins, and the further generation and refinement inal mutation matrices used to construct the of biochemical insight. database of multiple sequence alignments, we have introduced no bias toward the development of any METHODS particular architecture of class membership and Data Encoding class linkages. To quantify the correspondence be- A significant and often neglected issue is the non- tween amino acid sequence and local protein struc- uniformity of the databases of known protein se- ture, we employ the concept of “mutual informa- quences; a large probability of a given residue at any tion.” Briefly, mutual information is the amount of location may only indicate that that particular res- knowledge obtainable about one random variable by idue is characteristic of a protein subfamily over- knowledge of another; in this case, how much knowl- represented in the database. Such biases can be edge of the local structure can be obtained by know- rather extreme: mammals make up 69% of all myo- ing the substitution class of the alignment position. globin sequences found in the SWISS-PROT Data- This measure enables us to search the space of pos- base (release 28).42In order to correct for these bi- sible substitution classes over a database of repre- ases, it is necessary to either cull or weight the sentative protein structures and their homologs and sequences in order to get a more even distribution, a find the set which possesses the strongest correla- process that involves substantial approximations tions with local structural environments. and assumptions. The resulting substitution classes display consis- We approach the non-uniformity of the database tency with a few well-known trends for patterns of in a manner similar to that used in the “Amino Acid amino acid mutations to correlate with the physico- Class Covering patterns” developed by Smith and chemical properties of those amino acids. In the con- Smith37,38and in the “minimal sets” suggested by text of our substitution classes, however, these rela- Taylor.39These methods consider only compatibility tionships are not nearly so highly idealized as in the of various amino-acid residues with a given location, 30 M.J. THOMPSON AND R.A. GOLDSTEIN

...LLVMC--- ... HO = - Cp(xi)lnp(xi). (1) ...LIVMCSTL ... i ALIGNMENT ...IIVMCSTV ... where p(xJ is the probability of X having the value ...LIVLCSSV ... xi. The joint entropy of two random variables X and ... LLVICTVL ... Y,H(X,Y), is calculated from their joint probability distribution:

H(X,Y) = - Cp(xi,yj).. In P(xiJj). (2) CLASSES ZJ The mutual information of these two random vari- Fig. 1. Illustrative example of assigning the residue positions ables is defined as the difference between the sum of of a multiple sequence alignment into substitution classes. the entropies of their respective probability distri- butions and the entropy of their joint probability distribution. without regard to the number of sequences that con- M(XY)= H(X) H(Y) - H(X,Y) tain any particular amino acid residue. In our work, + (3) a particular position in a multiple sequence align- If knowledge of the value of one variable provides ment is characterized by the first substitution class knowledge about the value of the other variable, that is sufficiently general to account for all of the then the information provided by knowledge of both side chain types observed at that site. As shown in variables will be less than the sum of the knowledge the simplified example of Figure 1, an alignment of each variable considered independently. Here, the position with a conserved valine would be assigned mutual information represents how much knowl- to the substitution class containing only valine, edge of one variable tells us about the value of the while the variable positions that contained any com- other. Clearly, if the variables are independent, and bination of valine, leucine, or isoleucine would be knowledge of one provides no information about the assigned to the second class, and so on. As positions value of the other, then M + 0. are assigned to the first appropriate class, the order For our purposes, we write, of classes is significant. For the actual set of classes constructed by our method, the first 20 substitution classes correspond to the conserved examples of the 20 amino acid types. The remaining classes increase where {2+) indicates the set of eight local structure in generality until the last substitution class in- categories defined below, and {C)is the set of amino cludes all possible amino acid types, so that each acid substitution classes being considered. The val- position in the dataset corresponds to one of the sub- ues are multiplied by 100 for ease of comparison. stitution classes. There could conceivably be as The mutual information quantifies how much infor- many as 2’l-2 substitution classes corresponding to mation about the local structure is contained in all of the possible combinations of amino acid types, knowledge of the substitution class. The set of and the possibility of a gap. The residues observed at classes that maximizes the mutual information is, each position in the multiple alignments are en- therefore, the best set of classes for extracting infor- coded as a 21-bit binary vector, where each of the mation about local protein structure. bits corresponds to the presence or absence of an We use a Metropolis algorithm53 with simulated amino acid type or gap at that position. Membership annealing to search for optimal sets of substitution in a substitution class can be assigned rapidly using classes. The searches begin with a set of substitution logical bit operations. classes with randomly constructed memberships. The mutual information is calculated for that set. Theory One of the bits of the binary vectors representing We perform a search for the optimal set of classes, one of the classes is altered at random. Another pos- defined as the set that maximizes the information sible move is the random selection and switching of about local structure furnished by class member- two of the substitution classes to rearrange the order ship. We quantify this approach using concepts bor- of the classes. The mutual information of this new rowed from information theory. (For a review, see trial configuration is recalculated. If the new trial reference43; for pioneering examples of the use of configuration has a mutual information value information theory in prediction efforts see refer- higher than the current configuration, the change is ence~~~-~~).Consider the random variable X, which accepted and the trial configuration becomes the can take on the possible values {xl,x2 . . . x,}. The new current configuration. If the trial configura- “Shannon entropy” of the probability distribution of tion’s mutual information is less than that of the X, H(X), represents the uncertainty of this random current configuration, the change is accepted with ~ariable:~’ probability STRUCTURALLY INDICATIVE SUBSTITUTION CLASSES 31

TABLE I. List of 111 Protein Chains by PDB Identifier Code* laak lab2 labmA lak3A lapmE latr lavhA lbaa lbabB lbet lbfg lbmdA lcaj lcauA lcauB lccr lcgt lcobA lcpcA lcpcB lctm ldlhB leco 1fbaA lfha lfkb 1fxiA 1gatA lgdl0 lgplA lhbq 1hdxA lhgeA lhgeB lhleA 1hsbA lhunA lhuw lipd 11e4 llenA llgaA lmctA lmfbH lminA lminB lnbtA lndc lnipA lofv lpfkA lpho 1PhP lpk4 1PkP lpoa 1PPn lrla2 lrec lrfbA lrnd lSOl 1sacA 1scmA lscmJ3 1scmC 1sltA 1smrA lspa 1tbpA ltfd 1tgxA ltie 1tndA ltreA ltys 1wsyB IxyaA 1zaaC 2acq 2azaA 2btfA 2cas 2cpl 2ctc 2gda 2hmzA 2ihl 21h2 2mge 2mipA 2plvl 2plv3 2reb 2sn3 2snv 2tgi 3cla 3Pgm 3rubS 4blmA 4gcr 4mb 4htcI 5fbpB 5nn9 5p21 8catA 8tlnE 9ldtA 9rnt *The fifth letters in the codes of multichain proteins designate the chains used.

e(‘tral-McumnS’T where T is an empirically defined states: buried and exposed. Secondary structure and decreasing function of the number of trials. solvent accessibility information is taken from the “Dictionary of Protein Secondary Structure’’ (DSSP) Databases files of Kabsch and Sander57which were constructed The dataset of protein chains was selected from based on the coordinates supplied in the Protein the October 1994 PDBselect list of representative Data Bank (PDB) files for the protein^.^',^^ Solvent structures sharing less than 25% sequence identity accessibility values reported for single chains of between any pair, as compiled by Hobohm and multimeric proteins were calculated based on the Sander.54 Alignments of homologs are extracted multi-chain complexes. The solvent accessibilities from “homology derived structures of proteins” are normalized by the values obtained for glycine- (HSSP) files.55Application of a minimum length re- X-glycine tri-peptides by Shrake and Rupley“ in quirement of 60 residues for example proteins, of a order to generate fractional exposures. The choice of minimum number of 10 homologs per sequence po- threshold for distinguishing buried and exposed sition and of a 40% lower bound on sequence identity states and the classification of non-standard second- between the example chain and its homologs pro- ary structures into the common four are explained duces the set of 111 protein chains (25,511 residues) in the results section. listed in Table I. This dataset is well-distributed over the ranges of % @-strandcontent and % helical RESULTS content observed in the protein structure databases. Our first use of mutual information was in deter- We do not classify the example proteins into discrete mining the optimal definition of the local structure categories such as the typical “mostly a,”“mostly p,” categories. The DSSP files contain eight possible “mixed alp,” and “irregular” due to the inconsis- structural assignments, adding the P-bridge, n-he- tency among the various sets of classification crite- lix, 3,,-helix, and bend to the canonical a-helix, ria that have been proposed in the literat~re.~~ @-strand,turn and coil. It is not obvious how to par- tition these additional assignments among the four Structure Definitions more standard structures. For example, the GOR In the HSSP files, there are three types of gaps: method6, assigns bend and P-bridge locations to coil end gaps, internal gaps, and insertions. End gaps in and .rr-helix and 3,,-helix locations to helix, while the homolog sequences are ignored due to the ambi- Rost and Sander group 3,,-helix locations with he- guity regarding whether these regions outside the lix, and combine P-bridge, turn, bend, and n-helix boundaries of the alignment bear any structural locations into a “loop” category.26Likewise, there is similarity to the example protein. Insertions in the a problem in choosing a surface accessibility thresh- homologous sequences are also discounted because old, as noted by other authors.27 they correspond to gaps in the example protein Our solution to these problems was to find the where no secondary structure information is avail- solvent accessibility threshold combined with the able. Thus, the only gaps taken into account are mapping of the four non-standard secondary struc- those within the homolog sequences. ture types into the standard four which provided the The local structure of a residue location is as- maximum mutual information for a set of 40 classes signed as one of eight categorical combinations of (one variable class and one conserved class for each the four standard secondary structure states: helix, amino acid type). We chose to work with these strand, turn, and coil; and two solvent accessibility classes for two reasons. One, these classes are “non- 32 M.J. THOMPSON AND R.A. GOLDSTEIN

adjustable” and, therefore, we avoid the computa- are resolved. As the number of classes, N, increas- tional enormity of simultaneously or iteratively es, more specific residue properties and their come- searching for a set of optimal substitution classes for lations with particular local protein conformations every possible structural categorization. Two, these are distinguished. As can be seen in Figure 2, the classes convey information about the residue iden- mutual information begins to saturate in the range tity and the basic conserved or variable nature of a of a relatively small number of classes. Therefore, the sequence position. For each sequence position in the insignificant gain in mutual information provided by 111 protein dataset, the residue identity was taken larger sets of optimal substitution classes is obtained from the example sequence while the conserved or at increasingly prohibitive computational costs as variable nature of the position was determined by the number of possible configurations roughly grows the alignment of the homologs. We exhaustively as 221N.There is also an increasing risk that the considered every possible secondary structure map- resulting classes represent memorization of dataset- ping for solvent accessibility thresholds ranging specific sequence to structure correlations. This from 15 to 25%. The optimal structure categoriza- memorization reaches a maximum when each align- tion assigned p-bridge locations to p-strand, 3,,-he- ment position in the database which has a unique set lix and n-helix locations to helix, and bend locations of residues is allowed to be a substitution class itself. to turn, for a solvent accessibility cutoff of 22%.This There were 7,705 such positions in our database of classified 47% of the 25,511 residues in our 111 pro- alignments. Therefore, the mutual information tein chain dataset as exposed, 34% as helical, 23% as value calculated for these 7,705 classes-86.22-was strand, 22% as turn, and 21% as coil. This local the maximum attainable value for this dataset. Here structure categorization was robust to the choice of we report on the optimal set of 28 classes. At this dataset. resolution, many of the well-known relationships be- We also used mutual information to determine tween generic residue properties and local protein how distant homologs should be included in the mul- structure are distinguished with a consequent large tiple sequence alignments. Inclusion of more distant initial gain in mutual information. As shown in Fig- homologs increases the probability that the range of ure 3, the Metropolis search scheme was capable of possible amino acid residues that can occupy each finding the optimal set of 28 substitution classes in location is well defined. However, it also increases a consistent and rapid fashion. the probability that a rare but possible mutation To address the issue of dataset dependence, we may be included, and given the nature of the sub- first split the database of 111 protein chains into stitution classes, be accorded as much weight as a subsets of 55 and 56 examples. The optimal sets of more likely residue. The accidental inclusion of non- 28 classes derived for each dataset-half could be homologous sequences becomes a concern as well as mapped 1 to 1 onto the 28 classes optimized for the the general reliability of the multiple sequence entire 111 protein dataset. Changes between the alignments. Using the same type of search proce- subset-optimized and globally optimized classes in- dure with non-adjustable classes as before, we found volved small changes in the residue membership of that the mutual information plateaued in the range some classes and a shift in the order of classes. For of 40-55% sequence identity. We chose to work with the half-dataset classes, 82% of the alignment posi- the dataset produced by the 40% lower bound as it tions were assigned to the same classes as for the provided the largest number of example proteins. globally optimal set. This indicates that while the For all searches for optimal sets of residue substi- exact appearance of the optimal set of classes shows tution classes, 20 non-adjustable classes were explic- some dataset dependence, the relationships between itly considered, representing conserved examples of sequence and structure represented by these sets of each of the 20 amino acid types. While a variable set substitution classes are relatively robust. of residues may represent conservation of a generic We separately addressed the possibility of memo- property of side chains associated with a particular rization of particular residue substitution patterns local protein conformation, conservation of a certain by dividing the 111 protein chain dataset into fifths. side chain type more likely represents a strict func- The optimal set of substitution classes was found for tional constraint which may or may not be coupled 4/5s of the dataset and a mutual information value with a conserved structural feature. Also, the struc- was calculated over the remaining 1/5 of the dataset tural propensities of conserved positions of a partic- using these classes. This was done cyclically for all ular side chain are likely to be different than those permutations of the dataset-fifths, and an average of variable positions which allow that side chain. mutual information was calculated. Similarly, the These twenty classes allow for separation of such set of substitution classes optimized for the entire differences without creating any additional com- 111 protein chain database was used to calculate a plexity in the optimization searches. mutual information value over each 1/5 of the data- With a very small number of residue substitution base. These values were also averaged. The differ- classes, only very generic physicochemical properties ences between the two averages were negligible for and their correlations with local protein structures numbers of substitution classes from 21 to 28. This STRUCTURALLY INDICATIVE SUBSTITUTION CLASSES 33

21

19 Minimal Substitution Classes C 17 0 .-c 15 0 -u-c 7 13 PlMA c TAe 11

9

7

Number of Classes

Fig. 2. Curve of mutual information for increasing numbers of substitution classes. In compar- ison to the value of 20.32 calculated for the optimal 28 substitution classes, we calculated values of 14.51 for the 20 single-residue amino acid classes based on single sequence data, 12.49 for the 37 class-nodes of the PlMA tree of Smith and Smith, and 18.54 for a set of 86 classes which include the "minimal sets" of Taylor. The maximum mutual information attainable for our 111 protein dataset, corresponding to complete memorization of the dataset, was 86.22.

20

19.5 K ._0 cm E 19 0 't K 18.5 1m +3 r' 18

17.5

17 .'..'.'"'.' '.'.'~~~.... 50000 100000 150000 200000 250000 Trial Number

Fig. 3. Five sample searches for 28 optimal residue substitution classes, each seeded with a different random number. demonstrated that there was effectively no memori- dence exists does not substantially affect the infor- zation occurring at the resolution of sequence to mation available when the classes are applied to structure correlations represented by the 28 substi- other datasets. tution classes described in this article. In the space As marked in Figure 2, the mutual information of possible sets of substitution classes, there are very value for the 20 amino acid residues taken from the many near-optimal sets forming a large, but broad sequences of the 111 example proteins only-ne- and flat plateau in the mutual information land- glecting information from homologous sequences- scape. The small dataset-dependence of the substi- was 14.51. The value attained for the optimal set of tution classes discussed above corresponds to small 28 classes was 20.32-a clear gain in information peaks atop that plateau. Whatever dataset depen- regarding local protein structure. Another compar- 34 M.J. THOMPSON AND R.A. GOLDSTEIN

TABLE 11. Structural likelihood Ratios and Residue Memberships for the Optimal Set of 28 Residue Substitution Classes* Buried Exposed Class a P t C a P t C L 74 40 22 32 - 133 -116 - 159 - 108 I 49 79 -6 22 -91 -180 - 166 - 122 V 34 97 -23 -10 - 184 -11 - 140 -139 F 44 62 53 37 - 183 - 90 - 147 - 90 M 45 55 39 -10 -106 - 94 - 102 - 20 C - 24 52 40 79 -97 - 24 - 101 -29 A 69 23 32 0 -34 - 108 - 109 -114 W 44 72 78 5 -323 - 63 - 158 -114 Y 21 65 - 27 34 - 77 29 - 106 -95 T -21 35 52 59 -117 50 - 86 - 18 S -9 33 13 68 -73 -41 -21 - 28 H 30 22 40 55 -61 -4 -81 -87 Q 15 7 -11 21 -18 30 -67 11 N 0 3 25 63 -73 -61 - 10 6 E 17 -25 - 10 -27 44 21 - 24 - 42 D - 22 -2 18 68 - 40 - 64 -4 15 K -28 -23 - 70 6 50 62 -29 -11 R 21 - 16 - 17 2 3 63 -71 -4 P - 76 -39 77 89 - 128 -41 -2 45 G - 52 1 114 51 -203 - 105 37 - 37 -1VFMCA - 63 78 - 16 5 -160 -75 -181 - 127 LIVFMCAWYTSH G- 40 40 24 20 - 89 -38 - 59 - 49 LIVFMCAWYTSHQNE R - 0 - 28 0 -20 21 33 -11 9 LIVFMCAWYTSH D RPG- - 30 -47 19 45 -49 - 15 27 43 LIVFMC WYTSHQNE KR - - 58 -77 -55 -81 56 90 0 34 LIVFMC WYTSHQN DKRPG- -131 - 126 -20 -35 -10 16 83 69 LIVFMCAWYTSHQNEDKR - 90 - 139 -131 -132 108 29 22 4 LIVFMCAWYTSHONEDKRPG- - 94 -182 -72 -67 54 - 14 76 49

*- indicates gap, CI indicates helix, p indicates strand, t indicates turn, and c indicates coil. ative value was obtained by encoding the PIMA tree son to other methods, our classification scheme pro- of Smith and Smith3’ as a set of 37 classes and cal- vides more information with fewer parameters. culating the mutual information with the eight local We can represent the propensity for various resi- structure categories over the dataset of 111 protein dues and substitution classes to exist in any struc- chains. This resulted in a value of 12.49 which is less tural context through the use of log-likelihood ra- than that achieved by our set of 22 substitution tios. L(C,2+),the log-likelihood ratio for class C to be classes or even that obtained from single sequence in context Z4, is defined by information. While the PIMA structure may be use- ful in the alignment applications suggested by its authors, it is relatively inadequate for structure- based applications. Similarly, we coded the 65 “min- imal sets” of amino acids proposed by Taylor to char- This ratio represents how much more likely it is for a given position described by a substitution class C acterize alignment positions and “ . . . capture virtually all the useful information that can be ex- to be in local structure Z4, compared with what tracted from a number of aligned sequence^"^' with would be expected at random. A listing of these log 20 single-residue classes for the twenty conserved likelihood ratios and the actual amino acid member- amino acids and one class with full amino acid and ships of the optimal 28 substitution classes can be gap membership to account for all those alignment found in Table 11. positions which failed to fit one of these 85 classes. The mutual information calculated for this com- DISCUSSION bined set of 86 classes was 18.54-less than that The first 20 classes listed in Table I1 are single- calculated for our set of 25 substitution classes and residue classes which account for conserved in- providing only slightly more information than the stances of the 20 amino acids. The structural pro- 20 amino acid classes which do not include any in- pensities of these 20 classes in the standard four formation from homologous sequences. In compari- secondary structure states are highly correlated with STRUCTURALLY INDICATIVE SUBSTITUTION CLASSES 35 those calculated in the GOR method61 (data not classes do not have high probabilities of pairwise shown). Small statistical differences result from our mutations amongst themselves as indicated by mu- use of different categorizations of local protein struc- tation matrices.65 Interestingly, there is a general ture and a different dataset. Greater disparities can lack of preferential mutations between members of be understood by considering that our classes repre- the smaller constitutive blocks as well. While the sent only conserved examples of each of the 20 amino first block, {LIVFMC}, contains the relatively ex- acids while the GOR method makes no such distinc- changeable residues {LIVM}with phenylalanine (F) tion. Thus, while the structural propensities calcu- near to belonging to this group, cysteine is rather lated by the GOR method follow general trends in distinct and has low probabilities of mutation to sev- protein , the structural propensities cal- eral of the other members of this block. No signifi- culated for our 20 conserved classes reflect more spe- cant pattern can be found in the Dayhoff matrix for cific structural or functional constraints. either of the two remaining blocks (WYTSH} and The broad range of acceptable side chains dis- {QND. played in the eight multi-residue classes is consis- Consideration of the structural propensities of the tent with conclusions reached in mutagenesis exper- eight multi-residue classes indicates that the first iments involving multiple structural locations in two classes are strongly associated with solvent-in- several protein^.^',^^ Clearly, most of these classes accessible structures while the remaining six are do not display strict conservation of any of the phys- associated with surface structures. This simple dis- icochemical properties often used to typify the 20 tinction, however, is certainly not the limit of infor- amino acids, such as side-chain flexibility, bulk, or mation being conveyed by this small number of polarity. In fact, there are no one-to-one correspon- multi-residue classes. As seen in Figure 2, the point dences between our multi-residue substitution for 21 classes (20 conserved and one variable) corre- classes and the nodes of the PIMA tree of Smith and sponds to the point where the basic distinction is Smith or the “minimal sets” derived from the Venn made between conserved and variable alignment po- diagram of Taylor. Examination of the eight multi- sitions and, hence, between buried and exposed struc- residue substitution classes reveals that they are ap- ture locations. The mutual information value at this parently built-up as combinations of a few single point is 7.57-not particularly high. With the addi- amino acid types and “blocks” of multiple amino ac- tion of a few more multi-residue classes, however, the ids. These “blocks” display a degree of clustering of mutual information increases dramatically. This in- amino acids with similar physicochemical at- dicates that these classes which represent general- tributes. For instance, one block which is found in ized patterns of amino acid substitutability are ac- all the multi-residue classes consists of leucine, iso- counting for very significant correlations between leucine, valine, phenylalanine, methionine, and cys- amino acid substitution patterns and local protein teine. Aside from the presence of cysteine, this block structure. From the curve in Figure 2 we can also is similar to the “aliphatic.or.1arge-non-polar” set conclude that, on the whole, the information provided ({LIVFM}) defined by Taylor. by patterns of variability as captured by these multi- We also find a block of aromatic and short resi- residue classes is significantly greater than that pro- dues, including tyrosine, tryptophan, histidine, vided by conserved residue locations. serine, and threonine. In the set-logic nomenclature of Taylor, the simplest description would be po- CONCLUSION lar-aromatic.or.S.or.T. Using a set-logic description It is unfortunate that biological systems resist the based on other common physicochemical characters kind of reductionism that would yield “folding rules” not included in Taylor’s Venn diagrams, it would specifying exact amino acid residue identities re- perhaps be more illuminating to describe this block quired for producing exact local protein structures. of residues as hydrogen-bonding-non-charged-non- What rules might be set forth are so full of excep- flexible (ignoring the variable charge of histidine). tions and caveats that they hardly qualify as rules. The pair of residues, glutamine and asparagine, also Instead we must develop descriptions based on pro- either appears together or not at all, and could be pensities and inclinations. Such probabilistic de- regarded as the hydrogen-bonding-non-charged- scriptions are the natural province of information flexible set-a partial complement to the preceding theory. Here, we have used information theory to block. The remaining residues, which appear in var- extract information about protein structure from a ious combinations with one another and the blocks dataset of multiple sequence alignments for a struc- of residues just described include the “default”64 turally representative set of protein chains. This in- amino acid alanine, the “breakers” of repetitive formation is represented by a set of amino acid res- secondary structure proline and glycine, and the idue substitution classes. charged residues aspartic acid, glutamic acid, ly- The structurally indicative residue substitution sine, and arginine. classes obtained by our optimization method do not Due to the broad membership of most of the multi- correspond to the areas of Taylor’s Venn Dia- residue classes, the amino acid members of these gram~,3~the nodes of the PIMA tree of Smith and 36 M.J. THOMPSON AND R.A. GOLDSTEIN

Smith,38 or the clusters of “surface indicating” and REFERENCES “interior indicating” side chains used by Benner and 1. Chothia, C. Principles that determine the structure of pro- coworkers32 in any simple and consistent fashion. teins. Annu. Rev. Biochem. 53537472, 1984. This is due to the fact that these other groupings of 2. Chothia, C., Lesk, A.M. The relation between the diver- gence of sequence and structure in proteins. EMBO J. amino acids are heuristic idealizations of the residue 5~823-826, 1986. substitution data. In contrast, our substitution 3. Dickerson, R.E., Timkovich, R., Almassy, R.J. The cy- classes are derived though a rigorous statistical pro- tochrome fold and the evolution of bacterial energy metab- olism. J. Mol. Biol. 100:473-491, 1976. cedure that introduces no bias concerning how the 4. Pastore, A,, Lesk, A.M. Comparison of the structures of amino acids “should” be grouped. Therefore, while globins and phycocyanins: Evidence for evolutionary rela- tionship. Proteins 8:133-155, 1990. our classes display segregation of the amino acids 5. Kimura, M. Evolutionary rate at the molecular level. Na- according to some generic physicochemical charac- ture 217:624-626, 1968. teristics, mostly in the form of the “blocks” of resi- 6. King, J.L., Jukes, T.H. Non-Darwinian evolution. Science 164788-798, 1969. dues as discussed above, this segregation is not par- 7. Maxfield, F.R., Scheraga, H.A. Improvements in the pre- ticularly strict or defining. Although there is no diction of protein backbone topography by reduction of sta- simple mapping of our structurally-informative tistical errors. Biochem. 18:697, 1979. 8. Sawyer, L., Fothergill-Gilmore, L.A., Russel, G.A. The classes to the sequence-derived Dirichlet mixture predicted secondary structure of enolase. Biochem. J. 236: priors of Haussler and coworker^,^' the results of 127-130, 1986. 9. Crawford, I.P., Niermann, T., Kirschner, K. Prediction of that statistical approach also display only an em- secondary structure by evolutionary comparison. Proteins phasis of certain residues or common physicochem- 2:118-129, 1987. ical properties. 10. Zvelebil, M.J., Barton, G.J., Taylor, W.R., Sternberg, M.J.E. Prediction of protein secondary structure and ac- Our results indicate that conserved alignment po- tive sites using the alignment of homologous sequences. J. sitions, on the whole, are less informative with re- Mol. Biol. 195957-961, 1987. 11. Perkins, S.J., Haris, P.I., Sim, R.B., Chapman, D. A study gard to local protein structure than are variable of the structure of human complement component factor H alignment positions. This suggests that any method by Fourier transform infrared spectroscopy and secondary of sequence profiling of structural motifs should in- structure averaging methods. Biochemistry 27:4004- 4012, 1988. clude consideration of alignment positions which 12. Levin, J.M., Pascarella, S., Argos, P., Garnier, J. Quanti- display side chain variability. Since the mutual in- fication of secondary structure prediction improvement us- formation begins to saturate with a small number of ing multiple alignments. Protein Eng 6,:849-854, 1993. 13. Goldstein, R.A., Katzenellenbogen, J.A., Luthey-Schulten, substitution classes, any potential application will Z.A., Seielstad, D.A., Wolynes, P.G. Three-dimensional benefit from this small number of parameters model for the hormone binding domains of steroid recep- tors. Proc. Natl. Acad. Sci., U.S.A. 90:9949-9953, 1993. needed to characterize protein structure as subdi- 14. Taylor, W., Thornton, J. Recognition of super-secondary vided into our eight structural categories. structure in proteins. J. Mol. Biol. 173:487-514, 1984. We have demonstrated the utility of these classes 15. Wierenga, R.K., Terpstra, P., Hol, W.G.J. Prediction ofthe occurrence of the ADP-binding pup-fold in proteins, using in the prediction of solvent ac~essibility.~~They an amino acid sequence fingerprint. J.Mol. Biol. 187:lOl- could also be used in improved methods of multiple 107, 1986. sequence alignment, in the sequence profiling of 16. Webster, T.A., Lathrop, R.H., Smith, T.F. Prediction of a common structural domain in aminoacyl-tRNA syn- structural patterns, and in improved secondary thetases through use of a new pattern-directed inference structure prediction. Such classes of side chain sub- system. Biochemistry 26:6950-6957, 1987. 17. Barton, G.J., Sternberg, M.J.E. Flexible protein sequence stitution could clearly be made increasingly general patterns: A sensitive method to detect weak structural ho- through the use of larger sets of proteins. They could mologies. J. Mol. Biol. 212:389-402, 1987. also be improved through a self-consistent iterative 18. Bairoch, A. Prosite: A dictionary of sites and patterns in proteins. Nucleic Acids Res 19:2241-2245, 1991. approach wherein they would be used to construct 19. Pickett, S.D., Saqi, M.A.S., Sternberg, M.J.E. Evaluation better alignments from which more optimal classes of the sequence template method for protein structure pre- could be derived. diction. J. Mol. Biol. 228170-187, 1992. 20. Bork, P. Mobile modules and motifs. Curr. Opin. Struct. Biol. 2:413-421, 1992. ACKNOWLEDGMENTS 21. Gribskov, M., McLachlan, A.D., Eisenberg, D. Profile analysis: Detection of distantly related proteins. Proc. We thank Mark Saper for helpful guidance, and Natl. Acad. Sci., U.S.A. 84:4355-4358, 1987. Kurt Hillig for computational assistance. Peter 22. Luthy, R., McLachlan, A.D., Eisenberg, D. Secondary Wolynes made the original suggestion to use mutual structure-based profiles: Use of structure conserving scor- ing tables in searching protein sequence databases for information to optimize structure prediction meth- structural similarities. Proteins 10229-239, 1991. ods. We extend a general thanks to those who solve 23. Bowie, J.U., Luthy, R., Eisenberg, D. A method to identify protein structures and make this information avail- protein sequences that fold into a known three-dimen- sional structure. Science 253:164-170, 1991. able and to those who construct and maintain data- 24. Wako, H., Blundell, T. Use of amino acid environment- bases of protein sequences and structures. Financial dependent substitution tables and conformational propen- sities in structure prediction from aligned sequences of support was provided by the College of Literature, homologous proteins. I. Solvent accessibility classes. J. Science, and the Arts, the Program in Protein Struc- Mol. Biol. 238:682-692, 1994. ture and Design, the Horace H. Rackham School of 25. Salamov, A,, Solovyev, V. Prediction of protein secondary structure by combining nearest-neighbor algorithms and Graduate Studies at the University of Michigan, multiple sequence alignments. J. Mol. Biol. 247:ll-15, and NIH grant R29 LM05770. 1995. STRUCTURALLY INDICATIVE SUBSTITUTION CLASSES 37

26. Rost, B., Sander, C. Combining evolutionary information of sequence sets with solvent accessibility patterns of and neural networks to predict protein secondary struc- known structures. Proteins 7:257-264, 1990. ture. Proteins 1955-72, 1994. 48. Stephens, R.M., Schneider, T.D. Features of spliceosome 27. Rost, B., Sander, C. Conservation and prediction of solvent evolution and function inferred from an analysis of the accessibility in protein families. Proteins 20:216-226, information at human splice sites. J. Mol. Biol. 228:1124- 1994. 1136, 1992. 28. Benner, S.A., Gerloff, D. Patterns of divergence in homol- 49. Farber, R., Lapedes, A,, Sirotkin, K. Determination of eu- ogous proteins as indicators of secondary and tertiary karyotic protein coding regions using neural networks and structure: The catalytic domain of protein kinases. Adv. information theory. J. Mol. Biol. 226:471-479, 1992. 31:121-181. 1991. 50. Papp, P.P., Chattoraj, D.K., Schneider, T.D. Information Enzvme~~~~ Red. 2 Y 29. Rost, B., Sander, C. Jury returns on structure prediction. analysis of sequences that bind the replication initiator Nature 360540, 1992. repA. J. Mol. Biol. 233:219-230, 1993. 30. Barton, G.J., Russel, R.B. Protein structure prediction. 51. Williamson, R.M. Information theory analysis of the rela- Nature 361:505-506, 1993. tionship between primary sequence structure and ligand 31. Robson, B., Garnier, J. Protein structure prediction. Na- recognition among a class of facilitated transporters. J. ture 361506, 1993. Theor. Biol. 174:179-188, 1995. 32. Benner, S.A., Badcoe, I., Cohen, M.A., Gerloff, D.L. Bona 52. Shannon, C.E., Weaver, W. “The Mathematical Theory of Fide prediction of aspects of protein conformation. J. Mol. Communication.” Urbana, IL University of Illinois Press, Biol. 235926-958, 1994. 1949. 33. Gobel, U., Sander, C., Schneider, R., Valencia, A. Corre- 53. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., lated mutations and residue contacts in proteins. Proteins Teller, A.H., Teller, E. Equation of state calculations for 18:309-317, 1994. fast computing machines. J. Chem. Phys. 21:1087, 1953. 34. Neher, E. How frequent are correlated changes in families 54. Hobohm, U., Sander, C. Enlarged representative set of of protein sequences. Proc. Natl. Acad. Sci., U.S.A. 91:98- protein structures. Protein Sci. 3522-524, 1994. 102,1994. 55. Sander, C., Schneider, R. Database of homology-derived 35. Shindyalov, I., Kolchanov, N., Sander, C. Can three-di- protein structures and the structural meaning of sequence mensional contacts in protein structures be predicted by alignment. Proteins 956-68, 1991. analysis of correlated mutations. Protein Eng. 7:349-358, 56. Eisenhaber, F., Persson, B., Argos, P. Protein structure 1994. prediction: Recognition of primary, secondary, and tertiary 36. Taylor, W.R., Hatrick, K. Compensating changes in pro- structural features from amino acid sequence. Crit. Rev. tein multiple sequence alignments. Protein Eng. 7:341- Biochem. Mol. Biol. 3O:l-94, 1995. 348,1994. 57. Kabsch, W., Sander, C. Dictionary of protein secondary 37. Smith, R.F., Smith, T.F. Automatic generation of primary structures. Pattern recognition of hydrogen-bonded and sequence patterns from sets of related protein sequences. geometrical features. Biopolymers 22:2577-2637, 1983. Proc. Natl. Acad. Sci., U.S.A. 87:118-122, 1990. 38. Smith, R.F., Smith, T.F. Pattern-induced multi-sequence 58. Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer, alignment (PIMA) algorithm employing secondary struc- E.F., Jr., Brice, M.D., Rodgers, J.R., Kennard, O., Shiman- ture-dependent gap penalties for use in comparative pro- ouchi, T., Tasumi, M. Protein data bank: A computer-based tein modelling. Protein Eng. 5:35-41, 1992. archival file for macromolecular structures. J. Mol. Biol. 39. Taylor, W. The classification of amino acid conservation. J. 112535-542, 1977. Theor. Biol. 119:205-218, 1986. 59. Abola, E.E., Bernstein, F.C., Bryant, S.H., Koetzle, T.F., 40. Brown, M., Hughey, R., Krogh, A., Mian, I., Sjolander, K., Weng, J. Protein data bank. In: “Crystallographic Data- Haussler, D. Using Dirichlet mixture priors to derive hid- bases-Information Content, Software Systems, Scientific den Markov models for protein families. In: “Proceedings Applications.” Allen, F.H., Bergerhoff, G., Sievers, R. of the First International Conference on Intelligent Sys- (eds.). Bonn: Data Commission of the International Union tems for .” Hunter, L., Searles, D., Shav- of Crystallography, 1987:107-132. lik, J. (ed.). Menlo Park, CA: AAAUMIT Press 1993, 47- 60. Shrake, A., Rupley, J.A. Environment and exposure to sol- 55. vent of protein atoms: Lysozyme and insulin. J. Mol. Biol. 41. Thompson, M.J., Goldstein, R.A. Predicting solvent acces- 79:351-371, 1973. sibility: Higher accuracy using Bayesian statistics and op- 61. Garnier, J., Robson, B. The GOR method for predicting timized residue substitution classes. Proteins 2538-47, secondary structure in proteins. In: “Prediction of Protein 1996. Structure and the Principles of Protein Conformation.” 42. Bairoch, A., Boeckmann, B. The SWISS-PROT protein se- Fasman, G.D. (ed.). New York: Plenum Press, 1989:417- quence data bank Current status. Nucleic Acids Res. 22: 465. 3578-3580,1994. 62. Reidhaar-Olson, J.F., Sauer, R.T. Combinatorial cassette 43. Cover, T.M., Thomas, J.A. “Elements of Information The- mutagenesis as a probe of the informational content of ory.” New York John Wiley & Sons, Inc., 1991. protein sequences. Science 24153-57, 1988. 44. Garnier, J., Osguthorpe, D., Robson, B. Analysis of accu- 63. Kamtekar, S., Schiffer, J.M., Xiong, H., Babik, J.M., racy and implications of simple methods for predicting sec- Hecht, M.H. Protein design by binary patterning of polar ondary structure of globular proteins. J. Mol. Biol. 120:97- and nonpolar amino acids. Science 262:1680-1685, 1993. 120, 1978. 64. Richardson, J.S., Richardson, D.C. Principles and patterns 45. Schneider, T.D., Stormo, G.D., Gold, L. Information con- of protein conformation. In: “Prediction of Protein Struc- tent of binding sites on nucleotide sequences. J. Mol. Biol. ture and the Principles of Protein Conformation.” Fasman, 188:415-431, 1986. G.D. (ed.). New York: Plenum Press, 1989:l-98. 46. Gibrat, J.F., Garnier, J., Robson, B. Further developments 65. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. A model of of protein secondary structure prediction using informa- evolutionary change in proteins. In: “Atlas of Protein Se- tion theory. J. Mol. Biol. 198:425-443, 1987. quence and Structure.” Vol. 5, suppl. 3. Dayhoff, M.O. 47. Bowie, J.U., Clarke, N.D., Pabo, C.O., Sauer, R.T. Identi- (ed.). Washington, DC: National Biomedical Research fication of protein folds: Matching hydrophobicity patterns Foundation, 1979:353-358.