Constructing Amino Acid Residue Substitution Classes Maximally Indicative of Local Protein Structure
Total Page:16
File Type:pdf, Size:1020Kb
PROTEINS: Structure, Function, and Genetics 25:28-37 (1996) Constructing Amino Acid Residue Substitution Classes Maximally Indicative of Local Protein Structure Michael J. Thompson' and Richard A. Goldstein'*2 'Biophysics Research Division, 'Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109-1055 ABSTRACT Using an information theo- in these families as a result of the degenerate rela- retic formalism, we optimize classes of amino tionship between sequence and structure. During acid substitution to be maximally indicative of evolution, a three-dimensional fold can persist de- local protein structure. Our statistically-de- spite significant divergence of the amino acid se- rived classes are loosely identifiable with the quence~.'-~In order to preserve the structural or heuristic constructions found in previously functional integrity of the protein, important sites published work. However, while these other in the sequence must either conserve some specific methods provide a more rigid idealization of characteristics requiring conservation of amino acid physicochemically constrained residue substi- identity, or preserve some more general property tution, our classes provide substantially more such as polarity, charge, or size. Variable sites with structural information with many fewer param- little or no structural or functional constraints may eters. Moreover, these substitution classes are evolve by random fixation of neutral or nearly neu- consistent with the paradigmatic view of the se- tral mutation^.^,^ As a result, alignments of homol- quence-to-structure relationship in globular ogous sequences provide more information about the proteins which holds that the three-dimen- architectural necessities of their commonly-adopted sional architecture is predominantly deter- fold than do lone sequences, implicitly carrying in- mined by the arrangement of hydrophobic and formation about the non-local interactions fre- polar side chains with weak constraints on the quently invoked to explain the apparent 65% accu- actual amino acid identities. More specific con- racy ceiling for secondary structure prediction. straints are imposed on the placement of pro- Various methods have been developed €or extract- lines, glycines, and the charged residues. These ing information from multiple sequence alignments. substitution classes have been used in highly The most straightforward approach has been to use accurate predictions of residue solvent accessi- the multiple members of a protein family in order to bility. They could also be used in the identifica- provide "signal averaging" of structure predictions tion of homologous proteins, the construction over multiple homologous sequence^.^-^^ This kind of and refinement of multiple sequence align- simple averaging, however, does not capture the in- ments, and as a means of condensing and cod- formation implicit in structurally characteristic pat- ifying the information in multiple sequence terns of side chain variability. Another approach has alignments for secondary structure prediction been the construction of consensus or signature and tertiary fold recognition. sequence segments characteristic of particular 0 1996 Wiley-Liss, Inc. structures, which are often used in database ~earches.'"~'~-~~By concentrating on the conserved Key words: protein evolution, structure pre- residues at the expense of alignment positions which diction, information theory, amino allow side chain variability, these methods lose much acid substitution, multiple se- of the information contained in the multiple align- quence alignment ments, as discussed below. In addition, these partic- ular methods create templates specific for each in- INTRODUCTION dividual protein family, and do not give insight into Although it is highly desirable to know the de- the more universal patterns found in proteins. tailed three dimensional structure of a protein un- Both the profile method of tertiary structure rec- der study, such structures are few in number and ognition developed by Eisenberg and coworker^"^-"^ laboriously determined. In contrast, amino acid se- quences are often readily obtainable. When the se- quences of homologous proteins are available as well, the sets of residue substitutions resulting from Received July 14, 1995; revision accepted November 20, evolutionary differentiation can provide a rich 1995. Address reprint requests to Richard A. Goldstein, Biophysics source of information about the protein's function Research Division, Department of Chemistry, University of and structure. Structural information is contained Michigan, Ann Arbor, MI 48109-1055. 0 1996 WILEY-LISS, INC. STRUCTURALLY INDICATIVE SUBSTITUTION CLASSES 29 and the secondary structure and surface accessibil- hierarchical “Amino Acid Class Covering patterns” ity prediction work of Wako and Bl~nde11~~em- of Smith and Smith37,38or the Venn diagrams of ployed environment-dependent mutation matrices Taylor.39 This is due to the fact that our methodol- and tables of structural propensities. A recent hy- ogy contains no ad hoc elements, and allows for the brid method used similar tables.25 Measures of side unprejudiced identification of structurally informa- chain conservation have been used to weight second- tive classes of residue substitution. Our substitution ary structure predictions” and to train neural net- classes are also conceptually similar to the clusters works, in combination with substitution profiles, to of “interior-indicating’’ and “surface-indicating’’ predict secondary structure and solvent accessibil- side chains used by Benner and coworkers in their ity.26,27While some of these more quantitative prediction In contrast to their heuristic methods tend towards general applicability, they do groupings, our classes are derived by a rigorous sta- not lend themselves to the abstraction of general tistical methodology. A similarly rigorous approach principles relating protein structure and sequence. to generalizing the patterns of residue substitutions In contrast, Benner and Gerloff have developed found in multiple sequence alignments can be found methods for analyzing patterns of sequence varia- in the work of Haussler and coworkers.40 Their tion using heuristic rules which vary as subsets of work, however, is aimed at developing improved the alignments are manipulated.” While this work methods for multiple sequence alignment and does has claimed certain successes, it lacks quantitative not investigate the relationship between allowed rigor and reprodu~ibility.~~-~~Benner et al. have substitutions and local protein structure. also pioneered the consideration of correlated muta- In this article, we report on the set of 28 optimal tions as an approach to structure ~rediction.~’Such substitution classes which provides close to the max- efforts are still at a preliminary In spite imum resolution of sequence to structure relation- of this work and the more common uses of multiple ships without significant dataset-dependence. These sequence alignments in the biochemical or crystal- classes provide much more structural information lographic characterization of proteins, a rigorous than is extractable from single sequences. They also and generally applicable classification of these convey more structural information with fewer pa- structurally-constrained sets of residue substitu- rameters than the covering classes developed by tions is lacking. Smith and Smith or the Venn diagrams constructed In this article, we use information theory to con- by Taylor. These classes can be used to address ques- struct sets of residue substitution classes which pro- tions concerning what, if any, structure-determin- vide maximal information about local protein struc- ing properties are being conserved, and generate ture, classified into combinatorial categories of qualitative insight into the relationship between se- secondary structure and solvent accessibility. As quence and structure. In addition to their use in this approach contains no preconceived notions con- structure predicti01-1,~~potential applications in- cerning how the amino acid types should be clus- clude sequence alignment, detection of homologous tered, beyond those proclivities present in the orig- proteins, and the further generation and refinement inal mutation matrices used to construct the of biochemical insight. database of multiple sequence alignments, we have introduced no bias toward the development of any METHODS particular architecture of class membership and Data Encoding class linkages. To quantify the correspondence be- A significant and often neglected issue is the non- tween amino acid sequence and local protein struc- uniformity of the databases of known protein se- ture, we employ the concept of “mutual informa- quences; a large probability of a given residue at any tion.” Briefly, mutual information is the amount of location may only indicate that that particular res- knowledge obtainable about one random variable by idue is characteristic of a protein subfamily over- knowledge of another; in this case, how much knowl- represented in the database. Such biases can be edge of the local structure can be obtained by know- rather extreme: mammals make up 69% of all myo- ing the substitution class of the alignment position. globin sequences found in the SWISS-PROT Data- This measure enables us to search the space of pos- base (release 28).42In order to correct for these bi- sible substitution classes over a database of repre- ases, it is necessary to