Theory for Protein Mutability and Biogenesis (Lattice Model/Exhaustive Simulation/Random Sequences/Protein Origins/Protein Folding) KIT FUN LAU and KEN A
Total Page:16
File Type:pdf, Size:1020Kb
Proc. Nati. Acad. Sci. USA Vol. 87, pp. 638-642, January 1990 Biophysics Theory for protein mutability and biogenesis (lattice model/exhaustive simulation/random sequences/protein origins/protein folding) KIT FUN LAU AND KEN A. DILL Department of Pharmaceutical Chemistry, University of California, San Francisco, CA 94143 Communicated by Frederic M. Richards, October 17, 1989 (received for review June 14, 1989) ABSTRACT Using an elementary physical model for pro- SEQUENCE CONFORMATIONAL tein folding, of self-avoiding short copolymer chains on two- SPACE SPACE ....I... dimensional square lattices, we address two questions regard- .7 HH...HHHH ing the evolution and origins of globular proteins. (i) How will COMPACT 2-< protein native structures and stabilities be affected by single- HP...HHIH probability that a HP...HHPH and double-site mutations? (ii) What is the HP...HPHH randomly chosen sequence of amino acids will be compact and HP...HPHP globular under folding conditions? For a large number of HP...HPPH C-i different sequences, we search the conformational space ex- PH...HHHH haustively to find unequivocally the "native" conformation(s), PH...HHHP ofglobal minimum free energy, for each sequence. We find that PH...HHPH Ir replacing nonpolar residues in the core by polar residues is Pp...PPPP generally destabilizing, that surface sites are less sensitive than FIG. 1. Mapping of sequence space to native structure(s). Se- core sites, that some mutations increase the degeneracy of quence space is the set of all possible sequences; conformational native states, and that overall it is most probable that a space is the set of all possible conformations. Two arrows to a native mutation will be neutral, having no effect on the native state from different sequences represents "convergence;" two ar- structure. These results support a "Continuity Principle," that rows to different native structures from a given sequence represents small changes in sequence seldom have large effects on struc- "degeneracy." H, nonpolar (hydrophobic) residue; P, other type of ture or stability of the native state. The simulations also show residue (see text). that (i) the number of "convergent" sequences (different sequences coding for the same native structure) is extremely (2D) square lattice. Thus, each amino acid is represented as large and (ii) most sequences become quite dense under folding simply occupying one lattice site, connected to its chain conditions. This implies that the probability of formation of a neighbor(s), and unable to occupy a site filled by any other globular protein from a random sequence of amino acids by residue. For each monomer, we define "connected" neigh- prebiotic or mutational methods is significantly greater than bors as units j and j + 1 adjacent along the chain, and zero. "topological" neighbors as units (i, j) adjacent in space but not adjacent in position along the sequence, i 4 j - 1, j + 1. The central question of protein evolution is how mutational Every HH contact between topological neighbors is assigned change in the amino acid sequence leads to change in the a contact energy equal to e (<0) and every other interaction structure and stability, and thereby to change in protein among all other possible types of neighbor pairs (H, P, or function. A fundamental question of protein origins is how solvent, S) has energy equal to 0. For sufficiently short chains different sequences, and how many of them, could fold up to every accessible conformation is then surveyed by computer, stable globular conformations with active sites. To address and its energy (number of HH contacts) is evaluated to find these questions of protein biogenesis and change at the the native structure(s), those with the maximum number of molecular level requires a "protein folding" algorithm, a HH contacts. For n < 11, native states are found in this predictor of the native structure of a protein from its amino manner for every possible sequence. For longer chains, we acid sequence. It further requires an algorithm sufficiently randomly choose 200 different sequences, and native states fast so as to be capable of exploring a suitably large fraction are found by exhaustive exploration ofall maximally compact of "sequence space," the set of all possible sequences of conformations (1). amino acids (Fig. 1). No such algorithm yet exists at atomic This model has the following basic features in common resolution. Our purpose here is to address these questions with real proteins (1). For small HH attraction, the predom- through use of a recently developed lower-resolution lattice inant populations of chain conformations are unfolded. For statistical mechanics model of protein folding (1). molecules composed of certain sequences (which we refer to as "folding" sequences), increasing the HH attraction leads The Lattice Model of Protein Sequence/Structure to a transition to a native state characterized by a very small number of conformations (most often, only one). The native The model has been described in detail elsewhere (1); we just conformations have the properties that they (i) are maximally summarize the principal aspects here. Each protein is a linear compact, (ii) are composed of a core of predominantly H chain of a specific sequence of n amino acids. Each amino residues, and (iii) have considerable amounts of 2D equiva- acid can be either oftwo types: H (nonpolar) or P (other). The lents of secondary structures: helices and sheets, principally fraction of the n residues which are of type H is (D; hence the antiparallel (2, 3). Moreover, their distributions of secondary fraction of type P is 1 - 41. A chain conformation is structures are approximately the same as those observed in represented as a self-avoiding walk on a two-dimensional three-dimensional (3D) protein structures (H. S. Chan and K.A.D., unpublished results). The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" Abbreviations: 2D and 3D, two- and three-dimensional or dimen- in accordance with 18 U.S.C. §1734 solely to indicate this fact. sions. 638 Downloaded by guest on October 2, 2021 Biophysics: Lau and Dill Proc. Natl. Acad. Sci. USA 87 (1990) 639 There are two principal virtues of this approach. (i) The with too few H residues do not have compact native struc- model is physical; it is based on the dominant driving forces tures, and sequences with too many H residues have many for folding: the solvent aversion ofnonpolar residues in water different compact structures of equal lowest energy and thus and the conformational freedom and excluded volume of the are not singly degenerate. chain. It has no adjustable parameters or additional ad hoc assumptions. (ii) It is sufficiently simple that for any given Plasticity of Native Structures to Sequence Mutations sequence, the entire conformational space can be searched exhaustively to determine unequivocally the "native" struc- First, we ask: if a sequence is changed at one or two residue ture(s), those which are at the global minimum offree energy. positions, then how much will the native structure and energy Moreover, the native structures can be found for a large change? The question is whether proteins are either (i) highly number of different sequences, permitting a broad explora- "plastic," whereby single-site mutations would seldom lead tion (exhaustive, in some cases) of the full sequence space. to a change ofthe native structure, or (ii) so precisely crafted This model is intended to address questions of general that a single mutation would usually significantly alter the principle rather than of atomic detail. For questions of this native structure. We test this here by choosing a sequence type, justification is given elsewhere (1) that the shortness of with known native structure, and making each possible the chains and 2D nature of the model should not be signif- single-site change, from H to P, or P to H, whichever is icantly limiting. We consider here only the nature of the appropriate, then determining the "mutant" native structure compact folded conformations of chains; we do not address and energy. For this purpose, from about 1000 random the thermodynamics offolding, which will be more sensitive sequences, 251 were chosen as good folders with singly to simplifications of this type. The native conformations are degenerate native states. We characterize structural change simply those of lowest energy, found by taking the limit e by a quantity Ad (defined by equation 13 of ref. 1), which -00 (1), which we refer to as "folding conditions." The simply counts the number ofbond rotation changes, weighted distinction that e is actually a free energy rather than an by the relative magnitude of each change. energy is irrelevant for present purposes. The distribution of structural change resulting from 3263 To decide whether a sequence is a "folder" or not, we use different single-site mutations on chains of length n = 13 is the compactness (1-3) p = t/tmax, where t is the total number shown in Fig. 2B. It is clear from Fig. 2 that proteins are of topological neighbors in a given conformation and tmax is predicted to be nearly maximally insensitive to single-site the maximum possible number of topological neighbors change. In this case, nearly 90% of all possible single-site which could be achieved by any conformation of a given changes cause no change in the native structure. It is inter- chain length. Below we consider different criteria for glob- esting that when a mutation does change the structure, it does ularity, ranging from the most strict (t = tmax) to less strict (t so cooperatively: the small peak at Ad = 3-8 implies that the 2 tmaX - 2); a conformation is considered folded if it satisfies typical single-site-modified mutant native structure has sev- this criterion under folding conditions.