<<

Consensus sequence design as a general strategy to create hyperstable, biologically active

Matt Sternkea, Katherine W. Trippa, and Doug Barricka,1

aThe T.C. Jenkins Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218

Edited by Brian W. Matthews, University of Oregon, Eugene, OR, and approved April 26, 2019 (received for review October 6, 2018)

Consensus sequence design offers a promising strategy for de- tion in its biological context and ultimately contribute to or- signing proteins of high stability while retaining biological activity ganismal fitness.* By averaging over many sequences that share since it draws upon an evolutionary history in which residues similar structure and function, the consensus design approach important for both stability and function are likely to be con- has the potential to produce proteins with high levels of ther- served. Although there have been several reports of successful modynamic stability and biological activity, since both attributes consensus design of individual targets, it is unclear from these are likely to lead to residue conservation. anecdotal studies how often this approach succeeds and how There are two experimental approaches that have been used often it fails. Here, we attempt to assess generality by designing to examine the effectiveness of consensus information in “ ” consensus sequences for a set of six protein families with a range design: point-substitution and wholesale substitution. In the of chain lengths, structures, and activities. We characterize the first approach, single residues in a well-behaved protein that resulting consensus proteins for stability, structure, and biological differ from the consensus are substituted with the consensus – activities in an unbiased way. We find that all six consensus residue (13 17). In these studies, about one-half of the consensus proteins adopt cooperatively folded structures in solution. Strik- point substitutions examined are stabilizing, but the other half ingly, four of six of these consensus proteins show increased are destabilizing. Although this frequency of stabilizing muta- tions is significantly higher than an estimated frequency around thermodynamic stability over naturally occurring homologs. Each 3 consensus protein tested for function maintained at least partial 1in10 for random (13, 18, 19), it suggests that biological activity. Although peptide binding affinity by a combining individual consensus mutations may give minimal net

increase in stability since stabilizing substitutions would be offset BIOPHYSICS AND consensus-designed SH3 is rather low, Km values for consensus

by destabilizing substitutions. COMPUTATIONAL BIOLOGY are similar to values from extant homologs. Although The “wholesale” approach does just this, combining all sub- consensus enzymes are slower than extant homologs at low tem- stitutions toward consensus into a single consensus polypeptide perature, they are faster than some thermophilic enzymes at high composed of the most frequent amino acid at each position in temperature. An analysis of sequence properties shows consensus sequence. By making a large number of substitutions at once, proteins to be enriched in charged residues, and rarified in un- wholesale consensus substitution may collectively combine the charged polar residues. Sequence differences between consensus incremental effects from the individual substitutions as well as and extant homologs are predominantly located at weakly con- nonadditive effects arising from the substitution of each residue served surface residues, highlighting the importance of these res- into the novel background of the consensus protein (20, 21). The idues in the success of the consensus strategy. stabilities of several globular proteins and several repeat proteins

protein stability | protein design | consensus sequence Significance

xploiting the fundamental roles of proteins in biological sig- A major goal of protein design is to create proteins that have Enaling, catalysis, and mechanics, protein design offers a high stability and biological activity. Drawing on evolutionary promising route to create and optimize biomolecules for medi- – information encoded within extant protein sequences, con- cal, industrial, and biotechnological purposes (1 3). Many dif- sensus sequence design has produced several successes in ferent strategies have been applied to designing proteins, achieving this goal. Here, we explore the generality with which including physics-based (4), structure-based (5, 6), and directed consensus design can be used to enhance protein stability and -based approaches (7). While these strategies have maintain biological activity. By designing and characterizing generated proteins with high stability, implementation of the consensus sequences for six unrelated protein families, we find design strategies is often complex and success rates can be low – that consensus design shows high success rates in creating (8 11). Although directed evolution can be functionally directed, well-folded, hyperstable proteins that retain biological activi- de novo design strategies typically focus primarily on structure. ties. Remarkably, many of these consensus proteins show Introducing specific activity into de novo-designed proteins is a higher stabilities than naturally occurring sequences of their significant challenge (8, 10). respective protein families. Our study highlights the utility of Another strategy that has shown success in increasing ther- consensus sequence design and informs the mechanisms by modynamic stability of natural protein folds is consensus se- which it works. quence design (12). For this design strategy, “consensus” residues are identified as the residues with the highest frequency Author contributions: M.S., K.W.T., and D.B. designed research; M.S. and K.W.T. per- at individual positions in a multiple (MSA) formed research; M.S. and K.W.T. analyzed data; and M.S. and D.B. wrote the paper. of extant sequences from a given . The consensus The authors declare no conflict of interest. strategy draws upon the hundreds of millions of years of se- quence evolution by random and natural selection that This article is a PNAS Direct Submission. is encoded within the extant sequence distribution, with the idea Published under the PNAS license. that the relative frequencies of residues at a given position reflect 1To whom correspondence may be addressed. Email: [email protected]. the “relative importance” of each residue at that position for This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. some biological attribute. As long as the importance at each 1073/pnas.1816707116/-/DCSupplemental. position is largely independent of residues at other positions, a consensus residue at a given position should optimize stability, *These properties include rates of folding, unfolding, degradation, compartmentaliza- activity, and/or other properties that allow the protein to func- tion, oligomerization, and solubility.

www.pnas.org/cgi/doi/10.1073/pnas.1816707116 PNAS Latest Articles | 1of10 Downloaded by guest on September 28, 2021 have been increased using this approach (22–31). An increase in Results thermodynamic stability is seen in most (but not all) cases, but Consensus Sequence Generation. The six families targeted for effects on biological activity are variable. In a recent study, we consensus design are shown in Fig. 1. To generate and refine characterized a consensus-designed homeodomain sequence that MSAs from these families, we surveyed the (33), SMART showed a large increase in both thermodynamic stability and (34), and InterPro (35) databases to obtain sets of aligned se- DNA-binding affinity (32). Unlike the point-substitution ap- quences for each family. For each family, we used the sequence proach, where both stabilizing and destabilizing substitutions are set that maximized both the number of sequences and quality of reported, the success rate of the wholesale approach is not easy the MSA (SI Appendix, Table S1). For five of the six protein to determine from the literature, where publications present families, we did not limit sequences using any structural or single cases of success, whereas failures are not likely to be phylogenetic information. For AK, we selected sequences that “ ” published. One study of TIM barrels reported a few poorly be- contained a long-type LID domain containing a 27-residue haved consensus designs that were then optimized to generate a insertion found primarily in bacterial sequences (36). The number of sequences in these initial sequence sets ranged from folded, active protein (24). Although this study highlighted some SI Appendix limitations, it would not likely have been published if it did not around 2,000 up to 55,000 ( , Table S1). end with success. To improve alignments and limit bias from copies of identical Here, we address these issues by applying the consensus se- (or nearly identical) sequences, we removed sequences that de- viated significantly (±30%) in length from the median length quence design strategy to a set of six taxonomically diverse value, and removed sequences that shared greater than 90% protein families with different folds and functions (Fig. 1). We sequence identity with another sequence. This curation reduced chose the three single domain families including the N-terminal the size of sequence sets by an average of 65% (ranging from domain of ribosomal protein L9 (NTL9), the SH3 domain, and 30 to 80%; SI Appendix, Table S1). Final sequence sets ranged the SH2 domain, and the three multidomain protein families from 1,355 (NTL9) to 14,474 (SH3) sequences, and showed including dihydrofolate (DHFR), adenylate broad ranges of sequence variability (quantified by average (AK), and (PGK). We characterized pairwise identities among sequences in the set) and phylogenetic these six consensus proteins in terms of structure, stability, and distribution (SI Appendix, Table S1). function. We find that consensus sequences for all six protein MSAs showed varying levels of conservation across positions families are quite soluble and adopt the native folds of their in sequence (SI Appendix, Fig. S1). Curated sequence sets were respective families. Strikingly, four of the six consensus proteins aligned, and consensus sequences were determined from the show increased thermodynamic stability compared with naturally most probable residue at each position (SI Appendix, Table S2). occurring homologs; the other two consensus proteins show Consensus sequences differ substantially from aligned sequences stabilities comparable to natural homologs. All consensus pro- from which they are derived. The maximum identity between teins assayed for biological activity retain their expected activi- consensus sequences and the most similar extant sequence in the ties, including molecular recognition and enzymatic catalysis. An corresponding alignment ranges from 63% for SH2 to 80% for advantage of this multitarget comparison is that it allows us to NTL9; the average pairwise identity between consensus se- examine sequence features of consensus-designed proteins and quences and each sequence in the corresponding alignment ranges from 58% for NTL9 and PGK to 40% for SH3 and SH2 relate them to one another and to naturally occurring homologs. SI Appendix This shows that consensus proteins are ( , Fig. S2). enriched in charged residues and are depleted in polar un- Structural Characterization of Consensus Proteins. To determine charged residues, and highlights the importance of weakly con- whether the six consensus sequence proteins adopt their target served surface residues in enhancing stability through the consensus folds, we expressed and purified consensus proteins (hereafter design strategy. denoted with a “c,” e.g., cNTL9) for each of the six protein families and characterized them using circular dichroism (CD) and NMR spectroscopies. Far-UV CD spectra for cNTL9, cAK, and cPGK show minima at 222 and/or 208 nm (Fig. 2A), con- sistent with α-helical secondary structure. Spectra for cSH2 and cDHFR show single minima at 219 and 214 nm, respectively, consistent with predominantly β-sheet secondary structure. Consensus SH3 shows a far-UV CD spectrum with minima at 228 and 204 nm and a maximum at 219 nm, similar to published spectra for a naturally occurring SH3 domain (37). With the exception of cDHFR, 1H–15N heteronuclear single quantum coherence (HSQC) and transverse relaxation opti- mized spectroscopy (TROSY) (for cPGK) NMR spectra for each consensus protein show sharp resonances that are well dispersed in the 1H dimension (Fig. 2 B–G), indicating that these con- sensus proteins adopt well-folded tertiary structures. These six consensus proteins show between 88 and 99% of expected cross- peaks in the 1H–15N HSQC and TROSY (cPGK) spectra, sug- gesting that the consensus proteins maintain rigid tertiary structure over most of their sequence. The high signal-to-noise and sharp cross-peaks indicate that these consensus proteins have high solubility at NMR concentrations (ranging from 400 to 800 μM). For cDHFR, the 1H–15N HSQC spectrum shows rather broad resonances and only around 60% of expected cross-peaks (SI Appendix Fig. 1. Targets for consensus design. A representative structure of an extant , Fig. S3). However, addition of the substrate analog sequence is shown for each target family (NTL9, 2HBB; SH3, 1LKK; SH2, 4U1P; methotrexate both sharpens resonances and increases the number DHFR, 5DFR; AK, 1ANK; PGK, 1PHP). Length of consensus sequence, bi- of resolved cross-peaks (Fig. 2E), suggesting that cDHFR may ological function, and number of sequences used in final MSA are noted. undergo a large-scale conformational change and rigidification

2of10 | www.pnas.org/cgi/doi/10.1073/pnas.1816707116 Sternke et al. Downloaded by guest on September 28, 2021 A

HI J

BCD EF G

Fig. 2. Structural features of consensus proteins. (A) Far-UV CD spectra for all consensus proteins. (B–F) 1H–15N HSQC spectra for (B) cNTL9, (C) cSH3, (D) cSH2, and (E) cDHFR with addition of 1:1 molar equivalents of methotrexate, and (F) cAK at 600 MHz. (G) 1H–15N TROSY spectrum for consensus cPGK at 800 MHz. (H–J) Secondary chemical shift-based secondary structure probabilities using backbone resonances calculated by TALOS-N for (H) cNTL9, (I) cSH3, and (J) cSH2. Schematics above plots represent sequence positions of secondary structures in homology models of consensus sequences.

upon binding. This behavior has been observed in a study of two folded states in the folding transition region. As a result, esti- naturally occurring DHFR proteins (38). mated folding free energies for these constructs are likely to BIOPHYSICS AND To determine the location of α-helices and β-strands, we underestimate the free energy difference between the native and COMPUTATIONAL BIOLOGY assigned backbone resonances for cNTL9, cSH3, and cSH2. denatured states. Using standard triple-resonance experiments, we were able to To compare the measured folding free energies of the con- assign backbone 1H and 15N resonances, as well as 13CCα,Cβ, sensus proteins with extant sequences, we gathered folding free and C′ chemical shifts, for 96%, 96%, and 89% of residues for energy values from the literature for naturally occurring se- cNTL9, cSH3, and cSH2, respectively (SI Appendix, Figs. S4–S6). quences for each protein family. We limited our search to folding free energy values determined by chemical denaturation exper- Secondary chemical shift-based secondary structure predictions † using these resonances show α-helix and β-strand boundaries that iments and to proteins that appear to be monomeric .For largely match secondary structure locations in consensus ho- comparison, we have included the consensus homeodomain mology models, suggesting that the consensus proteins adopt (cHD) that we recently characterized with this approach (32). their archetypal folds (Fig. 2 H–J). The stabilities of six of the seven consensus proteins are greater than the mean stabilities of the extant homologs (Fig. 3B Equilibrium Stability of Consensus Proteins. To determine equilib- and SI Appendix, Table S3), with stability increases ranging from rium folding free energies, we measured guanidine hydrochlo- 8.3 kcal/mol for cAK to 1.4 kcal/mol for cPGK (although, for ride (GdnHCl)- and temperature-induced unfolding transitions cPGK, fitted ΔG° parameters are likely to underestimate the free for all consensus proteins using a combination of CD and fluo- energy difference between native and denatured ensembles since rescence spectroscopies. All consensus proteins show sigmoidal the unfolding transition is almost certainly multistate). The sole GdnHCl-induced unfolding transitions (Fig. 3A and SI Appendix, exception to this observation is cSH3, which shows a stability that Fig. S8), indicating that the consensus proteins unfold in a co- is 0.3 kcal/mol less than the mean value for the extant homologs. operative manner similar to naturally occurring proteins. Four of Furthermore, stabilities of five of the seven consensus proteins the six consensus proteins showed sigmoidal thermal unfolding (cNTL9, cHD, cSH2, cDHFR, and cAK) are greater than that T of the most stable extant homolog, ranging from 0.6 kcal/mol transitions, with m values ranging from 62 to 87 °C. For B SI Appendix cNTL9 and cSH2, unfolding transitions were not observed up to for cSH2 to 3.8 kcal/mol for cAK (Fig. 3 and , the maximal temperature of 93 °C; addition of 4 and 2 M Table S3). GdnHCl revealed thermal unfolding transitions, demonstrating that these proteins are hyperstable, with thermal unfolding Characterization of cSH3 Peptide Binding. To measure binding of transitions significantly above 93 °C in the absence of denaturant. cSH3 to a target peptide, we acquired a synthetic peptide (Ac- Of the four proteins that showed thermal unfolding transitions PLPPLPRRALSVW-NH2), which contains a proline-rich motif in the absence of GdnHCl, two (cSH3 and cAK) were not fully used in previous binding studies of the human Fyn SH3 domain, a well-studied extant sequence with peptide contact residues that reversible. In contrast, by chemical denaturation, all consensus 1 –15 proteins were found to fold and unfold reversibly, allowing us to match those in consensus SH3 (41, 42). H N HSQC spectra of 15N-labeled cSH3 at increasing concentrations of unlabeled pep- determine equilibrium folding free energies and m values by tide show shifts in some (but not all) peaks (SI Appendix,Fig.S9). fitting a two-state linear extrapolation model to all consensus This behavior is consistent with formation of a complex that is in protein GdnHCl denaturations. Folding free energies ranged − − fast exchange on the chemical shift timescale. When plotted on a from 3.2 kcal/mol for cSH3 to 13.6 kcal/mol for cAK (Table 1). homology model of cSH3, the largest chemical shift perturbations Measured m values for cNTL9 and cSH2 match values predicted from an empirical relationship based on chain length (39, 40) m within 5%, suggestive of two-state folding. The measured † value for cSH3 is 30% larger than the predicted value. Measured Although for most families this search resulted in five or more free energy values, pro- m viding a good representation of the average stability, all of these stabilities may be values for cDHFR, cAK, and cPGK are 20%, 50%, and 70% biased by experimental constraints (expression, solubility, and baseline-resolved smaller than predicted values, suggesting the population of partly folding transitions).

Sternke et al. PNAS Latest Articles | 3of10 Downloaded by guest on September 28, 2021 A enzymes are similar to those of extant enzymes, ranging from 10- fold higher to 3-fold lower than values determined under similar conditions (Table 2). Steady-state kcat values determined at 20 °C for the consensus enzymes are all smaller than those for meso- philic homologs (Table 2) but range from comparable to (1.3- fold lower) to slightly lower (roughly 6-fold lower) kcat values for thermophilic homologs. The high thermal stability of the consensus enzymes allows us to measure activity at high temperatures where reaction rates likely increase considerably, providing a better comparison with thermophilic homologs. To determine the effects of temperature on activity, we measured kcat values for the consensus enzymes over a range of temperatures. For cDHFR, where ab- sorbance provides a direct readout of catalysis, we were able to access temperatures up to 50 °C. However, for cPGK, we were restricted to 40 °C due to the limited stability of the coupling B enzymes used in the absorbance assay. For cAK, we were able to expand this temperature range to 70 °C using a 31P NMR-based assay (AK; see SI Appendix and ref. 43), which directly monitors conversion of reactants to products. Over the accessible tem- perature range, all consensus enzymes show exponential in- creases in catalytic rates with increased temperature, consistent with Arrhenius kinetics (SI Appendix, Fig. S10), allowing us to extrapolate kcat values to other temperatures. On the whole, the consensus enzymes show levels of activity comparable to ther- mophilic homologs (SI Appendix, Table S4), although there are variations depending on the consensus enzyme and the ther- mophile being compared. Consensus DHFR has a larger turn- over number than two thermophilic homologs (by 5- and 31- fold). Consensus PGK shows both larger (1.8-fold increase) and smaller turnover numbers (1.2-, 4-, 5-, and 13-fold decreases). In Fig. 3. Equilibrium folding of consensus proteins. (A) Representative gua- contrast, cAK has a smaller turnover number (by 21-fold). nidine hydrochloride-induced folding/unfolding curves for consensus pro- teins. Data were collected using CD (cDHFR, cAK, and cPGK) or fluorescence Sequence Properties of Consensus Proteins and Consensus Mismatches. (cNTL9, cSH3, and cSH2) spectroscopies. The solid lines are obtained from The studies above show that, for all six families examined here, fitting a two-state model to the data. (B) Comparison of measured folding free energies of consensus proteins to those of extant sequences. The open consensus design is successful. That is, proteins adopt well-folded circles represent folding free energies of extant eukaryotic or mesophilic structures, these structures match the fold of the families from (blue) and thermophilic (red) sequences. The black dashed lines represent which they derived, they retain biological activity, and importantly, measured folding free energies of consensus sequences. Folding free energy their stabilities equal or (more often) exceed median stabilities of values and sources are reported in SI Appendix, Table S3. extant sequences. With this broad collection of consensus- designed proteins, we are in the position to ask whether there are any sequence features that set these consensus proteins apart from (CSPs) cluster on the canonical peptide- (Fig. 4A). A extant sequences, and if so, which features are most important for global fit of a single-site binding model to the 10 residues dis- increased stability. Such properties include general position- playing the largest CSPs gives a Kd of 795 ± 13 μM (Fig. 4B;the independent features such as sequence composition, charge, po- reported uncertainty is at a 67% confidence level derived from larity, and hydrophobicity, as well as position-specific properties 5,000 bootstrap iterations in which residuals are randomly resam- within each family such as degree of conservation and surface pled with replacement). This measured binding affinity for cSH3 is accessibility. ∼5,000-fold weaker than a binding affinity reported for human Fyn General sequence features. We compared various sequence prop- SH3 domain binding the same peptide sequence (41). erties of our six consensus sequences to those of the naturally occurring sequences within our curated MSAs. As with the sta- Steady-State Kinetics of Consensus Enzymes. To investigate how the bility comparisons above, we have included the cHD that we consensus design strategy affects enzymatic activity, we charac- have described previously (32). For every sequence (consensus terized steady-state kinetics of catalysis for the three consensus and every sequence in each MSA), we calculated the proportion enzymes in our set. DHFR, AK, and PGK catalyze various chem- of sequence (averaging over all sites) made up of charged resi- ical reactions important for cellular metabolism: dues (D, E, K, and R), polar uncharged residues (C, H, N, Q, S, and T), total polar residues (the sum of charged and polar un- Dihydrofolate + NADPH ⇌ Tetrahydrofolate + NADP+ charged), and nonpolar residues. Similarly, we calculated the net charge of each sequence (the difference between the number of + ⇌ positively and negatively charged residues, assuming full positive ATP AMP 2ADP charges for R and K, and full negative charges for D and E). For each metric, the naturally occurring sequences appeared 3-phosphoglycerate + ATP ⇌ 1, 3-bisphosphoglycerate + ADP. approximately normally distributed (SI Appendix, Fig. S11). Consensus sequences show strong biases within the distributions of natural sequences for some of these sequence and structural For each enzyme, we measured reaction velocities at 20 °C as a features. Most notably, consensus sequences all lie toward the function of substrate concentration, monitoring the oxidation of upper-tail end of the distributions for proportion of charged NAD(P)H spectrophotometrically either as a direct readout of residues, averaging 2.1 SDs above the mean (Fig. 6A). Con- activity (DHFR) or in an enzyme-coupled reaction (Materials versely, consensus sequences lie toward the lower-tail end of and Methods). Strikingly, all three consensus enzymes show sub- the distributions for proportion of polar uncharged residues, stantial activity (Fig. 5). Michaelis constants for the consensus averaging 2.1 SDs below the mean. These two biases offset in

4of10 | www.pnas.org/cgi/doi/10.1073/pnas.1816707116 Sternke et al. Downloaded by guest on September 28, 2021 Table 1. Equilibrium folding free energies of consensus proteins −1 −1 −1 −1 Protein ΔG°H2O, kcal·mol m value, kcal·mol ·M ΔΔG°H2O, kcal·mol

cNTL9 −7.9 ± 0.1 1.23 ± 0.02 −4.7 cSH3 −3.2 ± 0.1 1.88 ± 0.05 +0.3 cHD −10.8 ± 0.5 1.60 ± 0.08 −5.3 cSH2 −7.2 ± 0.1 2.28 ± 0.01 −3.6 cDHFR −10.1 ± 0.1 4.32 ± 0.03 −6.7 cAK −13.6 ± 0.5 3.83 ± 013 −8.3 cPGK −7.5 ± 0.2 4.34 ± 0.12 −1.4

ΔG°H2O and m values represent the folding free energy extrapolated to water and the denaturant sensitivity of the folding free energy determined from a two-state fit. Reported uncertainties are SEMs from three in-

dependent experiments. ΔΔG°H2O values represent the difference of the measured folding free energy value of the consensus protein and the mean folding free energy value of the naturally occurring proteins from the literature (SI Appendix, Table S3).

terms of total polar (charged and uncharged) residues: Con- On average, there will be more consensus mismatches at po- sensus sequences lie near the middle of the distribution for the sitions with low sequence conservation, although there will also proportion of total polar residues, averaging 0.1 SDs below be some number of mismatches at positions with high conser- the mean. Likewise, consensus sequences lie near the middle of vation. One simple question that can be asked about consensus the distribution for the proportion of nonpolar residues, as would mismatches is “how biased are consensus mismatches towards be expected given the proportion of total polar residues. Thus, it positions of low conservation?” To quantify the extent to which appears that the consensus design strategy implicitly creates se- consensus mismatches are made at sites with high conservation, quences that are preferentially enhanced in charged residues at i the expense of polar uncharged residues, without a large per- we calculated the sequence entropy (44) at each position in turbation of the number of total polar versus nonpolar residues. each MSA using the following formula: Two protein families, SH3 domain and homeodomain, deviate X BIOPHYSICS AND from this trend, both showing an increase in total polar residue Sseq, i = − pjlnpj, [1] content and a decrease in nonpolar residue content. The net j∈f20AAg COMPUTATIONAL BIOLOGY charge of six of the seven consensus protein sequences are also close to mean values from the MSA, with DHFR showing the where pj is the frequency of residue j occurring at position i, and only significant deviation. plotted this distribution for each family (purple distributions in The observed increase in the proportion of charged residues in Fig. 6C and SI Appendix, Fig. S13). We then weighted this dis- consensus sequences relative to extant sequences must arise from differences in the proportions of each of the charged residues tribution by the fraction of sequences in the MSA that differ individually. However, an increase in the charged residue con- from the consensus at each position: tent does not require that the proportion of each charged residue increases. To determine the specific residues responsible for this Smismatch, i = pmismatch, i × Sseq, i [2] overall increase, we compared the proportions of individual residues over the extant sequences in each MSA (again averaging (red distributions in Fig. 6C and SI Appendix, Fig. S13), and the over all sites) to those in the consensus sequence. Many residues fraction that match the consensus at each position: show consistent enrichment or depletion in consensus sequences relative to the extant sequences. For example, consensus sequences Smatch, i = pmatch, i × Sseq, i [3] areenrichedinD,E,andK,anddepletedinC,N,Q,S,andTinall SI Appendix or six of the seven protein families ( ,Fig.S12). These (blue distributions in Fig. 6C and SI Appendix, Fig. S13). trends are highlighted when the relative differences between the consensus sequence and extant sequences for each residue are av- eraged over the seven protein families. Consensus sequences show the largest enrichments in E, K, G, L, and D, the largest depletion AB in Q, S, and I, and marginal effects on all other residues (Fig. 6B). Thus, the increase in charged residues for the consensus proteins results from increased proportions of D, E, and K, but not R, while the decrease in polar uncharged residues results from decreased proportions (albeit modest proportions for some residues) of all polar uncharged residues. Position-specific sequence features. In addition to the general se- quence features described above, which average over each se- quence, we also examined the features of sequence substitution in a position-specific way. At each position in the MSA, we compared the residue of each extant sequence to the consensus residue at Fig. 4. Peptide binding of cSH3 by NMR spectroscopy. (A) Chemical shift per- that position. We separated these residue-specific comparisons 1 –15 into “consensus matches” and “consensus mismatches” for the turbations (CSPs) of backbone H N HSQC cross-peaks resulting from addition of a 20-fold excess unlabeled peptide to 15N-labeled consensus SH3. The red and positions at which extant residue matched or differed from the blue lines indicate 1 and 2 SDs, respectively. Inset shows a homology model of consensus residue, respectively. Clearly, all changes in structure, cSH3 aligned to a peptide-bound structure of human Fyn SH3 (PDB 1A0N; stability, and function in consensus constructs compared with ex- peptide shown in green). Residues showing CSPs greater than 1 and 2 SDs are tant homologs result from the residues that differ from the con- shown with α-carbons as spheres and side-chain atoms as lines, and colored red sensus. Thus, the properties of these “consensus mismatches” are and blue, respectively. (B) Binding isotherms for the 10 consensus SH3 residues important for understanding, for example, the increase in stability showing largest CSPs. The solid lines are obtained from a global fit to a single- we have observed for most of our consensus constructs. site binding model using a common dissociation constant for all residues.

Sternke et al. PNAS Latest Articles | 5of10 Downloaded by guest on September 28, 2021 A B A comparison of the distribution of consensus mismatches occurring at surface, intermediate, and buried positions shows that, for six of the seven proteins, consensus mismatches occur most frequently at surface positions (SI Appendix, Fig. S14; darker bars). Overall, mismatches occur at surface, intermediate, and buried positions with median proportion of 0.49, 0.27, and 0.22, respectively. However, these proportions are dependent on the overall proportions of surface, intermediate, and buried residues that make up each protein. The four smaller domain families of NTL9, SH3, HD, and SH2 all have more surface residues than intermediate or buried residues (SI Appendix, Fig. S14; lighter bars). However, the larger proteins DHFR, AK, and C PGK have more buried residues than surface or intermediate residues (consistent with a decreased surface-to-volume ratio for larger proteins). To account for this underlying difference in surface accessi- bility, we divided the conditional probabilities that consensus mismatches are located at surface, intermediate, and buried positions [i.e., p(X j mismatch), where X represents the three structural classes; dark bars in SI Appendix, Fig. S14] by the overall marginal probabilities of the three structural classes DE(surface, intermediate, buried; light bars in SI Appendix, Fig. S14). By Bayes’ theorem, this ratio represents the true enrich- ment (or suppression) of consensus mismatches at surface, in- termediate, and exposed sites. This ratio shows an enrichment of consensus mismatches at surface positions for all seven protein families (Fig. 6D), with an average factor of 1.21. In contrast, there is a suppression of consensus mismatches at core positions in all seven protein families, with an average factor of 0.71. At intermediate positions, mismatches are neither enhanced nor suppressed (Fig. 6D; with an average factor of 1.03). Discussion Fig. 5. Steady-state kinetics of consensus enzymes. Representative Michaelis– There are a number of reports in the literature in which con- Menten curves generated for each substrate for cDHFR (A, varying DHF with sensus design has successfully been used to produce proteins that 150 μM NADPH; B,varyingNADPHwith100μM DHF), cAK (C,varyingAMP adopt their archetypical fold. This approach has found recent with 5 mM ATP), and cPGK (D, varying 3-PG with 20 mM ATP; E,varyingATP – – success for linear repeat-protein targets (49 52) but has also with 10 mM 3-PG). The solid lines are obtained from fitting a Michaelis shown some success for globular protein targets (13, 14, 23, 26, Menten model to the data. In A, full activity was obtained at the lowest DHF 27). In most reports, globular consensus proteins have enhanced concentration we could measure; as a result, the Km value for DHF could not be accurately determined. equilibrium stability and sometimes retain biological activity. However, point-substitution studies suggest stability gains from wholesale consensus substitution may be marginal, and protein As expected, we see a bias of consensus matches toward low engineering studies with a WW domain suggest that consensus sequence entropy positions and a bias of consensus substitutions proteins may lack the coupling energies needed to fold (53). Because publications of successful design of globular proteins toward high entropy positions. For example, at positions that have “ ” entropies below 0.5, nearly all sequences in the MSA match the have so far been one target at a time, and since unsuccessful designs are unlikely to be published, it is difficult to estimate how consensus. Likewise, at positions that have entropies above value likely the consensus approach is to yield folded proteins, and to of 2, most sequences in the MSA differ from consensus. However, determine how to improve on the approach when it performs at positions with intermediate entropies (ranging from 1 to 1.5), poorly. The studies presented here, where we generate and there are significant numbers of sequences in each MSA that characterize consensus sequences for six protein families, pro- differ from consensus. It is possible that these substitutions vide an unbiased look at rates of success and failure. Compari- disproportionately contribute to the observed stability increase sons for different targets are aided by the use of a common of consensus proteins. design protocol for all targets, a common purification pipeline, Finally, we examined the extent to which these sequence dif- and a uniform set of protocols for measuring stability, structure, ferences occur at surface positions. Since surface residues tend to and function. This multitarget approach also allows us to eval- contribute less to overall stability than buried residues (45, 46), uate sequence features that are likely to contribute consensus we might expect the stability increases of consensus proteins to stabilization in general way, and correlate these features with have their origins in substitutions at buried positions. However, measured biochemical properties. because surface positions tend to be more highly variable, we Our results show that consensus sequence design is both a expect most consensus substitutions to occur at surface sites; as a general and successful strategy for small and large globular pro- result, surface residues may contribute to stability enhancement teins with diverse folds. For all six protein families targeted, the through the collective effect from many substitutions. Consistent resulting consensus proteins expressed to high levels, remained with this, Makhatadze and coworkers (47, 48) have shown that soluble in solution, and adopted a well-folded tertiary structure. stability can be increased through careful placement of surface Structural information obtained from CD and NMR spectros- charged residues. To identify surface positions within each MSA, copies suggest that all of the consensus proteins adopt their ar- we generated a homology model from the corresponding con- chetypal fold. Support for folding is provided by the observation sensus sequence, calculated the side chain solvent-accessible that all consensus proteins tested show their expected biological surface area at each position, and referenced to the average activities (Figs. 2, 4, and 5). side chain solvent-accessible surface area of the same residue in Our findings indicate that thermodynamic stabilization of an extended peptide (SI Appendix, Materials and Methods). consensus proteins is the norm, rather than the exception. Four

6of10 | www.pnas.org/cgi/doi/10.1073/pnas.1816707116 Sternke et al. Downloaded by guest on September 28, 2021 Table 2. Steady-state kinetic parameters of catalysis for consensus and naturally occurring enzymes −1 −1 −1 Enzyme Substrate kcat,s Km, μM kcat/Km,M ·s

Consensus DHFR DHF 1.7 ± 0.1 <5* >3.4 × 105 † Escherichia coli DHFR 12.3 ± 0.7 0.7 ± 0.2 1.8 ± 0.03 × 107 Bacillus stearothermophilus DHFR‡ 2.2 3.5 6.3 × 105 Consensus DHFR NADPH 1.6 ± 0.1 15 ± 21.0± 0.1 × 105 † E. coli DHFR 10 ± 14.8± 12.1± 0.7 × 106 ‡ B. stearothermophilus DHFR 2.3 1.5 1.5 × 106 Consensus AK AMP 5.1 ± 0.6 86 ± 10 5.9 ± 1.0 × 104 E. coli AK§ 400 ± 10 120 ± 20 3.3 ± 0.6 × 106 Aquifex aeolicus AK¶ 30 ± 10 21 1.4 × 106 Consensus PGK 3-PG 12.8 ± 0.1 190 ± 10 6.7 ± 0.3 × 104 Yeast PGK# 335 590 5.6 × 105 Thermotoga maritima PGKjj 85 30 NR Consensus PGK ATP 14.1 ± 2.1 1,200 ± 400 1.2 ± 0.4 × 104 Yeast PGK# 345 370 4.1 × 105 T. maritima PGKjj 85 NR NR

Michaelis–Menten parameters for consensus enzymes (top), a mesophilic enzyme (middle), and a thermophilic enzyme (bottom). Uncertainties reported for consensus enzymes represent SEMs from two independent exper- iments. Experiments for consensus enzymes were performed at 20 °C. NR denotes values that are not reported in the study.

*The Km value for DHF was found to be too low to accurately measure under experimental conditions. Consensus DHFR showed full activity at lowest DHF concentration tested of 5 μM. †Data are from ref. 76. Experiments were performed at 25 °C. ‡ Data are from ref. 77. Experiments were performed at 20 °C. §

Data are from ref. 78. Experiments were performed at 25 °C. BIOPHYSICS AND ¶Data are from ref. 59. Experiments were performed at 20 °C. #Data are from ref. 75. Experiments were performed at 25 °C. Errors were reported to as <5%. COMPUTATIONAL BIOLOGY jjData are from ref. 79. Experiments were performed at 20 °C.

of the six [and five of the seven including the previously reported those proteins likely to conform to a two-state mechanism. It homeodomain (32)] consensus proteins showed stabilities may also be noteworthy that, along with SH2, the SH3 multiple greater than the most stable naturally occurring sequences sequence is uniquely dominated by eukaryotic sequences. In a identified in our literature search, and the other two showed study of consensus superoxide sequences, Goyal and stabilities near and above average (Fig. 3 and SI Appendix, Table Magliery (56) found successful consensus design to be highly S3). In short, consensus design can be expected to provide sta- dependent on the phylogeny represented in the MSA. bility enhancements about three-quarters of the time. This result A particular advantage of consensus sequence design is that it is particularly surprising, given that proteins are not under direct draws upon the natural evolutionary history of a protein family. evolutionary selection for maximal stability (54, 55), but must As a result, residues important for function are likely to be simply retain sufficient stability to remain folded. It appears that conserved (57). However, it cannot be assumed a priori that the by taking the most probable residue at each position, a large resulting consensus proteins will show biological function, since number of small stability enhancements sum to a large net in- consensus sequences are novel sequences that have experienced crease in stability. Although naturally occurring extant proteins no evolutionary selection for function. Importantly, all consensus are of lower stability (perhaps because high stabilities are not re- proteins we assayed for function maintained some level of quired for function), this information gets encoded in alignments expected biological activities of both molecular recognition and of large numbers of sequences. Taken at face value, these results enzymatic catalysis (Figs. 4 and 5). This result, combined with highlight the importance of information encoded at the level of previously reported studies showing consensus protein function single residues. Though single-residue information does not ex- (22–31), indicates that information necessary for protein func- plicitly include pairwise sequence correlations, favorable pairs may tion is retained in averaging over many sequences that each in- be retained in our consensus designs as long as each residue in the dividually contain functionally important information. pair is highly conserved. Thus, the extent to which consensus se- In both this study and our previous investigations of a cHD quences capture such “accidental” correlations and their contri- butions to the observed effects on stability require further studies. (32), consensus substitution showed varying effects on molecular The source of the relative instability of cSH3 is unclear. The recognition. Consensus HD and cSH3 each showed two to three consensus SH3 and the extant sequences from which it was generated orders of magnitude differences in their binding affinities to have some exceptional features, although none of these is unique to cognate substrates relative to naturally occurring sequences, with SH3. The SH3 sequences are among the shortest sequence families cHD showing higher affinity and cSH3 showing lower affinity. examined (54 residues), although not the shortest (NTL9 is The origins of these differences in substrate binding affinities 46 residues). The consensus SH3 sequences has a high fraction of remain unclear. It is possible that the sequences used to obtain polar residues and the lowest fraction of uncharged residues (Fig. the cHD sequence bind similar sequences (indeed, many of these 6A), although the cHD is a close second. The SH3 sequence is also sequences are from the engrailed superfamily), resulting in an the only protein in our set that has an all-β structure. “optimized” homeodomain, whereas sequences used to obtain Taxonomically, sequences in the SH3 MSA show the lowest the consensus SH3 possess different specificities, resulting in a pairwise identity (26%; SI Appendix, Table S1), although the sequence whose binding affinity has been “averaged out.” Test- SH2 family is a close second (28%). It may be noteworthy that ing this explanation will require an investigation of the binding the increase in stability of cSH2 compared with the available specificities of the consensus proteins as well as those of the extant values is the smallest stability increment observed for sequences used to generate them.

Sternke et al. PNAS Latest Articles | 7of10 Downloaded by guest on September 28, 2021 A B stabilities observed for the consensus proteins. Like our con- sensus sequences, thermophilic proteins have been shown to be enriched in E and K, and depleted in A, C, H, Q, S, and T (63, 64). However, thermophilic proteins have also been shown to be enriched in Y, R, and I, which are at or below average compo- sition in our consensus sequences. Although it might be expected that the inclusion of sequences from thermophilic organisms in our MSAs contributes both to the composition bias and to high stabilities, most of the sequences in our MSAs are from mesophilic C D organisms. The sequences in the SH3 and SH2 MSAs are pre- dominantly eukaryotic (as were our previous cHD sequences); aside from a small number of moderately thermophilic fungi, these sequences all derive from mesophiles. For the other four protein families (NTL9, DHFR, AK, and PGK), MSAs are composed of at most 5% of sequences from thermophilic or hyperthermophilic bacteria or archaea (SI Appendix, Table S5). If the identified thermophilic sequences are removed from the MSA before consensus sequence generation, the resulting con- sensus sequences have identities of 98.6% or greater to the Fig. 6. Sequence properties of consensus sequences. (A) Z scores (the consensus sequence derived from the full MSAs (SI Appendix, ‡ number of SDs that separate the consensus sequence from the mean value Table S5). of sequences in the MSAs; SI Appendix, Eq. 3) for various sequence proper- Makhatadze et al. (47, 48) have been able to increase stability ties. Distributions for all protein families are shown in SI Appendix, Fig. S11. by introducing charged residues at surface positions of several (B) Differences between residue frequencies in the consensus sequences and the MSAs averaged over all seven protein families. Residues are colored as proteins and optimizing electrostatic interactions. It is unclear to follows: polar charged (red), polar uncharged (blue), and nonpolar (black). what extent the locations of consensus charged residues optimize The vertical offset is used for clarity. (C) Distributions of sequence entropy electrostatic interactions, or whether additional stability in- values for all positions in the PGK MSA (purple), positions at which residues creases can be obtained by charge shuffling. Increases in stability in extant sequences differ from the consensus sequence (consensus mis- and solubility have also been reported for “supercharged” pro- matches; red), and positions at which residues in extant sequences match the teins, which have similar numbers of charged residues (65); consensus sequence (consensus matches; blue) for PGK. Sequence entropy however, the consensus proteins studied here are generally close distributions for all protein families are shown in SI Appendix, Fig. S13.(D) electroneutrality, whereas supercharged proteins have highly Ratios of conditional probabilities of different structural environments imbalanced positive or negative charge. (surface, intermediate, and buried; “X” in the y label) for consensus mis- It should be noted, however, that some of the consensus se- matches relative to overall probabilities of surface, intermediate, and buried quences deviate from the general trends observed in sequence residues at all positions. Conditional and overall probabilities for all protein biases. For instance, consensus SH3 and HD have a greater families are shown in SI Appendix, Fig. S14. The legend is as in A. percentage of polar residues and a lower percentage of non- polar residues, and cDHFR has a much lower net charge, compared with the average for each MSA. Thus, consensus For the three enzymes examined here, consensus substitution sequence statistics appear to abide by general trends but not shows variable effects on catalysis. At low temperatures (20 °C), absolute rules. steady-state turnover numbers for all consensus enzymes were Similarly, analysis of the positions at which the consensus se- smaller than those for naturally occurring mesophilic sequences, quences differ from extant sequences highlights important as- but on par with those of thermophilic homologs (Table 2). This is pects about the consensus design strategy. These consensus consistent with the observation that thermophilic proteins show mismatches occur mainly at positions with relatively low con- lower catalytic activities at low temperatures than their less sta- servation and positions on the protein surface (Fig. 6), consistent ble mesophilic counterparts (58, 59). At higher temperatures, with the well-known correlation between residue conservation cPGK has a kcat value comparable to thermophilic homologs, k and solvent-accessible surface area (66). This may highlight an whereas cDHFR and cAK have higher and lower cat values, implicit advantage of consensus sequence design, since substi- respectively. On the whole, this observation demonstrates that tutions at core positions are often destabilizing (67). However, consensus enzymes can (but sometimes do not) achieve the same the large observed effects of consensus substitution on both level of activities as their naturally occurring counterparts. The stability and activity indicate that these weakly conserved and observed inverse correlations between enzyme stability and cat- surface positions play a sizable role in both stability and function, alytic rates have widely been interpreted as resulting from a and considerable gains can be made by optimizing these posi- trade-off between dynamics and catalysis (60). As this issue is tions. This observation is consistent with the observation of the still debated (61, 62), consensus sequence design may offer a functional impacts of nonconserved “rheostat” substitutions on promising avenue to gain insights into the relationships among the surface of lac repressor (68). Furthermore, the importance of protein stability, dynamics, activity, and evolution. these weakly conserved positions suggests that using a large The consensus design strategy used here appears to impart a number of sequences may be a key component of successful strong bias on sequence: Taking the most probable residue at consensus design, since weakly conserved positions are most each position does not result in average composition. This bias sensitive to phylogenetic noise and misalignment (69). may have significant effects on stability and function. The con- Our work here demonstrates that the consensus sequence sensus sequences all have a high content of charged residues and design method is both a general and successful strategy to design low content of polar uncharged residues (Fig. 6). Consensus proteins of high stability that retain biological activity. Compared substitutions from uncharged to charged residues show a stron- with other rational, structure-based, or directed evolution ger bias toward positions of higher sequence entropy than sub- methods, consensus sequence design provides a simple route to stitutions among uncharged residues (SI Appendix, Fig. S15). Thus, the overall enrichment of charged residues in consensus “ ” sequences results from charged residues (E, D, and K) winning ‡ Though the 5% estimate of thermophilic sequences is likely to undercount the number over uncharged residues at positions with low conservation. of thermophilic sequences in our MSA, since not all sequences could be unambiguously Similar (but not identical) compositional biases have also been assigned to a source organism, the total number of thermophiles in our MSAs is not likely observed in thermophilic sequences, consistent with the high to exceed 16% (the ratio of identified thermophiles to thermophiles plus mesophiles).

8of10 | www.pnas.org/cgi/doi/10.1073/pnas.1816707116 Sternke et al. Downloaded by guest on September 28, 2021 accomplish longstanding goals of protein design. Furthermore, cSH3, and cSH2 was monitored by tryptophan fluorescence on an Aviv Model its foundation in provides a promising avenue to 107 ATF. Fluorescence was measured using a 280-nm excitation and either a address questions regarding the relationships of protein sequence, 332-nm (cSH2 and cSH3) or 348-nm (cNTL9) emission, signal averaging for 30 s biophysics, and evolution. at each GdnHCl concentration. For these four proteins, unfolding was found to equilibrate rapidly, so that titrations could be generated using a Hamilton Materials and Methods automated titrator, with a 5-min equilibration period. Gdn-induced unfolding of cAK and cPGK was found to equilibrate on a Design of Consensus Sequences. Sequences for each family were gathered slower timescale than the other consensus proteins, prohibiting the use of an from Pfam (33), SMART (34), or InterPro (35) databases. Resulting sequence automated titrator. Therefore, samples at each denaturant concentration sets for each domain were filtered by sequence length, removing sequences were made individually and equilibrated at room temperature for 24 h (cAK) 30% longer or shorter than the median sequence length of the set. To avoid or 5 h (cPGK). For each sample, the CD signal at 225 nm (cAK) or 222 nm (cPGK) bias from sequence groups with high identity, we used CD-HIT (70) to cluster was averaged for 30 s. Melts for both proteins were collected in buffer sequences at 90% identity and selected a single representative sequence containing 50 mM NaCl, 0.5 mM TCEP, and 25 mM Tris·HCl (pH 8.0). from each cluster. This curated sequence set was used to generate an MSA Titrations were carried out in triplicate for each protein. Thermodynamic using MAFFT (71). folding/unfolding parameters were determined by fitting a two-state linear At each position of the MSA, frequencies were determined for the extrapolation model to the folding/unfolding curves (73). 20 amino acids along with a gap frequency using an in-house script. Positions occupied by residues (as opposed to a gap) in at least half of the sequences were included as positions in the consensus sequence. The most frequent Steady-State . Steady-state enzyme kinetic parameters at residue at each of these “consensus positions” was gathered to create the 20 °C were determined for cDHFR (in the direction of tetrahydrofolate for- consensus sequence for that protein family. mation), cAK (in the direction of ADP formation), and cPGK (in the di- rection of 1,3-bisphosphoglycerate formation) using absorbance spectroscopy to monitor the oxidation of NAD(P)H catalyzed either directly by the consensus NMR Spectroscopy. 15N- and 13C,15N-isotopically labeled proteins were enzyme (cDHFR) (74), or by the activity of an enzyme that is directly coupled expressed and purified as described in SI Appendix. NMR samples, data ac- to the products produced by the rate-limiting activity of the consensus en- quisition, and data analysis are also described in SI Appendix. zyme (cAK and cPGK) (59, 75). The absorbance at 340 nm was monitored over Peptide binding to cSH3 was monitored by heteronuclear NMR spec- time after rapid addition of consensus enzyme. Steady-state velocities were troscopy. A putative SH3-binding peptide (Ac-PLPPLPRRALSVW-NH ) was 2 determined as the initial linear slope of the time course. Because these synthesized by GenScript. Samples containing 200 μM 15N-labeled cSH3 were enzymes catalyze bisubstrate reactions, we were able to obtain Michaelis– prepared at 0-, 0.05-, 0.125-, 0.5-, 1.25-, 2.5-, 5-, 10-, and 20-fold molar Menten kinetic parameters for each substrate by varying the concentration of equivalents of unlabeled peptide. 1H–15N HSQC spectra were collected on a one substrate at a constant, saturating concentration of the other substrate. Bruker Avance 600-MHz spectrometer in 150 mM NaCl, and 5% D2O, 25 mM BIOPHYSICS AND 1 15 For cDHFR assays, a concentration of 175 nM cDHFR was used in reaction NaPO4 (pH 7.0) at 25 °C. H– N chemical shifts varied monotonically with

buffer containing 25 mM Hepes (pH 7.5) and 150 mM NaCl. For cAK assays, a COMPUTATIONAL BIOLOGY peptide concentrations such that assignments from the apo-protein could be concentration of 53 nM cAK was used in reaction buffer containing 50 mM transferred to the bound state. CSPs were calculated and globally fit using a Hepes (pH 7.5), 100 mM NaCl, 20 mM MgCl , 1 mM phosphoenolpyruvate, single-site binding equation as described in SI Appendix. 2 0.1 mM NADH, 10 units of , and 10 units of lactate de- . For cPGK assays, a concentration of 53 nM consensus CD and Fluorescence Spectroscopies. CD measurements were collected on an PGK was used in reaction mixture containing 100 mM Tris·HCl (pH 8.0), Aviv Model 435 CD spectropolarimeter. Far-UV CD spectra were collected 3mMMgCl, 0.1 mM NADH, and 5 units of glyceraldehyde phosphate using a 1-mm cuvette with protein concentrations ranging from 2 to 31 μMat 2 . 20 °C, averaging for 5 s with a 1-nm step size. Consensus NTL9, cSH3, cSH2, For cDHFR and cPGK, k values were measured at high temperatures and cDHFR were collected in 150 mM NaCl and 25 mM NaPO (pH 7.0). cat 4 using the absorbance spectroscopic assays described above at saturating Consensus AK and cPGK were collected in 50 mM NaCl, 0.5 mM TCEP, and concentrations of both substrates. Samples were allowed to equilibrate at 25 mM Tris·HCl (pH 8.0). the desired temperature for 5 min before initiation of the reaction. Con- GdnHCl- and temperature-induced folding/unfolding transitions were sensus DHFR activity was measured up to 50 °C. Consensus PGK activity was monitored using either CD or fluorescence spectroscopy. All unfolding measured up to 40 °C, the temperature of onset of denaturation of the transitions were collected with protein concentrations ranging from 1 to coupling enzyme (glyceraldehyde phosphate dehydrogenase). For cAK, en- 6 μM. GdnHCl melts were collected at 20 °C. UltraPure GdnHCl was pur- zyme activity at various temperatures was measured using a direct 31P NMR chased from Invitrogen. Concentrations of GdnHCl were verified using re- assay previously used for an AK from Aquifex aeolicus (SI Appendix, Sup- fractometry (72). plementary Methods) (43). Consensus AK activity was measured up to 70 °C. Temperature-induced unfolding transitions were generated by measuring CD or fluorescence in 2 °C increments. Samples were allowed to equilibrate for 2 min at each temperature before signal measurement. The signal at ACKNOWLEDGMENTS. We thank Ananya Majumdar for assistance in each temperature was then averaged for 30 s. Reversibility was assessed collecting and discussions of the NMR experiments. We thank the Johns Hopkins University Biomolecular NMR Center, Center for Molecular Biophysics, by cooling samples to 25 °C after thermal denaturation and comparing CD and Chemistry NMR Core Facility for providing facilities and resources. We or fluorescence emission spectra to those collected immediately before thank Michael Harms for suggesting the BacDive database as a resource to thermal unfolding. classify bacterial growth temperatures. This work was supported by NIH Grant GdnHCl-induced unfolding of cDHFR was monitored by CD at 222 nm, GM068462 to D.B. M.S. was supported by NIH Grants T32GM008403 and signal averaging for 30 s at each GdnHCl concentration. Unfolding of cNTL9, F31GM128295.

1. J. E. Gagner, W. Kim, E. L. Chaikof, Designing protein-based biomaterials for medical 9. Z. Li, Y. Yang, J. Zhan, L. Dai, Y. Zhou, Energy functions in de novo protein design: applications. Acta Biomater. 10, 1542–1557 (2014). Current challenges and future prospects. Annu. Rev. Biophys. 42, 315–335 (2013). 2. S. Jemli, D. Ayadi-Zouari, H. B. Hlima, S. Bejar, Biocatalysts: Application and engi- 10. P. Gainza, H. M. Nisonoff, B. R. Donald, Algorithms for protein design. Curr. Opin. – neering for industrial purposes. Crit. Rev. Biotechnol. 36, 246–258 (2016). Struct. Biol. 39,16 26 (2016). 3. G. A. Khoury, J. Smadbeck, C. A. Kieslich, C. A. Floudas, Protein folding and de novo 11. M. D. Lane, B. Seelig, Advances in the directed evolution of proteins. Curr. Opin. – protein design for biotechnological applications. Trends Biotechnol. 32,99–109 (2014). Chem. Biol. 22, 129 136 (2014). – 4. D. Röthlisberger et al., Kemp elimination catalysts by computational enzyme design. 12. B. T. Porebski, A. M. Buckle, Consensus protein design. Protein Eng. Des. Sel. 29, 245 251 (2016). Nature 453, 190–195 (2008). 13. B. Steipe, B. Schiller, A. Plückthun, S. Steinbacher, Sequence statistics reliably predict 5. S. Kulshreshtha, V. Chaudhary, G. K. Goswami, N. Mathur, Computational ap- stabilizing mutations in a . J. Mol. Biol. 240, 188–192 (1994). proaches for predicting mutant protein stability. J. Comput. Aided Mol. Des. 30, 14. B. J. Sullivan et al., Stabilizing proteins from sequence statistics: The interplay of – 401 412 (2016). conservation and correlation in triosephosphate stability. J. Mol. Biol. 420, 6. C.-Y. Chen, I. Georgiev, A. C. Anderson, B. R. Donald, Computational structure-based 384–399 (2012). – redesign of enzyme activity. Proc. Natl. Acad. Sci. U.S.A. 106, 3764 3769 (2009). 15. K. M. Polizzi, J. F. Chaparro-Riggers, E. Vazquez-Figueroa, A. S. Bommarius, Structure- 7. C. Jäckel, P. Kast, D. Hilvert, Protein design by directed evolution. Annu. Rev. Biophys. guided consensus approach to create a more thermostable penicillin G acylase. Bio- 37, 153–173 (2008). technol. J. 1, 531–536 (2006). 8. H. Kries, R. Blomberg, D. Hilvert, De novo enzymes by computational design. Curr. 16. A. Rath, A. R. Davidson, The design of a hyperstable mutant of the Abp1p Opin. Chem. Biol. 17, 221–228 (2013). SH3 domain by sequence alignment analysis. Protein Sci. 9, 2457–2469 (2000).

Sternke et al. PNAS Latest Articles | 9of10 Downloaded by guest on September 28, 2021 17. P. V. Nikolova, J. Henckel, D. P. Lane, A. R. Fersht, Semirational design of active tumor 49. L. K. Mosavi, D. L. J. Minor, Jr, Z.-Y. Peng, Consensus-derived structural determinants suppressor p53 DNA binding domain with enhanced stability. Proc. Natl. Acad. Sci. of the ankyrin repeat motif. Proc. Natl. Acad. Sci. U.S.A. 99, 16029–16034 (2002). U.S.A. 95, 14675–14680 (1998). 50. T. Aksel, A. Majumdar, D. Barrick, The contribution of entropy, enthalpy, and hy- 18. L. Giver, A. Gershenson, P. O. Freskgard, F. H. Arnold, Directed evolution of a ther- drophobic desolvation to in repeat-protein folding. Structure 19, 349– mostable . Proc. Natl. Acad. Sci. U.S.A. 95, 12809–12813 (1998). 360 (2011). 19. H. Zhao, F. H. Arnold, Directed evolution converts E into a functional 51. E. R. G. Main, Y. Xiong, M. J. Cocco, L. D’Andrea, L. Regan, Design of stable alpha- equivalent of . Protein Eng. 12,47–53 (1999). helical arrays from an idealized TPR motif. Structure 11, 497–508 (2003). 20. L. Serrano, A. Horovitz, B. Avron, M. Bycroft, A. R. Fersht, Estimating the contribution 52. F. Parmeggiani et al., Designed armadillo repeat proteins as general peptide-binding of engineered surface electrostatic interactions to protein stability by using double- scaffolds: Consensus design and computational optimization of the hydrophobic core. mutant cycles. 29, 9343–9352 (1990). J. Mol. Biol. 376, 1282–1304 (2008). 21. Z. R. Sailer, M. J. Harms, Molecular ensembles make evolution unpredictable. Proc. 53. M. Socolich et al., Evolutionary information for specifying a protein fold. Nature 437, Natl. Acad. Sci. U.S.A. 114, 11938–11943 (2017). 512–518 (2005). 22. J. R. Desjarlais, J. M. Berg, Use of a zinc-finger consensus sequence framework and 54. D. M. Taverna, R. A. Goldstein, Why are proteins marginally stable? Proteins 46, 105– specificity rules to design specific DNA binding proteins. Proc. Natl. Acad. Sci. U.S.A. 109 (2002). 90, 2256–2260 (1993). 55. J. D. Bloom, A. Raval, C. O. Wilke, Thermodynamics of neutral protein evolution. 23. M. Lehmann et al., The consensus concept for thermostability engineering of pro- 175, 255–266 (2007). teins: Further proof of concept. Protein Eng. 15, 403–411 (2002). 56. V. D. Goyal, T. J. Magliery, Phylogenetic spread of sequence data affects fitness of 24. B. J. Sullivan, V. Durani, T. J. Magliery, Triosephosphate isomerase by consensus de- SOD1 consensus enzymes: Insights from sequence statistics and structural analyses. – sign: Dramatic differences in physical properties and activity of related variants. Proteins 86, 609 620 (2018). J. Mol. Biol. 413, 195–208 (2011). 57. J. A. Capra, M. Singh, Predicting functionally important residues from sequence – 25. B. T. Porebski et al., Smoothing a rugged protein folding landscape by sequence- conservation. 23, 1875 1882 (2007). based redesign. Sci. Rep. 6, 33958 (2016). 58. L. Z. Holland, M. McFall-Ngai, G. N. Somero, Evolution of -A 26. B. T. Porebski et al., Circumventing the Stability-Function Trade-Off in an Engineered homologs of barracuda fishes (genus Sphyraena) from different thermal environ- FN3 Domain (Protein Eng Des Sel PEDS, 2016). ments: Differences in kinetic properties and thermal stability are due to amino acid – 27. A. J. Stevens et al., Design of a split intein with exceptional protein splicing activity. substitutions outside the . Biochemistry 36, 3207 3215 (1997). J. Am. Chem. Soc. 138, 2162–2165 (2016). 59. M. Wolf-Watz et al., Linkage between dynamics and catalysis in a thermophilic- – 28. D. Aerts et al., Consensus engineering of sucrose : The outcome reflects mesophilic enzyme pair. Nat. Struct. Mol. Biol. 11, 945 949 (2004). the sequence input. Biotechnol. Bioeng. 110, 2563–2572 (2013). 60. P. G. Varley, R. H. Pain, Relation between stability, dynamics and enzyme activity in 3- 29. L. M. Blatt, J. M. Davis, S. B. Klein, M. W. Taylor, The biologic activity and molecular phosphoglycerate from yeast and Thermus thermophilus. J. Mol. Biol. 220, 531–538 (1991). characterization of a novel synthetic interferon-alpha species, consensus interferon. 61. R. A. Studer, P.-A. Christin, M. A. Williams, C. A. Orengo, Stability-activity tradeoffs J. Interferon Cytokine Res. 16, 489–499 (1996). constrain the adaptive evolution of RubisCO. Proc. Natl. Acad. Sci. U.S.A. 111, 2223– 30. F. García-Guevara, I. Bravo, C. Martínez-Anaya, L. Segovia, specificity switch 2228 (2014). in by rational design and consensus engineering. Protein 62. J. Tian, J. C. Woodard, A. Whitney, E. I. Shakhnovich, Thermal stabilization of Eng. Des. Sel. 30, 533–541 (2017). using monte carlo unfolding simulations and its functional 31. L. Huang et al., The design and recombinant protein expression of a consensus por- consequences. PLoS Comput. Biol. 11, e1004207 (2015). cine interferon: CoPoIFN-α. Cytokine 57,37–45 (2012). 63. S. Kumar, C. J. Tsai, R. Nussinov, Factors enhancing protein thermostability. Protein 32. K. W. Tripp, M. Sternke, A. Majumdar, D. Barrick, Creating a homeodomain with high Eng. 13, 179–191 (2000). stability and DNA binding affinity by sequence averaging. J. Am. Chem. Soc. 139, 64. Y. Li, C. R. Middaugh, J. Fang, A novel scoring function for discriminating hyper- 5051–5060 (2017). thermophilic and mesophilic proteins with application to predicting relative ther- 33. A. Bateman et al., The Pfam protein families database. Nucleic Acids Res. 32, D138– mostability of protein mutants. BMC Bioinformatics 11, 62 (2010). D141 (2004). 65. M. S. Lawrence, K. J. Phillips, D. R. Liu, Supercharging proteins can impart unusual 34. I. Letunic, T. Doerks, P. Bork, SMART: Recent updates, new developments and status in resilience. J. Am. Chem. Soc. 129, 10110–10112 (2007). 2015. Nucleic Acids Res. 43, D257–D260 (2015). 66. J. M. Koshi, R. A. Goldstein, Mutation matrices and physical-chemical properties: 35. R. D. Finn et al., InterPro in 2017-beyond protein family and domain annotations. Correlations and implications. Proteins 27, 336–344 (1997). Nucleic Acids Res. 45, D190–D199 (2017). 67. N. Tokuriki, F. Stricher, J. Schymkowitz, L. Serrano, D. S. Tawfik, The stability effects of 36. K. Fukami-Kobayashi, M. Nosaka, A. Nakazawa, M. Go, Ancient divergence of long protein mutations appear to be universally distributed. J. Mol. Biol. 369, 1318–1332 and short isoforms of : Molecular evolution of the nucleoside (2007). – monophosphate kinase family. FEBS Lett. 385, 214 220 (1996). 68. S. Meinhardt, M. W. J. Manley, Jr, D. J. Parente, L. Swint-Kruse, Rheostats and toggle 37. A. R. Viguera, J. C. Martínez, V. V. Filimonov, P. L. Mateo, L. Serrano, Thermodynamic switches for modulating protein function. PLoS One 8, e83502 (2013). and kinetic analysis of the SH3 domain of spectrin shows a two-state folding transi- 69. G. Talavera, J. Castresana, Improvement of phylogenies after removing divergent and – tion. Biochemistry 33, 2142 2150 (1994). ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 56, 564–577 38. E. J. Loveridge et al., The role of large-scale motions in catalysis by dihydrofolate (2007). – reductase. J. Am. Chem. Soc. 133, 20561 20570 (2011). 70. W. Li, L. Jaroszewski, A. Godzik, Clustering of highly homologous sequences to reduce 39. J. K. Myers, C. N. Pace, J. M. Scholtz, Denaturant m values and heat capacity changes: the size of large protein databases. Bioinformatics 17, 282–283 (2001). Relation to changes in accessible surface areas of protein unfolding. Protein Sci. 4, 71. K. Katoh, K. Misawa, K. Kuma, T. Miyata, MAFFT: A novel method for rapid multiple se- – 2138 2148 (1995). quence alignment based on fast Fourier transform. Nucleic Acids Res. 30,3059–3066 (2002). 40. C. D. Geierhaas, A. A. Nickson, K. Lindorff-Larsen, J. Clarke, M. Vendruscolo, BPPred: A 72. Y. Nozaki, The preparation of guanidine hydrochloride. Methods Enzymol. 26,43–50 (1972). Web-based computational tool for predicting biophysical parameters of proteins. 73. C. N. Pace, Determination and analysis of urea and guanidine hydrochloride de- Protein Sci. 16, 125–134 (2007). naturation curves. Methods Enzymol. 131, 266–280 (1986). 41. J.-P. Demers, A. Mittermaier, Binding mechanism of an SH3 domain studied by NMR 74. G. Bhabha et al., Divergent evolution of protein conformational dynamics in and ITC. J. Am. Chem. Soc. 131, 4355–4367 (2009). dihydrofolate reductase. Nat. Struct. Mol. Biol. 20,1243–1249 (2013). 42. K. Saksela, P. Permi, SH3 domain ligand binding: What’s the consensus and where’s 75. M. Bentahir et al., Structural, kinetic, and calorimetric characterization of the cold- the specificity? FEBS Lett. 586, 2609–2614 (2012). active phosphoglycerate kinase from the antarctic Pseudomonas sp. TACII18. J. Biol. 31 43. P. Rogne, T. Sparrman, I. Anugwom, J.-P. Mikkola, M. Wolf-Watz, Realtime P NMR Chem. 275, 11147–11153 (2000). investigation on the catalytic behavior of the enzyme adenylate kinase in the matrix 76. C. A. Fierke, K. A. Johnson, S. J. Benkovic, Construction and evaluation of the kinetic of a switchable ionic liquid. ChemSusChem 8, 3764–3768 (2015). scheme associated with dihydrofolate reductase from Escherichia coli. Biochemistry 44. P. S. Shenkin, B. Erman, L. D. Mastrandrea, Information-theoretical entropy as a 26, 4085–4092 (1987). measure of sequence variability. Proteins 11, 297–313 (1991). 77. H. S. Kim, S. M. Damo, S.-Y. Lee, D. Wemmer, J. P. Klinman, Structure and hydride 45. J. F. Reidhaar-Olson, R. T. Sauer, Combinatorial cassette mutagenesis as a probe of the transfer mechanism of a moderate thermophilic dihydrofolate reductase from Ba- informational content of protein sequences. Science 241,53–57 (1988). cillus stearothermophilus and comparison to its mesophilic and hyperthermophilic 46. D. Rennell, S. E. Bouvier, L. W. Hardy, A. R. Poteete, Systematic mutation of bacte- homologues. Biochemistry 44, 11428–11439 (2005). riophage T4 . J. Mol. Biol. 222,67–88 (1991). 78. H. Y. Aviram et al., Direct observation of ultrafast large-scale dynamics of an enzyme 47. S. S. Strickler et al., Protein stability and surface electrostatics: A charged relationship. under turnover conditions. Proc. Natl. Acad. Sci. U.S.A. 115, 3243–3248 (2018). Biochemistry 45, 2761–2766 (2006). 79. H. Schurig et al., Phosphoglycerate kinase and triosephosphate isomerase from the 48. A. V. Gribenko et al., Rational stabilization of enzymes by computational redesign of hyperthermophilic bacterium Thermotoga maritima form a covalent bifunctional surface charge-charge interactions. Proc. Natl. Acad. Sci. U.S.A. 106, 2601–2606 (2009). enzyme complex. EMBO J. 14, 442–451 (1995).

10 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1816707116 Sternke et al. Downloaded by guest on September 28, 2021