Consensus Sequence Design As a General Strategy to Create Hyperstable, Biologically Active Proteins

Consensus sequence design as a general strategy to create hyperstable, biologically active proteins Matt Sternkea, Katherine W. Trippa, and Doug Barricka,1 aThe T.C. Jenkins Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218 Edited by Brian W. Matthews, University of Oregon, Eugene, OR, and approved April 26, 2019 (received for review October 6, 2018) Consensus sequence design offers a promising strategy for de- tion in its biological context and ultimately contribute to or- signing proteins of high stability while retaining biological activity ganismal fitness.* By averaging over many sequences that share since it draws upon an evolutionary history in which residues similar structure and function, the consensus design approach important for both stability and function are likely to be con- has the potential to produce proteins with high levels of ther- served. Although there have been several reports of successful modynamic stability and biological activity, since both attributes consensus design of individual targets, it is unclear from these are likely to lead to residue conservation. anecdotal studies how often this approach succeeds and how There are two experimental approaches that have been used often it fails. Here, we attempt to assess generality by designing to examine the effectiveness of consensus information in protein “ ” consensus sequences for a set of six protein families with a range design: point-substitution and wholesale substitution. In the of chain lengths, structures, and activities. We characterize the first approach, single residues in a well-behaved protein that resulting consensus proteins for stability, structure, and biological differ from the consensus are substituted with the consensus – activities in an unbiased way. We find that all six consensus residue (13 17). In these studies, about one-half of the consensus proteins adopt cooperatively folded structures in solution. Strik- point substitutions examined are stabilizing, but the other half ingly, four of six of these consensus proteins show increased are destabilizing. Although this frequency of stabilizing mutations is significantly higher than an estimated frequency around thermodynamic stability over naturally occurring homologs. Each 3 consensus protein tested for function maintained at least partial 1in10 for random mutations (13, 18, 19), it suggests that biological activity. Although peptide binding affinity by a combining individual consensus mutations may give minimal net increase in stability since stabilizing substitutions would be offset BIOPHYSICS AND consensus-designed SH3 is rather low, Km values for consensus by destabilizing substitutions. COMPUTATIONAL BIOLOGY enzymes are similar to values from extant homologs. Although The “wholesale” approach does just this, combining all sub- consensus enzymes are slower than extant homologs at low tem- stitutions toward consensus into a single consensus polypeptide perature, they are faster than some thermophilic enzymes at high composed of the most frequent amino acid at each position in temperature. An analysis of sequence properties shows consensus sequence. By making a large number of substitutions at once, proteins to be enriched in charged residues, and rarified in un- wholesale consensus substitution may collectively combine the charged polar residues. Sequence differences between consensus incremental effects from the individual substitutions as well as and extant homologs are predominantly located at weakly con- nonadditive effects arising from the substitution of each residue served surface residues, highlighting the importance of these res- into the novel background of the consensus protein (20, 21). The idues in the success of the consensus strategy. stabilities of several globular proteins and several repeat proteins protein stability | protein design | consensus sequence Significance xploiting the fundamental roles of proteins in biological sig- A major goal of protein design is to create proteins that have Enaling, catalysis, and mechanics, protein design offers a high stability and biological activity. Drawing on evolutionary promising route to create and optimize biomolecules for medi- – information encoded within extant protein sequences, con- cal, industrial, and biotechnological purposes (1 3). Many dif- sensus sequence design has produced several successes in ferent strategies have been applied to designing proteins, achieving this goal. Here, we explore the generality with which including physics-based (4), structure-based (5, 6), and directed consensus design can be used to enhance protein stability and evolution-based approaches (7). While these strategies have maintain biological activity. By designing and characterizing generated proteins with high stability, implementation of the consensus sequences for six unrelated protein families, we find design strategies is often complex and success rates can be low – that consensus design shows high success rates in creating (8 11). Although directed evolution can be functionally directed, well-folded, hyperstable proteins that retain biological activi- de novo design strategies typically focus primarily on structure. ties. Remarkably, many of these consensus proteins show Introducing specific activity into de novo-designed proteins is a higher stabilities than naturally occurring sequences of their significant challenge (8, 10). respective protein families. Our study highlights the utility of Another strategy that has shown success in increasing ther- consensus sequence design and informs the mechanisms by modynamic stability of natural protein folds is consensus se- which it works. quence design (12). For this design strategy, “consensus” residues are identified as the residues with the highest frequency Author contributions: M.S., K.W.T., and D.B. designed research; M.S. and K.W.T. per- at individual positions in a multiple sequence alignment (MSA) formed research; M.S. and K.W.T. analyzed data; and M.S. and D.B. wrote the paper. of extant sequences from a given protein family. The consensus The authors declare no conflict of interest. strategy draws upon the hundreds of millions of years of sequence evolution by random mutation and natural selection that This article is a PNAS Direct Submission. is encoded within the extant sequence distribution, with the idea Published under the PNAS license. that the relative frequencies of residues at a given position reflect 1To whom correspondence may be addressed. Email: [email protected]. the “relative importance” of each residue at that position for This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. some biological attribute. As long as the importance at each 1073/pnas.1816707116/-/DCSupplemental. position is largely independent of residues at other positions, a consensus residue at a given position should optimize stability, *These properties include rates of folding, unfolding, degradation, compartmentaliza- activity, and/or other properties that allow the protein to function, oligomerization, and solubility. www.pnas.org/cgi/doi/10.1073/pnas.1816707116 PNAS Latest Articles | 1of10 Downloaded by guest on September 28, 2021 have been increased using this approach (22–31). An increase in Results thermodynamic stability is seen in most (but not all) cases, but Consensus Sequence Generation. The six families targeted for effects on biological activity are variable. In a recent study, we consensus design are shown in Fig. 1. To generate and refine characterized a consensus-designed homeodomain sequence that MSAs from these families, we surveyed the Pfam (33), SMART showed a large increase in both thermodynamic stability and (34), and InterPro (35) databases to obtain sets of aligned se- DNA-binding affinity (32). Unlike the point-substitution ap- quences for each family. For each family, we used the sequence proach, where both stabilizing and destabilizing substitutions are set that maximized both the number of sequences and quality of reported, the success rate of the wholesale approach is not easy the MSA (SI Appendix, Table S1). For five of the six protein to determine from the literature, where publications present families, we did not limit sequences using any structural or single cases of success, whereas failures are not likely to be phylogenetic information. For AK, we selected sequences that “ ” published. One study of TIM barrels reported a few poorly be- contained a long-type LID domain containing a 27-residue haved consensus designs that were then optimized to generate a insertion found primarily in bacterial sequences (36). The number of sequences in these initial sequence sets ranged from folded, active protein (24). Although this study highlighted some SI Appendix limitations, it would not likely have been published if it did not around 2,000 up to 55,000 ( , Table S1). end with success. To improve alignments and limit bias from copies of identical Here, we address these issues by applying the consensus se- (or nearly identical) sequences, we removed sequences that de- viated significantly (±30%) in length from the median length quence design strategy to a set of six taxonomically diverse value, and removed sequences that shared greater than 90% protein families with different folds and functions (Fig. 1). We sequence identity with another sequence. This curation reduced chose the three single domain families including the N-terminal the size of sequence sets by an average of 65% (ranging from

Consensus Sequence Design As a General Strategy to Create Hyperstable, Biologically Active Proteins

DNA Sequencing and Sorting: Identifying Genetic Variations

Structural Basis for the Sheddase Function of Human Meprin Β Metalloproteinase at the Plasma Membrane

Protein Family Classification with Neural Networks

Pyruvate-Phosphate Dikinase of Oxymonads and Parabasalia and the Evolution of Pyrophosphate-Dependent Glycolysis in Anaerobic Eukaryotes† Claudio H

Gent Forms of Metalloproteinases in Hydra

Annotation of Glycolysis and Gluconeogenesis Pathways

Rubisco Biogenesis and Assembly in Chlamydomonas Reinhardtii Wojciech Wietrzynski

Protein Family Classification Using Shallow and Deep Networks ABSTRACT 1 INTRODUCTION

An Aminoacyl-Trna Synthetase Paralog with a Catalytic Role in Histidine Biosynthesis

Bioinformatic Protein Family Characterisation

Independent Evolution of Polymerization in the Actin Atpase Clan Regulates Hexokinase Activity

A Tool for Detecting Base Mis-Calls in Multiple Sequence Alignments by Semi-Automatic Chromatogram Inspection